Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework

Cuzzocrea, Alfredo; Ciancarini, Paolo

doi:10.3390/modelling5030061

Open AccessArticle

Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework^†

by

Alfredo Cuzzocrea

^1,2,* and

Paolo Ciancarini

³

¹

iDEA Lab, University of Calabria, 87036 Rende, Italy

²

Department of Computer Science, University of Paris City, 75013 Paris, France

³

DISI Department, University of Bologna, 40126 Bologna, Italy

^*

Author to whom correspondence should be addressed.

^†

This paper is the extension of our paper: Cuzzocrea, A.; Ciancarini, P. SeDaSOMA: A Framework for Supporting Serendipitous, Data-As-A-Service-Oriented, Open Big Data Management and Analytics. In Proceedings of the 5th International Conference on Cloud and Big Data Computing, Liverpool, UK, 20–22 August 2021. https://doi.org/10.1145/3481646.3481647.

Modelling 2024, 5(3), 1173-1196; https://doi.org/10.3390/modelling5030061

Submission received: 24 November 2023 / Revised: 14 May 2024 / Accepted: 1 August 2024 / Published: 4 September 2024

Download

Browse Figures

Versions Notes

Abstract

This paper presents and delves into the architecture and intricacies of SeDaSOMA, a sophisticated framework supporting Serendipitous, Data-as-a-Service-oriented, Open big data Management and Analytics. SeDaSOMA meticulously addresses the multifaceted challenges inherent in open big data management and analytics. SeDaSOMA stands as a testament to the evolving landscape of big data management and analytics, embodying a commitment to harnessing advanced functionalities through a synthesis of innovative research findings and cutting-edge tools. In the context of this framework, the paper not only elucidates its structural components but also underscores its pivotal role in facilitating the seamless integration, processing, and analysis of massive and diverse datasets. By providing a comprehensive overview of SeDaSOMA, this paper contributes to the ongoing discourse within the field of big data management and analytics, shedding light on the intricate interplay between technological innovation and practical application. Moreover, as a complement to the discussion on SeDaSOMA, the paper offers a critical exploration of the emerging topics in the realm of big data research. By delineating current state-of-the-art methodologies and forecasting future research trajectories, this overview enriches the scholarly dialogue surrounding the evolving landscape of big data management and analytics, offering insights into the broader implications and potential advancements in the field.

Keywords:

big data; big data management; big data analytics; big data methodologies; big data as a service; open big data; distributed environments; privacy-preserving big data; approximate big data

1. Introduction

The amalgamation of big data management and analytics represents a significant and evolving area of interdisciplinary research, which draws upon diverse studies spanning database systems, data warehousing, data mining, and machine learning (e.g., [1,2]). This integrated approach synthesizes various research endeavors conducted over time. Notably, the application of big data management and analytics has witnessed a proliferation across a spectrum of domains, ranging from smart cities to social networks, from sensor networks to intelligent transportation systems, and from supply-chain analysis to economic development protocols. For instance, in smart cities, big data management and analytics facilitate efficient resource allocation and urban planning [3]. Similarly, in social networks, these methodologies enable insightful analyses of user behavior and network dynamics [4]. Moreover, their application extends to domains such as sensor networks [5] where they contribute to data-driven decision-making processes and to intelligent transportation systems [6] where they enhance traffic management and optimize route planning. Additionally, in areas like supply-chain analysis [7] and economic development protocols [8], big data management and analytics play pivotal roles in optimizing operational efficiency and fostering sustainable growth strategies. This multidisciplinary convergence underscores the pervasive impact and versatile applications of big data management and analytics across diverse domains.

Indeed, early research, exemplified by OLAP data cube processing [9,10] for big data management and OLAP mining [11,12,13,14] for big data analytics, has elucidated that a direct transference of findings to the context of big data is not feasible due to emergent challenges. Consequently, the necessity for innovative techniques and algorithms arises [15,16]. This imperative is further underscored by the escalating requirement for intelligent big data applications within the information and communication community [5,17,18,19].

Following this trending research topic, this paper presents the architecture of a reference framework, namely SeDaSOMA, which supports Serendipitous, Data-as-a-Service-oriented, Open big data Management and Analytics. The main goal of this framework is to provide advanced big data management and analytics on the basis of innovative research achievements and next-generation big data tools.

SeDaSOMA presents a sophisticated software architecture tailored for serendipitous, Data-as-a-Service (DaaS)-oriented management of big data and predictive analytics purposes. The architecture integrates a comprehensive suite of models, techniques, and algorithms designed to acquire, represent, manage, and secure big data within modern Cloud-based DaaS paradigms (e.g., [20]). Specifically, it aims to establish a novel data marketplace environment and foster an open big data ecosystem capable of supporting scalable big data analytics. To achieve this goal, the framework leverages innovative methods for big data query answering such as the following: (i) Distributed query processing; (ii) Approximate query processing; (iii) Query optimization for heterogeneous data sources. In addition, it leverages context-aware approaches such as the following: (i) Semantic data integration; (ii) Adaptive analytics frameworks; (iii) Context-sensitive anomaly detection). Furthermore, it leverages big data preparation techniques tailored for developing effective big data analytics solutions such as the following: (i) Data cleansing and preprocessing; (ii) Feature engineering and selection; (iii) Scalable data transformation pipelines. The SeDaSOMA reference architecture contains several main layers that represent the functional and procedural basis of the framework, which is similar to other initiatives in the field (e.g., [21]). Typically, these functions and procedures are destined for data-intensive tasks and also for a wide range of Cloud-aware big data vertical applications that vary in a range determined by the main areas of challenges in the current information community, which are specified by the European data management guidelines by the European Commission [22], with particular reference to the following: (i) Social big data management for workplace safety and health; (ii) Integration of (big) bank data and (big) customer data for supporting big data intelligence; (iii) Big data advertisement on the Web for opportunity finding. The latter only represents some case studies that have been considered as critical examples in modern big data settings.

This paper is organized as follows. Section 2 describes some related works relevant to our research. In addition to this, we provide an analysis that focuses on the main contributions of the SeDaSOMA framework over conventional state-of-the-art approaches. In Section 3, we provide a detailed description and composition of the SeDaSOMA framework. In Section 4, we focus on providing a critical overview of some emerging topics of big data research, which are close to our work, by highlighting current state-of-the-art and future research efforts in the field. In Section 5, we introduce a practical implementation of SeDaSOMA that focuses on big clinical data. Finally, in Section 6, we give the conclusions of our research work along with the future works that we plan to undertake to extend this work. This paper relevantly extends the paper [23] where we first introduced the framework idea and the main design guidelines. Other past contributions that are now deeply extended in this work focus on privacy-preserving and approximate big data management [24] and integration of multidimensional modelling and Artificial Intelligence (AI) [25].

2. Related Work

In recent years, the management and analysis of open big data have garnered significant attention in both academia and industry (e.g., [26]). In this Section, we provide a comprehensive review of relevant literature by focusing on key developments and methodologies in the field of Open Big Data Management and Analytics (e.g., [27]).

In [28], authors adopt a complexity-theory approach and suggest that organizations develop different approaches to leverage their big data analytics resources toward the attainment of organizational goals [29]. They build on a sample of 175 survey responses from IT managers in Greek firms and examine the patterns of big data analytics resources that lead to high levels of performance. Authors also apply a configurational approach through a novel methodological tool named fsQCA, which allows the examination of such complex phenomena and the reduction of solutions to a core set of elements. In addition, the authors examine three case studies to uncover how these elements, as well as other core enablers or inhibitors, can emerge and how they coalesce and impact performance.

Ref. [30] provides a valuable systematic literature review on Big Data across business management between 2009 and 2014. Although the field is in its earliest stages of scholarly development, authors found clear evidence of the energy and increasing interest focused on Big Data studies in business. Numerous studies have examined Big Data initiatives in organizations. These include the Big Data revolution in corporate strategies and management. The clear purpose is to understand and offer a conceptual framework as well as define an empirical interpretation of a Big Data methodological approach in organizational Competitive Intelligence (CI) cycles. To this end, the authors provide a comprehensive description of the evidence received from the selected organizations on Big Data practices as well as analytical approaches in corporate CI processes. In fact, they introduce basic Big Data like analytics, competitor analysis, SWOT, segmentation analysis, and five forces analysis; however, we have not used this in real time on larger data. The authors also employ advanced text mining, natural language processing, etc., which are all related to Big Data. Overall, the research findings provide a clear understanding that firms have yet to consider Big Data technologies in CI processes; these results have contributed to starting the building of the process for Big Data in organizational CI cycles.

Ref. [31] recognizes that the emergence of cloud computing techniques and big data processing models, such as MapReduce, has made big data analytics and knowledge management a trending topic in the research world. However, regardless of the wide adoption of applications built following the MapReduce paradigm on public Clouds, the lack of trust in the participating virtual machines deployed on these Clouds resulted in a blockage in further adopting these applications. Therefore, the authors extended the existing hybrid Cloud MapReduce architecture to Multiple Public Clouds in their paper. As a result, they have proposed IntegrityMR, i.e., an integrity assurance framework for big data analytics and management applications based on this type of architecture. Furthermore, the authors investigated the integrity check techniques obtained at two alternative software layers: the MapReduce task layer and the application layer. In addition, based on Apache Hadoop MapReduce and Pig Latin, they designed and implemented their system at both layers and evaluated it by performing a series of experiments using common big data analytics and management tools such as Apache Mahout and Pig on commercial, public Clouds (Amazon EC2 and Microsoft Azure) and local cluster environments. The results obtained from these experiments following the task layer approach show high integrity (98% with a credit threshold of 5) with non-negligible performance overhead (18% to 82% extra running time compared to the original MapReduce).

In [32], the authors describe the Ophidia project, i.e., a research work that deals with big data analytics requirements, problems, and challenges for e-science. Where the authors present Ophidia, an analytics framework responsible for atomically processing, transforming, and manipulating array-based data, also capable of running large clusters of analytics tasks performed on big datasets. Additionally, the paper describes the design principles, algorithm, and most important aspects in the implementation of the Ophidia analytics framework, along with presenting some experimental results related to some data analytics operators in a real cluster environment.

Ref. [33] argues that Cloud computing and big data analysis are gaining lots of interest across a range of applications, including disaster management. These two technologies together provide the capability of real-time data analysis not only to detect emergencies in disaster areas but also to rescue the affected people. The paper presents a framework that supports emergency event detection and alert generation by analyzing the data stream, which includes efficient data collection, data aggregation, and alert dissemination. One of the goals of such a framework is to support an end-to-end security architecture to protect the data stream from unauthorized manipulation as well as leakage of sensitive information. The proposed system provides support for both data security punctuation and query security punctuation. The paper presents the proposed architecture with a specific focus on data stream security. It also briefly describes the implementation of security aspects of the architecture.

In [34], the authors focus their attention on decision making in the context of natural disaster management, where several challenges that need to be faced exist. When disasters happen, the government, being a response organization has to take immediate and accurate decisions in order to rapidly assist and effectively recover for the victims involved. This paper aims to adopt strategic decision making in government regarding disaster management by employing Big Data Analytics (BDA) techniques. BDA methodologies are integrated as solutions for managing, employing, maximizing, and displaying climate change data insights when dealing with water-related natural disasters. NAHRIM is a government agency responsible for elaborating research on water and its environment. The latter proposed a natural disaster management BDA framework for utilizing NAHRIM historical and simulated projected hydroclimate datasets. The development of this framework aims at assisting the government in the decision-making process regarding natural disaster management by fully employing NAHRIM datasets. This BDA framework consists of three stages; Data Acquisition, Data Computation, and Data Interpretation and seven layers; Data Source, Data Management, Analysis, Data Visualization, Disaster Management, and Decision. And it is anticipated that this framework will have an effective impact on the prevention, minimization, preparation, adaptation, response, and recovery of water-related natural disasters.

Ref. [35] considers that advanced digitalization, together with the rise of disruptive Internet technologies, are key enablers of a fundamental paradigm shift observed in industrial production. This is known as the fourth industrial revolution (Industry 4.0), which proposes the integration of the new generation of ICT solutions for the monitoring, adaptation, simulation, and optimization of factories. With the democratization of sensors and actuators, factories and machine tools can now be sensorized, and the data generated by these devices can be exploited, for instance, to optimize the utilization of the machines as well as their operation and maintenance. However, analyzing the vast amount of generated data is resource-demanding both in terms of computing power and network bandwidth, thus requiring highly scalable solutions. The paper presents a novel big data approach and analytics framework for the management and analysis of machine-generated data in the Cloud. It brings together standard open-source technologies and the exploitation of elastic computing, which, as a whole, can be adapted to and deployed on different Cloud computing platforms. This enables the reduction of infrastructure costs, the minimization of deployment difficulty, and the provision of on-demand access to a virtually infinite set of computing power, storage, and network resources.

Finally, in [36], the authors recognize that due to the increased usage of ICT technologies in the field of smart cities, the data generated, as a result, has increased manifold. This data is heterogeneous in nature, as it varies with respect to time and exhibits the properties of all the essential V’s for big data. Therefore, to handle such an enormous amount of data, various big data processing techniques are required. To cope with these issues, the paper presents a tensor-based big data management technique to reduce the dimensionality of data gathered from the Internet-of-Energy (IoE) environment in a smart city. The core data is extracted from the gathered data by using tensor operations such as matricization, vectorization, and tensorization with the help of higher-order singular value decomposition. This core data is then stored on the Cloud in the reduced form. After reducing the dimensionality of data, it is used to provide many services in smart cities, and its application to provide Demand Response (DR) services has been discussed in the paper. For this purpose, Support Vector Machine (SVM)-based classifier is used to classify the end-users (residential and commercial) into normal, overloaded, and underloaded categories from the core data. Once such users are identified as taking part in the DR mechanism, utilities then generate commands to handle their DR in order to alter load requirements so that the overall load is optimized. Results obtained on the Open Energy Information and PJM dataset clearly indicate the supremacy of the proposed tensor-based scheme over the traditional scheme for DR management.

Innovations of SeDaSOMA over Conventional State-of-the-Art Approaches

The SeDaSOMA framework introduces some important innovations over conventional state-of-the-art approaches, like those provided and discussed in our literature analysis. These innovations are fully discussed in Section 3, along with all the specific properties of SeDaSOMA, but here, we highlight those that fill the gap with respect to literature contributions. In the following, we summarize these innovations:

Support for Big Data Crowdsourcing: One of the most prominent characteristics provided by SeDaSOMA is represented by the fact that its big data layer is alimented via crowdsourcing methodologies, like in some recent initiatives (e.g., [37]—contrary to this, in conventional approaches, big data repositories are alimented via methods that, essentially, are inspired by traditional ETL (Extraction, Transformation and Loading) procedures (e.g., [38]); obviously, the usage of crowdsourcing methodologies ensures more dynamicity in the data sources, more scalability, and more heterogeneity;
Support for Big Data Marketplacing: SeDaSOMA provides support for big data marketplacing (e.g., [39]), meaning that, after that big data within the SeDaSOMA data layer are consolidated, they are not exposed to the big data analytics layer via conventional methods (e.g., publish-subscribe, push-down, ETL, etc.) but, rather, via the innovative data marketplace model, according to which client users/applications can access a marketplace of (consolidated) big data sources and select the ones that are more appropriate and “convenient” to them based on their big data analytics goals, preferences, workload issues, and so forth;
Support for Context-Aware Big Data Processing: In addition to the above-listed innovations, the SeDaSOMA framework makes use of another really relevant feature, i.e., the approach of processing big data in dependence on the context which respect to which they are produced, located, elaborated, and consumed. This is a topic that is raising a great deal of attention now (e.g., [40]), as it overall allows us to achieve better intelligent big data processing, which turns to play a critical role with respect to specific and advanced challenges of big data, such as big data understanding and big data fruition;
Support Big Data Preparation for Analytics: Finally, SeDaSOMA introduces the nice amenity of applying data preparation techniques (e.g., [41]) for effectively and efficiently supporting big data analytics tasks. Indeed, due to the well-known 3V properties of big data, big data cannot be processed by analytical tasks as they are, but they indeed need specific data preparation solutions (e.g., normalization, scaling, and so forth).

3. SeDaSOMA’s Anatomy

In Figure 1, we present the layered architecture of the framework SeDaSOMA. The main challenge this framework faces is to provide and experiment with a novel and innovative vision for big data management and analytics that surpasses traditional approaches, address issues regarding scalability and responsiveness of analytics over big data, and achieve novel paradigms, mechanisms, and schemes in the context of critical Cloud-aware big data vertical applications. In turn, the latter are associated with different major areas of current information community challenges specified by the European guidelines.

The architecture of SeDaSOMA is typically nested and multi-layered, where each layer consists of multiple components, as shown in Figure 1. SeDaSOMA includes the following four layers:

Big Data Source Layer, which locates and manages different sources of big data, which are heterogeneous in nature, distributed, and streamed inherently.
Big Data Repository and Provisioning Layer, which stores and processes big data, then transmits it to the higher layers.
Big Data Analytics Layer, which extracts useful knowledge from big data by running complex and scalable analytics over these huge amounts of data.
Big Data Application Layer, which includes Cloud-aware big data vertical applications that completely rely on the underlying layers. These applications focus on the main domains of modern information community challenges specified by European Union data management guidelines.

Each layer is associated with a particular set of big data management and analytics challenges (that, however, has already been investigated in actual literature—e.g., [42]); all aim to attain the same particular goal, which is to achieve innovation and advancements in critical knowledge and research, along with authoritative implementations by employing ad-hoc Cloud-aware vertical applications over big data. Additionally, the SeDaSOMA framework highlights the organization and relationships between the different layers within its composite architecture, which is considered innovative by itself, as it introduces a novel and sustainable vision of big data management and analytics that has no similarities in the current literature.

3.1. SeDaSOMA Components

In this part, we focus on the layers of the SeDaSOMA framework, where each layer consists of several components, as follows.

Big Data Source Layer, which contains the following components:

Physical Resource Management, which handles the physical aspects of resources that are capable of generating data streams (e.g., sensors and datacenters) by diving into and exploring innovative aspects such as noise-detection and cleaning, stream pipelining, resource/stream synchronization, etc. This can be further extended by checking out other advanced topics such as designing high-performance infrastructures that solve the traditional issues regarding big data streams, including the common 3Vs of big data, i.e., Volume, Variety, Velocity.
Heterogeneous Distributed Big Data Stream Management, which is responsible for managing big data (stream) at the framework’s input interface, where big data is considered as data objects on top of the underlying physical resources (e.g., sensors and datacenters that are capable of generating big data streams). The goal here is to overcome the following major challenges: (i) having to deal with big-data’s streaming nature; (ii) having to deal with multi-rate arrivals and inter-arrivals; (iii) having to deal with the massive amounts of (streaming) big data, etc.
Big Data Crowdsourcing, which tackles the problem of big data collection using the innovative crowdsourcing paradigm. This aims to develop models, techniques and algorithms for applying this novel paradigm to big data collection/acquisition problems.

Big Data Repository and Provisioning Layer, which contains the following components:

Big Data Storage, which is a major component of the SeDaSOMA framework. It tackles big data representation and storage problems in particular as scalability presents one of the key requirements. In this context, several challenges exist, which range from advanced data structures for big data storage to big data indexing, from big data partitioning techniques to innovative data management policies and regulations focusing on elastic big data storage solutions, etc.
Big Data Warehousing, which ensures the reliability of data management within the SeDaSOMA framework, is a key feature of warehousing methodologies and solutions due to the complex nature of big data. In our case, these methodologies are captured and implemented by the Big Data Warehousing component in combination with the innovative Cloud-based paradigm, which is built on top of the Big Data Storage component and modern NoSQL architectures (e.g., in distributed settings) that are used in the implementation of this component. This component addresses the following main issues: (i) multidimensional big data models, which are critical for supporting big data analytics; (ii) compressed big multidimensional data representations; (iii) MapReduce-based big data warehousing, etc.
Secure Big Data Processing, which represents a major concern of the SeDaSOMA framework as it is critical to guarantee the security of big data stored in the Big Data Warehousing component; this is because these (big) data are accessed and processed within an open environment, i.e., the Big Data Marketplace component. Therefore, as a result, the SeDaSOMA framework is required to incorporate ad-hoc solutions in order to secure big data processing. The Secure Big Data Processing component directly interfaces with the Big Data Warehousing component and achieves this goal. In this context, some of the relevant existing challenges are as follows: (i) ensuring scalability and security when accessing big data; (ii) implementing innovative big data encryption methodologies; (iii) defining novel techniques for big data provenance, etc.
Big Data Marketplace, the SeDaSOMA framework contains innovative characteristics because its core data layer provides big data to higher layers and consumer (Cloud-aware) applications and makes it available by employing a novel data marketplace (DaaS) paradigm instead of using a traditional big data fruition scheme. This paradigm handles modern Cloud computing environments and anticipates an applicative setting that uses service-oriented primitives in order to make big data available for consumer applications. The main goal of this component consists of making the big data stored in the Big Data Warehousing component available to consumer applications following the DaaS paradigm, all while ensuring security (which is guaranteed by the Secure Big Data Processing component). Several open problems emerge in this context: from innovative service-oriented big data provisioning to designing models for the big data marketplace, from challenges related to scalability to Cloud-compliant heterogeneity problems, etc.

The Big Data Analytics Layer consists of the following components:

Big Data Query Answering: this component addresses challenges related to implementing effective and efficient algorithms that support query answering over big data, along with scalability issues. The traditional query answering algorithms that have been proposed and developed in the context of traditional data-intensive scenarios (e.g., very large databases) are not capable of processing large amounts of big data; therefore, as a result, new techniques need to be introduced. The major relevant challenges regarding this concept are the following: (i) the use of data compression paradigms to enhance query answering performance while ensuring the accuracy of answers; (ii) evaluating preference-based query processing techniques and how they handle big data; (iii) scalability problems related to querying big data, etc.
Context-Aware Big Data Processing: As highlighted before, the SeDaSOMA framework aims primarily at providing powerful data-intensive analytics in next-generation big data applications, where these analytics can be exploited based on the fortunate Cloud computing paradigm. Following this concept, it is critical to incorporate context-aware methods in order to provide “the right (big) data to the right application” due to the large amounts and the strong heterogeneity of big data. This issue has become very relevant lately, especially due to the strong relationship among topics such as advertisements and recommendations in big data. In the case of our SeDaSOMA framework, we guarantee this requirement via the Context-Aware Big Data Processing component. The major relevant challenges regarding this context are as follows: (i) context-awareness in the proposed algorithms and techniques for big data; (ii) preference-based big data processing; (iii) scalability in context-aware big data processing, etc.
Big Data Preparation for Analytics, which is a major challenge within the process of designing and executing analytics over big data, i.e., how to perform pre-processing over big data in order to make more effective and more efficient analytics. In fact, it is a critical issue due to the characteristics of big data, such as the extremely massive size and strong heterogeneity. In addition, the analytics become irrelevant (e.g., in terms of “responsiveness”) due to the lack of a “unifying” schema. The Big Data Preparation for Analytics component is responsible for providing solutions that fulfil some technical needs in this context, such as refining existing data, reducing unnecessary data, normalizing data, and successfully extracting attributes and features from big data that guide the analytics process. Moreover, since these requirements can only be answered during the analytics phase, the Big Data Preparation for Analytics component needs to provide on-demand preparation primitives in order to be able to respond to requests from the analytics component. Finally, it is mandatory that all preparation primitives should be used in a privacy-preserving manner, especially when managing user data. SOLID is an ecosystem that we plan to refer to in order to handle and overcome the challenges of privacy (e.g., [43,44]).
Scalable Big Data Analytics, which is the component that ensures the main goal of the whole SeDaSOMA proposal, i.e., supporting scalable big data analytics. To this end, the Scalable Big Data Analytics component addresses the problem of defining novel paradigms for next-generation analytics that are characterized by high responsiveness and high scalability. This component plays a critical role as Cloud-aware big data vertical applications rely on the SeDaSOMA framework (being devoted to assessing and possibly showing the effectiveness and the reliability of the framework), and on successfully exploiting the results and the interactions of analytics in order to attain their respective applicative goals. Several relevant research problems can be faced regarding this aspect, such as the following: (i) data-intensive analytics, i.e., analytics interacting with large-scale repositories such as big data; (ii) declarative against procedural analytics; (iii) quality-aware analytics and measures of the “quality” of analytics; (iv) responsiveness issues related to analytics; (v) scalability issues related to analytics, etc.

The Big Data Analytics Layer plays the role of the principal underlying layer in several Cloud-aware big data vertical applications where the attention is focused on how to assess the effectiveness and reliability of the SeDaSOMA framework—to which such applications are exposed. These applications are strongly dependent on the Cloud computing paradigm, and in particular, they make use of the underlying analytics presented by the SeDaSOMA framework with the goal of attaining their respective applicative results, even if depending on user/environment requirements.

With respect to the SeDaSOMA framework, there exist several major fields that critical applications of actual information society challenges specified by the European data management guidelines can fall into, such as the following:

Social Big Data Management for Workplace Safety and Health, where, regarding vertical application, it focuses on the “serendipitily” issue when exploring and integrating (big) data collected from different media sources such as blogs, social networks, emails, forums, communities, etc., while integrating and mining such data in order to share this data among users, along with successfully exploiting it for workplace safety and health. In this case, it will be necessary that people play an “active” role, meaning that they are able to share their opinions, criticisms, and positive/negative opinions and, therefore, create a “real” (big-data-based) community with powerful analytics capabilities.
Integration of (Big) Bank Data and (Big) Customer Data for Supporting Big Data Intelligence. Vertical applications in this context aim to improve the “on-line” experience of their customers through typical banking services delivered via different types of devices. Both types of such data (i.e., bank data and customer data) are naturally large and also introduce the common characteristics of big data (i.e., 3Vs of big data). These applications will also be exposed to dealing with heterogeneous types of data, such as location data related to customers, and services, such as health services that may take advantage of their integration with bank data associated with customers, which are, as a result, considered as citizens by health services.
Big Data Advertisement on the Web for Opportunity Funding. In this context, vertical applications consider the issues related to using big data analytics (also combined with artificial intelligence algorithms) to support advanced Web marketing in various domains (work, investments, etc.). This represents a very exciting challenge for the next-generation community, as it is encouraged by already widely-adopted technologies and platforms for the Web (e.g., Google, Amazon, Alibaba, etc.) and for social networks (e.g., Facebook, Instagram, LinkedIn, etc.), which give providers and vendors the amenity for collecting and storing immense amounts of big data repositories representing user profiles, user preferences, user goals, etc. In addition to these big data repositories, artificial intelligence algorithms are performed in order to discover the best sub-optimal solution in a very large set of Web application cases, such as buying, investing, dating, etc.

As we demonstrate through the paper, the issue of big data management and analytics is playing a critical role in the actual research community. The SeDaSOMA framework’s proposal is a direct consequence of this clear trend. SeDaSOMA includes many research innovations, among which the serendipitous paradigm is the major one.

As regards future developments, we envision that critical big data applications will increasingly teach us what the novel requirements and features to capture will be, as was the situation in other relevant cases (e.g., COVID-19 pandemic management and analytics) (e.g., [45]). On the other hand, foundational and methodological aspects of big data management and analytics still need to establish themselves largely due to the fact that, on the application side, the evolution of big data management and analytics has been incredibly fast during the last few years.

All these considerations confirm for us the relevance of the research area we investigate and, to this end, the milestone represented by the SeDaSOMA proposal.

3.2. SeDaSOMA Implementation

As regards the specific implementation of the SeDaSOMA framework, we employ well-understood big data technologies. Table 1 reports, for each of the main layers of SeDaSOMA (see Figure 1), the technology adopted.

At the Big Data Source Layer, there are the following: (i) Oracle NoSQL Database provides support for key–value data; (ii) MongoDB provides support for document-shaped big data (e.g., JSON); (iii) Neo4J provides support for big graph data; (iv) Apache Streaming provides support for streaming big data. The Big Data Repository and Provisioning Layer is completely implemented on top of Apache Hadoop. Apache Spark is the fundamental technology for the Big Data Analytics Layer, with its well-known task-focused libraries such as Spark SQL for SQL and structured data, MLlib for Machine Learning, Spark Streaming for stream processing, and GraphX for graph analytics. Finally, common high-level programming languages, such as Java, C#, J#, and Python, form the basis of the Big Data Application Layer.

In more detail, our proposed framework SeDaSOMA is integrated as a component of the Big Data Management and Analytics System architecture, as shown in Figure 2. This architecture consists of the following main components.

Big Data Sources. In contemporary data environments, enterprises utilize a combination of cloud and on-premise resources to access and manage diverse big data sources. Cloud platforms offer scalability and accessibility, allowing enterprises to store and process large volumes of data efficiently. Within these cloud environments, enterprises integrate various data sources, including social media platforms, which provide valuable insights into consumer behavior and market trends through user-generated content and engagement metrics. Additionally, enterprises leverage activity generated from their own digital channels, such as websites and mobile applications, to gain real-time insights into user interactions and preferences. On-premise infrastructure complements cloud resources by hosting proprietary archives of historical data and legacy systems, ensuring data continuity and regulatory compliance. Furthermore, enterprises tap into public data repositories and external sources to enrich their datasets with contextual information from governmental databases, open data initiatives, and third-party APIs. Regarding the handling the different types of big data at this level of the architecture, Oracle NoSQL Database can provide support for the following: (i) key–value data; (ii) MongoDB for document-shaped big data (e.g., JSON); (iii) Neo4J for big graph data; (iv) Apache Streaming for big stream data. This hybrid approach to big data management enables enterprises to leverage a diverse array of sources to support decision-making processes and drive innovation.
Big Data Archives. Within the architecture of big data management and analytics systems, the big data archive component plays a crucial role in facilitating the storage, preservation, and accessibility of vast volumes of data. Serving as a repository for historical and infrequently accessed data, the big data archive component ensures the long-term retention of valuable information while optimizing storage resources and cost-effectiveness. Leveraging scalable storage technologies and efficient data compression techniques, this component accommodates the ever-growing influx of data generated by diverse sources. Moreover, the big data archive component incorporates robust data governance [46] and security mechanisms to safeguard sensitive information and adhere to regulatory compliance requirements. By seamlessly integrating with other components of the big data ecosystem, such as data processing and analytics modules, the archive component enables the efficient retrieval and analysis of archived data, thereby facilitating informed decision-making and fostering innovation in data-driven enterprises using.
Transactional Systems. This component assumes a pivotal role in enabling real-time processing and management of high-volume transactional data streams. This component encompasses sophisticated data processing frameworks and distributed transaction processing engines designed to handle massive concurrent transactions with low latency and high throughput. Leveraging scalable and fault-tolerant architectures, the big data transactional system ensures the reliability and integrity of transactional data in dynamic and distributed environments. Furthermore, it integrates with various data sources and downstream analytics modules to enable continuous data ingestion, processing, and analysis, thus supporting timely decision-making and operational intelligence in enterprise settings. Additionally, the big data transactional system component incorporates advanced transaction management functionalities, including distributed locking mechanisms, transaction isolation levels, and conflict resolution strategies, to maintain data consistency and concurrency control across distributed computing nodes.
Big Data Engine. This comprises the Hadoop Distributed File System (HDFS) and MapReduce, therefore constituting a fundamental infrastructure for scalable and parallel processing of large datasets. Hadoop HDFS serves as a distributed file system designed to store and replicate data across a cluster of commodity hardware, ensuring fault tolerance and high availability. Concurrently, MapReduce provides a programming model and runtime environment for distributed data processing, enabling efficient parallel execution of computational tasks across distributed computing nodes. Together, Hadoop HDFS and MapReduce form the backbone of big data processing frameworks, facilitating the distributed storage and processing of massive datasets in a fault-tolerant and cost-effective manner. This component plays a pivotal role in supporting diverse analytics workflows, ranging from batch processing and data warehousing to real-time stream processing and machine learning, thereby enabling enterprises to derive actionable insights and drive innovation through data-driven decision-making.
Operational Data Store (ODS). This component assumes a critical role as an intermediary storage layer facilitating real-time access and integration of heterogeneous data sources. The ODS acts as a centralized repository for ingesting, cleansing, and harmonizing streaming and batch data from diverse operational systems, sensors, and external sources. By providing a unified view of operational data in near real-time, the ODS enables organizations to make informed decisions, monitor performance, and respond promptly to evolving business needs. Leveraging scalable distributed architectures and advanced data processing techniques, the ODS ensures data consistency, reliability, and timeliness, thus serving as a foundational component for downstream analytics, reporting, and decision support applications. Moreover, the ODS fosters interoperability and data sharing across disparate systems and departments, driving collaboration and innovation in data-driven enterprises.
Big Data Analytics (SeDaSOMA). Our proposed framework assumes a central role, employing a suite of advanced tools and frameworks to extract actionable insights from vast and diverse datasets. Utilizing Spark SQL for SQL-based querying and processing of structured data, this component enables efficient and scalable data manipulation and analysis, facilitating exploratory data analysis, data visualization, and ad-hoc querying tasks. Complementing Spark SQL, MLlib provides a comprehensive library of machine learning algorithms and utilities, empowering users to build and deploy scalable machine learning models for classification, regression, clustering, and collaborative filtering tasks. Furthermore, Spark Streaming facilitates real-time stream processing and analysis of continuous data streams, enabling organizations to derive timely insights and respond promptly to evolving trends and events. Additionally, GraphX offers a powerful framework for graph analytics, supporting graph processing algorithms and graph-based analytics applications such as social network analysis, recommendation systems, and fraud detection. Together, these components form a robust and versatile analytics ecosystem, empowering organizations to unlock the full potential of big data and drive innovation through data-driven decision-making processes.

4. Discussion

The natural humus for big data management and analytical chores is distributed environments. Cloud computing systems are among the most important, and they have even been sparked by current technological developments that have greatly improved the ICT sector right now (e.g., [47]). Real-world Cloud-based applications, like smart cities, intelligent transportation systems, marketplace tools, and so on, are increasingly creating new challenges for big data study, which advances the science.

Big data management and big data analytics in remote systems effectively merge inside a common, unifying framework whose primary concerns and challenges call for common solutions to universally handle the irksome problem of managing and supporting knowledge discovery from huge amounts of data.

Here, three top-class topics have thus emerged in actual studies:

distributed big data management and analytics: complex methodologies;
privacy-preservation methodologies for distributed big data management and analytics;
distributed uncertain and imprecise big data management and analytics.

It should be highlighted that all of these themes not only pose pertinent theoretical challenges but also serve as reminders of substantial practical achievements in actual Cloud-based systems and applications. Consider, for example, the glaring example offered by genuine bio-informatics systems (e.g., [48]), where such concerns assume a significant role, to become convinced of this. To give an example in this specific context, SeDaSOMA can be successfully exploited to represent, process, and mine genome datasets, which are usually created within the context of major collaborations via comprehensive overshadowed data compilations generated by individual researchers in their respective laboratories, and shared upon publications. For instance, Array Express, a repository of publicly accessible gene expression data, houses over 1.3 million genome-wide assays derived from 45,000+ experiments. More precisely, what can be mined? Unsupervised techniques are employed to reveal the intrinsic organizational framework of the data, such as identifying patterns in gene expression related to cancers. These methodologies commonly detect prominent recurring characteristics within the data and may be influenced by various confounding factors. For instance, Principal Component Analysis (PCA) is a type of unsupervised method used to unveil hidden data features that deliver the most substantial information signal.

These important subjects are currently having a significant impact on the research community and will continue to play a significant part in both current and upcoming research endeavors. Following this unambiguous proof, we present the state-of-the-art on the referenced topics in this Section, along with ideas for directing additional field research.

4.1. Distributed Big Data Management and Analytics: Complex Methodologies

Numerous big data management and analytics systems, some of which have achieved excellence (e.g., [49,50,51]), have been proposed over time in the research literature. The next stage is moving toward advanced big data management and analytics in distributed contexts by looking into particular and brand-new research subjects (like those discussed in [52]). These topics first focus on effectively and efficiently supporting big data management and analytics tasks over challenging big data types, such as big social data (e.g., [53]), big multidimensional data (e.g., [54]), big graph data (e.g., [55]), big healthcare data (e.g., [56]), key–value stores (e.g., [57]), and so forth.

At the same time, designing innovative big data analytics tools that can extend the true capabilities of the most advanced methods is another hot topic of the proposal. Innovative big data analytics tools, such as multidimensional analysis (e.g., [58]), process mining analytics, reinforcement learning of big data analytics (e.g., [59]), etc., really require a new “call to arms” in the context of the literature. This has advanced today, with many promising scientific results.

Therefore, in line with these key research trends, the following are possible guidelines for next-generation research in the context of advanced big data management and analysis in a distributed environment:

analysis of authoritative proposals in the investigated scientific area;
selection of reference complex big data types;
integration of complex big data repositories in specific Cloud data stores, such as NoSQL data layers;
definition of novel big data management tasks on top of these data stores;
project and implementation of big data management tasks;
definition of novel big data analytics tools on top of these data stores;
project and implementation of big data analytics tools.

4.2. Privacy-Preservation Methodologies for Distributed Big Data Management and Analytics

Like for other principal research topics, distributed big data management and analytics privacy-preservation methodologies (e.g., [60,61]) play a critical role, especially regarding the wide range of big data application scenarios that are emerging recently, which vary from social networks to bio-informatics, from sensors networks to web recommendation tools, from e-science systems to e-government systems, and so forth. The protection of the privacy of sensitive information can be referred to as enabling technology in all these applicative settings, whether for personal data (e.g., [62,63,64]) or aggregate data (e.g., [65,66,67]). Blockchain technology (e.g., [68,69]) is one of the significantly emerging topics in this context.

On the other hand, the process of privacy-preserving big data management and analytics is strongly related to the big data security research area (e.g., [70]), which investigates the means of accessing and managing big data repositories in a secure manner. However, there remains a problem that needs to be assessed, which is combining the privacy and security of big data in distributed environments. It is one of the main future research directions in big data.

The research community has already provided a large number of pieces of literature on privacy-preserving big data management and analytics in distributed environments in response to the growing interest in this topic, which has evolved in recent years (e.g., [62,71]). This shows how developed the subject is. However, it is certain that theoretical tools for facilitating big data management and analytics in distributed systems while preserving privacy constitute an exciting topic of study that has to be investigated in future works in this context. In this regard, an interesting research topic is to further extend well-consolidated theoretical models for privacy-preserving OLAP (e.g., [72]) to developing technologies like differential privacy (e.g., [73]). This paradigm should be extended even further to more broadly applicable privacy-preserving big data publishing problems (e.g., [71]). The combination of these problems with cutting-edge, sophisticated machine learning tools, like tensor-based big data analytics (e.g., [74]), represents a thriving field of research with excellent results in terms of both theoretical contributions and real-world applications. However, in terms of big data analytics proper, the problem of providing long-running big data analytics query processing in distributed environments (e.g., [75]), for instance, Cloud stores (e.g., [76]), in a privacy-preserving manner, represents another interesting line of research for the studied area. The primary challenge here is figuring out how to integrate the singleton query’s (e.g., OLAP query [77]) privacy preservation with the big data analytics task’s overall distributed privacy preservation, in order to build even more powerful and more expressive methodologies.

In the following, we provide some promising suggestions for future works in the field of privacy-preserving distributed big data management and analytics:

sophisticated recommendation analytics for large data management and analytics that guarantee privacy in a distributed environment;
identify case studies for privacy-preserving big data management and analysis in distributed environments (e.g., IoT, social networks, Cloud storage, etc.);
specification of big data analytics and management tools/processes to be handled in distributed environments (e.g., OLAP, data publication, analytics-based tensor, long-term big data analytics, etc.);
develop novel tools for privacy-preserving big data management and analysis in distributed environments, based on differential privacy theory, for example;
design, implement, and test privacy-preserving big data management and analysis algorithms in a distributed environment;
create and deploy benchmark case studies to comprehensively assess privacy-cyber-secure big data management and analysis in distributed environments;
create and implement tuning solutions for predefined case studies.

4.3. Distributed Uncertain and Imprecise Big Data Management and Analytics

Among the issues and disadvantages regarding big data is that they are unpredictable and imprecise (e.g., [78,79]). In the case of sensor networks, which are some of the most popular sources of big data (e.g., [80,81]), Sensor-generated big stream data (e.g., [82,83]) are inherently ambiguous and inaccurate, which is similar to what occurs, for instance, with environmental sensor networks. In fact, there is usually some uncertainty in the monitored characteristic parameters such as temperature, pressure, humidity, and so forth.

This clear evidence has led to the recent emergence of approximate big data management and analytics models, methodologies, and approaches (e.g., [84,85]) as one of the top issues in big data research. Basically, big data management and analytics over (uncertain and imprecise) big data repositories are supported by approximation approaches and algorithms that are devised and built. Big data that are inaccurate and uncertain is usually modeled using probabilistic data models (e.g., [86]). It is clear that the aim of these techniques and algorithms is to assist with the basic procedures that form the basis for the development of more sophisticated approximation of big data management and analytics techniques. We identify approximate query answering algorithms (e.g., [87]), approximate search (e.g., [88]), approximation paradigms for supporting machine learning (e.g., [89]) clustering (e.g., [90]) tasks, and so forth, among the most intriguing proposals. Social Networks (such as [91,92]) offer a perfect example of a case study in which all of these techniques are applicable.

When all of these topics are combined, approximate big data management and analytics techniques and algorithms over imprecise and ambiguous big data repositories in distributed environments become one of the core contexts for next-generation big data research, which effectively merges significant technological advancements with solid theoretical frameworks.

In the active literature, there are many proposals focusing on approximate paradigms that support big data management and analytics over uncertain and imprecise big data repositories in distributed environments; some of these proposals, like [80,93,94], serve as important milestones for the scientific field, but problems and challenges persist. This research line aims to extend present achievements and solve these issues and difficulties by covering theoretical and technological gaps and enhancing the state of the art.

The challenge of computing OLAP aggregates over imprecise and unpredictable big data streams (e.g., [95]) is a well-consolidated situation in this regard, as these aggregates form the foundation of widely used big data analytics tools (e.g., [58]). With sampling-based techniques that have previously been shown to be successful in approximate query processing approaches, this paradigm readily scales to the more general challenge of providing scalable joins via flexible sample synopsis (e.g., [96]). Furthermore, these ideas can be easily extended to more probing scenarios, such as in [97], where sophisticated machine learning tasks leverage approximation calculations over imprecise and uncertain big data sources as baseline operations. However, query processing over uncertain big data also includes some types of questions that are worth exploring, such as top-k queries (e.g., [98]), RDF queries (e.g., [99]), pattern-matching on graphs (e.g., [100]), and so on. It should be highlighted that, as with big data in general, these concerns become especially interesting when applied to distributed environments. Here, it is necessary to genuinely extend classical approaches, which were primarily developed for centralized systems, to unique circumstances while maintaining a high level of creativity and innovation and taking into account the particular needs of distributed environments.

The following reports some potential directions for future research in the context of approximation methods and algorithms for big data management and analytics tasks over big data repositories in distributed environments:

evaluation of state-of-the-art proposals in the context of approximate paradigms for supporting big data management and analytics in distributed environments;
development of target uncertain and imprecise big data management and analytics scenarios in distributed environments for usage as case studies (e.g., Internet of Things, social networks, Cloud storage, and so forth);
definition of target big data management and analytics tools and processes in distributed environments to be addressed, which are characterized by uncertainty and imprecision (e.g., OLAP over streaming big data, sensor networks, social networks, etc.);
establishment of novel approximate big data analytics and management tools over imprecise and uncertain big data repositories in distributed environments, such as those based on probability theory;
creation, execution, and evaluation of approximation-based big data management and analytics algorithms for unpredictable and inaccurate big data repositories in distributed environments;
establishment and execution of benchmark case studies aimed at comprehensively evaluating approximate big data management and analytics techniques over ambiguous and imprecise big data repositories in distributed environments.

5. A Practical Implementation: The CORE-BCD-mAI Framework

In this Section, we introduce a practical implementation of SeDaSOMA, the CORE-BCD-mAI framework, namely “A COmposite Framework for REpresenting, Querying, and Analyzing Big Clinical Data by means of multidimensional AI Tools”. The proposed implemented framework considers the definition of innovative models, methodologies, techniques and algorithms, along with their experimental evaluation. These innovations are oriented to support the following: (i) representation; (ii) querying; (iii) analytics over big clinical data (e.g., [101]). This is achieved by exploiting meaningful multidimensional AI (Artificial Intelligence) tools such as the CORE-BCD-mAI framework. As will be made clear throughout this Section, CORE-BCD-mAI is a true realization of SeDaSOMA, which focuses on a specific case study (i.e., big clinical data).

To reach this ambitious goal, CORE-BCD-mAI truly introduces two main concepts: (i) big clinical data; (ii) multidimensional AI tools

Clinical data (i.e., medical tests, patient records, HVR (Heart Rate Variability)/ECG (ElectroCardioGram) data, omics, clinical trial data, etc.) is increasingly becoming a type of big data repository due to its intrinsic characteristics of volume, streaming, and diversity. This is reflected in the term big clinical data (see, for example, [102]). Among other characteristics, big clinical data is highly integrated with external data sources (e.g., census, social data, pharmacovigilance data, etc.), which further reinforces its (strong) variety. The latter characteristic opens the door to multi-level analytics methodologies that seek to analyze (large) clinical data at different levels of granularity (i.e., hospital level, regional level, national level) for supporting epidemiological studies oriented to health policy-oriented decision making.

The term “multidimensional AI tools” refers to a set of models, methods and analytics tools that are designed to support big data representativeness, management and analytics based on well-researched multidimensional frameworks. The most prominent multidimensional model among them is OLAP (OnLine Analytical Processing) (e.g., [16,103]). OLAP enables us to visualize and analyze data using happy multidimensional metaphors, such as dimensions, measures, levels, hierarchies, etc., by constructing highly intuitive multidimensional environments with analysis capabilities that exceed those of traditional SQL (Structured Query Language)-based analysis tools. In recent years, the research community has paid a lot of attention to the challenge of managing and analyzing big data through these paradigms (e.g., [13]).

Therefore, it is important to acknowledge that the representation, query, and analysis of large clinical data, such as patient records, HVR measurements and clinical trial data, through big data tools, and more specifically, multidimensional artificial intelligence tools, is a new research challenge with important results for the society.

It is no secret that ICT (Information and Communication Technologies) technologies can play a significant role in lowering the costs of national health care systems, as well as providing the necessary guidance for shaping future health care policies (see, for example, [104]).

Today, there is a lot of focus on how to effectively and efficiently manage and analyze large amounts of clinical data (e.g., [105]), especially because it has a significant impact on national health systems (and hence on society). In fact, there are several (large) national projects within this sector, such as in the USA (for example, [106]), Europe (for example, [107]) and South Korea (for example, [108]). However, the reality is that most of today’s Healthcare Information Systems (HIS) are still managed as (traditional) legacy IT systems, without taking advantage of the opportunities that big data technologies offer. In a sense, the primary objective of CORE-BCD-mAI is to fill this gap.

As big data technologies have been widely adopted, CORE-BCD-mAI exhibits a high degree of originality, in particular, because of the adaptation of multidimensional AI tools (which breaks the tradition of classical AI tools). On the other hand, CORE-BCD-mAI’s perspective is high because the proposed representation, query and analysis methodologies implemented within the framework can extend to other classes of data (e.g., social data or government data), just to name two well-known examples. CORE-BCD-mAI’s methodology focuses on designing and experimentally testing innovative AI models over synthetic and real-life big multidimensional clinical data sets.

Methodologies, Methods and Main Functionalities of CORE-BCD-mAI

The primary research claim of CORE-BCD-mAI is based on the argument that big data techniques—more specifically, multidimensional AI tools—are ideal for managing, processing, and assisting with analytics over big clinical data. This claim has also been the subject of other recent proposals in this area. The first step in achieving this goal and the central idea behind the CORE-BCD-mAI concept is the development of an appropriate multidimensional data model. In order to accomplish this, CORE-BCD-mAI presents the big multidimensional NoSQL data model, a particular data model that builds multidimensional structures on top of NoSQL data (such as OLAP data cubes; [103]), as the latter are acknowledged as the best option for representing big clinical data (e.g., [109]).

Classification, clustering, frequent itemset mining, association rule mining, and other classical data mining tools can all integrate with big data techniques. However, in order to benefit from rich 3V (e.g., [110]) data availability and the efficiency of consolidated mining approaches, these techniques must be parallelizable. Given the effectiveness and broad adoption of big data processing, administration, and mining tools and systems now in use, we can thus confidently define big data approaches as an enabling technology for CORE-BCD-mAI aims.

In order to support advanced big data analytics (including emerging big data visualization methodologies, e.g., [111,112]) for decision-making purposes, the primary goal of CORE-BCD-mAI is to develop models, methodologies, techniques, and algorithms based on multidimensional AI tools for the particular case of big clinical data to be processed, i.e., HVR measurements and statistics, ECG traces, patient record information, omics data, clinical guidelines, and so forth (including external data sources). This enables us to communicate novel best practices that will be taken into account in further research and to develop, execute, and experimentally evaluate tailored big data solutions that support CORE-BCD-mAI objectives.

Aiming to support the different stages of the previously mentioned data, information, and decisional processing flows, CORE-BCD-mAI also aims to define and implement a big multidimensional clinical data framework, known as the CORE-BCD-mAI framework. In order to achieve this, CORE-BCD-mAI uses well-known open-source big data processing technologies. This allows the framework to be used as widely as possible in both academic and industrial research communities, as well as to maximize the benefits of interoperability paradigms over freely available open-source data representation, management, and mining tools.

In Figure 3, the various stages of the research activities are emphasized, including the following: (i) Big Multidimensional Clinical Data Representation; (ii) Big Multidimensional Clinical Data Querying; (iii) Big Multidimensional Clinical Data Analytics; (iv) Big Multidimensional Clinical Data Visualization. This figure presents the overall data, information, and workflows of CORE-BCD-mAI and their interconnections.

Regarding particular implementation-related details, the proposed framework, which is fully based on SeDaSOMA (see Section 3.2), is built upon well-known Cloud Computing architectures (e.g., [113,114]), in which Hadoop (e.g., [115]) integrates various components (namely, big multidimensional clinical data representation, querying, analytics, and visualization)—mostly based on MapReduce (e.g., [116]), the most widely used big data processing paradigm. This strategy is based on fortunate situations where the Cloud Computing paradigm has already demonstrated its efficacy and efficiency in handling enormous volumes of big data.

Furthermore, the advantages of CORE-BCD-mAI over traditional data analytics tools for clinical data (e.g., [117]) should be highlighted. In contrast, CORE-BCD-mAI genuinely enables us to combine several types of data (e.g., clinical, patient, HVR/ECG, Omics, clinical trial, etc.) into a single integrated big data view. When paired with multidimensional modeling, this feature enables us to enhance the accuracy of the entire big data analytics process.

The different types of data sources are kept apart in the big data layer of CORE-BCD-mAI by contrasting traditional data warehousing approaches. This allows for the development of specialized machine artificial intelligence (mAI) tools for specific types of original datasets for correlation analysis and advanced knowledge discovery, possibly in comparison with knowledge findings derived from the integrated big data view (e.g., [118]).

This Section emphasizes that CORE-BCD-mAI incorporates multidimensional modeling and analytic methodologies at every system tier, not just in the final stage. This is, in fact, the proposal’s most significant addition. Multidimensional modeling and analytic techniques have, in fact, previously demonstrated their value in various contexts (e.g., [83]).

6. Conclusions and Future Work

This paper has focused on emerging trends in big data and introduced the main architecture of SeDaSOMA, a layered framework that supports Serendipitous, Data-as-a-Service-oriented, Open big data Management and Analytics. Our proposed framework is capable of supporting advanced big data management and analytics on different “hot” Cloud-aware application cases such as smart cities, social networks, sensor networks, intelligent transportation systems, etc. We further extended our contributions by providing some examples of Cloud-aware big data vertical applications of SeDaSOMA in particularly specific scenarios that capture a great deal of interest. In addition to this, we have also provided a critical overview of some emerging topics of big data research that are close to our work, by highlighting current state-of-the-art and future research efforts in the field. Finally, we have introduced a practical implementation of SeDaSOMA that focuses on big clinical data.

Our proposal is thus centered around the novel and emerging topic of open big data, which is an exciting paradigm for future years. Mostly, our proposal is concerned with paradigm aspects along with a reference framework and its architectural realization. We believe that these are relevant innovations in the field.

Future work is mainly oriented toward several assets: (i) design and implementation of a wide and comprehensive experimental campaign for assessing the performance of systems and tools developed according to the SeDaSOMA predicates; (ii) improving the overall proposal by advocating formal methods analysis, thus migrating toward a more conceived realization; (iii) exploiting SeDaSOMA as the foundational basis of future real-life research projects in order to stride-up the overall impact of our proposal. In addition to these practical goals, and looking at methodological perspectives, we aim at integrating our proposed framework with special big data features (e.g., [119]) such as the following: (i) visualization metaphors (e.g., [120]); (ii) flexible methodologies (e.g., [121,122,123]); (iii) uncertain management paradigms (e.g., [99,124,125]); (iv) intelligent data exchange approaches (e.g., [68,126,127]); (v) deep learning (e.g., [37]); (vi) federated learning (e.g., [128,129,130]).

Author Contributions

Conceptualization, A.C.; methodology, A.C. and P.C.; validation, P.C.; investigation, A.C. and P.C.; resources, A.C.; writing—original draft preparation, A.C.; writing—review and editing, A.C. and P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created.

Acknowledgments

This research is supported by the ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing within the NextGenerationEU program (Project Code: PNRR CN00000013).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, J.; Chen, Y.; Du, X.; Li, C.; Lu, J.; Zhao, S.; Zhou, X. Big Data Challenge: A Data Management Perspective. Front. Comput. Sci. Sci. 2013, 7, 157–164. [Google Scholar] [CrossRef]
Russom, P. Big Data Analytics. TDWI Best Pract. Rep. 2011, 19, 1–34. [Google Scholar]
Hashem, I.A.T.; Chang, V.; Anuar, N.B.; Adewole, K.S.; Yaqoob, I.; Gani, A.; Ahmed, E.; Chiroma, H. The Role of Big Data in Smart City. Int. J. Inf. Manag. 2016, 36, 748–758. [Google Scholar] [CrossRef]
Tan, W.; Blake, M.B.; Saleh, I.; Dustdar, S. Social-Network-Sourced Big Data Analytics. IEEE Internet Comput. 2013, 17, 62–69. [Google Scholar] [CrossRef]
Bonifati, A.; Cuzzocrea, A. Storing and Retrieving XPath Fragments in Structured P2P Networks. Data Knowl. Eng. 2006, 59, 247–269. [Google Scholar] [CrossRef]
Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big Data Analytics in Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2019, 20, 383–398. [Google Scholar] [CrossRef]
Baqleh, L.A.; Alateeq, M.M. The Impact of Supply Chain Management Practices on Competitive Advantage: The Moderating Role of Big Data Analytics. Int. J. Prof. Bus. Rev. 2023, 8, 3. [Google Scholar] [CrossRef]
Zhou, Y. Integrated Development of Industrial and Regional Economy using Big Data Technology. Comput. Electr. Eng. 2023, 109, 108764. [Google Scholar] [CrossRef]
Cuzzocrea, A. Approximate OLAP Query Processing over Uncertain and Imprecise Multidimensional Data Streams. In Proceedings of the 24th International Conference on Database and Expert Systems Applications, DEXA 2013, Prague, Czech Republic, 26–29 August 2013. [Google Scholar]
Cuzzocrea, A.; Serafino, P. LCS-Hist: Taming Massive High-dimensional Data Cube Compression. In Proceedings of the 12th International Conference on Extending Database Technology, EDBT 2009, Saint Petersburg, Russia, 24–26 March 2009. [Google Scholar]
Ceci, M.; Cuzzocrea, A.; Malerba, D. Effectively and Efficiently Supporting Roll-up and Drill-down OLAP Operations over Continuous Dimensions via Hierarchical Clustering. J. Intell. Inf. Syst. 2015, 44, 309–333. [Google Scholar] [CrossRef]
Cuzzocrea, A. OLAP Intelligence: Meaningfully Coupling OLAP and Data Mining Tools and Algorithms. Int. J. Bus. Intell. Data Min. 2009, 4, 213–218. [Google Scholar]
Cuzzocrea, A. Scalable OLAP-based Big Data Analytics over Cloud Infrastructures: Models, Issues, Algorithms. In Proceedings of the 2017 International Conference on Cloud and Big Data Computing, ICCBDC 2017, London, UK, 17–19 September 2017. [Google Scholar]
Han, J.; Sethu, H. OLAP Mining: Integration of OLAP with Data Mining. In Proceedings of the 7th Conference on Database Semantics, DS-7, Leysin, Switzerland, 7–10 October 1997. [Google Scholar]
Adadi, A. A Survey on Data-Efficient Algorithms in Big Data Era. J. Big Data 2021, 8, 24. [Google Scholar] [CrossRef]
Chaudhuri, S.; Dayal, U. An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec. 1997, 26, 65–74. [Google Scholar] [CrossRef]
Aidala, C.A.; Burr, C.; Cattaneo, M.; Fitzgerald, D.S.; Morris, A.; Neubert, S.; Tropmann, D. Ntuple Wizard: An Application to Access Large-Scale Open Data from LHCb. Comput. Softw. Big Sci. 2023, 7, 6. [Google Scholar] [CrossRef]
Coronato, A.; Cuzzocrea, A. An Innovative Risk Assessment Methodology for Medical Information Systems. IEEE Trans. Knowl. Data Eng. 2022, 34, 3095–3110. [Google Scholar] [CrossRef]
Khalil, M.; Esseghir, M.; Merghem-Boulahia, L. Privacy-Preserving Federated Learning: An Application for Big Data Load Forecast in Buildings. Comput. Secur. 2023, 131, 103211. [Google Scholar] [CrossRef]
Zheng, Z.; Zhu, J.; Lyu, M.R. Service-Generated Big Data and Big Data-as-a-Service: An Overview. In Proceedings of the IEEE International Congress on Big Data, BigData Congress 2013, Santa Clara, CA, USA, 27 June–2 July 2013. [Google Scholar]
Fahmideh, M.; Beydoun, G. Big Data Analytics Architecture Design—An Application in Manufacturing Systems. Comput. Ind. Eng. 2019, 128, 948–963. [Google Scholar] [CrossRef]
European Commission. Horizon Europe–The EU Framework Programme for Research and Innovation; European Commission: Brussels, Belgium, 2022; Available online: https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-europe_en (accessed on 1 April 2023).
Cuzzocrea, A.; Ciancarini, P. SeDaSOMA: A Framework for Supporting Serendipitous, Data-As-A-Service-Oriented, Open Big Data Management and Analytics. In Proceedings of the 5th International Conference on Cloud and Big Data Computing, ICCBDC 2021, Liverpool, UK, 13–15 August 2021. [Google Scholar]
Cuzzocrea, A. Advanced, Privacy-Preserving and Approximate Big Data Management and Analytics in Distributed Environments: What is Now and What is Next. In Proceedings of the 44th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2020, Madrid, Spain, 13–17 July 2020. [Google Scholar]
Cuzzocrea, A.; Bringas, P.G. CORE-BCD-mAI: A Composite Framework for Representing, Querying, and Analyzing Big Clinical Data by Means of Multidimensional AI Tools. In Proceedings of the 17th International Conference on Hybrid Artificial Intelligent Systems, HAIS 2022, Salamanca, Spain, 5–7 September 2022. [Google Scholar]
Pavlopoulou, C.; Carey, M.J.; Tsotras, V.J. Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems. SIGMOD Rec. 2023, 52, 104–113. [Google Scholar] [CrossRef]
Siddiqa, A.; Hashem, I.A.T.; Yaqoob, I.; Marjani, M.; Shamshirband, S.; Gani, A.; Nasaruddin, F. A Survey of Big Data Management: Taxonomy and State-of-the-art. J. Netw. Comput. Appl. 2016, 71, 151–166. [Google Scholar] [CrossRef]
Mikalef, P.; Boura, M.; Lekakos, G.; Krogstie, J. Big Data Analytics and Firm Performance: Findings from a Mixed-Method Approach. J. Bus. Res. 2019, 98, 261–276. [Google Scholar] [CrossRef]
Woodside, A.G. Embrace• Perform• Model: Complexity Theory, Contrarian Case Analysis, and Multiple Realities. J. Bus. Res. 2014, 67, 2495–2503. [Google Scholar] [CrossRef]
Ranjan, J.; Foropon, C. Big Data Analytics in Building the Competitive Intelligence of Organizations. Int. J. Inf. Manag. 2021, 56, 102231. [Google Scholar] [CrossRef]
Wang, Y.; Wei, J.; Srivatsa, M.; Duan, Y.; Du, W. IntegrityMR: Integrity Assurance Framework for Big Data Analytics and Management Applications. In Proceedings of the 2013 IEEE International Conference on Big Data, BigData 2013, Santa Clara, CA, USA, 6–9 October 2013. [Google Scholar]
Fiore, S.; Palazzo, C.; D’Anca, A.; Foster, I.T.; Williams, D.N.; Aloisio, G. A Big Data Analytics Framework for Scientific Data Management. In Proceedings of the 2013 IEEE International Conference on Big Data, BigData 2013, Santa Clara, CA, USA, 6–9 October 2013. [Google Scholar]
Puthal, D.; Nepal, S.; Ranjan, R.; Chen, J. A Secure Big Data Stream Analytics Framework for Disaster Management on the Cloud. In Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications; 14th IEEE International Conference on Smart City; 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016, Sydney, Australia, 12–14 December 2016. [Google Scholar]
Abdullah, M.F.; Ibrahim, M.; Zulkifli, H. Big Data Analytics Framework for Natural Disaster Management in Malaysia. In Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security, IoTBDS 2017, Porto, Portugal, 24–26 April 2017. [Google Scholar]
Terrazas, G.; Ferry, N.; Ratchev, S.M. A Cloud-based Framework for Shop Floor Big Data Management and Elastic Computing Analytics. Comput. Ind. 2019, 109, 204–214. [Google Scholar] [CrossRef]
Jindal, A.; Kumar, N.; Singh, M. A Unified Framework for Big Data Acquisition, Storage, and Analytics for Demand Response Management in Smart Cities. Future Gener. Comput. Syst. 2020, 108, 921–934. [Google Scholar] [CrossRef]
Almagrabi, A.O.; Ali, R.; Alghazzawi, D.M.; Albarakati, A.; Khurshaid, T. A Reinforcement Learning-Based Framework for Crowdsourcing in Massive Health Care Internet of Things. Big Data 2022, 10, 161–170. [Google Scholar] [CrossRef]
Mehmood, E.; Anees, T. Distributed Real-Time ETL Architecture for Unstructured Big Data. Knowl. Inf. Syst. 2022, 64, 3419–3445. [Google Scholar] [CrossRef]
Miltiadou, D.; Pitsios, S.; Spyropoulos, D.; Alexandrou, D.; Lampathaki, F.; Messina, D.; Perakis, K. A Big Data Intelligence Marketplace and Secure Analytics Experimentation Platform for the Aviation Industry. In Proceedings of the 10th EAI International Conference and 13th EAI International Conference on Wireless Internet, BDTA/WiCON 2020, Virtual Event, 11 December 2020. [Google Scholar]
Dinh, L.T.N.; Karmakar, G.C.; Kamruzzaman, J. A Survey on Context Awareness in Big Data Analytics for Business Applications. Knowl. Inf. Syst. 2010, 62, 3387–3415. [Google Scholar] [CrossRef]
Doherty, A.J.; Murphy, R.; Schieweck, A.; Clancy, S.; Breathnach, C.; Margaria, T. CensusIRL: Historical Census Data Preparation with MDD Support. In Proceedings of the 2022 IEEE International Conference on Big Data, BigData 2022, Osaka, Japan, 17–20 December 2022. [Google Scholar]
Zhang, H.; Chen, G.; Ooi, B.C.; Tan, K.-L.; Zhang, M. In-Memory Big Data Management and Processing: A Survey. IEEE Trans. Knowl. Data Eng. 2015, 27, 1920–1948. [Google Scholar] [CrossRef]
Buyle, R.; Taelman, R.; Mostaert, K.; Joris, G.; Mannens, E.; Verborgh, R.; Berners-Lee, T. Streamlining Governmental Processes by Putting Citizens in Control of their Personal Data. In Proceedings of the 6th International Conference on Electronic Governance and Open Society: Challenges in Eurasia, EGOSE 2019, St. Petersburg, Russia, 13–14 November 2019. [Google Scholar]
Cuzzocrea, A.; Damiani, E. Making the Pedigree to Your Big Data Repository: Innovative Methods, Solutions, and Algorithms for Supporting Big Data Privacy in Distributed Settings via Data-Driven Paradigms. In Proceedings of the 43rd IEEE Annual Computer Software and Applications Conference, COMPSAC 2019, Milwaukee, WI, USA, 15–19 July 2019. [Google Scholar]
Elmeiligy, M.A.; El-Desouky, A.I.; El-Ghamrawy, S.M. A Multi-Dimensional Big Data Storing System for Generated COVID-19 Large-Scale Data using Apache Spark. arXiv 2020, arXiv:2005.05036. [Google Scholar] [CrossRef]
Alaoui, S.S.; Farhaoui, Y.; Aksasse, B. Data Openness for Efficient E-Governance in the Age of Big Data. Int. J. Cloud Comput. 2021, 10, 522–532. [Google Scholar] [CrossRef]
Xiao, F.; Xie, J.; Chen, Z.; Li, F.; Chen, Z.; Liu, J.; Liu, Y. Ganos Aero: A Cloud-Native System for Big Raster Data Management and Processing. Proc. VLDB Endow. 2023, 16, 3966–3969. [Google Scholar] [CrossRef]
Mehta, N.; Pandit, A.; Shukla, S. Transforming Healthcare with Big Data Analytics and Artificial Intelligence: A Systematic Mapping Study. J. Biomed. Inform. 2019, 100, 103311. [Google Scholar] [CrossRef]
Galakatos, A.; Markovitch, M.; Binnig, C.; Fonseca, R.; Kraska, T. FITing-Tree: A Data-aware Index Structure. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD/PODS 2019, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar]
Gu, J.; Watanabe, Y.H.; Mazza, W.A.; Shkapsky, A.; Yang, M.; Ding, L.; Zaniolo, C. RaSQL: Greater Power and Performance for Big Data Analytics with Recursive-Aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD/PODS 2019, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar]
Xie, T.; Chandola, V.; Kennedy, O. Query Log Compression for Workload Analytics. Proc. VLDB Endow. 2018, 12, 183–196. [Google Scholar] [CrossRef]
Chatzimilioudis, G.; Cuzzocrea, A.; Gunopulos, D.; Mamoulis, N. A Novel Distributed Framework for Optimizing Query Routing Trees in Wireless Sensor Networks via Optimal Operator Placement. J. Comput. Syst. Sci. 2013, 79, 349–368. [Google Scholar] [CrossRef]
Nguyen, D.T.; Jung, J.E. Real-Time Event Detection for Online Behavioral Analysis of Big Social Data. Future Gener. Comput. Syst. 2017, 66, 137–145. [Google Scholar] [CrossRef]
Cuzzocrea, A.; Song, I.Y.; Davis, K.C. Analytics over Large-Scale Multidimensional Data: The Big Data Revolution! In Proceedings of the ACM 14th International Workshop on Data Warehousing and OLAP, DOLAP 2011, Glasgow, UK, 28 October 2009. [Google Scholar]
Han, G.; Sethu, H. Closed Walk Sampler: An Efficient Method for Estimating Eigenvalues of Large Graphs. IEEE Trans. Big Data 2020, 6, 29–42. [Google Scholar] [CrossRef]
Islam, M.M.; Razzaque, M.A.; Hassan, M.M.; Ismail, W.N.; Song, B. Mobile Cloud-Based Big Healthcare Data Processing in Smart Cities. IEEE Access 2017, 5, 11887–11899. [Google Scholar] [CrossRef]
Zhang, J.; Wu, S.; Tan, Z.; Chen, G.; Cheng, Z.; Cao, W.; Gao, Y.; Feng, X. S3: A Scalable In-memory Skip-List Index for Key-Value Store. Proc. VLDB Endow. 2019, 12, 2183–2194. [Google Scholar] [CrossRef]
Cuzzocrea, A. Aggregation and Multidimensional Analysis of Big Data for Large-Scale Scientific Applications: Models, Issues, Analytics, and Beyond. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM 2015, La Jolla, CA, USA, 29 June 2015–1 July 2015. [Google Scholar]
Zhang, J.; Liu, Y.; Zhou, K.; Li, G.; Xiao, Z.; Cheng, B.; Xing, J.; Wang, Y.; Cheng, T.; Liu, L.; et al. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD 2019, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar]
Lu, R.; Zhu, H.; Liu, X.; Liu, J.K.; Shao, J. Toward Efficient and Privacy-Preserving Computing in Big Data Era. IEEE Netw. 2014, 28, 46–50. [Google Scholar] [CrossRef]
Tran, H.-Y.; Hu, J. Privacy-Preserving Big Data Analytics A Comprehensive Survey. J. Parallel Distrib. Comput. 2019, 134, 207–218. [Google Scholar] [CrossRef]
Au, M.H.; Liang, K.; Liu, J.K.; Lu, R.; Ning, J. Privacy-Preserving Personal Data Operation on Mobile Cloud-Chances and Challenges over Advanced Persistent Threat. Future Gener. Comput. Syst. 2018, 79, 337–349. [Google Scholar] [CrossRef]
Komishani, E.G.; Abadi, M.; Deldar, F. PPTD: Preserving Personalized Privacy in Trajectory Data Publishing by Sensitive Attribute Generalization and Trajectory Local Suppression. Knowl. Based Syst. 2016, 94, 43–59. [Google Scholar] [CrossRef]
Liang, P.; Zhang, L.; Kang, L.; Ren, J. Privacy-Preserving Decentralized ABE for Secure Sharing of Personal Health Records in Cloud Storage. J. Inf. Secur. Appl. 2019, 47, 258–266. [Google Scholar] [CrossRef]
Boubiche, S.; Boubiche, D.E.; Bilami, A.; Toral-Cruz, H. Big Data Challenges and Data Aggregation Strategies in Wireless Sensor Networks. IEEE Access 2018, 6, 20558–20571. [Google Scholar] [CrossRef]
Cuzzocrea, A. Privacy-Preserving Big Data Management: The Case of OLAP. In Big Data-Algorithms, Analytics, and Applications; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015; pp. 301–326. [Google Scholar]
Cuzzocrea, A.; Saccà, D. A Constraint-Based Framework for Computing Privacy Preserving OLAP Aggregations on Data Cubes. In Proceedings of the 15th East-European Conference on Advances in Databases and Information Systems, ADBIS 2011, Vienna, Austria, 20–23 September 2011. [Google Scholar]
Chen, Y.; Guo, J.; Li, C.; Ren, W. FaDe: A Blockchain-Based Fair Data Exchange Scheme for Big Data Sharing. Future Internet 2019, 11, 225. [Google Scholar] [CrossRef]
Zheng, Z.; Xie, S.; Dai, H.; Chen, X.; Wang, H. An Overview of Blockchain Technology: Architecture, Consensus, and Future Trends. In Proceedings of the 2017 IEEE International Congress on Big Data, BigData Congress 2017, Honolulu, HI, USA, 25–30 June 2017. [Google Scholar]
Tankard, C. Big Data Security. Netw. Secur. 2012, 2012, 5–8. [Google Scholar] [CrossRef]
Zakerzadeh, H.; Aggarwal, C.C.; Barker, K. Privacy-Preserving Big Data Publishing. In Proceedings of the 27th International Conference on Scientific and Statistical Database Management, SSDBM 2015, La Jolla, CA, USA, 29 June 2015–1 July 2015. [Google Scholar]
Cuzzocrea, A.; Bertino, E.; Saccà, D. Towards a Theory for Privacy Preserving Distributed OLAP. In Proceedings of the 2012 Joint EDBT/ICDT Workshops, EDBT/ICDT 2012, Berlin, Germany, 30 March 2012. [Google Scholar]
Dwork, C. Differential Privacy: A Survey of Results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC 2008, Xi’an, China, 25–29 April 2008. [Google Scholar]
Song, Q.; Ge, H.; Caverlee, J.; Hu, X. Tensor Completion Algorithms in Big Data Analytics. ACM Trans. Knowl. Discov. Data 2019, 13, 1–48. [Google Scholar] [CrossRef]
Qaosar, M.; Alam, K.M.R.; Li, C.; Morimoto, Y. Privacy-Preserving Top-K Dominating Queries in Distributed Multi-Party Databases. In Proceedings of the 2019 IEEE International Conference on Big Data, BigData 2019, Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar]
Grolinger, K.; Higashino, W.A.; Tiwari, A.; Capretz, M.A.M. Data Management in Cloud Environments: NoSQL and NewSQL Data Stores. J. Cloud Comput. 2013, 2, 22. [Google Scholar] [CrossRef]
Wang, T.; Ding, B.; Zhou, J.; Hong, C.; Huang, Z.; Li, N.; Jha, S. Answering Multi-Dimensional Analytical Queries under Local Differential Privacy. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD/PODS 2019, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar]
Braun, P.; Cuzzocrea, A.; Jiang, F.; Leung, C.K.-S.; Pazdor, A.G.M. MapReduce-Based Complex Big Data Analytics over Uncertain and Imprecise Social Networks. In Proceedings of the 19th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2017, Lyon, France, 28–31 August 2017. [Google Scholar]
Hariri, R.H.; Fredericks, E.M.; Bowers, K.M. Uncertainty in Big Data Analytics: Survey, Opportunities, and Challenges. J. Big Data 2019, 6, 44. [Google Scholar] [CrossRef]
Mouratidis, K.; Tang, B. Exact Processing of Uncertain Top-K Queries in Multi-Criteria Settings. Proc. VLDB Endow. 2018, 11, 866–879. [Google Scholar] [CrossRef]
Muzammal, M.; Gohar, M.; Rahman, A.U.; Qu, Q.; Ahmad, A.; Jeon, G. Trajectory Mining Using Uncertain Sensor Data. IEEE Access 2018, 6, 4895–4903. [Google Scholar] [CrossRef]
Cuzzocrea, A. CAMS: OLAPing Multidimensional Data Streams Efficiently. In Proceedings of the 11th International Conference on Big Data Analytics and Knowledge Discovery, DaWaK 2009, Linz, Austria, 31 August–2 September 2009. [Google Scholar]
Hershberger, J.; Shrivastava, N.; Suri, S.; Tóth, C.D. Adaptive Spatial Partitioning for Multidimensional Data Streams. Algorithmica 2006, 46, 97–117. [Google Scholar] [CrossRef][Green Version]
Feng, Y.; Zhou, Y.; Tarokh, V. Recurrent Neural Network-Assisted Adaptive Sampling for Approximate Computing. In Proceedings of the 2019 IEEE International Conference on Big Data, BigData 2019, Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar]
Ma, S.; Huai, J. Approximate Computation for Big Data Analytics. ACM SIGWEB Newsl. 2021, 2021, 1–8. [Google Scholar] [CrossRef]
Pei, J. Some New Progress in Analyzing and Mining Uncertain and Probabilistic Data for Big Data Analytics. In Proceedings of the 14th International Conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, RSFDGrC 2013, Halifax, NS, Canada, 11–14 October 2013. [Google Scholar]
Kantere, V. Approximate Queries on Big Heterogeneous Data. In Proceedings of the 2015 IEEE International Congress on Big Data, BigData Congress 2015, New York City, NY, USA, 27 June–2 July 2015. [Google Scholar]
Zhou, Z.; Zhang, H.; Li, S.; Du, X. Hermes: A Privacy-Preserving Approximate Search Framework for Big Data. IEEE Access 2018, 6, 20009–20020. [Google Scholar] [CrossRef]
Cech, P.; Lokoc, J.; Silva, Y.N. Pivot-Based Approximate k-NN Similarity Joins for Big High-Dimensional Data. Inf. Syst. 2020, 87, 101410. [Google Scholar] [CrossRef]
Salloum, S.; Wu, Y.; Huang, J.Z. A Sampling-Based System for Approximate Big Data Analysis on Computing Clusters. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, 3–7 November 2019. [Google Scholar]
Paredes, P.; Ribeiro, P.M.P. Rand-FaSE: Fast Approximate Subgraph Census. Soc. Netw. Anal. Min. 2015, 5, 17:1–17:18. [Google Scholar] [CrossRef]
Perozzi, B.; McCubbin, C.; Halbert, J.T. Scalable Graph Clustering with Parallel Approximate PageRank. Soc. Netw. Anal. Min. 2014, 4, 179. [Google Scholar] [CrossRef]
Park, Y.; Mozafari, B.; Sorenson, J.; Wang, J. VerdictDB: Universalizing Approximate Query Processing. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
Peng, J.; Zhang, D.; Wang, J.; Pei, J. AQP++: Connecting Approximate Query Processing with Aggregate Precomputation for Interactive Analytics. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD 2018, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
Zeng, K.; Agarwal, S.; Stoica, I. IOLAP: Managing Uncertainty for Efficient Incremental OLAP. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD 2016, San Francisco, CA, USA, 26 June–1 July 2016. [Google Scholar]
Yu, F.; Hou, W.-C. CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis. In Proceedings of the 2019 IEEE International Conference on Big Data, BigData 2019, Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar]
Hasani, S.; Thirumuruganathan, S.; Asudeh, A.; Koudas, N.; Das, G. Efficient Construction of Approximate Ad-Hoc ML Models Through Materialization and Reuse. Proc. VLDB Endow. 2018, 11, 1468–1481. [Google Scholar] [CrossRef]
Xiao, G.; Li, K.; Zhou, X.; Li, K. Efficient Monochromatic and Bichromatic Probabilistic Reverse Top-K Query Processing for Uncertain Big Data. J. Comput. Syst. Sci. 2017, 89, 92–113. [Google Scholar] [CrossRef]
Benbernou, S.; Ouziri, M. Query Answering on Uncertain Big RDF Data Using Apache Spark Framework. In Proceedings of the 2018 IEEE International Conference on Big Data, BigData 2018, Seattle, WA, USA, 10–13 December 2018. [Google Scholar]
Yuan, Y.; Wang, G.; Chen, L.; Ning, B. Efficient Pattern Matching on Big Uncertain Graphs. Inf. Sci. 2016, 339, 369–394. [Google Scholar] [CrossRef]
Perez-Arriaga, M.O.; Poddar, K.A. Clinical Trials Data Management in the Big Data Era. In Proceedings of the 2020 IEEE International Congress on Big Data, BigData Congress 2020, Honolulu, HI, USA, 18–20 September 2020. [Google Scholar]
Shae, Z.-Y.; Tsai, J.J.P. A Clinical Kidney Intelligence Platform Based on Big Data, Artificial Intelligence, and Blockchain Technology. Int. J. Artif. Intell. Tools 2022, 31, 2241007. [Google Scholar] [CrossRef]
Gray, J.; Chaudhuri, S.; Bosworth, A.; Layman, A.; Reichart, D.; Venkatrao, M.; Pellow, F.; Pirahesh, H. Data Cube: A Relational Aggregation Operator Generalizing Group-by, cross-Tab, and Sub Totals. Data Min. Knowl. Discov. 1997, 1, 29–53. [Google Scholar] [CrossRef]
Shahbaz, M.; Gao, C.-Y.; Zhai, L.; Shahzad, F.; Hu, Y. Investigating the Adoption of Big Data Analytics in Healthcare: The Moderating Role of Resistance to Change. J. Big Data 2019, 6, 6. [Google Scholar] [CrossRef]
Chrimes, D.; Zamani, H. Using Distributed Data over HBase in Big Data Analytics Platform for Clinical Services. Comput. Math. Methods Med. 2017, 2017, 6120820. [Google Scholar] [CrossRef]
Groves, P.; Kayyali, B.; Knott, D.; Kuiken, S.V. The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation; McKinsey Tech Rep: New York, NY, USA, 2016. [Google Scholar]
Habl, C.; Renner, A.T.; Bobek, J.; Laschkolnig, A. Study on Big Data in Public Health, Telemedicine and Healthcare; European Commission Tech Rep: Brussels, Belgium, 2016. [Google Scholar]
Nam, J.; Kwon, H.W.; Lee, H.; Ahn, E.K. National Healthcare Service and Its Big Data Analytics. Healthc. Inform. Res. 2018, 24, 247–249. [Google Scholar] [CrossRef]
Yang, E.; Scheff, J.D.; Shen, S.C.; Farnum, M.; Sefton, J.; Lobanov, V.S.; Agrafiotis, D.K. A Late-Binding, Distributed, NoSQL Warehouse for Integrating Patient Data from Clinical Trials. Database J. Biol. Databases Curation 2019, 2019, baz032. [Google Scholar] [CrossRef] [PubMed]
Laney, D. 3D Data Management: Controlling Data Volume, Velocity, and Variety; Technical Report; META Group Inc.: Stamford, CT, USA, 2001. [Google Scholar]
Barkwell, K.E.; Cuzzocrea, A.; Leung, C.K.; Ocran, A.A.; Sanderson, J.M. Big Data Visualization and Visual Analytics for Music Data Mining. In Proceedings of the 22nd International Conference Information Visualisation, IV 2018, Fisciano, Italy, 10–13 July 2018. [Google Scholar]
Keim, D.A.; Qu, H.; Ma, K.-L. Big-Data Visualization. IEEE Comput. Graph. Appl. 2013, 33, 20–21. [Google Scholar] [CrossRef][Green Version]
Armbrust, M.; Fox, A.; Griffith, R.; Joseph, A.D.; Katz, R.H.; Konwinski, A.; Lee, G.; Patterson, D.A.; Rabkin, A.; Stoica, I.; et al. A View of Cloud Computing. Commun. ACM 2010, 53, 50–58. [Google Scholar] [CrossRef]
Buyya, R.; Yeo, C.S.; Venugopal, S.; Broberg, J.; Brandic, I. Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility. Future Gener. Comput. Syst. 2009, 25, 599–616. [Google Scholar] [CrossRef]
White, T. Hadoop: The Definitive Guide; O’Reilly Media Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
Gale, C.; Statnikov, Y.; Jawad, S.; Uthaya, S.N.; Modi, N. Neonatal Brain Injuries in England: Population-Based Incidence Derived from Routinely Recorded Clinical Data Held in the National Neonatal Research Database. ADC Fetal Neonatal Ed. 2017, 103, 301–3415. [Google Scholar] [CrossRef]
Wu, X.; Duan, J.; Pan, Y.; Li, M. Medical Knowledge Graph: Data Sources, Construction, Reasoning, and Applications. Big Data Min. Anal. 2023, 6, 201–217. [Google Scholar] [CrossRef]
Minatogawa, V.L.F.; Franco, M.M.V.; Rampasso, I.S.; Anholon, R.; Quadros, R.; Durán, O.; Batocchio, A. Operationalizing Business Model Innovation through Big Data Analytics for Sustainable Organizations. Sustainability 2020, 12, 277. [Google Scholar] [CrossRef]
Sun, Y.; Xiong, H.; Yiu, S.M.; Lam, K.-Y. BitAnalysis: A Visualization System for Bitcoin Wallet Investigation. IEEE Trans. Big Data 2023, 9, 621–636. [Google Scholar] [CrossRef]
Íñiguez, L.; Galar, M. A Scalable and Flexible Open Source Big Data Architecture for Small and Medium-Sized Enterprises. In Proceedings of the 16th International Conference on Soft Computing Models in Industrial and Environmental Applications, SOCO 2021, Bilbao, Spain, 22–24 September 2021. [Google Scholar]
Stergiou, C.L.; Psannis, K.E.; Gupta, B.B. InFeMo: Flexible Big Data Management Through a Federated Cloud System. ACM Trans. Internet Techn. 2022, 22, 1–22. [Google Scholar] [CrossRef]
Teng, D.; Kong, J.; Wang, F. Scalable and flexible management of medical image big data. Distrib. Parallel Databases 2019, 37, 235–250. [Google Scholar] [CrossRef] [PubMed]
Haseeb, K.; Saba, T.; Rehman, A.; Ahmed, I.; Lloret, J. Efficient Data Uncertainty Management for Health Industrial Internet of Things Using Machine Learning. Int. J. Commun. Syst. 2021, 34, 4948. [Google Scholar] [CrossRef]
Shukla, A.K.; Muhuri, P.K. Big-data Clustering with Interval Type-2 Fuzzy Uncertainty Modeling in Gene Expression Datasets. Eng. Appl. Artif. Intell. 2019, 77, 268–282. [Google Scholar] [CrossRef]
Koshizuka, N.; Mano, H. DATA-EX: Infrastructure for Cross-Domain Data Exchange Based on Federated Architecture. In Proceedings of the IEEE International Conference on Big Data, Big Data 2022, Osaka, Japan, 17–20 December 2022. [Google Scholar]
Li, T.; Ren, W.; Xiang, Y.; Zheng, X.; Zhu, T.; Choo, K.-K.R.; Srivastava, G. FAPS: A Fair, Autonomous and Privacy-Preserving Scheme for Big Data Exchange Based on Oblivious Transfer, Ether Cheque and Smart Contracts. Inf. Sci. 2021, 544, 469–484. [Google Scholar] [CrossRef]
Kang, Q.; Liu, J.; Yang, S.; Xiong, H.; An, H.; Li, X.; Feng, Z.; Wang, L.; Dou, D. Quasi-Optimal Data Placement for Secure Multi-tenant Data Federation on the Cloud. In Proceedings of the 2020 IEEE International Conference on Big Data, BigData 2020, Atlanta, GA, USA, 10–13 December 2020. [Google Scholar]
Liu, J.; Zhou, X.; Mo, L.; Ji, S.; Liao, Y.; Li, Z.; Gu, Q.; Dou, D. Distributed and Deep Vertical Federated Learning with Big Data. Concurr. Comput. Pract. Exp. 2023, 35, e7697. [Google Scholar] [CrossRef]
Nair, A.K.; Sahoo, J.; Raj, E.D. Privacy Preserving Federated Learning Framework for IoMT Based Big Data Analysis using Edge Computing. Comput. Stand. Interfaces 2023, 86, 103720. [Google Scholar] [CrossRef]

Figure 1. The SeDaSOMA framework.

Figure 2. Big Data Management and Analytics System Architecture.

Figure 3. CORE-BCD-mAI data/information/workflow structure.

Table 1. SeDaSOMA technologies.

Layers	Technology
Big Data Source Layer	Oracle NoSQL Database, MongoDB, Neo4J, Apache Streaming
Big Data Repository and Provisioning Layer	Apache Hadoop
Big Data Analytics Layer	Apache Spark
Big Data Application Layer	Java, C#, J#, Python

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cuzzocrea, A.; Ciancarini, P. Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework. Modelling 2024, 5, 1173-1196. https://doi.org/10.3390/modelling5030061

AMA Style

Cuzzocrea A, Ciancarini P. Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework. Modelling. 2024; 5(3):1173-1196. https://doi.org/10.3390/modelling5030061

Chicago/Turabian Style

Cuzzocrea, Alfredo, and Paolo Ciancarini. 2024. "Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework" Modelling 5, no. 3: 1173-1196. https://doi.org/10.3390/modelling5030061

APA Style

Cuzzocrea, A., & Ciancarini, P. (2024). Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework. Modelling, 5(3), 1173-1196. https://doi.org/10.3390/modelling5030061

Article Menu

Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework^†

Abstract

1. Introduction

2. Related Work

Innovations of SeDaSOMA over Conventional State-of-the-Art Approaches

3. SeDaSOMA’s Anatomy

3.1. SeDaSOMA Components

3.2. SeDaSOMA Implementation

4. Discussion

4.1. Distributed Big Data Management and Analytics: Complex Methodologies

4.2. Privacy-Preservation Methodologies for Distributed Big Data Management and Analytics

4.3. Distributed Uncertain and Imprecise Big Data Management and Analytics

5. A Practical Implementation: The CORE-BCD-mAI Framework

Methodologies, Methods and Main Functionalities of CORE-BCD-mAI

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework †

Abstract

1. Introduction

2. Related Work

Innovations of SeDaSOMA over Conventional State-of-the-Art Approaches

3. SeDaSOMA’s Anatomy

3.1. SeDaSOMA Components

3.2. SeDaSOMA Implementation

4. Discussion

4.1. Distributed Big Data Management and Analytics: Complex Methodologies

4.2. Privacy-Preservation Methodologies for Distributed Big Data Management and Analytics

4.3. Distributed Uncertain and Imprecise Big Data Management and Analytics

5. A Practical Implementation: The CORE-BCD-mAI Framework

Methodologies, Methods and Main Functionalities of CORE-BCD-mAI

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Serendipitous, Open Big Data Management and Analytics: The SeDaSOMA Framework^†