Big Data Sharing: A Comprehensive Survey

Shan Jiang

doi:10.3390/data10110182

School of Software Engineering, Sun Yat-sen University, Zhuhai 519082, China

Data2025, 10(11), 182;https://doi.org/10.3390/data10110182

Version Notes

Order Reprints

Abstract

The transformative potential of big data across various industries has been demonstrated. However, the data held by different stakeholders often lack interoperability, resulting in isolated data silos that limit the overall value. Collaborative data efforts can enhance the total value beyond the sum of individual parts. Thus, big data sharing is crucial for transitioning from isolated data silos to integrated data ecosystems, thereby maximizing the value of big data. Despite its potential, big data sharing faces numerous challenges, including data heterogeneity, the absence of pricing models, and concerns about data security. A substantial body of research has been dedicated to addressing these issues. This paper offers the first comprehensive survey that formally defines and delves into the technical details of big data sharing. Initially, we formally define big data sharing as the act of data sharers to share big data so that the sharees can find, access, and use it in the agreed ways and differentiate it from related concepts such as open data, data exchange, and big data trading. We clarify the general procedures, benefits, requirements, and applications associated with big data sharing. Subsequently, we examine existing big data-sharing platforms, categorizing them into data-hosting centers, data aggregation centers, and decentralized solutions. We then identify the challenges in developing big data-sharing solutions and provide explanations of the existing approaches to these challenges. Finally, the survey concludes with a discussion on future research directions. This survey presents the latest developments and research in the field of big data sharing and aims to inspire further scholarly inquiry.

Keywords:

big data; big data sharing; big data trading; data ecosystems; data exchange

1. Introduction

The proliferation of the Internet of Things (IoT) [,], social media platforms [], and related technologies has led to an unprecedented surge in data generation, collection, and processing. This high-volume, high-velocity, and high-diversity data, commonly referred to as big data, represents a paradigm shift in information technology []. Beyond its scale, big data delivers measurable value across domains: it reduces operational costs, enhances efficiency, and enables data-driven decision making in industries [], commerce [], and public services []. For instance, Google’s advertising engine, powered by big data, accounts for more than 75% of the enterprise’s total revenue in 2024.

Recently, big data sharing, referring to the act of data sharers to share big data so that the sharees can find, access, and use it in the agreed ways, has been receiving extensive attention from industries and academia because the total value of shared data is much more significant than the sum of individual parts []. Moreover, certain tasks cannot be completed without big data from different stakeholders []. For example, accurate disease diagnosis requires excessive hospitalized cases worldwide, while the result from a single hospital is far from usable [].

Despite the recognized benefits of large-scale data sharing, data controllers, including enterprises, healthcare institutions, and other organizations, frequently withhold their datasets. This reluctance stems from a range of technical, legal, and strategic concerns []. First, different enterprises have conflicts of interest and are unwilling to share big data’s great value with others []. Second, many types of data, e.g., electronic health records and personal bank bills, are sensitive, and sharing them raises severe privacy concerns []. Furthermore, big data sharing among countries and regions can even be illegal because of the legal regulations on digital data and copyright worldwide [].

These overarching barriers give rise to a set of formidable technical challenges that hinder the maturation of big data sharing [,,]. To overcome the reluctance to share, a systematic framework must address these core problems. For instance, data from different sources often exist in heterogeneous formats that must be standardized for interoperability []. As a tradable commodity, big data requires a uniform value assessment mechanism and rational pricing models []. Most critically, robust security and privacy-preserving techniques are non-negotiable prerequisites to protect data before, during, and after the sharing process []. Effectively enabling big data sharing, therefore, hinges on systematically resolving these challenges.

Although the literature offers a variety of partial remedies, a holistic survey dedicated to big data sharing is still missing. Existing survey papers either treat big data and big data sharing in the abstract [,,] or restrict their scope to a single facet, e.g., security and privacy [,], incentives [], spatio-temporal data [,], machine-learning workflows [], health informatics [], Internet-of-Things streams [], multimedia content [], digital forensics investigations [], or social media analytics []. Likewise, domain-specific studies concentrate on high-value silos such as scholarly repositories [] or electronic health records [], while cryptographic surveys focus on isolated mechanisms, e.g., proxy re-encryption []. Consequently, current knowledge remains fragmented; a comprehensive map of the technical landscape and application demands for general big data sharing is yet to be drawn.

To this end, we conduct a comprehensive survey about big data sharing from the perspectives of definition, applications, platforms, challenges, solutions, future directions, etc. We answer the following important questions that are broadly concerned in the academia and industries. First, what is big data sharing and why is it so important? Second, what are the requirements and possible system architectures for developing big data-sharing platforms? Third, what are the technical challenges and feasible solutions to deliver a big data-sharing solution? Finally, how can emerging technologies help big data sharing? The unique contributions of this paper are as follows:

To the best of our knowledge, this paper is the first comprehensive survey that formally defines and delves into the technical details of big data sharing.
We present the readers with the state-of-the-art development and research of big data sharing by articulating the definition, general workflow, and requirements and summarizing the existing popular platforms, challenging issues, and solutions.
The promising future directions, i.e., blockchain-based big data sharing and edge as big data-sharing infrastructure, are identified and may incentivize future research.

The survey structure is depicted in Figure 1. Section 2 clarifies the concept of big data sharing: we define the term, outline a generic sharing workflow, itemize its benefits and requirements, catalogue application domains, and distinctively compare it with related notions to expose subtle but important differences. Section 3 reviews representative platforms, classifying them by architecture into data-hosting centers, data aggregation hubs and fully decentralized solutions. Section 4 distills the key technical impediments to sharing and surveys the counter-measures proposed to date. Finally, Section 5 charts open research avenues.

Figure 1. The structure of this survey, covering the concept, platforms, challenges, and solutions of big data sharing.

2. Preliminaries of Big Data Sharing

2.1. Basics of Big Data

A foundational understanding of big data is essential prior to examining its sharing paradigms. The term, in circulation since the late 1990s, gained significant traction after its feature in the Communications of the ACM in 2009 []. The canonical definition of big data is encapsulated by the “3Vs” model, as depicted in Figure 2, which is a framework that has been widely adopted by industry leaders such as Gartner, IBM, and Microsoft. The first dimension, volume, describes datasets of a scale so vast that they overwhelm the capacity of conventional software to capture, manage, and process within a reasonable timeframe []. The second, velocity, signifies the rapid rate at which these data are generated and must be processed. Finally, variety refers to the heterogeneous nature of the data, which encompasses diverse formats and modalities ranging from structured tables to unstructured text and media.

Figure 2. Primary characteristics of big data.

In addition to volume, variety, and velocity, another dimension, value, has garnered significant attention from both industry and academia in recent years. A paradigmatic illustration of big data’s economic and epidemiological value is Google LLC’s now-classic 2009 influenza study []. By mining billions of anonymized search queries, the company produced real-time estimates of influenza activity that were markedly more timely than those disseminated by traditional public health surveillance centers. During the pandemic, users’ query profiles deviated systematically from baseline patterns; the volume and geographic distribution of influenza-related keywords correlated strongly with subsequent laboratory-confirmed cases. Google distilled 45 query clusters whose temporal dynamics tracked the ongoing outbreak, embedded these signals in a linear regression model, and generated weekly forecasts of influenza prevalence at the U.S. state level. The results, reported in Nature [], demonstrated that passively collected digital traces can function as an early-warning system, underscoring the transformative potential of large-scale behavioral data for population health. Beyond social impact, big data holds the potential to generate revenue and reduce costs for enterprises, such as enhancing recommendation systems for e-commerce platforms and optimizing pricing strategies for airlines.

2.2. Definition of Big Data Sharing

While big data holds the potential to generate significant value for both society and enterprises, several challenges impede its development, including inefficient data representation, suboptimal analytical mechanisms, and a lack of data cooperation. This paper addresses the issue of data cooperation through the lens of big data sharing.

Formally speaking, big data sharing refers to the act of data sharers to share big data so that the sharees can find, access, and use it in the agreed ways. Normally, the volume of shared data is at least GB level. Several related concepts, such as open data [], data exchange [,], and big data trading [], bear similarities to big data sharing but also exhibit distinct differences. The following discussion elucidates these concepts and their distinctions from big data sharing.

In its conventional usage, open data denotes the provision of the research-generated and governmental datasets to the wider academic community and even the public. The scope of these datasets is typically restricted to scholarly outputs or statistics of citizens, and the ethos is one of maximal openness. By contrast, “big data sharing” embraces data assets whose volume, heterogeneity, and economic value far exceed those of traditional research and governmental repositories. The sheer scale and sensitivity of such resources render openness untenable; instead, they mandate high-performance infrastructures, uniform semantic representations, and security architectures capable of guaranteeing confidentiality, integrity, and controlled access. Consequently, while open data prioritizes transparency, big data sharing is defined by the imperative of secure, yet still efficient, dissemination.

Data exchange is construed in two principal senses. The first, mainly predominant in the database literature, denotes the algorithmic transformation of an instance valid under a source schema into an instance conforming to a target schema while preserving semantic fidelity. Because this interpretation concerns schema mapping rather than dissemination, it lies outside the scope of the present study. The second construal, adopted here, characterizes data exchange as the bilateral transfer of usage rights over a dataset between autonomous parties. Under this reading, the term “exchange” presupposes symmetry: each participant simultaneously cedes and acquires identical legal–technical entitlements, namely, the prerogative to access and exploit the data in question.

Big data sharing, by contrast, is not predicated on reciprocity; its objective is the scalable discovery, secure access, and value-generating utilization of massive, heterogeneous datasets. Consequently, the two paradigms diverge in purpose: data exchange centers on the equitable reallocation of rights, whereas big data sharing emphasizes the facilitation of downstream use.

Big data trading involves the buying and selling of large datasets []. This topic has gained prominence in recent years as enterprises and commercial organizations recognize the potential to monetize valuable collected data. Concurrently, other entities, such as universities and companies, require data for purposes like research and product quality improvement. Big data trading can be viewed as a subset of big data sharing, as it is confined to commercial use, whereas big data sharing is not restricted by commercial considerations.

As summarized in Table 1, the nuanced differences between big data sharing, open data, data exchange, and big data trading can be understood across the dimensions of data types, incentives, and commerciality. Big data sharing serves as the most encompassing paradigm, which is characterized by its flexibility across all three dimensions. In contrast, open data is a significantly narrower term, which is typically confined to non-commercial, non-remunerative exchanges of scholarly and governmental data. Data exchange is distinguished by its transactional logic, which is predicated on a reciprocal transfer of usage rights rather than the unrestricted forms of reward possible in big data sharing. Finally, big data trading is conceptualized as a specialized subset of big data sharing, defined explicitly by its commercial imperative, whereas the parent concept remains agnostic to commerciality.

Table 1. Relationship and differences between big data sharing, open data, data exchange, and big data trading.

2.3. General Procedures of Big Data Sharing

The process of big data sharing can be conceptualized as a three-stage workflow: data publishing, data search, and the final act of data sharing. The process commences with data publishing, wherein data owners prepare and announce their datasets. The objective of this initial phase is not necessarily to release the raw data but rather to populate the sharing platform’s catalog with descriptive metadata, thereby informing potential users of the data’s existence and content while allowing owners to retain control and implement access policies. Subsequently, prospective users engage in data searches, querying the platform to discover datasets that meet their specific requirements []. This discovery phase is critical, as the platform’s utility is largely determined by its ability to efficiently connect its user base with valuable data. The workflow culminates in the terminal phase of data sharing, which is a transactional step initiated by an access request from a user. The data owner retains unilateral authority to approve or deny this request. Upon approval, the transaction is recorded, formally establishing the owner as the sharer and the user as the sharee.

2.4. Benefits of Big Data Sharing

Big data sharing generates multi-level value. Data sharees obtain access to high-volume, heterogeneous datasets that can be repurposed for research, innovation, or operational optimization. Data sharers, in turn, accrue reputational capital, expanded market visibility, and direct monetary remuneration where licensing permits. At the societal level, aggregated data assets fuel scientific discovery, evidence-based policy, and infrastructural efficiencies, thereby advancing the public good. Benefits accruing to individual sharees are deliberately omitted here, as they are contingent upon domain-specific use cases and cannot be meaningfully generalized.

The benefits for data sharers can be summarized as follows:

For researchers, sharing scholarly data increases the visibility of their work and can strengthen their academic reputations. Shared materials typically comprise full texts, source code, experimental tools, and evaluation datasets. Open data encourage replication and comparative studies: open-access articles record 89% more full-text downloads and 42% more PDF downloads than paywalled equivalents [], while publicly available medical datasets attract 69% more citations after controlling for journal impact, publication date, and institutional affiliation [].
For enterprises and public sectors, big data sharing can enhance recognition and foster ongoing collaboration. Over the past decade, national open-government portals, exemplified by data.gov.uk, data.gov, and data.gov.sg, have proliferated, furnishing citizens, firms, and researchers with standardized access to public-sector datasets. Empirical studies indicate that such transparency measures enhance institutional trust and stimulate civic engagement []. Parallel developments are evident in the private sector, where enterprises leverage big data sharing as a strategic marketing and innovation instrument. A prominent modality is the datathon: firms release curated datasets to the public and sponsor predictive-modelling contests. Kaggle, the largest online community of data scientists and machine-learning practitioners, currently hosts several thousand public datasets together with reproducible code notebooks, thereby lowering transaction costs and fostering collaborative analytics between industry and academia.
Monetary rewards are a clear incentive for sharees engaging in big data sharing, particularly from a commercial perspective. The daily deluge of high-value data produced by billions of low-cost devices and users constitutes a major commercial asset. Facebook, for instance, with over two billion monthly active accounts, generates approximately four petabytes of new data each day. Legal prohibitions on direct sale, imposed by privacy statutes and platform terms of service, do not diminish this asset’s worth; instead, the expected future monetization of these data underpins Facebook’s market capitalization, which surpassed USD 1.45 trillion in October 2024. The immense volume and value of data creates unprecedented opportunities for monetization and new business models. Particularly, dedicated trading venues such as Japan Data Exchange Inc. and Shanghai Data Exchange Corp. have emerged to facilitate the compliant, market-mediated exchange of big data rights while respecting regulatory constraints.

The benefits of big data sharing for the public good can be summarized as follows:

Promotion of academic integrity: Big data sharing fosters academic integrity, which is the ethical standard that mandates the avoidance of plagiarism and cheating in academic endeavors. By making scholarly data accessible, research findings become more reproducible, as others can replicate specific experiments. This transparency encourages researchers to exercise greater caution when publishing their findings, thereby creating a virtuous cycle that enhances academic integrity. More broadly, sharing big data ensures that the evidence underpinning scientific results is preserved, which is crucial for the advancement of science.
Incentivization for data quality management: Sharing high-velocity data (e.g., from IoT networks) facilitates real-time decision making in domains like smart cities and supply chain management. However, high-velocity data are usually low quality, and making these data publicly available creates reputational incentives for researchers to implement rigorous data management workflows and to enforce stringent quality-control procedures. Large-scale repositories invariably contain redundant records that inflate storage costs and degrade query performance. Sharers can exploit big-data reduction techniques, e.g., deduplication, stratified sampling, or lossless compression, to eliminate superfluous information while preserving analytical utility. The resulting high-quality datasets not only attract a broader user base but also lower the indirect costs (bandwidth, replication, and curation) imposed on the hosting infrastructure without eroding the intrinsic scientific value of the resource.
Facilitation of collaboration and innovation: Big data sharing encourages increased collaboration and connectivity among researchers, potentially leading to significant new discoveries within a field. Data serve as the bedrock of scientific progress and are typically acquired through substantial effort and publicly funded projects. However, their utility is often confined to generating scientific publications, leaving much data underutilized. Big data sharing offers a more efficient approach by enabling researchers to share resources. Furthermore, big data sharing enables societal-level insights that are impossible with smaller datasets. Examples include the Google Flu study [] and large-scale climate modelling [].

While the benefits of big data sharing are evident from the perspectives of data providers and the public good, it is important to acknowledge that there are potential drawbacks. For instance, sharing sensitive data can raise privacy concerns, and determining data ownership can become complex during the sharing process. However, these challenges are not the focus of this paper. Readers interested in exploring the potential drawbacks of big data sharing are encouraged to consult [,] for further information.

2.5. Requirements of Big Data-Sharing Solutions

While big data sharing offers numerous benefits, designing an effective solution is complex due to several critical requirements. This section identifies and elaborates on the fundamental requirements for big data-sharing solutions: security and privacy, flexible access control, reliability, and high performance.

First, security and privacy are paramount in big data sharing: without rigorous safeguards, prospective users will withhold both data and custom. These requirements decompose into three interrelated dimensions: data security (protection against unauthorized access or alteration), user privacy (preservation of the identity and behavior of data consumers), and data privacy (assurance that the content itself does not reveal sensitive information):

Data security. Big data sharing is inherently dyadic: a trustworthy solution must guarantee that no entity other than the two designated parties can read or modify the dataset. Access and alteration rights must be strictly predicated on explicit, fine-grained authorizations issued by the data sharer. Moreover, the architecture has to provide verifiable recovery mechanisms that can reconstruct both the data and the immutable sharing log in the event of corruption or malicious destruction.
User privacy. Within big data-sharing ecosystems, the identities of both the data sharer and sharee must be shielded from external observers; ideally, they should also remain mutually concealed. The transaction should foreground the dataset itself while rendering the participating parties provably anonymous.
Data privacy. Datasets such as electronic health records combine high analytic value with extreme sensitivity. To preserve privacy while enabling big data sharing, custodians must apply protective transformations, e.g., masking, generalization, and cryptographic obfuscation, before any external release.

Second, a big data-sharing platform must implement fine-grained, policy-based, flexible access control that explicitly answers three questions: (i) what is being shared (the data object); (ii) with whom it is shared (the recipient set); and (iii) how it is shared (the technical modality). Candidate modalities include the following:

Big data preview. Under a preview regime, the sharee receives only a down-sampled or fragmentary surrogate, e.g., a textual excerpt, a low-resolution video frame, or an audio snippet, rendered through a functionality-restricted viewer. This partial disclosure permits value assessment while withholding the native dataset.
Search over big data. The sharer exposes a controlled query interface that accepts only pre-approved query types, e.g., keyword, range, ranked, or similarity search, while restricting the sharee to search operations. The sharer retains full authority over permissible query grammars and returned data formats, thereby prohibiting bulk retrieval or direct inspection of the underlying corpus.
Nearline computation. The sharee can perform operations using a combination of predefined interfaces, extending beyond the search to include actions like addition, deletion, and updates. “Nearline” indicates that computation is nearly online and quickly accessible without human intervention.
Big data transfer. The sharer directly transfers the data to the sharee, allowing for a wide range of operations. Post-transfer operations depend on the contractual agreement between the parties. For instance, if data ownership is not transferred, the sharee is legally prohibited from further disseminating the data.

Third, a big data-sharing architecture must guarantee demonstrably high reliability: any protracted outage or data loss immediately erodes trust and undermines the economic and scientific incentives for sharing. The platform should exhibit minimal risk of malfunction, as system failures could result in significant data value loss. Avoiding a single point of failure is key with decentralization being critical to mitigating this risk. Reliability must encompass all platform functions: data publishing, data search, and data sharing. Once data are published, they should be searchable and shareable by the data owner. Search results must be accurate and comprehensive, and predefined rules must be upheld post-sharing.

Finally, a big data-sharing platform must deliver sustained high performance because suboptimal throughput or latency quickly negates the utility of massive datasets and discourages prospective users. The platform must efficiently handle numerous users and transactional records, including publishing, searching, and sharing activities. Users span enterprises, organizations, government sectors, and individuals, while the data involved are of high volume. The platform should facilitate efficient data publication and sharing with high-performance data searches enabling sharees to promptly discover needed datasets.

2.6. Big Data-Sharing Applications

Big data sharing finds applications across various domains, including healthcare [], supply chain management [], open government [], and clean energy []. This discussion focuses on two prominent applications that have garnered significant industrial and academic interest in recent years: healthcare and data trading.

Healthcare is an essential societal domain: morbidity, trauma, and medical emergencies occur continuously, generating an unremitting demand for accurate diagnosis, effective intervention, and long-term condition management. These activities produce vast, heterogeneous data assets, e.g., electronic health records (EHRs) generated by hospitals, trial datasets curated by research organizations, and real-time wellness measurements collected in smart-home environments. Interoperability among these disparate sources, achieved through secure big data sharing, promises to accelerate biomedical discovery, enhance clinical decision making, and optimize population-level health outcomes. In 2020, the U.S. National Institutes of Health, the world’s largest funder of biomedical research, finalized a policy supporting healthcare data sharing [].

First, big data sharing can enhance the understanding of individual hospital cases []. When a hospital admits a new patient, particularly one with atypical symptoms, physicians must reference past cases with similar characteristics. Additionally, the patient’s historical EHRs are invaluable for accurate disease diagnosis. In the medical field, big data sharing, known as healthcare information exchange (HIE), has become prevalent for facilitating the acquisition of EHRs between hospitals. This exchange enables doctors to gain a deeper understanding of various cases and diseases.

Moreover, big data sharing aids in the discovery of scientific insights, such as disease prediction and therapeutic regimen development, through the aggregation and analysis of clinical trials []. In recent years, machine learning-based medical data analysis has attracted considerable attention from governments, industry, and academia. Machine learning approaches require extensive datasets as input, which a single medical organization may struggle to provide. Consequently, the aggregation of clinical trials is crucial, and big data sharing serves as a means to achieve this.

Finally, sharing data from smart homes significantly contributes to precision medicine []. With advancements in Internet of Things technology, smart healthcare devices have become commonplace in daily life, such as smartwatches for heart rate monitoring and smart sphygmomanometers for blood pressure measurement. The vast array of personal health data collected is invaluable for precision medicine, allowing doctors to consider individual variability in environment and lifestyle when devising treatment plans or offering preventive advice.

Figure 3 illustrates a prototypical big data-sharing ecosystem in healthcare. Clinical trial datasets are pooled among research institutes to expedite investigations of existing diseases and to evaluate novel therapeutic regimens. Hospitals integrate patients’ longitudinal EHRs with real-time symptomatology and with EHR cohorts exhibiting comparable phenotypes to refine diagnoses. Concurrently, personal health streams generated in smart home environments are channeled into precision-medicine pipelines, enabling individualized therapeutic strategies.

Figure 3. Big data sharing for healthcare.

Big data trading has crystallized into a distinct business paradigm driven by the concomitant surge in data volume and commercial demand []. Both industry consortia and academic communities are actively articulating architectural and economic frameworks tailored to this emergent commodity, whose defining characteristics, including non-rivalry, experience good properties, and economies of scale, differentiate it sharply from conventional assets.

Industrial uptake. QUODD (https://www.quodd.com/, accessed on 1 September 2025) curates a vertically integrated marketplace specializing in anonymized feeds from global financial institutions and FinTech operators. Amazon’s AWS Data Exchange federates hundreds of licensed data providers with downstream analytics consumers through a fully managed, API-driven brokerage. In China, state-level industrial policy has incubated quasi-public institutes: the Guiyang Global Big Data Exchange (2014) (https://www.gzdex.com.cn/, accessed on 1 September 2025) and the Shanghai Data Exchange (2016) (https://www.chinadep.com/, accessed on 1 September 2025) operate under regulatory sandboxes that recognize data as a factor of production equivalent to land, labor, and capital.

Academic advances. Complementing these commercial deployments, a growing corpus of scholarship interrogates the micro-economic and algorithmic underpinnings of data markets. Zheng et al. introduce Arete, which was the first decentralized architecture for mobile crowd-sensed data that dynamically aligns online pricing with contributor remuneration through a Nash-bargaining reward-splitting mechanism []. Oh et al. formulate a multi-stage, non-cooperative game in which heterogeneous providers optimize the trade-off between privacy-preserving noise injection and revenue-maximizing valuation []. Zhao et al. quantify how increased data variety alters the strategic interaction between content providers and internet-service providers under sponsored-data tariffs, revealing a non-monotonic relationship between variety and market surplus [].

3. Existing Platforms and Categorization

Big data sharing has attracted extensive attention from academia, industries, and governments, and various big data-sharing platforms have been developed. This section first surveys representative big data-sharing platforms currently operational worldwide; subsequently, we propose a systematic taxonomy of these systems and delineate their underlying mechanisms.

3.1. Existing Platforms

There are many big data-sharing platforms worldwide. In the following, we introduce five representative big data-sharing platforms from different regions, for different purposes (academic/commercial), and with different system architectures (centralized/decentralized).

3.1.1. Epimorphics Linked Data Platform

The Linked Data Platform (LDP) (https://www.epimorphics.com/, accessed on 1 September 2025) is an advanced solution for big data sharing developed by Epimorphics Ltd., Bristol, UK. The LDP serves dual purposes. First, it functions as a software solution that can be deployed as local infrastructure to facilitate big data sharing. For instance, a university might implement the LDP to enable data sharing among its faculties, departments, and administrative offices. Second, the LDP acts as a platform utilized by the U.K. government to host data accessible to a broad spectrum of public and private sectors. This paper focuses exclusively on the latter application as it pertains to existing platforms for big data sharing.

Users of the LDP can publish datasets and display their descriptions and links. As illustrated in Figure 4, these descriptions encompass various elements, including the title, publisher, publication date, latest update, topic, license, summary, data links, and contact information. The data links section may contain multiple datasets, each containing a URL to the database, data format, publication date, and a preview option if enabled by the data provider. Essentially, data providers retain their data on their own servers while publishing descriptions of their datasets on the LDP.

Figure 4. An example dataset provided by Epimorphics LDP.

In addition to data publishing, the LDP offers robust data search capabilities. Users can search for big data using general keywords; datasets with descriptions containing these keywords will be displayed. Moreover, users can refine their searches by filtering datasets based on criteria such as publisher, topic, license, and data format. When a dataset’s description is displayed, related datasets are also suggested to the user. Regarding data sharing, the LDP supports only the download of raw data, as the original datasets are maintained by the data owners themselves. It is important to note that the LDP does not track the usage of the data once it has been downloaded from the data owners.

The LDP’s primary strength lies in managing high semantic variety, allowing data owners to publish rich, standardized metadata about their datasets without hosting the data itself. However, because data providers retain their data on their own servers, the LDP does not directly address the challenges of high volume or high velocity; these responsibilities are delegated to the original data owners.

3.1.2. HKSTP Data Studio

The Hong Kong Science and Technology Parks Corporation (HKSTP) established a pivotal platform, originally launched as Data Studio (https://sp.hkstp.org/, accessed on 1 September 2025) in 2007, to advance the government’s smart city initiative. This platform has since evolved into the Digital Service Hub, broadening its mission to foster a comprehensive ecosystem for digital innovation centered on AI models, big data, and high-performance computing. Its primary objective is to serve as a nexus where public and private entities can collaborate, leveraging a rich repository of data to develop cutting-edge solutions. The Hub hosts foundational government data spanning areas like city management and employment while also providing exclusive access to high-value, specialized datasets from strategic partners such as the Hospital Authority (HA) and Radio Television Hong Kong (RTHK).

The methodology for data interaction on Data Studio has matured significantly beyond its initial offerings. Initially, Data Studio enabled static data sharing via shareable links and real-time data sharing through Application Programming Interfaces (APIs) in high velocity. Later, the platform introduced far more sophisticated and secure data collaboration models. A key innovation lies in the establishment of controlled environments like the Data Collaboration Lab, which was launched in partnership with the Hospital Authority. The so-called “data clean room” model allows users to analyze sensitive, anonymous clinical data without direct access or extraction, thereby ensuring privacy, security, and full auditability. It is further complemented by integrated services like “HPC as a Service”, which provides the robust computational power necessary to process these large-scale datasets directly within the Hub’s secure infrastructure, representing a shift toward providing holistic, value-added analytical environments. The “data clean room” model is a direct architectural solution to the data privacy and fine-grained access control requirements mentioned in Section 2.5.

This evolution addresses critical limitations inherent in earlier data-sharing platforms. The previous challenge of being unable to track how shared data are utilized is fundamentally resolved by the data clean room model, where all queries and analytical activities are logged, providing data owners with robust governance and oversight. This controlled environment inherently mitigates copyright and usage concerns far more effectively than a simple user registry for APIs. However, the Digital Service Hub remains a proprietary, service-oriented platform focused on the Hong Kong ecosystem. Its source code is not publicly available, and its operational model is to function as a central innovation hub rather than to offer a replicable, open-source data-sharing solution for other enterprises to deploy independently.

HKSTP Data Studio is designed to handle data characterized by high variety and velocity. It aggregates diverse datasets from public and private sectors and supports real-time data sharing via APIs, making it suitable for dynamic smart city applications. To tackle the computational demands of high-volume datasets, the platform integrates “HPC as a Service,” providing the necessary power for large-scale analysis within its secure environment, thus offering a comprehensive solution across all three dimensions of big data.

3.1.3. SEEK

SEEK (https://seek4science.org/, accessed on 1 September 2025) is a big data-sharing platform specifically developed by a consortium of scientists to facilitate the sharing of datasets and models among researchers in the field of systems biology []. Systems biology, characterized by the computational and mathematical modelling of intricate biological networks, demands the integration of heterogeneous datasets and corresponding analytical models. SEEK was conceived to satisfy this requirement by federating otherwise isolated experimental and modelling resources across institutional boundaries; the platform is further released as open-source software, enabling community-driven extension and transparent reuse.

Researchers can publish their high-variety data within SEEK in the form of projects, which may include raw datasets, standard operating procedures (SOPs), models, publications, and presentations. SEEK supports version control and allows researchers to update shared data as needed. Models can be simulated within SEEK if they adhere to the Systems Biology Markup Language (SBML). Upon publication, the metadata from each project are extracted using the Resource Description Framework (RDF), enabling semantic query-based searches. A distinctive feature of systems biology is the inherent interconnection of datasets and models. To facilitate the understanding and exploration of these links, SEEK employs the ISA (Investigation, Study, and Assay) data model [], as depicted in Figure 5.

Figure 5. ISA model in SEEK.

However, SEEK has certain limitations. It is not designed for the sharing of general big data, as it is tailored specifically for systems biology research. Furthermore, similar to LDP and Data Studio, once data are shared on SEEK and downloaded by users, there is no mechanism to track how the data are subsequently utilized.

The SEEK platform’s primary strength lies in its sophisticated management of high variety within the domain of systems biology. It uses the ISA data model to federate complex and interconnected project assets. While it is not designed to handle the extreme volume or velocity seen in commercial platforms, the semantic richness and intricate relationships between its heterogeneous data assets firmly place it within the big data paradigm, prioritizing complexity over sheer scale.

3.1.4. InterPlanetary File System

The InterPlanetary File System (IPFS) (https://ipfs.tech/, accessed on 1 September 2025) is a distributed peer-to-peer (P2P) network designed for high-reliability data storage and sharing initiated by Protocol Labs. To publish an artifact, a user serializes the content, computes its cryptographic hash, and derives a self-describing content identifier (CID). This CID functions as a location-independent address; any peer possessing the CID can retrieve the corresponding artifact from the IPFS network. Content availability is advertised through a distributed hash table (DHT) that maps each CID to the set of peers currently storing the blocks. Consequently, every participant curates a local repository by declaring which CIDs it seeds, which it wishes to cache, and which it explicitly discards.

Within the IPFS P2P network, all users are considered equal. As illustrated in Figure 6, a user seeking a particular piece of data can retrieve it from other users, thereby becoming a data provider themselves. Despite its innovative approach, the IPFS encounters several challenges in the realm of big data sharing. A significant issue is the lack of motivation among users to share large datasets, as doing so demands substantial storage and network resources. In contrast, public cloud storage offers a more user-friendly alternative, requiring no maintenance. This lack of incentive results in a scarcity of nodes within the IPFS network, undermining its reliability as a distributed system.

Figure 6. A running example of IPFS.

In response to these challenges, Protocol Labs has introduced Filecoin, which is a platform that leverages blockchain technology to incentivize the use of IPFS by offering monetary rewards for providing data storage resources []. However, users of the platform must incur significant costs for data storage. Consequently, Filecoin faces the formidable task of competing with established cloud storage providers, which may prove to be a considerable challenge.

IPFS offers a decentralized architectural solution for storing and sharing large-volume data by distributing it across a peer-to-peer network. Its content-addressing model is inherently well suited to managing variety, as any type of static file can be uniformly addressed and retrieved. However, the platform is not designed for high-velocity streaming data, functioning as a file retrieval system rather than a real-time processing engine, and its performance can be inconsistent depending on network participation.

3.1.5. Amazon Web Services Data Exchange

In November 2019, Amazon, a leading cloud service provider, introduced a new service known as Amazon Web Services (AWS) Data Exchange (https://aws.amazon.com/data-exchange/, accessed on 1 September 2025). This service is designed to offer customers a secure means of discovering, subscribing to, and utilizing third-party data. At its launch, AWS Data Exchange featured contributions from over 100 data providers, offering more than 1000 datasets, as announced by Amazon.

AWS Data Exchange is structured around two principal actor types: data providers and data subscribers. Providers publish products, either gratis or fee-based, together with machine-readable license clauses that encode pricing schedules and permissible use. An integrated versioning facility allows in situ updates without invalidating extant subscriptions. Subscribers acquire access through monthly or annual entitlements; during the active term, they may download the corpus for local processing and automatically receive notifications of any revision. The service’s native integration with the wider AWS ecosystem constitutes a key advantage: datasets can be streamed directly into Amazon S3 via the console, representational state transfer (RESTful) APIs, or command-line interface (CLI), while update events are pushed to subscribers through CloudWatch in near real time.

The platform has been widely adopted for disseminating large-scale reference corpora, including around 200 COVID-19-related datasets catalogued in Table 2. It handles the high-volume data via integration with the high-performance Amazon S3 data storage service. Nevertheless, several limitations persist. First, providers bear the full cost of Amazon S3 storage; for voluminous big data assets, this expenditure can outweigh the marginal licensing revenue. Second, subscription mechanics require subscribers to disclose personally identifying information to each publisher, engendering user privacy leakage. Third, although governed by comprehensive contractual clauses, datasets ultimately reside on AWS-controlled infrastructure, leaving open the residual risk of insider misuse and data privacy concerns.

Table 2. COVID-19 datasets on AWS Data Exchange.

AWS Data Exchange excels at handling massive volumes through its native integration with Amazon S3, enabling the transfer of petabyte-scale datasets. It is also highly effective at managing variety, as its marketplace is data-agnostic and supports virtually any file-based data product. The platform is less focused on real-time velocity; however, as its subscription-based model is oriented around batch updates and discrete dataset revisions, positioning it as a marketplace for curated data products rather than live data streams.

3.2. Categorization of Existing Platforms

This section categorizes the platforms described in Section 3.1 into data-hosting centers, data aggregation centers, and decentralized data-sharing solutions. Furthermore, we identify their unique features and challenges.

3.2.1. Data-Hosting Center

The functions of a data-hosting center (DHC) bear similarities to those of a portfolio manager in the financial sector. A portfolio manager raises capital from investors and allocates it to stocks or bonds to generate financial returns. Analogously, a DHC aggregates original big data from data owners and identifies potential data users who can share these data in exchange for rewards. Both portfolio managers and DHCs exercise complete control over the funds from investors and the big data from data owners, respectively. However, DHCs differ from portfolio managers due to the replicable nature of big data. Unlike money, big data can be easily duplicated, which often leads DHCs to make data publicly accessible. SEEK and Amazon Web Services Data Exchange are typical data-hosting centers.

Figure 7 depicts the canonical workflow of a DHC. In the initial phase, data owners (e.g., Users A and B) deposit their datasets (denoted data 1 and 2) with the DHC. User A subsequently issues a keyword query (“blockchain”) against the center’s consolidated catalogue; the DHC returns the corresponding metadata descriptions. After reviewing the results, User A requests dataset 2, whereupon the DHC mediates the transfer: it delivers data 2 to User A, receives an integrity acknowledgment, and relays that acknowledgment to User B to complete the transaction audit trail.

Figure 7. A running example of a data hosting center.

A defining characteristic of a DHC is its dual mandate: (i) to aggregate voluminous user-generated corpora and (ii) to act as an authorized redistribution hub. SEEK exemplifies this model: investigators upload project-derived datasets to the SEEK repository, which, having obtained explicit dissemination consent, renders the content universally accessible under open-access licensing terms.

One of the primary advantages of DHCs is their high efficiency. As centralized repositories for shared big data, DHCs benefit from the general optimization methods applicable to data storage systems. For instance, frequently accessed big data can be pre-replicated to enhance data transfer speeds, and caching techniques can be employed to expedite big data queries. Another advantage of DHCs is their ability to ensure the authenticity of big data. Without DHCs, data owners might falsely claim possession of data or refuse to share it despite prior commitments. DHCs act as intermediaries between data owners and users, ensuring reliable data exchange.

However, DHCs also have disadvantages with data privacy being a significant concern. Specifically, DHCs may replicate big data without the consent of data owners, and there is no guarantee that data will be shared in accordance with the policies set by the data owners.

3.2.2. Data Aggregation Center

The function of a data aggregation center (DAC) in the realm of big data parallels the role of a real estate agency in the property market. A real estate agency compiles information about properties from owners and advertises them to attract potential buyers, thereby facilitating transactions between property owners and buyers. Similarly, as depicted in Figure 8, a DAC gathers data descriptions from data owners, offers a platform for potential users to search for big data, and facilitates the sharing of big data between data owners and users. While both real estate agencies and DACs possess general information about properties and big data, respectively, they do not have control over their disposition. However, DACs differ from real estate agencies in that the authenticity of big data is not as easily verifiable as that of real estate. Epimorphics Linked Data Platform and HKSTP Data Studio are typical data aggregation centers.

Figure 8. A running example of a data aggregation center.

In the DAC model, the center connects data services among agencies through an API interface. Data agencies are not required to report or upload data to the DAC in advance; instead, they retain ownership and management of their data. When an agency needs to search for data, it interacts with the DAC in real time to send a data request. The DAC then relays and broadcasts this request to other agencies. Agencies possessing the requested data respond, and the DAC collects and forwards the data back to the requester. However, it is evident that the DAC has both the capability and the opportunity to retain data. Over time, as the DAC accumulates data during the sharing process, it may gradually evolve into a DHC.

3.2.3. Decentralized Big Data Sharing Solutions

DHCs and DACs represent centralized solutions that require central authorities to manage big data-sharing platforms. In contrast, decentralized solutions operate without the need for a central authority []. In these decentralized systems, data providers and recipients form a peer-to-peer (P2P) network, maintaining a list of available data. When a data owner wishes to publish data, they broadcast a message regarding data ownership across the entire network. Similarly, data search and transfer requests are handled within this network.

Decentralized solutions offer several advantages, the most significant being their immunity to single points of failure or malicious central authorities due to the absence of central control. Additionally, data transfer speeds can be rapid if the data are “hot,” meaning a large number of peers in the network possess copies of the desired data. The InterPlanetary File System is a typical decentralized big data sharing solution.

However, decentralized solutions also present certain drawbacks. First, maintaining the list of available data imposes additional burdens on both data providers and recipients, potentially overwhelming lightweight devices that may lack the capacity for such maintenance. Second, the incentive mechanisms for data providers and recipients to participate in the P2P network are less clear compared to centralized solutions, where participants benefit from structured services. Lastly, the performance of decentralized solutions can be inconsistent. If the P2P network has few participants or if the requested data are “cold” (i.e., only a small number of peers possess copies of the data), the system’s performance may be significantly compromised.

As shown in Table 3, the three models, DHCs, DACs, and decentralized solutions, represent distinct architectural philosophies for sharing big data. Each approach offers a unique balance of control, efficiency, privacy, and resilience. A DHC acts as a centralized custodian of data, a DAC serves as a centralized broker of metadata, and a decentralized solution eliminates the central entity altogether in favor of a peer-to-peer network.

Table 3. Comparison of data-hosting center (DHC), data aggregation center (DAC), and decentralized big data-sharing solutions.

The choice between these three models involves a fundamental trade-off between centralization and decentralization.

DHCs prioritize efficiency and data authenticity at the cost of data owner control and privacy. They are suitable for scenarios where performance and reliability are paramount, and data owners are willing to entrust their data to a central authority.
DACs offer a middle ground, preserving data owner control while providing a centralized mechanism for data discovery. However, this model introduces concerns about the center’s potential to overstep its role and the difficulty in verifying data authenticity.
Decentralized solutions champion resilience, censorship resistance, and the elimination of a central authority, which removes single points of failure and control. This comes at the price of performance consistency, a higher maintenance burden on participants, and less defined incentive structures. This model is ideal for environments where trust in a central entity is low or non-existent.

4. Challenges and Existing Solutions

Unlike traditional commodities, data are non-exclusive, exponentially growing, and costless to replicate; their ultimate value is unknowable ex ante, ownership is difficult to establish, and circulation channels are inherently hard to police. These idiosyncrasies frustrate the design of trading mechanisms that are simultaneously efficient, credible, fair, and secure. Meanwhile, the requirements of big data sharing articulated in Section 2.5 impose technical solutions to meet them. This section delineates the principal challenges confronting big data sharing and surveys the state-of-the-art proposals advanced to address each one.

4.1. Standardization of Heterogeneous Data

Standardization is crucial in facilitating an efficient and meaningful analysis of shared data []. Initially, a thorough data audit is necessary to ensure it is correctly stored as required by the data owner. Subsequently, normalizing the structure of the dataset is essential to generate an efficient and organized dataset. Finally, providing a data summary enables data users to better comprehend the characteristics of the data. Data-hosting centers are facing severe challenges to standardize heterogeneous data for providing reliable data storage services to the data owners and trustworthy data retrieval services to the data users.

Data owners increasingly delegate their corpora to cloud repositories that may behave maliciously or simply fail to meet service-level agreements. A rational yet unscrupulous provider can silently discard cold data to reclaim storage and bandwidth, thereby violating integrity guarantees. Concretely, the owner needs an efficient mechanism that both (i) continuously certifies that the outsourced dataset is intact and fully retrievable and (ii) cryptographically binds that guarantee to the owner’s private key. The problem is variously termed proof of retrievability, proof of storage, or provable data possession [,].

The recent literature has produced practically oriented solutions. He et al. present DeyPoS, a dynamic proof-of-storage framework that couples authenticated skip-list indexing with secure cross-user deduplication, allowing outsourced files to be updated and audited in logarithmic time while eliminating redundant ciphertext []. Yu et al. address the realistic threat of the ephemeral client–key compromise; they construct a leakage-resilient cloud-audit scheme that refreshes secret keys after each verification epoch, thereby containing the damage caused by exposure [].

The analytical utility of shared datasets is fundamentally constrained by their inherent heterogeneity across dimensions such as type, structure, semantics, granularity, and accessibility []. The propagation of incorrect or inconsistent data artifacts not only diminishes the value of the source but also invalidates subsequent analyses. Accordingly, a systematic preprocessing regimen, encompassing data cleaning, standardization, calibration, fusion, and desensitization, is indispensable prior to dissemination []. For voluminous sources like sensor networks, this curation process may involve significant data reduction, yet it introduces non-trivial challenges in designing information-preserving filters and automating metadata generation. The objectives of this phase are twofold: to establish an efficient data representation that reflects its structural and semantic properties and to facilitate information extraction from underlying resources into a structured, analyzable format [].

Subsequently, the presentation of these curated datasets on sharing platforms necessitates a strategic balance between informational transparency and asset protection. While detailed descriptions are required to attract prospective users, full disclosure invites unauthorized replication and compromises data value. Data summarization is the principal technique used to resolve this tension, offering a condensed representation of the dataset []. Although existing platforms often utilize manual metadata and summary generation [], automation is imperative for achieving scalable efficiency and accuracy. The synthesis of a meaningful summary is a task of greater complexity than metadata generation, as it presupposes a semantic interpretation of the data []. Advanced methodologies, including machine learning models [], are being developed to address this, but their viability hinges on achieving computational efficiency, ensuring data security, and accommodating demands for personalized abstracts.

4.2. Value Assessment and Pricing Model

When data are commoditized, their quality and prospective economic value must be estimated rigorously and communicated transparently []. Accurate valuation and pricing safeguard the interests of all market participants, underwrite the emergence of standardized and trustworthy exchanges, and foster a sustainable data-sharing ecosystem. Quality appraisal concentrates on intrinsic content attributes: completeness, accuracy, consistency, timeliness, and provenance integrity []. Value assessment extends this analysis by incorporating (i) the historical cost of data generation, curation, and storage, and (ii) the expected marginal utility that the dataset yields across heterogeneous downstream applications [].

The rigorous evaluation of data quality and value is a critical prerequisite for any data transaction, enabling consumers to make informed purchasing decisions and providers to establish fair market prices. This evaluative function is also incumbent upon the sharing platform, which must actively monitor data to maintain market health. Key platform responsibilities include filtering substandard data, preventing malicious pricing from disrupting the market, and recommending cost-effective datasets to users. Ultimately, these governance measures are vital for boosting user satisfaction and safeguarding the platform’s long-term credibility. Value assessment is even more challenging for data aggregation centers because they do not access the original big data.

State-of-the-art quality and value assessment frameworks are hindered by four inter-locking challenges: (1) the absence of reproducible, quantitative metrics for contextual or semantic features; (2) prohibitive computational overhead when sampling large-scale, high-velocity corpora; (3) opaque cost structures that obscure the true economic expense of data generation and curation; and (4) epistemic uncertainty in forecasting downstream utility. These difficulties are compounded by the non-stationarity of value (a dataset’s worth can appreciate or depreciate within hours), the trivial ease of perfect replication, and the fragility of ex post access control. Consequently, ex ante certification of quality and value demands assessment protocols that are simultaneously accurate, inexpensive, and dynamically updateable.

Research has identified five critical dimensions of data quality: intrinsic quality, presentation quality, contextual quality, accessibility, and reliability. These dimensions are essential for evaluating data quality and ensuring that it meets the requirements of data users.

Intrinsic quality: the conformance of a dataset to elementary syntactic and semantic criteria: volume, accuracy, completeness, timeliness, uniqueness, internal consistency, security posture, and provenance reliability [].
Presentation quality: the clarity of structure and semantics conveyed to the consumer, encompassing conciseness, interpretability, syntactic uniformity, and cognitive ease of comprehension.
Contextual quality: the degree to which data content aligns with the specific decision-making context and is fit for the intended analytical or operational purpose.
Accessibility: the ease and economy with which the buyer can locate, negotiate, and physically retrieve the dataset, including communication latency and any associated transactional overheads.
Reliability: the cumulative reputation and verifiable trustworthiness of both the data originator and the vendor, evaluated through historical performance and third-party attestations.

The economic value of a dataset is contingent upon its quality, but it is also mediated by production cost and prevailing market conditions. Conventional intangible-asset valuation techniques, including cost, market, and income approaches, can be adapted to data, yet each yields distinctive limitations.

Cost approach: Value is anchored to the historic expenditure incurred during collection, cleansing, storage, and maintenance. Owing to joint-production effects and indivisible overheads, marginal cost is rarely observable, so the method often understates the option value and fails to capture future rent-generating potential.
Market approach: Value is inferred from recent transaction prices of allegedly comparable datasets. The paucity of transparent exchanges and the heterogeneity of data attributes (schema granularity, provenance, timeliness) render the identification of true comparables problematic, producing wide confidence intervals.
Income (revenue) approach: Value equates to the discounted stream of incremental cash flows attributable to the dataset across its economic life. Because forecast benefits are application-specific and buyer-specific, the approach is inherently subjective; valuations can diverge by orders of magnitude across prospective licensees.

By understanding the dimensions of data quality and the methods for evaluating data value, data-sharing platforms can provide accurate and reliable quality and value evaluation of data, enabling data users to make informed decisions and ensuring a healthy data-sharing environment.

Despite the growing importance of big data sharing, developing models and methods for accurate assessment of the full value of data remains a significant challenge. The vast volume and diverse types of data in the market pose several obstacles to data quality and value evaluation, including the following:

Multi-dimensional quantitative evaluation of quality. Although the literature proposes extensive taxonomies of data-quality dimensions, most remain conceptual schemata supported by qualitative heuristics; operational, quantitative models are conspicuously absent. This deficit is exacerbated when repositories contain massive unstructured corpora, such as text, imagery, and sensor streams, whose semantic content resists automatic, scalable, and reproducible metrology.
Data collection quality assessment. Most current approaches assess data quality at the level of individual data units (e.g., a single text or image). However, data sharing and trading platforms typically involve large datasets (e.g., 10,000 texts or 100,000 images). Evaluating the overall quality of these datasets by aggregating the quality statistics of individual data units ignores the relationships between data units and their impact on the overall quality of the dataset.
Dynamic evaluation of data value. Quantifying the value of a dataset is an inherently complex task, requiring an assessment of factors such as its rarity, acquisition difficulty, and intrinsic quality. A significant limitation of existing evaluation frameworks, however, is their tendency to focus on static measures of quality while neglecting the dynamic nature of data’s true value. The value of data is not fixed; it evolves in response to technological advancements in collection and storage, the optimization of data-mining models, and shifts in application scenarios and consumer needs. This temporal dynamism introduces a profound layer of complexity, rendering simplistic, static assessments inadequate and making robust value estimation a persistent challenge.

Developing more sophisticated models and methods for data quality and value evaluation is essential to overcome these challenges. The models and methods involve the following:

Developing quantitative models and methods for evaluating data quality. Researchers should focus on creating specific quantitative models and methods for evaluating data quality, particularly for unstructured data.
Assessing data collection quality. New approaches should be developed to assess the overall quality of large datasets, taking into account the relationships between data units and their impact on the overall quality of the dataset.
Evaluating data value dynamically. Researchers should develop methods that can reasonably evaluate the dynamic characteristics of data value, including rarity, difficulty in obtaining, and changes in data collection, storage, and application scenarios.

By addressing these challenges, developing more accurate and reliable data quality and value evaluation models is possible, ultimately facilitating more efficient and effective big data sharing.

4.3. Sharing Security

Data security is a multifaceted concept encompassing three primary attributes: confidentiality, integrity, and availability, which is collectively known as the CIA triad. These attributes are crucial in protecting sensitive data from unauthorized access, modification, or disruption despite the architecture of the big data-sharing solutions. Confidentiality is paramount in big data, ensuring that all input, output, and intermediate state calculations remain secret to potentially adversarial or untrusted entities.

Confidentiality is achieved through the protection of data from unauthorized access []. Access control and encryption technologies are effective means of ensuring data confidentiality []. Access control technology, in particular, plays a vital role in protecting data from unauthorized access and managing authorized users. In scenarios involving big data sharing, fine-grained [] and flexible [] access control is often required to meet the following requirements:

Time-limited authorization: determining whether user authorization has time constraints.
Authority division: distinguishing between data ownership and usage rights.
Re-sharing permissions: deciding whether users can re-share data.
Flexible revocation: allowing for the complete revocation of user permissions.

This survey examines the requirements for access control from the following perspectives:

Consolidation and integration of access control strategies. In many cases, data users require access to multiple heterogeneous data sources. Integrating access control policies from these sources is essential, but automated or semi-automated strategic integration systems are needed to resolve conflict issues []. Allowing data providers to develop their access strategies can complicate data sharing, and the automatic integration and merging of these strategies remains challenging.
Authorization management. Fine-grained access control requires efficient authorization management, which can be resource-intensive for large datasets. Automatic authorization technologies are necessary based on the user’s digital identity, profile, context, and data content and metadata. While initial steps have been taken in developing machine learning-based permission assignments [], more advanced methods are needed to address dynamically changing contexts and situations.
Implementation of access control on big data platforms. The rise of big data platforms has introduced new challenges in implementing fine-grained access control for diverse users. Although initial work has focused on injecting access control policies into submitted work, further research is needed to study the effective implementation of such strategies in big data storage, particularly in fine-grained encryption.

In the realm of cloud computing security, data encryption plays a pivotal role in ensuring confidentiality. Various encryption methodologies, such as functional encryption [], identity-based encryption [], and attribute-based encryption [], are instrumental in safeguarding data. Notably, attribute-based encryption is further categorized into key-policy attribute-based encryption [] and ciphertext-policy attribute-based encryption []. Despite the protective capabilities of data encryption, its efficacy is often constrained by the proliferation of keys and the complexities inherent in key management. Additionally, the computational demands of existing encryption schemes present significant challenges. Consequently, there remains a pressing need for the development of more lightweight and adaptable encryption algorithms to facilitate the secure sharing of large datasets.

A particularly noteworthy encryption technique for ensuring data confidentiality is homomorphic encryption (HE) [,]. The pioneering fully homomorphic encryption scheme was introduced by Gentry and Boneh in 2009 []. Although HE represents a ground-breaking advancement in cryptographic technology, its practical application is hindered by inefficiencies. Subsequent research efforts have focused on enhancing the scheme’s efficiency, notably reducing computation time []. Nevertheless, the performance of fully homomorphic encryption remains suboptimal for most practical applications. Beyond inefficiency, homomorphic encryption presents additional limitations. For instance, it necessitates that all sensors and the ultimate recipient share a common key for encryption and decryption, posing logistical challenges when these entities belong to disparate organizations. Furthermore, homomorphic encryption does not support computations on data encrypted with different keys without incurring substantial overhead, thereby restricting differential access to contributed data. While alternative encryption methods, such as attribute-based and functional encryption, address some of these limitations, homomorphic encryption solely guarantees data confidentiality and not integrity. It must be combined with mechanisms ensuring correct computations to provide comprehensive security assurances.

The integrity of data is a critical attribute, ensuring that any unauthorized modifications are detectable []. Moreover, it guarantees that the outputs of computations on sensitive data are accurate and consistent with the input data. In essence, integrity signifies that data remains unaltered by unauthorized entities. With the proliferation of internet usage, the demand for data has surged, leading to an expanded concept of data integrity, now encompassing data trustworthiness. This broader notion ensures that data are not only unmodified by unauthorized parties but also error-free, current, and sourced from reputable origins. Addressing data trustworthiness is a complex challenge, which is often contingent on the specific application domain. Solutions typically involve a synergy of various technologies, including cryptographic techniques for digital signatures [], access control to restrict data modifications to authorized parties [], data quality techniques for automatic error detection and correction, source verification technologies [], and reputation systems to assess data source credibility []. Availability, another crucial attribute, ensures that data are accessible to authorized users and that users can retrieve data as needed. The triad of confidentiality, integrity, and availability remains paramount in contemporary data security. As data collection and sharing activities intensify, the complexity of data attacks has escalated, expanding the attack surface and rendering the fulfillment of these security requirements increasingly challenging.

4.4. Sharing Privacy

In recent years, the increasing demand for data and the evolution of data sharing, alongside the attributes of confidentiality, integrity, and availability, have elevated privacy to a critical requirement. This paper addresses privacy concerns by categorizing them into data privacy and user privacy.

Numerous definitions of data privacy have been proposed over time, reflecting the evolving methods of acquiring personal information. A widely accepted definition is provided by Allan Westin, who describes data privacy as the ability of individuals, groups, or institutions to control the timing, manner, and extent to which information about them is shared with others [].

While data privacy is often equated with data confidentiality, there are distinct differences between these two concepts. Data privacy inherently requires the protection of data confidentiality, as unauthorized access undermines privacy. However, privacy encompasses additional considerations, including compliance with legal requirements, privacy regulations, and individual privacy preferences []. For instance, data sharing poses significant privacy challenges, as individuals may have differing views on sharing their data for research purposes. Consequently, systems managing privacy-sensitive data must accommodate and record the privacy preferences of individuals to whom the data pertains. Furthermore, these preferences may evolve over time. Therefore, addressing privacy issues necessitates not only the implementation of organizational access control policies but also adherence to the legal and regulatory frameworks governing data subjects’ preferences.

Beyond robust access control mechanisms, data encryption technologies are indispensable in safeguarding privacy. Privacy management in the context of big data often relies on cloud platforms, where key concerns include secure storage, computation on encrypted data, and secure communication []. Data encryption technologies address these challenges. Applications on cloud platforms typically depend on secure data storage, indexing, retrieval, and the trustworthiness of the cloud provider. Homomorphic encryption and functional encryption, previously discussed in the context of data confidentiality, are prevalent methods for protecting individual data privacy. Hu et al. introduced key-value privacy storage methods and multi-level index processing technologies utilizing homomorphic encryption to ensure that neither the data owner nor the cloud platform can be identified during the node retrieval process of user queries [].

Encryption technology and access control serve as heuristic protection mechanisms against existing external threats. However, in the face of novel attacks, it becomes imperative to reformulate these protective strategies. These methods are not universally applicable within the big data environment due to the absence of a robust mathematical framework to define data privacy and potential loss. Differential privacy has emerged to address this deficiency []. This model represents a novel and robust privacy-protection technology underpinned by mathematical theory. According to its formal definition, differential privacy regulates the degree of privacy protection and the extent of privacy loss through privacy parameters, ensuring that the insertion or deletion of a record in a dataset does not influence the outcome of any computation. Furthermore, this method remains effective regardless of the attacker’s background knowledge; even if an attacker possesses information on all records except one, the privacy of that particular record remains intact. This characteristic endows differential privacy with excellent scalability. Consequently, differential privacy has become a focal point in contemporary privacy protection research. The academic community posits that differential privacy is inherently suited to big data, as the vast volume and diversity of big data render the addition or removal of a single data point minimally impactful on the overall dataset. This aligns with the fundamental principles of differential privacy.

Despite its advantages, differential privacy technology has certain limitations, such as the inability to actively manage privacy parameters. Small privacy parameters result in high privacy but low data utility, whereas large parameters yield high utility but low privacy. Thus, managing these parameters presents a challenge. Additionally, the inherent correlations within big data may diminish the effectiveness of differential privacy protection. An alternative approach to safeguarding data privacy involves conducting searches directly on ciphertext. Ciphertext retrieval technology is categorized into symmetric encryption and public-key encryption methods. Kamara et al. introduced an asymmetric encryption method that supports dynamic retrieval, offering enhanced security and retrieval efficiency []. Similarly, Abdalla et al. developed searchable public-key encryption technology that supports keyword retrieval [].

Moreover, private information retrieval technology is commonly employed to ensure query security when outsourcing data []. This technology enables users to query data on untrusted service platforms without disclosing sensitive information about the queried data. While the queried data can be public and anonymous, the service platform cannot discern its specific content. Although encrypted search technologies can also manage queries, the computational overhead associated with these methods often renders them impractical []. Privacy retrieval technologies are generally classified into two types: information theory-based retrieval methods and hardware-based computational retrieval methods. Information theory-based retrieval typically involves transmitting all data to the client for local decoding, which is unsuitable for big data due to high transmission costs. Hardware-accelerated private-information-retrieval protocols are now commonplace in security-critical domains, and they are exemplified by genomic sequence alignment, content-based image retrieval, and location privacy services. Yet their transplantation to the big-data arena remains involved: the confluence of petabyte-scale corpora, high query throughput, and heterogeneous storage tiers compounds bandwidth bottlenecks and cryptographic overhead, demanding new algorithmic and architectural optimizations.

Beyond the incidental protection of data privacy, safeguarding the identity information of data users and providers is crucial, which is a concern known as the user privacy issue. Anonymization and encryption signature technologies are commonly employed to protect user privacy. The user privacy issue is especially outstanding for decentralized big data-sharing solutions because the data-sharing information is publicly available in general.

Anonymization involves concealing or obscuring data and its sources. This technology typically employs techniques such as suppression, generalization, analysis, slicing, and separation to anonymize data. A prominent method in this domain is k-anonymity []. When publishing relational data, k-anonymity requires that each generalized equivalence class contains at least k indistinguishable data entries, ensuring that any single data entry is indistinguishable from at least

k - 1

others. However, k-anonymity has limitations, particularly in not restricting the sensitive attributes within equivalence classes, which can lead to de-anonymization. For instance, if all sensitive attributes in an equivalence class are identical, an attacker can easily infer the sensitive value. In contrast, the l-diversity method ensures that each equivalence class contains at least l distinct sensitive attribute values when anonymizing relational data []. Although l-diversity enhances the diversity of sensitive attributes, it overlooks the global distribution of these attributes, allowing attackers to deduce sensitive values with high probability. The t-closeness method was introduced to address this, requiring that the distribution of sensitive attribute values within equivalence classes aligns with the global distribution []. Additionally, methods such as min-variance and HD-composition extend k-anonymity, l-diversity, and t-closeness to handle dynamic or incremental data releases, ensuring privacy is maintained []. However, anonymization in the context of big data is more complex. The integration and fusion of multi-source data and correlation analysis in big data environments render traditional passive protection methods ineffective. Traditional anonymity technologies, when applied to single datasets, are limited by their passive approach to preventing privacy leakage. The vast volume and diversity of big data challenge these traditional methods, making them less effective.

Secure multi-party computation (MPC) furnishes a further cryptographic layer for user-centric privacy []. In an MPC protocol, n distrusting parties jointly evaluate a public function

f (x_{1}, x_{2}, \dots, x_{n})

so that no coalition of

t < n

participants learns anything beyond the function output and their own inputs []. Originally deployed for privacy-preserving distributed data mining, MPC has since been scaled to big-data primitives such as undirected graph products and high-dimensional vector addition []. Complementary safeguards include ring signatures, which anonymize transaction originators among ad hoc groups [], and zero-knowledge proofs that allow a prover to demonstrate statement validity without revealing the witness [].

While the aforementioned research offers valuable insights into big data privacy management, the limitations of these technologies are evident. Similar to anonymization, these technologies provide passive protection for specific data types. In the big data environment, characterized by high volume and diversity, these technologies often fall into a cyclical pattern. New encryption methods must be continually developed to address privacy leaks in emerging applications.

4.5. Data Traceability and Accountability

In an ethical and sustainable data-trading market, ensuring the traceability of data dissemination post-transaction is crucial for maintaining the system’s reliability. This traceability directly influences user satisfaction and trust in the system. While significant attention is often given to the data-sharing process itself, it is equally important to track data after they have been shared. However, designing a traceable data-sharing mechanism presents considerable challenges [] and is usually overlooked by existing solutions.

First, guaranteeing data traceability is difficult because attackers may employ various strategies to evade detection and identification during data propagation. Second, detecting plagiarism is challenging, as users might alter a small portion of the data and present it as a new dataset. Lastly, data agents face difficulties in identifying illegal offline data transactions. In recent years, researchers have made efforts to address these issues. One approach involves introducing a third-party trusted institution to oversee all shared transactions, embedding a watermark in each transaction’s data, and verifying the existing watermark before permitting data operations to ensure compliance with sharing rules. However, collusive users and offline data circulation can circumvent such surveillance, and issues of data plagiarism persist []. To address data originality, researchers have defined originality indices for various data types and validated these indices through experiments. Tools such as Merkle trees, digital signatures, and locally sensitive hashes are employed to detect data tampering and duplication.

Contemporary data-marketplace operators routinely advertise blockchain-based traceability as a core differentiator, citing the blockchain technology’s immutability, Byzantine fault-tolerance, and decentralized time-stamping []. Bitcoin, the first operational blockchain, fused asymmetric cryptography, digital signatures, Merkle trees, and proof-of-work to enable value transfer without a trusted third party []. Researchers subsequently repurposed this toolkit for general record-keeping, culminating in Ethereum, which is an open, programmable ledger proposed by Vitalik Buterin in 2013 that embeds Turing-complete smart contracts []. On Ethereum, every instruction, including state transitions, payment logic, and application-specific computations, is executed deterministically across the peer-to-peer network and immutably archived, yielding a transparent and tamper-evident audit trail.

These primitives have been instantiated in a variety of data-trading architectures. A common design deploys a permissionless blockchain as a settlement layer: each dataset is bound to a non-fungible file contract that registers metadata, checksum, licensing terms, and ownership assertions on-chain; purchase orders are processed atomically by invoking the contract’s transfer function, while content hashes and copyright notices persist indefinitely for provenance and anti-counterfeiting. Hybrid schemes supplement on-chain registries with off-chain contract repositories to mitigate gas fees and throughput bottlenecks.

Nevertheless, on-chain provenance does not avert downstream piracy. Once plaintext is delivered, the buyer can create indistinguishable offline duplicates or lightly modified derivatives []. Such leakage is endemic to digital goods, e.g., e-books, audio, and video, and it has historically been countered by Digital Rights Management (DRM) ecosystems that couple encryption to specialized hardware or software. Typical controls include proprietary viewers, product-key activation, tethered streaming, and continuous online identity attestation. Analogous mechanisms are indispensable in data markets: releasing cleartext exposes the vendor to irreversible copyright erosion, so usage must be constrained through fine-grained, cryptographically enforced licenses. Current DRM prototypes, however, remain inadequate for the scale, velocity, and heterogeneity of big-data assets; the following enhancements are required to align DRM with large-scale data sharing:

High-speed real-time online access: Unlike streaming media, data access in big data environments can be sporadic and arbitrary. Therefore, improvements in the efficiency and speed of online access are essential.
Improvement of dedicated software: Existing dedicated software for copyright management typically restricts users to browsing activities, such as watching videos or listening to music. In the context of data transactions, such software must also support data computation and visualization for buyers.
Function restriction mechanism: Certain software applications should disable screen capture functionalities and implement mechanisms to prevent buyers from capturing images or videos of the screen, thereby preventing indirect infringement.
Infringement detection: Beyond using product keys and continuous online identity verification to prevent infringement, it is crucial to detect instances of infringement. This can be achieved by recording the devices involved in data transmission and integrating copyright restrictions within the data transaction contract. The software should automatically assess whether infringement has occurred and, if so, penalize the buyer by disrupting data access or completely revoking usage rights.

4.6. High Quality of Service

The proliferation of network services generates substantial data, attracting global users interested in data produced remotely. The internet’s universal infrastructure facilitates the sharing of scientific data for research and engineering data for manufacturing, establishing a modern trend. Consequently, delivering big data to users according to their specific needs is imperative, which is a concept known as big data services [].

Contemporary big data services are struggling to satisfy the concomitant hardware and software demands of petabyte-scale analytics despite the underlying system architectures, necessitating the design of high-performance, end-to-end solutions []. The salient bottlenecks are summarized below.

Storage subsystem. Conventional electromechanical hard-disk drives exhibit random-access latency and throughput that are orders of magnitude below the ingestion and query rates required for real-time big-data workloads. Although solid-state drives (SSDs) and phase-change memory (PCM) offer substantially higher IOPS, their unit cost and limited write endurance have slowed enterprise-wide deployment.
Index-management algorithms. Existing data structure and indexing techniques are not co-designed with modern storage hierarchies; as a result, point and range queries remain CPU-bound despite abundant secondary-storage bandwidth. Cache-conscious, compression-aware index layouts must therefore be re-engineered to exploit both byte-addressable NVM and parallel flash arrays.
Secure high-bandwidth transport. Because data acquisition and service delivery are predominantly cloud-resident, multi-gigabit, wide-area transfers are routine. Packet loss, jitter, and man-in-the-middle attacks can silently corrupt or exfiltrate in-flight segments; hence, low-overhead, line-rate encryption and loss-resilient integrity checks are mandatory.
Compute-power scalability. Aggregate data volume now grows super-linearly with transistor density, whereas single-core clock frequencies have plateaued since 2005. Sustained quality of service, therefore, hinges on heterogeneous parallelism (GPUs, FPGAs, domain-specific accelerators) and energy-efficient cluster fabrics rather than on frequency scaling alone.
Timeliness guarantees. Meeting subsecond latency service-level objectives for complex analytics (streaming joins, iterative graph algorithms, deep-learning inference) demands a synergistic redesign of compute architectures, scheduling policies, and approximation algorithms; failure at any stratum propagates delay and violates business-critical deadlines.

5. Future Directions

In this section, we identify two promising future directions that use cutting-edge blockchain [] and edge-computing [] technologies to address the challenges in big data sharing.

5.1. Blockchain-Based Big Data Sharing

In recent years, blockchain technology has attracted extensive attention from industry and academia since it enables trustless data storage with auditability []. Generally speaking, a blockchain is a growing list of blocks linked by cryptographic functions in which each block contains a set of transactions. The blockchain structure is maintained by a set of nodes connected in a P2P manner, which is the blockchain network. The main properties of blockchain technology are as follows:

Decentralization: The blockchain is maintained by a P2P network, in which all nodes are identical and there is no central authority.
Transparency: The blocks and transactions are visible to all the nodes in the blockchain network and even public to everyone.
Immutability: The data cannot be changed once stored on the blockchain because the blocks are generated individually and securely linked via cryptographic functions.

Blockchain’s transparency, immutability, and decentralization furnish a natural antidote to the twin threats of traceability, accountability, and authenticity disputes that plague centralized sharing platforms []. Custody remains with data owners, eliminating the need to trust a third-party repository, while consensus-based validation by network nodes certifies provenance and integrity. Consequently, blockchain-mediated architectures are widely regarded as a paradigmatic solution for secure big-data exchange.

Figure 9 illustrates a reference implementation. Data sharers announce datasets by registering metadata or access pointers through blockchain-facing interfaces; the corresponding raw bytes are retained locally inside a hardware-enforced trusted execution environment (TEE). Sharees retrieve and process the data, generating immutable usage receipts. Two purpose-built ledgers, an immutable metadata chain that indexes every published asset and a sharing data chain that logs each access event, are maintained by the same peer-to-peer overlay. Both chains exploit a unified, modular stack comprising smart contract, consensus, storage, and network layers, thereby guaranteeing end-to-end verifiability without sacrificing performance or confidentiality.

Figure 9. System architecture of blockchain-based big data sharing.

The use of blockchain technology to share medical data has been extensively studied in the literature. Some other application domains are vehicular networks, network management, smart grids, etc. Blockchain technology plays a significant role in providing features of traceability, auditability, immutability, etc. There is also much research work on access control, security, privacy, incentive mechanisms, and efficiency in big data sharing.

Despite the significance of blockchain technology for big data sharing, many existing studies are conceptual and lack real-world deployment. Furthermore, there are still many challenging issues in blockchain-based big data sharing that remain to be addressed. For example, suppose a large amount of data cannot be simply stored on a blockchain. In that case, it becomes a question of how to provide high performance without affecting security and privacy. Nevertheless, if the big data are transferred from the data sharers to the sharees, then it remains unknown how to avoid unauthorized big data re-sharing. These open problems demand insightful solutions.

5.2. Edge as Big Data-Sharing Infrastructure

Edge computing is a distributed computing paradigm that pushes data storage and computation to the location near the data source to improve the operation response time and save the network bandwidth []. In the edge-computing paradigm, multiple edge servers form a P2P network and provide services to the end devices []. Edge servers are closer to the end devices than cloud servers, meaning low latency can be achieved.

Edge computing can provide high-performance big data sharing in terms of low latency and high reliability and enhance the quality of service []. In particular, multiple edge servers form a decentralized network and provide big data-sharing services to the users, in which the decentralization leads to high reliability. Moreover, the nearby edge servers can respond to requests for data publishing, searching, and sharing with low latency. As a result, integrating big data sharing into edge computing infrastructure is promising.

Figure 10 shows a possible system architecture of edge computing as a big data-sharing infrastructure. The data sharers and sharees act as end devices and enjoy the content services provided by the collaborative edge networks. The requests for data publishing, searching, and sharing are responded to using mechanisms of content caching, distributed searching, task scheduling, etc.

Figure 10. System architecture of edge as big data-sharing infrastructure.

There is a lot of research on achieving big data sharing in edge computing. Various properties, e.g., revocability and privacy, are studied for edge-empowered big data sharing. The traditional problems in edge computing, such as computation offloading, content caching, and task scheduling, are also studied in edge-empowered big data sharing. Recently, some work has explored how blockchain and edge computing empower big data sharing.

There are still many challenges to be addressed in the edge as a big data-sharing infrastructure. It is challenging to consider how to schedule tasks, cache content, offload computation, and search data in a time-efficient way in edge-based big data sharing. Moreover, the security and authenticity issues need to be considered. These open problems are worth exploring.

6. Conclusions

The transition from isolated data silos to integrated data ecosystems is an imperative for realizing the full transformative potential of big data. To this end, the paper establishes a comprehensive and systematic foundation for the field of big data sharing, which was a domain previously characterized by fragmented knowledge and domain-specific studies. By formally defining the concept, delineating its core procedures, and cataloging its benefits and requirements, this work provides a unified conceptual framework.

A unique contribution of this survey lies in the architectural taxonomy that categorizes existing platforms into data-hosting centers, data aggregation centers, and decentralized solutions. This categorization reveals a fundamental trade-off between centralization and decentralization, compelling system designers to balance the high efficiency and data authenticity offered by centralized custodians against the resilience, owner autonomy, and trustlessness afforded by peer-to-peer networks. The analysis demonstrates that no single architecture is universally optimal; rather, the appropriate choice is contingent upon the specific requirements for control, performance, and trust within a given sharing scenario.

Furthermore, the paper synthesizes the primary technical impediments that hinder the maturation of big data sharing. These challenges span the entire data lifecycle, from the foundational prerequisites of standardizing heterogeneous data and establishing rational value assessment models, to the non-negotiable guarantees of security and privacy during sharing, and finally to the persistent post-transaction issues of traceability and accountability. The survey methodically maps these challenges to state-of-the-art countermeasures, highlighting the critical role of advanced cryptographic techniques, robust access control mechanisms, and privacy-preserving technologies.

Looking forward, the survey identifies blockchain technology and edge computing as two paradigmatic future directions. Blockchain emerges as a compelling solution to instill trust, auditability, and accountability in a decentralized manner, directly addressing the core challenges of traceability and provenance. Concurrently, edge computing presents a viable infrastructure to mitigate latency and bandwidth constraints, thereby enhancing the quality of service for real-time big data applications. By charting these open research avenues, this survey not only provides a one-stop reference for researchers and practitioners but also aims to catalyze future innovation toward the development of secure, efficient, and scalable big data-sharing ecosystems.

Funding

This research was funded by the Hong Kong Research Grant Council Theme-based Research Scheme (No. T43-513/23-N).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Ajagbe, S.A.; Mudali, P.; Adigun, M.O. Internet of things with deep learning techniques for pandemic detection: A comprehensive review of current trends and open issues. Electronics 2024, 13, 2630. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, M.; Sun, M.; Deng, R.; Cheng, P.; Niyato, D.; Chow, M.Y.; Chen, J. Vulnerability of machine learning approaches applied in iot-based smart grid: A review. IEEE Internet Things J. 2024, 11, 18951–18975. [Google Scholar] [CrossRef]
Salaris, S.; Ocagli, H.; Casamento, A.; Lanera, C.; Gregori, D. Foodborne event detection based on social media mining: A systematic review. Foods 2025, 14, 239. [Google Scholar] [CrossRef]
Chen, M.; Mao, S.; Liu, Y. Big data: A survey. Mob. Netw. Appl. 2014, 19, 171–209. [Google Scholar] [CrossRef]
Liu, X.; Cao, J.; Yang, Y.; Jiang, S. CPS-based smart warehouse for industry 4.0: A survey of the underlying technologies. Computers 2018, 7, 13. [Google Scholar] [CrossRef]
Jamarani, A.; Haddadi, S.; Sarvizadeh, R.; Haghi Kashani, M.; Akbari, M.; Moradi, S. Big data and predictive analytics: A systematic review of applications. Artif. Intell. Rev. 2024, 57, 176. [Google Scholar] [CrossRef]
Latupeirissa, J.J.P.; Dewi, N.L.Y.; Prayana, I.K.R.; Srikandi, M.B.; Ramadiansyah, S.A.; Pramana, I.B.G.A.Y. Transforming public service delivery: A comprehensive review of digitization initiatives. Sustainability 2024, 16, 2818. [Google Scholar] [CrossRef]
Wang, R.; Xu, C.; Dong, R.; Luo, Z.; Zheng, R.; Zhang, X. A secured big-data sharing platform for materials genome engineering: State-of-the-art, challenges and architecture. Future Gener. Comput. Syst. 2023, 142, 59–74. [Google Scholar] [CrossRef]
Ye, M.; Shen, W.; Du, B.; Snezhko, E.; Kovalev, V.; Yuen, P.C. Vertical federated learning for effectiveness, security, applicability: A survey. ACM Comput. Surv. 2025, 57, 1–32. [Google Scholar] [CrossRef]
Mello, M.M.; Lieou, V.; Goodman, S.N. Clinical trial participants’ views of the risks and benefits of data sharing. N. Engl. J. Med. 2018, 378, 2202–2211. [Google Scholar] [CrossRef] [PubMed]
Figueiredo, A.S. Data sharing: Convert challenges into opportunities. Front. Public Health 2017, 5, 327. [Google Scholar] [CrossRef]
Agapito, G.; Cannataro, M. An overview on the challenges and limitations using cloud computing in healthcare corporations. Big Data Cogn. Comput. 2023, 7, 68. [Google Scholar] [CrossRef]
Hajian, A.; Prybutok, V.R.; Chang, H.C. An empirical study for blockchain-based information sharing systems in electronic health records: A mediation perspective. Comput. Hum. Behav. 2023, 138, 107471. [Google Scholar] [CrossRef]
Rhahla, M.; Allegue, S.; Abdellatif, T. Guidelines for GDPR compliance in Big Data systems. J. Inf. Secur. Appl. 2021, 61, 102896. [Google Scholar] [CrossRef]
Wang, J.; Gao, F.; Zhou, Y.; Guo, Q.; Tan, C.W.; Song, J.; Wang, Y. Data sharing in energy systems. Adv. Appl. Energy 2023, 10, 100132. [Google Scholar] [CrossRef]
Liu, Z.; Huang, B.; Li, Y.; Sun, Q.; Pedersen, T.B.; Gao, D.W. Pricing game and blockchain for electricity data trading in low-carbon smart energy systems. IEEE Trans. Ind. Inform. 2024, 20, 6446–6456. [Google Scholar] [CrossRef]
Deepa, N.; Pham, Q.V.; Nguyen, D.C.; Bhattacharya, S.; Prabadevi, B.; Gadekallu, T.R.; Maddikunta, P.K.R.; Fang, F.; Pathirana, P.N. A survey on blockchain for big data: Approaches, opportunities, and future directions. Future Gener. Comput. Syst. 2022, 131, 209–226. [Google Scholar] [CrossRef]
Khan, N.; Yaqoob, I.; Hashem, I.A.T.; Inayat, Z.; Mahmoud Ali, W.K.; Alam, M.; Shiraz, M.; Gani, A. Big data: Survey, technologies, opportunities, and challenges. Sci. World J. 2014, 2014, 712826. [Google Scholar] [CrossRef] [PubMed]
Arora, S.; Kumar, M.; Johri, P.; Das, S. Big heterogeneous data and its security: A survey. In Proceedings of the 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 29–30 April 2016; pp. 37–40. [Google Scholar]
Yang, P.; Xiong, N.; Ren, J. Data security and privacy protection for cloud storage: A survey. IEEE Access 2020, 8, 131723–131740. [Google Scholar] [CrossRef]
Ferradi, H.; Cao, J.; Jiang, S.; Cao, Y.; Saxena, D. Security and privacy in big data sharing: State-of-the-art and research directions. arXiv 2022, arXiv:2210.09230. [Google Scholar]
Liu, L.; Han, M. Data sharing and exchanging with incentive and optimization: A survey. Discov. Data 2024, 2, 2. [Google Scholar] [CrossRef]
Liang, H.; Zhang, Z.; Hu, C.; Gong, Y.; Cheng, D. A Survey on Spatio-temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and Applications. IEEE Trans. Big Data 2024, 10, 174–193. [Google Scholar] [CrossRef]
Almeida, A.; Brás, S.; Sargento, S.; Pinto, F.C. Time series big data: A survey on data stream frameworks, analysis and algorithms. J. Big Data 2023, 10, 83. [Google Scholar] [CrossRef]
Selmy, H.A.; Mohamed, H.K.; Medhat, W. Big data analytics deep learning techniques and applications: A survey. Inf. Syst. 2023, 120, 102318. [Google Scholar] [CrossRef]
Lv, Z.; Qiao, L. Analysis of healthcare big data. Future Gener. Comput. Syst. 2020, 109, 103–110. [Google Scholar] [CrossRef]
Talebkhah, M.; Sali, A.; Marjani, M.; Gordan, M.; Hashim, S.J.; Rokhani, F.Z. IoT and big data applications in smart cities: Recent advances, challenges, and critical issues. IEEE Access 2021, 9, 55465–55484. [Google Scholar] [CrossRef]
Kumar, A.; Sangwan, S.R.; Nayyar, A. Multimedia social big data: Mining. In Multimedia Big Data Computing for IoT Applications: Concepts, Paradigms and Solutions; Springer: Berlin/Heidelberg, Germany, 2020; pp. 289–321. [Google Scholar]
Nelufule, N.; Senamela, P.; Moloi, P. Digital Forensics Investigations on Evolving Digital Ecosystems and Big Data Sharing: A Survey of Challenges and Potential Opportunities. In Proceedings of the 2025 IST-Africa Conference, Nairobi, Kenya, 28–30 May 2025; pp. 1–12. [Google Scholar]
Hemmati, A.; Arzanagh, H.M.; Rahmani, A.M. A taxonomy and survey of big data in social media. Concurr. Comput. Pract. Exp. 2024, 36, e7875. [Google Scholar] [CrossRef]
Khan, S.; Liu, X.; Shakil, K.A.; Alam, M. A survey on scholarly data: From big data perspective. Inf. Process. Manag. 2017, 53, 923–944. [Google Scholar] [CrossRef]
Adler-Milstein, J.; Garg, A.; Zhao, W.; Patel, V. A survey of health information exchange organizations in advance of a nationwide connectivity framework. Health Aff. 2021, 40, 736–744. [Google Scholar] [CrossRef]
Manzoor, A.; Braeken, A.; Kanhere, S.S.; Ylianttila, M.; Liyanage, M. Proxy re-encryption enabled secure and anonymous IoT data sharing platform based on blockchain. J. Netw. Comput. Appl. 2021, 176, 102917. [Google Scholar] [CrossRef]
Jacobs, A. The pathologies of big data. Commun. ACM 2009, 52, 36–44. [Google Scholar] [CrossRef]
Lazer, D.; Kennedy, R.; King, G.; Vespignani, A. The parable of Google Flu: Traps in big data analysis. Science 2014, 343, 1203–1205. [Google Scholar] [CrossRef]
Ginsberg, J.; Mohebbi, M.H.; Patel, R.S.; Brammer, L.; Smolinski, M.S.; Brilliant, L. Detecting influenza epidemics using search engine query data. Nature 2009, 457, 1012–1014. [Google Scholar] [CrossRef]
Hossain, M.A.; Dwivedi, Y.K.; Rana, N.P. State-of-the-art in open data research: Insights from existing literature and a research agenda. J. Organ. Comput. Electron. Commer. 2016, 26, 14–40. [Google Scholar] [CrossRef]
Wu, H.; Cao, J.; Jiang, S.; Yang, R.; Yang, Y.; Hey, J. TSAR: A fully-distributed trustless data sharing platform. In Proceedings of the 2018 IEEE International Conference on Smart Computing (SMARTCOMP), Taormina, Italy, 18–20 June 2018; pp. 350–355. [Google Scholar]
Cuzzocrea, A.; Damiani, E. Privacy-preserving big data exchange: Models, issues, future research directions. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 5081–5084. [Google Scholar]
Zhang, M.; Beltrán, F.; Liu, J. A survey of data pricing for data marketplaces. IEEE Trans. Big Data 2023, 9, 1038–1056. [Google Scholar] [CrossRef]
Azcoitia, S.A.; Laoutaris, N. A survey of data marketplaces and their business models. ACM SIGMOD Rec. 2022, 51, 18–29. [Google Scholar] [CrossRef]
Jiang, S.; Cao, J.; McCann, J.A.; Yang, Y.; Liu, Y.; Wang, X.; Deng, Y. Privacy-preserving and efficient multi-keyword search over encrypted data on blockchain. In Proceedings of the 2019 IEEE International Conference on Blockchain (Blockchain), Atlanta, GA, USA, 14–17 July 2019; pp. 405–410. [Google Scholar]
Davis, P.M.; Lewenstein, B.V.; Simon, D.H.; Booth, J.G.; Connolly, M.J. Open access publishing, article downloads, and citations: Randomised controlled trial. BMJ 2008, 337, 343–345. [Google Scholar] [CrossRef] [PubMed]
Piwowar, H.A.; Day, R.S.; Fridsma, D.B. Sharing detailed research data is associated with increased citation rate. PLoS ONE 2007, 2, e308. [Google Scholar] [CrossRef]
Janssen, M.; Charalabidis, Y.; Zuiderwijk, A. Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 2012, 29, 258–268. [Google Scholar] [CrossRef]
Guo, H.D.; Zhang, L.; Zhu, L.W. Earth observation big data for climate change research. Adv. Clim. Change Res. 2015, 6, 108–117. [Google Scholar] [CrossRef]
Van Panhuis, W.G.; Paul, P.; Emerson, C.; Grefenstette, J.; Wilder, R.; Herbst, A.J.; Heymann, D.; Burke, D.S. A systematic review of barriers to data sharing in public health. BMC Public Health 2014, 14, 1144. [Google Scholar] [CrossRef]
Houtkoop, B.L.; Chambers, C.; Macleod, M.; Bishop, D.V.; Nichols, T.E.; Wagenmakers, E.J. Data sharing in psychology: A survey on barriers and preconditions. Adv. Methods Pract. Psychol. Sci. 2018, 1, 70–85. [Google Scholar] [CrossRef]
Jiang, S.; Cao, J.; Wu, H.; Yang, Y.; Ma, M.; He, J. Blochie: A blockchain-based platform for healthcare information exchange. In Proceedings of the 2018 IEEE International Conference on Smart Computing (Smartcomp), Taormina, Italy, 18–20 June 2018; pp. 49–56. [Google Scholar]
Wu, H.; Jiang, S.; Cao, J. High-efficiency blockchain-based supply chain traceability. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3748–3758. [Google Scholar] [CrossRef]
Jiang, S.; Chai, W.; Zhang, M.; Cao, J.; Xuan, S.; Shen, J. Verifying Energy Generation via Edge LLM for Web3-based Decentralized Clean Energy Networks. Inf. Fusion 2026, 127, 103752. [Google Scholar] [CrossRef]
Sim, I.; Stebbins, M.; Bierer, B.E.; Butte, A.J.; Drazen, J.; Dzau, V.; Hernandez, A.F.; Krumholz, H.M.; Lo, B.; Munos, B.; et al. Time for NIH to lead on data sharing. Science 2020, 367, 1308–1309. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Fang, Y. Cross-domain data sharing in distributed electronic health record systems. IEEE Trans. Parallel Distrib. Syst. 2009, 21, 754–764. [Google Scholar] [CrossRef]
Hossain, M.E.; Khan, A.; Moni, M.A.; Uddin, S. Use of electronic health data for disease prediction: A comprehensive literature review. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019, 18, 745–758. [Google Scholar] [CrossRef] [PubMed]
Oh, H.; Park, S.; Lee, G.M.; Choi, J.K.; Noh, S. Competitive Data Trading Model with Privacy Valuation for Multiple Stakeholders in IoT Data Markets. IEEE Internet Things J. 2020, 7, 3623–3639. [Google Scholar] [CrossRef]
Zheng, Z.; Peng, Y.; Wu, F.; Tang, S.; Chen, G. Arete: On designing joint online pricing and reward sharing mechanisms for mobile data markets. IEEE Trans. Mob. Comput. 2019, 19, 769–787. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, H.; Su, H.; Zhang, L.; Zhang, R.; Wang, D.; Xu, K. Understand love of variety in wireless data market under sponsored data plans. IEEE J. Sel. Areas Commun. 2020, 38, 766–781. [Google Scholar] [CrossRef]
Wolstencroft, K.; Owen, S.; Krebs, O.; Nguyen, Q.; Stanford, N.J.; Golebiewski, M.; Weidemann, A.; Bittkowski, M.; An, L.; Shockley, D.; et al. SEEK: A systems biology data and model management platform. BMC Syst. Biol. 2015, 9, 33. [Google Scholar] [CrossRef]
Rocca-Serra, P.; Brandizi, M.; Maguire, E.; Sklyar, N.; Taylor, C.; Begley, K.; Field, D.; Harris, S.; Hide, W.; Hofmann, O.; et al. ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics 2010, 26, 2354–2356. [Google Scholar] [CrossRef]
Psaras, Y.; Dias, D. The interplanetary file system and the filecoin network. In Proceedings of the 2020 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S), Valencia, Spain, 28 June–2 July 2020; p. 80. [Google Scholar]
Shen, B.; Guo, J.; Yang, Y. MedChain: Efficient healthcare data sharing via blockchain. Appl. Sci. 2019, 9, 1207. [Google Scholar] [CrossRef]
Zuech, R.; Khoshgoftaar, T.M.; Wald, R. Intrusion detection and big heterogeneous data: A survey. J. Big Data 2015, 2, 3. [Google Scholar] [CrossRef]
Chen, D.; Yuan, H.; Hu, S.; Wang, Q.; Wang, C. BOSSA: A decentralized system for proofs of data retrievability and replication. IEEE Trans. Parallel Distrib. Syst. 2020, 32, 786–798. [Google Scholar] [CrossRef]
Yang, A.; Xu, J.; Weng, J.; Zhou, J.; Wong, D.S. Lightweight and privacy-preserving delegatable proofs of storage with data dynamics in cloud storage. IEEE Trans. Cloud Comput. 2018, 9, 212–225. [Google Scholar] [CrossRef]
He, K.; Chen, J.; Du, R.; Wu, Q.; Xue, G.; Zhang, X. Deypos: Deduplicatable dynamic proof of storage for multi-user environments. IEEE Trans. Comput. 2016, 65, 3631–3645. [Google Scholar] [CrossRef]
Yu, J.; Ren, K.; Wang, C.; Varadharajan, V. Enabling cloud storage auditing with key-exposure resistance. IEEE Trans. Inf. Forensics Secur. 2015, 10, 1167–1179. [Google Scholar] [CrossRef]
Wang, H.; Li, M.; Bu, Y.; Li, J.; Gao, H.; Zhang, J. Cleanix: A parallel big data cleaning system. ACM SIGMOD Rec. 2016, 44, 35–40. [Google Scholar] [CrossRef]
Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1–20. [Google Scholar]
Ahmed, M. Data summarization: A survey. Knowl. Inf. Syst. 2019, 58, 249–273. [Google Scholar] [CrossRef]
Hesabi, Z.R.; Tari, Z.; Goscinski, A.; Fahad, A.; Khalil, I.; Queiroz, C. Data summarization techniques for big data—A survey. In Handbook on Data Centers; Springer: Berlin/Heidelberg, Germany, 2015; pp. 1109–1152. [Google Scholar]
Xiao, D.; Bashllari, A.; Menard, T.; Eltabakh, M. Even metadata is getting big: Annotation summarization using insightnotes. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, VIC, Australia, 31 May–4 June 2015; pp. 1409–1414. [Google Scholar]
Laban, P.; Kryściński, W.; Agarwal, D.; Fabbri, A.R.; Xiong, C.; Joty, S.; Wu, C.S. SUMMEDITS: Measuring LLM ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 9662–9676. [Google Scholar]
Ardagna, D.; Cappiello, C.; Samá, W.; Vitali, M. Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 2018, 89, 548–562. [Google Scholar] [CrossRef]
Li, J.; Li, J.; Wang, X.; Qin, R.; Yuan, Y.; Wang, F.Y. Multi-blockchain based data trading markets with novel pricing mechanisms. IEEE/CAA J. Autom. Sin. 2023, 10, 2222–2232. [Google Scholar] [CrossRef]
Hazen, B.T.; Boone, C.A.; Ezell, J.D.; Jones-Farmer, L.A. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 2014, 154, 72–80. [Google Scholar] [CrossRef]
Caruccio, L.; Desiato, D.; Polese, G.; Tortora, G. GDPR compliant information confidentiality preservation in big data processing. IEEE Access 2020, 8, 205034–205050. [Google Scholar] [CrossRef]
Colombo, P.; Ferrari, E. Privacy aware access control for big data: A research roadmap. Big Data Res. 2015, 2, 145–154. [Google Scholar] [CrossRef]
Ding, Y.; Sato, H. Bloccess: Enabling fine-grained access control based on blockchain. J. Netw. Syst. Manag. 2023, 31, 6. [Google Scholar] [CrossRef]
Ding, W.; Yan, Z.; Deng, R.H. Privacy-preserving data processing with flexible access control. IEEE Trans. Dependable Secur. Comput. 2017, 17, 363–376. [Google Scholar] [CrossRef]
Zhang, L.; Wang, J.; Mu, Y. Privacy-preserving flexible access control for encrypted data in Internet of Things. IEEE Internet Things J. 2021, 8, 14731–14745. [Google Scholar] [CrossRef]
Nobi, M.N.; Krishnan, R.; Huang, Y.; Shakarami, M.; Sandhu, R. Toward deep learning based access control. In Proceedings of the Twelfth ACM Conference on Data and Application Security and Privacy, Washington, DC, USA, 25–27 April 2022; pp. 143–154. [Google Scholar]
Naveed, M.; Agrawal, S.; Prabhakaran, M.; Wang, X.; Ayday, E.; Hubaux, J.P.; Gunter, C. Controlled functional encryption. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1280–1291. [Google Scholar]
Yan, H.; Wang, Y.; Jia, C.; Li, J.; Xiang, Y.; Pedrycz, W. IoT-FBAC: Function-based access control scheme using identity-based encryption in IoT. Future Gener. Comput. Syst. 2019, 95, 344–353. [Google Scholar] [CrossRef]
Zhang, Y.; Deng, R.H.; Xu, S.; Sun, J.; Li, Q.; Zheng, D. Attribute-based encryption for cloud computing access control: A survey. ACM Comput. Surv. 2020, 53, 1–41. [Google Scholar] [CrossRef]
Luo, F.; Wang, H.; Yan, X.; Wu, J. Key-Policy Attribute-Based Encryption with Switchable Attributes for Fine-Grained Access Control of Encrypted Data. IEEE Trans. Inf. Forensics Secur. 2024, 19, 7245–7258. [Google Scholar] [CrossRef]
Bethencourt, J.; Sahai, A.; Waters, B. Ciphertext-policy attribute-based encryption. In Proceedings of the 2007 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 20–23 May 2007; pp. 321–334. [Google Scholar]
Acar, A.; Aksu, H.; Uluagac, A.S.; Conti, M. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. 2018, 51, 1–35. [Google Scholar] [CrossRef]
Zhang, Z.; Cheng, P.; Wu, J.; Chen, J. Secure state estimation using hybrid homomorphic encryption scheme. IEEE Trans. Control Syst. Technol. 2020, 29, 1704–1720. [Google Scholar] [CrossRef]
Gentry, C.; Boneh, D. A Fully Homomorphic Encryption Scheme; Stanford University Stanford: Stanford, CA, USA, 2009; Volume 20. [Google Scholar]
Marcolla, C.; Sucasas, V.; Manzano, M.; Bassoli, R.; Fitzek, F.H.; Aaraj, N. Survey on fully homomorphic encryption, theory, and applications. Proc. IEEE 2022, 110, 1572–1609. [Google Scholar] [CrossRef]
Zhou, L.; Fu, A.; Yu, S.; Su, M.; Kuang, B. Data integrity verification of the outsourced big data in the cloud environment: A survey. J. Netw. Comput. Appl. 2018, 122, 1–15. [Google Scholar] [CrossRef]
Li, B.; He, Q.; Chen, F.; Jin, H.; Xiang, Y.; Yang, Y. Inspecting edge data integrity with aggregate signature in distributed edge computing environment. IEEE Trans. Cloud Comput. 2021, 10, 2691–2703. [Google Scholar] [CrossRef]
Yang, Y.; Zheng, X.; Guo, W.; Liu, X.; Chang, V. Privacy-preserving smart IoT-based healthcare big data storage and self-adaptive access control system. Inf. Sci. 2019, 479, 567–592. [Google Scholar] [CrossRef]
Yu, H.; Hu, Q.; Yang, Z.; Liu, H. Efficient continuous big data integrity checking for decentralized storage. IEEE Trans. Netw. Sci. Eng. 2021, 8, 1658–1673. [Google Scholar] [CrossRef]
Ganeriwal, S.; Balzano, L.K.; Srivastava, M.B. Reputation-based framework for high integrity sensor networks. ACM Trans. Sens. Netw. 2008, 4, 1–37. [Google Scholar] [CrossRef]
Westin, A.F. Social and political dimensions of privacy. J. Soc. Issues 2003, 59, 431–453. [Google Scholar] [CrossRef]
Jiang, S.; Cao, J.; Wu, H.; Chen, K.; Liu, X. Privacy-preserving and efficient data sharing for blockchain-based intelligent transportation systems. Inf. Sci. 2023, 635, 72–85. [Google Scholar] [CrossRef]
Hu, H.; Xu, J.; Xu, X.; Pei, K.; Choi, B.; Zhou, S. Private search on key-value stores with hierarchical indexes. In Proceedings of the IEEE 30th International Conference on Data Engineering, Chicago, IL, USA, 31 March–4 April 2014; pp. 628–639. [Google Scholar]
Vasa, J.; Thakkar, A. Deep learning: Differential privacy preservation in the era of big data. J. Comput. Inf. Syst. 2023, 63, 608–631. [Google Scholar] [CrossRef]
Kamara, S.; Papamanthou, C.; Roeder, T. Dynamic searchable symmetric encryption. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, Raleigh, NC, USA, 16–18 October 2012; pp. 965–976. [Google Scholar]
Abdalla, M.; Bellare, M.; Catalano, D.; Kiltz, E.; Kohno, T.; Lange, T.; Malone-Lee, J.; Neven, G.; Paillier, P.; Shi, H. Searchable encryption revisited: Consistency properties, relation to anonymous IBE, and extensions. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 14–18 August 2005; pp. 205–222. [Google Scholar]
Colombo, S.; Nikitin, K.; Corrigan-Gibbs, H.; Wu, D.J.; Ford, B. Authenticated private information retrieval. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 3835–3851. [Google Scholar]
Xu, H.; Xiao, B.; Liu, X.; Wang, L.; Jiang, S.; Xue, W.; Wang, J.; Li, K. Empowering authenticated and efficient queries for STK transaction-based blockchains. IEEE Trans. Comput. 2023, 72, 2209–2223. [Google Scholar] [CrossRef]
Sangaiah, A.K.; Javadpour, A.; Ja’fari, F.; Pinto, P.; Chuang, H.M. Privacy-aware and ai techniques for healthcare based on k-anonymity model in internet of things. IEEE Trans. Eng. Manag. 2024, 239, 122343. [Google Scholar] [CrossRef]
Ashkouti, F.; Khamforoosh, K.; Sheikhahmadi, A. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Inf. Sci. 2021, 546, 1–24. [Google Scholar] [CrossRef]
Ren, W.; Ghazinour, K.; Lian, X. kt-Safety: Graph Release via k-Anonymity and t-Closeness. IEEE Trans. Knowl. Data Eng. 2022, 35, 9102–9113. [Google Scholar] [CrossRef]
Sun, Y.; Yin, L.; Liu, L.; Xin, S. Toward inference attacks for k-anonymity. Pers. Ubiquitous Comput. 2014, 18, 1871–1880. [Google Scholar] [CrossRef]
Zhang, H.; Jiang, S.; Xuan, S. Decentralized federated learning based on blockchain: Concepts, framework, and challenges. Comput. Commun. 2024, 216, 140–150. [Google Scholar] [CrossRef]
Zhang, M.; Cao, J.; Sahni, Y.; Chen, X.; Jiang, S. Resource-efficient Parallel Split Learning in Heterogeneous Edge Computing. In Proceedings of the International Conference on Computing, Networking and Communications, Big Island, HI, USA, 19–22 February 2024; pp. 794–798. [Google Scholar]
Odoom, J.; Huang, X.; Zhou, Z.; Danso, S.; Zheng, J.; Xiang, Y. Linked or unlinked: A systematic review of linkable ring signature schemes. J. Syst. Archit. 2023, 134, 102786. [Google Scholar] [CrossRef]
Zhou, L.; Diro, A.; Saini, A.; Kaisar, S.; Hiep, P.C. Leveraging zero knowledge proofs for blockchain-based identity sharing: A survey of advancements, challenges and opportunities. J. Inf. Secur. Appl. 2024, 80, 103678. [Google Scholar] [CrossRef]
Ma, R.; Zhang, L.; Wu, Q.; Mu, Y.; Rezaeibagha, F. Be-trdss: Blockchain-enabled secure and efficient traceable-revocable data-sharing scheme in industrial internet of things. IEEE Trans. Ind. Inform. 2023, 19, 10821–10830. [Google Scholar] [CrossRef]
Jung, T.; Li, X.Y.; Huang, W.; Qian, J.; Chen, L.; Han, J.; Hou, J.; Su, C. Accounttrade: Accountable protocols for big data trading against dishonest consumers. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar]
Wu, H.; Li, H.; Luo, X.; Jiang, S. Blockchain-Based Onsite Activity Management for Smart Construction Process Quality Traceability. IEEE Internet Things J. 2023, 10, 21554–21565. [Google Scholar] [CrossRef]
Jiang, S.; Cao, J.; Tung, C.L.; Wang, Y.; Wang, S. Sharon: Secure and Efficient Cross-shard Transaction Processing via Shard Rotation. In Proceedings of the IEEE INFOCOM 2024-IEEE Conference on Computer Communications, Vancouver, BC, Canada, 20–23 May 2024; pp. 2418–2427. [Google Scholar]
Chen, H.; Pendleton, M.; Njilla, L.; Xu, S. A survey on ethereum systems security: Vulnerabilities, attacks, and defenses. ACM Comput. Surv. 2020, 53, 1–43. [Google Scholar] [CrossRef]
Wu, H.; Cao, J.; Yang, Y.; Tung, C.L.; Jiang, S.; Tang, B.; Liu, Y.; Wang, X.; Deng, Y. Data management in supply chain using blockchain: Challenges and a case study. In Proceedings of the 2019 28th International Conference on Computer Communication and Networks (ICCCN), Valencia, Spain, 29 July–1 August 2019; pp. 1–8. [Google Scholar]
Zheng, Z.; Zhu, J.; Lyu, M.R. Service-generated big data and big data-as-a-service: An overview. In Proceedings of the 2013 IEEE International Congress on Big Data, Silicon Valley, CA, USA, 6–9 October 2013; pp. 403–410. [Google Scholar]
Byabazaire, J.; O’Hare, G.; Delaney, D. Data quality and trust: Review of challenges and opportunities for data sharing in iot. Electronics 2020, 9, 2083. [Google Scholar] [CrossRef]
Jiang, S.; Cao, J.; Zhu, J.; Cao, Y. Polychain: A generic blockchain as a service platform. In Proceedings of the Third International Conference Blockchain and Trustworthy Systems (BlockSys), Guangzhou, China, 5–6 August 2021; pp. 459–472. [Google Scholar]
Zhang, M.; Cao, J.; Yang, L.; Zhang, L.; Sahni, Y.; Jiang, S. Ents: An edge-native task scheduling system for collaborative edge computing. In Proceedings of the 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC), Seattle, WA, USA, 5–8 December 2022; pp. 149–161. [Google Scholar]
Zhang, Z.; Yang, K.; Tian, Y.; Ma, J. An anti-disguise authentication system using the first impression of avatar in metaverse. IEEE Trans. Inf. Forensics Secur. 2024, 19, 6393–6408. [Google Scholar] [CrossRef]
Wang, S.; Yang, M.; Jiang, S.; Chen, F.; Zhang, Y.; Fu, X. BBS: A secure and autonomous blockchain-based big-data sharing system. J. Syst. Archit. 2024, 150, 103133. [Google Scholar] [CrossRef]
Chen, X.; Cao, J.; Sahni, Y.; Jiang, S.; Liang, Z. Dynamic task offloading in edge computing based on dependency-aware reinforcement learning. IEEE Trans. Cloud Comput. 2024, 12, 594–608. [Google Scholar] [CrossRef]
Zhang, M.; Shen, X.; Cao, J.; Cui, Z.; Jiang, S. Edgeshard: Efficient llm inference via collaborative edge computing. IEEE Internet Things J. 2025, 12, 13119–13131. [Google Scholar] [CrossRef]
Zhang, M.; Cao, J.; Sahni, Y.; Chen, Q.; Jiang, S.; Yang, L. Blockchain-based collaborative edge intelligence for trustworthy and real-time video surveillance. IEEE Trans. Ind. Inform. 2022, 19, 1623–1633. [Google Scholar] [CrossRef]

Figure 1. The structure of this survey, covering the concept, platforms, challenges, and solutions of big data sharing.

Figure 2. Primary characteristics of big data.

Figure 3. Big data sharing for healthcare.

Figure 4. An example dataset provided by Epimorphics LDP.

Figure 5. ISA model in SEEK.

Figure 6. A running example of IPFS.

Figure 7. A running example of a data hosting center.

Figure 8. A running example of a data aggregation center.

Figure 9. System architecture of blockchain-based big data sharing.

Figure 10. System architecture of edge as big data-sharing infrastructure.

Table 1. Relationship and differences between big data sharing, open data, data exchange, and big data trading.

Term	Data Type	Incentives	Commerciality
Big data sharing	Unrestricted	Unrestricted	Unrestricted
Open data	Governmental and scholarly data	Public good	Non-commercial
Data exchange	Unrestricted	Right of using data	Unrestricted
Big data trading	Unrestricted	Monetary reward and right of using data	Commercial

Table 2. COVID-19 datasets on AWS Data Exchange.

Dataset	Description	Update Frequency	Data Source
Coronavirus Disease (COVID-19) Testing Data	The dataset includes positive and negative results, pending tests, and the total people tested for each U.S. state or district.	Every two hours	COVID Tracking Project, Washington, DC, USA
COVID-19 Apple Mobility Trends Reports	The dataset contains COVID-19 mobility trends in countries/regions and cities.	Daily	Apple Inc., Cupertino, CA, USA
USA Hospital Beds—COVID-19	This dataset includes data on the numbers of licensed beds, staffed beds, ICU beds, and the bed utilization rate for the hospitals in the U.S.	Daily	Definitive Healthcare, Framingham, MA, USA
Google COVID-19 Community Mobility Reports	The dataset includes the movement trends by geography across different categories of places over time.	Daily	Google LLC, Mountain View, CA, USA
COVID-19—World Confirmed Cases, Deaths, and Testing	The dataset includes COVID-19 data on confirmed cases, deaths, and testing worldwide.	Daily	Our World in Data, Oxford, UK

Table 3. Comparison of data-hosting center (DHC), data aggregation center (DAC), and decentralized big data-sharing solutions.

Feature/Criterion	DHC	DAC	Decentralized Solutions
Central Authority	Yes, to manage data storage, access, and transfer	Yes, for data discovery and brokering connections but not for data storage	No, the network itself governs the interactions
Pros	High efficiency, data authenticity	Data integrity, efficient data discovery	Resilience, high efficiency for “hot” data
Cons	Data privacy risk	Potential for misuse, weak authenticity	Maintenance burden, unclear incentives, inconsistent performance
Examples	SEEK, AWS Data Exchange	Epimorphics LDP, HKSTP Data Studio	IPFS

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Big Data Sharing: A Comprehensive Survey

Abstract

1. Introduction

2. Preliminaries of Big Data Sharing

2.1. Basics of Big Data

2.2. Definition of Big Data Sharing

2.3. General Procedures of Big Data Sharing

2.4. Benefits of Big Data Sharing

2.5. Requirements of Big Data-Sharing Solutions

2.6. Big Data-Sharing Applications

3. Existing Platforms and Categorization

3.1. Existing Platforms

3.1.1. Epimorphics Linked Data Platform

3.1.2. HKSTP Data Studio

3.1.3. SEEK

3.1.4. InterPlanetary File System

3.1.5. Amazon Web Services Data Exchange

3.2. Categorization of Existing Platforms

3.2.1. Data-Hosting Center

3.2.2. Data Aggregation Center

3.2.3. Decentralized Big Data Sharing Solutions

4. Challenges and Existing Solutions

4.1. Standardization of Heterogeneous Data

4.2. Value Assessment and Pricing Model

4.3. Sharing Security

4.4. Sharing Privacy

4.5. Data Traceability and Accountability

4.6. High Quality of Service

5. Future Directions

5.1. Blockchain-Based Big Data Sharing

5.2. Edge as Big Data-Sharing Infrastructure

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics