1. Introduction
The effective application of data helps enterprise supply chains achieve business process transformation and product and service innovation, and it helps to improve supply chain operations. Enterprises wanting to maintain their competitive advantages must pay attention to the application of big data, which requires extensive access to data from internal and external sources, and data trading across organizations and between chains becomes very important, and data trading becomes an important means to strengthen data resource integration, open information silos, and activate data assets. However, data trading faces risks in practice, such as unclear ownership, complicated authorization, lack of transparency of transactions, and privacy leakage.
Figure 1 shows a general data trading scenario. The data user sends a data purchase request, and the data owner responds to the request. When the data user pays, the data owner embeds the watermark into the data and sends it to the data user. However, the virtual, non-exclusive, and lossless characteristics make data easy to be tampered with, resold, leaked, and used beyond the scope in the process of circulation. The characteristics of the zero marginal production cost and the difficulty of complete physical delivery make it impossible for the ownership, use, and control of data to be delivered uniformly. Therefore, the static data watermarking model cannot be applied to dynamic data market transactions.
Blockchain, first proposed by Satoshi Nakamoto [
1], is a public ledger, maintained by decentralized nodes for the distributed sharing and storage of data. Blockchain has the characteristics of decentralization, anonymity, privacy, traceability, and tamper resistance, which has attracted great attention from academia and industry (such as supply chain, Internet of things, and medical fields) [
2]. New concepts, such as smart contracts [
3] and smart attributes that originated from blockchain technology, were quickly accepted by the economic market. A smart contract is a computer transaction protocol that enforces the terms of contract, allows trusted transactions without third parties, and ensures that those transactions are traceable and irreversible. In recent years, blockchain technology has been successfully applied to IoT platforms [
4,
5], medical data sharing systems [
6], data privacy protection [
7], supply chains [
8,
9,
10], biomedical research [
11], and financial transactions [
12]. The decentralization of the blockchain paves the way for data transactions in the supply chain data marketplace.
Related work. Zhao, Y. et al. [
13] proposed a new protocol for distributed data transactions that uses ring signatures to enhance the privacy of data provider identities. In a ring signature, a user selects a group of users, called the ring, to generate a signature, where the verifier can be confident that the signature was generated by a member of the ring but cannot reveal which person actually generated the signature. The protocol also extends double-authentication prevention signatures (DAPS) to penalize signers who generate two signatures for messages with the same title and different payloads, and this guarantees the fairness of transactions between data providers and data consumers. Xiang, Y. et al. [
14] proposed a smart-contract-based data trading scheme. The scheme uses smart contracts to ensure fairness of data sharing and data copyright in transactions and minimizes the risk of partial/combined resale or leakage of data by using a multi-type-based watermarking strategy. Jing, N. et al. [
15] proposed a blockchain-based code copyright management system. The original verification model of code based on abstract syntax tree is applied to the verification process of blockchain to realize the copyright verification and protection of original code. However, there is a problem of originality verification cost and verifier’s dilemma [
16]. Xu, Y. et al. [
17] proposed a game theory based Nash equilibrium model between watermarking robustness and data quality. The model uses a secure hashing algorithm to establish the mapping relationship between data groups and watermark bits and uses an improved particle swarm optimization algorithm to solve the optimal solution for each data group’s data variation under the data availability constraint and then modifies the data accordingly to complete the embedding of the watermark bits and protect the copyright of the data. Kumar, R. et al. [
18] proposed a distributed image- and video-sharing platform based on IPFS (Interstellar File System). The platform detects copyright infringement of multimedia by calculating the similarity between perceptual hashes (pHash) stored in the blockchain. Nasonov, Denis. et al. [
19] proposed a distributed big data platform in which a blockchain-based distributed digital data market is used to ensure the integrity of data transactions. Zhou, J. et al. [
20] addresses the trade-off dilemma between the effectiveness of data retrieval and the leakage risk of data indexing in distributed data transactions, and they propose a framework for distributed data transactions (DDV) by combining data embedding and similarity learning. The framework uses a privacy-preserving data-embedding procedure as an input to measure the similarity between data entries and achieves effective retrieval in data transactions while preserving data privacy. Elias Strehle and Martin Maurer [
21] proposed the DibiChain protocol for the discovery and exchange of supply chain information, which is built on top of a distributed data store that maintains a high degree of anonymity and unlinkability while ensuring a high degree of privacy by minimizing data in the shared data store, avoiding persistent user identifiers and communicating anonymously with minimal intermediaries. Nawaz, A. et al. [
22] proposed EdgeBoT, a platform for IoT based on smart contracts, considering the potential changes in interaction topology in data transaction scenarios. EdgeBoT enables more diverse interaction topologies between nodes in the network and external services, enabling direct data transactions at edge devices while guaranteeing data ownership and end-user privacy.
However, most of the current research on data ownership confirmation in data trading is focused on improving digital watermarking technology and similarity detection. This can only cover the detection of illegal data and cannot fundamentally cover the accurate tracing and timely accountability of illegal data. The current trading platform construction has no standard system for data ownership verification, traceability, and accountability.
Our contributions. This paper proposes a data ownership confirmation scheme (DOCS) for distributed data asset trading of the supply chain system, which has a credible and accountable architecture. We have studied in detail the structural methods of data storage, traceability, and accountability. (1) We adopt data signatures and similarity learning to enhance the reliable mapping between on-chain data ownership and off-chain data entities. It can effectively maintain the integrity of off-chain data. (2) We propose a smart contract-based data fingerprint generation protocol, which contains a two-part structure of mutual identity verification and data fingerprint generation. This ensures channel security under anonymous transaction networks and also achieves accurate traceability and market tracking of illegal data transactions. (3) We design a market supervision mechanism empowered by smart contracts to encourage market users to assist in prosecuting illegal data transaction in a timely manner.
The rest of this paper is structured as follows.
Section 2 introduces the basic applications of DOCS, including data signatures, similarity learning, and smart contracts.
Section 3 describes the structure of DOCS and the workflow and defines common data tenure attack models.
Section 4 provides a security analysis of DOCS and demonstrates that DOCS can resist attacks on data tenure in data transactions.
Section 5 evaluates the encoding performance and decoding performance of data-embedding techniques with supply chain data, and the experimental results show that data signatures can be used as reliable credentials for data ownership confirmation.
Section 6 provides a conclusion.
3. DOCS
3.1. DOCS Overview
In the DOCS, Blockchain Ethereum serves as the underlying blockchain infrastructure to build the transaction network in DOCS, where a combination of smart contract features can be enabled. There are also three participants: data owner, data user, and market users.
Blockchain Ethereum: Blockchain Ethereum is an open source public blockchain with smart contract functionality. The data owner and data user trade data on the Blockchain Ethereum.
Data owner: The data owner is usually the producer of the data. They have a list of topics to advertise the sales data, register the publication data with the Blockchain Ethereum, and generate a topic transaction.
Data user: The data user is usually a buyer of data. They query the data list through the Blockchain Ethereum and generate payment transactions to purchase the data.
Market users: Market users are data users who collect and trade data through black market. They are rewarded for assisting with the prosecuting of data users for illegal transactions.
The workflow of DOCS is shown in
Figure 3. The data owner will then publish the list of topics on the Blockchain Ethereum. Data user search on the Blockchain Ethereum and request data on a specific topic to enter the publication stage. Data owner generate the data signature through the data-embedding function, and request to upload it to the transaction list. After the data user retrieves the availability of the data, the transaction enters the verification stage. The two parties conduct identity verification through the data fingerprint generation protocol based on smart contract, and after the verification passes, the transaction enters the payment stage; after the data user completes the payment, the subject data embedded in the data fingerprint is obtained, and the transaction enters the supervision stage; within the validity period of the supervision stage, market users obtain rewards by assisting with prosecuting data users for illegal transactions, and market users may also choose to resell data for profit.
Adversary models. According to the real process of data trading, we identify six typical potential adversaries in distributed data transaction scenarios, which are the most common attacks on data ownership and the most threatening in terms of data transaction systems attacks. The specific definitions of these attacks are as follows:
Definition 1. False identity attack. During the transaction process, the counterparty uses a false identity to evade the tracking of data fingerprint.
Definition 2. Repeat confirmation attack. After the adversary obtains the data copy of the data owner, it slightly modifies the data copy to obtain a new data ownership certificate, which is confirmed and traded on the chain.
Definition 3. Data corruption attack. Data are often stored and processed with the risk of data corruption, such as data loss and data distortion.
Definition 4. Illegal distribution. After the adversary obtains the data copy of the data owner, it circumvents the on-chain transaction network and conducts anonymous transactions off-chain.
Definition 5. Shared key attack. The adversary gets access to the data owner’s encrypted data and causes data leakage by sharing the data decryption key.
Definition 6. Transaction fraud. In a transaction, the buyer and seller do not stay synchronized in the process of payment and delivery, specifically one peer is spoofed by another peer, resulting in the loss of data or tokens.
The key symbols used in data trading are presented in
Table 1.
3.2. Workflow of DOCS
Publish. We argue that the process of publishing data to the Blockchain Ethereum is the confirmation process of data ownership, and can be used as a valid proof of data ownership. Algorithm 1 describes the publishing process implemented with smart contract.
Step 1: The data owner uploads a data signature
based on data-embedding technology and the miner updates it to the blockchain transaction list
after verifying its legitimacy. The data user retrieves
and moves to the transaction verification phase after verifying the availability of the data through similarity learning.
Algorithm 1: Contract_publish |
Input: , Issure, contract_state |
Output: , contract_state |
1. if = true 2. update to L |
3. renew |
4. contract_state=verification 5. else 6. return an error |
Verification. Before data user can pay, we need an identifiable data fingerprint. For data transaction scenarios, data fingerprinting protocols that rely on third parties do not support anonymity, and the leakage of fingerprint information will also create risks for transaction participants. DOCS rely on data fingerprints to trace user and owner identities, so the security and trustworthiness of fingerprints is very important for member management and accountability tracking. In DOCS, a necessary but not sufficient condition for the credibility of a data fingerprint is to verify the identity of the other party. We propose a data fingerprint generation protocol based on smart contracts. The framework of the protocol is shown in
Figure 4. The protocol requires mutual authentication of participant identity, confirmation of the identity of the sender, and channel security, and it then generates
through
. The process is as follows:
- (1)
Authentication initialization
Step 2: The data owner and data user obtain their certificates through CA authentication. The certificate structure is as follows:
The initialization smart contract generates random numbers
and
, the data owner computes
, and the data user computes
and uploads them to the smart contract along with the certificate.
Step 3: The data user computes
and uploads
and
to the smart contract; the data owner computes
and uploads
and
to the smart contract.
- (2)
Session key authentication
Step 4: The data owner decrypts to get and . The data owner then sends to the smart contract. The data user decrypts to get and . The data user then sends to the smart contract.
- (3)
Identity verification
Step 5: The data owner gets
; the data user gets
.
- (4)
Generate data fingerprint
Step 6: Smart contract generates a
by
.
Payment. Before the data user pays, both parties deposit a certain amount of deposit in the smart contract. If the payment is successful, the deposit will be returned after a time limit , and the transaction will enter the supervision stage. If payment fails, the data user will compensate the data owner for a certain loss, and the transaction will be terminated. Algorithm 2 describes the payment process implemented with smart contracts.
Step 7: The data user pays the data price, the data owner embeds
to the corresponding subject data and uploads the encrypted data
with
to the cloud storage, and the data user downloads the decrypted data
.
Algorithm 2: Contract_payment |
Input: , , , contract_state |
Output: contract_state |
1. if DU initiates a payment to DO 2. DO to embed in the data |
3. DO sends to cloud storage |
4. return to DO after 5. return to DU after 6. contract_state = supervision 7. else 8. termination transaction 9. destroy 10. return to DO 11. DU compensates the loss from to the DO |
Supervision. There are two ways that market can obtain illegal copies of data. First, direct transactions between data user and market users. Second, transactions between market users. We greatly encourage all participants in the data market to assist with prosecuting a data user for unlawful conduct. When the market user purchases an illegal copy of the data, the smart contract will seek to upload the data fingerprint and send a verification request to the miner. If and the upload time t of is in an ownership protection period , return 1 to Contract_prosecution, then the market user sued successfully. If is invalid, or t exceeds one ownership protection period, then 0 is returned and the market user’s prosecution fails. Algorithm 3 shows this process implemented with a smart contract. If return 1, smart contract Contract_payment will issue a reward from to the market user who successfully assist with prosecuting. Then, it will compensate to the data owner, where is the initial price of the data. Algorithm 4 describes the reward process implemented with smart contract.
Definition 7. Reward mechanism. The price paid by market user for a copy of the data is . The reward for a successful prosecution is . There is a scale factor for and for the price of the data, , and decreases in steps as the number of Contract_prosecution triggers increases.
We assume that the data compensation is gradually reduced but not to zero, so that the deposit of data user always meets the requirements. The reason why this assumption can be made is that when the reward is low enough, the data owner has received enough compensation, and they are also satisfied that the data user will pay enough for the illegal distribution of data copies. Therefore, this game model is still valid.
Algorithm 3: Contract_prosecution |
Input: account() |
Output: return 0 or 1 |
1. if 2. |
3. return 1 to contract_payment.rep(T) |
4. else 5. return 0 to contract_payment.rep(T) 6. termination transaction |
Algorithm 4: Expansion of Contract_payment |
1. func rep( ): 2. var {account(), T, b, t} |
3. if (T = 1) (t ) 4. successful prosecution 5. send bi to account() from 6. send to DO 7. else if (T = 0) (t ) 8. prosecution failed |
4. Security Analysis
In this section, we prove that DOCS can defend against various types of attacks on data ownership in data transaction scenarios.
Theorem 1. DOCS can resist false identity attacks.
Proof. There are two ways in which a data user can provide a false identity, which are analyzed as follows:
- (1)
The certificate itself is invalid. In DOCS, when the data owner receives , it will be verified by to . If the verification is successful, it means that is valid. If the verification fails, it means that the data user holds an invalid certificate. Likewise, data user can be authenticated, in the same way.
- (2)
Whether the data subject is the true owner of the certificate. Data user try to send other people’s certificates to circumvent smart contract-based fingerprint generation protocols and avoid fingerprint tracking. The most effective way for DOCS to verify that the data subject is the true owner of the certificate is by verifying that the data subject actually owns the private key of . The data owner can obtain , and of the data user during the transaction verification stage. From , it can be reversibly deduced to , and the data owner calculates . If and , it means that the data user’s is correct. If , the identity of the data user is correct. Likewise, data user can authenticate data owner in the same way. The data user can obtain , and of the data owner during the transaction verification stage. From , it can be reversibly deduced to , and the data user can calculate . If and , it means that the data owner’s is correct. If , the identity of the data owner is correct.
Therefore, no matter how the adversary provides false identity information, it will be detected, and DOCS can resist false identity attacks. □
Theorem 2. DOCS can defend against repeat confirmation attacks.
Proof. After the data user purchases and obtains the data entity X, corresponding to , they attempt to slightly modify and reacquire a new data signature to upload to the blockchain network for ownership confirmation and to initiate a transaction. Data signature based on data-embedding techniques are essentially abstract features of the data entity X, the data signature can be represented by a vector , equation (5) calculates the distance relationship between different data signatures, and the metric learning algorithm can be extended to a multi-task setup when there are many tasks (Equation (6)). In the following, we describe how this attack can be intercepted by a combination of data signature and similarity learning. □
We present YODA [
31] in the defense process of DOCS to demonstrate that our defense is more robust. First, the anchor node broadcasts
’s similarity learning request R to the entire network, and its retrieval scope includes the list of all serialized data signatures on the Blockchain Ethereum. Initialization smart contract pseudo-randomly selects miner
to join the execution set
of R.
performs the similarity retrieval task of
independently, returns the execution result
and broadcasts it to other miners in
, where bool represents the execution result of R is true or false;
represents the signature of miner
,
represents the result of RICE [
31], and the miner who executes it through the PBFT consensus protocol reaches a consensus result
. Then, the anchor node broadcasts
to
, regenerates
and re-executes R.
maintains the result set
, where
means X is original data,
means X is duplicate data,
means X is in dispute, and you need to submit it manually for verification. Finally,
decides the final execution result from the result set through likelihood estimation;
serializes the result and sends feedback to
to terminate the computation. Therefore, the repeat confirmation attack of the data user can always be blocked by DOCS.
Theorem 3. DOCS can defend against data corruption attacks.
Proof. Data storage and processing are often accompanied by risks, such as data loss and data distortion. Since is reversible, the data owner can decode by decoder and get X . is stored in the Blockchain Ethereum as an ownership credential, and X is permanently trusted due to the tamper-proof nature of the blockchain. The data owner can use X as the credential to audit the data entity under the chain and effectively maintain the integrity of the data entity under the chain. Therefore, DOCS can resist data corruption attacks. □
Theorem 4. DOCS can defend against illegal distribution.
Proof. The illegal distribution of data cannot be realized in our data transaction network based on Blockchain Ethereum because it will be blocked in the transaction response stage. Data users often choose to avoid on-chain transactions and resell copies of data on the black market. DOCS rely on credible data fingerprints
and timely incentives to encourage market users to sue data users for illegal distribution in an anonymous network. We denote the set of market users who purchase data copies in the black market as
and
, market users want to maximize their own profits no matter how they obtain data copies. The policy space of
is
, and the action set of
can be expressed as
. We analyze the market users benefit matrix under the four actions, as shown in
Table 2. □
Starting from the row of
Table 2, the user gets the greatest benefit when executing
; starting from the column of
Table 2, the user who sues the earliest can always get the greatest benefit. Therefore, the illegal distribution of the data user can always be traced back in time. In
Table 3, we analyze the payoff matrix of market users, data owner, and data user in the case of action
.
From
Table 3, we can see that if the
th market user
successfully sues, the return to the
is
, the compensation to the data owner is
, and the loss to the data user is
. Therefore, the sued data user will face huge compensation beyond the value of the data itself. Under this kind of game, data users will not distribute copies of data illegally on the black market. If data users choose to distribute copies of data illegally, data owners will not suffer losses.
In the supervision model, the earlier a market user sues, the more rewards they can receive. A market user has to sue before other market users in order to get higher re-wards, so this creates a competitive relationship between market users. In most cases, market users do not know the source of illegal data copies. We can rely on this com-petitive relationship to encourage market users to initiate timely assistance with pros-ecutions and improve the timeliness of the monitoring model. Although we cannot eradicate the continuous distribution of illegal data copies and need to rely on compe-tition to encourage market users to file lawsuits in a timely manner, we have established a game relationship between market users and data user in this way. If data users choose to illegally distribute copies of data to the black market, they are likely to face high penalties in a short space of time. Moreover, the penalties are much higher than the benefits obtained by illegally distributing data copies. In this game, data users are forced to remain rational.
Theorem 5. DOCS can defend against shared key attack.
Proof. In the DOCS, the data owner encrypts the data with a randomly generated temporary session key . During the generation of by the data owner, the data owner can set permissions and policies so that can only be used within the specified scope of permissions. For example, will be invalid after data are decrypted by the data user. Or expires after a certain time limit has been exceeded. Therefore, data user cannot leak data through the shared key. □
Theorem 6. DOCS can defend against transaction fraud.
Proof. The data user attempts to refuse delivery of the data after the data user has paid. In this case, the data user may set a return value and a time limit in the payment phase of the smart contract to return a value that triggers the smart contract to take effect after receiving the complete data. If the data user does not return the data in time after receiving it, the smart contract automatically becomes effective after a time limit is exceeded and the data owner is paid.
The data user attempts to refuse payment after the data owner delivers the data. In DOCS, the data user refuses to send encrypted data in case the data owner refuses to pay, all of which does not happen. □
5. Performance Evaluation
In this section, we evaluate the embedding performance and recovery performance of DOCS on real datasets. We use the supply chain geographic proximity data of nearly 30,000 listed enterprises as our simulation dataset, and the data information in the dataset contains user/supplier ID, spatial distance, and distance to user/supplier, etc. We can unify the data standards in the initializing smart contract and build a data standard repository chain network using Blockchain Ethereum, which is jointly constructed and maintained by all the nodes that join. After the new data standards are verified by the consensus algorithm of each node, they are linked to the standard library chain to ensure the stability and openness and transparency of the data standard library.
Our simulation platform is Intel(R) Core(TM) i5-5350U CPU @ 1.80GHz 8.00GB RAM and Windows 10 operating system. We evaluate the autoencoder-based stacked denoising autoencoder (SDAE) and convolutional autoencoder (CAE) performance in
Figure 5 and
Figure 6, and the decoding efficiency in different dimensions in
Table 4.
From
Figure 5, it can be seen that the training loss of CAE self-supervision converges to about 0.07, which is smaller than 0.096 of SDAE. From
Figure 6, the accuracy of CAE’s supervised training process rises to 0.99, which is higher than SDAE’s accuracy of 0.95. Therefore, CAE has certain advantages in reconstruction tasks and classification tasks in datasets.
Table 4 shows that the recovery performance of the decoder also tends to increase slowly as the data dimension increases, further illustrating that the recovery efficiency tends to saturate as the signature vector size increases. In
Table 4, it can be seen that after 1000 dimensions, the improvement of recovery efficiency decreases significantly with the increase in embedding dimension. In order to achieve scalable feature representation and retrieval performance, we would like to use an embedding size that stays within a rational range, which compresses the raw data sufficiently without significantly sacrificing embedding accuracy. Therefore, we suggest using a 1000-dimensional embedding representation because it provides more than 30 times the compression of the original supply chain data while preserving most of the sparsity and temporal properties of the downstream tasks.
6. Conclusions
In this paper, we propose a distributed data ownership confirmation scheme (called DOCS) in a data transaction scenario. The advantage of a data transaction network built on Blockchain Ethereum is that it eliminates the single point of failure in the big data market. We describe the data signature and fingerprint generation protocols in the DOCS architecture, as well as the market supervision mechanism empowered by smart contracts, and build a standard system for data ownership verification, traceability, and accountability, maintaining data integrity and enabling accurate traceability and timely accountability for illegal data transactions. We demonstrated that DOCS can resist different types of attacks. We analyzed the encoding performance and decoding performance of different autoencoders through supply chain data.
Most smart contract applications, including DOCS, face privacy concerns because of the conflict between privacy needs and the transparency of blockchains and smart contracts. In the Blockchain Ethereum, anyone can view the current state of the smart contract, which also contains information about personal consumption and more. An effective smart contract access control mechanism plays an important role in resolving the above conflicts, and our future research will be carried out on this basis.