A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain

Han, Shumin; Wang, Zikang; Shen, Dengrong; Wang, Chuang

doi:10.3390/math12121854

Open AccessArticle

A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain

by

Shumin Han

^1,*,

Zikang Wang

¹,

Dengrong Shen

² and

Chuang Wang

¹

School of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun 113001, China

²

School of Computer Science and Engineering, Northeastern University, Shenyang 110167, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(12), 1854; https://doi.org/10.3390/math12121854

Submission received: 11 May 2024 / Revised: 30 May 2024 / Accepted: 11 June 2024 / Published: 14 June 2024

(This article belongs to the Special Issue Mathematical Modeling for Parallel and Distributed Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Privacy-preserving record linkage (PPRL) is the process of linking records from various data sources, ensuring that matching records for the same entity are shared among parties while not disclosing other sensitive data. However, most existing PPRL approaches currently rely on third parties for linking, posing risks of malicious tampering and privacy breaches, making it difficult to ensure the security of the linkage. Therefore, we propose a parallel multi-party PPRL method based on consortium blockchain technology which can effectively address the issue of semi-trusted third-party validation, auditing all parties involved in the PPRL process for potential malicious tampering or attacks. To improve the efficiency and security of consensus within a consortium blockchain, we propose a practical Byzantine fault tolerance consensus algorithm based on matching efficiency. Additionally, we have incorporated homomorphic encryption into Bloom filter encoding to enhance its security. To optimize computational efficiency, we have adopted the MapReduce model for parallel encryption and utilized a binary storage tree as the data structure for similarity computation. The experimental results show that our method can effectively ensure data security while also exhibiting relatively high linkage quality and scalability.

Keywords:

privacy-preserving record linkage; bloom filter; consortium blockchain; consensus algorithm; MapReduce model; homomorphic encryption

MSC:

68W10

1. Introduction

With the widespread adoption of internet applications, the volume of data is rapidly increasing, making data sharing increasingly critical [1,2]. Numerous organizations have started gathering and processing data from a variety of sources to obtain valuable insights [3]. In these scenarios, the record linkage (RL) task is commonly employed to identify matching entities across various data sources [4]. However, due to privacy and confidentiality concerns, sharing or exchanging such valuable information among different organizations is often prohibited [5,6]. Therefore, in scenarios where entity privacy is crucial, PPRL technology is commonly adopted. PPRL technology ensures that only matching entities are shared among data sources during the record linkage process, while unmatched data are effectively protected and not disclosed. For example, in the medical field, PPRL technology enables the mutual identification of diagnosis and treatment information from different hospitals, allowing for more precise analysis of medical conditions while safeguarding patient privacy [7]. PPRL technology provides a feasible solution for integrating massive datasets while preserving privacy. By regulating the process of data flow, PPRL technology effectively avoids the risks of data leakage, loss, and misuse.

Researchers have proposed many PPRL methods, with most of them being designed based on the honest but curious (HBC) model [8]. In the HBC assumption, mutual trust among participants is required, where they are expected not to engage in malicious attacks or tampering. However, this model is difficult to apply in practical scenarios due to its reliance on trust. Furthermore, these approaches often rely on a semi-trusted third party (STTP) for similarity computation. However, as the STTP is not entirely reliable, it introduces concerns regarding data privacy breaches. Therefore, researchers have proposed the utilization of blockchain technology to verify the trustworthiness of all participants and third parties in the PPRL process [9]. This approach allows honest parties to detect such improper behavior with a high probability, even when adversaries attempt to deviate arbitrarily from the protocol to deceive them. Although this approach to some extent safeguards the security of data, there still remains the risk of malicious parties stealing private data, and the consensus process of blockchain updates nodes relatively slowly. To further enhance data security, the introduction of encryption technology is a viable option. However, this approach also entails significant time costs, as the encryption and decryption processes may consume substantial computational resources and time.

Based on the above issues, we propose a multi-party parallel PPRL method based on consortium blockchain technology. We explore the auditability of a consortium blockchain in the PPRL process, investigate parallel encryption, and study data structures to enhance computational similarity efficiency. Furthermore, we delve into consensus algorithms for PPRL within consortium blockchains.

Contributions:

We utilize a consortium blockchain to validate the trustworthiness of third parties and all involved parties, controlling data access and introducing a consensus algorithm to enhance the efficiency and security of the consortium blockchain;
Using a homomorphic encryption technique for encoding the Bloom filter, the encryption process incorporates the MapReduce model for parallel encryption. This not only enhances computational efficiency but also strengthens the security of the encoding process;
Using the binary storage tree to store the data to be linked and employing the Jaccard similarity function to calculate the similarity among the splitting Bloom filter encoding effectively reduces the number of comparisons among records.
The experimental results show that our method can effectively ensure data security while also exhibiting relatively high linkage quality and scalability.

Outline: The remaining structure of this paper is outlined as follows: In the second section, previous research efforts relevant to the proposed method are reviewed to provide a comprehensive understanding of the existing literature in this field. The third section introduces the definitions of relevant theoretical issues, laying the foundation for the subsequent discussion. In the fourth section, a detailed description of the process and algorithms used in the parallel multi-party PPRL method based on a consortium blockchain is provided. The fifth section presents the experimental results, accompanied by in-depth analysis and discussion. Finally, the sixth section concludes the main findings of this paper and explores potential directions for future research.

2. Related Works

At present, PPRL methods are mostly studied in three aspects: improving the linkage quality of record linkage, enhancing privacy security, and improving computing performance. Most early PPRL methods utilized embedding space technology to transform the data to be linked into a different space, preserving the similarity between data attributes [10,11]. However, this technology is primarily effective for linking numerical data. In subsequent research [12], Durham et al. suggested using Bloom filters to encode the data, which enhances linkage efficiency. However, this method is vulnerable to frequency attacks during the linkage process. Subsequent methods, as discussed in [13,14], are all based on PPRL technology using Bloom filter encoding. In [15], Han et al. proposed a method that combines Bloom filter coding with Multi-LUs to counter frequency-based attacks while also introducing a blocking technique based on the sorted nearest-neighborhood approach to group similar Bloom filters. In [16], Yao et al. improved the Bloom filter to enhance security and employed Siamese neural networks to enhance matching, thereby improving the quality of connections.

The honest but curious model and the Malicious Adversary model are two common models in PPRL technology [17]. Most existing PPRL methods are designed based on the assumption of honest but curious, where mutual trust among participants is required, and malicious attacks and tampering are not expected. However, this assumption makes it difficult to apply in practical scenarios. In the malicious adversary model, ensuring trustworthiness and security is extremely challenging. When a participant deviates from security protocols and attempts to obtain the privacy data of other participants, existing solutions find it difficult to identify the offending participant, making participant verifiability hard to achieve. Randall et al. proposed a PPRL method based on the malicious adversary model, utilizing homomorphic encryption techniques to ensure data security [17]. While homomorphic encryption can effectively defend against attacks by malicious adversaries, it cannot find actors that deviate from security norms. To address this issue, Nobrega et al. applied blockchain technology to develop a verifiable PPRL method. This method can detect malicious behavior of participants in similarity computation with high probability, making it the first verifiable solution in the PPRL field and possessing significant implications [13]. However, in the same year, Christen et al. proposed attack methods against the methods in the literature and highlighted the inefficiency caused by the need to update all nodes in blockchain technology [18]. Yao et al. further proposed a PPRL method based on blockchain technology, encrypting split Bloom filters using homomorphic encryption techniques to effectively defend against attacks from malicious adversaries, but the computational efficiency remains low [19]. A summary of PPRL-related methods is shown in Table 1.

Our work involves utilizing a consortium blockchain to validate the trustworthiness of third parties and all participants and leveraging the data access control features of consortium blockchain to share data only with successful matches. Based on the matching efficiency characteristic of PPRL, we propose a consensus algorithm to improve the efficiency and security of consensus on a consortium blockchain. Simultaneously, we employ a homomorphic encryption technique to encrypt the splitting Bloom filter encoding, integrating the MapReduce model into the encryption process for parallel encryption.

3. Preliminaries and Background

To better understand our proposed approach, related concepts and techniques are explained in this section.

3.1. PPRL

Privacy-preserving record linkage is the process of linking records from various data sources, ensuring that matching records for the same entity are shared among parties while not disclosing other sensitive data. Given

p

participants

(P_{1}, \dots P_{p})

(P \geq 3)

, each with their respective data source

D_{i}

where

i

ranges from

1

to

p

, each data source

D_{p} = {\{e_{1}, \dots \dots e_{n}\}}_{\neq}

is composed of a set of different entities. The objective of PPRL is to accurately identify and group records that correspond to the same entity across these different data sources without compromising the privacy of the entities involved.

3.2. Bloom Filter

A Bloom filter (BF) [20] is an efficient data structure that is known for its low time and space requirements, consisting of binary vectors and random mapping functions. It is widely used as an anonymization technique in privacy-preserving record linkage. The quality of anonymization depends on the parameterization of the BF, the number of hash functions (

k

), and the filter length (

l

). In its initial state, a BF is a bit array of length m, with each bit set to 0. For a set containing n elements, a BF utilizes

k

mutually independent hash functions

h_{1} (x_{i}), h_{2} (x_{i}), \dots, h_{k} (x_{i})

to map each element

x_{i}

to a range 1, …, m, setting the corresponding positions in the BF to 1 [21].

3.3. Jaccard Similarity Function

The Jaccard similarity function evaluates the similarity between two sets by determining the ratio of their intersection to their union. This method is frequently employed to compare the similarity of multiple bit arrays. Assuming there are P bit arrays

B_{1}, B_{2}, \dots, B_{p}

, the formula is as follows [13,22]:

J a c c a r d (B_{1}, \dots, B_{p}) = \frac{B_{1} \cap B_{2} \dots \cap B_{p}}{B_{1} \cup B_{2} \dots \cup B_{p}}

(1)

where

|B_{i}|

represents the number of 1 in the i-th bit array.

3.4. Splitting Bloom Filter

The splitting Bloom filter (SBF) [13] aims to divide the original Bloom filter (BF) into segments, where each segment represents a portion of the original BF’s length. The core concept of the SBF is to conduct iterative similarity computations using only a small segment of the original BF, thereby minimizing the amount of information exchanged during the comparison phase of privacy-preserving record linkage. Assuming the BF of the entity e is

B F (e) = e^{τ}

, the SBF is

S B F (e^{τ}, s) =

[ϕ^{0}, \dots, ϕ^{s - 1}]

, where

ϕ^{i} = [b_{j}, \dots b_{j + (\frac{l}{s} - 1)}]

,

j = i \times \frac{l}{s}, \forall (i) |0 \leq i \leq s - 1

.

Based on the SBF, the similarity between two distinct entities

J a c c a r d (e_{a}^{τ}, e_{b}^{τ})

is calculated by determining their split similarity separately, as illustrated in Formula (2):

J a c c a r d_SBF (e_{a}^{τ}, e_{b}^{τ}, s) = \frac{1}{s} \sum \frac{|ϕ_{a}^{i} \cap ϕ_{b}^{i}|}{|ϕ_{a}^{i} \cup ϕ_{b}^{i}|}

(2)

where s represents the number of splits;

ϕ_{a}^{i}

and

ϕ_{b}^{i}

represent the i-th split in

S B F (ϕ_{a}^{i}, s)

and

S B F (ϕ_{b}^{i}, s)

, respectively [13].

Assuming the segmentation of the SBF varies, we also consider that the similarity of the segments might differ slightly from the BF similarity. That is, there is an error value

e r r o r (ε), 0 \leq ε \leq 1

. The calculation is shown in Formula (3):

J a c c a r d (e_{a}^{τ}, e_{b}^{τ}) = J a c c a r d (ϕ_{a}^{i}, ϕ_{b}^{i}) + ε

(3)

Figure 1 illustrates the difference in similarity between the SBF and BF, with a BF similarity of 0.4.

3.5. Homomorphic Encryption

Homomorphic encryption is a cryptographic technique enabling meaningful operations on ciphertext without decryption, ensuring that the decrypted result matches the outcome of performing the same computation on plaintext. By encrypting BF fragments using homomorphic encryption, it is possible to calculate the similarity while the data remains encrypted, ensuring security and privacy during transmission and processing. Even if intercepted during data exchange, attackers cannot directly decipher the content since the data are encrypted before they are uploaded to the blockchain. In the PPRL process on a consortium blockchain, data are used in ciphertext form, ensuring that sensitive information is not accessible to unauthorized participants or third parties while maintaining the functionality of the consortium blockchain as a trusted data exchange platform. To implement homomorphic encryption, one can consider using Microsoft’s SEAL (Simple Encrypted Arithmetic Library) and IBM’s HElib (Homomorphic Encryption Library). SEAL supports partially homomorphic encryption, somewhat homomorphic encryption, and fully homomorphic encryption. Moreover, researchers have made hardware improvements to the SEAL library, significantly enhancing its speed [11]. The HElib library, with its flexibility and rich functionality, is well suited for various homomorphic encryption research projects and applications [23,24,25].

3.6. MapReduce Model

The MapReduce model is a programming model and processing framework used for large-scale data processing. Its basic process involves partitioning input data into multiple data blocks and distributing these blocks to different nodes for processing. In the Map phase, complex tasks are decomposed into simpler sub-tasks, which are then processed in parallel on nodes. The Reduce phase is responsible for aggregating and computing the intermediate results obtained from the Map phase, ultimately generating the final output.

When performing homomorphic encryption on the bits of the splitting Bloom filters, it can be time-consuming, especially when dealing with large data volumes. The encryption and decryption processes are relatively time-consuming steps, as they involve data transformation and the use of keys for encryption and decryption. These operations can introduce significant latency, especially when dealing with large amounts of data. In contrast, the additional overhead of similarity computation typically does not consume much time, especially with optimized homomorphic encryption algorithms. Improvements in homomorphic encryption algorithms include advancements in mathematical operations, hardware acceleration, and other areas. For example, designing specialized hardware (ASIC or FPGA) to accelerate key operations in homomorphic encryption, such as large-integer arithmetic and polynomial multiplication, can significantly enhance performance. These hardware accelerators can greatly improve the efficiency of homomorphic encryption processes [23,24,25]. To improve the efficiency of this process, parallel encryption and decryption can be achieved using the MapReduce framework model, enhancing both efficiency and security [7].

3.7. Consortium Blockchain

Blockchain was initially introduced by Satoshi Nakamoto in 2008 as part of the Bitcoin concept. The operation of blockchain involves distributed data storage and peer-to-peer transmission, requiring the application of computer technologies such as consensus mechanisms and cryptographic algorithms. Essentially, blockchain is a continuously growing distributed ledger database characterized by decentralization, immutability, and traceability. Blockchain can be categorized into three types based on the level of decentralization: public, consortium, and private blockchains. The consortium blockchain is typically used in scenarios where multiple participants share data. The consortium blockchain only allows authorized nodes to join, and nodes can be added or removed at any time, providing excellent scalability. Access to and manipulation of data are restricted by permission management, making it more suitable for scenarios involving sensitive data matching. Due to the authorized nature of the nodes, data privacy and access control can be managed more flexibly. Additionally, the design of the consortium blockchain enables the adoption of more efficient consensus mechanisms to enhance the efficiency of the PPRL process [26].

3.8. Consensus Algorithm

In blockchain, the consensus algorithm is pivotal for nodes in the network to agree on transaction validity. Commonly employed consensus algorithms comprise Proof of Work (PoW), Proof of Stake (PoS), and practical Byzantine fault tolerance (PBFT). In public blockchains, the Proof of Work (PoW) mechanism is commonly employed. Despite being highly effective in preventing malicious behavior, PoW suffers from issues such as high resource consumption and energy wastage. The consensus mechanism on consortium blockchains focuses more on ensuring legitimacy and security. Unlike the PoW mechanism in public blockchains, it can achieve efficient consensus through algorithms like PBFT, reducing resource wastage and promoting trust and collaboration among nodes. The core principle of the PBFT algorithm is to achieve consensus through mutual communication and verification among nodes, thereby confirming the legitimacy of transactions and data. Therefore, we propose a practical Byzantine fault tolerance algorithm based on matching efficiency as the consensus algorithm, enhancing security during the consensus process and enabling a more efficient consensus to be reached during matching.

4. Methods

The parallel multi-party PPRL method based on consortium blockchain technology (MP-PPRL-CBT) proposed in this paper consists of three modules: a data preparation and generation module, an approximate matching module, and an auditable module, as illustrated in Figure 2.

Data Preparation and Generation Module: This module determines the privacy and encoding parameters for each data source, and encodes the data using Bloom filters. Subsequently, each participant independently splits the Bloom filters and generates corresponding splitting Bloom filter encodings to minimize data sharing. Next, homomorphic encryption is applied to the encoded data, employing a parallel encryption approach using the MapReduce model and utilizing the corresponding homomorphic encryption algorithm for this model.

Approximate Matching Module: Participants upload encrypted splitting Bloom filters to various nodes on the consortium blockchain. Initially, each participant sends a small portion of encrypted data to the semi-trusted third party (STTP). After the computation concludes, the STTP releases a preliminary similarity table to all involved parties. Subsequently, participants calculate the remaining split encoding similarity based on the similarity table. If the calculated similarity falls within the error range of the table’s similarity, matching parties share data, while unmatched parties do not receive any data. Moreover, in this module, we utilize binary tree storage of encoded data to enhance computational efficiency. The Jaccard similarity function is utilized to calculate the similarity within the binary-tree-encoded data. Upon obtaining the matching results, they are shared among participants. Additionally, in this module, we employ an improved consensus algorithm based on matching efficiency to generate new blocks, thereby enhancing consensus speed.

Auditable Module: In this module, we audit the similarity calculation and transform it into a smart contract hosted on a consortium blockchain, effectively converting the semi-trusted third party (STTP) into a piece of code executed within the consortium blockchain environment. Each computation and interaction log will be recorded on the accounting nodes of the consortium blockchain, ensuring the transparency and integrity of the calculation process, effectively preventing potential malicious tampering.

4.1. Data Preparation and Generation Module

Each participant converts data into a pre-agreed standard format, achieving consensus on input parameters including anonymization parameters, splitting parameters, encryption parameters, etc. The parameters used in the method described in this paper and their meanings are detailed in Table 2 [13].

Participants first anonymize each entity, assigning a unique ID to each one. They then utilize

k

hash functions to convert the dataset

D_{p}^{τ}

into a bit array with a specified length

l

. This Bloom filter is then split into multiple smaller filters, each termed as split Bloom filters

ϕ

, with a defined split count

s

. These SBFs are encrypted using homomorphic encryption technology to produce ciphertexts suitable for homomorphic calculations

ϕ^{'} = E (ϕ)

.

The process of homomorphic encryption takes a lot of time, which affects the linkage efficiency of the entire process. In order to improve the efficiency of the process, we combine the MapReduce model. Based on the MapReduce model, we improve the homomorphic encryption algorithm to adapt to the parallelism of the MapReduce model. Parallel encryption processing is applied to the SBF, resulting in improved efficiency and increased security. The process of data preparation and generation module is illustrated in Figure 3.

The MapReduce model submits tasks to the Job Tracker through the Client. The Job Tracker is responsible for resource monitoring and job scheduling, overseeing the operation status of all Task Trackers. Once tasks are assigned, Task Trackers are responsible for initiating Map Tasks and Reduce Tasks. In the Map Task phase, data are parsed and processed through the Map function to generate intermediate results after homomorphic encryption. Each Map Task independently processes a portion of the data, employing an improved homomorphic encryption algorithm to encrypt the SBF. Subsequently, in the Reduce Task phase, Reduce Tasks gather intermediate results from various Map Task outputs and merge them to produce the result. Reduce Tasks are accountable for aggregating all encrypted data and generating the result through parallel homomorphic encryption. This approach maximizes the parallel processing capabilities of the MapReduce framework, effectively enhancing the efficiency and performance of privacy-preserving record linkage through judicious task allocation and processing [27].

We improve the homomorphic encryption algorithm for parallel homomorphic encryption of SBF. We use the Paillier homomorphic encryption [28] as an example to illustrate how to perform parallel homomorphic encryption. In practical applications, we can choose an optimized homomorphic encryption algorithm with better performance. The specific encryption steps are as follow:

4.1.1. Generate Key

Let

n = p * q

, where

p

and

q

are two random large prime numbers and the Euler function

ϕ (n) = (p - 1) (q - 1)

; set

L (x) = (x - 1) / n

&

S_{n} =

\{x |0 < x < n^{2}, x = 1 m o d n\}

. Then, randomly select an integer

g \in Z_{(N^{2})}^{*}

such that

g

satisfies

g c d (L (g^{λ} m o d n^{2}), n) = 1

. For the minimum public multiple of

p - 1

and

q - 1

, the public key is

(n, g)

, and the private key is

λ

.

4.1.2. Encryption Process

Set the plaintext as

k

(MB), divide it into

x

groups of 256-bit-long packets, where the number of packets is

x = ⌈ \frac{k * 2^{20}}{32} ⌉

, convert all packet plaintext data into a large integer type, and set the value to

m

, where

m = m_{1} + m_{2}

. If

m

is an even number, then

m_{1} = m_{2} = m / 2 (m m o d 2 = 0)

. If

m

is an odd number, then the value of

m_{1}

and

m_{2}

is as follows.

\{\begin{matrix} m_{1} = (m - 1) / 2 \\ m_{2} = (m + 1) / 2 \end{matrix}

(4)

Randomly select two integers

r_{1}, r_{2} \in Z_{N^{2}}^{*}

and then use the encryption formula to calculate

E (m_{1})

,

E (m_{2})

. The formula is as follows:

\{\begin{matrix} E (m_{1}) = g^{(m_{1})} r_{1}^{n} m o d n^{2} \\ E (m_{2}) = g^{(m_{2})} r_{2}^{n} m o d n^{2} \end{matrix}

(5)

Then, use the following formula to calculate the encrypted ciphertext as follows:

c = E (m_{1}) \times E (m_{2})

(6)

4.1.3. Decryption Process

If you want to obtain the plaintext

m

, you need to use the private key

λ

to decrypt the ciphertext

c

; the plaintext

m

can be calculated by using the following formula:

m = L (c^{λ} m o d n^{2}) / L (g^{λ} m o d n^{2}) m o d n

(7)

The decryption formula for the summed data is as follows:

D (c_{1}, c_{2}) = (m_{1} + m_{2}) m o d n

(8)

Parallel homomorphic encryption is implemented based on the MapReduce model. During the encryption process, the Split function divides the input data into fixed-size data blocks according to the actual requirements for subsequent processing. On the consortium blockchain, the nodes determine whether they are master nodes or not. If they are master nodes, they are responsible for distributing Map tasks and Reduce tasks to different processors based on scheduling mechanisms. Otherwise, the nodes need to further determine whether their role is a Map node or a Reduce node. For Map nodes, they utilize the parallel homomorphic encryption algorithm to encrypt the SBF. This improved algorithm allows for independent encryption computation for each data block, enabling simultaneous allocation to multiple Map nodes for encryption operations. Once encryption is completed, Map nodes pass the encrypted data to Reduce nodes. Reduce nodes are responsible for aggregating the encrypted data results from all Map nodes and ultimately generating parallel homomorphic encrypted ciphertext. A brief description of the parallel homomorphic encryption algorithm is as follows (Algorithm 1).

Algorithm 1 Parallel Homomorphic Encryption Algorithm

Input: Part of the SBF for

P

Participants
Output: The encrypted SBF for

P

Participants
1: for each sets of SBF in

D_{i}

, 1 \leq i \leq P

do
2: public key

(n, g)

, private key

λ

\leftarrow

generate_key

(p, q)

3: Set the plaintext as

k

(MB), divide it into

x

groups of 256 bits long packets
4: The number of packets is

x = ⌈ \frac{k * 2^{20}}{32} ⌉

5: for each sets of SBF in

D_{i}

, 1 \leq i \leq P

do
6: Convert each packet plaintext into a large integer type, set the value to

m = m_{1} + m_{2}

7: if

m

is an even number then
8:

m_{1} = m_{2} = m / 2 (m m o d 2 = 0)

9: else if
10:

\{\begin{matrix} m_{1} = (m - 1) / 2 \\ m_{2} = (m + 1) / 2 \end{matrix}

11: end if
12: integer

r_{1}, r_{2}

13:

E (m_{1}) = g^{(m_{1})} r_{1}^{n} m o d n^{2}

,

E (m_{2}) = g^{(m_{2})} r_{2}^{n} m o d n^{2}

14:

c = E (m_{1}) \times E (m_{2})

In Algorithm 1, lines 1–2 of the code generate the key before encryption, input the SBF of each party, and serially generate the public key

(n, g)

and the private key λ. Lines 3–4 of the code divide the plaintext file into data blocks and transmit them to the corresponding processor. Multiple processors will execute the following steps in parallel: divide each data block into groups with a data length of 256 bits, where the data are sequentially read into the cache and converted into large-integer types. Lines 5–14 of the code are the encryption process. Lines 7–11 of the code use different formulas to calculate the value depending on whether the converted integer is an odd number or an even number. Lines 12–14 of the code randomly select two integers

r_{1}, r_{2} \in Z_{(N^{2})}^{*}

, and the two integers

E (m_{1})

,

E (m_{2})

are calculated by using

\frac{E (m_{1}) = g^{(m_{1})} r_{1}^{n} m o d n^{2}}{E (m_{2}) = g^{(m_{2})} r_{2}^{n} m o d n^{2}}

. Then, use

c = E (m_{1}) \times E (m_{2})

to calculate the ciphertext

c

after encrypting the plaintext m. All grouped data are summarized to generate an intermediate file. Subsequently, all processor results are aggregated to obtain the ciphertext after parallel homomorphic encryption.

4.2. Approximate Matching Module

In the process of realizing the matching of sensitive data records from multiple data sources, the use of the consortium blockchain has some significant advantages over the public blockchain. The consortium blockchain only allows authorized participants to join, and they can join or leave the network at any time, demonstrating excellent scalability. On the consortium blockchain, data access and the manipulation of data are restricted by permission management, making it more suitable for scenarios involving sensitive data matching. Since nodes are authorized, they can manage data privacy and access control more flexibly. Additionally, the design of the consortium blockchain allows for the adoption of lightweight and efficient consensus mechanisms, which can enhance the efficiency of the overall process.

The entire process of the approximate matching module is conducted on the consortium blockchain, where data from each participant are used for similarity calculation and matching. The basic process is illustrated in Figure 4.

In the data preparation and generation module, the SBF after homomorphic encryption for each participant can be obtained. Subsequently, these encrypted data need to be uploaded to various nodes on the consortium blockchain for similarity matching.

All participants collectively establish the consortium blockchain platform, defining consensus mechanisms, and identity verification rules and privacy protection protocols. Through authenticated authorization of the nodes, permitted validating participants can join the consortium blockchain. Access control mechanisms are implemented on the consortium blockchain, allowing only authorized participants to access specific data. This can be achieved through smart contracts, ensuring that only authorized participants can obtain data. In this scenario, although only a subset of participants can access and use the data, the matching data operations can be verified and recorded through the consensus of all participants. This means that the matching operations of the participants will be verified and recorded on the consortium blockchain to ensure the integrity and immutability of the data, even if other participants cannot directly access these data.

Each participant stores homomorphically encrypted SBFs along with necessary metadata on the consortium blockchain. The consortium blockchain operates smart contracts, which receive encrypted data from all parties on the chain, perform calculations based on the specified similarity method, and return matching results.

Firstly, the parties send a small portion of encrypted SBFs to the STTP. The STTP performs similarity calculations using only one split of the original Bloom filter to filter out entity pairs with similarity values different from a specified threshold β. The STTP lacks knowledge of the data anonymization parameters or the chosen split for use. Essentially, it only receives a fraction of the original Bloom filter, rendering cryptanalysis attacks challenging to execute. As a result, the STTP publishes a list ζ containing all entity pairs with similarity values greater than β, and one entity list is disclosed to the other participants.

After obtaining the preliminary similarity table, each participant facilitates data transmission without the interference of other participants on the consortium blockchain by building a secure channel. In this case, the parties perform similarity calculations in the remaining segmentation of the entity disclosed by the STTP. In order to audit the implementation of the agreement, all parties can compare the similarity calculated by each party with the similarity disclosed by the STTP. The similarity threshold used in STTP calculations is β, while the similarity calculation threshold used by each participant is α. The thresholds must factor in the error of the SBF and should be calculated as β = α − error.

For each split (s) of the entities stored in ζ, the parties alternately exchange among themselves the splits, one at time. At the end, each participant receives

ϕ^{'}

splits, where

|φ| = \frac{s - 1}{|P|}

[13]. The parties utilize the splits they have exchanged to compute entity similarities using Formula (2). If the disparity between this computed similarity and the value determined by the STTP surpasses the error threshold, the participant identifies misbehavior and halts protocol execution. The splits’ calculated similarity is exchanged among parties to adjust the overall similarity of entities. They then compare the exchanged similarity with the value stored in ζ; detecting a discrepancy higher than the error threshold signifies misbehavior and leads to protocol termination. Following this, the parties update the similarity values of entities in ζ. Finally, they select entities with similarity values exceeding the α threshold [13,18,29].

The computing node shares the successful matching results with the successful participants. Then, the consensus node is selected by using the PBFT consensus algorithm based on matching efficiency, the validity of the data change is confirmed, and the transaction is widely broadcast to all nodes on the consortium blockchain to generate new blocks, and the matching results are regarded as valid and recorded on the consortium blockchain. Participants can check the matching results according to their own permissions without having to directly access the original data of other parties. In this process, the log of each calculation and interaction will carry the consortium blockchain certificate to the accounting node to ensure that the calculation process is open and transparent and can effectively prevent malicious tampering that may exist in the real world.

4.2.1. Binary Storage Tree

To enhance the efficiency of similarity calculations, a binary storage tree is employed as the storage structure for the SBF, reducing the frequency of similarity calculation operations. Blocking means putting similar pairs of records into the same block and filtering out obviously mismatched pairs of records to reduce the search space of the subsequent matching process. Binary storage trees are similar to block operations in that they both reduce the number of comparisons and the amount of data processed during the linking process [30,31]. In this structure, the left subtree holds a fixed homomorphically encrypted SBF, while the right subtree stores the homomorphically encrypted values of SBFs sent by other participants. Using the Jaccard similarity function formula, each participant computes the similarity value between the left and right subtrees. This binary storage tree’s structure is depicted in Figure 5 [19].

In the approximate matching module, each participant will run the approximate matching algorithm to obtain the final successfully matched records by calculating the Jaccard similarity function (Algorithm 2).

Algorithm 2 Similarity Calculation Algorithm

Input: Part of the SBF encrypted by the

P

participants
Output: The record group successfully matched by each participant
1: Build the binary storage tree to store the SBF of participants
2: for each fixed encrypted value

ϕ_{i}

in

D_{i}

, 1 \leq i \leq P

do
3: for others fixed encrypted value

ϕ_{i}

in

D_{i}

, j \neq i

, 1 \leq i \leq P

do
4:

L e f t C h i l d N o d e \leftarrow ϕ_{i}, 1 \leq i \leq P

5:

L e f t C h i l d N o d e \leftarrow ϕ_{j}, 1 \leq j \leq P

6: Calculate the Jaccard similarity value between left and right subtree SBF
7:

J a c c a r d (ϕ_{i}, ϕ_{j}) = \frac{ϕ_{i} \cap ϕ_{j}}{ϕ_{i} \cup ϕ_{j}}

8: Compared with the similarity

γ

in the table

ξ

,

ϵ

is the error value
9: if

| J a c c a r d (ϕ_{i}, ϕ_{j}) - γ |< ϵ

then
10: Indicates that

ϕ_{i}, ϕ_{j}

two splitting bloom filter match
11: end if
12: end for
13: end for

In Algorithm 2, lines 1–5 of the code are for each participant to construct the binary storage tree and store the encrypted value of each participant in the left and right subtrees of the binary storage tree. Lines 6–10 of the code compute the similarity between the left and right subtrees using the Jaccard similarity function. Leveraging the transitive properties inherent in binary trees, if

ϕ_{1}^{'}

and

ϕ_{2}^{'}

match, and

ϕ_{1}^{'}

and

ϕ_{3}^{'}

match, it can be inferred that

ϕ_{2}^{'}

and

ϕ_{3}^{'}

match. Hence, inferring the similarity of various split Bloom filters (SBFs) can be achieved without computing all possible record pairs from the right subtrees [19].

4.2.2. Consensus Algorithm

On the consortium blockchain, members confirm the matching results through the consensus mechanism. Once a consensus is reached, the matching result is considered valid and recorded on the consortium blockchain. Participants can check the matching results according to their permissions without having to directly access the original data of other parties.

To further improve efficiency and fairness, we propose a practical Byzantine fault tolerance algorithm based on matching efficiency, namely match effective–practical Byzantine fault tolerance (ME-PBFT), referred to as ME-PBFT. The improved consensus algorithm aims to ensure data consistency while minimizing computing and communication overhead so that a consensus can be reached more efficiently during the matching process. In the process of consensus, the consortium blockchain can pay more attention to the real data processing capacity of the nodes and select the nodes that process data efficiently so as to improve system performance and energy efficiency. The algorithm not only improves the security, but also enhances the ability of the system to deal with faults.

ME-PBFT takes the past processing data performance and reputation performance of the nodes as the main indicators and enters the consensus process through probability. The basic formula is as follows:

M E^{'} = M / T \cdot 100 %

(9)

M

represents the matching amount of data or the amount of shared data accepted by each node, and

T

represents the time required to process these amounts of data. The matching efficiency of the node can be obtained by calculating their ratio. The higher the matching efficiency of the node, the higher the activity and efficiency of the node in the consensus process, which indicates that the node invests more resources in data processing and accounting.

The similarity calculation error index

R

is used to evaluate the difference between the similarity value calculated by the participants in the matching process and the standard or expected similarity value. In this case, the similarity value calculated by the STTP is regarded as a reference standard. By comparing the similarity value calculated by the participants and the similarity value calculated by the STTP, we can evaluate whether the participants follow the protocol and whether their calculation results are within a reasonable range. The evaluation criteria are as follows:

R = \{\begin{matrix} 1 & E r r o r o c c u r r e d m a n y t i m e s \\ 0 & N o e r r o r o c c u r r e d \\ - 1 & T w o o r m o r e e r r o r s \end{matrix}

(10)

In addition to these two main indicators, response time and block time need to be added as evaluation criteria, and the evaluation criteria of block time are as follows:

M^{″} = \{\begin{matrix} 1 & S e l e c t e d a n d p r o d u c e d b l o c k s o n t i m e \\ 0 & N o t s e l e c t e d \\ - 1 & S e l e c t e d a n d f a i l e d t o p r o d u c e b l o c k s o n t i m e \end{matrix}

(11)

Response time

M^{‴}

refers to whether the corresponding node can make a signature response that meets the requirements in the consensus phase. If the node responds within the specified time, it is assigned a value of 1; otherwise, it is assigned a value of 0. The above four indicators are weighted according to the impact of 0.5, 0.3, 0.1, and 0.1, and can be calculated to obtain the final node matching efficiency value

M E_{i}

.

M E_{i} = M E^{'} \times 0.5 + R \times 0.3 + M E^{″} \times 0.1 + M E^{‴} \times 0.1

(12)

The consensus process of the algorithm usually includes five steps: The first step is the request stage, in which the matching node sends the request to other nodes. The second step is the preparatory stage. The node that receives the request from the matching node first verifies the validity of the request. The message is then broadcast to the remaining nodes with its own signature and timestamp. The third step is the preparation stage. It is assumed that

f

nodes fail to receive messages or deliver outdated messages due to insufficient network bandwidth or a long waiting time. Each node verifies the validity of the message, and when it receives

2 f + 1

prepared messages, it broadcasts the confirmation message with a signature. The fourth step is the submission phase. When the node that receives the request of the matching node receives

2 f + 1

valid confirmation messages, it can reply to the matching node with its signature and numbering information. The last part is the recovery stage. After receiving

2 f + 1

messages, the matching node thinks that the system is consistent with the message.

Although the communication overhead for Byzantine fault tolerance is significant, for security reasons, the consortium blockchain cannot fix production nodes. Therefore, when electing block-producing nodes, the voting mechanism considers incorporating the efficiency value of the nodes. The voting mechanism for the consensus algorithm based on matching efficiency is as follows: The consortium blockchain first sorts all n nodes based on their matching efficiency value

M E_{i}

. The top 1/2 nodes in the ranking will be designated as priority candidate nodes for the current round of consensus. For all nodes participating in the current consensus, let the total number of candidate nodes for this round be n, and then, the number of priority candidate nodes is n/2. Each node can cast votes in the election, denoted as t votes. The definition of the number of votes is as follows:

(n / 2) + 1 \leq t \leq n

(13)

Considering the efficiency of the consensus process in the consortium blockchain, each node will first allocate

(n / 2) + 1

votes to all priority candidate nodes, and the remaining

t - (n / 2) - 1

votes will be distributed probabilistically among the remaining

n / 2

nodes. Prioritizing votes for priority candidate nodes increases their likelihood of becoming consensus nodes, while the remaining nodes still have a chance to become consensus nodes, preserving the system’s flexibility and fairness. The definition of the number of votes is as follows:

t = \exp \{\begin{matrix}  \end{matrix} - \{\frac{\tan (\frac{π}{2} \times λ)}{μ} + \ln 2 \begin{matrix}  \end{matrix}\} + \frac{1}{2} \begin{matrix}  \end{matrix}\} \times r

(14)

In the equation,

λ

and

μ

represent efficiency and security factors, respectively. A higher value for

λ

or

μ

indicates higher efficiency or greater security. Adjusting the node’s voting count through the efficiency factor

λ

and the security factor

μ

, a higher

λ

value makes nodes with higher efficiency values more likely to become consensus nodes, while a higher

μ

value gives other nodes a better chance to become block-producing nodes. Using this consensus algorithm can not only improve the security of the system, but also contribute to the performance of sharing and anti-jamming. The node voting algorithm is shown in Algorithm 3.

Algorithm 3 Node voting algorithm

Input: The number of pending votes for all nodes
Output: Number of votes cast on all nodes
1: for i = 1:nodenumber
2: for j = 1:nodenumber/2
3: selected_node = randi(nodenumber/2)
4: nodes_getticket(selected_node) = nodes_getticket(selected_node) + 1
5: end
6: for k = 1:ticket-nodenumber/2
7:

t = \exp \{\begin{matrix}  \end{matrix} - \{\frac{\tan (\frac{π}{2} \times λ)}{μ} + \ln 2 \begin{matrix}  \end{matrix}\} + \frac{1}{2} \begin{matrix}  \end{matrix}\} \times r

8: selected_node = randi([t, nodenumber])
9: nodes_getticket(selected_node) = nodes_getticket(selected_node) + 1
10: end
11: end

In Algorithm 3, lines 1–5 of the code dictates that each node will first assign tickets to all priority candidate nodes. Lines 6–10 of the code will distribute the remaining tickets to the remaining nodes according to the formula according to probability.

ME-PBFT uses the concept of matching efficiency to select nodes for the consensus process. This efficiency is determined by evaluating each node’s historical data processing performance and reputation. By prioritizing nodes with higher matching efficiency, the algorithm ensures that more capable and reliable nodes are involved in the consensus process. This reduces the likelihood of delays and inefficiencies caused by less reliable nodes, thereby improving the overall throughput and reducing the time delay. ME-PBFT minimizes communication overhead by limiting the number of messages exchanged during the consensus process. Only the nodes with the highest matching efficiency are selected to participate actively in each consensus round. Fewer nodes communicating with others reduces the total number of messages exchanged, which directly decreases the network’s communication load. This streamlined communication enhances throughput and lowers latency. The algorithm incorporates a similarity calculation and error metrics to evaluate the consistency of the data processed by the nodes. By ensuring that the nodes’ processing results are within an acceptable error range, the algorithm maintains data consistency without excessive communication and reprocessing. This mechanism helps in achieving a faster consensus while maintaining data integrity. ME-PBFT uses a probabilistic approach to select nodes based on their matching efficiency values. Nodes are sorted and the top half with the highest efficiency values are chosen as the primary candidates for the consensus. This selection mechanism ensures that only the most efficient nodes are primarily responsible for the consensus, which enhances the speed of reaching an agreement and reduces the time needed for the consensus process. The algorithm enhances fault tolerance by maintaining detailed logs of the nodes’ response times and block times, which are used to detect and mitigate malicious behavior or failures. Improved fault tolerance mechanisms ensure the reliability of the consensus process, even in the presence of faulty or malicious nodes. This contributes to a more stable and resilient system, ultimately enhancing performance by preventing disruptions.

4.3. Auditable Module

The auditability of PPRL is based on the characteristics of the consortium blockchain, which possesses features such as tamper-resistance, decentralization, and auditability. During the comparison and classification steps of PPRL, these operations are conducted on the consortium blockchain. The STTP is deployed on the consortium blockchain and transformed into a smart contract hosted on the consortium blockchain, converting the STTP into a small piece of code executed within the consortium blockchain environment. Once the smart contract is deployed on the consortium blockchain, it cannot be modified, thereby reducing the possibility of malicious parties altering the smart contract code.

The consortium blockchain has decentralized characteristics, in which each participant reaches an agreement through a consensus mechanism and jointly maintains the status of the ledger. The consortium blockchain has auditability, allowing all nodes to access the STTP’s code, inputs, and outputs once stored on the consortium blockchain. To ensure data security, the consortium blockchain has established access permissions, whereby shared data are only disclosed to participants using PPRL while being hidden from external malicious entities. The consortium blockchain is exclusively utilized for verifying computations and updating similarity values stored in ζ.

Each participant can audit the execution of smart contracts, verify the correctness of computations, and ensure the protection of data privacy. Audit logs serve as crucial sources of information, documenting all operations related to data linkage, including data transmission, encryption, and decryption and the execution of matching algorithms. Audit logs contain detailed information such as timestamps, participant identities, operation types, parameter changes, etc. Storing audit logs in the blocks of the consortium blockchain ensures the immutability and permanence of the logs.

To protect entity privacy from attacks by various participants of PPRL and the STTP, we utilize SBF encoding for similarity computation, reducing the amount of shared information. Additionally, each participant will also verify similarity computation values between each other and the STTP to determine when participants or the STTP attempt to deviate from the protocol by computing or transmitting incorrect entity similarities (Algorithm 4).

Algorithm 4 Auditable algorithm

Input: Encryption value

ϕ_{i}^{'}, φ_{p},

threshold

α,

a table

ξ

consisting of approximately matching candidate record groups
Output: The set

M

composed of the true matching candidate record groups
1: The input of each participant is SBF

ϕ_{i}^{'}

, STTP verifies the input of each participant
2: STTP performs the similarity calculation and stores entity pairs with high similarity probability in table

ξ

3: Send the table

ξ

to all participants
4: The participants exchange the rest

ϕ_{i}^{'}

of the entities in the table
5:

J a c c a r d (ϕ_{i}, ϕ_{j}) = \frac{ϕ_{i} \cap ϕ_{j}}{ϕ_{i} \cup ϕ_{j}}

6: Compared with the similarity

γ

in the table

ξ

,

ϵ

is the error value
7: if

| J a c c a r d (ϕ_{i}, ϕ_{j}) - γ |< ϵ

then
8: Indicates that STTP is trusted
9: end if
10: The participants exchange the similarity calculated in the previous step and au dited whether the participant were trustworthy
11:

J a c c a r d (e_{α}^{τ}, e_{β}^{τ}, s) = \frac{1}{s} \sum_{i = 0}^{s} \frac{|ϕ_{α}^{i} \cap ϕ_{β}^{i}|}{|ϕ_{α}^{i} \cup ϕ_{β}^{i}|}

12: if

J a c c a r d (e_{α}^{τ}, e_{β}^{τ}, s) > α

then
13: Add it to

M

14: end if

In Algorithm 4, lines 1–3 involve the STTP computing the similarity of the data sent by each participant using Formula (1), storing entities with similarity >

α

in table

ζ

and then transmitting them to each participant. Lines 4–9 are where each participant calculates the similarity between their SBFs based on entity IDs in table

k

using Formula (1), comparing it with the similarity stored in the table. If the difference is less than the error value

ε

, it indicates that the STTP is trustworthy. Lines 10–13 involve exchanging the calculated similarity values among participants to verify their trustworthiness. Subsequently, using Formula (2), the complete Bloom filter encoding similarity value is computed using the SBF, and entities with similarity values are stored in the matching entity set

M

[19].

5. Security Analysis

In our proposed MP-PPRL-CBT method, it is crucial to ensure that the participants do not engage in malicious behavior during the similarity calculation process. To achieve this goal, a variety of audit mechanisms are designed, combining blockchain technology and smart contracts to ensure the transparency and reliability of data processing. An important feature of the consortium blockchain is its data transparency and immutability. We achieve traceability and verifiability of the calculation process by recording the similarity calculation process of the participants on the blockchain. The similarity calculation process of each participant is recorded as a transaction on the blockchain. Other participants and auditing agencies can view these records to ensure the transparency of the calculation process. Due to the immutability of the blockchain, any modification of the calculation results will be recorded and noticed, preventing malicious tampering. Smart contracts are self-executing codes deployed on the blockchain. This paper uses smart contracts to realize automatic auditing of similarity calculations; smart contracts contain predefined similarity calculation rules and thresholds. When a participant submits a similarity calculation result, the smart contract automatically verifies whether the result complies with the predefined rules. If the similarity calculation result deviates from expectations or does not comply with the rules, the smart contract will automatically trigger and record the abnormal behavior to prevent malicious behavior from going unnoticed. We designed a multi-party verification mechanism, that is, the similarity calculation results need to be verified by multiple parties before they can be considered valid. Multiple parties participate in the similarity calculation and verify the calculation results with each other. The calculation results of one party need to be consistent with the results of other parties in order to pass the verification. The consistency of the results is checked by comparing the calculation results of different parties. If significant differences are found, the system will conduct further reviews to determine whether there is malicious behavior. During the data processing process, all calculation steps and results will generate detailed log records; the system monitors the calculation logs in real time and records them immediately when abnormal behavior is found. The calculation logs are audited regularly to ensure that all calculation steps are in compliance with the specifications and there is no malicious behavior.

Our method can effectively resist frequency attacks and Sybil attacks. Frequency attacks are an attack method that analyzes the frequency of occurrence of elements in encrypted data and infers the actual content of the encrypted data by analyzing these frequencies. Homomorphic encryption allows encrypted data to be operated without decryption. This means that during data processing, the data always remain encrypted. Since attackers cannot directly access the plaintext data, they cannot infer the content of the original data through frequency analysis. Homomorphic encryption technology effectively hides the actual value of the data and prevents the occurrence of frequency attacks. In the PPRL process, security vulnerabilities may exist due to malicious eavesdropping on the communication data stream between the data source and the cloud server. However, since all private data information is encrypted in parallel homomorphically and stored in the communication data stream in ciphertext form, even if the attacker intercepts the encrypted data sent by the participants, it is impossible to decrypt the sensitive information contained therein. A Sybil attack is a common security threat. The attacker disrupts the normal operation of the distributed network by creating multiple forged identities (Sybil nodes). In the MP-PPRL-CBT method we proposed, the method of resisting Sybil attacks is mainly reflected in the following aspects. The consortium blockchain is a private blockchain that only allows authorized nodes to join the network. Through this permission management mechanism, the consortium blockchain can effectively restrict the joining of malicious nodes, thereby preventing Sybil attacks. Each participant needs to be verified and authorized before joining the network, which makes it difficult for attackers to create and control a large number of forged nodes; we adopt an improved matching efficiency-based practical Byzantine fault tolerance (PBFT) consensus algorithm to improve the consensus efficiency and security of the consortium blockchain. The PBFT algorithm reaches a consensus through mutual communication and verification between nodes to ensure the legitimacy of the transactions and data. Even if there are malicious nodes, the algorithm can tolerate a certain number of Byzantine nodes (i.e., malicious nodes) without affecting the normal operation of the entire system. This mechanism greatly increases the difficulty of a Sybil attack, because the attacker needs to control more than one-third of the total nodes to disrupt the consensus process. In the consortium blockchain, data access and operations are subject to strict permission management control. Only authorized nodes can access and process specific data. This control mechanism ensures that even if Sybil nodes enter the network, they cannot arbitrarily access or tamper with sensitive data, thereby protecting the security and integrity of the data. During data uploading and processing, all participants need to pass identity authentication. This verification is performed not only when participants first join the network, but also during each important operation (such as data uploading, consensus participation, etc.). Through a strict identity authentication mechanism, attackers can be effectively prevented from disguising themselves as multiple false identities, thereby resisting Sybil attacks. Our method uses smart contracts in the consortium blockchain to automate and record all calculations and interaction operations. These operation logs are stored on the accounting nodes of the blockchain, ensuring the transparency and integrity of the calculation process. The use of smart contracts effectively transforms the role of a semi-trusted third party (STTP) into a code executed in the consortium blockchain environment, so that even if Sybil nodes attempt to perform malicious operations, they can be detected and prevented by the audit mechanism. Through homomorphic encryption technology, the data remain encrypted during transmission and processing. Even if the Sybil nodes are able to intercept the data, they cannot decrypt and use them. This encryption mechanism further enhances the security of the data throughout the link process.

6. Experimental Results

6.1. Experiment Preparation

The datasets used in this study include the North Carolina Voter Registration List (NCVR), the SCHOLAR dataset, and the ACM dataset. These three datasets are publicly available. Among them, the NCVR dataset is widely used in the research of privacy-preserving record linkage and has become the standard test dataset in this field. It can be downloaded from https://www.ncsbe.gov/results-data/. The methods mentioned in this paper are implemented using Python (version 3.8).

We evaluate the overall scalability and linkage quality of our proposed method using four metrics: runtime, precision, recall, and F-measure. Precision refers to the ratio of the number of actual matching record pairs to the total number of candidate record pairs. Recall is the ratio of the number of actual matching record pairs to the total number of truly matching record pairs. The F-measure is used to comprehensively evaluate the results of the method, and its formula is as follows.

F = 2 \times \frac{Re c a l l \times P r e c i s i o n}{Re c a l l + P r e c i s i o n}

(15)

For the consensus algorithm proposed by us, we evaluate it based on two metrics: throughput and time delay.

We compare our proposed multi-party parallel PPRL method based on consortium blockchain technology (MP-PPRL-CBT) with three existing relevant methods. These include methods proposed by Randall [17], Karapiperis [10], and Nóbrega [13]. All three methods are related to the method proposed in this paper. The method of Randall employs homomorphic encryption to address the vulnerability of Bloom filter encoding to frequency attacks. The ABEL method proposed by Nóbrega is the first to use blockchain technology to verify the trustworthiness of the STTP and the parties involved in PPRL methods. The method proposed by Karapiperis achieves data security through improvements in Bloom filter encoding.

The independent variables in the experiment include the size of the data source, the number of participating parties, and the disturbance ratios. In the experiment, the sizes of the datasets selected are 5 K, 10 K, 50 K, 100 K, and 500 K records. The numbers of participating parties are chosen as 3, 5, 7, and 9.

We generate three perturbed datasets based on the original dataset by introducing spelling errors, semantic perturbations, and structural perturbations. The errors in the perturbed datasets may include malicious deletion of a character, random substitution of a character, word reordering, character insertion, and other common errors. Perturbed dataset 1 ensures that each participant’s record information has at most one error (Mod-1), perturbed dataset 2 has at most two errors (Mod-2), and perturbed dataset 3 has at most three errors (Mod-3).

6.2. Experimental Results and Analysis

6.2.1. Scalability Assessment

In evaluating scalability, we primarily use runtime as a metric. We assess how the runtime of our method changes with the increase in the size of the data source. Here, the perturbation ratio is 30%, and the number of splits is

s = 3

. To ensure security, our method uses consortium blockchain and homomorphic encryption technology, which will greatly reduce the efficiency of our method, so we propose to use a binary storage tree and the MapReduce model to accelerate the process. We compare our method (using both a binary storage tree and the MapReduce model) with the method using only a binary storage tree and the method using only the MapReduce model. The experimental results are shown in Figure 6. When the number of participants is 3, the running time of the method using the MapReduce model is significantly lower than that of the method using a binary tree. In comparison, the MapReduce model contributes more to improving performance because the calculation of homomorphic encryption is quite time-consuming. As the dataset continues to grow, the runtime does not increase much compared to when no relevant model is introduced, which shows that the proposed solution has relatively good scalability.

As the size of the dataset increases, the specific runtime of each method is shown in Table 3.

For the consensus algorithm proposed by us, we will evaluate it based on two metrics: throughput and time delay.

S = N / T

(16)

S

represents throughput;

N

represents the number of successfully agreed-upon transactions; T represents the total time. In the experiment, the client sends 1500 messages, recording the number of transactions that can be successfully agreed upon per second. We compare the improved PBFT algorithm with the normal PBFT algorithm, where a represents our algorithm and b represents the PBFT algorithm. The experimental results are illustrated in Figure 7.

The consensus latency, as a crucial metric for evaluating consensus algorithms, represents the time difference from when a transaction is initiated to when it is completed in blockchain systems. Lower latency enhances the availability and security of the blockchain. The formula for testing latency is expressed as follows:

D e l a y T i m e = T (s u b m i t) - T (a u t h e n t i c a t i o n)

(17)

T (s u b m i t)

represents the time of consensus completion confirmation, and

T (a u t h e n t i c a t i o n)

represents the time when the consensus authentication phase begins. We compare the improved PBFT algorithm with the normal PBFT algorithm, where A represents our algorithm and B represents the PBFT algorithm. In the experiment, averages are taken every 150 transactions, and tests are conducted with different numbers of nodes. The experimental results are depicted in Figure 8.

6.2.2. Method Performance Evaluation

To comprehensively evaluate the linkage quality of the proposed method, we assess the MP-PPRL-CBT method from three aspects: precision, recall, and F-score. Evaluations are conducted to assess how these three evaluation metrics of the MP-PPRL-CBT method and other methods change with the increase in the number of participants, considering three different degrees of perturbed datasets.

When the dataset size is 100 K and the split count is s = 3, on the perturbed dataset Mod-1, our proposed method surpasses the ABEL method when the number of participants is 5, maintaining a recall rate consistently above 0.4. On the perturbed dataset Mod-2, our proposed method exhibits similar performance to the ABEL method when the number of participants is 3 and 5. On the perturbed dataset Mod-3, with an increase in the number of participants, our proposed method consistently outperforms Karapiperis’ method but falls below the ABEL method and Randall method. Overall, across the three perturbed datasets, with an increase in the number of participants, even as the degree of dataset perturbation increases, our proposed method remains slightly inferior to the ABEL method and Randall method but superior to the Karapiperis method. This is attributed to our proposed method utilizing a binary storage tree as the storage structure, which reduces the number of similarity comparison operations, thereby enhancing algorithm efficiency. However, this may also lead to a minimal number of matching records being unrecognized, as the algorithm does not exhaustively traverse all possible matching record pairs. The results of the recall rate changing with the number of participants in the three perturbed datasets are illustrated in Figure 9.

The specific values of recall with the number of participants in the three disturbed datasets are shown in Table 4.

When the dataset size is 100 K and the split count is s = 3, we evaluate the changes in precision rates of the MP-PPRL-CBT method and other methods as the number of participants increases. Overall, across different perturbed datasets, there is a decreasing trend in precision rates with an increase in the number of participants for all methods. However, our proposed method exhibits a lower rate of decline compared to the other four methods. As the number of participants increases, the precision rate generally remains above 0.4 for our proposed method. Our proposed method combines homomorphic encryption technology to enhance the security of encoding, resulting in a slightly lower precision rate compared to the ABEL method overall. The results of precision changes with the number of participants in the three perturbed datasets are illustrated in Figure 10.

The specific values of precision with the number of participants in the three disturbed datasets are shown in Table 5.

When the dataset size is 100 K and the split count is s = 3, we evaluate the changes in the F-score of the MP-PPRL-CBT method proposed in this paper and other methods as the number of participants increases. Overall, with an increase in the number of participants, the F-score of our proposed method consistently remains at a moderate level. As the number of participants increases, even with a higher degree of dataset perturbation, the F-score of our proposed method generally stays above 0.4. Overall, our proposed method shows relatively good linkage quality with significantly improved security. The results of F-measure changing with the number of participants in the three disturbed datasets are illustrated in Figure 11.

The specific values of the F-measure with the number of participants in the three disturbed datasets are shown in Table 6.

Due to the increased level of perturbation, genuine matching data is more prone to being lost during record linkage. Therefore, even with varying degrees of dataset perturbation, as the number of participants increases, the precision, recall, and F-score of the proposed baseline method all exhibit a downward trend. As the disturbance level or the number of participants increases, the F-measure of each method show a down-ward trend, as shown in Figure 12.

Comparing the F-measure values of our proposed method with those of the ABEL method at different disturbance levels and with different numbers of participants, we can see that the F-measure values are almost the same, as shown in Figure 13. The method proposed in this paper is mainly based on the expansion of the ABEL method. While improving security, the linkage quality does not change much, so multiple datasets can be effectively linked.

7. Conclusions

We propose a method for multi-party parallel privacy-preserving record linkage based on consortium blockchain technology. By utilizing consortium blockchain technology, we can effectively address the issue of semi-trusted third-party verification, auditing whether there are is malicious tampering or attacks by parties involved in the PPRL process. In the consensus process on the consortium blockchain, we introduce a consensus algorithm to enhance the efficiency and security of a consortium blockchain consensus. We employ SBFs for similarity calculations, which reduces the amount of information shared among participants, thus lowering the potential for cryptanalysis. Additionally, to prevent participants or the STTP from inferring the complete Bloom filter encoding through the SBF, we also subject the SBF to homomorphic encryption. To improve computational efficiency, we utilize the MapReduce model for parallel encryption and employ a binary storage tree as the data storage structure for similarity calculations. The experimental results demonstrate that our method effectively ensures data security and possesses relatively high linkage quality and scalability. In conclusion, the proposed method exhibits high feasibility and practicality. However, certain challenges persist, such as enhancing computational efficiency and addressing potential malicious attacks. Future research will focus on improving the computational efficiency of consortium blockchain and adopting parallel similarity calculations without compromising security, as well as improving the linkage quality under parallel computing.

Author Contributions

Conceptualization, S.H. and Z.W.; methodology, S.H. and Z.W.; software, S.H. and Z.W.; validation, S.H. and Z.W.; formal analysis, S.H., C.W., and D.S.; investigation, S.H., D.S., and C.W.; resources, S.H. and Z.W.; data curation, S.H. and Z.W.; writing—original draft preparation, S.H. and Z.W.; writing—review and editing, S.H. and Z.W.; visualization, S.H., Z.W., and D.S.; supervision, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (62172082) and the Education Department of Liaoning Province, Youth Project (LJKQZ20222440).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, T.; Gu, Y.; Zhou, X.; Ma, Q.; Yu, G. An effective and efficient truth discovery framework over data streams. In Proceedings of the EDBT, Venice, Italy, 21–24 March 2017; pp. 180–191. [Google Scholar]
Wang, J.; Li, T.; Wang, A.; Liu, X.; Chen, L.; Chen, J.; Liu, J.; Wu, J.; Li, F.; Gao, Y. Real-time Workload Pattern Analysis for Large-scale Cloud Databases. arXiv 2023, arXiv:2307.02626. [Google Scholar] [CrossRef]
Vatsalan, D.; Christen, P.; Verykios, V.S. A taxonomy of privacy-preserving record linkage techniques. Inf. Syst. 2013, 38, 946–969. [Google Scholar] [CrossRef]
Christen, P.; Vatsalan, D. A Flexible Data Generator for Privacy-Preserving Data Mining and Record Linkage; The Australian National University: Canberra, Australia, 2012. [Google Scholar]
Christen, P.; Vatsalan, D.; Verykios, V.S. Challenges for privacy preservation in data integration. J. Data Inf. Qual. (JDIQ) 2014, 5, 1–3. [Google Scholar] [CrossRef]
Vatsalan, D.; Karapiperis, D.; Gkoulalas-Divanis, A. An Overview of Big Data Issues in Privacy-Preserving Record Linkage. In Proceedings of the Algorithmic Aspects of Cloud Computing: 4th International Symposium, ALGOCLOUD 2018, Helsinki, Finland, 20–21 August 2018; Revised Selected Papers 4. 2019; pp. 118–136. [Google Scholar]
Pita, R.; Pinto, C.; Melo, P.; Silva, M.; Barreto, M.; Rasella, D. A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In Proceedings of the EDBT/ICDT Workshops, Brussels, Belgium, 27 March 2015; pp. 17–26. [Google Scholar]
Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. A survey of blocking and filtering techniques for entity resolution. arXiv 2019, arXiv:1905.06167. [Google Scholar]
El-Hindi, M.; Heyden, M.; Binnig, C.; Ramamurthy, R.; Arasu, A.; Kossmann, D. Blockchaindb-towards a shared database on blockchains. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1905–1908. [Google Scholar]
Karapiperis, D.; Gkoulalas-Divanis, A.; Verykios, V.S. FEDERAL: A framework for distance-aware privacy-preserving record linkage. IEEE Trans. Knowl. Data Eng. 2017, 30, 292–304. [Google Scholar] [CrossRef]
Karapiperis, D.; Gkoulalas-Divanis, A.; Verykios, V.S. Distance-aware encoding of numerical values for privacy-preserving record linkage. In Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering (ICDE), San Diego, CA, USA, 19–22 April 2017; pp. 135–138. [Google Scholar]
Durham, E.A.; Kantarcioglu, M.; Xue, Y.; Toth, C.; Kuzu, M.; Malin, B. Composite bloom filters for secure record linkage. IEEE Trans. Knowl. Data Eng. 2013, 26, 2956–2968. [Google Scholar] [CrossRef] [PubMed]
Nóbrega, T.; Pires, C.E.S.; Nascimento, D.C. Blockchain-based privacy-preserving record linkage: Enhancing data privacy in an untrusted environment. Inf. Syst. 2021, 102, 101826. [Google Scholar] [CrossRef]
Vatsalan, D.; Christen, P. Multi-party privacy-preserving record linkage using bloom filters. arXiv 2016, arXiv:1612.08835. [Google Scholar]
Han, S.; Shen, D.; Nie, T.; Kou, Y.; Yu, G. An enhanced privacy-preserving record linkage approach for multiple databases. Clust. Comput. 2022, 25, 3641–3652. [Google Scholar] [CrossRef]
Yao, S.; Ren, Y.; Wang, D.; Wang, Y.; Yin, W.; Yuan, L. SNN-PPRL: A secure record matching scheme based on siamese neural network. J. Inf. Secur. Appl. 2023, 76, 103529. [Google Scholar] [CrossRef]
Randall, S.M.; Brown, A.P.; Ferrante, A.M.; Boyd, J.H.; Semmens, J.B. Privacy preserving record linkage using homomorphic encryption. In Proceedings of the First International Workshop on Population Informatics for Big Data (PopInfo’15), Sydney, Australia, 10 August 2015. [Google Scholar]
Christen, P.; Schnell, R.; Ranbaduge, T.; Vidanage, A. A critique and attack on “Blockchain-based privacy-preserving record linkage”. Inf. Syst. 2022, 108, 101930. [Google Scholar] [CrossRef]
Yao, H.; Wei, H.; Han, S.; Shen, D. Efficient multi-party privacy-preserving record linkage based on blockchain. In Proceedings of the International Conference on Web Information Systems and Applications, Dalian, China, 16–18 September 2022; pp. 649–660. [Google Scholar]
Christen, P.; Ranbaduge, T.; Vatsalan, D.; Schnell, R. Precise and fast cryptanalysis for Bloom filter based privacy-preserving record linkage. IEEE Trans. Knowl. Data Eng. 2018, 31, 2164–2177. [Google Scholar] [CrossRef]
Vidanage, A.; Ranbaduge, T.; Christen, P.; Schnell, R. Efficient pattern mining based cryptanalysis for privacy-preserving record linkage. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1698–1701. [Google Scholar]
Li, T.; Huang, R.; Chen, L.; Jensen, C.S.; Pedersen, T.B. Compression of uncertain trajectories in road networks. Proc. VLDB Endow. 2020, 13, 1050–1063. [Google Scholar] [CrossRef]
Di Matteo, S.; Gerfo, M.L.; Saponara, S. VLSI Design and FPGA implementation of an NTT hardware accelerator for Homomorphic seal-embedded library. IEEE Access 2023, 11, 72498–72508. [Google Scholar] [CrossRef]
Doröz, Y.; Öztürk, E.; Sunar, B. Accelerating fully homomorphic encryption in hardware. IEEE Trans. Comput. 2014, 64, 1509–1521. [Google Scholar] [CrossRef]
Jung, W.; Lee, E.; Kim, S.; Kim, J.; Kim, N.; Lee, K.; Min, C.; Cheon, J.H.; Ahn, J.H. Accelerating fully homomorphic encryption through architecture-centric analysis and optimization. IEEE Access 2021, 9, 98772–98789. [Google Scholar] [CrossRef]
Tien Tuan Anh, D.; Ji, W.; Gang, C.; Rui, L.; Blockbench, A. Framework for Analyzing Private Blockchains. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; pp. 14–19. [Google Scholar]
Boussis, D.; Dritsas, E.; Kanavos, A.; Sioutas, S.; Tzimas, G.; Verykios, V.S. MapReduce Implementations for Privacy Preserving Record Linkage. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, Athens, Greece, 9–12 July 2018; pp. 1–4. [Google Scholar]
Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Prague, Czech Republic, 2–6 May 1999; pp. 223–238. [Google Scholar]
Nóbrega, T.; Pires, C.E.S.; Nascimento, D.C. Explanation and answers to critiques on: Blockchain-based Privacy-Preserving Record Linkage. Inf. Syst. 2022, 108, 101935. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B. TRACE: Real-time compression of streaming trajectories in road networks. Proc. VLDB Endow. 2021, 14, 1175–1187. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B.; Gao, Y.; Hu, J. Evolutionary clustering of moving objects. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2399–2411. [Google Scholar]

Figure 1. The similarity and error between the BF and SBF.

Figure 2. The overall process of multi-party PPRL.

Figure 3. The process of the data preparation and generation module.

Figure 4. The process of the approximate matching module.

Figure 5. The structure of the binary storage tree.

Figure 6. Runtime with different values for dataset sizes.

Figure 7. Throughput with different values for the number of nodes.

Figure 8. Consensus delay with different values for the number of nodes.

Figure 9. The variation in recall with the number of participants in the three disturbed datasets.

Figure 10. The variation in precision with the number of participants in the three disturbed datasets.

Figure 11. The variation in the F-measure with the number of participants in the three disturbed datasets.

Figure 12. The distribution of F-measure values with different disturbance levels of datasets and different numbers of participants.

Figure 13. The F-measure distribution of the ABEL method and our proposed method.

Table 1. Summary of PPRL-related methods.

Method	Advantages	Disadvantages	Solutions
The PPRL methods based on the honest but curious model	Linkage quality is high and efficiency is fast	Typically, the linkage needs to be entrusted to STTPs	Consortium blockchain
The PPRL methods based on the malicious adversary model	The security of linkage is high	The efficiency is slow and it cannot verify which malicious party deviated from the protocol	Consortium blockchain and MapReduce model
The PPRL methods based on blockchain	Auditing all parties involved in the PPRL process for potential malicious tampering or attacks	The efficiency is slow and still has security issues	Homomorphic encryption, consensus algorithm, and binary storage tree

Table 2. The parameters used in this paper.

Parameters	Description
$P$	Participant of PPRL
$e$	Entity
$e^{τ}$	Anonymized entity
$D_{p}$	Dataset of participant $p$
$D_{p}^{τ}$	Anonymized dataset of participant $p$
$l$	Bloom filter length
$n$	Collection of q-grams to be added to Bloom filter
$k$	hash functions $h_{1}, h_{2}, \dots, h_{k}$
$s$	Number of splits
$ϕ$	Splitting Bloom filter (SBF)
$ϕ^{'}$	Homomorphic encryption value of SBF
$α$	Threshold α
$β$	Threshold β (β = α − error)
$ς$	List of entity (id) pairs with their similarity values
$φ$	A set of SBF ( $ϕ^{'}$ )
$ε$	Error

Table 3. The specific runtime of each method with different values for dataset sizes.

Method	Dataset Size = 5 K	Dataset Size = 50 K	Dataset Size = 500 K
MP-PPRL-CBT	12	90	1300
MP-PPRL-CBT (Binary Storage Tree)	70	600	8800
MP-PPRL-CBT (MapReduce)	18	130	2100

Table 4. The variation in recall with the number of participants in the three disturbed datasets.

Method	Participants = 3	Participants = 5	Participants = 7	Participants = 9
Randall	0.95 (Mod-1)	0.88 (Mod-1)	0.74 (Mod-1)	0.65 (Mod-1)
	0.91 (Mod-2)	0.74 (Mod-2)	0.64 (Mod-2)	0.55 (Mod-2)
	0.77 (Mod-3)	0.72 (Mod-3)	0.66 (Mod-3)	0.53 (Mod-3)
MP-PPRL-CBT	0.87 (Mod-1)	0.78(Mod-1)	0.58 (Mod-1)	0.53 (Mod-1)
	0.75 (Mod-2)	0.58 (Mod-2)	0.48 (Mod-2)	0.43 (Mod-2)
	0.71 (Mod-3)	0.61 (Mod-3)	0.48 (Mod-3)	0.42 (Mod-3)
ABEL	0.91 (Mod-1)	0.79 (Mod-1)	0.63 (Mod-1)	0.57 (Mod-1)
	0.81 (Mod-2)	0.63 (Mod-2)	0.53 (Mod-2)	0.47 (Mod-2)
	0.73 (Mod-3)	0.66 (Mod-3)	0.51 (Mod-3)	0.46 (Mod-3)
Karapiperis	0.72 (Mod-1)	0.66 (Mod-1)	0.53 (Mod-1)	0.45 (Mod-1)
	0.69 (Mod-2)	0.54 (Mod-2)	0.43 (Mod-2)	0.35 (Mod-2)
	0.52 (Mod-3)	0.37 (Mod-3)	0.29 (Mod-3)	0.20 (Mod-3)

Table 5. The variation in precision with the number of participants in the three disturbed datasets.

Method	Participants = 3	Participants = 5	Participants = 7	Participants = 9
Randall	0.94 (Mod-1)	0.85 (Mod-1)	0.76 (Mod-1)	0.71 (Mod-1)
	0.90 (Mod-2)	0.81 (Mod-2)	0.73 (Mod-2)	0.69 (Mod-2)
	0.78 (Mod-3)	0.68 (Mod-3)	0.51 (Mod-3)	0.41 (Mod-3)
MP-PPRL-CBT	0.82 (Mod-1)	0.75(Mod-1)	0.66 (Mod-1)	0.55 (Mod-1)
	0.82 (Mod-2)	0.72 (Mod-2)	0.63 (Mod-2)	0.55 (Mod-2)
	0.60 (Mod-3)	0.50 (Mod-3)	0.43 (Mod-3)	0.31 (Mod-3)
ABEL	0.91 (Mod-1)	0.81 (Mod-1)	0.72 (Mod-1)	0.68 (Mod-1)
	0.87 (Mod-2)	0.78 (Mod-2)	0.69 (Mod-2)	0.61 (Mod-2)
	0.62 (Mod-3)	0.53 (Mod-3)	0.45 (Mod-3)	0.33 (Mod-3)
Karapiperis	0.75 (Mod-1)	0.66 (Mod-1)	0.54 (Mod-1)	0.45 (Mod-1)
	0.71 (Mod-2)	0.64 (Mod-2)	0.51 (Mod-2)	0.41 (Mod-2)
	0.51 (Mod-3)	0.47 (Mod-3)	0.30 (Mod-3)	0.21 (Mod-3)

Table 6. The variation in the F-measure with the number of participants in the three disturbed datasets.

Method	Participants = 3	Participants = 5	Participants = 7	Participants = 9
Randall	0.94 (Mod-1)	0.86 (Mod-1)	0.74 (Mod-1)	0.67 (Mod-1)
	0.90 (Mod-2)	0.77 (Mod-2)	0.68 (Mod-2)	0.61 (Mod-2)
	0.77 (Mod-3)	0.69 (Mod-3)	0.57 (Mod-3)	0.46 (Mod-3)
MP-PPRL-CBT	0.84 (Mod-1)	0.76(Mod-1)	0.61 (Mod-1)	0.53 (Mod-1)
	0.78 (Mod-2)	0.64 (Mod-2)	0.54 (Mod-2)	0.48 (Mod-2)
	0.65 (Mod-3)	0.54 (Mod-3)	0.45 (Mod-3)	0.35 (Mod-3)
ABEL	0.91 (Mod-1)	0.80 (Mod-1)	0.67 (Mod-1)	0.62 (Mod-1)
	0.83 (Mod-2)	0.69 (Mod-2)	0.59 (Mod-2)	0.53 (Mod-2)
	0.67 (Mod-3)	0.58 (Mod-3)	0.48 (Mod-3)	0.38 (Mod-3)
Karapiperis	0.73 (Mod-1)	0.66 (Mod-1)	0.53 (Mod-1)	0.45 (Mod-1)
	0.69 (Mod-2)	0.58 (Mod-2)	0.46 (Mod-2)	0.37 (Mod-2)
	0.51 (Mod-3)	0.41 (Mod-3)	0.29 (Mod-3)	0.20 (Mod-3)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, S.; Wang, Z.; Shen, D.; Wang, C. A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain. Mathematics 2024, 12, 1854. https://doi.org/10.3390/math12121854

AMA Style

Han S, Wang Z, Shen D, Wang C. A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain. Mathematics. 2024; 12(12):1854. https://doi.org/10.3390/math12121854

Chicago/Turabian Style

Han, Shumin, Zikang Wang, Dengrong Shen, and Chuang Wang. 2024. "A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain" Mathematics 12, no. 12: 1854. https://doi.org/10.3390/math12121854

APA Style

Han, S., Wang, Z., Shen, D., & Wang, C. (2024). A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain. Mathematics, 12(12), 1854. https://doi.org/10.3390/math12121854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Parallel Multi-Party Privacy-Preserving Record Linkage Method Based on a Consortium Blockchain

Abstract

1. Introduction

2. Related Works

3. Preliminaries and Background

3.1. PPRL

3.2. Bloom Filter

3.3. Jaccard Similarity Function

3.4. Splitting Bloom Filter

3.5. Homomorphic Encryption

3.6. MapReduce Model

3.7. Consortium Blockchain

3.8. Consensus Algorithm

4. Methods

4.1. Data Preparation and Generation Module

4.1.1. Generate Key

4.1.2. Encryption Process

4.1.3. Decryption Process

4.2. Approximate Matching Module

4.2.1. Binary Storage Tree

4.2.2. Consensus Algorithm

4.3. Auditable Module

5. Security Analysis

6. Experimental Results

6.1. Experiment Preparation

6.2. Experimental Results and Analysis

6.2.1. Scalability Assessment

6.2.2. Method Performance Evaluation

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI