Next Article in Journal
Triple-Shield Privacy in Healthcare: Federated Learning, p-ABCs, and Distributed Ledger Authentication
Previous Article in Journal
Towards Reliable Fake News Detection: Enhanced Attention-Based Transformer Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Opinion

A Framework for the Design of Privacy-Preserving Record Linkage Systems

RTI International, 3040 E Cornwallis Rd, Research Triangle Park, NC 27709, USA
*
Author to whom correspondence should be addressed.
J. Cybersecur. Priv. 2025, 5(3), 44; https://doi.org/10.3390/jcp5030044
Submission received: 28 May 2025 / Revised: 3 July 2025 / Accepted: 4 July 2025 / Published: 9 July 2025
(This article belongs to the Section Privacy)

Abstract

Record linkage can enhance the utility of data by bringing data together from different sources, increasing the available information about data subjects and providing more holistic views. Doing so, however, can increase privacy risks. To mitigate these risks, a family of methods known as privacy-preserving record linkage (PPRL) was developed, using techniques such as cryptography, de-identification, and the strict separation of roles to ensure data subjects’ privacy remains protected throughout the linkage process, and the resulting linked data poses no additional privacy risks. Building privacy protections into the architecture of the system (for instance, ensuring that data flows between different parties in the system do not allow for transmission of private information) is just as important as the technology used to obfuscate private information. In this paper, we present a technology-agnostic framework for designing PPRL systems that is focused on privacy protection, defining key roles, providing a system architecture with data flows, detailing system controls, and discussing privacy evaluations that ensure the system protects privacy. We hope that the framework presented in this paper can both help elucidate how currently deployed PPRL systems protect privacy and help developers design future PPRL systems.

1. Introduction

The linkage of data from multiple sources can provide rich and varied datasets that give holistic views of individuals, providing significant increases in data utility as compared with separate, disparate datasets. For example, the linkage of an electronic health record (EHR) dataset with a social determinant of health (SDOH) dataset can illuminate social factors that are related to health outcomes, which is of interest to policymakers seeking ways to make evidence-based policy decisions [1,2,3,4,5]. Linkage can also streamline data management, usage, and analysis, providing more efficient usage of data while enabling previously unavailable analyses.
However, linkage presents significant risks to individual privacy, not only because the resulting linked data can contain more information about the individual but also because traditional record linkage involves using personally identifiable information (PII) such as name, phone number, email address, unique codes like social security numbers, and indirect identifiers like date of birth, sex, ZIP code of residence, and dates of special events (e.g., a surgery or hospital admission) to match records across datasets [6]. Handling significant amounts of identifiable information can pose privacy and security risks, the mitigation of which can create barriers to usage and sharing. For instance, data that contains identifiable information often cannot be used outside the purposes for which it was initially gathered without either obtaining additional consent or ensuring that the usage falls under specific allowances (e.g., public health emergencies) [7]. This limits the ability of researchers, analysts, and policymakers from using these data in identifiable formats, creating the need for de-identified data with sufficient privacy protections to ensure de-identification remains in place throughout the data usage lifecycle.
After data is de-identified, traditional methods of record linkage cannot be performed because the identifying information necessary for linkage gets removed. To enable such linkage, a family of techniques known as privacy-preserving record linkage (PPRL) has been developed that leverage cryptography, secret sharing, and careful system design to ensure that individual privacy remains protected throughout the linkage process [8]. These techniques have undergone comprehensive review and evaluation for their utility in federal data linkage projects with investigations of both open-source and proprietary commercial systems [9,10,11,12] and have been deployed in real-life data linkage and analysis systems such as that used by the N3C collaborative [13,14].
Missing from the literature is a framework for the design and architecture of a PPRL system, describing the roles and responsibilities of the different parties involved within the system, the transfers of data and information flow between the different parties, the protections that need to be put in place, and the evaluations that need to be performed. The existing literature focuses on tooling and techniques, leaving information gaps around how to structure processes and organize data flows between the different parties. This gives the misconception that PPRL can be achieved solely through technical measures, overlooking that the process and people involved are just as important to ensuring privacy protection as the technology. The authors provide a framework for the design of PPRL systems within this paper, leveraging experience with real-world PPRL systems to provide a system architecture with broad applicability across many use cases. The focus of our framework is privacy protection; by having this focus, we can use this framework to both design systems that better protect privacy when linking data and explain to others how the system protects privacy in a simple and understandable manner. Our framework can be used to simplify the complexities within existing systems, allowing for ease of understanding of the data flows and privacy-preservation techniques being used, and it provides developers and deployers of PPRL systems with a broadly applicable technology-agnostic architecture that can be used to build their own systems. We hope that this framework enables the construction of more PPRL systems to enable greater data sharing and collaboration in a manner that protects the privacy of data subjects.
This paper is organized as follows. We first begin with a section explaining the basic aims of and the techniques that enable PPRL to establish a common baseline of understanding. We then move on to our system design framework, listing out the key roles in a PPRL system, describing the data flow architecture for a PPRL system designed under our framework, detailing system controls to be put in place around a PPRL system to protect privacy, and outlining what the qualified expert evaluation is and what should be evaluated. In the discussion section, we describe how the framework can be used, focusing on the two primary use cases of providing clarity as to the architecture of an already existing PPRL system and designing a new system.

2. What Is PPRL?

PPRL is a family of techniques for enabling data linkage while protecting the privacy of data subjects. PPRL is intended to mitigate privacy risks and reduce barriers to data linkage and data use, enabling the access to and usage of linked data without needing to process large amounts of identifiable information [10]. Key to privacy protection in these systems is the requirement that no party within a PPRL system can access PII they do not already control [11,15]. PPRL does this by enabling record linkages without using PII via tokenization, where the PII traditionally used for linkage is transformed in a reproducible manner into a unique identifier for data subjects and events associated with those data subjects [16,17]. The tokens can then be matched across datasets to perform linkage instead of using the PII. Tokenization has the additional benefit of de-identifying the component datasets, enabling secondary usage without the need to obtain data subject consent [15]. Figure 1 shows an example of the design of a tokenizer.
The current state of the art in tokenization uses a two-step process to generate linkage tokens and controller-specific tokens [12,17]. First, PII used for linkage gets passed into the tokenizer, where a master salt (i.e., a special set of characters used by the tokenizer for linkage) gets added to the PII. The PII plus master salt is then passed through a one-way hashing algorithm to create the linkage token. This token will be the same for matching sets of PII regardless of the source, which is how it enables data linkage. The linkage token then undergoes a secondary encryption process using keys that are specific to each data controller, creating the controller-specific tokens. This token can be used by data controllers as unique identifiers for their data subjects, but they cannot use them for linkage on their own. Instead, a third party who holds the keys to decrypt the controller-specific tokens can transform them back into the linkage tokens and use that to perform linkage for data controllers. With secondary encryption, controllers can be sure that they do not reveal data subjects present in their data to other parties in the linkage process.
While tokenization enables PPRL, to ensure privacy preservation throughout the linkage process, there needs to be additional technical and administrative controls designed into the record linkage system. While the requirement that no party within a PPRL system can access PII they do not already control is easy to understand, designing a record linkage system that can implement this requirement can be difficult with non-obvious pitfalls that put the privacy preservation of the system at risk. For instance, PPRL systems should allow original data controllers to be able to re-identify subjects in the data they provided. However, they should not be able to re-identify those same data subjects in data provided by other parties. Additionally, in some use cases, controllers need to identify whether a data subject within their data can be found in another controller’s data (a process known as de-duplication) without re-identifying the data subject in the other parties’ data. Creating a system that can enable such complex and seemingly conflicting requirements requires careful design to ensure that privacy protections remain in place.

3. System Design Framework for PPRL

In this section, we present our framework PPRL system design. This framework aims to capture all the key aspects of PPRL systems necessary for privacy preservation throughout the linkage process with a focus on privacy protection. We start by presenting an overview of the key roles within a PPRL system, then move on to discuss the design of the system itself, presenting a data flow diagram with different parties marked out and the flow of information between the different parties. We present a system architecture for creating a linked dataset for analysis that can be modified for other use cases. We demonstrate a modification of this architecture to enable de-duplication and data harmonization between multiple data controllers. We then discuss the protections that need to be put in place to enable PPRL and conclude with key points that need to be part of the privacy evaluation.

3.1. Key Roles in a PPRL System

Table 1 describes the roles and responsibilities of different parties within a PPRL system. This table can be used to identify roles and establish responsibilities for different parties to a PPRL system. It can also help identify which roles have not been filled by existing parties so that either existing parties take up multiple roles or new parties can be engaged to take on those roles. All roles should be filled to ensure privacy protections in a PPRL system.

3.2. Architecture of PPRL Systems

As the primary goal of PPRL is to link data while protecting privacy, that is, to link data without any party to the linkage obtaining PII they do not already possess, the system must be designed so that flows of data do not involve transfer of PII outside of any individual party’s control. Key to enabling this goal are the Tokenizer and the Honest Broker; the Tokenizer masks PII used for linkage before it gets transferred outside a data controller’s security perimeter, and the Honest Broker facilitates the linkage so that controllers do not need to handle tokens generated from another controller’s data. Figure 2 shows the architecture of the PPRL system, focusing on data flows between different parties in the system, with the goal of creating a linked dataset for analysis.
In this system, linked de-identified data from two controllers is meant to be provided to users of a data platform for analysis. Each step of the process is marked in the figure and listed below.
  • Data controllers send the PII they use for linkage through a tokenizer to create linkage tokens and controller-specific tokens.
  • Tokens are then sent to the Honest Broker.
  • Data controllers provide to the data platform de-identified data with controller-specific tokens generated from the tokenizer.
  • The honest broker facilitates linkage by either providing mapping tables between the linkage tokens and controller-specific tokens or by providing the encryption keys that can decrypt the controller-specific tokens back into the linkage tokens. The linkage process can be performed automatically by the platform using the information provided by data controllers and the honest broker.
  • A qualified expert determiner examines the system and the processes performed by the data platform to ensure that no PII gets shared at any point in this process and may make recommendations for mitigating any privacy gaps if necessary. These recommendations could include modifications to the data flow, additional encryption steps, increasing the security context of the data platform, adding access controls, limiting the number or type of data queries, perturbing the linked data to prevent the data controllers from linking it to their original data, or others.
  • Data users are able to submit queries to the platform and receive the linked de-identified data they need for their analyses. They will not be able to see the process performed by the data platform to create the linked dataset.
This system architecture is flexible in that it can accommodate any number of data controllers (our example shows just two for simplicity), multiple data platforms for analysis, multiple Honest Brokers for different linkage purposes, or even multiple Tokenizer tools. It is technology-agnostic, not requiring any specific hardware or software to be implementable. To ensure the system is privacy-preserving, the data flows as described between the different key roles in the system must be in place.
The architecture of this system can be modified for other use cases as well. For instance, in some PPRL frameworks, instead of linking data, data controllers may be trying to determine whether they hold duplicate or redundant information between themselves. They may be performing this as a required step in data harmonization (e.g., removing duplicate records prior to combining their data, a process used within N3C [13]) or to identify individuals that are of interest because they appear in two different datasets (e.g., two states comparing data to find individuals who may have crossed state lines, a use case present in a pilot program that looked at linkage of Supplemental Nutrition Assistance Program (SNAP) recipients across state lines to prevent duplicate issuances of SNAP benefits [18]). In this case, the Honest Broker identifies overlapping linkage tokens, maps them back to controller tokens, and delivers the matched controller tokens back to their original data controller. Figure 3 is an architecture diagram showing the data flows for this use case, which is a simplified version of the diagram in Figure 2.

3.3. System Controls for Privacy Protection

To ensure the PPRL process remains privacy-preserving, in addition to the system architecture, privacy and security controls need to be put in place. The establishment of a set of controls typically involves working with data governance and privacy officers, performing evaluations such as Privacy Impact Assessments, Data-Protecting Impact Assessments, Vendor Risk Assessments, Business Impact Assessments, Enterprise Risk Assessments, and potentially Transfer Impact Assessments (used for transfers of data out of the European Union) to identify risks and then determining the best controls to deploy to mitigate those risks. Privacy and security controls utilized to mitigate risks typically include the following:
  • Data sharing agreements between controllers specifying the data they plan to share, who will have access and for what purpose, allowable data usages, term limits, and liabilities;
  • Data use agreements for end users specifying the specific terms of use, including prohibitions against further data sharing without approval and re-identification of data subjects;
  • Honest Broker agreements between controllers and Honest Brokers within the PPRL system, specifying the data Honest Brokers will receive from controllers, the allowed usages for said data, and whom the Honest Brokers can share data with;
  • Business agreements with Tokenizer providers should Tokenization software need to be purchased or licensed;
  • Policies at organizations receiving data (such as the Honest Broker, the host of the data platform, and the data users) that establish roles-based access to data, data protection, secure data storage, data retention and destruction, identity and authentication, regular monitoring of data access and data use, and oversight over the PPRL system;
  • Sanctions for people who violate terms of agreements and privacy and security policies;
  • Personnel granted with authority and responsibility for oversight over the system, which includes IT security, data privacy and protection officers, data governance and ethics councils, and data stewardship committees;
  • Training on privacy, security, and confidentiality for staff who have access to the system and users of the data;
  • Establishment of protocols for privacy breaches, including playbooks for containment, response, and informing those who are affected;
  • IT technical controls that enable identity and authentication, role-based access, secure data storage, data retention and destruction, and monitoring of data access and data use;
  • Secure transmission of data with encryption in transit to ensure sensitive information does not get leaked when transferring data between different parties in the system;
  • Encryption and hardening of systems that hold sensitive information, such as the Honest Broker’s environments that hold lookup tables between different sets of tokens;
  • Implementation of anti-virus and anti-malware programs on all systems where data is stored;
  • Physical and technical security for all systems where data is stored; and
  • Regular review and security audits of the system.
Several frameworks for establishing these controls exist that can be used to guide their implementation, such as the National Institute of Standards and Technology (NIST) Privacy Framework [19] and Cybersecurity Framework [20], International Standards Organization (ISO) and International Electrotechnical Commission (IEC) standards ISO/IEC 27001:2022 and ISO/IEC 27701:2019 [21,22], Federal Information Security Modernization Act (FISMA) control levels [23], Systems and Organization Controls 2 (SOC 2) [24], and Health Information Trust Alliance (HITRUST) cybersecurity framework [25].

3.4. Qualified Expert Evaluations

A key part of the PPRL system is an evaluation performed by qualified experts to ensure that privacy protections are in place throughout the record linkage process. This kind of evaluation is similar to the Expert Determination as defined under the HIPAA Privacy Rule [26]. The expert is not only looking for whether the resulting linked data is de-identified but also looking through the linkage process to ensure that data is not identifiable in any part of the linkage process. As such, the expert typically evaluates the following:
  • If any PII gets transferred out of a controller’s environment during the tokenization process;
  • Whether the tokens generated through the tokenization process could still be considered PII;
  • Whether the information transferred by the Honest Broker back to data controllers or the data platform could be considered PII;
  • Any additional re-identification risk to the de-identified datasets arising from the data linkage;
  • Whether the controls implemented on the system ensure privacy protection throughout the PPRL process.
After performing a holistic evaluation on the entire PPRL system, the expert provides their report, which acts as evidence that the system can successfully conduct PPRL. They may also recommend some mitigations to be put into place to ensure privacy preservation in the report if they discover issues during their evaluation. While these evaluations have no set timeframe for expiry and renewal, due to changing technologies and the evolution of privacy regulations, setting up a regular schedule of reviews (i.e., annually) is recommended. Should there be significant changes to the system, such as the addition of new data controllers participating in the linkage system, the addition of new data to link, the switching of Honest Brokers, changing tokenization methods or providers, changing methods by which data users access and analyze the data, or changes to the data flow between different parties, then a re-evaluation may need to be performed.

4. Discussion

Our goal for presenting this framework is to describe a system architecture focused on privacy protection that is broadly applicable to many domains while being technology-agnostic. This framework can be used in two ways: (1) when looking at an existing PPRL system, we can use this framework to determine the roles of each party in the system, map out data flows between the different parties, and explain how the system protects privacy while producing linked data, and (2) when designing a new system, we can use this framework to identify the roles and responsibilities of the different parties who participate in the system and create a system architecture for the data flows between them. We will go into further detail as to how this framework accomplishes each use case in the following section.

4.1. Elucidation of Existing Systems

The usage of this framework can help with creating easy-to-understand explanations of the architectures of existing PPRL systems, as well as identifying the roles different parties play in the system and detailing their responsibilities. As an example, we can take the PPRL system currently in use by N3C to facilitate linkages for clinical research. When applying the framework, we can map roles to the different parties in the system; i.e., the clinical sites are data controllers, the Regenstrief Institute acts as the Honest Broker, Datavant is the Tokenization provider, N3C hosts the data platform and provides data governance for the system, researchers who apply for research access are data users, the National Center for Advancing Translational Sciences (NCATS) privacy office acts as the privacy officer overseeing the system, and Datavant and Regenstrief Institute provided their expert evaluation for the system [13,27]. After mapping out the roles, our framework also helps clarify the data transfers between different parties, which allows us to create a simple data flow diagram for the N3C ecosystem using the architecture presented in Figure 2 (shown in Figure 4). The N3C PPRL system also has many of the privacy-protecting system controls listed in Section 3.3 (as well as additional controls for government data systems) [14]. Thus, with this framework, we are able to reduce the complexity of the N3C PPRL system, gaining a clear understanding of the different parties’ roles and responsibilities, what data flows between these different parties, and how privacy is protected while data gets shared, linked, and analyzed.

4.2. Design of New PPRL Systems

The usage of this framework can provide developers of new PPRL systems with an overall system architecture that can enable PPRL, identifying all the different parties who should take part in the system, highlighting privacy and security controls that should be in place, and providing guidance on what to look for when engaging experts for an evaluation. Starting with the roles and responsibilities, the utilization of this framework ensures that the different parties participating in the linkage process get mapped to the key roles and identifies roles that are not filled by existing parties. Afterwards, either existing parties to the linkage take up multiple roles (when allowed), or the developers engage with new parties to fill those roles. Developers can then use the system architecture to map out the data flows between the parties. By following the architecture, developers ensure that no party within the system can access PII they do not already control. They then work with data governance and privacy officers to perform necessary assessments of the system and establish a set of system controls to mitigate identified risks. Finally, they have a qualified third-party perform an expert evaluation on the system to ensure privacy protection throughout the linkage process.

5. Conclusions

The key benefit to using this framework is that it can reduce complexities to focus on the privacy-protecting aspects of PPRL systems. It covers the key roles that should be present within such systems, how the data should flow between different parties to enable linkage while preventing the sharing of PII, the controls to be put on the system to ensure privacy protection, and the necessary components of privacy-protection evaluations. This framework can be used to clarify complexities within currently deployed PPRL systems to explain how they protect privacy while still enabling record linkage. It also helps developers designing PPRL systems build their system with a focus on privacy preservation while being technology agnostic.
Future research plans include the utilization of the framework to create new PPRL systems across various domains. Beyond the healthcare and health research domains where PPRL is already in use, the authors believe PPRL can also be applied to criminal justice data (e.g., facilitating linkages between criminal justice agencies to track criminals across jurisdictions or to map out criminal case proceedings), socioeconomic data (e.g., facilitating linkages between different social and economic surveys), other government data (e.g., used for determination of government benefit eligibility and identification of benefits fraud) and used in cross-domain research (i.e., linkage of healthcare data with social survey data to identify social determinants of health). In cases where there are potential overlapping populations and PII should not be shared, PPRL can be an option to enable linkages to create datasets with greater analytic utility while preserving privacy. The utilization of the framework presented in this paper can help researchers across many domains construct PPRL systems to create linked data for their research while ensuring privacy protection.

Author Contributions

The following describes the author contributions to this paper: Conceptualization, Z.N., B.T., D.B., E.G. and E.P.; Funding acquisition, A.B.; Methodology, Z.N.; Project administration, A.B.; Supervision, A.B.; Writing—Original draft, Z.N.; Writing—Review and editing, Z.N., B.T., D.B., E.G. and E.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDPIMultidisciplinary Digital Publishing Institute
DOAJDirectory of open access journals
PPRLPrivacy-preserving record linkage
PIIPersonal identifying information
EHRElectronic health record
SDOHSocial determinants of health
N3CNational Clinical Cohorts Collaborative
NCATSNational Center for Advancing Translational Sciences
SHA-2Secure Hash Algorithm 2
AESAdvanced Encryption Standard
SNAPSupplemental Nutrition Assistance Program
NISTNational Institute of Standards and Technology
FISMAFederal Information Security Modernization Act
HITRUSTHealth Information Trust Alliance
SOC 2Systems and Organization Controls 2
ISOInternational Standards Organization
IECInternational Electrotechnical Commission

References

  1. Vo, A.; Tao, Y.; Li, Y.; Albarrak, A. The Association Between Social Determinants of Health and Population Health Outcomes: Ecological Analysis. JMIR Public Health Surveill. 2023, 9, e44070. [Google Scholar] [CrossRef] [PubMed]
  2. Burström, B.; Tao, W. Social determinants of health and inequalities in COVID-19. Eur. J. Public Health 2020, 30, 617–618. [Google Scholar] [CrossRef] [PubMed]
  3. Howell, C.R.; Zhang, L.; Yi, N.; Mehta, T.; Garvey, W.T.; Cherrington, A.L. Race Versus Social Determinants of Health in COVID-19 Hospitalization Prediction. Am. J. Prev. Med. 2022, 63, S103–S108. [Google Scholar] [CrossRef] [PubMed]
  4. Park, H.S.; White, R.S.; Ma, X.; Lui, B.; Pryor, K.O. Social determinants of health and their impact on postcolectomy surgery readmissions: A multistate analysis, 2009–2014. J. Comp. Eff. Res. 2019, 8, 1365–1379. [Google Scholar] [CrossRef] [PubMed]
  5. Washington, D.L.; Bean-Mayberry, B.; Riopelle, D.; Yano, E.M. Access to care for women veterans: Delayed healthcare and unmet need. J. Gen. Intern. Med. 2011, 26 (Suppl. S2), 655–661. [Google Scholar] [CrossRef] [PubMed]
  6. Dusetzina, S.B.; Tyree, S.; Meyer, A.-M.; Meyer, A.; Green, L.; Carpenter, W.R. An Overview of Record Linkage Methods. In Linking Data for Health Services Research: A Framework and Instructional Guide [Internet]; Agency for Healthcare Research and Quality (US): Rockville, MD, USA, 2014. Available online: https://www.ncbi.nlm.nih.gov/books/NBK253312/ (accessed on 29 April 2025).
  7. Office for Civil Rights (OCR). Summary of the HIPAA Privacy Rule. Available online: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed on 29 April 2025).
  8. Pathak, A.; Serrer, L.; Zapata, D.; King, R.; Mirel, L.B.; Sukalac, T.; Srinivasan, A.; Baier, P.; Bhalla, M.; David-Ferdon, C.; et al. Privacy preserving record linkage for public health action: Opportunities and challenges. J. Am. Med. Inform. Assoc. 2024, 31, 2605–2612. [Google Scholar] [CrossRef] [PubMed]
  9. Mirel, L.B. Privacy Preserving Techniques: Case Studies from the Data Linkage Program. 19 May 2021. Available online: https://stacks.cdc.gov/view/cdc/114623 (accessed on 15 November 2023).
  10. Mirel, L.B.; Resnick, D.M.; Aram, J.; Cox, C.S. A methodological assessment of privacy preserving record linkage using survey and administrative data. Stat. J. IAOS 2022, 38, 413–421. [Google Scholar] [CrossRef] [PubMed]
  11. Landscape Analysis of Privacy Preserving PAtient Record Linkage Software (P3RLS). National Cancer Institute (NCI), National Institutes of Health (NIH), Department of Health and Human Services (HHS), Final Report Prepared by Synectics for Management Decisions, Inc., January 2020. Available online: https://surveillance.cancer.gov/reports/TO-P1-PPRLS-Landscape-Analysis.pdf (accessed on 3 July 2025).
  12. Evaluating the Performance of Privacy Preserving Record Linkage Systems (PPRLS). Evaluation Performed by Information Management Services (IMS) for Leidos Biomedical Research (LBR) Under the Agreement 20Q035TO01, Issued as a Subcontract Under Contract HHSN2612015000031, Task Order No. HHSN26100038 Issued by the National Cancer Institute (NCI), National Institutes of Health (NIH), Department of Health and Human Services (HHS). March 2023. Available online: https://surveillance.cancer.gov/reports/TO-P2-PPRLS-Evaluation-Report.pdf (accessed on 3 July 2025).
  13. Tachinardi, U.; Grannis, S.J.; Michael, S.G.; Misquitta, L.; Dahlin, J.; Sheikh, U.; Kho, A.; Phua, J.; Rogovin, S.S.; Amor, B.; et al. Privacy-preserving record linkage across disparate institutions and datasets to enable a learning health system: The national COVID cohort collaborative (N3C) experience. Learn. Health Syst. 2024, 8, e10404. [Google Scholar] [CrossRef] [PubMed]
  14. N3C Consortium. N3C Privacy-Preserving Record Linkage and Linked Data Governance; Zenodo: Geneva, Switzerland, 2021. [Google Scholar] [CrossRef]
  15. Petersen, S.; Lieberthal, R.; Miller, K.; Vakil, N. Privacy Preserving Record Linkage (PPRL) Strategy and Recommendations; PPRL Linkage Strategies Report; The MITRE Corporation: Mclean, VA, USA, 2023. Available online: https://www.nia.nih.gov/sites/default/files/2023-08/pprl-linkage-strategies-preliminary-report.pdf (accessed on 23 May 2025).
  16. Eckrote, M.J.; Nielson, C.M.; Lu, M.; Alexander, T.; Gupta, R.S.; Low, K.W.; Zhang, Z.; Eliazar, A.; Klesh, R.; Kress, A.; et al. Linking clinical trial participants to their U.S. real-world data through tokenization: A practical guide. Contemp. Clin. Trials Commun. 2024, 41, 101354. [Google Scholar] [CrossRef] [PubMed]
  17. Datavant. Overview: Tokenization Technology for Structured Data. January 2024. Available online: https://assets-global.website-files.com/655ba3a14f5a76dc96d65e09/65a8755ffd1a65fe7b1e5a53_LEPS_Whitepaper_Datavant%20Connect%20Overview%20-%20Tokenization%20Structured%20Data_Jan24.pdf (accessed on 3 July 2025).
  18. Supplemental Nutrition Assistance Program: Requirement for Interstate Data Matching to Prevent Duplicate Issuances. Federal Register. Available online: https://www.federalregister.gov/documents/2022/10/03/2022-21011/supplemental-nutrition-assistance-program-requirement-for-interstate-data-matching-to-prevent (accessed on 6 May 2025).
  19. NIST. NIST Privacy Framework 1.1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2025; NIST CSWP 40 ipd. [CrossRef]
  20. Pascoe, C.; Quinn, S.; Scarfone, K. The NIST Cybersecurity Framework (CSF) 2.0; NIST: Gaithersburg, MD, USA, 2024. Available online: https://www.nist.gov/publications/nist-cybersecurity-framework-csf-20 (accessed on 29 April 2025).
  21. ISO/IEC 27001:2022; Information Security, Cybersecurity and Privacy Protection—Information Security Management Systems—Requirements. ISO: Geneva, Switzerland, 2022. Available online: https://www.iso.org/standard/27001 (accessed on 29 April 2025).
  22. ISO/IEC 27701:2019; Security Techniques—Extension to ISO/IEC 27001 and ISO/IEC 27002 for Privacy Information Management—Requirements and Guidelines. ISO: Geneva, Switzerland, 2019. Available online: https://www.iso.org/standard/71670.html (accessed on 29 April 2025).
  23. An Act to Amend Chapter 35 of Title 44, United States Code, to Provide for Reform to Federal Information Security. U.S. Government Publishing Office. 18 December 2014. Available online: https://www.govinfo.gov/app/details/PLAW-113publ283 (accessed on 29 April 2025).
  24. SOC 2®—SOC for Service Organizations: Trust Services Criteria. Available online: https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2 (accessed on 29 April 2025).
  25. HITRUST Framework for Cybersecurity and Compliance Success. Available online: https://hitrustalliance.net/hitrust-framework (accessed on 29 April 2025).
  26. Office for Civil Rights (OCR). Guidance Regarding Methods for De-Identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. HHS.gov. Available online: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html (accessed on 21 March 2023).
  27. Suver, C.; Harper, J.; Loomba, J.; Saltz, M.; Solway, J.; Anzalone, A.J.; Walters, K.; Pfaff, E.; Walden, A.; McMurry, J.; et al. The N3C governance ecosystem: A model socio-technical partnership for the future of collaborative analytics at scale. J. Clin. Transl. Sci. 2023, 7, e252. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example tokenizer design. Tokenization employs current generation cryptographic techniques such as Secure Hash Algorithm 2 (SHA-2) and Advanced Encryption Standard (AES). However, it is meant to be reversible; i.e., a mapping table exists between the tokens and the original PII that allows data controllers to recover the original PII if needed.
Figure 1. Example tokenizer design. Tokenization employs current generation cryptographic techniques such as Secure Hash Algorithm 2 (SHA-2) and Advanced Encryption Standard (AES). However, it is meant to be reversible; i.e., a mapping table exists between the tokens and the original PII that allows data controllers to recover the original PII if needed.
Jcp 05 00044 g001
Figure 2. PPRL data flow architecture to create linked data.
Figure 2. PPRL data flow architecture to create linked data.
Jcp 05 00044 g002
Figure 3. PPRL data flow architecture for de-duplication.
Figure 3. PPRL data flow architecture for de-duplication.
Jcp 05 00044 g003
Figure 4. PPRL data flow diagram for N3C PPRL System based on our architecture.
Figure 4. PPRL data flow diagram for N3C PPRL System based on our architecture.
Jcp 05 00044 g004
Table 1. Roles and responsibilities of different parties in a PPRL system.
Table 1. Roles and responsibilities of different parties in a PPRL system.
PartyRoles and Responsibilities
Data ControllersRole: Custodians of data subjects’ information
Responsibility: Has the legal obligation to protect the privacy rights of data subjects whose data they control. Must enact physical, administrative, and technical controls to protect data. Must ensure that any data processors who access and use data they control are able to meet the same standards they follow for privacy and security controls.
TokenizerRole: Provides a technical solution for the creation of non-identifiable tokens to represent data subjects and facilitate linkages between datasets.
Responsibility: Provides a secure solution for the creation of data subject tokens. Said solution often involves deploying tokenization software on-site that does not require connection to the internet, cloud computing, or any external system to operate, requiring the solution to be modular and containerizable.
Honest BrokerRole: Facilitates data linkages in a privacy-preserving manner. This can include performing de-identification and data encryption or just be limited to acting as an escrow service for encryption keys and patient token mappings.
Responsibility: Is a neutral third party that provides protections for data subject privacy within a data use and sharing system. Other responsibilities may be defined in an Honest Broker Agreement with contractual controls in place that specify their role in the system and what they can and cannot do with the data.
Expert DeterminerRole: Evaluates and provides evidence (i.e., a report) that shows that what is being planned with the data does not violate the privacy rights of data subjects
Responsibility: Provides an unbiased evaluation of the privacy risks to data subjects from the anticipated usage and data linkages and, if there are risks, provides recommendations for mitigation strategies to address those risks.
The expert determiner may evaluate data subject identifiability before and after linkage, privacy and security controls on a system, end data user motives and capacity for privacy violations (including data access and use limitations), and likelihood of data breach. The determination may recommend additional transformations be applied to the data or additional controls be put in place.
Privacy OfficerRole: Knowledgeable expert in privacy-related matters who can help identify and address privacy-related issues when handling PII.
Responsibility: Provides guidance on privacy-related matters on projects where PII is being handled to ensure compliance with regulations and policies. Assists with system design to ensure privacy principles are followed throughout the data lifecycle.
Data PlatformRole: Provides the hosting environment where individual unlinked datasets and linked datasets are stored and may also provide a computational platform for data users to access, process, and analyze said data.
Responsibility: Implement the necessary privacy and security controls to protect the data hosted on the platform and provide analytic tools to end users for data processing and analysis. Platforms may be the enablers of PPRL, as the actual linkage process may take place on the platform using analytic tools provided within the platform. Platforms may also be the best place to implement technical privacy and security controls, such as limits on downloading data, access controls, and logging and monitoring data use.
Data GovernanceRole: Provides the rules within which the system operates and oversight to ensure rules are followed.
Responsibility: Development of the rules that govern the data system, including the privacy and security controls that must be followed, access control procedures, roles and responsibilities of different parties in the system, contractual terms and agreements, data retention and destruction policies, quality rules, data and metadata standards, and regular monitoring and auditing. Provides oversight over data governance rules to ensure that they are followed.
Data UsersRole: End users of the data product.
Responsibility: Process and analyze the data according to the requirements and limitations set down within the data access and use agreements they sign.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nie, Z.; Tyndall, B.; Brannock, D.; Gentles, E.; Parish, E.; Banger, A. A Framework for the Design of Privacy-Preserving Record Linkage Systems. J. Cybersecur. Priv. 2025, 5, 44. https://doi.org/10.3390/jcp5030044

AMA Style

Nie Z, Tyndall B, Brannock D, Gentles E, Parish E, Banger A. A Framework for the Design of Privacy-Preserving Record Linkage Systems. Journal of Cybersecurity and Privacy. 2025; 5(3):44. https://doi.org/10.3390/jcp5030044

Chicago/Turabian Style

Nie, Zixin, Benjamin Tyndall, Daniel Brannock, Emily Gentles, Elizabeth Parish, and Alison Banger. 2025. "A Framework for the Design of Privacy-Preserving Record Linkage Systems" Journal of Cybersecurity and Privacy 5, no. 3: 44. https://doi.org/10.3390/jcp5030044

APA Style

Nie, Z., Tyndall, B., Brannock, D., Gentles, E., Parish, E., & Banger, A. (2025). A Framework for the Design of Privacy-Preserving Record Linkage Systems. Journal of Cybersecurity and Privacy, 5(3), 44. https://doi.org/10.3390/jcp5030044

Article Metrics

Back to TopTop