Next Article in Journal / Special Issue
A Scoping Analysis of the Literature on the Use of Hybrid Cryptographic Systems for Data Hiding in Cloud Storage
Previous Article in Journal
A Survey on Classical Lattice Algorithms
Previous Article in Special Issue
A Survey of Post-Quantum Oblivious Protocols
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Cryptographic Foundations of Pseudonymisation for Personal Data Protection

by
Konstantinos Limniotis
Hellenic Data Protection Authority, Kifissias 1-3, 11523 Athens, Greece
Cryptography 2026, 10(2), 18; https://doi.org/10.3390/cryptography10020018
Submission received: 30 January 2026 / Revised: 5 March 2026 / Accepted: 8 March 2026 / Published: 11 March 2026
(This article belongs to the Collection Survey of Cryptographic Topics)

Abstract

Pseudonymisation constitutes an essential technical and organisational measure for implementing personal data-protection safeguards. Its main goal is to hide identities of individuals, thus reducing data protection and privacy risks through facilitating the fulfilment of several principles such as data minimisation and security. However, selecting and deploying appropriate pseudonymisation mechanisms in a risk-based approach, tailored to the specific data processing context, remains a non-trivial task. This survey paper aims to present especially how cryptography can be used at the service of pseudonymisation, putting emphasis not only on traditional approaches but also on advanced cryptographic techniques that have been proposed to address special pseudonymisation challenges. To this end, we systematically classify existing approaches according to a taxonomy that captures key design dimensions that are relevant to specific data-protection challenges. Finally, since the notion of pseudonymisation adopted in this work is grounded in European data-protection law, we also discuss recent legal developments, in particular the CJEU’s latest judgment, which refined the interpretation of pseudonymous data.

1. Introduction

In an era of rapid increase in data-driven applications, including decision-making machine learning (ML) systems, the protection of personal information and privacy has become of utmost importance, being a major challenge from a technical perspective (see, e.g., [1]). To this end, pseudonymisation is a well-known privacy engineering technique, which is very promising in terms of ensuring that appropriate safeguards are in place [2]. In fact, pseudonymisation does not only constitute an enabler for providing privacy guarantees, but it may also be prerequisite for ensuring the lawfulness of a personal data process [3,4]. Indeed, pseudonymisation is being explicitly mentioned several times within the General Data Protection Regulation (GDPR) [5], which is the main legal instrument in Europe in terms of personal data protection. It is also present in a number of regulatory texts that are related with the so-called Common European Data Spaces (see, e.g., the European Health Data Space Regulation [6], the Digital Governance Act [7] and the Data Act [8]), with the aim of facilitating trustworthy and secure environments while making more data available for access and reuse.
Although traditional pseudonymisation methods often rely on basic techniques such as masking individuals’ identifiers or replacing them with random variables, these approaches either have significant practical limitations or cannot be considered as sufficient for any possible use case (see, e.g., [9]). This in turn necessitates a more rigorous approach to implement pseudonymisation. To this end, cryptographic approaches with well-known security properties seem to pave the way for implementing robust pseudonymisation schemes. Indeed, cryptography provides a rich set of primitives that allow to build privacy-enhancing mechanisms, with pseudonymisation being a prominent such mechanism [10]. In this context, techniques such as keyed hashing, as well as symmetric and asymmetric encryption, are widely used as building blocks for pseudonymisation, whilst more advanced cryptographic techniques including Secure Multiparty Computations and Oblivious Pseudorandom Functions may facilitate to address more complex challenges in the context of pseudonymisation. However, interestingly enough, whilst there exist several rigorous cryptographic techniques that are capable of ensuring the fulfilment of specific properties implied by legal requirements, there is still a lack of widespread implementation.
This paper aims to provide a survey on how cryptography can be used to implement effective and secure pseudonymisation solutions. We examine any possible use of cryptography in this context, starting from the classical approaches and covering also advanced cryptographic techniques. In the process, common pitfalls in existing methods are being discussed. We also present a taxonomy of such cryptography-based approaches, mainly outlining their underlying properties. Our ultimate goal is to bridge the gap between regulatory concepts and technical perspectives, offering a framework that will help identify how specific cryptographic techniques may facilitate the alignment with data protection and privacy legal requirements. To this end, a new taxonomy of pseudonymisation techniques is provided. It should be stressed that, unlike other works like, e.g., the relevant reports in [3,4,9], this new taxonomy allows the classification of pseudonymisation techniques at an architectural level, accounting for structural properties such as knowledge separation and control distribution; thus, it provides a framework that enables systematic risk analysis and governance assessment beyond implementation details, thereby combining high-level architectural analysis—based on the relevant legal provisions—with practical guidance for implementation.
The paper is organized as follows: First, Section 2 sets the basic background for the notion of the pseudonymisation, describing both technical and legal aspects, whilst the contribution of the present work with respect to other surveys is also clarified. Section 3 introduces the taxonomy used throughout the paper to classify cryptography-based pseudonymisation techniques. The main part of the paper lies in Section 4 and Section 5, which provide a survey of pseudonymisation techniques based on symmetric and asymmetric cryptographic primitives, respectively, in relation to the aforementioned taxonomy. Section 6 offers an overall summary and discussion of the various approaches. Emphasis is placed on implementation and on the importance of the secure key management, while the necessary trust assumptions for the data controller are also discussed for each technique. Section 7 examines a recent judgement made by the Court of Justice of the European Union that establishes a revised basis for assessing whether data should be regarded as pseudonymous or anonymous. Here, an initial analysis of how this may affect the interpretation of cryptographic pseudonymisation techniques is provided. Finally, concluding remarks are given in Section 8.

2. Background

Although this survey focuses on cryptographic mechanisms for pseudonymisation and therefore addresses readers with a background in cryptography, pseudonymisation under the GDPR constitutes a both legal and technical concept whose interpretation cannot be inferred from cryptographic definitions alone. We therefore briefly clarify the regulatory definition and its engineering implications before analysing concrete cryptographic mechanisms.

2.1. Privacy and Personal Data Protection

The right to privacy is recognized as a fundamental human right by several key international instruments, including the Universal Declaration of Human Rights, the International Covenant on Civil and Political Rights, the Charter of Fundamental Rights of the European Union (EU), and other global treaties. While privacy and personal data protection are closely connected, the EU Charter treats them as distinct fundamental rights [11]. Specifically, the Charter affirms the right to privacy as follows:
Everyone has the right to respect for his or her private and family life, home and communications.
In relation to personal data protection, the Charter further provides:
Everyone has the right to the protection of personal data concerning him or her. Such data must be processed fairly for specified purposes and on the basis of the consent of the person concerned or some other legitimate basis laid down by law. Everyone has the right of access to data which has been collected concerning him or her, and the right to have it rectified (…).
The General Data Protection Regulation (GDPR) is the primary legal framework governing personal data protection in Europe [5]. Although the GDPR is an EU regulation, its territorial scope extends beyond the European Union, applying to any organization —regardless of location—that processes the personal data of individuals residing in the EU. Under the GDPR, personal data is defined as any information relating to an identified or identifiable natural person (referred to as the data subject). An identifiable individual is one who can be recognized, either directly or indirectly, particularly by reference to identifiers such as a name, identification number, location data, online identifiers, or attributes specific to the individual’s physical, physiological, genetic, mental, economic, cultural, or social identity. Given this definition, the scope of what qualifies as personal data is very broad. In this regard, it is different from the—also widely used—notion of Personally Identifiable Information (PII), mainly defined under various U.S. privacy laws (e.g., in HIPAA), since the latter refers to information that directly identifies an individual. For instance, even device-related identifiers, such as an IP addresses, are considered personal data (see, e.g., [12]).
The GDPR lays down a set of core principles that must be respected whenever personal data are processed and imposes specific obligations on those processing personal data. These principles include the following (amongst others):
  • Purpose limitation: Personal data should be collected for specified, explicit, and legitimate purposes and not further processed in a way that is incompatible with those purposes.
  • Data minimisation: Personal data must be adequate, relevant, and limited to what is necessary in relation to the purposes for which it is processed.
  • Integrity and confidentiality: Personal data must be processed in a manner that ensures appropriate security, including protection against unauthorised or unlawful processing and against accidental loss, destruction, or damage, using suitable technical or organisational measures.
The GDPR, through its principles, clarifies that anonymous data refers to information that no longer permits the identification of a specific individual. In such cases, where identification is definitively precluded, the data fall outside the scope of the Regulation and are not subject to its provisions. Nevertheless, the GDPR emphasizes that determining whether data are truly anonymous requires a contextual assessment: specifically, one must consider all means that are reasonably likely to be used to identify the individual, whether directly or indirectly. This sets a high threshold for achieving true anonymity. For further analysis on the challenges of anonymity in practice, see, for example, ref. [13].

2.2. The Notion of Pseudonymisation: Technical and Legal Perspectives

From an engineering point of view, the notion of pseudonymisation is traditionally considered as a process related to replacing original identifiers by pseudonyms, which in turn are considered as identifiers of a subject other than one of the real names of the subject [14]. In fact, a pseudonym is a way to identify an individual in a particular context as an alternative to their real name, which in turn remains hidden [15]; by these means, pseudonymisation is related to de-identification (since the real identity, determined by the original direct identifier(s), is hidden).
Depending on the ultimate goal of the pseudonymisation, as well as on the corresponding risks that need to be addressed, there are several pseudonymisation policies that could be adopted [4]. More precisely:
  • Deterministic pseudonymisation. This is the case that the same pseudonym is always assigned to the same individual, even within the same database (if there are more than one entries corresponding to this individual) or in different databases. This is also known as consistent pseudonymisation.
  • Randomized pseudonymisation. By these means, a different pseudonym is being assigned for the same user, even within the same database (if there are more than one entries corresponding to this individual) or in different databases. This enables the provision of unlinkability [14]—i.e., from an attacker’s perspective that aims to gain more information than they are authorised to have, any two pseudonymous entries corresponding to the same individual are no more and no less related than they are related concerning their a priori knowledge.
The scope of the overall processing will in fact determine whether consistent or unlinkable pseudonyms are used. For example, a medical research centre collecting various pseudonymous data for a research project may need to have consistent pseudonyms in order to conduct its research (i.e., to associate different entry values corresponding to the same individual, yet without having knowledge of the individuals’ identities). On the other hand, different databases for different purposes typically necessitate a randomised pseudonymisation policy, because in such a case the unlinkability property is essential to fulfil data-protection requirements such as, e.g., purpose limitation and data minimisation.
In [4], other classifications of pseudonymisation are also presented, in terms of who is the entity that performs the pseudonymisation. For example, pseudonymisation may be performed by the organisation having also the original data for e.g., enhancing security or sharing the data with a third party that does not need to have access to the original identifiers. Another option is to have a trusted third party to perform pseudonymisation, whilst there are settings that pseudonyms are being generated by the individuals themselves.
In any case, as it is explicitly stated in [16], the context in which pseudonymisation is to preclude attribution of data to specific individuals is a crucial aspect which determines the so-called pseudonymisation domain. The pseudonymisation domain, which relies on a risk-based approach, may coincide with a set of foreseen legitimate recipients of the pseudonymised data, whilst it can also include persons who are not legitimate recipients of the pseudonymised data, but may attempt to gain access to it anyway. In fact, the pseudonymisation should ensure that actors within the pseudonymisation domain are not able to reverse the pseudonymisation. Although it is not explicitly mentioned in [16], the definition of the pseudonymisation domain actually affects the choice of the pseudonymisation policy, among other things.
From a legal perspective, pseudonymisation is defined in Article 4 GDPR as the processing of personal data in such a manner that they can no longer be attributed to a specific data subject without additional information, provided that this information is kept separately and protected by appropriate technical and organisational measures. The GDPR refers to pseudonymisation at several points as a means of enhancing data security and supporting data minimisation. However, it explicitly distinguishes pseudonymisation from anonymisation, even though pseudonymisation may reduce risks to individuals (see, e.g., [10,13]). Accordingly, pseudonymised data remain subject to the GDPR.
Recent legislative initiatives in the EU, such as the Data Governance Act (DGA), the Data Act (DA), and the European Health Data Space (EHDS) Regulation, promote data sharing while emphasizing the importance of privacy and the protection of personal data. Within this framework, pseudonymisation is being explicitly mentioned therein as a tool that can help mitigate potential negative impacts on individuals, towards striking a proper balance between enabling data sharing and safeguarding fundamental rights.

2.3. Contribution of This Work

The purpose of this work is to study how cryptography suffices to provide the means for various pseudonymisation techniques. Since each of these techniques has specific properties, the ultimate goal is to provide a survey of this field, providing a comparative study in terms of well-determined characteristics and, thus, facilitating the proper implementation when pseudonymisation is to be performed.
Existing surveys on pseudonymisation primarily focus either on specific application areas (e.g., in IoT systems [17], in medical research [18], in network traffic [19] etc.) and/or on examining pseudonymisation in conjunction with other privacy enhancing technologies, in terms of providing a catalogue of techniques (see, e.g., [20,21,22]). The present survey, apart from the fact that it provides a more comprehensive list of cryptography-oriented pseudonymisation techniques than other relevant works, also offers a new cryptography-oriented taxonomy of pseudonymisation mechanisms, classifying schemes according to features that are important in terms of legal data-protection requirements from both security and data minimisation aspects. This taxonomy enables a systematic comparison of schemes that allows deriving conclusions on which techniques fit well with specific application scenarios and data-protection challenges.
It should be pointed out that pseudonymisation may not necessarily rely on cryptography; for example, replacing identifiers with the output of a counter mechanism (e.g., by entries such as ’1’, ’2’, ’3’, etc.) constitutes pseudonymisation without using any cryptographic mechanism. In this paper though, we focus explicitly on cryptography-oriented pseudonymisation.
Moreover, an important aspect is that pseudonymised data may not be sufficient to fully hide an individual’s identity if re-identification is feasible through the so-called quasi-identifiers—i.e., attributes that remain unaltered during pseudonymisation. For example, in a patient database, pseudonymisation may lead to replacement of each user’s social security number with a pseudonym and removal of all other directly identifying information (such as the person’s name or surname). However, other attributes such as date of birth, gender, and postal code—serving as quasi-identifiers—may still enable third parties to re-identify individuals [23]. Apparently, such a pseudonymisation cannot be considered as a powerful one. Nonetheless, this work does not address re-identification risks arising from quasi-identifiers (which could be addressed by other means such as, e.g., attribute generalisation); instead, it focuses on the protection of direct identifiers through their substitution with cryptographically derived pseudonyms.

3. Taxonomy of Cryptographic Techniques for Pseudonymisation

Pseudonymisation can be supported by a various set of cryptographic primitives, each offering different trade-offs in terms of reversibility, linkability and other privacy-oriented properties. More precisely, towards presenting a proper taxonomy of cryptographic techniques applicable to pseudonymisation, grouped by their operational model and security properties, we introduce the following notions:
  • Reversible vs. Irreversible: A pseudonymisation is considered as reversible if the entity that determines the purpose and the means of the pseudonymisation (hereinafter, pseudonymisation entity) can directly reverse it—i.e., given the pseudonym and any auxiliary information used during pseudonymisation, mapping back the pseudonym into the original identifier is straightforward. On the other hand, irreversible pseudonymisation refers to a one-way process for which no efficient method exists to reconstruct the original identifier from the pseudonym. It should be noted, though, that irreversibility does not preclude verifiability—that is, the ability to efficiently determine whether a given pseudonym corresponds to a specific identifier (verifiability is also, apparently, present in any reversible pseudonymisation).
Example 1.
In cases where a hospital needs to provide patients’ health data to a research group, they may at the same time need to re-identify patients if needed for a medical follow-up based on the researchers’ output. Here, a reversible pseudonymisation shall be the proper option (note that the researchers shall not be in place to re-identify individuals; hence, reversibility is in fact relevant only for the entity that performs the pseudonymisation).
On the other hand, irreversible pseudonymisation may occur when a mobile application derives a random device identifier (ID) by applying a one-way function to a unique device ID before transmitting usage data to the service provider, thereby ensuring that the provider never accesses or processes the user’s real identity or original device identifier. The application operates in this manner because it has been designed in such a way by the service provider (and, thus, from a legal perspective, the pseudonymisation entity is the service provider).
  • Deterministic vs. Probabilistic: A pseudonymisation is considered as deterministic if it ensures that the same pseudonym is generated for the same individual, within the same context—for example, if a database has more than one entries corresponding to the same individual, then for each of them the individual’s direct identifier(s) is/are replaced by the same pseudonym. On the other hand, in a probabilistic pseudonymisation setting, different pseudonyms are always generated for the same individual; this ensures that, e.g., within the same database, unlinkability between different entries corresponding to the same individual is ensured.
    It should be pointed out that unlinkability is a broad concept that refers to the property that different pieces of information relating to the same individual cannot be connected [14]. As such, unlinkability also encompasses situations where correlations between different datasets (e.g., separate databases) need also to be prevented. While probabilistic pseudonymisation suffices to ensure this property, it is not the only approach; indeed, each dataset could also be pseudonymised deterministically, provided that distinct pseudonyms are used for the same individual across different datasets to prevent linking (although, in such a case, linkage within the same database is feasible).
Example 2.
In cases where a university seeks to examine library usage per student for statistical purposes, a deterministic pseudonymisation approach is required to ensure that the same pseudonym is consistently generated for each student across different systems, without disclosing the actual student ID.
Conversely, a survey platform conducting multiple studies should assign a new pseudonym to each participant for every submission; therefore, probabilistic pseudonymisation is required to prevent linking responses across different studies.
  • Blind vs. Non-blind: A pseudonymisation process is considered as blind if the entity that performs pseudonymisation does not obtain access to the original identifiers in any stage of the procedure. We refer to non-blind pseudonymisation for any other case.
Example 3.
Any reversible pseudonymisation is, inherently, non-blind; meanwhile, blind pseudonymisation is, inherently, irreversible (although the pseudonymisation entity may consist of several modules: all the modules jointly can recover the original identifiers if they collude, but some modules may derive pseudonyms blindly). However, we may have irreversible pseudonymisation without being blind. This is why we introduce this distinct class in the taxonomy. For example, a research team collecting network data for research on intrusion detection may irreversibly pseudonymise identifiers such as IP addresses, but once it obtains an initial access to these identifiers, the pseudonymisation is not blind.
An interesting case of blind pseudonymisation occurs with user-generated pseudonyms. For example, in an online survey where participants, following the platform’s prescribed procedure, create their own pseudonyms before submitting responses, the process constitutes blind pseudonymisation. In this setting, the platform, acting as the pseudonymisation entity, deliberately adopts a user-generated pseudonym approach to ensure that it never gains access to the original users’ identifiers.
  • Centralised vs. Distributed: A pseudonymisation is considered as centralised if the relevant process is being performed by a single entity—i.e., the so-called pseudonymisation entity. On the contrary, there are techniques that necessitate, towards deriving pseudonyms, the active collaboration of more than one entities; in such a case, we refer to distributed pseudonymisation.
Example 4.
A centralised pseudonymisation occurs, for example, in case that a healthcare provider pseudonymises the data before sharing it with researchers for, e.g., clinical research. On the other hand, if we have several healthcare providers that need to provide their data to a national health analytics platform for a specific purpose defined by the platform, then a distributed pseudonymisation model could be the case that each healthcare provider first locally pseudonymises patient data before sending it to the platform; in turn, after receiving the data, the platform applies a second layer of pseudonymisation (to generate, e.g., consistent identifiers).
The rationale for selecting the above notions as the foundation of our taxonomy is grounded in the requirements and principles of the GDPR and is detailed below.
  • Irreversibility is closely linked to both security guarantees and the GDPR principle of data minimisation. In scenarios where data controllers, acting as pseudonymisation entities, do not require the ability to recover original identifiers, irreversible pseudonymisation represents an appropriate design choice. Conversely, in cases where the controller must be able to re-establish the association between pseudonyms and data subjects: for example, to enable the exercise of data subject rights, a reversible pseudonymisation technique is necessary.
  • Deterministic pseudonymisation is required in use cases where consistent pseudonyms are needed, that is, where the same data subject must be mapped to the same pseudonym across records or processing operations. When such consistency is not required, probabilistic pseudonymisation provides stronger protection by limiting linkability and reducing the risk of tracking data subjects across pseudonymised datasets, in line with the GDPR’s objective of mitigating re-identification risks.
  • Blind pseudonymisation is particularly relevant in settings where the data controller must never obtain access to the original identifiers, thereby enforcing strict separation of knowledge. This aligns with the GDPR’s emphasis on reducing unnecessary access to personal data and limiting the exposure of identifiers to entities that do not require them for their processing purposes. However, blind pseudonymisation typically involves increased implementation and operational complexity, as it relies on either interactive protocols or advanced cryptographic mechanisms to guarantee that identifiers remain hidden throughout the process.
  • The distinction between centralised and distributed approaches is significant from both a technical and an organisational standpoint. Distributed pseudonymisation limits the amount of personal data accessible to any single entity, thereby reinforcing the GDPR principle of data minimisation and reducing the risk of undue re-identification. Since no single party can perform all required operations or reconstruct full identifiers, distributed approaches inherently constrain data exposure. At the same time, they typically entail higher implementation and operational complexity, as they require coordination among multiple parties and the deployment of more sophisticated cryptographic or protocol mechanisms. Nevertheless, they mitigate risks associated with single points of failure and contribute to the principles of integrity and confidentiality.

4. Pseudonymisation Based on Symmetric Cryptography

In this subsection, we present pseudonymisation techniques that rely on symmetric cryptographic primitives.

4.1. Pseudonymisation Based on Symmetric Encryption

This is probably the most-known type of cryptography-oriented pseudonymisation. In this case, the pseudonyms are generated by encrypting, through a symmetric cipher (the notable example is the cryptographic standard AES [24] the identifiers of the individuals; once the encryption key remains secret and protected, the derived pseudonyms (i.e., in fact ciphertexts that play the role of pseudonyms) are irreversible. On the other hand, the entity that has access to this key can easily decrypt and, thus, reverse the pseudonymisation.
Hence, for an identifier id and a symmetric encryption function E K where K is the relevant secret key, the relevant pseudonym pseudo id is being generated as follows [3]:
pseudo id = E K ( id )
This process is reversible through the corresponding decryption function D K with the same key K:
id = D K ( pseudo id )
The above principles are illustrated in Figure 1; it should be noted that, in principle, the identifier id may consist of more than one individual’s attributes that suffice to uniquely identify the individual depending on the specific context—for example, id may consist of a name, email address, and social security number combination.
With respect to our taxonomy, this pseudonymisation has the following properties:
  • It is reversible, since knowledge of the secret key allows decryption and, thus, re-identification.
  • It is mainly deterministic, provided that the same key K is being used, as well as that the relevant Initialisation Vector (IV) is also fixed (recall that IV is a non-secret, random or pseudorandom value used in conjunction with the symmetric key to initialize the encryption process). However, if the IV is being changed within the same pseudonymisation application, then a probabilistic pseudonymisation scheme occurs.
  • It is a non-blind process, since the pseudonymisation entity has direct access to the original identifiers.
  • It shall be considered as a centralised approach, since the pseudonyms are being generated in one place from the pseudonymisation entity.
In such schemes, the secret key constitutes the crucial element for allowing re-identification (and hence, according to the GDPR’s provisions, it can be considered as the additional information required to identify the individual from the pseudonymous data). Therefore, in case of destroying the key used for pseudonymisation, then the latter becomes irreversible.
Symmetric encryption has been used for pseudonymisation in many cases, such as:
  • The pseudonymisation of clinical data for research applications is being discussed in [25], in which patient identifiers are encrypted using a symmetric cryptographic algorithm to produce reversible pseudonyms. This approach allows an authorized entity to subsequently decrypt the pseudonym and re-establish the link between the research data and the corresponding patient, while ensuring that all other parties have access only to the pseudonymised identifiers.
  • An e-health architecture is described in [26], in which patient identifiers are replaced with pseudonyms generated via symmetric encryption and maintained independently from the corresponding clinical data. Patient re-identification for authorized primary health care purposes is feasible due to the symmetric keys, while the original identifiers remain inaccessible to researchers analysing the pseudonymised dataset.
  • A translational research scenario is discussed in [27] that is related to the pseudonymisation of clinical data for secondary use in a research database. This scenario necessitates one single deterministic and distinct pseudonym per patient; to this end, the usage of a block cipher is proposed to unambiguously transform the unique identifiers into pseudonyms.
  • The authors of [28] utilize the Advanced Encryption Standard for creating pseudonyms for the purpose of allowing health data exchange for secondary, cross-institutional clinical research.

Format-Preserving Encryption

Format-preserving encryption (FPE), a specific type of symmetric encryption, is particularly useful for structured data fields (e.g., credit card numbers, social security numbers), allowing the ciphertext having the same format as the original data (see the FF1 algorithm in [29]). FPE constitutes a key technique for protecting data, particularly in legacy systems and established applications. By producing ciphertexts that preserve the format of the original data, FPE enables secure storage and processing without necessitating substantial modifications to existing infrastructures, thereby reducing both development effort and deployment costs.
If FPE is applied to pseudonyms derived from original identifiers, the resulting values, while preventing third parties from recovering the original identifiers, retain the same structural format. This property is particularly valuable for pseudonymisation, as it enables pseudonymous data to remain interoperable with existing validation rules, databases, and processing operations, thereby supporting data-protection objectives without disrupting established system semantics. As a symmetric cryptographic primitive, FPE as a pseudonymisation enabler shares the same properties with the symmetric ciphers.

4.2. Pseudonymisation Based on Hash Functions

Cryptographic hash functions have also been widely used to implement pseudonymisation schemes. The main property of such hash functions, contrary to the symmetric encryption, is their inherent irreversibility: being one-way functions, there is no direct way for them to compute the identifiers back from the derived pseudonyms.

4.2.1. Use of Unkeyed Hash Function

A common pitfall—as has been very nicely presented in [30]—is that the classical unkeyed cryptographic hash functions (such as SHA-2 and SHA-3) allow for efficient irreversible deterministic pseudonymisation. However, in fact, this is not a robust pseudonymisation. More precisely, for an identifier id and a cryptographic hash function H, the derived pseudonym in such a case is given by the following:
pseudo id = H ( id )
At first, this is a deterministic irreversible pseudonymisation. However, the absence of any secret pseudonymisation information (such as, e.g., a secret key or a secret salt) renders this pseudonymisation vulnerable to re-identification attacks; indeed, an attacker having some information on the pool of possible identifiers can easily verify whether identifiers from this pool correspond to given pseudonyms in a type of dictionary attack (see, e.g., [3,4]). More precisely, the hashes of well-known identifiers can be pre-computed by an adversary, in a form similar to the well-known rainbow attacks for password cracking (i.e., through constructing lookup tables), so as to allow the attacker to map pseudonyms to specific identifiers. Even though such a pre-computation is not feasible, if the identifiers have low entropy—i.e., their values are drawn from specific well-structured input spaces—then an attacker may still be in place to compute hashes for all candidate inputs (such low entropy identifiers could be, e.g., email addresses, IP addresses, sequential customer numbers, etc.).
Therefore, the use of unkeyed hash functions for deriving pseudonyms shall not be considered as a robust pseudonymisation approach; it could only be an acceptable option for some cases depending on the context and the underlying risks (for example, in cases that the pool of the possible original identifiers is not known and not predictable); however, in principle, it shall be considered as vulnerable to dictionary-type attacks, as described above.

4.2.2. Use of Keyed Hash Functions

Regarding the use of hash functions as a pseudonymisation technique, things change significantly in case of a keyed hash function (e.g., through message authentication codes (MACs)). By these means, pseudonyms are irreversible for anyone that does not have access to the secret key (i.e., dictionary attacks are not feasible), whilst even the entity that has access to the secret key cannot directly reverse the pseudonym; this entity can only verify whether a given identifier corresponds to a specific pseudonym or not (i.e., pseudonym’s verifiability).
In other words, for an identifier id and a cryptographic keyed hash function H K , the derived pseudonym in such a case is given by the following [4]:
pseudo id = H K ( id )
whilst there is no reverse function H K 1 .
With respect to our taxonomy, this pseudonymisation has the following properties:
  • It is irreversible (i.e., an one-way transformation).
  • It is mainly deterministic, provided that the same key K is being used in the same pseudonymisation process.
  • It is a non-blind process, since the pseudonymisation entity has direct access to the original identifiers.
  • It shall be considered as a centralised approach, since the pseudonyms are being generated in one place from the pseudonymisation entity.
In such schemes, the secret key constitutes the crucial element for allowing verifiability of the pseudonym (and not re-identification by itself, since it is an irreversible transformation). In a case of destroying the key that was used for pseudonymisation, even the verifiability property will be lost.
Keyed hash functions have been used for pseudonymisation in many cases:
  • To derive statistics for smart TV’s customers use [31].
  • To propose a generic approach for implementing pseudonymisation so as to generate stable pseudonyms that preserve linkability across records while preventing unauthorized re-identification [32].
  • For log pseudonymisation [33].
  • In the case of the OpenPseudononymiser tool, created by the University of Nottingham [34].

5. Pseudonymisation Based on Asymmetric Cryptography

Next, we present how asymmetric (i.e., public key) cryptography can be used to implement pseudonymisation. As will become evident, more advanced pseudonymisation techniques in fact rely on asymmetric cryptography due to their underlying mathematical properties.

5.1. Use of Classical Public Key Encryption

The main difference between symmetric and asymmetric cryptography is that, in the latter, the decryption key (which remains private, known only to its owner) is not the same with the encryption key (which in turn is public). Therefore, for each public encryption key e corresponds a uniquely defined private decryption key d. In the context of pseudonymisation, this property yields the following technique for deriving a pseudonym pseudo id from an identifier id based on an asymmetric encryption function E with a public key e:
pseudo id = E e ( id )
whilst this process is reversible through the corresponding decryption function D with the private key d:
id = D d ( pseudo id )
Although this pseudonymisation process seems to possess several similarities with the one relying on symmetric encryption, there is a main difference: asymmetric encryption is by default probabilistic (i.e., the encryption process always introduces some randomness); otherwise, the same message encrypted with the same public key would always produce the same ciphertext, thus yielding security issues especially due to the fact that the encryption key is public. Therefore, in principle, any such pseudonymisation technique is probabilistic (and not deterministic). This is illustrated in Figure 2 [3]. Therefore, in principle, asymmetric encryption enables unlinkability.
Apart from this, the aforementioned process is reversible, and it is also non-blind. However, there is an additional property with respect to the entities involved in the whole pseudonymisation process. More precisely, the pseudonymisation is reversible only for the entity that owns the corresponding private key d, whilst the pseudonymisation entity could possibly be a different one (i.e., any entity having access to the corresponding public key). This in turn yields that, although we have a single entity performing pseudonymisation, the pseudonymisation entity may not have access to the secret information which coincides with the private key.
Public key cryptography heavily relies on the need to ensure the validity of a user’s public key, which in turn typically rests with the use of digital certificates that are being issued—and digitally signed—by trusted third parties (Certification Authorities).
Classical asymmetric ciphers are not typically used as in Equations (5) and (6) for pseudonymisation, mainly due to the fact that asymmetric cryptography is computationally costly compared to the symmetric primitives, handling large keys and producing large outputs; however, advanced pseudonymisation techniques with additional properties do rely on asymmetric cryptography, as discussed next.

5.2. Identity-Based Encryption

Identity-Based Encryption (IBE) is a public key cryptographic paradigm in which a user’s public key is derived directly from a unique identifier, such as an email address or username. A trusted authority, called the Private Key Generator (PKG), uses a master secret to generate the corresponding private keys, thus eliminating the need for traditional public key certificates [35]. However, this unique identifier could be any string, which means that this could also be a pseudonym (derived by any other means). Hence, IBE enables the provision of cryptographic protection of the data, so only the user with a given identifier (’pseudonym’) can decrypt them.
On the basis of the above, it becomes evident that IBE could be considered as a technological enabler to protect identities, allowing encryption of data for any identity even before the corresponding private key is generated, which is of importance in large-scale systems. Therefore, IBE does not implement pseudonymisation per se but supports pseudonymisation since identities used for encryption can be detached from real-world identities, allowing access control to data based on the pseudonyms.
In 2001, Boneh and Franklin introduced the first practical Identity-Based Encryption scheme (a full version of this work is available in [36]). However, such schemes come along with the key escrow issue, since the users private keys are generated and known by the PKG. This means that the PKG can, in principle, decrypt any ciphertext. As a result, the entity operating the PKG possesses a built-in re-identification capability, creating a single point of trust and a potential single point of failure—thus constituting a structural limitation in terms of separation of duties. An approach alleviating this issue is proposed in [37], which relies on IBE to implement a privacy-preserving solution for an e-health application, allowing patients to be monitored remotely and enabling direct medical advice. In this work, pseudonyms are being created by an Identity Certification Authority, which receives the original identifiers and produces pseudonyms based on a hash function and a secret master key, whilst there is another entity deriving the private keys that never learns the original identifiers.
A structural limitation of IBE is the inherent key escrow problem. In schemes such as Boneh–Franklin Identity-Based Encryption, the Private Key Generator (PKG) holds a master secret from which all users’ private keys can be derived.
Focusing explicitly on the pseudonymisation, solutions such as the above have the following properties:
  • Pseudonymisation could be either reversible or irreversible, according to how exactly the strings being used for pseudonyms are being generated (for instance, in the aforementioned example that is based on a hash function, we have an irreversible pseudonymisation).
  • The pseudonymisation is deterministic, since the same master key is being used to derive pseudonyms.
  • The pseudonymisation is not blind, since the pseudonymisation entity obtains access to the original identifiers.
  • The pseudonymisation is centralised (one single entity suffices to derive the pseudonyms).

5.3. Polymorphic Encryption and Pseudonymisation

A notable example of advanced asymmetric cryptography ensuring several security and privacy properties, including pseudonymisation, is the so-called polymorphic encryption and pseudonymisation (PEP), first introduced in [38].
More precisely, PEP achieves the following:
  • With regard to encryption, personal data can be encrypted and stored at a central point in such a way that there is no need to fix a priori who can decrypt the data later; this can be decided later on, via some transformation of the ciphertext which allows ciphertext to be locally decryptable via locally different cryptographic keys. This transformation can be performed blindly, without the party performing this—being called transcryptor in [38]— being able to see the original plaintext.
  • With regard to the pseudonymisation, PEP also proceeds similarly to encryption; again, the role of the transcryptor is crucial since it generates pseudonyms via cryptographic transformations in a “blind” way—i.e., the transcryptor manages to “change” the content of the ciphertext so as the original user’s identifier in the corresponding plaintext is being transformed into a meaningless irreversible pseudonym before the transcryptor “allows” the ciphertext to be locally decryptable for a legitimate user. Different recipients receive different pseudonyms for the same individual, thus promoting unlinkability.
Before proceeding to the mathematical description of PEP, we first provide a high-level overview of its main features. Let us assume the case that health data, being held by health service providers (e.g., hospitals), are to be shared with researchers through an intermediary service (a scenario that is described in the original work on PEP [38]); each patient is associated with a unique identifier. First, the corresponding health data are being stored encrypted at an intermediary service; this service has an important element being called transcryptor, which is to perform pseudonymisation but has not access to the original patients’ identifiers. When a researcher R 1 requires a dataset for research purposes, then the transcryptor re-encrypts the data so as to to provide a ciphertext that can be decrypted only by R 1 (which is the only recipient of this ciphertext); at the same time, pseudonymisation is being also performed on the encrypted data (through a so-called re-shuffling transformation), so as to ensure that a recipient-specific pseudonym will be generated for each patient—that is when the R 1 decrypts, will obtain irreversible pseudonyms instead of any patient’s identifying information. Since these pseudonyms are recipient-specific (in fact, they are domain-specific), if the same dataset is being shared with another researcher R 2 , then different pseudonyms will be produced for each individual, thus establishing unlinkability. The transcryptor that performs pseudonymisation does never learn the derived pseudonyms. This is also illustrated in Figure 3, which illustrates that the two recipients obtain always different pseudonyms for the same individual, thus linkability through pseudonyms is not possible.
The PEP scheme presented in [38] is based on the well-known probabilistic ElGamal public key encryption. Although the encryption is inherently randomized, a legitimate recipient deterministically obtains the same local pseudonym upon decryption for a given underlying identity. Consequently, the same individual is consistently represented by the same identifier within a recipient’s domain. At the same time, the probabilistic nature of the ciphertext prevents any third party from linking multiple encrypted representations of the same identity. Furthermore, the scheme enforces domain separation: distinct recipients decrypt the same underlying identity to different pseudonyms, thereby ensuring unlinkability across domains.
More precisely, PEP relies on the ElGamal public key encryption in a group G of prime order p (applied to the elliptic curve domain). Initially, for the key generation, the server chooses x Z p , and computes X = x · g G where g is a generator of G . By these means, x is the server’s secret (private) key and X is the public key. ElGamal encrypts a message M under a public key X as:
EG ( r , M , X ) = ( r · g , r · X + M )
where r F p is a random value, thus rendering the cipher probabilistic.
The PEP scheme defines three fundamental operations on ElGamal ciphertexts:
1.
Re-randomisation (RR): Adds new randomness s to an existing ciphertext without changing the underlying message.
2.
Re-keying (RK): Changes the effective public key under which a ciphertext can be decrypted. Specifically, if the original ciphertext has been produced based on a public key X, re-keying with a factor k yields a ciphertext decryptable under the key k · x .
3.
Re-shuffling (RS): Scales the ciphertext and embedded message, used in pseudonym derivation.
These operations exploit somehow an underlying homomorphic structure of the ElGamal.
In the specific context of the PEP scheme in [38], each user A has an original identifier p i d A G , such as a numeric identity. The system, through a Registration Authority, first creates a polymorphic pseudonym by encrypting that identity under the master public key X:
p p i d A = EG ( r , p i d A , X ) = ( g r , p i d A · X r )
This is a first layer of pseudonymisation; an attacker that does not have access to the private key x cannot learn p i d A (even the Registration Authority does not have learn the master private key).
To tailor this pseudonym ( p p i d A ) for a particular recipient B, the scheme uses a two-factor transformation involving:
  • A key factor K A , B controlling how the ciphertext’s decryption key is adapted for the recipient. This key stems from another entity being called the Access Manager, which in turn never learns the aforementioned pseudonym (i.e., the ciphertext). In fact, K A , B controls who can decrypt.
  • A pseudonym factor S B , specific to the destination, determining the local pseudonym for that domain. Again, this comes from the Access Manager.
These are secret scalars in F p , and the recipient’s effective public key becomes:
X B = K A , B · X
Although the subsequent transformations are being performed by the transcryptor, the latter does not learn the scalar factor K A , B . This is because the Access Manager computes a re-key token, derived from K A , B , being an element of the group and thus, the transcryptor learns this element g r K A , B and not the exponent K A , B .
The transformation performed by the transcryptor consists of re-keying and re-shuffling—i.e., applying re-keying with K A , B and re-shuffling with S B to the encrypted p i d (i.e., to p p i d A ):
e p i d A , B = RS ( RK ( p p i d A , K A , B ) , S B ) = ( g r K A , B S B , p i d A K A , B S B × X r K A , B S B )
Due to the properties of ElGamal, this yields, after decrypting with the private key corresponding to the new public key X B (known only to the recipient B):
p i d A , B = p i d A K A , B S B
The above is a domain-specific pseudonym for the user A at recipient B. By these means, this construction ensures the derivation of a pseudonym tied to both the original identifier and the intended recipient.
The above pseudonymisation process has the following properties:
  • It is mathematically reversible, due to the fact that the Registration Manager is in place to associate each p i d with the corresponding p p i d (unless of course the Manager destroys any such associations). Moreover, there always exists a corresponding private key allowing for decryption (i.e., the master private key), despite the fact that the system is designed in such a way that this private key is not known to any party and not used. As described in [38], this key is securely stored by a trusted Key Server, in secure hardware. However, this in fact—from a cryptographical point of view—yields a reversible pseudonymisation.
  • It is a probabilistic process—for the same individual, different pseudonyms are always created for a different recipient. Even for the same recipient B, who decrypts to the same pseudonym p i d A , B for a given individual A, the corresponding encrypted value of the pseudonym is always different, due to the cipher’s probabilistic notion.
  • The process is non-blind in terms of the reversibility described above. However, the transcryptor, being the main part of the pseudonymisation entity that actually produces the final pseudonyms, does not obtain access to the original identifiers (i.e., thus operating in a blind way).
  • The process is centralised. since in fact there is a main central point performing pseudonymisation.
PEP has been used for a research regarding the Parkinson’s disease [39].

5.4. Oblivious Pseudorandom Functions (OPRFs)

An Oblivious Pseudorandom Function (OPRF) enables a client to obtain F K ( x ) from a server holding the secret key K, such that: (i) the client learns only the value F K ( x ) and not the key K; (ii) the server learns nothing about x. The OPRF guarantees that, as long as the server is honest, the client cannot distinguish F K ( x ) from the output of a random function using any efficient computation (see Figure 4). To achieve these purposes, OPRFs heavily rely on asymmetric cryptographic primitives.
OPRFs are being met in various applications (for a helpful survey of the various techniques, see [40]). Their underlying properties though fit also well with pseudonymisation properties; more precisely, they let one party obtain a deterministic token (i.e, pseudonym) of a value (i.e., an identifier) without revealing the value to the party that holds the key, and without learning the key used to generate it. Hence, as a pseudonymisation service, it achieves the following properties:
  • Irreversibility, in terms that the pseudonymisation entity which holds the secret key does not even obtain access to the derived pseudonym (of course, in case that this entity learns somehow a pseudonym, the original identifier can be recovered; however, OPRFs are being used exactly in cases that the pseudonymisation entity does not need to know the original identifier).
  • Deterministic, under the assumption that, for a specific pseudonymisation process, the same key is being used by the pseudonymisation entity to derive pseudonyms.
  • Blind, since the pseudonymisation entity does never learn the original identifier.
  • Since OPRFs are in fact 2-party computation protocols (for a specific purpose), they should not be considered as centralised but distributed—namely, the pseudonymisation entity cannot perform any pseudonymisation unless the active participation of the entity which is to be pseudonymised.
Amongst several applications of OPRFs for pseudonymisation, we will focus on the idea presented in the so-called ScrambleDB [41], which possesses very successful properties that cannot be easily achieved by other schemes. More precisely, this approach fits well with a scenario that several data holders (i.e., entities that have personal data) need to store pseudonymous data to a central secure point (intermediary service), which in turn will facilitate further secure and privacy-oriented data sharing to legitimate recipients. Hence, according to the analysis in [41]:
  • Pseudonymisation is performed by a central yet fully oblivious intermediary. This pseudonymisation entity learns neither the original identifiers nor the pseudonyms it generates, which are computed as values of a pseudorandom function (PRF). The entity additionally serves as a central storage point, acting as an intermediary for subsequent data-sharing operations.
  • When data are stored at rest at this central point, they are pseudonymised in a fully unlinkable and irreversible manner. Concretely, if the original dataset of a single data provider contains, besides identifiers, N attributes, the intermediary stores N separate tables, each consisting of two columns: one holding a pseudonym and the other a single attribute. The pseudonyms used across these N tables are pairwise distinct, even when the corresponding entries relate to the same individual. As a result, neither re-identification nor cross-attribute linkage is possible at the level of the pseudonymisation entity.
  • When specific subsets of data are requested by an authorised data recipient, linkage is established only at that point through a controlled, non-transitive join operation (which means that when data is joined for a specific query, the resulting pseudonymous identifiers are fresh and cannot be linked across different join results. In other words, even if two join operations involve overlapping data, the pseudonymisation linkage from one join does not automatically propagate to another, preventing unintended correlation across joins). Consequently, only the requesting data recipient obtains the correlated pseudonymous data, while different data recipients accessing different subsets are unable to correlate their respective views.
  • A notable property of this approach is that the join operation enables the derivation, for each individual, of a consistent yet irreversible pseudonym, thereby yielding a deterministic pseudonymisation outcome. This holds despite the fact that the intermediary stores multiple, mutually unlinkable pseudonyms corresponding to the same individual. Importantly, this derivation is carried out blindly by the pseudonymisation entity using an oblivious PRF, such that the entity does not learn the resulting deterministic pseudonyms.
The above are illustrated in Figure 5, which is based on the detailed description in [41]; the entities that generate pseudonyms rely on special-type OPRFs.
The main underlying block in the ScrambleDB approach is a 3-party oblivious and convertible PRF (first introduced in this work), which ensures that pseudonyms are generated and transformed in a way that no single party learns both the input and the final pseudonym. In this construction, a requester (data source) can have the PRF evaluated on an identifier towards a receiver (e.g., the intermediary service) without the converter learning the identifier or the pseudonym, and without the receiver learning the original input. Moreover, with respect to recipients (i.e., those that will receive pseudonymous data from the intermediation service), since each recipient’s evaluation key is independent and only convertible within the controlled protocol, there is no shared secret that would allow different recipients to link pseudonyms generated under distinct keys—they can only obtain a common representation through a purpose-specific join operation. This blindness prevents unintended correlation across recipients who do not share a key or join context.
Due to their intrinsic properties, Oblivious Pseudorandom Functions (OPRFs), when employed as pseudonymisation mechanisms, provide irreversible, probabilistic, and blind pseudonymisation. Furthermore, since OPRFs are typically realised as two-party computation protocols, the pseudonymisation process is inherently distributed, with the individual (data subject) and the pseudonymisation entity jointly participating without either party gaining complete information.
OPRFs can be also used to derive the so-called anonymous tokens, which—as will be discussed next—fit nicely with the notion of pseudonymisation. A notable example is the so-called Privacy Pass [42], which is a privacy-preserving token protocol designed to let users reduce the number of CAPTCHA/anti-abuse challenges they must solve when accessing web resources served by a Content Delivery Network (CDN) or edge providers. This protocol aims to ensure user anonymity (i.e., the derived tokens are irreversible), as well as to avoid linkage of token usage across sessions (so the edge provider cannot track a user). The main building block of this protocol is a verifiable Oblivious Pseudorandom Function (VOPRF) between the user and the edge server (with the latter not learning anything about the user’s input); the function is verifiable in terms that the output token includes a verification proof that is indeed computed with a single committed secret key by the said server.
Privacy Pass is generally considered as a protocol for issuing anonymous tokens rather than as a pseudonymisation mechanism. The tokens it produces do not necessarily originate from an identifier, since the PRF input provided by the user can be a freshly generated random value, and the protocol is explicitly designed to achieve unlinkability and anonymity. Nevertheless, as with many cryptographic protocols, it remains theoretically possible that, if the involved parties subsequently collude and disclose the exact information used during token issuance, a particular token generation could be verified and thereby associated with an identified individual. Consequently, because additional information exists that could enable re-identification—albeit protected and not normally accessible—such constructions appear to provide a (strong) irreversible, blind, distributed, and probabilistic form of pseudonymisation, rather than anonymisation.

OPRFs in Secure Multiparty Computation (MPC) Protocols

Secure Multiparty Computation (SMPC) allows a group of parties to collaboratively compute a function without revealing their private inputs and without relying on any trusted third party. SMPC relies on several advanced cryptographic techniques, including OPRFs (other options also include homomorphic encryption or zero-knowledge proofs).
In principle, when using OPRFs in SMPC protocols, their outputs again fit with the notion of pseudonymisation. In fact, they may enable joint pseudonymisation among multiple parties without revealing the original identifiers. To illustrate this, we focus on a specific protocol for Private Set Intersection (PSI), described in [43] (we rely on the simplified description from [44]). Let us assume that an entity A has a private list a 1 , a 2 , , a n corresponding to identifiers of individuals, whilst another entity B has another private list b 1 , b 2 , , b m of the same type of individuals’ identifiers (e.g., A and B could be hospital, with the individuals’ identifiers being their social security numbers). If A and B want to learn their common values of their private lists and nothing more (and, hence, they shall not simply exchange their lists), they could execute a PSI protocol as follows:
  • A chooses a secret key K for a pseudorandom function F.
  • A and B execute m Oblivious Pseudorandom Function evaluations, such as in the i-th execution, 1 i m , the entity A inputs K and the entity B inputs the i-th identifier b i . Due to the property of the OPRF, at the end of these evaluations, B learns the outputs F K ( b i ) , 1 i m without learning K, whilst A does not learn anything from the list b 1 , b 2 , , b m .
  • A computes the outputs F K ( a j ) , 1 j n and sends these values to B.
  • B computes the intersection between the lists F K ( b i ) , 1 i m and F K ( a j ) , 1 j n and sends to A all values b i from its private list for which there exists 1 j n such that F K ( b i ) = F K ( a j ) .
Notably, this protocol complies with the GDPR definition of pseudonymisation, since the original identifiers are hidden by the OPRF values which in turn could be considered as pseudonyms (whilst the secret key K has the role of the additional information that may allow re-identification, provided that the pool of the original identifiers are known). This interpretation of such schemes as pseudonymisation ones has been also discussed in [10].
Apparently, such pseudonymisation schemes can be classified as follows:
  • Irreversible, since the OPRF values are not reversible; although they are mathematically reversible, an OPRF inherently ensures the property that no entity has the information needed to reverse its output.
  • Deterministic, provided that the same key K is used (which is indeed the case for a specific pseudonymisation context).
  • Blind, since the underlying idea is to hide the original identifiers.
  • Distributed, since we have a two-party protocol.

5.5. Secret Sharing

Secret sharing refers to cryptographic techniques that allow sharing a secret value amongst several n entities so that no single entity can recover the initial secret, but if any t entities, for 1 < t n exchange their information, then the original secret value can be computed by them. Such schemes, first introduced by Shamir [45], are being typically used to protect information such as secret cryptographic keys—and, thus, it may also facilitate pseudonymisation by enabling secure storage of the pseudonymisation secret (see, e.g., [10]). Indeed, the main idea is the following:
x Share ( t , n ) ( x 1 , x 2 , , x n )
satisfying the following:
I { 1 , , n } , | I | t : x = i I λ i x i
where λ i are coefficients from the Lagrange interpolation, whilst I { 1 , , n } , | I | < t : I reveals no information about x .
However, it has been shown that they could also support pseudonymisation in terms of considering the identifier that is to be pseudonymised as the original secret value in such a setting. Then, this identifier is being split into n parts, each stored by different entity, whilst only collaboration of any t of them (and not less than t) suffices to re-construct the original identifier. This is illustrated in Figure 6, for a case of n = 4 and t = 3 for an example that is based on the Shamir’s secret sharing scheme. More precisely, in the example illustrated in this Figure, the polynomial that is being used is the f ( x ) = 790 + 123 x + 456 x 2 ( mod 2089 ) , where the i-th share, i = 1 , 2 , 3 , 4 , is computed by the value f ( i ) ; having knowledge of any three pairs ( x i , f ( x i ) ) allows, through Lagrange interpolation, the derivation of f ( 0 ) , which is the initial secret. Such approaches have been proposed in [46], which divides a vehicle’s identity—being used as a secret—into multiple sub-identities through a secret sharing scheme in order to protect it. In a similar approach, in [47] it is described how this technique can be applied to pseudonymise system log files by replacing original identifiers (e.g., IP addresses) with pseudonyms, thereby preventing user profiling. In other words, the user’s identity is split into n shares, with each share functioning as a separate pseudonym.
Considering secret sharing schemes as pseudonymisation mechanisms, they generally share the following properties:
  • They are reversible, since collaboration of a well-determined number of entities allows recovering the original identifier—and such collaboration is not forbidden in principle.
  • The pseudonymisation process is probabilistic due to the randomness of share generation.
  • They are in principle non-blind, in terms that, typically, there is an original entity having access to the original identifier that splits it into n shares.
  • It is inherently distributed, since we have n entities that in fact contribute to the whole process.

5.6. User-Generated Pseudonyms

A particular class of pseudonymisation techniques arises when data subjects are actively involved in the pseudonymisation process, such that pseudonyms are generated by the individuals themselves and the data controller is technically unable to associate a given pseudonym with a specific identity. This model is especially relevant in contexts where the data controller does not require knowledge of the data subject’s identity, as retaining such knowledge would be incompatible with the data minimisation principle.
It should be emphasised that user-generated pseudonyms do not correspond to the trivial case in which an individual merely selects a nickname or alias to access a service. Rather, the term denotes the execution of a (typically, asymmetric) cryptographic protocol between the individual and the data controller that enforces this separation by design (whilst, in the context of this protocol, the pseudonym is generated in the user’s environment). Although the pseudonym generation is performed by the user, the protocol itself is specified and governed by the data controller, thereby ensuring the fulfilment of the desired data-protection properties.
As discussed in [48], the design of a user-generated pseudonym approach should primarily satisfy the following requirements: (i) usability, (ii) the inability of any party other than the user herself to link a pseudonym to its owner, unless such linking is explicitly authorised by the user, (iii) the impossibility of correlating multiple pseudonyms as belonging to the same user, (iv) injectivity, meaning that the generation process must prevent duplicate pseudonyms, and (v) flexibility.
Several pseudonymisation approaches fall within this category and are based on asymmetric cryptography (e.g., [48,49]). It should be noted, though, that there also exist approaches that rely heavily on symmetric cryptographic primitives, such as the scheme presented in [50], which is based on Merkle trees. In this approach, each pseudonym is derived simultaneously from multiple user identifiers, each corresponding to a different domain or context. This design both conceals the original identifiers and enables the generation of unlinkable pseudonyms for the same user across different domains or contexts. Furthermore, the pseudonymisation scheme in [50] allows a user to efficiently prove ownership of a pseudonym within a specific context without revealing any information about her original identifiers, leveraging intrinsic properties of Merkle trees inspired by their use in post-quantum secure hash-based signature schemes.

Anonymous Credentials

Anonymous credentials constitute foundational building blocks of contemporary privacy-preserving identity management systems [51]. Such systems enable users to demonstrate possession of attributes or credentials without disclosing any additional personal information about themselves. Moreover, credential presentations are unlinkable across multiple executions and cannot be associated with other identifying information pertaining to the user. Anonymous credentials were first envisaged by Chaum in [52], whilst a first realisation is presented in [53] (the so-called Camenish–Lysyanskaya scheme). Since then, other schemes have been proposed with new features (a survey is given in [54]).
The basic functions that are employed in such systems can be summarised (in a simplified notion) as follows: Let U denote a set of users (being characterised as provers) and let each user u U be associated with a set of attributes ( id u ). An anonymous credential system typically consists of the following algorithms:
  • A function called Issue performed by each issuer, defined as follows:
    Issue ( u , attr u , isk ) cred u
    through which the issuer (i.e., an entity that signs and issues credentials for provers) issues for the user u a credential, through the issuer’s secret key isk . The user’s attribute set attr u is the one that the prover wants to be certified. Typically, the credential is bound to a secret that only the user holds since this function is in fact a joint, interactive issuance protocol between issuer and user, whilst credential issuance is typically blind (or partially blind), preventing the issuer from linking the issued credential to future presentations.
  • A function called P r o v e executed by the user (prover) as follows:
    Prove ( cred u , attr u , usk ) token
    which obtains as inputs a credential cred u , the set of attributes to be disclosed, and the prover’s private (secret) key m a t h s f u s k , towards generating a proof token token which, when verified with respect to the issuer’s public key ipk attests that the disclosed attribute information attr is correctly embedded in cred u , whilst the latter suffices to convince a verifier of this fact.
  • The function Verify executed by the Verifier, defined as
    Verify ( token , ipk ) { 0 , 1 }
    which checks whether the proof token is valid, in relation to the issuer’s public key ipk
The aim of such systems is to ensure that for any two users u 1 , u 2 , the distributions of their relative tokens token 1 and token 2 are computationally indistinguishable for verifiers.
Some of the most well-known anonymous credential systems are the following:
  • Idemix [55], which implements the Camenisch–Lysyanskaya paradigm; a user generates a presentation token by deriving a zero-knowledge proof from a credential bound to a user-held secret and a random value. The token attests to possession of a valid credential and, where required, to selected certified attributes, while remaining unlinkable across multiple showings for ordinary verifiers. However, the token is cryptographically derived from persistent secret information and issuer-certified data.
  • U-prove [56], which is based on [57]; in this system, a presentation token is generated from a credential obtained via a blind issuance protocol and is cryptographically bound to a user-held secret. Unlike Idemix though, tokens are typically designed to be intentionally linkable, so as to use repeated presentations of the same token in order to yield a stable pseudonym for the user (prover). Verification confirms the validity of the certified attributes with respect to the issuer’s public key, but does not by itself reveal the user’s identity.
  • PRIMA [58], which decouples credential issuance by the Identity Provider from subsequent service authentication by enabling the user to mediate the use of issued credentials. Authentication tokens are generated locally by the user and disclose only the information strictly required by the relying service, without involving the Identity Provider at authentication time. As a result, the Identity Provider is prevented from directly observing service usage and its ability to track or profile users across services is significantly reduced.
In all three cases discussed above, as well as in other anonymous credential systems, authentication tokens which may function as pseudonyms are generated locally by the user and remain unlinkable and non-attributable for ordinary verifiers in the absence of auxiliary information. However, attribution to a specific user remains possible when additional information held by the user or the issuer becomes available, for example through disclosure of the user-held secret. In other words, unlinkability does not automatically imply unconditional lack of identifiability; as long as a trusted issuer (or another authorised party) retains the ability to re-associate a token with a specific person, the data remain personal data in the sense of the GDPR (it should be noted that the user themselves may have the possibility to prove that a specific credential that has been issued correspond to them). Accordingly, such systems significantly reduce correlation risks, whilst they also hide the users original identities, but they do not eliminate the possibility of attribution where additional information is available. Hence, these systems, although commonly described as anonymous, in fact implement pseudonymisation as defined under the GDPR. The tokens derived from such systems must therefore be regarded as personal data, notwithstanding the typically low residual data-protection risk they entail. Moreover, anonymous credential tokens should be understood as user-generated pseudonyms embedded in cryptographic protocols, although they are generally intended for scenarios in which the individual is not expected to ever need to prove pseudonym ownership through their real-world identity.
As pseudonymisation techniques, they share the following properties:
  • They are irreversible, since the derived pseudonyms (tokens) do not allow computing directly the users original identifiers. Of course, as also discussed above, they are verifiable, meaning that having the user’s identifier and the secret information allows verifying whether a given token corresponds to this user.
  • They are inherently probabilistic, since tokens are generated using random values within the underlying cryptographic computations.
  • They are blind, as all user-generated pseudonyms (i.e., the data controller does not learn the original identifiers).
  • They are decentralised, as users actively participate in the whole pseudonymisation process. However, typically, there is a central issuer—especially in anonymous credential systems.

6. Summary and Discussion

Based on the analysis in Section 4 and Section 5, we next present Table 1, which in fact summarises the aforementioned discussion. This table focuses on the main classes of cryptographic techniques that could be used as the means for pseudonymisation, presenting their properties according to the notions of the corresponding taxonomy presented in Section 3.
The classification summarised in Table 1 should be read in conjunction with the detailed analysis provided in the previous sections (which also presents the properties of specific sub-families of these general categories). Indeed, several cryptographic primitives admit multiple instantiations and usage models that may exhibit different properties with respect to the notions of the taxonomy dimensions. For example, Oblivious Pseudorandom Functions (OPRFs) may be used by several different ways. The assignments in Table 1 therefore reflect typical or representative uses, rather than exhaustive classification of the underlying primitives.
A main observation is that the suitability of a pseudonymisation mechanism is highly dependent on the context. Indeed, no single cryptographic technique appears to be a, somehow, one-size-fits-all solution. Consequently, pseudonymisation mechanisms should not be selected based solely on cryptographic strength, but rather on how well their properties align with the specific context, in terms of the relevant data-protection requirements. For example, the presence of intermediaries performing pseudonymisation to support data sharing to third parties—which is expected to be common in, e.g., European data spaces—plays an important role in the selection of appropriate techniques. In scenarios where personal data are shared through intermediaries, approaches that enforce blindness or separation of knowledge are highly probable to be preferable. Cryptographic techniques such as Oblivious Pseudorandom Functions or Secure Multiparty Computation can ensure that intermediaries are unable to learn or reconstruct original identifiers, thereby limiting unnecessary exposure of personal data and facilitating the fulfilment of data minimisation. Conversely, in environments where data processing remains under the control of a single entity and re-identification is required to fulfil legitimate purposes, centralised constructions may provide a proper and efficient solution.
Hence, Table 1 should not be interpreted as a rigid classification scheme. Instead, it provides a structured approach for helping in design choices, while allowing flexibility for specific implementations that deviate from classical assumptions. Data controllers are therefore encouraged to interpret this Table as a high-level guide that complements, rather than replaces, the aforementioned more fine-grained discussion of individual techniques.

6.1. Trust Assumptions for Data Controllers

In this subsection, we aim to further clarify—with respect to the aforementioned techniques—the relevant trust assumptions for the entity that is competent for the pseudonymisation (i.e., the data controller). To this end, we introduce three different trust models for data controllers (adopting the relevant terminology for similar security trust models):
  • Honest controller—this is the case that the data controller is assumed to properly follow the specified pseudonymisation process, protect the cryptographic keys, re-identify only when it is strictly required and generally fulfils the purpose limitation and data minimisation principles.
  • Semi-honest controller—this is the case that the data controller is assumed to properly follow the specified pseudonymisation process but it may attempt to infer or derive additional information on data subjects based on legitimately accessible data (e.g., pseudonymised data, other auxiliary information etc.).
  • Adversarial controller—this is the case that the controller is assumed to possibly deviate even from the intended pseudonymisation process, intentionally attempting to re-identify of single out individuals so as to violate data minimisation and/or purpose limitation. Such a controller may also collude with other parties.
Typically, as has become evident from the previous analysis, traditional cryptographic pseudonymisation techniques (symmetric, keyed hash, asymmetric), as well as IBE, require high level of trust in the controller, whilst advanced techniques (distributed/blinded) aim to reduce such a need for trust. These are reflected next in Table 2.
With respect to OPRFs, it shall be stressed that, in principle, some of them suffice to provide security even under the assumption of having adversaries—this is the reason that the relevant entry in Table 2 states both assumptions. This is not always the case though: for example, in the ScrambleDB proposed in [41], the security is established under the assumption of a semi-honest data controller.

6.2. Implementation Aspects of Pseudonymisation Approaches

The practical suitability of a pseudonymisation technique is not determined solely by its privacy guarantees, but also by its computational cost, its scalability characteristics, as well as by the relevant operational governance burden. These parameters are also important for data controllers; to this end, it shall be stressed that the general notion of cost is explicitly mentioned even in the GDPR as an important factor to be considered when a risk-based approach is adopted.
This section evaluates the above cryptographic pseudonymisation techniques from Table 1 in terms of their performance, key-management overhead, as well as their scalability (which in turn is contingent on whether the implementation is centralised or distributed).

6.2.1. Symmetric Encryption

Symmetric encryption schemes typically exhibit linear time complexity in relation to the length of the input (which, in the context of pseudonymisation, is relatively small). They are also much faster than public key cryptographic schemes, enabling high throughput. From a scalability perspective, each record can be pseudonymised independently, requiring only access to a secret key (and the Initialisation Vector (IV), when needed). Hence, the storage overhead is minimal, and is typically limited to IVs or possibly authentication tags, depending on the specific implementation.
Operational complexity is mainly attributed to the key management and governance. Hence, secure generation and storage (e.g., HSM/KMS-backed) become the main organisational issues. However, these requirements scale independently of dataset size.
Therefore, symmetric encryption is generally considered as highly suitable for large-scale centralised pseudonymisation.

6.2.2. Keyed Hash Function

Keyed hash constructions also run in linear time, being also substantially more efficient than public key mechanisms—in fact, they are more efficient even than symmetric cryptographic algorithms. They additionally produce fixed-length outputs (independently from the size of the input), and they do not require IV or other random seed. As a result, they scale exceptionally well. Similarly to the symmetric encryption, the key management and governance are very important; in particular, they shall not be underestimated due to the mathematical irreversibility of the underlying hash function. Indeed, any breach of the secret key yields the keyed hash function as equivalent to an unkeyed one, which it turn raises security issues if the input domain has low entropy (e.g., social security numbers, phone numbers etc.).
Therefore, keyed hashes are appropriate for centralised deterministic pseudonymisation policies at very large scale, since they offer minimal infrastructural overheads.

6.2.3. Asymmetric Encryption

Traditional public key encryption schemes (e.g., RSA or elliptic-curve-based constructions) exhibit substantially higher computational costs than the symmetric primitives, whilst the resulting outputs (i.e., pseudonyms) are also large compared to the inputs (i.e., the original identifiers). Hence, this introduces some limitations with respect to their suitability for high-volume record-level pseudonymisation.
Their primary advantage lies in architectural flexibility: public keys can be widely distributed, enabling separation of roles between encrypting and decrypting entities. This is particularly relevant in cross-organisational or federated contexts. Of course, there is an overall additional complexity with respect to generation, management and storage of the keys: namely, the private keys shall be securely stored, whilst the public keys shall be generated appropriately in a trust way, allowing verification of their authenticity—e.g., through digital certificates issued by trusted entities.
Therefore, asymmetric encryption is appropriate where legal or governance constraints require strict key isolation across entities, but is comes with implementation complexity due to the need for a public key infrastructure.

6.2.4. Identity-Based Encryption (IBE)

IBE replaces conventional public key infrastructure with identity-derived public keys. While simplifying certificate management, it introduces though a central Private Key Generator (PKG), which constitutes both a scalability issue and—possibly—a trust issue.
Moreover, pairing-based cryptographic operations typically have higher computational costs than classical public key systems. In addition, key escrow properties inherent in IBE may raise governance and accountability concerns.
As a result, IBE can be suitable in tightly controlled ecosystems requiring fine-grained identity binding, but it is not optimal for high-throughput pseudonymisation at large scale.

6.2.5. Polymorphic Encryption and Pseudonymisation (PEP)

PEP supports unlinkability across different recipients, but it operates at public key cost levels, whilst typically generates larger ciphertexts (this is an inherent property of the El-Gamal asymmetric algorithm). Hence, we have computational overhead and storage expansion.
It shall be stressed that PEP inherently supports separation of duties, in terms that the pseudonymisation entity does not have re-identification capability. However, the initial private key shall be highly protected (e.g., through an HSM-backed custody), since it somehow constitutes a single point of failure.
In conclusion, PEP is advantageous in scenarios requiring unlinkability guarantees across processing nodes having simultaneously an internal separation of duties, but it also introduces—as an asymmetric encryption scheme—significant computational overhead.

6.2.6. Oblivious Pseudorandom Functions (OPRFs)

OPRF-based pseudonymisation techniques constitute in fact interactive protocols between parties. Their computational complexity typically ranges from moderate to high, whilst performance is additionally constrained by network latency. As stated above, the main advantage is the trust minimisation: no single party learns all information (i.e., both initial identifiers and final pseudonyms). However, scalability is considered as limited, due to the interactions. The main OPRF key, being held by one party (the server) is also essential and needs to be protected.
Therefore, OPRFs are particularly suitable for inter-organisational record linkage under strong privacy constraints, where minimisation of trust assumptions outweighs possible performance considerations.

6.2.7. Secret Sharing

Secret-sharing schemes distribute re-identification capability across multiple entities. Share generation is linear, whilst reconstruction requires polynomial interpolation. While computationally moderate, their primary scalability limitation lies in storage multiplication and coordination overhead (i.e., the storage is multiplied by number of shares).
The above properties actually yield that secret sharing is best understood as a resilience mechanism for key custody (and, thus, could be used for the protection of secret keys for any other pseudonymisation technique) rather than a primary pseudonymisation method.

6.2.8. User-Generated Pseudonyms

This is a quite different approach compared to all the others, since such techniques typically necessitate negligible computational cost on data controllers and scale trivially from an infrastructural perspective. Typically, not even a cryptographic key is managed by the controller. Of course, the whole procedure, even if part of it is being implemented by the data subjects, it is under the data controller’s governance. On the other hand, computational cost is transferred to the users, who need to have devices performing specific cryptographic tasks (typically demanding in terms of computational cost).

6.3. How to Protect the Secret Keys for Pseudonymisation

As becomes evident from the above analysis, the security of pseudonymisation mechanisms cannot be assessed solely at the level of cryptographic primitives; the secure management of the pseudonymisation keys shall be also taken into account. As it is also implied by the GDPR’s provisions on pseudonymisation, the pseudonymised dataset and the key management infrastructure should be logically and, where feasible, physically segregated. To this end, network segmentation, distinct trust domains, or separation of administrative roles typically reduce any risks of illegitimate re-identification (i.e., implementing an architectural separation).
Other approaches could include hardware-based key protection. By these means, master keys should be generated and stored within Hardware Security Modules (HSMs) or equivalent secure enclaves, so as to ensure non-exportability; in such cases, the relevant cryptographic operations for pseudonymisation involving root keys should take place within this protected hardware. Another option could be the adoption of a hierarchical key management in term of the so-called envelope encryption—i.e., a root key (HSM-protected) encrypts data encryption keys, which in turn are being used for pseudonymisation; where feasible, per-batch or per-record derived keys may further reduce risks. Under such a structure, compromise of a single DEK does not expose the entire re-identification corpus.
In any case, strict access control for the key material is essential, in conjunction with auditability so as to ensure that all accesses to keys, as well as methods for re-identification processing, are being securely logged.
Last but not least, it is important to address the single point of failure risk that is inherent in any centralised key management architecture: secret sharing techniques (as discussed above) could be considered, so as to ensure that no single actor or compromised system can independently enable re-identification.
In conclusion, the robustness of cryptography-based pseudonymisation depends not only on the strength of the underlying algorithms but on the security of the key management, in terms of the generation, storage and use of the pseudonymisation key. Hence, a pseudonymisation scheme can be considered sufficient (on a risk-based approach) only if its key management is engineered through appropriate technical and organisational measures, so as to withstand key compromise and misuse.

7. (Legally) Revisiting the Notion of Pseudonymisation

In this paper, we rely on the notion of pseudonymisation as it is commonly articulated in European data-protection practice and also reflected in the European Data Protection Board’s guidelines (2025 version, subject to public consultation [16]). However, this interpretation has been refined by a very recent judgment of the Court of Justice of the European Union (CJEU) of September 2025 (EDPS v Single Resolution Board (SRB)—Case C-413/23 [59])—which revisits the boundary between pseudonymised and anonymous data under EU data-protection law.
In that decision, the CJEU held that pseudonymised data cannot be presumed to qualify as personal data, irrespective of context. The Court instead emphasised a recipient- and context-dependent understanding of identifiability, under which the personal character of data must be assessed with reference to a particular entity, focusing on the specific entity processing the data and on the means reasonably likely to be used by that entity to re-identify individuals. Where a recipient of pseudonymised data does not have access to, and cannot reasonably be expected to obtain, additional information enabling re-identification, the data may be regarded as anonymous in relation to that recipient, even though it may remain pseudonymous and personal for the entity retaining such information. This reasoning introduces a different approach with respect to what has been commonly reflected in prior practice.
Importantly, the Court emphasized that this assessment cannot be made in abstract terms. Identifiability must be evaluated on an ad-hoc basis, taking into account the state of technology, the availability of auxiliary datasets, legal constraints, and practical incentives or capabilities for re-identification. This is the reason that the same dataset may simultaneously be considered personal data for one party (e.g., the original controller having the re-identification key) and non-personal data for another party lacking any realistic re-identification pathway.
This new development brings somehow the legal analysis closer to the risk-based approach commonly applied from an engineering aspect. Indeed, under this approach, it becomes highly plausible that certain constructions—such as anonymous credential systems—may qualify as anonymous rather than pseudonymous under European data-protection law. More broadly, pseudonymisation mechanisms that are irreversible, blind, and decentralised are likely to generate data that can be regarded as anonymous for parties who lack any technical or legal means of accessing re-identification information. However, this conclusion cannot be elevated to a general rule: as explicitly emphasised by the CJEU, such determinations must always be made on an ad hoc, context-specific basis. Therefore, qualifying the cryptographic techniques given in Table 1 as anonymisation or pseudonymisation methods is in fact not possible for the general case, because such an assessment is highly relevant to the context. For example, even the classical cryptographic techniques like symmetric cryptographic schemes or keyed hash functions suffice, from a mathematical point of view, to fully eliminate re-identification risks for any party that does not have access to the secret key (provided that the key cannot be computed/revealed by other means; however, we cannot exclude the possibility that identification even for such pseudonymous data could be an option, even without the knowledge of the secret key, through other auxiliary information/data depending on the context—and, thus, in principle, there is no a privacy enhancing technology that ensures anonymisation for any case (see the recent report from the stakeholder event on anonymisation and pseudonymisation organised by the EDPB [60]).
Last but not least, it should be pointed out that the judgment does not eliminate data-protection obligations. Indeed, the CJEU underlined that certain duties—such as transparency at the time of collection—remain unaltered in the perspective of the original data controller and are not retroactively altered by a later recipient-specific anonymity assessment.

8. Conclusions

This paper provides a survey of cryptography-based pseudonymisation techniques, examining their alignment with respect to specific requirements stemming from the European data-protection legislation. To this end, it becomes evident that pseudonymisation can be implemented in many different ways, each of them having different features and, thus, application areas.
Recent regulatory developments at the EU level also illustrate a growing focus on clarifying pseudonymisation in practice. In January 2025 the European Data Protection Board (EDPB) published Guidelines 01/2025 [16] on pseudonymisation (for public consultation) to help organisations align technical measures with legal expectations under the GDPR, including interpreting pseudonymisation as a risk-mitigating safeguard. More recently, the Digital Omnibus package proposed by the European Commission (as part of a broader Digital Omnibus Regulation) has seemed to amend the GDPR through empowering the European Commission to adopt implementing acts specifying criteria for when data resulting from pseudonymisation no longer qualifies as personal data for certain entities, in light of the recent CJEU decision [59].
In any case, regardless of regulatory developments, pseudonymisation remains an active research area and is expected to remain highly relevant even in case that the regulatory landscape changes. Future research on cryptography-based pseudonymisation is expected to focus on the integration of post-quantum cryptographic primitives and advanced pseudonymisation techniques (regarding the specific case of post-quantum oblivious functions, a recent survey is given in [61]). Indeed, pseudonymisation techniques that rely on asymmetric cryptography such as elliptic curve algorithms or the El-Gamam cipher do not provide post-quantum security. This limitation applies, for example, to the polymorphic encryption described in [38], to most OPRF constructions such as those proposed in ScrambleDB [41], and to many anonymous credential systems. Hence, we need to consider how post-quantum secure alternatives can be realized, for instance by relying on lattice-based cryptography. However, this transition is not just a matter of substituting cryptographic primitives; a thorough re-design is needed, accompanied with rigorous security proofs. Another important direction concerns the proper application of pseudonymisation in machine learning systems, especially with respect to training data and model outputs, where memorisation, inference attacks, and data leakage may undermine traditional pseudonymisation guarantees.
These developments further highlight the need to treat pseudonymisation as an adaptive, context-dependent process assessed on a case by case basis, assessing its effectiveness according to the specific context of processing and the roles of the parties involved. Nevertheless, a precise characterization of pseudonymisation techniques in terms of their respective capabilities remains crucial, and this paper aims to contribute toward this goal.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The author would like to thank the anonymous reviewers for their valuable comments that helped to greatly improve the paper.

Conflicts of Interest

The author declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AESAdvanced Encryption Standard
CJEUCourt of Justice of the European Union
EDPBEuropean Data Protection Board
EUEuropean Union
FPEFormat-Preserving Encryption
GDPRGeneral Data Protection Regulation
HIPAAHealth Insurance Portability and Accountability Act
IBEIdentity-Based Encryption
IDIdentifier
IoTInternet of Things
IPInternet Protocol
IVInitialisation Vector
MACMessage Authentication Code
OPRFOblivious Pseudorandom Function
PEPPolymorphic Encryption and Pseudonymisation
PKGPrivate Key Generator
PSIPrivate Set Intersection
SMPCSecure Multiparty Computation
VOPRFVerifiable Oblivious Pseudorandom Function

References

  1. Dhirani, L.L.; Mukhtiar, N.; Chowdhry, B.S.; Newe, T. Ethical Dilemmas and Privacy Issues in Emerging Technologies: A Review. Sensors 2023, 23, 1151. [Google Scholar] [CrossRef]
  2. Hintze, M.; Emam, K.E. Comparing the Benefits of Pseudonymisation and Anonymisation under the GDPR. J. Data Prot. Priv. 2018, 2, 145–158. [Google Scholar] [CrossRef]
  3. European Union Agency for Cybersecurity (ENISA). Recommendations on Shaping Technology According to GDPR Provisions: An Overview on Data Pseudonymisation; Technical Report; European Union Agency for Cybersecurity (ENISA): Athens, Greece, 2019. [Google Scholar] [CrossRef]
  4. European Union Agency for Cybersecurity (ENISA). Pseudonymisation Techniques and Best Practices; Technical Report, ENISA Report; European Union Agency for Cybersecurity (ENISA): Athens, Greece, 2019. [Google Scholar] [CrossRef]
  5. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, L119, 4 May 2016, pp. 1–88. 2016. Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (accessed on 5 March 2026).
  6. European Parliament and Council of the European Union. Regulation (EU) 2025/327 of the European Parliament and of the Council of 11 February 2025 on the European Health Data Space and Amending Directive 2011/24/EU and Regulation (EU) 2024/2847. Official Journal of the European Union, L 327, 2025. Regulation (EU) 2025/327, 5.3.2025. Available online: https://eur-lex.europa.eu/eli/reg/2025/327/oj/eng (accessed on 5 March 2026).
  7. European Parliament and Council of the European Union. Regulation (EU) 2022/868 of the European Parliament and of the Council of 30 May 2022 on European Data Governance and Amending Regulation (EU) 2018/1724 (Data Governance Act). Official Journal of the European Union, L 152, 2022. Regulation (EU) 2022/868. 30 May 2022. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32022R0868 (accessed on 5 March 2026).
  8. European Parliament and Council of the European Union. Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023 on Harmonised Rules on Fair Access to and Use of Data and Amending Regulation (EU) 2017/2394 and Directive (EU) 2020/1828 (Data Act). Official Journal of the European Union, L 327, 2023. Regulation (EU) 2023/2854, Published 22 December 2023, Commonly Referred to as the Data Act. Available online: https://eur-lex.europa.eu/eli/reg/2023/2854/oj/eng (accessed on 5 March 2026).
  9. European Union Agency for Cybersecurity (ENISA). Data Pseudonymisation: Advanced Techniques and Use Cases; Technical Report TP-01-21-024-EN-N; European Union Agency for Cybersecurity (ENISA): Athens, Greece, 2021. [Google Scholar] [CrossRef]
  10. Limniotis, K. Cryptography as the Means to Protect Fundamental Human Rights. Cryptography 2021, 5, 34. [Google Scholar] [CrossRef]
  11. Charter of Fundamental Rights of the European Union. Official Journal of the European Communities C 364/01, 18 December 2000, 2000. Proclaimed by the European Parliament, the Council and the Commission, Consolidated Versions Commonly Cited (e.g., 2012) Available. Available online: https://www.europarl.europa.eu/charter/pdf/text_en.pdf (accessed on 5 March 2026).
  12. Chatzistefanou, V.; Limniotis, K. On the (Non-)anonymity of Anonymous Social Networks. In Communications in Computer and Information Science, E-Democracy—Privacy-Preserving, Secure, Intelligent E-Government Services—7th International Conference, E-Democracy 2017, Athens, Greece, 14–15 December 2017; Katsikas, S.K., Zorkadis, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 792, pp. 153–168. [Google Scholar]
  13. Finck, M.; Pallas, F. They who must not be identified—Distinguishing personal from non-personal data under the GDPR. Int. Data Priv. Law 2020, 10, 11–36. [Google Scholar] [CrossRef]
  14. Pfitzmann, A.; Hansen, M. Anonymity, Unlinkability, Unobservability, Pseudonymity, and Identity Management—A Consolidated Proposal for Terminology (Version v0.28). Technical Terminology Draft, TU Dresden, Faculty of Computer Science/ULD Kiel, 2006. Available online: https://dud.inf.tu-dresden.de/literatur/Anon_Terminology_v0.28.pdf (accessed on 5 March 2026).
  15. Jajodia, S.; Samarati, P.; Yung, M. (Eds.) Encyclopedia of Cryptography, Security and Privacy; Springer Nature: Berlin/Heidelberg, Germany, 2025. [Google Scholar] [CrossRef]
  16. European Data Protection Board. Guidelines 01/2025 on Pseudonymisation; Technical Report; European Data Protection Board: Brussels, Belgium, 2025. [Google Scholar]
  17. Akil, M.; Islami, L.; Fischer-Hübner, S.; Martucci, L.A.; Zuccato, A. Privacy-Preserving Identifiers for IoT: A Systematic Literature Review. IEEE Access 2020, 8, 168470–168485. [Google Scholar] [CrossRef]
  18. Abu Attieh, H.; Müller, A.; Wirth, F.N.; Prasser, F. Pseudonymization tools for medical research: A systematic review. BMC Med. Inform. Decis. Mak. 2025, 25, 128. [Google Scholar] [CrossRef] [PubMed]
  19. Dijkhuizen, N.V.; Ham, J.V.D. A Survey of Network Traffic Anonymisation Techniques and Implementations. ACM Comput. Surv. 2018, 51, 52:1–52:27. [Google Scholar] [CrossRef]
  20. Asad, M.; Shaukat, S.; Javanmardi, E.; Nakazato, J.; Tsukada, M. A Comprehensive Survey on Privacy-Preserving Techniques in Federated Recommendation Systems. Appl. Sci. 2023, 13, 6201. [Google Scholar] [CrossRef]
  21. Au nón, J.M.; Hurtado-Ramírez, D.; Porras-Díaz, L.; Irigoyen-Pe na, B.; Rahmian, S.; Al-Khazraji, Y.; Soler-Garrido, J.; Kotsev, A. Evaluation and utilisation of privacy enhancing technologies—A data spaces perspective. Data Brief 2024, 55, 110560. [Google Scholar] [CrossRef] [PubMed]
  22. Garrido, G.M.; Sedlmeir, J.; Uludağ, Ö.; Alaoui, I.S.; Luckow, A.; Matthes, F. Revealing the landscape of privacy-enhancing technologies in the context of data markets for the IoT: A systematic literature review. J. Netw. Comput. Appl. 2022, 207, 103465. [Google Scholar] [CrossRef]
  23. Sweeney, L. k -Anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
  24. National Institute of Standards and Technology (NIST). Federal Information Processing Standards Publication 197 (FIPS 197): Advanced Encryption Standard (AES); Technical Report FIPS 197-upd1; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2023. [CrossRef]
  25. Noumeir, R.; Lemay, A.; Lina, J.M. Pseudonymization of Radiology Data for Research Purposes. J. Digit. Imaging 2007, 20, 284–295. [Google Scholar] [CrossRef] [PubMed]
  26. Heurix, J.; Neubauer, T. Privacy-Preserving Storage and Access of Medical Data through Pseudonymization and Encryption. In Proceedings of the 8th International Conference on Trust, Privacy and Security in Digital Business (TrustBus 2011); Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6863, pp. 186–197. [Google Scholar] [CrossRef]
  27. Aamot, H.; Kohl, C.D.; Richter, D.; Knaup-Gregori, P. Pseudonymization of patient identifiers for translational research. BMC Med. Inform. Decis. Mak. 2013, 13, 75. [Google Scholar] [CrossRef]
  28. Elger, B.S.; Iavindrasana, J.; Iacono, L.L.; Müller, H.; Roduit, N.; Summers, P.E.; Wright, J. Strategies for health data exchange for secondary, cross-institutional clinical research. Comput. Methods Programs Biomed. 2010, 99, 230–251. [Google Scholar] [CrossRef] [PubMed]
  29. Dworkin, M. Recommendation for Block Cipher Modes of Operation: Methods for Format-Preserving Encryption; NIST Special Publication 800-38G; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2016.
  30. Demir, L.; Kumar, A.; Cunche, M.; Lauradoux, C. The Pitfalls of Hashing for Privacy. IEEE Commun. Surv. Tutor. 2018, 20, 551–565. [Google Scholar] [CrossRef]
  31. Schwartmann, R.; Weiß, S.; Group, D.P.F. White Paper on Pseudonymization: Guidelines for the Legally Secure Deployment of Pseudonymization Solutions in Compliance with the General Data Protection Regulation; White Paper/Technical Report; Digital Summit Data Protection Focus Group/ePrivacy.eu: Madrid, Spain, 2017. [Google Scholar]
  32. Zimmer, E.; Burkert, C.; Petersen, T.; Federrath, H. PEEPLL: Privacy-Enhanced Event Pseudonymisation with Limited Linkability. In Proceedings of the 35th ACM/SIGAPP Symposium on Applied Computing (SAC ’20), Brno, Czech Republic, 30 March–3 April 2020; pp. 1308–1311. [Google Scholar] [CrossRef]
  33. Varanda, A.; Santos, L.; Costa, R.L.d.C.; Oliveira, A.; Rabad ao, C. Log pseudonymization: Privacy maintenance in practice. J. Inf. Secur. Appl. 2021, 63, 103021. [Google Scholar] [CrossRef]
  34. University of Nottingham and Julia Hippisley-Cox. OpenPseudonymiser. Open Source Pseudonymisation Software for Dataset Digest Generation. 2011. Available online: https://www.openpseudonymiser.org/ (accessed on 5 March 2026).
  35. Shamir, A. Identity-Based Cryptosystems and Signature Schemes. In Advances in Cryptology—CRYPTO 84; Chaum, D., Blakley, G.R., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1985; Volume 196, pp. 47–53. [Google Scholar] [CrossRef]
  36. Boneh, D.; Franklin, M. Identity Based Encryption from the Weil Pairing. SIAM J. Comput. 2003, 32, 586–615. [Google Scholar] [CrossRef]
  37. Boussada, R.; Elhdhili, M.E.; Saidane, L.A. A Lightweight Privacy-Preserving Solution for IoT: The Case of E-Health. In Proceedings of the IEEE 20th International Conference on High Performance Computing and Communications, IEEE 16th International Conference on Smart City, IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 28–30 June 2018; pp. 555–562. [Google Scholar]
  38. Verheul, E.; Jacobs, B.; Meijer, C.; Hildebrandt, M.; de Ruiter, J. Polymorphic Encryption and Pseudonymisation for Personalised Healthcare. Cryptology ePrint Archive, Paper 2016/411. 2016. Available online: https://eprint.iacr.org/2016/411 (accessed on 5 March 2026).
  39. van Gastel, B.E.; Jacobs, B.; Popma, J. Data Protection Using Polymorphic Pseudonymisation in a Large-Scale Parkinson’s Disease Study. J. Park. Dis. 2021, 11, S19–S25. [Google Scholar] [CrossRef]
  40. Casacuberta, S.; Hesse, J.; Lehmann, A. SoK: Oblivious Pseudorandom Functions. In Proceedings of the 7th IEEE European Symposium on Security and Privacy, Genoa, Italy, 6–10 June 2022; pp. 625–646. [Google Scholar] [CrossRef]
  41. Lehmann, A. ScrambleDB: Oblivious (Chameleon) Pseudonymization-as-a-Service. Proc. Priv. Enhancing Technol. 2019, 2019, 289–309. [Google Scholar] [CrossRef]
  42. Davidson, A.; Goldberg, I.; Sullivan, N.; Tankersley, G.; Valsorda, F. Privacy Pass: Bypassing Internet Challenges Anonymously. Proc. Priv. Enhancing Technol. 2018, 2018, 164–180. [Google Scholar] [CrossRef]
  43. Kolesnikov, V.; Kumaresan, R.; Rosulek, M.; Trieu, N. Efficient Batched Oblivious PRF with Applications to Private Set Intersection. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS ’16), Vienna, Austria, 24–28 October 2016; pp. 818–829. [Google Scholar] [CrossRef]
  44. Lindell, Y. Secure Multiparty Computation (MPC). Commun. ACM 2020, 64, 86–96. [Google Scholar] [CrossRef]
  45. Shamir, A. How to Share a Secret. Commun. ACM 1979, 22, 612–613. [Google Scholar] [CrossRef]
  46. Li, H.; Pei, L.; Liao, D.; Sun, G.; Xu, D. Blockchain Meets VANET: An Architecture for Identity and Location Privacy Protection in VANET. Peer-to-Peer Netw. Appl. 2019, 12, 1178–1193. [Google Scholar] [CrossRef]
  47. Biskup, J.; Flegel, U. On Pseudonymization of Audit Data for Intrusion Detection. In Designing Privacy Enhancing Technologies; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2009, pp. 161–180. [Google Scholar] [CrossRef]
  48. Lehnhardt, J.; Spalka, A. Decentralized Generation of Multiple, Uncorrelatable Pseudonyms without Trusted Third Parties. In Trust, Privacy and Security in Digital Business (TrustBus 2011); Furnell, S., Lambrinoudakis, C., Pernul, G., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6863, pp. 113–124. [Google Scholar] [CrossRef]
  49. Schartner, P.; Schaffer, M. Unique User-Generated Digital Pseudonyms. In Computer Network Security (MMM-ACNS 2005); Gorodetsky, V., Kotenko, I., Skormin, V., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3685, pp. 194–205. [Google Scholar] [CrossRef]
  50. Kermezis, G.; Limniotis, K.; Kolokotronis, N. User-Generated Pseudonyms Through Merkle Trees. In Privacy Technologies and Policy (APF 2021); Gruschka, N., Antunes, L.F.C., Rannenberg, K., Drogkaris, P., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12703, pp. 89–105. [Google Scholar] [CrossRef]
  51. Camenisch, J.; Hansen, M. (Eds.) Privacy and Identity Management for Life; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6794. [Google Scholar]
  52. Chaum, D. Blind Signatures for Untraceable Payments. In Advances in Cryptology: Proceedings of CRYPTO ’82; Plenum Press: New York, NY, USA, 1983; pp. 199–203. [Google Scholar]
  53. Camenisch, J.; Lysyanskaya, A. An Efficient System for Non-transferable Anonymous Credentials with Optional Anonymity Revocation. In Advances in Cryptology—EUROCRYPT 2001; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2045, pp. 93–118. [Google Scholar]
  54. Kakvi, S.A.; Martin, K.M.; Putman, C.; Quaglia, E.A. SoK: Anonymous Credentials. In Security Standardisation Research; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2023; Volume 13895, pp. 129–151. [Google Scholar] [CrossRef]
  55. Camenisch, J.; Herreweghen, E.V. Design and Implementation of the Idemix Anonymous Credential System. In Proceedings of the 9th ACM Conference on Computer and Communications Security, Washington, DC, USA, 18–22 November 2002; ACM: New York, NY, USA, 2002; pp. 21–30. [Google Scholar]
  56. Paquin, C. U-Prove Cryptographic Specification, version 1.1. Technical Report. Microsoft: Redmond, WA, USA, 2011.
  57. Brands, S. Rethinking Public Key Infrastructures and Digital Certificates: Building in Privacy; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
  58. Asghar, R.; Backes, M.; Simeonovski, M. PRIMA: Privacy-Preserving Identity and Access Management at Internet-Scale. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; IEEE: Kansas City, MO, USA, 2018; pp. 1–6. [Google Scholar]
  59. Court of Justice of the European Union. Judgment of the Court (First Chamber) of 4 September 2025, Case C-413/23 P: European Data Protection Supervisor v Single Resolution Board (Concept of Personal Data/Pseudonymisation). 2025. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:62023CJ0413 (accessed on 5 March 2026).
  60. European Data Protection Board. Report on Stakeholder Event on Anonymisation and Pseudonymisation of 12 December 2025; Technical Report; European Data Protection Board: Brussels, Belgium, 2026. [Google Scholar]
  61. Khutsaeva, A.; Leevik, A.; Bezzateev, S. A Survey of Post-Quantum Oblivious Protocols. Cryptography 2025, 9, 62. [Google Scholar] [CrossRef]
Figure 1. Use of symmetric cryptography for deriving pseudonyms.
Figure 1. Use of symmetric cryptography for deriving pseudonyms.
Cryptography 10 00018 g001
Figure 2. Use of asymmetric cryptography for deriving pseudonyms (where K is the public key of the legitimate recipient).
Figure 2. Use of asymmetric cryptography for deriving pseudonyms (where K is the public key of the legitimate recipient).
Cryptography 10 00018 g002
Figure 3. An abstract view of the PEP functionality.
Figure 3. An abstract view of the PEP functionality.
Cryptography 10 00018 g003
Figure 4. An abstract view of a two-party OPRF.
Figure 4. An abstract view of a two-party OPRF.
Cryptography 10 00018 g004
Figure 5. An illustration of the ScrambleDB blind pseudonymisation.
Figure 5. An illustration of the ScrambleDB blind pseudonymisation.
Cryptography 10 00018 g005
Figure 6. Illustration of a secret sharing scheme, splitting the identifiers into n = 4 parts, from which we need at least t = 3 parts to re-construct the original identifier.
Figure 6. Illustration of a secret sharing scheme, splitting the identifiers into n = 4 parts, from which we need at least t = 3 parts to re-construct the original identifier.
Cryptography 10 00018 g006
Table 1. Classification of cryptographic pseudonymisation techniques.
Table 1. Classification of cryptographic pseudonymisation techniques.
Cryptographic
Technique
Reversible/
Irreversible
Deterministic/
Probabilistic
Blind/
Non-Blind
Centralised/
Distributed
Symmetric encryptionReversibleDeterministic #Non-blindCentralised
Keyed hash functionIrreversibleDeterministic #Non-blindCentralised
Asymmetric encryptionReversibleProbabibilisticNon-blindCentralised
Identity-Based EncryptionBoth optionsDeterministicNon-blindCentralised
Polymorphic *ReversibleProbabilisticNon-blindCentralised
Oblivious PRFsIrreversible DeterministicBlindDistributed
Secret sharingReversibleProbabilisticNon-blindDistributed
User-generated pseudonymsIrreversible Both optionsBlindHybrid
# Probabilistic pseudonymisation is also an option; however, this approach is mainly used when a deterministic pseudonymisation is needed. *—While the scheme is reversible and non-blind in principle, the module within the pseudonymisation entity performs pseudonymisation blindly and cannot reverse it. †—While there exists key that allows decryption, the scheme is designed so as to ensure that the pseudonymisation entity does not have access to it. ‡—Although more than one parties are in fact needed (i.e., each user has her own data that are used to pseudonym derivation), there typically exists a central issuer.
Table 2. Trust assumptions for the pseudonymisation techniques.
Table 2. Trust assumptions for the pseudonymisation techniques.
TechniqueRequired Trust Model
Symmetric encryptionHonest
Keyed hash functionHonest
Asymmetric encryptionHonest
Identity-Based EncryptionHonest
PolymorphicSemi-honest
Oblivious PRFsSemi-honest or Adversarial (depending on the implementation)
Secret sharingAdversarial (up to the relevant threshold)
User-generated pseudonymsSemi-honest
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Limniotis, K. Cryptographic Foundations of Pseudonymisation for Personal Data Protection. Cryptography 2026, 10, 18. https://doi.org/10.3390/cryptography10020018

AMA Style

Limniotis K. Cryptographic Foundations of Pseudonymisation for Personal Data Protection. Cryptography. 2026; 10(2):18. https://doi.org/10.3390/cryptography10020018

Chicago/Turabian Style

Limniotis, Konstantinos. 2026. "Cryptographic Foundations of Pseudonymisation for Personal Data Protection" Cryptography 10, no. 2: 18. https://doi.org/10.3390/cryptography10020018

APA Style

Limniotis, K. (2026). Cryptographic Foundations of Pseudonymisation for Personal Data Protection. Cryptography, 10(2), 18. https://doi.org/10.3390/cryptography10020018

Article Metrics

Back to TopTop