# Anonymization Procedures for Tabular Data: An Explanatory Technical and Legal Synthesis

## Abstract

## 1. Introduction

- Terminology and taxonomy establishment of anonymization methods for tabular data:This review introduces a unifying terminology for anonymization methods specific to tabular data. Furthermore, the paper presents a novel taxonomy that categorizes these methods, providing a structured framework that enhances clarity and organization within tabular data anonymization.
- Comprehensive summary of information loss, utility loss, and privacy metrics in the context of anonymizing tabular data:By conducting an extensive exploration, this paper offers a comprehensive overview of methods used to quantitatively assess the impact of anonymization on information and utility in tabular data. By providing an overview of the so-called privacy models, along with precise definitions aligned with the established terminology, the paper reviews and explains the trade-offs between privacy protection and data utility, with special attention to the Curse of Dimensionality. This contribution facilitates a deeper understanding of the complex interplay between anonymization and the quality of tabular data.
- Integration of anonymization of tabular data with legal considerations and risk assessments:Last but not least, this review bridges the gap between technical practices and legal considerations by analyzing how state-of-the-art anonymization methods align with case law and legislation. By elucidating the connection between anonymization techniques and the legal context, the paper provides valuable insights into the regulatory landscape surrounding tabular data anonymization. This integration of technical insights with legal implications is essential for researchers, practitioners, and policymakers alike, contributing to a more holistic approach to data anonymization. The paper conducts a risk assessment for privacy metrics and discusses present issues regarding implementing anonymization procedures for tabular data. Further, it examines possible gaps in the interplay of legislation and research from both technical and legal perspectives. Based on the limited sources of literature and case law, conclusions on the evaluation of the procedures were summarized and were partially drawn using deduction.

## 2. Background

## 3. Related Work

## 4. Technical Perspective

#### 4.1. Eliminating Direct Identifiers

#### 4.2. Generalization

#### 4.3. Suppression

#### 4.4. Permutation

#### 4.5. Perturbation

#### 4.6. Differential Privacy

#### 4.7. Synthetic Data

## 5. Utility vs. Privacy

#### 5.1. Information Loss

#### 5.2. Utility Loss

#### 5.3. Privacy Models

#### 5.3.1. k-Anonymity

`education, education-num, capital-loss, native-country`build a QI and the attribute

`age`is an SA. In Figure 8, generalization and discretization are applied, affecting the attributes

`education, education-num, native-country`in such a way that at least two records in the table always have the same QI, leading to k-anonymity with $k=2$. To be precise, the data are split into two groups: $\{{R}_{1},{R}_{2},{R}_{5},{R}_{6}\}$ and $\{{R}_{3},{R}_{4}\}$.

#### 5.3.2. l-Diversity

`age`in every group, all values of

`age`are diverse, and each group consists of two records. Therefore, we have l-diversity with $l=2$. For the SA

`workclass`, there would be l-diversity with $l=1$.

#### 5.3.3. t-Closeness

- D is the dataset;
- P is the relative frequency distribution of all attribute values in the column of the SA in dataset D;
- ${Q}_{group}$ is the relative frequency distribution of all attribute values in the column of the SA within $group$ that is an equivalence class of dataset D and is obtained by a given QI;
- $EMD(P,Q)$ is the EMD between two relative frequency distributions and depends on the attributes’ value type.

- o is the number of distinct integer attribute values in the SA column;
- P and Q are two relative frequency distributions as histograms (integers are ordered in ascending order).

- o is the number of distinct categorical attribute values in the SA column;
- P and Q are two relative frequency distributions as histograms (integers are ordered in ascending order).

`age`, there would be t-closeness with $t=0.2$, due to

#### 5.4. Re-Identification Risk Quantification

- $P\left(R\left(J\right)\right):=Pr[{X}_{j}=R\left(j\right),j\in J]$;
- $P\left(R\right|R\left(J\right)):=Pr[{X}_{i}=R\left(i\right),i\notin J|{X}_{j}=R\left(j\right),j\in J]$;
- $P\left(R(j+1)\right|R\left(j\right)):=Pr[{X}_{j+1}=R(j+1)|{X}_{j}=R\left(j\right)]$;
- $P\left(R\right|R(j+1)):=Pr[{X}_{i}=R\left(i\right),i\notin J|{X}_{j+1}=R(j+1)]$.

#### 5.5. Curse of Dimensionality

## 6. Legal Perspective

#### 6.1. Synopsis of the Problem

#### 6.2. Recital 26

#### 6.3. Absolute Personal Reference/Zero-Risk Approach

#### 6.4. Relative Personal Reference/Risk-Based Approach

#### 6.5. Tightened Relative Personal Reference of the EU’s Court of Justice

#### 6.6. Evaluation Standards for the Risk Assessment of the Techniques

#### 6.7. Legal Evaluation

#### 6.7.1. Identifiers, Quasi-Identifiers, and Sensitive Attributes

#### 6.7.2. k-Anonymity

#### 6.7.3. l-Diversity

#### 6.7.4. t-Closeness

#### 6.7.5. Differential Privacy

#### 6.7.6. Synthetic Data

#### 6.7.7. Risk Assessment Overview

## 7. Discussion

## 8. Conclusions

## Abbreviations

DP | Differential Privacy |

DP-SGD | Differentially Private Stochastic Gradient Descent |

ECJ | European Court of Justice |

EGC | European General Court |

EU | European Union |

FMRMR | Fragmentation Minimum Redundancy Maximum Relevance |

GAN | Generative Adversarial Network |

GDPR | General Data Protection Regulation |

HIPAA | Health Insurance Portability and Accountability Act |

LDA | Linear Discriminant Analysis |

LSTM | Long Short-Term Memory |

MIMIC-III | Medical Information Mart for Intensive Care |

PCA | Principal Component Analysis |

PPDP | Privacy-preserving data publishing |

PPGIS | Public Participation Geographic Information System |

QI | Quasi-Identifier |

SA | Sensitive Attribute |

SVD | Singular Value Decomposition |

## References

**Figure 1.**The considered data model. The first r attributes form a QI. All attributes indexed from 1 to $r+t$ are potentially SAs. The considered data model does not contain Direct Identifiers.

**Figure 3.**Example. Visualizing both generalization and discretization by projecting the first six records of Adult on the columns

`age`and

`education`. In the categorical attribute column

`education`, the attribute values “Bachelors” and “Masters” are summarized to a set with both values. In the numerical attribute column

`age`, the values for

`age`are discretized in intervals of size 10.

**Figure 4.**Example. Visualizing suppression of the numerical attribute column

`fnlwgt`(final weight: number of units in the target population that the responding record represents) by replacing every column value with the mean value of all column values. Visualizing suppression of the categorical attribute column

`marital-status`by replacing the values with ∗, which denotes all possible values or the empty set.

**Figure 5.**Example. Visualizing permutation of the column

`occupation`in the cutout of the first six rows in the Adult dataset. The attached indices point out the change in order by applying permutation. No attribute values are deleted, but the ordering inside the column is very likely destroyed.

**Figure 6.**Example. Entropy information loss when generalizing the column

`education`of the cutout of the first six rows in the Adult dataset. In generalization (I), we obtain ${\Pi}_{e}(D,g\left(D\right))\approx 3.25$, which means lower information loss than in generalization (II), where ${\Pi}_{e}(D,g\left(D\right))\approx 5.25$.

**Figure 7.**Example. Numerical information loss when generalizing the column

`age`of the cutout of the first six rows in the Adult dataset. In generalization (I), we obtain $\Pi (D,g(D\left)\right)\approx 0.36$ and $IL(D,g(D\left)\right)\approx 3.33$, which means higher information loss than in generalization (II), where $\Pi (D,g(D\left)\right)=0.16$ and $IL(D,g(D\left)\right)\approx 1.17$. In this example, to apply $ID$, intervals are vectorized by calculating the mean of the minimum and maximum values.

**Figure 8.**Example. The first six rows of the Adult dataset, where the blue-background attributes

`education, education-num, capital-loss, native-country`define a QI (just artificially chosen as the QI for demonstration purposes!). Column sorting can be applied to fit the data scheme (Figure 1). The transformed six-row database fulfills k-anonymity with $k=2$, whereas before discretization in the column

`education-num`and generalizations in the columns

`education`and

`native-country`, the groups had a minimum group size of one. The background colors (orange and yellow) visualize group correspondence, where the attributes in the chosen QI are identical for every record in the group.

**Figure 9.**Example. Projecting the first six rows of the Adult set on the attributes

`education, sex, hours-per-week`. The $PR$ score assumes that attribute values are known and subsequently calculates the risk of re-identifying a single record (in the case of unit record data). Having knowledge about different values of the attribute

`education`(yellow resp. orange) leads to different privacy probabilities of re-identifying a record (record ${R}_{1}$ resp. ${R}_{3}$).

**Figure 10.**Example. Considering the Adult dataset as an example, this dataset can be used for the supervised training of a machine learning algorithm to classify persons having income ≤USD 50 K. The categorical attributes $\mathtt{education}$ and $\mathtt{education}-\mathtt{num}$ contain highly mutual information ($I({A}_{education},{A}_{education-num})\approx 2.93$) and might be part of different fragments, whereas the categorical attributes $\mathtt{race}$ and $\mathtt{sex}$ do not contain highly mutual information ($I({A}_{race},{A}_{sex})\approx 0.01$) and can be part of the same fragment in vertical fragmentation. The calculated mutual information values are based on the training dataset (without the test data) of the Adult dataset. The matrix is symmetric because the function in (30) is symmetric. The values are rounded to two decimal places.

**Figure 11.**Example. Absolute values of Pearson Correlation coefficients and Cramér’s V Statistic coefficients in the Adult dataset. Both matrices are symmetric. The values are rounded to two decimal places.

No. | Direct Identifier | No. | Direct Identifier |
---|---|---|---|

1 | Names | 10 | Social security numbers |

2 | All geographic subdivisions smaller than a state | 11 | IP addresses |

3 | All elements of dates (except year) directly related to an individual | 12 | Medical record numbers |

4 | Telephone numbers | 13 | Biometric identifiers, including finger and voice prints |

5 | Vehicle identifiers and serial numbers | 14 | Health plan beneficiary numbers |

6 | Fax numbers | 15 | Full-face photographs and any comparable images |

7 | Device identifiers and serial numbers | 16 | Account numbers |

8 | Email addresses | 17 | Any other unique identifier |

9 | URLs | 18 | Certificate/license numbers |

**Table 2.**Overview of information losses, utility losses/measurements, and privacy models when applying anonymization methods to tabular data.

Measurement | Method |
---|---|

Information loss | Conditional entropy [18] |

Monotone entropy [18] | |

Non-uniform entropy [18] | |

Information loss on a per-attribute basis [38] | |

Relative condensation loss [39] | |

Euclidean distance [40] | |

Utility loss | Average group size [41] |

Normalized average equivalence class size metric [42] | |

Discernibility metric [21,42,43] | |

Proportion of suppressed records | |

ML utility | |

Earth Mover Distance [44] | |

z-Test statistics [7] | |

Privacy models | k-Anonymity [3] |

Mondrian multi-dimensional k-anonymity [42] | |

l-Diversity [45] | |

t-Closeness [46] | |

Privacy probability of non-re-identification [47] |

**Table 3.**Risk assessment for anonymization methods of tabular data. $\left(1\right)$: Risk depends on chosen k. $\left(2\right)$: It does not take into account similarity attacks. $\left(3\right)$: Based on k-anonymity. $\left(4\right)$: Risk depends on value distribution of Sensitive Attributes. $\left(5\right)$: Risk depends on privacy budget. $\left(6\right)$: Might be combined with DP. +: The method can be considered a strategy to defend against the attack scenario. −: The method cannot solely be considered a defense strategy against the attack scenario.

Singling Out | Linkability | Inference | |
---|---|---|---|

k-Anonymity | + | $-{\phantom{\rule{3.33333pt}{0ex}}}^{\left(1\right)}$ | $-{\phantom{\rule{3.33333pt}{0ex}}}^{\left(2\right)}$ |

l-Diversity | $+{\phantom{\rule{3.33333pt}{0ex}}}^{\left(3\right)}$ | $-{\phantom{\rule{3.33333pt}{0ex}}}^{(1,3)}$ | $+{\phantom{\rule{3.33333pt}{0ex}}}^{(2,4)}$ |

t-Closeness | $+{\phantom{\rule{3.33333pt}{0ex}}}^{\left(3\right)}$ | $-{\phantom{\rule{3.33333pt}{0ex}}}^{(1,3)}$ | $+{\phantom{\rule{3.33333pt}{0ex}}}^{(2,4)}$ |

DP | + | $+{\phantom{\rule{3.33333pt}{0ex}}}^{\left(5\right)}$ | $+{\phantom{\rule{3.33333pt}{0ex}}}^{\left(5\right)}$ |

Synthetic data | + | + | $-{\phantom{\rule{3.33333pt}{0ex}}}^{\left(6\right)}$ |

