The utility value (U) of pseudonymized data is measured by averaging the information loss after pseudonymization and the similarity between the original and pseudonymized data or by calculating a weighted average of the two values. Information loss evaluates the extent to which the pseudonymized data differs from the original, while original similarity assesses how closely it retains the characteristics of the original.
The final utility score is calculated in one of two ways using the overall information loss index (Ltotal) and the overall original similarity index (Stotal). First, the two metrics can be combined by assigning weights and calculating their weighted average as the final utility score. Alternatively, the relative ratio Stotal/Ltotal can be used to derive the final utility score U.
3.2.1. Information Loss
This measure quantifies the information lost when comparing pseudonymized data with the original. It evaluates the extent of deviation introduced by transformation and expresses the amount of lost information quantitatively.
This index indicates the extent to which an attribute has been generalized into a broader range. It is defined as follows:
where |Generalization Range| represents the size of the generalized interval.
For example, if an original age value of 34 is generalized to the range [30–39], the size of the generalized interval is 10. In this case, the information loss is 0.9, indicating a reduction in precision. The index ranges between 0 and 1: values close to 0 indicate minimal information loss and high data utility, while values close to 1 indicate substantial information loss. In the latter case, privacy protection is strong, but data utility is significantly diminished.
When comparing the original and transformed data, the degree of difference can be calculated using distance measures such as Euclidean distance or Manhattan distance. These metrics quantify the level of distortion introduced during pseudonymization.
Euclidean distance measures the shortest straight-line distance between two points in a coordinate space, based on the Pythagorean theorem. It is one of the most widely used distance metrics. The Euclidean distance between two points A = (x
1, y
1) and B(x
2, y
2) is defined as:
In general, in an n-dimensional space, the Euclidean distance between points A = (x
1, x
2, …, x
n) and B = (y
1, y
2, …, y
n) is defined as:
For example, given A = (1, 2) and B = (4, 6), the squared differences are (4 − 1)2 = 9, (6 − 2)2 = 16. The Euclidean distance is then . If the result is 0, the points are identical, indicating no difference between them. A positive value indicates the points are different; the larger the value, the greater the distance and the lower the utility, while smaller values indicate higher similarity and thus higher utility.
Manhattan distance does not measure straight-line distance but instead sums the absolute differences along each coordinate axis. It is often compared to navigating a grid of city blocks in Manhattan, hence the name. The Manhattan distance between two points A = (x
1, y
1) and B(x
2, y
2) is defined as:
In general, in an n-dimensional space, the Manhattan distance between points A = (x
1, x
2, …, x
n) and B = (y
1, y
2, …, y
n) is defined as:
For example, given A = (1, 2) and B = (4, 6), the absolute differences are |1 − 4| = 3 and |2 − 6| = 4. The Manhattan distance is then DManhattan(A, B) = 3 + 4 = 7. That is, the Manhattan distance between A and B is 7. Larger values indicate greater differences between the data points and thus lower utility, while smaller values indicate closer similarity and higher utility. Unlike Euclidean distance, Manhattan distance provides an absolute measure of differences along each dimension.
Entropy represents the uncertainty in data. When evaluating information loss, the entropy of the original dataset is compared with that of the pseudonymized dataset.
Let H(X) denote the entropy of the original dataset and H(X′) that of the pseudonymized dataset. Entropy is defined mathematically as:
where x
i is a possible value of random variable P(x
i) is its probability, and n is the number of distinct values. The logarithm log
2P(x
i) is taken in base 2, and the unit of entropy is bits.
[Step 1] Compute probability distribution: Determine the probability of each value by dividing its frequency by the total number of observations.
[Step 2] Calculate logarithms: Compute the log of each probability (using base 2).
[Step 3] Multiply probability and log term: For each value, calculate -P(xi)log2P(xi).
[Step 4] Summation: Add the results across all values to obtain the entropy.
For example, consider a dataset with the following values: A, A, B, B, B, C, C, D. In this case, the entropy can be calculated as follows:
- -
Probability of each value: P(A) = 2/8 = 0.25, P(B) = 3/8 = 0.375, P(C) = 2/8 = 0.25, P(D) = 1/8 = 0.125
- -
Entropy calculation: H(X) = −(0.25log20.25 + 0.375log20.375 + 0.25log20.25 + 0.125log20.125)
- -
Logarithmic calculation: log20.25 = −2, log20.375≈-1.415, log20.125 = −3
- -
Final result: H(X) = −(0.25 × (−2) + 0.375 × (−1.415) + 0.25 × (−2) + 0.125 × (−3)) = 1.90625
Accordingly, the entropy of this dataset is approximately 1.91 bits. This means that, on average, 1.91 bits of information are required to predict a single value in the dataset. The entropy increases as values become more evenly distributed, reflecting greater diversity in the data. The theoretical range of entropy values is [0 ~ log2(n)], where n is the number of distinct values. A value near 0 indicates low uncertainty, meaning that most records share the same value and diversity is minimal. At the maximum (log2(n)), the dataset exhibits maximal diversity, as all values occur with equal probability. Thus, high entropy signifies greater data variability, which enhances analytical utility but complicates privacy protection. Conversely, lower entropy indicates reduced diversity and uncertainty, which favours privacy protection but diminishes data utility.
Entropy difference (ΔH) represents the change in information content between the original dataset and the pseudonymized dataset.
When the entropy difference is zero, the uncertainty of the original data is preserved after pseudonymization. This indicates that the pseudonymization process had little to no impact on the information content of the data, meaning almost no information loss. Data utility remains high in this case. This situation typically arises when very minimal modifications are applied during pseudonymization or when the modified attributes have little influence on overall entropy.
When the entropy of the original dataset is greater than that of the pseudonymized dataset, the uncertainty of the pseudonymized data is reduced. This means some information has been lost due to pseudonymization. The larger the entropy difference, the greater the information loss. This typically occurs through generalization, aggregation, or deletion of specific attributes, resulting in a dataset that contains less information. Higher information loss implies reduced data utility, which may negatively affect predictive power and analytical insights in applications such as data analysis and machine learning.
Although generally uncommon, a negative entropy difference means that the pseudonymized dataset exhibits greater uncertainty than the original dataset. This may occur if the pseudonymization process artificially increases diversity or randomness, leading to higher entropy. In such cases, the pseudonymized data may appear more randomized than the original, suggesting both reduced re-identification risk and a potential means of preserving utility while enhancing privacy.
Consistency loss occurs when identical original values are not transformed into identical pseudonymized values. It can be defined as the proportion of inconsistent transformations among all transformed instances of the same value.
For example, suppose the value A appears 10 times in the dataset. If 7 instances are transformed into B and the remaining 3 into C,
then the inconsistency rate is 30%. The index ranges from 0 to 1. A value of 0 indicates that identical attribute values have been transformed consistently, meaning that consistency was well maintained throughout the transformation process. A value close to 1 indicates that identical attribute values are transformed differently, reflecting a lack of consistency. In other words, values closer to 0 imply that the transformation is applied uniformly, thereby preserving dataset consistency and ensuring higher reliability of analytical results. Conversely, values closer to 1 signify inconsistency during transformation, which can compromise dataset integrity and reduce the reliability of subsequent analyses.
The mean absolute deviation quantifies the average difference between original and pseudonymized values by taking the absolute value of differences. It is calculated as:
where n is the total number of records, xi is the original value, and xi′ is the pseudonymized value.
For an original dataset [
5,
10,
15] and a pseudonymized dataset [
6,
9,
14],
the mean absolute deviation is 1. This indicates that, on average, each value shifted by 1 due to pseudonymization. The result ranges from 0 to ∞. A value of 0 indicates no difference between the original and transformed data, meaning that information loss is negligible and data similarity is preserved. Conversely, larger values indicate greater differences between the original and transformed data, which implies higher information loss and reduced accuracy and utility. In other words, smaller values reflect minimal differences, lower information loss, and higher data utility, whereas larger values correspond to greater differences, significant information loss, and diminished data utility.
The overall information loss value, L
total, can be calculated either as a weighted average of the five previously described metrics or by selecting the maximum value among them. If the resulting score is not expressed on a 0–1 scale, normalization is applied to map values into the range [0, 1]. Normalization is performed using the maximum and minimum across the five metrics, as follows:
where L
i is the original loss value for the metric, L
min is the minimum among all five loss values, and L
max is the maximum.
3.2.2. Original Similarity
This metric evaluates how similar the anonymized data is to the original data, measuring the extent to which patterns or characteristics of the original are preserved after transformation. The purpose is to determine whether the data maintains a similar structure despite anonymization.
Cosine similarity measures the similarity between two vectors based on the angle between them. The smaller the angle, the higher the similarity. This method evaluates the similarity of direction between vectors rather than their actual distance or magnitude. For two vectors A = (x
1, x
2, …, x
n) and B = (y
1, y
2, …, y
n), cosine similarity is defined as:
where
is the dot product,
is the Euclidean length of A, and
is the Euclidean length of B.
For example, if A = (1, 2, 3) and B = (4, 5, 6): the results are as follows:
Dot product: A·B = (1 × 4) + (2 × 5) + (3 × 6) = 32;
Vector magnitudes: , ;
Cosine similarity: .
The cosine similarity ranges from −1 to 1. A value close to 1 indicates that the vectors point in nearly the same direction and are highly similar, meaning that the two data points are very similar. A value near 0 means the vectors are orthogonal, implying little to no similarity, indicating that the data points have little to no correlation. A value close to −1 indicates that the vectors point in opposite directions, implying strong negative similarity.
Jaccard similarity is a method for measuring the similarity between two sets, evaluating how many elements they share in common. It is defined as the ratio of the size of the intersection of the two sets to the size of their union, thereby quantifying the degree of overlap.
Jaccard similarity is a ratio that evaluates the similarity between two sets by quantifying the degree of their overlap. The value of Jaccard similarity generally ranges between 0 and 1, where a larger value indicates that the two sets share more elements. Formally, for two sets A and B, the Jaccard similarity is defined as:
where |A∩B| is the number of elements in the intersection of A and B (the elements both sets share), and |A∪B| is the number of elements in the union of A and B (all unique elements contained in either set, with duplicates counted only once).
Suppose two sets are given as A = {1, 2, 3, 4} and B = {3, 4, 5, 6}.
Intersection A∩B: A∩B = {3, 4}, |A∩B| = 2;
Union A∪B: A∪B = {1, 2, 3, 4, 5, 6}, |A∪B| = 6;
Jaccard similarity: .
Accordingly, the Jaccard similarity between sets A and B is 0.33.
The value ranges in [0–1]. If the two sets share no common elements, the Jaccard similarity is 0, meaning that they consist of completely different elements. If sets A and B contain exactly the same elements, the Jaccard similarity is 1, indicating that the two sets are identical. In cases where the two sets partially overlap, the Jaccard similarity takes a value between 0 and 1. A higher value indicates a greater proportion of overlapping elements between the sets. In other words, a high Jaccard similarity (J ≈ 1) signifies that the two sets share many elements. For example, in web documents or text analysis, when two documents contain a large number of common words, their Jaccard similarity increases. Conversely, a low Jaccard similarity (J ≈ 0) indicates that the sets have little overlap, meaning that the two data points are not similar to each other. When the Jaccard similarity between sets A and B is 0.33, it means that the sets share approximately 33% of the elements in their union. This indicates that the two sets overlap to some extent, but most of their elements are different.
The overall original similarity S
total can be computed either by taking the weighted average of the two similarity measures described above or by selecting the maximum value among them.
Table 2 summarizes the utility metrics (information loss and original similarity scores), their target attributes, value ranges, and interpretations.