In this section, we evaluate and compare the performance of our scheme on real-world benchmark datasets, focusing on privacy preservation, utility enhancement, lookup reduction to generalization hierarchies, and time overhead. Concise details on the three real-world datasets, the experimental settings, evaluation metrics, baselines, and a comparative analysis with SOTA methods are given below.
5.3. Evaluation Metrics and Baselines
We compared the performance of our scheme with six recent SOTA algorithms: AKA [
30], RKA [
31], AFA [
32], UPA [
33], MCKLA [
34], and SDA [
35]. To the best of our knowledge, these are highly relevant and recent SOTA algorithms for data outsourcing. Also, these algorithms have developed practical methods to enhance utility and, therefore, are the best candidates for comparison. Lastly, some methods, such as SDA, employ parallel implementations to speed up the anonymity process and were therefore reliable candidates for comparison with our scheme, as we made data-related enhancements in our pipeline to accomplish a similar goal. Based on the above considerations, we contrasted our scheme against these six SOTA algorithms. Although some privacy-preserving synthetic data generation methods have emerged, they generate anonymous data by jointly combining a generative model and an anonymity/DP model; therefore, they were not ideal for direct comparisons with our scheme. We created distinct variants of
from all datasets via
k and compared the results of these methods fairly via four evaluation metrics: SA disclosure risk (
), information loss (
), query accuracy (
), and time overhead. The formalization of the evaluation metrics is given below.
In (
11),
is background knowledge, and
shows tuples in anonymous form that relate to
. It is hard to know which tuple/portion of
T is exposed to an adversary; therefore, we assumed a worst-case setting in which any record could be exposed to determine an SA of a user.
In (
12),
l is the actual generalization level, and
is the total number of levels in
H.
if QI values are not generalized in a cluster. The two possibilities for
are in Equation (
13).
The
metric is formalized in Equation (
14):
where
denotes the maximum number of clusters yielding correct results against aggregated queries in
, and
denotes the total number of clusters.
encloses anonymized clusters from all three partitions (, , and ). The experiments were conducted across all clusters, not just the anonymized ones, as most data underwent noise and generalization, except for some categorical QIs that exhibited general patterns. Only generalization was performed selectively (e.g., considering diversity and pattern information); the noise addition was consistently applied to the numerical part of all clusters, so all clusters had some form of perturbation.
5.4. Privacy Comparisons
In the first set of experiments, we compared the performance of our scheme with respect to privacy preservation. Specifically, we drew of diverse types from T and evaluated the effectiveness of our scheme with respect to privacy preservation using the metric. The metric evaluates disclosure risk in scenarios where attackers know the QIs, either fully or partially, and want to infer SA, not vice versa. To evaluate the disclosure risk in the opposite case (e.g., where attackers know the SA and want to infer QIs), we calculated the probability from successful matches in the anonymized data. Specifically, we encoded the data exposed to the adversary in a pandas DataFrame and matched it with the anonymized data. Subsequently, the number of successful matches was counted, and the probability of disclosure was computed. An example of the code that finds the correct matches (or linked data) is given as follows: linked_data = pd.merge(df, external_knowledge, on = [‘age’, ‘workclass’, ‘education’, ‘sex’, ‘income’], how = ‘left’). Once the linked data were determined, the probability was computed via . If the matches spanned multiple clusters, we calculated individually and took the average of the results.
At the time of data anonymization, it is challenging to quantify the adversary’s skills and knowledge regarding the real data. Similarly, it is difficult to gauge the adversary’s background knowledge. The standard practice is to expose a set of records (assuming them to be background knowledge of the adversary) or different parts of records (as listed in the threat model) and assess the disclosure risk accordingly. In some cases, a certain percentage of records is extracted from real data, and a linking attack is performed to determine whether those records can be recovered from the anonymized data. We made five realistic assumptions about the data composition that an adversary might know, and we measured the disclosure risk accordingly. The rigorous testing of our scheme’s privacy strength against five realistic assumptions and various attacks (linking, background, and guessing) mimicked real-world settings. We tested all five scenarios listed in the threat model to prove the efficacy of our scheme regarding privacy preservation in
. The privacy protection results obtained from the experiments are illustrated in
Figure 5.
From the results, we can see that the last scenario had a higher
in most cases owing to data linking with
. In each scenario, the quantity of data exposed varied, and the corresponding
was measured. From the comparative analysis, our scheme yielded lower values of
and
in most cases compared to recent SOTA methods. These improvements came from using diversity information in SAs and from paying close attention to specific patterns prone to privacy leaks. In most existing methods, diversity information was not considered, and thus, the
was higher in most cases. The average
of the results for five attack scenarios is given in
Table 4. As shown in
Table 4, our scheme, on average, yielded lower
across datasets.
To further enhance the validity and to ensure the results were robust, we added the statistical validation of
for five attack scenarios using 95% confidence interval (CI) in
Table 5. From the results, it can be observed that our scheme yielded a narrower gap than most baselines across datasets.
In addition, we tested the privacy-preserving capabilities of our scheme by varying k and , achieving better results than most SOTA methods. On average, our scheme showed 12% or higher improvements in each test. Based on the results, we conclude that our scheme is an ideal candidate for preserving privacy in mixed-data outsourcing. Next, we present the privacy analysis across varying k values for three commonly encountered attacks (e.g., 2, 4, and 5). Specifically, we identified three commonly encountered privacy attacks on anonymized data and demonstrated the corresponding privacy disclosures. We exposed some data for each attack and compared the privacy disclosure from the anonymized data created with seven different k values. It is worth noting that the imbalance in the data can also impact the privacy analysis, and therefore, we exposed data from both parts (e.g., balanced and imbalanced) to analyze the privacy strengths of our proposed scheme. To better assess the privacy strength, we used small and large k values. To provide privacy analysis across varying k values, we selected three privacy attacks (e.g., attack 2, attack 4, and attack 5) discussed in the threat model.
The rationale for choosing these threats is to provide privacy analysis in more realistic settings. For instance, attack 2 is commonly encountered because demographics are primarily available in the external repository, but the SA is hidden (it is mainly with the data owners). When data owners release data, it can encompass both QIs and SA. Therefore, adversaries can be curious to know the SA of the target person. Similarly, in attack 4, the adversary might know half of the QIs and the SA value, but cannot infer the true identity/SA because data undergo modifications during the anonymization process. Lastly, in attack 5, there are myriad external data sources, and therefore, linking is the more obvious attack that can be performed on the outsourced data via correlation. The concise details of the attacks are: (attack 2) all QIs are known to the adversary, who might attempt to link a user in the data to find an SA; (attack 4) the adversary might know half of the QIs and an SA and intends to determine the remaining QIs to breach privacy; and (attack 5) the adversary might have access to detailed data on target users from external sources and can then link the data to infer their identities or SAs. All three chosen attacks are practical and can lead to a more realistic assessment of risks associated with data publishing.
Figure 6 presents the privacy results and comparisons with SOTA methods as
k varies.
Referring to
Figure 6, the disclosure risk (e.g.,
) decreases with
k as data become more general, and diversity increases in the SA column when
k is high. However, the imbalance in the SA column can still contribute to the higher
, and those methods that do not consider SA diversity during anonymization can expose the SA more easily. For a lower
k, our method shows slightly higher disclosure, as many classes cannot meet the required diversity due to imbalance in the SA column, particularly in the A and D datasets. However, in real-world settings, a higher
k is typically used; therefore, the
resulting from our scheme is acceptable. These results verify the superiority of our scheme in terms of privacy protection across varying values of
k in most cases.
Figure 7 presents the privacy protection results across seven different values of
k. In this scenario, the disclosure risk is not very high, as the adversary needs to identify multiple QIs to compromise privacy. Referring to
Figure 7, it can be seen that the disclosure risk for all methods decreases with
k as uncertainty increases. However, the proposed scheme yielded better privacy protection in this attack scenario. Even though some parts of the data were unmodified, the linkage was complex owing to three generalization types (i.e., skipped, minimal, and maximal). Lastly, the disclosure risk from the C dataset was low, owing to a greater number of QIs and a balanced SA distribution. These results verify the robustness of the proposed scheme against contemporary privacy attacks.
In the last experiments, we evaluated the effectiveness of our scheme in protecting privacy across seven
k values against linking attacks (e.g., attack 5). The empirical results from the three datasets are shown in
Figure 8.
Referring to
Figure 8, it can be seen that the SA disclosure risk for all methods decreases with
k due to higher uncertainty in the QI part and greater diversity in the SA part. However, the privacy methods that ignore SA diversity have relatively higher disclosure risk. In this setting, the SA disclosure risk from the C dataset is relatively low, owing to a higher number of SA values and a balanced SA distribution. In realistic scenarios, when a lot of information is readily available to an adversary about the real data, the residual risks (e.g., correlation/linkage under strong auxiliary knowledge) are high. However, the general patterns, SA diversity, numerical noise, and generalization of the categorical part in our scheme can effectively mitigate residual risks arising from auxiliary knowledge. These results verify the robustness of the proposed scheme in terms of privacy protection against contemporary privacy attacks.
The experimental results given in
Figure 6,
Figure 7 and
Figure 8 verify the efficacy of the proposed scheme in terms of privacy protection when either QI, SA, or both of them are exposed to an adversary. These improvements were achieved by employing different mechanisms (noise, generalization, diversity, and pattern analysis) during anonymization. Through experiments, we found that when data were imbalanced, a higher
k provided better privacy. In contrast, when data were balanced, then a small
k could also provide reasonable privacy guarantees in most cases. In addition, the higher cardinality of SA values could also increase uncertainty for the adversary, and the number of privacy disclosures could decrease. However, when the number of unique values in the SA column was too few and their distribution was skewed, privacy protection became hard. Hence, it is vital to analyze the data characteristics for both QIs and SAs when selecting feasible values for privacy parameters. Our scheme provides better privacy protection across most attack scenarios. Specifically, it yields higher indistinguishability with respect to numerical QIs and adequate generalization with respect to categorical QIs and provides quantifiable privacy guarantees. However, the poor parameter choice (e.g., higher
and lower
k) may degrade its performance in terms of measurable privacy.
5.5. Utility Comparisons
In the second set of experiments, we compared the utility results of our scheme with existing methods using two evaluation metrics. Specifically, we measured and compared
and
results with those of existing SOTA methods. Both these metrics have been widely used to calculate the utility of the resulting
.
Figure 9 presents the
results obtained from different values of
k.
In
Figure 9, the
value increased along with
k due to an increase in anonymization. However, our scheme maintained
at a much lower level than the SOTA methods in most cases. The main reason for these better results with our scheme is fewer generalizations. Whereas all existing methods anonymize the entire
T, our method anonymizes only privacy-sensitive portions of
T. It is worth noting that
can rapidly increase as data are generalized to higher levels of the generalization hierarchy. With five levels in the hierarchy, the
incurred for one QI is {level 1:
, level 2:
, level 3:
, level 4:
, and level 5:
, where
is the importance score of the respective QI. The lowest level has the least
per QI; therefore, the overall
is lower in our scheme than in other methods. To enhance the reliability of the reported performance gains in
, we report variance analyses across five runs for two
k values across three datasets. In the Adults dataset, the variance at
and
was 1.37 and 8.84, respectively. In the Careplans dataset, the variance at
and
was 1.16 and 3.25, respectively. In the Diabetes dataset, the variance at
and
was 0.57 and 0.70, respectively.
In the next set of experiments, we evaluated and compared our scheme in terms of
results. The primary purpose of assessing our scheme with respect to QA was to demonstrate its effectiveness in data mining. We executed various aggregate queries on
and analyzed the results by using both
T and
. One exemplary query that was executed on the Careplans dataset was as follows:
SELECT COUNT(*) from Careplansdataset WHERE race=‘asian’GROUP BY healthcare_expenses. Specifically, we computed the query error for each query and counted the clusters that were giving reliable results. The
results and their comparisons for seven different values of
k are in
Figure 10.
From the results, note that results have no apparent increase/decrease trend because different queries involving numerical QIs only, categorical QIs only, and mixed-type QIs were submitted for each k value. However, the proposed scheme yielded highly accurate results compared to the SOTA methods. The main reasons for the better results are restricted anonymity and not using suppression (e.g., replacing QI values with ∗) in . These results verify the efficacy of our scheme for data mining and analytics.
The average
results for each dataset are in
Table 6. The abbreviations from A∼C refer to Adults, Careplans, and Diabetes, respectively. From the results, we can see that across all datasets, our scheme achieved higher query accuracy than its counterparts.
The experimental results in
Figure 9 and
Figure 10 and
Table 6 validate the claims concerning utility enhancement in
. These improvements came from preserving general patterns and minimally distorting the data. We believe our scheme can greatly contribute to preventing data-specific biases as well as wrong conclusions when
is employed in real services. To provide a more holistic view of utility preservation, we also compared the classification accuracy (
) on downstream tasks (e.g., using anonymized data for prediction). Specifically, we trained a random forest (RF) classifier on
T and
and computed the accuracy using the following Equation (
15) for seven different
k and two
values.
where
and
refer to true positive and true negative, respectively.
T is the size of the datasets. As the size of both datasets is identical, the denominator does not need to be changed while computing
for
.
Figure 11 presents the comparative analysis of
between real data and anonymous data. From the results, it can be seen that
decreases with
k owing to more generalization. However, our scheme performs minimal generalization and adds controlled noise; therefore, the
of the anonymous data is very close to that of the real data, underscoring the efficacy of the proposed scheme in data mining applications.
5.7. The Impact of Parameter on Privacy and Utility
In these experiments, we comprehensively assessed the impact of the
parameter on privacy and utility. The
parameter can significantly affect the amount of noise added to the data. When the
value is small, the amount of noise is very high, leading to robust privacy guarantees at the expense of utility. In contrast, a higher value of
yields more utility but low privacy [
44]. To effectively resolve the trade-off, it is paramount to exploit data characteristics to determine a feasible value for
. In this work, the Laplace mechanism of DP, which is better suited to numerical QIs, was adopted to add noise to the numerical portion of the data. In the first set of experiments, we demonstrate the impact of the
parameter on data distribution by choosing one numerical QI from the adult data in
Figure 14. Referring to
Figure 14, it can be seen that the distribution becomes balanced when
is high due to a reduction in the noise. However, the distributions are highly inaccurate at the lower value of
. This work applied the Laplace mechanism only to the numerical parts of the data, and the resulting distribution loss was not very high, leading to better utility. Privacy was protected using noisy, generalized data and SA diversity.
To verify the efficacy of the proposed scheme, we analyzed the impact of two values on privacy and utility. The main reason to use a relatively high is that the categorical QIs dominate most datasets. In the threat model, we considered an adversarial setting in which the adversary knew some QIs and aimed to identify the correct SA for the target person. In that case, we assumed that those QIs were numerical and that the adversary knew them correctly. Subsequently, the adversary linked their data to the anonymized data produced by our scheme, thereby compromising privacy. As stated earlier, the adversary needed to correctly figure out the categorical QIs as well to compromise privacy; therefore, we considered two settings of generalization (i.e., skipped, minimal) and (i.e., minimal, maximal generalization) for each value and evaluated the probability of disclosure. The skipped generalization case occurred when generalization was not applied due to the general pattern and higher diversity in the SA column. In the minimal generalization case, the real values were replaced by their direct ancestors in the generalization hierarchy. In the last case, the generalization was performed at higher levels of the generalization hierarchy to protect privacy effectively.
When
was small, the minimal and maximal generalizations were applied. In contrast, skipped and minimal generalizations were applied when
was large due to high diversity. To evaluate privacy strength for two
values, we exposed certain records and attempted to identify the correct matches in the anonymized data. An example of an exposed record can be as follows: <Name = Huang, age = 34, Gender = Female, Race = White, Country = USA, income = ? (unknown)>. In this attack scenario, we assumed that SA was not publicly available, but QIs were available to the adversary. We partitioned the datasets into five equal parts and selected a subset of records from each partition. Subsequently, we matched those records to their respective clusters and computed the disclosure risk as a probability. The experimental results obtained from these experiments on the three datasets using two distinct values are given in
Table 8.
Referring to
Table 8, it can be observed that the Careplans dataset has a lower
due to more unique SA values, and distributions are mostly balanced. In contrast, the higher
in the other two datasets is due to greater imbalance and fewer unique SA values. In terms of
, the lower value has a lower
due to more noise and vice versa. From the results, it can be seen that the
is not very high for both imbalanced and balanced datasets, indicating the efficacy of our scheme in realistic scenarios. In practice, the
will decrease significantly when the categorical part of the data is also considered in experiments.
In the next set of experiments, we quantified the utility loss for each dataset based on the two
values. Specifically, we determined the level of the generalization hierarchy on which the noisy value map operated and then calculated
using the formula given in Equation (
13). It is worth noting that each value of the numerical QIs was noisy and therefore, there was some loss of information. However, applying two
values and considering other data characteristics reduced the noise, and therefore the
was not very high.
Table 9 presents the experimental results that were obtained from three datasets. Across all datasets, there were few numerical QIs, so the overall
was not very high.
Referring to
Table 9, a higher
yielded higher utility (lower
) across datasets. In contrast, higher values yielded lower utility loss. However, the distinct values of
and the greater portions of the general patterns reduced noise, and the utility loss was not very high across all datasets. Most existing methods use a single value of
and do not classify data based on privacy risk, which can lead to greater utility loss. These results underscore the efficacy of our method for data mining and knowledge extraction from anonymized data.
In this work, we used two values of
for each data partition to balance privacy and utility. The
value is more sensitive between zero and one, as utility dominates privacy when
in most cases [
46]. In this work, we classified
into two intervals, 0–1 and 0–0.5, and took the average to find the
for each partition. Since the records in general patterns are not very risky, a relatively higher
was employed, and vice versa. Our idea of employing distinct
aligns with recent trends in using non-uniform noise in DP [
47]. The chosen values satisfy the properties of the privacy–utility curve and enable controlled noise to be added to the data. The experimental results are given in
Figure 14, and
Table 8 and
Table 9 prove that a higher
can produce data of higher utility at the expense of privacy. Similarly, a lower
can provide robust privacy guarantees at the expense of utility. The proposed scheme exploits other characteristics of the attributes to balance this trade-off between utility and privacy effectively.
Ablation study: Different components (i.e., CL-based data sorting, pattern-aware generalization, and DP noise schedule) in our scheme contributed differently to the enhanced utility, privacy, and computing time. To show the differential impact of each component, we conducted an ablation study, and the corresponding results are given in
Table 10. From the results in
Table 10, it can be seen that CL-based data sorting contributed more to time reduction during clustering and generalization. In contrast, customized anonymity operations (generalization and noising) contributed more to utility and privacy results. It is worth noting that another component (e.g., diversity consideration in the clustering process) also resulted in an additional 3.88% gain in privacy. These results verify the complimentary role of each component in balancing the privacy–utility–efficiency triad.