1. Introduction
Recently, with the rapid evolution of machine learning technology and the expansion of data due to developments in information technology, it has become increasingly important that companies determine how they might utilize big data effectively and efficiently. However, big data often include personal and private information; thus, careless utilization of such sensitive information may lead to unexpected sanctions.
To overcome this problem, many privacy-preserving technologies have been proposed for the utilization of data that nevertheless maintains privacy. Typical privacy-preserving technologies include data anonymization (e.g., [
1,
2,
3]) and secure computation (e.g., [
4]). This paper focuses on the relationship between data anonymization and decision trees, a typical machine learning method. Historically, data anonymization research has progressed from pseudonymization to
k-anonymity [
1],
ℓ-diversity [
2,
5], and
t-closeness [
3] and is continually growing. Currently, many researchers are focused on membership privacy and differential privacy [
6].
In [
7,
8], the authors pointed out that the decision tree is not robust to
homogeneity attacks and
background knowledge attacks; they then demonstrated the application of
k-anonymity and
ℓ-diversity in order to amplify security. However, their proposals could not satisfy the requirements of differential privacy. In this paper, we discuss how we might prevent the leakage of private information via differential privacy provided by a learned decision tree using data anonymization techniques such as
k-anonymity and
ℓ-diversity.
To prevent leakage of private information, we propose the application of
k-anonymity and sampling to a
random decision tree, which is a variation of the expanded decision tree proposed by Fan et al. [
9]. Interestingly, we show in this paper that this modification results in
differential privacy. The essential idea is that instead of adding Laplace noise, as in [
10,
11] (please see [
12] for a survey of a differentially private (random) decision tree), we propose a method of enhancing the security of a random decision tree by sampling and then removing the leaf containing fewer data than some threshold
k, which applies to the other leaves of the tree. The basic concept is outlined in [
13]. Our proposed model, in which
k-anonymity is achieved after sampling, provides differential privacy, as in [
13].
As mentioned above, researchers have shifted their attention to differential privacy rather than
k-anonymization and
ℓ-diversity. In fact, building upon the work outlined in [
14], decision trees that satisfy differential privacy use techniques that are typical of differential privacy, such as the exponential, Gaussian, and Laplace mechanisms [
10,
11,
15]. That is, all of these algorithms achieve differential privacy by adding some kind of noise. Our approach is very different from those of others. That said, the basic technique involves applying
k-anonymity to each leaf in the random decision tree; this is similar to pruning, which is a widely accepted technique used to avoid overfitting.
The remainder of this paper is organized as follows.
Section 2 introduces relevant preliminary information, e.g., anonymization methods and decision trees, and demonstrates how strategies for attacking data anonymization can be converted into attacks targeting decision trees. In
Section 3, we demonstrate how much security and accuracy can be achieved in practice when the random decision tree is strengthened using a method that is similar to
k-anonymity. In
Section 4, the potential advantages of our proposal are discussed. Finally, the paper is concluded in
Section 5, which includes a brief discussion of potential future research topics.
3. Proposal: Applying -Anonymity to a (Random) Decision Tree
3.1. Construction of
In this section, we demonstrate how one might achieve differential privacy from
k-anonymity. More specifically, we present a proposal based on a random decision tree, which is a variant of the decision tree outlined in
Section 2.2.2. Said proposal is shown as Algorithm 3. It differs from the original random decision tree in the following ways:
(Pruning): for some threshold k, if there exists a tree , a leaf , and a label y, satisfying , then let equal 0.
(Sampling): training using , which is the result obtained after sampling dataset D of each tree with probability .
3.2. Security: Strongly Safe k-Anonymization Ensures Differential Privacy
In the field of data anonymization, in study [
13], the authors demonstrated that performing
k-anonymity after sampling achieved differential privacy; our proposal is a development upon this core principle. Below, we outline the items necessary to evaluate data security.
Definition 3 (Strongly Safe
k-anonymization algorithm [
13]).
Suppose that a function g has , where and are the domain and range of g, respectively. Suppose that g does not depend on , i.e., g is constant. The strongly safe k-anonymization algorithm A with input is defined as follows:Compute .
.
For each element in , if , then the element is set to , and the result is set to .
| Algorithm 3 Proposed training process |
Input: Training data , the set of features , number of random decision trees to be generated Output: Random decision trees - 1:
for do - 2:
- 3:
end for - 4:
for do - 5:
- 6:
Regarding each , with probability - 7:
- 8:
end for - 9:
return - 10:
- 11:
- 12:
Set for all leaves and labels y. - 13:
for do - 14:
Find corresponding to , and set . - 15:
end for - 16:
for All pairs of leaf and label do - 17:
if then - 18:
/*Removing leaves with fewer data*/ - 19:
end if - 20:
end for
|
Assume that
denotes the probability mass function: the probability of succeeding
j times after
n attempts, where the probability of success after one attempt is
. Furthermore, the cumulative distribution function is expressed as follows in Equation (
2):
Theorem 1 (Theorem 5 in [
13]).
Any strongly safe k-anonymization algorithm satisfies -DPS for any , , andwhere . Equation (
3) shows the relationship between
and
in determining the value of
when
k is fixed.
Let us consider the case where record
is applied to a random decision tree
, and the leaf reached is denoted by
. If a function
is defined as
then
in Equation (
4) is apparently constant, that is, it does not depend on
D. Therefore,
, which is generated using
, can be regarded as an example of strongly safe
k-anonymization; consequently, Theorem 1 can be applied.
However, Theorem 1 above can be applied in its original form when there is one , i.e., when the number of trees . Theorem 2 can be applied when .
Theorem 2 (Theorem 3.16 in [
17]).
Assume is an -DP algorithm for . Then, the algorithmsatisfies -DP. In Algorithm 3, each is selected randomly, and sampling is performed for each tree. Hence, the following conclusion can be reached.
Corollary 1. The proposed algorithm satisfies -DPS, for any , , andwhere Table 5 shows the relationship, derived from Equation (
6), between
and
in determining the value of
when
k and
are fixed. The cells in the table represent the approximate value of
. For
k and
, we chose
, and
, as shown in
Table 5.
3.3. Experiments on k-
The efficiency of the proposal was verified using the Nursery dataset [
18], the Adult dataset [
19], the Mushroom dataset [
20], and the Loan dataset [
21]. The characteristics of each dataset are as follows.
The Nursery dataset contains 12,960 records with eight features, with a maximum of five values for each feature;
The Adult dataset contains 48,842 records with 14 features. Here, each feature has more possible values and more records than in the Nursery datasets;
The Mushroom dataset contains 8124 records with 22 features. Compared to the above two datasets, there are more features, but the number of records is small. In general, applying k-anonymity to this kind of dataset is challenging.
The Loan dataset [
21] contains 9578 records with 13 features. We used the “not.fully.paid” feature to label classes. There are four binary attributes and seven numerical attributes in this dataset.
Appendix A contains the evaluation of the basic decision tree with each dataset. Firstly,
were set to
,
, and
.
The results from the Nursery dataset obtained via with these parameters are as follows:
The accuracy of the original decision tree was 0.933, as shown in
Table A1.
As shown in
Table 6 (a), for a tree depth equal to four, the accuracy obtained was 0.84, which was inferior to that of the original decision tree.
As shown in
Table 6 (a), for a tree depth equal to five, the accuracy decreased drastically as
k increased.
The results from the Adult dataset obtained via with the same were as follows: (there were numerical values in this dataset; to handle this, a threshold t was chosen randomly from its domain, and two children for ≤t and >t were produced by the tree)
The accuracy of the original decision tree was 0.855, as shown in
Table A1.
As shown in
Table 6 (b), the achieved accuracy when
was 0.817.
The results from the Mushroom dataset obtained via with the same were as follows:
The accuracy of the original decision tree was 0.995, as shown in
Table A1.
As shown in
Table 6 (c), the achieved accuracy when
was 0.98.
The results from the Loan dataset obtained via with the same were as follows:
In summary, the accuracy achieved by was slightly inferior to that of the original decision tree.
Changing sampling rate β: To achieve secure differential privacy, the sampling rate should remain small. Maintaining values of
,
, which were small enough for our practical application,
Table 7 shows how the values of
changed according to the sampling rate
. As shown, for some parameters, the accuracy of the proposed method was relatively good even when
and
were small.
4. Discussion
In another highly relevant study [
10], Jagannathan et al. proposed a variant of a random decision tree that achieved differential privacy. The accuracy of their proposal is shown in Figures 1 and 2 of [
10] for the same datasets with the same class labels; (in [
10], instead of five class labels, three were used for the Nursery dataset, i.e., some of the similar labels were merged) their method resulted in similar precision. Because our proposal employs sampling, it is limited by the size of the dataset being utilized; the smaller the dataset (e.g., the Mushroom dataset), the less pronounced the accuracy. However, it must be noted that their approach was very different from ours: Laplace noise was added instead of pruning and sampling. Notably, within their proposal,
for all trees
, all leaves
, and all labels
y. Even in this context, if
is small for a certain
,
, and
y, it may be regarded almost as a personal record. A good general approach to handling such cases is to
remove the rare records, i.e., to “remove the leaves containing fewer records”. This is a broadly accepted data anonymization technique [
22] that is commonly used to avoid legal difficulties. Our proposal shows that pruning and sampling can be combined to ensure
differential privacy. If rare sensitive records need to be removed, our method may therefore represent an excellent option.