A Multi-Level Privacy-Preserving Approach to Hierarchical Data Based on Fuzzy Set Theory

: Nowadays, more and more applications are dependent on storage and management of semi-structured information. For scientiﬁc research and knowledge-based decision-making, such data often needs to be published, e.g., medical data is released to implement a computer-assisted clinical decision support system. Since this data contains individuals’ privacy, they must be appropriately anonymized before to be released. However, the existing anonymization method based on l -diversity for hierarchical data may cause serious similarity attacks, and cannot protect data privacy very well. In this paper, we utilize fuzzy sets to divide levels for sensitive numerical and categorical attribute values uniformly (a categorical attribute value can be converted into a numerical attribute value according to its frequency of occurrences), and then transform the value levels to sensitivity levels. The privacy model ( α hlev , k )-anonymity for hierarchical data with multi-level sensitivity is proposed. Furthermore, we design a privacy-preserving approach to achieve this privacy model. Experiment results demonstrate that our approach is obviously superior to existing anonymous approach in hierarchical data in terms of utility and security.


Introduction
Hospitals and other organizations often need to publish data, e.g., medical data or census data, for the purposes of scientific research and knowledge-based decision-making [1][2][3][4][5][6][7][8][9][10].To avoid the leakage of individual privacy, explicit identifying information is removed when data is released.However, individual privacy still could be leaked by linking other public data [11].Privacy-preserving data publishing provides methods and tools for publishing useful information while preserving individual privacy [12].In recent years, the problem of privacy-preserving data publishing has been studied extensively.The existing privacy protection methods mainly focus on relational data, and many mature privacy models are proposed, such as k-anonymity [11], l-diversity [13], (α, k)-anonymity [14] and t-closeness [15].However, data often has a complicated structure in the real world.With the advent of document-oriented databases (e.g., MongoDB) and the wide use of markup languages (e.g., XML), hierarchical data has become ubiquitous [16].To avoid the leakage of individual privacy, the hierarchical data must be properly anonymized before it is released.At present, there are few researches on privacy protection for hierarchical data.Ozalp et al. [16] proposed l-diversity anonymous methods for hierarchical data.An example for hierarchical data is given in Figure 1.The schema for education data is obtained from Sabanci University [16] and the examples appearing in this paper are related to the schema.Figure 1a represents a student's record, which fits the education schema shown in Figure 1b.The student is born in 1990 and majors in Computer Science.He took two courses, CS201 and CS305.For CS201, his evaluations are submitted for two instructors.For CS305, he submitted an evaluation and showed he bought a database book.The labels of vertices are all quasi-identifiers (QIs) of the student and the corresponding sensitive information is remarked in the side of every vertex.Quasi-identifier is a set of attributes that can potentially identify an individual [11].Assume that an attacker knows some QIs of a victim, and his goal is to reason the sensitive information of the victim.In [16], they used suppression and generalization [11] to make the anonymous hierarchical dataset satisfy l-diversity, which ensures the frequency of every sensitive value for the union-compatible vertices (belonging to the same vertex in schema) in an equivalence class is not more than 1/l.The constraint also can guarantee that every equivalence class contains at least l hierarchical data records.An equivalence class in an anonymous hierarchical dataset is a set of records with the same values for the QIs.However, the method does not consider the sensitivity of different sensitive attribute values, which lead to similarity attacks [15].For example, an equivalence class contains three hierarchical data records and its class representative is shown in Figure 2, which satisfies 3-diversity.The sensitive values of their cumulative GPAs are 0.31, 0.15 and 0.09, respectively, where GPA is the grade point average.An attacker knows a victim in the equivalence class by linking with some QIs of the victim.Although the attacker does not infer the victim's specific sensitive value, he can know that the victim's academic performance is low with 100% probability and the victim's privacy is leaked.Similarly, the attacker can confirm that the grade of the victim in the course CS201 is very low according to the value {D, D+, D−}.Also, the attacker can infer that the victim is very dissatisfied with the DB Prof. by the value {0, 1/10, 2/10}.To avoid similarity attack, we propose a multi-level privacy-preserving approach in hierarchical data based on fuzzy sets.The contributions of this paper are summarized as follows:

•
We utilize the fuzzy set theory to obtain the sensitivity levels for sensitive numerical and categorical attribute values, and present the privacy model (α h lev , k)-anonymity for hierarchical data with multi-level sensitivity.This model can solve the similarity attack, and provide reasonable privacy protection for sensitive value in different sensitivity level.

•
We improve the privacy-preserving approach in hierarchical data to obtain the anonymous data that satisfies (α h lev , k)-anonymity.

•
We do experiments to compare our approach with the existing anonymous method ClusTree proposed in [16].Experiment results demonstrate that our approach is superior to ClusTree in terms of utility and security.

Related Work
In this section, we review the related work about privacy preserving data publishing for relational data and hierarchical data.

Preserving Privacy for Publishing Relational Data
The first privacy model, proposed by Samarati and Sweeney [11] in 1998, is k-anonymity for relational data, which requires that every record in a table is indistinguishable from at least k-1 other records with respect to QI.There exist many anonymization methods to implement k-anonymity, such as bottom-up generalization, top-down specialization and anonymity by clustering technique [17][18][19].k-anonymity can protect against identity disclosure, but cannot prevent attribute disclosure.Therefore, l-diversity has been proposed [13].It requires that every equivalence class contains at least l different sensitive values.There are numerous methods for achieving l-diversity [20,21].Furthermore, Wong et al. [14] extended k-anonymity to (α, k)-anonymity to limit the confidence of the implications from the QI to a sensitive value to within α in order to protect the sensitive information from being inferred by strong implications, and proposed a bottom-up generalization algorithm to achieve (α, k)-anonymity.Li et al. [15] pointed out that l-diversity does not prevent skewness attack and similarity attack, so they introduced t-closeness model, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table.They also revised the Incognito algorithm [17], which is a top-down generalization method proposed for k-anonymity, to achieve t-closeness.However, t-closeness still does not prevent similarity attacks.Han et al. [22] considered the difference of sensitivity for sensitive values, and proposed multi-level l-diversity model for numerical sensitive attribute.Furthermore, Jin et al. [23] presented the (α i , k)-anonymity privacy preservation based on sensibility grading.However, the levels are artificially assigned.Some researches proposed fuzzy based methods for privacy preserving [24,25].They used fuzzy sets to transform sensitive values to semantic values and published the data with fuzzy sensitive information, which decreases the utility of sensitive information and still does not resist similarity attacks.

Preserving Privacy for Publishing Hierarchical Data
There are several studies about preserving privacy for publishing hierarchical or tree-structured data.Yang and Li [26] found that the dependencies between nodes in the XML data information may result in privacy leakage.They formally defined these dependencies as XML constraints, and designed an algorithm to sanitize XML documents by considering these constraints such that no privacy is leaked.However, their attack model is too weak.Our adversarial model assumes that the attacker has some information about the victim.Landberg et al. [27] proposed δ-dependency and extended the anatomy method in relational data to hierarchical data.But the dissection method will damage the original semantic structure of hierarchical data, and the generalization in sensitive attributes will affect the effectiveness of hierarchical data.Nergiz et al. [28] extended k-anonymity methods to a multi-relational database, and proposed multi-relational k-anonymity.Firstly, hierarchical data will be converted to multiple relational data tables, which related to each other by primary key or foreign key, then performed k-anonymity separately on each relational data.However, converting hierarchical data into relational data is not a simple matter, and will produce large amounts of data redundancy, which made the executive efficiency of algorithm extremely low.It will also lose a lot of structural information.Gkountouna and Terrovitistis [29] proposed the k (m, n) -anonymity for tree-structured data.By using generalization and structure decomposition methods, they ensured that the number of matching records not less than k when the attacker knows up to m nodes in a tree and to n structural relations between these nodes.But the method cannot resist the attack with stronger background knowledge.In addition, they used structural decomposition that destroys the structural information of the hierarchical data.Ozalp et al. [16] extended l-diversity to hierarchical data.They utilized generalization and suppression to anonymize the hierarchical data, and make the hierarchical records in an equivalence class to be indistinguishable in terms of the QIs and structure and the sensitive values for the union-compatible vertices in an equivalence class satisfies the requirements of l-diversity.This method is very scalable for the general anonymous method of hierarchical data.However, this method does not consider the different sensitivity of sensitive attribute values in anonymous hierarchical data, so the anonymous hierarchical data still does not resist similarity attack.In this paper, we use fuzzy set theory to partition rank for sensitive values of union-compatible vertices, and propose a multi-level privacy-preserving approach in hierarchical data to solve similarity attacks.

Problem Descriptions
In this section, we describe the attack model, give some fundamental definitions, and introduce our privacy protection model.

Attack Model
We assume that an attacker knows a victim's QI information, which contains any combination of QI values in the same or different vertices of the victim's record.Also, the attacker can obtain some structural links.For example, the victim took two courses, and purchased only a book for course CS201.In addition, the attacker has some negative knowledge, e.g., the victim did not take CS305.Our anonymization approach can ensure that an attacker, who has this background knowledge about a victim, does not infer any sensitive value of the victim is in some level with the probability, which is greater than a given threshold.

Basic Definitions in Hierarchical Data
In this subsection, we give some basic definitions for hierarchical data [16].Let T be a graph with n vertices.We say that T is a rooted tree if and only if (1) T is a directed acyclic graph with n-1 edges; (2) for every vertex (except root vertex), there is a single path from the root vertex to it in T; (3) there exists an edge v → c i if c i ∈ children(v), where children(v) is the children of vertex v.Such tree is denoted by T(V, E), where V and E are the sets of vertices and edges in the tree, respectively.
A hierarchical data record satisfies the following conditions: (1) it follows a rooted tree structure; (2) each vertex v has two j-tuples (j ≥ 0), v QIt and v QI , which contains the names of QI attributes and the values of corresponding QIs, respectively; (3) each vertex v also has two m-tuples (0 ≤ m ≤ 1), v SAt and v SA , which contains the name of sensitive attribute and the value of corresponding sensitive attribute, respectively; (4) assume that |v QI | + |v SA | ≥ 1 to eliminate empty vertices.For a vertex v of a hierarchical data record, v QI is the label of v and v SA is next to v.For Figure 1, v QIt = {major program, year of birth}, v Sat = {GPA}, v QI = {Computer Science, 1990}, and v SA = {3.75}.Definition 1 (Union-Compatibility) [16].Two vertices v and v are union-compatible if and only if v QIt = v QIt and v SAt = v SAt .
if and only if there exists a bijection f: V 1 → V 2 , such that: (1) For x, y ∈ V 1 , there exists an edge e i ∈ E 2 from f(x) to f(y) if and only if there exists an edge e j ∈ E 1 from x to y. (2) f(r 1 ) = r 2 , where r 1 ∈ V 1 and r 2 ∈ V 2 be the roots of T 1 (V 1 , E 1 ) and T 2 (V 2 , E 2 ), respectively.
(3) For all pairs (x, x ), where x ∈ V 1 and x = f(x), x and x are union-compatible and x QI = x QI .
Definition 3 (Equivalence Class of Hierarchical Records) [16].Let Q = {T 1 ,T 2 ,...,T k } is a collection of k hierarchical data records.We say Q is an equivalence class, if for ∀i, j ∈ {1, . . . ,k}, T i and T j are QI-isomorphic.Definition 4 (Class Representative) [16].Let Q = {T 1 ,T 2 ,...,T k } be an equivalence class in hierarchical data, and f i (1 ≤ i ≤ k-1) be a bijection that maps T 1 s vertices to T i+1 s vertices as in QI-isomorphism.T is the class representative for Q if T is QI-isomorphic to T 1 with a bijection function f and ∀v ∈ T, Let X = {x 1 , x 2 , ..., x o } be a multiset of values from the domain of a sensitive attribute A. X satisfies l-diversity if ∀x i ∈ X, p(x i ) ≤ 1/l, where p(x i ) is the frequency of s i in X.For an equivalence class Q in hierarchical data, T is the class representative for Q.If for ∀v ∈ T, v SA satisfies l-diversity, then T satisfies l-diversity.Given a hierarchical data D, an anonymous hierarchical data D * satisfies l-diversity, if the class representative of any equivalence class in D * satisfies l-diversity.The l-diversity hierarchical data does not prevent similarity attack, since it does not consider the different sensitivity of sensitive attribute values.

Privacy Model
For every sensitive attribute, including numerical and categorical attributes, we partition sensitive values to five levels: low, very low, middle, very high and high (for some sensitive attributes, e.g., a student's grade in a course, the levels have been divided, and we do not need to handle it), and transform these value levels to corresponding sensitivity levels.
Let U be a universe of discourse.A mapping µ A : U → [0, 1] is called a membership function on U, where the set A, which consists of µ A (u) (u ∈ U), is a fuzzy set on U, and µ A (u) is the membership degree of u to A [30][31][32].The trapezoidal distribution [33] is used to give the membership functions for fuzzy sets low, very low, middle, very high and high, denoted by A 1 , A 2 , A 3 , A 4 , and A 5 , respectively.Let U be the domain of a numerical attribute (for categorical attribute, a numerical attribute can be obtained according to the frequency of every value), and min and max be the minimum and maximum values in U, respectively.The five fuzzy sets have values in the range [min, a 2 ], [a 1 , a 3 ], [a 2 , a 4 ], [a 3 , a 5 ] and [a 4 , max], respectively, where a 3 = (min + max)/2, a 1 = min + (a 3 -min)/3, a 2 = min + 2(a 3 -min)/3, a 4 = a 3 + (max-a 3 )/3, a 5 = a 3 + 2(max-a 3 )/3.That is, a 1 , a 2 , a 3 , a 4 and a 5 uniformly divide the interval [min, max].The membership functions for A i (i = 1, 2, ..., 5) are shown as follows. (2) For any u ∈ U, argmax{u Ai (u)|i ∈ {1, 2, 3, 4, 5}} is the level which u belongs to.We transform the value level to sensitivity level.For some sensitive attributes, the higher the value level is, the larger the sensitivity level is, e.g., income; but it is reversed for other sensitive attributes, e.g., student's cumulative GPA.For a numerical attribute, we divide the five levels from 1 to 5 for sensitivity.Level 5 is the highest and level 1 is the lowest.The higher sensitivity level is, the stronger privacy protection will be given.
For example, for an equivalence class Q in a hierarchical data, we assume that the sensitive attribute of the root vertex in the class representative of Q is the cumulative GPA, whose value is {0.8, 1.6, 2.3, 2.7, 3.5, 3.9}, where the domain of the cumulative GPA is [0, 4].We can obtain the min = 0, max = 4, a 3 = 2, a 1 = 2/3, a 2 = 4/3, a 4 = 8/3 and a 5 = 10/3.The membership degree of u i to A j are shown in Table 1, where u i ∈ {0.8, 1.6, 2.3, 2.7, 3.5, 3.9} and A j ∈ {low, very low, middle, very high, high}.We can know that 0.8, 1.6, 2.3, 2.7, 3.5 and 3.9 are belong to low, very low, middle, very high, high and high, respectively.Their sensitivity levels are 5, 4, 3, 2, 1 and 1, respectively.In fact, for every sensitive value a numerical attribute A, we can confirm quickly its value level by using the membership functions.As shown in Figure 3, the [min, max] is the domain of A, a 1 , a 2 , a 3 , a  For example, for the cumulative GPA and evaluation score for a teacher, the domains are [0, 4] and [0, 1], respectively.Their value levels and sensitivity levels are shown in Table 2.The letter grade of a course has been divided five levels.For a categorical attribute, e.g., disease, according to the frequency of every value, we obtain an attribute Frequency.The values of Frequency can be divided into 5 levels including low, very low, middle, very high and high.For the disease HIV, it is more sensitive than flu, and the frequency of HIV is less than one of flu.Therefore, we divide the values of disease into 5 sensitivity levels according to the value levels of Frequency.The lower the value level is, the larger the sensitivity level is.

Definition 5 ((α h lev , k)-anonymity in Hierarchical Data
).Given a hierarchical data H, a published anonymous hierarchical data H satisfies (α h lev , k)-anonymity if every equivalence class Q in H satisfies (α h lev , k)-anonymity.That is, Q contains at least k hierarchical data records, and for every vertex v in the class representative of Q, the frequency of the values in v SA which belong to the sensitivity level i is less than or equal to α h lev [i], where α h lev = {0.8, 0.6, 0.4, 0.2, 0.1}.

The Anonymization Method
In this section, we introduce our anonymous method, which is divided into two parts.The first step is to realize the anonymization of two hierarchical data records or class representatives, and the second step is to anonymize the entire hierarchical data by using a clustering method.
The anonymization for two hierarchical data records is shown in Algorithm 1.The input is arbitrary two hierarchical data records T 1 and T 2 .Without loss of generality, we assume that T 1 has fewer subtrees than T 2 .The output is the information loss of anonymizing the two records.
We first check the root nodes of T 1 and T 2 , stored in variables a and b, respectively, whether satisfy the anonymous constraint check_cons(a, b), shown as follows: where a SA ∪ b SA is identical to (α h lev , k)-anonymity, i.e., for any an vertex v in the class representative, the number of the values in v SA , which lie in sensitivity level i, is less than or equal to k*α h lev [i].If check_cons(a, b) is 0, tree(a) and tree(b) are suppressed, where tree(a i ) (a i ∈ {a, b}) denotes the subtree rooted a i ; otherwise, the values in QI of a and b are generalized.Let subtrees(a) and subtrees(b) represent the set of subtrees under a and b, respectively.There are three cases: (1) subtrees(a) = ∅ and subtrees(b) = ∅, which indicates that a and b are leaves of hierarchical records, i.e., no vertex need to be processed, and algorithm returns the total cost in tree(a) and tree(b); (2) subtrees(a) = ∅ and subtrees(b) = ∅, and we suppress all vertices under b to keep the structural consistency, and return the total cost; (3) subtrees(a) = ∅ and subtrees(b) = ∅, the subtrees under a and b need to be further processed.To minimize the information loss caused by anonymization, the subtrees under the a and b need to be optimally matched.Let subtrees(a) = {U 1 , U 2 , ..., U m } and subtrees(b) = {V 1 , V 2 , ..., V n } For every subtrees U i of a, we find the subtrees V j of b with minimum MLevAnonytree(U i , V j ), as shown in lines 12-23.For every pair (i, j) in pairs, we call MLevAnonytree(U i , V j ) to generalize them.In lines 26 and 27, we suppress the unpaired subtrees of b if they exist.An anonymous example of two hierarchical data records is shown in Figure 4, where Figure 4a-c are two raw hierarchical data records, with their anonymous results identical to (α h lev , 2)-anonymity, and their class representative, respectively.Now, we give the clustering algorithm for anonymizing the entire hierarchical data, as shown in Algorithm 2. The input is a hierarchical data H and privacy parameters α h lev and k.The output is the anonymous data H satisfies (α h lev , k)-anonymity.In lines 2-16, when the number of records in H is equal or larger than k, the algorithm creates an equivalence class from H. The first record is randomly picked in an equivalence class Q.For any residual record T i in H, we compute the information loss by adding T i to Q, and then sort H in ascending order according to the information loss.We select other k-1 records from the first 50 records to decrease the runtime of algorithm.In lines 17 and 18, when the number of records in H is less than k, the algorithm suppresses the all records in H.

Experimental Results
The objective of these experiments is to evaluate the performance of the proposed algorithm with respect to data utility, security and efficiency by comparing with existing anonymous approach Clutree [16] in hierarchical data which achieves l-diversity.The algorithms are implemented in Python, and ran on a computer with a four-core 3.4 GHz CPU and 8 GB RAM running Windows 7. We experimented on two synthetic datasets, which are obtained by the authors in [16].They were modeled synthetically based on the real information of graduates from Sabanci University in Turkey.The synthetic dataset A has two levels (h = 2), in the order of (major program, year of birth) → courses, which contains 1000 students and nearly 20 courses per student.The synthetic data set B has three levels (h = 3), in the order of (major program, year of birth) → courses → teachers, in which there are 1000 students, every student studies nearly 20 courses, and every course has one to two teachers.

Evaluation Metrics
We evaluate data utility, security and efficiency of our method by using LM cost [16,28], dissimilarity degree of the equivalence class [22] and the execution time, respectively.
For a hierarchical data record T, the cost of T is computed as follows: where Ω and Ψ are the sets of vertices which are not suppressed and suppressed, respectively, |ω QI | is the number of QI attributes in ω, and LM (q) = (|u q | − 1)/(|u| − 1) is the information loss of generalizing q to u q .The larger information loss is, the lower utility is.LM cost is an important index to evaluate the utility of the anonymous method.
The equivalence class dissimilarity is proposed in [22] for relational data, and we extend it to hierarchical data.Let Q be an equivalence class and its class representative be C rep .v is a vertex in C rep , and the dissimilarity degree of a vertex in the class representative of an equivalence class increases.So the average dissimilarity degree of an equivalence class increases.From Figure 6, we can see that the average dissimilarity degree of an equivalence class for our MLevClusTree is higher than that for Clustree, since our approach restricts the proportion of sensitive values in different sensitivity levels.Therefore, our approach enhances the ability to resist similarity attacks and improves the data security.Finally, we evaluate the efficiency of our algorithm by the execution time.The experimental results are shown in Figure 7.We can see that the execution time of two algorithms increases with the increment of k.For every equivalence class Q in hierarchical data, the first hierarchical data record is randomly selected and we do not need to compute.For every other record in the equivalence class, we need to scan partial hierarchical data to find the record whose distance to current Q is approximately minimum.When k increases, the size of an equivalence class increases.Thus, the runtime increases.Also, we can see that the time for dataset B is more than that for dataset A, because the hierarchical data with more levels needs more time to find the record whose distance to current Q is approximately minimum.From Figure 7, we know that our MLevClusTree is slightly higher than that of ClusTree when k increases, since for every equivalence class MLevClusTree needs to decide whether the number of sensitive values in every sensitivity level exceeds the given threshold.From these experimental results, we can see that our MlevClusTree provides stronger privacy protection and has lower information loss, although it takes more time.It is acceptable because the anonymized process is offline.

Conclusions
Hierarchical data has become ubiquitous with the advent of document-oriented databases and the wide use of markup languages.However, this data contains privacy information, and so must be appropriately anonymized before it is to be published for scientific research and decision-making.To prevent similarity attacks in hierarchical data, in this paper, we use fuzzy set theory to partition sensitive values for a sensitive numerical or categorical attribute uniformly into five levels by converting the categorical attribute values into the numerical attribute values, and then map the five value levels to five sensitivity levels.According to these sensitivity levels, we propose privacy model (α h lev , k)-anonymity for hierarchical data with multi-level sensitivity and design a privacy-preserving approach to achieve (α h lev , k)-anonymity.Experimental results show that the average dissimilarity degree of these equivalence classes in anonymized hierarchical data obtained by our approach is higher than that for existing anonymous approaches in hierarchical data.Thus, our approach can effectively resist similarity attacks.Also, our approach causes less information loss and so improves the utility of anonymized hierarchical data.

Figure 1 .
Figure 1.An example for hierarchical data: (a) A student's record; (b) Schema for education data.

Figure 3 .
Figure 3.The membership functions for five value levels.

Figure 4 .
Figure 4.An anonymous example: (a) Two raw hierarchical data records; (b) The anonymous results; (c) Class representative of results.

Figure 6 .
Figure 6.Dissimilarity degree of equivalence class on two datasets: (a) Dataset A with h = 2; (b) Dataset B with h = 3.

Figure 7 .
Figure 7. Execution time on two synthetic datasets: (a) Dataset A with h = 2; (b) Dataset B with h = 3.

Table 1 .
The membership degree of u i to A j .

Table 2 .
The value levels and sensitivity levels for sensitive attributes.