Publishing Anonymized Set-Valued Data via Disassociation towards Analysis

: Data publishing is a challenging task for privacy preservation constraints. To ensure privacy, many anonymization techniques have been proposed. They differ in terms of the mathematical properties they verify and in terms of the functional objectives expected. Disassociation is one of the techniques that aim at anonymizing of set-valued datasets (e.g., discrete locations, search and shopping items) while guaranteeing the conﬁdentiality property known as k m -anonymity. Disassociation separates the items of an itemset in vertical chunks to create ambiguity in the original associations. In a previous work, we deﬁned a new ant-based clustering algorithm for the disassociation technique to preserve some items associated together, called utility rules, throughout the anonymization process, for accurate analysis. In this paper, we examine the disassociated dataset in terms of knowledge extraction. To make data analysis easy on top of the anonymized dataset, we deﬁne neighbor datasets or in other terms datasets that are the result of a probabilistic re-association process. To assess the neighborhood notion set-valued datasets are formalized into trees and a tree edit distance (TED) is directly applied between these neighbors. Finally, we prove the faithfulness of the neighbors to knowledge extraction for future analysis, in the experiments.


Introduction
Set-valued data is one of the data formats that can be extracted from social networks, it is characterized by associating a set of values to individuals. Set-valued data is common in logs of search engines, search by hashtags, and can be also found in databases such as health databases and market basket records. This work is an extension of [1] which studies the privacy-utility trade-off of certain predefined associations in an anonymized set-valued data through the disassociation technique, defined first in [2]. The aim of this paper is to continue the investigation and reconstruct the disassociated set-valued dataset to be considerate to future data analysis. Describing, diagnosing, predicting and prescribing are the main uses of data. Set-valued data provides multiple opportunities for various data mining tasks. Using data mining techniques and learning algorithms, machine learning infers models of what is underlying in order to predict possible futures. Extracting knowledge from data, whether it be to increase business productivity, drives effective decision making or predict trends or behaviors, is done through discovering patterns inherited in the data. To benefit from the large amount of set-valued data, association rule mining is deployed and used in many fields all the way from discovering the link between diseases to marketing and retail. It does not refer to just one single algorithm or application, but more to a set of applications that can be used to analyze the correlation between data items in order to find relationships between different items and items categories. This study is an investigation of general set-valued data, where data temporality and semantics is not examined. To illustrate the dilemma of data analysis and privacy preservation in set-valued data, let us consider an example of a mobility data, which stores the GPS location of individuals, each record corresponds to the set of visited cities by an individual. Finding a very high association rule between three cities {{Valetta, Bratislava} ⇒ {Salzburg}} might indicate a high demand for mobility solutions between the cities. On the other side, publishing set-valued data raises the problem of privacy preservation. If an individual, Bob, visited {Salzburg (Austria), Valetta (Malta), Bratislava (Slovakia), Bergamo (Italy)} and an attacker knows that Bob visited both {Valetta, Bratislava}; he/she can predict with high probability that Bob also visited {Salzburg}. Publishing the dataset unrefined, fails to protect the privacy of Bob's mobility data.
From the above example, we can see that it is necessary to extract knowledge from set-valued datasets but at the same time, the privacy of the individuals should be considered wisely. Publishing raw data is a public disclosure of information collected and puts the privacy of individuals at risk. For that reason, privacy-preserving mechanisms should be applied to the data before publishing it.
Anonymization is the process that imposes a level of privacy on the data in an attempt to protect the privacy of individuals participating in the dataset. Anonymization techniques that can be deployed to protect the confidentiality of user data is a wide area of research [3][4][5][6][7][8][9][10]. Disassociation, is an anonymization technique presented in [2], particularly designed for set-valued data. Basically, it set-valued data are clustered into homogeneous groups (via horizontal partitioning), and then each cluster is split into record chunks (via vertical partitioning) where items of a set are separated into disconnected subsets to preserve the k m -anonymity privacy constraint.
However, the anonymization approach aims are both: to provide a dataset respecting privacy and to make the published data valuable for direct analysis. A trade-off must be found between privacy preservation and the benefits of knowledge extraction from datasets.
Guided by the disassociation technique our investigation started in [1] by studying and improving the probabilistic preservation of associations in a disassociated dataset. After examination of the probabilistic preservation results, we present an optimization of disassociation for a predefined set of associations, we call utility rules. The goal is to save the utility rules from vertical partitioning as much as possible, while respecting the k m -anonymity privacy constraint of disassociation.
A derivative of the ant-based algorithm for the disassociation has been proposed to group data whilst respecting the utility rules.
The new problematic in this extension of the work [1], is that in reality, we cannot define every association present in the original dataset as a utility rule in order to optimize its preservation for future analysis. Therefore, the study of association rules in a disassociated dataset cannot be limited to items preserved together without a break up after vertical partitioning. To run an analysis on a disassociated dataset, we must compute a huge set of pre-images of the disassociated dataset, by reconstructing all possible associations between subsets of items disconnected. This new data format, the disassociated dataset, by breaking down a set of items into disconnected subsets of items, makes the analysis over it hard or at least time consuming to achieve. Primarily, it is useless to publish a dataset that cannot be helpful for data analysis.
In this extension, the goal is to publish an anonymized set-valued dataset that falls in the neighborhood of the original one, consequently preserving its original format. In this context, the "neighbor datasets" terminology seems to fall perfectly to our intention. In an effort to make this article self-sufficient, Section 2 recalls and summarizes our previous work [1] for probabilistic preservation of utility in a disassociated dataset and its implementation as ant clustering. In Section 3, we present our first contribution where we define "neighbor datasets" as two similar datasets that fall under a certain radius of distance and present a way to asses the distance between the "neighbor datasets" by formalizing the datasets into trees. Our second contribution in this work, presented in Section 4, is a technique that generates a neighbor datasets to the original, from its disassociated result, which can be seen as a statistical-based re-association. We investigate our solution from two perspectives. First, we look at the preservation of association rules which will reflect the synthetic similarity of neighbor datasets. Second, to generalize our approach, we evaluate the distance between the original and reconstructed datasets to have an overview of the created neighborhood using the distance introduced above. Figure 1 summarizes of the whole solution depicted in this article. All experimental results that confirm the applicability of our approach are presented in Section 5. Finally, Section 6 presents conclusive remarks.

Disassociation and Utility Awareness for Associations
The dissociation technique [2] ensures the constraint of k m -anonymity by separating the elements of a record into several chunks within the same group. It thus creates association ambiguity between its separated terms, which causes a reduction of the utility for the association in question. Dissociation, as defined by Terrovitis, is based on two assumptions. The first is that no one association is more significant than another. The second is that the data should not be modified, generalized or deleted. In the next section, we provide an algorithm to preserve a better utility for a set of predefined associations, called the utility rules, by reducing the amount of split-ups a utility rule has to endure in order to preserve k m -anonymity [2]. Table 1 recalls the basic notations used in this paper. a record (of T ) which is set of items associated with a specific individual of a population I an itemset included in D s(I, T ) the support of itemset I i.e., the number of records in T that are superset of I C a cluster in a disassociated dataset, formed by the horizontal partitioning of T C * a vertically partitioned cluster C that results in record chunks and a term chunk R C a record chunk from the vertically partitioned cluster C * T C the term chunk from the vertically partitioned cluster C * δ maximum number of records allowed in a cluster, also know as the maximum cluster size

Disassociation of a Set-valued Data
This section is dedicated to show how disassociation works in terms of privacy and utility. We use Figure 2 to illustrate an example of disassociation, applied with k = 3, m = 2 and δ = 6.
T r 1 a b c d e r 2 a b c r 3 a d e r 4 a d r 5 a e r 6 a r 7 f b c r 8 f g b r 9 f g c r 10 f g Horizontal partitioning: is the process of clustering the records of a datasets following a naive similarity function. Iteratively and until all records are clustered, from non clustered records, horizontal partitioning groups at most δ records containing the most frequent item at the current iteration. With that, horizontal partitioning fails to take into consideration the associations in the dataset, relying only on one common term to cluster the records. Figure 2b reflects the process of horizontal partitioning, where records containing the most frequent item, a, s(a, T ) = 6, are grouped together within cluster C 1 , and all the other records within C 2 . Both clusters have a size less than δ.
Vertical partitioning: is the process that verifies the k m -anonymity privacy constraint by vertically cutting every cluster into record chunks RC. A record chunk represents sub-itemsets of the cluster's records verifying k m -anonymity between their items. A term chunk TC, is added to the vertical cut for items with a support of less than k. This process is known as vertical partitioning and is the core of privacy preservation in disassociation. In our example, vertical partitioning is applied over C 1 and C 2 . Associations in the clusters are split into different record chunks when k m -anonymity is not verified. Figure 2c represents the final result of horizontal and vertical partitioning.
To illustrate the effect of disassociation on the utility of associations, let us consider that the frequency of the association {b, c}, is valuable for future analysis. From the result of disassociation, Figure 2c, we can note that the bond between items b and c is ambiguous. In C * 1 , items b and c are dropped in the term chunk TC having a support less than k = 3, therefore, camouflaging their exact support and the association between them, if it exists. Similarly, the association between b and c is unclear in C * 2 , with only one advantage over C * 1 ; knowing the support of item b. Let us suppose that we can guide the horizontal partitioning process for the favor of the association {b, c} by keeping together all records that its supersets, as in Figure 3a. Now, association {b, c} verifies k m − anonynity and is totally preserved associated after vertical partitioning, Figure 3b.
From this example, we deduce that preserving associations depends significantly on horizontal partitioning. When a set of associations is important for future analysis, a special treatment of horizontal partitioning must be considered. In what follows, we present a solution to preserve high utility for a set of predefined associations, we call utility rules, within the disassociated dataset.

The Privacy-Utility Trade-off in Disassociation
Giving an exact general definition for the utility of the data in the domain of anonymization is irrational. In this work, a utility rule is an association which is important for accurate analysis, especially for aggregate query answering accuracy.

•
Let UR = {ur 1 , ..., ur u } be a set of predefined associations, that we refer to as utility rules and are important in future analysis.

•
Let s(ur, T ) be the support of the utility rule ur in the original dataset.

•
Let s(ur, T * ) be the support of the utility rule in the disassociated dataset.
To evaluate the probabilistic preservation of a utility rule in a disassociated dataset, the confidence of a utility rule is analyzed.
Definition 1 (α-confidence). The α-confidence of a utility rule ur is evaluated as follows: The term confidence is used to determine the strength of the association between the items of a utility rule ur after disassociation. Statistical queries are based on the support of the associations in question. The α-confidence represents a ratio of the preservation of a utility rule's support of a utility rule, reflected in the final output of the disassociation.
In [1] the utility of an association in disassociated datasets is evaluated theoretically under the k m -anonymity privacy model, and is proven to be: From this perspective of privacy-utility trade-off, we are motivated to contribute with a more insightful horizontal partitioning process, tolerant to the predefined utility rules for future data analysis accuracy.
The next section provides a general description of the clustering problem and then the role of swarm intelligence algorithms for the improvement of data clustering.

Data Clustering
Clustering by definition is the process of grouping objects with similar characteristics together within a cluster. There exists no unified solution to solve all the clustering problems and it has been proven to be NP-hard [11,12], thus the problem consists of finding an optimal or near-optimal solution. In what follows, we present briefly general techniques used for data clustering.
Classical clustering algorithms attempt to optimize a fitness function in order to minimize the dissimilarity between items within a cluster and maximize it between the clusters. Clustering can be either fuzzy or partitional. For fuzzy clustering, data items may belong to multiple clusters with a fuzzy membership grade like the c − means algorithm [13]. For partitional clustering, clusters are totally disjoint and data objects belong to exactly one cluster like the K-means algorithm [14]. In this work and in accordance with the disassociation principle, we are only interested in partitional clustering.
Bio-inspired algorithms model processes existing in nature for optimization problems. Example of bio-inspired systems used for data clustering are: ant colony system [15], particle swarm optimization [16], artificial bee colony [17]. Those algorithms draw inspiration from the collective behavior of decentralized, self-organized natural social animals. Even though particles of a swarm may have very limited individual capabilities, they can perform very complex jobs, vital for their survival, when acting as a community. Choosing the right bio-inspired algorithm to cluster data relies on the comparability of the given problem's background with the behavior of the bio-particles.
Ant clustering algorithm (ACA) is a population-based stochastic search process, modeled after the social behavior of ants searching for food, sorting larval and cleaning corpse. To pick and drop items, ants action is influenced by the similarity and the density of the data within the local neighborhood. From this behavior, researches introduced many variation of clustering algorithms applicable in a wide range of problems [18][19][20][21][22].
In the next section, we define a variant version of the ant-based clustering algorithm to cluster records that are supersets of the predefined utility rules, for a more utility-guided disassociation.

Framework of the Algorithm
We are motivated by the need to preserve some items associated together to increase their utility value despite disassociation and the ambiguity that it raises. We refer to those associations as utility rules. Accordingly, we transform normal horizontal partitioning for the set of records, which are supersets of at least one utility rule, into a clustering optimization problem. Before describing the algorithm, it is important to identify the challenges of the clustering problem in our context:

•
A record might enclose multiple utility rules and with partitional clustering, this record should belong to exactly one cluster satisfying one utility rule.

•
The common items, belonging or not to a utility rule, between the records affect the distance metrics.

•
The maximum cluster size constant, δ, limits the number of records allowed in a cluster.
The proposed algorithm benefits from the studied behaviors of natural ant. Table 2 describes the environment of our clustering problem in the ant colony system terminology.
Let T |UR be the set of records from T that are supersets of any utility rule ur ∈ UR: Cluster initialization: Every utility rule, ur, has a representative cluster and an expert ant that transports records. The algorithm starts by sending the expert ants in search for records from T |UR , containing their representative utility rules; recursively until T |UR is empty.

Ant colony
Set of utility rules UR.

Ant
Expert agent a i working for the benefit of the utility rule ur i . Pheromone trail Square matrix, A, representing the density of each utility rule in each cluster, updated through probabilistic picking-up and dropping functions. It is the shareable memory between the ants. Food Data records T |UR from the dataset T , relative to the utility rules such that: T |UR = {r ∈ T | ∃ur ∈ UR and ur ⊆ r} The ant's load load(a i ) contains a data record from T |UR that the ant a i is transporting. Individual ant's job Picking-up and Dropping a load.
Pheromone Trail: Square matrix A = [u][u] represents the pheromone trail of the working ants, with u being the number of predefined utility rules, u = |UR|. It is the collective adaptive memory of the expert ants and is initialized by the value of the support of each utility rule ur i in each cluster C j , such that: We denote by β ur i the ratio of the records representing ur i in cluster C i : Individual and cooperative work of expert ants:During the clustering process, every ant a i works for its utility rule ur i to reach a support threshold β prede f ined ∈ [0, 1] in its representative cluster C i , such that: • If β ur i < β prede f ined : Let d(ur i , C j ) be the density of a utility rule ur i in C j : Each ant a i chooses the cluster whose density of ur i is the highest. A record r from this one is then moved to cluster C i , with ur i ⊆ r. This process is known as the pick up job.
To speed up the convergence of the solution and prevent the ants from moving aimlessly, for the current iteration, ant a i works for the benefit of another ant a j that needs the most help to increase its β ur j . This process is known as the drop job. We consider that ant a j is in need for the most help when reaching β ur j for its utility rule ur j demands the highest number of iterations: The next section describes and comments the algorithm implementing the ant-based clustering methodology.

Utility Guided Ant-based Clustering Algorithm (UGAC)
The utility-guided ant-based clustering (UGAC) algorithm, Algorithm 1, is presented and explained in details in [1]. This section recaps the optimization process elaborated for the preservation of the utility rules. UGAC creates for every utility rule ur i ∈ UR, an expert ant a i to pick up records that are supersets of ur i , and drop them in the representative cluster C i . The choice of records for pick up is guided by the pheromone trail represented with a square matrix A = [u][u] (line 3) reflecting the support of each utility rule in every cluster. Actually, in the PICKUP procedure, an ant a i chooses to move a record from the cluster that has the highest density of ur i (line 2) to C i (line 7-9). After every move, the pheromone matrix, A, is updated with the support of the utility rules in the clusters (lines 10-15).

Algorithm 1 Utility guided ant-based clustering algorithm (UGAC).
Input: T , UR, k, δ, β prede f ined , it Output: T * 1: T |UR = {r | r ∈ T and∃ ur ∈ UR and ur ⊆ r} 2: u = |UR| 3: create u clusters, u ants and square matrix A[u][u] 4: it_count = 0 5: while (T |UR = ∅) do 6: for each expert ant a i do 7: for (j = 0; j < u; j + +) do 10: end for 12: end for 13: end while 14: while (it_count < it or jobless_ants < u) do 15: jobless_ants = 0 16: for each expert ant a i do 17:  All this pick up job is executed while β ur i < β prede f ined or if there is less than k records in C i (line [16][17][18][19]). Yet, if β ur i ≥ β prede f ined , the expert ant a i can work for the benefit of another ant during the current iteration to converge to the optimal solution quickly (line [21][22]. Function DROPLOAD , Algorithm 2, finds the utility rule that still needs the most iterations to achieve the β prede f ined (line 2). Then, it calls the PICKUP function, Algorithm 3, to find a record that can be transported to the corresponding cluster.
At the end of the iterations, there exist u clusters, each representing mainly one utility rule. The resulting clusters may have sizes greater than the maximum cluster size δ allowed. Every cluster is split into smaller clusters having respectively a size less or equal to δ (line 28), calling Algorithm 4, when necessary. Algorithm 1 ends by vertically partitioning the resulting clusters from UGAC (line 30) and treats all the records that are not supersets of any utility rule, via the normal processes of disassociation (line 31).

Ant-Based Clustering Effect on Associations
In [1], the efficiency of the UGAC technique in terms of preservation of the associations represented in the utility rule is evaluated alongside other experiments analyzing the privacy-utility trade-off. Results are very promising for the preservation of the utility rules. In this section, we investigate the effect of ant-based clustering for the predefined utility rules UR on the associations beyond UR.

Algorithm 4 SplitClusters Function
if |C i | > δ then 3: create new cluster C new 4: for (int l = 0; l < δ; l + +) do 5: load(a i ) = r | r ∈ C i 6: Return AntClusters 15: end procedure We chose for the experiment the BMS1 dataset, which contains click-stream E-commerce data, and a set of 70 distinct utility rules UR, extracted from the dataset with different characteristics: highest frequency= 1204 representing frequent association and Lowest frequency = 2 representing a very rare association. Only records that are supersets of the utility rules (36,141 records from the original set of 149,639 records) are clustered and evaluated.
We compare UGAC (with α = 0.6) to normal horizontal partitioning for disassociation and investigate the overall preservation of associations for the 2 clustering techniques using the Relative Association Error (RAE) defined as: where A is any type of associations present in T |U R and s(A, T |U R ), s(A, T |U R * ) represent respectively the support of the association A in T |U R and its disassociated result T |U R * . In this experiment, two types of associations are evaluated: first, the set of all the couples present in the records related to the utility rules, ∀(x, y) ∈ T |UR , denoted by GA and second the set of utility rules UR. Table 3 shows that the results of UGAC for GA and UR are much better than normal horizontal partitioning. This indicates that for the records that are treated through UGAC, the associations are not deteriorated on the expense of preserving the UR. Finally, we can say that UGAC is reliable for analysis for the utility rules in question and beyond them. The above work, as presented, offers an optimization for disassociating interesting rules in the datasets for analysis. However, a disassociated dataset is not the most appropriate data format for knowledge extraction because it breaks down a set of items into disconnected subsets of items. For this reason, we present in the next section, an algorithm that generates a re-associated dataset from the disassociated one, that we intend to call it a neighbor dataset, by re-associating statistically partitioned subsets to restore the original format of set-valued dataset.

Neighbor Datasets: a Road-map for Distance Evaluation
When a dataset is anonymized it cannot hold the exact same records as the original dataset. Yet, for the anonymized dataset to be useful it has to lead analysis in the same direction of the original dataset. In the following, we define a dataset T and its anonymized version T as two neighbor datasets if they fall under a certain radius of distance and consequently lead to close hypothesis when analyzed.
Let us first define neighboring for two datasets. To be able to describe datasets as neighbors, two questions arise. First, should the definition of neighboring datasets be contextual or general? Second, how can we mathematically assess this neighborhood degree of two datasets? In the following, we go over the above two dilemmas. From what we briefly described above, we see that the basic definition of neighbor datasets should neither discriminate a contextual nor a general analysis. Taking into consideration both very general branches of data analysis we surely need to have a synthetic reproduction of the original dataset T . Then, the value of the items should stay intact without generalization, to preserve the specificity and context of outliers. At the same time, the neighbor dataset should be a synthetic representation of T and this means that statistically, the two datasets should represent almost the same overview of the data, leading to a close hypothesis. This section first introduces how set-valued datasets are translated into trees (Section 3.1) and further shows (Section 3.2) how generated trees are used to compute distances between this dataset.

Dataset Formalization
This work focuses on unordered records of sets of distinct data. For instance, a summary of visited cities during last month for some travelers, sets of unique words used in web queries for collection of persons, etc. There is no interpretation of the sens of the data neither in terms of semantic nor in terms of values. Data in records are either equal or distinct. They are thus considered as unique. For instance, if the same city is visited twice by a traveler, it will only be counted once in the summary of cities visited during the period under study. Moreover, the order (between records or between innermost data) is not taken into consideration. There is thus no notion of temporality in the data.
From a mathematical point of view, the considered data are multisets of finite sets of discrete data. Multiset allows duplicated records: two distinct user may have indeed visited the same cities, and have then the same data.
Multisets of sets may be efficiently represented by a forest of trees. Figure 4 gives, for example, two trees representing the same multiset. Each path from the root to a leaf defines a set, and conversely. Storing multiset of sets as is thus motivated by the objective of computing distance between two multisets as a distance between two trees.
However, as shown in Figure 4, each multiset of sets can possibly be represented in various ways. In this case, using tree edit distance (TED) will not lead to a distance because the distance between a multiset and itself will not be zero, which would contradict the separation property. The next section presents how this problem is tackled.

Tree Edit Distance for Datasets
Tree editing distance is a measure that estimates the cost of transforming one tree into another. More precisely, this cost depends on the minimum number of elementary operations (such as deletion, insertion and renaming applied to a node of a tree) weighted by their cost, to move from one tree to another. This notion extends to trees the edit distance (or Levenshtein Distance) between character strings.
We are then left to translate in a uniform manner multiset of sets to ordered trees. Each set of items (i.e., each record) is thus translated into an ordered list of items, thanks to a lexicographic ordering. It results in a multiset of words where each of them contains at most one occurrence of each letter. Similarly, all these words are lexicographically ordered leading in a sequence of words.
Finally, the distance between the two datasets is the distance between the translated ordered trees. Since the lexical ordering based translation from a multiset of sets is bijective, and since the tree edit distance is a distance, the proposed metric is a distance on datasets. With tree edit distance it is easy to evaluate how far two sets of data are, and more specifically the original and anonymized one.

Probabilistic Wheel Re-association for Disassociated Datasets
This section depicts the final motive of our study: How can we publish an anonymized set-valued datasets? First, the process of building a dataset to be published, arising from the privacy constraints of disassociation is described. Then, the relationship between the result of the developed process and tree edit distance is examined from a privacy point of view and the real usefulness of the metric.

Probabilistic Wheel Re-association Algorithm
The problem of publishing an anonymized set-valued data is transformed into finding a neighbor dataset that shows true features from the original dataset while being distinct, hence faithful to data analysis. In this section, we propose an algorithm, probabilistic wheel re-association, Algorithm 5, to generate neighbor datasets, given hereafter. Roulette wheel selection is a probabilistic optimization method used for selecting potentially useful solutions for recombination. We profit from the disassociation technique to lead the re-association process. Disassociation by default creates ambiguity between associations present in different record chunks of a cluster, while preserving the accuracy for associations found in the same record chunk.Ambiguity on the level of the associations between the record chunks is our playground for the probabilistic wheel re-association. The following solution respects the result of disassociation on different levels: • What has been disassociated in two distinct clusters should not be re-associated.

•
Associations from different record chunks of the same cluster can only be re-associated.

•
The associations preserved in a record chunk should not be altered or re-associated between them. They already passed the k m -anonymity test for anonymization.
Algorithm 5 takes as input the disassociated dataset T * , and starts by reconstructing a neighbor cluster C of each cluster C from T * (Line 1). C i is the cluster containing the result of the gradual re-association between the record chunks of C i . At first, C i is loaded with the records of the first record chunk R C 0 (Line 2). Further, the original number of records in a cluster affects deeply the re-association where empty records show the weakness of the associations between the record chunks. To have the same size of the initial cluster, we add to C i a number of empty sets representing what is left from the original cluster size not represented in R C 0 due to k m -anonymity (Line 3-5).
After initializing C i , the algorithm applies probabilistic wheel re-association between C i and successively every R C j in C i (Lines 6-17). Two records r 1 and r 0 are respectively chosen from C i and the record chunk in question following the rules of the SelectRecord function (Line 9-10). Function SelectRecord, Algorithm 6, takes a record chunk, generates the counts for the distinct itemsets in R C (Line 2) and then constructs the array of cumulative probabilities of the records ACP (Line 3). A random number between 1 and 100 is generated (Line 4); the itemset with cumulative probability equal or straight greater than the selected random number is returned. The two selected itemsets, r 0 and r 1 , are merged together (Line 11) and moved to the temporary cluster ReconstructedRecord (Line 12), waiting for all the itemsets in R C j to be merged with other itemsets from C i in a similar way (Line 8-15). The generated itemsets in ReconstructedRecord are added to the neighbor cluster C i for the merge with the next record chunk (Line 16).
The union of all the generated clusters form the neighbor dataset of the disassociated dataset (Line 18). In Section 5, we present a set of experiments to evaluate the result in terms of neighborhood generation.

Algorithm 5 Probabilistic wheel re-association.
Input: T * Output: T 1: for each cluster C i in T * do 2: end for 6: for each record chunk R j C i from cluster C i such that j > 0 do 7: while R j C i > 0 do 9: r 0 = SELECTRECORD(C i ) 10: r 0 = r 0 ∪ r 1

Analysis of Re-associated Datasets in Terms of Distance
How can a re-associated dataset be evaluated in terms of neighborhood generation? A suitable distance provides precise indicators about the similarity between objects. As presented in Section 3, tree edit distance provides a solution to assess the similarity between set-valued datasets. Using this metric, it is easy to assess, first the distance between two re-associated datasets and second the distance between a re-associated dataset and the original dataset. For two datasets to be neighbors the distance between them should be small, in other words, few modifications should be made to make them similar. Neighborhood is generated when multiple re-associated datasets are close to each other. It is interesting to investigate the process of neighborhood initiation by generating re-associated datasets using the same/distinct privacy constraints. However, using the tree edit distance to evaluate the distance between the generated dataset and the original one, can cause a privacy breach in practice. Let us consider a very basic example where the distance is really low, almost zero: in this case, the public can be sure that they have the original dataset. Yet, we stress the idea that with a big dataset and domain, probabilistically it is very rare to regenerate the original dataset. Therefore, the distance between a re-associated dataset and the original one should not be leaked in any sense to the public, increasing the awareness of an attacker to the background data.
The next section presents experiments evaluating the neighborhood generation and the preservation of the data utility for analysis. Despite our realization of the privacy threat that might occur when calculating the distance with the original dataset, we are going to use the tree edit distance on the original dataset, to prove the efficiency of the probabilistic wheel re-association algorithm in terms of neighborhood generation.

Experiments
In what follows, a set of experiments is led to evaluate the probabilistic wheel re-association algorithm on three scales:

•
The efficiency of the algorithm in neighborhood generation, using the tree edit distance.

•
The data privacy effect on the neighborhood generation.

•
The accuracy of knowledge discovery between the original dataset and the published dataset.
The following experiments are conducted on the same dataset, the BMS1 dataset (59,602 records), used in Section 2.6.

Evaluating the Neighborhood of Set-Valued Datasets
We start our investigation from the Probabilistic Wheel Re-association algorithm to see how well can it generate a neighborhood from the original dataset. Despite our awareness in terms of privacy worries arising from the calculation of the tree edit distance with the original dataset. Our main objective in these experiments is to evaluate realistically the efficiency of the proposed algorithm; and this cannot be done without the original dataset. To evaluate the neighborhood of the generated datasets, disassociation on BMS1 with different k, m and δ constraints is applied and then multiple re-associated datasets for each combination are generated. TED is then calculated between every re-associated dataset and the original one. The implementation of APTED, developed in [23,24], is finally used to calculate the TED.
The results for the following analysis, are represented in two formats: First, Table 4 represents the average TED calculated between multiple re-associated datasets and the original dataset and the standard deviation of TED, under the same privacy constraints (same k, m and δ).

•
Are datasets generated from the same disassociated result reflecting a neighborhood? • What is the effect of the privacy constraints on the distance? Second, Figure 5 is a visual representation of the calculated distances for five samples of each combination of k, m and δ, to make more sense of the distances calculated. For a more concrete understanding, we plot the calculated TED relatively to the average TED by δ, without using a scale. In the following subsections we analyze the represented results to answer respectively the following questions:

The Neighborhood Generation
TED evaluates the transformations that a disassociated dataset has to undergo to transform itself into a re-associated dataset. In this analysis, we are searching for datasets that are relatively in the same distant range from the original dataset. This information is enough to distinguish neighborhoods of datasets. It is notable from the data in Table 4 that datasets generated from the same privacy constraints (k, m, δ), through the probabilistic wheel re-association algorithm, fall under the same distance range, where TED is almost equivalent for the same k, m, δ combination. This means that from the same disassociated dataset, we can generate datasets that need relatively the same amount of transformations to retrieve the starting point (i.e., the original dataset), where the standard deviation of the TED calculated is fairly small. If we overlook the data represented in Figure 5, it automatically reflects the neighborhood generation, where datasets fall at close radii from the original dataset, especially when generated from the same privacy constraints. We can deduce that the probabilistic wheel re-association algorithm through disassociation, is a reliable tool for generating neighbor datasets, assessed using the tree edit distance.

Data Privacy and Neighborhood Distance
In this section, we investigate the effect of the level of privacy on the neighborhood generation. From the first analysis, we recognized that the same privacy constraint can generate re-associated datasets that have a close radius to the original dataset. What is shockingly interesting is that higher privacy (k = 4 and m = 4) generate neighbor datasets closer to the original dataset than lower privacy constraints (k = 2 and m = 2) as shown in Figure 5. This can be explained by the reality that higher privacy is harder to achieve while disassociating the original dataset. In fact, imposing an extra check on associations between items in a cluster, will force dissimilar subsets to be separated into distinct record chunks, thus creating less diversity in a record chunk. Accordingly, the probabilistic wheel re-association algorithm will have less diverse subsets to choose to re-associate together between record chunks. We can notice another behavior related to the maximum cluster size δ, smaller δ generate datasets that are closer to the original dataset and this is due to the fact that when fewer records are allowed in a cluster, diversity in terms of associations is minimized. Therefore, giving less choice to probabilistic wheel re-association by imposing higher privacy (k, m), and lower number of records to re-associate (smaller δ) can lead to generating datasets that are less distant from the original dataset.
To conclude, the above experiments prove the efficiency of the Probabilistic Wheel Re-association algorithm for its ability to generate different solutions, yet taking into account the privacy level imposed in the disassociation phase. The last question we investigated in our experiments is: what can we learn from those disassociated datasets compared to the original dataset? The next section tackles this problem and evaluates the knowledge discovery on a neighbor dataset.

Neighbor Datasets and Knowledge Discovery
Creating a neighbor dataset is useless if it is not a synthetic representation of the original dataset. We dedicate this section to validate our approach for data analysis faithfulness between the original dataset and its neighbors. To do so, we examine the widely used techniques for mining set-valued data, the frequent itemsets and frequent association rules. Apriori is an algorithm for frequent itemset mining and association rule learning. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger itemsets as long as those itemsets appear sufficiently often in the database. We benefit from the implementation of the apriori algorithm in ARULES, developed in [25,26], to investigate the frequent itemsets and association rules representation in neighbor datasets. In what follows, we consider the average number of records of the above-generated neighbor datasets and consider it the size of a representative neighbor dataset T , |T | =53,555, and we use BMS1 as the original dataset, T , with a size |T | = 59,602.

Mining Frequent Itemsets
In order to go deeper in the analysis of the neighbor datasets, we turn our attention to the most frequent itemsets. The discovery of frequent itemsets help decision-makers develop new strategies by gaining insights into which items are frequently found together. For this investigation, we use the support metric, defined as the proportion of transactions in the dataset D which contains an itemset X. Table 5 presents the itemsets with support greater or equal to 0.008. We are able to extract 17 itemsets, representing the most frequent itemsets. To compare the accuracy of knowledge extraction between the original and re-associated dataset, we compared their ranks in terms of frequency. Despite the drop of the count of those itemsets, we can notice that they reappear in the re-associated dataset in the same rank range with mild changes in the ranking, yet they are still the 17 most frequent itemsets in the re-associated dataset. We further tested the three most frequent itemsets, shown in Table 6. We could not detect their presence with the same minimum support, so we dropped it from 0.008 to 0.005, which means they are less frequent in the dataset. However, we found that even for less frequent itemsets we can retrieve them in almost the same rank range. This shows that a neighbor dataset is a faithful representation of the original dataset, by preserving the most frequent itemsets in the same rank range even with the drop in the support. Table 5. Frequent two-itemsets (minimum support = 0.008). T  Table 6. Frequent three-itemsets (minimum support = 0.005). T   Itemset1  417  338  1  1  Itemset2  351  281  2  2  Itemset3  335  274  3  3  Itemset4  322  238  4  5  Itemset5  315  223  5  8  Itemset6  310  231  6  1

Mining Association Rules
The discovery of association rules draws another view on the neighbor datasets. In their default definition, association rules represent a correlation between itemsets. An association rule is defined as an implication of the form X =⇒ Y where X and Y are two disjoint itemsets, X ∩ Y = ∅. They illustrate a concept of correlation between the two itemsets, where the presence of itemset X shows a strong presence of itemset Y. The discovery of association rules is a more complex characterization of data that depends fundamentally on the discovery of frequent itemsets. We searched for the most frequent association rules by their support, which is relative to their count and the size of the dataset. For this experiment, we extracted the 50 most frequent association rules in the original dataset with a support greater than 0.001. Then, we calculated the average support for those association rules in the generated re-associated datasets, T .
From Figure 6 we can see that the support of the most frequent association rules is well-preserved through probabilistic wheel re-association. This shows that neighbor datasets are a faithful representation of the original dataset, by preserving reflecting close support. We can notice a loss in 5 Association Rules (support = 0); in fact, this is not a real loss this is reflecting association rules that have a support less than the minimum threshold (0.001) after re-association. However, if the need is to preserve predefined associations, ant-driven clustering, investigated thoroughly in Section 2, can be used.

Conclusion
Anonymization is challenging when data must be analyzed, studied and explored afterwards. Unstructured data complicate the anonymization process, since in a very general way data items may vary tremendously in structure and value without following a pattern. Set-valued data is a type of unstructured data representing multi-sets of sets. In this work, we are interested in publishing an anonymized set-valued dataset, ready for future accurate analysis. In this context, set-valued data is considered for isolated data, without temporality, and without semantics. This scenario is a generalization of specific data values, which can be considered for a broader investigation of set-value data. Disassociation is an anonymization technique for set-valued datasets, presented by [2]. It guarantees k m -anonymity for associations without altering the value of the data items. It proceeds by grouping the data records in different clusters and then vertically separate items of the records when k m -anonymity is not verified, thus creating ambiguity on the level of associations for the items that are vertically separated in a cluster. This paper is an extension of the work done in [1], where the utility-privacy trade-off in a disassociated dataset is studied deeply. The loss of associations for aggregate analysis is considered for the theoretical study. We came to the conclusion that the loss of the utility is directly linked to the clustering process of disassociation. Driven by this problem we propose in [1], UGAC, to drive the process of clustering for a set of records representing predefined utility rules.
As a continuity of the previous work, we tackle the problem of knowledge extraction in a disassociated dataset. We know from the aggregate analysis that it is hard to evaluate itemsets that are split on multiple record chunks. The problematic becomes: how can we re-associate the itemsets from the disassociated result while staying faithful to anonymization and knowledge extraction at the same time, in set-valued datasets?
To solve this problem, we define a general notion of similarity between datasets: neighbor datasets. Neighbor datasets are datasets that are not copies but synthetic representation, that leads to trustworthy data analysis. To standardize our notion of a neighborhood, we need a distance to assess it. Up to our knowledge, there exists no specific metric for set-valued data which is mathematically defined as a multiset of sets. Our first contribution in this work is the formalization of the datasets into trees and the use of tree edit distance that calculates the number of transformations needed to move from one tree to another. This way we are able to calculate a distance between two multisets of sets, also known as datasets. Our second contribution is an algorithm that intuitively generates neighbor datasets. We propose a probabilistic wheel re-association algorithm to generate a re-associatied dataset from the result of disassociation of the original dataset.
Finally, we test our utility guided ant-based clustering and probabilistic wheel re-association algorithms to evaluate their efficiency especially for knowledge extraction and data analysis in the context of general set-valued data. From the experiments, we can see that re-associated datasets create a neighborhood to the original dataset that depends on the privacy level imposed by disassociation. Despite the perturbation and noise added to the support of itemsets, probabilistic wheel re-association is able to generate synthetic datasets respecting the overall data representation of the itemsets and association rules in the data. This is extremely interesting for prediction and decision-making analysis, as statistical exploration will lead to a very close hypothesis. On another side, when applying the same privacy constraints for disassociating the original dataset, the probabilistic wheel re-association algorithm generates re-associated datasets almost at the same distance from the original dataset. This reflects the fact that our approach respects the privacy imposed through the process. We can conclude that probabilistic wheel re-association is a faithful algorithm for knowledge extraction and data analysis over anonymized set-valued datasets. The preservation of predefined utility rules is necessary when we want to ensure their representation over a threshold. We run a set of experiments with the utility guided ant-based clustering algorithm, UGAC, to see how well can it preserve the utility rules. UGAC is compared with the classical clustering technique, k-means, and normal horizontal disassociation for various properties of utility rules. The results show that UGAC, compared to the other two solutions, is able to decrease the information loss for the utility rules without increasing the information loss of other associations in the cluster. Combining the two solutions, we can say that knowledge extraction and data analysis is exceptionally valid on anonymized datasets when transformed into neighbors to the original dataset.
In future work, we intend to use the Tree Edit Distance on the set of generated neighbor datasets, to decide which re-association is the most representative of the datasets for publication, being the centroid, or in other words the least distance to all the other datasets. We will investigate our approach with other anonymization techniques for publishing datasets and generalize the mathematical evaluation of the utility-privacy trade-off for future uses of the dataset in machine learning algorithms.