Iterative Group Decomposition for Refining Microaggregation Solutions

Microaggregation refers to partitioning n given records into groups of at least k records each to minimize the sum of the within-group squared error. Because microaggregation is non-deterministic polynomial-time hard for multivariate data, most existing approaches are heuristic based and derive a solution within a reasonable timeframe. We propose an algorithm for refining the solutions generated using the existing microaggregation approaches. The proposed algorithm refines a solution by iteratively either decomposing or shrinking the groups in the solution. Experimental results demonstrated that the proposed algorithm effectively reduces the information loss of a solution.


Introduction
Protection of publicly released microdata from individual identification is a primary societal concern.Therefore, statistical disclosure control (SDC) is often applied to microdata before releasing the data publicly [1,2].Microaggregation is an SDC method, which functions by partitioning a dataset into groups of at least k records each and replacing the records in each group with the centroid of the group.The resulting dataset satisfies the "k-anonymity constraint," thus protecting data privacy [3].However, replacing a record with its group centroid results in information loss, and the amount of information loss is commonly used to evaluate the effectiveness of a microaggregation method.
A constrained clustering problem underlies microaggregation, in which the objective is to minimize information loss and the constraint is to restrict the size of each group of records to not fewer than k.This problem can be solved in polynomial time for univariate data [4]; however, it has been proved non-deterministic polynomial-time hard for multivariate data [5].Therefore, most existing approaches for multivariate data are heuristic based and derive a solution within a reasonable timeframe; consequently, no single microaggregation method outperforms other methods for all datasets and k values.
Numerous microaggregation methods have been proposed, e.g., the Maximum Distance to Average Vector (MDAV) [6], Diameter-Based Fixed-Size (DBFS) [7], Centroid-Based Fixed-Size (CBFS) [7], Two Fixed Reference Points (TFRP) [8], Multivariate Hansen-Mukherjee (MHM) [9], Density-Based Algorithm [10], Successive Group Minimization Selection (GSMS) [11], and Fast Data-oriented Microaggregation [12].They generate a solution that satisfies the k-anonymity constraint and minimizes the information loss for a given dataset and an integer k.A few recent studies have focused on refining the solutions generated using existing microaggregation methods [13][14][15][16].The most widely used method for refining a microaggregation solution is to determine whether decomposing each group of records in the solution by adding its records to other groups can reduce the information loss of the solution.This method, referred to as TFRP2 in this paper, is originally used in the second phase of the TFRP method [8] and has been subsequently adopted by many microaggregation approaches [10,11].
Because the above microaggregation approaches are based on simple heuristics and do not always yield satisfactory solutions, there is room to improve the results of these existing approaches.Our aim here is to develop an algorithm for refining the results of the existing approaches.The developed algorithm should help the existing approaches to reduce the information loss further.
The remainder of this paper is organized as follows.Section 2 defines the microaggregation problem.Section 3 reviews relevant studies on microaggregation approaches.Section 4 presents the proposed algorithm for refining a microaggregation solution.The experimental results are discussed in Section 5. Finally, conclusions are presented in Section 6.

Microaggregation Problem
Consider a dataset D of n points (records), x i , i∈{1, . . .,n}, in the d-dimensional space.For a given positive integer k ≤ n, the microaggregation problem is to derive a partition P of D, such that |p| ≥ k for each group p∈P and SSE(P) is minimized.Here, SSE(P) denotes the sum of the within-group squared error of all groups in P and is calculated as follows: The information loss incurred by the partition P is denoted as IL(P) and is calculated as follows: Because SST(D) is fixed for a given dataset D, regardless of how D is partitioned, minimizing SSE(P) is equivalent to minimizing IL(P).Furthermore, if a group contains 2k or more points, it can be split into two or more groups, each with k or more points, to reduce information loss.Thus, in an optimal partition, each group contains at most 2k − 1 points [17].
This study proposed an algorithm for refining the solutions generated using the existing microaggregation methods.The algorithm reduces the information loss of a solution by either decomposing or shrinking a group in the solution.Experimental results obtained using the standard benchmark datasets show that the proposed algorithm effectively improves the solutions generated using state-of-the-art microaggregation approaches.

Microaggregation Approaches
Many microaggregation approaches are based on a fixed-size heuristics in which groups of size k are iteratively built around the selected records [6][7][8]10,11].These approaches mainly differ in two aspects: selection of the first record for each group and formation of a group of size k from the selected record.
Let T denote the remaining unpartitioned records in D. Initially, T = D. Common methods to choose from T for the first record of a new group are as follows: the record (denoted as r) furthest from the centroid of T (e.g., MDAV [6] and CBFS [7]), the record furthest from r (e.g., MDAV [6]), the two records in T most distant from each other (e.g., DBFS [7]), and the two records furthest from the two fixed reference points of D (e.g., TFRP [8]), where the reference points are separately determined from the maximal and minimal values over all attributes in D.
Once the first record of a group is determined, one of the following two methods is commonly adopted to grow the group to size k.The first method (referred to as Nearest Neighbors and denoted as NN) is to form a group with the selected record and its k − 1 nearest neighbors in T [6,8,10,11].The other (referred as Nearest to Center and denoted as NC) is to iteratively update the centroid of the group and add the record in T that is nearest to the centroid of the group until the size of the group reaches k [7].The first method is faster, whereas the second is inclined towards minimizing the SSE of the generated group.
GSMS [11], a fixed-size approach, adopts a different approach to iteratively build groups of size k.Instead of choosing a record and then forming a group, it forms a candidate group for each record in T by using the record and its k − 1 nearest records.Subsequently, the candidate group p with the smallest SSE(p) + SSE(T/p) is selected, where T/p denotes the difference of T and p.A priority queue is maintained for each record in T to update all candidate groups rapidly.However, such an arrangement could result in high space complexity.
To speed up the fixed-size approaches, one can either use an efficient way to pick the first record of each group [18][19][20] or reduce the size of the dataset to be processed [21].In Refs.[18][19][20], the authors first sorted all records by an auxiliary attribute.To build a new group, they used the first unassigned record in this ordering as the first record of the new group, and then grew the new group to size k by adding the first record's k − 1 nearest neighbors.In Ref. [21], the authors first selected several attributes that have a high mutual information measure with other attributes.Then, they applied MDAV on the projection of the dataset on those selected attributes.Finally, the partition results were extended to all attributes to calculate each group's centroid.
In addition to fixed-size heuristics, many microaggregation approaches were derived using methods not originally designed for the multivariate microaggregation problem.For example, MDAV-MHM [9] first adopted heuristics (e.g., MDAV) to order the multivariate records and then used Hansen-Mukherjee method (a microaggregation approach for univariate data [4]) to partition the data according to this ordering.Other examples include Ref. [7] and Ref. [22], which extended the minimal spanning tree partitioning algorithm [23], and Ref. [17], which extended Ward's agglomerative hierarchical clustering algorithm [24] to the microaggregation problem.
Ref. [28] proposed a microaggregation method to steer the microaggregation process such that the desired privacy constraints were satisfied.

Refining Approaches
Most approaches for refining a microaggregation solution involved iteratively generating new solutions and searching for possible improvements [13][14][15][16].Clustering algorithms, such as k-means and h-means, were adopted in [13] to modify a microaggregation solution.This approach used a pattern search algorithm to search for the appropriate value of a parameter.
Iterative MHM, proposed in Ref. [14], built groups of a microaggregation solution according to constrained clustering and linear programming relaxation and then fine-tuned the results using an integrated iterative approach.Its solution was built on the microaggregation solution generated using the MDAV method, and the number of groups in the solution was determined in the same manner as in MDAV-MHM [9].
An iterative local search approach [15] was proposed to refine a microaggregation solution.During the local search, the microaggregation solution was improved by swapping a record between two groups and shifting a record from one group to another.Furthermore, to explore the results using a different number of groups, the number of groups in a solution was updated randomly within an optimal range in each iteration.In addition, the "dissolve" operation (same as the TFRP2 method described in Section 1) and the "distill" operation (i.e., forming a new group by removing records from groups with more than k records) were used to adjust the number of groups in the solution.
Similar to Ref. [15], another iterative local search approach proposed in [16] used swapping and shifting of the records between two groups to refine a microaggregation solution.This method expanded the search space by allowing more than one swapping or shifting in each iteration of the local search.

Proposed Algorithm
Figure 1 shows the pseudocode of the proposed algorithm for refining a microaggregation solution.The input to the algorithm is a partition P of a dataset D generated using a fixed-size microaggregation approach, such as CBFS, MDAV, TFRP, and GSMS.Because a fixed-size microaggregation approach repeatedly generates groups of size k, it maximizes the number of groups in its solution P. By using this property, our proposed algorithm focuses only on reducing the number of groups rather than randomly increasing and decreasing the number of groups, as in Ref. [15].
The proposed algorithm repeats two basic operations until both operations cannot yield a new and enhanced partition.The first operation, Decompose (line 4; Figure 1), fully decomposes each group to other groups if the resulting partition reduces the SSE (Figure 2).This operation is similar to the TFRP2 method described in Section 1.For each group p, this operation checks whether moving each record in p to its nearest group reduces the SSE of the solution.
Because the Decompose operation could result in groups with 2k or more records, at its completion (line 9; Figure 2), this operation calls the SplitLargeGroups function to split any group with 2k or more records into several new groups such that the number of records in each new group is between k and 2k − 1.The SplitLargeGroups function (Figure 3) follows the CBFS method [7].For any group p with 2k or more records, this function finds the record r∈p most distant from the centroid of p and forms a new group p r = {r} (lines 4-7; Figure 3).It then repeatedly adds to p r the record in p nearest to the centroid of p r until |p r | = k (lines 8-12; Figure 3).This process is repeated to generate new groups until |p| ≤ k (lines 3-14; Figure 3).The remaining records in p are added to their nearest groups (lines 15-17; Figure 3).
The second operation, Shrink (line 5; Figure 1), shrinks any group with more than k records (Figure 4).For any group p with more than k records, this operation searches for and moves the record x min ∈p such that moving x min to another group reduces the SSE the most (lines 3-14; Figure 4).This process is repeated until p has only k records remaining or the resulting partition cannot further reduce the SSE (lines 2-15; Figure 4).Similar to the Decompose operation, the Shrink operation results in groups with 2k or more records and calls the SplitLargeGroups function to split these over-sized groups (line 17; Figure 4).
(Figure 4).For any group p with more than k records, this operation searches for and moves the record xmin∈p such that moving xmin to another group reduces the SSE the most (lines 3-14; Figure 4).This process is repeated until p has only k records remaining or the resulting partition cannot further reduce the SSE (lines 2-15; Figure 4).Similar to the Decompose operation, the Shrink operation results in groups with 2k or more records and calls the SplitLargeGroups function to split these over-sized groups (line 17; Figure 4).Notably, at most, n/k groups are present in a solution.Because lines 1-8 in Figure 2 and lines 1-16 in Figure 4 require searching for the nearest group of each record, their time complexity is O (n 2 /k).The time complexity of the SplitLargeGroups function is O (k 2 × n/k).Thus, an iteration of the Decompose and Shrink operations (lines 3-5; Figure 1 The proposed algorithm differs from previous work in two folds.First, the Shrink operation explores more opportunities for reducing SSE. Figure 5a gives an example.The upper part of Figure 5a shows the partition of 12 records (represented by small circles) generated by MDAV for k = 3.First, the Decompose operation decomposes group p 3 and merges its content into groups p 2 and p 4 , as shown in the middle part of Figure 5a.At this moment, the Decompose operation cannot further reduce the SSE of the partition result.However, the Shrink operation can reduce the SSE by moving a record from group p 2 to group p 1 , as shown in the bottom part of Figure 5a.
Second, previous work performs the Decompose operation only once and ignores the fact that, after the Decompose operation, the grouping of records may have been changed and consequently new opportunities of reducing the SSE may appear [8,10,11].Thus, the proposed algorithm repeatedly performs both Decompose and Shrink operations to explore such possibilities until it cannot improve the SSE any further.Figure 5b gives an example.The upper part of Figure 5b shows the partition of 13 records generated by MDAV for k = 3.At first, the Decompose operation can only reduce the SSE by decomposing group p 3 and merging its content into groups p 2 and p 4 .Because group p 2 now has 2k or more records, it is split into two groups, p 21 and p 22 , as shown in the middle part of Figure 5b.The emergence of the group p 21 provides an opportunity to further reduce the SSE by decomposing group p 21 and merging its content into groups p 1 and p 22 , as shown in the bottom part of Figure 5b.
the partition of 13 records generated by MDAV for k = 3.At first, the Decompose operation can only reduce the SSE by decomposing group p3 and merging its content into groups p2 and p4.Because group p2 now has 2k or more records, it is split into two groups, p21 and p22, as shown in the middle part of Figure 5b.The emergence of the group p21 provides an opportunity to further reduce the SSE by decomposing group p21 and merging its content into groups p1 and p22, as shown in the bottom part of Figure 5b.

Datasets
Three datasets commonly used for testing microaggregation performance were adopted in this experiment [9]: Tarragona (834 records in a 13-dimensional space), Census (1080 records in a 13-dimensional space), and EIA (4092 records with 11 numerical attributes).The original EIA dataset contained 15 attributes, out of which 13 were numeric.We discarded the two numeric attributes (YEAR and MONTH) and used the remaining 11 numeric attributes.Although there is no consensus on which attributes should be used, the aforementioned settings are the most widely adopted [7][8][9]11,13,14,29].However, differences exist; for example, Ref. [15] used only 10 attributes from both the Census and EIA datasets.When comparing the experimental results from previous studies, one should check whether the same set of attributes was used in the experiments.
Consistent with most studies, before microaggregation, each attribute in each dataset is normalized to have zero mean and unit variance.This normalization step ensures that no single

Datasets
Three datasets commonly used for testing microaggregation performance were adopted in this experiment [9]: Tarragona (834 records in a 13-dimensional space), Census (1080 records in a 13-dimensional space), and EIA (4092 records with 11 numerical attributes).The original EIA dataset contained 15 attributes, out of which 13 were numeric.We discarded the two numeric attributes (YEAR and MONTH) and used the remaining 11 numeric attributes.Although there is no consensus on which attributes should be used, the aforementioned settings are the most widely adopted [7][8][9]11,13,14,29].However, differences exist; for example, Ref. [15] used only 10 attributes from both the Census and EIA datasets.When comparing the experimental results from previous studies, one should check whether the same set of attributes was used in the experiments.
Consistent with most studies, before microaggregation, each attribute in each dataset is normalized to have zero mean and unit variance.This normalization step ensures that no single attribute has a disproportionate effect on the microagregation results.In Ref. [30], the theoretical bounds of information loss for these three datasets were derived.Without applying this normalization step, the information loss might be lower than the theoretical bound derived in Ref. [30] (e.g., Ref. [19]).

Experimental Settings
As described in Section 4, the proposed algorithm refines the solution generated using a fixed-size microaggregation approach.Seven fixed-size microaggregation approaches were adopted in this study: CBFS-NN, CBFS-NC, MDAV-NN, MDAV-NC, TFRP-NN, TFRP-NC and GSMS-NN.The prefix (i.e., CBFS, TFRP, MDAV, or GSMS, described in Section 3.1) indicates the heuristic used to select the first record of each group, and the suffix (i.e., NN or NC, described in Section 3.1) indicates the method used to grow the selected record to a group of k records.NN refers to forming a group using the selected record and its k − 1 nearest neighbors, and NC refers to forming a group by iteratively updating the centroid of the group and adding the record nearest to the centroid until the size of the group reaches k.MDAV-NN, CBFS-NC, TFRP-NN, and GSMS-NN are the same as MDAV [6], CBFS [7], TFRP [8], and GSMS [11] in the literature, respectively.Furthermore, we did not extend GSMS to GSMS-NC because maintaining the candidate groups in GSMS by using the NC method was too costly.Recalled from Section 3.1, the original GSMS [11] (referred to as GSMS-NN in this paper) needs to maintain a candidate group for each record which contains the record and its k-1 nearest unassigned neighbors.In each iteration, GSMS chooses one candidate group as a part of the final partition and updates the content of the other candidate groups to exclude those records in the selected candidate group.Because repeatedly updating the content of these candidate groups is time-consuming, GSMS maintains a priority queue for each record r that sorts all of the records by their distances to r.Thus, the original GSMS essentially uses the NN method to form each group.If GSMS adopts the NC method to form each group, then the priority queue technique is no longer feasible because the NC method is based on the distances to the groups' centroids, not to a fixed record r as in the NN method.That is, each time a group's centroid changes, the corresponding priority queue must be rebuilt, making GSMS-NC an inefficient alternative.

Experimental Results
Tables 2-4 show the information loss using each method in Table 1 for different values of k in the Tarragona, Census, and EIA datasets, respectively.For brevity, we refer to those methods without applying any refinement heuristic as the Unrefined methods (i.e., all method names without a suffix "2" or "3" in the first column of Table 1), and those methods applying the TFRP2 heuristic to refine a solution as the TFRP2 methods (i.e., all method names with a suffix "2").Although the TFRP2 methods always achieved a lower information loss than their corresponding Unrefined methods did on the Census and EIA datasets, our methods (i.e., all method names with a suffix "3") could yield an even lower information loss than the TFRP2 methods did on these two datasets (Tables 3 and 4).
The italicized entries in Table 2 indicate that, for the Tarragona dataset, the TFRP2 methods could not improve the solutions provided by the Unrefined methods only in five cases (i.e., CBFS-NN, CBFS-NC and GSMS-NN at k = 3, and MDAV-NN at k = 5 and 10).However, our methods could not improve the Unrefined methods only in three cases (i.e., CBFS-NN, CBFS-NC and GSMS-NN at k = 3).Therefore, our methods are more effective than the TFRP2 methods in refining the solutions of the Unrefined methods.The best result for each k value is shown in bold in Tables 2-4.In Table 5, our best results from Tables 2-4 are compared with the best results of the almost quadratic time microaggregation algorithms from [11].In all cases, less information loss was observed using our methods.Furthermore, for the Census dataset with k = 10 and for the EIA dataset with k = 5 or 10, the best solutions obtained by our methods in Table 5 are the most superior in the literature [11,13].
Table 6 indicates the cases in which our methods yielded a lower information loss than those in [11].Among the seven methods, both CBFS-NC3 and GSMS-NN3 outperformed the results from [11] for six of the nine cases.Table 5 confirms that CBFS-NC3 and GSMS-NN3 also yielded the best results for two and three of the nine cases, respectively.6.The cases that our methods yield lower information loss than the best results from Ref. [11].

Conclusions
In this paper, we proposed an algorithm to effectively refine the solution generated using a fixed-size microaggregation approach.Although the fixed-size approaches (i.e., methods without a suffix "2" or "3") do not always generate an ideal solution, the experimental results in Tables 2-4 concluded that the refinement methods (i.e., methods with a suffix "2" or "3") help with improving the information loss of the results of the fixed-size approaches.Moreover, our proposed refinement methods (i.e., methods with a suffix "3") can further reduce the information loss of the TFRP2 refinement methods (i.e., methods with a suffix "2") and yield an information loss lower than those reported in the literature [11].
The TFRP2 refinement heuristic checks each group for the opportunity of reducing the information loss via decomposing the group.Our proposed algorithm (Figure 1) can discover more opportunities such as this than the TFRP2 refinement heuristic does because the proposed algorithm can not only decompose but also shrink a group.Moreover, the TFRP2 refinement heuristic checks each group only once, but our proposed algorithm checks each group more than once.Because one refinement step could result in another refinement step that did not exist initially, our proposed algorithm is more effective in reducing the information loss than the TFRP2 refinement heuristic does.
The proposed algorithm is essentially a local search method within the feasible domain of the solution space.In other words, we refined a solution while enforcing the k-anonymity constraint (i.e., each group in a solution contains no fewer than k records).However, the local search method could still be trapped in the local optima.A possible solution is to allow the local search method to temporarily step out of the feasible domain.Another possible solution is to allow the information loss to increase within a local search step but at a low probability, similar to the simulated annealing algorithms.The extension of the local search method warrants further research.

Figure 5 .
Figure 5. Two examples.(a) A Shrink operation after a Decompose operation reduces the information loss; (b) A Decompose operation after another Decompose operation reduces the information loss.

Figure 5 .
Figure 5. Two examples.(a) A Shrink operation after a Decompose operation reduces the information loss; (b) A Decompose operation after another Decompose operation reduces the information loss.

Algorithm 1 :
The proposed algorithm for refining a microaggregation solution.

Table 2 .
The information loss (%) in the Tarragona dataset.

Table 3 .
The information loss (%) in the Census dataset.

Table 4 .
The information loss (%) in the EIA dataset.