# Iterative Group Decomposition for Refining Microaggregation Solutions

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Microaggregation Problem

_{i}, i∈{1,…,n}, in the d-dimensional space. For a given positive integer k ≤ n, the microaggregation problem is to derive a partition P of D, such that |p| ≥ k for each group p∈P and SSE(P) is minimized. Here, SSE(P) denotes the sum of the within-group squared error of all groups in P and is calculated as follows:

## 3. Related Work

#### 3.1. Microaggregation Approaches

#### 3.2. Refining Approaches

## 4. Proposed Algorithm

_{r}= {r} (lines 4–7; Figure 3). It then repeatedly adds to p

_{r}the record in p nearest to the centroid of p

_{r}until |p

_{r}| = k (lines 8–12; Figure 3). This process is repeated to generate new groups until |p| ≤ k (lines 3–14; Figure 3). The remaining records in p are added to their nearest groups (lines 15–17; Figure 3).

_{min}∈p such that moving x

_{min}to another group reduces the SSE the most (lines 3–14; Figure 4). This process is repeated until p has only k records remaining or the resulting partition cannot further reduce the SSE (lines 2–15; Figure 4). Similar to the Decompose operation, the Shrink operation results in groups with 2k or more records and calls the SplitLargeGroups function to split these over-sized groups (line 17; Figure 4).

^{2}/k). The time complexity of the SplitLargeGroups function is O (k

^{2}× n/k). Thus, an iteration of the Decompose and Shrink operations (lines 3–5; Figure 1) entails O (n

^{2}/k + k

^{2}× n/k) = O (n

^{2}) time computation cost.

_{3}and merges its content into groups p

_{2}and p

_{4}, as shown in the middle part of Figure 5a. At this moment, the Decompose operation cannot further reduce the SSE of the partition result. However, the Shrink operation can reduce the SSE by moving a record from group p

_{2}to group p

_{1}, as shown in the bottom part of Figure 5a.

_{3}and merging its content into groups p

_{2}and p

_{4}. Because group p

_{2}now has 2k or more records, it is split into two groups, p

_{21}and p

_{22}, as shown in the middle part of Figure 5b. The emergence of the group p

_{21}provides an opportunity to further reduce the SSE by decomposing group p

_{21}and merging its content into groups p

_{1}and p

_{22}, as shown in the bottom part of Figure 5b.

## 5. Experiment

#### 5.1. Datasets

#### 5.2. Experimental Settings

#### 5.3. Experimental Results

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

**Figure 5.**Two examples. (

**a**) A Shrink operation after a Decompose operation reduces the information loss; (

**b**) A Decompose operation after another Decompose operation reduces the information loss.

Tested Method | Heuristic for Selecting the 1^{st} Record of Each Group | Heuristic for Growing a Group to Size k | Method for Refining a Solution |
---|---|---|---|

CBFS-NN | CBFS | Nearest Neighbors to 1^{st} record | None |

CBFS-NN2 | CBFS | Nearest Neighbors to 1^{st} record | TFRP2 |

CBFS-NN3 | CBFS | Nearest Neighbors to 1^{st} record | Our method in Figure 1 |

CBFS-NC | CBFS | Nearest to group’s Centroid | None |

CBFS-NC2 | CBFS | Nearest to group’s Centroid | TFRP2 |

CBFS-NC3 | CBFS | Nearest to group’s Centroid | Our method in Figure 1 |

MDAV-NN | MDAV | Nearest Neighbors to 1^{st} record | None |

MDAV-NN2 | MDAV | Nearest Neighbors to 1^{st} record | TFRP2 |

MDAV-NN3 | MDAV | Nearest Neighbors to 1^{st} record | Our method in Figure 1 |

MDAV-NC | MDAV | Nearest to group’s Centroid | None |

MDAV-NC2 | MDAV | Nearest to group’s Centroid | TFRP2 |

MDAV-NC3 | MDAV | Nearest to group’s Centroid | Our method in Figure 1 |

TFRP-NN | TFRP | Nearest Neighbors to 1^{st} record | None |

TFRP-NN2 | TFRP | Nearest Neighbors to 1^{st} record | TFRP2 |

TFRP-NN3 | TFRP | Nearest Neighbors to 1^{st} record | Our method in Figure 1 |

TFRP-NC | TFRP | Nearest to group’s Centroid | None |

TFRP-NC2 | TFRP | Nearest to group’s Centroid | TFRP2 |

TFRP-NC3 | TFRP | Nearest to group’s Centroid | Our method in Figure 1 |

GSMS-NN | GSMS | Nearest Neighbors to 1^{st} record | None |

GSMS-NN2 | GSMS | Nearest Neighbors to 1^{st} record | TFRP2 |

GSMS-NN3 | GSMS | Nearest Neighbors to 1^{st} record | Our method in Figure 1 |

Method/k | 3 | 4 | 5 | 10 | 20 | 30 |
---|---|---|---|---|---|---|

CBFS-NN | 16.966 | 19.730 | 22.819 | 33.215 | 42.955 | 49.489 |

CBFS-NN2 | 16.966 | 19.227 | 22.588 | 33.211 | 42.944 | 49.481 |

CBFS-NN3 | 16.966 | 18.651 | 22.268 | 33.173 | 42.872 | 49.404 |

CBFS-NC | 15.617 | 19.230 | 22.609 | 37.105 | 47.685 | 56.042 |

CBFS-NC2 | 15.617 | 19.210 | 22.150 | 36.892 | 46.415 | 53.212 |

CBFS-NC3 | 15.617 | 19.172 | 21.434 | 36.290 | 41.848 | 47.231 |

MDAV-NN | 16.9326 | 19.546 | 22.4613 | 33.192 | 43.195 | 49.483 |

MDAV-NN2 | 16.9324 | 19.029 | 22.4613 | 33.192 | 43.099 | 49.460 |

MDAV-NN3 | 16.9320 | 18.434 | 22.4612 | 33.184 | 42.771 | 49.261 |

MDAV-NC | 15.631 | 19.176 | 22.712 | 36.992 | 47.705 | 56.370 |

MDAV-NC2 | 15.617 | 19.140 | 22.284 | 36.955 | 46.167 | 52.705 |

MDAV-NC3 | 15.598 | 19.068 | 21.409 | 36.389 | 41.122 | 47.297 |

TFRP-NN | 17.112 | 19.995 | 23.412 | 33.557 | 43.416 | 50.187 |

TFRP-NN2 | 17.070 | 19.715 | 23.136 | 33.405 | 43.343 | 49.965 |

TFRP-NN3 | 16.954 | 19.275 | 22.408 | 32.866 | 42.652 | 48.512 |

TFRP-NC | 17.629 | 19.511 | 23.222 | 35.645 | 47.654 | 55.604 |

TFRP-NC2 | 16.702 | 19.374 | 23.171 | 35.400 | 46.317 | 53.050 |

TFRP-NC3 | 16.021 | 19.233 | 22.839 | 34.909 | 41.358 | 47.034 |

GSMS-NN | 16.610 | 19.050 | 21.948 | 33.234 | 43.023 | 49.433 |

GSMS-NN2 | 16.610 | 19.046 | 21.723 | 33.230 | 43.008 | 49.429 |

GSMS-NN3 | 16.610 | 19.039 | 21.311 | 33.208 | 42.932 | 49.395 |

Method/k | 3 | 4 | 5 | 10 | 20 | 30 |
---|---|---|---|---|---|---|

CBFS-NN | 5.654 | 7.441 | 8.884 | 14.001 | 19.469 | 23.881 |

CBFS-NN2 | 5.648 | 7.439 | 8.848 | 13.902 | 19.384 | 23.651 |

CBFS-NN3 | 5.644 | 7.406 | 8.554 | 12.809 | 17.938 | 21.509 |

CBFS-NC | 5.348 | 7.173 | 8.685 | 14.341 | 21.390 | 26.505 |

CBFS-NC2 | 5.337 | 7.165 | 8.656 | 14.117 | 20.470 | 24.848 |

CBFS-NC3 | 5.325 | 7.139 | 8.575 | 12.672 | 17.365 | 20.326 |

MDAV-NN | 5.692 | 7.495 | 9.088 | 14.156 | 19.578 | 23.407 |

MDAV-NN2 | 5.683 | 7.434 | 9.054 | 14.017 | 19.492 | 23.289 |

MDAV-NN3 | 5.660 | 7.218 | 8.950 | 12.809 | 18.129 | 21.201 |

MDAV-NC | 5.343 | 7.290 | 8.945 | 14.361 | 21.364 | 25.123 |

MDAV-NC2 | 5.335 | 7.265 | 8.898 | 14.043 | 20.091 | 23.686 |

MDAV-NC3 | 5.334 | 7.222 | 8.698 | 12.648 | 17.481 | 20.647 |

TFRP-NN | 5.864 | 7.965 | 9.252 | 14.369 | 20.167 | 23.607 |

TFRP-NN2 | 5.805 | 7.831 | 9.039 | 14.042 | 19.817 | 23.063 |

TFRP-NN3 | 5.735 | 7.428 | 8.408 | 13.024 | 18.211 | 21.112 |

TFRP-NC | 5.645 | 7.636 | 9.301 | 14.834 | 21.719 | 26.725 |

TFRP-NC2 | 5.546 | 7.496 | 9.037 | 14.265 | 20.555 | 25.031 |

TFRP-NC3 | 5.466 | 7.382 | 8.796 | 12.963 | 17.973 | 20.892 |

GSMS-NN | 5.564 | 7.254 | 8.686 | 13.549 | 18.792 | 22.432 |

GSMS-NN2 | 5.545 | 7.251 | 8.597 | 13.452 | 18.451 | 22.354 |

GSMS-NN3 | 5.535 | 7.240 | 8.367 | 13.085 | 17.230 | 21.089 |

Method/k | 3 | 4 | 5 | 10 | 20 | 30 |
---|---|---|---|---|---|---|

CBFS-NN | 0.478 | 0.671 | 1.740 | 3.512 | 7.053 | 10.919 |

CBFS-NN2 | 0.416 | 0.614 | 0.960 | 2.644 | 6.981 | 10.854 |

CBFS-NN3 | 0.402 | 0.587 | 0.803 | 2.036 | 6.823 | 10.605 |

CBFS-NC | 0.470 | 0.672 | 1.533 | 3.276 | 7.628 | 10.084 |

CBFS-NC2 | 0.426 | 0.612 | 0.891 | 2.552 | 7.410 | 10.046 |

CBFS-NC3 | 0.415 | 0.574 | 0.762 | 2.282 | 7.110 | 10.038 |

MDAV-NN | 0.483 | 0.671 | 1.667 | 3.840 | 7.095 | 10.273 |

MDAV-NN2 | 0.417 | 0.614 | 0.969 | 2.931 | 7.010 | 10.192 |

MDAV-NN3 | 0.401 | 0.587 | 0.802 | 2.022 | 6.806 | 9.873 |

MDAV-NC | 0.471 | 0.677 | 1.459 | 3.058 | 7.641 | 9.984 |

MDAV-NC2 | 0.428 | 0.612 | 0.962 | 2.744 | 7.427 | 9.946 |

MDAV-NC3 | 0.415 | 0.573 | 0.795 | 2.298 | 7.109 | 9.937 |

TFRP-NN | 0.513 | 0.680 | 1.768 | 3.543 | 7.087 | 11.116 |

TFRP-NN2 | 0.419 | 0.613 | 0.969 | 2.669 | 6.977 | 10.993 |

TFRP-NN3 | 0.405 | 0.585 | 0.8 | 2.04 | 6.771 | 10.491 |

TFRP-NC | 0.465 | 0.674 | 1.670 | 3.288 | 7.663 | 11.286 |

TFRP-NC2 | 0.420 | 0.607 | 0.887 | 2.545 | 7.443 | 10.684 |

TFRP-NC3 | 0.410 | 0.574 | 0.779 | 2.289 | 7.116 | 10.324 |

GSMS-NN | 0.469 | 0.669 | 1.713 | 3.313 | 6.958 | 11.384 |

GSMS-NN2 | 0.407 | 0.610 | 0.890 | 2.569 | 6.859 | 10.704 |

GSMS-NN3 | 0.394 | 0.59 | 0.796 | 2.101 | 6.647 | 9.314 |

**Table 5.**Best information loss (IL) from Ref. [11] and from our methods.

Dataset | k | Best from [11] | Our Best | ||
---|---|---|---|---|---|

IL*100 | Method | IL*100 | Method | ||

Tarragona | 3 | 16.36 | GSMS-T2 | 15.598 | MDAV-NC3 |

Tarragona | 5 | 21.72 | GSMS-T2 | 21.311 | GSMS-NC3 |

Tarragona | 10 | 33.18 | MD-MHM | 32.866 | TFRP-NN3 |

Census | 3 | 5.53 | GSMS-T2 | 5.325 | CBFS-NC3 |

Census | 5 | 8.58 | GSMS-T2 | 8.367 | GSMS-NN3 |

Census | 10 | 13.42 | GSMS-T2 | 12.648 | MDAV-NC3 |

EIA | 3 | 0.401 | GSMS-T2 | 0.394 | GSMS-NN3 |

EIA | 5 | 0.87 | GSMS-T2 | 0.762 | CBFS-NC3 |

EIA | 10 | 2.17 | μ-Approx | 2.022 | MDAV-NN3 |

**Table 6.**The cases that our methods yield lower information loss than the best results from Ref. [11].

Method | Tarragona | Census | EIA | ||||||
---|---|---|---|---|---|---|---|---|---|

k = 3 | k = 5 | k = 10 | k = 3 | k = 5 | k = 10 | k = 3 | k = 5 | k = 10 | |

CBFS-NN3 | V | V | V | V | V | ||||

CBFS-NC3 | V | V | V | V | V | V | |||

MDAV-NN3 | V | V | V | ||||||

MDAV-NC3 | V | V | V | V | V | ||||

TFRP-NN3 | V | V | V | V | V | ||||

TFRP-NC3 | V | V | V | V | |||||

GSMS-NN3 | V | V | V | V | V | V |

