# Hiding Sensitive Itemsets Using Sibling Itemset Constraints

## Abstract

## 1. Introduction

## 2. Related Work

## 3. Preliminaries

- Modification of the database is minimized, such that originality of the database is kept as much as possible.
- All sensitive itemsets are hidden and do not appear in the sanitized database.
- Supersets of sensitive itemsets are also hidden and do not appear in the sanitized database. We know from the Apriori property that this goal is also accomplished if the first goal is achieved.
- All non-sensitive frequent itemsets appear in the sanitized database. If an itemset doesn’t appear in the new database, it is called a lost itemset.
- No new itemset appears in the sanitized database. Such itemsets are called ghost itemsets. However, approaches that delete items from the dataset naturally accomplish this goal, and no new itemset can be mined.

#### CSP Formulation

## 4. Itemset Hiding Using Sibling Itemsets Constraints

#### Illustrative Example

## 5. Experimental Analysis

#### 5.1. Evaluation Metrics of Itemset Hiding

#### 5.1.1. Hiding Failure

#### 5.1.2. Artifactual Patterns

#### 5.1.3. Dissimilarity

#### 5.1.4. Misses Cost

#### 5.2. Comparison

#### 5.3. Discussion

## 6. Conclusions

## References

Notation | Description |
---|---|

$D$ | Dataset |

$~D$ | Intermediate form of dataset |

${D}^{s}$ | Sanitized dataset |

${T}_{i}$ | i-th transaction of dataset |

$I$ | Set of items |

$\sigma \left(X\right)$ | Support count of itemset X in dataset |

${\sigma}^{s}\left(X\right)$ | Support count of itemset X in sanitized dataset |

${\sigma}_{min}$ | Minimum support count threshold |

$S$ | Set of sensitive itemsets |

$Ss$ | Set of supersets of sensitive itemsets |

$F$ | Set of frequent itemsets in dataset |

${F}^{n}$ | Set of non-sensitive frequent itemsets in dataset |

${F}^{s}$ | Set of frequent itemsets in sanitized dataset |

${d}_{ij}$ | Item of dataset in bitmap notation at i-th row j-th column |

$~{d}_{ij}$ | Item of intermediate form of dataset in bitmap notation at i-th row j-th column |

${d}_{ij}^{s}$ | Item of sanitized dataset in bitmap notation at i-th row j-th column |

${u}_{ij}$ | Binary variable of intermediate form of dataset in bitmap notation at i-th row j-th column |

$SI$ | Set of sibling itemsets |

$SI\left(X\right)$ | Set of sibling itemsets of itemset X |

${r}_{Y}$ | Binary variable of constraint defined for itemset Y |

TID | Items |
---|---|

T_{1} | AC |

T_{2} | ACDE |

T_{3} | CD |

T_{4} | BE |

T_{5} | ACDE |

T_{6} | DE |

T_{7} | C |

T_{8} | AB |

T_{9} | AC |

T_{10} | CD |

A | B | C | D | E |
---|---|---|---|---|

1 | 0 | 1 | 0 | 0 |

1 | 0 | 1 | 1 | 1 |

0 | 0 | 1 | 1 | 0 |

0 | 1 | 0 | 0 | 1 |

1 | 0 | 1 | 1 | 1 |

0 | 0 | 0 | 1 | 1 |

0 | 0 | 1 | 0 | 0 |

1 | 1 | 0 | 0 | 0 |

1 | 0 | 1 | 0 | 0 |

0 | 0 | 1 | 1 | 0 |

A | B | C | D | E |
---|---|---|---|---|

1 | 0 | 1 | 0 | 0 |

1 | 0 | u_{2,3} | u_{2,4} | 1 |

0 | 0 | u_{3,3} | u_{3,4} | 0 |

0 | 1 | 0 | 0 | 1 |

1 | 0 | u_{5,3} | u_{5,4} | 1 |

0 | 0 | 0 | 1 | 1 |

0 | 0 | 1 | 0 | 0 |

1 | 1 | 0 | 0 | 0 |

1 | 0 | 1 | 0 | 0 |

0 | 0 | u_{10,3} | u_{10,4} | 0 |

A | B | C | D | E |
---|---|---|---|---|

1 | 0 | 1 | 0 | 0 |

1 | 0 | 1 | 1 | 1 |

0 | 0 | 0 | 1 | 0 |

0 | 1 | 0 | 0 | 1 |

1 | 0 | 1 | 0 | 1 |

0 | 0 | 0 | 1 | 1 |

0 | 0 | 1 | 0 | 0 |

1 | 1 | 0 | 0 | 0 |

1 | 0 | 1 | 0 | 0 |

0 | 0 | 0 | 1 | 0 |

Dataset Name | Number of Transactions (Count) | Average Transaction Length | Number of Items | Minimum Support Count | Number of Frequent Itemsets | Runtime (Seconds) |
---|---|---|---|---|---|---|

T10I4D100K | 100,000 | 10.10 | 870 | 500 (%0.5) | 1073 | 9.15 |

T40I10D100K | 100,000 | 39.60 | 942 | 500 (%0.5) | 1,286,037 | 392.96 |

Mushroom | 8124 | 23.00 | 119 | 406 (%5) | 3,755,704 | 9.77 |

retail | 88,162 | 10.30 | 16,470 | 440 (%0.5) | 581 | 2.00 |

BMS1 | 59,602 | 2.51 | 497 | 60 (%0.1) | 3991 | 0.88 |

BMS2 | 77,512 | 4.62 | 3,340 | 77 (%0.1) | 24,143 | 5.22 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/0 | 5.72 | 4.04 |

HS_2.2 | 0/0 | 6.75 | 5.2 |

HS_2.3 | 0/0 | 7.62 | 6.03 |

HS_3.1 | 0/0 | 6.78 | 5.54 |

HS_3.2 | 0/0 | 9.51 | 10.31 |

HS_4.1 | 0/0 | 9.01 | 10.5 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/1 | 13.64 | 13.31 |

HS_2.2 | 0/1 | 27.59 | 14.59 |

HS_2.3 | 0/2 | 62.49 | 22.88 |

HS_3.1 | 0/0 | 95.61 | 18.92 |

HS_3.2 | 0/0 | 1,110.97 | 31.32 |

HS_4.1 | 0/1 | 408.48 | 24.02 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/0 | 8.98 | 1.56 |

HS_2.2 | 0/0 | 18.45 | 2.6 |

HS_2.3 | 0/0 | 22.45 | 4.23 |

HS_3.1 | 0/0 | 23 | 2.67 |

HS_3.2 | 0/0 | 25.78 | 4.96 |

HS_4.1 | 0/0 | 17.44 | 6.11 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/0 | 1.07 | 1.06 |

HS_2.2 | 0/0 | 1.14 | 1.17 |

HS_2.3 | 0/1 | 1.18 | 1.18 |

HS_3.1 | 0/0 | 1.2 | 1.36 |

HS_3.2 | 0/1 | 1.47 | 1.39 |

HS_4.1 | 0/2 | 2.96 | 1.57 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/0 | 1.07 | 1.06 |

HS_2.2 | 0/0 | 1.14 | 1.17 |

HS_2.3 | 0/1 | 1.18 | 1.18 |

HS_3.1 | 0/0 | 1.2 | 1.36 |

HS_3.2 | 0/1 | 1.47 | 1.39 |

HS_4.1 | 0/2 | 2.96 | 1.57 |

Hiding Scenario | Number of Lost Itemsets (IPA/HISB) | Algorithm IPA (Seconds) | Algorithm HISB (Seconds) |
---|---|---|---|

HS_2.1 | 0/0 | 36.21 | 33.21 |

HS_2.2 | 0/0 | 38.59 | 38.54 |

HS_2.3 | 0/0 | 43.36 | 52.48 |

HS_3.1 | 0/0 | 39.15 | 34.89 |

HS_3.2 | 0/0 | 43.92 | 42.98 |

HS_4.1 | 0/0 | 45.11 | 44.52 |

