# Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

About $87\%$ of the population in the United States is likely to be identified based only on 5-digit zip code, gender, date of birth. About $50\%$ of the U.S. population are likely to be uniquely identified by only place, gender, date of birth. Moreover, even at the county level, county, gender, date of birth are likely to identify about $18\%$ of the U.S. population.

## 2. Background and Related Work

## 3. The Proposed Anonymization Scheme

#### 3.1. Determining the Identity Vulnerability Values of QIs

#### 3.2. Highly Similar Users Ranking and Formation of Equivalence Classes

#### 3.3. Calculate and Compare Diversity and Evenness of Equivalence Classes

- Calculate the proportion (${p}_{i}$) of each SA’s category in an equivalence class using (Equation (3)).$${p}_{i}=\frac{{n}_{i}}{k}$$
- Sum and square the individual proportions (${p}_{1},{p}_{2},{p}_{3},\cdots ,{p}_{n}$) of each SA’s category in an equivalence class.$$\sum _{n=1}^{n}{P}_{i}^{2}={\left({p}_{1}\right)}^{2}+{\left({p}_{2}\right)}^{2}+{\left({p}_{3}\right)}^{2}+,.....,+{\left({p}_{n}\right)}^{2}$$
- Reciprocate the value obtained from Equation (4). The result is diversity denoted with D.$$D=1/\sum _{n=1}^{n}{P}_{i}^{2}$$
- To find E, divide D by the total number of unique SA categories (n) in an equivalence class.$$E=D/n$$

#### 3.4. Adaptive Data Generalization

## 4. Finding the Identity Vulnerability of Quasi Identifiers

Algorithm 1: Finding identity vulnerability values of the QIs |

## 5. Determining the Best Generalization Level

#### 5.1. Higher and Lower Level Generalization

#### 5.2. Adaptive Data Generalization Algorithm

Algorithm 2: Adaptive data generalization algorithm |

## 6. Simulation Results

#### 6.1. Dataset Description

#### 6.2. Improvements in Privacy

#### 6.3. Improvements in Anonymous Data Utility

Number of trees ($ntree$) $=500$, QIs used to split the tree node ($mtry$) $=3$, RF model = classification, variable importance = $true$, keep.forest = $true$, data = users - data, predictors = (age, gender, race, country) and target = salary.

#### 6.3.1. Reduction in Information Losses

#### 6.3.2. Improvements in Classification Accuracy

#### 6.4. Decrease in Computational Overheads

## 7. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Sweeney, L. Simple Demographics Often Identify People Uniquely; Carnegie Mellon University: Pittsburgh, PA, USA, 2000; Volume 671, pp. 1–34. [Google Scholar]
- Liu, J. Privacy preserving data publishing: Current status and new directions. Inf. Technol. J.
**2012**, 11, 1–8. [Google Scholar] [CrossRef] - Gkoulalas-Divanis, A.; Loukides, G. Revisiting sequential pattern hiding to enhance utility. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; pp. 1316–1324. [Google Scholar]
- Gkoulalas-Divanis, A.; Verykios, V.S. Hiding Sensitive Knowledge without Side Effects. Knowl. Inf. Syst.
**2009**, 20, 263–299. [Google Scholar] [CrossRef] - Gwadera, R.; Gkoulalas-Divanis, A.; Loukides, G. Permutation-based sequential pattern hiding. In Proceedings of the 13th International Conference on Data Mining (ICDM), Dallas, TX, USA, 7–10 December 2013; pp. 241–250. [Google Scholar]
- Georgina, C.; Brown, D.; Archer, M.; Khan, M.; Pockley, A.G. A Survey on Computational Intelligence Approaches for Predictive Modeling in Prostate Cancer. Expert Syst. Appl.
**2017**, 70, 1–19. [Google Scholar] - Dwork, C. Differential Privacy; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
- Dinur, I.; Nissim, K. Revealing information while preserving privacy. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principle of Database Systems, San Diego, CA, USA, 9–12 June 2003; pp. 202–210. [Google Scholar]
- Blum, A.; Dwork, C.; Mcsherry, F.; Nissim, K. Practical Privacy: The SuLQ Framework. In Proceedings of the the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, MD, USA, 13–15 June 2005; pp. 128–138. [Google Scholar]
- Cynthia, D.; Nissim, K. Privacy-preserving datamining on vertically partitioned databases. In Proceedings of the Annual International Cryptology Conference, Santa Barbara, CA, USA, 15–19 August 2004; pp. 528–544. [Google Scholar]
- Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the 3rd Conference on Theory of Cryptography, New York, NY, USA, 4–7 March 2006; pp. 265–284. [Google Scholar]
- Friedman, A.; Schuster, A. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 25–28 July 2010; pp. 493–502. [Google Scholar]
- Soria-Comas, J.; Domingo-Ferrer, J.; Sánchez, D.; Megías, D. Individual Differential Privacy: A Utility-Preserving Formulation of Differential Privacy Guarantees. IEEE Trans. Inf. Forensics Secur.
**2017**, 12, 1418–1429. [Google Scholar] [CrossRef] - Sarathy, R.; Muralidhar, K. Evaluating Laplace noise addition to satisfy differential privacy for numeric data. Trans. Data Privacy
**2011**, 4, 1–17. [Google Scholar] - McSherry, F.D. Privacy integrated queries. In Proceedings of the 35th SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009. [Google Scholar]
- Roy, I.; Setty, S.T.; Kilzer, A.; Shmatikov, V.; Witchel, E. Airavat: Security and Privacy for MapReduce. NSDI
**2010**, 10, 297–312. [Google Scholar] - Mohan, P.; Thakurta, A.; Shi, E.; Song, D.; Culler, D. GUPT: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, AZ, USA, 20–24 May 2012; pp. 349–360. [Google Scholar]
- Sweeney, L. K-Anonymity: A Model for Protecting Privacy. Int. J. Uncertainty Fuzziness Knowledge Based Syst.
**2002**, 10, 557–570. [Google Scholar] [CrossRef] - Ashwin, M.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-Diversity: Privacy Beyond k-Anonymity. ACM Trans. Knowl. Discovery Data
**2007**, 1, 3. [Google Scholar] - Han, J.; Yu, H.; Yu, J.; Cen, T. A complete (alpha, k)-anonymity model for sensitive values individuation preservation. In Proceedings of the 2008 International Symposium on Electronic Commerce and Security, Guangzhou, China, 3–5 August 2008; pp. 318–323. [Google Scholar]
- Li, N.; Li, T.; Suresh, V. T-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 106–115. [Google Scholar]
- Benjamin, C.M.; Fung, M.; Wang, K.E.; Chen, R.; Yu, P.S. Privacy-Preserving Data Publishing: A Survey of Recent Developments. ACM Comput. Surv.
**2010**, 42, 141–153. [Google Scholar] - Bayardo, R.J.; Agrawal, R. Data privacy through optimal k-anonymization. In Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan, 5–8 April 2005; pp. 217–228. [Google Scholar]
- Kristen, L.F.; DeWitt, D.J.; Ramakrishnan, R. Incognito: Efficient full-domain k-anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 13–17 June 2005; pp. 49–60. [Google Scholar]
- Meyerson, A.; Ryan, W. On the Complexity of Optimal K-Anonymity. In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Paris, France, 14–16 June 2004. [Google Scholar]
- Zhong, S.; Yang, Z.; Wright, R.N. Privacy-enhancing k-anonymization of customer data. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Baltimore, MD, USA, 13–17 June 2005; pp. 139–147. [Google Scholar]
- Zaman, A.N.; Obimbo, C.; Dara, R.A. A Novel Differential Privacy Approach that Enhances Classification Accuracy. In Proceedings of the Ninth International C* Conference on Computer Science & Software Engineering, Porto, Portugal, 20–22 July 2016; pp. 79–84. [Google Scholar]
- Jaiswal, J.K.; Samikannu, R.; Paramasivam, I. Anonymization in PPDM based on Data Distributions and Attribute Relations. Indian J. Sci. Technol.
**2016**, 37. [Google Scholar] [CrossRef] - Zaman, A.N.; Obimbo, C. Privacy preserving data publishing: A classification perspective. In International Journal of Advanced Computer Science and Applications; The Science and Information (SAI) Organization Limited: West Yorkshire, England, 2014; Volume 5. [Google Scholar]
- Loukides, G.; Gkoulalas-Divanis, A.; Shao, J. Efficient and flexible anonymization of transaction data. Knowl. Inf. Syst.
**2013**, 36, 153–210. [Google Scholar] [CrossRef] - Cagliero, L.; Garza, P. Improving classification models with taxonomy information. Data Knowl. Eng.
**2013**, 86, 85–101. [Google Scholar] [CrossRef] - LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering, Atlanta, GA, USA, 3–7 April 2006; p. 25. [Google Scholar]
- Xu, J.; Wang, W.; Pei, J.; Wang, X.; Shi, B.; Fu, A.W.-C. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 785–790. [Google Scholar]
- El Emam, K.; Buckeridge, D.; Tamblyn, R.; Neisa, A.; Jonker, E.; Verma, A. The Re-Identification Risk of Canadians from Longitudinal Demographics. BMC Med. Inf. Decis. Making
**2011**, 11, 46. [Google Scholar] [CrossRef] [PubMed] - Li, T.; Li, N. Injector: Mining background knowledge for data anonymization. In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; pp. 446–455. [Google Scholar]
- Du, W.; Teng, Z.; Zhu, Z. Privacy-maxent: integrating background knowledge in privacy quantification. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 459–472. [Google Scholar]
- Wong, R.C.-W.; Fu, A.W.-C.; Wang, K.; Pei, J. Minimality attack in privacy preserving data publishing. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–28 September 2007; pp. 543–554. [Google Scholar]
- Tao, Y.; Xiao, X.; Li, J.; Zhang, D. On anti-corruption privacy preserving publication. In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 7–12 April 2008; pp. 725–734. [Google Scholar]
- Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Mag.
**1996**, 17, 37. [Google Scholar] - Friedman, A.; Wolff, R.; Schuster, A. Providing k-Anonymity in Data Mining. VLDB J.
**2008**, 17, 789–804. [Google Scholar] [CrossRef] - LeFevre, K.; DeWitt, D.J.; Ramakrishnan, R. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 277–286. [Google Scholar]
- Fung, B.C.M.; Wang, K.; Philip, S.Y. Anonymizing Classification Data for Privacy Preservation. IEEE Trans. Knowl. Data Eng.
**2007**, 19, 5. [Google Scholar] [CrossRef] - Li, J.; Liu, J.; Baig, M.; Wong, R.C.-W. Information Based Data Anonymization for Classification Utility. Data Knowl. Eng.
**2011**, 70, 1030–1045. [Google Scholar] [CrossRef] - Kisilevich, S.; Rokach, L.; Elovici, Y.; Shapira, B. Efficient Multidimensional Suppression for k-Anonymity. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 334–347. [Google Scholar] [CrossRef] - Samarati, P.; Sweene, L. Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement through Generalization and Suppression. Technical Report, SRI-CSL-98-04. SRI International: Menlo Park, CA, USA, 1998. [Google Scholar]
- Baron, J. The effects of overgeneralization on public policy. In Proceedings of the Intervento Presentato All’Experimental Method Conference, Center for Basic Research in the Social Sciences, Philadelphia, PA, USA, 20 December 2000; pp. 17–18. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Simpson, E.H. Measurement of Diversity. Nature
**1949**, 163. [Google Scholar] [CrossRef] - Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Ham, J.; Chen, Y.; Crawford, M.M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens.
**2005**, 43, 492–501. [Google Scholar] [CrossRef] - Newman, C.L.; Blake, C.; Merz, C.J. UCI Repository of Machine Learning Databases; University of California, Irvine: Irvine, CA, USA, 1998. [Google Scholar]
- Truta, T.M.; Vinay, B. Privacy protection: p-sensitive k-anonymity property. In Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, GA, USA, 3–7 April 2006; p. 94. [Google Scholar]
- Byun, J.-W.; Kamra, A.; Bertino, E.; Li, N. Efficient k-anonymization using clustering techniques. In Proceedings of the International Conference on Database Systems for Advanced Applications, Bangkok, Thailand, 9–12 April 2007; pp. 188–200. [Google Scholar]
- Xu, L.; Jiang, C.; Qian, Y.; Zhao, Y.; Li, J.; Ren, Y. Dynamic Privacy Pricing: A Multi-Armed Bandit Approach with Time-Variant Rewards. IEEE Trans. Inf. Forensics Secur.
**2017**, 12, 271–285. [Google Scholar] [CrossRef] - Kim, S.; Lee, H.; Chung, Y.D. Privacy-Preserving Data Cube for Electronic Medical Records: An Experimental Evaluation. Int. J. Med. Informatics
**2017**, 97, 33–42. [Google Scholar] [CrossRef] [PubMed] - Wang, K.; Han, C.; Fu, A.W.-C.; Wong, R.C.-W.; Philip, S.Y. Reconstruction Privacy: Enabling Statistical Learning. In Proceedings of the EDBT, Brussels, Belgium, 23–27 March 2015; pp. 469–480. [Google Scholar]
- Fung, B.C.M.; Wang, K.; Fu, A.W.-C.; Philip, S.Y. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- R Core Team. A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2013. Available online: http://www.R-project.org (accessed on 25 January 2017).
- Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The Weka Data Mining Software: An Update. ACM SIGKDD Explor. Newsl.
**2009**, 11, 10–18. [Google Scholar] [CrossRef] - Miner, G.; Elder, J., IV; Hill, T. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications; Academic Press: Tulsa, OK, USA, 2012. [Google Scholar]

**Figure 6.**Comparison of privacy between the information-based anonymization for classification given k (IACk) algorithm and the proposed scheme.

Attributes | Type | Quasi-identifier | Description |
---|---|---|---|

Age | Numerical | Yes | Taxonomy tree of height 7 |

Race | Categorical | Yes | Taxonomy tree of height 3 |

Gender | Categorical | Yes | Taxonomy tree of height 2 |

Country | Categorical | Yes | Taxonomy tree of height 4 |

Salary | Categorical | No | Sensitive Attribute |

Sr.No | Quasi-Identifiers | Actual Values | Relative Values |
---|---|---|---|

1 | Age | 0.03 | 81.72 |

2 | Gender | 0.01 | 16.92 |

3 | Race | 0.00047 | 0.83 |

4 | Country | 0.00011 | 0.53 |

Values of k | Existing Schemes | Proposed Scheme |
---|---|---|

5 | 0.024 | 0.01 |

10 | 0.025 | 0.011 |

50 | 0.033 | 0.019 |

100 | 0.033 | 0.020 |

150 | 0.035 | 0.022 |

200 | 0.037 | 0.025 |

Mean | 0.03 | 0.01 |

Standard Deviation | 0.005 | 0.006 |

Values of k | Existing Schemes | Proposed Scheme |
---|---|---|

5 | 8.3 | 2.5 |

10 | 12.3 | 3.7 |

50 | 16.3 | 4.9 |

100 | 33 | 9.9 |

150 | 68 | 52.5 |

200 | 160 | 60.5 |

Mean | 49.6 | 22.3 |

Standard Deviation | 58.32 | 26.70 |

Values of k | Proposed Algorithm | Mondrian Algorithm |
---|---|---|

10 | 15 s | 20 s |

20 | 15 s | 20 s |

100 | 16 s | 19 s |

150 | 18 s | 18 s |

200 | 18 s | 19 s |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Majeed, A.; Ullah, F.; Lee, S. Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data. *Sensors* **2017**, *17*, 1059.
https://doi.org/10.3390/s17051059

**AMA Style**

Majeed A, Ullah F, Lee S. Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data. *Sensors*. 2017; 17(5):1059.
https://doi.org/10.3390/s17051059

**Chicago/Turabian Style**

Majeed, Abdul, Farman Ullah, and Sungchang Lee. 2017. "Vulnerability- and Diversity-Aware Anonymization of Personally Identifiable Information for Improving User Privacy and Utility of Publishing Data" *Sensors* 17, no. 5: 1059.
https://doi.org/10.3390/s17051059