# An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methodology

#### 2.1. General Additive Data Perturbation Method

- the mean and standard deviation of each attribute of $\left(Y,S\right)$ is same as those of $\left(X,S\right)$ in expectation; and
- the correlation matrix of $\left(Y,S\right)$ is same as that of $\left(X,S\right)$ in expectation.

#### 2.1.1. Symbols

- ${\mu}_{X},{\mu}_{S},{\mu}_{Y}$ are the mean vectors of $X,S,Y,$ respectively; and
- ${\mathsf{\Sigma}}_{XX},{\mathsf{\Sigma}}_{SS},{\mathsf{\Sigma}}_{YY}$ are the variance-covariance matrices of $X,S,Y,$ respectively, while ${\mathsf{\Sigma}}_{XY}$ is the variance-covariance matrices between $X$ and $Y$—which is similar to ${\mathsf{\Sigma}}_{YX},{\mathsf{\Sigma}}_{XS},{\mathsf{\Sigma}}_{SX},{\mathsf{\Sigma}}_{SY},{\mathsf{\Sigma}}_{YS}$—and ${\mathsf{\Sigma}}_{XY}={\mathsf{\Sigma}}_{YX}^{\prime},{\mathsf{\Sigma}}_{XS}={\mathsf{\Sigma}}_{SX}^{\prime},{\mathsf{\Sigma}}_{SY}={\mathsf{\Sigma}}_{YS}^{\prime}$.

#### 2.1.2. Assumptions

- ${\mu}_{X}={\mu}_{Y},{\mathsf{\Sigma}}_{YY}={\mathsf{\Sigma}}_{XX}$, and ${\mathsf{\Sigma}}_{YS}={\mathsf{\Sigma}}_{XS}$; and
- all marginal distributions of each attribute and the joint distribution of all attributes are normal distribution.

#### 2.2. Copula General Additive Data Perturbation Method

#### Copula Function

**Definition**

**1.**

**Definition**

**2.**

- ${X}_{1}$ follows normal distribution, denoted by ${G}_{1}$;
- ${X}_{2}$ follows exponential distribution, denoted by ${G}_{2}$;
- ${X}_{3}$ follows exponential distribution, denoted by ${G}_{3}$;
- ${S}_{1}$ follows exponential distribution, denoted by ${F}_{1}$; and
- ${S}_{2}$ follows exponential distribution, denoted by ${F}_{2}$.

#### 2.3. Gaussian Copula General Additive Data Perturbation Method

**Theorem**

**1.**

#### The Gaussian CGADP Method

## 3. Empirical Study Results

#### 3.1. Applying the GADP Method

#### 3.2. Applying the Gaussian CGADP m = Method

- ${S}_{1}\sim {F}_{1}=N\left({\mu}_{{S}_{1}}=55.44,{\sigma}_{{S}_{1}}=8.75\right)$
- ${S}_{2}\sim {F}_{2}=N\left({\mu}_{{S}_{2}}=157.69,{\sigma}_{{S}_{2}}=5.60\right)$

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Chu, A.M.Y.; So, M.K.P.; Chan, T.W.C.; Tiwari, A. Estimating the dependence of mixed sensitive response types in randomized response technique. Stat. Methods Med. Res.
**2019**. [Google Scholar] [CrossRef] [PubMed] - Hodge, J.G., Jr. Health information privacy and public health. J. Law Med. Ethics
**2003**, 31, 663–671. [Google Scholar] [CrossRef] [PubMed] - Mercuri, R.T. The HIPAA-potamus in health care data security. Commun. ACM
**2004**, 47, 25–28. [Google Scholar] [CrossRef] - Guttman, M.P.; Stern, P. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data; National Research Council: Washington, DC, USA, 2007.
- Abowd, J.M.; Lane, J. New approaches to confidentiality protection: Synthetic data, remote access and research data centers. In Proceedings of the International Workshop on Privacy in Statistical Databases, Barcelona, Spain, 9–11 June 2004; Springer: Berlin, Germany, 2004; pp. 282–289. [Google Scholar]
- Sweeney, L. K-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst.
**2002**, 10, 557–570. [Google Scholar] [CrossRef] - Sweeney, L. Datafly: A system for providing anonymity in medical data. In Database Security XI; Springer: Boston, MA, USA, 1998; pp. 356–381. [Google Scholar]
- Berman, J.J. Concept-match medical data scrubbing: How pathology text can be used in research. Arch. Pathol. Lab. Med.
**2003**, 127, 680–686. [Google Scholar] [PubMed] - Drechsler, J. Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation; Springer: Berlin, Germany, 2011. [Google Scholar]
- Alfons, A.; Kraft, S.; Templ, M.; Filzmoser, P. Simulation of close-to-reality population data for household surveys with application to EU-SILC. Stat. Methods Appl.
**2011**, 20, 383–407. [Google Scholar] [CrossRef] - Templ, M.; Filzmoser, P. Simulation and quality of a synthetic close-to-reality employer–employee population. J. Appl. Stat.
**2014**, 41, 1053–1072. [Google Scholar] [CrossRef] - Domingo-Ferrer, J.; Mateo-Sanz, J. Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng.
**2002**, 14, 189–201. [Google Scholar] [CrossRef] - Muralidhar, K.; Parsa, R.; Sarathy, R. A general additive data perturbation method for database security. Manag. Sci.
**1999**, 45, 1399–1415. [Google Scholar] [CrossRef] - Carlson, M.; Salabasis, M. A data-swapping technique using ranks; a method for disclosure control. Res. Off. Stat.
**2002**, 5, 35–64. [Google Scholar] - Muralidhar, K.; Sarathy, R. Data shuffling—A new masking approach for numerical data. Manag. Sci.
**2006**, 52, 658–670. [Google Scholar] [CrossRef] - El Emam, K.; Dankar, F.K. Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc.
**2008**, 15, 627–637. [Google Scholar] [CrossRef] [PubMed] - Park, Y.; Ghosh, J.; Shankar, M. Perturbed gibbs samplers for generating large-scale privacy-safe synthetic health data. In Proceedings of the IEEE International Conference on Healthcare Informatics, Philadelphia, PA, USA, 9–11 September 2013; pp. 493–498. [Google Scholar]
- Traub, J.F.; Yemini, Y.; Wozniakowski, H. The statistical security of a statistical database. ACM Trans. Database Syst.
**1984**, 9, 672–679. [Google Scholar] [CrossRef] - Kim, J.J. A method for limiting disclosure in microdata based on random noise and transformation. In Proc of the Section on Survey Research Methods; American Statistical Association: Alexandria, VA, USA, 1986; pp. 303–308. [Google Scholar]
- Tendick, P.; Matloff, N. A modified random perturbation method for database security. ACM Trans. Database Syst.
**1994**, 19, 47–63. [Google Scholar] [CrossRef] - Sarathy, R.; Muralidhar, K.; Parsa, R. Perturbing nonnormal confidential attributes: The copula approach. Manag. Sci.
**2002**, 48, 1613–1627. [Google Scholar] [CrossRef] - Sklar, M. Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Stat. Univ. Paris.
**1959**, 8, 229–231. [Google Scholar] - Nelsen, R.B. An Introduction to Copulas; Springer: Berlin, Germany, 2007. [Google Scholar]
- Cherubini, U.; Luciano, E.; Vecchiato, W. Copula Methods in Finance; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
- Chong, A.C.Y.; Chu, A.M.Y.; So, M.K.P.; Chung, R.S.W. Asking sensitive questions using the randomized response approach in public health research: An empirical study on the factors of illegal waste disposal. Int. J. Environ. Res. Public Health
**2019**, 16, 970. [Google Scholar] [CrossRef] [Green Version]

${\mathit{X}}_{1}$ | ${\mathit{X}}_{2}$ | ${\mathit{X}}_{3}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |
---|---|---|---|---|---|

${X}_{1}$ | 1 | ||||

${X}_{2}$ | 0.70 | 1 | |||

${X}_{3}$ | 0.80 | 0.75 | 1 | ||

${S}_{1}$ | 0.50 | 0.40 | 0.25 | 1 | |

${S}_{2}$ | 0.30 | 0.20 | 0.15 | 0.60 | 1 |

${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |
---|---|---|---|---|---|

${Y}_{1}$ | 1 | ||||

${Y}_{2}$ | 0.6322 | 1 | |||

${Y}_{3}$ | 0.7571 | 0.7216 | 1 | ||

${S}_{1}$ | 0.5025 | 0.3576 | 0.2342 | 1 | |

${S}_{2}$ | 0.2152 | 0.1336 | 0.1000 | 0.4398 | 1 |

Variable | Description |
---|---|

${X}_{1}$ | Feel rested upon awakening at the end of a sleep period |

${X}_{2}$ | Feel satisfied with the quality of your sleep |

${X}_{3}$ | Get too much sleep |

${X}_{4}$ | Take a nap at a scheduled time |

${X}_{5}$ | Fall asleep at an unscheduled time |

${S}_{1}$ | Weight |

${S}_{2}$ | Height |

No. | ${\mathit{X}}_{1}$ | ${\mathit{X}}_{2}$ | ${\mathit{X}}_{3}$ | ${\mathit{X}}_{4}$ | ${\mathit{X}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ |
---|---|---|---|---|---|---|---|

1 | 0 | 0 | 7 | 5 | 7 | 56 | 165 |

2 | 0 | 0 | 0 | 0 | 0 | 48 | 152 |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

185 | 1 | 3 | 1 | 0 | 0 | 72 | 162 |

186 | 7 | 1 | 0 | 1 | 1 | 47 | 149 |

Summary Statistics | Pearson’s Correlation Matrix | ||||||||
---|---|---|---|---|---|---|---|---|---|

Mean | Std | ${\mathit{X}}_{1}$ | ${\mathit{X}}_{2}$ | ${\mathit{X}}_{3}$ | ${\mathit{X}}_{4}$ | ${\mathit{X}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |

2.37 | 2.18 | ${X}_{1}$ | 1 | ||||||

2.31 | 2.28 | ${X}_{2}$ | 0.4461 | 1 | |||||

1.15 | 1.71 | ${X}_{3}$ | 0.1795 | 0.1590 | 1 | ||||

1.74 | 1.99 | ${X}_{4}$ | 0.1092 | 0.0379 | 0.2091 | 1 | |||

1.59 | 1.90 | ${X}_{5}$ | $-$0.0512 | 0.0116 | 0.1593 | 0.2489 | 1 | ||

55.44 | 8.75 | ${S}_{1}$ | $-$0.1017 | $-$0.0336 | $-$0.0977 | $-$0.0221 | 0.1003 | 1 | |

157.69 | 5.60 | ${S}_{2}$ | 0.0032 | $-$0.0569 | 0.0592 | 0.0807 | $-$0.0900 | 0.3719 | 1 |

${\mathit{X}}_{1}$ | ${\mathit{X}}_{2}$ | ${\mathit{X}}_{3}$ | ${\mathit{X}}_{4}$ | ${\mathit{X}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |
---|---|---|---|---|---|---|---|

${X}_{1}$ | 1 | ||||||

${X}_{2}$ | 0.4680 | 1 | |||||

${X}_{3}$ | 0.2017 | 0.2254 | 1 | ||||

${X}_{4}$ | 0.1285 | 0.0787 | 0.1985 | 1 | |||

${X}_{5}$ | $-$0.0300 | 0.0621 | 0.1690 | 0.3009 | 1 | ||

${S}_{1}$ | $-$0.0863 | $-$0.0437 | $-$0.0698 | $-$0.0355 | 0.1090 | 1 | |

${S}_{2}$ | 0.0039 | $-$0.0531 | 0.0123 | 0.0954 | $-$0.0470 | 0.3482 | 1 |

No. | ${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ |
---|---|---|---|---|---|---|---|

1 | 4.6587 | 2.7093 | −2.0323 | 3.6562 | 5.9077 | 56 | 165 |

2 | 4.4505 | 5.0854 | 0.7921 | −1.7925 | −2.3083 | 48 | 152 |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

185 | 2.0445 | 2.0600 | 1.2364 | 0.4676 | 2.6754 | 72 | 162 |

186 | 4.3816 | 4.2980 | 2.8277 | 2.9120 | 2.6487 | 47 | 149 |

Summary Statistics | Pearson’s Correlation Matrix | ||||||||
---|---|---|---|---|---|---|---|---|---|

Mean | Std | ${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |

2.43 | 2.14 | ${Y}_{1}$ | 1 | ||||||

2.43 | 2.39 | ${Y}_{2}$ | 0.4994 ^{#} | 1 | |||||

0.93 | 1.72 | ${Y}_{3}$ | 0.1741 | 0.2190 ^{#} | 1 | ||||

1.61 | 2.03 | ${Y}_{4}$ | 0.1834 ^{#} | 0.1141 ^{#} | 0.1503 ^{#} | 1 | |||

1.58 | 1.88 | ${Y}_{5}$ | 0.0315 ^{^} | 0.0548 | 0.2573 ^{#} | 0.3349 ^{#} | 1 | ||

55.44 | 8.75 | ${S}_{1}$ | −0.1022 | −0.1272 ^{#} | −0.1063 | −0.0614 | 0.1642 ^{#} | 1 | |

157.69 | 5.60 | ${S}_{2}$ | 0.0375 | −0.1244 ^{#} | 0.0808 | 0.0759 | −0.0722 | 0.3719 | 1 |

${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |
---|---|---|---|---|---|---|---|

${Y}_{1}$ | 1 | ||||||

${Y}_{2}$ | 0.4892 | 1 | |||||

${Y}_{3}$ | 0.1790 | 0.2300 | 1 | ||||

${Y}_{4}$ | 0.2117 ^{#} | 0.1306 ^{#} | 0.1873 | 1 | |||

${Y}_{5}$ | 0.0109 ^{^} | 0.0436 | 0.2704 ^{#} | 0.3062 | 1 | ||

${S}_{1}$ | −0.0994 | −0.1151 ^{#} | −0.1405 ^{#} | −0.0163 | 0.2163 ^{#} | 1 | |

${S}_{2}$ | 0.0488 | −0.0800 | 0.1086 ^{#} | 0.0892 | −0.0788 | 0.3482 | 1 |

Distribution | Parameters | Test Value | p-Value | ||
---|---|---|---|---|---|

${S}_{1}$ | Normal | $\mu $ = 55.44 | $\sigma $ = 8.73 | 0.0995 | 0.0502 |

${S}_{2}$ | Normal | $\mu $ = 157.69 | $\sigma $ = 5.59 | 0.0605 | 0.5029 |

**Table 11.**Fitting distribution of ${X}_{1},{X}_{2},{X}_{3},{X}_{4},{X}_{5}$ and goodness of fit test.

Distribution | Parameters | Test Value | p-Value | |||
---|---|---|---|---|---|---|

${X}_{1}$ | ZANB | $\mu $ = 3.0331 | $\sigma $ = 0.1244 | $\pi $ = 0.2796 | 9.04 | 0.2498 |

${X}_{2}$ | ZINB | $\mu $ = 3.1563 | $\sigma $ = 0.1388 | $\pi $ = 0.2692 | 5.2487 | 0.6296 |

${X}_{3}$ | ZINB | $\mu $ = 2.1153 | $\sigma $ = 0.2780 | $\pi $ = 0.4561 | 6.2476 | 0.5112 |

${X}_{4}$ | ZANB | $\mu $ = 2.7233 | $\sigma $ = 0.1095 | $\pi $ = 0.4194 | 6.3010 | 0.5051 |

${X}_{5}$ | ZINB | $\mu $ = 2.5237 | $\sigma $ = 0.1322 | $\pi $ = 0.3694 | 9.2857 | 0.2328 |

No. | ${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ |
---|---|---|---|---|---|---|---|

1 | 4 | 5 | 0 | 2 | 4 | 56 | 165 |

2 | 3 | 1 | 2 | 0 | 2 | 48 | 152 |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

$\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ | $\vdots $ |

185 | 1 | 0 | 2 | 0 | 0 | 72 | 162 |

186 | 0 | 0 | 0 | 3 | 6 | 47 | 149 |

Summary Statistics | Pearson’s Correlation Matrix | ||||||||
---|---|---|---|---|---|---|---|---|---|

Mean | Std | ${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |

2.90 | 2.02 | ${Y}_{1}$ | 1 | ||||||

2.98 | 2.14 | ${Y}_{2}$ | 0.4975 ^{#} | 1 | |||||

1.84 | 1.61 | ${Y}_{3}$ | 0.1793 | 0.2930 ^{#} | 1 | ||||

2.42 | 1.85 | ${Y}_{4}$ | 0.2327 ^{#} | 0.1283 ^{#} | 0.3395 ^{#} | 1 | |||

2.01 | 1.80 | ${Y}_{5}$ | −0.0543 | 0.0294 | 0.1530 | 0.3484 ^{#} | 1 | ||

55.44 | 8.75 | ${S}_{1}$ | −0.1179 | −0.1061 ^{#} | −0.1382 | 0.0436 ^{^} | 0.0744 | 1 | |

157.69 | 5.60 | ${S}_{2}$ | 0.0418 | −0.0871 | 0.0140 | 0.1057 | −0.0500 | 0.3719 | 1 |

${\mathit{Y}}_{1}$ | ${\mathit{Y}}_{2}$ | ${\mathit{Y}}_{3}$ | ${\mathit{Y}}_{4}$ | ${\mathit{Y}}_{5}$ | ${\mathit{S}}_{1}$ | ${\mathit{S}}_{2}$ | |
---|---|---|---|---|---|---|---|

${Y}_{1}$ | 1 | ||||||

${Y}_{2}$ | 0.5122 | 1 | |||||

${Y}_{3}$ | 0.1969 | 0.2902 ^{#} | 1 | ||||

${Y}_{4}$ | 0.2359 ^{#} | 0.1636 ^{#} | 0.3282 ^{#} | 1 | |||

${Y}_{5}$ | −0.0195 | 0.0588 | 0.1953 | 0.3304 | 1 | ||

${S}_{1}$ | −0.1211 | −0.0873 | −0.1357 ^{#} | −0.0108 | 0.0306 ^{#} | 1 | |

${S}_{2}$ | 0.0525 | −0.0765 | 0.0264 | 0.1168 | −0.0300 | 0.3482 | 1 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chu, A.M.Y.; Lam, B.S.Y.; Tiwari, A.; So, M.K.P.
An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research. *Int. J. Environ. Res. Public Health* **2019**, *16*, 4519.
https://doi.org/10.3390/ijerph16224519

**AMA Style**

Chu AMY, Lam BSY, Tiwari A, So MKP.
An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research. *International Journal of Environmental Research and Public Health*. 2019; 16(22):4519.
https://doi.org/10.3390/ijerph16224519

**Chicago/Turabian Style**

Chu, Amanda M. Y., Benson S. Y. Lam, Agnes Tiwari, and Mike K. P. So.
2019. "An Empirical Study of Applying Statistical Disclosure Control Methods to Public Health Research" *International Journal of Environmental Research and Public Health* 16, no. 22: 4519.
https://doi.org/10.3390/ijerph16224519