# Anonymiced Shareable Data: Using mice to Create and Analyze Multiply Imputed Synthetic Datasets

^{*}

^{†}

## Abstract

**:**

`mice (Version 3.13.15)`in

`R`. We demonstrate through simulations that the analysis results obtained on synthetic data yield unbiased and valid inferences and lead to synthetic records that cannot be distinguished from the true data records. The ease of use when synthesizing data with

`mice`along with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research on initially private data.

## 1. Introduction

`mice`[14] implements multiple imputation of missing data in a straightforward and user-friendly manner. However, the functionality of

`mice`is not restricted to the imputation of missing data, but allows imputation of any value in the data, even observed values. Consequently,

`mice`can be used for the creation of multiply imputed synthetic data sets.

`mice`. First, the

`mice`algorithm for the creation of synthetic data will be briefly explained. The aim is to generate synthetic sets that reassure the privacy and confidentiality of the participants. Second, a straightforward workflow for imputation of synthetic data with

`mice`will be demonstrated. Third, we demonstrate the validity of the procedure through statistical simulation.

## 2. Generating Synthetic Data with
`mice`

`R`-package

`jomo`[18] to deal with missing data. However, when the structure of the data increases in complexity, specifying a single multivariate distribution that would fit the observed data can become challenging. A more flexible solution has been proposed in sequential regression imputation, in which the multivariate distribution is separated into univariate conditional distributions. Every single variable is then imputed conditionally on a model containing only the variables in the sequence located before the variable to be imputed. This approach has been implemented in the R-packages

`mdmb`[19] for imputation of missing data and

`synthpop`[20] for imputation of synthetic data.

`mice`package [14,21] in

`R`[22], which has been developed for multiple imputation to overcome problems related to nonresponse. In that context, the aim is to replace missing values with plausible values from the predictive distribution of that variable. This is achieved by breaking down the multivariate distribution of the data $\mathbf{Y}=({\mathbf{Y}}_{obs},{\mathbf{Y}}_{mis})$ into $j=1,2,\cdots ,k$ univariate conditional densities, where k denotes the number of columns in the data. Using FCS, a model is constructed for every incomplete variable and the missing values ${Y}_{j,mis}$ are then imputed with draws from the predictive distribution of $P\left({Y}_{j,mis}\right|{\mathbf{Y}}_{obs},\theta )$ on a variable-by-variable basis. Note that the predictor matrix ${Y}_{-j}$ may contain yet imputed values from an earlier imputation step, and thus will be updated after every iteration. This procedure is applied m times, resulting in m completed datasets $\mathbf{D}=({\mathbf{D}}^{\left(1\right)},{\mathbf{D}}^{\left(2\right)},\cdots ,{\mathbf{D}}^{\left(m\right)})$, with ${\mathbf{D}}^{\left(l\right)}=({\mathbf{Y}}_{obs},{\mathbf{Y}}_{mis}^{\left(l\right)})$.

`mice (Version 3.13.15)`, the generation of multiply imputed datasets to solve for unobserved values is straightforward. The following pseudocode details multiple imputation of any dataset that contains missingness into the object

`imp`with

`m = 10`imputated sets and

`maxit = 7`iterations for the algorithm to converge, using the default imputations methods.

`mice`can be easily extended to generate synthetic values. Rather than imputing missing data, observed values are then replaced by synthetic draws from the predictive distribution. For simplicity, assume that the data are completely observed (i.e., $\mathbf{Y}={\mathbf{Y}}_{obs}$). Following the notation of Reiter and Raghunathan [23], let, given n observations, ${Z}_{i}=1$ if any of the values of unit $i=1,2,\cdots ,n$, are to be replaced by imputations, and ${Z}_{i}=0$ otherwise, with $Z=({Z}_{1},{Z}_{2},\cdots ,{Z}_{n})$. Accordingly, the data consist of values that are to be replaced and values that are to be kept (i.e., $\mathbf{Y}=({\mathbf{Y}}_{rep},{\mathbf{Y}}_{nrep})$. Now, instead of imputing ${\mathbf{Y}}_{mis}$ with draws from the predictive distribution of $P\left({Y}_{j,mis}\right|{\mathbf{Y}}_{obs},\theta )$ as in the missing data case, ${\mathbf{Y}}_{rep}$ is imputed from the posterior distribution of $P\left({Y}_{j,rep}^{\left(l\right)}\right|{\mathbf{Y}}_{-j},Z,\theta )$, where l is an indicator for the synthetic dataset ($l=1,2,\cdots ,m$). This process results in the synthetic data $\mathbf{D}=({\mathbf{D}}^{\left(1\right)},{\mathbf{D}}^{\left(2\right)},\cdots ,{\mathbf{D}}^{\left(m\right)})$.

`syn`, given the same imputation parameters as specified above, can be realized as follows:

`where`requires a matrix of the same dimensions as the data, (i.e., a $n\times k$ matrix) containing logicals ${z}_{ij}$ that indicate which cells are selected to have their values replaced by draws from the predictive distribution. This approach enables replacing a subset of the observed data (e.g., by specifying only those cells that are to be replaced as

`TRUE`in the

`where`-matrix, leaving the rest to

`FALSE`), or as in the aforementioned example, the observed data as a whole, resulting in a dataset that partially or completely consists of synthetic data values. Note that because the data are completely observed, iterating over the predictive distribution is not required.

`mice`, conditioning the imputations on both the missingness and the values that are to be synthesized. In practice, replacing observed and missing cells in a dataset by synthetic values into the object

`syn.mis`, given the same imputation parameters as before, can be realized by the following code execution.

`mice`to method

`mice.impute.cart()`, realized by:

`mice`. In practice, however, one should take some additional complicating factors into account. For example, one should account for deterministic relations in the data. Additionally, relations between variables may be described best using a different model than

`CART`. Such factors are data dependent and should be considered by the imputer. In the next section, we will describe how a completely observed version of the

`mice::boys`data [34] can be adequately synthesized. Additionally, we will show through simulations that this approach yields valid inferences.

## 3. Materials and Methods

`mice`for synthesization using a simulation study on the

`mice::boys`dataset. This data set consists of the values of 748 Dutch boys on nine variables (Table 1).

`mice`imputation model for all predictors except

`bmi`, which is passively imputed using its deterministic relation with

`wgt`and

`hgt`. Specifically, the imputed values are used to calculate the exact

`bmi`values that correspond with

`hgt`and

`wgt`.

#### Simulation Methods

`boys`data have been synthesized with $m=5$ imputations for every data cell to induce an appropriate amount of sampling variance. This approach precludes us from knowing the true data-generating model, and in this sense provides a more stringent test of the applicability of

`mice`for real-life synthesization procedures. Specifically, every bootstrapped sample is treated as an actual sample from a population. These bootstrap samples will be synthesized using

`mice`, using a model that is built to approximate the true data-generating mechanisms as closely as possible. As relationships in realistic data are generally more complex than relationships in parametrically simulated data, achieving good performance on a real data set is likely to be more indicative of practical applicability than good performance on samples from a known multivariate probability density function.

`CART`imputation method for all columns, except for

`bmi`. The deterministic relation

`bmi`which will be synthesized passively based on the synthetic values for

`hgt`and

`wgt`to preserve the relation in the synthetic data. Additional parameters that come with the use of

`mice.impute.cart()`are the complexity parameter

`cp`and the minimum number of observations in any terminal node

`minbucket`, that both constrain the flexibility of the imputation model. The values of the parameters

`cp`and

`minbucket`ought to adhere to the call for imputation models that are as flexible as possible. Appropriate values for these parameters, as well as the input for the

`predictorMatrix`, depend on the data at hand. In the current example, the complexity parameter is specified at

`cp = ${10}^{-8}$`rather than the default value

`${10}^{-4}$`, and the minimum number of observations in each terminal node is set at

`minbucket`$=3$ rather than the default value 5. By allowing for more complexity in the imputation model, bias in the estimates from the synthetic dataset is reduced. Additionally, since the synthesis pattern is monotone, the number of iterations can be set to

`maxit = 1`(e.g., [7], Ch. 3).

`mice`for synthesizing data, we compare the bootstrapped samples with the synthetic versions of these bootstrapped samples. Specifically, univariate descriptive statistics, the correlation matrix, and two linear regression models as well as one ordered logistic regression model will be considered. Subsequently, the bias in the parameters and the $95\%$ confidence interval coverage of the synthetic data will be examined. Similar to multiple imputation of missing data, correct inferences from synthetic data require correct pooling over the multiply imputed data sets.

`mice`as the function

`pool.syn()`. Returning to our previous example, pooling the parameter estimates of a linear model in which the dependent variable

`DV`is regressed on predictors

`IV1`and

`IV2`fitted on each of the synthetic datasets is as straightforward as the following code block.

## 4. Results

#### 4.1. Univariate Estimates

#### 4.2. Bivariate Estimates

`boys`dataset, the differences between the correlations in the synthetic and bootstrapped data are very small. These results are displayed in Table 3.

#### 4.3. Multivariate Model Inferences

`hgt`is modeled by a continuous predictor

`age`and an ordered categorical predictor

`phb`. The results for this simulation can be found in Table 4.

`phb`. This finding is observed in both the bootstrap coverages (i.e., the fraction of 95% confidence intervals that cover the true data parameters) and the synthetic data coverages. Hence, it is likely that this undercoverage stems from the simulation setup, rather than the imputation procedure. Besides the undercoverage, there is a tiny bit of bias in the estimated coefficients of the variable

`phb`that occurs in the synthetic estimates, but not in the observed estimates. Yet, since the bias is relatively small and does not result in confidence invalidity, it seems fair to assume that the introduced bias is not that problematic.

`gen`is modeled by continuous predictors

`age`and

`hc`, and the categorical predictor

`reg`. The results for this model are shown in Table 5.

#### 4.4. Data Discrimination

## 5. Discussion

`mice`using CART in

`R`is a straightforward process that fits well in a data analysis pipeline. The approach is hardly different from using multiple imputation to solve problems related to missing data and, hence, can be expected to be familiar to applied researchers. The multiple synthetic sets yield valid inferences on the true underlying data generating mechanism, thereby capturing the nature of the original data. This makes the multiple synthetisation procedure with

`mice`suitable for further dissemination of synthetic datasets. It is important to note that in our simulations we used a single iteration and relied on CART as the imputation method. A single iteration is sufficient only when the true data are completely observed, or when the missingness pattern is monotone [7]. If both observed and unobserved values are to be synthesized, then more iterations and a careful investigation into the convergence of the algorithm are in order. Synthetic data generation is therein no different than multiple imputation.

`mice`, close attention should be paid to three distinct factors: (i) the additional uncertainty that is due to synthesizing (part of) the data should be incorporated, (ii) potential disclosure risks that remain after synthesizing the data should be assessed, and (iii) the utility of the synthetic data should be examined. First, the procedure of generating multiple synthetic sets may seem overly complicated. We would like to emphasize that analyzing a single synthesized set, while perhaps unbiased, would underestimate the variance properties that are so important in drawing statistical inferences from finite datasets. After all, we are often not interested in the sample at hand, but aim to make inferences about the underlying data generating mechanism as reflected in the population. Properly capturing the uncertainty of synthetic datasets, just like with incomplete datasets, is therefore paramount. To achieve this, we adopted a bootstrap scheme in our simulations, to represent sampling from a superpopulation. However, when a single sample should be the reference, in the sense that the sample reflects the population one wishes to make inferences about, adjusted pooling rules are required that are similar to the procedure outlined in Vink and van Buuren [37]. The corresponding pooling rules have not been derived yet and the incorporation in data analysis workflows would require proper attention from developers alike.

`R`-package

`synthpop`[20], who pioneered in the field of synthetic data generation in

`R`.

`synthpop`is capable of synthesizing regardless of missing data, it is not developed to solve missingness problems. If the data for synthesization contain missing values, the missingness will be considered as a property of the data, and the corresponding synthetic datasets will contain missing values as well. If the goal is to solve for the missingness and synthesize the dataset, a two-step approach must be used, in which the missing values are imputed using a different package, such as

`mice`. Subsequently,

`synthpop`can be used to synthesize the imputed data. Then, given m multiple imputations and r synthesizations, at least $m\times r$ synthetic datasets are in order, after which the data can be analysed and the analyses can be pooled using rules developed by Reiter [11]. However, such an approach would be computationally inefficient and practically cumbersome, because the architectures of both packages differ.

`mice`therefore lays in its ability to both solve the missingness problem and create synthetic versions of the data. Eventually, the flexibility with

`mice`is that unobserved and observed data values could be synthesized at once, without the need for a two-step approach. Using

`mice`, m synthetic sets would be sufficient. However, as of today, no pooling rules for one-step imputation of missingness and synthesization have been developed. The derivation of those would yield a massive theoretical improvement to further reduce the burden of creating synthetic data sets.

`miceadds`package [40], which links functionality between between

`synthpop`and

`mice`, further improvements could be made in that respect. Specifically,

`synthpop`contains a comprehensive suite of functions that can be used to assess the quality of synthetic data (e.g., see [39]), while

`mice`contains methodology to allow for flexible synthesis of intricate data structures, such as multilevel data. Being able to use such functionality interchangeably would greatly benefit data analysts.

`mice`in

`R`, together with the validity of inferences obtained through this procedure opens up a wealth of possibilities for data dissemination and further research of initially private data to applied

`R`users.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Gewin, V. Data sharing: An open mind on open data. Nature
**2016**, 529, 117–119. [Google Scholar] [CrossRef] [PubMed] - Molloy, J.C. The open knowledge foundation: Open data means better science. PLoS Biol.
**2011**, 9, e1001195. [Google Scholar] [CrossRef] [PubMed] - Walport, M.; Brest, P. Sharing research data to improve public health. Lancet
**2011**, 377, 537–539. [Google Scholar] [CrossRef] - Lazer, D.; Pentland, A.S.; Adamic, L.; Aral, S.; Barabasi, A.L.; Brewer, D.; Christakis, N.; Contractor, N.; Fowler, J.; Gutmann, M.; et al. Life in the network: The coming age of computational social science. Science
**2009**, 323, 721. [Google Scholar] [CrossRef] [Green Version] - Ohm, P. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. UCLA Law Rev.
**2009**, 57, 1701–1778. [Google Scholar] - Council, N.R. Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data; The National Academies Press: Washington, DC, USA, 2007. [Google Scholar] [CrossRef]
- Drechsler, J. Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation; Springer Science & Business Media: New York, NY, USA, 2011; Volume 201. [Google Scholar]
- Rubin, D.B. Statistical disclosure limitation. J. Off. Stat.
**1993**, 9, 461–468. [Google Scholar] - Little, R.J. Statistical analysis of masked data. J. Off. Stat.
**1993**, 9, 407–426. [Google Scholar] - Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; Wiley: Hoboken, NJ, USA, 1987. [Google Scholar]
- Reiter, J.P. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv. Methodol.
**2004**, 30, 235–242. [Google Scholar] - Grund, S.; Lüdtke, O.; Robitzsch, A. Using Synthetic Data to Improve the Reproducibility of Statistical Results in Psychological Research. 2021. Available online: https://twitter.com/PsyArXivBot/status/1418605901351632899 (accessed on 7 October 2021).
- Jiang, B.; Raftery, A.E.; Steele, R.J.; Wang, N. Balancing Inferential Integrity and Disclosure Risk via Model Targeted Masking and Multiple Imputation. J. Am. Stat. Assoc.
**2021**. [Google Scholar] [CrossRef] - Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw.
**2011**, 45, 1–67. [Google Scholar] [CrossRef] [Green Version] - Neyman, J. On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection. J. R. Stat. Soc.
**1934**, 97, 123–150. [Google Scholar] [CrossRef] - Murray, J.S. Multiple imputation: A review of practical and theoretical findings. Stat. Sci.
**2018**, 33, 142–159. [Google Scholar] [CrossRef] [Green Version] - Lüdtke, O.; Robitzsch, A.; West, S.G. Regression models involving nonlinear effects with missing data: A sequential modeling approach using Bayesian estimation. Psychol. Methods
**2020**, 25, 157. [Google Scholar] [CrossRef] - Quartagno, M.; Carpenter, J. jomo: A Package for Multilevel Joint Modelling Multiple Imputation; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
- Robitzsch, A.; Luedtke, O. mdmb: Model Based Treatment of Missing Data; R Package Version 1.5-8; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Nowok, B.; Raab, G.M.; Dibben, C. synthpop: Bespoke Creation of Synthetic Data in R. J. Stat. Softw.
**2016**, 74, 1–26. [Google Scholar] [CrossRef] [Green Version] - Van Buuren, S.; Brand, J.P.L.; Groothuis-Oudshoorn, C.G.M.; Rubin, D.B. Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul.
**2006**, 76, 1049–1064. [Google Scholar] [CrossRef] - R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Reiter, J.P.; Raghunathan, T.E. The Multiple Adaptations of Multiple Imputation. J. Am. Stat. Assoc.
**2007**, 102, 1462–1471. [Google Scholar] [CrossRef] [Green Version] - Rubin, D.B. Multiple imputation after 18+ years. J. Am. Stat. Assoc.
**1996**, 91, 473–489. [Google Scholar] [CrossRef] - Yucel, R.M.; Zhao, E.; Schenker, N.; Raghunathan, T.E. Sequential Hierarchical Regression Imputation. J. Surv. Stat. Methodol.
**2018**, 6, 1–22. [Google Scholar] [CrossRef] - Murray, J.S.; Reiter, J.P. Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. J. Am. Stat. Assoc.
**2016**, 111, 1466–1479. [Google Scholar] [CrossRef] [Green Version] - Liu, D.; Oberman, H.I.; Muñoz, J.; Hoogland, J.; Debray, T.P.A. Quality control, data cleaning, imputation. In Clinical Applications of Artificial Intelligence in Real-World Data; Asselbergs, F., Moore, J., Denaxas, S., Oberski, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
- Reiter, J.P. Using CART to generate partially synthetic public use microdata. J. Off. Stat.
**2005**, 21, 441. [Google Scholar] - Burgette, L.F.; Reiter, J.P. Multiple imputation for missing data via sequential regression trees. Am. J. Epidemiol.
**2010**, 172, 1070–1076. [Google Scholar] [CrossRef] [PubMed] - Doove, L.L.; Van Buuren, S.; Dusseldorp, E. Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal.
**2014**, 72, 92–104. [Google Scholar] [CrossRef] - Raab, G.M.; Nowok, B.; Dibben, C. Practical Data Synthesis for Large Samples. J. Priv. Confident.
**2016**, 7, 67–97. [Google Scholar] [CrossRef] [Green Version] - Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
- Fredriks, A.M.; Van Buuren, S.; Burgmeijer, R.J.; Meulmeester, J.F.; Beuker, R.J.; Brugman, E.; Roede, M.J.; Verloove-Vanhorick, S.P.; Wit, J.M. Continuing positive secular growth change in The Netherlands 1955–1997. Pediatr. Res.
**2000**, 47, 316–323. [Google Scholar] [CrossRef] [Green Version] - Reiter, J.P. Inference for partially synthetic, public use microdata sets. Surv. Methodol.
**2003**, 29, 181–188. [Google Scholar] - Reiter, J.P.; Kinney, S.K. Inferentially valid, partially synthetic data: Generating from posterior predictive distributions not necessary. J. Off. Stat.
**2012**, 28, 583. [Google Scholar] - Vink, G.; van Buuren, S. Pooling multiple imputations when the sample happens to be the population. arXiv
**2014**, arXiv:1409.8542. [Google Scholar] - Drechsler, J.; Reiter, J.P. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Stat. Data Anal.
**2011**, 55, 3232–3243. [Google Scholar] [CrossRef] - Snoke, J.; Raab, G.M.; Nowok, B.; Dibben, C.; Slavkovic, A. General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A (Stat. Soc.)
**2018**, 181, 663–688. [Google Scholar] [CrossRef] [Green Version] - Robitzsch, A.; Grund, S. miceadds: Some Additional Multiple Imputation Functions, Especially for ‘mice’; R Package Version 3.11-6; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]

**Figure 1.**Empirical sampling distribution of the bootstrapped and synthetic means of the continuous variables in the boys data.

**Figure 2.**Empirical sampling distribution of the number of observations within each category of the categorical variables in the bootstrapped and synthetic boys data.

Column | Description |
---|---|

age | age in years |

hgt | height (cm) |

wgt | weight (kg) |

bmi | body mass index |

hc | head circumference (cm) |

gen | genital Tanner stage G1–G5 |

phb | pubic hair Tanner P1–P6 |

tv | testicular volume (mL) |

reg | region |

**Table 2.**Univariate descriptives for the true data and $m=5$ pooled univariate descriptives for the synthetic data over 1000 simulations. Variable names followed by an asterix (*) are categorical.

n | Mean | sd | Median | Min | Max | Skew | Kurtosis | |
---|---|---|---|---|---|---|---|---|

original age | 748 | 9.16 | 6.89 | 10.50 | 0.04 | 21.18 | −0.03 | −1.56 |

synthetic age | 748 | 9.15 | 6.89 | 10.49 | 0.04 | 20.96 | −0.03 | −1.55 |

original hgt | 748 | 131.10 | 46.52 | 145.75 | 50.00 | 198.00 | −0.30 | −1.47 |

synthetic hgt | 748 | 131.06 | 46.50 | 145.32 | 50.69 | 197.16 | −0.30 | −1.47 |

original wgt | 748 | 37.12 | 26.03 | 34.55 | 3.14 | 117.40 | 0.38 | −1.03 |

synthetic wgt | 748 | 37.09 | 26.00 | 34.44 | 3.35 | 112.26 | 0.38 | −1.03 |

original bmi | 748 | 18.04 | 3.04 | 17.45 | 11.73 | 31.74 | 1.14 | 1.79 |

synthetic bmi | 748 | 18.05 | 3.08 | 17.48 | 11.49 | 32.37 | 1.11 | 1.85 |

original hc | 748 | 51.62 | 5.86 | 53.10 | 33.70 | 65.00 | −0.91 | 0.12 |

synthetic hc | 748 | 51.61 | 5.86 | 53.18 | 34.38 | 62.85 | −0.91 | 0.12 |

original gen * | 748 | 2.53 | 1.59 | 2.00 | 1.00 | 5.00 | 0.52 | −1.36 |

synthetic gen * | 748 | 2.53 | 1.59 | 2.00 | 1.00 | 5.00 | 0.52 | −1.35 |

original phb * | 748 | 2.75 | 1.86 | 2.00 | 1.00 | 6.00 | 0.56 | −1.25 |

synthetic phb * | 748 | 2.75 | 1.86 | 2.00 | 1.00 | 6.00 | 0.56 | −1.24 |

original tv | 748 | 8.43 | 8.12 | 3.00 | 1.00 | 25.00 | 0.85 | −0.78 |

synthetic tv | 748 | 8.42 | 8.11 | 3.19 | 1.00 | 25.00 | 0.85 | −0.77 |

original reg * | 748 | 3.02 | 1.14 | 3.00 | 1.00 | 5.00 | −0.08 | −0.77 |

synthetic reg * | 748 | 3.02 | 1.14 | 3.00 | 1.00 | 5.00 | −0.08 | −0.76 |

**Table 3.**Bivariate correlations of the numerical columns in the true data with in parentheses the corresponding bias of the $m=5$ pooled synthetic correlations over 1000 simulations. All estimates are rounded to 3 decimal places.

Age | hgt | wgt | bmi | hc | tv | |
---|---|---|---|---|---|---|

age | 1 | 0.976 (0.001) | 0.950 (0.000) | 0.627 (0.009) | 0.853 (0.000) | 0.810 (0.002) |

hgt | 0.976 (0.001) | 1 | 0.944 (0.001) | 0.596 (0.013) | 0.907 (0.000) | 0.754 (0.000) |

wgt | 0.950 (0.000) | 0.944 (0.001) | 1 | 0.791 (0.009) | 0.834 (0.000) | 0.817 (0.000) |

bmi | 0.627 (0.009) | 0.596 (0.013) | 0.791 (0.009) | 1 | 0.588 (0.009) | 0.610 (0.007) |

hc | 0.853 (0.000) | 0.907 (0.000) | 0.834 (0.000) | 0.588 (0.009) | 1 | 0.623 (0.000) |

tv | 0.810 (0.002) | 0.754 (0.000) | 0.817 (0.000) | 0.610 (0.007) | 0.623 (0.000) | 1 |

**Table 4.**Simulation results for a linear regression model with continuous and ordered categorical predictors. The model evaluated is

`hgt ∼ age + phb`. Depicted are the true data estimates and the bias from the true data estimates and the coverage rate of the 95% confidence interval for the bootstrap and synthetic data sets.

Bootstrap | Synthetic | ||||
---|---|---|---|---|---|

Term | Estimate | Bias | Cov | Bias | Cov |

(Intercept) | 63.087 | −0.001 | 0.970 | 0.405 | 0.958 |

age | 7.174 | 0.000 | 0.958 | −0.033 | 0.947 |

phb.L | −12.250 | 0.008 | 0.950 | 0.582 | 0.927 |

phb.Q | −1.376 | −0.022 | 0.926 | 0.112 | 0.934 |

phb.C | −3.564 | 0.051 | 0.915 | 0.301 | 0.912 |

phb^4 | −0.431 | 0.016 | 0.930 | 0.106 | 0.940 |

phb^5 | 2.064 | 0.060 | 0.941 | 0.077 | 0.943 |

**Table 5.**Simulation results for a proportional odds logistic regression model with continuous and ordered categorical predictors. The model evaluated is

`gen ∼ age + hc + reg`. Depicted are the true data estimates and the bias from the true data estimates and the coverage rate of the 95% confidence interval for the bootstrap and synthetic data sets.

Bootstrap | Synthetic | ||||
---|---|---|---|---|---|

Term | Estimate | Bias | Cov | Bias | Cov |

age | 0.461 | 0.004 | 0.942 | 0.002 | 0.939 |

hc | −0.188 | −0.000 | 0.929 | −0.004 | 0.945 |

regeast | −0.339 | 0.012 | 0.960 | 0.092 | 0.957 |

regwest | 0.486 | 0.009 | 0.952 | −0.122 | 0.944 |

regsouth | 0.646 | 0.012 | 0.966 | −0.152 | 0.943 |

regcity | −0.069 | 0.012 | 0.940 | 0.001 | 0.972 |

G1|G2 | −6.322 | 0.032 | 0.934 | −0.254 | 0.946 |

G2|G3 | −4.501 | 0.052 | 0.936 | −0.246 | 0.945 |

G3|G4 | −3.842 | 0.058 | 0.937 | −0.244 | 0.948 |

G4|G5 | −2.639 | 0.064 | 0.936 | −0.253 | 0.947 |

**Table 6.**Simulation results for a logistic regression model aimed at discriminating between synthetic records and true records.

Term | Estimate | std Error | Statistic | df | p Value |
---|---|---|---|---|---|

(Intercept) | 0.22 | 1.15 | 0.19 | 521.93 | 0.60 |

wgt | 0.00 | 0.02 | 0.25 | 420.19 | 0.60 |

hgt | −0.00 | 0.01 | −0.18 | 359.73 | 0.60 |

age | −0.00 | 0.05 | −0.00 | 345.73 | 0.62 |

hc | 0.00 | 0.03 | 0.11 | 415.44 | 0.61 |

gen.L | −0.00 | 0.42 | −0.00 | 164.76 | 0.65 |

gen.Q | −0.01 | 0.21 | −0.04 | 203.38 | 0.63 |

gen.C | −0.00 | 0.16 | −0.02 | 239.63 | 0.64 |

gen^4 | 0.01 | 0.21 | 0.01 | 237.41 | 0.63 |

phb.L | −0.02 | 0.44 | −0.04 | 156.54 | 0.64 |

phb.Q | −0.01 | 0.22 | −0.02 | 198.20 | 0.64 |

phb.C | 0.00 | 0.18 | 0.00 | 211.17 | 0.62 |

phb^4 | 0.00 | 0.18 | 0.02 | 228.51 | 0.64 |

phb^5 | 0.00 | 0.20 | 0.02 | 248.57 | 0.63 |

tv | 0.00 | 0.02 | 0.01 | 264.26 | 0.63 |

regeast | −0.00 | 0.23 | 0.01 | 210.90 | 0.64 |

regwest | −0.00 | 0.22 | −0.00 | 221.14 | 0.63 |

regsouth | −0.01 | 0.22 | −0.02 | 226.10 | 0.64 |

regcity | 0.01 | 0.27 | 0.03 | 204.15 | 0.64 |

bmi | −0.02 | 0.06 | −0.26 | 320.18 | 0.61 |

**Table 7.**Confusion statistics for a prediction model aimed at discriminating between synthetic records and true records.

Estimate | |
---|---|

Accuracy | 0.50381 |

Balanced Accuracy | 0.50381 |

Kappa | 0.00762 |

McNemar p Value | 0.63187 |

Sensitivity | 0.50368 |

Specificity | 0.50394 |

Prevalence | 0.50000 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Volker, T.B.; Vink, G.
Anony*mice*d Shareable Data: Using *mice* to Create and Analyze Multiply Imputed Synthetic Datasets. *Psych* **2021**, *3*, 703-716.
https://doi.org/10.3390/psych3040045

**AMA Style**

Volker TB, Vink G.
Anony*mice*d Shareable Data: Using *mice* to Create and Analyze Multiply Imputed Synthetic Datasets. *Psych*. 2021; 3(4):703-716.
https://doi.org/10.3390/psych3040045

**Chicago/Turabian Style**

Volker, Thom Benjamin, and Gerko Vink.
2021. "Anony*mice*d Shareable Data: Using *mice* to Create and Analyze Multiply Imputed Synthetic Datasets" *Psych* 3, no. 4: 703-716.
https://doi.org/10.3390/psych3040045