# Combining Probability and Nonprobability Samples by Using Multivariate Mass Imputation Approaches with Application to Biomedical Research

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Multivariate Mass Imputation Approaches

#### 2.2. Monte Carlo Simulation Study

#### 2.3. Real Data Application

## 3. Results

#### 3.1. Monte Carlo Simulation Study

#### 3.2. Real Data Application

## 4. Conclusions

## Author Contributions

## Funding

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Fuller, W.A. Sampling Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Baker, R.; Brick, J.M.; Bates, N.A.; Battaglia, M.; Couper, M.P.; Dever, J.A.; Gile, K.J.; Tourangeau, R. Summary report of the AAPOR task force on non-probability sampling. J. Surv. Stat. Methodol.
**2013**, 1, 90–143. [Google Scholar] [CrossRef] - Cochran, W.G. Sampling Techniques; John Wiley & Sons: Hoboken, NJ, USA, 1977. [Google Scholar]
- Wu, C.; Thompson, M.E. Sampling Theory and Practice; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
- Vehovar, V.; Toepoel, V.; Steinmetz, S. Non-Probability Sampling; The Sage Handbook of Survey Methods; SAGE Publications: New York, NY, USA, 2016; Volume 1, pp. 329–345. [Google Scholar]
- Dutwin, D.; Buskirk, T.D. Telephone sample surveys: Dearly beloved or nearly departed? Trends in survey errors in the era of declining response rates. J. Surv. Stat. Methodol.
**2021**, 9, 353–380. [Google Scholar] [CrossRef] - Lehdonvirta, V.; Oksanen, A.; Räsänen, P.; Blank, G. Social media, web, and panel surveys: Using non-probability samples in social and policy research. Policy Internet
**2021**, 13, 134–155. [Google Scholar] [CrossRef] - Chen, S.; Campbell, J.; Spain, E.; Milligan, A.; Snider, C. Improving the representativeness of the Tribal Behavioral Risk Factor Surveillance System through data integration. BMC Public Health
**2023**, 23, 273. [Google Scholar] [CrossRef] [PubMed] - Thompson, A.J.; Pickett, J.T. Are relational inferences from crowdsourced and opt-in samples generalizable? Comparing criminal justice attitudes in the GSS and five online samples. J. Quant. Criminol.
**2020**, 36, 907–932. [Google Scholar] [CrossRef] - Valliant, R. Comparing alternatives for estimation from nonprobability samples. J. Surv. Stat. Methodol.
**2020**, 8, 231–263. [Google Scholar] [CrossRef] - Tsung, C.; Kuang, J.; Valliant, R.L.; Elliott, M.R. Model-assisted calibration of non-probability sample survey data using adaptive LASSO. Surv. Methodol.
**2018**, 44, 117–145. [Google Scholar] - Lee, S.; Valliant, R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res.
**2009**, 37, 319–343. [Google Scholar] [CrossRef] - Wang, L.; Valliant, R.; Li, Y. Adjusted logistic propensity weighting methods for population inference using nonprobability volunteer-based epidemiologic cohorts. Stat. Med.
**2021**, 40, 5237–5250. [Google Scholar] [CrossRef] [PubMed] - Kim, J.K.; Park, S.; Chen, Y.; Wu, C. Combining non-probability and probability survey samples through mass imputation. J. R. Stat. Soc. Ser. A
**2021**, 184, 941–963. [Google Scholar] [CrossRef] - Yang, S.; Kim, J.K.; Hwang, Y. Integration of survey data and big observational data for finite population inference using mass imputation. Surv. Methodol.
**2021**, 47, 29–58. [Google Scholar] - Chen, S.; Yang, S.; Kim, J.K. Nonparametric mass imputation for data integration. J. Surv. Stat. Methodol.
**2022**, 10, 1–24. [Google Scholar] [CrossRef] [PubMed] - Chen, Y.; Li, P.; Wu, C. Doubly robust inference with nonprobability survey samples. J. Am. Stat. Assoc.
**2020**, 115, 2011–2021. [Google Scholar] [CrossRef] - Chen, S.; Haziza, D. General purpose multiply robust data integration procedures for handling nonprobability samples. Scand. J. Stat.
**2022**. [Google Scholar] [CrossRef] - Brand, J. Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets. Ph.D. Thesis, Erasmus University, Rotterdam, The Netherlands, 1999. [Google Scholar]
- Van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res.
**2007**, 16, 219–242. [Google Scholar] [CrossRef] [PubMed] - Robbins, M.W. A flexible and efficient algorithm for joint imputation of general data. arXiv
**2020**, arXiv:2008.02243. [Google Scholar] - Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 81. [Google Scholar]
- Chu, A.; Brick, J.M.; Kalton, G. Weights Forcombining Surveys across Time or Space, 52nd Session ed; Bulletin of the International Statistical Institute: ContributedPapers, Book 2; International Statistical Institute: Voorburg, The Netherlands, 1999; pp. 103–104. [Google Scholar]
- Friedman, E.M.; Jang, D.; Williams, V.T. Combined Estimates from FourQuarterly Survey Data Sets. In Proceedings of the American Statistical Association Joint Statistical Meetings—Section on Survey Research Methods, Alexandria, VA, USA, 11–15 August 2002; pp. 1064–1069. [Google Scholar]
- Homas, S.; Wannell, B. Combining cycles of the Canadian Community Health Survey. Health Rep.
**2009**, 20, 53–58. [Google Scholar]

**Table 1.**Comparison of population averages, probability sample weighted averages, and nonprobability sample unweighted averages.

Variable | Population | Probability Sample | Nonprobability Sample |
---|---|---|---|

X1 (Value=1) | 0.200 | 0.199 | 0.041 |

X1 (Value=2) | 0.300 | 0.299 | 0.077 |

X2 | 2.300 | 2.301 | 2.836 |

X3 | 5.300 | 5.298 | 5.988 |

X4 | 0.602 | 0.602 | 0.688 |

X5 (Value=1) | 0.159 | 0.159 | 0.049 |

X5 (Value=2) | 0.538 | 0.538 | 0.478 |

X6 | 0.303 | 0.304 | 0.336 |

X7 | 1.600 | 1.602 | 2.727 |

Variable | Method | Estimate | Bias |
---|---|---|---|

X4 | mice (pmm) | 0.598 | −0.0036 |

mice (cart) | 0.573 | −0.0288 | |

mice (rf) | 0.706 | 0.1041 | |

gerbil | 0.603 | 0.0012 | |

X5 (Value=1) | mice (pmm) | 0.159 | 0.0002 |

mice (cart) | 0.140 | −0.0188 | |

mice (rf) | 0.048 | −0.1112 | |

gerbil | 0.160 | 0.0012 | |

X5 (Value=2) | mice (pmm) | 0.537 | −0.0007 |

mice (cart) | 0.543 | 0.0052 | |

mice (rf) | 0.569 | 0.0314 | |

gerbil | 0.534 | −0.0039 | |

X6 | mice (pmm) | 0.312 | 0.0091 |

mice (cart) | 0.282 | −0.0213 | |

mice (rf) | 0.269 | −0.0339 | |

gerbil | 0.308 | 0.0049 | |

X7 | mice (pmm) | 1.603 | 0.0025 |

mice (cart) | 1.603 | 0.0025 | |

mice (rf) | 1.603 | 0.0025 | |

gerbil | 1.603 | 0.0025 |

**Table 3.**Comparison of distributions of covariate variables in BRFSS and TBRFSS (significant results with p values less than 0.001 are marked with *).

Variable | Value | BRFSS Weighted Frequency (Percent) | TBRFSS Unweighted Frequency (Percent) |
---|---|---|---|

age * | 18–24 | 46,597 (17.07) | 37 (5.83) |

25–29 | 30,027 (11.00) | 48 (7.56) | |

30–34 | 32,567 (11.93) | 46 (7.24) | |

35–39 | 29,459 (10.79) | 49 (7.72) | |

40–44 | 19,838 (7.27) | 55 (8.66) | |

45–49 | 17,961 (6.58) | 51 (8.03) | |

50–54 | 21,637 (7.93) | 63 (9.92) | |

55–59 | 21,303 (7.81) | 94 (14.80) | |

60−64 | 16,142 (5.91) | 76 (11.97) | |

65–79 | 12,267 (4.49) | 59 (9.29) | |

70+ | 25,129 (9.21) | 57 (8.98) | |

gender * | Male | 133,198 (48.80) | 140 (22.05) |

Female | 139,728 (51.20) | 495 (77.95) | |

marital * | Married | 120,946 (44.31) | 242 (38.11) |

Divorced/Separated | 50,397 (18.47) | 142 (22.36) | |

Widowed | 16,701 (6.12) | 60 (9.45) | |

Never Married | 72,022 (26.39) | 114 (17.95) | |

Member of unmarried Couple | 12,861 (4.71) | 77 (12.13) | |

education * | Less than High School | 38,116 (13.97) | 63 (9.92) |

High School Graduate | 103,878 (38.06) | 191 (30.08) | |

Some college/technical school | 89,158 (32.67) | 231 (36.38) | |

College Graduate | 41,774 (15.31) | 150 (23.62) | |

employ * | Employed/Self-employed | 157,742 (57.80) | 400 (62.99) |

Unemployed/Homemaker/Student | 49,507 (18.14) | 72 (11.34) | |

Retired | 31,124 (11.40) | 104 (16.38) | |

Unable to Work | 34,553 (12.66) | 59 (9.29) | |

income * | Less than USD 10,000 | 24,554 (9.00) | 117 (18.43) |

Less than USD 15,000 | 11,586 (4.25) | 60 (9.45) | |

Less than USD 20,000 | 32,404 (11.87) | 63 (9.92) | |

Less than USD 25,000 | 29,114 (10.67) | 76 (11.97) | |

Less than USD 35,000 | 35,740 (13.10) | 88 (13.86) | |

Less than USD 50,000 | 42,416 (15.54) | 89 (14.02) | |

Less than USD 75,000 | 40,524 (14.85) | 79 (12.44) | |

USD 75,000 or More | 56,587 (20.73) | 63 (9.92) | |

BMI Cat * | Underweight/Healthy weight | 64,439 (23.61) | 105 (16.54) |

Overweight | 98,507 (36.09) | 176 (27.72) | |

Obese | 109,980 (40.30) | 354 (55.75) | |

general health * | Excellent | 37,839 (13.86) | 56 (8.82) |

Very Good | 78,767 (28.86) | 144 (22.68) | |

Good | 85,727 (31.41) | 261 (41.10) | |

Fair/Poor | 70,593 (25.87) | 174 (27.40) |

Variable | Naïve | Mice (pmm) | Mice (cart) | Mice (rf) | Gerbil |
---|---|---|---|---|---|

cvd | 0.0353 | 0.0434 | 0.0290 | −0.0081 | 0.0401 |

asth | −0.0300 | −0.0273 | −0.0405 | −0.0867 | −0.0179 |

hlthcov | −0.1391 | −0.1548 | −0.1012 | −0.0535 | −0.1197 |

stroke | −0.0082 | −0.0015 | 0.0033 | −0.0334 | −0.0027 |

diabete | 0.1070 | 0.0508 | 0.0667 | 0.0252 | 0.0515 |

smoke | −0.0732 | 0.0154 | 0.0188 | −0.1343 | 0.0400 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chen, S.; Woodruff, A.M.; Campbell, J.; Vesely, S.; Xu, Z.; Snider, C.
Combining Probability and Nonprobability Samples by Using Multivariate Mass Imputation Approaches with Application to Biomedical Research. *Stats* **2023**, *6*, 617-625.
https://doi.org/10.3390/stats6020039

**AMA Style**

Chen S, Woodruff AM, Campbell J, Vesely S, Xu Z, Snider C.
Combining Probability and Nonprobability Samples by Using Multivariate Mass Imputation Approaches with Application to Biomedical Research. *Stats*. 2023; 6(2):617-625.
https://doi.org/10.3390/stats6020039

**Chicago/Turabian Style**

Chen, Sixia, Alexandra May Woodruff, Janis Campbell, Sara Vesely, Zheng Xu, and Cuyler Snider.
2023. "Combining Probability and Nonprobability Samples by Using Multivariate Mass Imputation Approaches with Application to Biomedical Research" *Stats* 6, no. 2: 617-625.
https://doi.org/10.3390/stats6020039