Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health
Abstract
1. Introduction
2. Materials and Methods
3. Results
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; ISBN 978-0-470-52679-8. [Google Scholar]
- Van Buuren, S. Flexible Imputation of Missing Data, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-0-429-96035-2. [Google Scholar]
- Soley-Bori, M. Dealing with Missing Data: Key Assumptions and Methods for Applied Analysis. Boston Univ. 2013, 4, 19. [Google Scholar]
- Allison, P.D. 312-2012: Handling Missing Data by Maximum Likelihood; Statistical Horizons: Ardmore, PA, USA, 2012. [Google Scholar]
- Donders, A.R.T.; van der Heijden, G.J.M.G.; Stijnen, T.; Moons, K.G.M. Review: A Gentle Introduction to Imputation of Missing Values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef] [PubMed]
- Song, S.; Sun, Y.; Zhang, A.; Chen, L.; Wang, J. Enriching Data Imputation under Similarity Rule Constraints. IEEE Trans. Knowl. Data Eng. 2020, 32, 275–287. [Google Scholar] [CrossRef]
- Breve, B.; Caruccio, L.; Deufemia, V.; Polese, G. RENUVER: A Missing Value Imputation Algorithm Based on Relaxed Functional Dependencies. In Proceedings of the 25th International Conference on Extending Database Technology, Online, 29 March–1 April 2022. [Google Scholar]
- Song, S.; Sun, Y. Imputing Various Incomplete Attributes via Distance Likelihood Maximization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 535–545. [Google Scholar]
- Jia, X.; Dong, X.; Chen, M.; Yu, X. Missing Data Imputation for Traffic Congestion Data Based on Joint Matrix Factorization. Knowl.-Based Syst. 2021, 225, 107114. [Google Scholar] [CrossRef]
- Rekatsinas, T.; Chu, X.; Ilyas, I.F.; Ré, C. HoloClean: Holistic Data Repairs with Probabilistic Inference. arXiv 2017, arXiv:1702.00820. [Google Scholar] [CrossRef]
- Chu, X.; Ilyas, I.F.; Papotti, P. Holistic Data Cleaning: Putting Violations into Context. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–12 April 2013; pp. 458–469. [Google Scholar]
- Jäger, S.; Allhorn, A.; Bießmann, F. A Benchmark for Data Imputation Methods. Front. Big Data 2021, 4, 693674. [Google Scholar] [CrossRef]
- Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of Imputation Methods for Missing Laboratory Data in Medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef]
- Van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
- Im, J.; Cho, I.H.; Jae, K. FHDI: An R Package for Fractional Hot Deck Imputation. R J. 2018, 10, 140. [Google Scholar] [CrossRef]
- Robbins, M.; Griswold, M.; Lima, P.N. de Gerbil: Generalized Efficient Regression-Based Imputation with Latent Processes. 2021. Available online: https://cran.r-project.org/package=gerbil (accessed on 7 January 2023).
- Robbins, M.W. A Flexible and Efficient Algorithm for Joint Imputation of General Data 2021. arXiv 2020, arXiv:2008.02243. [Google Scholar]
- Johnson, P.J.; Ghildayal, N.; Rockwood, T.; Everson-Rose, S.A. Differences in Diabetes Self-Care Activities by Race/Ethnicity and Insulin Use. Diabetes Educ. 2014, 40, 767–777. [Google Scholar] [CrossRef][Green Version]
- Schauer, G.L.; Halperin, A.C.; Mancl, L.A.; Doescher, M.P. Health Professional Advice for Smoking and Weight in Adults with and without Diabetes: Findings from BRFSS. J. Behav. Med. 2013, 36, 10–19. [Google Scholar] [CrossRef]
- Lloyd-Jones, D.M.; Ning, H.; Labarthe, D.; Brewer, L.; Sharma, G.; Rosamond, W.; Foraker, R.E.; Black, T.; Grandner, M.A.; Allen, N.B.; et al. Status of Cardiovascular Health in US Adults and Children Using the American Heart Association’s New “Life’s Essential 8” Metrics: Prevalence Estimates From the National Health and Nutrition Examination Survey (NHANES), 2013 Through 2018. Circulation 2022, 146, 822–835. [Google Scholar] [CrossRef]
- Pieters, M.; Ferreira, M.; de Maat, M.P.M.; Ricci, C. Biomarker Association with Cardiovascular Disease and Mortality—The Role of Fibrinogen. A Report from the NHANES Study. Thromb. Res. 2021, 198, 182–189. [Google Scholar] [CrossRef]
- Huque, M.H.; Carlin, J.B.; Simpson, J.A.; Lee, K.J. A Comparison of Multiple Imputation Methods for Missing Data in Longitudinal Studies. BMC Med. Res. Methodol. 2018, 18, 168. [Google Scholar] [CrossRef]
- Mandel, J.S.P. A Comparison of Six Methods for Missing Data Imputation. J. Biom. Biostat. 2015, 6, 1–6. [Google Scholar] [CrossRef]
- Wongkamthong, C.; Akande, O. A Comparative Study of Imputation Methods for Multivariate Ordinal Data. J. Surv. Stat. Methodol. 2021, smab028. [Google Scholar] [CrossRef]
- Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
- Wang, Z.; Akande, O.; Poulos, J.; Li, F. Are Deep Learning Models Superior for Missing Data Imputation in Large Surveys? Evidence from an Empirical Comparison. arXiv 2022, arXiv:2103.09316. [Google Scholar]
- Chen, H.Y. Compatibility of Conditionally Specified Models. Stat. Probab. Lett. 2010, 80, 670–677. [Google Scholar] [CrossRef] [PubMed]
- Bertsimas, D.; Pawlowski, C.; Zhuo, Y.D. From Predictive Methods to Missing Data Imputation: An Optimization Approach. J. Mach. Learn. Res. 2018, 18, 7133–7171. [Google Scholar]
- Woźnica, K.; Biecek, P. Does Imputation Matter? Benchmark for Predictive Models. arXiv 2020, arXiv:2007.02837. [Google Scholar]
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.7191 | 1.5332 | 0.0076 | 0.0162 | 0.7191 | 1.5332 |
MICE: pmm | −0.2458 | 0.5242 | 0.0131 | 0.0279 | 0.2462 | 0.5249 |
FHDI | −0.9888 | 2.1082 | 0.0072 | 0.0154 | 0.9888 | 2.1083 |
GERBIL | −0.353 | 0.7527 | 0.0536 | 0.1143 | 0.3571 | 0.7614 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE |
Complete-case | 1.2694 | 2.7066 | 0.007 | 0.0149 | 1.2694 | 2.7066 |
MICE: pmm | 0.1534 | 0.3271 | 0.008 | 0.0171 | 0.1536 | 0.3275 |
FHDI | −1.0655 | 2.2718 | 0.0218 | 0.0465 | 1.0657 | 2.2722 |
GERBIL | 0.052 | 0.1108 | 0.0528 | 0.1126 | 0.0741 | 0.1580 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE |
Complete-case | 8.2251 | 17.5373 | 0.0161 | 0.0343 | 8.2252 | 17.5374 |
MICE: pmm | 5.192 | 11.07 | 0.0121 | 0.0258 | 5.192 | 11.0701 |
FHDI | 7.0293 | 14.9875 | 0.0099 | 0.0211 | 7.0293 | 14.9875 |
GERBIL | 5.4701 | 11.6631 | 0.045 | 0.0959 | 5.4703 | 11.6635 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.1366 | 5.7022 | 0.002 | 0.0835 | 0.1366 | 5.7012 |
MICE: pmm | 0.1013 | 4.2285 | 0.0003 | 0.0125 | 0.1013 | 4.2279 |
FHDI | 0.1415 | 5.9063 | 0.0076 | 0.3172 | 0.1417 | 5.9140 |
GERBIL | 0.115 | 4.7998 | 0.0208 | 0.8681 | 0.1169 | 4.8790 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.2539 | 10.5992 | 0.0016 | 0.0668 | 0.2539 | 10.5968 |
MICE: pmm | 0.2297 | 9.5869 | 0.0018 | 0.0751 | 0.2297 | 9.5868 |
MICE: logreg + pmm | 0.2175 | 9.0798 | 0.0018 | 0.0751 | 0.2175 | 9.0776 |
FHDI | 0.2101 | 8.7674 | 0.0103 | 0.4299 | 0.2103 | 8.7771 |
GERBIL: 1-step | 0.2199 | 9.1783 | 0.0219 | 0.9140 | 0.221 | 9.2237 |
GERBIL: 2-step | 0.1084 | 4.5252 | 0.0203 | 0.8472 | 0.1103 | 4.6035 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.9304 | 38.8318 | 0.0004 | 0.0167 | 0.9304 | 38.8314 |
MICE: pmm | 0.7432 | 31.0186 | 0.0002 | 0.0083 | 0.7432 | 31.0184 |
MICE: logreg + pmm | 0.728 | 30.3866 | 0.0002 | 0.0083 | 0.728 | 30.3840 |
FHDI | 0.5941 | 24.7963 | 0.0036 | 0.1503 | 0.5941 | 24.7955 |
GERBIL: 1-step | 0.8023 | 33.487 | 0.0212 | 0.8848 | 0.8026 | 33.4975 |
GERBIL: 2-step | 0.6896 | 28.7811 | 0.0202 | 0.8431 | 0.6899 | 28.7938 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.5803 | 15.1983 | 0.0039 | 0.1021 | 0.5803 | 15.1979 |
MICE: pmm | 0.1707 | 4.4694 | 0.0008 | 0.0210 | 0.1707 | 4.4706 |
FHDI | −0.1661 | 4.3493 | 0.0028 | 0.0733 | 0.1661 | 4.3501 |
GERBIL | −0.0723 | 1.8937 | 0.0222 | 0.5814 | 0.0756 | 1.9799 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.6638 | 17.3859 | 0.0041 | 0.1074 | 0.6639 | 17.3873 |
MICE: pmm | 0.1942 | 5.087 | 0.0031 | 0.0812 | 0.1943 | 5.0887 |
FHDI | −0.1376 | 3.6025 | 0.0085 | 0.2226 | 0.1378 | 3.6089 |
GERBIL | 0.0009 | 0.0246 | 0.0238 | 0.6233 | 0.0239 | 0.6259 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.655 | 43.3448 | 0 | 0.0000 | 1.655 | 43.3439 |
MICE: pmm | 1.382 | 36.1944 | 0.0026 | 0.0681 | 1.382 | 36.1941 |
FHDI | 1.0198 | 26.7078 | 0.0094 | 0.2462 | 1.0198 | 26.7082 |
GERBIL | 1.099 | 28.7826 | 0.0328 | 0.8590 | 1.0995 | 28.7955 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0699 | 17.3139 | 0.0002 | 0.0495 | 0.0699 | 17.3114 |
MICE: logreg | 0.0083 | 2.0617 | 0.0004 | 0.0991 | 0.0083 | 2.0556 |
FHDI | 0.0623 | 15.4393 | 0.0021 | 0.5201 | 0.0624 | 15.4539 |
GERBIL | 0.0051 | 1.2713 | 0.0032 | 0.7925 | 0.006 | 1.4860 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0753 | 18.6511 | 0.0002 | 0.0495 | 0.0753 | 18.6487 |
MICE: logreg | 0.002 | 0.5003 | 0.0003 | 0.0743 | 0.002 | 0.4953 |
FHDI | 0.0112 | 2.7679 | 0.0004 | 0.0991 | 0.0112 | 2.7738 |
GERBIL | −0.0054 | 1.3342 | 0.0033 | 0.8173 | 0.0063 | 1.5603 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.2603 | 64.474 | 0.0004 | 0.0991 | 0.2603 | 64.4657 |
MICE: logreg | −0.2229 | 55.2122 | 0.0005 | 0.1238 | 0.2229 | 55.2032 |
FHDI | −0.2499 | 61.8784 | 0.0003 | 0.0743 | 0.2499 | 61.8900 |
GERBIL | −0.2354 | 58.2945 | 0.003 | 0.7430 | 0.2354 | 58.2989 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.4639 | 59.406 | 0.0046 | 0.5890 | 0.464 | 59.4110 |
MICE | −0.2066 | 26.4567 | 0.0063 | 0.8067 | 0.2067 | 26.4661 |
FHDI | −0.3933 | 50.3674 | 0.0001 | 0.0128 | 0.3933 | 50.3585 |
GERBIL | −0.1727 | 22.1189 | 0.0303 | 3.8796 | 0.1754 | 22.4584 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0901 | 11.5387 | 0.0097 | 1.2420 | 0.0906 | 11.6005 |
MICE | 0.1092 | 13.9853 | 0.0006 | 0.0768 | 0.1092 | 13.9821 |
FHDI | −0.2222 | 28.4482 | 0.0048 | 0.6146 | 0.2222 | 28.4507 |
GERBIL | 0.0255 | 3.2647 | 0.0342 | 4.3790 | 0.0426 | 5.4545 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.7235 | 92.6478 | 0.0158 | 2.0230 | 0.7237 | 92.6633 |
MICE | 0.3396 | 43.4898 | 0.0112 | 1.4341 | 0.3398 | 43.5083 |
FHDI | −0.0956 | 12.2474 | 0.0021 | 0.2689 | 0.0957 | 12.2535 |
GERBIL | 0.0399 | 5.1105 | 0.0669 | 8.5659 | 0.0779 | 9.9744 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.8683 | 347.4789 | 0.1076 | 20.0118 | 1.8714 | 348.0487 |
MICE | 0.0738 | 13.7336 | 0.0002 | 0.0372 | 0.0738 | 13.7255 |
FHDI | 0.2546 | 47.3445 | 0.0227 | 4.2218 | 0.2556 | 47.5373 |
GERBIL | 0.0799 | 14.8515 | 0.1021 | 18.9889 | 0.1296 | 24.1034 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.2265 | 228.1141 | 0.037 | 6.8814 | 1.2271 | 228.2198 |
MICE | 0.4352 | 80.9333 | 0.0107 | 1.9900 | 0.4353 | 80.9584 |
FHDI | −0.2359 | 43.8748 | 0.0031 | 0.5765 | 0.2359 | 43.8734 |
GERBIL | 0.2013 | 37.4425 | 0.1089 | 20.2536 | 0.2289 | 42.5715 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −1.3693 | 254.6583 | 0.0366 | 6.8070 | 1.3697 | 254.7410 |
MICE | −0.4281 | 79.617 | 0.0182 | 3.3849 | 0.4285 | 79.6937 |
FHDI | −0.5571 | 103.6038 | 0.0017 | 0.3162 | 0.5571 | 103.6112 |
GERBIL | −0.8072 | 150.1331 | 0.1505 | 27.9904 | 0.8212 | 152.7293 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pan, S.; Chen, S. Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. Int. J. Environ. Res. Public Health 2023, 20, 1524. https://doi.org/10.3390/ijerph20021524
Pan S, Chen S. Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. International Journal of Environmental Research and Public Health. 2023; 20(2):1524. https://doi.org/10.3390/ijerph20021524
Chicago/Turabian StylePan, Steven, and Sixia Chen. 2023. "Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health" International Journal of Environmental Research and Public Health 20, no. 2: 1524. https://doi.org/10.3390/ijerph20021524
APA StylePan, S., & Chen, S. (2023). Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. International Journal of Environmental Research and Public Health, 20(2), 1524. https://doi.org/10.3390/ijerph20021524