Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health
Abstract
:1. Introduction
2. Materials and Methods
3. Results
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; ISBN 978-0-470-52679-8. [Google Scholar]
- Van Buuren, S. Flexible Imputation of Missing Data, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-0-429-96035-2. [Google Scholar]
- Soley-Bori, M. Dealing with Missing Data: Key Assumptions and Methods for Applied Analysis. Boston Univ. 2013, 4, 19. [Google Scholar]
- Allison, P.D. 312-2012: Handling Missing Data by Maximum Likelihood; Statistical Horizons: Ardmore, PA, USA, 2012. [Google Scholar]
- Donders, A.R.T.; van der Heijden, G.J.M.G.; Stijnen, T.; Moons, K.G.M. Review: A Gentle Introduction to Imputation of Missing Values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef] [PubMed]
- Song, S.; Sun, Y.; Zhang, A.; Chen, L.; Wang, J. Enriching Data Imputation under Similarity Rule Constraints. IEEE Trans. Knowl. Data Eng. 2020, 32, 275–287. [Google Scholar] [CrossRef]
- Breve, B.; Caruccio, L.; Deufemia, V.; Polese, G. RENUVER: A Missing Value Imputation Algorithm Based on Relaxed Functional Dependencies. In Proceedings of the 25th International Conference on Extending Database Technology, Online, 29 March–1 April 2022. [Google Scholar]
- Song, S.; Sun, Y. Imputing Various Incomplete Attributes via Distance Likelihood Maximization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 535–545. [Google Scholar]
- Jia, X.; Dong, X.; Chen, M.; Yu, X. Missing Data Imputation for Traffic Congestion Data Based on Joint Matrix Factorization. Knowl.-Based Syst. 2021, 225, 107114. [Google Scholar] [CrossRef]
- Rekatsinas, T.; Chu, X.; Ilyas, I.F.; Ré, C. HoloClean: Holistic Data Repairs with Probabilistic Inference. arXiv 2017, arXiv:1702.00820. [Google Scholar] [CrossRef]
- Chu, X.; Ilyas, I.F.; Papotti, P. Holistic Data Cleaning: Putting Violations into Context. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–12 April 2013; pp. 458–469. [Google Scholar]
- Jäger, S.; Allhorn, A.; Bießmann, F. A Benchmark for Data Imputation Methods. Front. Big Data 2021, 4, 693674. [Google Scholar] [CrossRef]
- Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of Imputation Methods for Missing Laboratory Data in Medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef]
- Van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef] [Green Version]
- Im, J.; Cho, I.H.; Jae, K. FHDI: An R Package for Fractional Hot Deck Imputation. R J. 2018, 10, 140. [Google Scholar] [CrossRef] [Green Version]
- Robbins, M.; Griswold, M.; Lima, P.N. de Gerbil: Generalized Efficient Regression-Based Imputation with Latent Processes. 2021. Available online: https://cran.r-project.org/package=gerbil (accessed on 7 January 2023).
- Robbins, M.W. A Flexible and Efficient Algorithm for Joint Imputation of General Data 2021. arXiv 2020, arXiv:2008.02243. [Google Scholar]
- Johnson, P.J.; Ghildayal, N.; Rockwood, T.; Everson-Rose, S.A. Differences in Diabetes Self-Care Activities by Race/Ethnicity and Insulin Use. Diabetes Educ. 2014, 40, 767–777. [Google Scholar] [CrossRef]
- Schauer, G.L.; Halperin, A.C.; Mancl, L.A.; Doescher, M.P. Health Professional Advice for Smoking and Weight in Adults with and without Diabetes: Findings from BRFSS. J. Behav. Med. 2013, 36, 10–19. [Google Scholar] [CrossRef]
- Lloyd-Jones, D.M.; Ning, H.; Labarthe, D.; Brewer, L.; Sharma, G.; Rosamond, W.; Foraker, R.E.; Black, T.; Grandner, M.A.; Allen, N.B.; et al. Status of Cardiovascular Health in US Adults and Children Using the American Heart Association’s New “Life’s Essential 8” Metrics: Prevalence Estimates From the National Health and Nutrition Examination Survey (NHANES), 2013 Through 2018. Circulation 2022, 146, 822–835. [Google Scholar] [CrossRef]
- Pieters, M.; Ferreira, M.; de Maat, M.P.M.; Ricci, C. Biomarker Association with Cardiovascular Disease and Mortality—The Role of Fibrinogen. A Report from the NHANES Study. Thromb. Res. 2021, 198, 182–189. [Google Scholar] [CrossRef]
- Huque, M.H.; Carlin, J.B.; Simpson, J.A.; Lee, K.J. A Comparison of Multiple Imputation Methods for Missing Data in Longitudinal Studies. BMC Med. Res. Methodol. 2018, 18, 168. [Google Scholar] [CrossRef] [Green Version]
- Mandel, J.S.P. A Comparison of Six Methods for Missing Data Imputation. J. Biom. Biostat. 2015, 6, 1–6. [Google Scholar] [CrossRef]
- Wongkamthong, C.; Akande, O. A Comparative Study of Imputation Methods for Multivariate Ordinal Data. J. Surv. Stat. Methodol. 2021, smab028. [Google Scholar] [CrossRef]
- Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
- Wang, Z.; Akande, O.; Poulos, J.; Li, F. Are Deep Learning Models Superior for Missing Data Imputation in Large Surveys? Evidence from an Empirical Comparison. arXiv 2022, arXiv:2103.09316. [Google Scholar]
- Chen, H.Y. Compatibility of Conditionally Specified Models. Stat. Probab. Lett. 2010, 80, 670–677. [Google Scholar] [CrossRef] [PubMed]
- Bertsimas, D.; Pawlowski, C.; Zhuo, Y.D. From Predictive Methods to Missing Data Imputation: An Optimization Approach. J. Mach. Learn. Res. 2018, 18, 7133–7171. [Google Scholar]
- Woźnica, K.; Biecek, P. Does Imputation Matter? Benchmark for Predictive Models. arXiv 2020, arXiv:2007.02837. [Google Scholar]
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.7191 | 1.5332 | 0.0076 | 0.0162 | 0.7191 | 1.5332 |
MICE: pmm | −0.2458 | 0.5242 | 0.0131 | 0.0279 | 0.2462 | 0.5249 |
FHDI | −0.9888 | 2.1082 | 0.0072 | 0.0154 | 0.9888 | 2.1083 |
GERBIL | −0.353 | 0.7527 | 0.0536 | 0.1143 | 0.3571 | 0.7614 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE |
Complete-case | 1.2694 | 2.7066 | 0.007 | 0.0149 | 1.2694 | 2.7066 |
MICE: pmm | 0.1534 | 0.3271 | 0.008 | 0.0171 | 0.1536 | 0.3275 |
FHDI | −1.0655 | 2.2718 | 0.0218 | 0.0465 | 1.0657 | 2.2722 |
GERBIL | 0.052 | 0.1108 | 0.0528 | 0.1126 | 0.0741 | 0.1580 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE |
Complete-case | 8.2251 | 17.5373 | 0.0161 | 0.0343 | 8.2252 | 17.5374 |
MICE: pmm | 5.192 | 11.07 | 0.0121 | 0.0258 | 5.192 | 11.0701 |
FHDI | 7.0293 | 14.9875 | 0.0099 | 0.0211 | 7.0293 | 14.9875 |
GERBIL | 5.4701 | 11.6631 | 0.045 | 0.0959 | 5.4703 | 11.6635 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.1366 | 5.7022 | 0.002 | 0.0835 | 0.1366 | 5.7012 |
MICE: pmm | 0.1013 | 4.2285 | 0.0003 | 0.0125 | 0.1013 | 4.2279 |
FHDI | 0.1415 | 5.9063 | 0.0076 | 0.3172 | 0.1417 | 5.9140 |
GERBIL | 0.115 | 4.7998 | 0.0208 | 0.8681 | 0.1169 | 4.8790 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.2539 | 10.5992 | 0.0016 | 0.0668 | 0.2539 | 10.5968 |
MICE: pmm | 0.2297 | 9.5869 | 0.0018 | 0.0751 | 0.2297 | 9.5868 |
MICE: logreg + pmm | 0.2175 | 9.0798 | 0.0018 | 0.0751 | 0.2175 | 9.0776 |
FHDI | 0.2101 | 8.7674 | 0.0103 | 0.4299 | 0.2103 | 8.7771 |
GERBIL: 1-step | 0.2199 | 9.1783 | 0.0219 | 0.9140 | 0.221 | 9.2237 |
GERBIL: 2-step | 0.1084 | 4.5252 | 0.0203 | 0.8472 | 0.1103 | 4.6035 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.9304 | 38.8318 | 0.0004 | 0.0167 | 0.9304 | 38.8314 |
MICE: pmm | 0.7432 | 31.0186 | 0.0002 | 0.0083 | 0.7432 | 31.0184 |
MICE: logreg + pmm | 0.728 | 30.3866 | 0.0002 | 0.0083 | 0.728 | 30.3840 |
FHDI | 0.5941 | 24.7963 | 0.0036 | 0.1503 | 0.5941 | 24.7955 |
GERBIL: 1-step | 0.8023 | 33.487 | 0.0212 | 0.8848 | 0.8026 | 33.4975 |
GERBIL: 2-step | 0.6896 | 28.7811 | 0.0202 | 0.8431 | 0.6899 | 28.7938 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.5803 | 15.1983 | 0.0039 | 0.1021 | 0.5803 | 15.1979 |
MICE: pmm | 0.1707 | 4.4694 | 0.0008 | 0.0210 | 0.1707 | 4.4706 |
FHDI | −0.1661 | 4.3493 | 0.0028 | 0.0733 | 0.1661 | 4.3501 |
GERBIL | −0.0723 | 1.8937 | 0.0222 | 0.5814 | 0.0756 | 1.9799 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.6638 | 17.3859 | 0.0041 | 0.1074 | 0.6639 | 17.3873 |
MICE: pmm | 0.1942 | 5.087 | 0.0031 | 0.0812 | 0.1943 | 5.0887 |
FHDI | −0.1376 | 3.6025 | 0.0085 | 0.2226 | 0.1378 | 3.6089 |
GERBIL | 0.0009 | 0.0246 | 0.0238 | 0.6233 | 0.0239 | 0.6259 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.655 | 43.3448 | 0 | 0.0000 | 1.655 | 43.3439 |
MICE: pmm | 1.382 | 36.1944 | 0.0026 | 0.0681 | 1.382 | 36.1941 |
FHDI | 1.0198 | 26.7078 | 0.0094 | 0.2462 | 1.0198 | 26.7082 |
GERBIL | 1.099 | 28.7826 | 0.0328 | 0.8590 | 1.0995 | 28.7955 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0699 | 17.3139 | 0.0002 | 0.0495 | 0.0699 | 17.3114 |
MICE: logreg | 0.0083 | 2.0617 | 0.0004 | 0.0991 | 0.0083 | 2.0556 |
FHDI | 0.0623 | 15.4393 | 0.0021 | 0.5201 | 0.0624 | 15.4539 |
GERBIL | 0.0051 | 1.2713 | 0.0032 | 0.7925 | 0.006 | 1.4860 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0753 | 18.6511 | 0.0002 | 0.0495 | 0.0753 | 18.6487 |
MICE: logreg | 0.002 | 0.5003 | 0.0003 | 0.0743 | 0.002 | 0.4953 |
FHDI | 0.0112 | 2.7679 | 0.0004 | 0.0991 | 0.0112 | 2.7738 |
GERBIL | −0.0054 | 1.3342 | 0.0033 | 0.8173 | 0.0063 | 1.5603 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.2603 | 64.474 | 0.0004 | 0.0991 | 0.2603 | 64.4657 |
MICE: logreg | −0.2229 | 55.2122 | 0.0005 | 0.1238 | 0.2229 | 55.2032 |
FHDI | −0.2499 | 61.8784 | 0.0003 | 0.0743 | 0.2499 | 61.8900 |
GERBIL | −0.2354 | 58.2945 | 0.003 | 0.7430 | 0.2354 | 58.2989 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.4639 | 59.406 | 0.0046 | 0.5890 | 0.464 | 59.4110 |
MICE | −0.2066 | 26.4567 | 0.0063 | 0.8067 | 0.2067 | 26.4661 |
FHDI | −0.3933 | 50.3674 | 0.0001 | 0.0128 | 0.3933 | 50.3585 |
GERBIL | −0.1727 | 22.1189 | 0.0303 | 3.8796 | 0.1754 | 22.4584 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −0.0901 | 11.5387 | 0.0097 | 1.2420 | 0.0906 | 11.6005 |
MICE | 0.1092 | 13.9853 | 0.0006 | 0.0768 | 0.1092 | 13.9821 |
FHDI | −0.2222 | 28.4482 | 0.0048 | 0.6146 | 0.2222 | 28.4507 |
GERBIL | 0.0255 | 3.2647 | 0.0342 | 4.3790 | 0.0426 | 5.4545 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 0.7235 | 92.6478 | 0.0158 | 2.0230 | 0.7237 | 92.6633 |
MICE | 0.3396 | 43.4898 | 0.0112 | 1.4341 | 0.3398 | 43.5083 |
FHDI | −0.0956 | 12.2474 | 0.0021 | 0.2689 | 0.0957 | 12.2535 |
GERBIL | 0.0399 | 5.1105 | 0.0669 | 8.5659 | 0.0779 | 9.9744 |
Missing Mechanism: MAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.8683 | 347.4789 | 0.1076 | 20.0118 | 1.8714 | 348.0487 |
MICE | 0.0738 | 13.7336 | 0.0002 | 0.0372 | 0.0738 | 13.7255 |
FHDI | 0.2546 | 47.3445 | 0.0227 | 4.2218 | 0.2556 | 47.5373 |
GERBIL | 0.0799 | 14.8515 | 0.1021 | 18.9889 | 0.1296 | 24.1034 |
Missing Mechanism: Small MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | 1.2265 | 228.1141 | 0.037 | 6.8814 | 1.2271 | 228.2198 |
MICE | 0.4352 | 80.9333 | 0.0107 | 1.9900 | 0.4353 | 80.9584 |
FHDI | −0.2359 | 43.8748 | 0.0031 | 0.5765 | 0.2359 | 43.8734 |
GERBIL | 0.2013 | 37.4425 | 0.1089 | 20.2536 | 0.2289 | 42.5715 |
Missing Mechanism: Large MNAR | ||||||
Imputation Method | B | RB (%) | SE | RSE (%) | RMSE | RRMSE (%) |
Complete-case | −1.3693 | 254.6583 | 0.0366 | 6.8070 | 1.3697 | 254.7410 |
MICE | −0.4281 | 79.617 | 0.0182 | 3.3849 | 0.4285 | 79.6937 |
FHDI | −0.5571 | 103.6038 | 0.0017 | 0.3162 | 0.5571 | 103.6112 |
GERBIL | −0.8072 | 150.1331 | 0.1505 | 27.9904 | 0.8212 | 152.7293 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pan, S.; Chen, S. Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. Int. J. Environ. Res. Public Health 2023, 20, 1524. https://doi.org/10.3390/ijerph20021524
Pan S, Chen S. Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. International Journal of Environmental Research and Public Health. 2023; 20(2):1524. https://doi.org/10.3390/ijerph20021524
Chicago/Turabian StylePan, Steven, and Sixia Chen. 2023. "Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health" International Journal of Environmental Research and Public Health 20, no. 2: 1524. https://doi.org/10.3390/ijerph20021524
APA StylePan, S., & Chen, S. (2023). Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. International Journal of Environmental Research and Public Health, 20(2), 1524. https://doi.org/10.3390/ijerph20021524