Next Article in Journal
Social and Physical Environmental Correlates of Adults’ Weekend Sitting Time and Moderating Effects of Retirement Status and Physical Health
Next Article in Special Issue
Effects of Non-Differential Exposure Misclassification on False Conclusions in Hypothesis-Generating Studies
Previous Article in Journal
Mercury Exposure in Ireland: Results of the DEMOCOPHES Human Biomonitoring Study
Article Menu

Export Article

Open AccessArticle
Int. J. Environ. Res. Public Health 2014, 11(9), 9776-9789;

Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets

School of Nursing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Received: 20 June 2014 / Revised: 4 September 2014 / Accepted: 12 September 2014 / Published: 18 September 2014
(This article belongs to the Special Issue Methodological Innovations and Reflections-1)
Full-Text   |   PDF [575 KB, uploaded 18 September 2014]   |  


In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees. View Full-Text
Keywords: automated classifier; data mining; decision tree; oversampling; predictive power; rare events automated classifier; data mining; decision tree; oversampling; predictive power; rare events

Figure 1

This is an open access article distributed under the Creative Commons Attribution License (CC BY 3.0).

Share & Cite This Article

MDPI and ACS Style

Lee, P.H. Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets. Int. J. Environ. Res. Public Health 2014, 11, 9776-9789.

Show more citation formats Show less citations formats

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Int. J. Environ. Res. Public Health EISSN 1660-4601 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top