Resampling Methods Improve the Predictive Power of Modeling in Class-Imbalanced Datasets

In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.


Introduction
In the medical field, many outcome variables are dichotomized (or binary), for example survival status, indicator of a particular disease, and adequacy of a particular nutrient. The two possible values of a dichotomized variable are referred to as classes. A dataset with dichotomized outcome is class-imbalanced if it consists mostly of one class. Class-imbalance is a common phenomenon in the medical context; analysis of rare diseases or mortality within a short period of follow-up usually suffers from class-imbalance problems. Class-imbalanced datasets are challenging to analyze, as the performance of common classification models (for example decision trees) on class-imbalanced datasets tends to be suboptimal [1] since these models target to improve the overall accuracy, hence these models focus on classifying correctly the large class at the cost of ignoring the misclassification of the small class. However, the actual cost of misclassifying the small class (false negative) maybe much higher than misclassification of the large class (false positive).
To tackle such a problem arising from class-imbalance, a classic solution is to use a case-control study design [2,3]. In adopting a case-control study design, researchers first draw samples of the cases (supposing that the case is a rare event), then samples of the controls are drawn according to the collected samples of the cases. The advantage of a case-control study over a cohort study is that the case-control study does not require a large sample size and long follow-up period to accumulate a reasonable number of rare disease patients.
However, a dataset obtained using case-control study design is only suitable for estimating the relative risk or odds ratio of several exposures on the particular disease. If the study objective is to estimate the risk factor of more than one disease, cohort or cross-sectional study designs appear to be more appropriate. In studies using cohort and cross-sectional designs, problem arising from class-imbalance can be tackled at the stage of statistical analysis using resampling methods [4]. Resampling methods include oversampling, i.e., oversample the small class to a sample size comparable to the large class, and undersampling, i.e., randomly draw samples from the large class with sample size comparable to the small class. A lot of work had been done in the data mining literature on developing resampling methods [5], yet these techniques are rarely applied in the medical literature.
This paper aims at illustrating the effect of resampling methods in medical research, using the public-available National Health and Nutrition Examination Survey (NHANES) wave 2009-2010 data. Using these data, we built several decision tree models to predict undiagnosed diabetes among adult participants. According to the Centers for Disease Control and Prevention, the prevalence of diagnosed and undiagnosed diabetes are 6.0% and 2.3%, respectively [6], and given its large burden to society [7], a huge effort was dedicated to identify undiagnosed diabetes for better decision-making of health care providers. A recent systematic review showed that from 1997 to 2010 there were 15 published papers about developing prediction models to identify undiagnosed diabetes [8], but none of these addressed the problem of class imbalance. In the NHANES 2009-2010 data, the prevalence of undiagnosed diabetes among adults age ≥20 without self-report diabetes was 3.0% (shown below). Here, using the NHANES 2009-2010 data, we compare the predictive power of the decision tree models on the full dataset and on the resampled datasets and it was hypothesized that using the resampled dataset the predictive power will be improved. Here we also consider unbalanced resampling, that is, resampling to a pre-determined ratio of both classes. Resampling to different rates had been studied and these sometimes yield better predictive power than balanced resampling [4].

Ethics Statement
The NHANES study was approved by the Centers for Disease Control and Prevention ethics review board (Continuation of Protocol #2005-06). The NHANES also obtained consent from all participants.

Participants
This study utilized data collected from participants in the National Health and Nutrition Examination Survey (NHANES) wave 2009-2010. The NHANES, conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention, was designed to assess the health and nutrition status in the United States [9]. The sample was representative of the United States population, and was selected using a multi-stage probability cluster design. Participants were invited to complete a survey and a health examination; the details can be obtained from the NHANES website [10]. A total of 10,537 participants completed the survey, and those with aged 19 and below, with self-reported diabetes (that is, having a positive response in the question "Have you ever been told by a doctor or health professional that you have diabetes or sugar diabetes?"), and/or without blood test results were excluded in this study, leaving a final sample of 4677.

Measurement
A blood test was conducted in a morning examination after a 9-hour fast to obtain fasting glucose and hemoglobin A1c levels of the participants. In addition, a two-hour oral glucose tolerance test was conducted to obtain non-fasting glucose level. A participant demonstrated evidence of diabetes if any of the following is met: (a) fasting glucose ≥ 126 mg/dL, (b) non-fasting glucose ≥ 200 mg/dL, (c) hemoglobin A1c ≥ 6.5% (or 47.5 mmol/mol). BMI was calculated as weight (kg) divided by the square of height (m 2 ). Family Poverty Index, determined by the eligibility of certain federal financial assistance programs, was computed according to the Department of Health and Human Services guidelines. The Family Poverty Income Ratio was computed by dividing the Family income by the Family Poverty Index. Other exposure variables included age, race, marital status, and education level. To facilitate the use of decision tree models in non-clinical setting, relevant biomarkers, e.g., blood pressure or high-density lipoprotein, were not included as exposure variables.

Statistical Analysis
The Classification and Regression Tree (CART) procedure [11] was used to build a classification model on undiagnosed diabetes. CART is a recursive partitioning procedure aim at splitting the data into distinct partitions base on the most important exposure variables determined by the procedure. A split on a partition is carried out to maximize the purity, that is, the dominance of one class, of its descendant partitions. In this study, the purity of a partition is measured by the Gini impurity, which equals where p1 and p2 are the proportions of classes 1 and 2 respectively. The model is named as a tree model as the partitions can be arranged in a tree-like structure, as shown in Figures 1-8. The CART was fitted using package rpart of R, with a complexity parameter and minimum number of partition size of 0.01 and 20, respectively. We also fitted CART model with complexity parameter determined by the 1-SE rule (a standard, accepted method for complexity parameter determination) [9], the random forest model using package randomForest of R with 500 trees, and the generalized boosted trees using package gbm of R with 100 trees. Random forests and generalized boosted trees are extension of CART models by constructing multiple decision trees to improve prediction accuracy.
Decision tree models were built with demographic characteristics as predictors including age, sex, education, race, BMI, and Family Poverty Ratio. Demographic characteristics were found predictive for undiagnosed diabetes [8]. To assess the classification power of the decision tree models, the models are fitted using a randomly selected 70% of the data (n = 3264, named as training dataset), and the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and classification rate were computed using the remaining 30% (n = 1413, 40 of them had undiagnosed diabetes) of the sample for evaluation (named as testing dataset). AUC is the most commonly used indicator for model comparison [12] and a value of 0.70 or above was considered as good fit [13].
Among the 3264 participants in the training dataset, 3165 did not have any evidences of diabetes and the remaining 99 had diabetes. The CART was fitted using eight datasets: (a1) the training dataset (n = 3264), (a2) the weighted training dataset with diabetes to non-diabetes participants ratio of 1:1 (n = 3264), (b1) the oversampled dataset sample that combined randomly oversampled 3165 diabetes participants from the original 99 diabetes participants with the 3165 participants without diabetes (n = 6330), (b2) the oversampled training dataset with case-to-control ratio of 1:2 (n = 4,748), (b3) the oversampled training dataset with case-to-control ratio of 1:4 (n = 3957), (c1) the undersampled training dataset that combined randomly selected 99 participants without diabetes out of the 3165 with the 99 diabetes participants (n = 198), (c2) the undersampled training dataset with case-to-control ratio of 1:2 (n = 297), and (c3) the undersampled training dataset with case-to-control ratio of 1:4 (n = 495). Method of modifying the loss matrix of CART that adjusts the weightings on false positive rate and false negative rate was not adopted here as it had no effect on the tree built. Table 1 shows the descriptive statistics of the training and testing datasets. There were no differences between training and testing datasets for all exposure variables (all p > 0.05). There was no difference (χ 2 = 0.14, p = 0.71) in the incidence rates of diabetes in the training dataset (n = 99, 3.0%) and the testing dataset (n = 40, 2.8%). Among participants in the training dataset, those with undiagnosed diabetes consisted of more of aged 50 or above and Mexican American, had less than nine years of education, had lower Family Poverty Income Ratio, and had higher BMI. */**/*** χ 2 test between no diabetes and diabetes in training dataset significant at 5%/1%/0.1% level. All χ 2 tests between training dataset and testing dataset were insignificant at 5% level. Figures 1 to 8 show the decision tree model fitted using the CART algorithm on the full, weighted, oversampled (case-to-control ratio 1:1, 1:2, and 1:4), and undersampled (case-to-control ratio 1:1, 1:2, and 1:4) training dataset respectively. They have 3, 13, 12, 14, 15, 15, 9, and 12 partitions respectively.

Results
Tree on full training data included two (BMI and age) exposure variables and other trees included four to six exposure variables. It was obvious that the tree model fitted on the full training dataset was an underfit. In fact, the tree model fitted on the oversampled training dataset was an extension of that on the full training dataset, with the partitions "BMI < 30.96" and "BMI ≥ 30.96 and Age < 50" further split into nine and five partitions respectively. The decision trees across different case-to-control ratios were similar. In decision trees fitted on the oversampled and undersampled training dataset with case-to-control ratio of 1:1, the partition having the highest incidence of diabetes had similar characteristics (oversampled: BMI ≥ 35.47 and Age < 50 and Race = others, undersampled: 30.96 > BMI ≥ 27.29 and Age < 60 and Race = other and Family Poverty Income Ratio < 0.5). The decision trees fitted on the weighted training dataset was nearly the same with that on the oversampled training dataset with case-to-control ratio of 1:1.     Table 2 shows the classification performance of all decision trees. While NPV were similar across all decision trees, the classification rate of trees on the full training data and the weighted datasets were substantially smaller than those with resampled training datasets. Both the trees fitted on the oversampled and undersampled training data with case-to-control ratio of 1:1 yielded a good fit with AUC above 0.70, however the tree on the full training data, the weighted training data and the resampled training data with case-to-control ratios of 1:2 and 1:4 did not yield a good fit with an AUC of 0.63 to 0.69. There is a clear trend that the AUC reduce with case-to-control ratio for both oversampled and undersampled training dataset. Figure 9 shows the receiver operating characteristic curves of all tree models.      The classification performance of the decision trees with complexity parameter determined by 1-SE rule, the random forest models, and the generalized boosted trees, can be found in Tables 3-5 respectively. The decision tree model fitted on the full training dataset showed poor AUC as it had no splitting at all. Only the tree on the weighted training dataset and the undersampled training datasets demonstrated good fit. Not limited in CART models, resampling methods could also improve classification power for random forest models and generalized boosted trees. For random forest models, only those fitted on undersampled training datasets demonstrated good fit (Table 4). For generalized boosted trees, those fitted on full training dataset, weighted training dataset, and undersampled training datasets demonstrated good fit, and the undersampled dataset with case-to-control ratio 1:1 yielded the best classification power, although the specificity is zero (Table 5).
By comparing the classification performance of different types of models (Tables 2-5), we can see that undersampling could improve the classification power of all models. Most of the models fitted using the undersampled dataset could achieve an AUC of above 0.7 and in general the case-to-control ratio of 1:1 performed the best.

Discussion
Illustrated with a dataset with only 3.0% of the participants classified as undiagnosed diabetes, our results showed that applying resampling methods in a class-imbalanced dataset clearly improved the explanatory power of the decision tree models, random forests, and generalized boosted trees. With an illustration from a public health perspective, a systematic comparison between standard method of analysis and those based on resampled data showed that resampling could improve the overall classification rate and positive predictive value. Besides CART, the performance of other extended tree models including random forests and generalized boosted trees could also be improved using resampling. The decision tree fitted on the full training dataset clearly underfitted the data and this could be explained as follows. By comparing the trees on the full training dataset and the overersampled dataset with case-to-control ratio of 1:1, we can see that the split for the partition "BMI < 30.96" stopped in the former model but continued to split in the latter model. It is because the reduction of Gini impurity for the split "Age < 60 vs. Age 60" is minimal (from 1 -(31/2259) 2  As the use of automated classifiers like decision trees, support vector machines (see [14] for example) and artificial neural networks (see [15] for example) are becoming much more popular in the medical literature, the use of resampling methods should be promoted as it apply on all these statistical models that targeting at maximizing accuracy [16]. (Re)Analysis of previously published data using resampling methods is warranted given the potential suboptimal results of existing analyses.
The most commonly applied statistical model for predicting undiagnosed diabetes was the logistic regression [8]. Although there was no evidence that resampling methods improve the predictive power of logistic regression or even any class of generalized linear models, applying logistic regression on a class-imbalanced dataset may sometimes be inappropriate, especially when there are a large number of exposure variables. Generalized linear models require as much as 10 to 20 cases of both classes per exposure variable [17,18], and if such models were applied on our example, only 5 to 10 exposures variables were allowed. This is obviously too strict a criterion for predictive models for undiagnosed diabetes given its multi-factorial nature [8]. Therefore, given such an imbalanced dataset, modeling using automated classifiers appears to be the only appropriate choice and the dataset should be resampled.
In this study, we consider both balanced and unbalanced resampling. However, resampling methods combining both oversampled and undersampled datasets [4], have not been examined. Besides these extensions on resampling methods, researchers had also developed non-random resampling methods to reduce the potential bias and overfitting caused by random resampling [19,20]. However, these advanced resampling methods had been rarely applied in the medical literature. Given the effectiveness of the balanced resampling methods that have been shown in this study, the effectiveness of these advanced resampling methods is worth exploring.
Our study was not without limitations. First, biomarkers that were found associated with undiagnosed diabetes, for example blood pressure, high-density lipoprotein, C-reactive protein, triaglyceride, and white blood cell count [8] were not included in the decision tree model for facilitating the use of decision tree models in non-clinical setting. Improvement of resampling methods on models including these biomarkers was unknown; however we believe that resampling should be effective as well. Second, only decision tree models and their extensions, but not other automatic classifiers, were employed, as decision tree models, but not other automatic classifiers, are feasible to be administered in a clinical setting to predict undiagnosed diabetes patients, albeit their underperformance compared with other classifiers such as ensemble methods [21]. Further research on the effectiveness of resampling methods with support vector machines and artificial neural networks is warranted. Note that oversampling will introduce dependence to the data, therefore using traditional regression models, which assume data independence, on oversampled data may not be appropriate.

Conclusions
Our results showed that applying resampling methods in a class-imbalanced dataset clearly improved the classification power of the CART model, random forests, and generalized boosted tree. Data analysis targeting at maximizing accuracy should apply resampling methods.

Conflicts of Interest
The author declares no conflict of interest.