The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data

: Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.


Introduction
Diabetes is a health condition characterized by chronic hyperglycemia and resulting from issues with insulin secretion and action [1].The onset of diabetes increases the risk for a number of health complications such as cardiovascular disease, kidney disease, retinopathy, and neuropathy [2,3].The longer one has diabetes, the more complications are likely to occur [4].Diabetes affects 464 million people in the world as of 2021, and it is predicted to increase to 638 million by 2045 [5].Diabetes disproportionately affects minority populations [4,6].
Diabetes has also been studied using machine learning [7][8][9].Oikonomou et al. [10] provide a comprehensive overview of how machine learning has been applied to precision diabetes care, particularly in cardiovascular risk prediction among diabetic patients.Their work underscores the significant potential of machine learning in transforming diabetes care by leveraging large datasets to identify risk factors and predict outcomes with high accuracy.
In recent years, the application of machine learning to electronic health records (EHR) has emerged as a promising tool for enhancing our understanding of diabetes and improving prediction models for its management.The integration of machine learning with EHR data offers a new frontier in diabetic research.Prior studies have shown that machine learning models can effectively predict the progression to pre-diabetes and type 2 diabetes using EHR data, emphasizing the role of established risk factors and identifying novel factors for further research [11].Cahn et al. highlighted the use of machine learning models to improve the prediction of incident diabetes utilizing patient data from EHR, underscoring the potential for targeted interventions [12].Additionally, leveraging large health records datasets has enabled significant progress in diabetes forecasting using machine learning, as demonstrated by research conducted using the health records of patients in Ontario, Canada [13].This approach not only offers predictive insights, but also helps identify critical features contributing to diabetes onset.
Building upon this line of research, we study the prediction of diabetes using EHR data from the National Institutes of Health (NIH)'s All of Us (AoU) dataset.The program is a result of the Precision Medicine Initiative Cohort Program [14].The cohort consists of over 1 million volunteers who contributed their biospecimen samples (such as blood and urine), physical measurements, and extensive surveys on health and lifestyle [15].The overarching goal of All of Us is to advance precision medicine-a personalized approach to disease prevention and treatment that considers individual differences in lifestyle, environment, and biology.This approach is intended to overcome the limitations of a one-size-fits-all model in health care by factoring individual variation.The All of Us Research Program stands out for its commitment to diversity, striving to include participants from various racial and ethnic backgrounds, age groups, geographic regions, and health statuses to ensure the dataset reflects the broad diversity of the U.S. population [16].By harnessing the power of big data and emphasizing inclusivity and participant engagement, the All of Us Research Program aspires to revolutionize our understanding of health and pave the way for more effective, personalized healthcare solutions.
We focus, in particular, on measuring the effect of data missingness on the prediction of health outcomes such as diabetes using All of Us data.We apply several data imputation techniques and measure their effect on various model performance metrics.This characterization of data missingness on large EHR datasets can inform future efforts that apply imputation strategies to such data.

Dataset
We used the National Institutes of Health (NIH) All of Us dataset.We selected 47 features from Abegaz et al.'s work [17].We list them in Table 1 alongside the proportion of missing values per feature.For each measurement type, we created two features: one for the average reading and another for the number of times the feature is read.

Modeling
To increase uniformity for ease of comparison while maintaining a robust search for well-performing models, we employed Autosklearn2.0[18,19].This meta-model has a search space consisting of every model within Scikit-Learn and subsequently searches over hyperparameter space per model.The training is conducted on four CPUs, 26 GB of RAM, 3 h of training time, 6572 MB of memory per job, log loss as the objective function, and no limit to the number of models on disk.
We compare the following six imputation methods alongside an oversampling preprocessing step.
No Imputation: This method involves not performing any imputation on the dataset, leaving the missing values as they are.In this approach, the model chosen must inherently be capable of handling missing data.Techniques such as decision trees or certain ensemble methods can often process datasets with missing values directly.This method is based on the assumption that the model can interpret and manage the missingness in the data without any explicit intervention.
Automatic Imputation (via Autosklearn): This approach employs Autosklearn, an automated machine learning tool, to determine the best imputation method for the dataset.Autosklearn explores various imputation strategies as part of its preprocessing pipeline and selects the one that optimizes model performance.This method leverages the power of automated machine learning to identify the most effective imputation technique, which could range from simple strategies like mean or median substitution to more complex ones, based on the characteristics of the data.
Constant Fill: In this approach, missing values are filled with a constant value.This constant could be a number outside the normal range of values (such as −1) to differentiate imputed values from real ones.The advantage of this method is its simplicity and the clear demarcation it provides, which can be helpful in certain analytical contexts.
Mean Substitution: Mean substitution involves replacing missing values in a dataset with the mean value of the respective column.This method assumes that the missing values are randomly distributed and that the mean is a representative statistic for the missing data.It is a straightforward approach but may not always be suitable, particularly in cases where the data distribution is skewed or the mean is not a good representation of the central tendency.
Median Substitution: Similar to mean substitution, median substitution replaces missing values with the median of the respective column.This method is particularly useful in datasets where the distribution is skewed or there are outliers, as the median is less affected by extreme values than the mean.It is a robust approach that can provide a better central tendency estimate in certain types of data distributions.
Multiple Imputation with Bayesian Ridge: This is a more sophisticated approach where multiple imputation is performed using Bayesian Ridge regression.In this method, missing values are estimated based on observed data, with the Bayesian Ridge regression model used to predict the missing values.Specifically, one begins by denoting one column of the training input f and the other columns X f .A Bayesian Ridge regression model is then fitted on (X f , f ).This is conducted for every feature and can be repeated so that in the next round, the previous rounds' predictions can be used to make better predictions of the missing value.In this paper, we use 15 imputation rounds.The number of imputation rounds, 15, is chosen arbitrarily.The higher the number, the more accurate the imputation should be.For a dataset as large as All of Us, we chose to keep it lower.This technique considers the uncertainty in the imputation process by creating several imputed datasets and combining the results, leading to more accurate and reliable imputation compared to single imputation methods.
Each of these imputation methods has its strengths and weaknesses and is suitable for different types of datasets and missing data patterns.The choice of imputation method can significantly impact the performance of the subsequent analysis or machine learning models.
Random oversampling is a technique used to address class imbalance in a dataset, particularly in situations where the dataset has a disproportionate number of instances in different classes.This imbalance can lead to biased or inaccurate model performance, as the model may tend to favor the majority class.
In random oversampling, the idea is to balance the dataset by increasing the size of the underrepresented class (minority class).This is accomplished by randomly duplicating instances from the minority class until the number of instances in both the minority and majority classes is approximately equal.This method creates additional samples from the minority class not by generating new samples but by resampling from the existing samples.
In total, there are 12 different models to test with the same underlying classifier.

Model Evaluation
Model performance is a catch-all term to describe the plethora of different metrics used to compare a model's predictions to the actual outcome.We can summarize the comparison of a classification model's predictions as compared to the number of actual classes in a confusion matrix.
We use the following abbreviations in the definitions of our performance metrics: We have the corresponding normalized quantities associated with the above counts: We may now define four of the five metrics: The final of the five metrics consists of the probability output of a model.Given an input, a model has a probability associated with the class and a threshold such that inputs with a probability larger than the threshold are predicted to be a member of the class.There are certain points on this curve that we know the values for.
If the threshold is set to 0, then the model predicts all inputs as positive.Thus, the true positive rate is 1 and the false positive rate is 1.If the threshold is set to 1, then the model predicts all inputs as negative.Thus, the true positive rate is 0 and the false positive rate is 0. This defines a curve in the space with coordinates (FPR, TPR) parameterized by the probability threshold with endpoints (0, 0) and (1,1).This curve is called the Receiver Operating Characteristic (ROC) curve, and its integral is called the Area Under the ROC (AUROC).

Model Fairness Evaluation
Given the standard metrics above, we can consider some fairness metrics that are measured as discrepancies of some performance metric between members of a privileged group and the remaining groups.In this dataset, there are two primary sensitive attributes that fall into this regime: gender and race.In order to define these differences, we must introduce new notation.The exact notation will differ based on the source [20][21][22][23].The quantities below will be numerically equivalent to those in the previous literature while remaining consistent with the notation used in this paper.Let µ S denote the metric µ on the subset S within the data.For example, FPR P will denote the False Positive Rate on the privileged group, whereas FPR U will denote the False Positive Rate on the unprivileged group.We let y i represent the test result for patient i and ŷi represent the model's prediction for patient i.The final fairness metric shown below is described in detail by Speicher et al. [24].

Measuring the Effect of Data Missingness
We are interested in measuring the effect on the model's performance as the number of missing features varies.One expects that a higher number of missing features would lead to lower overall performance.Since the number of missing features is a large range, we can study the trend by fitting an ordinary least squares line between the performance versus the number of missing features.Our procedure is as follows: 1.
Given a model fitted on the training data: 2.
Select a subset of the testing data with a specified number of missing features.

3.
Evaluate the model's performance on that subset.4.
Plot the performance versus the number of missing features.

5.
Evaluate the F-test for the slope of the line and the Breuch-Pagan test for the heteroskedasticity of the residuals around the line.

Data Missingness
We constructed a simple (but interpretable) linear regression model that predicts the number of missing features given race and gender.The coefficients are shown in Table 2.We observe that race and gender are predictors of missingness.

Model Performance
Figures 1 and 2 outline each imputation method's overall performance on the dataset when stratified by different sensitive attributes.For each imputation method, we measured the AUROC, balanced accuracy, F1, precision, and recall on the total population, each gender category, each racial category, and across the different missing feature bucketed groups.We then reran the analysis with an extra step of oversampling to balance the dataset for the number of people with diabetes.
Figures 3 and 4 compare the fairness metrics, average odds difference, average odds error, between-group generalized entropy error, class imbalance, equal opportunity difference, mean difference, and statistical parity difference.These are fairness metrics, which means that for a sensitive attribute, we denote one group to be privileged and one to be unprivileged.We evaluated the imputation methods on the model discrepancy across groups.Since there is no obvious privileged group for the missing feature sub-populations, we only compared gender (with male being the privileged group) and race (with white being the privileged group).

Effect of Data Missingness
We next seek to understand the effect of data missingness.In the previous section, the 0.2-quantile missing feature sub-populations had their AUROC, balanced accuracy, F1, precision, and recall tabulated.We may visualize how the models perform more easily by plotting the models performance as a grouped bar chart, both without (Figure 5) and with (Figure 6) oversampling.We plotted the line of best fit for each machine learning metric as a function of the number of missing features and across imputation strategies, both without oversampling (Figure 7, Table 3) and with oversampling (Figure 8, Table 4).We observed a statistically significant negative slope in all of the performance metrics and models except for the following imputation methods using balanced accuracy: impute mean, impute naive, impute median, impute ridge.Furthermore, any model apart from "No Imputation" and "Auto Impute" demonstrated statistically significant heteroskedasticity.Table 3. Tabular representation of Figure 7.We display the Y-intercept and slope of the lines of best fit for the estimator performance on a given metric.The F-Test p-value gives the probability of the null hypothesis that the line of best fit has a slope of zero.The Breusch-Pagan p-value, which gives the probability that the error of the line has constant variance, is also given.The shading is the residual of the best fit line.All models contain an oversampling step.The best fit line is colored green if we reject the null hypothesis that the line has a slope of zero.The shading is colored green if we reject the null hypothesis that the residuals have constant variance.

Estimator
Table 4. Tabular representation of Figure 8 displaying the Y-intercept and slope of the lines of best fit for the estimator performance with oversampling on a given metric.The F-Test p-value, which gives the probability of the null hypothesis that the line of best fit has a slope of zero, is given.The Breusch-Pagan p-value, which gives the probability that the error of the line has constant variance, is also given.

Discussion
We observe that imputation methods homogenize the amount of information per patient.That is, without imputation, the models have a sharp performance loss, whereas imputation makes the slope less steep at the cost of increasing heteroskedasticity.We also note that every statistical test agrees between the oversampled and non-oversampled models.This trend underscores the sensitivity of predictive models to the method of handling missing data in electronic health records (EHR).The negative slope indicates that as the degree of imputation increases-implying more data are being estimated rather than observed-the accuracy, precision, and recall of the models tend to decrease.This phenomenon can be attributed to the fact that imputation, despite being a necessary process to address missing data, introduces a level of uncertainty or noise.This noise can distort the underlying patterns within the data, leading to less reliable predictions from the models.
We are not the first paper to study diabetes prediction using the All of Us dataset.A paper by Abegaz et al. studied the application of machine learning algorithms to predict diabetes in the All of Us dataset [17].Their work presents the AUROC, recall, precision, and F1 scores stratified by gender of the random forest, XGBoost, logistic regression, and weighted ensemble models.Our work builds upon those foundations in three ways.First, we note that all of the models in Abegaz et al.'s work can be found in Scikit-Learn.Hence, we performed a deep search over all Scikit-learn models to find the best performing ones.Second, we presented our results for further substrata of the dataset.One of the most important features of AoU is the diversity of people within the dataset.We highlighted the five performance metrics on the total testing dataset on each gender, on each race, and on groups bucketed by the number of missing features.We also presented the models' performance on a number of fairness measurements when the sub-populations have a clear privileged group.Third, our largest deviation from the previous work was to show how the performance of a model changes as one changes the number of missing features.
The model performance in Figures 1 and 2 has been trained for only three hours (as opposed to the multiday-or multiweek-long training that some deep neural network solutions provide) and yields modest results.Our best performing model is the "Auto Impute" model.We may compare the performance of that model to Abegaz et al.'s work."Auto Impute" has a higher AUROC, comparable precision, and worse recall and F1.We note, however, that these are not clinically ready.Further improvements need to be made in order to prefer this to a HbA1c test for diabetes testing.Since the multiple imputer only used 15 iterations, the algorithm likely did not stabilize and caused the performance to drop.We emphasize that the primary objective of our research was not to maximize the performance of machine learning models applied to AoU data, but instead to study the effects of data missingness and imputation strategies on model performance.
Our analysis also highlights the presence of statistically significant heteroskedastic variance in model performances across imputation methods.Heteroskedasticity, in this context, refers to the irregular variability in the performance of predictive models, dependent on the amount and pattern of missing data being imputed.This irregular variance poses a significant challenge in predictive modeling, as it implies that the error terms (or the differences between predicted and actual values) are not uniformly distributed.Models thus exhibit different levels of accuracy and reliability depending on the specific characteristics of the missing data in each patient record.
The presence of heteroskedastic variance can be particularly problematic in clinical settings.It implies that for some patients, especially those with more extensive or particular patterns of missing data, the predictions made by the models could be less reliable.This inconsistency could lead to disparities in clinical decision-making, potentially affecting the quality of care provided to certain patient groups.Since the "Auto Imputation" model has the largest Y-intercept and one of the most negative slopes, it might be most beneficial to use the "Auto Impute" method for patients with few missing values in a clinical setting.For patients with a lot of missing values, one may use another imputation method with a less steep slope or perform a cost-benefit analysis of ordering more tests to make the model more performant.
These findings highlight the critical need for developing more robust imputation techniques that can minimize the introduction of noise and ensure uniform model performance across varying degrees of missing data.It also underscores the importance of considering the nature and pattern of missing data when applying machine learning models in healthcare settings.Future research should focus on exploring advanced imputation methods, possibly incorporating domain knowledge or utilizing more sophisticated algorithms, to mitigate the effects of data missingness on predictive model performance.In conclusion, while imputation is a necessary step in dealing with incomplete datasets for some models, our study indicates that current methods have significant limitations.
Addressing these limitations is crucial for the development of reliable and consistent machine learning models for clinical predictions, ultimately enhancing the quality of patient care and health outcomes.Our analysis on data missingness revealed that individuals who are male and persons of color would be disproportionately affected by a loss in performance with respect to data missingness.This is due to the number of missing features being more highly correlated with males and non-white people.
Future work can be conducted to ensure the robustness of the findings.A number of unanswered questions remain, such as: (1) does heteroskedasticity depend on certain features included in the model over another?(2) Do these findings pertain to more modern and complex deep learning models?(3) What other forms of data augmentation can be performed to reduce heteroskedasticity?Another comparison of interest is exploring whether the testing dataset holds more missing values than the training dataset and how the performance differs compared to the case of having roughly similar missing values between training and testing.If the testing dataset does not require many labels, then hospitals could save time and money by not measuring every missing value.

Figure 1 .
Figure1.Performance of the models, with the columns denoting the specific metric, across the evaluated sub-population (left label) and the imputation method (right label).The color denotes the magnitude of the metric, warmer colors indicating higher performance.The text color is adjusted to be readable given the background color.

Figure 2 .
Figure 2. Performance of the models when oversampling, with the columns denoting the specific metric, across the evaluated sub-population (left label) and the imputation method (right label).The color denotes the magnitude of the metric, warmer colors indicating higher performance.

Figure 3 .
Figure 3. Performance of the models, with the columns denoting the specific metric, across the evaluated sub-population (left label) and the imputation method (right label).

Figure 4 .
Figure 4. Performance of the models when oversampling, with the columns denoting the specific metric, across the evaluated sub-population (left label) and the imputation method (right label).The color denotes the magnitude of the metric, with warmer colors indicating better performance.The text color is adjusted to be readable given the background color.

Figure 5 .
Figure 5. Machine learning performance exhibited by different imputation methods grouped by 0.2 quantiles.

Figure 6 .
Figure 6.Machine learning performance exhibited by different imputation methods using an oversampling preprocessing step grouped by 0.2 quantiles.

Figure 7 .
Figure 7. Best fit lines of machine learning metrics as a function of the number of missing features.The shading is the residual of the best fit line.The best fit line is colored green if we reject the null hypothesis that the line has a slope of zero.The shading is colored green if we reject the null hypothesis that the residuals have constant variance.

Figure 8 .
Figure 8. Best fit lines of machine learning metrics as a function of the number of missing features.The shading is the residual of the best fit line.All models contain an oversampling step.The best fit line is colored green if we reject the null hypothesis that the line has a slope of zero.The shading is colored green if we reject the null hypothesis that the residuals have constant variance.

Funding:
This research received no external funding.Institutional Review Board Statement: Not applicable.Data Availability Statement: Restrictions apply to the availability of these data.Data were obtained from the National Institutes of Health's All of Us and are available at https://www.researchallofus.org/ (accessed on 12 January 2024) with the permission of National Institutes of Health's All of Us.Obtaining access to the data involves institutional agreement, verification of identity, and mandatory training.

Table 1 .
Model input features and missingness proportion for the total dataset for training and testing subsets.