Next Article in Journal
Pediatric and Adolescent Seizure Detection: A Machine Learning Approach Exploring the Influence of Age and Sex in Electroencephalogram Analysis
Next Article in Special Issue
Comparing ANOVA and PowerShap Feature Selection Methods via Shapley Additive Explanations of Models of Mental Workload Built with the Theta and Alpha EEG Band Ratios
Previous Article in Journal
Machine Learning Models and Technologies for Evidence-Based Telehealth and Smart Care: A Review
Previous Article in Special Issue
Whole Slide Image Understanding in Pathology: What Is the Salient Scale of Analysis?
 
 
Article
Peer-Review Record

The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data

BioMedInformatics 2024, 4(1), 780-795; https://doi.org/10.3390/biomedinformatics4010043
by Zain Jabbar 1,2 and Peter Washington 1,2,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
BioMedInformatics 2024, 4(1), 780-795; https://doi.org/10.3390/biomedinformatics4010043
Submission received: 21 January 2024 / Revised: 7 February 2024 / Accepted: 26 February 2024 / Published: 6 March 2024
(This article belongs to the Special Issue Feature Papers in Medical Statistics and Data Science Section)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

  • Dear authors, I am delighted to provide my sincere suggestions with the hope that they contribute positively to your publication.

 

  1. Please provide more information about the All of Us dataset in a few sentences, including its purpose, the entities involved, the methods employed, and the timeline of its implementation.

 

  1. Regarding the primary precedent work, please provide additional insights into Abegaz et al.'s research to help readers swiftly comprehend their contributions and methodology. Additionally, it would be meaningful to elucidate the reasons behind the authors' decision to build upon that work.

 

  1. In the method section, elaborate on the Multiple Imputation with Bayesian Ridge method. Also, providing more details about the machine learning algorithms, i.e., how the parameters are set, will assist future researchers in replicating this work.

 

  1. In section 3.1, consider demonstrating the effectiveness of the Linear Regression model incorporating race and gender by comparing it with the same regression model involving fewer or more features. If only race and gender are utilized as the controlling pair in the subsequent study, please clarify this at the beginning and end of the paper as part of the experiment setting.

 

  1. In comparison to the majority of accepted papers, this manuscript should include additional content to meet the standards of this journal. I recommend the incorporation of at least two more pages dedicated to expanding on the background, detailing the dataset, and thoroughly describing all the methods employed.

 

Some minor issues:

 

  1. Regarding Figures 1 and 2, there is a mix of white and dark-colored text within one column. If unintended, consider using one color per column. Alternatively, if intentional, consider marking the comparative magnitude of unbalanced and processed datasets with different colors.

 

  1. Additionally, some minor adjustments can be made to enhance the visual appearance of the paper, such as increasing the font size of Figures 5 to 8, centrally aligning the titles of Figures 5 and 6, improving the readability of the bar plot, and adjusting the positions of text and non-text elements.

Author Response

Aloha,

Thank you for your feedback. Every comment was useful, however due to some time constraints I find it hard to fully incorporate all of the feedback. I fully intend on polishing this work to be up to the standards of this journal.

  1. Please provide more information about the All of Us dataset in a few sentences, including its purpose, the entities involved, the methods employed, and the timeline of its implementation.

Done. Added the following paragraph:

In this paper, the dataset of concern is the National Institutes of Health (NIH) All of Us (AoU). The program is a result of the Precision Medicine Initiative Cohort Program in March 2015 \cite{hudson2015precision}. The cohort consists of over 1 million volunteers who contributed their biospecimen samples (such as blood and urine), physical measurements, and extensive surveys on health and lifestyle \cite{sankar2017precision}. The overarching goal for All of Us is to advance precision medicine—a personalized approach to disease prevention and treatment that considers individual differences in lifestyle, environment, and biology. This approach is intended to overcome the limitations of a one size fits all model in health care by factoring individual variation. The All of Us Research Program stands out for its commitment to diversity, striving to include participants from various racial and ethnic backgrounds, age groups, geographic regions, and health statuses to ensure the dataset reflects the broad diversity of the U.S. population \cite{mapes2020diversity}. By harnessing the power of big data and emphasizing inclusivity and participant engagement, the All of Us Research Program aspires to revolutionize our understanding of health and pave the way for more effective, personalized healthcare solutions.

  1. Regarding the primary precedent work, please provide additional insights into Abegaz et al.'s research to help readers swiftly comprehend their contributions and methodology. Additionally, it would be meaningful to elucidate the reasons behind the authors' decision to build upon that work.

Added the following:

A paper by Abegaz et. al studies the application of machine learning algorithms to predict diabetes in the All of Us dataset \cite{abegaz2023application}. Their work presents the AUROC, recall, precision, and F1 scores stratified by gender of the random forest, XGBoost, logistic regression, and weighted ensemble models. This work builds upon those foundations in three ways. First, we note that all of the model in Abegaz et. al's work can be found in Scikit-Learn. Hence, we perform a deep search over all Scikit-learn models to find the best performing ones. Second, we present our results for further substrata of the dataset. One of the most important features of AoU is the diversity of people within the dataset. We highlight the five performance metrics on the total testing dataset, on each gender, on each race, and on groups bucketed by the number of missing features. We also present the models performance on a number of fairness measurements when the sub populations have a clear privileged group. Third, our largest deviation from the previous work is to show how the performance of a model changes as one changes the number of missing features. 

  1. In the method section, elaborate on the Multiple Imputation with Bayesian Ridge method. Also, providing more details about the machine learning algorithms, i.e., how the parameters are set, will assist future researchers in replicating this work.

Appended some extra information.

Specifically, one begins by denoting one column of the training input \(f\) and the other columns \(X_f \). A Bayesian Ridge regression model is then fitted on \( (X_f, f) \). This is done for every feature, and can be repeated so that in the next round, the previous rounds predictions can be used to make better predictions of the missing value. In this paper we use 15 imputation rounds. The number of imputation rounds, 15, is chosen arbitrarily. The higher the number, the more accuracy the imputation should be. For a dataset as large as All of Us, we chose to keep it lower. 

  1. In section 3.1, consider demonstrating the effectiveness of the Linear Regression model incorporating race and gender by comparing it with the same regression model involving fewer or more features. If only race and gender are utilized as the controlling pair in the subsequent study, please clarify this at the beginning and end of the paper as part of the experiment setting.

This should definitely be expanded upon. The takeaway from that section was to see if data missingness (a proxy for healthcare availability) had some relationship with sensitive attributes within the dataset. We see that males have a positive correlation coefficient whereas females have a negative correlation coefficient. This means that if a patient is male we expect a greater number of missing features (leading to drops in performance / heteroskedasticity).

  1. In comparison to the majority of accepted papers, this manuscript should include additional content to meet the standards of this journal. I recommend the incorporation of at least two more pages dedicated to expanding on the background, detailing the dataset, and thoroughly describing all the methods employed.

Added more background on the standard metrics involved.

Some minor issues:

  1. Regarding Figures 1 and 2, there is a mix of white and dark-colored text within one column. If unintended, consider using one color per column. Alternatively, if intentional, consider marking the comparative magnitude of unbalanced and processed datasets with different colors.

This is an artifact of the heatmap package. I believe it checks if the background color is dark enough and changes the foreground text color. I am happy to hear more feedback on visualizations. It could be the case that using color to denote magnitude is more distracting than helpful. 

  1. Additionally, some minor adjustments can be made to enhance the visual appearance of the paper, such as increasing the font size of Figures 5 to 8, centrally aligning the titles of Figures 5 and 6, improving the readability of the bar plot, and adjusting the positions of text and non-text elements.

I would like some more feedback on this part too. An important quality of this paper is the number of different imputation models, metrics, and subpopulations compared. However this leads to either: 1. A lot of clutter and the need to zoom in 2. A separation into multiple plots but slightly harder to compare related quantities. The current set up with oversampling methods in one set of plots and the rest in another plot seemed like an okay balance at the time. 

Reviewer 2 Report

Comments and Suggestions for Authors

It is interesting to test different methods to fix the missing values on diabetes data.

There’s some point need to be fix:

In section 3.1, please explain why you choose to use simple linear regression instead of others, like general linear regression model. Can you compare your result with different methods to test the goodness of fixing the missing values?

In section 3.4, the graph is not easy to read. Can you provide a table of F test and Breusch-Pagan test results?

In the discussion part, please give a clear result on whether it is a good idea to use these methods to fix the missing values, and how to explain the results to the patients.

Author Response

Aloha,

Thank you for your response.

In section 3.1, please explain why you choose to use simple linear regression instead of others, like general linear regression model. Can you compare your result with different methods to test the goodness of fixing the missing values?

The purpose of linear regression was to determine some relationship between sensitive attributes. In the table we see that males have a higher linear coefficient than females. Hence if we see a male patient, we expect them to have a greater number of missing features  than a female. This is a (very rough) proxy for healthcare availability. I would like more feedback along these lines, I do believe there are better ways about disseminating this fact. If I use a GLM then would likely use the same input variables in order to have the same ease of explaination for what coefficient means what. Another method could be to use a more high powered model and compare permutation importance. 

In section 3.4, the graph is not easy to read. Can you provide a table of F test and Breusch-Pagan test results?

Done: Figures 9 and 10 (in the updated manuscript) contain the models y-intercept, slope, F-test p-value, and Breusch-Pagan test p-value.

In the discussion part, please give a clear result on whether it is a good idea to use these methods to fix the missing values, and how to explain the results to the patients.

Added:

Because the "Auto Imputation" model has the largest Y-intercept and one of the most negative slopes, in a clinical setting it might be most beneficial to use the "Auto Impute" method for patients with little missing values. For patients with a lot of missing values, one may use another imputation method with a less steep slope or do a cost benefit analysis of ordering more tests to make the model more performant. 

Reviewer 3 Report

Comments and Suggestions for Authors

Jabbar et.al evaluated the impact of different imputation algorithms on a machine learning prediction task (diabetes) using EHR data from All of Us. Missingness is a critical issue when modeling EHR. This study conducted some analytical research on this issue which may add value to this field. Several concerns need to be resolved before they can be considered for publication.

1.     Authors need to add more statements about the results from their Figure 1-4. What are the important patterns for readers to focus on from these heatmaps?

From Figure 1-2, it seems like the multiple imputation algorithm constantly has reduced performance compared to other methods. Please analyze and discuss the potential reason behind this observation.

Author Response

Aloha,

Thank you for your response.

  1. Authors need to add more statements about the results from their Figure 1-4. What are the important patterns for readers to focus on from these heatmaps?

Added:

The model performance in Figures \ref{model_performance} and \ref{model_performance_with_oversampling} has been trained for only three hours (as opposed to the multiday or multiweek long training that some deep neural network solutions provide) and yields modest results. Our best performing model is the "Auto Impute" model. We may compare the performance of that model to Abegaz et al's work. "Auto Impute" has a higher AUROC, comparable precision, and worse recall and F1. We note however that these are not clinically ready. Further improvements need to be made in order to prefer this to a HbA1c test for diabetes testing.

From Figure 1-2, it seems like the multiple imputation algorithm constantly has reduced performance compared to other methods. Please analyze and discuss the potential reason behind this observation.

Added:

Because the multiple imputer only used 15 iterations, the algorithm has not likely stabilized and caused performance to drop.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Great, thank you for delivering the comments, wish you good luck in the future!

Reviewer 3 Report

Comments and Suggestions for Authors

Resolved my previous comments

Back to TopTop