Fundamentals of Analysis of Health Data for Non-Physicians
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript needs profound English editing and proofreading. It is tough to read it as it is at the moment.
The authors' main claim is that any data scientist working in health data science should learn how the business works. However, that is true for every field data scientists may work on. So, I don't understand how authors support their point by calculating standardised rates, which are trivial when comparing populations with different gender and age structures. It is a principle the training of any data scientist focused on biostatistics or epidemiology should entail, as it belongs to the roots of the statistical foundations.
Comments on the Quality of English LanguageThe manuscript needs profound English editing and proofreading. It is tough to read it as it is at the moment.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe purpose of this manuscript is to provide guidance to research data managers and administrators in the management of medical data on a fundamental issue: the management of confounding factors. My comments to the authors are as follows:
1. Your article has a good educational idea. It can help members of medical research teams from outside the medical and health sciences who are involved in the storage, management and processing of research data. It is very important for them to learn the language, terms and basic analyses used in the analysis of medical data. Consider explaining the benefits of mastering these skills.
2. The analyses presented in the article and the results of adjusting the AHRDM rates are not new. Similar analyses are performed as exercises in epidemiology and medical statistics courses. This could be used as an educational example, provided that the terminology is clarified.
3. Title: The title of the article does not reflect the content of the manuscript. The article does not compare health science and data science methods. The manuscript uses an example to illustrate one of the basic methods of analysis in medicine, health sciences and related fields, namely the adjustment for the effect of background variables on the response variable. Please consider changing the title.
4. Pages 1-2, line 32-37: In clinical medicine, epidemiology and health care research, the main goal is to understand, not to predict. For epidemiologists, knowing how to reduce the prevalence or incidence of a disease in a population is more important than predicting who will develop a disease. For clinicians and subject specific researchers, the point of looking at health data is often to intervene to change the expected outcomes. Methods such as machine learning (ML), deep learning (DL) and random forests, have shown promise in the field of medicine. These methods can develop effective diagnostic and predictive tools to identify various diseases. However, prediction methods suffer from the “black box” problem: inputs are fed to the algorithm and an output emerges, but it is not exactly clear which variables were identified, or how they contributed to the final output. The introduction would benefit from highlighting this difference.
5. Page 2, line 97: Please replace “design study” with “study design”. In epidemiology, your study is defined as a cross-sectional study.
6. Page 3, lines 315-321: In biostatistics, medical statistics and epidemiology, term “standardisation” is mainly used for purposes other than adjusting for confounding factors. In statistics, standardisation is the process of putting different variables on the same scale. This process allows you to compare values between different types of variables. Typically, to standardise variables, statisticians calculate the mean and standard deviation for a variable. Then, for each observed value of the variable, they subtract the mean and divide by the standard deviation. In your study you are dealing with a phenomenon called "confounding", and one method to control or adjust for confounding, called "stratification". In the results, you report adjusted rates for the AHRDM. Please reformulate.
7. Figure 1: The comparison is confusing because there are no such differences. Medical statisticians, biostatisticians and public health researchers have been using the methods described in the figure as "data science characteristics" for decades. The terms below the figure are useful to describe. Consider adding "biological and clinical focus" to line 333.
8. Figure 2: Please help the readers and clarify that the final adjusted rate is obtained by calculating the weighted average of the rates obtained in the subgroups. The weights are the relative sizes of the subgroups.
9. Discussion section: Please note that in cross-sectional and longitudinal studies, confounders are usually controlled using multivariable regression models, and results are reported with estimated adjusted effect sizes and their confidence intervals.
Comments on the Quality of English LanguageEnglish language requires editing.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe quality pf the English language improved. Still many improvements must be made.
The authors should have written the article under the supervision of a senior researcher. Even as the authors claim the article seeks to be "a guide to deal with health data", the manuscript is childishly written. The authors must know they are writing a scientific piece, not a class tutorial.
Specific comments:
11. The introduction would improve if the first and second paragraphs' order switched places.
T2. The paragraph in line 40 is crucial. It promises a lot, but the promises are not fulfilled.
33. Paragraphs in lines 53 and 58 should come earlier in the introduction as they are about the big picture.
44. The paragraph in line 65 should be rewritten for clarity.
55. The first sentence of the paragraph in line 80 does not make any sense.
h6. The sentence in line 92 starting with "In the epidemiology…" does make sense. Or the authors should rewrite it. The article is not a manual on epidemiological designs!
77. The paragraph in line 99 starting with "A secondary objective…". What is the first objective? It comes after!
(.8. The last paragraph of the introduction is about the main objective. Do you think it is the right place to have it?
99. The paragraph describing the organisation of the article is missing. It should be the last paragraph of the introduction.
110. The sentence in line 117, "The generally known data science methodology involves the stages: data collection, preprocessing, modeling, interpretation, and presentation." should be rewritten. Something like: "Data Science methodology involves five different stages, data collection, data preprocessing, modelling, interpretation of the results and results communication." It is an example of the drawbacks still found in English after the revision.
111. Figure captions should be self-explained. Some of the text in the body of the article should be moved to the Figure caption.
112. The second sentence of the paragraph in line 117 is useless and makes no sense. Written differently may be helpful.
113. Figure 1 is misleading! Data Scientists contribute a lot more than just in data collection, data cleaning (there is a typo in the figure!) and preprocessing.
114. The phases (before being named stages!) could be explained rigorously, accurately and formally. But this paragraph is useless! Authors' points of view can be expressed in a single paragraph.
115. In line 158, the sentence "From the third column, each percentage corresponds to the value of the previous column." should be replaced by "In Table 1, the percentages in each column in brackets are relative to the universe reported in the previous column." The "Data sources" section should add value to the table's caption. Table captions should be self-explained. Therefore, some of the text in this section should be moved to the table caption.
116. The section "Data cleansing and preprocessing" is written childishly. It does not add any value to the article. If authors think it is essential, move it to an appendix.
117. The authors should adopt a scientific, rigorous, accurate and formal way of writing. Here is an example. In the Section "Outliers Analysis" the paragraph
An outliers analysis is implemented through a hypothesis testing method. Some records with patients 999 years old and others with values different from 1 or 2 in the variable sex are allowed: 1 for men and 2 for women. This analysis is crucial because databases usually have missing, redundant, or spurious values that could generate wrong results.
The cleaning process includes a filtering process using regular expressions of the cor- 181 rect ICD-10-CM codes. Taking only the first four alphanumeric values with the regular 182 expression in the sentence grepl("[A-Z][0-9][0-9][A-Z0-9]", AFECPRIN).
can be transformed into something like this:
The outlier analysis unveiled missing, redundant and spurious values. Records with age equal to "999" and sex coded outside the allowed coding (1 for men and 2 for women) were removed.
118. The sentence "The cleaning process includes a filtering process using regular expressions of the cor- 181 rect ICD-10-CM codes. Taking only the first four alphanumeric values with the regular 182 expression in the sentence grepl("[A-Z][0-9][0-9][A-Z0-9]", AFECPRIN)." Should be removed to the previous section and replaced by "The preprocessing process filtered records with ICD-10-CM codes of interest for the study."
119. The purpose of standardised rates is to allow comparison between regions exhibiting different population structures. The text of Sections "Crude Rate", and "Direct Standardization", and "Specific rates" should be shortened and rewritten formally and rigorously. The authors must know they are writing a scientific piece, not a class tutorial (less childish).
220. Figure 2 caption should be extended to explain the picture.
221. The first sentence of the Section "Results" does not belong here!
222. What is the purpose of computing crude rates if they are not used?
223. Any reference to the software used should be in a dedicated section (line 257).
224. The paragraph starting in line 261 is about good practices or required skills. There are parts of the articles where the authors point out the same. Why aren't they gathered together in a section introducing the skills a data scientist must have and the good practices that must be followed?
225. The section "Results" is poor. It is supposed to present the main findings. I find none!
226. The section "Discussion" should discuss the results presented in the Section "Results" and its implications. This section contains results that should be in the Section "Results" and does not discuss at all.
227. When we reach the Section "Conclusions", we should find some wrap-up of the results and its implications (discussion). However, we find results from the presented Figures that should have been presented before and recalled here.
228. Here, we understand that one of the authors aims to show that crude rates cannot be used to compare populations of different sex and age structures. It must have been made clear earlier. The dataset on the Mexico case is instrumental in showing it. From this point of view, I believe that the last paragraph of the Section "Conclusion" in line 310 is reporting a limitation indeed. The dataset is a case study. The authors haven't ever had the intention of generalising the results.
Comments on the Quality of English LanguageSee previous comments
Author Response
Please see the attachment
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have satisfactorily responded to all my questions and made the necessary changes to the manuscript.
Author Response
Thanks for your comments. We modified the manuscript to respond to another reviewer's comments, but we maintained all your suggestions in the first revision round.
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors put a lot of effort in improving the manuscript according to the suggestions. However, the English language used is not common in scientific research writing. The manuscript seems to have been written as a class tutorial. For future articles, I strongly suggest the authors to write the manuscripts under the supervision of an experienced researcher from scratch.
Author Response
The manuscript is intended as a resource for non-physicians. As you suggested, your feedback has led us to recognize the importance of planning the manuscript's tone for scientific research from scratch. We acknowledge that this should have been a priority from the beginning. Nonetheless, we believe the manuscript can make a meaningful contribution to interdisciplinary research at the intersection of public health and data science. Additionally, we have made some adjustments to the manuscript to enhance its tone.
Thank you for all your comments