Next Article in Journal
Loss of Protein Function Causing Severe Phenotypes of Female-Restricted Wieacker Wolff Syndrome due to a Novel Nonsense Mutation in the ZC4H2 Gene
Previous Article in Journal
Association of LBX1 Gene Methylation Level with Disease Severity in Patients with Idiopathic Scoliosis: Study on Deep Paravertebral Muscles
 
 
Article
Peer-Review Record

Application of Feature Selection and Deep Learning for Cancer Prediction Using DNA Methylation Markers

Genes 2022, 13(9), 1557; https://doi.org/10.3390/genes13091557
by Rahul Gomes 1,*, Nijhum Paul 2, Nichol He 1, Aaron Francis Huber 1 and Rick J. Jansen 2,3,4,5,*
Reviewer 1: Anonymous
Reviewer 2:
Genes 2022, 13(9), 1557; https://doi.org/10.3390/genes13091557
Submission received: 10 July 2022 / Revised: 24 August 2022 / Accepted: 25 August 2022 / Published: 29 August 2022
(This article belongs to the Section Bioinformatics)

Round 1

Reviewer 1 Report

Title: Application of feature selection and deep learning for cancer prediction using DNA methylation markers


Comments to the Author


The work presented in the manuscript utilized machine learning feature selection methods and deep learning methods, particularly Artificial Neural Network (ANN) methods, to predict the biomarkers for cancer prediction and demonstrated the approach to breast cancer classification. The authors' methods are standard, and the authors show good accuracy. The manuscript is written well. However, the authors need to clarify a few parts, which will also improve the quality of the manuscript. 

 

Major comments

 

 

  1. The abstract is well written but seems very long. Is it possible to reduce? Although I am not sure if the journal has a limitation. Just to be precise, I suggest shortening a bit if possible. 
  2. The introduction section is good. There is a minor concern related to citation on page 2, line 82. The authors can write “In a study by Angermuelle et al.,” instead of “In [22], DNA and CpG modules from Single Cell Bisulfite Sequencing 82 (scBS-seq) data….”
  3. Regarding the datasets, the authors did not focus on subtypes or grade information; this must be mentioned.
  4. Right now, the evaluation metrics are provided as bar graphs only. Neither confusion matric nor bar graph is provided. Many important evaluation metrics are missing, and difficult to access the model.
  5. The authors should provide more details on the figure legend. For example, what does the color means in Figure 6a? Provide network interpretation in figure 6b. The same applies to figure 7. 
  6. The authors mentioned the strengths of the work in the discussion section. However, the authors should provide weaknesses of the study as well.
  7. Furthermore, the authors could provide the workflow for reproducibility. For example, a GitHub repository. Currently, there is no way to reproduce the work. 
  8. Further, the parameters, package version, etc., must be explained in detail for the same.
  9. The authors should clarify in a better way the novelty of the work. Also, the authors have made any attempt to check whether these precision markers have any effect on patient survival? Since TCGA got the survival, it can be possible.
  10. Overall, the authors should provide a vigorous spell check and grammar check, just in case.

 

 

Author Response

Thank you for your feedback, the manuscript is much improved.

Author Response File: Author Response.docx

Reviewer 2 Report

Main comments

1.       Abstract should be condensed by about 50%; the first 10 lines can be omitted completely. Please avoid terms that cannot be understood without having read the full paper, such as SMOTE.

2.       The introduction is more suited for a review on this topic, please condense and focus on the question that is being addressed in this paper.

3.       Line 253-256: “It was observed that the original 450k dataset performed poorly. The reason is probably the excessive number of features that makes it difficult for the simple deep learning architecture to detect meaningful changes. The filtered dataset, despite having a similarly large number of features, performs much more reasonably.” This cannot be gleaned from Table 3. What are “poorly” and “more reasonably” referring to?

4.       It was totally unclear to this reviewer how figures 4 and 5 were derived. What is the difference between accuracy, precision, and recall? I suppose it reflects how good the model predicted the sample as normal or as cancer, but it is not explained anywhere in the paper.

5.       Par. 3.3 is very hard to follow. There are four sets of CpG markers obtained, but there is no description of these four sets. Then there are six different gene sets, were they derived from the four CpG marker sets? The relevance of the data presentation in figures 6-9 are completely elusive to this reviewer. These kinds of analyses always lead to cancer-relevant gene sets, what’s the novelty? It’s comparable to a weather report in this way.

6.       The real critical issue with this paper is the application of SMOTE to pump up the proportion of normal samples in the datasets. The authors show that this leads to higher performance of their models, but how robust are these predictions in new datasets? Without such external replication, the effect of oversampling by SMOTE is hard to gauge and this paper isn’t much more than an advertisement for SMOTE.

Please explain:

1.       Line 112: GDC Data Portal (or provide reference)

2.       Line 118: these markers were removed or imputed markers before proceeding further. Not clear what exactly was done here, removing or imputing or both?

Typos:

1.       Line 65: prostrate

 

 

Author Response

Thank you for your feedback, the manuscript is much improved.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Within a couple of days, the authors have addressed all my comments adequately and revised the paper extensively, for which I would like to commend them. However, my main criticism, i.e., lack of external validation of the model with 7 overlapping genes, can obviously not be addressed within a couple of days. I think this significantly reduces the impact of this manuscript.

One could argue that the 27K and 450K datasets are independent, but the Methods section does not provide any details on the cases included besides their numbers. E.g., were the 309 tumor samples in the 27K dataset also analyzed in the 450K dataset?

In addition, because both sets were analyzed in parallel (as far as I could tell), they could serve as each other’s validation, in a way, although the methylation platforms are strongly different. But the analyses aren’t presented in this way and the fact is that the overlap between the two results (figure 10) is almost non-existent. The odds that this is a chance observation is looming large.

Hence the paper should discuss the merits of the datasets in more detail, as well as say something about the robustness of the result. What is the difference between the 27K and 450K in terms of genome representation or informativeness per gene locus? Can this explain the poor overlap? Or is it the sample set itself? The seven genes in the overlap are not major players in breast cancer given what we know today about the molecular genetics of this cancer, and despite the few references listed that claim the opposite: is this result robust?

Finally, the authors have now added a survival analysis, which is commendable. In a way, this adds credibility to the 7-gene set, although I’ve seen similar results in many bioinformatics papers before of which nothing was ever heard of since. In any case, nothing is said about this analysis in the Methods section (how were cases selected for the analysis, what statistic was used, etc)

The log-rank p-value is significant, but as stated by the authors, probably caused mostly by group 5+, which represent only 12% of the patients in the analysis. Since there is hardly a difference between the other groups, these could be lumped, or perhaps there is some other way to dichotomize the data into two groups with distinct methylation features and survival?

Minor details:

Lines 2383-285: “These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, IGF2R.” That’s very strange after a sentence mentioning 136 genes. This sentence should be moved to the end of the paragraph at line 308.

Line 345: sentences prematurely broken off.

 

Author Response

Thank you for the helpful comments!

Author Response File: Author Response.docx

Back to TopTop