Decision Trees for the Analysis of Gene Expression Levels of COVID-19: An Association with Alzheimer’s Disease
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsDear authors,
The study has a goal to identify the genes associated with covid-19 that may be related to Alzheimer’s disease using a machine learning approach, i.e, decision trees. But the manuscript would benefit from a better statement of novelty. There are already some studies looking at gene expression overlaps between covid and Alzheimer's. Authors should explicitly highlight the novelty about their approach and findings.
- Terms like “RMA algorithm”, “CfsSubsetEval”, “BestFirst”, and “J48” are used without explanation. Please explain them briefly.
- Authors worked with only single dataset. The dataset has only 47 samples, which is a very low number to perform any machine learning method. It needs more discussion about how it might affect the robustness and generalizability of the results.
- The methodology mentioned that authors applied 10-fold cross validation. Applying 10 fold means splitting data into 90:10. So the 10 % of 47 samples is very low to retain the 3 classes. Please be clear about which k-fold was applied. As the number of samples is very low, the leave one out cross validation (LOOVC) can be a good alternative.
- Also explain more about the confusion matrix along with precision, recall, and F1 score metrics. It will be good if authors could provide the predicted probabilities and the actual class along with ROCs (as supplementary files). Those values are always a better indicator to assess the model performance.
- The biological interpretation of DNAJC16, TREM1, and UCP2 is appreciated, but these sections read like a literature dump. The authors should better connect their decision tree findings to the biological pathways that link COVID-19 and Alzheimer's, ideally with a figure or diagram.
Discretionary revisions:
- Why are authors still working on array-based datasets while sequencing and may be single-cell datasets are available in the public domain?
- Why did authors choose only decision trees although several ML models are available?
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study effectively leverages machine learning on microarray data to explore potential genetic links between COVID-19 and Alzheimer’s disease. The identification of key genes like DNAJC16, TREM2, and UCP2 provides valuable insights into shared neuroinflammatory pathways and metabolic dysfunctions. Although the study is interesting, I have some of the following concerns which needs to be addressed:
- While the study utilizes microarray data to identify differentially expressed genes related to COVID-19 and Alzheimer’s Disease, it would be helpful to understand why RNA-seq was not chosen for this analysis. Given that RNA-seq provides higher sensitivity, dynamic range, and the ability to detect novel transcripts compared to microarrays, was the choice of microarray (GSE177477) based on dataset availability, computational constraints, or other considerations? A brief justification in the manuscript would enhance transparency and methodological clarity.
- The study employs decision tree analysis for classification. Could the authors elaborate on why this method was chosen over other commonly used machine learning models (e.g., Random Forest, SVM, or ensemble methods), and I suggest to perform comparisons to further validate its effectiveness.
- Were the identified genes (TREM1, UCP2, DNAJC16) validated against any independent datasets or biological evidence, to support their association with Alzheimer’s disease?Since TREM1 appears consistently across all decision trees, do the authors propose it as a potential biomarker for COVID-19-associated AD risk? Could its role be further explored experimentally or through pathway enrichment analysis?
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsDear Authors:
Thanks to the authors for their revisions. While I appreciate the acknowledgment of the small dataset issue, I don’t think the response fully addresses the concern. Simply stating that SMOTE will be used in future work doesn’t resolve how the current results might be affected by having only 47 samples. There is little discussion about potential overfitting or how reliable the reported results are, given the limited data. I do appreciate the clarification on the cross-validation method and the expanded metrics, as they are solid improvements. That said, the core issue of data scarcity still feels underexplored. I will leave the final call to the editor, but I must be honest, I am not entirely satisfied with how this particular concern was handled.
Thanks
Reviewer 2 Report
Comments and Suggestions for AuthorsI appreciate the authors effort to modify the manuscript based on the reviewers comments. While the clarification regarding the choice of the microarray dataset and the updates to the Methods section enhance the study’s transparency, incorporating RNA-seq data in future research could provide deeper insights and strengthen the current findings.
Authors elaboration on the choice of the decision tree method, along with the discussion of its advantages and the mention of future exploration of additional machine learning algorithms, is well noted. The proposed title change appropriately reflects the focus of your work and enhances clarity.
Additionally, I commend your efforts to perform Gene Ontology enrichment analysis on the key genes (TREM1, UCP2, DNAJC16) and to include these findings in the revised manuscript. Your acknowledgment of the need for future experimental validation is important and adds depth to the conclusions.
Overall, the revisions have substantially improved the manuscript. Thank you again for your responsiveness and for the updates in improving the manscuipt.