Next Article in Journal
Benefits from Incorporating Virtual Reality in Pulmonary Rehabilitation of COPD Patients: A Systematic Review and Meta-Analysis
Previous Article in Journal
Volatile Organic Compound Identification-Based Tuberculosis Screening among TB Suspects: A Diagnostic Accuracy Study
 
 
Article
Peer-Review Record

A Machine Learning-Based Model to Predict In-Hospital Mortality of Lung Cancer Patients: A Population-Based Study of 523,959 Cases

Adv. Respir. Med. 2023, 91(4), 310-323; https://doi.org/10.3390/arm91040025
by Que N. N. Tran 1,*, Minh-Khang Le 2, Tetsuo Kondo 2 and Takeshi Moriguchi 1
Reviewer 1:
Reviewer 2:
Adv. Respir. Med. 2023, 91(4), 310-323; https://doi.org/10.3390/arm91040025
Submission received: 9 July 2023 / Revised: 1 August 2023 / Accepted: 4 August 2023 / Published: 9 August 2023

Round 1

Reviewer 1 Report

[Comment 1] Proposed method and numerical experiments

[Subcomment 1a] (lines 119-120) It is not okay to determine the threshold for the OS time to be 1 month based on the authors' opinion, e.g., cancer treatment often happens much longer than only 1 month. The authors should have used the standards made by the hospital or consulted the assumption with the doctors.

[Subcomment 1b] The authors should have shown in the numerical experiments about the validation results are different when using and not using the missing data. If considering the missing data is better, then the authors must address the structure of the missing data, e.g., how incomplete it was, the percentage of missing data for each data field, etc.

[Subcomment 1c] The authors need to make another table that is like Table 2 with the combined N0-N1 category, as they addressed in lines 173-174.

[Subcomment 1d] (Section 3.4) The authors must mention which input data were considered for the K-means clustering algorithm.

[Subcomment 1e] The authors must connect all result explanations with the figures, e.g., which figure confirms that the in-hospital mortality is less than 8% (line 231)?

[Subcomment 1f] (lines 308-315) The authors must also explain the results found by Shen et al. [17], and emphasize how the authors' results is better than theirs quantitatively.

[Subcomment 1g] There should be an analysis from the practical point of view, including the doctor's or hospital's opinions that agree with the result of this study. Without this statement, the authors should have conducted a more complete literature review and state how the results confirm or contradict the results found here.

[Subcomment 1h] The authors should mention why they use the regression method for the analysis, considering that there are much more machine learning models available. The authors must show that the regression method used is the state-of-the-art method for dealing with such a cancer data, otherwise, the authors need to compare several state-of-the-art models and conclude which one is the best.

 

[Comment 2] Writing quality and clarity

The citations should be written in [] signs, instead of ().

 

[Comment 3] Conclusions

The conclusion section should have explained the problem background, the purpose of the study, and the used method as well, considering that the readers might only observe this part when briefly reading the paper at the first time.

 

Author Response

ISSUE 1

 

1a. OS time

We understand the limitations of this definition. However, there is no record of in-hospital mortality in the SEER database. Therefore, we used literature with regards to this problem. It is noteworthy that we did not define the OS time of lung cancer patients in general, but the OS time of those who were satisfied the selection criteria with respect to the in-hospital mortality rate defined as death within 28 days of admission. Thus, the patients with OS time to be  1 month were considered to have in-hospital mortality. This definition shares the same standard set in healthcare settings and the findings of literature.

 

References

  1. Skaug K, Eide GE, Gulsvik A. Hospitalisation days in patients with lung cancer in a general population. Respiratory Medicine, 2009. 103(22):1941-1948
  2. Xia Y, Ma H, Buckeridge DL, Brisson M, Sander B, Chan A, Verma A, Ganser I, Kronfli N, Mishra S, et al. Mortality trends and length of stays among hospitalized patients with COVID-19 in Ontario and Québec (Canada): a population-based cohort study of the first three epidemic waves. International Journal of Infectious Diseases, 2022. 121:1-10

 

1b. Missing data

Thank you for the comment. We used both the missing and non-missing data in different parts of analyses. Non-missing data comprise the validation cohort. We examined the performance of our model in this cohort separately. Missing data was imputed using HI-VAE model and then comprised external validation cohort. We examined this cohort in another section of our analysis. Therefore, the structure of missingness actually did not affect the training process of our model. (Supplementary Figure 1)

 

1c. Table 2

1d. Input data for K-means clustering algorithm

We only use the linear predictor of the model as the input data for the clustering algorithm.

For example, consider the regression formula: y_pred    ~ X1 + X2 + X3.

Y_pred is the linear predictor. Therefore, the input data is a 1-dimension continuous feature.

1e. Solved

1f. Solved

1g. Solved

1h. The use of regression method

We performed random forest and XGBoost classifier, which are among the state-of-the-art classification models. The resulted AUC of validation cohort were 0.708 (95%CI=0.695-0.721) and 0.783 (95%CI=0.772-0.793), respectively while the corresponding results of logistic regression model was 0.781 (95%CI=0.770-0.791). Although AUC of XGBoost classifier is a bit higher than logistic regression model, we believe that these results are not very significant and, on the other hand, logistic regression provides a better model interpretability and user-friendly interfaces.

 

ISSUE 2 Solved

 

ISSUE 3 Solved

Reviewer 2 Report

In this paper, the authors proposed a machine learning-based model to predict in-hospital mortality rate of lung cancer patients from SEER. The following lists comments. First is about the data. The authors used the data from SEER. However, the data origin and the statistics for this data are not very clear, especially for TCGA data. Second, the authors only used the logistic regression as the classifier. It is good to employ other methods to test the possible better results with comparison studies. The authors could refer to some highly-related papers (e.g., PMID: 34930307, PMID: 35932551) in this field. Third, this paper claims high predictive for the morality of lung cancer. However, the predictor variables are very important for concluding these results. They need be clearly described and investigated.

The languge need be polished.

Author Response

  1. The Cancer Genome Atlas (TCGA) database is available at https://www.cancer.gov/ccg/research/genome-sequencing/tcga.In external validation, we combined the TCGA-LUAD and TCGA-LUSC. The necessary variables in the external cohorts were used and missing values were imputed by heterogeneous incomplete variational auto-encoder (HI-VAE) (Supplementary Figure 1).
  2. The regression method. We performed random forest and XGBoost classifier, which are among the state-of-the-art classification models. The resulted AUC of validation cohort were 0.708 (95%CI=0.695-0.721) and 0.783 (95%CI=0.772-0.793), respectively while the corresponding results of logistic regression model was 0.781 (95%CI=0.770-0.791). Although AUC of XGBoost classifier is a bit higher than logistic regression model, we believe that these results are not very significant and, on the other hand, logistic regression provides a better model interpretability and user-friendly interfaces.
  3. Predictor variables & Conclusions. We modified the Discussion & Conclusions sections as below.

[…]

From the statistical standpoint, our study shows the predictive value of the training model as the probability varies from 0 to 50% of in-hospital mortality since the initial diagnosis timepoint. Having said that, Shen et al. [17] had a similar approach to ours in terms of the SEER database use to predict death within three months after being diagnosed with lung-cancer-with-brain-metastasis. Nevertheless, instead of using ten variables including metastasis that is likely limited in advanced-stage cases, we only selected six general criteria that can be available not only in wide-range cases with respect to staging, but also accessible in both advanced and low-resource healthcare settings. De-spite the fact that our model has less accurate performance with a higher probability of in-hospital mortality as shown in calibration plots, the sensitivity analysis with the ex-ternal validation added validates the robustness of our model by ζZ and ζT values in the 3 contour plots. From these analyses, we confirmed that the model is fairly robust under a different data distribution and an unobserved confounder.

[…]

To conclude, our machine learning-based model trained and validated shows a high predictive value for the in-hospital mortality rate of new lung cancer patients, especially in the probability of less than 50%. Three applications including the static nomogram, the web app, and the risk table are helpful and accurate for risk stratification. Indeed, as more and more novel therapies for lung cancer treatment are accessible, more and more new prognosis-related questions need to be addressed. Our applications, especially the web app, are available on the internet and applicable to a wide range of healthcare settings in the world. Additionally, they can aid physicians to consult not only patients with “low-risk” in-hospital mortality rates for strategic planning but also those with “high-risk” ones for end-of-life care. In combination with the molecular investigation, the model with its robustness can assist clinical beyond that for further intervention.

Round 2

Reviewer 1 Report

Thank you for your revisions.

Author Response

Thank you for your consideration.

Back to TopTop