AI-Driven Predictions of Readmission and Mortality for Improved Discharge Decisions in Critical Care: A Retrospective Study
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsICU readmission is influenced also by system factors: ICU strain/bed pressure, discharge timing (night/weekend), step-down availability, outreach teams and ward monitoring. None appear in the variable set yet these can affect readmission risk independent of physiology. Please acknowledge this more strongly and, if available, add at least basic variables like discharge time/day, ICU type, admission source or bed occupancy proxies. If not available, emphasize that model predictions may partly /poorly reflect local practice rather than only patient status/stability.
Fig. 4 caption mentions “mortality prediction” but the model outcome is readmission or death. This is confusing and should be corrected.
SWIFT requires variables like PaO2/FiO2 and PaCO2 (entire acid-base status?). Your variable list includes SpO2 and labs but some labs are missing and the article does not clearly show the exact SWIFT components (or how missing values were treated). More generally, SWIFT as an instrument requires much more attention since it’s the main comparator.
The primary endpoint is a composite: ICU readmission or death within 7 days after ICU discharge (Methods, page 3–4). These are clinically different: readmission is often “rescuable” by different discharge timing/monitoring whereas ward death may reflect goals-of-care decisions or frailty or error. Please report separate model performance for (a) readmission and (b) post-ICU death – only then report the composite. If not possible, this should be strongly addressed in the discussion.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors- The use of MIMIC-IV for training/internal validation and KNUH for external validation is appropriate. However, please describe the comparability of the two ICU populations. Were they similar in terms of ICU type, primary admission diagnoses, or case-mix? Discuss potential selection bias.
- The initial selection of 48 variables based on prior literature is noted. However, the final selection of 26 variables based on a missingness threshold lacks a statistical or clinical feature selection rationale. Please clarify the feature selection process.
- The AUROC difference between internal (MIMIC-IV: 0.802) and external (KNUH: 0.756) validation warrants discussion. Please explore potential reasons.
- Please consider supplementing the literature review or discussion section with a citation to recent literature on data compensation and signal processing methods in structural health monitoring, specifically the study by Liu et al. (2025) on A new method for long-term temperature compensation of structural health monitoring by ultrasonic guided wave (Measurement). The temperature compensation methodology proposed in that study could serve as a valuable methodological reference for handling environmental or systematic noise in time-series data within your research. It offers insightful parallels, particularly for addressing challenges related to missing values and signal attenuation in multivariate time-series analysis.
- Several figures appear to have low resolution or blurry text. Please provide high-resolution, vector-based figures for publication. Ensure that in the final version, axis labels, key frequency markers, and annotations are clearly legible.
- Some sentences are not expressed clearly enough and need to be further polished.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors1. TITLE AND ABSTRACT
1.1 The title is appropriate and clearly reflects the study content.
1.2 Abstract - Line 27: "KNUH database" should include the timeframe for data collection upon first mention for consistency with MIMIC-IV description.
1.3 Abstract - Lines 35-37: The reported AUROC values differ slightly between the abstract (0.802 and 0.744) and Table 2 (0.802 and 0.756). Please verify and ensure consistency throughout the manuscript.
2. INTRODUCTION
2.1 Lines 45-53: The introduction effectively establishes the clinical context. However, consider adding recent statistics on ICU readmission rates globally to strengthen the rationale.
2.2 Lines 66-71: The criticism of the SWIFT score is mentioned but would benefit from specific examples of its inconsistent performance with citations from the references provided [8, 9].
2.3 Lines 87-93: The study objectives are clearly stated. However, please clarify why the 7-day window was selected for outcome prediction. Was this based on previous literature, clinical significance, or data-driven considerations?
3. MATERIALS AND METHODS
3.1 Study Design and Population
3.1.1 Lines 97-101: Please provide the exact dates for KNUH data collection (currently stated as "between January 1, 2016, and February 28, 2023"). This 7-year period is substantial; were there any changes in ICU protocols or treatment standards during this time that might affect the results?
3.1.2 Figure 1 - Exclusion criteria: Please provide detailed reasons for excluding 22,920 ICU stays from MIMIC-IV and 4,960 from KNUH. The specific exclusion criteria should be explicitly listed (e.g., age <18, ICU stay <24 hours, death during ICU stay, etc.).
3.1.3 Lines 131-135: The definition of "planned surgical readmissions" needs clarification. What specific criteria were used to distinguish planned from unplanned readmissions?
3.2 Data Collection
3.2.1 Lines 140-145: You mention selecting 26 variables from an initial 48 based on <50% missing values. Please provide a supplementary table listing all 48 initially considered variables and their missing value rates to improve transparency.
3.2.2 Lines 141-144: The data collection methodology states variables were "collected at 1-day intervals and averaged when multiple measurements were taken within a day." This approach may lose important information about variability. Please justify this decision and discuss its potential impact on model performance.
3.2.3 Table 1: Several variables show substantial missing value rates in KNUH data (e.g., HCO3: 38.27%, WBC: 16.98%). How does GRU-D++ handle such high missingness compared to traditional imputation methods? This should be addressed in the discussion.
3.3 Statistical Analysis
3.3.1 Lines 150-159: The statistical approach is appropriate. However, for binary variables with very low prevalence (e.g., HFNC: 0.42% in MIMIC-IV), was Fisher's exact test considered instead of chi-square test?
3.3.2 Lines 154-156: You mention that normality tests indicated variables did not follow normal distributions. Consider providing these test results in a supplementary table.
3.4 Deep Learning Model
3.4.1 Lines 170-184: The description of GRU-D++ is quite technical. While the detailed mathematical formulation is in the Appendix, the main text should briefly explain the key advantages in more accessible terms for clinicians.
3.4.2 Lines 213-215: The 10-fold cross-validation methodology is mentioned, but were these folds stratified to maintain the outcome ratio? Given the class imbalance (6.76% failure rate in MIMIC-IV), stratification is crucial.
3.4.3 Lines 217-223: The fine-tuning experiment using varying percentages of KNUH data is interesting. However, please clarify:
- How were these percentages selected randomly? Was this repeated multiple times?
- Why was 40% reserved for testing and only up to 50% used for training?
3.4.4 The manuscript lacks information about:
- Hyperparameter optimization procedures
- Computing infrastructure used
- Criteria for model convergence
- Prevention of overfitting strategies (e.g., early stopping, dropout rates)
4. RESULTS
4.1 Baseline Characteristics
4.1.1 Table 1: Several findings warrant discussion:
- The age difference between success and failure groups is statistically significant in both datasets. Was age included as a feature in the models?
- KNUH data shows notably different characteristics (e.g., lower dialysis use, different physiological parameters). Please discuss potential reasons.
- Some p-values are very small (p<0.001) - consider reporting exact p-values where possible.
4.1.2 Lines 161-168: The table footnote mentions statistics are presented as mean ± SD or n (%). However, for highly skewed variables (as indicated by your normality tests), would median and IQR be more appropriate?
4.2 Predictive Performance
4.2.1 Table 2:
- Standard deviations are provided for AUROC but not explained. Are these from 10-fold cross-validation? Please clarify in the table footnote.
- The AUROC for external validation shows a notable drop (0.802 → 0.756). This 5.7% decrease deserves more discussion regarding model generalizability.
- AUPR values are substantially lower than AUROC across all models. Given the class imbalance, AUPR may be more clinically relevant. Why was AUROC chosen as the primary metric?
4.2.2 Lines 217-224: The fine-tuning results show improvement with more data, but the gains appear modest (0.752 to 0.765 AUROC). Is this improvement statistically significant? Were confidence intervals calculated?
4.2.3 Missing information:
- Sensitivity, specificity, PPV, NPV at optimal thresholds
- Calibration metrics (e.g., Brier score, calibration plots)
- Performance stratified by important subgroups (e.g., age groups, ICU types)
4.3 Comparison with SWIFT
4.3.1 Figure 3: The ROC curves are well-presented. However:
- Confidence intervals for the AUC estimates should be shown
- Consider adding calibration plots to assess if predicted probabilities match actual outcomes
- The SWIFT score performance (0.69 and 0.68) is lower than reported in some validation studies. Please discuss potential reasons.
4.3.2 Lines 232-238: The DeLong test showing statistical significance is appropriate. However, also consider reporting clinical significance - what would be the practical impact of this improved discrimination in terms of patients correctly classified?
4.4 SHAP Analysis
4.4.1 Figure 4: The SHAP value analysis is valuable for interpretability. However:
- The figure caption should explain how to interpret the color coding (red vs. blue)
- Why do some features (e.g., LOS, GCS, SAS) show such strong impacts? This deserves discussion.
- Consider showing SHAP interaction plots for the top features
4.4.2 Lines 246-251: The SHAP analysis section is very brief. Please expand to discuss:
- Which features were most important for prediction?
- Do the feature importance rankings align with clinical knowledge?
- Were there surprising findings?
4.5 Discharge Prediction Scores
4.5.1 Figure 5: This figure effectively shows the divergence in prediction scores. However:
- What is the "optimal threshold" indicated by the dashed gray line, and how was it determined?
- The confidence intervals overlap substantially in the days before discharge. At what point do they become significantly different?
- Consider adding the number of patients at each time point
4.5.2 Figure 6: The individual patient examples are illustrative but:
- Why were these specific 5 patients selected from each group? Were they representative or cherry-picked?
- Cases f-k in MIMIC-IV show varying patterns. Some show early warning signs while others don't. This heterogeneity should be discussed.
- Patient identifiers should be anonymized more carefully (e.g., use "Patient A1, A2" instead of "a, b, c")
5. DISCUSSION
5.1 Lines 290-305: The interpretation of results is generally sound, but several points need strengthening:
5.1.1 The claim that "trainable imputation methods...proved to be effective" needs more support. Please provide specific examples of how GRU-D++ handled missing values differently than conventional imputation.
5.1.2 The external validation AUROC drop from 0.802 to 0.756 is substantial. While you mention this "demonstrated good predictive performance," a 5.7% decrease suggests potential generalizability issues that need more thorough discussion.
5.2 Comparison with Previous Studies
5.2.1 Line 325: Loreto et al. achieved AUROC of 0.91, much higher than your model. While you mention it "focused solely on categorical data," please discuss more substantively why their performance was superior and whether their approach has limitations your method addresses.
5.2.2 Lines 332-341: The claims about GRU-D++ advantages need more empirical support. You state it "eliminates" the need for imputation and delivers "higher predictive performance," but Table 2 shows only marginal improvements over GRU-D. Please revise these claims to be more measured.
5.3 Limitations
5.3.1 Lines 342-368: The limitations section is thorough, but several additional limitations should be acknowledged:
- Class imbalance: The outcome occurs in only 6.76% (MIMIC-IV) and 7.07% (KNUH) of cases. How might this affect model performance and clinical utility?
- Temporal validation: Both datasets include historical data. Was there any temporal drift in patient characteristics or outcomes over time?
- Missing data patterns: Were data missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR)? This affects the validity of any imputation approach.
- Lead time: How early before discharge can the model provide reliable predictions? This is crucial for clinical implementation.
- Threshold selection: The manuscript doesn't discuss how the optimal decision threshold would be selected for clinical use.
5.3.2 Lines 349-352: Regarding DNR patients, you state "the performance was relatively well-maintained" - this needs quantification. Consider a sensitivity analysis excluding these patients or showing their impact on model performance.
5.3.3 Lines 354-361: While you argue that including all disease categories may be advantageous, this also introduces heterogeneity. Consider subgroup analyses for major ICU admission categories (e.g., medical vs. surgical, sepsis vs. cardiac).
5.3.4 Lines 369-372: The call for prospective validation is appropriate, but please be more specific about what such a study would entail and how the model would be integrated into clinical workflow.
5.4 Clinical Implications
5.4.1 The discussion lacks a dedicated section on clinical implementation. Please address:
- At what point in the ICU stay would this model be applied?
- How would the predictions be communicated to clinicians?
- What actions would be triggered by high-risk predictions?
- What is the potential cost-benefit of implementing this system?
5.4.2 Lines 362-368: You mention the model "could lead to decreased ICU stay length and lower healthcare costs" but provide no evidence or estimation. Either remove these claims or provide supporting calculations/citations.
6. CONCLUSIONS
6.1 Lines 374-379: The conclusion appropriately summarizes the key findings. However, consider adding:
- Specific recommendations for future research
- Limitations that should be addressed before clinical implementation
- The next steps toward clinical translation
AS A RESULTS;
- Address inconsistencies in reported AUROC values
- Provide more detailed methodology (exclusion criteria, hyperparameter optimization, missing data handling)
- Expand discussion of external validation performance drop
- Add clinical utility assessment (decision curve analysis, threshold selection)
- Provide more comprehensive supplementary materials
- Enhance discussion of clinical implementation considerations
- Conduct and report subgroup analyses
- Improve figure quality and interpretability
- Correct language and technical issues identified above
- Improve table formatting and add effect sizes
- Clarify data availability statement
- Expand limitations section
- Add summary comparison table with previous studies
- Provide more accessible explanation of GRU-D++ for clinicians
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI am satisfied with paper in this form
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have thoroughly addressed all concerns raised in my previous review. After carefully examining the revised manuscript, I confirm that the issues identified in the initial evaluation have been satisfactorily and comprehensively resolved. The authors have also incorporated relevant and up-to-date literature, which improves the scientific context and strengthens the overall quality of the manuscript. Based on the current version, the manuscript now meets the standards expected for publication.

