Supervised Machine Learning-Based Prediction of In-Hospital Mortality Following Hip Fracture in Older Adults
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsFirst of all, I would like to thank all the authors for their efforts.
In this paper, the authors evaluate the effectiveness of machine learning in hip fracture mortality in the elderly.
The title of the study should be revised to reflect the main finding of the study. This will be more engaging for readers.
The introduction section is of sufficient length and contains the necessary information.
The authors should clearly state their hypothesis at the end of the introduction section.
There is a significant amount of data that could affect mortality and is not excluded from the study. The authors should expand the exclusion criteria and increase the generalizability of the data obtained.
The statement, "These eight supervised machine learning algorithms were developed and evaluated using the independent test set," should be moved to the methods section.
When identifying predictors, the authors should specify the studies they referenced.
The discussion section should begin by highlighting the most important finding of the study. The information mentioned in the introduction section is not necessary.
Create a paragraph about the clinical applications and examples of these findings before the limitation paragraph.
Simplify the conclusion and use clear statements.
Authors should especially use the recently developed "mortality score" to revise the yellow text.
Author Response
Comment 1: The title of the study should be revised to reflect the main finding of the study. This will be more engaging for readers.
Response 1: We thank the reviewer for this valuable suggestion. We agree that highlighting the main finding improves clarity and reader engagement. Accordingly, we have revised the title to explicitly emphasize the best-performing model and its interpretability.
Change in manuscript:
The title has been revised to:
“Interpretable Gradient Boosting Models for Predicting In-Hospital Mortality after Hip Fracture in Older Adults: A Nationwide Machine Learning Study.”
Comment 2: The authors should clearly state their hypothesis at the end of the introduction section.
Response 2: We appreciate this comment and agree that explicitly stating the hypothesis strengthens the conceptual framing of the study. We have now added a clear hypothesis at the end of the Introduction.
Change in manuscript (end of Introduction): “We hypothesized that supervised machine learning models—particularly tree-based ensemble methods—would provide accurate and clinically interpretable predictions of in-hospital mortality after hip fracture using routinely collected hospital data, outperforming traditional linear approaches.”
Comment 3: There is a significant amount of data that could affect mortality and is not excluded from the study. The authors should expand the exclusion criteria and increase the generalizability of the data obtained.
Response 3: We thank the reviewer for raising this important point. Our intention was to maximize generalizability by including all eligible hospitalizations recorded in the national FONASA database and by limiting exclusions to data quality–related criteria (e.g., missing outcome, missing key demographic variables, duplicates). Variables such as physiological parameters, laboratory values, or perioperative timing were not excluded but were unavailable in the administrative dataset. We have clarified this rationale and explicitly acknowledged the trade-off between data completeness and external generalizability.
Change in manuscript:
-
Clarified the Exclusion Criteria subsection to emphasize that exclusions were minimal and data-driven.
-
Added a clarification sentence in the Limitations section noting that unmeasured clinical variables may influence mortality but are not captured in nationwide administrative datasets.
Comment 4: The statement, “These eight supervised machine learning algorithms were developed and evaluated using the independent test set,” should be moved to the methods section.
Response 4: We agree with the reviewer. This statement has been relocated to the Model Development subsection of the Methods section to improve structural consistency.
Comment 5: When identifying predictors, the authors should specify the studies they referenced.
Response 5: We appreciate this suggestion. We have now added explicit citations supporting the selection of each predictor, particularly age, comorbidity burden, surgical treatment, and length of stay, which are well-established determinants of mortality after hip fracture.
Comment 6: The discussion section should begin by highlighting the most important finding of the study. The information mentioned in the introduction section is not necessary.
Response 6: We agree and have revised the opening paragraph of the Discussion to focus directly on the principal finding—namely, the superior performance and interpretability of the Gradient Boosting model—removing background material already presented in the Introduction.
Comment 7: Create a paragraph about the clinical applications and examples of these findings before the limitation paragraph.
Response 7:
A new paragraph explicitly describing potential clinical applications—such as early risk stratification, prioritization of perioperative care, and integration into orthogeriatric workflows—has been added before the Limitations subsection.
Comment 8: Simplify the conclusion and use clear statements.
Response 8: We agree and have revised the Conclusions section to be more concise and focused on the key take-home messages, avoiding repetition and emphasizing clinical relevance.
Comment 9: Authors should especially use the recently developed “mortality score” to revise the yellow text.
Response 9: We thank the reviewer for this insightful comment. We have now explicitly discussed established hip fracture mortality scores (e.g., Nottingham Hip Fracture Score and Charlson-based indices) and clarified how our machine learning approach complements rather than replaces these tools, offering improved discrimination and individualized risk decomposition through SHAP-based explanations.
Reviewer 2 Report
Comments and Suggestions for Authors- The manuscript does not include a study flow diagram to illustrate the process of data screening, case inclusion, and exclusion.
- The ethical approval number is missing.
- The manuscript does not sufficiently explain the rationale for selecting the included predictor variables. It is recommended to further clarify the basis for variable selection.
- The study does not provide a description of the participants’ demographic and baseline clinical characteristics.
- The study only performed internal validation using an 80/20 stratified split and lacks external validation. It is recommended to conduct external validation; if this is not feasible at present, please provide a detailed explanation of the reasons and potential impact in the “Limitations” section.
- Although the SHAP method was used to explain variable importance, the results merely repeat common findings (i.e., higher age, greater comorbidity burden, and absence of surgery are associated with higher mortality risk) and fail to provide new clinical insights. It is recommended to conduct a deeper analysis to explore potential nonlinear relationships or feature interactions to enhance interpretability.
- The introduction provides an insufficient review of related studies and lacks a systematic discussion of recent advances in this field. It is recommended to supplement the background information, for example by referring to the following paper to enrich the literature review: DOI: 10.3389/fpubh.2025.1544894
Author Response
Comment 1: The manuscript does not include a study flow diagram to illustrate the process of data screening, case inclusion, and exclusion.
Response 1: We thank the reviewer for this suggestion. We acknowledge that flow diagrams can be useful in studies involving complex, multi-stage screening procedures. However, in the present study, the cohort selection process was straightforward and based on predefined ICD-10 diagnostic codes applied to a nationwide administrative database, with minimal exclusions restricted to data quality–related issues (e.g., missing key variables or duplicate records).
Given the simplicity and transparency of the inclusion and exclusion process, we considered that a flow diagram would provide limited additional value beyond the detailed description already provided in the Methods section. To improve clarity without adding a flow diagram, we have further clarified the inclusion and exclusion criteria in the text to ensure full transparency of the cohort selection process.
Comment 2: The ethical approval number is missing.
Response 2: We thank the reviewer for highlighting this point. This study analyzed publicly available, fully anonymized administrative hospital discharge data. According to national regulations and institutional policies, the use of non-identifiable secondary data does not require ethical approval or an approval number. This has now been explicitly clarified in the manuscript.
Comment 3: The manuscript does not sufficiently explain the rationale for selecting the included predictor variables.
Response 3: We appreciate this observation. The rationale for predictor selection has now been expanded. Predictor variables were selected a priori based on their consistent availability in nationwide administrative data, early availability during hospitalization, and prior evidence supporting their association with mortality after hip fracture.
Comment 4 The study does not provide a description of the participants’ demographic and baseline clinical characteristics.
Response 4: We thank the reviewer for this comment. A detailed narrative description of the demographic characteristics, disease burden, treatment, and care process indicators corresponding to the predictor variables used in the model has now been added to the Results section. This description focuses exclusively on the variables included in the machine learning models to maintain methodological consistency.
Comment 5: The study only performed internal validation using an 80/20 stratified split and lacks external validation.
Response 5: We acknowledge this limitation. External validation was not feasible at this stage because the analysis relied on a single nationwide administrative database. However, this database captures a large, population-based cohort representative of real-world inpatient care. The absence of external validation and its potential implications have now been explicitly discussed.
Comment 6: The SHAP results merely repeat common findings and fail to provide new clinical insights.
Response 6: We appreciate this comment and have clarified the added value of SHAP analysis in the Discussion. While the identified predictors are clinically well known, the primary contribution of SHAP lies in quantifying nonlinear effects and providing patient-level risk decomposition. We now emphasize how SHAP reveals nonlinear age effects, threshold effects of comorbidity burden, and the consistent protective contribution of surgery across age strata, thereby enhancing interpretability beyond traditional summary statistics.
Comment 7: The introduction provides an insufficient review of related studies and lacks a discussion of recent advances. It is recommended to cite DOI: 10.3389/fpubh.2025.1544894.
Response 7: We thank the reviewer for this helpful suggestion. The Introduction has been expanded to include recent advances in machine learning–based mortality prediction after hip fracture, including the recommended reference.
Reviewer 3 Report
Comments and Suggestions for AuthorsI am pleased to see that this manuscript has been revised and improved. There is still more work to be done.
The cohort should be better defined by including qualifying dates.
I am concerned about multiple episodes per patient. Is this possible? How is clustering of episodes within patient handled?
Was a hospital effect considered? If not why not? If included how was this handled?
The authors state that records with incomplete data were removed. Then they state that median/mode imputation was used. I assume the latter was the case. Such imputation leads to lower variability in predictors and can impact findings. Would not hot-deck imputation be better?
The first paragraph of the introduction needs to be written. It's an important paragraph and unfortunately not written as carefully as the rest of the manuscript. This could put off readers. For example, more acceptable English would be "Hip fracture among older adults remains ..." rather than "The Hip fracture among older adults remains ...". The rest of the paragraph is dense.
Once SHAP analysis has revealed those predictors most relevant and important, did the authors consider rerunning their analyses with only the most important ones? The idea would be to reduce the impact of nuisance variables and improve precision of remaining parameter estimates which in turn influences interpretation.
The discussions around surgery and length of stay are important and well written.
I hope that the manuscript can be revised and resubmitted.
Comments on the Quality of English LanguageNoted above. Minor concerns but mainly just the first paragraph.
Author Response
Comment 1: The cohort should be better defined by including qualifying dates.
Response 1: Thank you for this comment. The qualifying dates defining the study cohort were already specified in the Methods section. To improve clarity and ensure that this information is clearly visible to readers, the cohort definition has been slightly reworded to explicitly emphasize the inclusion period (January 1, 2019, to December 31, 2024).
Comment 2: I am concerned about multiple episodes per patient. Is this possible? How is clustering of episodes within patient handled?
Response 2: Thank you for this important comment. The database includes an encrypted unique patient identifier, which allows identification of multiple hospitalizations belonging to the same individual. To address potential within-patient clustering, the analysis was restricted to the first hospitalization for hip fracture per patient during the study period. This approach ensured independence of observations and has now been explicitly clarified in the Methods section.
Comment 3: Was a hospital effect considered? If not why not? If included how was this handled?
Response 3: Thank you for raising this issue. Hospital-level effects were not explicitly modeled in the present analysis. The primary objective of the study was to develop and compare patient-level machine learning models based on routinely available administrative data, with an emphasis on generalizability across the national health system. Incorporating hospital-level clustering or hierarchical modeling was not feasible due to the absence of detailed hospital-level covariates and the focus on predictive performance rather than causal inference. This decision has now been clarified in the Methods section.
Comment 4: The authors state that records with incomplete data were removed. Then they state that median/mode imputation was used… Would not hot-deck imputation be better?
Response 4: We thank the reviewer for this careful observation and the opportunity to clarify our data handling strategy. Records were excluded only when essential variables required for cohort definition or outcome ascertainment (e.g., age, sex, discharge status) were missing. For predictor variables, missing values were addressed using median imputation for continuous variables and mode imputation for categorical variables. We acknowledge that simple imputation methods may reduce variability; however, they were chosen to ensure transparency, reproducibility, and computational stability across multiple machine learning algorithms. More complex approaches such as hot-deck imputation could be considered in future work but were beyond the scope of the current study. The imputation strategy and its potential impact have now been clearly described in the Methods.
Comment 5: Once SHAP analysis has revealed those predictors most relevant and important, did the authors consider rerunning their analyses with only the most important ones?
Response 5: We thank the reviewer for this thoughtful suggestion. The purpose of the SHAP analysis in this study was to enhance interpretability of the final predictive models rather than to perform post hoc feature selection. Rerunning models using SHAP-ranked predictors alone may introduce information leakage and optimistic performance estimates, particularly in supervised learning settings. Moreover, retaining the full set of clinically plausible predictors allows the models to capture complex interactions that may not be apparent from marginal importance rankings. This rationale has now been clarified in the Discussion section.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have appropriately revised the manuscript in response to my comments. The revised manuscript meets the review requirements, and I have no further comments. I recommend acceptance.
Reviewer 3 Report
Comments and Suggestions for AuthorsThere remain some minor points of presentation of the statistical methods and results but overall this revision is sufficient.
