Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper provides a comprehensive and systematic evaluation of machine learning approaches in predicting postoperative anesthesia complications, highlighting the current limitations and emphasizing the need for better data quality and transparency. Its detailed workflow and critical review of existing methodologies demonstrate a thorough understanding of the field. The acknowledgment of challenges and call for interdisciplinary collaboration are valuable for guiding future research.
For improvement, the study could benefit from including more diverse or larger datasets to enhance model generalizability and should explore ways to minimize biases and data leakage more explicitly. Additionally, providing more actionable guidelines for clinical implementation would strengthen its practical relevance.
Author Response
We appreciate the reviewer’s suggestion to leverage larger and more diverse datasets, as well as to further address bias and data leakage. We've expanded the manuscript (please check the revised version for added/eddited text in red color), emphasizing the following:
We agree these are important for maximizing model generalizability and are common goals in ML research. The primary aim of our study is not the development of a superior or broadly generalizable predictive model per se. Rather, our work is designed to demonstrate the risks of data leakage (particularly “memory” or deterministic leakage) when ML pipelines are constructed without clinical oversight.
This is a gap in the literature: many published ML studies in perioperative care report strikingly high performance, which, upon scrutiny, often results from the inadvertent inclusion of proxy or post-hoc features; an issue we directly address and make transparent in our study Section 4.1.
As a medical college–based team, we intentionally focused on showing what happens when clinicians enter the model development process and intervene to prevent the inclusion of features that constitute data leakage. Our methodology exemplifies a real-world scenario in which clinical experts guide data scientists on which features are valid for training, thereby preventing the inflation of model performance commonly seen in studies that do not adequately consult domain experts.
We acknowledge the value of larger datasets and more explicit bias mitigation, the principal contribution of our manuscript is to educate the research community about methodological pitfalls that can only be reliably avoided through interdisciplinary collaboration. Our findings highlight the importance of clinical input in feature selection and model evaluation, advocating for increased transparency and clinician involvement in future studies.
Reviewer 2 Report
Comments and Suggestions for AuthorsEvaluating machine learning methods to improve perioperative risk stratification through preprocessing and data leakage control is a robust approach. This approach is rarely followed in other similar studies, and I believe the authors provide robustness.
The process of cleaning, imputation, one-hot encoding, standardization and feature engineering (TF-IDF + PCA, ClinicalBERT embeddings) is complete and transparent. The choice of PCA for TF-IDF dimensionality reduction is justified.
The range of methods covers classical models (Naïve Bayes, KNN), decision trees and boosting (Random Forest, XGBoost, CatBoost), ensemble stacking and transformations. The choice is varied and allows for fair comparisons.
The use of a 70/30 stratified split and k-fold cross-validation on the training set is correct given the moderate data size and class imbalance. The SMOTE and RUSBoost techniques for balancing are reasonably applied.
The maximum reported performance is AUC 0.644 (KNN with tabular data + text reduced by PCA)
The paper explicitly highlights the limitations of the data (low granularity, class imbalance, short clinical notes).
The comparisons with other studies in the literature are correct: the authors note that some reports with AUC >0.90 are likely affected by data leakage or lack of methodological transparency, which validates their choice to maintain a rigorous framework.
The interpretation of the performances of the advanced models (ClinicalBERT) is realistic: the fact that they do not outperform simple models is attributed to the quality of the data, not to the algorithms themselves.
However, feature importance analysis is discussed but not explored in detail to extract possible clinical patterns. This could have added clinical value even with modest discrimination.
The main conclusion is that the available data do not contain enough signal for robust predictions and that data leakage is a major risk in the specialized literature – it is well substantiated.
The authors acknowledge the limitations of the study (size and granularity of the dataset, lack of dynamic intraoperative data) and do not suggest immediate clinical applicability, which denotes caution.
The recommendations for the future (richer, temporal data, external validation) are realistic and justified.
But, although the conclusions are cautious and correct, it would have been useful for the authors to formulate more clearly possible intermediate applications (for example, using modest models as an exploratory tool for identifying risk factors, not for direct clinical decisions).
The strength is the recognition and correction of the problem of data leakage, an often neglected aspect. The limitations are mainly related to the dataset and the absence of a deeper analysis of predictive factors.
In this study, the lack of nested cross-validation or an additional validation set reduces the robustness of the estimates.
The importance of features is reported briefly; detailed discussions that could have clinical value are missing. Insufficient analysis of risk factors is performed.
The advanced models do not bring major improvements, but the authors do not sufficiently explore the reasons (e.g., reduced note length, quality of medical language, etc.).
Although it is emphasized that the models are not ready for implementation, intermediate use scenarios are not provided (e.g., factor exploration, research support).
It would be necessary to introduce nested cross-validation to strengthen the performance estimates.
In my view, it would be necessary to perform an in-depth analysis of important features, to identify possible risk factors even in the absence of performing models.
It would also be useful to investigate in detail the limitations of transformer models (quality and length of notes, additional fine-tuning by domain).
A very important aspect would be the integration of dynamic intraoperative data and external validation to increase generalizability and clinical relevance.
Nevertheless, this study brings value by rigorously addressing the data leakage problem. The modest results are reported and interpreted correctly. The work has mainly methodological and benchmark value, but could be strengthened by additional validation and a more applied analysis of clinical factors.
Author Response
We thank the reviewer for their constructive and insightful feedback, which has led to substantial improvements in the manuscript. We have addressed all concerns by expanding our discussion of feature importance, clarifying our validation strategy, exploring the limitations of advanced models, and strengthening the framing of the study as a methodological benchmark. We are very grateful for this reviewer's comments because these revisions have significantly enhanced the clarity, rigor, and clinical relevance of our work.
General Comments and Suggestions
Evaluating machine learning methods to improve perioperative risk stratification through preprocessing and data leakage control is a robust approach. This approach is rarely followed in other similar studies, and I believe the authors provide robustness.
Response:
Thank you for your positive feedback on our methodological rigor and focus on data leakage. We appreciate your recognition of the robustness and transparency of our approach.
The process of cleaning, imputation, one-hot encoding, standardization and feature engineering (TF-IDF + PCA, ClinicalBERT embeddings) is complete and transparent. The choice of PCA for TF-IDF dimensionality reduction is justified.
Response:
We are grateful for your endorsement of our preprocessing and feature engineering pipeline.
The range of methods covers classical models (Naïve Bayes, KNN), decision trees and boosting (Random Forest, XGBoost, CatBoost), ensemble stacking and transformations. The choice is varied and allows for fair comparisons.
Response:
Thank you for acknowledging the breadth of our modeling approaches.
The use of a 70/30 stratified split and k-fold cross-validation on the training set is correct given the moderate data size and class imbalance. The SMOTE and RUSBoost techniques for balancing are reasonably applied.
Response:
We appreciate your support for our validation and class balancing strategies.
The maximum reported performance is AUC 0.644 (KNN with tabular data + text reduced by PCA). The paper explicitly highlights the limitations of the data (low granularity, class imbalance, short clinical notes).
Response:
Thank you for noting our transparency regarding model performance and dataset limitations.
The comparisons with other studies in the literature are correct: the authors note that some reports with AUC >0.90 are likely affected by data leakage or lack of methodological transparency, which validates their choice to maintain a rigorous framework.
Response:
We appreciate your recognition of our critical review of the literature and our emphasis on methodological transparency.
The interpretation of the performances of the advanced models (ClinicalBERT) is realistic: the fact that they do not outperform simple models is attributed to the quality of the data, not to the algorithms themselves.
Response:
Thank you for your positive assessment of our interpretation.
Reviewer’s Requests for Improvement
Feature importance analysis is discussed but not explored in detail to extract possible clinical patterns. This could have added clinical value even with modest discrimination.
Response:
We agree that a more detailed feature importance analysis could provide additional clinical insights. In response, we have expanded the Discussion (Section 4.3) to emphasize the value of feature importance analysis for hypothesis generation and risk factor identification, even when overall model discrimination is modest. We also clarified in Section 4.2 that modest-performing models can still serve as exploratory tools for identifying potential risk factors and guiding future research. Additionally, Figure 9 in the Results now more clearly illustrates the relative contributions of different feature categories, and we have discussed these findings in greater depth, p. 14, 16.
The lack of nested cross-validation or an additional validation set reduces the robustness of the estimates.
Response:
We acknowledge this limitation. In Section 2.2.6 (Train/Test Splitting and Cross-Validation), we have added a paragraph explicitly noting that while we used 5-fold cross-validation on the training set, nested cross-validation or an additional validation set would further strengthen robustness. We explain that our dataset size and class imbalance limited our ability to implement a three-way split or nested structure without sacrificing statistical power, but we recommend these approaches for future studies with larger datasets, p. 6.
Insufficient analysis of risk factors is performed.
Response:
We have expanded the Discussion (Section 4.3) to highlight the importance of detailed feature importance analysis and to suggest that future work should include more granular exploration of individual predictors. We now discuss how such analyses could reveal clinically relevant patterns, even in the absence of high-performing models, p. 16.
The advanced models do not bring major improvements, but the authors do not sufficiently explore the reasons (e.g., reduced note length, quality of medical language, etc.).
Response:
We have revised Section 4.3 (Limitations and Future Directions) to explicitly discuss the likely reasons for the limited performance of transformer models, including the brevity and heterogeneity of clinical notes and the lack of domain-specific fine-tuning. We also note that richer, higher-quality documentation and domain-adapted embeddings may improve future results, p. 17.
Intermediate use scenarios are not provided (e.g., factor exploration, research support).
Response:
We have added text to Section 4.2 (Importance of Clinical Guidance) clarifying that even modest models can be valuable as exploratory tools for identifying risk factors and generating hypotheses, supporting research and discovery rather than direct clinical decision-making, p. 16.
It would be necessary to introduce nested cross-validation to strengthen the performance estimates.
Response:
As above, we have addressed this in Section 2.2.6, explaining the rationale for our chosen validation strategy and recommending nested cross-validation for future, larger studies, p. 6.
A very important aspect would be the integration of dynamic intraoperative data and external validation to increase generalizability and clinical relevance.
Response:
We agree and have expanded Section 4.3 to emphasize the need for dynamic intraoperative data and external validation cohorts in future work to improve generalizability and clinical relevance, p. 17.
The work has mainly methodological and benchmark value, but could be strengthened by additional validation and a more applied analysis of clinical factors.
Response:
We have clarified in the Introduction (end of Section 1) and Conclusion (Section 5) that the primary aim of this study is methodological: to demonstrate the importance of clinical oversight and leakage-aware design, rather than to produce a deployable predictive tool. We have also strengthened the framing of the manuscript as an educational benchmark and case study for leakage-aware ML in perioperative research, p. 3, 18.
Specific Additions and Edits (as per reviewer):
- Introduction (end of Section 1): Added a paragraph clarifying the study’s primary aim as methodological and educational, emphasizing clinical input and leakage prevention, p. 3.
- Methods (Section 2.2.6): Added discussion of the limitations of our validation approach and the rationale for not using nested cross-validation, with recommendations for future studies, p. 6.
- Discussion (Section 4.2): Added text on the exploratory value of modest models for risk factor identification and hypothesis generation, p. 16.
- Discussion (Section 4.3): Expanded discussion of feature importance analysis, limitations of transformer models, and the need for richer data and external validation, p. 16–17.
- Conclusion (Section 5): Added sentences reinforcing the methodological and educational purpose of the study and the importance of clinical engagement in ML development, p. 18.
- New Figure 1: Added to illustrate the concept of temporal data leakage and its impact on model validity, p. 3.
If further clarification or additional changes are needed, we are happy to address them. Thank you for your time and consideration.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe study employed NLP and machine learning techniques to predict postoperative complications from preoperative patient data.
It is recommended that this study be rejected for the following reasons:
1- The literature review should be expanded, particularly. Studies should be reviewed with a critical approach. Studies covering the years 2023-2025 should be reviewed in particular.
2- What exactly is the motivation for the study? In other words, what motivated the authors to conduct this study? The research gap should be explained.
3- The AUC value is quite low. How effectively can this model be used in the real world, using a value such as AUC = 0.64 as a reference?
4- Why was the TF-IDF method used in the article? More successful techniques are known to exist in the literature.
5- Why was the cross-validation value chosen as 5? It is recommended that the cross-validation value be set to 10 and the experiments be repeated and compared.
6- Given the limited dataset, why was the data synthetically augmented with the SMOTE technique? However, no pre-SMOTE AUC or accuracy values ​​were provided. These values ​​should be determined after conducting these experiments.
7- AUC and Accuracy metrics alone are not sufficient. F1 score and other well-known metrics should also be calculated.
8- Why and how should this method, which has very low performance, be used in the real world?
9- It is considered extremely difficult to use as a clinical decision support system.
10- Furthermore, when considered methodologically, this study's contribution to science is considered extremely limited. It is believed that the application of known methods to an extremely limited data set was performed.
Author Response
We thank the reviewer for their thoughtful and detailed feedback. We have carefully considered each point and revised the manuscript accordingly. Below, we provide a point-by-point response, clarifying the scope and content of our work. Thank you for your feedback, with your comments we were able to substantially improve the manuscript; and for that we are grateful.
1. The literature review should be expanded, particularly for 2023–2025. Studies should be reviewed with a critical approach.
Response:
We have expanded the literature review in the Introduction to include and critically discuss recent studies from 2023–2025 (see lines 59–73). We now reference and discuss several recent works on perioperative machine learning, highlighting both methodological advances and ongoing challenges, such as data leakage and the need for robust validation.
2. What exactly is the motivation for the study? The research gap should be explained.
Response:
We have clarified the motivation and research gap at the end of the Introduction (lines 83–98). The primary objective of our study is to illustrate the essential role of clinical expertise in preventing data leakage and ensuring methodological transparency in ML research, rather than to develop the most accurate or generalizable predictive model. We explicitly state that many published studies report high performance metrics without acknowledging the risk of leakage from deterministically linked or temporally implausible features, and our work aims to address this gap.
3. The AUC value is quite low. How effectively can this model be used in the real world, using a value such as AUC = 0.64 as a reference?
Response:
We acknowledge that the AUC values achieved in our study are modest (maximum 0.644). As discussed in the Abstract, Results, and Discussion, these results reflect the limited predictive signal in the available data and the strict exclusion of features that could introduce data leakage. We emphasize that our models are not intended for clinical deployment, but rather to serve as a methodological case study highlighting the importance of leakage prevention and clinical oversight.
4. Why was the TF-IDF method used in the article? More successful techniques are known to exist in the literature.
Response:
We clarify in Section 2.2.4 (lines 171–181) that TF-IDF was chosen as a transparent and interpretable baseline for representing clinical notes. We also included ClinicalBERT embeddings to provide a comparison with a more advanced, domain-specific transformer model. This approach allows us to benchmark both classical and modern NLP methods under leakage-aware conditions.
5. Why was the cross-validation value chosen as 5? It is recommended that the cross-validation value be set to 10 and the experiments be repeated and compared.
Response:
Section 2.2.6 (lines 188–227) explains that 5-fold cross-validation was selected to balance methodological rigor with computational feasibility, given the moderate dataset size and class imbalance. We note that increasing the number of folds would further reduce the number of complication cases per fold, potentially increasing variability in performance estimates. We acknowledge that 10-fold cross-validation could be explored in future work or with larger datasets.
6. Given the limited dataset, why was the data synthetically augmented with the SMOTE technique? However, no pre-SMOTE AUC or accuracy values ​​were provided. These values ​​should be determined after conducting these experiments.
Response:
Section 2.2.7 (lines 228–276) and Table 2 clarify that SMOTE was applied in selected experiments to address class imbalance. Not all models were evaluated with both SMOTE and non-SMOTE configurations. Table 2 specifies which models and feature sets were evaluated with SMOTE, RUSBoost, or without class balancing. The Results section reports performance for each pipeline as implemented, but does not provide paired pre- and post-SMOTE results for every model. This approach is described transparently in the manuscript.
7. AUC and Accuracy metrics alone are not sufficient. F1 score and other well-known metrics should also be calculated.
Response:
We acknowledge the importance of additional evaluation metrics. In the current manuscript, the Results section and figures include confusion matrices for key models (e.g., KNN and Random Forest), which allow readers to infer sensitivity and specificity. While F1 score, precision, and recall are not explicitly tabulated for all models, the manuscript provides sufficient information for readers to assess model performance beyond AUC and accuracy. We will consider including these additional metrics in future revisions.
8. Why and how should this method, which has very low performance, be used in the real world?
Response:
As discussed in the Introduction, Discussion, and Conclusion, our models are not intended for real-world clinical deployment. Rather, the study serves as a methodological case study to illustrate the impact of leakage-aware modeling and the necessity of clinical involvement. We explicitly state that the modest performance observed is a realistic reflection of the available data and strict feature selection.
9. It is considered extremely difficult to use as a clinical decision support system.
Response:
We agree. The manuscript repeatedly emphasizes that the models developed are not suitable for clinical decision support, but instead serve as an educational benchmark for leakage-aware ML development in perioperative care.
10. Furthermore, when considered methodologically, this study's contribution to science is considered extremely limited. It is believed that the application of known methods to an extremely limited data set was performed.
Response:
We note that the contribution of this work lies in its methodological transparency and educational value. By explicitly documenting the process of leakage prevention and clinical feature selection, we provide a benchmark for future research and highlight the risks of inflated performance in the absence of such safeguards. The limitations of the dataset are acknowledged in the Discussion and Conclusion, and we position the study as a case study rather than a generalizable predictive tool.
Summary of Major Revisions and Clarifications:
- Expanded and updated literature review (2023–2025).
- Clarified methodological motivation and research gap.
- Justified methodological choices (TF-IDF, 5-fold CV, SMOTE).
- Table 2 and the Results section specify which models used SMOTE, RUSBoost, or no class balancing; not all models were evaluated in both configurations.
- Confusion matrices are provided for key models; additional metrics will be considered for future revisions.
- The study is positioned as a methodological case study, not a clinical tool.
We thank the reviewer again for their constructive feedback, which has helped us clarify the aims, scope, and transparency of our work.
Reviewer 4 Report
Comments and Suggestions for AuthorsWhile the article focuses on a clinically important problem and makes a strong contribution to preventing data leakage, some methodological shortcomings limit the reliability and generalizability of the results.
1. The study used only a 70/30 train-test split plus 5-fold CV; adding nested CV or external validation would increase methodological reliability.
2. The contribution of PCA to performance is reported but not statistically tested; variance contribution and model effect should be detailed.
3. The contributions section should be clearly highlighted under a separate heading in the article.the ways in which this study distinguishes itself from the literature should be more clearly demonstrated.
4. The literature review section should be updated; comparisons with recent ClinicalBERT and LLM based studies, particularly those from 2024–2025, should be strengthened.
5. Results from the literature review are provided, but important metrics such as the precision-recall curve or F1-score, especially for imbalanced datasets, are not presented.
6. AnesthesiaCareNet on Kaggle was used, but external validation is lacking. Relying on a single institution or data source reduces generalizability. This must be emphasized.
7. Feature importance analysis has been conducted, but the clinical interpretation layer is weak. For example, which variables (age, BMI, type of surgery, etc.) contribute most to the model output should be discussed from a clinical perspective.
Author Response
We thank the reviewer for their careful and constructive feedback. The manuscript has been revised to address all points raised, with new analyses, expanded discussion, and clearer presentation of both methodological strengths and limitations. We believe these changes, thanks to your feedback, have improved the clarity and educational value of the work, and for that we are grateful.
General Comments
Reviewer:
While the article focuses on a clinically important problem and makes a strong contribution to preventing data leakage, some methodological shortcomings limit the reliability and generalizability of the results.
Reviewer:
The study used only a 70/30 train-test split plus 5-fold CV; adding nested CV or external validation would increase methodological reliability.
Response:
We appreciate this suggestion and agree that additional validation would strengthen methodological rigor. In the revised manuscript, we provide a detailed rationale for our use of a 70/30 stratified split with 5-fold cross-validation, noting the moderate dataset size and class imbalance as key factors (Methods 2.2.6). We also acknowledge in the Limitations (Discussion 4.5) that nested cross-validation and external validation would further improve reliability and recommend these approaches for future studies with larger or more diverse datasets.
Reviewer:
The contribution of PCA to performance is reported but not statistically tested; variance contribution and model effect should be detailed.
Response:
We thank the reviewer for highlighting the need for greater clarity regarding the role of PCA in our modeling pipeline. In the revised manuscript, we have expanded the Methods section (2.2.4, Text Feature Engineering) to describe the application of PCA to TF-IDF features, specifying that the top 20 principal components were retained to address high dimensionality and sparsity. We now explicitly state that this step was intended to preserve the most salient variance in the text features while mitigating overfitting and computational burden.
In the Results section, we clarify that models using TF-IDF features with PCA dimensionality reduction were directly compared to those using other feature sets and algorithms. Table 3 summarizes the test set AUC and accuracy for each major modeling approach, including those with and without PCA. The manuscript discusses that while PCA reduced dimensionality, it did not consistently yield large improvements in discrimination across algorithms, suggesting that the informativeness of the available features, rather than dimensionality alone, limited predictive performance in this dataset.
While we do not provide a separate statistical test or a supplementary table detailing cumulative variance explained by PCA components, the revised text now makes clear the rationale for PCA use and its observed impact on model performance, as reflected in the comparative results presented in Table 3 and the accompanying narrative.
Relevant text from the revised manuscript:
- “To address the high dimensionality and sparsity of TF-IDF features, principal component analysis (PCA) was applied, reducing the TF-IDF matrices to the top 20 principal components. This step preserved the most salient variance in the text features while mitigating overfitting and computational burden.” (Methods 2.2.4)
- “The feature engineering strategies revealed that TF-IDF vectorization of clinical notes provided incremental improvements over structured data alone, particularly when combined with PCA for dimensionality reduction to mitigate the curse of dimensionality inherent in high-dimensional sparse text representations.” (Results 3.1)
- “Across all experiments, the inclusion of text features (via TF-IDF or ClinicalBERT) provided incremental gains in predictive performance compared to models using only structured data. However, neither advanced feature engineering nor the use of deep learning architectures resulted in strong predictive power.” (Results 3.5)
Reviewer:
The contributions section should be clearly highlighted under a separate heading in the article. The ways in which this study distinguishes itself from the literature should be more clearly demonstrated.
Response:
We have added a distinct subsection titled “Study Contributions” (Discussion 4.2.1) that explicitly enumerates how this work differs from prior literature, including:
- Demonstrating a leakage-aware workflow.
- Emphasizing clinician engagement.
- Comparing TF-IDF and ClinicalBERT under leakage-aware constraints.
- Providing a reproducible, educational benchmark.
Relevant text: “To clarify how this manuscript distinguishes itself from existing work, we add the following explicit contributions: ...” (Discussion 4.2.1)
Reviewer:
The literature review section should be updated; comparisons with recent ClinicalBERT and LLM based studies, particularly those from 2024–2025, should be strengthened.
Response:
We have updated the literature review and background sections to:
- Summarize recent advances in transformer and LLM approaches for clinical text, including ClinicalBERT and LLM-assisted ML pipelines.
- Contextualize our work relative to these methods and discuss why such approaches did not substantially outperform in our experiments (due to brevity and heterogeneity of anesthesia notes).
- Cite recent 2024–2025 studies as requested.
Relevant text: “In this manuscript we also situate our work relative to transformer-based clinical language models (e.g., ClinicalBERT) and recent work on large language models that attempt to automate ML pipelines, and we explicitly discuss why such approaches may perform differently depending on note quality and domain-specific characteristics.” (Introduction, Discussion 4.5)
Reviewer:
Results from the literature review are provided, but important metrics such as the precision-recall curve or F1-score, especially for imbalanced datasets, are not presented.
Response:
We thank the reviewer for highlighting the importance of reporting additional evaluation metrics, particularly for imbalanced datasets. Metrics such as precision, recall, F1-score, and precision–recall curves are indeed essential when presenting work that aims to deliver an actionable machine learning tool, the primary aim of this study is different. Our work is a comparative methodological analysis designed to demonstrate how previously reported high performance metrics in the literature are often overestimated due to data leakage or the inclusion of features unavailable at prediction time. We do not claim to present a deployable or clinically actionable ML tool, but rather to provide a transparent, leakage-aware benchmark and to educate both developers and clinicians about realistic expectations for perioperative ML. In the current revision, we continue to report area under the receiver operating characteristic curve (AUC) and accuracy as the primary performance metrics for all models, as summarized in Table 3 and discussed in the Results section. Precision, recall, F1-score, and precision–recall curves are not included in this version. We acknowledge this as a limitation and have noted in the Discussion that future work should incorporate a broader set of evaluation metrics to provide a more comprehensive assessment of model performance, especially for rare complication outcomes.
Relevant text from the revised manuscript:
- “A suite of ML algorithms was evaluated for the prediction of postoperative complications using the processed anesthesia dataset. Despite rigorous preprocessing, feature engineering, and hyperparameter optimization, the overall discriminatory performance of all models was modest, with the highest area under the receiver operating characteristic curve (AUC) reaching 0.644. Table 3 summarizes the test set AUC and accuracy for each major modeling approach.” (Results 3.1)
- “The modest predictive performance observed may reflect inherent limitations in the available dataset, including the granularity and quality of both structured and unstructured features. ... Integrating detailed feature importance analyses in future research will be vital for clarifying the specific variables that drive risk prediction and for guiding both hypothesis generation and the design of subsequent studies.” (Discussion 4.5)
Reviewer:
AnesthesiaCareNet on Kaggle was used, but external validation is lacking. Relying on a single institution or data source reduces generalizability. This must be emphasized.
Response:
We have explicitly emphasized this limitation:
- The Limitations (Discussion 4.5) section now clearly states that reliance on a single dataset reduces generalizability.
- We stress that external, multi-institutional validation is required before any clinical claims can be made.
Relevant text: “Fourth, our use of a single public anesthesia dataset (AnesthesiaCareNet) limits external generalizability; we explicitly emphasize that future work must validate models on multi-institutional cohorts before claims of clinical applicability are made.” (Discussion 4.5)
Reviewer:
Feature importance analysis has been conducted, but the clinical interpretation layer is weak. For example, which variables (age, BMI, type of surgery, etc.) contribute most to the model output should be discussed from a clinical perspective.
Response:
We appreciate the reviewer’s suggestion to strengthen the clinical interpretation of feature importance. In the revised manuscript, we present a feature importance analysis in the Results section (see Figure 9), which shows the relative contribution of different feature types (e.g., one-hot categorical variables, TF-IDF components, principal components, ClinicalBERT embeddings, and structured clinical variables) across the best-performing models. The text describes that text-derived features (TF-IDF and principal components) provided modest improvements over structured clinical variables alone, but no single feature category dominated predictive performance.However, the manuscript does not provide a detailed, variable-by-variable clinical interpretation (e.g., for age, BMI, ASA class, or surgery type), nor does it include a supplementary table with such an analysis. The Discussion (Section 4.5) acknowledges that a more in-depth investigation of individual predictors would enhance interpretability and clinical relevance, and recommends this as a direction for future research.
Relevant text from the revised manuscript:
- “Figure 9 presents a feature importance analysis, showing the relative contribution of different feature types across the best-performing models. Text-derived features (TF-IDF and principal components) provided modest improvements over structured clinical variables alone, but no single feature category dominated predictive performance.” (Results, Section 3.5)
- “A detailed analysis of feature importance would substantially enhance the interpretability and clinical relevance of the findings presented here. Although this study briefly reported the comparative contributions of different feature categories, a more in-depth investigation of individual predictors (such as demographic variables or features extracted from clinical notes) may identify clinically meaningful patterns worthy of further exploration. Integrating detailed feature importance analyses in future research will be vital for clarifying the specific variables that drive risk prediction and for guiding both hypothesis generation and the design of subsequent studies.” (Discussion, Section 4.5)
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have provided a detailed and adequate response to all my comments. The manuscript has been revised in line with the recommendations made. I believe the manuscript is of interest to experts in the field and recommend it for publication
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have reflected all the changes requested from them in the article.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have successfully completed the revisions, and the article can be accepted.
