Next Article in Journal
Negative Pressure Wound Therapy for Surgical Site Infection Prevention Following Pancreaticoduodenectomy: A Systematic Review and Meta-Analysis
Previous Article in Journal
Tray Application Versus the Standard Surgical Procedure: A Prospective Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes

1
Department of Osteopathic Manipulative Medicine, College of Osteopathic Medicine, New York Institute of Technology, Old Westbury, NY 11568, USA
2
Department of Surgery, Icahn School of Medicine at Mount Sinai, 1428 Madison Avenue, Atran Berg Building, 8th Floor, New York, NY 10029, USA
*
Author to whom correspondence should be addressed.
Surgeries 2025, 6(4), 87; https://doi.org/10.3390/surgeries6040087
Submission received: 29 August 2025 / Revised: 4 October 2025 / Accepted: 8 October 2025 / Published: 9 October 2025

Abstract

Background/Objectives: Machine learning has been suggested as a way to improve how we predict anesthesia-related complications after surgery. However, many studies report overly optimistic results due to issues like data leakage and not fully using information from clinical notes. This study provides a transparent comparison of different machine learning models using both structured data and preoperative notes, with a focus on avoiding data leakage and involving clinicians throughout. We show how high reported metrics in the literature can result from methodological pitfalls and may not be clinically meaningful. Methods: We used a dataset containing both structured patient and surgery information and preoperative clinical notes. To avoid data leakage, we excluded any variables that could directly reveal the outcome. The data was cleaned and processed, and information from clinical notes was summarized into features suitable for modeling. We tested a range of machine learning methods, including simple, tree-based, and modern language-based models. Models were evaluated using a standard split of the data and cross-validation, and we addressed class imbalance with sampling techniques. Results: All models showed only modest ability to distinguish between patients with and without complications. The best performance was achieved by a simple model using both structured and summarized text features, with an area under the curve of 0.644 and accuracy of 60%. Other models, including those using advanced language techniques, performed similarly or slightly worse. Adding information from clinical notes gave small improvements, but no single type of data dominated. Overall, the results did not reach the high levels reported in some previous studies. Conclusions: In this analysis, machine learning models using both structured and unstructured preoperative data achieved only modest predictive performance for postoperative complications. These findings highlight the importance of transparent methodology and clinical oversight to avoid data leakage and inflated results. Future progress will require better control of data leakage, richer data sources, and external validation to develop clinically useful prediction tools.

1. Introduction

Recent advances in artificial intelligence (AI) have opened new opportunities for personalized healthcare, particularly in the perioperative setting where precise anesthesia management and early prediction of postoperative complications are critical. The integration of structured clinical data and unstructured clinical notes, combined with machine learning (ML) algorithms, enables the development of predictive models that can assist clinicians in identifying patients at elevated risk for adverse outcomes. However, building accurate and clinically actionable models for anesthesia complication prediction presents unique challenges, including the need for robust data preprocessing, careful feature engineering, and the mitigation of potential sources of data leakage.
Machine learning is increasingly being used to help doctors predict which patients might have complications after surgery, especially in the field of anesthesiology [1,2]. Traditionally, these predictions have relied on structured information like a patient’s age, health status, and type of surgery. More recently, doctors’ notes and other unstructured text from the medical record have been recognized as valuable sources of information. New tools in natural language processing (NLP), such as TF-IDF and transformer models like ClinicalBERT, can help turn these notes into useful data for prediction [3,4].
Different ML methods have their own strengths. For example, tree-based models like XGBoost and CatBoost are popular for handling structured data, while transformer models are especially good at understanding language and context in text [5,6]. Transformers have become widely used in healthcare because they can learn from many types of data, including text, images, and signals, and have shown promise in predicting outcomes like pain, complications, and even helping with diagnosis and drug discovery [7].
Despite these advances, real-world predictive performance remains modest. Several challenges hinder the development and deployment of effective ML models in healthcare, including label leakage, i.e., when outcome information is inadvertently included in model inputs [8,9]. Additionally, biases inherent in electronic health records (EHRs) may exacerbate health disparities when carried over into medical AI systems. These biases often originate from under-documented sources and can be magnified through data-driven modeling [10].
In anesthesiology, closed-loop systems have demonstrated advantages over manual control in regulating single variables, reducing provider workload, and safely delivering therapies over the past two decades [11]. Platforms like AnesthesiaGUIDE exemplify how automation can optimize intraoperative drug delivery through real-time physiologic monitoring [12]. However, such systems typically do not rely on preoperative data for forecasting complications. Predicting adverse outcomes preoperatively based on static or near-static features thus remains a complex but highly valuable clinical objective.
There is now a substantial and growing body of published research on using ML to predict complications from anesthesia. These studies span a wide range of complications, use diverse ML algorithms, and are increasingly being validated with robust statistical methods. For example, recent work has demonstrated the use of automated ML frameworks to predict postoperative pulmonary complications in non-small cell lung cancer patients undergoing thoracoscopic surgery [13], as well as the development and validation of models for predicting complications following radical gastrectomy for gastric cancer [14]. Other studies have focused on predicting specific adverse events such as postoperative nausea and vomiting using ensemble ML approaches [15], and on leveraging swarm intelligence-based interpretable models to assess postoperative recovery in general anesthesia patients [16]. While clinical implementation is still limited, the field is moving toward real-world integration, with some tools already in use for specific applications such as hypotension prediction and ultrasound guidance. Ongoing research and validation are expected to further bridge the gap between ML research and routine anesthesia practice [17,18,19,20,21].
Building on this foundation, our study adopts a distinct methodology by integrating both structured clinical variables and unstructured clinical notes within a single predictive framework. Leveraging recent advances in natural language processing alongside traditional feature engineering, we aimed to capture a broader spectrum of perioperative information than is typically considered in prior work. Importantly, our approach directly addresses key methodological challenges that can inflate reported model performance in the literature, such as data leakage from proxy variables. By systematically combining diverse data modalities and rigorously controlling for sources of bias, this study establishes a transparent and comprehensive benchmark for future research in perioperative ML.
The primary objective of this study is not to develop the most accurate or generalizable predictive model, but rather to illustrate the essential role of clinical expertise in preventing data leakage and ensuring methodological transparency in ML research. Many published studies in this field (often conducted and reported by computational or engineering teams without direct clinical involvement) report high performance metrics without acknowledging the risk of leakage from deterministically linked or temporally implausible features. Inadvertently, such studies may include post-diagnostic or future information as model inputs, rather than restricting features to those available prior to diagnosis or clinical decision-making. This practice can result in artificially inflated validation metrics that do not translate to reliable performance in real-world deployment. The concept of temporal data leakage and its impact on model validity is illustrated in Figure 1.
By systematically engaging clinicians throughout the model development process, the present work demonstrates how medical expertise guides appropriate feature selection, clarifies what information is available at the time of prediction, and ultimately mitigates the risk of misleading results. In this way, this study serves as an educational benchmark for the implementation of leakage-aware practices in medical ML research.

2. Materials and Methods

2.1. Dataset Description

The AnesthesiaCareNet Dataset is designed to support research in personalized anesthesia management and postoperative outcome prediction. It comprises both structured and unstructured clinical data collected from patient records, encompassing a diverse set of features relevant to perioperative care. The dataset includes patient demographics, surgical details, anesthesia administration logs, vital signs, and recovery information. Additionally, it incorporates clinical notes processed using advanced natural language processing (NLP) techniques, as well as predictive labels for complications, pain management needs, and recovery trajectories. The data is formatted in a tabular structure (CSV) to facilitate integration into ML workflows, with a target column indicating predicted postoperative outcomes or complications. A summary of the dataset fields is provided in Table 1.
The structured data fields (e.g., demographics, surgery details, vital signs) are complemented by unstructured clinical notes, which are processed using NLP methods such as TF-IDF vectorization and ClinicalBERT embeddings to extract informative features. The dataset is suitable for developing and evaluating AI-driven frameworks for personalized healthcare and predictive modeling in anesthesia management.

2.2. Data Preprocessing and Feature Engineering

A systematic data preprocessing pipeline was implemented to ensure data quality, maximize the utility of both structured and unstructured features, and facilitate fair comparison across ML algorithms. The following subsubsections detail each stage of the preprocessing workflow, from raw data ingestion to the preparation of final feature matrices for model training.

2.2.1. Deterministic Link Between ‘Complications’ and ’Outcome’

Analysis of the dataset revealed a deterministic relationship between the ‘Complications’ and ’Outcome’ columns. Specifically, whenever ‘Complications’ contains any string other than ‘none’, ‘Outcome’ is 1 (indicating complications present); if ‘Complications’ is ‘none’, ‘Outcome’ is 0 (no complications). Thus, ‘Complications’ serves as a perfect proxy for the target label.
Inclusion of ‘Complications’ as a feature in ML model training introduces data leakage, as the model can directly infer the outcome from this column. This results in artificially high performance metrics that do not reflect the model’s ability to generalize to real-world clinical scenarios, where the presence or absence of complications is not known at the time of prediction. Consequently, models trained with access to ‘Complications’ are not clinically actionable (see Figure 2).
To ensure the validity and clinical relevance of the ML experiments, the ‘Complications’ column was excluded from all predictive modeling workflows. It is important to note that studies using this or similar datasets must omit such deterministically linked columns from model training. Failure to do so will yield inflated performance estimates and models that are not suitable for deployment in clinical practice.

2.2.2. Data Cleaning and Initial Processing

The raw anesthesia dataset was first loaded and subjected to initial cleaning. Rows with missing outcome labels were removed to ensure a well-defined prediction target. Non-predictive columns, such as patient identifiers and free-text complication summaries, were dropped. The binary outcome variable (Outcome) was encoded as a categorical variable with classes 0 (no complication) and 1 (complication).
Numeric fields, including age, BMI, surgery duration, and pain level, were converted to numeric types as needed. Surgery duration, for example, was parsed from string representations by stripping non-numeric characters. Missing values in numeric columns were imputed using the median of the available values for each feature, a robust approach to mitigate the influence of outliers and preserve the distribution of the data.

2.2.3. Categorical Variable Encoding

Categorical variables, such as gender, surgery type, and anesthesia type, were transformed using one-hot encoding. Any missing or undefined categories were assigned an “Unknown” label prior to encoding. This process resulted in the expansion of each categorical variable into a set of binary indicator columns, ensuring compatibility with downstream ML algorithms that require numeric input.

2.2.4. Text Feature Engineering

Unstructured clinical notes, specifically preoperative and postoperative notes, were processed using natural language processing techniques to extract informative features. TF-IDF vectorization was applied after tokenizing, lowercasing, and removing stop words from the notes. A bag-of-words representation was constructed, and term frequency-inverse document frequency (TF-IDF) scores were computed. The top 100 or 500 unigrams with the highest IDF values were selected to form the final TF-IDF feature set for each note type, and these features were appended to the main feature table. To address the high dimensionality and sparsity of TF-IDF features, principal component analysis (PCA) was applied, reducing the TF-IDF matrices to the top 20 principal components. This step preserved the most salient variance in the text features while mitigating overfitting and computational burden. For deep learning experiments, ClinicalBERT embeddings were extracted from the clinical notes. These dense vector representations captured contextual and semantic information beyond what was possible with TF-IDF. In multimodal models, ClinicalBERT embeddings were concatenated with tabular features for joint modeling.
TF-IDF was chosen as a baseline representation of clinical notes because of its transparency and interpretability. While more advanced techniques exist, such as contextual embeddings from transformer models, TF-IDF remains widely used in clinical NLP and allows clear benchmarking against both classical and modern approaches. To provide such a comparison, we also included ClinicalBERT embeddings in our deep neural network models. This allowed us to evaluate both a traditional, interpretable method and a domain-specific transformer model under leakage-aware conditions. This choice underscores the study’s primary aim: to provide a transparent comparison of feature representations under strict, clinically realistic constraints, thereby highlighting how attention to proper methods (particularly the exclusion of features unavailable at prediction time) directly impacts model performance and interpretability.

2.2.5. Feature Scaling and Final Matrix Construction

For algorithms sensitive to feature scaling, such as K-Nearest Neighbors and Naïve Bayes, numeric features were standardized to zero mean and unit variance after imputation. The final feature matrix for each experiment consisted of the standardized numeric variables, one-hot encoded categorical variables, and either the reduced TF-IDF principal components or ClinicalBERT embeddings, depending on the modeling approach.

2.2.6. Train/Test Splitting and Cross-Validation

The processed dataset was split into training and test sets using a 70/30 holdout strategy, stratified by the outcome variable to preserve class proportions. For model selection and hyperparameter tuning, k-fold cross-validation (typically 5-fold) was performed on the training set. This ensured a proper estimation of model performance and mitigated the risk of overfitting to a particular data split.
A two-way split (train/test) was used instead of a three-way split (train/validation/test) for several practical and methodological reasons. The anesthesia dataset is of moderate size and exhibits class imbalance, with relatively few complication cases compared to non-complications. Introducing a third split would further reduce the number of samples available for both training and reliable evaluation, especially for the minority class, thereby compromising statistical power. Instead, the model selection and hyperparameter tuning were achieved using k-fold cross-validation within the training set, maximizing the use of available data for both model fitting and tuning, while the held-out test set remained untouched for unbiased final evaluation. This approach is particularly advantageous when data is limited, as it avoids the inefficiency of setting aside a permanent validation set that is never used for training or final testing. Thus, the two-way split combined with cross-validation was chosen to maximize data efficiency, maintain statistical power, and ensure robust model selection and evaluation, given the dataset size and class distribution.
Five-fold cross-validation was employed on the training set to support model selection and hyperparameter tuning. In studies with moderate dataset sizes, k-fold cross-validation is a widely accepted and practical approach for estimating model performance, as it maximizes the use of available data and maintains statistical power, particularly in the presence of class imbalance. Importantly, this study is fundamentally a comparative analysis of model performance under conditions of data leakage versus strict leakage prevention. As the same methodology was applied in both scenarios, the validity of the comparison is preserved, even if the absolute performance estimates may be less robust than those obtained with larger datasets and more complex validation frameworks.
Although both 5-fold and 10-fold cross-validation are widely used, the optimal number of folds depends on the characteristics of the dataset and the modeling objectives. In this study, the moderate sample size and the relative rarity of complication events meant that increasing the number of folds would further reduce the number of positive cases in each validation split, potentially increasing the variability of performance estimates and reducing their stability. While 10-fold cross-validation may offer marginally lower bias in some settings, the benefit is often outweighed by the risk of unstable estimates when each fold contains fewer events of interest. Thus, 5-fold cross-validation was chosen to ensure that each fold retained enough complication cases for meaningful evaluation and tuning. In larger datasets with more balanced outcomes, exploring 10-fold cross-validation frameworks may be warranted to assess whether they yield more reliable or generalizable results.

2.2.7. Handling Class Imbalance

The outcome variable in our dataset (presence or absence of postoperative complications) exhibits moderate class imbalance, with non-complication cases outnumbering complication cases. This imbalance can adversely affect model training, as many ML algorithms are biased towards the majority class, potentially leading to poor sensitivity (recall) for the minority class, i.e., complications.
To address this, we incorporated class balancing strategies in selected experiments, particularly for algorithms known to be sensitive to class distribution, such as tree-based ensembles and certain classical classifiers. Two principal approaches were used: random undersampling and synthetic minority oversampling.
Random undersampling, as implemented in RUSBoost, works by reducing the number of majority class samples during each boosting iteration, thus forcing the model to focus on learning from the minority class. While this method is simple and effective, it can risk discarding potentially informative data from the majority class.
Synthetic Minority Oversampling Technique (SMOTE) was also employed in some experiments. SMOTE generates synthetic samples for the minority class by interpolating between existing minority class examples. By artificially increasing the representation of complication cases, SMOTE can help the model better recognize patterns associated with rare events, potentially improving recall for complications.
Not all models were evaluated with every class balancing strategy. The choice of whether to apply SMOTE, RUSBoost, or neither was guided by both methodological considerations and practical constraints. For example, some algorithms (such as RUSBoost) inherently incorporate random undersampling as part of their design, making additional oversampling unnecessary or incompatible. In other cases, the combination of certain feature sets and algorithms (e.g., deep learning models using ClinicalBERT embeddings) was less likely to benefit from synthetic oversampling, either due to the model’s inherent robustness to imbalance or the risk of introducing noise into high-dimensional feature spaces. Additionally, computational feasibility and the desire to avoid excessive model proliferation influenced the selection of which models to balance.
Table 2 summarizes which models and feature sets were evaluated with each class balancing strategy. A checkmark (✓) indicates that the configuration was tested and reported, while a cross (×) indicates it was not. As shown, some models were evaluated both with and without balancing, while others were only tested in one configuration. This approach allowed us to focus our analysis on the most relevant and informative comparisons, while maintaining transparency about the scope of our experiments.
The selective application of class balancing strategies reflects both the strengths and limitations of each algorithm and the practical realities of clinical ML research. For instance, tree-based models and ensembles (such as Random Forest, XGBoost, and CatBoost) are known to be sensitive to class imbalance and often benefit from oversampling or undersampling. In contrast, deep learning models using ClinicalBERT embeddings may be less affected by imbalance due to their capacity to learn complex feature representations, and oversampling in high-dimensional embedding spaces can sometimes introduce artifacts rather than improve performance.
Finally, the decision to use balancing techniques was guided by a primary commitment to clinical realism: in actual deployment, the prevalence of complications will remain low, and models must be evaluated under conditions that reflect this reality. By reporting both balanced and unbalanced outcomes, this study provides a nuanced view of model sensitivity, specificity, and real-world applicability, supporting informed interpretation by both clinical and technical audiences.

2.2.8. Summary of Preprocessing Pipeline

Hence, the preprocessing pipeline comprised data cleaning and imputation, one-hot encoding of categorical variables, extraction and dimensionality reduction of text features, feature scaling, stratified train/test splitting, and optional class balancing. This approach ensured that all ML models were trained and evaluated on consistent, high-quality feature sets, enabling fair and reproducible comparison across algorithms.

2.3. Summary of ML Algorithms for Anesthesia Complication Prediction

A variety of ML algorithms were employed to predict the risk of complications under anesthesia, each with distinct methodological foundations and typical application domains. These algorithms can be grouped into tree-based models, deep neural network (DNN) approaches, and classical algorithms. The following subsections provide a detailed overview of each group, including mathematical descriptions, traditional uses, and an assessment of their suitability for the anesthesia dataset.

2.3.1. Tree-Based Algorithms

Tree-based algorithms are particularly well-suited for structured, tabular data and are capable of modeling non-linear relationships and complex feature interactions. In this study, several variants were implemented, including Random Forest, Gradient-Boosted Trees, LSBoost, RUSBoost, XGBoost, CatBoost, and stacked ensembles.
Random Forest constructs an ensemble of decision trees, each trained on a bootstrap sample of the data. The final prediction is determined by aggregating the predictions of individual trees, typically through majority voting for classification. The mathematical formulation for the ensemble prediction is given by
y ^ = mode h t ( x ) t = 1 T
where h t denotes the t-th decision tree.
Gradient-boosted tree algorithms, such as LSBoost, RUSBoost, XGBoost, and CatBoost, build an ensemble of decision trees in a sequential manner. Each new tree is trained to correct the errors of the previous ensemble by fitting to the residuals. The general update rule for the ensemble is
F m ( x ) = F m 1 ( x ) + γ m h m ( x )
where h m is the m-th weak learner and γ m is a step size parameter.
Stacked ensemble methods combine the predictions of multiple base models, such as Random Forest, Gradient-Boosted Trees, and Support Vector Machines, using a meta-learner. The meta-learner is trained to optimize the final prediction based on the outputs of the base models:
y ^ = g h 1 ( x ) ,   h 2 ( x ) , , h K ( x )
where g denotes the meta-learner and h k are the base models.
Tree-based models are traditionally used for classification and regression tasks involving tabular data. They are robust to overfitting, especially when the number of trees is large, and provide measures of feature importance. In the context of the anesthesia dataset, these models were applied to engineered features such as TF-IDF representations of clinical notes and one-hot encoded categorical variables. Their suitability depends on the ability to capture relevant patterns in both structured and engineered features, as well as the handling of class imbalance and feature correlations.

2.3.2. Deep Neural Network Algorithms

Deep neural network algorithms, particularly transformer-based models, were employed to leverage the unstructured clinical notes in the anesthesia dataset. These models are capable of capturing complex semantic relationships in text data and can be extended to multimodal settings by incorporating structured features.
ClinicalBERT is a transformer-based language model pre-trained on clinical text. It utilizes self-attention mechanisms to model contextual relationships between words:
Attention ( Q ,   K ,   V ) = softmax Q K T d k V
where Q, K, and V are the query, key, and value matrices, respectively.
ClinicalBERT can be fine-tuned for classification tasks using clinical notes as input. In the joint model, ClinicalBERT embeddings are concatenated with tabular features and passed through additional neural network layers. These models are traditionally used for natural language processing tasks in healthcare, such as clinical note classification and information extraction. A hybrid approach was also implemented, where ClinicalBERT was used to generate dense vector representations of the clinical notes, which were then combined with tabular features and classified using XGBoost.
Transformer-based models are powerful for extracting information from unstructured text, but their effectiveness in this context depends on the richness and informativeness of the clinical notes. The brevity and sparsity of the notes in the anesthesia dataset may limit the ability of these models to learn meaningful representations.

2.3.3. Other Algorithms

Several additional algorithms were evaluated as baselines or for comparison, including Naïve Bayes and K-Nearest Neighbors (KNN). These classical ML methods are often used for their simplicity and interpretability.
Naïve Bayes is a probabilistic classifier that assumes conditional independence among features given the class label. The posterior probability is computed as
P ( y x ) P ( y ) j = 1 d P ( x j y )
where x j denotes the j-th feature.
Naïve Bayes is traditionally used for high-dimensional, sparse data such as text classification with bag-of-words or TF-IDF features. It is computationally efficient and often effective despite its strong independence assumption. In this study, Naïve Bayes was applied to principal components derived from TF-IDF features of clinical notes.
K-Nearest Neighbors is a non-parametric method that predicts the class of a sample based on the majority class among its k nearest neighbors in feature space:
y ^ = mode y ( i ) i = 1 k
where y ( i ) denotes the class label of the i-th nearest neighbor.
KNN is traditionally used for small to moderate-sized datasets with well-scaled features. It is sensitive to the choice of distance metric and the presence of irrelevant or redundant features. In this study, KNN was applied after dimensionality reduction of TF-IDF features and standardization of numeric variables.

2.3.4. Modeling Pipeline Overview

The anesthesia complication prediction task leveraged a diverse suite of ML algorithms, each tailored to different aspects of the dataset and problem structure. Tree-based models and ensemble methods, such as Random Forest, Gradient Boosted Trees, and XGBoost, were particularly effective for structured tabular data and engineered features, offering robustness to feature interactions and missing values. Deep neural networks, including ClinicalBERT and multimodal architectures, were employed to extract nuanced information from unstructured clinical notes, capturing semantic patterns that classical models might overlook. Classical algorithms like Naïve Bayes and K-Nearest Neighbors (KNN) served as valuable baselines and, in some cases, delivered competitive performance, especially when combined with dimensionality reduction techniques such as PCA on TF-IDF features.
The choice of algorithm was influenced by several factors: the nature and quality of available features (structured versus unstructured), the balance of outcome classes (complication versus no complication), and the complexity of relationships to be modeled. For instance, ensemble methods and KNN performed well on the tabular and engineered features, while deep learning models were better suited for leveraging the rich information in clinical text. Ultimately, the workflow required careful preprocessing, feature engineering, and model selection to address the challenges posed by the dataset, such as class imbalance and heterogeneous data types.
Figure 3 illustrates the end-to-end experimental workflow adopted in this study. The process begins with loading the raw anesthesia dataset, followed by data cleaning and imputation to handle missing values and ensure data quality. Categorical variables are transformed using one-hot encoding, while unstructured text fields undergo preprocessing steps such as TF-IDF vectorization, dimensionality reduction via PCA, or embedding extraction using ClinicalBERT. The dataset is then split into training and test sets, with the training set further used for k-fold cross-validation to select and tune models. The final model is trained on the full training set and evaluated on the held-out test set using metrics such as AUC, accuracy, ROC curves, and confusion matrices. This modular pipeline ensures reproducibility and allows for systematic comparison of different algorithms and feature sets.
A generalized pseudocode is presented (see Appendix A) Algorithm A1 to summarize the core steps of the ML pipeline used for anesthesia complication prediction. This abstraction encompasses data cleaning, feature engineering, dimensionality reduction, train/test splitting, model selection, and evaluation. The pseudocode is designed to be adaptable to various algorithms and feature sets, ensuring reproducibility and systematic comparison across experiments. Implementation-specific details are omitted in favor of a clear, logical sequence of operations.

3. Results

3.1. Overview of Model Performance

A suite of ML algorithms was evaluated for the prediction of postoperative complications using the processed anesthesia dataset. Despite rigorous preprocessing, feature engineering, and hyperparameter optimization, the overall discriminatory performance of all models was modest, with the highest area under the receiver operating characteristic curve (AUC) reaching 0.644. Table 3 summarizes the test set AUC and accuracy for each major modeling approach.
Table 3 presents a detailed comparison of fourteen distinct ML pipelines, each employing different combinations of feature engineering techniques and algorithmic approaches. The feature engineering methodologies include Term Frequency-Inverse Document Frequency (TF-IDF) vectorization with varying vocabulary sizes (100 and 500 terms), Principal Component Analysis (PCA) for dimensionality reduction, one-hot encoding for categorical variables, and Synthetic Minority Oversampling Technique (SMOTE) for addressing class imbalance. Figure 4 provides a visual representation of the AUC performance across all evaluated algorithms, clearly illustrating the modest discriminatory performance achieved despite the diversity of approaches employed.
Among the classical ML approaches, K-Nearest Neighbors (KNN) achieved the highest predictive performance with an AUC of 0.644 and accuracy of 60.0% when applied to numeric features, one-hot encoded categorical variables, and TF-IDF features reduced via PCA. The Naïve Bayes classifier demonstrated the second-best performance with an AUC of 0.625 and accuracy of 54.4% using the same feature combination.
Tree-based ensemble methods showed moderate performance across various configurations. Random Forest achieved an AUC of 0.589 (56.7% accuracy) with TF-IDF and PCA features, while Extreme Gradient Boosting (XGBoost) reached 0.575 AUC (56.5% accuracy) when combined with TF-IDF (500 terms), one-hot encoding, and SMOTE. Categorical Boosting (CatBoost) performed similarly with an AUC of 0.581 and accuracy of 56.7%. Notably, Random Under Sampling Boosting (RUSBoost) and Least Squares Boosting (LSBoost) both achieved identical modest performance metrics of 0.563 AUC and 50.0% accuracy.
Stacked ensemble approaches, which combine multiple base learners through a meta-learning framework, consistently underperformed with AUCs of 0.456 and accuracies of 44.4%, suggesting that the modest individual model performances could not be effectively leveraged through ensemble aggregation.
The transformer-based approaches utilizing Clinical Bidirectional Encoder Representations from Transformers (ClinicalBERT) showed mixed results. The hybrid approach combining ClinicalBERT embeddings with tabular features and XGBoost classification achieved the best performance among deep learning models with an AUC of 0.600 and accuracy of 56.7%. However, fine-tuning ClinicalBERT on clinical notes alone yielded a more modest AUC of 0.539 (52.2% accuracy), while joint fine-tuning of ClinicalBERT with tabular features resulted in the poorest performance with an AUC of 0.450 and accuracy of 44.0%.
The feature engineering strategies revealed that TF-IDF vectorization of clinical notes provided incremental improvements over structured data alone, particularly when combined with PCA for dimensionality reduction to mitigate the curse of dimensionality inherent in high-dimensional sparse text representations. The application of SMOTE for addressing class imbalance showed mixed results, improving some algorithms while having negligible or detrimental effects on others.
These results collectively demonstrate the challenging nature of postoperative complication prediction in the anesthesia domain, where even sophisticated feature engineering and advanced ML techniques yielded only modest discriminatory performance, with all AUC values falling between 0.450 and 0.644.

3.2. Classical ML Algorithms

Among classical algorithms, the K-Nearest Neighbors (KNN) classifier achieved the highest test set AUC of 0.644 and an accuracy of 60.0%. The Naïve Bayes classifier, using principal components derived from TF-IDF features, yielded an AUC of 0.625 and an accuracy of 54.4%. Other classical approaches, such as Random Forest and LSBoost, produced AUCs in the range of 0.56–0.59, with test accuracies between 50.0% and 56.7%.
The receiver operating characteristic (ROC) curve for the best-performing KNN model is shown in Figure 5. This curve demonstrates modest discriminatory ability, with performance only slightly better than random classification. The KNN model used standardized features including numeric variables, one-hot encoded categoricals, and principal components derived from TF-IDF text features.
The confusion matrix for the KNN model, shown in Figure 6, illustrates the classification performance on the test set. The model achieved 60.0% accuracy with the highest AUC (0.644) among all evaluated algorithms. The matrix shows that 64.4% of non-complication cases were correctly classified, while 55.6% of complication cases were correctly identified, demonstrating modest but superior discriminatory ability compared to other approaches.

3.3. Tree-Based Ensemble Methods

Tree-based ensemble models, including LogitBoost, RUSBoost, and Random Forest, demonstrated moderate performance. Both LogitBoost and RUSBoost achieved an AUC of 0.563 and an accuracy of 50.0%. The Random Forest ensemble slightly outperformed these methods with an AUC of 0.589 and an accuracy of 56.7%. Stacked ensemble models, which combined multiple base learners, did not improve performance and resulted in lower AUCs (0.456) and accuracies (44.4%).
The confusion matrix for the Random Forest ensemble model is shown in Figure 7. This figure demonstrates the model’s moderate sensitivity and specificity for detecting postoperative complications, as well as the challenges in distinguishing between complication and non-complication cases.

3.4. Deep Learning and Transformer-Based Models

Transformer-based models leveraging ClinicalBERT embeddings were evaluated in both unimodal (notes only) and multimodal (notes plus tabular features) configurations. The best transformer-based pipeline, which combined ClinicalBERT embeddings with tabular features and used XGBoost as the classifier, achieved an AUC of 0.600 and an accuracy of 56.7%. Fine-tuning ClinicalBERT on notes alone resulted in an AUC of 0.539 and an accuracy of 52.2%. Joint fine-tuning of ClinicalBERT with tabular heads did not yield improved results (AUC 0.450).
Figure 8 compares the performance of these transformer-based approaches. The hybrid approach combining ClinicalBERT embeddings with tabular features and XGBoost classification achieved the best performance among deep learning models.

3.5. Comparison of Feature Sets and Algorithms

Across all experiments, the inclusion of text features (via TF-IDF or ClinicalBERT) provided incremental gains in predictive performance compared to models using only structured data. However, neither advanced feature engineering nor the use of deep learning architectures resulted in strong predictive power. Ensemble models and KNN classifiers performed comparably or slightly better than more complex approaches, but the overall AUCs remained in the moderate range (0.56–0.64).
Figure 9 presents a feature importance analysis, showing the relative contribution of different feature types across the best-performing models. Text-derived features (TF-IDF and principal components) provided modest improvements over structured clinical variables alone, but no single feature category dominated predictive performance.

3.6. Summary of Findings

Despite extensive experimentation with diverse algorithms and feature sets, the highest test set AUC achieved was 0.644 (KNN), indicating only modest discriminatory ability for predicting postoperative complications in this dataset. These results suggest that the available features, including both structured clinical variables and unstructured notes, may lack sufficient signal for robust risk prediction. Additional data modalities or richer documentation may be required to improve model performance in real-world anesthesia risk prediction tasks.
The distribution of AUC scores across all evaluated models and feature combinations is shown in Figure 10. The histogram demonstrates that most models achieved AUC values between 0.45 and 0.65, with the majority clustering around 0.55–0.60, indicating consistently modest predictive performance across diverse approaches.

4. Discussion

In this study, we evaluated the ability of a wide range of ML algorithms (including tree-based ensembles, classical classifiers, and ClinicalBERT-based transformer models) to predict postoperative complications in anesthesia using a multimodal dataset comprising both structured clinical data and unstructured clinical notes. Despite leveraging advanced feature engineering, careful preprocessing, and explicit prevention of data leakage, the best AUC achieved was 0.64. This modest discriminatory performance suggests that, within the constraints of our dataset, neither model architecture nor feature engineering strategies were able to extract sufficient predictive signal to reliably forecast postoperative complications. The incremental gains observed from incorporating unstructured text features further underscore the limitations of the available data in capturing the complex, multifactorial nature of perioperative risk.
Tree-based ensemble models such as XGBoost and CatBoost slightly outperformed classical algorithms, and ClinicalBERT embeddings provided modest improvements when incorporated into multimodal models. Nonetheless, even with various architectures including ensemble learners, classical classifiers, and transformer-based models predictive performance remained modest, consistent with findings in other domains [22].
Several factors contributed to the restricted AUC. First, standard structured variables (e.g., age, BMI, ASA class, procedure type) may lack the granularity to capture subtle risk patterns. Second, the quality and length variability of unstructured notes limited their predictive utility [23,24]. Although pretrained ClinicalBERT models showed potential, meaningful gains emerged only after domain-specific fine-tuning, suggesting that generic embeddings may lack sufficient specificity for perioperative tasks [25].
To mitigate class imbalance, we applied techniques such as SMOTE and RUSBoost. However, the low prevalence of complications remained a barrier, a common issue in clinical ML studies involving rare events [26,27,28]. These findings underscore that high-fidelity risk prediction may require temporally dynamic intraoperative signals data typically missing from static perioperative datasets. Recent advances in large language models (LLMs) have shown promise in automating and streamlining the ML pipeline for clinical studies, reducing the need for extensive manual intervention and domain expertise. These approaches can facilitate more transparent, reproducible, and accessible ML workflows in healthcare [29].
To prevent label leakage, we deliberately excluded the deterministically linked ‘Complications’ feature, prioritizing true generalization over inflated performance [27]. While this decision may have reduced model accuracy, it enhances the clinical relevance and trustworthiness of our findings.
Simple models like Naïve Bayes and K-Nearest Neighbors, though computationally efficient and interpretable, performed poorly consistent with other domains such as EEG-based seizure detection, where they fail to capture nonlinear relationships [30]. In our case, their limited capacity to model complex perioperative data further supports the need for more expressive architectures. Nevertheless, such models may retain value in resource-limited environments.

4.1. Leakage-Aware ML

The objective of this study was not the maximization of predictive performance, but rather the demonstration of the essential role of clinical expertise in the development of reliable ML models for healthcare. In particular, close collaboration with clinicians is vital for identifying and removing features that act as deterministic proxies for the outcome (such as the ‘Complications’ variable in the present dataset) which, if included, would result in data leakage and artificially inflated performance metrics [31].
By transparently documenting the process of detecting and excluding such sources of leakage, this study emphasizes the necessity of interdisciplinary collaboration to maintain methodological rigor and clinical relevance. The methodology presented here serves as a case study exemplifying how issues such as label leakage and artificially elevated performance can be mitigated through deliberate clinical oversight. Adoption of this approach, which encourages broader clinician engagement in the design, evaluation, and implementation of ML tools in healthcare, may promote the development of models that are both robust and applicable in real-world clinical settings.

4.2. Comparison with Existing Studies

Our findings are consistent with a subset of the published literature in this domain, where several studies have reported similarly modest predictive performance for ML models tasked with forecasting anesthesia-related complications. For example, some recent investigations using structured perioperative data and classical ML approaches have reported AUCs in the range of 0.60–0.75 [17,18,22], aligning closely with our results.
However, it is important to note that other studies in the field have reported strikingly high, and in some cases nearly perfect, performance metrics, with AUCs exceeding 0.90 [32,33,34]. These claims, while impressive on the surface, warrant careful scrutiny. A critical review of the literature reveals that many of these high-performing models may have been trained on all available features in their datasets, including variables that would not be accessible in real-time clinical settings. This practice introduces the risk of data leakage, where the model inadvertently learns from information that is a proxy for the outcome or is only available after the fact, thereby inflating performance metrics in a way that would not generalize to actual clinical practice [35,36].
Unfortunately, a pervasive issue in the field is the lack of sufficient methodological transparency in many published studies. Key details, such as which features were included in the model, how data was split for training, validation and testing, what steps were taken to prevent data leakage, and how the model’s learning dynamics were evaluated, are often omitted or insufficiently described [32]. As a result, it is frequently impossible for readers, reviewers, or other researchers to determine whether the reported performance is genuinely achievable in a real-world clinical context or is an artifact of methodological shortcuts. This lack of transparency not only undermines trust in the reported results but also impedes scientific progress and the safe, effective translation of ML models into clinical care [19].
It is also important to recognize that models with only modest predictive performance can still provide substantial value as exploratory or hypothesis-generating tools. Even when such models are not immediately suitable for direct clinical decision-making at the bedside, they may help to identify potential risk factors, uncover patterns in perioperative data, and inform the design of future studies. The utility of a predictive model is closely tied to its intended application: for instance, a model with high specificity but lower sensitivity may be more appropriate for population-level risk stratification or resource allocation, where minimizing false positives is critical. Conversely, in patient-level applications where early intervention is prioritized, sensitivity may be of greater importance, even at the expense of specificity. Thus, the clinical context and the consequences of false positives and false negatives must guide the interpretation of model performance metrics. In this light, predictive modeling serves an important intermediate role; not only as a means of supporting discovery and understanding of perioperative risk factors, but also as a foundation for iterative improvement and eventual clinical translation. The findings of the present study underscore the need for careful alignment between model evaluation metrics and the specific medical objectives at hand, as well as the importance of transparency in reporting to facilitate meaningful comparison and application across studies.

4.3. Study Contributions

This study makes several explicit contributions to the field of perioperative ML and anesthesia complication prediction:
  • Leakage-Aware Experimental Design: We present a transparent, leakage-aware workflow for preoperative risk prediction that explicitly identifies and excludes deterministically linked features (such as the ‘Complications’ column) and proxies unavailable at prediction time. This approach directly addresses a pervasive methodological flaw in the literature, where inclusion of such features can lead to artificially inflated performance metrics and non-generalizable models.
  • Clinician-Guided Feature Selection and Oversight: The study demonstrates the essential role of clinician engagement throughout the ML pipeline, from feature selection to model interpretation. By systematically involving clinical experts, we ensure that only information available at the time of prediction is used, and we illustrate how clinical oversight alters both model inputs and outcomes compared to purely engineering-driven approaches.
  • Comparative Evaluation of Text Representations: We systematically compare classical text representations (TF-IDF with PCA) and domain-adapted transformer embeddings (ClinicalBERT) under strict leakage control. This head-to-head evaluation clarifies the relative benefits and limitations of each approach for extracting predictive signal from preoperative anesthesia notes.
  • Reproducible, Educational Benchmark: Rather than aiming for maximal predictive performance, we provide an explicit, reproducible baseline (including models with and without SMOTE, models with and without PCA-based dimensionality reduction, standard performance metrics (AUC, accuracy, ROC, confusion matrix), and detailed workflow diagrams) intended as an educational benchmark for future perioperative ML studies. Our results highlight the challenges of real-world prediction and the necessity of rigorous methodology over superficial accuracy gains.
  • Promotion of Methodological Transparency: By thoroughly documenting our preprocessing, feature engineering, model selection, and evaluation steps, we set a standard for methodological transparency. This enables meaningful comparison with future studies and supports the development of clinically actionable, trustworthy ML tools in perioperative care.
Collectively, these contributions serve as a methodological case study for the broader medical ML community, emphasizing that robust, clinically relevant prediction requires not only advanced algorithms and rich data, but also early and sustained clinical collaboration and close attention to data leakage and feature provenance.

4.4. The Importance of Clinical Guidance in ML Model Development

The challenges highlighted above underscore the critical importance of involving clinicians throughout the development lifecycle of ML models for healthcare applications. While technical expertise in data science and algorithm development is essential, it is equally vital that model development is guided by medical professionals who understand the nuances of clinical workflows, the realities of data availability, and the practical needs of end-users [19,37]. Without meaningful clinician involvement, there is a substantial risk that ML models will be optimized solely for technical performance metrics (e.g., accuracy or AUC) without regard for their clinical relevance or utility. This phenomenon has been observed in multiple domains of healthcare, where models that achieve high accuracy in retrospective or in silico evaluations fail to deliver meaningful improvements in patient outcomes or decision-making when deployed in practice [32,33,38]. In some cases, models have been trained on features that are not available at the point of care, or on data that does not reflect real-world clinical scenarios, rendering them unusable or even potentially harmful in actual clinical settings [39].
Clinician involvement is vital not only for ensuring that models are trained on appropriate, accessible features, but also for defining clinically meaningful outcomes, interpreting model predictions, and integrating decision support tools into existing workflows [39]. Furthermore, clinician engagement fosters trust, adoption, and iterative refinement of ML tools, increasing the likelihood that these technologies will ultimately benefit patients and providers [39]. For example, incorporating clinical utility metrics into ML model evaluation marks a critical shift from purely statistical validation to a more clinically grounded assessment. While traditional performance measures such as accuracy and AUC offer valuable information about model discrimination, they often fail to reflect the nuanced trade-offs involved in real-world medical decision-making [40].

4.5. Limitations and Future Directions

The present study is not without limitations. The modest predictive performance observed may reflect inherent limitations in the available dataset, including the granularity and quality of both structured and unstructured features. It is possible that richer data sources, more granular temporal information, or additional modalities (such as imaging or real-time physiological waveforms) could yield stronger predictive signal. Additionally, while we took explicit steps to prevent data leakage and ensure methodological rigor, the generalizability of our findings to other institutions or patient populations remains to be established.
A detailed analysis of feature importance would substantially enhance the interpretability and clinical relevance of the findings presented here. Although this study briefly reported the comparative contributions of different feature categories, a more in-depth investigation of individual predictors (such as demographic variables or features extracted from clinical notes) may identify clinically meaningful patterns worthy of further exploration. Integrating detailed feature importance analyses in future research will be vital for clarifying the specific variables that drive risk prediction and for guiding both hypothesis generation and the design of subsequent studies.
Transformer-based models demonstrated limited predictive utility in this context, an outcome attributed primarily to the brevity and heterogeneity of the available clinical notes. Improvements in model performance may be achievable with more detailed and consistently documented clinical narratives, as well as through fine-tuning on domain-specific text datasets. Furthermore, the incorporation of dynamic intraoperative data streams (such as real-time physiologic monitoring) and the use of external validation cohorts are anticipated to significantly enhance the robustness, generalizability, and clinical applicability of future models. It should be emphasized, however, that the principal aim of the present study was to illustrate the critical impact of data leakage on model development and evaluation, underscoring the necessity of rigorous methodological safeguards to ensure clinically meaningful results.
Future research should incorporate granular intraoperative data such as continuous physiologic measurements, waveform trends, and lab results to improve real-time responsiveness. For example, inaccuracies in pulse wave analysis due to underdamped waveforms can distort cardiac output estimates, potentially affecting fluid or vasopressor management especially in vulnerable populations [41]. Integrating smart filters or correction algorithms into monitoring systems could mitigate these risks. Similarly, adaptive control systems like closed-loop PI and fractional PI controllers show promise in maintaining anesthetic depth despite patient variability, suggesting that ML-informed control theory may enhance agent titration [12].
Lastly, to support external generalizability and mitigate overfitting, future studies should adopt rigorous validation frameworks. Methods such as registered models and adaptive sample splitting promote reproducibility and credibility of predictive tools across healthcare settings [42].

5. Conclusions

Our comprehensive evaluation of ML models for predicting anesthesia-related postoperative complications reveals a distinct gap between the potential of modern ML methods and their current real-world utility in this domain. Unlike fields such as medical image recognition, where ML has achieved outstanding accuracy and robust clinical integration, perioperative complication prediction remains an unsolved challenge. Even with a multimodal approach that leverages both structured clinical data and advanced natural language processing of clinical notes, our best model achieved only modest performance, echoing the results of several other studies. This result is not unique to our work. While some published studies in anesthesiology and perioperative care report similarly modest discrimination, others describe near-perfect performance metrics. However, it is vital to recognize that such impressive results often stem from methodological flaws; most notably, inadvertent or unreported data leakage. Many models are trained on features unavailable at the time of prediction or features that are perfect proxies for the outcome (as we explicitly identified and excluded in our own workflow), artificially inflating their performance. Worryingly, these methodological details are frequently omitted from publications, making it impossible for readers to discern whether such models would offer any real clinical value.
Hence, unlike in radiology or pathology, where ML has become a transformative companion to human expertise, the field of anesthesia complication prediction is still in its infancy. The path forward requires deeper interdisciplinary collaboration, the development and sharing of richer, higher-quality datasets, and a commitment to clinical validity over mere mathematical performance. Ultimately, this study should be interpreted as a methodological case study rather than an attempt to produce a deployable predictive tool. By demonstrating how clinical oversight can prevent data leakage and reshape the design of ML experiments, we aim to educate both developers and readers about the risks of inflated performance and the importance of clinical applicability in feature selection. The findings reinforce that advancing perioperative prediction requires not only stronger models and richer data, but also early and sustained engagement of clinicians in the development process.

Author Contributions

Conceptualization, A.A. and M.T.; methodology, A.A.; software, A.A.; validation, A.A., K.E. and K.N.; formal analysis, A.A.; investigation, A.A.; resources, A.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A.; visualization, A.A.; supervision, M.T.; project administration, M.T.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study were derived from the following resources available in the public domain: the Personalized Anesthesia Management Dataset repository at https://www.kaggle.com/datasets/programmer3/personalized-anesthesia-management-dataset, accessed on 4 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
AUCArea under the receiver operating characteristic curve
BMIBody Mass Index
CatBoostCategorical Boosting
CSVComma-Separated Values
EHRElectronic Health Record
KNNK-Nearest Neighbors
LSBoostLeast Squares Boosting
MLMachine Learning
NLPNatural Language Processing
PCAPrincipal Component Analysis
RFRandom Forest
ROCReceiver Operating Characteristic
RUSBoostRandom Under Sampling Boosting
SMOTESynthetic Minority Oversampling Technique
TF-IDFTerm Frequency–Inverse Document Frequency
XGBoostExtreme Gradient Boosting
LLMLarge Language Model

Appendix A

Algorithm A1 Generalized ML Pipeline for Anesthesia Complication Prediction
Require: 
Raw dataset with structured and unstructured features
Ensure: 
Trained model and evaluation metrics
  1:
Data Cleaning
  2:
Remove rows with missing outcome labels
  3:
Drop non-predictive columns (e.g., patient identifiers, free-text complication summaries)
  4:
Encode outcome variable as categorical
  5:
Numeric Feature Processing
  6:
for each numeric feature do
  7:
    Convert to numeric type if necessary
  8:
    Impute missing values with median
  9:
    (Optional) Standardize to zero mean and unit variance
10:
end for
11:
Categorical Feature Encoding
12:
for each categorical feature do
13:
    Assign “Unknown” to missing categories
14:
    Apply one-hot encoding
15:
end for
16:
Text Feature Engineering
17:
Tokenize and clean clinical notes
18:
Remove stop words
19:
Compute TF-IDF matrix
20:
Select top N terms by IDF
21:
(Optional) Apply PCA to reduce dimensionality
22:
(Alternative) Extract embeddings (e.g., ClinicalBERT) for deep learning models
23:
Feature Matrix Construction
24:
Concatenate numeric, one-hot encoded, and text-derived features
25:
Train/Test Split
26:
Split data into training and test sets (e.g., 70/30 stratified by outcome)
27:
Model Selection and Training
28:
Perform k-fold cross-validation on training set for hyperparameter tuning
29:
Train final model on full training set
30:
Evaluation
31:
Apply trained model to test set
32:
Compute evaluation metrics (AUC, accuracy, ROC curve, confusion matrix)

References

  1. Hassan, A.M.; Rajesh, A.; Asaad, M.; Nelson, J.A.; Coert, J.H.; Mehrara, B.J.; Butler, C.E. Artificial Intelligence and Machine Learning in Prediction of Surgical Complications: Current State, Applications, and Implications. Am. Surg. 2022, 89, 25–30. [Google Scholar] [CrossRef] [PubMed]
  2. Fritz, B.A.; King, C.R.; Abdelhack, M.; Chen, Y.; Kronzer, A.; Abraham, J.; Tripathi, S.; Ben Abdallah, A.; Kannampallil, T.; Budelier, T.P.; et al. Effect of machine learning models on clinician prediction of postoperative complications: The Perioperative ORACLE randomised clinical trial. Br. J. Anaesth. 2024, 133, 1042–1050. [Google Scholar] [CrossRef]
  3. Si, Y.; Wang, J.; Xu, H.; Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 2019, 26, 1297–1304. [Google Scholar] [CrossRef] [PubMed]
  4. Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar] [CrossRef]
  5. Ahn, J.M.; Kim, J.; Kim, K. Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins 2023, 15, 608. [Google Scholar] [CrossRef] [PubMed]
  6. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  7. Nerella, S.; Bandyopadhyay, S.; Zhang, J.; Contreras, M.; Siegel, S.; Bumin, A.; Silva, B.; Sena, J.; Shickel, B.; Bihorac, A.; et al. Transformers and large language models in healthcare: A review. Artif. Intell. Med. 2024, 154, 102900. [Google Scholar] [CrossRef]
  8. Sendak, M.; Gao, M.; Nichols, M.; Lin, A.; Balu, S. Machine Learning in Health Care: A Critical Appraisal of Challenges and Opportunities. eGEMs 2019, 7, 1. [Google Scholar] [CrossRef]
  9. Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in data mining: Formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 2012, 6, 1–21. [Google Scholar] [CrossRef]
  10. Perets, O.; Stagno, E.; Yehuda, E.B.; McNichol, M.; Celi, L.A.; Rappoport, N.; Dorotic, M. Inherent Bias in Electronic Health Records: A Scoping Review of Sources of Bias. medRxiv 2024. [Google Scholar] [CrossRef]
  11. Coeckelenbergh, S.; Boelefahr, S.; Alexander, B.; Perrin, L.; Rinehart, J.; Joosten, A.; Barvais, L. Closed-loop anesthesia: Foundations and applications in contemporary perioperative medicine. J. Clin. Monit. Comput. 2024, 38, 487–504. [Google Scholar] [CrossRef]
  12. Coman, S.; Iosif, D. AnesthesiaGUIDE: A MATLAB tool to control the anesthesia. Appl. Sci. 2021, 4, 3. [Google Scholar] [CrossRef]
  13. Qiu, X.; Hu, S.; Dong, S.; Sun, H. Construction of an automated machine learning-based predictive model for postoperative pulmonary complications risk in non-small cell lung cancer patients undergoing thoracoscopic surgery. PLoS ONE 2025, 20, e0333413. [Google Scholar] [CrossRef]
  14. Lin, Z.; Yan, M.; Chen, H.; Wei, S.; Li, Y.; Jian, J. Development and validation of a machine learning model to predict postoperative complications following radical gastrectomy for gastric cancer. Front. Oncol. 2025, 15, 1606938. [Google Scholar] [CrossRef] [PubMed]
  15. Glebov, M.; Lazebnik, T.; Katsin, M.; Orkin, B.; Berkenstadt, H.; Bunimovich-Mendrazitsky, S. Predicting postoperative nausea and vomiting using machine learning: A model development and validation study. BMC Anesthesiol. 2025, 25, 135. [Google Scholar] [CrossRef]
  16. Hua, C.; Chu, Y.; Zhou, M.; Ye, J.; Xu, X. Predictive effect of postoperative recovery in general anesthesia patients using interpretable models based on swarm intelligence machine learning. Front. Physiol. 2025, 16, 1565548. [Google Scholar] [CrossRef]
  17. Chen, M.; Zhang, D. Machine learning-based prediction of post-induction hypotension: Identifying risk factors and enhancing anesthesia management. BMC Med. Inform. Decis. Mak. 2025, 25, 96. [Google Scholar] [CrossRef]
  18. Tsai, F.F.; Chang, Y.C.; Chiu, Y.W.; Sheu, B.C.; Hsu, M.H.; Yeh, H.M. Machine Learning Model for Anesthetic Risk Stratification for Gynecologic and Obstetric Patients: Cross-Sectional Study Outlining a Novel Approach for Early Detection. JMIR Form. Res. 2024, 8, e54097. [Google Scholar] [CrossRef] [PubMed]
  19. Arina, P.; Kaczorek, M.R.; Hofmaenner, D.A.; Pisciotta, W.; Refinetti, P.; Singer, M.; Mazomenos, E.B.; Whittle, J. Prediction of Complications and Prognostication in Perioperative Medicine: A Systematic Review and PROBAST Assessment of Machine Learning Tools. Anesthesiology 2023, 140, 85–101. [Google Scholar] [CrossRef]
  20. Zaki, H.A.; Elmelliti, H.; Shaban, E.E.; Shaban, A.; Shaban, A.; Elgassim, M.; Shallik, N. Comprehensive systematic review and meta-analysis: Evaluating artificial intelligence (AI) effectiveness and integration obstacles within anesthesiology. J. Emerg. Med. Trauma Acute Care 2025, 2025, 22. [Google Scholar] [CrossRef]
  21. Mehta, D.; Gonzalez, X.T.; Huang, G.; Abraham, J. Machine learning-augmented interventions in perioperative care: A systematic review and meta-analysis. Br. J. Anaesth. 2024, 133, 1159–1172. [Google Scholar] [CrossRef] [PubMed]
  22. Sevakula, R.K.; Au-Yeung, W.M.; Singh, J.P.; Heist, E.K.; Isselbacher, E.M.; Armoundas, A.A. State-of-the-Art Machine Learning Techniques Aiming to Improve Patient Outcomes Pertaining to the Cardiovascular System. J. Am. Heart Assoc. 2020, 9, e013924. [Google Scholar] [CrossRef] [PubMed]
  23. Melton, G.B.; Hripcsak, G. Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries. J. Am. Med. Inform. Assoc. 2005, 12, 448–457. [Google Scholar] [CrossRef]
  24. Voss, R.W.; Schmidt, T.D.; Weiskopf, N.; Marino, M.; Dorr, D.A.; Huguet, N.; Warren, N.; Valenzuela, S.; O’Malley, J.; Quiñones, A.R. Comparing ascertainment of chronic condition status with problem lists versus encounter diagnoses from electronic health records. J. Am. Med. Inform. Assoc. 2022, 29, 770–778. [Google Scholar] [CrossRef]
  25. Alba, C.; Xue, B.; Abraham, J.; Kannampallil, T.; Lu, C. The foundational capabilities of large language models in predicting postoperative risks using clinical notes. Npj Digit. Med. 2025, 8, 95. [Google Scholar] [CrossRef]
  26. Mendez, J.A.; Leon, A.; Marrero, A.; Gonzalez-Cava, J.M.; Reboso, J.A.; Estevez, J.I.; Gomez-Gonzalez, J.F. Improving the anesthetic process by a fuzzy rule based medical decision system. Artif. Intell. Med. 2018, 84, 159–170. [Google Scholar] [CrossRef] [PubMed]
  27. Hashimoto, D.A.; Witkowski, E.; Gao, L.; Meireles, O.; Rosman, G. Artificial Intelligence in Anesthesiology: Current Techniques, Clinical Applications, and Limitations. Anesthesiology 2020, 132, 379–394. [Google Scholar] [CrossRef]
  28. Xu, Y.; Foryciarz, A.; Steinberg, E.; Shah, N.H. Clinical utility gains from incorporating comorbidity and geographic location information into risk estimation equations for atherosclerotic cardiovascular disease. J. Am. Med. Inform. Assoc. 2023, 30, 878–887. [Google Scholar] [CrossRef]
  29. Tayebi Arasteh, S.; Han, T.; Lotfinia, M.; Kuhl, C.; Kather, J.N.; Truhn, D.; Nebelung, S. Large language models streamline automated machine learning for clinical studies. Nat. Commun. 2024, 15, 1603. [Google Scholar] [CrossRef] [PubMed]
  30. Usman, S.M.; Usman, M.; Fong, S. Epileptic Seizures Prediction Using Machine Learning Methods. Comput. Math. Methods Med. 2017, 2017, 9074759. [Google Scholar] [CrossRef]
  31. Toma, M. AI-Assisted Medical Diagnostics: A Clinical Guide to Next-Generation Diagnostics; Dawning Research Press: Old Westbury, NY, USA, 2025; Available online: https://openlibrary.org/works/OL44048041W/ (accessed on 4 October 2025).
  32. Bellini, V.; Valente, M.; Bertorelli, G.; Pifferi, B.; Craca, M.; Mordonini, M.; Lombardo, G.; Bottani, E.; Del Rio, P.; Bignami, E. Machine learning in perioperative medicine: A systematic review. J. Anesth. Analg. Crit. Care 2022, 2, 2. [Google Scholar] [CrossRef]
  33. Zhang, Z.; Duan, Y.; Lin, J.; Luo, W.; Lin, L.; Gao, Z. Artificial intelligence in anesthesia: Insights from the 2024 Nobel Prize in Physics. Anesthesiol. Perioper. Sci. 2025, 3, 5. [Google Scholar] [CrossRef]
  34. Xu, P. Multi-layered data framework for enhancing postoperative outcomes and anaesthesia management through natural language processing. SLAS Technol. 2025, 32, 100294. [Google Scholar] [CrossRef]
  35. Mahajan, A.; Esper, S.; Oo, T.H.; McKibben, J.; Garver, M.; Artman, J.; Klahre, C.; Ryan, J.; Sadhasivam, S.; Holder-Murray, J.; et al. Development and Validation of a Machine Learning Model to Identify Patients Before Surgery at High Risk for Postoperative Adverse Events. JAMA Netw. Open 2023, 6, e2322285. [Google Scholar] [CrossRef]
  36. Starcke, J.; Spadafora, J.; Spadafora, J.; Spadafora, P.; Toma, M. The Effect of Data Leakage and Feature Selection on Machine Learning Performance for Early Parkinson’s Disease Detection. Bioengineering 2025, 12, 845. [Google Scholar] [CrossRef]
  37. Ng, F.Y.C.; Thirunavukarasu, A.J.; Cheng, H.; Tan, T.F.; Gutierrez, L.; Lan, Y.; Ong, J.C.L.; Chong, Y.S.; Ngiam, K.Y.; Ho, D.; et al. Artificial intelligence education: An evidence-based medicine approach for consumers, translators, and developers. Cell Rep. Med. 2023, 4, 101230. [Google Scholar] [CrossRef]
  38. Nasef, D.; Nasef, D.; Sher, M.; Toma, M. A Standardized Validation Framework for Clinically Actionable Healthcare Machine Learning with Knee Osteoarthritis Grading as a Case Study. Algorithms 2025, 18, 343. [Google Scholar] [CrossRef]
  39. Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef]
  40. Sher, M.; Sharma, R.; Remyes, D.; Nasef, D.; Nasef, D.; Toma, M. Stratified Multisource Optical Coherence Tomography Integration and Cross-Pathology Validation Framework for Automated Retinal Diagnostics. Appl. Sci. 2025, 15, 4985. [Google Scholar] [CrossRef]
  41. Foti, L.; Michard, F.; Villa, G.; Ricci, Z.; Romagnoli, S. The impact of arterial pressure waveform underdamping and resonance filters on cardiac output measurements with pulse wave analysis. Br. J. Anaesth. 2022, 129, e6–e8. [Google Scholar] [CrossRef] [PubMed]
  42. Gallitto, G.; Englert, R.; Kincses, B.; Kotikalapudi, R.; Li, J.; Hoffschlag, K.; Bingel, U.; Spisak, T. External validation of machine learning models—registered models and adaptive sample splitting. GigaScience 2025, 14, giaf036. [Google Scholar] [CrossRef]
Figure 1. Temporal leakage in diagnosis prediction models. Valid training features (green, left) are available before diagnosis, while post-diagnosis data (red, right) represents temporal leakage when used in ML training, artificially inflating model performance that cannot be replicated in real-world deployment where such future information is unavailable.
Figure 1. Temporal leakage in diagnosis prediction models. Valid training features (green, left) are available before diagnosis, while post-diagnosis data (red, right) represents temporal leakage when used in ML training, artificially inflating model performance that cannot be replicated in real-world deployment where such future information is unavailable.
Surgeries 06 00087 g001
Figure 2. Illustration of the deterministic relationship between ‘Complications’ and ‘Outcome’. Including ‘Complications’ as a feature in model training (orange path) leads to data leakage and artificially high accuracy, as the model can trivially infer the outcome. Excluding ‘Complications’ (green path) ensures valid, clinically meaningful prediction.
Figure 2. Illustration of the deterministic relationship between ‘Complications’ and ‘Outcome’. Including ‘Complications’ as a feature in model training (orange path) leads to data leakage and artificially high accuracy, as the model can trivially infer the outcome. Excluding ‘Complications’ (green path) ensures valid, clinically meaningful prediction.
Surgeries 06 00087 g002
Figure 3. Overview of the experimental workflow for anesthesia complication prediction. The pipeline begins with raw data ingestion, followed by data cleaning, imputation, and feature engineering steps such as one-hot encoding for categorical variables and advanced text processing (TF-IDF, PCA, ClinicalBERT embeddings) for clinical notes. The dataset is split into training and test sets, with k-fold cross-validation performed on the training set for robust model selection and hyperparameter tuning. The final model is trained on the entire training set and evaluated on the test set using standard metrics (AUC, accuracy, ROC, confusion matrix). This workflow ensures systematic preprocessing, fair model comparison, and reproducible evaluation across diverse ML algorithms and feature sets.
Figure 3. Overview of the experimental workflow for anesthesia complication prediction. The pipeline begins with raw data ingestion, followed by data cleaning, imputation, and feature engineering steps such as one-hot encoding for categorical variables and advanced text processing (TF-IDF, PCA, ClinicalBERT embeddings) for clinical notes. The dataset is split into training and test sets, with k-fold cross-validation performed on the training set for robust model selection and hyperparameter tuning. The final model is trained on the entire training set and evaluated on the test set using standard metrics (AUC, accuracy, ROC, confusion matrix). This workflow ensures systematic preprocessing, fair model comparison, and reproducible evaluation across diverse ML algorithms and feature sets.
Surgeries 06 00087 g003
Figure 4. Comparison of test set AUC performance across different ML algorithms and feature combinations. The K-Nearest Neighbors (KNN) algorithm achieved the highest AUC of 0.644 (highlighted in green), followed by Naïve Bayes with 0.625. Despite diverse approaches including ensemble methods and deep learning, all models showed modest discriminatory performance, with most achieving AUCs between 0.45 and 0.64.
Figure 4. Comparison of test set AUC performance across different ML algorithms and feature combinations. The K-Nearest Neighbors (KNN) algorithm achieved the highest AUC of 0.644 (highlighted in green), followed by Naïve Bayes with 0.625. Despite diverse approaches including ensemble methods and deep learning, all models showed modest discriminatory performance, with most achieving AUCs between 0.45 and 0.64.
Surgeries 06 00087 g004
Figure 5. ROC curve for the best-performing K-Nearest Neighbors (KNN) model (AUC = 0.644). The curve demonstrates modest discriminatory ability, with performance only slightly better than random classification. The KNN model used standardized features including numeric variables, one-hot encoded categoricals, and principal components derived from TF-IDF text features.
Figure 5. ROC curve for the best-performing K-Nearest Neighbors (KNN) model (AUC = 0.644). The curve demonstrates modest discriminatory ability, with performance only slightly better than random classification. The KNN model used standardized features including numeric variables, one-hot encoded categoricals, and principal components derived from TF-IDF text features.
Surgeries 06 00087 g005
Figure 6. Confusion matrix for the best-performing K-Nearest Neighbors (KNN) model showing classification performance on the test set. The model achieved 60.0% accuracy with the highest AUC (0.644) among all evaluated algorithms. The matrix shows that 64.4% of non-complication cases were correctly classified, while 55.6% of complication cases were correctly identified, demonstrating modest but superior discriminatory ability compared to other approaches.
Figure 6. Confusion matrix for the best-performing K-Nearest Neighbors (KNN) model showing classification performance on the test set. The model achieved 60.0% accuracy with the highest AUC (0.644) among all evaluated algorithms. The matrix shows that 64.4% of non-complication cases were correctly classified, while 55.6% of complication cases were correctly identified, demonstrating modest but superior discriminatory ability compared to other approaches.
Surgeries 06 00087 g006
Figure 7. Confusion matrix for the Random Forest ensemble model showing classification performance on the test set. The model achieved 56.7% accuracy with moderate sensitivity and specificity for detecting postoperative complications. The confusion matrix reveals challenges in distinguishing between complication and non-complication cases.
Figure 7. Confusion matrix for the Random Forest ensemble model showing classification performance on the test set. The model achieved 56.7% accuracy with moderate sensitivity and specificity for detecting postoperative complications. The confusion matrix reveals challenges in distinguishing between complication and non-complication cases.
Surgeries 06 00087 g007
Figure 8. Performance comparison of transformer-based approaches including unimodal ClinicalBERT (notes only), multimodal ClinicalBERT with XGBoost, and joint fine-tuning architectures. The hybrid approach combining ClinicalBERT embeddings with tabular features and XGBoost classification achieved the best performance among deep learning models (AUC = 0.600).
Figure 8. Performance comparison of transformer-based approaches including unimodal ClinicalBERT (notes only), multimodal ClinicalBERT with XGBoost, and joint fine-tuning architectures. The hybrid approach combining ClinicalBERT embeddings with tabular features and XGBoost classification achieved the best performance among deep learning models (AUC = 0.600).
Surgeries 06 00087 g008
Figure 9. Feature importance analysis showing the relative contribution of different feature types across the best-performing models. Text-derived features (TF-IDF and principal components) provided modest improvements over structured clinical variables alone, but no single feature category dominated predictive performance.
Figure 9. Feature importance analysis showing the relative contribution of different feature types across the best-performing models. Text-derived features (TF-IDF and principal components) provided modest improvements over structured clinical variables alone, but no single feature category dominated predictive performance.
Surgeries 06 00087 g009
Figure 10. Distribution of AUC scores across all evaluated models and feature combinations. The histogram shows that most models achieved AUC values between 0.45 and 0.65, with the majority clustering around 0.55–0.60, indicating consistently modest predictive performance across diverse approaches.
Figure 10. Distribution of AUC scores across all evaluated models and feature combinations. The histogram shows that most models achieved AUC values between 0.45 and 0.65, with the majority clustering around 0.55–0.60, indicating consistently modest predictive performance across diverse approaches.
Surgeries 06 00087 g010
Table 1. Summary of dataset fields and attributes.
Table 1. Summary of dataset fields and attributes.
FieldDescriptionData TypePossible Values/Notes
PatientIDUnique patient identifierInteger
AgeAge of patient (years)Integer
GenderGender of patientStringMale, Female
BMIBody Mass IndexInteger
SurgeryTypeType of surgeryStringCardiovascular, Orthopedic, Neurological, Cosmetic
SurgeryDurationDuration of surgeryStringe.g., “120 min”, “180 min”
AnesthesiaTypeType of anesthesiaStringGeneral, Local
PreoperativeNotesPre-surgery clinical notesStringUnstructured text
PostoperativeNotesPost-surgery clinical notesStringUnstructured text
PainLevelPostoperative pain level (1–10)Integer1 to 10
ComplicationsPostoperative complicationsStringNone, Nausea, mild bleeding, Respiratory distress, Delayed recovery
OutcomeComplication outcome labelInteger0 (No complications), 1 (Complications present)
Table 2. Overview of class balancing configurations for each model/feature set.
Table 2. Overview of class balancing configurations for each model/feature set.
Model/Feature SetWith SMOTEWithout SMOTEWith RUSBoostWithout RUSBoost
XGBoost (TF-IDF+One-hot)××
Random Forest (TF-IDF+PCA)××
Random Forest (TF-IDF+One-hot)××
CatBoost (TF-IDF+One-hot)××
RUSBoost (TF-IDF+PCA)××
KNN (Numeric+One-hot+TF-IDF+PCA)××
ClinicalBERT+Tabular+SMOTE XGBoost××
Table 3. Summary of test set performance for major ML pipelines.
Table 3. Summary of test set performance for major ML pipelines.
FeaturesAlgorithmTest AUC/Accuracy
TF-IDF (500) + tabularLogitBoost0.563/50.0%
TF-IDF + PCARUSBoost0.563/50.0%
Numeric + One-hot + TF-IDF + PCANaïve Bayes0.625/54.4%
TF-IDF + PCARandom Forest0.589/56.7%
TF-IDF (100) + PCALSBoost0.563/50.0%
TF-IDF + PCA + tabularStacked Ensemble0.456/44.4%
Numeric + One-hot + TF-IDF + PCAKNN0.644/60.0%
TF-IDF (500) + One-hot + SMOTEXGBoost0.575/56.5%
TF-IDF + One-hot + SMOTERandom Forest0.556/56.7%
TF-IDF + One-hot + SMOTECatBoost0.581/56.7%
TF-IDF + One-hot + SMOTEStacked Ensemble0.456/44.4%
ClinicalBERT + tabular + SMOTEXGBoost0.600/56.7%
ClinicalBERT (notes only)Transformers0.539/52.2%
Joint ClinicalBERT + tabularTransformers0.450/44.0%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Amanatidis, A.; Egan, K.; Nio, K.; Toma, M. Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes. Surgeries 2025, 6, 87. https://doi.org/10.3390/surgeries6040087

AMA Style

Amanatidis A, Egan K, Nio K, Toma M. Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes. Surgeries. 2025; 6(4):87. https://doi.org/10.3390/surgeries6040087

Chicago/Turabian Style

Amanatidis, Anastasia, Kyle Egan, Kusuma Nio, and Milan Toma. 2025. "Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes" Surgeries 6, no. 4: 87. https://doi.org/10.3390/surgeries6040087

APA Style

Amanatidis, A., Egan, K., Nio, K., & Toma, M. (2025). Data-Leakage-Aware Preoperative Prediction of Postoperative Complications from Structured Data and Preoperative Clinical Notes. Surgeries, 6(4), 87. https://doi.org/10.3390/surgeries6040087

Article Metrics

Back to TopTop