Prediction of Postoperative Pulmonary Edema Risk Using Machine Learning

Postoperative pulmonary edema (PPE) is a well-known postoperative complication. We hypothesized that a machine learning model could predict PPE risk using pre- and intraoperative data, thereby improving postoperative management. This retrospective study analyzed the medical records of patients aged > 18 years who underwent surgery between January 2011 and November 2021 at five South Korean hospitals. Data from four hospitals (n = 221,908) were used as the training dataset, whereas data from the remaining hospital (n = 34,991) were used as the test dataset. The machine learning algorithms used were extreme gradient boosting, light-gradient boosting machine, multilayer perceptron, logistic regression, and balanced random forest (BRF). The prediction abilities of the machine learning models were assessed using the area under the receiver operating characteristic curve, feature importance, and average precisions of precision-recall curve, precision, recall, f1 score, and accuracy. PPE occurred in 3584 (1.6%) and 1896 (5.4%) patients in the training and test sets, respectively. The BRF model exhibited the best performance (area under the receiver operating characteristic curve: 0.91, 95% confidence interval: 0.84–0.98). However, its precision and f1 score metrics were not good. The five major features included arterial line monitoring, American Society of Anesthesiologists physical status, urine output, age, and Foley catheter status. Machine learning models (e.g., BRF) could predict PPE risk and improve clinical decision-making, thereby enhancing postoperative management.


Introduction
Postoperative pulmonary edema (PPE) is a well-known complication with multiple possible causes [1]. Preexisting cardiac disease, including heart failure, is the most common cause of PPE. Fluid overload results in increased hydrostatic pressure and worsening left ventricular function [2]. Regardless of preexisting heart disease, fluid overload itself can cause PPE. In particular, excessive postoperative fluid administration and transfusions increase the risk of PPE [1,3]. Neurogenic pulmonary edema is another potential cause of PPE [4]. Although neurogenic pulmonary edema is sometimes regarded as a form of acute respiratory distress syndrome, its pathophysiology and prognosis differ from the characteristics of acute respiratory distress [4,5]. PPE can also be caused by anaphylaxis, which results in negative pressure and acute lung injury [1,6].
It is often difficult to determine the cause of PPE during its early stages, particularly in patients with overlapping etiologies [1,6,7]. There is a need to identify patients at high risk of PPE to allow prevention and early treatment. Several studies have reported the causes and risk factors for PPE, but early diagnosis and management are difficult, more so in patients with overlapping etiologies or uncertain causes [1][2][3][4][5][6][7][8][9][10].
Advances in computing have enhanced several key areas of clinical research; artificialintelligence-based methods may have additional applications. Machine learning (ML) systems are widely used in clinical research to analyze big data. Compared to traditional scoring systems, ML models perform better when predicting various clinical conditions [11][12][13]. They have been successfully used to predict postoperative complications [14][15][16][17][18][19]. However, there is no reported ML model to predict PPE. In the present study, we hypothesized that ML could predict PPE risk with good performance, and then developed ML models to predict PPE.

Data Collection
This retrospective cohort study protocol was approved by the Clinical Research Ethics Committee of Chuncheon Sacred Heart Hospital, Hallym University. The need for informed consent was waived because of the retrospective study design. The medical records of patients treated between 1 January 2011 and 15 November 2021 were obtained from the clinical data warehouses of five hospitals affiliated with Hallym University Medical Center. The hospitals were located in Seoul (Kangnam Sacred Heart Hospital and Hangang Sacred Heart Hospital), Gyeonggi Province (Hallym University Sacred Heart Hospital and Dongtan Sacred Heart Hospital), and Gangwon Province (Chuncheon Sacred Heart Hospital).
A clinical data warehouse is a database of medical records, prescriptions, and test results, which can be used to identify patients based on prescriptions, examinations, and diagnostic data. The timing and results of investigations, drug administration, transfusions, and other information were extracted in an unstructured text format. The requested data were provided in a de-identified format, but the data of specific patients could be extracted using a key.

Patients and Pulmonary Edema
The study included adult patients aged > 18 years who did not exhibit preoperative pulmonary edema. The exclusion criteria and outlier data were missing. Pulmonary edema was diagnosed by radiologists on the basis of chest radiographs. Patients were presumed not to have PPE if they lacked perioperative respiratory symptoms and did not undergo chest radiography.

Dataset
The dataset involved the following 98 perioperative variables: age, male sex, and order of surgery; the statuses of preoperative atelectasis, preoperative effusion, preoperative pneumothorax, preoperative pneumonia, preoperative pulmonary thromboembolism, and preoperative acute respiratory distress; body mass index; the statuses of congestive heart failure, cardiac arrhythmia, valvular diseases, pulmonary circulation disorders, peripheral vascular disorders, hypertension (uncomplicated vs. complicated), paralysis, other neurological disorders, chronic pulmonary diseases, diabetes (uncomplicated vs. complicated), hypothyroidism, renal failure, liver diseases, peptic ulcer diseases (excluding bleeding), acquired immune deficiency syndrome/human immunodeficiency virus, lymphoma, metastatic cancer, solid tumors (without metastasis), and rheumatoid arthritis/collagen vascular diseases; alcohol consumption, current smoking status, smoking frequency (packs), smoking duration (years), emergency status, American Society of Anesthesiologists physical status of >2, use of general anesthesia, maintenance anesthetics administered, N 2 O use, anesthesia time (min), surgery time (min), intraoperative blood and fluid administration, intraoperative urine output, and estimated blood loss; the statuses of arterial line monitoring, central venous pressure monitoring, Foley catheter, Levin tube, and patient-controlled analgesia; the administration of intraoperative packed red blood cells, frozen fresh plasma, platelets (concentration and cryoprecipitate), rocuronium, vecuronium, atracurium, cisatracurium, succinylcholine, pyridostigmine, neostigmine, sugammadex, fentanyl, alfentanil, sufentanil, remifentanil, and pethidine; blood urea nitrogen level, creatinine level, glomerular filtration rate, prothrombin time, activated partial thromboplastin time, and platelet count; the levels of sodium, potassium, uric acid, protein, and albumin; and the statuses of robotic surgery, laparoscopic surgery, heart surgery, abdominal surgery, breast surgery, ear surgery, endocrine surgery, eye surgery, head and neck surgery, musculoskeletal surgery, neurosurgery, obstetric and gynecological surgery, spine surgery, thoracic surgery, transplant surgery, urogenital surgery, vascular surgery, and skin and soft tissue surgery.
The dataset was divided into training and test sets. The training set included data from Kangnam Sacred Heart Hospital, Hangang Sacred Heart Hospital, Hallym University Sacred Heart Hospital, and Dongtan Sacred Heart Hospital. The test set included data from Chuncheon Sacred Heart Hospital. The training set was used for model learning, whereas the test set was used to evaluate model performance. Both datasets were standardized using min.-max. scaling based on the training set.

Machine Learning
The study used supervised learning, which is an ML paradigm for data consisting of labeled examples (i.e., each data point contains variables and an associated label). Five ML algorithms were used: random forest, light-gradient boosting machine, extreme-gradient boosting machine, multilayer perceptron, and logistic regression [20][21][22][23][24]. Random forest is a regression tree technique that uses bootstrap aggregation and predictor randomization to achieve high predictive accuracy. Various random forest input parameters were explored [25]. A light-gradient boosting machine continuously divides a leaf node with maximum data loss without a consideration of tree balance, resulting in a deep and asymmetric tree [26]. Extreme-gradient boosting machine is an optimized gradient boosting algorithm that involves parallel processing, tree-pruning, missing value management, and regularization to avoid overfitting/bias [27]. Multilayer perceptron is a neural network with ≥1 intermediate layer between the input and output layers. The network is connected in the direction of the input, hidden, and output layers; there are no connections within the layers, but the output layer is directly connected to the input layer through a feedforward network [28]. Logistic regression can solve the binary classification problems associated with the linear model.
The dataset was imbalanced and may have caused low model performance. Therefore, we used the synthetic minority oversampling technique for all algorithms except random forest [29]. After the ratio of pulmonary edema had been balanced, we trained the models with a training set that included synthetic samples. The random forest algorithm includes a classifier method known as balanced random forest (BRF); therefore, the synthetic minority oversampling technique was not used for the random forest algorithm. Data processing and the ML process are summarized in Figure 1. Feature importance was calculated to assess the best model using the built-in function in the algorithm package.

Modified Dataset
An additional carved dataset was used to modify the prediction model based on the large and complex dataset. This dataset was learned and validated using the best prediction algorithm from the original data. First, the test dataset was reduced by undersampling using the Tomek's link method to validate our best model [30]. Second, a simplified prediction model was made using 20 important features of the best model, and was validated using a test dataset that included these features.

Metrics and Statistics
Six metrics were calculated for model performance. The primary metric was the area under the receiver operating characteristic curve. The average precisions of precision-recall curve, best threshold, precision, recall and f1 score, and accuracy were calculated. Google Colab (Python version 3.7; Google, Mountain View, CA, USA) was used to calculate model metrics.

Modified Dataset
An additional carved dataset was used to modify the prediction model based on the large and complex dataset. This dataset was learned and validated using the best prediction algorithm from the original data. First, the test dataset was reduced by under-sampling using the Tomek's link method to validate our best model [30]. Second, a simplified prediction model was made using 20 important features of the best model, and was validated using a test dataset that included these features.

Metrics and Statistics
Six metrics were calculated for model performance. The primary metric was the area under the receiver operating characteristic curve. The average precisions of precision-recall curve, best threshold, precision, recall and f1 score, and accuracy were calculated. Descriptive analysis was performed to compare the characteristics of patients with and without PPE. Categorical variables were presented as numbers (%) and compared using the chi-squared test. Continuous variables were presented as medians (interquartile ranges) and compared using the Mann-Whitney U test. p-values of < 0.05 were considered statistically significant.

Patient Characteristics
The study included 287,976 patients aged > 18 years who did not exhibit preoperative pulmonary edema. After the exclusion of 26,597 patients with missing (n = 26,593) and outlier (n = 4) data, and 4480 preoperative PPE patients, a total of 256,899 patients were included in the analysis. PPE occurred in 5480 (2.8%) patients. The training and test sets included 221,908 and 34,991 patients, respectively. PPE occurred in 3584 (2.1%) and 1896 (7.4%) patients in the training and test sets, respectively (Tables 1 and 2).

Model Performance
BRF exhibited the best performance for the prediction of PPE risk. As the primary metric, the area under the receiver operating characteristic curve for BRF was 0.91 (95% confidence interval: 0.84-0.98). The performances of the remaining models are summarized in Figure 2. BRF also exhibited the best performance based on the average precision of the precision-recall curve (0.44). The average precisions of the precision-recall curve for the remaining models are summarized in Figure 3. BRF had the best recall (0.832) and f1 score (0.372), whereas the light-gradient boosting machine model had the best precision (0.531) and accuracy (0.946). The remaining metrics are summarized in Table 3.

Feature Importance
The evaluation of feature importance in the BRF model revealed that arterial line monitoring was the most important feature. Ten major features in the BRF model are shown in Figure 4.

Feature Importance
The evaluation of feature importance in the BRF model revealed that arterial line monitoring was the most important feature. Ten major features in the BRF model are shown in Figure 4.

Validation of under-Sampling Test Dataset and Simplified Model
After under-sampling of the test dataset, PPE patients were 1896 and No-PPE patients were 32,621. In the simplified prediction model, the included features were as follows: arterial monitoring, American Society of Anesthesiologists physical status, age, urine output, intraoperative fluid, estimated blood loss, foley catheter, anesthesia time, albumin, glomerular filtration rate, central venous pressure monitoring, operation time, prothrombin time, blood urea nitrogen, protein, creatinine, prothrombin time-international normalized ratio, platelet, body mass index, and intraoperative packed red blood cell. Validation results are summarized in Table 4.

Validation of under-Sampling Test Dataset and Simplified Model
After under-sampling of the test dataset, PPE patients were 1896 and No-PPE patients were 32,621. In the simplified prediction model, the included features were as follows: arterial monitoring, American Society of Anesthesiologists physical status, age, urine output, intraoperative fluid, estimated blood loss, foley catheter, anesthesia time, albumin, glomerular filtration rate, central venous pressure monitoring, operation time, prothrombin time, blood urea nitrogen, protein, creatinine, prothrombin time-international normalized ratio, platelet, body mass index, and intraoperative packed red blood cell. Validation results are summarized in Table 4.

Discussion
We used ML to develop models for the prediction of PPE. Model training using data from 221,908 patients was followed by model testing using data from 34,991 patients. Five algorithms were used to develop the models, whereas six metrics were used to evaluate their performances. BRF exhibited the best performance in terms of area under the receiver operating characteristic curve, recall, and accuracy. However, no model had a good precision or f1 score.
Numerous studies have developed ML models to predict postoperative pulmonary complications. Peng et al. developed and validated a deep-neural-network model based on combined natural language data and structured data to predict pulmonary complications in geriatric patients [15]. Xue et al. developed an ML model to predict postoperative pulmonary complications after emergency gastrointestinal surgery in patients with acute diffuse peritonitis [18]. Chen and colleagues developed an ML model to predict postoperative pneumonia in orthotopic liver transplant patients [14]. Although the outcomes of the above studies included PPE, their findings differed from ours because they also assessed other complications. An ML model to predict PPE risk after any type of surgery has not been developed.
PPE has various causes, several of which can occur simultaneously. PPE may be cardiogenic or noncardiogenic, but it is difficult to distinguish between these etiologies because of their similar clinical features. In patients with acute myocardial infarction, cardiogenic pulmonary edema may be complicated by noncardiogenic edema related to the aspiration of gastric contents, syncope, or cardiac arrest. Conversely, in patients with severe trauma or infections accompanied by noncardiogenic pulmonary edema, fluid resuscitation may cause pulmonary edema through volume overload and increased pulmonary vascular hydrostatic pressure [1,6,31]. Therefore, PPE prediction and the preemptive management of risk factors are important.
The present study investigated the important features of the best model for the prediction of PPE risk. Ten major PPE risk factors were included, primarily those related to fluid and hydrostatic pressure rather than the other causes of PPE. This means that the PPE prediction model could mainly predict cardiogenic and hydrostatic pulmonary edema. However, the evidence is not conclusive because the etiologies of PPE in this study were not known.
The most important feature was arterial line monitoring, which is required in patients who need continuous blood pressure monitoring or multiple blood sampling during surgery [32]. Arterial line monitoring is the standard of care for patients at risk of rapid hemodynamic changes. Patients with a poor preoperative status and those who undergo major surgeries can develop rapid hemodynamic changes and often need multiple sampling [33]. American Society of Anesthesiologists physical status and age also indicate preoperative patient condition. Patients with high American Society of Anesthesiologists physical status grades may develop heart, lung, kidney, and brain problems [34,35]. Old age is generally associated with compromised organ function, resulting in a greater risk of PPE [36]. Urine output, fluid volume, EBL, albumin, and glomerular filtration rate directly and indirectly affect body fluid status, which is associated with hydrostatic pressure [37][38][39][40][41].
To the best of our knowledge, our model is the first to predict PPE risk, and its performance was better than the previous PPE models. However, the present study had some limitations. First, the overall performance of the model was good, but its precision and f1 scores were low, even for the best threshold. Because recall (sensitivity) was good, the proportion of false positives may be high, presumably because of the low proportion of patients with pulmonary edema in the overall dataset. Thus, our model interpreted the normal state as PPE in many cases. There were similar results in the validation with the under-sampling dataset. Second, our model requires many features to predict PPE, which reduces its practicality. Although the performance was not significantly worse in the model with twenty features, this limitation of the model could not be resolved. A prediction model based on fewer features while maintaining the performance may be needed in the future. To resolve the first two limitations, additional datasets should be acquired and learned, or features with better predictive values should be selected. Third, our model could not distinguish between cardiogenic and noncardiogenic PPE. Additional studies are needed to develop models that can distinguish between the two PPE types and predict the risk of each type.
In conclusion, we developed an ML model that could predict PPE risk in patients undergoing surgery. The model was superior to previously reported prediction models for postoperative pulmonary complications. Our ML model may improve clinical decisionmaking, thereby enhancing postoperative management. However, further improvements are needed to reduce the false positive rate and enhance the practical usefulness.