Postoperative Nausea and Vomiting Prediction: Machine Learning Insights from a Comprehensive Analysis of Perioperative Data

Postoperative nausea and vomiting (PONV) are common complications after surgery. This study aimed to present the utilization of machine learning for predicting PONV and provide insights based on a large amount of data. This retrospective study included data on perioperative features of patients, such as patient characteristics and perioperative factors, from two hospitals. Logistic regression algorithms, random forest, light-gradient boosting machines, and multilayer perceptrons were used as machine learning algorithms to develop the models. The dataset of this study included 106,860 adult patients, with an overall incidence rate of 14.4% for PONV. The area under the receiver operating characteristic curve (AUROC) of the models was 0.60–0.67. In the prediction models that included only the known risk and mitigating factors of PONV, the AUROC of the models was 0.54–0.69. Some features were found to be associated with patient-controlled analgesia, with opioids being the most important feature in almost all models. In conclusion, machine learning provides valuable insights into PONV prediction, the selection of significant features for prediction, and feature engineering.


Introduction
Postoperative nausea and vomiting (PONV) is a common and distressing complication experienced by patients after surgery, particularly under general anesthesia [1][2][3][4].This can lead to discomfort, delayed recovery, and even extended hospital stays, negatively affecting the overall patient experience and increasing healthcare costs [3,5,6].Therefore, effective management of PONV is crucial for improving patient outcomes and satisfaction during the postoperative period [7].
Traditional approaches to managing PONV involve the administration of prophylactic antiemetic medications to high-risk patients based on clinical risk factors [7][8][9].However, these approaches are often suboptimal as they may not accurately predict individual patient risks and can result in unnecessary medication use [10].Consequently, there is a growing interest in developing more precise and personalized predictive models for PONV, leveraging machine learning algorithms to consider patient-specific data and risk factors.
In recent years, advancements in machine learning have revolutionized various fields, including healthcare [11,12].In particular, machine learning holds great promise in the prediction and prevention of postoperative complications [13,14], such as PONV.The ability to accurately predict which patients are at higher risk for PONV would allow clinicians

Data Preprocessing
Data were divided into continuous and categorical categories.Continuous data were standardized by removing the mean and scaling it to the unit variance [15].This study had an imbalance in the target PONV.There were more patients without PONV than those with PONV.In classification problems, imbalanced datasets negatively affect the accuracy of class predictions [16].To solve this problem, we applied the synthetic minority oversampling technique (SMOTE) [17].SMOTE is a method for generating new data of a minor class using the k-NN algorithm.Subsequently, we divided the entire dataset into training and test datasets in an 8:2 ratio.We randomly assigned similar rates of PONV to the training and test sets.

Machine Learning
We used five algorithms to develop the PONV prediction models.The four algorithms were as follows: logistic regression, random forest, light-gradient boosting machine, multilayer perceptron, and extreme boosting machine [18][19][20][21][22].In the random forest, we used the balanced random forest built-in packages without SMOTE.A balanced random forest randomly under-samples each bootstrap sample to balance it [23].Prediction models were developed by applying a training dataset to each algorithm.
Hyperparameter tuning and cross-validation using RandomSearchCV were conducted to obtain the models with the best performance.RandomSearchCV is a random combination of selected hyperparameters used to train the model [24].The hyperparameters used in RandomSearchCV are summarized in Listing A1 in Appendix A. We determined a strategy to evaluate the performance of the five-fold cross-validated model on the training set as the area under the receiver operating characteristic curve (AUROC).Subsequently, the best models for each algorithm were evaluated using a test set.
Additionally, we developed simplified models that included features known to be associated with PONV in adults, which included female sex, smoking status, age, volatile anesthetics, duration of anesthesia, postoperative opioid use, risky surgery (laparoscopic surgery and obstetric gynecologic surgery), and preventive antiemetics.Although most known risks or mitigation factors follow the Fourth Consensus Guidelines for the Management of Postoperative Nausea and Vomiting [9], some features were missing or insufficient.Postoperative opioid use was determined when opioids were used within 24 h after surgery.Preventive antiemetics were determined when antiemetics were used intraoperatively or in the PACU before the occurrence of PONV.As we did not have data associated with a history of PONV, or motion sickness, we added data regarding preoperative nausea and vomiting.For risky surgeries, we included only laparoscopic surgery and obstetric and gynecologic surgery because we did not have data on cholecystectomy and bariatric surgery.
To obtain the feature importance, we used mutual information, which quantifies the dependency or association between two random variables.In the context of feature importance, mutual information is used to measure the amount of information gained regarding a target variable by knowing the value of a particular feature.This is a method to assess the relevance of a feature in predicting a target variable [25].

Statistics
Descriptive analyses were performed to compare the characteristics and perioperative data of the training and test sets.Categorical features were presented as numbers and percentages, and continuous features were presented as medians and interquartile ranges.The differences were evaluated as absolute standardized differences.Five metrics were calculated to assess the model performance; the AUROC was used as the primary metric, as well as recall, precision, f1-score, and accuracy.Bootstrapping (n = 1000) was performed to calculate 95% confidence intervals (CI).Python (version 3.7; PSF, Beaverton, OR, USA) was used to calculate the model metrics.

Results
A total of 149,802 patients underwent surgery under general anesthesia from 1 January 2013 to 30 April 2023.After 42,942 patients were excluded, data of 106,860 patients were divided into training (n = 84,888) and test (n = 21,372) sets.Details are summarized in Figure 1.The numbers of PONV cases were 12,287 (14.5%) and 3072 (14.4%) in the training and test sets, respectively.Patient characteristics and perioperative data are summarized in Tables 1 and 2, respectively.The absolute standardized difference between the training and test sets was below 0.1 for all features.

Performance of Models, including 10 Known Risks and Mitigating Factors
Figure 3 shows the AUROC of the models, including the known risks and mitiga factors according to the algorithm.Balanced random forest (AUROC [95% CI] = 0.69 [0 0.70]) had the highest AUROC.Table 4 shows the precision, recall, accuracy, and f1 s of each model according to the algorithm.In terms of precision, light GBM was the hig (0.46, 95% CI: 0.42-0.49).In terms of recall, balanced random forest was the highest (0 95% CI: 0.69-0.72).In terms of accuracy, light GBM was the highest (0.85, 95% CI: 0 0.86).In terms of the f1 score, the balanced random forest was the highest (0.39, 95% 0.38-0.40).

Performance of Models, including 10 Known Risks and Mitigating Factors
Figure 3 shows the AUROC of the models, including the known risks and mitigating factors according to the algorithm.Balanced random forest (AUROC [95% CI] = 0.69 [0.68-0.70])had the highest AUROC.Table 4 shows the precision, recall, accuracy, and f1 score of each model according to the algorithm.In terms of precision, light GBM was the highest (0.46, 95% CI: 0.42-0.49).In terms of recall, balanced random forest was the highest (0.71, 95% CI: 0.69-0.72).In terms of accuracy, light GBM was the highest (0.85, 95% CI: 0.85-0.86).In terms of the f1 score, the balanced random forest was the highest (0.39, 95% CI: 0.38-0.40).

Feature Importance
Table 5 lists the top 20 most important features in the models.The female sex, smoking status, obstetric and gynecologic surgery, and factors associated with postoperative opioid use were included in the five models.The importance of all features is summarized in Table A1 in Appendix B.
Table 6 shows the feature importance and score in the models that include 10 known risks and mitigating features.The female sex had the highest score in the three models (logistic regression, light gradient boosting machine, and balanced random forest), whereas postoperative opioids had the highest score in the two models (random forest

Feature Importance
Table 5 lists the top 20 most important features in the models.The female sex, smoking status, obstetric and gynecologic surgery, and factors associated with postoperative opioid use were included in the five models.The importance of all features is summarized in Table A1 in Appendix B. Table 6 shows the feature importance and score in the models that include 10 known risks and mitigating features.The female sex had the highest score in the three models (logistic regression, light gradient boosting machine, and balanced random forest), whereas postoperative opioids had the highest score in the two models (random forest and multilayer perceptron).
In this study, we developed PONV prediction models with machine learning using the characteristics and perioperative data of 84,888 patients.In the evaluation of models using data from 21,372 patients, the performance of the models showed that AUROC ranged from 0.6 to 0.67 when all features were included.When the known risk and mitigating factors were included, the AUROC ranged from 0.54 to 0.69.Shim et al. recently reported the prediction of PONV using machine learning in patients undergoing intravenous PCA [26].Their study included 2149 patients and used seven algorithms and 13 features.Despite the small size of their data compared with ours, their AUROC ranged from 0.576 to 0.686 and was 0.643 when only Apfel risk factors were used.Their AUROC values were similar to those obtained in our study.On the other hand, Xie et al. also reported the probability of PONV for PCA using machine learning.Although they included 2222 patients and 21 features, their best AUROC value was 0.947.However, because their study included only patients who received PCA and the PCA regimen was limited, their models could not predict all patients undergoing general anesthesia.Zhou et al. reported the prediction of early postoperative PONV using multiple machine-learning and deep-learning algorithms [27].Their study included 2149 patients and used seven algorithms and 15 features.They also had a small amount of data, but the AUROC values of the models ranged from 0.611 to 0.732.Some models showed better performance than ours.However, their data were obtained 10-15 years ago, and there were no recent data.Therefore, their models do not reflect the recent situation of anesthesia and surgery.
To develop models that can be applied to most patients under general anesthesia as much as possible, the training of the models included data from over 80,000 patients from two hospitals and 102 features.Additionally, we developed brief models that included only 10 known risks and mitigating factors.These factors are general categories that medical staff investigate or apply to general anesthesia.However, no model with excellent performance included only the 10 known risks and mitigating factors.In addition, the performance of some metrics was worse than that of models that included all features.If the removed features contain crucial information related to the target variable, their removal can result in poor performance.In this case, the model may lack the information necessary to make accurate predictions [28].
In models that included all features, the most important features were associated with opioids or PCA.In our study, if simplified models were developed with the most important features, models would have no choice but to include only the biased types of data, such as opioids and PCA, and other risk factors for PONV would have been excluded from the models.Incorporating or transforming some features may be needed to improve performance and ease of use, such as incorporating variable factors associated with postoperative opioid use.Although feature elimination sometimes helps in understanding the data, reducing computational requirements, reducing the effect of the curse of dimensionality, and improving predictor performance [29], a larger and more representative dataset can lead to better generalization [30].The selection and transformation of features should be performed carefully, considering the specific characteristics of the data and the problem at hand.
Upon analyzing the results of feature importance, certain features consistently emerge as influential across multiple models.For instance, female sex was the variable that consistently held a substantial influence in all models, suggesting that sex might play a significant role in PONV prediction.Similarly, smoking status was another significant factor across all models, indicating its relevance in predicting the risk of PONV.Interestingly, the variables associated with opioid use demonstrated significant importance across all models, suggesting a robust association between opioid administration and the likelihood of PONV, similar to the conventional prediction of PONV.Predictions using machine learning also underscore the need for cautious opioid management strategies to mitigate the risk of PONV.
The strengths of this study include its meticulous approach to model development by utilizing a substantial dataset of over 80,000 patients and incorporating a rich set of features.This emphasis on data quantity and feature diversity provides a robust foundation for predictive modeling.In addition, the development of comprehensive models incorporating a wide range of features and simplified models based on known risk and mitigating factors acknowledges the practical need for predictive tools that can be applied to most patients undergoing general anesthesia.The integration of artificial intelligence into such medical information creates a new opportunity to design and improve new systems beyond existing systems [31].
This study also has several limitations.

1.
Our models acknowledged that including only known risk and mitigating factors did not exhibit strong performance and, in some cases, showed worse metrics than the models with all features.This limitation suggests that there may be unaccounted factors contributing to PONV that are not captured solely by known risks and mitigating factors.

2.
Although our study included a substantial number of patients, data were obtained from only two hospitals.This may raise questions regarding the diversity of patient populations and medical practices, potentially affecting the generalizability of the models to other healthcare settings.

3.
Some studies referenced for comparison had outdated data, which might not accurately reflect the current landscape of anesthesia and surgery.This emphasizes the importance of continuously updating the models based on recent data.

4.
This study highlighted the challenges of feature selection and the potential impacts of excluding informative features.However, further insight into the specific criteria and methods used for feature selection would enhance the transparency of the model development process.

Conclusions
Our study offers a valuable contribution to the realm of predictive modeling for PONV in patients undergoing general anesthesia.However, the performance of models based solely on known risks and mitigating factors highlights the complexity of PONV prediction and the need to consider additional contributing variables.Furthermore, the origin of the dataset from two hospitals warrants cautious interpretation when considering its generalizability across diverse healthcare settings.Prediction of PONV can lead to a significant reduction in PONV incidence by personalizing anesthesia and medication plans, efficiently allocating resources, improving patient experience, and strengthening recovery protocols.This can benefit patients by minimizing discomfort as well as making healthcare delivery and resource utilization more efficient.However, improved usability and performance of the model are needed to make this a reality.

Table 1 .
Characteristics data of patients.

Table 1 .
Characteristics data of patients.

Table 2 .
Perioperative data of patients.

Table 3 .
Precision, recall, accuracy, and f1 score of each model according to the algorithm.

Table 4 .
Precision, recall, accuracy, and f1 score of models, including known risk and mitigating factors according to the algorithm.

Table 4 .
Precision, recall, accuracy, and f1 score of models, including known risk and mitigating factors according to the algorithm.

Table 5 .
Top 20 importance features using mutual information according to model.-expressed features were included as the top 20 features in all models.ASA PS, American Society of Anesthesiologists physical status; BRF, balanced random forest; GERD, gastroesophageal reflux disease; MLP, multilayer perceptron; LGBM, light gradient boosting; LR, logistic regression; PACU, postanesthesia care unit; PCA, patient-controlled analgesia; RF, random forest; TDFP, transdermal fentanyl patch. Bold

Table 6 .
Feature importance and score in models that include 10 known risks and mitigating features.