To Establish an Early Prediction Model for Acute Respiratory Distress Syndrome in Severe Acute Pancreatitis Using Machine Learning Algorithm

Objective: To develop binary and quaternary classification prediction models in patients with severe acute pancreatitis (SAP) using machine learning methods, so that doctors can evaluate the risk of patients with acute respiratory distress syndrome (ARDS) and severe ARDS at an early stage. Methods: A retrospective study was conducted on SAP patients hospitalized in our hospital from August 2017 to August 2022. Logical Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), and eXtreme Gradient Boosting (XGB) were used to build the binary classification prediction model of ARDS. Shapley Additive explanations (SHAP) values were used to interpret the machine learning model, and the model was optimized according to the interpretability results of SHAP values. Combined with the optimized characteristic variables, four-class classification models, including RF, SVM, DT, XGB, and Artificial Neural Network (ANN), were constructed to predict mild, moderate, and severe ARDS, and the prediction effects of each model were compared. Results: The XGB model showed the best effect (AUC = 0.84) in the prediction of binary classification (ARDS or non-ARDS). According to SHAP values, the prediction model of ARDS severity was constructed with four characteristic variables (PaO2/FiO2, APACHE II, SOFA, AMY). Among them, the overall prediction accuracy of ANN is 86%, which is the best. Conclusions: Machine learning has a good effect in predicting the occurrence and severity of ARDS in SAP patients. It can also provide a valuable tool for doctors to make clinical decisions.


Introduction
Acute pancreatitis (AP) is a common disease in gastroenterology. It is estimated that about 20% of AP patients can develop severe acute pancreatitis (SAP) with persistent organ failure and become critically ill [1,2]. Among them, acute respiratory distress syndrome (ARDS) is the leading cause of lung failure in SAP patients, and the incidence can be as high as 1/3. Respiratory death caused by ARDS accounted for 60% of all causes of death in SAP patients within one week of onset [3,4]. However, so far, the treatment options for ARDS are still limited, and the improvement of prognosis mainly depends on early identification and the use of mechanical ventilation [5]. Currently, the clinical diagnosis and severity evaluation of ARDS still rely on identifying symptoms such as dyspnea and monitoring blood gas analysis by doctors. However, the accuracy of the judgment is not satisfactory due to the intense subjectivity. To solve this problem, some scholars are trying to find new biomarkers that can predict ARDS [6], and some scholars have developed predictive scoring systems to predict the occurrence of ARDS [7]. However, these prediction models are established mainly based on traditional statistical methods. Due to the inability to consider multicollinearity between variables, the prediction ability is often uneven and cannot meet clinical needs.
Artificial intelligence machine learning technology has made significant progress in medicine in recent years. Studies have shown that machine learning has a better predictive effect than traditional statistical analysis [8], which provides an unprecedented opportunity to establish new practical ARDS prediction tools for SAP patients.
Several previous publications have demonstrated the advantages of machine learning algorithms in ARDS prediction [9][10][11]. However, few models have used characteristic variables associated with AP, such as amylase and inflammatory markers, which may seriously affect the prediction of ARDS in AP patients. Yang et al. [12] tried to use the ANN algorithm to predict ARDS in AP patients but needed to explain the rationale behind the model, which will limit the clinical application of this model. Therefore, this work attempted to combine machine learning methods to establish binary and quaternary classification ARDS prediction models in SAP patients and to find the model with the best prediction performance. By analyzing the clinical data of patients, this model can quickly and accurately determine the high-risk patients with ARDS and understand the severity of expiratory distress in patients, providing a valuable tool for clinical decision-making. To the best of our knowledge, this is the first validated model that applies interpretable machine learning to predict the occurrence and severity of ARDS in SAP patients.

Data Resources and Study Cohorts
A retrospective study of patients diagnosed with SAP between August 2017 and August 2022 at Zhongda hospital was conducted using data from the electronic database. These patients were admitted within 72 h of onset, and their diagnosis of SAP met the Atlanta Classification (2012) [13] criteria: the presence of persistent single or multiple organ failure for ≥48 h. Patients with a previous history of chronic pulmonary disease; pancreatitis after ERCP, acute exacerbation of chronic pancreatitis, acute pancreatitis in pregnancy; and comorbid malignancies, autoimmune diseases, and hematological disorders were excluded.
The primary working outcome was acute respiratory distress syndrome, diagnosed and classified according to the Berlin definition [14]. There are 440 SAP patients in this research dataset, including 230 non-ARDS patients and 210 ARDS patients. The sample categories are roughly balanced. Among the 210 patients with ARDS, 40 patients with mild ARDS, 110 with moderate ARDS, and 60 with severe ARDS were included.
This study was performed in Zhongda Hospital affiliated to Southeast University. The protocol was consent to the Declaration of Helsinki and its later declarations.

Data Collection and Preprocessing
In contrast to existing risk-scoring models that include only a limited number of clinical features, this work considers the broadest possible range of factors based on clinically available data. Demographic data, etiology, medical history, hemodynamic parameters, laboratory index, and scoring system for disease severity were collected separately for each patient from the electronic health record. Etiologies were categorized as cholestatic, alcoholic, hyperlipidemia, and other. Medical history included some history of chronic disease and smoking and alcohol. Hemodynamic parameters, including systolic, diastolic, and mean arterial pressure, were measured on admission. Laboratory test data were taken from the patient's first peripheral venous blood sample results within 24 h of admission.
The most clinically accepted scoring system for the severity of disease was chosen for this work: Acute Physiology and Chronic Health Evaluation II (APACHE II) score [15] and Sequential Organ Failure Assessment (SOFA) score [16] was used to evaluate the severity of the disease. They make a quantitative evaluation of the patient's condition through objective physiological parameters, and the final score corresponds to the severity of the disease. The worst values of physiological parameters within 24 h after admission were calculated and analyzed by APACHE II and SOFA.
The above data were processed for missing values to manage the data effectively. Outliers that do not pass the logical check are treated as missing values and handled using the lost value method. Features with missing values exceeding 10% of the sample size are discarded. Stratified means were applied to fill in missing values for features with a missing value ratio of less than 10% of the sample size. The valid data set after processing is 440 * 45. There are 440 cases in total, and each case has 43 feature variables and two predicted labels. Binary and quaternary classification labels were used to predict the occurrence and severity of ARDS, respectively. Table 1 presents the distribution of the studied characteristics. The measurement data were expressed as median (interquartile ranges). The comparison between the two groups was performed by normal distribution and homogeneity of variance test. Independent sample t-test was used for normal distribution and homogeneity of variance, and the Wilcoxon test was used for non-normal distribution. p < 0.05 was considered statistically significant.

Model Development and Evaluation
The entire work cohort was randomly divided into a training set (70%) and a test set (30%). To prevent the problem of overfitting in the training process, this work performs model training in the form of five-fold cross-validation. A total of five accuracies were obtained, and the average value was used to obtain the model accuracy.
Machine learning predictive models and statistical analysis were constructed using the Sklearn package version 1.0.2 and Python version 3.9. This work attempts to develop predictive models with the following machine learning methods, which are currently the most advanced and powerful machine learning methods in biomedicine: Logistic Regression (LR), Random Forest (RF), Support vector Machine (SVM), Decision Tree (DT), eXtreme Gradient Boosting (XGB), and Artificial Neural Network (ANN). The LR model uses the maximum likelihood function method and gradient descent to solve the parameters to achieve the purpose of binary classification of data, which is a classical method for data classification [17]. The RF model is a decision tree based on a bagging framework, which can deal with high-dimensional feature input samples and has excellent accuracy [18]. The SVM model is a supervised learning algorithm often used to solve classification problems [19]. The DT is represented by a tree structure and can mathematically predict the best choice. The DT is often used to solve problems in biological sciences due to its ease of understanding [20]. The XGB is an optimized distributed gradient-boosting library that innovates the transformation of weak learners into strong learners, significantly increasing classification accuracy [21]. The ANN is a mathematical model that simulates neurons processing information. It has strong generalization ability and robustness and can provide a high data processing efficiency [22].
Although the LR algorithm performs well in binary classification models, it is difficult for this algorithm to competently select quaternary classification labels due to its low classification accuracy [17]. Therefore, in this work, the LR algorithm was only applied to the selection of binary classification labels, that is, to judge the risk of ARDS. Given the powerful classification function of the ANN algorithm, this work added the ANN model to quaternary classification for ARDS severity prediction.
The training set (train_x, train_y) is fed into a machine learning model, which is trained to predict the classification label based on the case feature variables. Then, the test set data test_x is input into the trained model, and the predicted label obtained by the model is compared with the actual label test_y to judge the prediction effect of the model. The Area under the Curve (AUC) was used to compare the overall discriminant ability of the models, and the optimal model was selected according to the AUC value. Combined with the evaluation indicators such as accuracy, precision, recall rate, and F1 value, the classification prediction effect of each model is analyzed and compared.

Model Interpretation and Optimization
In this work, the SHAPley Additive interpretation (SHAP) is used to interpret model results, a standard method for interpreting machine learning models [23]. SHAP interprets the predicted value of the model as the sum of the attributed values for each input feature, and each feature has a corresponding SHAP value that represents the contribution of the feature to predicting the risk of complications. Many of the dozens of features belong to noise, affecting the model's classification accuracy if not filtered out [24]. Therefore, to further optimize the model, this work screens the significant features according to the SHAP value of the features based on model interpretation and uses the optimized features again for model training to improve the model's effectiveness and reduce the influence of noise.

Model Construction and Optimization
In this work, five machine learning algorithms (LR, RF, SVM, DT, XGB) were trained to predict the risk of ARDS in SAP patients (binary classification model). The training set data were randomly split into five groups, of which four groups were used for training, and the remaining one was used for testing. The other group and the remaining groups are used for validation and training. This continues until all the groups are used as the test set for testing. The average of the test results obtained each time was calculated. All of the above variables are input variables to train these five models. The receiver operating characteristic curve (ROC) showing the final prediction performance of these five models is shown in Figure 1. XGB showed the largest AUC (area = 0.839) for dichotomous variables (ARDS or non-ARDS), suggesting that it was the best predictor of ARDS. In contrast, RF showed the second highest AUC (area = 0.807) after XGB.
To further elaborate on the model performance, the evaluation metrics of accuracy, precision, recall, and F1 score ( Table 2) were included in this work to compare the five models' prediction effects. When compared, XGB showed the best predictive performance for ARDS with the highest accuracy (0.84), the highest precision (0.86), the highest recall (0.81), and the highest F1 score (0.83), while RF was the next best.  To further elaborate on the model performance, the evaluation metrics of accuracy, precision, recall, and F1 score ( Table 2) were included in this work to compare the five models' prediction effects. When compared, XGB showed the best predictive performance for ARDS with the highest accuracy (0.84), the highest precision (0.86), the highest recall (0.81), and the highest F1 score (0.83), while RF was the next best. However, integrated learning methods such as RF and XGB, while effective, have not been able to address the issue of interpretability. On the other hand, SHAP values happen to be an effective way of interpreting the model ex-post, revealing what the model understood when it made the correct judgment. Typically, a more significant SHAP value for a feature indicates that it has a more significant impact on the model. This study, therefore, incorporates the concept of SHAP values to rank the importance of the features in the bestperforming RF and XGB models to reveal the essential feature variables influencing both models ( Figure 2). According to the SHAP value distribution of each feature, the crucial features that accounted for more than 80% of the importance in both models were intersected to obtain seven main features: PaO2/FiO2, APACHE II, SOFA, ALB, AMY, K + , and WBC, which formed the new dataset. However, integrated learning methods such as RF and XGB, while effective, have not been able to address the issue of interpretability. On the other hand, SHAP values happen to be an effective way of interpreting the model ex-post, revealing what the model understood when it made the correct judgment. Typically, a more significant SHAP value for a feature indicates that it has a more significant impact on the model. This study, therefore, incorporates the concept of SHAP values to rank the importance of the features in the best-performing RF and XGB models to reveal the essential feature variables influencing both models (Figure 2). According to the SHAP value distribution of each feature, the crucial features that accounted for more than 80% of the importance in both models were intersected to obtain seven main features: PaO 2 /FiO 2 , APACHE II, SOFA, ALB, AMY, K + , and WBC, which formed the new dataset.  Useless features will not help the model but will reduce the prediction effect of the model. Eliminating some irrelevant features can reduce the model's complexity while ensuring the features' effectiveness in improving the model's accuracy. Therefore, this work used the data set of the seven features optimized above to train these five models again. This noise reduction process allows for more explicit sample characteristics and improved model performance. Using a number of evaluation metrics (AUC, accuracy, precision, recall, and F1 score), this work compared model performance changes in predicting ARDS before and after optimization (Figure 3). The prediction performance of all five models improved to varying degrees, with the accuracy of the LR, RF, SVM, DT, and XGB models increasing by 7%, 6%, 9%, 20%, and 9%, respectively, the AUC rising by 9%, 6%, 6%, 20%, and 9% respectively. The remaining metrics (accuracy, recall, and F1 score) also increased significantly, as detailed in Table 3.   The SHAP summary plot combines the influence of feature importance and features to explain the relationship between features and models comprehensively. Each point in the plot represents a sample, with the position on the y-axis determined by the feature, and the position on the x-axis determined by the SHAP value of the feature. The colors represent the feature values from small to large, with redder colors indicating larger values and bluer colors indicating smaller values for the feature itself. The graph below ( Figure 4A,B) shows that the PaO 2 /FiO 2 is very important to the RF model and that a lower PaO 2 /FiO 2 indicates a greater likelihood of ARDS.  After excluding the interference of other irrelevant features, and again using the SHAP summary plot, this work determined the order of importance of the seven features with the most significant impact on the RF and XGB models, respectively. The order of After excluding the interference of other irrelevant features, and again using the SHAP summary plot, this work determined the order of importance of the seven features with the most significant impact on the RF and XGB models, respectively. The order of importance of the features in the RF model was: PaO 2 /FiO 2 , APACHE II, SOFA, ALB, AMY, WBC, K + ; in the XGB model, the order of importance of the features was: PaO 2 /FiO 2 , AMY, APACHE II, WBC, SOFA, K + , ALB. In order to further reduce noise and optimize the model effect, the four most important features for predicting the risk of ARDS in SAP patients were selected by combining the ranking of feature importance and expert consensus: PaO 2 /FiO 2 , AMY, APACHE II, and SOFA.
In addition, to facilitate the early assessment of the severity of ARDS in patients by clinicians, this work presents a quaternary classification study for the first time. In this modeling, the four characteristic variables obtained by screening were used to construct the model of this work. The models used in the quaternary classification study include RF, SVM, DT, XGB, and ANN. The newly added ANN model is a complex network structure formed by interconnecting a large number of processing units, which has been shown to provide a more accurate response to the probability of risk occurrence than general statistical methods. The predictive performance metrics of the five models in the test set are shown in Table 4. The comparison shows that the ANN model can predict ARDS severity with 86% accuracy, with 82%, 78%, and 86% precision predicting mild, moderate, and severe ARDS, 93%, 72%, and 95% recall, and 87%, 75%, and 90% F1 values, the best performance among the five models. XGB and RF are the following best models, with 81% and 77% prediction accuracy being the next best.

Discussion
ARDS is one of the most common complications in patients with SAP, with an average of one in every three patients complicated with ARDS [25,26]. In patients with acute onset of bilateral pulmonary edema accompanied by refractory hypoxemia, in-hospital mortality can reach more than 30% [27]. Therefore, early identification of patients at high risk of ARDS and prediction of the severity of the disease is essential. More than 20 scoring systems have been clinically developed to predict the deterioration of SAP patients, such as the RANSON and BISAP scores. However, the clinical utility of these scores still needs to be determined [28]. However, these prediction systems based on rule learning methods have limitations when dealing with complex data and complex data relationships, so clinical utility still needs to be determined [29]. Lung Injury Prediction Score (LIPS) is one of the most commonly used scores to predict ARDS in high-risk patients [30], but prospective follow-up shows that its positive predictive value is limited [31].
In recent years, machine learning has been increasingly applied to the medical field because of its accuracy, objectivity, and rapidity in improving the reliability and accuracy of diagnostic systems for specific diseases [32]. It is crucial in the prediction of disease. Therefore, this work attempts to establish an early prediction model for the occurrence and severity of ARDS in SAP patients by using available clinical objective data in combination with machine learning methods to help clinicians make rapid and accurate judgments on high-risk patients.
In this work, five machine learning algorithms, including LR, RF, XGB, SVM, and DT, are used to predict the occurrence of ARDS in SAP. The prediction effects of the models are excellent through the evaluation of some indicators. Among them, the XGB model has been proven to have the best all-around prediction performance because its AUC value reaches 0.93. Other indicators, such as accuracy and precision, are significantly better than the other four models. The XGB model is an optimized distributed gradient lifting library with high efficiency, flexibility, and portability advantages. Based on this, Le et al. [10] independently verified the value of the XGB model in the early prediction of ARDS. They found that the AUC value of the XGB model reached 0.827, 0.810, and 0.790, respectively, when tested for the detection of ARDS at 12 h, 24 h, and 48 h prior to onset. The AUC values obtained in the above studies were slightly lower than ours, which may be because the investigators did not optimize the characteristic variables. Noise in the data may lead to instability between the model and its predicted results, thereby reducing the generalization of its effects [24]. Another study that predicted the duration of mechanical ventilation in patients with ARDS also found that the XGB model stood out among many models and exhibited the most stable and accurate performance, called the "optimal model" by the authors [33].
The ANN algorithm is famous for its powerful object classification ability [22]. A retrospective work conducted by Yang et al. [34] also confirmed that the ANN model is a powerful tool for predicting ARDS after SAP. However, this work only included a small number of characteristic variables that the authors subjectively considered relevant, which may have overlooked some critical potential indicators and thus reduced the accuracy of the model. Therefore, in this work, five models, RF, XGB, SVM, DT, and ANN, were trained based on the critical features obtained by feature optimization in predicting ARDS severity. The systematic comparison shows that the ANN model can most accurately reflect the severity of ARDS compared with RF, XGB, SVM, and DT models, and its accuracy rate value is 86%. The XGB model, which has previously been excellent in predicting the occurrence of ARDS, occupies the second place with an accuracy rate of 81%.
In addition, this work explains the ranking of essential characteristic variables in the SHAP summary plot in combination with expert opinions. PaO 2 /FiO 2 has the highest contribution value in the model. As one of the commonly used indicators for the diagnosis and severity grading of ARDS in clinical practice, studies have shown that dynamic monitoring of its changes can detect mild forms of ARDS in a timely manner, which helps to immediately start corresponding lung protective ventilation strategies and avoid treatment delay [27]. A study on high-risk patients with ARDS in the intensive care unit showed that plasma amylase was markedly increased, although in the absence of pancreatitis [35]. The relationship between plasma amylase and other lung diseases has also been confirmed. Its elevation was even related to the adverse clinical outcome of patients with coronavirus disease in 2019 [36]. This may be because a large amount of pancreatic amylase can cause pulmonary vasoconstriction by activating the relevant complement and promoting the release of bioactive substances such as histamine, which leads to pulmonary dysfunction [37]. APACHE II and SOFA scores have been proven to be effective in reflecting the dynamic changes in the clinical condition of critical patients and have high predictive value [38]. They are widely used in predicting the secondary multiple organ failure of SAP patients, which has received high evaluation [39]. In addition, ALB and WBC levels also play a role in the accurate prediction model. Ocskay. K. et al. [40] found that about 1/5 AP patients were accompanied at admission, and severe hypoproteinemia was independently related to subsequent organ failure. In systemic inflammation, the production of albumin may be reduced to make room for the production of proinflammatory cytokines [41]. Elevated proinflammatory cytokines such as IL-6 and IL-8 are associated with lung injury [27]. The elevation of leukocytes due to severe inflammation during SAP can cause an increase in inflammatory mediators through a variety of signal transduction mechanisms and induce the occurrence and development of ARDS [42].
The strength of this work is that the characteristic variables are all taken from routine clinical tests, so there is no need to waste costs collecting additional data. Secondly, the noise reduction is repeated in the optimization process of the characteristic variables of the dataset, which makes the sample characteristics clearer and improves the model prediction effect. As confirmed by the AUC value and other evaluation indicators, the effects of the five models after optimization are significantly improved. More importantly, our study takes the lead in predicting the severity of ARDS in SAP patients. The prediction accuracy of the trained ANN model can reach 86%, which facilitates doctors to implement different degrees of respiratory support strategies for high-risk patients with ARDS as early as possible. It can prevent further deterioration of the patient's condition, which has significant clinical application value.
It must be acknowledged that this research has certain limitations. First, this is a singlecenter retrospective study, which may have some selection bias. A multicenter prospective study must further verify the prediction effect of subsequent related models. Secondly, some new biomarkers (such as Ang-2, sRAGE, and cytokines) [43] are not included in our prediction indicators because of the high cost and failure to be widely detected clinically, which may limit the prediction effect of the model. Even so, the prediction models proposed in this study can help clinicians to predict early and judge the severity of ARDS in SAP.

Conclusions
In summary, this work screened four important indicators by machine learning methods. It developed relevant predictive models for predicting the risk of developing ARDS in SAP patients, and the models were shown to be excellent. Meanwhile, this work is the first quaternary classification work to establish five models for early assessment of ARDS severity, and the accuracy of the ANN model reached 86%, which suggests the potential value of machine learning models in predicting complications in SAP patients. This work combines machine learning with SHAP, which can also promote the optimization of the model while explaining the model in depth. This model can also be applied in the risk prediction of other diseases and provide better explanations. This work will consider expanding the amount of data through multi-center research in the future to adjust the optimization model further. Institutional Review Board Statement: The protocol is in accordance with the ethical guidelines of the 1975 Declaration of Helsinki, as refected by prior approval by one Chinese tertiary academic hospital (2018ZDSYLL070-P01 from Zhongda Hospital Southeast University. All participants will be informed that their anonymous data will be used for research and publication. Relevant information of this project will be provided to potential eligible participants in oral and written forms for them to make their decisions of participation or not. Each participant will be informed that he/she can withdraw from the study at any time and that withdrawal will not affect his or her subsequent medical treatment.

Informed Consent Statement: Not applicable.
Data Availability Statement: All data generated or analysed during this study are included in this published article.

Conflicts of Interest:
The authors declare there is no conflict of interest regarding the publication of this paper.