Prediction of Perforated and Nonperforated Acute Appendicitis Using Machine Learning-Based Explainable Artificial Intelligence

Background: The primary aim of this study was to create a machine learning (ML) model that can predict perforated and nonperforated acute appendicitis (AAp) with high accuracy and to demonstrate the clinical interpretability of the model with explainable artificial intelligence (XAI). Method: A total of 1797 patients who underwent appendectomy with a preliminary diagnosis of AAp between May 2009 and March 2022 were included in the study. Considering the histopathological examination, the patients were divided into two groups as AAp (n = 1465) and non-AAp (NA; n = 332); the non-AAp group is also referred to as negative appendectomy. Subsequently, patients confirmed to have AAp were divided into two subgroups: nonperforated (n = 1161) and perforated AAp (n = 304). The missing values in the data set were assigned using the Random Forest method. The Boruta variable selection method was used to identify the most important variables associated with AAp and perforated AAp. The class imbalance problem in the data set was resolved by the SMOTE method. The CatBoost model was used to classify AAp and non-AAp patients and perforated and nonperforated AAp patients. The performance of the model in the holdout test set was evaluated with accuracy, F1- score, sensitivity, specificity, and area under the receiver operator curve (AUC). The SHAP method, which is one of the XAI methods, was used to interpret the model results. Results: The CatBoost model could distinguish AAp patients from non-AAp individuals with an accuracy of 88.2% (85.6–90.8%), while distinguishing perforated AAp patients from nonperforated AAp individuals with an accuracy of 92% (89.6–94.5%). According to the results of the SHAP method applied to the CatBoost model, it was observed that high total bilirubin, WBC, Netrophil, WLR, NLR, CRP, and WNR values, and low PNR, PDW, and MCV values increased the prediction of AAp biochemically. On the other hand, high CRP, Age, Total Bilirubin, PLT, RDW, WBC, MCV, WLR, NLR, and Neutrophil values, and low Lymphocyte, PDW, MPV, and PNR values were observed to increase the prediction of perforated AAp. Conclusion: For the first time in the literature, a new approach combining ML and XAI methods was tried to predict AAp and perforated AAp, and both clinical conditions were predicted with high accuracy. This new approach proved successful in showing how well which demographic and biochemical parameters could explain the current clinical situation in predicting AAp and perforated AAp.


Introduction
Acute appendicitis (AAp) is one of the most common causes of admission to emergency departments due to abdominal pain [1][2][3][4]. Obstruction of the lumen of the appendix vary depending on the existence and severity of inflammation, and their clinical use is efficient [1][2][3]7,11].
All the reasons above show how important biochemical parameters are for predicting AAp and perforated AAp. The studies conducted to date have been performed using standard biostatistical analysis methods, in which clinical, radiological, and biochemical parameters can be used to predict AAp and perforated AAp. However, the results' usability differs from center to center, and therefore the generalizability of the results to the population has been the topic of serious debate. All these factors have paved the way for the use of artificial intelligence (AI) models that will minimize the effect of the human factor in predicting AAp and perforated AAp.
The machine learning (ML) method, one of the AI methods that can be used in estimating AAp, has been demonstrated recently [11]. Unlike traditional statistical techniques, ML is a sub-field of AI that aims to make predictions about new observations by learning based on existing data. However, a significant problem in many state-of-the-art ML models is the lack of transparency, interpretability, and explainability. To overcome these shortcomings, explainable artificial intelligence (XAI) has recently started to attract more attention in clinical research. In this context, XAI deals with methods that aim to make ML models more understandable/interpretable by clinicians [20]. The Shapley Additive Explanations (SHAP) method, which is one of the XAI methods, determines the numerical values that show the direction and magnitude of the variable contributions to the estimations of the ML models and provides the visualization of the variable contributions [21]. This study aims to predict AAp and perforated AAp with ML models using patients' clinical and biochemical blood parameters and interpret the results with SHAP, which is an XAI approach. From this point of view, it is thought that this study represents an important step forward for the use of XAI models for AAp, which is one of the most common reasons for admission to emergency services.
The main findings and contributions of this article are listed below: • An ML model was created to accurately predict patients with AAp and perforated AAp. The importance of the SHAP-based methodology was examined to explain the model, which can assist clinicians in diagnosing AAp and perforated AAp. • ML and SHAP are useful in diagnosing and treating AAp and perforated AAp, future treatment goals, and personalized medication administration.

Study Design and the Related Dataset
Between May 2009 and March 2022, 1797 patients who underwent appendectomy with a preliminary diagnosis of AAp by the Department of Surgery of Inonu University Faculty of Medicine were divided into two main groups: AAp (n = 1465; 81.5%) and non-AAp (n = 332; 18.5%) based on the histopathological findings. Then, 1465 patients confirmed to AAp were then divided into two subgroups: nonperforated (n = 1161; 79.2%) and perforated AAp (n = 304; 20.8%). The presence of inflammatory cell infiltration in the appendectomy specimen without perforation was referred to as nonperforated AAp, perforation with inflammatory cell infiltration was referred to as perforated AAp, and the absence of inflammatory cell infiltration was referred to as non-AAp.

Data Preprocessing and Modeling
The random forest method assigned the missing values in the data set. The Boruta feature selection method was used to determine the most essential variable (predicting factors) for AAp and subgroup (perforated AAp) prediction. The class imbalance problem in the data set used in the study was resolved with the SMOTE method. The data were split 80:20 into training and test sets. To obtain a more robust prediction model, avoid biased results and limit the problem of overfitting, the persistence method was repeated 50 times with different random seeds, and the average performance was calculated across these 50 times ( Figure 1). The CatBoost model was used to predict patients with AAp and perforated AAp. The CatBoost model's hyperparameters, which are important parameters that affect the performance of the prediction models, were optimized using the grid search method and 10-fold cross validation with 5 replicates. The model's performance was evaluated with respect to accuracy, F1-score, sensitivity, specificity, and area under the receiver operator curve (AUC). The SHAP method, one of the XAI approaches, was used to interpret the model results. The methods used in the study are explained in the subtitles. Figure 1 provides an overview of the methodology.

Random Forest Missing Value Imputation
RF calculates a (nxn) proximity matrix to evaluate the similarity of observations in missing value imputation. The matrix's off-diagonal elements show how two different observations are comparable. RF performs an iterative procedure for imputation based on these proximity values by performing the following steps: After employing median imputation, an initial forest is created, and proximities are then computed. A proximitybased weighted mean is used to determine new imputed values. A new forest is constructed using this updated data set, yielding new proximities and imputed values [22].

Synthetic Minority Over-Sampling Technique (SMOTE)
SMOTE is one of the oversampling approaches suggested by Chawla et al. [23]. Based on feature space similarities between existing minority observations, and the SMOTE algorithm creates synthetic data. SMOTE randomly chooses a minority class observation (a) and locates its k-nearest minority class neighbors in order to develop new synthetic minority class observations. Then, one of the k-nearest neighbor elements (b) is randomly selected, and the synthetic observation is derived by constructing a line segment connecting a to b in the feature space. A convex combination of two chosen observations a and b yields synthetic observations [24].

Boruta Feature Selection
The Boruta algorithm is a feature selection algorithm that is placed under the RF classification method. Boruta employs shadow features, which are copies of the original features. The shadow features are randomly assigned to objects; therefore, decision trees are generated based on the shadow features. In addition, this algorithm considers multivariable relationships and can investigate interactions between variables [25].

CatBoost
CatBoost is a new gradient boosting technique presented by Prokhorenkova et al. [26] and Dorogush et al. [27] that works with categorical features with the least information loss [28]. To begin, it employs ordered boosting, a highly efficient variation of gradient boosting methods, to address the issue of target leaking. Second, this approach works well with tiny datasets. Third, CatBoost is capable of handling categorical features. This processing is often conducted during the preprocessing phase and consists primarily of substituting the original categorical variables with one or more numerical values. Furthermore, Bakhareva et al. [29] discovered that CatBoost might be successfully applied to various data kinds and formats. Another feature of the approach, as mentioned by Dorogush et al. [27], is that it uses random permutations to estimate leaf values while selecting the tree structure, hence avoiding the overfitting produced by typical gradient boosting algorithms.

Explainable Artificial Intelligence (XAI)
Computational learning theory and the study of pattern recognition led to the development of ML, a sub-branch of AI. ML is a collection of techniques and algorithms that can predict future events or classify data by learning patterns from previously collected data. Today, due to the complexity and large volume of data, human beings' capacity to interpret them quickly is many times higher. From this point on, ML comes into play, enabling accurate forward-looking analysis of complex data [30]. In various industries, including the medical sciences, ML approaches have had significant success with predictive models in analyzing structured datasets. Most models developed by data scientists focus on the model's accuracy in predicting the disease of interest, but models rarely explain these predictions. This is the black box feature of ML [31]. Traditional ML metrics such as AUC, accuracy, and recall may not be sufficient in many applications where the user must rely on ML system predictions. Understanding, explaining, and interpreting ML approaches is essential. While ML techniques have been in use for decades, their spread to areas such as healthcare has led to the greater emphasis on explanations in ML. The interpretability of model predictions is a priority for clinical practitioners regarding application and use. ML models that can explain why certain predictions are produced are called explainable AI models [32].
There are two types of XAI model: global interpretability and local interpretability. Global interpretability is the ability to examine the structure and parameters of a complex model and understand how the model works globally. On the other hand, local interpretability examines an individual prediction of a model locally and attempts to understand why the model made the decision it made. In this study, SHAP, one of the globally interpretable models, was used.

Shapley Additive Explanations (SHAP)
Difficulties in interpreting ML models and their predictions limit ML's practical applicability and confidence. Model interpretability often depends on estimating the contribution of individual characteristics (independent variables) to the model's results. Explainable approaches are needed to assist in the interpretation of ML models. To this end, the SHAP methodology was recently introduced [33].
SHAP is a method used in ML to explain the individual and global predictions of the model. The technique is theoretically based on optimal Shapley values. The technical definition of the Shapley value is the average marginal contribution of the value of a variable over all possible coalitions. In other words, Shapley values consider all potential estimates for an observation (sample) using all possible combinations of variables. Therefore, SHAP is a unified approach that provides global and local consistency and interpretability. In this context, it can be stated that the purpose of SHAP is to explain the estimation of any observation by calculating the contribution of each variable to the estimation [34]. The flow chart of all the methods used in the study is given in Figure 1.

Study Protocol and Ethics Committee Approval
This retrospective case-control study involving human participants was performed following the ethical standards of the institutional and national research committee and in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. First, the required permissions were obtained from the Directorate of Surgery. Then, ethical approval was obtained from the Inonu University Institutional Review Board (IRB) for non-interventional studies (2022/3481). STROBE (strengthening the reporting of observational studies in epidemiology) guidelines were utilized to assess the likelihood of bias and overall quality for this study [35].

Study Protocol and Ethics Committee Approval
This retrospective case-control study involving human participants was performed following the ethical standards of the institutional and national research committee and in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. First, the required permissions were obtained from the Directorate of Surgery. Then, ethical approval was obtained from the Inonu University Institutional Review Board (IRB) for non-interventional studies (2022/3481). STROBE (strengthening the reporting of observational studies in epidemiology) guidelines were utilized to assess the likelihood of bias and overall quality for this study [35].

Acute Appendicitis versus Negative Acute Appendicitis
The total of 1797 patients in this retrospective study comprised 1465 (81.5%) patients with AAp and 332 (18.5%) patients with non-AAp. Of the patients, 993 (55.3%) were male (median age: 33 years; IQR: 23) and 804 (44.7%) were female (median age: 34 years; IQR: 26). The median age of patients with AAp was 33.1 years (IQR: 25), and the median age of patients with non-AAp was 33 years (IQR: 24). The SHAP method was used to visually explain how the variables in the model affect the biochemical markers for AAp. Figure 2 shows possible markers evaluated by the normalized SHAP value and the importance levels of these markers for AAp. The aforementioned analysis findings showed that TBil, PNR, and PDW were the three most important predicting markers for AAp. Figure 3 was created by considering positive and negative SHAP values. A positive SHAP value indicates that the contribution to the target variable (AAp) is positive, and a negative SHAP value suggests that the contribution is negative. In addition, the variable's value decreases as the points on the graph get closer to blue and increases as they get closer to pink. Therefore, higher TBil, WBC, Neutrophil, WLR, NLR, CRP, and WNR values and lower PNR, PDW, and MCV values indicate an increased risk of AAp. When the normalized SHAP values in Table 2 were examined, the five most predictive factors for AAp were TBil, PNR, PDW, MCV, and WBC. The explanatory powers of these five biochemical markers for AAp were 16.6%, 16%, 13.3%, 11.1%, and 9.2%, respectively. The SHAP method was used to visually explain how the variables in the model affect the biochemical markers for AAp. Figure 2 shows possible markers evaluated by the normalized SHAP value and the importance levels of these markers for AAp. The aforementioned analysis findings showed that TBil, PNR, and PDW were the three most important predicting markers for AAp. Figure 3 was created by considering positive and negative SHAP values. A positive SHAP value indicates that the contribution to the target variable (AAp) is positive, and a negative SHAP value suggests that the contribution is negative. In addition, the variable's value decreases as the points on the graph get closer to blue and increases as they get closer to pink. Therefore, higher TBil, WBC, Neutrophil, WLR, NLR, CRP, and WNR values and lower PNR, PDW, and MCV values indicate an increased risk of AAp. When the normalized SHAP values in Table 2 were examined, the five most predictive factors for AAp were TBil, PNR, PDW, MCV, and WBC. The explanatory powers of these five biochemical markers for AAp were 16.6%, 16%, 13.3%, 11.1%, and 9.2%, respectively.

Nonperforated AAp versus Perforated AAp
In this section, the 1465 patients with AAp were divided into two sub-groups according to their perforation status: perforated (n: 304; 20.8%) and nonperforated AAp (n: 1161; 79.2%). Of the patients, 847 (57.8%) were male (median age: 33 years; IQR: 22), and 618 (42.2%) were female (median age: 34 years; IQR: 26). The median age of patients with perforated AAp was 43 years (IQR: 32.75), and the median age of patients with nonperforated AAp was 32 years (IQR: 22).  The SHAP method was used to visually explain how the variables in the model affect the biochemical markers for perforated AAp. Figure 4 shows the demographic and Diagnostics 2023, 13, 1173 9 of 14 biochemical markers evaluated by the normalized SHAP value and the order of importance of these factors. These variable importances are given in ascending order. It can be said that the three most determinative factors for perforated AAp are CRP, PDW, and Age. Figure 5 was created by considering positive and negative SHAP values. Therefore, higher CRP, Age, TBil, PLT, RDW, WBC, MCV, WLR, NLR and Neutrophil values, and medium and low Lymphocyte, PDW, MPV, and PNR values indicate an increased risk of perforated AAp. As a result, it can be said that CRP value higher than 12.80 was the most critical determining biochemical marker for predicting perforated AAp. When the normalized SHAP values in Table 4 are examined, the five most determinative factors for perforated AAp were CRP, PDW, Age, MPV, and TBil. The explanatory powers of these biochemical markers for AAp were 26.5%, 11.3%, 10.2%, 5.5%, and 5.2%, respectively. The SHAP method was used to visually explain how the variables in the model affect the biochemical markers for perforated AAp. Figure 4 shows the demographic and biochemical markers evaluated by the normalized SHAP value and the order of importance of these factors. These variable importances are given in ascending order. It can be said that the three most determinative factors for perforated AAp are CRP, PDW, and Age. Figure 5 was created by considering positive and negative SHAP values. Therefore, higher CRP, Age, TBil, PLT, RDW, WBC, MCV, WLR, NLR and Neutrophil values, and medium and low Lymphocyte, PDW, MPV, and PNR values indicate an increased risk of perforated AAp. As a result, it can be said that CRP value higher than 12.80 was the most critical determining biochemical marker for predicting perforated AAp. When the normalized SHAP values in Table 4 are examined, the five most determinative factors for perforated AAp were CRP, PDW, Age, MPV, and TBil. The explanatory powers of these biochemical markers for AAp were 26.5%, 11.3%, 10.2%, 5.5%, and 5.2%, respectively.

Discussion
Accurate classification and estimation of patients admitted to emergency services with a preliminary diagnosis of AAp using appropriate diagnostic algorithms prevents patients from being exposed to both unnecessary surgeries due to misdiagnosis and possible complications (perforation, abscess, etc.) may develop as a result of ignoring the actual patients. Furthermore, correct estimation minimizes the patient's treatment cost and workforce loss.
ML is a subset of AI that uses statistical approaches to provide computer systems the capacity to learn and improve over time. ML, in particular, refers to AI tools that may update their models to improve predictions, resulting in a gradual performance improvement at the defined job. In theory, ML approaches may be used on any size dataset; nonetheless, more data gives more experience with which to train the model. In accordance with the ML working principle, these features are fed into computer models that can provide insights into the data, such as grouping similar observations into groups or forecasting certain events [36]. ML has attracted increasing medical research attention in recent years, with a wide range of applications being researched. Many studies have been performed analyzing different parts of the healthcare system, reporting improvements in ML engagement in illness prevention, screening, treatment, and prognosis prediction [37].
In the last decade, with the availability of large datasets and greater computing power, ML methods have achieved high performance in various situations. However, the main problem with many of the models used is the lack of transparency, explainability, and interpretability. In light of these problems, XAI has recently started attracting more attention. Briefly, XAI is the collection of methods or techniques that aim to make AI applications understandable by users. The aim of XAI is to make the computational inferences behind the decisions of AI, which has a process that is difficult to grasp in general, understandable by available users and researchers. Because ML often does not provide direct explanations for why or how predictions and results are obtained, it is difficult to show why model makes certain decisions. For this reason, explicable AI methods have been developed and applied to different models [20].
In this study, we aimed to predict AAp and its complications by combining ML and XAI models, which have been used in many areas of health care. In other words, from an epidemiological point of view, we aimed to minimize Type I (false positive) and Type II (false negative) error rates by using ML and XAI models.
To summarize the study presented here: firstly, the AAp and perforated AAp statuses of the patients were determined with the CatBoost model based on decision trees, which is one of the complex models, to increase prediction accuracy. Second, the global annotation method SHAP was used to avoid ambiguity of the complex CatBoost model. The CatBoost model could distinguish AAp patients from NA with an accuracy of 88.2% (85.6-90.8%) while discriminating perforated AAp patients from nonperforated AAp patients with an accuracy of 92% (89.6-94.5%). The main reason for the higher distinguishing accuracy in perforated AAp patients is the higher elevation of inflammation-related biochemical blood parameters during perforation compared to normal AAp.
In addition, through the proposed XAI approach, it was possible to list the most important biochemical blood parameters that can be used to predict AAp and perforated AAp. According to this evaluation, the most important biochemical blood parameters for AAp prediction were TBil, PNR, PDW, MCV, WBC, CRP, Neutrophil, WNR, WLR and NLR, respectively. The results of SHAP, which is the XAI approach, showed that the most important biochemical blood parameters detected could be used to predict high or low levels of AAp compared to normal. Accordingly, higher TBil, WBC, Neutrophil, WLR, NLR, CRP, and WNR values and lower PNR, PDW, and MCV values were associated with AAp. Similarly, the most important biochemical blood parameters for perforated AAp estimation were found to be CRP Some studies on the prediction of AAp by AI methods have been published in the literature. In one study, the support vector machine method was used to differentiate complicated AAp from non-complicated AAp, and the accuracy, sensitivity, specificity, and Matthews correlation coefficients were 83.56%, 81.71%, 85.33% and 67.32%, respectively [38]. In another study, Logistic Regression, Naive Bayes, Generalized Linear, Decision Tree, Support Vector Machine, Gradient Augmented Tree and Random Forest methods were used to predict whether appendicitis is acute or subacute. Among the methods, the random forest method gave the best results, with 83.75% accuracy, 84.11% precision, 81.08% sensitivity, and 81.01% specificity [39]. Akmese et al. [11] stated that the prediction success of various ML algorithms for the early diagnosis of AAp was compared, and the gradient boosted tree algorithm achieved the best success. This model achieved the best success, with an accuracy of 95.31%. In a study conducted with children and adolescents between the ages of 0 and 17 at a hospital in Germany, the complete blood counts of 590 patients with 473 appendicitis and 117 with negative histopathological findings were analyzed. In the study, AAp patients were estimated using ML methods. The model's training was performed using the data of 35% of the patients, and 65% of the data were used for validation. In the study, 90% accuracy (with 93% sensitivity and 67% specificity) was obtained for the diagnosis of AAp [40]. Compared to both studies mentioned above, the accuracy of the current research in predicting AAp appears to be relatively lower (88.2% (85.6-90.8%)). This is because the biochemical parameters associated with AAp tend to increase more than normal due to the nature of the pediatric patients included in said studies. It is also important to evaluate the AUC along with accuracy. In our study, our model differentiated AAp patients from non-AAp patients with a very good AUC value of 94.7% (91.3-96.2%).
Most studies in the literature have used complex ML models for AAp prediction, but to the best of our knowledge, there are no studies on using XAI in predicting AAp and its complications. The primary contribution of the present study to the literature is its combination of ML and XAI. In addition, although most studies in the literature have examined AAp, there is limited research on perforated AAp. The secondary contribution of the present study to the literature is the interpretable estimation of perforated AAp using XAI.
Most studies conducted with conventional statistical methods reveal which parameters predict AAp and perforated AAp and show the relationship of changes in these parameters (such as fall and rise) with AAp. That is, conventional analyses fall short of demonstrating the significance of demographic and biochemical parameters and their ability to explain the clinical situation. On the other hand, models such as ML/XAI reveal the results of conventional statistical methods and the extent (%) of the parameters found to be significant to explain the clinical situation at hand.
Another study reporting on the current state of the art in postoperative risk estimation tackled the limitations of previous techniques and how they were used in practical settings. Additionally, the possibility of systematically incorporating machine learning models into health care in a broader sense and the future prospects beyond passive risk prediction were discussed [41]. Similarly, the current study investigated the prediction of perforated and nonperforated acute appendicitis using machine learning-based XAI, and evaluated potential implementations of the proposed algorithm integrated with XAI methods. Additionally, XAI techniques incorporated into AI/ML algorithms were of great importance for interpretable outcomes of the response variable associated with the explanatory factors. More explainable estimates could be obtained if different factors related to the disease and other AI/ML methods are used. This may limit the outputs of this study achieved from these models. The proposed approach with novel XAI methods may better highlight the results achieved from AI/ML methods.

Limitations
As in other retrospective studies, this study has some limitations. First of all, most clinical data were excluded from the study, since most of the clinical characteristics of the patients (location of pain, duration, nausea, vomiting, anorexia) were not recorded in the hospital's data processing system. Secondly, radiological data (US or CT) of approximately 11% of the patients included in this study could not be accessed. Excluding these patients whose radiological examinations could not be reached would decrease the sample size required for ML models and increase the class imbalance problem in the data set. For this reason, the radiological data of the patients were not included in the modeling. This situation can easily be resolved with prospective multi-center studies.

Conclusions
As a result, it was seen that there studies have been performed using ML methods for AAp and perforated AAp estimation in the literature, but there are no studies combining ML and XAI. Therefore, the present study is the first to combine the ML and XAI models to determine the biochemical blood parameters that predict AAp and perforated AAp. The results will help clinicians identify individuals at risk by paying attention to which biochemical blood parameters in patients with AAp.  Informed Consent Statement: Written and verbal informed consent were obtained from all subjects involved in the study before any intervention.