Feature Importance Analysis of a Deep Learning Model for Predicting Late Bladder Toxicity Occurrence in Uterine Cervical Cancer Patients

Simple Summary This study developed a prediction model for late bladder toxicity in patients with uterine cervical cancer undergoing radiation therapy. A deep learning (DL) model was trained on data from 281 patients and compared its performance with a multivariable logistic regression model. The DL model outperformed the regression model, achieving higher accuracy, recall, F1-score, and area under the receiver operating characteristic curve. Specifically, based on the feature importance analysis, the DL model identified the doses for the most exposed 2 cc volume of the bladder (BD2cc), BD5cc, and ICRU bladder point as high-priority features. Finally, the lightweight DL model, which was designed to focus on the top five important features, demonstrated superior predictive capabilities, highlighting its potential in improving patient outcomes and minimizing treatment-related complications with secured reliability. Abstract (1) In this study, we developed a deep learning (DL) model that can be used to predict late bladder toxicity. (2) We collected data obtained from 281 uterine cervical cancer patients who underwent definitive radiation therapy. The DL model was trained using 16 features, including patient, tumor, treatment, and dose parameters, and its performance was compared with that of a multivariable logistic regression model using the following metrics: accuracy, prediction, recall, F1-score, and area under the receiver operating characteristic curve (AUROC). In addition, permutation feature importance was calculated to interpret the DL model for each feature, and the lightweight DL model was designed to focus on the top five important features. (3) The DL model outperformed the multivariable logistic regression model on our dataset. It achieved an F1-score of 0.76 and an AUROC of 0.81, while the corresponding values for the multivariable logistic regression were 0.14 and 0.43, respectively. The DL model identified the doses for the most exposed 2 cc volume of the bladder (BD2cc) as the most important feature, followed by BD5cc and the ICRU bladder point. In the case of the lightweight DL model, the F-score and AUROC were 0.90 and 0.91, respectively. (4) The DL models exhibited superior performance in predicting late bladder toxicity compared with the statistical method. Through the interpretation of the model, it further emphasized its potential for improving patient outcomes and minimizing treatment-related complications with a high level of reliability.


Introduction
Cervical cancer is the third most commonly diagnosed malignancy in women worldwide, as reported in 2022 [1]. With early detection and advances in treatment, the number of cervical cancer survivors has increased over the past 40 years. Nevertheless, most patients with cervical cancer suffer from various morbidities resulting from the disease itself or from the treatment. Depending on the treatment strategy (surgery, radiotherapy (RT), or chemotherapy), the morbidities may include symptoms associated with the gastrointestinal or urinary tract, lymphedema, and sexual dysfunctions.
External beam radiation therapy (EBRT) with concurrent chemotherapy followed by intracavitary brachytherapy represents the standard treatment for locally advanced cervical cancers [2,3]. Owing to the very high dose of radiation delivered to the pelvic area using EBRT and brachytherapy, radiation-induced toxicity in these patients is not negligible [4]. Among the irradiated pelvic organs, radiation-induced bladder toxicity is the most commonly observed morbidity in cervical cancer patients who have received curative radiotherapy, with an incidence of approximately 20% [4]. Although severe bladder toxicities have become rare with image-guided brachytherapy, they still pose a considerable risk [5].
Predicting the occurrence of bladder toxicity before treatment is crucial for improving patients' long-term health outcomes and quality of life, as it reduces the probability of treatment-related complications or interruptions. Previous studies have focused on investigating prognostic factors associated with radiation-induced bladder toxicity. These studies have examined various dose volumetric parameters, such as the doses for the most exposed 2 cc volume of the bladder (BD 2cc ) [5] and the volume receiving a certain biologically weighted equivalent dose (EQD2) relative to the gross tumor volume (GTV) or clinical target volume (CTV), such as V 51.43Gy [6] and V 8.5Gy/w [7].
Related studies that have used predictive models for toxicity in various cancer types and radiation therapy techniques: • A machine-learning-based prediction model of fistula formation after interstitial brachytherapy for locally advanced gynecological malignancies achieved an accuracy of 0.901 using Support Vector Machine (SVM) [24]. • A feasibility study utilized a deep convolutional neural network (CNN) with transfer learning to predict rectum toxicity in cervical cancer radiotherapy, achieving an AUC of 0.89 [14]. • An observational study predicting radiotherapy impact on late bladder toxicity in prostate cancer patients used univariate logistic regression, achieving an AUC of 0.626 [6].

•
Various studies have focused on predicting urinary toxicity in prostate cancer radiotherapy using different models, such as the international prostate symptoms score model, logistic and Cox regression, the edited nearest neighbor algorithm together with the regularized discriminant analysis classifier, and others [16,18,19]. • Predicting late organ-at-risk toxicity after prostate radiation therapy has been explored using statistical analysis, cox regression, and random forest models [11,22,23].

•
In radiotherapy for cervical cancer, radiomics analysis of 3D dose distributions has been employed to predict toxicity rates, achieving AUCs ranging from 0.57 to 0.89 [13].
The differences between the statistical and DL methods can be described based on three factors: (i) model complexity, (ii) feature importance, and (iii) model transparency. Statistical methods tend to exhibit relatively low complexity, and, compared with DL models, they operate based on clear governing principles. Statistical methods often demonstrate relatively low accuracy; however, they offer relatively high model transparency, which simplifies the interpretation of feature importance. DL models tend to outperform statistical methods. However, they are more complex [25][26][27] and considered "black boxes", which makes the interpretation of their results very challenging [28]. However, both accuracy and interpretability must be considered when choosing a method for predicting clinical outcomes [29][30][31][32].
Therefore, in the present study, we propose an interpretable DL model for predicting late bladder toxicity in patients with cervical cancer who have received definitive radiotherapy. We compared the performance of the statistical method and the DL model. In addition to achieving a high level of reliability, we conducted a feature-importance analysis and validated the performance of a lightweight DL model.

Patient Selection
We identified 545 patients with primary uterine cervical cancer who underwent definitive RT with curative intent at our institution between February 2006 and December 2017. Follow-up evaluations were performed every three months in the first two years, every four months in the third year, every six months in the fourth and fifth years, and annually thereafter. In this study, we included patients with more than three years of follow-up after treatment completion. The radiation-induced bladder toxicity was evaluated during the regular follow-up visits based on the European Organization for Research and Treatment of Cancer late radiation toxicity criteria, which represent a structured scoring schema developed by the Radiation Therapy Oncology Group (RTOG) [33].
In total, 281 patients with cervical cancer were included in this study ( Figure 1). Clinical information regarding the status of the disease and treatment-related complications was collected retrospectively from patients' medical records. This study was approved by the Institutional Review Board (IRB) of the National Cancer Center, Korea: NCC2019-0166. This study adheres to the tenets of the Helsinki Declaration of 1975. The requirement for informed consent was waived owing to the retrospective nature of the study. Statistical methods tend to exhibit relatively low complexity, and, compared with DL models, they operate based on clear governing principles. Statistical methods often demonstrate relatively low accuracy; however, they offer relatively high model transparency, which simplifies the interpretation of feature importance. DL models tend to outperform statistical methods. However, they are more complex [25][26][27] and considered "black boxes", which makes the interpretation of their results very challenging [28]. However, both accuracy and interpretability must be considered when choosing a method for predicting clinical outcomes [29][30][31][32]. Therefore, in the present study, we propose an interpretable DL model for predicting late bladder toxicity in patients with cervical cancer who have received definitive radiotherapy. We compared the performance of the statistical method and the DL model. In addition to achieving a high level of reliability, we conducted a feature-importance analysis and validated the performance of a lightweight DL model.

Patient Selection
We identified 545 patients with primary uterine cervical cancer who underwent definitive RT with curative intent at our institution between February 2006 and December 2017. Follow-up evaluations were performed every three months in the first two years, every four months in the third year, every six months in the fourth and fifth years, and annually thereafter. In this study, we included patients with more than three years of follow-up after treatment completion. The radiation-induced bladder toxicity was evaluated during the regular follow-up visits based on the European Organization for Research and Treatment of Cancer late radiation toxicity criteria, which represent a structured scoring schema developed by the Radiation Therapy Oncology Group (RTOG) [33].
In total, 281 patients with cervical cancer were included in this study ( Figure 1). Clinical information regarding the status of the disease and treatment-related complications was collected retrospectively from patients' medical records. This study was approved by the Institutional Review Board (IRB) of the National Cancer Center, Korea: NCC2019-0166. This study adheres to the tenets of the Helsinki Declaration of 1975. The requirement for informed consent was waived owing to the retrospective nature of the study.

Treatment
The patients received concurrent chemoradiotherapy using either three-dimensional (3D) conformal EBRT or an intensity-modulated RT technique and high-dose-rate brachytherapy. The clinical target volume of EBRT included gross disease, the entire uterus, a margin In addition, regional lymphatics, including the common, internal, and external iliac nodal regions and the presacral and para-aortic lymph nodes, were considered. Prior to 2008, brachytherapy planning was CT-based; afterwards, MRI-based planning has been employed [4]. Brachytherapy planning generally followed the recommendations of the Gynecologic Group European de Curie Therapie, European Society for Therapeutic Radiology, and Oncology Working Group [34,35].
The patients received a combination of 45-50 Gy of pelvic EBRT and 30 Gy of high-doserate brachytherapy with daily fractions of 5 Gy over 3 weeks. The biologically equivalent dose to point A in 2 Gy fractions was previously calculated as approximately 82 Gy for cervical tumors (α/β = 10) and 91 Gy for normal tissue (α/β = 3) [36]. Concomitant chemotherapy was administered using a weekly schedule of intravenous cisplatin at 40 mg per square meter of body surface area.

Data
We used four groups of the feature set as input data for the statistical analysis and DL method: (i) "patient" feature group (n = 2), including age and pathology; (ii) "tumor" feature group (n = 3), including the federation of gynecology and obstetrics (FIGO) stage, tumor-node-metastasis (TNM) category, and maximum tumor length on axial T2-weighted magnetic resonance image (MRI); (iii) "treatment" feature group (n = 4), including concurrent chemoradiotherapy (CCRT), CCRT regimen, number of CCRT cycles, and adjuvant chemotherapy; (iv) "dose" feature group with the equieffective dose (EQD2) at 3.0 Gy (n = 7), including the total dose of external beam radiation therapy (EBRT) , the dose delivered to 100% of the primary GTV (GTV-D100), International Commission on Radiation Units and Measurements (ICRU) bladder point (BP ICRU ), BD 0.1cc , BD 1.0cc , BD 2.0cc , and BD 5.0cc . Features that were not continuous were dummy-coded. All features were deliberately collected in accordance with IRB guidelines and based on clinical experience and relevant literature [5,[36][37][38][39]. These carefully selected features are known to play a significant role in influencing local control and survival outcomes in cervical cancer. With these selections, we aimed to facilitate meaningful comparisons with other studies that have similar diseases.
Late bladder toxicity was graded according to the scoring schema of the RTOG, with scores in the range of 0-5. The median durations from the initial start of EBRT to the occurrence of late toxicity are 37.7 months in the current study population. Owing to the imbalance of grades, the grade of late bladder toxicity was binarized by considering the occurrence of late bladder toxicity only. When late bladder toxicity is absent, the occurrence status is set to 0; otherwise, the occurrence status is set to 1.

Statistical Method
Patient characteristics were indicated in terms of mean ± standard deviation or median (min-max) for continuous variables; frequencies and percentage values were used for categorical variables. Continuous variables were compared using the t-test or Wilcoxon rank-sum test, whereas categorical variables were compared using the Chi-square or Fisher's exact tests. The dataset was randomly divided into two sets containing 199 and 82 cases for training and testing the statistical model, respectively. The ratios for the training and test sets were 70% and 30%, respectively. A statistical model was built using a training set with univariable and multivariable logistic regression. A univariable logistic regression was performed on the input data, considering patient, tumor, treatment, and dose features. The results of the univariable logistic regression were used as the input for the multivariable logistic regression. The target was the occurrence of binarized late toxicity. The developed multivariable logistic regression model was tested using the test dataset. Statistical analyses were performed using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA) and R version 4.2.1 (R Foundation for Statistical Computing, Vienna, Austria). We assumed that statistical significance is achieved for p-values below 0.05.

Deep Learning Model for Permutation Feature Analysis
To implement the DL model by using Pytorch version 1.12.1 (Warsaw, Mazowieckie, Poland), we used an MLP. MLP is a type of artificial neural network that consists of multiple layers of nodes with batch normalization and nonlinearity functions such as rectified linear unit (RELU), leaky RELU, and sigmoid. Each node in one layer is connected to all nodes in the next layer, forming a fully connected network. The input layer takes the input data, and the output layer produces the final predictions. The layers between the input and output layers are called hidden layers. The DL model was built using six hidden layers. The architecture can be represented mathematically as follows: and where x is the input data, y is the output data, W i and b i are the weight matrix and bias vector for the i-th layer, respectively, f i is the activation function for the i-th layer, and h i is the output of the i-th hidden layer. As a preprocessing step, 16 features were used as input data, and binarized late bladder toxicity values were considered the output data. Z-score normalization was applied to the input data so that the DL model could rapidly converge to the optimal solution. The normalization parameters (mean and standard deviation) were determined from the training set.
The standardized input data were passed through an input layer, and output data were obtained on the output layer. In each hidden layer, except for the last one, batch normalization and dropout techniques were applied to avoid overfitting the training data. The probability of a node being zeroed for dropout was set to 0.2. In addition, a leaky ReLU was adopted as the activation function in the hidden layers to improve model complexity and performance. A sigmoid function was used as the activation function of the last layer only. These functions introduce nonlinearity into the model, allowing it to capture complex relationships within the data. The detailed architecture of the proposed model is presented in Table 1.
The loss between the ground truth and predicted toxicity occurrence was calculated using binary cross-entropy as follows: where N represents the total number of samples in the batch, y i represents the true label of the i-th sample (either 0 or 1), and p i represents the predicted probability (between 0 and 1) of the i-th sample belonging to the positive class. The DL models were trained to minimize the loss. To avoid overfitting, the synthetic minority oversampling technique (SMOTE) was adopted during the training procedure as oversampling technique and augmentation strategy, which adds random noise to the input data in every training epoch [18]. Moreover, we employed an adaptive momentum estimation optimizer with a learning rate of 0.005 and a weight decay of 0.0001. To compute the running average of the gradient, β1 and β2 were set to 0.9 and 0.999, respectively [40]. K-fold cross-validation (CV) was employed to ensure the generalizability of the DL model. The value of k was set to 5, resulting in the training set being divided into five folds. Furthermore, an early stopping strategy was implemented, where the model achieving the lowest validation loss during the 500 epochs was saved as the best-performing model. Each fold yielded independent results as separate models were trained. The final output data were determined using the voting method, with a threshold of 3 sets to obtain the ultimate decision.

Permutation Feature Importance Analysis
The permutation analysis is used to calculate the feature importance of a DL model. It involves randomly permuting the value of a single feature and evaluating the model's performance on the test dataset [41]. The analysis was performed using Scikit-learn version 1.0.2.
To measure the importance of each feature, the mean squared error (MSE) is calculated between the reference dataset and a corrupted dataset, where the value of a single feature has been randomly permuted. This is compared with the reference MSE, which is the MSE calculated on the uncorrupted reference dataset. The variation in model performance resulting from the permutation of a feature is used to compute the importance of each feature.
where i is used to index the features, j represents the total number of features, i j is the importance of j-th feature, K is the number of repetitions or permutations, S is the reference MSE, s k,j is the MSE calculated using a corrupted dataset for the k-th repetition, and 1 K ∑ K k=1 s k,j the average MSE over K repetitions for j-th feature. The value of K was set to 500, indicating that the permutation analysis was repeated 500 times. Finally, the permutation feature importance was calculated by averaging the feature importance values obtained from DL models trained using different training folds.

Lightweight Deep Learning Model
In order to develop a lightweight deep learning model, we utilized the importance values derived from permutation feature analysis and selected the top 5 features as input. This approach enabled us to focus on the most influential features while reducing the computational complexity of the model.
When designing the hidden layers, we adhered to the commonly employed standards of multi-layer perceptron (MLP) models. Specifically, our model was designed with a decreasing number of neurons in each hidden layer, starting from the number of input features and gradually reducing to 5, 4, 3, 2, and finally 1 neuron.
The activation functions, preprocessing steps, loss function, optimization technique, cross-validation, and final decision-making process of the lightweight deep learning model remain the same as those of the deep learning model described previously.

Performance Comparison
To compare the performance of the DL model and multivariable logistic regression, four metrics were adopted: accuracy (Equation (10)), precision (Equation (11)), recall (Equation (12)), F1-score (Equation (13)), and the area under the receiver operating characteristic curve (AUROC) (Equation (14)). To compute these metrics, we counted the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in the test set. Precision is defined as the number of positive predictions (i.e., those that were correctly identified by the model) expressed as a fraction of the total number of predictions. Recall indicates the fraction of positive instances that the model could identify. The F1-score is the harmonic mean of precision and recall. Therefore, a high F1-score indicates that the model exhibits a good balance between precision and recall. If a model performs well, it has a high AUROC value (close to 1.0).

General Study Characteristics
Patients' characteristics are summarized in Table 2. No significant differences were observed between the training and test sets in terms of baseline characteristics, such as age, pathology, FIGO stage, TNM category, maximum tumor length, CCRT, CCRT regimen, number of CCRT cycles, adjuvant chemotherapy, and the total dose of EBRT, GTV-D100, BP ICRU , BD 0.1cc , BD 1.0cc , BD 2.0cc , and BD 5.0cc .

Logistic Regression Results
Univariable and multivariable logistic regression analyses were performed to identify risk predictors for adverse events in the training set (Table 3). Based on the univariable analysis, no significant features were observed in the features set. The multivariable logistic regression model returned four features with p < 0.2: CCRT cycle, GTV-D100, BD 2.0cc , and BD 5.0cc . The prediction metrics were calculated using a multivariable logistic regression model on the test data. Multivariable logistic regression achieved accuracy, precision, recall, F1-score, and AUROC of 0.85, 0.08, 0.5, 0.14, and 0.43, respectively. Although multivariable logistic regression yielded an accuracy of 0.85, the regression model did not exhibit acceptable performance on the test dataset.

Permutation Feature Importance: Permutation Analysis
Permutation feature analysis determines the importance of features by quantifying the extent to which the performance of a trained model changes when the dataset is permuted (Equation (7)).
An independent permutation feature importance analysis was conducted on the DL model, separately for each training fold. To determine the overall feature importance, the mean and standard deviation of the calculated permutation feature importance were computed by averaging the importance values across folds. Figure 2 illustrates the mean and standard deviation of the permutation feature importance for all features.
Our findings reveal that the feature with the highest importance was BD 2cc , followed by BD 5cc and BP ICRU . Additionally, the features with the highest importance within each group were age, TNM category, number of CCRT cycles, and BD 2cc for the patient, tumor, treatment, and dose feature groups, respectively.
We identified the top five features with high importance as BD 2cc , BD 5cc , BP ICRU , TNM category, and tumor size. We made the decision to select tumor size instead of FIGO stage as a feature, considering the possibility of redundancy between FIGO stage and TNM category. This decision was based on the scientific rationale that tumor size provides valuable and distinct information for the prediction model [37].

Deep Learning Models
The prediction performance of the DL model for late bladder toxicity was evaluated using the voting method. Table 4 summarizes the prediction performances of the DL models and the lightweight DL models trained using the five different training folds. The means and standard deviations of prediction performance for accuracy, precision, recall, and F1-score differed between the deep learning model and the lightweight deep learning model. For the deep learning model, the values were as follows: accuracy (0.77 ± 0.06), precision (0.68 ± 0.10), recall (0.41 ± 0.12), and F1-score (0.49 ± 0.05). On the other hand, the lightweight deep learning model had different values: accuracy (0.92 ± 0.03), precision (0.97 ± 0.03), recall (0.86 ± 0.07), and F1-score (0.90 ± 0.04). In the case of the AUROC, the deep learning model achieved a value of 0.81 ± 0.04, while the lightweight deep learning model achieved a value of 0.94 ± 0.03 (Figure 3). the extent to which the performance of a trained model changes when the dataset is permuted (Equation (7)).
An independent permutation feature importance analysis was conducted on the DL model, separately for each training fold. To determine the overall feature importance, the mean and standard deviation of the calculated permutation feature importance were computed by averaging the importance values across folds. Figure 2 illustrates the mean and standard deviation of the permutation feature importance for all features. Our findings reveal that the feature with the highest importance was BD2cc, followed by BD5cc and BPICRU. Additionally, the features with the highest importance within each group were age, TNM category, number of CCRT cycles, and BD2cc for the patient, tumor, treatment, and dose feature groups, respectively.
We identified the top five features with high importance as BD2cc, BD5cc, BPICRU, TNM category, and tumor size. We made the decision to select tumor size instead of FIGO stage as a feature, considering the possibility of redundancy between FIGO stage and TNM category. This decision was based on the scientific rationale that tumor size provides valuable and distinct information for the prediction model [37].

Deep Learning Models
The prediction performance of the DL model for late bladder toxicity was evaluated using the voting method. Table 4 summarizes the prediction performances of the DL models and the lightweight DL models trained using the five different training folds. The means and standard deviations of prediction performance for accuracy, precision, recall, and F1-score differed between the deep learning model and the lightweight deep learning model. For the deep learning model, the values were as follows: accuracy (0.77 ± 0.06), precision (0.68 ± 0.10), recall (0.41 ± 0.12), and F1-score (0.49 ± 0.05). On the other hand, the lightweight deep learning model had different values: accuracy (0.92 ± 0.03), precision (0.97 ± 0.03), recall (0.86 ± 0.07), and F1-score (0.90 ± 0.04). In the case of the AUROC, the deep learning model achieved a value of 0.81 ± 0.04, while the lightweight deep learning model achieved a value of 0.94 ± 0.03 (Figure 3).     However, because the AUROC could not be calculated from the labels predicted by each of the five models, the average of the AUROCs of all the models was considered as the overall AUROC value.

Discussion
This study aimed to (1) compare the ability of multivariable logistic regression and DL models to identify those patients who, having received radiation therapy, are at risk of bladder radiation toxicity, and (2) interpret the results of DL models to understand the significance of input features. To the best of our knowledge, no previous study has attempted an interpretation to ensure the reliability of a DL model in predicting the occurrence of late bladder toxicity.
Several statistical methods and DL models are available for predicting clinical outcomes, including radiation toxicity. In many instances, DL models can effectively find near-optimal solutions for nonconvex optimization problems using gradient methods and nonlinear activation functions. However, statistical methods can also achieve high accuracy and reliability in specific cases. Despite the rapid development of DL models and their relatively high performance, their clinical utility is still a topic of controversy due to concerns about their reliability, particularly in terms of how clinicians interpret the results and features.
In this study, we utilized an MLP with relatively low model complexity to enhance the interpretability of deep learning models. The performance of the MLP relies on the configuration of the hidden layer, and Muhammad et al. [42] suggested that the optimal configuration for the hidden layer is three. Furthermore, experimental findings demonstrated that having fewer than three hidden layers directly impacts the network's accuracy, while having more than three hidden layers increases the time complexity without a proportional improvement in accuracy. Determining the appropriate number of neurons in the hidden layer is still a topic of debate, and the design of the model structure should be customized to the specific problem and available computational resources while also considering the general MLP model design standard: a structure in which the number of neurons in the hidden layer continuously decreases. Therefore, we selected the top five important features based on the results of the permutation feature importance analysis and designed the lightweight DL model, taking into account the general design standard. This approach not only improves interpretability but also offers advantages associated with the general design standard of MLP models. For instance, the continuous reduction of neurons in the hidden layer helps prevent overfitting and allows for efficient learning of complex patterns while avoiding excessive computational complexity.
Considering the experiment on our dataset, the DL model proved superior compared with the statistical model. To address the class-imbalance problem in the DL method, we adopted the SMOTE oversampling technique. Mylona et al. [18] suggested that variations in oversampling techniques, including SMOTE, increase the prediction performance of classifiers. Accordingly, our results showed that when SMOTE was not applied to the training process, the mean and standard deviation of the AUROC value on the test set were reduced to 0.52 ± 0.13.
The precision and recall of the DL model exhibited a tradeoff relationship depending on the value of the threshold used for voting, as shown in Figure 4. For the prediction of radiation-induced bladder toxicity, both precision and recall are crucial metrics for assessing the performance of a prediction model because they reflect the costs and benefits associated with false positives and false negatives. High precision is essential to minimize unnecessary interventions and potential harm for patients who are not at risk. However, a high recall is necessary to ensure that all patients at risk of developing bladder toxicity are detected and appropriate measures are taken to mitigate the associated risks. Therefore, achieving a balance between precision and recall is necessary to optimize the performance of the prediction model for clinical applications. At a threshold of 3, the precision and recall of the voting methods are indicated by markers placed inside circles denoted by asterisks. training process, the mean and standard deviation of the AUROC value on the test set were reduced to 0.52 ± 0.13. The precision and recall of the DL model exhibited a tradeoff relationship depending on the value of the threshold used for voting, as shown in Figure 4. For the prediction of radiation-induced bladder toxicity, both precision and recall are crucial metrics for assessing the performance of a prediction model because they reflect the costs and benefits associated with false positives and false negatives. High precision is essential to minimize unnecessary interventions and potential harm for patients who are not at risk. However, a high recall is necessary to ensure that all patients at risk of developing bladder toxicity are detected and appropriate measures are taken to mitigate the associated risks. Therefore, achieving a balance between precision and recall is necessary to optimize the performance of the prediction model for clinical applications. At a threshold of 3, the precision and recall of the voting methods are indicated by markers placed inside circles denoted by asterisks. Permutation feature importance is a model-agnostic technique that provides a comprehensive and computationally efficient assessment of feature importance by considering feature interactions and avoiding the bias caused by collinear or redundant features. In this study, permutation feature importance is utilized to calculate the relative importance of features while also employing a feature reduction technique like principal component analysis [42][43][44]. The importance of permutation features can be sensitive to the choice of the metric used to evaluate the model's performance (see Equation (7)), leading to different feature rankings. Our findings emphasize the importance of specific features, such as BD2cc, BD5cc, and BPICRU for the accurate prediction of late bladder toxicity Permutation feature importance is a model-agnostic technique that provides a comprehensive and computationally efficient assessment of feature importance by considering feature interactions and avoiding the bias caused by collinear or redundant features. In this study, permutation feature importance is utilized to calculate the relative importance of features while also employing a feature reduction technique like principal component analysis [42][43][44]. The importance of permutation features can be sensitive to the choice of the metric used to evaluate the model's performance (see Equation (7)), leading to different feature rankings. Our findings emphasize the importance of specific features, such as BD 2cc , BD 5cc , and BP ICRU for the accurate prediction of late bladder toxicity occurrence, which is also consistent with the prognostic factor for urinary toxicity suggested in previous studies, such as BD 2cc in [5]. Our results provide insights that could facilitate the development of more effective and personalized treatment strategies for patients undergoing radiation therapy.
Our study has certain limitations that need to be considered. Firstly, our analysis was based solely on structured data (patient and treatment dose features), which may not fully capture the complexity of bladder toxicity. Recent studies have demonstrated the potential of more sophisticated DL models that incorporate 3D dose distributions, medical images, contours, etc. to predict clinical outcomes [45,46]. Radiomics, which involves extracting quantitative features from medical images, can provide additional information about tumor characteristics and treatment response. Integrating radiomics data alongside patient data and dose volumetric parameters has shown improvements in the prediction of various types of toxicities, with potential enhancements in AUC values ranging from 0.11 to 0.16 [13]. Therefore, in future research, we aim to explore the integration of radiomics data into our DL models to enhance the prediction of late bladder toxicity. By incorporating this additional information, we anticipate improved performance and a better understanding of the underlying factors contributing to toxicity with advanced interpretation techniques.
Secondly, although the DL model outperformed the statistical model in our dataset, it is important to note that the performance of DL models can vary depending on the specific dataset and problem domain. Further validation using larger and more diverse datasets through multi-institutional studies is required to confirm the generalizability of our findings.
Thirdly, interpreting DL models remains a challenging task, especially concerning feature importance. Permutation feature analysis is a useful technique for assessing the importance of features in DL models. However, it should be noted that this method has limitations. It tends to assign higher importance to continuous variables and can produce different feature rankings depending on the choice of evaluation metric. Additionally, applying this technique to 3D input is challenging. While permutation feature analysis provides valuable insights, it needs to be supplemented with other interpretive techniques to gain a more comprehensive understanding of the behavior and functional importance of DL models. Therefore, as a further study, it is necessary to apply various analysis methods, such as LIME and its variants, input gradient-based methods, CAM and its variants, etc., to the DL model to ensure the reliability of models with relatively higher performance [47,48].
Furthermore, in order to address the limitation of a small sample size, we established an extra-validation set consisting of 17 individuals. Utilizing a lightweight DL model, we conducted predictions for late bladder toxicity. The results revealed an accuracy of 0.81%, a precision of 0.99%, a recall of 0.61%, an F1-score of 0.92, and an AUROC of 0.93. These performance metrics indicate that the lightweight DL model had limited accuracy in forecasting late bladder toxicity. While the precision was high, indicating few false positives, the recall was relatively low, meaning it missed many true positive cases.
It is crucial to exercise caution when interpreting these findings due to the relatively small size of the extra-validation set. The restricted sample may not adequately encompass the full range of late bladder toxicity in radiation therapy. For instance, the longest observed duration of late bladder toxicity in our institution was 107.7 months from the initiation of external beam radiation therapy (EBRT) to the occurrence of late toxicity. As a result, the performance metrics on the extra-validation set obtained may not truly reflect the predictive capabilities of the DL model. Thus, future research should strive to incorporate a larger patient cohort from multiple institutions to validate and enhance the predictive capabilities of DL models in accurately anticipating late bladder toxicity.
Overall, this study contributes to the ongoing discussion on the clinical utility of DL models in predicting radiation toxicity and emphasizes the importance of interpretability to enhance the reliability and practical applicability of these models. By addressing the limitations and conducting further research, we can advance the field and ultimately improve patient outcomes in radiation therapy.

Conclusions
In this study, we compared the performance of logistic regression and DL models for the prediction of late bladder toxicity in patients with cervical cancer. Logistic regression did not show acceptable performances, whereas the lightweight DL model achieved an accuracy of 0.92 ± 0.03 and an AUROC of 0.91 ± 0.03. Moreover, the permutation feature importance analysis identified BD 2cc as the most important feature for risk prediction.