Next Article in Journal
Clinical Benefit from Docetaxel +/− Ramucirumab Is Not Associated with Mutation Status in Metastatic Non-Small-Cell Lung Cancer Patients Who Progressed on Platinum Doublets and Immunotherapy
Previous Article in Journal
Long-Term Outcomes of Ablative Carbon-Ion Radiotherapy for Central Non-Small Cell Lung Cancer: A Single-Center, Retrospective Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Comparing Performances of Predictive Models of Toxicity after Radiotherapy for Breast Cancer Using Different Machine Learning Approaches

by
Maria Giulia Ubeira-Gabellini
1,
Martina Mori
1,
Gabriele Palazzo
1,
Alessandro Cicchetti
2,
Paola Mangili
1,
Maddalena Pavarini
1,
Tiziana Rancati
2,
Andrei Fodor
3,
Antonella del Vecchio
1,
Nadia Gisella Di Muzio
3,4 and
Claudio Fiorino
1,*
1
Medical Physics, IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
2
Data Science Unit, Fondazione IRCCS Istituto Nazionale dei Tumori, 20133 Milan, Italy
3
Radiotherapy, IRCCS San Raffaele Scientific Institute, 20132 Milan, Italy
4
Department of Radiotherapy, Vita-Salute San Raffaele University, 20132 Milan, Italy
*
Author to whom correspondence should be addressed.
Cancers 2024, 16(5), 934; https://doi.org/10.3390/cancers16050934
Submission received: 17 January 2024 / Revised: 20 February 2024 / Accepted: 20 February 2024 / Published: 25 February 2024

Abstract

:

Simple Summary

Studies comparing performances of machine learning (ML) methods in building predictive models of toxicity in RT are rare. Thanks to the availability of a large cohort (n = 1314) of breast cancer patients homogeneously treated with tangential fields, different ML approaches could be compared. This work shows how more complex models typically achieve higher performances. At the same time, for this test case, the importance is given mainly by a few variables, and toxicity can be predicted by simpler models with similar performances. The availability of more individually characterizing features (here partially missing) is expected to have a likely much higher impact than the choice of the best-performing ML/DL approach.

Abstract

Purpose. Different ML models were compared to predict toxicity in RT on a large cohort (n = 1314). Methods. The endpoint was RTOG G2/G3 acute toxicity, resulting in 204/1314 patients with the event. The dataset, including 25 clinical, anatomical, and dosimetric features, was split into 984 for training and 330 for internal tests. The dataset was standardized; features with a high p-value at univariate LR and with Spearman ρ > 0.8 were excluded; synthesized data of the minority were generated to compensate for class imbalance. Twelve ML methods were considered. Model optimization and sequential backward selection were run to choose the best models with a parsimonious feature number. Finally, feature importance was derived for every model. Results. The model’s performance was compared on a training–test dataset over different metrics: the best performance model was LightGBM. Logistic regression with three variables (LR3) selected via bootstrapping showed performances similar to the best-performing models. The AUC of test data is slightly above 0.65 for the best models (highest value: 0.662 with LightGBM). Conclusions. No model performed the best for all metrics: more complex ML models had better performances; however, models with just three features showed performances comparable to the best models using many (n = 13–19) features.

1. Introduction

To date, the use of artificial intelligence (AI)/machine learning (ML) models in the medical field is increasing thanks to their ability to learn from training data without being explicitly programmed. For this reason, these techniques are being more and more considered in the field of cancer therapy, going towards an era of “precision oncology” [1,2]. The topics span from medical imaging [3], personalized drug discovery [4], target auto-contouring, and dose distribution calculation [5], to the outcome and toxicity prediction, which is the focus of the current work.
In particular, in the field of breast cancer, the majority of patients [6,7] are proven to reduce local recurrence [8] and overall mortality [6,7,9]. However, breast radiotherapy has several side effects, both acute and late [10,11,12,13]. Early toxicity includes breast erythema and desquamation occurring within 90 days after treatment and generally recovers with time. Late toxicity (e.g., telangiectasia, fibrosis, and hyperpigmentation of the skin) may, on the contrary, often become irreversible and progressive over time.
Developing accurate prediction models for radiotherapy-related side effects (e.g., toxicity) is crucial to minimize their impact [14]. This would allow clinicians to deliver more personalized treatments by identifying the main factors influencing toxicity and incorporating them into the therapy decision and optimization. Although severe late toxicities following adjuvant radiotherapy for breast cancer are relatively rare [6,7,11], the number of patients treated in high-income countries is huge and with an annual increase of about 4% [15], pushing researchers to spend much energy in predicting and, possibly, preventing them. In addition, an association between the intensity of transient acute effects and the insurgence of late toxicity has also been reported [16], keeping high the interest also on predictive models of early toxicities.
ML techniques are widely used for making predictions on new data, but they require a diverse and large enough dataset to be trained on. This is one of the reasons why clinical application is still poor together with a lack of model interpretability [17,18]. Despite this, several attempts to use ML in developing predictive models of toxicity in radiotherapy appeared in the last years [17], including the breast cancer case [19,20,21]. ML holds the potential to enhance prediction capabilities given sufficient patient data. However, studies comparing the performances of ML methods in predicting radiation-induced toxicity, trying to assess the best approach, are largely lacking [17]. In particular, the interpretability issue remains relevant when trying to find the “best” approach: advanced ML/deep learning (DL) methods often result in “black-box” models and/or may tend to include a large set of clinically low-relevance variables, making the picture quite obscure to the clinicians [18]. Metrics typically used to evaluate classifier performance for high-risk toxicity patients, such as the area under the receiver operating system (ROC) called AUC, can be significantly misleading for strong unbalanced data. Thus, metrics like f1 and balanced accuracy should also be considered. Establishing a common preprocessing strategy involving scaling, encoding, and feature cleaning is crucial for comparing model performances.
To our knowledge, few studies comparing different modeling approaches have been conducted on breast toxicity. They are always on limited-size populations and/or consider only a few ML approaches [22,23,24,25]. In particular, these studies were performed on different outcomes (acute toxicity [22,25], acute desquamation [23], radiation-induced dermatitis [24]). The input variables used are diverse, going from spectrophotometric variables [25], baseline characteristics, comorbidities [23], and clinical variables [22], to radiomics features [24]. Some considered small [24,25] or large [22,23] cohorts. Their performances differ depending on the features used. In this regard, our work is better comparable with the one of Rattay et al. [22]. The availability of a large, single-institute cohort of patients, homogeneously treated with whole-breast irradiation using tangential fields with the same fractionation schedule and following similar contouring and planning procedures, made it possible to explore the issue. The rate of events pushed us to focus on early toxicity, considering the number of patients/events as large enough to successfully apply many ML methods. Ref. [25] shows a high performance improvement thanks to the other variables considered.

2. Materials and Methods

This study has made use of a large, single-institute cohort (n = 1325) of breast cancer patients homogeneously treated with tangential fields. Patient data were analyzed within the retrospective study approved by the Institutional Ethical committee (registered to ClinicalTrials.gov, Identifier No.: NCT03077191). Concerning the sample details, it is possible to refer to previously published papers [26,27]. A total of 11 patients were excluded, either because they received a boost or because they were treated with the VMAT technique. This study involved a single-institute cohort with the advantage of higher homogeneity, similar patients typology and large numbers, but with the disadvantage of not having an external validation as for multi-institutional studies. All procedures described were implemented in Python version 3.7.9 together with Microsoft SQL Server v15.0 via SQL Server Management Studio version 18.11.1. Preprocessing and modeling analysis were performed using an in-house code (medicalAI in mAItre (Medical Artificial Intelligence Toolkit for REsearch, https://github.com/pymaitre, accessed on 16 January 2024)).

2.1. Patient Characteristics, Endpoint Definition, and Available Variables

All considered patients underwent breast conservative surgery for pTis-pT3 pNx-pN1a M0 disease in the period from 02/2009 to 05/2017. Radiotherapy was delivered with whole-breast (WB)-IMRT to a total dose of 40 Gy in 15 fractions, without boost. Most patients showed invasive ductal carcinoma (n = 961, 72.5%), while n = 116 (8.8%) patients showed lobular invasive carcinoma. There were 194 patients with pT stage greater than one (14.6%). Patients also received chemotherapy (n = 371, 28%) and monoclonal antibodies (n = 109, 8.2%). The details of the characteristics of the patients and of the treatment are reported in Table 1, Table 2 and Table 3 and in the papers by Fodor et al. [26,27].
The endpoint was RTOG (Radiation Therapy Oncology Group) G2/G3 acute toxicity, resulting in 204 and 1110 patients with or without the event, respectively. The dataset, including 25 clinical, anatomical, and dosimetric features, was split into 984 for training and 330 for independent internal test. Variables considered for this study were “Age”, “Axillary Dissection”, “Quadrant Position”, “T Size”, “Chemo”, “Type Chemo”, “AB Monoclonal”, “Hormonal Therapy”, “Bilateral RT”, “Right”, “Number of Fractions (fr) with Bolus”, “Bolus”, “PTV Volume”, “PTV V105%” (i.e., volume of PTV receiving more than 105% of prescribed dose), “Body D1%” (i.e., dose received by 1% of the body volume), “Obesity”, “Diabetes”, “Hypertension”, “Thyroid Disorders”, “Smoke”, “Alcohol”, “Asymmetry”, “Overall Cosmesis”, “No Nipple/Retraction”, and “Hormonal Type”. In Table 1, Table 2 and Table 3, the characteristics of the cohort, according to the considered features, are shown, except for a few parameters with low variance and a high LR univariate p-value, as explained later.

2.2. Data Preprocessing

First, the training dataset was standardized using the robust scaling method [28] (see Table A2 for details), limiting the impact of outliers. The test dataset was consequently scaled accordingly. A one-hot encoder on categorical features (i.e., “quadrant position”; see Table 1, Table 2 and Table 3) was applied, and just the most predictive categories for this specific outcome, out of nine, were selected with a low p-value at univariate LR (global < 0.2): “QSE/QSI” and “QSE” i.e., QSE and QSI correspond to the external and internal superior quadrant, respectively). This choice was approved by the referring clinician. Moreover, variables with low intrafeature variance and with a high p-value at univariate LR were excluded [29]. Features, indeed, with too-low intrafeature variance (<0.02) cannot be predictive for the model and may just confuse a multivariate model. First, a high p-value threshold (global > 0.8) was preliminarily applied to skip features surely not associated with the endpoint. Consequently, “Bilateral RT”, “Alcohol”, “Hormonal Therapy”, “Overall Cosmesis”, “Chemo”, and “AB monoclonal” were thus dropped. Then, Spearman correlation was computed on the remaining 20 features (see Figure 1), and features with Spearman | ρ | > 0.8 were dropped, namely “Bolus”, as it is possible to see in Figure 1. Between correlated features, the one with a lower LR univariate p-value was chosen [30]. In Appendix A (Table A4), a summary of how features were selected is reported.
The synthetic minority oversampling technique (SMOTE, see Table A2 for details) was applied to create synthesized data of the minority to compensate for class imbalance. The ratio between minority and majority sample numbers was set to be 0.5. This results in the percentage of classes with/without the event of 32% and 68%, respectively. For completeness, models with a ratio of 1:1 were run, finding better results for some models and worse for others. It was thus decided to keep a more conservative approach, adding fewer synthetic data. Note that SMOTE was not used for the LightGBM and AutoGluon models due to their intrinsic balanced approach.

2.3. Statistical and ML Methods

Several ML methods (listed in Table 4 with their acronyms) were considered: LR, Lasso, ElasticNet, KNN, SVM, GNB, MLP, RF, LightGBM, and AutoGluon. Lasso and ElasticNet are part of the same typology of models (i.e., logistic regression). They differ just for the chosen penalty, which is a kind of regularization used to reduce overfitting and to help interpretability. ElasticNet, in particular, is a combination of the other two [31]. KNN is a nonparametric classifier, which uses proximity methods to distinguish between classes [32]. SVM is an algorithm that classifies through the choice of a hyperplane in N-dimensional space [33,34]. GNB is part of the so-called probabilistic classifiers, and it is based on Bayes’ theorem with the assumption of strong independence between input features and of a normal distribution for each class [35]. MLP is a feedforward artificial neural network (ANN) with multiple layers where the mapping between input and output layers is a nonlinear activation function [36]. RF [37] and LightGBM [38] are both algorithms that combine multiple decision trees to reach one final result (decision trees are constructed by a series of nodes in order to split the data), where the second one is typically more accurate than the first one. RF constructs them independently, while LightGBM, which is a gradient-boosting-based algorithm, builds them one after another. This means that data instances with large gradients are kept, while the ones with small gradients are randomly dropped. Finally, AutoGluon is an AutoML code operating in supervised machine learning, typically used for tabular predictions. It is focused on automated stack ensembling, which consists of combining a set of individually trained classifiers to reduce their intrinsic error. Here, it serves as a comparison with the other codes considered [39]. The models previously reported were chosen due to their typical use in ML with regard to classification problems (e.g., [17,40]). AutoGluon and LightGBM are less quoted in previous literature, considering their recent implementation as a future development of previous models.
All models, except for AutoGluon and LightGBM, were run by applying a Bayesian search (see Table A2 for details) to maximize the chosen metric (the best one between balanced accuracy, f1 weighted, f1 macro, and AUC; see metric sfs in Table 5) on a stratified k-fold cross-validation sample (see Table A2 for details). This means that the training set is split into k = 5 sets of data with the ratio between each patient class fixed. The model is trained using k-1 fold and validated on the one left outside. The results obtained on the k validation sets varying the parameters were then combined by averaging the score defined to choose the hyperparameters that maximize the score for each model, described in Appendix A (Table A1). Moreover, asequential backward floating selection (SBFS, see Table A2 for details) was applied, from which a parsimonious feature number was chosen. This means that the chosen metric (see metric mod   opt in Table 5) must fall inside the minimum between 1% of the maximum metric value and its SD error. The feature number is, indeed, considered as one of the free model parameters optimized, and thus, models possibly end up with a different feature number. For each model, the feature importance was derived together with complementary metric scoring for both training and test datasets. A more complete approach to address explainability would be to use an agnostic global explainer (e.g., SHAP [41] or LIME [42]); however, this was not the main goal of the current work, and it will be addressed in a follow-up paper. For each metric, the whole process was run, and the metric with less discrepancy in AUC between training and internal test was chosen. The metric chosen in metric sfs is, indeed, itself a free parameter for this study. This is better explained in Appendix A (Table A3).
In addition, LR was applied using two different approaches: the first follows the same process described above, while the second one follows an approach similar to the one described in [43,44,45]. While both the procedures followed the same preprocessing, as described in Section 2.2, a different strategy was used for the feature selector of the second procedure. In particular, a multivariate LR was computed with a fixed number of three or four variables chosen among all the feature combinations selected after computing their correlation. This was performed for each combination by applying bootstrapping (1000 random sampling with replacement) in parallel to select the combination with better performance.
A different approach was followed with regard to AutoGluon and LightGBM. Considering they are provided with data augmentation and feature selection, no SMOTE-augmented data were given to these models. While AutoGluon also provides hyperparameter optimization in its own code, this is not possible with LightGBM, and thus, a more complex tool, called “optuna” [46] (see Table A2 for details), was used. The best parameters chosen for all these models are described in Appendix A (Table A1).

2.4. Metrics to Compare ML Methods

Different metrics were used to compare the ML models considered in terms of performance. For the first type of procedure previously described, in addition to the well-known AUC, the metrics used are F1, separately on patients with/without the event; Brier score; and slope/ R 2 of the calibration plot. Balanced accuracy was also used for model optimization and SBFS.
F1 score can be considered as a harmonic mean of the precision and recall metrics. The best value is one, and the worst is zero (Equation (1)).
F 1 = 2 · Precision · Recall Precision + Recall
Precision (or positive predictive value) and recall (or sensitivity) are defined as follows (Equation (2)):
Precision = TP TP + FP , Recall = Sensitivity = TP TP + FN
In particular, F1 may be computed for each class separately and then combined in different ways, which give different results if the dataset is unbalanced: micro F1, macro F1, and weighted F1. The first is traditionally called accuracy, the second is the arithmetic mean of F1 classes, and the third is the weighted average of F1 classes considering each class’s numerosity.
The known metric, AUC, is calculated as the area under the curve of sensitivity (Equation (2)) versus 1-specificity, defined as (Equation (3)):
Specificity = TN FP + TN
From the same two metrics, balanced accuracy can be defined as the average of recall obtained on each class as in Equation (4):
Balanced   Accuracy = ( specificity + recall ) / 2
Both F1 and balanced accuracy are typically used in cases with unbalanced classification like this one, with the difference that F1 performs better when the attention has to be focused on positives, while for balanced accuracy, negatives and positives have the same importance.
The Brier score (Equation (5)) is the mean squared error computed between predicted probabilities ( f t ) and observed values ( o t ), and it is always between 0 and 1. The best model must minimize the Brier score value.
B S = 1 N t = 1 N ( f t o t ) 2
For the second procedure applied on LR with bootstrap, instead, a model was considered good when the global and inter variable multivariate LR p-value was found below 0.05. Just models with the best bootstrap frequency were kept. Moreover, the remaining ones were ordered to maximize the ones with better AUC average between training and test data, and of them, the best one was taken as the best model.
Finally, in both the procedures, maximizing the Youden J metric (Equation (6)) was used together with the ROC in order to define the threshold used to discriminate between the two classes.
J = sensitivity + specificity 1

3. Results

3.1. Models’ Performances

A summary of overall model performances is shown in Table 5 and Table 6 and in Figure 2 where different metrics are compared on training–test dataset on the x-axis. The best settings chosen for the models are described in Table A1. The models are plotted in ascending order, starting from the one with less training–testing discrepancy. LightGBM was the best model in terms of absolute metric value and discrepancy between training and test sets. Good performances are also given by AutoGluon, RF, SVC, and MLP. ElasticNet and LR, more often used in the literature, did not show largely different performances (as also shown in the ROC plots of Figure 3). KNN, instead, was strongly overfitted. Possibly, this is related to the fact that it is a nonparametric classifier, which uses proximity methods to distinguish between classes. The sample is unbalanced, and it is not trivial to define a correct value for hyperparameters (e.g., n neighbors). Moreover, this model deals with groups of data and thus more easily tends to better fit training data (forgetting test) if it does not find strong discrimination groups.
The AUC of the test data is slightly above 0.6 overall. The f1 score on patients with toxicity is around 0.3, while it depends on the model for patients without toxicity (mostly above 0.7). A calibration plot was derived for every model, from which the slope, R 2 , and Brier score were extracted. AutoGluon and LightGBM showed the best Brier score, while MLP, RF, and GNB showed the worst values in the test set. ElasticNet and LightGBM training and test slopes have similar values, even if ElasticNet has slope values lower than expected, while LightGBM values are higher. This can be due to the probability distribution derived (see Appendix A, Figure A2 for calibration plots). However, considering the similar slope between training and test sets, it could be easily recalibrated. LR, instead, for example, even if its training slope is near to one, is more difficult to analyze considering the large discrepancy with the test set slope, which could create difficulties in re-calibration. RF slopes were very discrepant if we consider the training data augmented. However, the training slope becomes similar to the one of the test set (see Appendix A, Figure A1) if the same model is applied considering the original training data. MLP resulted in the worst difference between training and test slopes, both considering the original training set and the SMOTE-augmented one. Finally, regarding R 2 , RF and LightGBM showed the least discrepant values closer to one, while LR was the most discrepant with a test value closer to zero. The Youden thresholds shown in Table 5 were separately derived for training and test sets from the ROC plots. The values were typically found similar between training and test sets independently from the model used. The absolute values, instead, focus sometimes on the outcome percentage average (∼0.2) or on the value typically used for classification models (∼0.5). In Table 6, a summary of the precision, specificity, sensitivity, f1 score, and AUC of all the mentioned models for both training and test sets is shown. Confusion matrixes were extracted by choosing as discriminant threshold the one derived by the Youden metric from training data, and consequently, the other metrics were derived. For almost all the models, the training data were presented with SMOTE-augmented data, while the test set was considered without it. For AutoGluon and LightGBM, instead, they were augmented inside the code, and thus, the augmentation was not visible here.
Figure 3 shows the ROC for the best models (i.e., ElasticNet, LR, SVC, and LightGBM). Error intervals were derived through a bootstrap of the population considered, and the lines are the resulting mean values. The optimal thresholds shown in Figure 3 were taken from the mean curves, and thus they are slightly different from the ones in Table 6. For the first three models, training data are shown with SMOTE-augmented data, while LightGBM does augmentation inside the code, and it is not visible here. Although the training data were not considered with augmented data, LightGBM was still the one with better test AUC score and less discrepancy between training and test data, less evident than what is shown in Figure 3.

3.2. Importance and Number of Selected Features: Measuring Redundancy

In Figure 4, the feature importances obtained for the best models are shown, respectively, for ElasticNet and LR, for which the importance is given by the model coefficient values, and for SVC and LightGBM derived as feature permutation. ElasticNet and LR have similar derived feature importance which, however, are not shared with more performant models such as SVC and LightGBM, where one of the most prominent features is PTV Volume, which has instead a low coefficient value for ElasticNet, and it is not present for LR. Obesity is among all models always in the first four best variables.
Figure 5 shows two examples (ElasticNet and SVC) of the methods used in SBFS, illustrating the effect of limiting feature redundancy. The same plot derived for all the other models is shown in Appendix A (Figure A4). LightGBM and AutoGluon performed a different kind of feature selector, already inside the code, then their plot could not be explicitly derived. Figures show how the performance chosen for models (see metric sfs , Table 5) increased by varying the number of features used. The thick line is computed on a cross-validation sample, and the light error interval is the standard deviation (SD). The maximum in this plot is shown through the vertical line, and it represents the number of features associated with the best performance. To be more conservative, a parsimonious feature number was chosen, such as the chosen metric falling inside the minimum between 1% of the maximum metric value and its SD error (gray vertical line). This criterion was set to be shared between models to make the performance comparison fair. In the plots, the scores obtained in both cases and their confidence interval of 95% are shown. The parsimonious number of features derived is also summarized in Table 5. Importantly, depending on the model, the number of features building the model could be substantially reduced in most cases up to nine variables, without any relevant impact on the model’s performance. Another approach could be to fix a priori the number of variables, but this would not allow quantifying the different convergence of ML models with regard to number of features chosen.

3.3. Bootstrap-Based Logistic Regression with Small Number of Features

Finally, as described in Section 2.3, LR models were developed through a previously suggested bootstrap-based method, fixing the number of variables to three or four. ROC results are presented in Figure 6 for both training and test sets. Training data are presented with SMOTE-augmented data. “Axillary Dissection” and “PTV Volume” were shared by the two models. “Quadrant Position_QSE” completed the three-feature model, while “Age” and “Type Chemo” were added to the four-feature model. In the list of ML model performances, the three-variable model gained the third and fourth positions thanks to its high AUC and f1 score, respectively. However, the model with four variables did not confirm all three variables and showed a worse performance in terms of the AUC of the test set compared with most of the models considered. This means that, also, if this method could give very good performances, it is important to know its large instability by changing the number of features fixed.

4. Discussion

The issues concerning the application of advanced ML/DL methodologies in training predictive models to predict RT-induced toxicity (and, more in general, outcome after RT) are multiple and complex. As a matter of fact, the growing availability of AI tools and skills makes their use more accessible and affordable, even in this field. Extending the field to include imaging biomarkers, the large number of image-based quantitative features, which may be extracted from medical imaging, largely contributed to the interest toward advanced ML/DL approaches in handling information and training/validating prediction models [47]. A similar process, with likely minor emphasis, is happening even in the case complex dose metrics are extracted from the 3D dose distributions of individual patients, i.e., dosiomics [48,49]. A large variety of AI-based methods may be considered and applied together, considering issues related to the clinical meaning of the resulting models with different attitudes. This may give the impression of generating controversial reactions in the radiation oncology community, ranging from “easy” enthusiasm to deep skepticism. Despite the many open issues, very few studies dealt with the attempt to rigorously compare the performances of models trained with different ML/DL approaches to the same patient’s cohort [17,50]. This is likely due to several reasons, such as the usually limited number of patients/events, the difficulty of setting up different methodologies using common criteria for data processing, and, highly relevant, the possibility of successfully applying “easier” statistic methods as well as a “priori” knowledge (for instance, NTCP models based on DVH reduction). The availability of a reasonably large cohort of patients treated at a single institute in a quite homogeneous way and followed by the consistent collection of toxicity information pushed us to investigate this issue in a real-life clinically relevant scenario as the one of acute toxicity after WBI. We note that the model of logistic regression can be considered as a benchmark model traditionally used and that the AUC metric on the test set can be considered for comparison between models.
With this aim, first of all, a robust preprocessing pipeline shared between all the routines (scaling, encoding, feature cleaning, data augmentation) was created, and 12 different approaches were consistently implemented in the Python environment, aiming to compare the resulting models to highlight strengths and weaknesses of each of them. The setting of the whole machinery needed a lot of effort for proper model tuning to make model performances quantifiable and comparable. As expected, more complex models (e.g., LightGBM and AutoGluon) achieved a higher test score compared to the other models. However, these models typically ended up having more features and thus being more difficult to be clinically interpretable. It was possible to study their importance thanks to feature permutation algorithms, as shown in Figure 4. Anyway, the first three or four features clearly gave the principal contribution. Thanks to methods such as SBFS, it was also possible to see how performances vary with feature number, confirming that the main gain is obtained with the first few features except for KNN, where, in any case, the model is heavily overfitted. More familiar models in the field of predictive models for toxicity, such as LR, gave better performances in test scores when methods with a fixed number of variables and bootstrapping were used. LR run through the first procedure (SBFS starting from all the features) did not find features as predictive as using the LR bootstrapping method. However, these methods seem to be unstable considering the large discrepancy in the performance of the test set of models with three versus four variables, as well as if the training score was similar. Moreover, these models were automatically chosen considering the average between training and test AUC scores, thus introducing a “data leakage” (i.e., information from outside the training dataset is somehow used to choose the best model). This may have caused a too-optimistic result as compared to the case where the cross-validation model was not influenced at all by the internal test, as for more complex models. From the study of a learning curve (Appendix A Figure A3) for both training and test sets, it was also possible to see that the metrics performance is still increasing, suggesting an improvement with more data. Possibly, this may have an impact on ML models, which were not yet able to fully exploit their maximum potential with the available numbers.
Concerning the interpretability of the models, it is without a doubt that the ones with few features have a much larger chance of being preferred for clinical use. In the current case, MLP and LR3 variables would be the best ones representing a good compromise between accuracy and interpretability: LR3 variables are particularly relevant, giving the possibility to easily represent the risk of toxicity in a single plot, taking the “continuous” variable (i.e., the PTV volume) on the x-axis and modifying the logistic curve in case of presence of the two other adverse factors (Axillary node resection and the superior external Quadrant QSE). Of note, the two clinical factors selected by LR3 variables were considered as clinically motivated by the reference clinician after discussion. On the other hand, LR with fixed variables number and bootstrapping, as mentioned, also seems to be quite “unstable”; thus, the current result should not be generalized to other situations. Importantly, future work could address pushing ML models to stay with the same smaller set of features in order to study how they significantly lower (or not) their performance.
The issue of using models too complex with respect to clinical explainability and usability has already been widely addressed (see, for instance, [18]). For all previously discussed issues; thus, a reasonable approach would be to consider more models at the same time to exploit their different strengths, possibly “forcing” models to limit the number of features and finally considering the “best” models based on the compromise between accuracy and interpretability. To this end, it is highly important to include clinicians in the final model evaluation to better assess clinical interpretations and the model’s usability.
The performance of few-feature LR models was similar to the one of more complex ML models, and this is not what one could expect. The relatively low performances are in line with what was previously found with comparable sample size and features selected in the field of breast toxicity after radiotherapy [22], where our results achieve an AUC on the test set of 0.66 and their results report an AUC of 0.65. A reason could be found in the unbalanced dataset with an outcome with not many positives. Dataset more heterogeneously treated [23], shows slightly higher performance, possibly simply reflecting major differences between the sub-groups of patients, compatible with our population that is homogeneously treated and thus without this phenomenon present. Moreover, ML models tend to work better when there is a higher variable number, which contributes to the result. For this reason, they tend to improve their performance with higher variables used. Furthermore, some important variables are missing. In order to obtain better-performing models (say, AUC > 0.80), the availability of other potential predictors would be highly relevant: in the case of toxicity after breast WBI, more accurate DVH information, information on the genotype and/or the phenotype, densitometry characteristics of the breast may all importantly contribute to improving the accuracy of the resulting models, as seldom reported [13,23,24,25]. The availability of more individually characterizing features (here in part missing) is expected to have a likely much higher impact than the choice of the best-performing ML/DL approach. For example, Cilla et al. [25] (with the same outcome: acute toxicity) reached performances of AUC = 0.87 for LR model using a small dataset (n = 129) and spectrophotometric features, not easy to extract for each patient and not present in our database. On the other hand, the availability of highly detailed individual information is harder to obtain in large cohorts like ours. Due to this, regarding the current breast toxicity cohort, ongoing and future studies will focus on the incorporation of skin DVH information and of densitometry/radiomics features in the models, being both individual features that may be recovered. In addition, the study will be extended to late toxicities.

5. Conclusions

The comparison between 12 ML models predicting acute toxicity was performed on a large cohort of patients (n = 1314). No model performed the best for all metrics: more complex ML models have better performances; ElasticNet performs better than the typically used LR when run with no fixed number of variables. For all of them, the feature importance was studied to address the explainability issue, finding that typically the main gain is given by the first few features. LR with three fixed features and bootstrap showed performances similar to the ones of more complex models. Thanks to its simplicity and performance, this model feasibly could be tested and possibly implemented into clinical practice. It could be reasonable to suggest, ideally, considering more models at the same time to exploit their different strengths, possibly “limiting” the number of features. The “best” models should be considered based on the compromise between accuracy and interpretability, including clinicians in the final model evaluation, to better assess clinical interpretations and model usability.

Author Contributions

Conceptualization, C.F., M.G.U.-G. and M.M.; Data curation, A.F., N.G.D.M., P.M. and A.C.; Data organization, M.G.U.-G. and M.P.; Formal analysis, methodology and software, M.G.U.-G. and G.P.; Visualization, M.G.U.-G.; writing—original draft, M.G.U.-G. and C.F.; writing—review and editing, A.F., T.R. and A.d.V.; Supervision, C.F.; Funding acquisition, C.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by Fondazione Regionale per la Ricerca Biomedica, Project No.: 110-JTC PerPlanRT ERA PerMed, GA 779282.

Institutional Review Board Statement

Patient data were analyzed within the retrospective study approved by the Institutional Ethical Committee (registered to ClinicalTrials.gov, Identifier n NCT03077191, approve date: 14-07-2016).

Informed Consent Statement

All patients involved in the study signed informed consent.

Data Availability Statement

No data are publicly available due to Ethical Committee constraints.

Acknowledgments

The authors want to thank S. Carrazza for useful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In this section, a description of the hyperparameters and metrics used for each model, the calibration plot, the learning curve for ElasticNet, the backward feature selector, and the same as Figure 2 and Table 6 (but with non-SMOTE-augmented data and with the threshold at 0.5) is presented. Moreover, the computational resources used and running time spent are described.

Appendix A.1

In Table A1, the hyperparameters chosen by the best models are shown and in Table A2, links to the procedures used with their versions are described.
Table A1. Best values chosen for the hyperparameters explored for each model described. Bold parameters were found through a cross-validated optimization procedure, while the others were kept fixed. RF, SVC, MLP, ElasticNet, LR, Lasso, GNB, and KNN were all implemented through sklearn. Note that for LR with bootstrap, statsmodels was used instead.
Table A1. Best values chosen for the hyperparameters explored for each model described. Bold parameters were found through a cross-validated optimization procedure, while the others were kept fixed. RF, SVC, MLP, ElasticNet, LR, Lasso, GNB, and KNN were all implemented through sklearn. Note that for LR with bootstrap, statsmodels was used instead.
ModelHyperparameters
LightGBMobjective: binary, metric: binary error, boosting type: gbdt,
lambda l1: 4.17, lambda l2: 7.20, num leaves: 2, min child samples: 2, max depth: 10, bagging freq: 6,
min data in leaf: 84, bagging fraction: 0.57, learning rate: 0.20, feature fraction: 0.65
AutoGluoneval metric: roc auc, sample weight: auto weight, num bag folds: 2,
num bag sets: 3, num stack levels: 1, presets: best quality
RFcriterion: gini, max leaf nodes: 17, bootstrap: True, ccp alpha: 0.02, class weight: balanced, max depth: 7,
min samples split: 12, min samples leaf: 7, n estimators: 75, min impurity decrease: 0.0
SVCclass weight: balanced, probability: True, C: 0.8, kernel: poly, degree: 3
MLPsolver: Adam, max iter: 5000, learning rate: adaptive, epsilon: 6 × 10 8 ,
activation: relu, alpha: 1.0, hidden layer sizes: 5, learning rate init: 0.001
ElasticNetfit intercept: True, max iter: 1000, penalty: ElasticNet, solver: Saga, C: 4.83, l1 ratio: 0.70
LRC: 1, solver: newton-cg, fit intercept: True, max iter: 1000, penalty: none
Lassofit intercept: True, intercept scaling: 1, max iter: 1000, penalty: l1, C: 4.83, solver: saga
GNBpriors: None, var smoothing: 0.001
KNNalgorithm: auto, p: 1, leaf size: 30, metric: minkowski, n neighbors: 17, weights: distance
Table A2. Links to the models and procedures used in the paper.
Table A2. Links to the models and procedures used in the paper.
Models/MethodsLinks to the CodeVersion Used
LightGBMhttps://lightgbm.readthedocs.io/en/v3.3.24.1.0
AutoGluonhttps://auto.gluon.ai/stable/index.html0.8.2
RFhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html0.0
SVChttps://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC0.0
MLPhttps://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html0.0
ElasticNethttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html0.0
LR Multivariatehttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html0.0
LR bootstraphttps://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.Logit.html0.13.2
Lassohttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html0.0
GNBhttps://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html0.0
KNNhttps://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html0.0
Robust scalinghttps://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html0.0
SMOTEhttps://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html0.8.0
Bayesian searchhttps://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html0.9.0
K-Fold cross-validatorhttps://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html0.0
SBFShttps://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector0.21.0
Optunahttps://optuna.org/2.10
It is important to note that all LR models run without bootstrap were implemented in sklearn and with hyperparameters described in Table A1, while LR3/LR4 with bootstrap were implemented through statsmodels. Bold parameters were tuned through a cross-validated optimization procedure, while the others were kept fixed. In particular, for RF, SVC, MLP, ElasticNet, LR, Lasso, GNB, and KNN, a Bayesian search was used, aimed at optimizing the metrics (balanced accuracy, f1 weighted, f1 macro, AUC). Moreover, all models, except for AutoGluon and LightGBM, were implemented using the sklearn Python library.
In order to select the best model, a combination of different metrics was run both to find the best hyperparameters of the model and the best features. For all of them, the AUC of training and internal tests was obtained. In Table A3, the AUC of the internal test is shown, with the best value found in bold (same as the one quoted in Table 6). The best metrics chosen are the same as quoted in Table 5. Typically, this corresponds to a model with less discrepancy in AUC between training and internal tests. This was carried out to avoid penalizing one method against another by considering the “best” metric for each method. In this regard, the test AUC was chosen only as the final criterion. Despite the limits of AUC, we chose it because it is the metric mostly used and most familiar to clinicians.
Table A3. Model comparison for different modalities for cross-validation, varying the metrics used for SBFS ( metric sfs ) and model hyperparameters optimization ( metric mod   opt ). Here, the final AUC of the best models on the outcome of acute toxicity, obtained for the internal test, are shown for a comparison. Models are in decrescent order following the AUC performed on test dataset. In bold, the best metric combination, AUC, is shown (same as in Table 6).
Table A3. Model comparison for different modalities for cross-validation, varying the metrics used for SBFS ( metric sfs ) and model hyperparameters optimization ( metric mod   opt ). Here, the final AUC of the best models on the outcome of acute toxicity, obtained for the internal test, are shown for a comparison. Models are in decrescent order following the AUC performed on test dataset. In bold, the best metric combination, AUC, is shown (same as in Table 6).
Metric/ModelAUC Internal Test
metric sfs Balaccf1 Weightf1 MacroAUCAUCAUCAUC
metric mod   opt Balaccf1 Weightf1 MacroAUCBalaccf1 Weightf1 Macro
Random Forest0.5790.5990.5730.5900.6120.6160.616
SVC0.5410.5240.5370.6140.5620.5470.562
MLP sklearn0.4710.4940.6020.4240.5770.5770.577
ElasticNet0.6020.5550.6020.5140.5130.5130.513
LR sklearn0.5780.5780.5310.5130.5130.5130.513
Lasso0.5540.5540.5540.5540.5130.5130.513
Naïve Bayes0.5480.5540.4950.5210.5470.5480.547
KNN0.5230.4590.4670.5330.4860.4860.486
In Figure A1, the same plot presented in the main text (Figure 2) is shown with two differences. First of all, the training data are shown with no data augmentation. Models are always the ones generated with SMOTE, but the data plotted here are a subset of the ones used for training (i.e., the original training data). Moreover, results were derived with the standard threshold value of 0.5 for all models, with the exception of the most advanced models (AutoGluon and LightGBM), for which the values derived through the Youden criterion were considered.
In Figure A2, the calibration plot is shown for two models as an example: ElasticNet on the top and LightGBM on the bottom. With a red line, the expected profile ( y = x equation) is shown. With blue points, the measured predicted probability over the fraction of positive samples is presented. The blue line is the linear regression on blue points with its error bar in light blue. The gray histogram represents the distribution of predicted probability. It is possible to see that, for the two models shown in the figure, this distribution is different; in the case of LightGBM, it is more peaked than ElasticNet and ranges between 0 and 0.5. Training and test calibration plots have similar slopes in both ElasticNet and LightGBM, with the first being half of what was expected and the second more than two times. This is given by the different probability distributions obtained. The fact that the slope is similar between training and test gives us the possibility to recalibrate it a posteriori. This plot was made with data from both training and testing with no SMOTE-augmented data and also when augmentation was used. If the plot of training was shown with augmented data, the distribution would almost match the red line, but the test would still be as is shown right now. For this reason, it is better to consider data from both training and tests with no augmentation in order to have the possibility to recalibrate it.
Figure A1. Model comparison on training (red) and internal test (blue) dataset for acute toxicity. The following metrics were compared: AUC, f1 on patients with/without the event, Brier score, slope, and R 2 of the calibration plot. Threshold used for all models is 0.5 with the exception of AutoGluon and LightGBM where the value was derived through the Youden criterion.
Figure A1. Model comparison on training (red) and internal test (blue) dataset for acute toxicity. The following metrics were compared: AUC, f1 on patients with/without the event, Brier score, slope, and R 2 of the calibration plot. Threshold used for all models is 0.5 with the exception of AutoGluon and LightGBM where the value was derived through the Youden criterion.
Cancers 16 00934 g0a1
Figure A2. ElasticNet (top) and LightGBM (bottom) calibration plot, respectively, for training (left) and test (right) data for the outcome acute toxicity. Blue line is the linear regression on blue points with its error band. The gray histogram represents the predicted probability. Note that the training data do not show SMOTE-augmented data.
Figure A2. ElasticNet (top) and LightGBM (bottom) calibration plot, respectively, for training (left) and test (right) data for the outcome acute toxicity. Blue line is the linear regression on blue points with its error band. The gray histogram represents the predicted probability. Note that the training data do not show SMOTE-augmented data.
Cancers 16 00934 g0a2
In Figure A3 on the right, the learning curve for ElasticNet is shown for two metrics (balanced accuracy (top) and AUC (bottom)), varying the number of training instances. It is possible to see that, while the value of AUC is almost stable, the balanced accuracy score is still largely increasing, suggesting a plausible strong improvement of the metric with more data. This effect is visible for most models. On the left, instead, the cross-validation curve is shown for one of the parameters explored (see Table A1), where it is possible to see how in both metrics the plateau is reached.
Figure A3. ElasticNet learning curve (left) and cross-validation curve (right) for both balanced accuracy (top) and AUC (bottom) metrics.
Figure A3. ElasticNet learning curve (left) and cross-validation curve (right) for both balanced accuracy (top) and AUC (bottom) metrics.
Cancers 16 00934 g0a3
Finally, in Figure A4, the SBFS is presented for models not shown in the main text (see Figure 5). It is possible to see how typically the main contribution is given by the first 3–6 features. In Table A4, the feature selection in preprocessing is summarized.
Figure A4. SBFS, respectively, for logistic regression (top left), Lasso (top right), Gaussian naïve Bayes (center left), MLP (center right), random forest (bottom left), and KNN (bottom right). Performances are shown on cross-validation set after training was performed. Light-blue box is the standard deviation of the performance over the feature number.
Figure A4. SBFS, respectively, for logistic regression (top left), Lasso (top right), Gaussian naïve Bayes (center left), MLP (center right), random forest (bottom left), and KNN (bottom right). Performances are shown on cross-validation set after training was performed. Light-blue box is the standard deviation of the performance over the feature number.
Cancers 16 00934 g0a4
Table A4. Variables given as input to the pipeline are listed on the left. With the number “1”, variables considered are shown, while with “-”, variables are dropped for that specific criterion. The first step was to delete variables with interfeature variance < 0.02, then to drop the ones with high p-value at univariate LR (global > 0.8) and, finally, the ones with Spearman correlation | ρ | > 0.8.
Table A4. Variables given as input to the pipeline are listed on the left. With the number “1”, variables considered are shown, while with “-”, variables are dropped for that specific criterion. The first step was to delete variables with interfeature variance < 0.02, then to drop the ones with high p-value at univariate LR (global > 0.8) and, finally, the ones with Spearman correlation | ρ | > 0.8.
VariablesHigh VarianceHigh LR p-ValueHigh Correlation
Age111
Axillary Dissection111
Quadrant position111
T Size111
Chemo1-
Type Chemo111
AB monoclonal1-
Hormonal Therapy1-
Bilateral RT-
Right11-
No fr with bolus111
Bolus11-
PTV Volume111
PTV V105%111
Body D1%111
Obesity111
Diabetes111
Hypertension111
Thyroid Disorders111
Smoke111
Alcohol1-
Asymmetry111
Overall Cosmesis1-
No Nipple/Retraction111
Hormonal Type111

Appendix A.2. Computational Resources and Running Time Spent

The computational resources used are CPU Intel(R) Xeon(R) Silver 4110 CPU @ 2.10 GHz with 16 cores, 16 logical processors, and 32 GB of RAM. For AutoGluon and LightGBM, Colab using a CPU was used. The time spent on average to run these models is about 4.5 min each, of which about 3 min are spent on SBFS feature extraction (minimum, 1 min; maximum, 6 min). For the 8 models run through this procedure (with the exception of AutoGluon and LightGBM), the total computational time spent was about one hour. The computational time was not fine-tuned, considering it was not the aim of this work. The usage is indeed at 15%. AutoGluon and LightGBM were run on Colab, and the time spent was about 7 min and less than one minute.

References

  1. Fiorino, C.; Jeraj, R.; Clark, C.H.; Garibaldi, C.; Georg, D.; Muren, L.; van Elmpt, W.; Bortfeld, T.; Jornet, N. Grand challenges for medical physics in radiation oncology. Radiother. Oncol. 2020, 153, 7–14. [Google Scholar] [CrossRef]
  2. Fiorino, C.; Guckenberger, M.; Schwarz, M.; van der Heide, U.A.; Heijmen, B. Technology-driven research for radiotherapy innovation. Mol. Oncol. 2020, 14, 1500–1513. [Google Scholar] [CrossRef]
  3. Siddique, S.; Chow, J.C. Artificial intelligence in radiotherapy. Rep. Pract. Oncol. Radiother. 2020, 25, 656–666. [Google Scholar] [CrossRef]
  4. Ho, D. Artificial intelligence in cancer therapy. Science 2020, 367, 982–983. [Google Scholar] [CrossRef] [PubMed]
  5. Chow, J.C.L. Artificial Intelligence in Radiotherapy and Patient Care. In Artificial Intelligence in Medicine; Springer International Publishing: Cham, Switzerland, 2021; pp. 1–13. [Google Scholar] [CrossRef]
  6. Darby, S.; McGale, P.; Correa, C.; Taylor, C.; Arriagada, R.; Clarke, M.; Cutter, D.; Davies, C.; Ewertz, M.; Godwin, J.; et al. Effect of radiotherapy after breast-conserving surgery on 10-year recurrence and 15-year breast cancer death: Meta-analysis of individual patient data for 10801 women in 17 randomised trials. Lancet 2011, 378, 1707–1716. [Google Scholar] [CrossRef] [PubMed]
  7. Shah, C.; Al-Hilli, Z.; Vicini, F. Advances in Breast Cancer Radiotherapy: Implications for Current and Future Practice. JCO Oncol. Pract. 2021, 17, 697–706. [Google Scholar] [CrossRef]
  8. Seibold, P.; Webb, A.; Aguado-Barrera, M.E.; Azria, D.; Bourgier, C.; Brengues, M.; Briers, E.; Bultijnck, R.; Calvo-Crespo, P.; Carballo, A.; et al. REQUITE: A prospective multicentre cohort study of patients undergoing radiotherapy for breast, lung or prostate cancer. Radiother. Oncol. 2019, 138, 59–67. [Google Scholar] [CrossRef]
  9. McGale, P.; Taylor, C.; Correa, C.; Cutter, D.; Duane, F.; Ewertz, M.; Gray, R.; Mannu, G.; Peto, R.; Whelan, T.; et al. Effect of radiotherapy after mastectomy and axillary surgery on 10-year recurrence and 20-year breast cancer mortality: Meta-analysis of individual patient data for 8135 women in 22 randomised trials. Lancet 2014, 383, 2127–2135. [Google Scholar] [CrossRef]
  10. Cox, J.D.; Stetz, J.; Pajak, T.F. Toxicity criteria of the Radiation Therapy Oncology Group (RTOG) and the European organization for research and treatment of cancer (EORTC). Int. J. Radiat. Oncol. Biol. Phys. 1995, 31, 1341–1346. [Google Scholar] [CrossRef] [PubMed]
  11. Chan, R.J.; Larsen, E.; Chan, P. Re-examining the Evidence in Radiation Dermatitis Management Literature: An Overview and a Critical Appraisal of Systematic Reviews. Int. J. Radiat. Oncol. Biol. Phys. 2012, 84, e357–e362. [Google Scholar] [CrossRef]
  12. Tesselaar, E.; Flejmer, A.M.; Farnebo, S.; Dasu, A. Changes in skin microcirculation during radiation therapy for breast cancer. Acta Oncol. 2017, 56, 1072–1080. [Google Scholar] [CrossRef]
  13. Avanzo, M.; Pirrone, G.; Vinante, L.; Caroli, A.; Stancanello, J.; Drigo, A.; Massarut, S.; Mileto, M.; Urbani, M.; Trovo, M.; et al. Electron Density and Biologically Effective Dose (BED) Radiomics-Based Machine Learning Models to Predict Late Radiation-Induced Subcutaneous Fibrosis. Front. Oncol. 2020, 10, 490. [Google Scholar] [CrossRef] [PubMed]
  14. Rancati, T.; Fiorino, C. Modelling Radiotherapy Side Effects: Practical Applications for Planning Optimisation; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
  15. Harbeck, N. Breast cancer is a systemic disease optimally treated by a multidisciplinary team. Nat. Rev. Dis. Prim. 2020, 6, 30. [Google Scholar] [CrossRef] [PubMed]
  16. The radiotherapeutic injury—A complex ‘wound’. Radiother. Oncol. 2002, 63, 129–145. [CrossRef]
  17. Isaksson, L.J.; Pepa, M.; Zaffaroni, M.; Marvaso, G.; Alterio, D.; Volpe, S.; Corrao, G.; Augugliaro, M.; Starzyńska, A.; Leonardi, M.C.; et al. Machine Learning-Based Models for Prediction of Toxicity Outcomes in Radiotherapy. Front. Oncol. 2020, 10, 790. [Google Scholar] [CrossRef]
  18. Fiorino, C.; Rancati, T. Artificial intelligence applied to medicine: There is an “elephant in the room”. Phys. Med. 2022, 98, 8–10. [Google Scholar] [CrossRef]
  19. Mbah, C.; Thierens, H.; Thas, O.; Neve, J.D.; Chang-Claude, J.; Seibold, P.; Botma, A.; West, C.; Ruyck, K.D. Pitfalls in Prediction Modeling for Normal Tissue Toxicity in Radiation Therapy: An Illustration With the Individual Radiation Sensitivity and Mammary Carcinoma Risk Factor Investigation Cohorts. Int. J. Radiat. Oncol. Biol. Phys. 2016, 95, 1466–1476. [Google Scholar] [CrossRef] [PubMed]
  20. Reddy, J.; Lindsay, W.; Berlind, C.; Ahern, C.; Smith, B. Applying a Machine Learning Approach to Predict Acute Toxicities During Radiation for Breast Cancer Patients. Int. J. Radiat. Oncol. Biol. Phys. 2018, 102, S59. [Google Scholar] [CrossRef]
  21. Saednia, K.; Tabbarah, S.; Lagree, A.; Wu, T.; Klein, J.; Garcia, E.; Hall, M.; Chow, E.; Rakovitch, E.; Childs, C.; et al. Quantitative Thermal Imaging Biomarkers to Detect Acute Skin Toxicity from Breast Radiotherapy Using Supervised Machine Learning. Int. J. Radiat. Oncol. Biol. Phys. 2020, 106, 1071–1083. [Google Scholar] [CrossRef]
  22. Rattay, T.; Seibold, P.; Aguado-Barrera, M.E.; Altabas, M.; Azria, D.; Barnett, G.C.; Bultijnck, R.; Chang-Claude, J.; Choudhury, A.; Coles, C.E.; et al. External Validation of a Predictive Model for Acute Skin Radiation Toxicity in the REQUITE Breast Cohort. Front. Oncol. 2020, 10, 575909. [Google Scholar] [CrossRef]
  23. Aldraimli, M.; Osman, S.O.S.; Grishchuck, D.; Ingram, S.P.; Lyon, R.; Mistry, A.; Oliveira, J.; Samuel, R.; Shelley, L.E.A.; Soria, D.; et al. Development and Optimization of a Machine-Learning Prediction Model for Acute Desquamation After Breast Radiation Therapy in the Multicenter REQUITE Cohort. Adv. Radiat. Oncol. 2022, 7, 100890. [Google Scholar] [CrossRef] [PubMed]
  24. Li, X.; Wang, H.; Xu, L.Y.; Ren, Y.; Deng, W.; Feng, H.; Yang, Z.; Ma, S.; Ni, Q.; Kuang, Y. A Machine Learning Framework for Early Prediction of Radiation Dermatitis in Patients with Breast Cancer Receiving Radiation Treatment: A Multicenter Retrospective Analysis Study. Int. J. Radiat. Oncol. Biol. Phys. 2022, 114, e119. [Google Scholar] [CrossRef]
  25. Cilla, S.; Romano, C.; Macchia, G.; Boccardi, M.; Pezzulla, D.; Buwenge, M.; Castelnuovo, A.D.; Bracone, F.; Curtis, A.D.; Cerletti, C.; et al. Machine-learning prediction model for acute skin toxicity after breast radiation therapy using spectrophotometry. Front. Oncol. 2023, 12, 1044358. [Google Scholar] [CrossRef] [PubMed]
  26. Fodor, A.; Brombin, C.; Mangili, P.; Borroni, F.; Pasetti, M.; Tummineri, R.; Zerbetto, F.; Longobardi, B.; Perna, L.; Dell’Oca, I.; et al. Impact of molecular subtype on 1325 early-stage breast cancer patients homogeneously treated with hypofractionated radiotherapy without boost: Should the indications for radiotherapy be more personalized? Breast 2021, 55, 45–54. [Google Scholar] [CrossRef] [PubMed]
  27. Fodor, A.; Brombin, C.; Mangili, P.; Tummineri, R.; Pasetti, M.; Zerbetto, F.; Longobardi, B.; Galvan, A.S.; Deantoni, C.L.; Dell’Oca, I.; et al. Toxicity of Hypofractionated Whole Breast Radiotherapy Without Boost and Timescale of Late Skin Responses in a Large Cohort of Early-Stage Breast Cancer Patients. Clin. Breast Cancer 2022, 22, e480–e487. [Google Scholar] [CrossRef] [PubMed]
  28. Ahsan, M.; Mahmud, M.; Saha, P.; Gupta, K.; Siddique, Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
  29. Steyerberg, E. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating; Springer: New York, NY, USA, 2009; Volume 19, Chapter 11. Selection of Main Effects. [Google Scholar] [CrossRef]
  30. Palumbo, D.; Mori, M.; Prato, F.; Crippa, S.; Belfiori, G.; Reni, M.; Mushtaq, J.; Aleotti, F.; Guazzarotti, G.; Cao, R.; et al. Prediction of Early Distant Recurrence in Upfront Resectable Pancreatic Adenocarcinoma: A Multidisciplinary, Machine Learning-Based Approach. Cancers 2021, 13, 4938. [Google Scholar] [CrossRef]
  31. Friedman, J.H.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
  32. Goldberger, J.; Hinton, G.E.; Roweis, S.; Salakhutdinov, R.R. Neighbourhood Components Analysis. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 1 December 2004; Advances in Neural Information Processing Systems. Saul, L., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2004; Volume 17. [Google Scholar]
  33. Schölkopf, B.; Smola, A.J.; Williamson, R.C.; Bartlett, P.L. New Support Vector Algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef]
  34. Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
  35. Zhang, H. Exploring conditions for the optimality of naive Bayes. Int. J. Pattern Recognit. Artif. Intell. 2005, 19, 183–198. [Google Scholar] [CrossRef]
  36. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; Teh, Y.W., Titterington, M., Eds.; Chia Laguna Resort: Chia, Italy, 2010; Volune 9, pp. 249–256. [Google Scholar]
  37. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  38. Ke, G.; Meng, Q.; Finely, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIP 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  39. Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. arXiv 2020, arXiv:2003.06505. [Google Scholar] [CrossRef]
  40. Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef] [PubMed]
  41. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  42. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv 2016, arXiv:1602.04938. [Google Scholar] [CrossRef]
  43. Mori, M.; Passoni, P.; Incerti, E.; Bettinardi, V.; Broggi, S.; Reni, M.; Whybra, P.; Spezi, E.; Vanoli, E.; Gianolli, L.; et al. Training and validation of a robust PET radiomic-based index to predict distant-relapse-free-survival after radio-chemotherapy for locally advanced pancreatic cancer. Radiother. Oncol. 2020, 153. [Google Scholar] [CrossRef]
  44. Naqa, I.E.; Bradley, J.; Blanco, A.I.; Lindsay, P.E.; Vicic, M.; Hope, A.; Deasy, J.O. Multivariable modeling of radiotherapy outcomes, including dose–volume and clinical factors. Int. J. Radiat. Oncol. Biol. Phys. 2006, 64, 1275–1286. [Google Scholar] [CrossRef]
  45. Palorini, F.; Rancati, T.; Cozzarini, C.; Improta, I.; Carillo, V.; Avuzzi, B.; Borca, V.C.; Botti, A.; Esposti, C.D.; Franco, P.; et al. Multi-variable models of large International Prostate Symptom Score worsening at the end of therapy in prostate cancer radiotherapy. Radiother. Oncol. 2016, 118, 92–98. [Google Scholar] [CrossRef]
  46. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar] [CrossRef]
  47. Avanzo, M.; Stancanello, J.; Naqa, I.E. Beyond imaging: The promise of radiomics. Phys. Med. 2017, 38, 122–139. [Google Scholar] [CrossRef]
  48. Rossi, L.; Bijman, R.; Schillemans, W.; Aluwini, S.; Cavedon, C.; Witte, M.; Incrocci, L.; Heijmen, B. Texture analysis of 3D dose distributions for predictive modelling of toxicity rates in radiotherapy. Radiother. Oncol. 2018, 129, 548–553. [Google Scholar] [CrossRef]
  49. Placidi, L.; Gioscio, E.; Garibaldi, C.; Rancati, T.; Fanizzi, A.; Maestri, D.; Massafra, R.; Menghi, E.; Mirandola, A.; Reggiori, G.; et al. A Multicentre Evaluation of Dosiomics Features Reproducibility, Stability and Sensitivity. Cancers 2021, 13, 3835. [Google Scholar] [CrossRef] [PubMed]
  50. Yahya, N.; Ebert, M.A.; Bulsara, M.; House, M.J.; Kennedy, A.; Joseph, D.J.; Denham, J.W. Statistical-learning strategies generate only modestly performing predictive models for urinary symptoms following external beam radiotherapy of the prostate: A comparison of conventional and machine-learning methods. Med. Phys. 2016, 43, 2040–2052. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Spearman correlation matrix on the 20 features obtained after preprocessing. Feature order decreases following their LR univariate p-value. The color bar shows the correlated Spearman ρ factor.
Figure 1. Spearman correlation matrix on the 20 features obtained after preprocessing. Feature order decreases following their LR univariate p-value. The color bar shows the correlated Spearman ρ factor.
Cancers 16 00934 g001
Figure 2. Model comparison on training (red) and test (blue) datasets for acute toxicity. The following metrics were compared: AUC, f1 separately on patients with/without the event, Brier score, slope, and R 2 of the calibration plot. F1 scores were derived through a threshold chosen by the Youden criterion (see Table 5). Training data of all models, except for AutoGluon and LightGBM, are shown with SMOTE-augmented data, and this is the reason for the large discrepancy.
Figure 2. Model comparison on training (red) and test (blue) datasets for acute toxicity. The following metrics were compared: AUC, f1 separately on patients with/without the event, Brier score, slope, and R 2 of the calibration plot. F1 scores were derived through a threshold chosen by the Youden criterion (see Table 5). Training data of all models, except for AutoGluon and LightGBM, are shown with SMOTE-augmented data, and this is the reason for the large discrepancy.
Cancers 16 00934 g002
Figure 3. ROC of the acute toxicity outcome with training (red) and test (blue) data. Training data of all models, except for LightGBM, are shown with SMOTE-augmented data. Error bands (gray) and the derived average (lines) are given by a bootstrap computed on training and test sets. The optimal points are the threshold computed through the Youden method on the mean curves.
Figure 3. ROC of the acute toxicity outcome with training (red) and test (blue) data. Training data of all models, except for LightGBM, are shown with SMOTE-augmented data. Error bands (gray) and the derived average (lines) are given by a bootstrap computed on training and test sets. The optimal points are the threshold computed through the Youden method on the mean curves.
Cancers 16 00934 g003
Figure 4. Feature importance of ElasticNet (top left), LR (bottom left), SVC (top right), and LightGBM (bottom right) for acute toxicity. For ElasticNet and LR, the values are the coefficients of the model derived on scaled data and thus clinically relevant; for SVC and LightGBM, the values are derived as feature permutation through different intrinsic methods.
Figure 4. Feature importance of ElasticNet (top left), LR (bottom left), SVC (top right), and LightGBM (bottom right) for acute toxicity. For ElasticNet and LR, the values are the coefficients of the model derived on scaled data and thus clinically relevant; for SVC and LightGBM, the values are derived as feature permutation through different intrinsic methods.
Cancers 16 00934 g004
Figure 5. Sequential backward floating selection of ElasticNet (left) and on SVC (right). The metric performance is computed on the cross-validation dataset. The standard deviation, best, and parsimonious feature numbers are, respectively, represented with a light-blue error band and vertical black and gray lines.
Figure 5. Sequential backward floating selection of ElasticNet (left) and on SVC (right). The metric performance is computed on the cross-validation dataset. The standard deviation, best, and parsimonious feature numbers are, respectively, represented with a light-blue error band and vertical black and gray lines.
Cancers 16 00934 g005
Figure 6. ROC using the method of bootstrap to predict acute toxicity. Training data are shown with SMOTE-augmented data. Models with three (left) and four (right) features fixed are shown.
Figure 6. ROC using the method of bootstrap to predict acute toxicity. Training data are shown with SMOTE-augmented data. Models with three (left) and four (right) features fixed are shown.
Cancers 16 00934 g006
Table 1. Eleven dichotomic variables selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded.
Table 1. Eleven dichotomic variables selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded.
Column NamesNoYesNA%No%Yes%NA
Axillary Dissection1073241082170
Type Chemo1022292078210
Right677637052470
Bolus1007307077220
Obesity9593134273233
Diabetes1119791168568
Hypertension75145011357338
Thyroid Disorders101817811878138
Smoke1054260080190
No Nipple/Retraction124119009900
Hormonal Type582732044540
Table 2. The categorical variable selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded, equivalent to right breast for this table. With * are shown the most predictive categories (univariate LR p-value < 0.2).
Table 2. The categorical variable selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded, equivalent to right breast for this table. With * are shown the most predictive categories (univariate LR p-value < 0.2).
Quadrant PositionCount%Count
QSE *59745.4
QSI22317.0
QIE14210.8
QSE/QSI *1007.6
QII967.3
Q retroareolar544.1
QIE/QII473.6
QSE/QIE423.2
QSI/QII120.9
NA10.1
Table 3. Seven continuous variables selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded. Continuous variables where NA values were avoided in deriving the other columns.
Table 3. Seven continuous variables selected with intrafeature variance > 0.02 and with p-value at univariate LR < 0.8. NA stands for not accorded. Continuous variables where NA values were avoided in deriving the other columns.
Column NamesMedianIQRMeanStdMinMaxCount NA
Age62.16[51.57, 70.65]61.3211.9227.9891.200
T Size1.30[0.9, 1.8]1.430.800.006.7014
No fr with bolus0.00[0.0, 0.0]1.643.070.0015.0041
PTV Volume642.25[445.62, 914.47]709.51359.56114.802649.000
PTV V105%2.46[1.07, 4.39]3.323.450.0033.630
Body D1%41.02[40.78, 41.27]41.030.4238.3746.230
Asymmetry1.00[1.0, 2.0]1.220.830.003.000
Table 4. Models’ description with acronyms and reference paper. Links to the model implementation used are in Table A2. * Model typically used.
Table 4. Models’ description with acronyms and reference paper. Links to the model implementation used are in Table A2. * Model typically used.
ModelAcronymReference PaperDescription
Logistic Regression *LR[31]The event probability is a function of a linear combination of independent variables.
LassoLasso[31]It is part of the same typology of models of LR, differing for the chosen penalty.
ElasticNetElasticNet[31]It is a combination of LR and Lasso, regarding the chosen penalty.
k-Nearest NeighborsKNN[32]It is a nonparametric classifier, which uses proximity methods to distinguish between classes.
Support Vector MachinesSVM[33,34]It is an algorithm that classifies through the choice of a N-dimensional hyperplane.
Gaussian Naïve BayesGNB[35]It is part of the so-called probabilistic classifiers, based on the Bayes’ theorem with the assumption
of a strong independence between input features and of a normal distribution for each class.
Multi-Layer PerceptronMLP 1[36]It is a feedforward artificial neural network (ANN) with multiple layers where
the mapping between input and output layers is a nonlinear activation function.
Random ForestRF[37]It combines multiple decision trees, constructed independently, to reach one final result.
Light Gradient Boosting MachineLightGBM[38]Development of RF. It is a gradient-boosting based algorithm, which builds
multiple decision trees one after another.
AutoGluonAutoGluon[39]It is an AutoML code, focused on automated stack ensembling of individually trained
classifiers to reduce their intrinsic error, here used as a comparison with the other codes.
1 Note that the version of MLP here used is a simplified version by sklearn, and thus it does not offer all the potentialities of a more complex one (using TensorFlow), which will be investigated in future work.
Table 5. Youden threshold derived for training/test data, number of parsimonious features selected, metrics used for SBFS ( metric sfs ) and model hyperparameters optimization ( metric mod   opt ). The value of the training Youden threshold here presented was used for the confusion matrix, while the one of the ROC plots is derived as the average of bootstrap computation (for KNN, the training threshold was derived by adding an ϵ = 0.01 ).
Table 5. Youden threshold derived for training/test data, number of parsimonious features selected, metrics used for SBFS ( metric sfs ) and model hyperparameters optimization ( metric mod   opt ). The value of the training Youden threshold here presented was used for the confusion matrix, while the one of the ROC plots is derived as the average of bootstrap computation (for KNN, the training threshold was derived by adding an ϵ = 0.01 ).
Model thr youden Train thr youden Testn. Features metric sfs metric mod   opt
LightGBM0.1620.13014--
AutoGluon0.1770.09019--
LR 3 Variables0.2940.2833--
Random Forest0.5110.5095AUCf1 macro
SVC0.2500.29410AUCAUC
MLP0.5000.4893f1 macrof1 macro
ElasticNet0.3220.40613balanced acc.balanced acc.
LR Multivariable0.3030.41511balanced acc.balanced acc.
LR 4 Variables0.2640.3274--
Lasso0.3560.4759balanced acc.balanced acc.
Naïve Bayes0.4480.4699f1 weightf1 weight
KNN0.0100.51710AUCAUC
Table 6. Model comparison for training and test datasets for the outcome of acute toxicity. Models are in descending order following the AUC performed on test dataset. The confusion matrix was extracted by choosing as discriminant threshold the value derived from the Youden training metric (see Table 6), and from this, the other metrics were derived. f 1 c l a s s 0 and f 1 c l a s s 1 are, respectively, the f1 score on classes 0 and 1.
Table 6. Model comparison for training and test datasets for the outcome of acute toxicity. Models are in descending order following the AUC performed on test dataset. The confusion matrix was extracted by choosing as discriminant threshold the value derived from the Youden training metric (see Table 6), and from this, the other metrics were derived. f 1 c l a s s 0 and f 1 c l a s s 1 are, respectively, the f1 score on classes 0 and 1.
TrainInternal Test
ModelPrecisionSpecificitySensitivityf 1 class 0 f 1 class 1 AUCPrecisionSpecificitySensitivityf 1 class 0 f 1 class 1 AUC
LightGBM0.2490.6620.6080.7600.3500.6800.2480.6730.5880.7700.3500.662
AutoGluon0.2490.4810.9350.6400.3900.7010.2160.7120.4310.7800.2900.633
LR 3 variables0.4380.5830.6930.6800.5400.6810.2050.5680.6080.6900.3100.627
Random Forest0.5110.7140.5990.7460.5510.7090.2450.7090.3290.7870.3290.616
SVC0.4720.5220.8540.6550.6070.7590.2060.5140.5830.6400.3040.614
MLP sklearn0.4740.6780.5820.7190.5220.6370.2250.6370.5100.7330.3120.602
ElasticNet0.4970.6510.6910.7210.5780.7250.2190.6580.4370.7380.2920.602
LR sklearn0.4870.6180.7280.7050.5840.7170.2040.6470.4170.7300.2740.578
LR 4 variables0.4170.4870.7830.6100.5400.6750.1690.4710.5880.6100.2600.559
Lasso0.5420.7200.6650.7630.5970.7130.2240.7330.3540.7830.2740.554
Naïve Bayes0.5420.7350.6270.7650.5820.7110.2530.7470.3960.7950.3090.554
KNN1.0001.0001.0001.0001.0001.0000.1730.0500.9170.2900.0940.533
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ubeira-Gabellini, M.G.; Mori, M.; Palazzo, G.; Cicchetti, A.; Mangili, P.; Pavarini, M.; Rancati, T.; Fodor, A.; del Vecchio, A.; Di Muzio, N.G.; et al. Comparing Performances of Predictive Models of Toxicity after Radiotherapy for Breast Cancer Using Different Machine Learning Approaches. Cancers 2024, 16, 934. https://doi.org/10.3390/cancers16050934

AMA Style

Ubeira-Gabellini MG, Mori M, Palazzo G, Cicchetti A, Mangili P, Pavarini M, Rancati T, Fodor A, del Vecchio A, Di Muzio NG, et al. Comparing Performances of Predictive Models of Toxicity after Radiotherapy for Breast Cancer Using Different Machine Learning Approaches. Cancers. 2024; 16(5):934. https://doi.org/10.3390/cancers16050934

Chicago/Turabian Style

Ubeira-Gabellini, Maria Giulia, Martina Mori, Gabriele Palazzo, Alessandro Cicchetti, Paola Mangili, Maddalena Pavarini, Tiziana Rancati, Andrei Fodor, Antonella del Vecchio, Nadia Gisella Di Muzio, and et al. 2024. "Comparing Performances of Predictive Models of Toxicity after Radiotherapy for Breast Cancer Using Different Machine Learning Approaches" Cancers 16, no. 5: 934. https://doi.org/10.3390/cancers16050934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop