Machine-Learning-Based Radiomics MRI Model for Survival Prediction of Recurrent Glioblastomas Treated with Bevacizumab

Anti-angiogenic therapy with bevacizumab is a widely used therapeutic option for recurrent glioblastoma (GBM). Nevertheless, the therapeutic response remains highly heterogeneous among GBM patients with discordant outcomes. Recent data have shown that radiomics, an advanced recent imaging analysis method, can help to predict both prognosis and therapy in a multitude of solid tumours. The objective of this study was to identify novel biomarkers, extracted from MRI and clinical data, which could predict overall survival (OS) and progression-free survival (PFS) in GBM patients treated with bevacizumab using machine-learning algorithms. In a cohort of 194 recurrent GBM patients (age range 18–80), radiomics data from pre-treatment T2 FLAIR and gadolinium-injected MRI images along with clinical features were analysed. Binary classification models for OS at 9, 12, and 15 months were evaluated. Our classification models successfully stratified the OS. The AUCs were equal to 0.78, 0.85, and 0.76 on the test sets (0.79, 0.82, and 0.87 on the training sets) for the 9-, 12-, and 15-month endpoints, respectively. Regressions yielded a C-index of 0.64 (0.74) for OS and 0.57 (0.69) for PFS. These results suggest that radiomics could assist in the elaboration of a predictive model for treatment selection in recurrent GBM patients.


Introduction
Glioblastoma (GBM), the most common malignant primary intracranial tumour in adults, is associated with a dismal prognosis and a median survival less than 12 months [1,2]. The current standard of care for newly diagnosed GBM is based on maximal safe resection followed by adjuvant radiation therapy and chemotherapy (oral temozolomide) [3][4][5]. Unfortunately, despite this multimodal strategy, almost all patients will eventually relapse; the 5-year survival is merely equal to 10% [6][7][8][9]. GBM, known to be a highly vascularized tumour, generates excessive levels of vascular endothelial growth factor (VEGF), a crucial component of tumour angiogenesis. Bevacizumab, an antiangiogenic drug that targets VEGF ligands [10][11][12], has shown significant activity in GBM and is considered as a valid option by the Food Drug Administration (FDA) for the treatment of recurrent glioblastoma [12,13]. Bevacizumab has significantly improved progression-free survival (PFS) when added to the standard protocol. However, both phase II and III clinical trials failed to demonstrate any significant impact on overall survival (OS) [13][14][15][16][17]. Nevertheless, the survival outcomes remain very heterogeneous, and a subgroup of patients, with a potential clinical benefit from bevacizumab, has yet to be defined [18,19].
With the paucity of options in the recurrent setting, bevacizumab remains largely used in GBM patients due to the improvement in their quality of life. There is an unmet need to identify those patients who might benefit from this drug to prevent unnecessary outcomes and side effects among the entire population. Recently, advances in neuro-imaging with the complex data acquirement (functional and morphological data) from GBM patients have led to a tremendous evolution in the field. The use of artificial intelligence (AI) has largely facilitated these analyses and the evaluation of imaging to better predict outcomes [20]. With the clinical and survival data of GBM as well as the molecular and genetic alterations, generating mathematical methods, with the help of radiomics, for the proper prediction of response to therapy can become a reality in the personalized medicine era.
Radiomics, an emerging field in radio-diagnostics, consists of extracting a high number of quantitative data from radiographic medical images using specific data algorithms. The features include the tumours' shape, texture, and voxels' intensity, which have the potential to uncover patterns that the naked eye could not detect on standard images. Previous experiences have successfully used the radiomics' features in the prognostic prediction of GBM patients treated with bevacizumab [21][22][23][24], but still further analysis on different datasets is needed to draw a complete conclusion on the potential use of radiomics in GBM survival analysis. Moreover, recent work has successfully used a radiomics model in the prediction of the response to bevacizumab in patients with brain necrosis after radiotherapy [25]. In this paper, we evaluated the potential of machinelearning (ML) algorithms based on radiomics in combination with clinical features for generating predictive mathematical models capable of predicting the survival outcomes of recurrent GBM patients treated with bevacizumab.

Patient Selection
Clinical and radiological data were retrospectively collected from magnetic resonance imaging (MRI) tests performed on 200 patients with recurrent GBM (confirmed histologically) at Gustave Roussy Cancer Campus (Villejuif, France) between 2006 and 2016 (based on the World Health Organization (WHO) classification of central nervous system tumours, Grade IV [26]). All enrolled patients must have received bevacizumab for the treatment of recurrent GBM after a first-line treatment failure, which consisted most often in surgery (72% patients) followed by post-operative chemoradiotherapy or chemoradiotherapy alone. Patients under 18 years or above 80 years of age were excluded from the analysis. Further cohort characteristics are detailed in Table 1.
The study was approved by the institutional review board per RGPD provisions. The study was declared on the Health Data Hub site and the CNIL per RGPD recommendations. Additionally, all patients were informed of their enrolment in the study.

MRI Protocol
MR acquisitions were all performed on 2 imaging machines (MRI) from the same manufacturer (General Electric ® , Milwaukee, USA): Optima MR450w 1.5T and Discovery MR750w 3T. MRI data included at least: a post-contrast (gadoterate meglumine Dotarem, Guerbet, Villepinte, France), a three-dimensional T1-weighted fast spoiled gradient recalled (FSPGR) acquisition (post-contrast 3DT1), post-contrast 3DT1, and fat-suppressed FLAIR images. Only MR images were used as inputs of the radiomics classifier. To ensure image quality, neuroradiologists analysed all the available imaging sequences. Table 2 details the MRI parameters for both machines. MR image pre-processing included: Z-score normalization and spatial resampling (with a target voxel size of 1 × 1 × 1 mm 3 ) that was performed due to data inhomogeneity.

Tumour Segmentation
Segmentation of two volumes of interest (2D) were performed semi-automatically using Olea Sphere ® (Olea Medical). The first volume was segmented on post-contrast T1 images and included the enhancing area and the necrotic regions, while the second segmentation was executed on the FLAIR images and included the FLAIR hypersignal. Within a region of interest defined by a trained radiologist (AS with 10 years of experience), the threshold-based grey level contouring and the manual correction were used for the segmentations so that the volumes of interest would be carefully drawn along both tumour enhancement and FLAIR hypersignal.

Radiomics and Features Extraction Technique
The whole radiomic pipeline is summarized in Figure 1. Radiomic features were extracted from the segmented T2 FLAIR and gadolinium-enhanced images using the Olea Sphere ® (Olea Medical La Ciotat, France) radiomics package. An absolute discretization with a bin size of 37 was chosen. This was equivalent to a 32 fixed bin number based on the mean of the intensity intervals computed for all patient volumes of interest within the dataset. Several feature classes were considered: 19 first-order statistics, 17 shape-based features, 23 grey level co-occurrence matrix features (GLCM, texture), 16 grey level run length matrix features (GLRLM, texture), 15 grey level size zone matrix features (GLSZM, texture), 5 original neighbouring grey tone difference matrix, and 14 grey level dependence matrix features (GLDM, texture). To remove potential non-biological variations related to different scanning machines, the Combat normalization from the NeuroCombat Python [27,28] package was applied to the radiomic features. In addition to the radiomics, 9 baseline clinical characteristics (age, sex, delay between the diagnosis and the start of treatment with bevacizumab treatment (Delay R), surgery (yes/no), presence of symptoms (nausea/vomiting, fatigue...), neurological deficit, epilepsy, intracranial hypertension (ICH), or haematological abnormality) before bevacizumab treatment initiation were also extracted.

Machine-Learning (ML) Algorithms
Survival analysis was performed to predict the OS and PFS; then, binary classification models were created at different endpoints. For the classification models, 7 different classifiers were trained: random forest (RFT), gradient boosting, Adaboost, logistic regression, K-neighbours, naïve Bayes, and SVM. All the algorithms were taken from the Scikit-learn Python library. Survival analysis differed from classical regression by the existence of censored data. Survival random forest [29,30] is a modified random forest algorithm that can perform such an analysis by calculating the survival function. This function gives the probability that one patient survives longer than a specific time. It can also calculate the risk score, which is a value computed on an arbitrary scale. Then, if samples are ordered according to their predicted risk score (in ascending order), one obtains the sequence in time of events, as predicted by the model. We used the RandomSurvivalForest from the Scikit-survival Python library to perform such a regression.

Model Building
The pipeline for the model building is summarized in Figure 2. Within the cohort, 6 patients were excluded because of missing clinical data. Hence, the data from 194 patients were used for the regressions. Since 44 patients had a censored date of death, 8 of them were censored after 15 months. Therefore, 158 patients were included for the 15-month survival classification. The same logic was applied for the other classification models (Table 3). For each model, patients were divided into a training (80%) and test cohort (20%) ( Table 3). The proportions of each class over the entire cohort were kept in the training and test sets. Only the training set was used for the development of the statistical models. First, cross-validation (CV = 5) was applied to the training set to select the best hyperparameters for each ML classifier and the selection of the best one to be tested on the test set. Then, the chosen classifiers were trained on the overall training dataset and validated on the test set.  Before the training, feature processing was applied using only the training set. First, the variable was standardized using the StandardScaler from the Scikit-learn Python library. Then, because of the high number of features compared to the number of patients and to overcome the well-known "curse of dimensionality" in ML, we performed a feature selection to reduce dimensionality and decrease the risk of overfitting: only the features with a concordance index (C-index) with the OS strictly superior to a certain threshold value were kept. Then, if two of the remaining variables had a covariance superior to 0.5, the one with the lowest C-index was eliminated. Afterward, the standardization model and the feature selection were applied to the test set. The C-index threshold was chosen according to the best result within the cross-validation. The same feature processing was implemented on the training set and applied to the validation set for every fold of the CV. Hence, several lists of features were obtained, depending on the train-fold utilized. Performing the feature selection on the overall dataset would introduce bias in the results, as we would use the samples within the test sets to select them.

Resampling
Due to the occurrence of imbalanced classes in the classification models (Table 4), a resampling technique was used to avoid overfitting of the majority class. This family of techniques has already been widely studied to deal with imbalanced datasets [31,32]. It consists of generating new data in the training set to have an equivalent amount of data within each class. The RandomOverSampler resampling method from the imblearn Python package was performed. This function creates new samples by reproducing already existing data in the under-represented class. We set the shrinkage parameter to 1.3 so that the new sample can differ only a little from the original one. The SMOTE resampling method has also been tried but did not lead to better results. After resampling, the same number of samples were obtained in each class of the training sets.

Statistical Analysis
OS was calculated as the number of days between the initiation of bevacizumab treatment and death. PFS was calculated as the number of days between the start of the treatment and progression or death. PFS was evaluated on all images by a neuroradiologist with 10 years of experience using FLAIR and 3D T1 sequences with gadolinium injection according to the RANO recommendations in association with a clinical evaluation (by a neuro-oncologist). We used the concordance index (C-index) as an evaluation metric for the regression. The C-index was defined as the ratio between the number of concordant pairs of predicted values divided by the total number of pairs of patients (this value does not take into account the pairs including censored data or an event taking place after the censoring). The area under the curve (AUC) of the receiving operating characteristic (ROC) curve was used to evaluate the classification performance. The classifiers' selection was based on the best mean metric score performed on the 5-fold of the cross-validation. Kaplan-Meier curves and log-rank statistic test were used to assert the building of two significantly different populations by the binary classification in terms of survival.

OS Outcomes
Several ML algorithms were tested to determine the method best suited to the crossvalidation. For the 9-month classification, a logistic regression insured the best performance with an AUC on the test sets equal to 0.78 (0.79 on the train). An SVM with a Gaussian kernel had an AUC of 0.85 (0.82) in the 12-month classification, while an RFT had an AUC of 0.76 (0.87) in the 15-month model. The SFT obtained a C-index of 0.64 on the test set (0.74 on the train). It performed a mean score of 0.63 ± 0.06 (±1 standard deviation) on the cross-validation. The logistic regression accomplished a mean AUC of 0.62 ± 0.12 for the 9-month models, while the SVM had a score of 0.71 ± 0.20 for 12 months and the RFT a score of 0.60 ± 0.10 on the cross-validations. Precision and recall for each class of each classification model are provided in Table 5. Cross validation results for every classifier and classification model are provided in Supplementary Materials (Table S1).
The Kaplan-Meier curve showing the survival probabilities (Figure 3) of the two classes predicted by the classifier on the test depicts two well-separated curves (log-rank p < 0.005) on the 9 months model. The class "beyond 9 months" obtained a survival probability at 6 months of 0.85, while the class "before 9 months" obtained a probability of 0.45. These probabilities went down to 0.71 versus 0.25 at 10 months and 0.43 versus 0.15 at 20 months. Without the resampling method, the two other models tended to classify all patients in the majority classes. Hence, well-balanced classes or a more adapted resampling algorithm for this study could improve these outcomes. Nevertheless, these two models also succeeded in the identification of two different populations with a significant difference in the survival probability distributions (log-rank p = 0.01 for both models). For instance, the 12-month classification identified all the patients who did not survive after the endpoint except two (15 patients out of 17). Moreover, the class "after 12 months" had more than a 0.5 probability of surviving beyond 1 year. Such a model could offer a great perspective into the prediction of prognostic value. The Kaplan-Meier curve showing the survival probabilities (Figure 3) of the two classes predicted by the classifier on the test depicts two well-separated curves (log-rank p < 0.005) on the 9 months model. The class "beyond 9 months" obtained a survival probability at 6 months of 0.85, while the class "before 9 months" obtained a probability of 0.45. These probabilities went down to 0.71 versus 0.25 at 10 months and 0.43 versus 0.15 at 20 months. Without the resampling method, the two other models tended to classify all patients in the majority classes. Hence, well-balanced classes or a more adapted resampling algorithm for this study could improve these outcomes. Nevertheless, these two models also succeeded in the identification of two different populations with a significant difference in the survival probability distributions (log-rank p = 0.01 for both models). For instance, the 12-month classification identified all the patients who did not survive after the endpoint except two (15 patients out of 17). Moreover, the class "after 12 months" had more than a 0.5 probability of surviving beyond 1 year. Such a model could offer a great perspective into the prediction of prognostic value.

PFS Outcomes
The regression model obtained a C-index value of 0.57 on the test set and 0.69 on the training set. The two classification models for progression at 6 and 9 months failed to stratify the cohorts in two significantly different populations according to the log-rank test (p-value > 0.05). These models obtained an AUC, respectively, equal to 0.56 (0.71) and 0.69 (0.82). Recall, precision (Table S2), and Kaplan-Meier curves ( Figure S1) are available in the Supplementary Materials.

Clinical and Radiomic Signature
On the four different models for OS (regression and classifications), on average, nine features were selected on the training sets each time (eight for 9 and 12 months, nine for 15 months, and twelve for the regression). Four of these features were constantly present in all selections: one histogram intensity feature from the FLAIR images (original first-order 10th percentile), one shape feature from the gadolinium images (original shape sphericity), and two clinical variables that were the presence of constitutional symptoms or the presence of a neurological deficit. The other selected features included delay R, gender, the original shape-surface-area-to-volume ratio (shape), and the original neighbouring grey tone difference in matrix strength (texture) from FLAIR images, which appeared in every model except one. Texture maps of original neighbouring grey tone difference in matrix strength radiomic of one patient with good prognosis and one with bad prognosis are provided in Figure 4.
According to the C-index of these variables, none of them seemed essential for the prognostic value ( Table 6). All the C-index ranged between 0.55 and 0.61. Only the shape sphericity had a value more than 0.60 for the 12-month classification. For the PFS regression, 11 features were selected including the presence of constitutional symptoms, the presence of a neurological deficit, FLAIR's area-to-volume ratio and difference matrix strength, and Gadolinium's sphericity. The eight features tended to show higher values for patients with good prognosis. The higher the features, the higher the chances for a patient to have a longer survival.

Discussion
In this paper, we evaluated the ability of several ML algorithms, based on radiomics from multimodal imaging in combination with clinical characteristics, to predict survival in recurrent GBM patients treated with bevacizumab. Nine different clinical variables, 109 radiomics from T2 FLAIR, and 109 radiomics from gadolinium-enhanced MRI images from a cohort of 194 patients were included in the analysis. Our results suggest that these models can provide valuable insights into the OS of recurrent GBM patients. The novel classification models successfully stratified patients into two significantly different populations, thus suggesting a possible role in the selection of patients who might benefit from bevacizumab in the recurrent setting.
The survival analysis was performed at three different endpoints at 9, 12, and 15 months after the beginning of the bevacizumab treatment for the OS analysis and at 6 and 9 months for the PFS. These time limits were pre-specified after being considered to be relevant according to the neuro-oncologists for their utility in clinical practice. Grossman et al. [21] already used these 6-and 9-month endpoints for PFS and the 12-month endpoint for OS in their analysis. Additionally, in comparison to other survival analyses that stratified the patients into low-and high-risk cohorts using the median survival [22] or the median predicted risk function [23], the selected endpoints in this paper should provide fewer cohort-specific and more generalizable models. Kickingereder et al. [23], who used radiomics from pre-and post-contrast T1-weighted images (referred to as T1 and cT1) and FLAIR sequences at baseline, have successfully stratified the patients into low and high risk (groups assessed by predicted risk function and cox regression analysis) for PFS. Nevertheless, their model was more accurate in terms of OS prediction rather than PFS. These differences between PFS and OS could be explained by the existence of a pseudo-response, a pseudo-progression [33][34][35], or the fact that the PFS endpoint could not be well defined.
The redundancy in the selection of features within every model could lead to the identification of a radiomic and clinical signature for the prediction of survival. Previous studies have already shown the ability of radiomics, extracted from T1-weighted and FLAIR imaging, of pre-therapy prediction of prognosis [21,23], but only Grossman et al. also identified a specific feature signature that was composed of 10 texture radiomics from T1 images and 10 texture and shape features from the FLAIR images. Moreover, it appeared in their model that T1-weighted images (and, more precisely, texture information) provided more information on survival prediction than FLAIR images where only shape features seemed to be significantly correlated with PFS and OS. The difference with the signature in this paper could be explained by their feature selection method. Indeed, they used a principal component analysis to select their best features. This method is based on the hypothesis that the best features are the ones that explain the variations within the data. The C-index method can select the features that allow the correct order of patients according to their date of death. This last method seemed more appropriate for survival analysis. In addition, they selected independently on each MRI modality (FLAIR and T1) the top 10 features, while we have carried out the selection on both modalities at once. The selection of the optimal features is a crucial step in the adaptation of radiomics that have drastically impacted our models with the support of the C-index threshold. For instance, Chang et al. [22] who used the top predictive 128 features among 2293 pre-therapy radiomics features, failed to stratify their cohorts on the test sets using a random forest. Grossman et al., who predicted PFS and OS on similar endpoints as us, used a similar number of features in their random forest. Their model stratified two populations for PFS at 3, 6, and 9 months and OS at 12 months.
The clinical variables such as the delay between the diagnosis and the first relapse, the presence of symptoms, or the presence of neurological deficits at the initiation of the therapy seem to be quite correlated with survival outcomes, as suggested by their C-index with OS. To our knowledge, no other studies ever used these features. Moreover, the correlation threshold in this analysis suggests that these clinical variables offer complementary data to the radiomics, which may lead to the elaboration of other advanced models using different sources of data. Various studies have shown that the apparent diffusion coefficient (ADC) from diffusion MR could represent a good predictor of survival in GBM patients treated with bevacizumab [36][37][38][39]. In addition, perfusion-derived biomarkers were explored [40][41][42], but it should be mentioned that these methods depend largely on the type of the MRI machine, the sequence processing software, and the calculation algorithms and, therefore, are more difficult to apply to different centres in routine clinical practice. Non-MRI-based biomarkers from transcriptomic [43][44][45] or the neutrophil-to-lymphocyte ratio [46] have also been studied for the prognostic of patients treated with bevacizumab. Although, they have been able to stratify two significantly different populations in term of survival outcomes, to our knowledge, none of them used as many patients as in our study and succeeded to accurately predict the OS or PFS. Using different sources of data could improve the accuracy and robustness of the predictive models in comparison to radiomics-only models.
Nevertheless, onemajor limitation of radiomics are robustness and reproducibility. During the extraction from the manual or semi-automatic tumour segmentation, it is impossible to reproduce the same contours twice. In particular, the robustness of the shape feature sphericity must be verified over different segmentations to be able to assert the generalization of the model. On the other hand, our most significant features according to their C-index with OS on training sets are logical and simple parameters that can be easily adapted in comparison to the other signatures described in the literature on texture and automotive prognosis. Another important finding was the signal intensity of the tumour in the FLAIR images that reflected the cellularity contour and volume, known to have a poor prognosis in large lesions, while the delay between the diagnosis and the start of treatment with bevacizumab treatment (Delay R), the presence of symptoms, or the presence of neurological deficit largely reflected the aggressiveness of the tumour.
Although previous data have demonstrated that OS is not significantly improved with bevacizumab, further studies are needed to confirm the reproducibility of our model with the possibility of stratifying a cohort of GBM not previously treated with bevacizumab into two distinct populations. That way, a sub-cohort of patients that would benefit from bevacizumab can be accurately identified. Indeed, a connection between patients with a good prognosis and the response to bevacizumab treatment cannot be confirmed, since they could have a good prognosis regardless of the therapy. In this paper, the crossvalidation failed to reach high accuracy, and huge differences were observed, which could be related to the relatively small sample size. Nevertheless, these results could lead the way to further analysis that could aid in the development of a prediction model based on radiomics in daily clinical practice. Predictive biomarkers have become a major component of cancer management to guide the therapeutic strategy where treatment selection for each patient has become personalized [47]. After further re-evaluation of our models, with the adaptation of ML models based on radiomics and clinical data, a decision tool could be developed to select those GBM patients who are most likely to benefit from bevacizumab. On the other hand, this tool may also identify those patients who are less likely to respond to bevacizumab and might require an alternative or a more aggressive approach. These advances in the radiomics field in GBM and the implementation of predictive models in daily practice will eventually encourage researchers to invest their efforts in the adaptation of similar models in other malignant tumours.
In conclusion, radiomics in combination with clinical data have successfully predicted the survival of recurrent GBM patients treated with bevacizumab. Three different binary models for survival prediction at 9, 12, and 15 months with high performances were built, which could lead to the creation of a convenient tool for decision making and the orientation to a more patient-specific treatment in the era of personalized medicine. Larger trials are needed for better identification and adaptation of these models in GBM patients.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/diagnostics11071263/s1, Figure S1: Kaplan-Meier curves on the test sets of the binary classification models on PFS, Table S1: Cross-validation results (±1 std deviation) for the different machine-learning algorithms, Table S2: Results of the binary classification models on PFS. Informed Consent Statement: For the involved patients, an informed letter was sent to each patient informing them about the use of the data for the scientific research, according to the laws of France and Europe. None of these patients responded to our letter to deny the use of the data.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical concerns.