Prognostic and Predictive Value of Integrated Qualitative and Quantitative Magnetic Resonance Imaging Analysis in Glioblastoma

Simple Summary Glioblastoma (GBM) is the most malignant primary brain tumor, for which improving patient outcome is limited by a substantial amount of tumor heterogeneity. Magnetic resonance imaging (MRI) in combination with machine learning offers the possibility to collect qualitative and quantitative imaging features which can be used to predict patient prognosis and relevant tumor markers which can aid in selecting the right treatment. This study showed that combining these MRI features with clinical features has the highest prognostic value for GBM patients; this model performed similarly in an independent GBM cohort, showing its reproducibility. The prediction of tumor markers showed promising results in the training set but not could be validated in the independent dataset. This study shows the potential of using MRI to predict prognosis and tumor markers, but further optimization and prospective studies are warranted. Abstract Glioblastoma (GBM) is the most malignant primary brain tumor for which no curative treatment options exist. Non-invasive qualitative (Visually Accessible Rembrandt Images (VASARI)) and quantitative (radiomics) imaging features to predict prognosis and clinically relevant markers for GBM patients are needed to guide clinicians. A retrospective analysis of GBM patients in two neuro-oncology centers was conducted. The multimodal Cox-regression model to predict overall survival (OS) was developed using clinical features with VASARI and radiomics features in isocitrate dehydrogenase (IDH)-wild type GBM. Predictive models for IDH-mutation, 06-methylguanine-DNA-methyltransferase (MGMT)-methylation and epidermal growth factor receptor (EGFR) amplification using imaging features were developed using machine learning. The performance of the prognostic model improved upon addition of clinical, VASARI and radiomics features, for which the combined model performed best. This could be reproduced after external validation (C-index 0.711 95% CI 0.64–0.78) and used to stratify Kaplan–Meijer curves in two survival groups (p-value < 0.001). The predictive models performed significantly in the external validation for EGFR amplification (area-under-the-curve (AUC) 0.707, 95% CI 0.582–8.25) and MGMT-methylation (AUC 0.667, 95% CI 0.522–0.82) but not for IDH-mutation (AUC 0.695, 95% CI 0.436–0.927). The integrated clinical and imaging prognostic model was shown to be robust and of potential clinical relevance. The prediction of molecular markers showed promising results in the training set but could not be validated after external validation in a clinically relevant manner. Overall, these results show the potential of combining clinical features with imaging features for prognostic and predictive models in GBM, but further optimization and larger prospective studies are warranted.


Introduction
Glioblastoma (GBM) is the most malignant type of primary brain cancer with an incidence of 2-3 cases per 100,000 [1]. Currently, a median survival of fifteen months is achieved with multimodal treatment [2] with a five-year overall relative survival of only 6.8% [3]. However, despite this intensive treatment by neurosurgical intervention, concurrent chemoradiation and adjuvant temozolomide (TMZ) [2], GBM is still considered incurable and recurrence is inevitable. Although major improvements in the treatment of cancer have been made, the current standard-of-care for GBM has largely remained unchanged over the past decade.
GBM is diagnosed using gadolinium contrast-enhanced magnetic resonance imaging (MRI) followed by histopathological examination of tumor tissue specimen obtained after either biopsy or resection. Further characterization of GBM has led to the introduction of the 2016 updated world health organization (WHO) classification of central nervous system tumors [4]. This classification integrates histopathological and morphological examination of the tumor with molecular markers [5]. Thus far, the only predictive marker that has been established into clinical practice is the 06-methylguanine-DNA-methyltransferase (MGMT) methylation status, which is predictive of an improved response to alkylating chemotherapy such as TMZ [6]. However, a substantial "grey zone" between MGMT methylated and unmethylated patients still exists for which the efficacy of TMZ is still to be determined [7]. Additionally, the presence of a mutation in the isocitrate dehydrogenase (IDH) genes-which has been identified as a positive prognostic marker-is linked to dedifferentiated low-grade gliomas which have a distinctly different clinical behavior compared to IDH wild-type (WT) GBM [8]. Epidermal growth factor receptor (EGFR) amplification is one of the most common genetic alterations (±50%) in GBM [9]. This oncogenic molecular alteration poses a potential therapeutic target but also identifies a biological different subtype of GBM which responds differently to established treatments [10,11]. However, the role of EGFR amplification as a prognostic factor still remains controversial [11][12][13] and studies using targeted agents for EGFR have so far been unsuccessful but are still ongoing [3]. Additionally, multiple other molecular targets (genetic mutations, amplifications and protein fusion products) have been identified which have either failed in previous clinical trials to improve patient survival or are currently still under investigation [3]. All in all, the integration of molecular markers has led to an improvement in prediction of prognosis and treatment response but a substantial variety remains and no improvement in treatment outcome has been made, which is thought to be due to extensive inter-and intratumor heterogeneity [14].
Intratumor heterogeneity complicates treatment efficacy as different regions within the same tumor may contain cells having distinct genetic compositions, transcriptional subtypes and/or proliferation kinetics [3]. Furthermore, temporal heterogeneity has been observed in which changes in the expression of molecular targets occur over time which limits efficacy of targeted approaches [15,16]. In clinical practice and currently used diagnostic techniques and available prognostic models intratumor heterogeneity is not accounted for, since single-cell sequencing is not routinely used. Additionally, it is not clear if molecular GBM heterogeneity can be captured by qualitative and/or quantitative analysis of imaging features.
Imaging techniques have the advantage over standard pathological examination to also analyze the invasive, non-resected, components of GBM and thus capture and analyze the tumor as a whole. Especially temporal heterogeneity of expression of molecular targets cannot be evaluated using routine clinical diagnostics, as re-resection of tumors is not always feasible, making non-invasive imaging an interesting alternative. In order to make a standardized analysis of qualitative MR imaging features, the Visually Accessible Rembrandt Images (VASARI) features were previously developed [17]. VASARI features include tumor size, location and morphology and have previously been shown to be reproducible and of prognostic value [17]. Quantitative imaging analysis using radiomics is an approach to extract imaging features by high-throughput data mining on textures, shapes and intensities [18]. Radiomics has shown prognostic and predictive potential in multiple solid tumors [19,20] including GBM [21]. Furthermore, radiomics features have the potential to analyze the entire tumor and to identify intratumor molecular heterogeneity and underlying biological processes [22,23]. In glioma, radiomics models have been developed to predict tumor grade [24], overall survival (OS) [25] and in GBM trying to predict molecular subtypes [26]. Although IDH-mutation status is established as the best prognostic marker in GBM [27], defining different IDH wild-type GBM prognostic subgroups is still warranted due to their heterogeneous prognosis and clinical behavior.
The main challenge in developing prognostic and predictive imaging-based models is their generalizability towards all GBM patients treated at different centers. Differences in diagnostic techniques (i.e., scanner vendors and protocols) and treatment and population variety can greatly influence model performances [28]. Due to these challenges, this study utilizes two multi-center datasets to train and validate the developed models.
The objective of this study was to investigate the additive value of qualitative and quantitative imaging heterogeneity analysis to established prognostic clinical features. These data were used to develop a prognostic model for OS in a real-world multi-center GBM population for IDH1/2 wild-type (IDH-WT) GBM. Furthermore, the value of imaging features as predictor for clinically relevant molecular markers for GBM was explored.

Patient and Tumor Characteristics
In total, 142 patients were included in the training cohort and 46 patients in the validation cohort. Median OS was 12.0 months (range, 0-142 months) in the training cohort and 7.3 months (range, 0-30 months) in the validation cohort (log rank p-value, 0.001). Patients in the validation cohort more frequently received no adjuvant treatment, but these data were not available for all patients. Patient demographics, received treatment schedules and tumor characteristics are listed in Table 1. Molecular data for a subset of patients are reported as missing due to insufficient formalin-fixed paraffin-embedded (FFPE) material or poor quality or quantity of extracted DNA. VASARI features were available for all patients in both cohorts. For radiomics analysis, T1+Gadolinium and T2-weighted images were available for 105 patients in the training cohort and 44 patients in the validation cohort. MRI characteristics such as types and manufacturers of scanners and imaging protocols are reported in Figures S1 and S2. The numbers of patients that were eligible in the two cohorts for the different models are reported in Table S1.  (Table S2). The multivariable Cox-regression model consisted of five VASARI features (Model 1). For radiomics, five radiomics features were selected to predict OS (Model 2) ( Table 2). In this study, none of the radiomics features showed evidence of a significant correlation with tumor volume ( Figure S3). Additionally, no significant correlation were found between VASARI, radiomics and clinical features ( Figure S4). An elaborate explanation of these radiomics features can be found on the Pyradiomics website [29] and in a previous study [30].
Clinical features that were selected in the clinical model were chosen based on previous studies [31] and clinical expertise (Model 3). Next, VASARI features, radiomics features and clinical features multivariable Cox-regression models were combined in different combinations. Model 4 was developed by combining VASARI prognostic index (PI) and Radiomics PI, Model 5 by combining VASARI PI and Clinical PI and Model 6 by combining Radiomics PI and Clinical PI (Model 4-6). Finally, clinical features were combined with the integrated VASARI and radiomics prognostic score to develop an integrated clinical and imaging prognostic model (Model 7) ( Table 2). The calibration slope of the PI of Model 7 on the validation set was 0.79 (log-rank test p-value 0.27), indicating there is no certainty for the slope in the validation set being different from 1. The joint test of all predictors with the offsetting of the predicted PI results in the p-value of 0.23, indicating that there is no evidence of a lack of fit on the validation. Table 2. Multivariate Cox-regression model using Visually Accessible Rembrandt Images (VASARI), radiomics and/or clinical features for overall survival (OS) prediction in isocitrate dehydrogenase wild type (IDH-WT) glioblastoma (GBM) patients in different prognostic models based on the training cohort (n = numbers of patients used for model development).

Prognostic Model Variables
Hazard Ratio (95% CI) p-Value To assess the reproducibility performance of the prognostic models, all models were tested on the external validation set (n = 38) and the discriminative prognostic value in both cohorts was analyzed using Harrell's C-index ( Figure 1A). Model 1 achieved a C-index of 0.61 (95% CI 0.55-0.68) when tested on the whole training cohort (n = 129). In order to make a comparison between the different models, the C-index for the VASARI-only model was also calculated using only the patients available in all other models (n = 95). In order to visualize the prognostic potential of the integrated imaging and clinical model (Model 7), the data-set was split in a low-and high-risk group at a set cut-off value (75th percentile) of the prognostic index in the training cohort. This same cut-off value was applied to the external validation cohort. Two survival groups could be identified (p-value < 0.0001) in both the training and validation cohort ( Figure 1B,C). order to make a comparison between the different models, the C-index for the VASARIonly model was also calculated using only the patients available in all other models (n = 95). In order to visualize the prognostic potential of the integrated imaging and clinical model (Model 7), the data-set was split in a low-and high-risk group at a set cut-off value (75th percentile) of the prognostic index in the training cohort. This same cut-off value was applied to the external validation cohort. Two survival groups could be identified (pvalue < 0.0001) in both the training and validation cohort ( Figure 1B,C).

Predictive Value of Integrative Imaging Analysis
In order to develop the predictive models for molecular markers (EGFR amplification, MGMT-methylation and IDH1 mutation), the Maastricht University Medical Center+ (MUMC+) cohort was split into a training (70%) and test (30%) cohort. For the prediction of EGFR amplification, in total eleven VASARI features and four radiomics features were selected in the predictive models using the XGBoost machine learning algorithm (Table  3). Both VASARI and radiomics models alone were able to significantly predict EGFR amplification in the test dataset ( Figure 2A). In the external validation set, both VASARI and radiomics features reached similar results to each other; however, an increased predictive value was observed when both models were combined (area-under-the-curve (AUC) 0.707 (95% CI 0.582-0.825); Figure 2B,C).

Predictive Value of Integrative Imaging Analysis
In order to develop the predictive models for molecular markers (EGFR amplification, MGMT-methylation and IDH1 mutation), the Maastricht University Medical Center+ (MUMC+) cohort was split into a training (70%) and test (30%) cohort. For the prediction of EGFR amplification, in total eleven VASARI features and four radiomics features were selected in the predictive models using the XGBoost machine learning algorithm (Table 3). Both VASARI and radiomics models alone were able to significantly predict EGFR amplification in the test dataset ( Figure 2A). In the external validation set, both VASARI and radiomics features reached similar results to each other; however, an increased predictive value was observed when both models were combined (area-under-the-curve (AUC) 0.707 (95% CI 0.582-0.825); Figure 2B,C).
The predictive models developed for MGMT-methylation status consisted of seven VASARI features (logistic regression analysis) and three radiomics features (XGBoost algorithm) ( Table 3). VASARI features alone reached similar predictive values in the test and validation dataset with an AUC of 0.668 (95% CI 0.513-0.850) and 0.622 (95% CI 0.475-0.761) respectively. Radiomics features alone could not predict MGMT-methylation in both datasets. An increased predictive value was observed when VASARI features and radiomics features were combined in one predictive model, with an AUC of 0.843 (95% CI 0.696-0.948) in the test dataset but did not perform as well in the external validation dataset (AUC 0.667 (95% CI 0.522-0.820); Figure 2B,C).
For the prediction of the IDH1 mutation ten VASARI features were included in the multivariate VASARI model and nine radiomics features in the radiomics prediction model developed using the XGBoost machine learning algorithm (Table 3). In the test dataset, only radiomics features reached statistical significance with an ROC AUC of 0.816 (95% CI 0.650-0.950), which improved upon combining with VASARI features (Figure 2A). In the external validation set, neither VASARI nor radiomics features or the combination were able to predict the IDH1 status ( Figure 2B,C). ROC curves for all predictive models in the training and validation cohort are reported in Figures S5 and S6, respectively.
Next, histogram heterogeneity was assessed to identify whether radiomics features demonstrate significant differences between the outcome groups in a univariate manner. Only for IDH1 mutation was a significant difference found for two features that could explain the heterogeneity in the outcome. The histograms of heterogeneity for each predictive model and significance values for IDH1-mutation are reported in Figures S7 and S8, respectively. Table 3. Selected VASARI and radiomics features in predictive models for epidermal growth factor receptor (EGFR) amplification, methylguanine methyltransferase(MGMT)-methylation and isocitrate dehydrogenase 1(IDH1) mutation in GBM patients in the training cohort.

Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) Statement and Radiomics Quality Score
The TRIPOD statement adherences were calculated at 77% for this study. The radiomics quality score (RQS) score calculated for this study was 47%. An overview of point allocation towards the TRIPOD statement and RQS score can be found in Tables S3 and S4, respectively.

Discussion
Increasing curation rates by optimizing treatment strategies is being hampered by the highly invasive nature and GBM specific inter-and intratumoral molecular heterogeneity. MR imaging is currently the preferred diagnostic imaging technique for GBM. However, integrated standardized qualitative and quantitative analysis of different MR sequences has not yet been introduced into prognostic and predictive GBM models. This study retrospectively analyzed two multi-center GBM patient cohorts to develop integrated clinical and imaging prognostic models and predictive models for clinically relevant molecular markers.
Combining clinical features with quantitative and qualitative imaging features resulted in the most optimal prognostic model which could be reproduced in the external validation cohort (C-index 0.72 in training cohort and 0.73 in validation cohort). Despite promising results for predicting EGFR amplification and IDH1-mutation in the test cohort, none of the predictive models for molecular markers were able to predict these markers in a clinically relevant manner in the external validation set.
The prognostic model described in this study is developed for IDH-WT GBM patients as this patient group makes up the majority of GBM and exhibits large variation in prognosis and treatment response. This variance is also reflected in statistically significant differences in baseline characteristics for OS and MGMT-methylation. However, these differences are also known to exist between centers, in which different treatment decisions

Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) Statement and Radiomics Quality Score
The TRIPOD statement adherences were calculated at 77% for this study. The radiomics quality score (RQS) score calculated for this study was 47%. An overview of point allocation towards the TRIPOD statement and RQS score can be found in Tables S3 and S4, respectively.

Discussion
Increasing curation rates by optimizing treatment strategies is being hampered by the highly invasive nature and GBM specific inter-and intratumoral molecular heterogeneity. MR imaging is currently the preferred diagnostic imaging technique for GBM. However, integrated standardized qualitative and quantitative analysis of different MR sequences has not yet been introduced into prognostic and predictive GBM models. This study retrospectively analyzed two multi-center GBM patient cohorts to develop integrated clinical and imaging prognostic models and predictive models for clinically relevant molecular markers.
Combining clinical features with quantitative and qualitative imaging features resulted in the most optimal prognostic model which could be reproduced in the external validation cohort (C-index 0.72 in training cohort and 0.73 in validation cohort). Despite promising results for predicting EGFR amplification and IDH1-mutation in the test cohort, none of the predictive models for molecular markers were able to predict these markers in a clinically relevant manner in the external validation set.
The prognostic model described in this study is developed for IDH-WT GBM patients as this patient group makes up the majority of GBM and exhibits large variation in prognosis and treatment response. This variance is also reflected in statistically significant differences in baseline characteristics for OS and MGMT-methylation. However, these differences are also known to exist between centers, in which different treatment decisions and strategies are being implemented. The aim of this study was to investigate the performance of prognostic models in such heterogeneous GBM cohorts. To predict OS, five VASARI features were identified to be of most prognostic relevance. Three of these features are well known prognostic factors and were also previously identified to be negatively associated with OS (involvement of eloquent cortex, multifocality and subependymal extension) and can be attributed to a more invasive growth of the tumor [32,33]. The other selected features, proportion of edema and T1-FLAIR-ratio showed opposite prognostic value in this study when compared to previous studies [33][34][35][36][37]. However, other studies reported no prognostic value for these features and therefore this still remains controversial [32,38].
Radiomics features that were identified to have prognostic value were mainly derived from T2-weighted imaging. This is in line with the hypothesis that the T2-weighted signal corresponds with intratumor heterogeneity and infiltrative tumor growth [39] and this area is accountable for the majority of local recurrences [40]. Therefore, radiomics features from this area are expected to be of importance for survival prediction as was also shown in previous studies [41,42]. The radiomics signature for OS consists of five features, from which two features are the first order Mean (T2-weighted) and Median (T1-weighted) describing the mean and median intensity values after the LLH and HHH wavelet decomposition of the original MR images. The remaining three features quantify gray level zones in an T2-weighted image, more precisely measuring the proportion in the image of the joint distribution of larger size zones with lower gray-level values after image transformation (Laplacian of Gaussian) which is useful for edge detection. These gray level zone features can potentially be associated with the measure of intratumor heterogeneity [43].
In this study, VASARI features alone or radiomics features alone were not able to predict OS in the external validation dataset in a clinically relevant manner. Interestingly, the performance of the prognostic model improved upon combining VASARI, radiomics and clinical features (C-index 0.723 in training cohort and 0.730 in validation cohort) and became clinically relevant. The robustness of this combined model also improved as the model performed similarly in the training-and validation cohort and the uncertainty decreased as represented by a smaller confidence interval of the C-index. Model 5 and 6 report similar performances when compared to the model combining all features. However, the final combined model seems to remain mostly stable between both cohorts, though the actual additive value should be further validated in larger patient cohorts. The combined model was also able to accurately split the two cohorts in a high-and low-risk group (p-value < 0.001) ( Figure 1B,C). Previous studies also observed that combining clinical features with imaging features improves the prognostic value of the model [42,[44][45][46][47]. The model developed in this study performed similar or better compared to previous findings, even after external validation in a heterogeneous patient cohort. This highlights the clinical relevant potential of combining these features into a multimodal prognostic model which can potentially be applied in clinical practice.
As a proof-of-concept study, this study investigated the capability of VASARI and radiomics features to link phenotype to genotype and predict clinically relevant molecular markers, IDH1-mutation, MGMT-methylation and EGFR amplification, by machine learning approaches. Overall, the predictive models had promising performance on the test set, especially when VASARI and radiomics features were combined (Figure 2A). Unfortunately, none of the developed models were able to predict in the external validation set in a clinically relevant manner with a wide spread in confidence intervals of the AUC values ( Figure 2B,C). In order for a model predicting molecular markers to be clinically relevant, much higher AUC values are desired. Since the presence of the molecular markers has biological consequences on tumor growth and development, specific imaging techniques that reflect biological processes have shown more promising results in the prediction of these markers and should therefore be used for further research. Perfusion-weighted and/or diffusion-weighted MRI features have been used to predict EGFR amplification [48][49][50] and MGMT-methylation [51], whereas MR spectroscopy [52] and amino acid tracer PET imaging (FET-PET) [53] can predict IDH1 mutation status due to its effects on tumor metabolism.
In addition, by analyzing the heterogeneity histogram for EGFR amplification based on the validation cohort, we can notice that none of the radiomics features has demonstrated significant difference between the outcome groups in the univariate manner. Heterogeneity histogram for MGMT-methylation also did not demonstrate the significant difference between the outcome groups. For IDH1 mutation, however, we can point out a significant difference (p < 0.05) for T2_original_firstorder_10Percentile, T1_wavelet_HLL_glcm_ DifferenceAverage features, which indicates the ability of these features to reflect the heterogeneity in the outcome ( Figure S8). These findings also highlight the value of multivariate predictive analysis.
The overall RQS of 47% achieved in this study is higher than generally reported in neuro-oncology radiomics studies [54].
The main strength of this study includes the usage of two independent multicenter datasets. Though the performance of previous prognostic models based on VASARI or radiomics features is generally better, most of these studies only use internal validation methods and lack validation in an independent external dataset [34,55]. The same applies to the performance of predictive models for molecular markers. However, the fact that the promising results for the predictive models in this study in the testing cohort could not be replicated in the external validation cohort stresses the importance of external validation. Additionally, most studies use a more homogeneous patient cohort, for example, with regards to treatment characteristics, whereas this present study comprises two heterogeneous cohorts which more reflects daily clinical practice. For example, corticosteroid usage is known to decrease the amount of edema, therefore altering the T2-weighted signal, which can influence both VASARI and radiomics features. Previous studies either do not mention corticosteroid usage or exclude patients using corticosteroids [36,37] even though a significant amount of GBM patients are known to use corticosteroids. Furthermore, multiple studies only use single-institute data in which real-life heterogeneity between MRI acquisition is not represented [56] which is important for the generalizability of radiomics models.
Several limitations should be taken into account when considering the results of this study. The main limitation of this study is the number of patients that were included. Though for the OS models the number of patients is in accordance with the majority of previous studies, especially the limited available molecular data in the external validation set limits the validation capacity of the predictive models. Especially IDH1/2 mutations rarely occur in both cohorts, which is to be expected in GBM, leading to wide confidence intervals and complications in the validation of the model. Future studies using a larger IDH-mutated cohort are needed to accurately test the models developed in this study. Next, the fact that this study is a retrospective study poses a potential selection bias. Additionally, the Karnofsky Performance Score (KPS) is an established prognostic feature which could not be included in this study due to lack of reporting of the KPS in patient files during the time period used for this study. Furthermore, it could be stated that a limitation of this study was the lack of advanced MRI sequences such as diffusion-and perfusion-weighted imaging and PET-MRI. However, this study specifically chose to focus on the relevance of conventional MRI images as these are widely available in clinical centers. Furthermore, MRI radiomics features are known to be dependent on differences in MRI scanners and scanning protocols. The images used in this study were collected from more than ten different hospitals over a ten-year time-period resulting in large differences in technical MRI characteristics. Again, even though this limits the performance of radiomics, an ideal prognostic and predictive model should not be dependent on homogeneous data. These differences in MRI acquisition methods are present in the real-life multicenter setting and should be accounted for in order to provide a relevant, clinical applicable model.
In order to further improve the prognostic and predictive potential of non-invasive imaging models, several steps need to be taken. First of all, larger (big data) datasets and preferably prospective studies are warranted to develop more accurate and generalizable models. This could pose a challenge, especially in less common types of cancer such as GBM.
Next, the first studies on radiomics have been conducted on computed tomography (CT) imaging, which can be quantified using standardized Hounsfield units. For MRI radiomics, such a unit does not exist which poses problems due to inter-and intra-scanner variability. Multiple pre-processing methods have been developed, though not all radiomics features were shown to be robust between different pre-processing approaches [57][58][59]. This calls for a generalized pre-processing pipeline and focus on features that are shown to be robust. Robust features and normalization methods can be achieved by applying phantom studies to account for differences between MRI acquisition protocols [60]. Tumor delineation poses another important aspect of radiomics feature extraction. Manual delineation is still generally seen as the golden standard, though a substantial inter-observer variability exists, despite international guidelines on tumor delineation [61] and it is a time consuming process. It has been shown that this inter-observer variation influences the radiomics analysis in multiple tumors [62]. Automatic segmentation methods using a deep learning neural network approach are widely developed and can be beneficial in future radiomics studies and its clinical applicability by decreasing workload on clinicians and inter-observer variability [63,64]. This is expected to lead to more robust radiomics features due to standardization of the delineation method.
Parallel to the establishment of MR signatures that are able to predict clinically significant expression of specific biomarkers, there is a need for imaging signatures that capture the level of intratumoral heterogeneity. However, it needs to be emphasized that is not yet clarified how to quantify GBM MR imaging heterogeneity and moreover how to non-invasively analyze the level of intratumoral heterogeneous expression of predictive markers, since the golden standard, single cell RNA sequencing, is missing in standard of care. By extracting radiomics features from the whole tumor and the surrounding area of edema we identified several features that are associated with intratumor heterogeneity. However, different steps could be taken to include more aspects of tumor heterogeneity. Improved performance of radiomics has been reported when features are extracted from distinct tumor areas (active tumor, necrosis and edema) separately [65,66], though this is a more labor-intensive approach which might limit its clinical applicability. In this aspect, automatic segmentation algorithms have shown to be useful for prognostic radiomics modelling [47]. Additionally, more biologically relevant MRI sequences such as diffusionor perfusion-weighted MRI have been shown to outperform radiomics models based on conventional MRI [25]. These approaches should be taken into account in future studies as they will be able to encompass more features concerning intratumor heterogeneity [67] and have shown improved performance with regards to predicting prognosis and molecular markers. Ultimately, studies correlation pathological and genetic examination of multiregional biopsies towards imaging features are needed to study the value of imaging features for tumor heterogeneity.

Patient Population
All patients treated by the neuro-oncology team of the Maastricht University Medical Centre (Maastricht UMC+, Maastricht, the Netherlands) between January 2004 and August 2014 for a glioblastoma (WHO grade IV) were considered for inclusion in the retrospective training cohort. Patients were excluded if no diagnostic, pre-operative MRI-images were available (minimum T1+Gadolinium and T2-weighed imaging), if survival data were unknown or no histological diagnosis was available. All patient records were reviewed considering patient and tumor characteristics, received treatments and survival data. The external validation cohort was constructed using the same criteria on an independent dataset of patients treated in Radboud University Medical Center (Radboudumc, Nijmegen, The Netherlands) in the same time period. Both Maastricht UMC+ and Radboudumc are academic reference centers for GBM patients in the Netherlands, implying MRI-images were also obtained in hospitals that refer their patients to these academic centers. Numbers of patients used for each analysis are reported in Table S1. The requirement for informed consent for this retrospective study was waived by the medical ethics committee of the MUMC+ (METC 16-4-022).

Image Acquisition and Qualitative Imaging Feature Assessment
Pre-operative MRI images were collected, pseudonymized and pooled in a database combining MRI images from different types and manufacturers of scanners using different imaging protocols to reflect the real-life inter-center heterogeneity (Figures S1 and S2). A quantitative and qualitative imaging analysis pipeline was set-up ( Figure 3). All diagnostic MRI-scans were analyzed by dedicated neuro-radiologists (SP, AJ, AP), blinded for outcome and scored using the VASARI Imaging Features. A previous study conducted by the VASARI research project group showed a strong overall inter-observer agreement among six readers for the VASARI features [29]. When needed, multi-categorical and continuous VASARI features were recoded into different groups based on their clinical relevance prior to the start of analysis (Table S5). Netherlands) in the same time period. Both Maastricht UMC+ and Radboudumc are academic reference centers for GBM patients in the Netherlands, implying MRI-images were also obtained in hospitals that refer their patients to these academic centers. Numbers of patients used for each analysis are reported in Table S1. The requirement for informed consent for this retrospective study was waived by the medical ethics committee of the MUMC+ (METC 16-4-022).

Image Acquisition and Qualitative Imaging Feature Assessment
Pre-operative MRI images were collected, pseudonymized and pooled in a database combining MRI images from different types and manufacturers of scanners using different imaging protocols to reflect the real-life inter-center heterogeneity (Figures S1 and S2). A quantitative and qualitative imaging analysis pipeline was set-up ( Figure 3). All diagnostic MRI-scans were analyzed by dedicated neuro-radiologists (SP, AJ, AP), blinded for outcome and scored using the VASARI Imaging Features. A previous study conducted by the VASARI research project group showed a strong overall inter-observer agreement among six readers for the VASARI features [29]. When needed, multi-categorical and continuous VASARI features were recoded into different groups based on their clinical relevance prior to the start of analysis (Table S5).

Tumor Delineation, Image Pre-Processing and Extraction of Radiomics Features
Using Osirix Lite (Pixmeo SARL, Bernex, Switzerland) and MiM software (version 7.0.4, MIM Software Inc., Cleveland, OH, USA), regions of interests (enhancing tumor on T1+Gadolinium images and combined tumor/edema portion on T2-weighted images) were manually delineated on all diagnostic MRI-images of the training and validation cohort, supervised by two experienced neuro-radiation oncologists (DE, IC).
At first, spatial resolution of the images was normalized with respect to the image sequence (final pixels are: 0.449 mm 2 and slice thickness of: 5.5 mm). The mode of the pixel spacing and slice thickness distributions from the Maastricht UMC+ cohort were used as reference values for the resampling procedure to minimize the number of resampled images. A bicubic interpolation over 4 × 4 pixel neighborhood was used for both upsampling and downsampling. In order to correct the low frequency intensity non-uniformity, which is intrinsic for MRI images, the N4 bias field correction algorithm was used [68]. Furthermore, the histogram equalization method implemented in the scikit-image 0.15.0 package

Tumor Delineation, Image Pre-Processing and Extraction of Radiomics Features
Using Osirix Lite (Pixmeo SARL, Bernex, Switzerland) and MiM software (version 7.0.4, MIM Software Inc., Cleveland, OH, USA), regions of interests (enhancing tumor on T1+Gadolinium images and combined tumor/edema portion on T2-weighted images) were manually delineated on all diagnostic MRI-images of the training and validation cohort, supervised by two experienced neuro-radiation oncologists (DE, IC).
At first, spatial resolution of the images was normalized with respect to the image sequence (final pixels are: 0.449 mm 2 and slice thickness of: 5.5 mm). The mode of the pixel spacing and slice thickness distributions from the Maastricht UMC+ cohort were used as reference values for the resampling procedure to minimize the number of resampled images. A bicubic interpolation over 4 × 4 pixel neighborhood was used for both upsampling and downsampling. In order to correct the low frequency intensity non-uniformity, which is intrinsic for MRI images, the N4 bias field correction algorithm was used [68]. Furthermore, the histogram equalization method implemented in the scikit-image 0.15.0 package [69] was used to enhance the contrast of MRI images [70]. As the last step of the pre-processing routine, image intensities were normalized using Z-score standardization method [71]. A pre-processing routine was applied to both cohorts, where parameters (mu, sigma) for the Z-score transformation were evaluated on the training cohort and transferred to the

Molecular Markers
Archival formalin-fixed paraffin-embedded (FFPE) tissue samples were analyzed for tumor percentage by an experienced neuro-pathologist (JB). DNA was extracted from FFPE tissue using the Cobas method (Roche, Bazel, Switzerland) and DNA concentration was quantified using Qubit Fluorometer (Life Technologies, Waltham, MA, USA). Nextgeneration sequencing (NGS) was performed using the Ion AmpliSeq Cancer Hotspot Panel v2 (Life Technologies) as previously described [73]. For the purpose of this study, the data were analyzed for the presence of an IDH1 (R132H) mutation (minimum coverage 500×) which was manually checked using Integrative Genomics Viewer (IGV). EGFR amplification was assessed using SNPitty, an open-source web application for interactive B-allele frequency and copy number visualization of NGS data, by comparing the number of reads in the EGFR locus to the surrounding regions [74]. MGMT methylation status was assessed using methylation-specific multiplex ligation-dependent probe amplification (MS-MLPA) as previously described [75]. In case NGS data was not available for a sample, MLPA was also used to assess IDH1 mutation status and EGFR amplification.

Statistical Analysis
Statistical analysis for differences between baseline characteristics was performed using double-sided T-test for "age at diagnosis". Fisher's exact test was used for all other binary variables (sex, type of surgery, adjuvant treatment and molecular markers).
Overall survival (OS) was defined as the time between the initial surgical intervention after diagnosis and the date of death (confirmed by the Municipal Personal Records Database). Patients that survived were censored at the moment of the last follow-up measurement. To develop a prognostic model, analysis was focused on the IDH-WT GBM samples.
OS analysis was performed using R (version 4.0.2., R Studio, Boston, MA, USA), employing the packages stats, survival, survminer, rms, pec and survcomp. VASARI features were tested in univariate Cox-regression analysis to determine the hazard ratio (HR) of each feature individually on the training cohort. Each feature with a p-value of ≤ 0.2 was considered for inclusion in the multivariable analysis. Resulting VASARI features were used for multivariable Cox-regression analysis with fast backward elimination (removal alpha < 0.2) on the training set. Radiomics features from T1-and T2-weighted images were combined and normalized with the Z-score transformation, where coefficients evaluated on the train set were transferred to the validation set. Highly correlated features exceeding the Spearman's rank correlation of rs = 0.85 were eliminated. Resulting radiomics features, were used for multivariable Cox-regression analysis with fast backward elimination for the training set [76] (Model 1-3) All clinical features were entered into the Cox-regression model to develop the Clinical model on the training set.. A prognostic Index (PI) for all models developed on the training set was calculated for training and validation datasets, where the PI was defined as ∑ i β i x i for each individual model. For the combined models, the PI of the individual models was used as a feature along with the PI for the individual model it was combined with in Cox-regression analysis [77]. Similarly, for a combined clinical/VASARI/radiomics model (Model 7), VASARI PI was used as a feature along the radiomics PI and clinical PI. Next, the models were validated using multiple-step approach [78]. Calibration slope was assessed using the Log-rank (LR) test. Model's misspecification was evaluated by performing the Cox regression on the individual features of the signature in the validation dataset with offsetting the validation PI [78].
Overall model performance for discriminating survival groups was evaluated with Harrell's C-index. To display the potential discrimination between survival groups Kaplan-Meier (KM) curves were used with the threshold value based on 75th percentile of training PI's in order to identify a high-risk group using our model. Significance of the split was estimated using the LR test. In addition, predicted survival curves for each risk group were plotted. The PI is used to estimate the survival curve, which is then averaged over the entire risk-group. These curves are plotted alongside the observed KM-curves. The correlation between radiomics features and tumor volume was assessed using Spearman's rank correlation. This was investigated since previous studies have shown some radiomics features to be surrogate markers for tumor volume and not independent prognostic features [79]. Correlation between VASARI features, radiomics features and clinical features were assessed using the point-biserial correlation coefficient. Python 3.7 was used to develop and validate the predictive models. Patients with unknown outcomes (molecular markers) were excluded from the analysis. At first, highly correlated features (rs > 0.85) were eliminated, in which the feature with the lower AUC value in univariate ROC analysis was removed and resulting features were normalized using Z score on the MUMC+ cohort. Shift/scale parameters of individual features are available upon request. As the second step, the MUMC+ cohort was split randomly into train and test sets with a 70/30 ratio and label stratification. In the third step, to obtain the feature importance scores, a random forest model with the random-sampled initialization of hyper parameters (each iteration parameter was randomly sampled from the hyper parameter ranges: number of estimators (20,300), max depth (2,6)) was fitted 1000 times resulting in the cumulative feature importance histogram. Based on the feature importance rank, the 20 most important features were selected for the further evaluation. In order to find the best performing model in the fourth step, Xgboost, Random Forest and Logistic regression algorithms were initialized with the random-sampling of hyper parameters (Table S6), trained and tested 1000 times. In order to overcome a "lucky split bias", step 2 (the random splitting of the cohorts) followed by model testing was repeated 10 times for the top 5 performing models from step 4, representing the cross validation technique.
Combined model was achieved by ensembling VASARI and radiomics models using averaging of VASARI and radiomics predicted probabilities. To evaluate performance of the predictive models, the area under the receiver operating characteristic (ROC) curve, or AUC, was calculated. Bootstrapping technique with 100 iterations was utilized to estimate ROC AUC 95% confidence intervals on test and validation datasets.
Additionally, to visualize the ability of radiomics features of capturing the outcome heterogeneity in a univariate manner and contribute to concept of explainable radiomics, we visualized the outcome heterogeneity through selected radiomics features by plotting the distribution of feature values for each particular feature of each binary outcome. The significance of the difference in the mean values was evaluated by performing the Mann-Whitney test with Bonferroni correction.

TRIPOD Statement and Radiomics Quality Score (RQS)
To assess the quality of the conducted study, a radiomics quality score (RQS) was calculated. The RQS is a checklist consisting of 16 components to assess the validity of the radiomics workflow and (external) validation of the models [19,80]. Furthermore, the checklist recommended in transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) was assessed [81].

Conclusions
In the present study, the potential of non-invasive quantitative and qualitative imaging features to predict prognosis and clinically relevant molecular markers was investigated in a real-life heterogeneous GBM patient cohort. The integrated prognostic model, including clinical and imaging features, showed the most promising performance which was reproducible and most robust between both datasets. However, further improvements and larger prospective studies are needed before this model can be used in daily clinical practice. Using imaging features to predict molecular markers showed promising results in the testing set but could not be validated on the external validation set and warrants additional validation in larger GBM cohorts.
Supplementary Materials: The following are available online at https://www.mdpi.com/2072 -6694/13/4/722/s1, Figure S1: Imaging heterogeneity in MUMC+ cohort, Figure S2: Imaging heterogeneity in Radboudumc cohort, Figure S3: Correlation matrix between radiomics features and tumor volume, Figure S4: Correlation matrix between VASARI features, clinical features and radiomics features. Figure S5: ROC curves for all predictive models in the test dataset, Figure S6: ROC curves for all predictive models in the validation dataset, Figure S7: Heterogeneity histograms from selected radiomics features in predictive models, Figure S8: Mann-Whitney test for significance of histogram heterogeneity of radiomics features in predicting IDH1 mutation status, Table S1: Number of patients included for development and validation of each model, Table S2: Univariate Cox-regression analysis of VASARI features for OS in IDH-WT GBM population, Table S3: TRIPOD  statement, Table S4: Radiomics Quality Score (RQS), Table S5: Definitions of VASARI features as used in this study, Table S6: Tuned hyperparameters for predictive models. Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the fact that, besides a few exceptions, most patients were deceased at the moment of this study and that all data collected are not traceable to individual patients.
Informed Consent Statement: Patient consent was waived due to the fact that, besides a few exceptions, most patients were deceased at the moment of this study and that all data collected are not traceable to individual patients.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy restrictions.
Conflicts of Interest: Philippe Lambin reports, within and outside the submitted work, grants/sponsored research agreements from Varian medical, Oncoradiomics, ptTheragnostic, Health Innovation Ventures and DualTpharma. He received an advisor/presenter fee and/or reimbursement of travel costs/external grant writing fee and/or in-kind manpower contribution from Oncoradiomics, BHV, Merck and Convert pharmaceuticals. Lambin has shares in the company Oncoradiomics, Convert pharmaceuticals, MedC2 and LivingMed Biotech and is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248, PCT/NL2014/050728) licensed to Oncoradiomics and one issue patent on mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito, three non-patentable invention (software) licensed to ptTheragnostic/ DNAmito, Oncoradiomics and Health Innovation Ventures. Henry C. Woodruff has (minority) shares in the company Oncoradiomics. Alinda Jacobi-Postma received institutional grants received from Siemens Healthcare and Bayer Healthcare. The funders had no role in the design of the study; in the collection, analyses or interpretation of data, in the writing of the manuscript or in the decision to publish the results. All other authors declare no potential conflict of interests.