Prediction of Prostate Cancer Disease Aggressiveness Using Bi-Parametric Mri Radiomics

Simple Summary The use of radiomics has been studied to predict Gleason Score from bi-parametric prostate MRI examinations. However, different combinations of type of input data (whole prostate gland/lesion features), sampling strategy, feature selection method and machine learning algorithm can be used. The impact of such choices was investigated and it was found that features extracted from the whole prostate gland were more stable to segmentation differences and produced better models (higher performance and less overfitting). This result suggests that the areas surrounding the tumour lesions offer relevant information regarding the Gleason Score that is ultimately attributed to that lesion. Abstract Prostate cancer is one of the most prevalent cancers in the male population. Its diagnosis and classification rely on unspecific measures such as PSA levels and DRE, followed by biopsy, where an aggressiveness level is assigned in the form of Gleason Score. Efforts have been made in the past to use radiomics coupled with machine learning to predict prostate cancer aggressiveness from clinical images, showing promising results. Thus, the main goal of this work was to develop supervised machine learning models exploiting radiomic features extracted from bpMRI examinations, to predict biological aggressiveness; 288 classifiers were developed, corresponding to different combinations of pipeline aspects, namely, type of input data, sampling strategy, feature selection method and machine learning algorithm. On a cohort of 281 lesions from 183 patients, it was found that (1) radiomic features extracted from the lesion volume of interest were less stable to segmentation than the equivalent extraction from the whole gland volume of interest; and (2) radiomic features extracted from the whole gland volume of interest produced higher performance and less overfitted classifiers than radiomic features extracted from the lesions volumes of interest. This result suggests that the areas surrounding the tumour lesions offer relevant information regarding the Gleason Score that is ultimately attributed to that lesion.


Introduction
Prostate cancer is the second most prevalent cancer in the world, according to the World Health Organization; 1,414,259 new cases were reported in 2020, with an ASR (world) mortality rate of 7.7 per 100,000 [1].
Prostate cancer in its early stages does not cause any specific symptoms, so a suspicion of PCa can arise from: an abnormality on digital rectal examination, DRE [2][3][4], or an Our dataset consisted of T2W, DW, and ADC data from the SPIE-AAPM-NCI PROSTA-TEx challenge (the data can be downloaded from (https://wiki.cancerimagingarchive.net/ pages/viewpage.action?pageId=23691656 (accessed on 28 September 2020))) [20][21][22]. The MRI exams were acquired at the Prostate MR Reference Center-Radboud University Medical Centre (Radboudumc) in the Netherlands. Due to the public nature of the data, ethics committee approval was waived for this study. The following description of the dataset was provided by the Challenge's organizers: "The images were acquired on two different types of Siemens 3T MR scanners, the MAGNETOM Trio, and Skyra. T2-weighted images were acquired using a turbo spin echo sequence and had a resolution of around 0.5 mm in plane and a slice thickness of 3.6 mm. The DWI series were acquired with a single-shot echo planar imaging sequence with a resolution of 2-mm in-plane and 3.6-mm slice thickness and with diffusion-encoding gradients in three directions. Three b-values were acquired (50, 400, and 800), and subsequently, the apparent diffusion coefficient (ADC) map was calculated by the scanner software. All images were acquired without an endorectal coil". The dataset is composed of 281 lesions from 183 patients. The approximate location of the centroid of each lesion was provided in DICOM coordinates. Cancer was considered significant when the biopsy Gleason score was 7 or higher. The lesions were labelled with "TRUE" and "FALSE" for presence of clinically significant cancer, with a distribution of 67 clinically significant lesions (TRUE) and 214 clinically non-significant lesions (FALSE).

Feature Extraction
Manual segmentations of the whole prostate gland and of each lesion were performed independently by two radiologists (M.L., 10 years of experience, and A.U., radiology resident) on T2W and DW maps separately. The lesion segmentation on DWI was performed on the high b-value image and the whole gland segmentation was performed on the low b-value image. An example segmentation can be found in Figure 1. For each sample, one radiologist's volume of interest (VOI) was randomly chosen to be included in the final dataset. Radiomic features were extracted using the package Pyradiomics (version 3.0) [23] in Python (v. 3.7.9; https://www.python.org/ (accessed on 28 September 2020)). 14 shape features, 18 firstorder features and 75 texture features were extracted from the VOI of three MRI modalities, T2W, DWI and ADC, resulting in a total of 321 features extracted. In the feature extraction of the ADC map, the mask drawn on the DWI was used. The mathematical expressions and semantic meanings of the features extracted can be found at https://pyradiomics.readthedocs.io/en/latest/ (accessed on 28 September 2020).

Dataset Construction
The features extracted from a lesion mask VOI constituted the Lesion Dataset. The features extracted from a whole gland mask VOI constituted the Gland Dataset. A Gland was considered to have clinically significant PCa if at least one of its lesions is clinically significant. From the previous datasets, two additional datasets were constructed: • Lesion Features with Anatomical Zone dataset-A dataset composed of lesion features plus features describing the anatomical location of the lesion. • Single-Lesion Whole Gland Features dataset-A truncated dataset composed of patients from the Gland dataset that had one only lesion.
The description of the datasets can be found in the Table 1.
The train/test split was performed with the train_test_split() function of the Python scikitlearn package (version 0.23.2; https://scikitlearn.org/ (accessed on 28 September 2020)) [24][25][26]. The hold out test sets consisted of 25% randomly selected samples from the original datasets and the split was stratified so that both train and test sets have the same proportion of True labels. The lesion train sets constituted of 210 lesions (51 clinically significant lesions and 159 clinically non-significant lesions) and test sets constituted of 71 lesions (16 clinically significant lesions and 55 clinically non-significant lesions). The gland train set constituted of 137 glands (48 clinically significant glands and 89 clinically non-significant glands) and test set constituted of 46 glands (15 clinically significant glands and 31 clinically non-significant glands). The single-lesion whole gland features dataset was not split into train and test set, due to its already reduced number of samples, and was only validated internally.

Feature Stability to Segmentation
Features that are highly dependent on segmentation margins, will not be stable predictors, since they easily change depending on the radiologist that performed the segmentation. Features extracted from the VOIs created by both radiologists were compared with Intraclass correlation coefficient (ICC). The ICC used was a two-way, single rater, absolute agreement ICC model (ICC 2.1) [27]. Features with ICC 95% confidence interval lower limit over 0.8 were considered to be robust to segmentation and were kept for further analysis. This analysis was performed in Python (v. 3.7.9; https://www.python.org/(accessed on 28 September 2020)) with the package icc (https://pypi.org/project/icc/ (accessed on 28 September 2020)).

Outlier Detection
In order to identify outliers, the local outlier factor (LOF) was used. Since scale affects the distance function, the data was normalized before applying the LOF algorithm. Samples with LOF over 2 were removed from the original not normalized dataset. Outlier detection and removal was performed inside cross validation with the software RapidMiner Studio (version 9.9; https://rapidminer.com/ (accessed on 28 September 2020)) [29].

Feature Correlation
The feature correlation analysis was performed inside cross validation on RapidMiner Studio (version 9.9; https://rapidminer.com/ (accessed on 28 September 2020)) [29] with the operator "Remove Correlated Attributes". This operator uses the Pearson correlation coefficient to compute the correlation between each pair of features. If a pair of features is found to have a correlation higher than the threshold, one of the features is randomly eliminated. The correlation threshold was a hyperparameter optimized during model training.

Feature Selection
Four feature selection algorithms were applied separately, and their performance compared. These algorithms were recursive feature elimination with support vector machine weighing (SVM-RFE), Boruta algorithm [30], minimum redundancy maximum relevance algorithm (mRMR) and LASSO regularization.

Model Development
In this work, different aspects of model development were assessed and compared ( Figure 2). The machine learning algorithms were chosen so as to cover a wide range of machine learning algorithm types [31]. In total, 288 pipelines were produced, corresponding to the different combinations. Each was trained and validated according to the diagram in Figure 3.  Hyperparameter tuning was done in a nested cross-validation fashion with an exhaustive grid search. This was performed on RapidMiner Studio (version 9.9; https://rapidminer.com/ (accessed on 28 September 2020)) with the operator "Optimize Parameters (Grid)" [29]. In this work, we have chosen to optimize the F2-score, so as to take into account the higher cost of a false negative misclassification when compared to a false positive. Additionally, we report Cohen's Kappa.
A subset of best classifiers was selected according to the 4-fold cross-validation F2 and Kappa performances, following the rule: CV F2 > 0.8 ∩ CV Kappa > 0.5. These were applied to the holdout test set for validation.

Metric Volatility Analysis
The Gland, Lesion, and Lesion with anatomical location Datasets were each randomly split in training and testing sets in 50 different ways, according to 50 different random seeds. Each of the highest-ranking classifiers was then trained on each of the 50 training sets and validated through both cross-validation and each of the 50 holdout testing sets. The distribution of cross-validation and test set performance results was recorded for further analysis (Figure 4). This analysis was based on the metric volatility analysis performed by the Probatus package (https://ing-bank.github.io/probatus/ (accessed on 28 September 2020)) .

Distribution Comparison Tests
All performance distributions were tested for normality using the Shapiro-Wilk test [32] and the D'Agostino K2 test [33]. For each classifier, the distribution of crossvalidation performances was compared to the distribution of test set performances, to assess whether they belonged to the same distribution. Two statistical tests were used: the paired t-test and the Kolmogorov-Smirnov test [34]. Both tests behave like common hypothesis tests and the hypothesis were as follows: Hypothesis 1 (H1). The distributions of cross-validation and test set performances are identical.

Hypothesis 2 (H2). The distributions of cross-validation and test set performances are different.
The significance level, α, was chosen to be 0.05.

Feature Stability to Segmentation
In the Lesion Dataset, 154 features were found to be unstable, out of the total 321 features. While, in the Gland Dataset, 64 features were found to be unstable, out of the total 321 features. Among the unstable features, the feature type that was considered the most unstable was texture features (72.73% of the unstable lesion features and 84.38% of the unstable gland features) and the MRI modalities that showed the lower stability were DWI (45.45% of the unstable lesion features) and ADC (82.81% of the unstable gland features).

Zero or Near-Zero Variance
In the Lesion Dataset, out of the total 169 stable features, 2 features were found to have near-zero variance: DWI_original_glszm_GrayLevelNonUniformity and ADC_ori ginal_glszm_GrayLevelNonUniformity. While, in the Gland Dataset, no features were excluded.

Classifier Development
In Figure 5a-d, we can see the cross-validation performance of the same 288 pipelines grouped by the different pipeline aspects assessed in this work.
Overall, the Boruta algorithm ( Figure 5a) did not perform as well as expected. Despite having a high cross-validation F2, most kappa values were extremely low, especially for pipelines trained on whole gland features. Pipelines trained with data that underwent SVM-RFE achieved an average cross-validation F2 of 0.7226 and Kappa of 0.3781. While the feature sets that underwent mRMR achieved average performances of 0.7071 on F2 and 0.4095 on Kappa. Overall, at this stage, SVM-RFE and mRMR pipelines show a similar average performance. Pipelines trained with data that underwent Lasso feature selection achieved an average cross-validation F2 of 0.643 and Kappa of 0.347, not performing, on average, as high as SVM-RFE and mRMR.  Figure 6 shows the 26 models that satisfied the condition: CV F2 > 0.8 ∩ CV Kappa > 0.5, as well as their performance on the cross-validation setting and hold-out test set. 65% of these are models trained on whole gland features. All of the best models were trained on data that underwent some kind of sampling: 42% downsampled data and 58% SMOTE data. Regarding feature selection, 31% of the pipelines included SVM-RFE, 50% included mRMR, 15% included Lasso and 4% included Boruta. As for the machine learning algorithm, the large majority of best models are tree-based algorithms (73%) and the remaining models are logistic regressions with or without elastic net regularization and one Naïve Bayes pipeline.  Table 2 shows the performance of the best models on the cross-validation setting and on the hold out test set in terms of F2, Kappa, ROC-AUC, and AUPRC. In addition, it shows the difference between cross-validation and test set performance, ∆. The models where this difference is closest to zero are the least overfitted models. There seems to be a cluster of overfitted models on the bottom of the table (in darker red). These correspond to the pipelines trained with Lesion data.

Metric Volatility Analysis
The mean and standard deviation values were calculated for each performance metric and each classifier, as well as the ∆ values ( Table 3). The latter are shown in Table 4, where each column is individually colour-coded from lowest value, in green, to highest value, in red. As previously, there seems to be a cluster of overfitted models on the bottom of the table (in darker red). These correspond to the pipelines trained with Lesion data. Three clusters of lower ∆ can be found in green, these correspond to the pipelines where downsampling of the majority class was performed.

Distribution Comparison Tests
Out of 26 classifiers, 19 classifiers displayed a significant difference between the F2 test set performance distribution and the F2 cross-validation performance distribution, 5 classifiers displayed no significant difference between the test set performance distribution and the cross-validation performance distribution, 1 classifier displayed a significant difference on the Kolmogorov-Smirnov test but no difference on the paired t-test and 1 classifier displayed a significant difference on the Kolmogorov-Smirnov test but inconclusive results on the paired t-test.
Out of 26 classifiers, 15 classifiers displayed a significant difference between the Kappa test set performance distribution and the Kappa cross-validation performance distribution, 8 classifiers displayed no significant difference between the test set performance distribution and the cross-validation performance distribution, 1 classifier displayed a significant difference on the Kolmogorov-Smirnov test but no difference on the paired t-test, 1 classifier displayed a significant difference on the paired t-test test but no difference on the Kolmogorov-Smirnov and 1 classifier displayed a significant difference on the KolmogorovSmirnov test but inconclusive results on the paired t-test.
Five classifiers displayed no significant difference between the cross-validation performance and the test set performance on both performance metrics, these were: G_D_SVM-RFE_XGB, G_D_mRMR_XGB, G_D_Lasso_DT, G_D_Lasso_RF, and G_D_Lasso_XGB. These were also among the classifiers found to be least overfitted previously, supporting those results. The performance distributions of these 5 classifiers can be found in Figure 7.

Discussion
In this work, an extensive analysis of different dimensions of a machine learning pipeline were assessed and their performance compared.
We started by assessing the radiomic features' stability to segmentation, where we found that the whole-gland features seem to be considerably more robust than lesion features (approximately 50% of lesion features were found to be unstable, compared to approximately 20% of gland features being unstable). This was expected since there is a lot more inter-reader variability in determining lesion borders when compared to whole gland borders.
Regarding feature selection, a low performance was unexpectedly observed from the pipelines that applied Boruta feature selection. These showed a high F2, because the model would classify the large majority of samples as the minority class, leading to a high recall. However, the low Kappa score makes it clear that these were not useful models. It was observed that the Boruta algorithm found very few features that were better predictors than the random versions of themselves. Hence, it is hypothesised that the number of features selected by the Boruta algorithm (around three features) was not enough to build a meaningful radiomics signature, which led to the poor results.
The pipelines where sampling was applied performed higher than the ones where no sampling was done, whether it was downsampling of the majority class or upsampling of the minority class with SMOTE. This was expected since training a model with balanced data gives it equal opportunities to learn from both classes. While the pipelines where SMOTE upsampling was performed seem to slightly outperform downsampling of the majority class, the volatility analysis showed that the latter produces classifiers that are consistently less overfitted and more reliable. This can be explained by the fact that SMOTE generates synthetic samples from the existing samples in the dataset. Thus, we are forcing the model to learn more from the same data, increasing the model's confidence in random variability, or noise, present in the data, which results in the overfitted behaviour.
Among the most interesting findings is the higher performance of models trained with radiomic features extracted from the whole gland VOI, as well as their higher reliability and lower overfitting. This suggests that the areas surrounding tumorous lesions might offer relevant information regarding their overall aggressiveness in the form of Gleason score. In addition to suggesting that the monotonous lesion segmentation work performed by radiologists may not be necessary or even be harming to the radiomics signature. However, it is of note that a few patients had more than one lesion. If these multiple lesions have the same clinical significance (same target label), then it seems reasonable that the model performs higher with gland features since it has more information pointing to the correct label. In order to make a fair comparison between the performance of both types of input data, the single-lesion whole gland dataset was created, including only patients with a single lesion. The performance results obtained with this smaller dataset confirm the suspicions above, that whole gland features produce more reliable machine learning models than lesion features. Additionally, the volatility analysis showed that the Lesion-based models seem to be the most overfitted, which supports the previews findings. This is not the first paper of its kind to report the performance of whole-gland radiomic features to predict PCa clinical significance [35,36].
As a final note, it is important to point out that given so many pipeline combinations we have to assume that it is possible to find one that performs well by chance. Statistically speaking, we could remedy this by doing something similar to a multiple comparisons p-value correction. However, at this point, we are not aware of such a correction for machine learning performance metrics.
Regarding the MR sequences used, DW images and ADC maps, although related, do not offer the same information. DW images quantify the diffusion of water molecules in the tissue. Their sensitivity to diffusion is regulated by the b-value, with a high b-value allowing the distinction between healthy tissue and tumorous tissue (where there is a much higher restriction to diffusion). While ADC maps show the rate of variation of the DWI signal intensity with respect to a change in b-values. Here, it is known that PCa signal intensity on DWI decreases slower than healthy tissue's signal intensity. Thus, the rate of variation will be lower and PCa will appear hypointense on the ADC map. The information provided by DWI and ADC is different and, so, not redundant, which is why we include both in our analysis.
In terms of literature comparison, the dataset used in this study has been widely used, both within and out of the SPIE-AAPM-NCI PROSTATEx challenge. In this setting, the highest performing classifier achieved an AUC of 0.87 [37]. Even though, we felt that the AUC was not the most appropriate metric to optimize, we still achieved AUCs of up to 0.90 while optimizing the F2-score. Recent literature has shown the growing interest in PSMA PET radiomics for the classification of PCa's clinical significance [38,39], so it would be interesting to assess in the future how PET radiomics would perform in the context of this study.
This study has some limitations. First, this was a retrospective study and, so, a multicentre prospective analysis should be carried out to validate these results and investigate the impact these predictive models have on patient outcome. Second, only T2W, DWI, and ADC sequences were used. Other sequences, such as dynamic contrast enhanced MRI, could be worth exploring. Third, only one set of MRI sequences was evaluated per patient, so we were unable to evaluate the temporal stability of the radiomic features. Fourth, although the overall class imbalance was addressed through downsampling of the majority class or SMOTE upsampling of the minority class, we did not address the imbalanced nature of the anatomical location of lesions, with the large majority of lesions belonging to the PZ. It would be interesting to investigate the model's performance on the different anatomical zones independently. Fifth, the use of a publicly available dataset increased transparency but limited our access to clinical data, such as PSA levels, patient age, or PI-RADS score, which are a fundamental component of a clinician's assessment, but could not be included in our model. Sixth, dataset quality issues were not addressed, such as the sometimes imperfect location of the centroid of each lesion [40]. Seventh, despite the effort of performing a metric volatility analysis, proper assessment of real-world clinical performance is only possible through external validation. This important validation step will be addressed in future work. Finally, inherent to the Gleason system is the subjectivity of cancer grading, so we must keep in mind that the gold standard used in this study is subject to human error and inter or intra-observer variability. In addition to this, the definition of clinical significance might be based on more than Gleason score alone, and variables such as tumour volume or tumour category might be of relevance.

Conclusions
In conclusion, our results further confirm the validity of MRI-based radiomic features in the identification of clinically significant prostate cancer. Additionally, we highlight the higher performance of models trained with whole gland radiomic features, as well as their higher stability and lower overfitting, when compared to lesion VOI radiomic features. Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the public nature of the data.
Informed Consent Statement: Patient consent was waived due to the public nature of the data. Data Availability Statement: Data will be available upon reasonable request.

Conflicts of Interest:
All authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: