Development of a Radiomic-Based Model Predicting Lymph Node Involvement in Prostate Cancer Patients

Simple Summary In patients with prostate cancer, lymph node involvement is a risk factor of relapse. Current guidelines recommend extended lymph node dissection to better stage the disease. However, such a surgical procedure is associated with a higher morbidity than limited lymph node dissection. A better selection of patients is thus essential. Radiomics features are quantitative features automatically extracted from medical imaging. Combining clinical and radiomics features, a machine learning-based model seemed to provide added predictive performance compared to state of the art models regarding the risk prediction of lymph-node involvement in prostate cancer patients. Abstract Significant advances in lymph node involvement (LNI) risk modeling in prostate cancer (PCa) have been achieved with the addition of visual interpretation of magnetic resonance imaging (MRI) data, but it is likely that quantitative analysis could further improve prediction models. In this study, we aimed to develop and internally validate a novel LNI risk prediction model based on radiomic features extracted from preoperative multimodal MRI. All patients who underwent a preoperative MRI and radical prostatectomy with extensive lymph node dissection were retrospectively included in a single institution. Patients were randomly divided into the training (60%) and testing (40%) sets. Radiomic features were extracted from the index tumor volumes, delineated on the apparent diffusion coefficient corrected map and the T2 sequences. A ComBat harmonization method was applied to account for inter-site heterogeneity. A prediction model was trained using a neural network approach (Multilayer Perceptron Network, SPSS v24.0©) combining clinical, radiomic and all features. It was then evaluated on the testing set and compared to the current available models using the Receiver Operative Characteristics and the C-Index. Two hundred and eighty patients were included, with a median age of 65.2 y (45.3–79.6), a mean PSA level of 9.5 ng/mL (1.04–63.0) and 79.6% of ISUP ≥ 2 tumors. LNI occurred in 51 patients (18.2%), with a median number of extracted nodes of 15 (10–19). In the testing set, with their respective cutoffs applied, the Partin, Roach, Yale, MSKCC, Briganti 2012 and 2017 models resulted in a C-Index of 0.71, 0.66, 0.55, 0.67, 0.65 and 0.73, respectively, while our proposed combined model resulted in a C-Index of 0.89 in the testing set. Radiomic features extracted from the preoperative MRI scans and combined with clinical features through a neural network seem to provide added predictive performance compared to state of the art models regarding LNI risk prediction in PCa.


Introduction
With 335,230 new estimated cases/year and 69,945 estimated deaths in 2020 in Europe alone [1], prostate cancer (PCa) is the most common cancer among men in western countries. Lymph Node Involvement (LNI) is one of the main prognostic factors and extensive lymph node dissection (eLND) is now the gold standard in patients with high-risk disease operated on for patients as they are at high risk of LNI [2]. However, eLND is associated with substantial morbidity [3][4][5][6][7]. The decision to perform LND in clinically LNI-free patients (cN0) is guided by the probability of nodal metastases, with several models developed for the prediction of LNI risk: the National Comprehensive Cancer Network (NCCN) chooses 2% as a cut-off according to a nomogram developed by the Memorial Sloan Kettering Cancer Centre, whereas the European Association of Urology (EAU) guidelines [8] recommends it for patients with a probability of LNI over 5% based on 2012-Briganti nomogram. Several clinical models were presented for LNI risk prediction [9][10][11][12][13][14]. Magnetic Resonance Imaging (MRI) and its implementation in the daily routine considerably changed both staging and diagnostic management [15]. Lately, Gandaglia et al. developed a new promising algorithm using qualitative data from MRIs in patients with targeted biopsies [16]. Even if targeted combined with systematic biopsies provide a better assessment of the Gleason score [17], software-based fusion targeted biopsies require special logistics and a certain cost [18], which explains its under-utilization compared to cognitive fusion, leaving place for better LNI risk assessment tools.
Such a better patients' selection could benefit the patients undergoing surgery with a reduction in unnecessary eLND for selected patients.
Quantitative analysis of medical images such as MRIs has shown to reflect macroscopic tumoral heterogeneity and has been implemented in several prediction and prognostic models, especially in PCa [19,20]. However, the added value of radiomic features on LNI prediction has never been studied.
Machine learning methods, especially artificial neural networks (NN) [21], have the potential to efficiently model the synergistic interaction between variables using a flexible nonlinear relationship.
In this report, we aimed to develop and internally validate a NN radiomic-based LNI prediction model.

Population
Patients who underwent a radical prostatectomy and eLND for a PCa in a single institution (University Hospital of Brest) between 2010 and 2018 were retrospectively considered using the pathology database.
Radical prostatectomy and eLND were performed following international guidelines 8 by a single trained surgeon (GF) [6,22], with~35 years of experience. Patients with a delay between the preoperative MRI and the surgery superior to 6 months or an MRI unfit for analysis (i.e., missing sequence or slice, hip replacement) were excluded. Patients with no index lesion according to PIRADS V2.1 [23] score after central review by a single specialized radiologist (V.T) with~15 years of experience.

Prostate MRI
Two different MRI scanners were used: a 3T Phillips (Philips Healthcare, Eindhoven, The Netherlands) and a 1.5T Siemens (Siemens Healthcare, Malvern, PA, USA). Acquisition was performed in accordance with the European Society of Urogenital Radiology (ESUR) guidelines [23]. Full protocol was previously published and summarized in Supplementary Table S1 [19].

Clinical and Radiomics Features
Regarding clinical features, we retrospectively collected the preoperative PSA level, the age at the time of surgery, the overall Gleason score along with the primary and secondary biopsy Gleason grades, the number of performed and positive biopsy cores (and rate of positive biopsy cores), the percentage of positive biopsy cores with highest-grade PCa, the percentage of positive biopsy cores with lower grade PCa, the clinical and MRI tumor stages and the NCCN risk classification. When targeted biopsies were performed, we merged the two biopsy techniques, the maximum number of positive cores being set to two by sextant. The PIRADS V2.1 [23] and MRI tumor stage were also collected.
The index lesion was semi-automatically delineated by a single expert (V.B.) using the Fast GrowCut tool imbedded in 3D Slicer©. Full detail regarding the extraction of radiomic features is detailed in Supplementary Protocol 1. As a result, 8651 features (11 clinical and 120 × 4 (fixed bin value) × 9 (original + 8 wavelet filters) × 2 (MRI sequences) = 8640 radiomic features) were available for each patient.

Model Building
The model was built on the training set alone consisting of approximately 60% of the overall cohort. It was then applied on the rest of the patients, defining the testing set (40%). All preselected clinical and radiomic features were then considered for the New-Combined model building, via a Neural Network approach. It was further compared to available nomograms (Partin [9], Yale [11], Roach [10], Briganti 2012 [12], Briganti 2017 [13] and MSKCC [14]). Evaluation of the trained model was carried out using the Area Under the Curve (AUC) along with the C-Index, Se, Sp and Balanced Accuracy (Bacc) after setting a threshold on the training cohort, number of false negative (FN), positive and negative predictive values (PPV and NPV). Importance of each feature in the final model was also reported. Decision curve analysis (DCA) was also used for inter-model comparison. Full description of the feature set reduction workflow and of the model building is available as Supplementary Protocol 2.
With the same methodology, and given the number of MRI scans used for the acquisition of prostate MRI, an a posteriori statistical harmonization method was used. The ComBat harmonization method [24,25] is a statistical method used to remove intersite technical variability while preserving intra-site variability. It was performed before feature set selection and model building creating a ComBat Combined model. For completeness, a model based on clinical/biological features only was also developed.
As a result, three separate models were thus developed and validated: the New-Clinical, the New Combined and the ComBat Combined models.

Inter-Reader Variability
Semi-automatic segmentation was performed by two other experts (U.S. and F.L.) blinded to the results of the previous delineation by V.B in a randomly selected subset (n = 18) of the testing set. Protocol regarding analysis of inter-reader variability is described in Supplementary Protocol 3.
We autoevaluated our study based on the radiomics quality score developed by Lambin et al. [26].

Ethics Committee
The study was approved by the hospital ethical committee (NCT04909957). All patients were given 15 days to formulate their opposition.

Population
From 2010 to 2018, 552 patients underwent radical prostatectomy with eLND, among which 280 patients were finally included with 272 excluded patients due to unavailable or incomplete MRI (n = 266) and unanalyzable MRI (n = six). The overall population was then randomly divided into a training cohort (n = 168) and a testing cohort (n = 112). LNI was found in 51 patients (18.2%), equally sampled between the training set (32 patients-19.0%) and the testing set (19 patients-17.0%). The training and testing set were globally comparable, except for the PIRADS classification. Main patients' characteristics are summarized in Table 1. A flow-chart is proposed as Supplementary Figure S1.

Feature Set Reduction
Regarding clinical features, out of the 11 initial features, 10 were further considered for the model building. For the radiomic features and before statistical harmonization, the feature set reduction allowed a preselection of only four radiomic features: at the first step, out of the initial 8640 features, only five radiomic features with an AUC ≥ 0.70 remained (Supplementary Table S2). They were all from the T2 sequence, four of them from waveletfiltered images and one shape feature. At the second step (Spearman rank correlation coefficient), all these features showed intra-correlation levels below 0.5 (Supplementary  Table S3) except for one, making a total number of four selected radiomic features for model building: Surface Area, Inverse Difference Normalized extracted from the Gray Level Co-Occurrence matrix, Run Entropy extracted from the Gray Level Run Length Matrix, and Contrast extracted from the neighbouring Grey Tone Difference Matrix. Finally, combining all features, the model achieving the highest mean accuracy (New-Combined: 0,97) was based on six features, among which the three most important features were the Inverse Difference Normalized, the percentage of positive biopsy cores and the percentage of positive biopsy cores with highest-grade PCa. With a decremental approach on the mean Bacc, the best combined (New-Combined) model was finally composed of these six features (Supplementary Table S4). With the same methodology but after ComBat harmonization, the ComBat-Combined model consisted in the association of seven features, among which four were radiomic features and the most important feature remaining the percentage of positive biopsy cores with highest-grade PCa. Apart from the Inverse Difference Normalized extracted from the Gray Level Co-Occurence matrix, the three other radiomic features were first order features (Kurtosis, Skewness and Mean). Finally, the clinical model combined five features among which the percentage of positive biopsy cores with highest-grade PCa accounted for 71.0% of the model's prediction. Composition of each model is available in Table 2.

Training Set
In the training set (n = 168), the New-Combined model achieved a high correlation with the risk of LNI (AUC of 1.00, p < 0.0001). With a cut-off of 7%, the New-Combined model resulted in a C-Index of 0.98 while with 10%, the C-Index raised to 1.00 (Table  3). These cut-offs allowed almost perfect classification with only six and one patients classified as false positives, for the 7 and 10% cut-offs, respectively (Supplementary Table  S5). The New-Clinical and ComBat-Combined models achieved similar high results on the training cohort with an AUC of 1.00 and a balanced accuracy of 100% with their respective thresholds.   Table S5).
The newly developed models had a higher AUC than all other nomograms with an increase of 0.19-0.28, the second highest AUC being achieved by the Briganti 2017 nomogram (Supplementary Table S6).

Testing Set
In the testing set (n = 112), the Combat-Combined model was ranked as the most efficient model with a C-Index of 0.89 and a Bacc of 89.4% with the 19% threshold. The New-Combined model achieved the second highest C-indexes: 0.87 for the 7% cut-off and 0.86 for the 10% cut-off; followed by the New-  Table S7).
The ComBat-Combined model reduced the risk of FN to 3.3% while correctly avoiding the risk of eLND for 88 (96.6%) patients, with the 19% cut-off. Among the 21 patients classified above the 19% cut-off, 76.2%% had a proven LNI resulting in a specificity (Sp) of 89.4%. Detailed results of each model are available in Table S7. Patients classified above the 19% cut-off using the ComBat-Combined model were 23.1 times more likely to present a LNI, respectively (Supplementary Table S8). For comparison, the second best relative-risk ratio, apart from the New-Clinical and New-Combined models, was 8.2 (MSKCC). Finally, DCA revealed that the newly developed models improved LNI risk prediction when compared to other models (Figure 1). Similarly, calibration plots showed a calibration as satisfactory as the Briganti 2017 and the MSKCC nomograms (Supplementary Figure S2).

Inter-Reader Variability
The three independent segmentations achieved a mean Dice overlap coefficient of 0.81 and a mean Hausdorff distance of 0.67 mm (Supplementary Table S9). The ICC between each feature's value (feature one -> six, respectively) corresponding to the different delineations ranged from 0.85 to 0.98 and from 0.94 to 0.99 for the average measures (Supplementary Table S10). In this subset of patients, three patients had LNI (16.7%). No change in LNI prediction occurred with the changes in radiomic features' values across the experts' delineation and whatever the chosen cut-off (Supplementary Table S11). An example of the variations of the delineations for two patients is available as Figure 2.

Inter-Reader Variability
The three independent segmentations achieved a mean Dice overlap coefficient of 0.81 and a mean Hausdorff distance of 0.67 mm (Supplementary Table S9). The ICC between each feature's value (feature one -> six, respectively) corresponding to the different delineations ranged from 0.85 to 0.98 and from 0.94 to 0.99 for the average measures (Supplementary Table S10). In this subset of patients, three patients had LNI (16.7%). No change in LNI prediction occurred with the changes in radiomic features' values across the experts' delineation and whatever the chosen cut-off (Supplementary Table S11). An example of the variations of the delineations for two patients is available as Figure 2.

Inter-Reader Variability
The three independent segmentations achieved a mean Dice overlap coefficient of 0.81 and a mean Hausdorff distance of 0.67 mm (Supplementary Table S9). The ICC between each feature's value (feature one -> six, respectively) corresponding to the different delineations ranged from 0.85 to 0.98 and from 0.94 to 0.99 for the average measures (Supplementary Table S10). In this subset of patients, three patients had LNI (16.7%). No change in LNI prediction occurred with the changes in radiomic features' values across the experts' delineation and whatever the chosen cut-off (Supplementary Table S11). An example of the variations of the delineations for two patients is available as Figure 2.

NCCN Risk Classification
The ComBat-Combined model performed robustly in the intermediate and high-risk patients according to NCCN risk classification with BAccs in the testing set of 86.7 and 94.8%, respectively (Supplementary Table S12).

Radiomics Quality Score
Our study scores moderately on the radiomics quality score [26] with 17 points (out of 36, Supplementary Table S13).

Discussion
According to EAU guidelines [8], eLND completion should be based on the use of predictive tools using disease and patients' characteristics, such as the models we tested (Briganti 2012 [12], Briganti 2017 [13] and MSKCC [14] nomograms). While these models have shown satisfying results, confirmed with external validation, calibration remained perfectible, especially in intermediate-risk patients. Indeed, in a population of 175 patients with intermediate PCa, the Briganti 2017 nomogram achieved a low AUC of 0.53 [27]. The Briganti 2018, while further enhancing the prediction quality, focuses specifically on patients with software-based fusion targeted biopsies [16]. However, despite being more precise, software-based fusion remains scarce due to logistics consideration, limiting the Briganti's nomogram's generalizability, most centers still using cognitive fusion biopsies [28]. Therefore, LNI risk prediction tools and their generalizability need to be improved.
Our Combat-combined model achieved a higher performance when compared to the available models and a higher net benefit according to DCA, even in the testing set. For predicted probabilities~85-95%, the newly developed models had negative net benefit reflecting the few false negatives that would still be operated on with eLND despite not showing LNI. Application of the ComBat harmonization method enhanced the model's performance with the Combat-model reaching an Bacc of 89.4% in the testing set, with a threshold set to 19%. We also thoroughly assessed the robustness of our model to delineation's variations. The LNI risk prediction was not modified by switching the experts' delineations, despite the slight changes between each segmentation in a subset of 18 randomly selected patients.
Recent models were systematically enhanced by the addition of MRI data but none incorporated radiomics analysis. Radiomic analysis in the PCa setting has been previously studied for both diagnosis and treatment outcome prediction [19,20,29]. LNI risk is known to be correlated to the Gleason score, PSA level and T stage [2]. The better assessment of tumor heterogeneity via radiomic analysis could explain the superiority of such features over simple and independent clinical, biological or radiological data. Among the seven features in the Combat-Combined model, four were radiomics features, accounting for 26.5% of the overall model's prediction. For instance, the Inverse Difference Normalized (Idn) accounted for 4.6% of the prediction. Idn is a measure of the local homogeneity of an image and normalizes the difference between the neighboring intensity values by dividing over the total number of discrete intensity values: by definition, the higher the value is, the more uniform the volume is and vice versa [30].
The radiomics approach applied to routinely acquired images for diagnosis has the great advantage of being cost-effective and noninvasive. Genomic tests, such as the Decipher Prostate Cancer test ® , have been used to stratify PCa patients on metastasis-free survival and cancer-specific mortality [31]. Even if costs have been reduced, such genomic tests have never been evaluated for LNI risk stratification.
Our workflow aimed to reduce common statistical biases through a conservative feature set reduction procedure and by relying on separate training and testing sets. However, some limitations persist. First, the use of training and testing sets does not replace a full external validation (which would also enhance the overall radiomics quality score of the study). Previous models were first internally developed using a logistic regression approach and secondly externally validated. To overcome the lack of external validation in our study, we tested our models in an internal set, well separated from the training set. This method showed a high stability of the model between the training and testing sets. We partly addressed that limitation by analyzing a subset of 18 patients with three different experts' delineations, with no resulting LNI classification changes. The ComBat harmonization method was performed in order to account for the MRI scans' heterogeneity. It resulted in a higher performing model, the Combat-Combined model. Neural networks are often criticized as being "black boxes" [32]. Our approach offers classification by normalized importance of the features, thus partly addressing this issue and providing models with partial clinical interpretability for the users. Finally, comparison with the latest Briganti 2018 nomogram was only offered in a subset of patients because of the targeted biopsies being nonmandatory in our study. Software-based fusion biopsies are considered the standard of care since 2018 [17] but our set predated the recommendation. However, the benefit of the latest Briganti 2018 over Briganti 2017 and Briganti 2012 appears to be small [33,34].
In patients treated with radiotherapy, pelvic irradiation remains controversial. While some studies suggest a possible benefit for low [35], and high-risk PCa patients [36,37], guidelines for pelvic lymph node irradiation remain weak because of the lack of level 1 data. The lack of consensus underpins the need of better biomarkers.

Conclusions
Radiomic features extracted from the preoperative MRI seem to provide added predictive value to usual models regarding LNI risk prediction in PCa, in an external testing set. Adoption of this model could avoid up to 80% of eLND with a risk of missing only 1.1% patients with LNI. External and prospective validations are currently under investigation.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/cancers13225672/s1: Protocol S1: Radiomics features extraction; Protocol S2: Features set reduction and model building; Protocol S3: Inter-reader variability; Figure S1: Flowchart of the patients' selection; Figure S2: Calibration plots for each available model, the New-Combined model and the Combat-Combined model, in the testing set; Figure S3: Comparison between the Briganti 2018 and the New-Clinical, the New-Combined and the Combat-Combined models ROC (Receiver Operative Characteristics) curves in the training set (3A) and in the testing set (3B); Figure S4: Decision curve analysis (A) and calibration plots for the Briganti 2018 (B), the New-Combined model (C) and the Combat-Combined models (D) in the training set; Figure S5: Decision curve analysis (A) and calibration plots for the Briganti 2018 (B), the New-Combined model (C) and the Combat-Combined models (D) in the testing set; Table S1: Summary of MRI scan acquisition parameters; Table S2: Selected radiomic features after feature selection; Table S3: Spearman's correlation coefficient (training set); Table S4: Analysis of each built model and respective mean accuracy based on the Boostrap analysis (n = 1000 replications) in the training set; Table S5: Analysis of each model's discrimination between patients with or without LNI in the training set; Table S6: Comparison of ROC curves in the training set; Table S7: Analysis of each model's discrimination between patients with or without LNI (testing set); Table S8: Relative Risk derived from each model (testing set); Table S9: Inter-reader variability assessment-segmentation; Table S10: Intra-class correlation for each radiomic feature in the New-Combined model, depending on the delineation variation; Table S11: Inter-reader variability assessment-LNI risk prediction; Table S12: Model performance according to the NCCN risk classification; Table S13: Radiomics Quality Score.