1. Introduction
The prognosis of patients with unresectable metastatic melanoma has improved significantly thanks to the development of immune checkpoint inhibitors and BRAF/MEK targeted therapies [
1,
2,
3,
4]. Antibodies that inhibit the programmed cell death protein 1 (PD-1) receptor (e.g., pembrolizumab and nivolumab) have become an approved standard treatment option [
5,
6,
7,
8,
9]. Durable responses can be achieved resulting in unprecedented survival rates at five years and beyond (e.g., in the phase III CheckMate 067 trial a 6.5-year overall survival (OS) rate of 43% was observed [
10]). Notably, durable clinical benefit and remission can be obtained even following elective discontinuation of immune-checkpoint blockade [
2]. While treatment with an anti-PD-1 monoclonal antibody is sufficient in a smaller proportion of patients, most patients will be in need of more active treatment options in order to obtain similar benefits. The combination of nivolumab with the cytotoxic T-lymphocyte-associated antigen 4 (CTLA-4) immune checkpoint blocking monoclonal antibody ipilimumab can improve treatment efficacy at the cost of increased toxicity [
10]. Likewise, combination of the anti-PD-L1 monoclonal antibody atezolizumab with the BRAF-/MEK-inhibitors vemurafenib and cobimetinib (and the combination of spartalizumab plus dabrafenib and trametinib alike) can provide a small incremental benefit; however, this comes at the cost of increased toxicity in patients with BRAF V600-mutant melanoma [
11,
12].
Reliable baseline identification of patients who derive the greatest benefit from anti-PD-1 monotherapy and identification of those patients in need of more active combination therapy would be helpful to guide personalised treatment decisions. However, predicting if a specific patient will respond to anti-PD-1 therapy remains challenging and current approaches are imperfect. Some of these approaches are time-consuming and costly and have therefore not been implemented in clinical routine [
13].
Translational research performed by our group and others has indicated that several parameters at baseline can predict the outcome of patients with advanced melanoma when treated with pembrolizumab [
14,
15,
16,
17]. We recently identified total metabolic tumour volume (TMTV) as determined on fluorine-18-fluorodeoxyglucose ([
F]FDG) positron emission tomography (PET)/computed tomography (CT) as the strongest predictor for futility of treatment with pembrolizumab [
16]. In addition, in univariate analysis, a higher number of metastatic sites, the presence of brain metastases, an increase in lactate dehydrogenase (LDH) or C-reactive protein (CRP) levels, lower albumin and absolute lymphocyte count, a higher neutrophil-to-lymphocyte ratio and increased circulating tumour deoxyribonucleic acid (DNA) levels are of interest. Tissue biomarkers could not be validated, but have been shown to be of interest by others [
18,
19,
20]. None of these biomarkers are currently widely accepted for decision making in clinical routine. With respect to determining TMTV on PET images, the extraction of the image-derived variables is too labour intensive and time-consuming and further validation is required.
Automated medical image analysis, and namely lesion segmentation, could automate the extraction of image biomarkers, alleviate the workflow and enable their usage in clinical practice. It may also allow the exploration of a wider range of image-derived parameters during clinical research studies. Recently, a large and growing body of literature has investigated lesion segmentation for PET/CT [
21,
22,
23,
24,
25,
26,
27]. However, to the best of our knowledge, the effect of adopting such an automated approach in the pipeline for melanoma prognosis prediction remains to be investigated.
1.1. Related Work
A number of publications describe machine-learning methods for prognosis prediction of melanoma. Goussault et al. [
28] compared a linear model, random forest, XGBoost and LightGBM to predict the response to immunotherapy and targeted therapy in stage IIIc or IV melanoma patients. A total of 935 patients from 10 different centres, taken from the French Clinical Database of Melanoma Patients (RIC-Mel), were included, of which, 80% were used for training. The response was classified as Class 1 in case of complete response, partial response or stable disease, or Class 2 in case of progressive disease. For immunotherapy, LightGBM was the best model with an accuracy of 66% while for targeted therapy, this was the random forest with an accuracy of 65%. The most predictive parameters proved to be the following: stage (IIIc or IV), response to previous treatment lines, age, number of metastasis sites and time between first melanoma excision and metastatic relapse.
Flaus et al. [
29] developed a method to predict a patient’s one-year OS and progression-free survival (PFS) based on the pre-treatment [
F]FDG PET. The population included 56 patients treated for metastatic melanoma with anti-PD1 immunotherapy. Lesions were segmented semi-automatically using a threshold set at 40% of the maximum standardised uptake value (SUV
). Per patient, the lesion with the highest FDG uptake was used to extract 45 radiomic features. After a number of feature selection steps, the five best-ranked ones were used to build survival prediction models. Data were balanced through the synthetic minority oversampling technique. A neural network, logistic regression, support vector machine, random forest and naive bayes approach were compared by averaging the results of 50 random splits stratified on outcome. Each time, the training set comprised 75% of the data. For both OS and PFS, the random forest obtained superior results with an area under the curve (AUC) of 0.87, a sensitivity of 0.79 and a specificity of 0.95 for OS; and an AUC of 0.90, a sensitivity of 0.88 and a specificity of 0.91 for PFS.
Küstner et al. [
30] developed a convolutional neural network (CNN) for outcome prediction and performed a range of survival analyses based on whole-body [
F]FDG PET/magnetic resonance imaging (MRI) and PET/CT acquired on the same day before treatment. Data from 37 patients who received checkpoint inhibitors and/or BRAF/MEK inhibitors were collected in a prospective study. A CNN was trained on automatically segmented lesions in the dataset to classify the patient as high-risk or low-risk. The latter was assigned in case of response to treatment and OS of more than 548 days. Inference was performed based on one, manually selected target lesion per patient. Results were computed in a leave-one-out cross-validation. In addition to this classification network, Kaplan–Meier analyses were performed, where patients were split into two groups and the difference in OS was assessed through a Wilcoxon test. Treatment response was evaluated with a t-test for equal means and unequal variances. The CNN for risk classification achieved a specificity of 96%, sensitivity of 92%, positive predictive value of 92% and accuracy of 95%. This model, combining all modalities, achieved superior overall results. A model based on only PET/MRI proved less sensitive and less accurate, but more specific than a model considering only PET/CT. Longer OS was seen in patients with a TMTV under 50 mL, no metastases in the brain, bone, liver, spleen or pleura, less than five affected organ regions, no metastases, a longest lesion diameter of less than 37 mm, a peak standardised uptake value of less than 1.3, or a range of mean apparent diffusion coefficient of less than 600 mm
. However, none of these correlated significantly with the split of patients into high- or low-risk groups.
1.2. Goal and Contributions of This Study
Though treatment with checkpoint inhibitors has become the new standard for advanced melanoma, a considerable part of the population still progresses on such therapy. Identification of patients that will not respond to anti-PD-(L)1 treatment at an early stage is of utmost importance to offer them the highest possible chances of survival. However, it is currently impossible to predict prior to treatment initiation if it will be effective for a specific patient. The whole-body FDG PET/CT scans that are taken before and during treatment are valuable sources of information. However, the use of quantitative image-derived parameters in clinical routine is not feasible with the available tools. There is a need for a fast, reproducible workflow to analyse the images that can be applied in clinical practice. Automation can provide the basis for this and can additionally offer tools to extract more specific imaging features that can be investigated in clinical research.
The contribution of this TRIPOD 2a [
31] study is threefold. First, in addition to known imaging features, like TMTV, more specific volumetric features were extracted and assessed for their predictive power. The potential of the features was determined through univariate Kaplan–Meier and Cox regression analyses. Second, promising features were exploited and combined with clinical parameters in multivariate Cox regressions to develop a fully automated prognosis prediction model. The proposed method starts from the whole-body PET/CT image in Digital Imaging and Communications in Medicine (DICOM) format, derives all needed parameters from the metadata and preprocesses the imaging data, completes a lesion segmentation, combines extracted features with clinical parameters that are given as input and performs a prognosis prediction for patients with advanced melanoma that will be treated with pembrolizumab. The final prognosis prediction model was validated on an internal dataset from the same institute. Third, a comparative analysis was performed between manually and automatically derived imaging parameters. The impact of using the latter was evaluated in each step of the survival analysis, feature selection and prognosis prediction. In case of similar results, the automation may enable the use of quantitative image-derived parameters for therapy selection in clinical routine. Moreover, it could provide new tools and features to explore in further clinical research.
2. Materials and Methods
This section describes the steps taken for the development and validation of the proposed prognosis prediction model. These are illustrated in
Figure 1. In brief, features are derived from automatically segmented lesions and anatomical regions. Next, a feature selection is performed via univariate Kaplan–Meier and Cox regression analyses. Promising features are then evaluated in multivariate Cox regressions and the model obtaining the best results within the development set is verified on the validation set.
2.1. Data
This retrospective study was performed with data collected from patients treated at Universitair Ziekenhuis Brussel (UZ Brussel, Brussels, Belgium) for malignant melanoma between February 2014 and August 2018. Patients with histologically confirmed, non-resectable stage III or IV malignant melanoma according to the American Joint Committee of Cancer (AJCC) 8th edition were included. They received pembrolizumab immunotherapy every 3 weeks as 1st- or up to 5th-line treatment. Alternative prior-line therapy could consist of anti-CTLA-4 (ipilimumab), anti-PD-1 (nivolumab), a combination of anti-CTLA-4 and anti-PD-1 treatments, BRAF inhibitors (dabrafenib or encorafenib) and/or BRAF/MEK inhibitors (dabrafenib/trametinib or encorafenib/binimetinib).
A total of 100 patients were included in this study. For 69 of them, manual lesion delineations were available, created as described in [
21]. This set was used for development while the remaining 31 patients were kept aside as an internal validation set.
For each patient, a whole-body [
F]FDG PET/CT scan was acquired at baseline, a maximum of 7 weeks before the start of pembrolizumab treatment, and at defined follow-up visits. The intervals of [
F]FDG PET/CT exams corresponded to roughly 3–4 immunotherapy administrations. For most patients, lesions were annotated manually by the physician as described in [
21].
CRP and LDH values of a baseline blood test were recorded and categorised based on the upper limit of normal (ULN) with 5 groups for CRP (<ULN, 1–2 × ULN, 2–5 × ULN, 5–10 × ULN, >10 × ULN) and 3 groups for LDH (<ULN, 1–2 × ULN, >2 × ULN). Additionally, the presence of brain metastases at baseline was retained as a binary, categorical variable. To this end, patients suspected of having brain lesions will undergo an MRI exam a few days before or after the acquisition of the [F]FDG PET/CT.
Treatment response was determined according to the immune-related response criteria (irRC) [
34]. Progressive disease was defined as an increase in tumour volume of at least 25% perceived on CT. Survival was considered progression-free in case of stable disease, complete response or partial response, characterised by a reduction in tumour volume of at least 50%.
2.2. Automated Lesion Segmentation
The lesion segmentation model developed in a previous work [
21] was adapted to overcome some of its limitations. In brief, the latter method consists of 2 steps. First, a PET threshold is automatically determined by identifying a region of interest of homogeneous intensity in the liver on both PET and CT. Application of the threshold segments all candidate regions with increased FDG uptake. In the second step, a deep learning classification is applied to separate the lesions from the healthy tissue with physiological uptake.
Here, the second step was replaced by a segmentation network using the MONAI [
35] implementation of the nnU-Net [
36] with 3 input channels: the binary mask from the PET thresholding, the PET and the CT image. This way, the lesion segmentation is not limited by the extent determined through thresholding, and 1 connected candidate region can be further divided into malignant and healthy tissue by the segmentation model. In accordance with the nnU-Net guidelines, a U-Net architecture [
37] was trained for 1000 epochs with deep supervision, a combination of dice and cross-entropy loss and an initial learning rate of 0.01 which was decayed following
[
38]. An Adam optimizer was used instead of Stochastic Gradient Descend as this had yielded better results in previous experiments [
21].
The PET intensities were converted to body-weight-corrected standardised uptake values (SUV) and clipped at 0 and 15 SUV while CT images were clipped at −1000 and 500 Hounsfield units (HU). The intensities of both modalities were scaled to the range [0, 1] and all images were resampled to an isotropic spacing of 4 mm. Per modality, corresponding patches of 128 voxels in three dimensions were extracted while ensuring a balance between the amount of positive and negative patches. Data augmentation was performed on the fly via a number of random transformations including affine transformation, Gaussian smoothing, intensity scaling, Gaussian noise and flips, each with a probability of 0.15.
The model was first pretrained on 700 patients of the publicly available data from the autoPET challenge [
33]. Since the ground truth segmentations were constructed differently, the trained model would produce inferior segmentation results for the data used in this study. Therefore, the CNN, with its weights initialised from the pretraining, was retrained on the dataset from UZ Brussel in a four-fold cross-validation. Final segmentations for the test set were constructed by averaging the output of the four models.
2.3. Automated Organ Segmentation
The use of automated methods for medical image analysis enables the exploration of additional, more fine-grained features for survival analysis. As an initial feasibility study, different anatomical structures were segmented using the publicly available TotalSegmentator [
32]. All bones were merged into one skeleton mask. For the gastrointestinal (GI) tract, the esophagus, stomach, duodenum, small bowel, colon and urinary bladder were merged. Furthermore, the lungs, liver, spleen, adrenal glands and pancreas were located. Survival analyses were performed for the tumour load per region to investigate if there are any critical levels that could indicate treatment with pembrolizumab to be futile.
2.4. Feature Extraction and Analysis
Within the dataset, total metabolic tumour volume, total lesion glycolysis (TLG), baseline LDH, CRP and the presence of brain metastases were available for survival analyses corresponding to clinical research. Furthermore, tumour load in terms of TMTV per anatomical area was assessed, using the organ segmentation described in
Section 2.3. All analyses were performed using Python 3.7 and packages scikit-survival and lifelines.
For the development set, each lesion-based, image-derived parameter was extracted once from the manual segmentations and once from the automated segmentations to perform univariate Kaplan–Meier analyses. For each feature, a threshold was applied to divide the population into two groups. For each group, the Kaplan–Meier survival curve was drawn and the statistical difference between both was determined through a logrank test. Different thresholds were assessed depending on the value range of the feature. The step size was set to respectively 1, 5, 10 and 100 for ranges between 1–10, 10–100, 100–1000 and more than 1000. The lowest considered threshold was equal to the step size and the maximum one was equal to the highest value under the maximum feature value. The threshold with the lowest associated p-value was retained for the considered feature.
The most significant of the obtained thresholds were compared across manual delineations and automated lesion segmentations to evaluate the effect of slightly different lesion tracings on the survival analysis. Hazard ratios were determined via a univariate Cox regression for overall and progression-free survival for each feature surpassing the critical threshold determined by the manual delineations and are reported with their 95% confidence interval (CI).
2.5. Prognosis Prediction
In order to be able to handle the right-censored data and get a prediction on a patient’s individual survival curves, the Cox proportional hazard regression was used to develop models for prognosis prediction through leave-one-out cross-validation. Hence, per experiment, each patient was used once as a test subject while training the regression on the remaining patients. The reported results are the mean values over all patients. The goal was to predict the OS and PFS chances at one year and at two years after the baseline PET/CT scan.
Volumetric features were selected based on their hazard ratio. Only features for which the lower bound of the 95% confidence interval of the hazard ratio surpassed 2, were considered in the regression modelling. The hazard ratios were calculated in a univariate, dichotomous analysis using a threshold optimised for this dataset and are therefore expected to be over-optimistic. Parameters with a hazard ratio below 2 were considered unlikely to hold information that would improve the multivariate regression models. Continuous variables were not categorised in order not to lose valuable information by thresholding.
Patients for whom TMTV rapidly drops to 0 mL are easily identified as responders through inspection of the first follow-up imaging. For the remaining patients, the prediction of response will be of interest to determine who will benefit from a continuation of the treatment. An additional Cox regression was tested to make a new survival prediction after the first follow-up PET/CT exam. For this, the rate of change in TMTV was added as a feature, and defined as
with
and
the total metabolic tumour volume derived from the baseline and first follow-up scan, respectively, and
the number of days between those acquisitions.
2.6. Evaluation
Lesion segmentations were evaluated using the dice similarity coefficient (DSC) and absolute volume difference (AVD), defined as
with
the number of true positives,
the number of false positives and
the number of false negatives at voxel level.
Additionally, two metrics defined in the autoPET challenge [
39] were assessed, namely the volume of false positive connected components in the predicted segmentation mask that do not overlap with tumours in the ground truth segmentation map (
) and the volume of connected components in the ground truth segmentation that do not overlap with the estimated segmentation mask (
).
The overall predictive performance of regression models for survival was quantified via the AUC of the receiver operating characteristic (ROC) curve with 95% CI, determined through bootstrapping the predictions with replacement in 1000 iterations. Thus, 1000 variations of the prediction set were created by sampling these predicted probabilities and allowing each one to be sampled multiple times. The CI was then constructed based on the sampling distribution estimated from the various prediction sets. AUCs for the regression model based on manual lesion delineation versus the ones based on automated segmentations were compared with the DeLong test [
40,
41]. In case of multiple comparisons with the same AUC value, the Bonferroni correction was applied. We also report sensitivity or true positive rate (TPR) and specificity or true negative rate (TNR) for a threshold favouring high specificity.
with
TP the number of true positives,
TN the number of true negatives,
FP the number of false positives and
FN the number of false negatives at patient level. The 95% CI around the sensitivity and specificity was calculated by bootstrapping the predictions with replacement in 1000 iterations.
For evaluation, patients lost to follow-up were excluded from the test set as their survival status is unknown. The performance of the model was evaluated on the internal test set by retraining on all patients from the development set.
3. Results
3.1. Data
An overview of the characteristics of the development and validation subsets is provided in
Table 1. The development set was made up of 69 patients (29 male, 40 female) with a median age of 60 years old (26–93). The set of whole-body [
F]FDG PET/CT images included at least the baseline exam, acquired a median of 7 days (0–44) before the start of the treatment, and between zero and nine follow-ups with a median follow-up time of 576 days (40–1242). A total of 16 patients had a prior history of brain metastases. The median time between PET/CT exams was 10.6 weeks (5.86–26.0). After three scans, this was increased to a median of 14.6 weeks (7.43–71.0).
One year after their baseline exam, two patients were lost to follow-up and 42 patients were still alive, of whom, 22 were progression-free. After two years, 16 patients were lost to follow-up and 22 patients survived, of whom, eight were without progressive disease. For automated lesion segmentation, the set of patients was randomly split into four groups, stratified on the number of lesions and all exams belonging to the same patient were included in the same group.
We collected an additional validation set that was never used during model development. This set comprised 31 patients (14 male, 17 female) with a median age of 65 years old (34–82) and a median follow-up time of 612 days (1–1874). Seven patients suffered from brain metastases at baseline. After one year, 18 patients were still alive while eight were lost to follow-up. A total of 13 patients survived for at least two years after their baseline exam while 10 were lost to follow-up. Manual lesion delineations and PFS status were not available for this validation set.
3.2. Automated Lesion Segmentation
The median segmentation results are summarised in
Table 2. On average, the median dice coefficient per fold is 0.842
0.343 with an absolute volume difference of 1.16
239 mL. The connected components in the prediction that do not overlap with the manual delineations constitute a median volume of 0
26.4 mL while the false negative ones make up 1.06
35.6 mL.
3.3. Feature Analysis
The thresholds that led to the lowest
p-value in a logrank test comparing Kaplan–Meier survival curves are summarised in
Table 3. For each feature, the number of patients with a value higher than zero is listed as well. Univariate hazard ratios for OS and PFS determined via the manual delineations and automated lesion segmentations are tabulated in
Table 4.
For OS, the hazard ratios indicated a potential predictive value in TMTV, TLG and the volume of liver, spleen and GI tract metastases at baseline. For PFS, this list was reduced to TMTV and the volume of liver metastases. Results are very similar when using manual delineations versus automatically derived lesion segmentations for TMTV, TLG and the volume of liver metastases. Hazard ratios of the volume of spleen and GI tract metastases show more variation, which can be attributed to the smaller number of patients with lesions in these areas and the more challenging nature of automated lesion segmentation in the abdomen due to the several regions of physiological uptake.
Irrespective of how the lesions were segmented, a TMTV of more than 90 mL has the highest hazard ratio for OS, followed by a tumour load in the liver greater than 30 mL and a total lesion glycolysis surpassing 400 SUVmL. For PFS, the order of liver metastases and TMTV is reversed. The most significant threshold in liver tumour load for PFS deviates with 10 mL, but there is still a significant difference (p < 0.001) in survival curves when applying the threshold of 30 mL instead of 40 mL in the set of automated lesion segmentations. The same can be observed for TLG. The threshold with the lowest p-value differs 100–200 units (for PFS and OS respectively), but the p-value is still smaller than 0.001 when applying the other threshold to construct the survival curves. The hazard ratio for OS based on a baseline TMTV above 90 mL is 12.2 or 14.3 when using manual delineations or automated lesion segmentations, respectively. For PFS, these values drop to 3.85 or 4.23.
A volume of liver metastases surpassing 30 mL leads to a hazard ratio of 11.0 (manual) or 8.21 (automated) for OS and 4.70 (manual) or 5.12 (automated) for PFS. A baseline TLG of more than 400 SUVmL is a bad indicator for OS with a hazard ratio of 7.77 using either method.
The Kaplan–Meier curves for OS with a TMTV at baseline smaller or greater than 90 mL are drawn in
Figure 2a. The plots using the manual delineations are drawn in solid lines while the ones for the automated segmentations are drawn in dotted lines. They overlap almost completely with only minor deviations at certain time points. There are two patients for which the decision of a TMTV larger or smaller than 90 mL differs, slightly modifying the plots depending on the segmentation approach. At 122 days, a patient died, for which the manual segmentation encompasses a total volume smaller than 90 mL (75.7 mL) while for the automated segmentation, this is larger than 90 mL (306 mL). The segmentation network misclassified a relatively large volume in the intestines as lesion. At 283 days, a patient died, for which the manually derived TMTV is greater than 90 mL (319 mL) while the automatically extracted value is less than 90 mL (66.2 mL), because for the latter, a large lesion in the abdomen was missed. Still, the OS curves are very similar.
The Kaplan–Meier plots for PFS based on TMTV are shown in
Figure 2b and the graphs for OS and PFS based on TLG and the volume of liver metastases are added in
Appendix A. The decision of baseline TLG surpassing 400 SUV
mL is the same for all patients no matter the segmentation approach. Hence, their plots for OS overlap completely. The critical tumour load in the liver was found to be 30 mL, splitting the population into two groups with significantly different survival chances for both OS and PFS. The graphs only deviate slightly between segmentation methods.
In addition to the image-derived variables, CRP and LDH values and the presence of brain metastases were collected, which do not depend on the lesion segmentation method. For the blood values, the most significant threshold was determined in a similar way, examining the different categories with respect to the ULN. In line with previous research, the thresholds of 2 × ULN for LDH and 10 × ULN for CRP proved most significant [
16].
A CRP level greater than 10 × ULN (p < 0.01) has a hazard ratio of 13.1 (95% CI: 1.53–112) for OS and 8.25 (95% CI: 0.92–73.8) for PFS.
For LDH, a value surpassing 2 × ULN (p < 0.001) has a hazard ratio of 13.9 (95% CI: 3.67–52.4) for OS and 7.85 (95% CI: 2.07–29.7) for PFS.
The logrank p-value between the groups with and without brain metastases at baseline is smaller than 0.05 with a hazard ratio of 2.33 (95% CI: 1.13–4.80) for OS while it is greater than 0.05 for PFS.
Considering this feature analysis, TMTV, TLG, the volume of liver, spleen and GI tract metastases are potential parameters to be used in a prognosis prediction model for OS. For PFS, the options are reduced to TMTV and the volume of liver metastases. For both prediction tasks, LDH, CRP and the presence of brain metastases are tested in the leave-one-out prognosis prediction as they hold information that can be complementary to the volumetric features.
3.4. Prognosis Prediction at Baseline
TLG is determined by multiplying TMTV with the PET intensity in standardised uptake values, so these are highly correlated (Pearson’s correlation of 0.95). Moreover, all patients with more than 30 mL of liver metastases at baseline that do not survive one or two years have a TMTV of more than 90 mL. Therefore, an initial prognosis prediction model was tested considering the remaining parameters.
Table 5 outlines the results through leave-one-out cross-validation for the Cox proportional hazard regressor considering TMTV, the LDH category and the presence of brain metastases.
Table 6 summarises the estimated baseline survival and multivariate hazard ratios. For OS, the proposed automated pipeline achieved an AUC of 0.78 for prediction at one year and 0.70 at two years. For the former, the model estimated a baseline survival probability of 0.82 with multivariate hazard ratios of 1.004 for TMTV, 2.59 for LDH and 2.52 for brain metastases. At two years, the estimated baseline survival probability decreased to 0.73 with respective hazard ratios of 1.004, 1.93 and 2.58.
In case of PFS, the one-year prediction achieved an AUC of 0.61 and the baseline survival probability was estimated at 0.74 with corresponding hazard ratios of 1.004 (TMTV baseline), 1.72 (LDH baseline) and 1.69 (brain metastases at baseline). The two-year prediction reached an AUC of 0.42 with an estimated baseline survival of 0.51 and multivariate hazard ratios of 1.005 (TMTV baseline), 1.41 (LDH baseline) and 1.69 (brain metastases at baseline). Similar predicted survival probabilities and hazard ratios were found when using the manually segmented lesions. The addition of the lesion volume in the spleen or GI tract or of the CRP category did not offer any performance improvement. For all predictions, the DeLong test indicated no statistical difference (p > 0.05) in AUC when using the manual delineation or automated lesion segmentation approach.
TLG is highly correlated with TMTV, but might contain more information as it gives an indication of the tumour activity. However, the AUC values are similar for each of the respective survival types and time points for the prognosis model considering TMTV, LDH and brain metastases and the one taking into account TLG, LDH and brain metastases. The DeLong p-values are larger than 0.05 except for one-year OS with the manual segmentation where p is exactly 0.05 between the two models.
The volume of liver metastases proved to be predictive for survival, but the subset of patients with a TMTV greater than 90 mL completely encompassed the group of patients for which this value was higher than 30 mL. Moreover, for all prediction tasks, a drop in performance was observed when replacing the TMTV with the volume of liver lesions, which was statistically significant for OS and for PFS at one year. After Bonferroni correction, the statistical significance level is set to 0.02 due to the comparison of the AUC of the automated method with the values for the manual approach, the model including TLG instead of TMTV and the regression with the volume of liver lesions instead of TMTV. For the latter, only OS at two years and PFS at one year still give significantly different results.
3.5. Prognosis Prediction after First Follow-Up
The median time to the first follow-up was 62 days (28–154). The rate of change was added to the list of features and the 15 patients without a follow-up exam were excluded. Results of the leave-one-out cross-validation are shown in
Table 7. One- and two-year OS were predicted with AUC scores of 0.68 and 0.66, respectively. Though this is a decrease compared to the model at baseline, the values cannot be compared directly as some patients were excluded here. When running the baseline model with the same patients excluded, the AUC values become 0.78 and 0.67. These values are still higher than for the follow-up model, but without a statistically significant difference (Delong
p-values > 0.05).
For PFS, the model achieved an AUC of 0.53 at one year and 0.48 at two years. For the baseline model applied to the same set of patients, the scores were respectively 0.50 and 0.35 (Delong p-values > 0.05). When deriving the volumetric covariates from manually segmented lesions, similar values were obtained. The AUCs at one and two years were respectively 0.78 and 0.65 for OS and 0.44 and 0.32 for PFS. These are not statistically different from the scores obtained when using the automated segmentation approach (all DeLong p-values > 0.05).
There were three patients with an initial increase in TMTV (rate of change: 0.77 (0.55–1.78)) but who survived at least as long as their total follow-up time (median follow-up time: 1140 days (805–1242)). A total of 10 patients had no lesion on both their baseline and follow-up scans, leading to a rate of change of zero, but also survived at least as long as their total follow-up time (median follow-up time: 988 days (774–1183)). For these 13 patients, all predictions OS were correct, irrespective of the applied segmentation method.
3.6. Internal Validation Set
For the internal validation set, no PFS status was available. The multivariate Cox regression model trained on all 69 patients of the development set considering TMTV, LDH strata and brain lesions at baseline was applied to the internal test set. The ROC curves for each prediction task are drawn in
Figure 3. For one-year OS, the AUC reached 0.76 (95% CI: 0.52–0.95) with a sensitivity of 0.61 (95% CI: 0.38–0.83) and specificity of 0.81 (95% CI: 0.33–1.00). At two years, the AUC became 0.74 (95% CI: 0.46–0.93) with a sensitivity of 0.61 (95% CI: 0.32–0.87) and specificity of 0.74 (95% CI: 0.40–1.00).
To illustrate the use and output of the system, the proposed clinical decision support system (CDS) is depicted in
Figure 4.
4. Discussion
A predictive model based on fully automated analysis of whole-body [F]FDG PET/CT was developed and validated on a separate dataset acquired at the same institute. Special attention was given to the impact of using automatically estimated lesion segmentations versus manual delineations.
Univariate analysis led to highly similar features being selected using automated lesion segmentation with respect to manual delineation. The prognosis prediction, including TMTV, LDH strata and the presence of brain metastases at baseline, gave results with no statistically significant difference with respect to the manual lesion delineation. This important result indicates that while the automated approach leads to slightly different lesion segmentations than obtained through manual delineations, the deviations are small and not of impact for final prognosis prediction.
Lesions in the abdomen were found to be the most challenging, bearing most discrepancies between the automated and manual segmentations. This region is characterised by higher physiological uptake, hampering the performance of the segmentation model. Segmentation models specifically trained for the abdominal region, as proposed by Jemaa et al. [
42], may improve this behaviour.
Despite the good overall agreement of the automated prognostic model with respect to the manual approach, a clinical implementation of the decision support system could still allow for user interaction. A possible implementation is illustrated in
Figure 4. In the automated PET thresholding step, the clinician can alter the selected threshold to change the segmentation of PET-positive regions. Next, after the lesion segmentation step, the output shows both the lesions and areas that are classified as physiological tracer uptake. If needed, the user can select delineated components to be included or excluded from the tumour load. Such an implementation would still improve the speed and reproducibility of the process, while improving the interpretability of the results.
Considering the predictive value of image-derived features, TMTV was found to lead to the best overall performance. It should be noted that the baseline [F]FDG PET/CT acquisition did not always coincide with the start of the treatment. As a result, there might be an underestimation in TMTV at the start of the treatment. Future research in which all PET/CT exams are performed close to start of treatment may yield a further improvement in results.
TLG includes information on the tumour size and intensity in SUV. Therefore, theoretically, it could be preferred over TMTV as a feature. Moreover, slight oversegmentations will be less pronounced after multiplication with PET intensity. This can be observed in the perfect agreement between manual and automated segmentation of the Kaplan–Meier curves for TLG (
Figure A1a). That being said, no significant difference in performance could be observed in the Cox proportional hazard prediction when including TLG instead of TMTV, and we preferred to retain TMTV for simplicity.
Both total metabolic tumour load and the liver metabolic tumour load proved highly predictive for survival. In fact, all patients with a volume of liver metastases higher than 30 mL also had a TMTV greater than 90 mL, but the opposite was not true. In case both features lead to similar performance, this would greatly reduce the workload for manual lesion delineation as only liver tumours would have to be delineated. However, for all prediction tasks, a drop in performance was observed, which was significant for several models. The volume of liver lesions cannot simply replace TMTV.
Considering the performance of the automated prognostic models, the one-year OS prediction performs well with an AUC of 0.78. One possible use of the prognosis prediction models could be to identify patients that will not respond to the standard anti-PD-1 treatment, such that an alternative treatment plan can be considered. Prioritizing sensitivity for non-responders comes at the cost of including some patients that would have responded and may be overtreated. However, such an approach could be favoured over the alternative, where patients with a poor prognosis are overlooked and do not receive the alternative therapy that might increase their chances. For automatically predicting overall survival at one year, a specificity of 0.88 was achieved with a sensitivity of 0.65, indicating that 88% of patients who would not survive one year with the standard pembrolizumab treatment can be identified correctly, at the cost of overtreating 35% of the patients. Important to note is that overtreatment by administering ineffective yet potentially toxic anti-PD-1 mono-therapy negatively impacts the overall outcome of the treated population. This highlights the importance of developing a more detailed cost–benefit analysis for deriving an optimised decision rule to determine how this prognostic model is best employed to support therapeutic decisions when starting anti-PD-1 monotherapy. In addition, patients predicted to fail anti-PD-1 monotherapy may benefit from treatment with BRAF-/MEK-inhibitors (if BRAF mutant) or participating in clinical trials exploring novel combination therapies.
Another possible use would be to identify the subset of patients with a baseline TMTV below 90 mL, that do not survive the first year. Both manual and automated models classify 10 out of 13 correctly. With this in mind, the proposed model can be a useful addition to the available data to support the treatment decision at baseline made in clinical practice.
At two years, the AUC score decreases to 0.70. Prognosis at a later point in time becomes more difficult as the uncertainty on the predictors increases. In addition, the more time passes by, the more patients get lost to follow-up. Out of 69 patients in the development set, only two were lost at one year, while this number increased to 11 for two years.
Predicting PFS achieved considerably lower results with AUCs of 0.61 at one year and 0.42 at two years. These predictions are generally harder due to the more specific nature of the task. The performance obtained here was not found to hold clinical value. The addition of other predictors that were not available in this dataset, might improve the estimation.
Patients whose TMTV quickly goes to 0 mL are likely to respond well to the treatment and are easy to identify at an early stage. For other cases (stable or increasing TMTV), it is harder to predict if continuation of the treatment would be beneficial. Therefore, prognosis models were tested for prediction after the first follow-up where the rate of change in TMTV was added as a feature. The overall performance of latter models in terms of AUC were not significantly different than models predicting prognosis at baseline. Considering only patients with stable or increasing TMTV at first follow-up, the model including rate of change correctly classified all patients with stable or increasing TMTV that survived at least two years.
When applying the model with baseline features to the internal validation set, never used during model development, comparable model performance was obtained, with an AUC of 0.76 for the one-year prediction and 0.74 for the two-year prediction. The results suggest our model generalises well to unseen data. We should however note that in this case the confidence intervals of the ROC curves are large due to the limited number of subjects.
Automated image analysis offers further opportunities for deriving predictive markers of response. Several authors have evaluated more advanced FDG PET/CT image-derived features in prognostic models. Küstner et al. [
30] evaluated organ-specific tumour load, and found some to be predictive, though patient numbers were low. In our study, tumour loads were extracted for seven different anatomical regions. For most of them, only a small number of patients suffered from lesions in the considered area, not justifying inclusion in the prognostic model. For liver, spleen and GI tract, inclusion in the prognostic model did not improve performance. In this work, information regarding brain lesions could be included solely through a binary indicator of their presence at baseline. However, the segmentation and quantification of such metastases can enable the exploration of more specific predictors.
Despite the sexual dimorphism observed in melanoma [
43], sex was not found to be a reliable predictor for prognostic outcome. In the development dataset, the difference in overall survival between the male and female groups was negligible. Within the male patients, 52% died during follow-up, while this was 48% for female patients.
The addition of radiomic features was tested, but this was deemed unfeasible due to the small dataset. Similar to the work of Flaus et al. [
29], radiomic features were extracted for the lesion with the highest FDG uptake per patient, excluding patients with no lesions greater than 10 mL [
44] detected on the PET/CT scan. However, this subset included only 25 patients, of which, eight survived more than one year (five progression-free) and six were still alive at two years (four progression-free). To be able to draw reliable conclusions, preference was given to omit radiomic features and perform experiments with a larger dataset, including patients with small or no baseline lesions.
The predominant limitation of this study is the relatively small datasets, both for development and validation. As a result, several strata of features were under-represented, not justifying further analysis. This included CRP and several organ-specific tumour loads. Moreover, the development set showed an overrepresentation of female patients (58%). Future work should include a more extensive validation, using an external dataset.
Furthermore, we did not have access to several parameters that have been reported to affect survival chances [
16,
18,
19,
20]. Tumour intrinsic characteristics and the immunological status of the patient are important factors influencing patient outcome. The inclusion of features like albumin and absolute lymphocyte count, neutrophil-to-lymphocyte ratio, circulating tumour DNA and protein expression on tumour cells are considered of interest for investigation as co-variables allowing even more precise prognostic prediction.