Prospective Evaluation of Different Methods for Volumetric Analysis on [18F]FDG PET/CT in Pediatric Hodgkin Lymphoma

Rationale: Therapy response evaluation by 18F-fluorodeoxyglucose PET/CT (FDG PET) has become a powerful tool for the discrimination of responders from non-responders in pediatric Hodgkin lymphoma (HL). Recently, volumetric analyses have been regarded as a valuable tool for disease prognostication and biological characterization in cancer. Given the multitude of methods available for volumetric analysis in HL, the AIEOP Hodgkin Lymphoma Study Group has designed a prospective analysis of the Italian cohort enrolled in the EuroNet-PHL-C2 trial. Methods: Primarily, the study aimed to compare the different segmentation techniques used for volumetric assessment in HL patients at baseline (PET1) and during therapy: early (PET2) and late assessment (PET3). Overall, 50 patients and 150 scans were investigated for the current analysis. A dedicated software was used to semi-automatically delineate contours of the lesions by using different threshold methods. More specifically, four methods were applied: (1) fixed 41% threshold of the maximum standardized uptake value (SUVmax) within the respective lymphoma site (V41%), (2) fixed absolute SUV threshold of 2.5 (V2.5); (3) SUVmax(lesion)/SUVmean liver >1.5 (Vliver); (4) adaptive method (AM). All parameters obtained from the different methods were analyzed with respect to response. Results: Among the different methods investigated, the strongest correlation was observed between AM and Vliver (rho > 0.9; p < 0.001 for SUVmean, MTV and TLG at all scan timing), along with V2.5 and AM or Vliver (rho 0.98, p < 0.001 for TLG at baseline; rho > 0.9; p < 0.001 for SUVmean, MTV and TLG at PET2 and PET3, respectively). To determine the best segmentation method, we applied logistic regression and correlated different results with Deauville scores at late evaluation. Logistic regression demonstrated that MTV (metabolic tumor volume) and TLG (total lesion glycolysis) computation according to V2.5 and Vliver significantly correlated to response to treatment (p = 0.01 and 0.04 for MTV and 0.03 and 0.04 for TLG, respectively). SUVmean also resulted in significant correlation as absolute value or variation. Conclusions: The best correlation for volumetric analysis was documented for AM and Vliver, followed by V2.5. The volumetric analyses obtained from V2.5 and Vliver significantly correlated to response to therapy, proving to be preferred thresholds in our pediatric HL cohort.


Introduction
Hodgkin lymphoma (HL) is one of the most frequent yet curable hematological malignancies in children [1]. Therefore, it appears to be of paramount importance to optimize therapeutic effect and minimize subsequent long-term risks while maintaining high cure rates. For this reason, it appears necessary to customize therapy for each patient based on the pre-treatment prognostic factors and intermediate assessments of disease response. Over the past few decades, 18 F-fluorodeoxyglucose PET/CT (FDG PET) has played an increasingly central role for staging, management and follow-up of various pediatric malignancies [2][3][4][5][6][7]. In pediatric HL, therapy response evaluation by means of [ 18 F]FDG PET has become a powerful tool for the discrimination of adequate responders from inadequate responders [8][9][10]. According to international standards, the Deauville five-point scale is considered the visual method of choice for discriminating responses in patients with lymphoma [11][12][13][14]. In order to extend the Deauville score to a continuous scale and limit visual misinterpretation, the German pediatric HL group proposed the use of qPET in 2014 [15][16][17]. This quantitative method was applied in the current EuroNet-PHL-C2 clinical study, in which therapy was adapted based on the FDG PET result after two cycles of chemotherapy [18,19].
Nevertheless, most of the studies were conducted in adolescent patients, and although PET appears to have similar diagnostic performance in the pediatric population, studies in the pediatric cohort are limited. Therefore, the criteria for the definition of adequate or inadequate response in children with HL are still being discussed [16].
Given the multitude of methods available for volumetric analysis in HL, the AIEOP Hodgkin Lymphoma Study Group has designed a prospective analysis of the Italian cohort enrolled in the EuroNet-PHL-C2 trial. The aim of this study is to investigate different segmentation techniques used for volumetric assessment in HL patients at baseline and during the course of therapy and to compare the parameters obtained by the different methods, relating them to the response.

Study Population
The population analyzed in the current study was obtained from the Italian cohort of patients treated according to the EuroNet-PHL-C2 trial [34] and enrolled in the prospective parallel study promoted by the AIEOP Hodgkin Lymphoma Study Group following Amendment Nr. 04, dated July 31, 2017 [34]. The study was approved by AIFA (Agenzia Italiana del Farmaco) on March 9, 2018. Written informed consent was obtained from all subjects or their legal representatives before inclusion. In accordance with the EuroNet-PHL-C2 trial, the study population comprised pediatric patients with histologically confirmed classical HL in intermediate and advanced treatment level, evaluated with FDG PET at baseline (PET1), after two cycles of induction therapy (PET2) and after the end of chemotherapy (PET3), in case of PET2 positivity [34]. Primarily, the present study aimed to compare the different segmentation techniques used for volumetric assessment in HL patients at baseline and during the course of therapy. Therefore, an overall population of 50 patients (150 scans) were investigated across 17 Italian AIEOP Centers.
Principal characteristics of the study population are shown in Table 1.

Volumetric Assessment
Imaging protocol for FDG PET scans was compliant with the requirements of the EuroNet-PHL-C2 trial, which were in accordance with the EANM guidelines for patient preparation, data acquisition and image reconstruction [35] to avoid discrepancies between the different PET tomographs used in the various AIEOP Centers involved in the study. For the semi-quantitative and volumetric analyses of the FDG PET scans, the Local Image Features Extraction (LIFEx) freeware (http://www.lifexsoft.org) was used to semi-automatically delineate contours of the lesions by using different threshold methods ( Figure 1). More specifically, we utilized four methods for volumetric assessment of pediatric HL [36]: (1) fixed 41% threshold of SUVmax within the respective lymphoma site (V41%), (2) fixed absolute SUV threshold of 2.5 (V2.5), (3) SUVmax(lesion)/SUVmean liver > 1.5 (Vliver) and (4) adaptive method (AM) [35]. This was computed as [0.15 × I(mean)] + I(background), where I(mean) is calculated as the mean intensity of all pixels surrounded by the 70% Imax isocontour within the tumor while I(background) is defined as SUVmean of the liver. [36]. The semi-quantitative parameters retrieved from the different analyses comprised metabolic tumor volume (MTV) and total lesion glycolysis (TLG = MTV × SUVmean) at baseline and during the course of therapy, as well as SUVmax, determined as the pixel with the highest uptake value; SUVmean, as the mean value of uptake; and SUVpeak, corresponding to the average value of uptake in a VOI (volume of interest) of 1ml that surrounds the pixel with the highest activity. different analyses comprised metabolic tumor volume (MTV) and total lesion glycolysis (TLG = MTV × SUVmean) at baseline and during the course of therapy, as well as SUVmax, determined as the pixel with the highest uptake value; SUVmean, as the mean value of uptake; and SUVpeak, corresponding to the average value of uptake in a VOI (volume of interest) of 1ml that surrounds the pixel with the highest activity.

Response Classification
According to the EuroNet-PHL-C2 trial, patients are classified into Adequate Response (AR) and Inadequate Response (IR). In the current study, we subdivided treatment response based on the Deauville-5-points-scale (DS) into the following: DS 1 = no uptake; DS 2 = uptake ≤ mediastinum; DS 3 = uptake > mediastinum but ≤ liver; DS 4 = uptake moderately more than liver uptake, at any site; DS 5 = markedly increased uptake at any site and new sites of disease [37]. The percentage variation of all semiquantitative parameters (i.e., SUVmax, SUVmean, SUVpeak, MTV and TLG) from baseline (PET1) to early (PET2) and late assessment (PET3) was computed. Since the aim of the present study was not to provide treatment outcomes from the EuroNet-PHL-C2 trial but rather select the best threshold method for volumetric assessment, we did not display the responses obtained from the Italian cohort and used the dichotomization according to DS only to validate the robustness of each method under investigation.

Statistical Analysis
Descriptive statistics comprised conventional metrics (mean, median, range). The different threshold methods used to outline lymphoma lesions were compared by the Pearson correlation coefficient, linear regression, Bland-Altman and logistic regression. Linear

Response Classification
According to the EuroNet-PHL-C2 trial, patients are classified into Adequate Response (AR) and Inadequate Response (IR). In the current study, we subdivided treatment response based on the Deauville-5-points-scale (DS) into the following: DS 1 = no uptake; DS 2 = uptake ≤ mediastinum; DS 3 = uptake > mediastinum but ≤ liver; DS 4 = uptake moderately more than liver uptake, at any site; DS 5 = markedly increased uptake at any site and new sites of disease [37]. The percentage variation of all semiquantitative parameters (i.e., SUVmax, SUVmean, SUVpeak, MTV and TLG) from baseline (PET1) to early (PET2) and late assessment (PET3) was computed. Since the aim of the present study was not to provide treatment outcomes from the EuroNet-PHL-C2 trial but rather select the best threshold method for volumetric assessment, we did not display the responses obtained from the Italian cohort and used the dichotomization according to DS only to validate the robustness of each method under investigation.

Statistical Analysis
Descriptive statistics comprised conventional metrics (mean, median, range). The different threshold methods used to outline lymphoma lesions were compared by the Pearson correlation coefficient, linear regression, Bland-Altman and logistic regression. Linear regression was applied to determine the relationship between response to treatment at early (PET2) and late evaluation (PET3), defined according to Deauville score (classified into DS 2, DS 3, DS 4 and DS 5) and all other variables classified with the different volumetric thresholds (i.e., V41%, V2.5, Vliver and AM). Statistical significance was set for a p value < 0.05.

Comparison of the Parameters according to the Different Methods
The linear regression allowed us to compare the semi-quantitative and volumetric parameters for the different threshold methods at different timing as absolute values (Table 2) as well as variation from PET1 to PET2 and PET3 ( Table 3). The scatter plots for the linear regression analyses and the Bland-Altman plots for baseline SUVmean, MTV and TLG values related to the different segmentation techniques used are also displayed (Figures 2-4). Among the different methods investigated, the strongest correlation was observed between AM and Vliver (rho > 0.9; p < 0.001 for SUVmean, MTV and TLG at all scan timing), along with V2.5 and AM or Vliver (rho 0.98, p < 0.001 for TLG at baseline; rho > 0.9; p < 0.001 for SUVmean, MTV and TLG at PET2 and PET3).    lower the variation between SD, the high the comparability of the methods. Herein, the calculated limits of agreement are substantial especially for the V41% method.   To determine the best segmentation method, we applied logistic regression and correlated different results with various Deauville scores obtained at late evaluation (PET3). The results are illustrated as absolute values (Table 4) as well as variations from baseline (Table 5). The standard deviations (SD) according to Bland-Altman plots for baseline SUVmean resulted in favor of the AM and Vliver methods (−1.6 and 1.4), followed by V2.5 and AM or Vliver (Figure 2). The widest range in SD was observed in V41%, with respect to all other methods used.
Similar results were obtained for baseline MTV and TLG (Figures 3 and 4) as well as for all semi-quantitative and volumetric analyses considered at different timing (Supplementary  Tables S1 and S2).
To determine the best segmentation method, we applied logistic regression and correlated different results with various Deauville scores obtained at late evaluation (PET3). The results are illustrated as absolute values (Table 4) as well as variations from baseline (Table 5).   Logistic regression demonstrated that MTV and TLG computation according to V2.5 and Vliver significantly correlated to PET3 (p = 0.01 and 0.04 for MTV and 0.03 and 0.04 for TLG, respectively), especially when used as absolute values for both DS 2 and DS 3 responses.
SUVmean absolute values were also associated with the responses for Vliver and V41% in the case of DS 2, while as variations, ∆SUVmean correlated to DS 3 for all methods, with the most statistically significant correlation for DS 2 in the case of the V2.5 threshold method (Table 5).

Discussion
In the last years, several strategies for semi-automatic tumor contouring have been proposed, including fixed (or relative), adaptive or gradient-based (growth of the adaptive region) thresholds. A common unified segmentation method is difficult to develop but necessary in order to improve interinstitutional comparison, find the best reproducibility between semiquantitative and volumetric parameters and ensure optimal patient management within medical centers [38]. However, a consensus on the choice of thresholds has not been reached and the optimal method for tumor volume segmentation is still debated.
Therefore, different methodologies have been studied. In their study, Song et al. [21] enrolled patients with early-stage HL (I-II) and performed their analyses using a threshold of 2.5 as optimal cut-off, demonstrating a correlation between disease prognosis and MTV status. Kanoun et al. [39] investigated the influence of software tools and the total metabolic tumor volume (TMTV) calculation method on prognostic stratification of baseline [ 18 F]FDG PET in newly diagnosed HL. They used 2.5, 41%, 125% liver SUVmax and 140% liver SUVmax, proving no significant difference between the respective ROC curves and with optimal cut-offs used being predictive of treatment failure. Martín-Saladich et al. [40] showed an optimal reproducibility in MTV computation for SUV > 2.5 threshold using contouring methodology or software tools. Although the authors found an overestimation of MTV when using a threshold of 2.5, it seemed preferable to the underestimation obtained with the cut-offs of 41% and 50%, respectively. Parvez et al. [41] have reported that the use of a fixed threshold of SUV3 or SUV6 was the best predictor of response to first-line therapy and overall survival.
Eude et al. [42] have compared the reproducibility of MTV measurement as well as the thresholds obtained for each method and their prognostic values in this regard. Eight methods were compared: three absolute thresholds (SUV ≥ 2.5; SUV ≥ liver SU-Vmax; SUV≥ PERCIST SUV), one percentage SUV threshold method (SUV ≥ 41% SUVmax) and four adaptive methods (Daisne, Nestle, Fitting, Black). There was a strong correlation between MTV and patient prognosis regardless of the segmentation method used (p = 0.001 for PFS and OS). The largest inter-observer cut-off variability was observed in the 41% SU-Vmax method, which resulted in more inter-observer disagreements. MTV measurements based on absolute SUV criteria were found to be significantly more reproducible than those based on 41% SUVmax criteria. Recently, Driessen et al. [43] analyzed 105 PET/CT scans from patients with newly diagnosed and relapsed/refractory cHL with six segmentation methods: two fixed thresholds (SUV4.0 and SUV2.5), two relative methods (41% of SUVmax and a contrast-corrected 50% of SUVpeak) and two combination majority vote' methods (MV2 and MV3). They observed that SUV4.0 tended to underestimate MTV and often missed small lesions, whereas SUV2.5 most frequently included all lesions and generally overestimated MTV. In contrast, few lesions were missed with use of relative or combined thresholds, but these segmentation methods required extensive manual adaptation and overestimated MTV in most cases. There were no significant differences in prognostic performance for all features among the methods.
In our study, we analyzed four segmentation methods: fixed 41% threshold of the SUVmax within the respective lymphoma site (V41%), fixed absolute SUV threshold of 2.5 (V2.5), SUVmax(lesion)/SUVmean liver > 1.5 (Vliver) and adaptive method (AM). With the exception of some outliers, Bland-Altman plots revealed no systematic errors between the different measurement approaches. The calculated limits of agreement were substantial, especially for the V41% compared to the other methods. Consequently, the best correlation for volumetric analysis was documented for AM and Vliver, followed by V2.5. Moreover, the volumetric analysis obtained from V2.5 and Vliver significantly correlated to response to therapy, proving to be preferred thresholds in our pediatric HL cohort.
The results of our study are in line with the cited studies currently in the literature and even when there were no significant differences between the main segmentation methods studied, the absolute SUV method appeared statistically as the most robust. In fact, in many studies [21,39,40,42,43], MTV for SUV > 2.5 threshold has shown an optimal reproducibility and a good correlation with prognosis.
One of the main limitations of our study is the use of images obtained from different scanners and subjected to different algorithm reconstructions. The parameters extracted could lose out from this bias, although harmonization based on EANM Research Ltd. (EARL, Vienna, Austria) accreditation is recommended for experimentation [35]. However, this is a preliminary and parallel study to the prospective study sponsored by the AIEOP Hodgkin Lymphoma Study Group, aimed at investigating the role of volumetric and texture (radiomic) characteristics better fulfilling the need for predictive and prognostic factors in pediatric HL.

Conclusions
Volumetric analyses with [ 18 F]FDG PET/CT are known to help predict the outcome in adult patients with lymphoma. This suggests a similar implication for the pediatric population with HL. To better define the optimal method for tumor volume segmentation, we performed a direct comparison of four computations based on either fixed thresholds (i.e., V41%, V2.5, Vliver) or adaptive methods (AM). Based on our findings, the best correlation for volumetric analysis was documented for AM and Vliver, followed by V2.5. The volumetric analyses obtained from V2.5 and Vliver significantly correlated to response to therapy, proving to be the preferred thresholds for volumetric analyses in our pediatric HL cohort.