Reproducibility of Baseline Tumour Metabolic Volume Measurements in Diffuse Large B-Cell Lymphoma: Is There a Superior Method?

The metabolic tumour volume (MTV) is an independent prognostic indicator in diffuse large B-cell lymphoma (DLBCL). However, its measurement is not standardised and is subject to wide variations depending on the method used. This study aimed to compare the reproducibility of MTV measurement as well as the thresholds obtained for each method and their prognostic values. The baseline MTV was measured in 239 consecutive patients treated at Henri Becquerel Centre by two blinded evaluators. Eight methods were compared: 3 absolute (SUV (standardised uptake value) ≥ 2.5; SUV≥ liver SUVmax; SUV≥ PERCIST SUV), 1 percentage SUV threshold method (SUV ≥ 41% SUVmax) and 4 adaptive methods (Daisne, Nestle, Fitting, Black). The intraclass correlation coefficients were excellent, from 0.91 to 0.96, for the absolute SUV methods, Black and Nestle methods, and good for 41% SUVmax, Fitting and Daisne methods (0.82 to 0.88), with a significantly lower variability with absolute methods compared to 41% SUVmax (p < 0.04). Thresholds were found to be specific to each segmentation method and ranged from 295 to 552 cm3. There was a strong correlation between the MTV and patient prognosis regardless of the segmentation method used (p = 0.001 for PFS and OS). The largest inter-observer cut-off variability was observed in the 41% SUVmax method, which resulted in more inter-observer disagreements in the classification of patients between high and low MTV groups. MTV measurements based on absolute SUV criteria were found to be significantly more reproducible than those based on 41% SUVmax criteria. The threshold was specific for each of eight segmentation methods, but all predicted prognosis.


Descriptive Statistics for the MTV Values
For each segmentation method, the mean and its standard deviation, median, first quartile and third quartile were measured for both evaluators (Table 2). Bland-Altman plots showed that the largest mean difference between the first and second evaluator in metabolic volumes was with the 41% SUVmax method (71.7 cm 3 ) and the Fitting method (76.5 cm 3 ) in comparison to <7 cm 3 for absolute SUV methods and <43 cm 3 for the other adaptive methods (Figure 1). .  3 . Differences: Difference in the metabolic tumour volume of the two evaluators in cm 3 . Dotted line: limits of agreement (mean difference ± 1.96 SD); LL: Lower limit 95% confidence interval; UL: Upper limit 95% confidence interval. Dotted line: limits of agreement (mean difference ± 1.96 SD); LL: Lower limit 95% confidence interval; UL: Upper limit 95% confidence interval.

Prognostic Value and Survival Analysis
The optimal cut-offs found with the ROC analysis were 552, 295, 487, 486, 340, 396, 352 and 460 cm 3 for SUV ≥ 2.5, 41% SUVmax, Liver SUVmax, PERCIST, Daisne, Nestle, Fitting and Black, respectively ( Table 5). The respective area under the curve for PFS varied from 0.638 to 0.672, suggesting similar performances in term of sensitivity and specificity among the methods (Figure 2).     The largest difference between the evaluator-specific cut-offs (ΔCut-off) was observed for the 41% SUVmax method (Cut-off evaluator 1 = 324 cm 3 vs. Cut-off evaluator 2 = 252 cm 3 ). The difference was ten times smaller for the 2.5 SUVmax method (548 cm 3 vs. 555 cm 3 ) ( Table 5). The largest difference between the evaluator-specific cut-offs (∆Cut-off) was observed for the 41% SUVmax method (Cut-off evaluator 1 = 324 cm 3 vs. Cut-off evaluator 2 = 252 cm 3 ). The difference was ten times smaller for the 2.5 SUVmax method (548 cm 3 vs. 555 cm 3 ) ( Table 5). This variability in the cut-off inter-reader results in a difference in the classification of patients into the low or high MTV groups ranging from 1 to 8 patients depending on the segmentation method used. The 41% method classified 8 patients differently according to evaluator 1 or 2, whereas the 2.5 SUVmax method or Liver SUVmax classified only one patient differently (Table 5 and Figure 5).
Log-rank tests and multivariate analyses were performed using Cox models, including only the 41% SUVmax method for the MTV. The IPI score, the type of chemotherapy and MTV were significantly correlated with PFS and OS (Table 6). These results are in agreement to our previous paper [17].

Discussion
The powerful independent prognostic value of the MTV is accepted regardless of the segmentation method used [14][15][16][17]. This was also true in the current study for all methods with a continuously increased of risk with MTV for PFS and OS in Cox model.
There was high interobserver agreement for measuring the MTV in all methods, with significantly better reproducibility for the absolute and Daisne methods versus the 41% method.
In our study, Kendall's Tau was 0.93 for the PERCIST method, 0.92 for SUVmax > 2.5 method and 0.85 for the 41% method compared to 0.96, 0.98, 0.90 respectively in the study by Ilyas et al. [24]. Inter-observer reproducibility was significantly higher for absolute methods compared to the 41% SUVmax method with the same trend in both studies. Kendall's Tau appeared to be higher on average in the study by Ilyas et al. This is probably due to software differences that allowed for a semi-automatic approach as opposed to a completely manual approach in our case.
Currently, there is no accepted gold standard for assessing the MTV in FDG PET [18]. We will focus more specifically on the two methods most documented in the literature: the 2.5 and 41% methods. In our study, the poorer reproducibility of the 41% method compared to the 2.5 method resulted in a higher variability of the cut-off between the two evaluators. Indeed, the absolute value of the cut-off difference between evaluators in the 41% SUVmax method reached 72 cm 3 compared to only 7 cm 3 for the 2.5 SUVmax cut-off method. This difference caused interobserver disagreement for patients with low or high MTV, and therefore, good or poor prognosis concerning 8 patients in the 41% SUVmax method versus only 1 patient in the 2.5 SUVmax method. This could have consequences in patient management when the MTV is used in routine clinical practice or in clinical trials.
At the Henri Becquerel Centre, we have carried out 2 studies partly concerning the same population of patients treated consecutively for DLBCL: Cottereau et al. in 2016 [25] found a threshold at 300 cm 3 using the 41% method in a population of 81 patients and Toledano et al. in 2018 [17] found a cut-off at 261 cm 3 using the 41% method in a population of 139 patients, which frames the cut-off value of 295 cm 3 in the present study. It is accepted that the cut-off is specific to the segmentation method used but also the study population [18,24]. Indeed, the distribution characteristics of the MTV, in particular the median volume, but also age and general state (performance status) are important elements. Indeed, the younger the population and the general state is preserved, the more the cut-off increases.
One of the most comparable studies in terms of patient characteristics is by Song et al. in 2016 [26] which had a population with exclusively advanced stages (100% vs. 78% in our study), equivalent in age (63% > 60 years vs. 64%), and a median MTV2.5 slightly lower (527 cm 3 vs. 600 cm 3 ). They found a threshold at 600 cm 3 using the 2.5 SUVmax method, relatively close to our cut-off of around 550 cm 3 .
The study by Mikhaeel et al. [11] presented a population with an equivalent median MTV2.5 (595 cm 3 vs. 600 cm 3 ), a lower proportion of advanced stages and younger population (69% advanced stages vs. 78% in our study and 52% < 60 years vs. 36%, respectively). In this study, the cut-off for the 2.5 SUVmax method was around 400 cm 3 (−28% compared to our cut-off of 552 cm 3 ). They also calculated the cut-off of 41% method at 166 cm 3 (−56% from our 295 cm 3 cut-off).
In our study, reproducibility was challenging. Indeed, each lesion was manually contoured using a 3D brush, without automation, blindly. This was probably unfavourable to the 41% method, and probably in part, explains its poorer reproducibility. This method is sensitive to the size of each "box" within which the threshold is applied (via the SUVmax of the hot spot within the box). In our daily experience, it happens quite frequently that the DLBCL diffusely infiltrates an anatomical region, especially in advanced stages, making it difficult to individualise each involved node. There is not always a single way to segment a pathological lesion ( Figure 5). This leads to significant inter-observational variation. This has been observed in the cases of patients who were outside the limits of agreement (mean difference ± 1.96 SD) on the Bland-Altman's figures. This variability does not exist in the case of absolute thresholding that applies to the whole organism for each voxel without the influence of the SUVmax within each initialization box, which probably explains the excellent inter-observer reproducibility (Kendall's tau ≥ 92%) of the absolute methods. However, the high interobserver reproducibility in the PERCIST and hepatic SUVmax methods is dependent on the determination of the hepatic SUVmean and SUVmax values, which is a prerequisite step in determining a fixed threshold to be applied. In our study, this was measured automatically, so the same threshold was used for both evaluators, eliminating the variability induced by this measurement.
Finally, the Daisne method appears to be an adaptive method that is significantly more reproducible than the 41% SUVmax method. However, it is less accessible in current practice because it is based on a more complex segmentation algorithm. Nevertheless, it is an interesting option to allow better inter-observer reproducibility of the MTV measurement while avoiding the use of a fixed threshold [27].
To our knowledge, this is the largest population of patients with the MTV measured by two readers. This is a consecutive enrolment at the Henri Becquerel Centre. This population is unselected and included all stages of the disease and a range of ages. As there was a median follow-up of 6.6 years (95% CI: 6.1 to 7.0), the PFS and OS data were mature.
Our study had some limitations. This is a retrospective monocentric study on a population that underwent pre-therapeutic PET/CT on an older generation machine (PET/CT Biograph Sensation 16 HiRes). These results should be confirmed for different devices PET/CT and new-generation devices.
More recently, Tout et al. [28] showed that rituximab exposure decreased with an increasing baseline MTV and was found to be predictive of response after induction treatment, OS, and PFS. This also opens perspectives in terms of therapeutic monitoring and personalised medicine. The MTV is a powerful prognostic marker. Its interest is growing, and it could be used in the future as a therapeutic decision tool with a possible intensification in patients with high metabolic volume, and therefore, poor prognosis. However, it is necessary to standardise its measurement through semi-automation so that the least possible intervention by the evaluator is required to make it as reproducible as possible, as defined by Barrington and Meignan [18]. Currently, an area of research is implemented on the theory of repeatable segmentation algorithm independent from the initial input, as reported by Comelli in head and neck and brain tumours [29]. Then, once the measurement method has been defined and an international consensus has been reached, it will be necessary to carry out large multicentre prospective studies to validate the different uses of the MTV in clinical practice. This will also require the training and evaluation of nuclear medicine physicians in its measurement with a benchmark dataset to test their ability to measure the MTV consistently against the expected values.

Patients and Methods
This monocentric study was approved by the Henri Becquerel Centre review board (n • 2001B). Patients were informed about the use of anonymised data for the research and their right to oppose this use. The study enrolled consecutive patients between November 2004 and September 2014, who were retrospectively evaluated.
The inclusion criteria were as follows: DLBCL confirmed in all patients by a histopathologic review of the baseline biopsy, treatment using an anthracycline-containing regimen with rituximab; R-CHOP chemotherapy or R-CHOP-like, including R-mini CHOP, R-COPADEMand R-ACVBP, staging with FDG-PET/CT at baseline. Clinical data obtained from all patients included: sex, age at disease onset, ECOG performance status, extranodal disease, Ann Arbor staging system and the LDH level. This allowed us to calculate the IPI score.

FDG-PET/CT Acquisitions
Images were acquired on a PET/CT Biograph Sensation 16 HiRes (Siemens ® , Erlangen, Germany) accredited from EARL and performed according to the EANM procedural guidelines [30].
Patients fasted for 6 h and the blood glucose level was <1.7 g/L before injection of the radiotracer. 4.5 MBq/kg of FDG was injected after 30 min of rest. Sixty minutes later (±5 min), acquisitions began with a CT scan in the craniocaudal direction. CT scan parameters were set to 120 kV and 100-150 mAs (based on the patient's weight) using the dose reduction software (Care-Dose, Siemens Medical Solutions, Hoffman Estates, Knoxville, TN, USA). This yielded a mean effective mA s of 89.1 ± 6.7. The patient's arms were positioned over his or her head, and the acquisition was performed with free breathing and a 16 × 0.75 mm primary collimation. The duration of the CT scan was 20 s. No contrast media injection was done.
PET image acquisitions immediately followed in the caudocranial direction, and the scan time was based on 3 min per bed position. Six to eight positions were acquired (whole-body); the axial field of view for the 1-bed position was 162 mm with a bed overlap of 25% (plane spacing: 2 mm). The transverse spatial resolution reached 4.4 mm (centred point source in the air). The image matrix was 168 × 168 pixels with 5.3 mm/pixel.

MTV Measurement
All scans were displayed using a fixed SUV scale and colour table. To analyse the interobserver variability, a second nuclear medicine physician (FE), who was blinded, measured the MTV independently from the first observer (MT) using the 8 methods available in the Oncoplanet application (version 3.1; DOSISoft, Cachan, France).
The MTV was computed using the following steps. First, the volumetric regions of interest were placed around each lesion, avoiding physiologic uptake (urinary elimination, heart).
The total metabolic tumour volume was obtained by summing the metabolic volumes of all nodal and extranodal lesions. The bone marrow involvement was included in the volume measurement only if there was focal uptake. The spleen was considered as involved if there was focal uptake or diffuse uptake higher than 150% of the liver background as recommended [18].
The time measurement used to calculate volumes in each method was not carried out due to the preliminary step of depositing boxes on each pathological fixation.
Liver SUVmax and SUVmean measurement were assessed automatically in the right upper lobe of the liver.

Statistical Analysis
The statistical analysis was performed using R software, version 3.6.1 [34]. Agreement between the two observers was evaluated by ICC to measure the consistency between the MTV evaluations and by Kendall's Tau to measure the rank correlation of the MTV evaluations [35,36]. The 95% confidence intervals (CI) of intraclass coefficient (ICC) and Kendall's Tau were estimated using 10,000 bootstrap replications with the bootstrap BCa (adjusted bootstrap percentile) [37]. Bland-Altman plots evaluated the means and the differences between the two evaluators, with a 95% CI [38]. The median follow-up was calculated using the reverse Kaplan-Meier method. Overall survival (OS) and progressionfree survival (PFS) were estimated from the diagnosis date to death or progression and death, respectively. Since the statistical analyses was carried out after 5 years, data were censored at this time. Survival probabilities were calculated using the Kaplan-Meier method. Log-rank tests and multivariate analyses were performed using Cox models with variable selection prior to analysis according to literature and clinical pertinence. Mean receiver operating characteristic (ROC) curves from the two evaluators were used to predict the PFS at 5 years for each segmentation method by identifying optimal cut-offs [39]. A two-tailed p-value below 0.05 was considered statistically significant. For secondary analyses, a Hochberg correction was applied to control the risk of Family-Wise type I error at 5% [40].

Conclusions
In conclusion, we found that MTV measurements based on absolute SUVmax criteria were significantly more reproducible than those based on 41% SUVmax criteria. The threshold was specific for each of eight segmentation methods, but all predicted prognosis. These results can contribute to setting benchmarks for the measurement of the MTV.