Tumor Size Measurements for Predicting Hodgkin’s and Non-Hodgkin’s Lymphoma Response to Treatment

The purpose of this study was to investigate the value of tumor size measurements as prognostic indicators of treatment outcome of Hodgkin’s and Non-Hodgkin’s lymphomas. 18F-FDG PET/CT exams before and after treatment were analyzed and metabolic and anatomic parameters—tumor maximum diameter, tumor maximum area, tumor volume, and maximum standardized uptake value (SUVmax)—were determined manually by an expert and automatically by a computer algorithm on PET and CT images. Results showed that the computer algorithm measurements did not correlate well with the expert’s standard maximum tumor diameter measurements but yielded better three dimensional metrics that could have clinical value. SUVmax was the strongest prognostic indicator of the clinical outcome after treatment, followed by the automated metabolic tumor volume measurements and the expert’s metabolic maximum diameter measurements. Anatomic tumor measurements had poor prognostic value. Metabolic volume measurements, although promising, did not significantly surpass current standard of practice, but automated measurements offered a significant advantage in terms of time and effort and minimized biases and variances in the PET measurements. Overall, considering the limited value of tumor size in predicting response to treatment, a paradigm shift seems necessary in order to identify robust prognostic markers in PET/CT; radiomics, namely combinations of anatomy, metabolism, and imaging, may be an option.


Introduction
Identifying the right treatment for the oncology patient is paramount to a successful outcome. Determining early on whether a tumor responds or not to a certain treatment followed is paramount to the selection of the right treatment and a good prognosis. There is significant effort in identifying metrics from imaging or other diagnostic studies that could be correlated to outcome and will allow us to predict response based on a tumor's anatomical or metabolic/functional characteristics. A treatment is effective if clinical symptoms and survival improve and tumor size is reduced. PET/CT hybrid imaging offers both anatomic and functional information and attractive metrics for monitoring tumor response to treatment in one-, two-, and three-dimensions [1].
Currently one-dimensional (1D) measurements are the clinical standard. There is significant research effort, however, to demonstrate that the measurement of a single dimension of a tumor is oversimplification; it does not adequately represent its irregular shape, and its often nonsymmetrical changes to a specific treatment or over time. There are numerous reports on the possibly higher value of three-dimensional (3D) tumor size metrics as prognostic indicators of an oncology patient's response to treatment. However, all studies are coming short of spectacular results when 3D metrics are evaluated, and the standard of practice remains unchanged and includes the metabolic tumor maximum diameter (MTDmax), which is a 1D metric, and SUVmax [2].
We can support the thesis that one type of measurement does not fit all cases. However, we can see evidence that complicated approaches do not necessarily mean better outcomes or better health care. The issue is timely intervention, and if this can be achieved via relatively simple, standardized means, then it is preferable. One should also consider the bias and variances identified in PET/CT tumor measurements in the process of developing a robust and reproducible treatment outcome metric [1].
Lymphomas have unique characteristics, different from solid tumors, which have been studied extensively in the last 20 years including the value of metabolic tumor volume (MTV) as a prognostic marker of Hodgkin's Lymphoma (HL) and Non-Hodgkin's Lymphoma (NHL) response to treatment [3,4]. Limitations of CT for these patients have been clearly demonstrated [4,5], but PET results are also contradicting; no consensus has been achieved on the value of metabolic parameters, while the initial method of measuring and reporting MTDmax and SUVmax remains a universal practice [5]. It is possible that the large number of variables interferes with the standardization of the measurement and its wider acceptance. So, the question remains: Should we strive for metabolic tumor volume measurements or be content with maximum diameter measurements and try to automate and standardize the latter as much as possible on PET data or look for completely new metrics altogether and re-evaluate the role of CT?
This study aimed at addressing some of the previous questions for HL and NHL patients, who underwent PET/CT before and after treatment and were followed up clinically after the end of their treatment. In a pilot study with HL only, the MTV showed significant correlation with treatment outcome or prognostic value [6]. We expanded our work, however, to include more NHL cases and complete clinical follow up. The new data showed that MTDmax is a reliable, universal, fast, and easy to apply measurement with good prognostic value. In agreement with studies in other pathologies, MTV added some but not significant prognostic value in HL and NHL cases [7][8][9][10]. Manual MTV estimates were time consuming to perform, while an automated approach was more promising and yielded significantly better results than an expert.

Patient Characteristics and Measurements
In this study, we expanded our pilot work, results of which were reported in 2015 [6], to include 24 NHL and 21 HL patients with one mass of interest each. The large majority of the NHL patients (about 80%) had diffuse large B-cell lymphoma, while the large majority of HL was nodular sclerosis (about 90%) 18 F-FDG PET/CT scans of each patient were taken before and after chemotherapy and radiation therapy with the same imaging protocol. The following measurements were performed on the PET images for all masses: Average values of the metabolic measurements from the PET images and their standard deviations performed by the expert and the CAD pre and post treatment are listed in Tables 1 and 2 for the HL and NHL patients, respectively. Table 3 summarizes the CAD CT measurements.

RECIL Classification of Changes in Measured Variables
Changes in the PET variables between pre and post-treatment were calculated by subtracting the baseline PET measurement from the post treatment measurement and dividing by the baseline measurement. The %change was classified in four categories based on the response evaluation criteria in lymphoma (RECIL) [11,12]. Specifically, patients' response to treatment was distinguished as i.
It should be noted that the numbers 3, 2, 1, and 0 were assigned to the above four classes, respectively, for further analysis. The classification of the patients that was based on the expert's or CAD's measurements were compared to the clinical outcome, i.e., the classification of each patient at the first follow-up after the end of the treatment. The distribution of the frequencies of the treatment response classes are shown in Figures 1 and 2.

Differences between Expert and CAD
The differences between expert's and CAD's measurements of MTV and MTD max were analyzed by Bland-Altman plots shown in Figures 3 and 4, respectively [13]. The graphs also list the bias, i.e., the gap between the horizontal line at mean difference and the zero differences, and the lower and upper limits of the 95% confidence interval for the mean difference. Note that (a) the further the bias is from zero, the larger the mean difference and (b) the wider the limits of agreement, or disagreement, the more ambiguous the measurements are.

Differences between Expert and CAD
The differences between expert's and CAD's measurements of MTV and MTDmax were analyzed by Bland-Altman plots shown in Figures 3 and 4, respectively [13]. The graphs also list the bias, i.e., the gap between the horizontal line at mean difference and the zero differences, and the lower and upper limits of the 95% confidence interval for the mean difference. Note that (a) the further the bias is from zero, the larger the mean difference and (b) the wider the limits of agreement, or disagreement, the more ambiguous the measurements are.

Differences between Expert and CAD
The differences between expert's and CAD's measurements of MTV and MTDmax were analyzed by Bland-Altman plots shown in Figures 3 and 4, respectively [13]. The graphs also list the bias, i.e., the gap between the horizontal line at mean difference and the zero differences, and the lower and upper limits of the 95% confidence interval for the mean difference. Note that (a) the further the bias is from zero, the larger the mean difference and (b) the wider the limits of agreement, or disagreement, the more ambiguous the measurements are.

Weighted Kappa Measurements of Agreement
Linearly weighted Kappa was used to determine the agreement between the expert's and CAD's PET measurements and the clinical F/U, i.e., the classes shown in Figures 1 and 2 [14,15]. Results are summarized in Table 4.

Discussion
This study aimed at addressing the following questions regarding HL and NHL imaged by 18 F-FDG PET/CT:

1.
How good therapy response predictors are PET standard metabolic measurements of MTDmax and SUVmax of lymphomas? To answer this, the MTDmax and SUVmax of 24 NHL and 21 HL patients were compared to the clinical outcome 6 months post treatment. Results showed that SUVmax had the highest agreement with the clinical outcome post treatment while the expert's MTDmax measurements had moderate agreement. The CIs for both metrics were relatively wide due to the small sample size of our study, but the relative significance is not affected, even at the lower limit [15]. The CAD MTDmax measurements differed from the expert's measurements and had poor correlation with the clinical follow-up. Differences for the larger size masses were often more than 100%, and this was puzzling considering that an expert also evaluated the CAD algorithm's segmentation performance and deemed it acceptable. It should be noted, however, that the expert's MTDmax and SUVmax values used in our analysis were recorded from the clinical diagnostic report, and the expert who did the formal clinical interpretation was different from the expert who participated in our segmentation process. It is well documented in several studies that a large margin is applied during standard clinical measurements while interobserver variability is high [1]. Finally, SUVmax was also measured by our algorithm, but these values were not reported here because they did not differ from the expert's as they were both based on similar mathematical definitions [16].

2.
How good are MTV measurements for the prognosis of the disease, and how do they compare to the standard measurements of MTDmax and SUVmax? Results showed that the expert's MTV manual measurements have a fair agreement with the clinical follow-up. CAD MTV measurements showed a moderate agreement as is also indicated in other similar reports [17][18][19]. CAD's better performance may be explained by the fact that CAD MTV values were based on more consistent ROI contours while the expert's MTV values were based on rough elliptical contours around the tumor area in the various slices. CAD measurements were reproducible and faster compared to the expert and can be highly accurate, particularly when semi-automated, i.e., when initiated by an expert.

3.
Is the MTAmax of any value? This parameter is rarely used or measured in studies of metabolic tumor size measurements. It showed fair agreement with the clinical outcome and its value was not considered significant.

4.
Are there differences between HL and NHL cases? It seems that both the expert and the CAD performed better on the NHL than the HL masses. The NHL cases had masses with larger diameters and volumes than the HL but there was no indication, given our relatively small sample size, that the accuracy of measurements depended on the size.

5.
How does the PET segmentation algorithm's parameters affect measurements? The adaptive thresholding segmentation is a key element in our CAD approach, and the selection of a threshold impacts the final result. The 50% threshold was considered the optimum threshold for our algorithm. The selection was determined by a receiver operating characteristic (ROC) study, which was performed with five threshold values (30%, 40%, 50%, 60%, and 70%) on a subset of images where the masses where outlined by an expert and these outlines were considered "ground truth" [20]. The ROC analysis showed that a threshold of 50% yielded the best agreement, with the ground truth followed by the thresholds of 40% and 70%. To test it further, all three thresholds were used for the metabolic size measurements. Comparisons with the clinical outcome were conducted for all three sets of measurements. The 50% threshold yielded the best results and these are reported here.
How does metabolic tumor size compare to anatomic size, and is there any prognostic value in the latter? To address this question, we performed, as indicated above, similar 1D, 2D, and 3D size measurements of the HL and NHL masses in the corresponding CT images. A different segmentation algorithm was applied to CT than the one described in the following section for the PET images. The algorithm involved an initialization step based on wavelets and fuzzy C-means unsupervised clustering and a Markov Random Field step for final tumor segmentations [21,22]. CT measurements were significantly different from the PET measurements by either the expert or the CAD, as can be seen from Tables 1-3. They correlated poorly with the clinical outcome showing little, if any, prognostic value. CT images of lymphomas have poor contrast, making pre-processing a critical step in the segmentation process. Our CT results seem to agree with previous reports on the limited value of CT in assessing tumor response to treatment, and particularly lymphomas [2,12]. However, there may be some value in using CT as a guide to CAD methods for more accurate tumor segmentation on PET images. Considering also the rapidly advancing field of radiomics and its recent promising results on both solid and non-solid tumors, one could improve decision support for both HL and NHL by combining various CT and PET quantitative features, possibly including patient and clinical data [23,24]. Our conclusions are based on the selected segmentation methods and the expert. It is apparent from the literature that there is significant variability among observers and among processing methodologies. So, it is possible that an average measurement from multiple observers with different levels of expertise may alter the results and reduce potential biases and variances. Similarly, more advanced artificial intelligence algorithms may yield better and more accurate metrics with better correlation to clinical outcome. In addition, combinations of metrics from metabolic and anatomical data may lead to more powerful markers. One has to weigh, however, the computational load, the time to process, and the cost:benefit ratio of various automated or semi-automated approaches relative to the current clinical standard.
Finally, the omission of total lesion glycolysis (TLG) measurements of these masses may be considered a weakness of the study. TLG is generally considered a useful metabolic marker [25]. However, the estimation of TLG requires an accurate ROI outline in order to estimate a mean SUV that enters the TLG calculation. Considering the observed differences between expert and CAD on the ROI outlines and the small sample size of our study, we decided to exclude the TLG estimates from this work, as they strongly depend on the selected regions. A pilot work suggested that TLG may be of value in optimizing automated ROI segmentation, and this aspect is currently under investigation. In addition, the Deauville five point scale was not used as classification guide and it is possible that it may also impact the prognostic value of the metabolic metrics [26].

Patients and Data Coding
The demographics of the patients are listed in Table 5. The database of the Nuclear Medicine Department of the Biomedical Research Foundation of the Academy of Athens was reviewed and serial PET/CT examinations were selected for the study that satisfied the following criteria: (a) Patients should have one mass, non-operable, that underwent similar clinical treatment that included chemotherapy and radiation therapy. (b) All patients should have at least two PET/CT examinations, one before (baseline) and one after treatment. (c) All patients should have a clinical follow up 6 months after the end of their treatment and be classified according to the RECIL as in remission (positive response to treatment) with either complete or partial response with the tumor reduced in size or in relapse (negative response to treatment) with either no change in the tumor size or increase in size or appearance of new lesions. Each mass was assessed by an expert nuclear medicine physician on the PET/CT images of each patient, and ground truth files were generated for each mass that included the maximum diameter pre and post treatment on the CT and PET images, and the SUVmax of each lesion pre and post treatment from the PET images following criteria used in the clinical practice. In addition, there was clinical follow of the patients one year after the end of treatment and cases were classified as remission or relapse according to RECIL.

18 F-FDG-PET/CT Imaging
A hybrid Biograph 5 PET/CT system (Siemens Healthcare GmbH, Erlangen, Germany) was used for imaging, pre and post-treatment. The same whole body imaging protocol was used pre and post treatment. All patients fasted for at least 6 h before the PET/CT study. The radiopharmaceutical was injected intravenously (370-555 MBq or 10-15 mCi) without contrast. Image acquisition started 1 h after intravenous administration at which time no patient had glucose level higher than 160 mg/dL. Patients were imaged in the supine position with their arms placed above their heads when possible. The acquisition time was 2-4 min per bed position. CT scans began at the orbitomeatal line and progressed to the upper thighs. CT images were acquired with 30 mA, 130 kV, axial slice thickness of 5 mm and table feed rotation of 27 mm per rotation. PET imaging followed immediately over the same body region. The CT data were used for attenuation correction and images were reconstructed using a standard ordered-subset expectation maximization algorithm. PET image reconstruction matrix size was 168 × 168 pixels with a voxel size of 4.06 mm × 4.06 mm × 2.5 mm [6].

Metabolic Parameter Measurements
The metabolic size of the masses in 1D (maximum tumor diameter), 2D (maximum tumor area), and 3D (tumor volume) were determined by an in-house developed algorithm and user-interface in MATLAB R2013b. Tumors were first segmented on multiple slices using a semi-automated approach, a representative example of which is shown in Figure 5. The segmentation procedure included the following steps: (1) Expert selected the PET and corresponding CT slice of a scan where tumor appeared at maximum diameter; we will refer to this as the "central" slice. (2) An ellipse was drawn by the expert around the region of interest (ROI) by selecting "Design ROI" on the user-interface (Figure 5a). The expert was given the option to repeat this step if the result was not satisfactory as presented in Figure 5b by selecting "Clear ROI". (3) A background ring was defined automatically on the border of the elliptical region drawn by the expert. The ring was 3 pixels wide and its mean pixel value was used for the estimation of the threshold of the segmentation (Figure 5c). Pixels with values greater than a selected percentage of the mean background value were considered as part of the tumor, otherwise they were rejected. A non uniform region was finally defined for a given threshold as the tumor ROI and was used for the estimation of the tumor maximum diameter (mm) and tumor maximum area (mm 2 ) (( Figure 5d). (4) The "central" slice elliptical contour of the expert was automatically projected by the algorithm to slices above and below where tumor appeared. This number ranged from 5-20 slices per case depending on the tumor size. (5) Same ROI segmentation process of step (3) was applied to all slices and for three different thresholds, 40%, 50%, and 70%. (6) The MTV was estimated by adding all 2D ROIs and using the voxel dimensions.
tumor area), and 3D (tumor volume) were determined by an in-house developed algorithm and user-interface in MATLAB R2013b. Tumors were first segmented on multiple slices using a semi-automated approach, a representative example of which is shown in Figure 5. The segmentation procedure included the following steps: (1) Expert selected the PET and corresponding CT slice of a scan where tumor appeared at maximum diameter; we will refer to this as the "central" slice. (2) An ellipse was drawn by the expert around the region of interest (ROI) by selecting "Design ROI" on the user-interface (Figure 5a). The expert was given the option to repeat this step if the result was not satisfactory as presented in Figure 5b by selecting "Clear ROI". (3) A background ring was defined automatically on the border of the elliptical region drawn by the expert. The ring was 3 pixels wide and its mean pixel value was used for the estimation of the threshold of the segmentation (Figure 5c). Pixels with values greater than a selected percentage of the mean background value were considered as part of the tumor, otherwise they were rejected. A non uniform region was finally defined for a given threshold as the tumor ROI and was used for the estimation of the tumor maximum diameter (mm) and tumor maximum area (mm 2 ) (( Figure 5d). (4) The "central" slice elliptical contour of the expert was automatically projected by the algorithm to slices above and below where tumor appeared. This number ranged from 5-20 slices per case depending on the tumor size. (5) Same ROI segmentation process of step (3) was applied to all slices and for three different thresholds, 40%, 50%, and 70%. (6) The MTV was estimated by adding all 2D ROIs and using the voxel dimensions. Measurements of MTDmax, SUVmax, and MTV were also performed manually by a nuclear medicine expert using Biograph's user interface standard tools. For the volume measurements, the expert outlined a tight ellipse around the tumor region in all PET slices where the tumor was observed and the volume was estimated in mm 3 by summing all the pixels in the outlined areas and multiplying with the voxel size. Measurements of MTDmax, SUVmax, and MTV were also performed manually by a nuclear medicine expert using Biograph's user interface standard tools. For the volume measurements, the expert outlined a tight ellipse around the tumor region in all PET slices where the tumor was observed and the volume was estimated in mm 3 by summing all the pixels in the outlined areas and multiplying with the voxel size.

Statistical Analysis
The analysis of the data used descriptive statistics, Bland-Altman plots for testing the agreement of the sets of measurements, and weighted Cohen's kappa to test the agreement of the various parameters with the clinical outcome, which may be considered as their prognostic value.
The Bland-Altman plots graphically demonstrate the difference between the expert and CAD estimates of the measured parameters as a function of their mean values [13]. The three horizontal lines are drawn at the mean difference, and at the limits of the agreement, which are defined as mean difference ±1.96 SD of the differences. The weighted Cohen's kappa was used to measure the agreement between the various metrics and the clinical outcome, because differences were ranked and were not considered to be equally important [14].

Conclusions
The hypothesis of our study was that 3D metabolic parameters are key predictors of response to treatment and would significantly overpower 1D metrics. We tested our hypothesis on HL and NHL cases and concluded that computer assisted MTV measurements have the potential to be a useful marker for treatment response, but do not differ significantly from MTDmax while they fall short of SUVmax. It is more likely that combinations of various metabolic, anatomic, and imaging parameters will yield better prognostic markers. Given the additional load and variability of radiomics measurements, the use of fast, standardized, and reproducible CAD tools in clinical PET/CT practice seems inevitable.