Deep Learning Approach to Automatize TMTV Calculations Regardless of Segmentation Methodology for Major FDG-Avid Lymphomas

The total metabolic tumor volume (TMTV) is a new prognostic factor in lymphomas that could benefit from automation with deep learning convolutional neural networks (CNN). Manual TMTV segmentations of 1218 baseline 18FDG-PET/CT have been used for training. A 3D V-NET model has been trained to generate segmentations with soft dice loss. Ground truth segmentation has been generated using a combination of different thresholds (TMTVprob), applied to the manual region of interest (Otsu, relative 41% and SUV 2.5 and 4 cutoffs). In total, 407 and 405 PET/CT were used for test and validation datasets, respectively. The training was completed in 93 h. In comparison with the TMTVprob, mean dice reached 0.84 in the training set, 0.84 in the validation set and 0.76 in the test set. The median dice scores for each TMTV methodology were 0.77, 0.70 and 0.90 for 41%, 2.5 and 4 cutoff, respectively. Differences in the median TMTV between manual and predicted TMTV were 32, 147 and 5 mL. Spearman’s correlations between manual and predicted TMTV were 0.92, 0.95 and 0.98. This generic deep learning model to compute TMTV in lymphomas can drastically reduce computation time of TMTV.


Introduction
The total metabolic tumor volume (TMTV) has recently been proposed as a tumor burden quantification method in various lymphoma subtypes, especially in Hodgkin lymphoma (HL) [1,2], diffuse large B cell lymphoma (DLBCL) [3] and follicular lymphoma (FL) [4]. This evaluation requires whole-body segmentation of the tumor mass on the baseline 18FDG PET/CT imaging. To make TMTV acceptable in routine clinical practice, a high level of automation is needed to reduce the calculation time and to enhance the interobserver reproducibility.
Such automatization approaches have been proposed using a large range of algorithms, and more recently, using deep learning segmentation algorithms [5][6][7]. Segmentation algorithms using convolutional neural networks (CNNs) have shown a very high accuracy for medical image segmentation [8,9]. These CNN approaches can exploit numbers of imaging features to distinguish tumoral uptakes from physiological uptakes such as brain, kidney or brown fat uptakes, which are common pitfalls in segmentation automation.
In the scope of TMTV calculations in lymphoma, many studies have demonstrate very high accuracy in predicting TMTV values and could be a major breakthrough for automating TMTV calculations [6,7,10]. However, published data are still limited in some respects. First, published papers usually rely on a single lymphoma subtype dataset, although it could be more interesting to use data of the most common lymphoma FDG-avid subtype to train a more generic model with increased dataset training (HL/DLBCL/FL having overlapping imaging pattern). Moreover, all those papers relied on a single TMTV calculation gold truth methodology, which seems to be a major limitation because TMTV methodology harmonization is still pending [11]. Training CNN models against a single segmentation methodology (ex: 41% SUVmax or 2.5 threshold) forces algorithms to try to reproduce one specific segmentation threshold, with a loss of control of the segmentation rule, which could not be transferred to other segmentation methodologies. Finally, the reported performance is still very dependent on the representativity of the training dataset and the quality of tumor labelling, making reported performances not always generalizable for routine applications.
One other approach for training CNN models could be to try to reproduce an equivalent of the manual physician delineation of uptakes, regardless of the final segmentation threshold algorithm. The final segmentation methodology would then be applied in a post-processing step to compute the final TMTV. With this approach, a more generalizable model could be built for TMTV calculations of FDG-avid lymphoma.
The aim of our study was to train and validate a generic segmentation CNN model on a large training dataset to provide TMTV automation on the main lymphoma subtypes (DLBCL, FL and HL), regardless of the PET thresholding methodology.
Each of these studies has been approved by an ethics committee (see related publications); ancillary studies of images of these studies were planned from the initials study protocols.  Ann Arbor Stages   AHL2011  IIB, III, IV   GAINED  I-IV   RELEVANCE  I-IV   REMARC  II-IV   FLIP  I-IV   LNH2007-3B  I-IV PVAB II-IV

Image Preprocessing
PET/CT quality was checked for slice interval regularity and axial slice completeness, and attenuation-corrected PET/CT images were converted into standardized uptake value (SUV) units.
All available PET, CT and ground truth masks followed a preprocessing pipeline. First, from the DICOM format, a 3D image in nifti (.nii) format was generated. Images were resized to a 128 × 128 × 256 and 4.0 × 4.0 × 4.0 mm voxel size with a linear interpolation. PET and CT were aligned at the same origin to build 4D PET/CT input data. Input data for training were scaled from 0 to 1 (corresponding to an original input range (−1000, 1000) for CT UH values and (0, 25) for PET SUV values).

Ground Truth Generation
For each baseline PET/CT, ground truths were generated by the manual delineation of uptakes randomly assigned to 13 expert physicians trained in TMTV calculations from the LYSA group.
Manual regions of interest (ROIs) were drawn using the open-source PET/CT viewer for Fiji [18]. Four different thresholds were then applied to calculate the probability of the voxel being included in the TMTV segmentation: 41% SUVmax, SUV > 2.5 and SUV > 4.0 cutoffs and Otsu method (histogram-based threshold) [19][20][21].
A voxel-by-voxel average of these masks was computed to generate one single probability mask (probability from 0 to 1 with 0.25 steps), called the TMTV probability map (TMTVprob).
The rescaled PET/CT and the TMTVprob served as input data and ground truth for the CNN.

Model Architecture and Training
The dataset was split into 60% for training (n = 1218), 20% for a test dataset (n = 407), and 20% was used for validation (n = 405).
A fully convolutional neural network VNET [22], with 4 levels and 8 channels in the first level, was trained on baseline PET-CT and TMTVprob, using Tensorflow 2.4.1 on a dual NVIDIA 1080 TI GPU.
For each patient, a data augmentation strategy with random translating, rotation and scaling value was used to generate one additional augmented image. Model weights were updated using the stochastic gradient descent with a learning rate of 0.001 and momentum of 0.9, during 100 epochs. The soft dice loss was used as the loss function, defined as 1-dice_coefficient [22]. When we optimized our network through this function, we did not need to account for class imbalance between regions (background and tumors). The last layer had one channel and used the sigmoid activation function.

Post-Processing
The predicted segmentation was converted to a raw TMTV prediction using a voxel probability threshold > 0.5. A post-processing step, consisting of applying a selected thresh-olding method, was then performed to reproduce three main different TMTV thresholding algorithms: 41% SUVmax, 2.5 and 4 SUV cutoff.
For the 41% methodology, a clustering method was implemented to identify the SUVmax of individual lesions and compute the thresholded mask. This clustering was based on the isolation of connected components of the tumor mask prediction. Then, for remaining areas over 30 mL, a watershed segmentation was used to isolate tumor subparts using the SUV values of the PET image. After this clustering process, the relative threshold (41% SUVmax of each sub-component) was applied on each ROI.

Predicted TMTV Validations
The raw TMTV predictions, before the post-processing step, were compared with TMTVprob and with the three TMTV manual delineations (41% SUVmax, SUV2.5 and SUV4) using the dice scores and Jaccard coefficients.

Training of the Convolutional Neural Network
The training was completed in 93 h on 1218 patients. In comparison to the TMTVprob, mean dice reached 0.836 in the training set, 0.835 in the validation set (405 patients) and 0.76 in the test set (407 patients with a median of 0.81).

Raw TMTV Prediction without Post-Processing
Of the 407 patients of the test dataset, median dice scores of the raw predicted segmentation were 0.71 for 41%, 0.70 for 2.5 and 0.81 for 4 SUV cutoffs.

Clustering of Predicted Segmentation for 41% TMTV Calculation
The mean number of ROIs drawn by physicians during manual segmentation was 21 ± 20, with a median of 15.
Generated ROI number with watershed sub-segmentation of the whole body predicted mask was 33 ± 30, with a median of 25.

Final TMTV Predicted Values Per Methodology
Of the 407 patients of the test dataset, median dice scores of the predicted segmentation were 0.77, 0.70 and 0.90 for 41%, 2.5 and 4 SUV cutoffs, respectively ( Table 2).

Discussion
TMTV calculations in various subtypes of lymphoma have recently become one of the most promising prognostic factors and may help to develop new risk-adapted treatment strategies [23]. Implementation of this new prognostic factor will require a high level of automatization to be routinely calculated.
Different approaches attempting to automatize TMTV calculation have been proposed to reduce the time-consuming delineation task to calculate tumor burden.
A first level of automation has been implemented in various medical views, such as region growing delineation [24], allowing single-click delineation, but still requires the manual identification of each target. Higher levels of automation have been proposed, such as component trees and connected operators [5], which are based on grey-level image intensity information and node hierarchies. This model enables whole-body segmentation without any input from the physician, but the result still lacks specificity, especially regarding physiological uptakes.
In recent studies, CNN architecture used cascaded networks, dividing the body into three different regions for the head, chest and abdomen, and then segmenting it using 2D or 3D CNN, as proposed by Jemaa et al. [10], or using direct 3D U-NET CNN on patched PET/CT, as proposed by Blanc-Durand et al. [7]. These proposed architectures rely on similar architecture to ours (V/U-NET); the difference is that the whole-body image is split into subparts to be able to fit the memory constraints of the GPU. A different approach has also been proposed by identifying FDG uptake with multi foci segmentation (MFS) over the image and then classifying them using CNNs to predict tumor vs. physiological uptake and localization, as proposed by Capobianco et al. [6]. In this case, the segmentation is defined by the MFS algorithm, which is an additional non-consensual delineation methodology that explains the reported correlation of 0.76 with manual TMTV 41%, which is fair but lower than the current study (r = 0.92 for TMTV 41%).
All these CNN approaches achieved accurate segmentation with reported dice scores from 0.73 to 0.88, but relied on a single methodology for TMTV calculation; thus, they tried to reproduce a particular threshold although no consensual segmentation methodology has yet been defined.

Discussion
TMTV calculations in various subtypes of lymphoma have recently become one of the most promising prognostic factors and may help to develop new risk-adapted treatment strategies [23]. Implementation of this new prognostic factor will require a high level of automatization to be routinely calculated.
Different approaches attempting to automatize TMTV calculation have been proposed to reduce the time-consuming delineation task to calculate tumor burden.
A first level of automation has been implemented in various medical views, such as region growing delineation [24], allowing single-click delineation, but still requires the manual identification of each target. Higher levels of automation have been proposed, such as component trees and connected operators [5], which are based on grey-level image intensity information and node hierarchies. This model enables whole-body segmentation without any input from the physician, but the result still lacks specificity, especially regarding physiological uptakes.
In recent studies, CNN architecture used cascaded networks, dividing the body into three different regions for the head, chest and abdomen, and then segmenting it using 2D or 3D CNN, as proposed by Jemaa et al. [10], or using direct 3D U-NET CNN on patched PET/CT, as proposed by Blanc-Durand et al. [7]. These proposed architectures rely on similar architecture to ours (V/U-NET); the difference is that the whole-body image is split into subparts to be able to fit the memory constraints of the GPU. A different approach has also been proposed by identifying FDG uptake with multi foci segmentation (MFS) over the image and then classifying them using CNNs to predict tumor vs. physiological uptake and localization, as proposed by Capobianco et al. [6]. In this case, the segmentation is defined by the MFS algorithm, which is an additional non-consensual delineation methodology that explains the reported correlation of 0.76 with manual TMTV 41%, which is fair but lower than the current study (r = 0.92 for TMTV 41%).
All these CNN approaches achieved accurate segmentation with reported dice scores from 0.73 to 0.88, but relied on a single methodology for TMTV calculation; thus, they tried to reproduce a particular threshold although no consensual segmentation methodology has yet been defined.
In contrast to cascaded networks, we chose to use a 3D V-NET with a whole-body image, resampled to a 4 mm isotropic cubic voxel included in a physical CNN-fixed input space. This kept the whole-body image (acquisition field up to 102 cm) without altering the image ratio, harmonized the image resolution of the wide range of image resolutions of the dataset, and was able to train the CNN with input fitting the GPU memory constraint.
In the present study, the CNN was trained against a probability map of segmentation that reflected the segmentation methodology differences. These methodological differences are particularly significant in the edges of the uptake; thus, combining these different thresholds into a probability map forces the CNN to generalize a generic segmentation regardless of the thresholding cutoff. This choice also enables dissociation of the lesion detection from the segmentation algorithm, ensuring that the first level of uptake segmentation id generic enough to then be thresholded in a post-processing step to reproduce clinically validated TMTV segmentation algorithms. This two-step strategy is particularly important because it can merge automated TMTV delineation with manual TMTV delineation, as both techniques share the same final segmentation rule.
The achieved median dice score of the TMTVprob on the validation set was 0.81, which seems similar to the reported dice score in the literature, despite the methodological choice to reproduce a generic TMTV segmentation rather than a specific one. This segmentation accuracy is probably due to the large training dataset being the largest manually labelled dataset of baseline lymphoma PET/CT, to the best of our knowledge.
Among the segmentation methodologies, the 41% SUVmax was the hardest to reproduce; in this methodology, the threshold is set per lesion, and thus needs to identify subparts of the whole-body segmentation inference. Although identifying non-connex segmented tumors is straightforward, lymphoma segmentation is also challenging because of large coalescent and heterogeneous uptakes. To solve this issue, we chose to sub-segment tumor masses over 30 mL, which are likely to be coalescent tumors, to isolate the subparts. For this lesion sub-segmentation, we chose the watershed algorithm which allows separation of regions of different uptake while maintaining a topographic surface approach: uptake zones of similar intensity are isolated in a continuous tumor unit in which the SUVmax-based threshold could be calculated. This algorithm enables the rough reproduction of manual segmentation from the binary-predicted whole-body segmentation and generates a comparable number of ROIs compared with manual segmentation.
In this study, we have shown a very strong correlation for each segmentation methodology (>0.9) and a relatively acceptable difference in median TMTV (CNN vs. manual). The demonstrated difference in median TMTV (5 to 147 mL) has to be interpreted regarding the high SD of the TMTV value (>400 mL). From these methodologies, the 2.5 SUV threshold appeared harder to reproduce with lower concordance values, probably because this threshold is too low and may include several background voxels which are discordant from other methodologies, and thus more difficult to include in a generic segmentation.
All of these results have been generated without any manual correction, highlighting the very good accuracy of CNN segmentations regardless of the methodology or lymphoma subtypes, because the three main FDG-avid lymphomas have shown similar performances.
In the perspective of using TMTV in decisional trials or in routine clinical practice, manual reviews and corrections will still be needed because the visual validation of segmentation still enables the removal of some obvious false-positives (e.g., tracer extravasation, ectopic kidney) ( Figure 4) and to add false-negative lesions. All these segmentation corrections are limited and do not seem to introduce bias in a statistical point of view in this large cohort, but are still required at the individual patient level. The combination of a well-trained CNN model with a medical image viewer with segmentation features would bring TMTV calculation ease, reproducibility and accuracy to be implementable in routine clinical practice. For this purpose, we developed the free and open-source Dicom-To-CNN [25] library (MIT License) which allows DICOM preprocessing and exporting CNN outputs using interoperable DICOM formats (DICOM RTSTRUCT and/or DICOM SEG). Collaborations with major open-source medical image processing projects are currently in progress to develop the final integration of this deep learning segmentation algorithm including inference, visualization and editing of the segmentation output. Once complete, it will be possible to validate the feasibility of a routine use of TMTV in a real AI-guided medical workflow. This final validation will evaluate the global TMTV determination time and its inter-reader reproducibility to be integrated in decisional therapeutic strategies.
In conclusion, TMTV automation with deep learning algorithms have demonstrated very high performance, regardless of the segmentation threshold methodology in the main FDG-avid lymphoma subtype, making a generic segmentation model to be implemented in clinical trials evaluating TMTV-based treatment strategies.

Institutional Review Board Statement:
The clinical studies used in this ancillary analysis were conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board defined for each study. This ancillary study was approved by the scientific council of the LYSA. The combination of a well-trained CNN model with a medical image viewer with segmentation features would bring TMTV calculation ease, reproducibility and accuracy to be implementable in routine clinical practice. For this purpose, we developed the free and open-source Dicom-To-CNN [25] library (MIT License) which allows DICOM preprocessing and exporting CNN outputs using interoperable DICOM formats (DICOM RTSTRUCT and/or DICOM SEG). Collaborations with major open-source medical image processing projects are currently in progress to develop the final integration of this deep learning segmentation algorithm including inference, visualization and editing of the segmentation output. Once complete, it will be possible to validate the feasibility of a routine use of TMTV in a real AI-guided medical workflow. This final validation will evaluate the global TMTV determination time and its inter-reader reproducibility to be integrated in decisional therapeutic strategies.
In conclusion, TMTV automation with deep learning algorithms have demonstrated very high performance, regardless of the segmentation threshold methodology in the main FDG-avid lymphoma subtype, making a generic segmentation model to be implemented in clinical trials evaluating TMTV-based treatment strategies.

Institutional Review Board Statement:
The clinical studies used in this ancillary analysis were conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board defined for each study. This ancillary study was approved by the scientific council of the LYSA.