Next Article in Journal
Evaluation of a Deep Learning Reconstruction for High-Quality T2-Weighted Breast Magnetic Resonance Imaging
Previous Article in Journal
Clinical Applicability of Electrical Impedance Tomography in Patient-Tailored Ventilation: A Narrative Review
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers

Department of Biomedical Engineering, The University of Iowa, Iowa City, IA 52242, USA
Department of Biostatistics, The University of Iowa, Iowa City, IA 52242, USA
Department of Radiology, The University of Iowa, Iowa City, IA 52242, USA
Department of Radiation Oncology, University of Iowa Hospitals and Clinics, Iowa City, IA 52242, USA
Department of Electrical and Computer Engineering, The University of Iowa, Iowa City, IA 52242, USA
Author to whom correspondence should be addressed.
Tomography 2023, 9(5), 1933-1948;
Original submission received: 31 July 2023 / Revised: 11 October 2023 / Accepted: 13 October 2023 / Published: 18 October 2023


Convolutional neural networks (CNNs) have a proven track record in medical image segmentation. Recently, Vision Transformers were introduced and are gaining popularity for many computer vision applications, including object detection, classification, and segmentation. Machine learning algorithms such as CNNs or Transformers are subject to an inductive bias, which can have a significant impact on the performance of machine learning models. This is especially relevant for medical image segmentation applications where limited training data are available, and a model’s inductive bias should help it to generalize well. In this work, we quantitatively assess the performance of two CNN-based networks (U-Net and U-Net-CBAM) and three popular Transformer-based segmentation network architectures (UNETR, TransBTS, and VT-UNet) in the context of HNC lesion segmentation in volumetric [F-18] fluorodeoxyglucose (FDG) PET scans. For performance assessment, 272 FDG PET-CT scans of a clinical trial (ACRIN 6685) were utilized, which includes a total of 650 lesions (primary: 272 and secondary: 378). The image data used are highly diverse and representative for clinical use. For performance analysis, several error metrics were utilized. The achieved Dice coefficient ranged from 0.833 to 0.809 with the best performance being achieved by CNN-based approaches. U-Net-CBAM, which utilizes spatial and channel attention, showed several advantages for smaller lesions compared to the standard U-Net. Furthermore, our results provide some insight regarding the image features relevant for this specific segmentation application. In addition, results highlight the need to utilize primary as well as secondary lesions to derive clinically relevant segmentation performance estimates avoiding biases.

1. Introduction

Head and neck cancer is most commonly a squamous cell carcinoma of the upper aerodigestive track and includes the oral cavity, pharynx and larynx as well as borders between these anatomic structures. It is associated with viral infection by human papilloma virus as well as smoking and heavy alcohol use. The cancer frequently spreads to local regional lymph nodes. In head and neck cancer (HNC), F-18 fluorodeoxyglucose (FDG) PET scans are frequently used for treatment planning and quantitative assessment of disease by calculating quantitative features like the standardized uptake value (SUV), metabolic tumor volume (MTV) and total lesion glycolysis (TLG). The elevated glucose uptake is a cardinal imaging feature on PET/CT, and the degree of elevation in uptake as well as the pattern of the uptake may provide important information for both treatment planning and prognosis. Such analyses require the segmentation of lesions in FDG PET scans. Quantitative imaging features can be also used for pre-treatment outcome prediction [1]. While the current standard in clinical practice is still manual segmentation by a radiation oncologist or a trained expert, a number of segmentation methods have been developed with the goal to simplify this process and increase segmentation consistency. These methods can be roughly classified into threshold-based methods and more advanced algorithm-based methods [2]. Specifically for HNC, there are many advanced algorithm-based methods such as graph-cut [3], k-nearest neighbor (KNN) [4], Markov random fields [5], and decision trees [6]. Recently, a growing number of approaches are utilizing deep learning methods with the promise of greatly improved performance [7], and a number of deep learning-based methods have been proposed [8,9]. A direct comparison of deep learning methods against the classical machine learning methods can be found in the work by Groendahl et al. [10], which shows that a 2D U-Net [11], a popular and powerful variant of a convolutional neural network (CNN), outperforms all other classical methods. A recent summary of state-of-the-art deep learning-based methods is provided by the “HECKTOR” challenge [12], where most participating methods were directly or partially based on the U-Net architecture.
Aside from the success of CNNs, Transformers [13] are gaining attention from the computer vision and medical image analysis community, especially after the proposal of Vision Transformers [14] (ViT), which match or even beat CNNs in various benchmarks. In comparison, CNNs utilize convolutional kernels to derive image features, which capture local information. Thus, representing long-range feature dependencies can be an issue, especially if one wants to keep kernel sizes small to avoid increased computing times. By contrast, the ViT allows representing long-range feature dependencies by using the self-attention module, enabling pairwise interaction between patch embeddings and resulting in more effective global contextual representations. Despite its potential, the application of ViT-based models to HNC segmentation in PET scans has currently not been adequately studied, and only scant literature exists about this topic [15,16]. Sobirov et al. [15] compared a UNETR variant to U-Net-based CNNs. Li et al. [16] recently proposed a cross-modal Swin Transformer and compared it to other networks. Both works rely on utilizing a HECKTOR challenge PET-CT data set, which is exclusively focused on primary lesions.
The contribution of our work is as follows. We provide a comparison of two CNNs, including the highly successful and widely adopted U-Net [11] and a U-Net variant with an integrated attention module, as well as three Transformer-based approaches that are frequently utilized for volumetric image segmentation. In our study design, special focus has been placed on:
The use of a clinically relevant HNC PET data set with 650 lesions;
Inclusion of primary and secondary lesions;
The use of a high-quality expert-generated ground truth;
Assessment of differences regarding their significance by utilization of statistical tests.
The presented study addresses the following existing gaps.
  • Machine-learning-based models like CNNs and Transformers have different inductive biases, which influence a model’s ability to learn from a given data set with certain image characteristics, data set size, etc., and also affect its performance on new data. Thus, to inform method development and optimization efforts, it is worthwhile to investigate the performance of ViT networks in the context of HNC segmentation in PET scans on a reasonably sized image data set.
  • A current shortcoming of existing studies is that secondary lesions are ignored in algorithm performance assessment but are of clinical relevance. Consequently, existing performance estimates could be biased. Secondary lesions are typically harder to segment due to potentially smaller size and lower contrast, but they are used for radiation treatment planning. Furthermore, primary and secondary lesions combined are utilized to calculate indices like MTV and TLG, which are often used as quantitative image features for outcome prediction.
  • There is the need for assessing performance differences regarding their statistical significance to enable meaningful conclusions, which is often omitted (e.g., [15,16]).
  • To assess systematic over- or under-segmentation, adequate error metrics are required, because this knowledge is relevant for selecting segmentation methods in the context of radiation treatment.

2. Materials and Methods

This section is structured as follows. In Section 2.1, image data, ground truth generation, and preprocessing steps common to all networks are described. The two utilized CNN approaches are introduced in Section 2.2, and the Transformer-based methods are described in Section 2.3. Section 2.4 provides details regarding the utilized post-processing steps. Finally, in Section 2.5, the experimental setup is described.

2.1. Image Data and Pre-Processing

The availability of clinically relevant image data sets in combination with a high-quality ground truth is a major challenge when utilizing learning-based approaches such as neural networks for medical applications like lesion segmentation. To address this issue, our experiments are based on 272 HNC patient PET-CT scans from the national ACRIN-HNSCC-FDG-PET/CT trial (ACRIN 6685, secondary data analysis, data available on The Cancer Imaging Archive [17]). The scans include a total of 272 primary and 378 secondary lesions (i.e., each scan contains one primary lesion and can have up to several secondary lesions). An experienced radiation oncologist generated a ground truth for all 650 lesions by utilizing a freely available semi-automated segmentation tool [3], which was implemented in 3D Slicer [18]. All PET scans and corresponding ground truth label maps were first resampled to the median of spacings of the entire data set ( 2.67 × 2.67 × 3 mm) and then cropped around the center of mass of the ground truth label map of each lesion to create training, validation and test volumes of size 48 × 48 × 48 voxels, which is enough to cover the largest lesion in the data set. In this context, a tradeoff between volume size and training time as well as number of network parameters to be optimized needs to be made. If the lesion center is too close to an image boundary (<24 voxels), the cropped volume is padded using the mean of the image volume so that a consistent 48 × 48 × 48 voxel volume size is maintained. Examples of cropped volumes containing primary and secondary lesions are shown in Figure 1. For intensity normalization of PET scans, a Z-score normalization using the mean and standard deviation of each individual scan was implemented.

2.2. Convolutional Neural Networks (CNNs)

CNNs are currently the de-facto standard for many computer vision tasks like image classification [19]. The U-Net, introduced by Ronneberger et al. [11], is a type of CNN architecture developed for image segmentation tasks. Due to its excellent segmentation performance in a variety of applications, the U-Net architecture has been widely adopted and is frequently utilized in medical applications. The basic U-Net is described below. Transformers (Section 2.3) utilize an attention mechanism. Therefore, we also studied a U-Net variant with an attention mechanism, the U-Net with the CBAM structure (Section 2.2.2).

2.2.1. U-Net

The U-Net’s architecture consists of a contracting path and an expanding path to enable the network to capture high-level as well as low-level image features. The contracting path makes use of several convolutional and pooling layers that downsample the image, while the expanding path consists of a series of convolutional and upsampling layers that upsample the image back to its original size. In addition, it makes use of skip connections to propagate information from earlier layers to deeper ones and facilitate recovering fine-grained details.

2.2.2. U-Net with CBAM

The convolutional block attention module (CBAM) was proposed by Woo et al. [20]. It was designed to be integrated into feed-forward CNNs and used an intermediate feature map. It infers channel and spatial attention maps, which are subsequently multiplied with the input feature map to enable adaptive feature refinement. An illustration of the CBAM is shown in Figure 2. Several approaches have been proposed to integrate CBAM or parts of it (e.g., spatial attention module) into U-Net for more robust segmentation performance [21,22,23,24,25]. These approaches are all focused on 2D segmentation problems, and the majority choose to integrate the attention module into the skip connections on the decoder part of the U-Net. One exception is the work of Guo et al. [23], who proposed a Spatial Attention U-Net (SA-UNet). At the bottleneck of a 2D U-Net, a spatial attention module is inserted to enable focusing on important features and ignoring unnecessary ones. The SA-UNet was designed for retinal vessel segmentation in 2D fundus images.
Research on U-Net-based combined localization and segmentation has demonstrated that the bottleneck of the U-Net contains useful information for image interpretation [26]. Motivated by this insight, Xiong [27] proposed to integrate a CBAM module, consisting of a spatial and channel attention module, into a 3D U-Net architecture for the purpose of head and neck lesion segmentation in volumetric PET images. This network is depicted in Figure 3 and is referred to as U-Net-CBAM.

2.3. Vision Transformer-Based Models

Transformers were introduced by Vaswani et al. [13] for natural language processing. Due to their success, they were adapted to solve computer vision problems, and the Vision Transformer (ViT) proposed by Dosovitskiy et al. [14] achieved state-of-the-art performance in many applications. In medical imaging, Transformers are utilized in applications like image segmentation, classification, and detection [28,29]. Transformer utilizes a self-attention (SA) mechanism, enabling it to learn the relative importance of a single token (patch embedding) with respect to all other tokens by capturing the interaction among all tokens. Mathematically, this is carried out by transforming input X ( X R N × D ) into queries (Q), keys (K) and values (V) using the three separate learnable weight matrices W Q ( W Q R D × D q ) , W K ( W K R D × D k ) and W V ( W V R D × D v ) , where D q = D v . The input X is multiplied by the three weight matrices to obtain Q = X W Q , K = X W K , V = X W V . Then, the SA layer output Z is calculated as: Z = S A ( X ) = s o f t m a x ( Q K T D q ) V . As an expansion to Transformer, the original ViT [14] integrates the SA mechanism into computer vision for image classification by splitting images into smaller patches and flattening the patches to low-dimensional linear embeddings. The sequences of embeddings are then fed to Transformer encoders.
For a systematic comparison of segmentation performance between CNN models (U-net and U-Net-CBAM) and three representative 3D state-of-the-art Transformer-based models (TransBTS, UNETR and VT-UNet), which were proposed for medical image analysis, all networks were trained and tested using the same framework on the same data set. Note that VT-UNet represents a more pure implementation of Transformer for volumetric image data, while TransBTS and UNETR represent hybrids of a U-Net and Transformer. Each utilized Transformer-based model is briefly described below.
TransBTS. The core idea of TransBTS [30] (Figure 4) is to replace the bottleneck part of the 3D U-Net with a set of Transformer encoders to model the long-distance dependency in a global space [30]. The contraction–expansion structure from the U-Net is mainly utilized because splitting the data into 3D patches following the ViT makes the model unable to capture the local context information across the whole spatial and depth dimensions for volumetric segmentation. Using convolution blocks with downsampling before the Transformer encoder allows it to learn long-range correlations with a global receptive field with relatively small computational complexity.
UNETR. UNETR (Figure 5), proposed by Hatamizadeh et al. [31], is another example of the combination of CNN and ViT. In contrast to TransBTS, UNETR does not use convolution blocks and downsampling to reduce the feature size of the whole data; instead, it splits the data into 3D volume patches and then employs the Transformer as the main encoder and connects it directly to the decoder via skip connections. More specifically, as shown by Hatamizadeh et al. [31], feature representations are extracted from several different layers of the Transformer decoder and reshaped and projected from the embedding space into the input space via deconvolutional and convolutional layers. At the last Transformer layer, a deconvolutional layer is applied to upsample the feature map size by 2, and then it is concatenated with the projected Transformer output from the upper tier. The concatenated feature map is fed to consecutive convolutional layers and subsequently upsampled with a deconvolutional layer. The process is repeated until the original input resolution is reached.
VT-UNet. The VT-UNet (Figure 6) proposed by Peiris et al. [32] also built on the encoder–decoder-based U-Net architecture. However, instead of trying to combine the CNN with Transformers like TransBTS and UNETR, VT-UNet is purely based on Transformers. This is achieved by two specially-designed Transformer blocks. In the encoder, a hierarchical Transformer block is used to capture both local and global information. It is similar to Swin Transformer blocks [33]. In the decoder, a parallel cross-attention and self-attention module is utilized. It enables creating a bridge between queries from the decoder and keys/values from the encoder. This architecture enables preserving global information during the decoding process [32]. More specifically, the Swin Transformer-based encoder expands the original 2D Swin Transformer to create a hierarchical representation of the 3D input by starting from small volume patches, which are gradually merged with neighboring patches in deeper Transformer layers. Then, the linear computational complexity with input size is achieved by computing SA locally within non-overlapping windows that partition an input. In addition, a shift of the window partition between consecutive SA layers provides connections among windows and significantly enhances the model power [33]. The idea of parallelization in the decoder is to mimic the skip connections in the U-Net and serve the same purpose, enabling a connection between lower-level spatial features from lower layers and higher-level semantic features from upper layers.

2.4. Post-Processing

For all segmentation results, the same post-processing algorithm is applied to remove potential islands and holes. First, a connected component labeling algorithm (e.g., see [34]) is applied to the segmentation output. Then, all components except the largest are removed. Then, the resulting label map is inverted, and the two previous steps are repeated on this inverted label map (background). Finally, the resulting label map is inverted again to produce the final segmentation label map.

2.5. Experimental Setup

2.5.1. Network Training

The hyperparameters of each network were manually optimized. For cross-validation, the 650 cropped lesion PET images and the corresponding ground truth label maps were split into three folds with roughly the same number of lesion scans so that three independent experiments can be implemented. In each of the three experiments, two folds were used for training and the other fold was used as the test data. Per experiment, a five-fold cross-validation approach was adopted, resulting in five trained models. Furthermore, all cross-validation folds were stratified per-patient. Network weights were initialized using the Kaiming initialization [35] with a normal distribution. The initial learning rate was set to 0.0001. It was reduced during network training as outlined by the “polyLR” approach [36]. The sum of binary cross-entropy (BCE) and the soft Dice loss was used as a loss function. As there can be significant differences in terms of convergence speed between CNN and Transformer models, all networks were over-trained for 3000 epochs and the final models were chosen to have the best validation loss during training. In addition, data augmentation was also applied including random rotation (range: ±45 degrees per axis), random flip and random scaling (range: ±25%).

2.5.2. Network Application

To fully utilize the five-fold cross-validation setup (Section 2.5.1) in each of the three experiments, we used an ensemble approach by taking the average of the five segmentation results from five trained models.

2.5.3. Performance Metrics

For segmentation performance assessment, the following error metrics were utilized: the Dice coefficient (Dice) [37], signed distance error ( d s ) [38], and unsigned distance error ( d u ) [38]. The selection was motivated by their relevance for clinical application. For example, the signed distance error enables assessing if there is a systematic bias, which is relevant for radiation treatment, because it might result in lesions receiving too low a dose or normal tissue receiving too high a dose. Note that for calculating d s , segmentation surface points inside the reference are assigned a negative distance. Furthermore, a lesion segmentation is considered as failed if its Dice coefficient is at or below the 0.6 level, as manual correction of errors becomes increasingly inefficient and unpractical. Successful and failed segmentation cases are reported by providing percentages for each category. In addition, lesions sizes typically vary (Figure 1). Thus, we utilized Bland–Altman plots to assess the impact of lesion size on segmentation performance. In this context, we note that smaller lesions remain a clinical challenge since coverage is ultimately important for tumor control.
Linear mixed-effects regression models were applied to segmentation performance measures from cross-validation folds in order to statistically test the significance of mean differences between networks. Random effects were included in the models to account for correlation in means between cross-validation folds.

3. Results

Table 1 provides the percentage of segmentations that were deemed as successful and failed. For all successful cases, Table 2 summarizes the performance metrics for all five segmentation approaches. In addition, corresponding box plots of DSC, d u and d s are depicted in Figure 7, Figure 8 and Figure 9, respectively. As can be seen in Figure 7, medians as well as first- and third-quartile levels are lower for Transformer-based approaches when compared to U-Net variants. Based on the results for the Dice coefficient, the networks were compared to assess the statistical significance of differences. The U-Net-CBAM approach (good trade-off between DSC and percentage of successful segmentations) performed significantly better ( p < 0.05 ) than UNETR ( p = 0.0070 ) and VT-UNet ( p = 0.0176 ). By contrast, the comparisons with U-Net ( p = 0.9135 ) and TransBTS ( p = 0.0763 ) were not found to be significant. Differences in the unsigned and signed distances were not statistically significant ( p > 0.05 ). All five approaches showed a segmentation bias with the tendency to produce larger segmentations compared to the reference standard.
Furthermore, we assessed the impact of lesion volume on the segmentation behavior of the top two performing networks using all segmentation results (i.e., failed and successful cases). For this purpose, the volumes of lesion segmentations generated by the U-NET and U-Net-CBAM approaches were compared to the independent reference standard by using Bland–Altman plots (Figure 10). Specifically, for each reference and generated segmentation pair, the means of the two measurements were assigned to the x-axis, and the difference between reference and prediction was assigned to the y-axis. Both methods show a bias, which is mainly driven by outliers (failed segmentations). The confidence intervals are tighter for the U-Net-CBAM approach compared to the standard U-Net. Also, R-squared ( R 2 ) values are provided to assess the goodness-of-fit between segmentation and reference volumes (Figure 10), indicating a better fit for the U-Net-CBAM.
For U-Net and U-Net-CBAM, approximately 90% of failures occurred for secondary lesions. By looking at all segmentation results of secondary lesions (i.e., failed and successful cases), we can observe performance differences between the mean error values of U-Net-CBAM and U-Net (Dice: 0.689 vs. 0.686 ; d u : 1.273 mm vs. 1.313 mm; and d s : 1.074 mm vs. 1.084 mm). Furthermore, for CNNs and Transformers, a substantial drop of the Dice coefficient on secondary lesions (i.e., hot lymph nodes) can be observed when compared to the performance on primary lesions. Table 3 summarizes the mean relative performance difference in percentage. While TransBTS shows the smallest performance difference, we note that the overall mean Dice coefficient of TransBTS is lower compared to CNNs. Among the better performing networks, CNNs have the smallest performance difference, and U-Net-CBAM outperforms the standard U-Net. Considering these substantial differences, we conclude that it is imperative for a “real-world”, clinically relevant performance assessment of HNC segmentation approaches to consider both primary and secondary lesions, so as to avoid any potential biases in performance estimation.
Figure 11 and Figure 12 depict typical cases of a primary and secondary lesion segmentation, respectively. In addition, Figure 13 and Figure 14 show poorly performing cases of primary and secondary lesion segmentation, respectively. Note that the lesions in good performing cases tend to have a larger size and better contrast.

4. Discussion

4.1. Segmentation Performance

On our data set, the two CNNs outperformed the three Transformer-based models investigated. The U-Net-CBAM approach represents a promising alternative to the standard U-Net. The addition of CBAM into the U-Net bottleneck resulted in better average segmentation error values across a number of error metrics investigated. By analyzing individual segmentation results, we observed that the improvements mostly resulted from better segmentation of secondary lesions (e.g., lymph nodes), which are more difficult to segment, because of the typically smaller size and lower contrast. Thus, we speculate that the refined features from the CBAM are more beneficial to the network when differentiating between foreground and background, which is more challenging yet equally important for long-term cancer control. Overall, when considering multiple clinically relevant factors like outlier percentage, Dice coefficient, distance errors, and performance on secondary lesions, we conclude that U-Net-CBAM provides the most clinically applicable and promising approach. However, in our experiments, these differences between U-Net-CBAM and U-Net were not found to be statistically significant. Thus, in future work, we plan on comparing these two networks on an enlarged data set to better assess the achievable performance differences and potential of clinical impact in correctly identifying areas at risk of tumor recurrence and hence targets for radiation therapy, and simultaneously minimizing the normal tissue identified to reduce acute and long-term sequelae of treatment.
Among the three Transformer-based models, TransBTS and VT-UNet are the most promising, and TransBTS showed the best performance. While TransBTS is a Transformer and 3D U-Net hybrid, VT-UNet is purely based on Transformers. This demonstrates the potential of the hierarchical Transformer structure and the attention bridging between encoders and decoders in semantic segmentation.
It is well-known that different network architectures have different inductive biases. For example, CNNs tend to classify images by texture and make limited use of the (global) shape [39]. By contrast, Tuli et al. [40] argue that Transformer-based approaches are more flexible, because they are not bound to utilizing convolution-type features, and have a higher shape bias, similar to human vision. Thus, we speculate that for HNC lesion segmentation in PET scans, shape seems to be less relevant, perhaps since most lesions (especially secondary lesions that represent lymph nodes) are roughly spherical or elliptical. In addition, Transformers tend to converge more slowly and require more data during training. While Transformer-based models have the potential to achieve a better segmentation performance, it seems likely that to achieve such improvements, a substantially larger training data set would be required. This is often an issue in medical applications, where available training data are typically limited and more effort is required to generate such data.
Our results show a clear segmentation performance difference for all networks when comparing results on primary and secondary head and neck lesions. The lower segmentation performance in the case of secondary lesions is expected, because they can be more difficult to differentiate from background FDG uptake yet remain clinically important areas for treatment and hence for achieving tumor control. Thus, we argue for the need to also include secondary lesions in image data sets utilized for training and testing CNNs and Transformers to avoid biases in segmentation performance estimates.

4.2. Current Limitations and Future Work

The goal of our work was to assess the base performance of Transformers in comparison to CNNs. Consequently, we have refrained from tweaking the investigated network structures so as to enable a fair comparison. We believe that the comparison of optimized variants is best performed in the form of challenges like HECKTOR [12]. However, the insights gained in this study are relevant for such challenges to enable clinically meaningful conclusions. Furthermore, the available training data set size also plays a role in model selection and needs to be considered. For example, for our data set size, an optimization of CNN-based methods seems to be more promising. To provide more guidance, future studies of segmentation performance’s dependence on data set size are needed.
In this work we have focused on FDG PET-driven lesion segmentation, because of its proven ability to indicate cancer regions critical for radiation treatment planning and subsequent treatment delivery. Future work will focus on assessing the impact of including CT images for lesion segmentation. Furthermore, studies are needed to assess the preference of the end users (i.e., radiation oncologists) as well as the clinical impact of using algorithms for lesion segmentation. While the preferences of individuals might vary, it will be interesting to find out whether commonalities exist or not.
One goal of our study was to assess deep-learning-based methods for their suitability to replace a graph-based segmentation algorithm of a previously published semiautomated tool designed specifically for HNC segmentation in FDG PET scans [3]. In this scenario, a trained expert provides guidance and identifies the center location of a lesion. To simulate this situation, the image volumes were cropped around the center of the target lesion. Thus, the localization of target lesions needs to be provided. Consequently, if automated segmentation of lesions is desired, a target localization approach will be necessary.

5. Conclusions

We have compared CNN-based networks (U-Net-CBAM and U-Net) and Transformer-based segmentation network architectures (VT-UNet, UNETR and TransBTS) in the context of HNC lesion segmentation in 3D FDG PET scans. Our results show that CNN-based architectures have an advantage in segmentation performance over the current Transformer-based network architectures in the case of the available HNC lesion FDG PET image data collected by the ACRIN 6685 clinical trial. As such, the utilized image data are quite diverse and highly relevant for clinical use, as they include primary as well as secondary lesions. Furthermore, our results provide some insight regarding segmentation-relevant image features.

Author Contributions

Conceptualization, X.X., B.J.S., S.A.G., M.M.G., J.M.B. and R.R.B.; methodology, X.X., B.J.S., J.M.B. and R.R.B.; software, X.X.; validation, X.X., B.J.S., M.M.G., J.M.B. and R.R.B.; formal analysis, X.X., B.J.S. and R.R.B.; investigation, X.X., B.J.S., J.M.B. and R.R.B.; resources, J.M.B. and R.R.B.; data curation, X.X., S.A.G., M.M.G., J.M.B. and R.R.B. writing—original draft preparation, X.X., B.J.S., J.M.B. and R.R.B.; writing—review and editing, X.X., B.J.S., S.A.G., M.M.G., J.M.B. and R.R.B.; visualization, X.X. and R.R.B.; supervision, J.M.B. and R.R.B.; project administration, J.M.B. and R.R.B.; funding acquisition, M.M.G., J.M.B. and R.R.B. All authors have read and agreed to the published version of the manuscript.


This research was funded by NIH/NCI grant number U01CA140206 and the Burke Family Foundation.

Institutional Review Board Statement

Not applicable as this work represents secondary data analysis and data were not identifiable by the authors.

Informed Consent Statement

Not applicable.

Data Availability Statement

PET-CT images are available at The Cancer Imaging Archive (, accessed on 19 June 2020).

Conflicts of Interest

The authors declare no conflict of interest.


  1. Castelli, J.; De Bari, B.; Depeursinge, A.; Simon, A.; Devillers, A.; Roman Jimenez, G.; Prior, J.; Ozsahin, M.; de Crevoisier, R.; Bourhis, J. Overview of the predictive value of quantitative 18 FDG PET in head and neck cancer treated with chemoradiotherapy. Crit. Rev. Oncol. Hematol. 2016, 108, 40–51. [Google Scholar] [CrossRef] [PubMed]
  2. Im, H.J.; Bradshaw, T.; Solaiyappan, M.; Cho, S.Y. Current Methods to Define Metabolic Tumor Volume in Positron Emission Tomography: Which One is Better? Nucl. Med. Mol. Imaging 2018, 52, 5–15. [Google Scholar] [CrossRef] [PubMed]
  3. Beichel, R.R.; Van Tol, M.; Ulrich, E.J.; Bauer, C.; Chang, T.; Plichta, K.A.; Smith, B.J.; Sunderland, J.J.; Graham, M.M.; Sonka, M.; et al. Semiautomated segmentation of head and neck cancers in 18F-FDG PET scans: A just-enough-interaction approach. Med. Phys. 2016, 43, 2948–2964. [Google Scholar] [CrossRef] [PubMed]
  4. Yu, H.; Caldwell, C.; Mah, K.; Mozeg, D. Coregistered FDG PET/CT-based textural characterization of head and neck cancer for radiation treatment planning. IEEE Trans. Med. Imaging 2009, 28, 374–383. [Google Scholar]
  5. Yang, J.; Beadle, B.M.; Garden, A.S.; Schwartz, D.L.; Aristophanous, M. A multimodality segmentation framework for automatic target delineation in head and neck radiotherapy. Med. Phys. 2015, 42, 5310–5320. [Google Scholar] [CrossRef]
  6. Berthon, B.; Evans, M.; Marshall, C.; Palaniappan, N.; Cole, N.; Jayaprakasam, V.; Rackley, T.; Spezi, E. Head and neck target delineation using a novel PET automatic segmentation algorithm. Radiother. Oncol. 2017, 122, 242–247. [Google Scholar] [CrossRef]
  7. Visvikis, D.; Cheze Le Rest, C.; Jaouen, V.; Hatt, M. Artificial intelligence, machine (deep) learning and radio(geno)mics: Definitions and nuclear medicine imaging applications. Eur. J. Nucl. Med. Mol. Imaging 2019, 46, 2630–2637. [Google Scholar] [CrossRef]
  8. Huang, B.; Chen, Z.; Wu, P.M.; Ye, Y.; Feng, S.T.; Wong, C.O.; Zheng, L.; Liu, Y.; Wang, T.; Li, Q.; et al. Fully Automated Delineation of Gross Tumor Volume for Head and Neck Cancer on PET-CT Using Deep Learning: A Dual-Center Study. Contrast Media Mol. Imaging 2018, 2018, 8923028. [Google Scholar] [CrossRef]
  9. Guo, Z.; Guo, N.; Gong, K.; Zhong, S.; Li, Q. Gross tumor volume segmentation for head and neck cancer radiotherapy using deep dense multi-modality network. Phys. Med. Biol. 2019, 64, 205015. [Google Scholar] [CrossRef]
  10. Groendahl, A.R.; Skjei Knudtsen, I.; Huynh, B.N.; Mulstad, M.; Moe, Y.M.M.; Knuth, F.; Tomic, O.; Indahl, U.G.; Torheim, T.; Dale, E.; et al. A comparison of fully automatic segmentation of tumors and involved nodes in PET/CT of head and neck cancers. Phys. Med. Biol. 2021, 66, 065012. [Google Scholar] [CrossRef]
  11. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  12. Oreiller, V.; Andrearczyk, V.; Jreige, M.; Boughdad, S.; Elhalawani, H.; Castelli, J.; Vallières, M.; Zhu, S.; Xie, J.; Peng, Y.; et al. Head and neck tumor segmentation in PET/CT: The HECKTOR challenge. Med. Image Anal. 2021, 77, 102336. [Google Scholar] [CrossRef] [PubMed]
  13. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  14. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  15. Sobirov, I.; Nazarov, O.; Alasmawi, H.; Yaqub, M. Automatic Segmentation of Head and Neck Tumor: How Powerful Transformers Are? In Proceedings of the 5th International Conference on Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022; Volume 172, pp. 1149–1161. [Google Scholar]
  16. Li, G.Y.; Chen, J.; Jang, S.I.; Gong, K.; Li, Q. SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images. arXiv 2023, arXiv:2302.03861. [Google Scholar] [CrossRef]
  17. Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef]
  18. Fedorov, A.; Beichel, R.; Kalpathy-Cramer, J.; Finet, J.; Fillion-Robin, J.C.; Pujol, S.; Bauer, C.; Jennings, D.; Fennessy, F.; Sonka, M.; et al. 3D Slicer as an image computing platform for the Quantitative Imaging Network. Magn. Reson. Imaging 2012, 30, 1323–1341. [Google Scholar] [CrossRef]
  19. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
  20. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
  21. Khanh, T.L.B.; Dao, D.P.; Ho, N.H.; Yang, H.J.; Baek, E.T.; Lee, G.; Kim, S.H.; Yoo, S.B. Enhancing U-Net with Spatial-Channel Attention Gate for Abnormal Tissue Segmentation in Medical Imaging. Appl. Sci. 2020, 10, 5729. [Google Scholar] [CrossRef]
  22. Tong, X.; Wei, J.; Sun, B.; Su, S.; Zuo, Z.; Wu, P. ASCU-Net: Attention Gate, Spatial and Channel Attention U-Net for Skin Lesion Segmentation. Diagnostics 2021, 11, 501. [Google Scholar] [CrossRef]
  23. Guo, C.; Szemenyei, M.; Yi, Y.; Wang, W.; Chen, B.; Fan, C. SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1236–1242. [Google Scholar] [CrossRef]
  24. Kazaj, P.M.; Koosheshi, M.; Shahedi, A.; Sadr, A.V. U-Net-based Models for Skin Lesion Segmentation: More Attention and Augmentation. arXiv 2022, arXiv:2210.16399. [Google Scholar]
  25. Xu, Y.; Hou, S.K.; Wang, X.Y.; Li, D.; Lu, L. C+ref-UNet: A novel approach for medical image segmentation based on multi-scale connected UNet and CBAM. SSRN Electron. J. 2022. [Google Scholar] [CrossRef]
  26. Xiong, X.; Smith, B.J.; Graves, S.A.; Sunderland, J.J.; Graham, M.M.; Gross, B.A.; Buatti, J.M.; Beichel, R.R. Quantification of uptake in pelvis F-18 FLT PET-CT images using a 3D localization and segmentation CNN. Med. Phys. 2022, 49, 1585–1598. [Google Scholar] [CrossRef]
  27. Xiong, X. Deep Convolutional Neural Network Based Analysis Methods for Radiation Therapy Applications. Ph.D. Thesis, University of Iowa, Iowa City, IA, USA, 2022. [Google Scholar]
  28. Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in Medical Imaging: A Survey. arXiv 2022, arXiv:2201.09873. [Google Scholar] [CrossRef] [PubMed]
  29. He, K.; Gan, C.; Li, Z.; Rekik, I.; Yin, Z.; Ji, W.; Gao, Y.; Wang, Q.; Zhang, J.; Shen, D. Transformers in Medical Image Analysis: A Review. Intell. Med. 2022, 3, 59–78. [Google Scholar] [CrossRef]
  30. Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. TransBTS: Multimodal Brain Tumor Segmentation Using Transformer. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021, Strasbourg, France, 27 September–1 October 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 109–119. [Google Scholar]
  31. Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; IEEE Computer Society: Washington, DC, USA, 2022; pp. 1748–1758. [Google Scholar]
  32. Peiris, H.; Hayat, M.; Chen, Z.; Egan, G.; Harandi, M. A Volumetric Transformer for Accurate 3D Tumor Segmentation. arXiv 2021, arXiv:2111.13300. [Google Scholar]
  33. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  34. Zhao, X.; He, L.; Wang, Y.; Chao, Y.; Yao, B.; Hideto, K.; Atsushi, O. An Efficient Method for Connected-Component Labeling in 3D Binary Images. In Proceedings of the 2018 International Conference on Robots and Intelligent System (ICRIS), Changsha, China, 26–27 May 2018; pp. 131–133. [Google Scholar] [CrossRef]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef]
  36. Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  37. Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
  38. Sonka, M.; Hlavac, V.; Boyle, R. Image Processing: Analysis and Machine Vision; CL Engineering: New York, NY, USA, 2007. [Google Scholar]
  39. Baker, N.; Lu, H.; Erlikhman, G.; Kellman, P.J. Deep convolutional networks do not classify based on global object shape. PLoS Comput. Biol. 2018, 14, e1006613. [Google Scholar] [CrossRef]
  40. Tuli, S.; Dasgupta, I.; Grant, E.; Griffiths, T.L. Are Convolutional Neural Networks or Transformers More Like Human Vision? arXiv 2021, arXiv:2105.07197. [Google Scholar]
Figure 1. Cross-sections of a typical primary lesion (left) and a secondary lesion (right).
Figure 1. Cross-sections of a typical primary lesion (left) and a secondary lesion (right).
Tomography 09 00151 g001
Figure 2. Illustration of CBAM. The module has two sequential sub-modules: channel and spatial attention. The channel attention utilizes both max-pooling and average-pooling outputs along the feature axis with a shared MLP layer, while the spatial attention utilizes the two similar outputs along the channel axis and forwards them to a convolution layer.
Figure 2. Illustration of CBAM. The module has two sequential sub-modules: channel and spatial attention. The channel attention utilizes both max-pooling and average-pooling outputs along the feature axis with a shared MLP layer, while the spatial attention utilizes the two similar outputs along the channel axis and forwards them to a convolution layer.
Tomography 09 00151 g002
Figure 3. The U-Net-CBAM network integrates a CBAM into the bottleneck of a 3D U-Net.
Figure 3. The U-Net-CBAM network integrates a CBAM into the bottleneck of a 3D U-Net.
Tomography 09 00151 g003
Figure 4. Overview of TransBTS network architecture.
Figure 4. Overview of TransBTS network architecture.
Tomography 09 00151 g004
Figure 5. Overview of UNETR network architecture.
Figure 5. Overview of UNETR network architecture.
Tomography 09 00151 g005
Figure 6. Overview of VT-UNet network architecture.
Figure 6. Overview of VT-UNet network architecture.
Tomography 09 00151 g006
Figure 7. Box plot of the Dice coefficient.
Figure 7. Box plot of the Dice coefficient.
Tomography 09 00151 g007
Figure 8. Box plot of the unsigned distance error ( d u ).
Figure 8. Box plot of the unsigned distance error ( d u ).
Tomography 09 00151 g008
Figure 9. Box plot of the signed distance error ( d s ).
Figure 9. Box plot of the signed distance error ( d s ).
Tomography 09 00151 g009
Figure 10. Bland–Altman plots with the representation of the limits of agreement from −1.96*SD to +1.96*SD, comparing the volume of segmentations of U-NET (top) and U-Net-CBAM (bottom) networks to the reference standard. In addition, R 2 values are shown. Note that the plots and R 2 values are based on all (successful and failed) cases.
Figure 10. Bland–Altman plots with the representation of the limits of agreement from −1.96*SD to +1.96*SD, comparing the volume of segmentations of U-NET (top) and U-Net-CBAM (bottom) networks to the reference standard. In addition, R 2 values are shown. Note that the plots and R 2 values are based on all (successful and failed) cases.
Tomography 09 00151 g010
Figure 11. Comparison of segmentation results of a typical primary lesion from all five networks.
Figure 11. Comparison of segmentation results of a typical primary lesion from all five networks.
Tomography 09 00151 g011
Figure 12. Comparison of segmentation results of a typical secondary lesion from all five networks.
Figure 12. Comparison of segmentation results of a typical secondary lesion from all five networks.
Tomography 09 00151 g012
Figure 13. Comparison of segmentation results of a difficult primary lesion from all five networks.
Figure 13. Comparison of segmentation results of a difficult primary lesion from all five networks.
Tomography 09 00151 g013
Figure 14. Comparison of segmentation results of a difficult secondary lesion from all five networks.
Figure 14. Comparison of segmentation results of a difficult secondary lesion from all five networks.
Tomography 09 00151 g014
Table 1. Percentage of successful and failed segmentations per approach.
Table 1. Percentage of successful and failed segmentations per approach.
Successful segmentations(%)
Failed segmentations(%)18.026.817.216.517.8
Table 2. Comparison of segmentation performance for successful lesion segmentations. Values are given in mean ± standard deviation format.
Table 2. Comparison of segmentation performance for successful lesion segmentations. Values are given in mean ± standard deviation format.
DSC(-)0.833 ± 0.0910.809 ± 0.1010.833 ± 0.0920.819 ± 0.0980.813 ± 0.090
d u (mm)0.684 ± 0.4890.806 ± 0.5300.682 ± 0.4640.740 ± 0.5560.785 ± 0.510
d s (mm)0.538 ± 0.6230.663 ± 0.7340.504 ± 0.5500.559 ± 0.6880.574 ± 0.636
Table 3. Mean relative difference in Dice coefficient when segmenting secondary lesions compared to primary lesions considering all (successful and failed) segmentation results.
Table 3. Mean relative difference in Dice coefficient when segmenting secondary lesions compared to primary lesions considering all (successful and failed) segmentation results.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiong, X.; Smith, B.J.; Graves, S.A.; Graham, M.M.; Buatti, J.M.; Beichel, R.R. Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers. Tomography 2023, 9, 1933-1948.

AMA Style

Xiong X, Smith BJ, Graves SA, Graham MM, Buatti JM, Beichel RR. Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers. Tomography. 2023; 9(5):1933-1948.

Chicago/Turabian Style

Xiong, Xiaofan, Brian J. Smith, Stephen A. Graves, Michael M. Graham, John M. Buatti, and Reinhard R. Beichel. 2023. "Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers" Tomography 9, no. 5: 1933-1948.

Article Metrics

Back to TopTop