GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation

Currently, brain tumors are extremely harmful and prevalent. Deep learning technologies, including CNNs, UNet, and Transformer, have been applied in brain tumor segmentation for many years and have achieved some success. However, traditional CNNs and UNet capture insufficient global information, and Transformer cannot provide sufficient local information. Fusing the global information from Transformer with the local information of convolutions is an important step toward improving brain tumor segmentation. We propose the Group Normalization Shuffle and Enhanced Channel Self-Attention Network (GETNet), a network combining the pure Transformer structure with convolution operations based on VT-UNet, which considers both global and local information. The network includes the proposed group normalization shuffle block (GNS) and enhanced channel self-attention block (ECSA). The GNS is used after the VT Encoder Block and before the downsampling block to improve information extraction. An ECSA module is added to the bottleneck layer to utilize the characteristics of the detailed features in the bottom layer effectively. We also conducted experiments on the BraTS2021 dataset to demonstrate the performance of our network. The Dice coefficient (Dice) score results show that the values for the regions of the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) were 91.77, 86.03, and 83.64, respectively. The results show that the proposed model achieves state-of-the-art performance compared with more than eleven benchmarks.


Introduction
A brain tumor may cause symptoms such as headache, dizziness, nausea and vomiting, lethargy, weakness, and optic disc edema [1].If the tumor is large and compresses the optic nerve, decreased and blurred vision may occur [2].Generally, tumors in the brain are more serious than those in other parts of the body.Benign and malignant tumors can continue to grow, increasing intracranial pressure [3].Additionally, they can compress brain tissue and affect brain function, having a greater impact.Treatment plans for brain tumors are generally developed through medical imaging.Common medical imaging methods include X-ray imaging, computed tomography (CT), and magnetic resonance imaging (MRI) [4].Compared with CT, MRI can obtain tomographic images in any direction, which helps display the anatomical relationship between tissue structures, clarify the origin and scope of lesions, and accurately diagnose the disease.It is safer to avoid the radiation damage that can be caused by traditional imaging methods, such as CT and X-ray [5].MRI is widely used in clinical practice because it does not use radiation and produces high-resolution soft tissue and multisequence imaging.Although MRI is very helpful in the treatment of brain tumors, segmentation by hand is subject to human error, and evaluations vary among radiologists, leading to inconsistent results.
Deep learning can automatically learn useful features from many medical images [6], such as brain tumor shape, size, and boundary information, which are important in brain tumor segmentation and disease analysis.Convolutional neural network (CNN) models based on deep learning have advantages in image processing [7].Convolution and pooling layers are used in CNNs.The convolution layer is responsible for extracting image features, and the pooling layer is used to greatly reduce the number of dimensions.Unlike CNNs, which are mainly used for feature extraction, classification, and the regression of input images, a fully convolutional network (FCN) adopts a full convolution structure and can adapt to variable-size input images.Thus, FCNs are widely used in image segmentation tasks.UNet is an excellent model that performs well in medical image segmentation.However, MRI images generally contain depth information, which is not fully utilized by the traditional 2D UNet; thus, 3D UNet was developed and has been popular for many years.Many researchers have developed excellent methods based on 3D UNet to complete segmentation tasks in the case of brain tumors.One of the weaknesses of 3D UNet is its extraction capability in terms of long-distance information, despite many attempts, such as atrous spatial pyramid pooling (ASPP), to expand the receptive field.Transformer deep learning methods, have achieved good results in natural language processing (NLP) and have been introduced into image processing to address this challenge.The Pyramid Vision Transformer (PVT), Swin Transformer, and Volumetric Transformer (VT-UNet) based on Transformer can capture the dependencies between different features through a self-attention mechanism, especially for long-distance dependencies, which has obvious advantages in image processing [8].One of the challenges with Transformer is related to its focus on local details.We propose GETNet to improve the segmentation of brain tumor images to address the challenge of fusing global features and local details.
In this paper, we focus on integrating modules that extract local features with blocks that capture long-distance relationships.The main contributions of our work are as follows: • We proposed a new GETNet for brain tumor segmentation which combined 3D convolution with VT-UNet to comprehensively capture delicate local information and global semantic information and improve brain tumor segmentation performance.• We developed a GNS block between the VT Encoder Block and the downsampling module to enable the Transformer architecture to obtain local information effectively.• We designed an ECSA block in the bottleneck layer to enhance the model for detailed feature extraction.
This paper is organized as follows: related work is described in Section 2. The materials and methods are presented in Section 3. In Section 4, comparison results and ablation experiments are presented and analyzed.Finally, Section 5 provides the discussion and conclusion.

Related Work 2.1. Deep-Learning-Based Methods for Medical Image Segmentation
In recent years, research on image analysis and segmentation has made breakthroughs with the proposal of deep learning methods, as represented by CNNs [9].CNNs can learn representative image features by continuously iterating model parameters and then constructing a model for subsequent segmentation tasks.The block-based CNN method uses images as the network input and adopts image classification to replace pixel classification in the image.Because traditional CNNs use a sliding window frame based on images for image segmentation, the overlap between adjacent image blocks will lead to repeated convolution calculations during training, increasing calculation time and reducing efficiency.Long et al. [10] proposed using an FCN to classify images at the pixel level to solve the problem of image segmentation at the semantic level and achieved good results.An FCN can classify images at the pixel level and has no limit in terms of input image size; it can reduce the number of computations and improve the efficiency of segmentation compared with traditional CNNs.However, this approach lacks spatial consistency and does not fully use contextual information.In 2015, Ronneberger et al. [11] proposed UNet, a U-shaped CNN for medical image segmentation, to address this challenge.Özgün Çiçek et al. [12] proposed 3D UNet, which extends the previous UNet by replacing all 2D operations with 3D operations.Applying three-dimensional depth information is helpful for improving the performance of brain tumor segmentation.Recently, many network variants have been proposed due to the success of 3D UNet.Many of these networks attempt to expand the receptive field to extract global features.DeepLabv1 was proposed by Chen et al. [13] to ensure that feature resolution is not reduced and that the network has a larger receptive field.DeepLabv2, DeepLabv3, and DeepLabv3+ were subsequently developed by Chen et al. [14][15][16].Chen et al. [17] designed DMFNet to construct multiscale feature representations via 3D dilated convolutions.Xu et al. [18] proposed a network to capture multiscale information using 3D atrous spatial pyramid pooling (ASPP).Jiang et al. [19] developed AIU-Net with the ASPP module to expand the receptive field and increase the width and depth of the network.Parvez Ahmad et al. [20] designed RD 2 A 3D UNet to preserve more contextual information of small sizes.A multiscale feature extraction module was developed by Wang et al. [21] to extract more receptive fields and improve the ability to capture features with different scales.The E1D3 network was introduced by Syed Talha Bukhari et al. [22] to perform effective multiclass segmentation.There are one-encoder and three-decoder fully convolutional neural network architectures where each decoder segments one of the hierarchical regions of interest (WT, TC, and ET) in the E1D3 network.Parvez Ahmad et al. [23] suggested that multiscale features are very important in MS UNet.Wu et al. [24] proposed SDS-Net to enhance segmentation performance.Local space with detailed feature information was designed by Chen et al. [25] to increase the detailed feature awareness of voxels between adjacent dimensions.MonaKharaji et al. [26] incorporated residual blocks and attention gates to capture emphasized informative regions.Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET).They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC.Using multiscale receptive fields to extract features for the three regions has advantages over using only a single receptive field.Despite acknowledging that considerable research has been carried out concerning the capture of contextual information or the expansion of the receptive field, the effective fusion of local feature information and the long-distance relationships between features is also crucial for multi-subregion segmentation of brain tumors (WT, TC, and ET).

Attention-Based Module for Medical Image Segmentation
An attention mechanism is a weighted change in target data.It is widely used in clustering learning, reinforcement learning, image processing, and speech recognition.An attention mechanism based on deep learning that imitates the human visual system automatically adopts some visual areas that need to be focused on and can improve the effectiveness of related learning tasks.Both spatial attention [27] and channel attention mechanisms can be used to recalibrate the characteristic information of the input data.They generally utilize a global pooling operation to obtain richer global information.One of the differences is that the channel attention mechanism performs global pooling layer by layer along the channel direction.In contrast, the spatial attention mechanism focuses on the feature information at a different location.Recently, many researchers have focused on multiscale and contextual information.Zhou et al. [28] designed attention mechanisms for learning contextual and attentive information.Zhang et al. [29] constructed the SMTFNet to aggregate global feature information.Zhao et al. [30] developed MSEF-Net to adapt a multiscale fusion module.Liu et al. [31] proposed MSMV-Net while considering the strengths of multiscale feature extraction.Wang et al. [32] proposed a multiscale contextual block to focus on spatial information at different scales.Self-attention [33] can establish a global dependency and expand the receptive field of an image, which is the foundation of Transformer methods.The above are all based on convolutional approaches.Local convolutional operations are limited by the size of the convolutional kernel, which results in a weaker perception of global features.Transformers, through self-attention, can capture dependencies at various positions; the receptive field for global features is relatively large, increasing the richness of global information and allowing for the capture of more information from medium to large targets.

The Transformer-Based Module for Medical Image Segmentation
The excellent performance of the Transformer in natural language processing tasks fully demonstrates its effectiveness.The breakthrough of Transformer networks in NLP has stimulated interest in applying them to computer vision tasks.Alexey Dosovitskiy et al. [34] proposed the Vision Transformer (ViT) to capture long-range dependencies in images through a global attention mechanism, a milestone in the application of Transformers to computer vision.Wang et al. [35] developed the Pyramid Vision Transformer (PVT) to generate multiscale feature maps for intensive prediction tasks.Liu et al. [36] presented the Swin Transformer, which captures global feature information via self-attention.Many researchers have investigated pure Transformers, such as the Volumetric Transformer Net (VT-UNet) [37].Their advantage is that the encoder benefits from the self-attention mechanism by encoding local and global features simultaneously, while the decoder uses parallel self-attention and cross-attention to capture fine details for boundary refinement.Ali Hatamizadeh et al. [38] utilized a Transformer as an encoder to learn sequence representations of the input volume and effectively capture global multiscale information.However, one of the disadvantages of pure Transformers is that they only focus on global contextual information and address local details less.As stated earlier, in the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET).In the task of brain tumor segmentation, the three indicators, as well as the boundary information, global information, and local information, need to be used together to enhance the final segmentation results.Recently, considerable research on brain tumor segmentation based on the fusion of Transformers and CNNs has been conducted.Jia et al. [39] proposed using BiTr-UNet, a combined CNN-Transformer network, to achieve good performance on the BraTS2021 validation dataset.TransBTS was developed by Wang et al. [40] to capture local 3D contextual information.Cai et al. [41] reported that Swin UNet can adequately learn both global and local dependency information in all layers of an image.Fu et al. [42] proposed HmsU-Net, a hybrid multiscale UNet based on the combination of a CNN and Transformer for medical image segmentation.Ao et al. [43] developed an effective combined Transformer-CNN network using multiscale feature learning.Ilyasse Aboussaleh et al. [44] designed 3DUV-NetR+ to capture more contextual information.Recently, hybrid architectures of CNNs and Transformers have been research hotspots.The further development of this research is very beneficial for improving performance in brain tumor segmentation.

Datasets and Preprocessing
The brain tumor segmentation challenge (BraTS) dataset [45,46] is a public medical image dataset used to research and develop brain tumor segmentation algorithms.The BraTS dataset integrates four MRI modalities: T1-weighted (T1), T2-weighted (T2), T1-enhanced contrast (T1ce), and fluid-attenuated inversion recovery (FLAIR).The BraTS2021 [47] dataset, consisting of data from 1251 patients for training and 219 patients for validation, is popular among researchers.Generally, 1251 cases all contain ground truths labeled by board-certified neuroradiologists, while the 219 ground-truth cases are hidden from the public; the results can be obtained only via online validation.Our training strategy included 80% and 20% of the BraTS2021 training data for training and validation, respectively.In addition, we uploaded our prediction results to the official BraTS platform (https://www.synapse.org/#)(accessed on 12 June 2024) for model evaluation.
In order to enable our network to segment brain tumor images normally, we first read the BraTS2021 dataset into our program in the preprocessing stage.After processing with simpleITK and MONAI, we used the Z-score method to standardize each image.Sequentially, we reduced the background as much as possible while ensuring that all of the brain was included and randomly re-cropped the fixed patch size of the image to 128 × 128 × 128.All of the intensity values were clipped to the 1st and 99th percentiles of the non-zero voxel distribution of the volume.In this research, we used rotation between −30 and 30, additive Gaussian noise of a centered normal distribution with a standard deviation of 0.1, blurring between 0.5 and 1, and the addition of a gamma transformation value between 0.7 and 1.5 as data augmentation techniques.The procedural flowchart of the proposed GETNet is depicted in Figure 1.

Implementation Details
Our network was constructed using Python 3.8.10 and PyTorch 1.11.0.A single NVIDIA RTX A5000 with 24 G memory and AMD EPYC 7551P were used during training.Table 1 shows that the initial learning rate was 1.00 × 10 −4 , with a batch size of 1. Cuda version cu113 was used.During training, Adam [48] was used to optimize our network.Unlike the case of hybrid loss, only ordinary soft Dice loss [49] was trained in our network.The input and output sizes were both 128 × 128 × 128.
Dice is a measure of the similarity between two effects.It is used to measure the similarity between the results predicted through network segmentation and manual masks in image segmentation, and it which can be represented as follows: where TP, FP, and FN represent true positive cases, false positive cases, and false negative cases, respectively.HD represents the maximum distance between the predicted and real region boundaries.The smaller the value is, the smaller the predicted boundary segmentation error and the better the quality.HD95 is similar to the maximum HD, but it is calculated based on the 95th percentile of distances between boundary points in t and p.The purpose of using this measure is to mitigate the impact of a very small subset of outliers.HD can be represented as follows: where t and p represent the real region boundary and predicted segmentation region boundary, respectively.d(•) represents the distance between t and p. Sup denotes the supremum.Sensitivity refers to as the true positive rate.It quantifies the accurate probability of complete positive detection.The sensitivity can be represented as follows: where TP and FN represent true positive cases and false negative cases, respectively.A higher sensitivity corresponds to a smaller discrepancy between glioma segmentation and the ground truth.
The Specificity represents the true negative rate, which reflects the probability of complete negative detection.The specificity can be represented as follows: where TN and FP represent true negative cases and false positive cases, respectively.The higher the specificity is, the smaller the difference between the segmentation and ground truth for normal tissue.

Network Architecture
The effective integration of local features and global relationships is very helpful for improving the performance of brain tumor segmentation tasks.As shown in Figure 2, our network is a U-shaped architecture based on a Transformer with convolution operations.The encoder branch is on the left, the bottleneck layer is located at the bottom, and the decoder is on the right of the architecture.The encoder incorporates a 3D Patch Partition Block, Linear Embedding Block, VT Encoder Block, GNS Block, and 3D Patch-Merging Block.The 3D Patch Partition Block cuts the brain tumor images into nonoverlapping patches, and the Linear Embedding block maps the tokens to a vector dimension equal to the number of channels.The 3D Patch-Merging Block reduces the size of the image by half and doubles the number of channels, similarly to the process of pooling or convolution with a stride of 2 in a CNN.This operation is akin to downsampling and increasing the feature depth, contributing to the overall efficiency and effectiveness of the network architecture.After the 3D Patch-Merging operation, VT-UNet only changes the height and width, while the depth remains unchanged.To reduce the image dimensions and floating-point operations per second (FLOPs) and to prevent overfitting, changes were made to the height, width, and depth of our model.The VT Encoder Block and VT Decoder Block employ attention layers with windows to important feature information when capturing long-distance dependencies between tokens.The attentions of W-MSA and SW-MSA in the VT Encoder Block and VT Decoder Block utilize tokens within the window to help with representation learning.In W-MSA, we uniformly divide the volume into smaller nonoverlapping windows.The tokens in adjacent windows of W-MSA cannot be seen by each other.In contrast, they can see each other by using the shifting window in SW-MSA, which facilitates the interaction of information between different windows, thereby guiding effective feature extraction.The VT Decoder Block can be divided into two parts: the left part is cross-attention (CA), and the right part is self-attention (SA).The fusion subblock, shown in Figure 3, merges the results of CA and SA and delivers them to the later layer.The fusion subblock comprises a convex combination, Fourier feature positional encoding (FPE), layer normalization (LN) [53], and a multi-layer perceptron (MLP) [54].The dimensions of the input image are 4 × 128 × 128 × 128, and the classifier layer includes a 3D convolutional layer to map deep dimensional features to 3 × 128 × 128 × 128.The GNS Block addresses the issue of insufficient local features in feature extraction.A hierarchical representation is constructed by the VT Encoder Block from small patches, which are gradually merged with neighboring patches as the Transformer layers deepen to capture better features.The two modules in one decoder layer are the 3D Patch Expanding Block and the VT Decoder Block.Here, 3D Patch Expansion reshapes the image size along the spatial axis, doubling the image size and reducing the number of channels by half.The VT Decoder Block integrates high-resolution information from the encoder and low-resolution information from the decoder to recover features lost during downsampling and improve segmentation accuracy.
Notably, the VT Encoder Block, VT Decoder Block, and ECSA Block are used twice.The VT Decoder Block combines a self-attention block and cross-attention to improve the prediction quality.The SC Block is similar to the UNet skip connection, which establishes a bridge for information transmission between the encoding layer and the corresponding decoding layer.Specifically, the values of both K and V generated by the multi-head selfattention (W-MSA) of the VT Encoder Block are passed to the W-MSA of the VT Decoder Block.Similarly, the shifted window-based multi-head self-attention (SW-MSA) of the VT Encoder Block delivers K ′ and V ′ to the SW-MSA of the VT Decoder Block in the same way.The bottleneck layer has two modules: the 3D Patch-Expanding Block, whose function is the same as that of the 3D Patch Expanding of the decoder, and the ESCA block, which can capture detailed features with long-distance relationships of the bottom layer.
The VT Encoder Block and VT Decoder Block employ attention layers with windows to important feature information when capturing long-distance dependencies between tokens.The attentions of W-MSA and SW-MSA in the VT Encoder Block and VT Decoder Block utilize tokens within the window to help with representation learning.In W-MSA, we uniformly divide the volume into smaller nonoverlapping windows.The tokens in adjacent windows of W-MSA cannot be seen by each other.In contrast, they can see each other by using the shifting window in SW-MSA, which facilitates the interaction of information between different windows, thereby guiding effective feature extraction.The VT Decoder Block can be divided into two parts: the left part is cross-attention (CA), and the right part is self-attention (SA).The fusion subblock, shown in Figure 3, merges the results of CA and SA and delivers them to the later layer.The fusion subblock comprises a convex combination, Fourier feature positional encoding (FPE), layer normalization (LN) [53], and a multi-layer perceptron (MLP) [54].The dimensions of the input image are 4 × 128 × 128 × 128, and the classifier layer includes a 3D convolutional layer to map deep dimensional features to 3 × 128 × 128 × 128.The VT Encoder Block and VT Decoder Block employ attention layers with windows to important feature information when capturing long-distance dependencies between tokens.The attentions of W-MSA and SW-MSA in the VT Encoder Block and VT Decoder Block utilize tokens within the window to help with representation learning.In W-MSA, we uniformly divide the volume into smaller nonoverlapping windows.The tokens in adjacent windows of W-MSA cannot be seen by each other.In contrast, they can see each other by using the shifting window in SW-MSA, which facilitates the interaction of information between different windows, thereby guiding effective feature extraction.The VT Decoder Block can be divided into two parts: the left part is cross-attention (CA), and the right part is self-attention (SA).The fusion subblock, shown in Figure 3, merges the results of CA and SA and delivers them to the later layer.The fusion subblock comprises a convex combination, Fourier feature positional encoding (FPE), layer normalization (LN) [53], and a multi-layer perceptron (MLP) [54].The dimensions of the input image are 4 × 128 × 128 × 128, and the classifier layer includes a 3D convolutional layer to map deep dimensional features to 3 × 128 × 128 × 128.

Enhanced Channel Self-Attention Block (ECSA)
A diagram of the Enhanced Transformer and ECSA Block is shown in Figure 4.The bottom layer is the lowest in the network and has the smallest image size.However, it contains the richest semantic information.It is helpful to extract detailed features effectively from the bottleneck layer, which is important in terms of the brain tumor segmentation results.The Enhanced Transformer combines the advantages of global and local features, which is beneficial for extracting the details of image features and can be represented as follows: where x denotes the input features.LN represents layer normalization.MLP is a multilayer perceptron.Z is the result of the equation.The ECSA Block is an enhanced channel self-attention block.The ECSA Block first extracts the channel weights, Kw, Qw, and Vw, of the image features.A weighted self-attention mechanism was developed to capture more effective global features.Depth-wise separable convolution [55] with large convolution kernels of 7 × 7 × 7 is used to ensure larger receptive fields, and it is then performed on each channel to obtain local features while minimizing information loss.Finally, all channels are aggregated using a 1 × 1 × 1 convolution before being output.The ECSA can be divided into three steps: calculating the weights, capturing the weighted global features, and fusing the local features.
First step: The calculation formula for the three weights can be described as follows: where FL denotes a linear operation.FSW can be calculated as follows: where AP denotes average pooling.FL represents a linear operation.
In the second step, the capture of the weighted global features is calculated as follows: where W(•), which is a multiplication operation using the input data, can be represented as follows: In the third step, the fusion of the local features is calculated as follows: where Y out denotes the final result.Conv 1×1×1 represents convolution with a 1 × 1 × 1 kernel.DWC 7×7×7 is expressed as a depth-wise separable convolution with a 7 × 7 × 7 kernel.FL denotes a linear operation.⊙ denotes the Hadamard product [56], which converts second-order mappings into third-order mappings.

Group Normalization Shuffle (GNS) Block
A diagram of the GNS Block is shown in Figure 5. Ma et al. [57] proposed ShuffleNetv2 to divide the input feature map into multiple subblocks and perform a shuffling operation on these subblocks.Shuffling operations typically involve rearranging the features between different subblocks to introduce more variation and diversity.This process helps the model better capture details and structures in images and improve the generalization ability.
In the third step, the fusion of the local features is calculated as follows: where Yout denotes the final result.Conv1×1×1 represents convolution with a 1 × 1 × 1 kernel.DWC7×7×7 is expressed as a depth-wise separable convolution with a 7 × 7 × 7 kernel.FL denotes a linear operation.⊙ denotes the Hadamard product [56], which converts second-order mappings into third-order mappings.

Group Normalization Shuffle (GNS) Block
A diagram of the GNS Block is shown in Figure 5. Ma et al. [57] proposed Shuf-fleNetv2 to divide the input feature map into multiple subblocks and perform a shuffling operation on these subblocks.Shuffling operations typically involve rearranging the features between different subblocks to introduce more variation and diversity.This process helps the model better capture details and structures in images and improve the generalization ability.Batch normalization (BN) has become an important component of many advanced deep learning models, especially in computer vision.BN normalizes layer inputs by calculating the average and variance in batch processing.The batch size must be sufficiently large, for BN to perform well.However, only small batches are available in some cases.Group normalization (GN) [58] is suitable for tasks that require a large amount of memory, such as image segmentation.GN calculates the mean and variance in each group channelwise and is not related to or constrained by batch size.As the batch size decreases, GN performance is basically unaffected.
The rectified linear unit (ReLU) and Gaussian error linear unit (GeLU) [59] are the most common activation functions.ReLU is a very simple function that returns 0 only when the input is negative and returns the value of the input when the input is positive.Thus, it contains only one piecewise linear transformation.However, the ReLU output remains constant at 0 when the input is negative.This problem may lead to neuronal death, reducing the expression of the model.The GeLU function is a continuous S-shaped curve with a smoother shape than that of ReLU, and it can alleviate neuronal death to a certain extent.Inspired by the above, we utilized GN instead of BN and replaced ReLU with GeLU in our GNS Block to enable the communication of information between different channel groups and improve accuracy, which can be represented as follows: where FS(•) denotes the split operation, which divides the features of one channel into two channels on average.Conv 1×1×1 represents a convolution with a 1 × 1 × 1 kernel.DWC 3×3×3 is expressed as a separable convolution with a 3 × 3 × 3 kernel.The concat operation represents the concatenation of two sets of features.The shuffle operation is a channel shuffle operation.

Comparison with Other Methods
We compared eleven advanced models to evaluate the advantages of the proposed model.Two networks were compared for 2024, two for 2023, and two for 2022, in addition to five classic networks.The five classic networks were 3D UNet, Att-UNet, UNETR, TransBTS, and VT-UNet.There are six architecture variants based on basic UNet, and five are structures based on Transformer.In order to accurately validate the effectiveness of the model that we proposed, we sequentially compared it with different methods offline and online on BraTS2021.The offline results were obtained by running experiments on our servers, while the online results were obtained after uploading the model to the official BraTS platform and receiving the official results.We utilized five-fold cross-validation in our offline experiments.In Table 2, the offline results are presented for comparison with those of other methods.In the table, it can be observed that except for a slightly lower F1score value, all other values are the highest.As shown in Table 3, we separately conducted a statistical significance analysis to compare different methods using the BraTS2021 dataset; the results were determined with a one-sided Wilcoxon signed rank test.Bold numbers indicate statistical significance (p < 0.05).58 for the three tumor subregions (WT, TC, and ET), respectively.We used VT-UNet as the baseline, and our WT, TC, ET, and average Dice results increased by 0.11, 1.62, 2.89, and 1.55, respectively, when the values of HD95 were close to each other.From the results, it can be seen that the incorporation of convolution into the pure Transformer (which also has local characteristics) improved its operation.Inspired by the studies of Michael Rebsamen [61] and Snehal Prabhudesai [62], GET-Net method was validated separately on the HGG dataset (293 cases), LGG dataset (76 cases), and a combination of the two.In Table 4, it can be seen that the Dice values of the HGG cases are relatively higher, followed by the mixed HGG and LGG values, and the LGG cases are slightly lower.In the BRATS dataset, due to the lack of representation of LGG samples, there was an inevitable performance decrease for the LGG data.There are typically no necrotic areas; hence, they exhibit significant differences in the appearance of the tumor core region compared to the HGG.Additionally, the appearance and size of the enhancing tumor region are also distinct.This impacts the network's performance, and these differences can lead to suboptimal model performance in HGG segmentation.The results show that our network slightly improved in terms of TC and ET; that is, our network performs better than the baseline in small target segmentation.Compared to other networks, ours may not be the best on a single indicator, but our average results and ET values are the highest.Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET).They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC.If there is an emphasis on enhancing the focus on local detail features, the ET results are likely to improve.This indicates that our segmentation performance for small targets is the best among the compared networks, mainly due to the incorporation of local detail features.Figure 8   Swin Unet3D (2023) [41] 90   The results show that our network slightly improved in terms of TC and ET; that is our network performs better than the baseline in small target segmentation.Compared to other networks, ours may not be the best on a single indicator, but our average results and ET values are the highest.Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET) They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC.If there is an emphasis on enhancing the focus on local detai features, the ET results are likely to improve.This indicates that our segmentation performance for small targets is the best among the compared networks, mainly due to the incorporation of local detail features.Figure 8   The results show that our network slightly improved in terms of TC and ET; that is, our network performs better than the baseline in small target segmentation.Compared to other networks, ours may not be the best on a single indicator, but our average results and ET values are the highest.Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET).They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC.If there is an emphasis on enhancing the focus on local detail features, the ET results are likely to improve.This indicates that our segmentation performance for small targets is the best among the compared networks, mainly due to the incorporation of local detail features.Figure 8 shows the visualization results of the GETNet model on the BraTS2021 dataset, in which five cases were randomly chosen.The medical cases, as shown in sequences A, B, C, D, and E of Figure 8, were segmented by GETNet.
The figures from left to right are, respectively, FLAIR, 3DUNet, Att-Unet, UNetr, TransBTS, VT-UNet, SwinUNet3D, the results segmented by GETNet, and the ground truth.Green, yellow, and red represent WT, TC, and ET, respectively.In general, the results of GETNet are close to the labeled ground truth.Compared to the network with only convolutions, the results of our model are the best.Our network also performs better than the networks based on a Transformer.Overall, our architecture and modules achieved better results in relation to BraTS2021, providing a good basis for subsequent research.
Diagnostics 2024, 14, x FOR PEER REVIEW 14 of 23 SwinUNet3D, the results segmented by GETNet, and the ground truth.Green, yellow, and red represent WT, TC, and ET, respectively.In general, the results of GETNet are close to the labeled ground truth.Compared to the network with only convolutions, the results of our model are the best.Our network also performs better than the networks based on a Transformer.Overall, our architecture and modules achieved better results in relation to BraTS2021, providing a good basis for subsequent research.

Ablation Study of Each Module in GETNet
We conducted ablation experiments to verify the effects of different modules in this architecture.Table 6 and Figure 9 show the results of utilizing GNS and ECSA in GETNet, which improved the average Dice coefficient by 0.64 and 0.91, respectively.Empirically, we added both the GNS and ECSA, and all indicators improved.The results are 91.77,86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively.The Hausdorff 95% (HD) values are 4.36, 11.35, 14.58, and 10.10 for WT, TC, ET, and the average HD, respectively.The original plan was to use a pure Transformer, which does not include the local connectivity of convolution (capturing local features), shared weights (reducing the number of parameters), and sparse interactions (reducing the number of parameters and computational overhead).The module that we designed uses convolution within the pure Transformer architecture; thus, it captures local features.Table 6 shows that whether we add the GNS module alone, the ECSA module alone, or both, there is an improvement.The best results show that our network and all of the modules can be effectively applied to brain tumor segmentation tasks.

Ablation Experiments 4.2.1. Ablation Study of Each Module in GETNet
We conducted ablation experiments to verify the effects of different modules in this architecture.Table 6 and Figure 9 show the results of utilizing GNS and ECSA in GETNet, which improved the average Dice coefficient by 0.64 and 0.91, respectively.Empirically, we added both the GNS and ECSA, and all indicators improved.The results are 91.77,86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively.The Hausdorff 95% (HD) values are 4.36, 11.35, 14.58, and 10.10 for WT, TC, ET, and the average HD, respectively.

Ablation Study of GN and GeLU in the GNS Module
To verify the effectiveness of replacing BN and ReLU with GN and GeLU, we conducted five sets of experiments.The results and experimental plan are shown in Table 7. Experiment A uses the original shuffle block of ShuffleNet V2.Unit 1, Unit 2, and Unit 3 can be seen in Figure 10, which represents BN, GN, BN + ReLU, and GN + GeLU placed at different positions.The original plan was to use a pure Transformer, which does not include the local connectivity of convolution (capturing local features), shared weights (reducing the number of parameters), and sparse interactions (reducing the number of parameters and computational overhead).The module that we designed uses convolution within the pure Transformer architecture; thus, it captures local features.Table 6 shows that whether we add the GNS module alone, the ECSA module alone, or both, there is an improvement.The best results show that our network and all of the modules can be effectively applied to brain tumor segmentation tasks.

Ablation Study of GN and GeLU in the GNS Module
To verify the effectiveness of replacing BN and ReLU with GN and GeLU, we conducted five sets of experiments.The results and experimental plan are shown in Table 7. Experiment A uses the original shuffle block of ShuffleNet V2.Unit 1, Unit 2, and Unit 3 can be seen in Figure 10, which represents BN, GN, BN + ReLU, and GN + GeLU placed at different positions.

Ablation Study of GN and GeLU in the GNS Module
To verify the effectiveness of replacing BN and ReLU with GN and GeLU, we conducted five sets of experiments.The results and experimental plan are shown in Table 7. Experiment A uses the original shuffle block of ShuffleNet V2.Unit 1, Unit 2, and Unit 3 can be seen in Figure 10, which represents BN, GN, BN + ReLU, and GN + GeLU placed at different positions.In the experiment, the first combination in Unit 1 was GN, that in Unit 2 was GN + GeLU, and that in Unit 3 was GN, and the effect improved (Experiment E had the best result).Next, we attempted the case where Unit 1, Unit 2, and Unit 3 were all GN + GeLU, and the results improved compared to those of Experiment A but were worse than those obtained in Experiment E. If all were replaced with GN, the results could improve (Experiment C) compared to those of Experiment A, but they were similar to those of Experiment B. Naturally, we also replaced Unit 1, Unit 2, and Unit 3 (Experiment D) in Experiment E with GN, GN + GeLU, and GN, and the results worsened.The experimental plans were as follows: A     8 and Figure 12 compare the coefficients of the convex combination in the GNS module; 1 − λ and λ represent the proportion of information processed from cross-attention and self-attention in the VT Decoder Block, respectively.The different proportions of 1 − λ and λ determine which part plays a decisive role, and this has a certain impact on the processing of the later layer.In order to find the optimal combination of λ values in convex combinations, values of λ from 0.1 to 0.9 were tested.The results show that the results for λ = 0.5 and 1 − λ = 0.5 were the best.The results are 91.77,86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively.It can be seen from the average Dice coefficient that there is indeed a certain improvement when the proportion of cross-attention increases, but this also has little impact on the results.The best effect is achieved when cross-attention and self-attention reach a balance.We also considered the case where convex combinations are not used, meaning the adaptive learning parameters of ω; η = 1 and θ = 1 did not exceed the results of λ = 0.5 and 1 − λ = 0.5, as illustrated in Table 9 and Figure 13.The results are 91.56,85.54, 82.83, and 86.57for WT, TC, ET, and the average Dice coefficient, respectively, when η and θ were adaptive learning parameters ω.The results were 91.50, 85.76, 82.47, and 86.58 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 1 and θ was 1.The results were 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 0.5 and θ was 0.5.The results indicate that further feature processing and extraction are needed to achieve better results when both cross-attention and self-attention contain more information.From the comparison of the results of Tables 8 and 9, it can be seen that the adaptive learning parameters played a certain role, but the results also need further processing.We also considered the case where convex combinations are not used, meaning the adaptive learning parameters of ω; η = 1 and θ = 1 did not exceed the results of λ = 0.5 and 1 − λ = 0.5, as illustrated in Table 9 and Figure 13.The results are 91.56,85.54, 82.83, and 86.57for WT, TC, ET, and the average Dice coefficient, respectively, when η and θ were adaptive learning parameters ω.The results were 91.50, 85.76, 82.47, and 86.58 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 1 and θ was 1.The results were 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 0.5 and θ was 0.5.The results indicate that further feature processing and extraction are needed to achieve better results when both cross-attention and self-attention contain more information.From the comparison of the results of Tables 8 and 9, it can be seen that the adaptive learning parameters played a certain role, but the results also need further processing.10 and Figure 14 show a comparison of the frequency coefficient of FEP in the GNS module.The frequency coefficient of FEP is 10,000 in the case of VT-UNet.In Table 10, Test A represents a frequency coefficient of 5000 in FEP, and Test B represents a frequency coefficient of 20,000.That is, the wavelengths form a geometric progression from 2π to 10,000•λπ.We made an effort to change this coefficient to achieve better results.We conducted experiments to modify the default coefficient by half and to double it.The results indicate that 10,000 is still optimal.In this section of the experiment, we only performed simple scaling by half or double and a large number of parameters remained untested.However, from the results in Table 10, it can be seen that the frequency coefficient does not have a particularly significant impact on the final results.Perhaps there will be better results in later testing, but this requires much experimentation.10 and Figure 14 show a comparison of the frequency coefficient of FEP in the GNS module.The frequency coefficient of FEP is 10,000 in the case of VT-UNet.In Table 10, Test A represents a frequency coefficient of 5000 in FEP, and Test B represents a frequency coefficient of 20,000.That is, the wavelengths form a geometric progression from 2π to 10,000•λπ.We made an effort to change this coefficient to achieve better results.We conducted experiments to modify the default coefficient by half and to double it.The results indicate that 10,000 is still optimal.In this section of the experiment, we only performed simple scaling by half or double, and a large number of parameters remained untested.However, from the results in Table 10, it can be seen that the frequency coefficient does not have a particularly significant impact on the final results.Perhaps there will be better results in later testing, but this requires much experimentation.11 shows tha when the depth-wise size changes with the layers, its performance is not affected, but the floating-point operations per second (FLOPs) are reduced by 48.99.This indicates tha changes in the depth-wise size with the layers can reduce the FLOPs and improve the segmentation efficiency.

Conclusions
In this paper, we propose GETNet based on VT-UNet, which integrates a GNS block and an ECSA block.It enhances the performance of brain tumor segmentation by effec tively fusing local features with long-distance relationships.The GNS module is used be tween the VT Encoder Block and 3D Patch-Merging Block, improving the shuffle block in ShuffleNetV2, enabling communication of information between different groups of chan nels, and improving accuracy.We propose the ECSA Block, which works in a bottleneck and can combine the advantages of global and local features, which is beneficial for ex tracting image feature details.In addition to comparing our results with those of the clas sic VT-UNet, we compared our results with those of networks based on UNet or Trans former.Our results yield Dice coefficients of 91.77, 86.03, and 83.64, respectively, for three tumor subregions (WT, TC, and ET).Our advantage over architectures based on UNet and Transformer lies in the more effective fusion of local and global features using GNS and  In the original VT-UNet, the depth-wise size does not change as the network deepens, but in the GETNet that we proposed, it does change with depth.Table 11 shows that when the depth-wise size changes with the layers, its performance is not affected, but the floating-point operations per second (FLOPs) are reduced by 48.99G.This indicates that changes in the depth-wise size with the layers can reduce the FLOPs and improve the segmentation efficiency.

Conclusions
In this paper, we propose GETNet based on VT-UNet, which integrates a GNS block and an ECSA block.It enhances the performance of brain tumor segmentation by effectively fusing local features with long-distance relationships.The GNS module is used between the VT Encoder Block and 3D Patch-Merging Block, improving the shuffle block in ShuffleNetV2, enabling communication of information between different groups of channels, and improving accuracy.We propose the ECSA Block, which works in a bottleneck and can combine the advantages of global and local features, which is beneficial for extracting image feature details.In addition to comparing our results with those of the classic VT-UNet, we compared our results with those of networks based on UNet or Transformer.Our results yield Dice coefficients of 91.77, 86.03, and 83.64, respectively, for three tumor subregions (WT, TC, and ET).Our advantage over architectures based on UNet and Transformer lies in the more effective fusion of local and global features using GNS and ECSA modules.We also conducted ablation experiments on the GNS module, ECSA module, convex combination, and FPE, which proved the effectiveness of our modules.Table 8 shows that for the average Dice coefficient, there is a certain improvement when the proportion of cross-attention increases, but it also has little impact on the results.The best effect is achieved when cross-attention and self-attention reach a balance.From the results in Table 10, it can be seen that the frequency coefficient does not have a particularly significant impact on the final results.Furthermore, quantitative and qualitative experiments demonstrated the accuracy of GETNet.Our architecture and the proposed modules can provide effective ideas for subsequent research.

Figure 2 .
Figure 2.An illustration of the proposed GETNet for brain tumor image segmentation.

Figure 2 .
Figure 2.An illustration of the proposed GETNet for brain tumor image segmentation.

Figure 2 .
Figure 2.An illustration of the proposed GETNet for brain tumor image segmentation.

Figure 3 .
Figure 3.An illustration of the fusion sub-block.Figure 3.An illustration of the fusion sub-block.

Figure 3 .
Figure 3.An illustration of the fusion sub-block.Figure 3.An illustration of the fusion sub-block.

Figure 4 .
Figure 4.An illustration of the building blocks of ECSA block.

Figure 11 .
Figure 11.The results of the ablation study of GN and GeLU in the ECSA module.

Figure 4 .
Figure 4.An illustration of the building blocks of the ECSA Block.

Figure 5 .
Figure 5.An illustration of the building blocks from a channel-wise perspective.(a) The shuffle block in ShuffleNetV2; (b) the GNS Block presented in this paper.

Figure 5 .
Figure 5.An illustration of the building blocks from a channel-wise perspective.(a) The shuffle block in ShuffleNetV2; (b) the GNS Block presented in this paper.

Figure 6 .
Figure 6.Comparison of the Dice results of different segmentation methods.

Figure 7 .
Figure 7.Comparison of the HD results of different segmentation methods.
shows the visualization results of the GETNet model on the BraTS2021 dataset, in which five cases were randomly chosen.The medical cases, as shown in sequences A, B, C, D, and E of Figure 8, were segmented by GETNet.The figures from left to right are, respectively, FLAIR, 3DUNet, TransBTS, VT-UNet,

Figure 6 .
Figure 6.Comparison of the Dice results of different segmentation methods.

Figure 6 .
Figure 6.Comparison of the Dice results of different segmentation methods.

Figure 7 .
Figure 7.Comparison of the HD results of different segmentation methods.
shows the visualization results of the GETNet model on the BraTS2021 dataset, in which five cases were randomly chosen.The medica cases, as shown in sequences A, B, C, D, and E of Figure 8, were segmented by GETNet The figures from left to right are, respectively, FLAIR, 3DUNet, TransBTS, VT-UNet

Figure 7 .
Figure 7.Comparison of the HD results of different segmentation methods.

Figure 8 .
Figure 8. Visualization results for medical cases.From left to right: FLAIR, 3DUNet, Att-Unet, UNETR, TransBTS, VT-UNet, SwinUNet3D, the results segmented by GETNet, and the ground truth.(A-E) are five cases were randomly chosen on the BraTS2021 dataset.Green, yellow, and red represent WT, TC, and ET, respectively.

Figure 9 .
Figure 9.The results of the ablation study of each module in GETNet.

Figure 9 .
Figure 9.The results of the ablation study of each module in GETNet.

Figure 9 .
Figure 9.The results of the ablation study of each module in GETNet.

Figure 10 .
Figure 10.An illustration of building blocks in the channel-wise perspective of GNS.

Figure 4 .
Figure 4.An illustration of the building blocks of ECSA block.

Figure 11 .
Figure 11.The results of the ablation study of GN and GeLU in the ECSA module.

Figure 11 .
Figure 11.The results of the ablation study of GN and GeLU in the ECSA module.4.2.3.Ablation Study of the Convex Combination in the ECSA ModuleTable 8 and Figure12compare the coefficients of the convex combination in the GNS module; 1 − λ and λ represent the proportion of information processed from cross-attention and self-attention in the VT Decoder Block, respectively.The different proportions of 1 − λ and λ determine which part plays a decisive role, and this has a certain impact on the processing of the later layer.In order to find the optimal combination of λ values in convex combinations, values of λ from 0.1 to 0.9 were tested.The results show that the results for λ = 0.5 and 1 − λ = 0.5 were the best.The results are 91.77,86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively.It can be seen from the average Dice coefficient that there is indeed a certain improvement when the proportion of cross-attention increases, but this also has little impact on the results.The best effect is achieved when cross-attention and self-attention reach a balance.

Figure 12 .
Figure 12.The results of the ablation study of the convex combination in the ECSA module when λ < 1 and λ ≠ 1 − λ.

Figure 12 .
Figure 12.The results of the ablation study of the convex combination in the ECSA module when λ < 1 and λ ̸ = 1 − λ.

Figure 13 .
Figure 13.The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ.

4.2. 4 .
Ablation Study of the Frequency Coefficient of FEP in the ECSA Module Table

Figure 13 .
Figure 13.The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ.

4.2. 4 .
Ablation Study of the Frequency Coefficient of FEP in the ECSA Module Table

Figure 14 .
Figure 14.The results of the FEP frequency coefficient in the ECSA module.

4.2. 5 .
Comparative Experiment on the Depth-Wise Size of the 3D Patch-Merging Operation In the original VT-UNet, the depth-wise size does not change as the network deep ens, but in the GETNet that we proposed, it does change with depth.Table

Figure 14 .
Figure 14.The results of the FEP frequency coefficient in the ECSA module.

4.2. 5 .
Comparative Experiment on the Depth-Wise Size of the 3D Patch-Merging Operation

Table 2 .
The offline validation results for the comparison of different methods in relation to BraTS2021, with the best performance highlighted in bold.

Table 5 and
Figures 6 and 7 show that the Dice coefficient values of GETNet in the online validation for the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) are 91.77,86.03,and 83.64, respectively.The values of HD, shown inTable 5, are 4.36, 11.35, and 14.

Table 3 .
Ratio (in %) of the improvement in the performance of GETNet compared to different methods.Bold numbers indicate statistical significance (p < 0.05).

Table 4 .
The results of GETNet when using the LGG/HGG dataset of BraTS2020, with the best performance highlighted in bold.

Table 5 .
The online validation results for the comparison of different methods in relation to BraTS2021, with the best performance highlighted in bold.

Table 6 .
The results of the ablation study of each module in GETNet, with the best performance highlighted in bold.

Table 6 .
The results of the ablation study of each module in GETNet, with the best performance highlighted in bold.

Table 7 .
The results of the ablation study of GN and GeLU in the GNS module, with the best performance highlighted in bold.

Table 7 .
The results of the ablation study of GN and GeLU in the GNS module, with the best performance highlighted in bold.

Table 8 .
The results of the ablation study of the convex combination in the ECSA module when λ < 1 and λ ̸ = 1 − λ, with the best performance highlighted in bold.

Table 9 .
The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ, with the best performance highlighted in bold.

Table 9 .
The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ, with the best performance highlighted in bold.

Table 10 .
The results of the FEP frequency coefficient in the ECSA module, with the best performance highlighted in bold.

Table 10 .
The results of the FEP frequency coefficient in the ECSA module, with the best performance highlighted in bold.

Table 11 .
The results of a comparative experiment on the depth-wise size of the 3D Patch-Merging operation, with the best performance highlighted in bold.

Table 11 .
The results of a comparative experiment on the depth-wise size of the 3D Patch-Merging operation, with the best performance highlighted in bold.