Semantic Segmentation of Brain Tumors Using a Local–Global Attention Model

Xing, Shuli; Lai, Zhenwei; Zhu, Junxiong; He, Wenwu; Mao, Guojun

doi:10.3390/app15115981

Open AccessArticle

Semantic Segmentation of Brain Tumors Using a Local–Global Attention Model

by

Shuli Xing

^1,2,

Zhenwei Lai

¹,

Junxiong Zhu

¹,

Wenwu He

^1,2

and

Guojun Mao

^1,2,*

¹

College of Computer Science and Mathematics, Fujian University of Technology, Fuzhou 350118, China

²

Fujian Provincial Key Laboratory of Big Data Mining and Applications, Fujian University of Technology, Fuzhou 350118, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 5981; https://doi.org/10.3390/app15115981

Submission received: 28 March 2025 / Revised: 23 May 2025 / Accepted: 23 May 2025 / Published: 26 May 2025

(This article belongs to the Special Issue Deep Learning in Medical Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

The distinctions between tumor areas and surrounding tissues are often subtle. Additionally, the morphology and size of tumors can vary significantly among different patients. These factors pose considerable challenges for the precise segmentation of tumors and subsequent diagnosis. Therefore, researchers are trying to develop an automated and accurate segmentation model. Currently, many segmentation models in deep learning rely on Convolutional Neural Network or Vision Transformer. However, Convolution-based models often fail to deliver precise segmentation results, while Transformer-based models often require more computational resources. To address these challenges, we propose a novel hybrid model named Local–Global UNet Transformer. In our model, we introduce: (1) a semantic-oriented masked attention to enhance the feature extraction capability of the decoder; and (2) network-in-network blocks to increase channel modeling complexity in the encoder while reducing the parameter consumption associated with residual blocks. We evaluate our model on two public brain tumor segmentation datasets, and the experimental results demonstrate that our model achieves the highest average Dice score on the BraTS2024-GLI dataset and ranks second on the BraTS2023-GLI dataset. In terms of

H D_{95}

, our model attains the lowest values on both datasets. Furthermore, the ablation study proves the effectiveness of our model design.

Keywords:

braintumor; semantic segmentation; hybrid model; channel modeling

1. Introduction

Tumors are abnormal cell growths that pose serious health risks due to their invasive and metastatic capabilities [1]. Among all tumors, gliomas are the most prevalent malignant brain tumors in adults, leading to a very short life expectancy at their highest grade. However, segmentation of gliomas is often complicated due to the subtle differences between tumor regions and adjacent tissues, leading to inconsistencies in labeling by experienced specialists [2,3]. At the same time, large variations in glioma morphology and size among patients bring challenges to the applicability of traditional segmentation models [4]. Therefore, it is essential to develop a high-performance segmentation model to help doctors make a more accurate preoperative assessment of gliomas.

Many traditional machine learning methods, such as K-Nearest Neighbors [5], Support Vector Machine [6], and Artificial Neural Network [7], have been applied to glioma segmentation. However, these methods achieve very low segmentation accuracy.

Most early deep learning models for glioma segmentation utilized Convolutional Neural Network (CNN). Havaei et al. [8] presented a glioma segmentation model based on a large-kernel CNN. They proposed a novel two-pathway architecture to learn about the local details as well as the larger context of the brain simultaneously. In addition, they developed a convolutional implementation of a fully connected layer to speed up model training and inference. However, the use of large kernels consumed huge computational resources, which limited their model depth. In contrast, Pereira et al. [9] constructed a CNN model with

3 \times 3

small kernels. They followed the design method of VGG models [10] and used the intensity normalization method [11] to address data heterogeneity in magnetic resonance imaging (MRI) images. Despite the remarkable performance achieved by previous methods, they neglected the complex relationship among different modalities of MRI images (such as T1, T1Gd, T2, and T2-FLAIR). Syazwany et al. [12] proposed a model to separately extract features of different modalities and use a bi-directional feature pyramid network to fuse these features. Their method achieved higher segmentation accuracy than conventional UNet and its variants.

CNN-based models rely on convolutions to extract features, causing the insufficient capture of global features. In order to overcome these problems, Transformer-based models have recently demonstrated competitive performance in glioma segmentation. Sagar [13] presented a Vision Transformer model for biomedical image segmentation. They applied the classical Vision Transformer (ViT) module [14] in a U-shaped encoder–decoder structure to establish long-range dependencies between features. Inspired by dilated convolution, Wu et al. [15] constructed a Dilated Transformer (D-former) for glioma segmentation. They designed Local Scope Modules (LSMs) and Global Scope Modules (GSMs) to capture local and global features, respectively. The LSMs utilized window-based local attention, while the GSMs employed dilated global attention, which significantly reduced the computational complexity of the ViT modules. Furthermore, they applied a dynamic position encoding to enhance the model’s translation invariance. Wei et al. [16] proposed a High-Resolution Swin Transformer Network (HRSTNet) for glioma segmentation. Different from the classical U-shaped models, HRSTNet used Swin Transformer blocks to extract local features and followed the design of a High-Resolution Network [17] to fuse features of different resolutions at each stage. Their model achieved better performance than most medical image segmentation methods at that time.

Nevertheless, Transformer-based models typically require more computational resources than CNN-based models, making them difficult to deploy in resource-constrained environments. According to Wang et al. [18], ViT inference on mobile devices has a latency and energy consumption up to 40 times higher than CNN models. Furthermore, Transformer-based models demonstrate poor inductive bias when fitting small-scale image datasets. For ease of use and to save computational resources, many researchers are trying to design hybrid models that combine the advantages of both local convolution and global attention. Liang et al. [19] proposed a parallel module that includes convolution blocks and Transformer blocks for extracting local and global features, and designed a cross-attention mechanism to fuse these generated features. Wang et al. [20] inserted a Transformer block into the last layer of a 3D UNet encoder, significantly reducing the computational complexity of self-attention and enhancing the model’s global feature extraction capability.

The models mentioned above do not fundamentally address the problem of high computational consumption in calculating global attention, and the features extracted by the intermediate layers of both the encoder and decoder lack semantic supervision. In this paper, we develop a novel hybrid segmentation model called Local–Global UNet Transformer (LG UNETR). Our model is built upon Swin UNETR [21], which exhibits low computational complexity by utilizing local attention and residual blocks for feature extraction. We design semantic-oriented masked attention and network-in-network blocks to further improve the segmentation performance of Swin UNETR while reducing the number of model parameters. LG UNETR achieves the highest average Dice score on the BraTS2024-GLI dataset [22,23] and ranks second on the BraTS2023-GLI dataset [22,24,25,26]. In terms of

H D_{95}

, our model attains the lowest values. The contributions of this paper can be summarized as follows:

(1): We propose semantic-oriented masked attention, a novel mechanism that applies global attention while integrating semantic supervision to enhance the precision of feature extraction in the decoder.
(2): We propose network-in-network blocks to replace the residual blocks in the feature fusion component of the original Swin UNETR architecture, with the goal of capturing inter-dependencies between feature channels.
(3): Experimental results show that our model achieves higher segmentation accuracy than several recent leading models on two public brain tumor datasets.
The source code with our proposed model are available at: https://github.com/laizhui/LG-UNETR (accessed on 26 March 2025)

2. Related Works

In this section, we first introduce several popular CNN-based models in the field of medical image segmentation, followed by a discussion of recent Transformer-based models. Finally, we present some approaches for building hybrid models.

2.1. CNN-Based Segmentation Models

UNet [27] was originally developed for 2D image segmentation. Due to its widespread use and promising results, several CNN-based models [28,29,30] have extended the standard UNet architecture for various medical image segmentation tasks. For 3D medical image segmentation [31,32,33], the full volumetric image is often converted into a sequence of 2D slices to facilitate subsequent processing. Çiçek et al. [34] introduced an end-to-end network structure capable of generating segmentation results directly from original 3D medical images. They replaced the 2D operations in the original UNet with 3D counterparts and achieved high performance on an in-house 3D dataset. Isensee et al. [35] proposed a deep learning-based segmentation framework that can automatically configure itself to adapt to different types of 3D medical image datasets. Without designing a new network architecture, their method outperformed most existing models across 23 public datasets. Huang et al. [36] proposed an improved version of UNet called UNet 3+. In their model, they used full-scale skip connections to combine low-level details with high-level semantics to obtain multi-scale feature representation. Their model achieved higher segmentation accuracy than UNet and its variants on an ISBI public dataset and a private dataset.

2.2. Transformer-Based Segmentation Models

ViT [14] marked a significant milestone by introducing the Transformer model, originally developed for natural language processing, into the realm of computer vision. The self-attention mechanism of ViT can model long-range dependencies among a sequence of image patches, capturing global relationships that are crucial for dense prediction tasks. Several recent works [37,38,39] have explored the effectiveness of pure Transformer-based models in medical image segmentation. Unlike classical ViT models, Swin Transformer [40] introduced local attention to reduce the computational complexity of global attention and designed a shifted window approach to enhance information interaction between local windows. Cao et al. [41] applied the basic building block of Swin Transformer into a U-shaped network architecture for 2D medical image segmentation. Their model exhibited outstanding performance in multi-organ and cardiac segmentation tasks. Liang et al. [42] developed a 3D Swin Transformer block based on the 2D version and applied it in a 3D U-shaped encoder–decoder network for brain tumor segmentation. They also introduced a self-supervised learning scheme to further improve the encoder’s feature extraction ability. Experimental results showed that their model achieved the best performance on the BraTS2018 and BraTS2019 [25,26,43] datasets.

2.3. Hybrid Models

To overcome the intrinsic locality of convolution operations and the insufficient low-level details in global self-attention, Chen et al. [44] proposed a hybrid model named TransUNet. They replaced the last few layers of the encoder in UNet with ViT blocks to capture global features. Their model achieved higher segmentation accuracy than both pure Transformer-based and CNN-based models on the Synapse multi-organ segmentation dataset [45]. Hatamizadeh et al. [46] constructed UNETR in a similar manner and later extended it with Swin Transformer blocks [40] to create Swin UNETR [21]. Compared to UNETR, Swin UNETR demonstrated better performance across various 3D medical image segmentation datasets [47]. Shaker et al. [48] developed a more efficient 3D medical image segmentation model upon the UNETR, where the ViT modules were replaced by efficient paired attention blocks to enhance feature extraction capabilities and parameter efficiency.

To address the challenges of multimodal feature fusion and computational efficiency in brain tumor segmentation, Xie et al. proposed SSCFormer [49]. This model revisited the ConvNet–Transformer hybrid framework from scale-wise and spatial-channel-aware perspectives, enhancing the ability to capture hierarchical features in volumetric medical images. Lin et al. proposed MM-UNet [50], a novel cross-attention mechanism between modules and scales for brain tumor segmentation, to overcome the obstacles of limited receptive fields and insufficient global feature integration in traditional U-Net architectures. MM-UNet achieved superior performance and outperformed existing state-of-the-art methods. In response to the computational inefficiency of existing multimodal brain tumor segmentation models, Yu et al. introduced SuperLightNet [51], a lightweight parameter aggregation network designed to minimize computational load while maintaining high accuracy. By optimizing the encoder–decoder structure, SuperLightNet reduced parameters by 95.59%, improved computational efficiency by 96.78%, and enhanced memory access performance by 96.86%, while still achieving a 0.21% increase in segmentation accuracy compared to leading methods. These studies highlight the ongoing evolution of brain tumor segmentation models, emphasizing cross-scale feature fusion and computational efficiency as key directions for future research.

3. Methods

Our LG UNETR is built upon Swin UNETR, which utilizes the 3D version of Swin Transformer and CNN as the encoder and decoder, respectively. To reduce computational costs and further improve Swin UNETR performance, we design Semantic-oriented Masked Attention (SMA) blocks to enhance the feature extraction capability of the decoder. Additionally, we employ Network-in-Network (NiN) blocks to address the limitations in channel modeling within the encoder. Figure 1 illustrates the overall architecture of our model.

3.1. The Overall Architecture of Swin UNETR

Swin UNETR [21] is also a U-shaped model. To better understand its design method, we divide its overall architecture into three parts: an encoder, decoder, and feature fusion component.

The encoder is composed of 3D versions of Swin Transformer blocks. Each block contains two computational units: Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window-based Multi-Head Self-Attention (SW-MSA). Figure 2 displays this structure in detail. W-MSA can be regarded as a large-kernel convolution with fewer parameters, while SW-MSA enhances information interaction between local windows. In order to progressively reduce the spatial dimensions of feature maps while increasing their channel dimensions, a patch merging module has been designed. Different from conventional pooling operations or convolutions with a stride of 2, the patch merging module performs uniform sampling along each spatial dimension and uses a linear layer to fuse the sampled features.

The decoder is composed of 3D versions of residual blocks. Each block contains two

3 \times 3 \times 3

convolutional layers, with each convolution followed by an instance normalization layer and a Leaky ReLU layer. Figure 3 illustrates this structure. A

2 \times 2 \times 2

deconvolution layer with a stride of 2 is used to increase the resolution of feature maps after each residual block. The feature fusion component also consists of a residual block and a concatenation operation. The residual block aims to convert attention features into convolutional features to ensure consistency in feature types between the encoder and decoder. The concatenation operation is responsible for the feature fusion.

3.2. Semantic-Oriented Masked Attention

In the original Swin UNETR, both the encoder and decoder utilize local feature extraction methods, which are inefficient in capturing the global features of brain tumors. In addition, the layers close to the bottom of this U-shaped model are distant from the loss function, resulting in low precision in feature extraction within these layers. To address these issues, we propose SMA as a replacement for the residual block in the decoder. The SMA comprises two core components: a non-homogeneous embedding module and a masked attention module. Figure 4 illustrates its structure in detail.

Most Multi-Head Self-Attentions (MHSAs) use the same embedding method to generate the query, key, and value. Different from them, we design a non-homogeneous embedding module, which contains a residual block and a fully connected layer with a Dropout layer to produce these. The fully connected layer is responsible for spatial reduction. It is a global operation and offers great flexibility in setting the output dimension. The Dropout layer is used to alleviate the overfitting problem. We employ the residual block to extract local features. It preserves the details of brain tumors and is crucial for distinguishing different classes of edge information.

Deep supervision has been widely used in segmentation tasks to guide the feature extraction of intermediate layers. However, it is computationally expensive, as it requires calculating pixel-wise classification loss at various resolutions [52]. We develop a masked attention module that incorporates semantic information in the calculation of attention. Specifically, we set the number of channels of

Y_{Q}

and

Y_{K}

to T using two

1 \times 1 \times 1

convolution layers, where T is equal to the number of classes. The

1 \times 1 \times 1

convolutions serve as a shared classifier across all positions in the embedded features. Another

1 \times 1 \times 1

convolution is applied to

Y_{V}

, with the output channel dimension equal to the input channel dimension. This convolution is used to recover the channel dimension of the final output. For more details about the masked attention module, please refer to Figure 5.

All of the SMAs share the same structure, except for the bottom one, which only receives features from the last encoder block and the features have a relatively small spatial dimension. Therefore, we only retain the mask attention module in this bottom SMA, as shown in Figure 6. The process of SMA can be summarized as follows:

Y_{I} = F_{1} (C o n c a t (Y_{E}, Y_{D})),

(1)

Y_{Q} = F_{2} (R e s (Y_{I})),

(2)

Y_{K} = F_{3} (F C (D r o p o u t (Y_{I}))),

(3)

Y_{V} = F_{4} (F C (D r o p o u t (Y_{I}))),

(4)

Z = R e s (Y_{I}) + S o f t m a x (Y_{Q} Y_{K}^{⊤}) Y_{V},

(5)

where

Y_{E} \in R^{N \times C}

and

Y_{D} \in R^{N \times C}

represent the input features from the encoder and decoder, respectively. N is the length of the image token sequence and C is the channel dimension.

F_{1}

,

F_{2}

,

F_{3}

, and

F_{4}

denote four different

1 \times 1 \times 1

convolution layers.

C o n c a t

refers to the concatenate operation.

D r o p o u t

represents the Dropout layer.

F C

denotes the fully connected layer, and

R e s

represents the residual block. Z is the final output. The dimensions of

Y_{Q}

,

Y_{K}

, and

Y_{V}

are

R^{N \times T}

,

R^{T \times P}

, and

R^{P \times C}

, respectively, where P is the output dimension of the fully connected layer.

3.3. Network-in-Network Blocks

The local attention in the original Swin UNETR is still a form of spatial attention, which is inherently weak in channel modeling. To address this issue, an intuitive approach is to perform attention along the channel dimension. However, channel attention becomes complex as the network depth increases, which may adversely affect a model’s inductive bias. We employ the NiN block, which replaces the previously used feature fusion component, to solve this problem.

The NiN block has two benefits: (1) it is lightweight compared to the residual block and also serves to transform attention features into convolutional features; and (2) the two successive

1 \times 1 \times 1

convolutions function as a multi-layer perceptron, which explicitly establishes interactions among channels. Figure 7 illustrates its structure. We find that the NiN blocks have been commonly used in modern CNN models, such as MedNeXt [47] and 3D UX-NET [53], reflecting the higher parameter efficiency of this structure.

3.4. Loss Function

We employ the soft Dice loss function [31], which has been widely used in various 3D medical image segmentation tasks. It is defined as:

L (G, R) = 1 - \frac{2}{J} \sum_{j = 1}^{J} \frac{\sum_{i = 1}^{I} G_{i, j} R_{i, j}}{\sum_{i = 1}^{I} G_{i, j}^{2} + \sum_{i = 1}^{I} R_{i, j}^{2}},

(6)

where I and J denote the number of voxels and classes, respectively;

G_{i, j}

and

R_{i, j}

denote the probabilities of ground truth and predicted result for class j at voxel i, respectively.

4. Experiments and Results

In this section, we first introduce the datasets used for training and testing our model performance, followed by evaluation metrics and experimental settings. Then, we present comparisons with several state-of-the-art methods. Finally, we conduct ablation studies to highlight the effectiveness of each core component in our model.

4.1. Datasets

We carried out experiments on two public brain tumor datasets: the BraTS2023-GLI dataset and BraTS2024-GLI dataset. These datasets contained MRI images from multiple medical centers, covering patients of various ages and tumor types. They provided good benchmarks for developing glioma segmentation algorithms, as all the images had been precisely annotated and approved by experienced neuroradiologists.

4.1.1. BraTS2023-GLI Dataset

The BraTS2023-GLI dataset [22,24,25,26] contains a total of 1251 brain MRI images from 1134 patients. Each image has four modalities: T1, T1Gd, T2, and T2-FLAIR, along with three segmentation targets: WT (Whole Tumor), ET (Enhancing Tumor), and TC (Tumor Core). The image resolution is

240 \times 240 \times 155

voxels in nii.gz format. The ground truth data are created after pre-processing, which included being co-registered to the same anatomical template, interpolated to the same resolution (1 mm³), and skull stripped.

4.1.2. BraTS2024-GLI Dataset

The BraTS2024-GLI dataset [22,23] contains a total of 1350 brain MRI images from 613 patients. Each image also includes four modalities but with four segmentation targets: WT, ET, TC, and RC (Resection Cavity). To maintain consistency with the BraTS2023-GLI dataset, we still use the first three segmentation targets. The image resolution is

182 \times 218 \times 182

voxels in nii.gz format. The ground truth data are created using the same pre-processing methods as in the BraTS2023-GLI dataset.

4.1.3. Comparison Between the Two Datasets

We conduct a comparative analysis of the labeling distribution between the two datasets, as shown in Table 1. It can be seen that the distribution of the three classes of labels to be segmented is relatively uniform in the BraTS2023-GLI dataset, with these classes co-occurring in the images at a high probability of 94.3%. In contrast, the labeling distribution in the BraTS2024-GLI dataset shows significant variation, with 58.6% of the images exhibiting partial label absence. These factors increase the difficulty of training and testing models on the BraTS2024-GLI dataset. Compared to the BraTS2023-GLI dataset, the BraTS2024-GLI dataset more closely reflects real-world data collection scenarios, making it more generalizable.

4.2. Evaluation Metrics

We evaluate the performance of each model with the Dice score, which measures the overlap between the segmentation predictions of voxels and their corresponding ground truths. It can be defined as:

D i c e (A, B) = \frac{2 |A \cap B|}{|A| + |B|} = \frac{2 A B}{A^{2} + B^{2}} .

(7)

where A denotes the segmentation of model prediction and B denotes the ground truth from manual annotation.

In addition, we use the 95% Hausdorff Distance (

H D_{95}

) as a boundary-based metric to measure the similarity between two sets of points. It is defined as:

H D_{95} (A, B) = m a x \{d_{95} (A, B), d_{95} (B, A)\} .

(8)

where A denotes the set of points from the ground truth, and B denotes the set of points from the predicted labels.

d_{95} (A, B)

denotes the 95th percentile of the distances from each point in A to its nearest neighbor in B, and

d_{95} (B, A)

denotes the 95th percentile of the distances from each point in B to its nearest neighbor in A.

4.3. Implementation Details

We train our model on a server that is equipped with an Intel Xeon Gold 6325 CPU (up to 2.90 GHz) and an NVIDIA A40 GPU with 48 GB of memory. For implementation, we use the PyTorch framework (v2.2.1) [54] with MONAI library (v0.9.0) [55]. We employ the AdamW [56] with its default configurations to optimize the LG UNETR. The batch size, initial learning rate, total number of epochs, dropout rate, and hyperparameter P are set to 1, 1 × 10⁻⁴, 500, 0.25, and 128, respectively. For model training and testing, we randomly divide each dataset into three parts: the training set, validation set, and test set, with a ratio of 8:1:1. For each model architecture, we train separate optimal models on both datasets. The input image is cropped to a fixed size of

128 \times 128 \times 128

. A data augmentation approach, including random scaling (with a scaling factor of 0.1 and a probability of 0.1), flipping (along the x, y, and z axes with a probability of 0.5), and shifting (with an offset of 0.1 and a probability of 0.1), is applied to the training set to further improve the model’s generalization ability.

4.4. Comparison with State-of-the-Art Methods

To demonstrate the effectiveness of our model, we compare it with four popular and recent deep learning-based models: 3D UX-NET [53], Swin UNETR [21], UNETR [46], and MedNeXt [47]. For a fair comparison, we use the publicly available codes for each method in our experiments and keep their default settings. The model 3D UX-NET is a lightweight 3D convolutional neural network that incorporates several innovative features such as depthwise and pointwise convolution to simplify the computation, and expands independent channels to enrich features. MedNeXt is a 3D medical image segmentation model inspired by Transformers, designed for efficiency and scalability in data-scarce medical imaging contexts. It employs a fully ConvNeXt [57] 3D encoder–decoder architecture, along with residual upsampling and downsampling blocks, and compound scaling for depth, width, and kernel size.

4.4.1. Comparison on the BraTS2024-GLI Dataset

The segmentation results of each model for the BraTS2024-GLI dataset are presented in Table 2. Among these models, our LG UNETR achieves Dice scores of 88.50%, 78.25%, and 80.80% for WT, TC, and ET, respectively, significantly outperforming all other methods for the given segmentation targets. The average Dice score of LG UNETR is 82.51%, representing a 1.39% increase over the original Swin UNETR and a 1.06% increase over the second-best 3D UX-NET. In terms of the

H D_{95}

, LG UNETR achieves the lowest value of 8.02 mm, followed by 3D UX-NET with 8.40 mm. Regarding computational consumption, MedNeXt exhibits the lowest, closely followed by our model, with only a marginal disparity between them. Compared with the original Swin UNETR, LG UNETR not only achieves a significant improvement in performance but also reduces the number of parameters and FLOPs. Figure 8 presents the prediction comparisons between them. It can be seen that our model can provide more accurate predictions.

4.4.2. Comparison on the BraTS2023-GLI Dataset

The segmentation performance of each model on the BraTS2023-GLI dataset is shown in Table 3. It can be seen that 3D UX-NET achieves the highest average Dice score of 89.17%, followed by our LG UNETR with an average Dice score of 89.12%. What is more, our LG UNETR achieves better Dice scores on ET and WT than 3D UX-NET. In terms of the

H D_{95}

, LG UNETR attains the lowest value of 4.81 mm, while 3D UX-NET ranks second with 4.90 mm. Notably, our model demonstrates significantly lower consumption in FLOPs compared to 3D UX-NET, with FLOPs being nearly half of those of 3D UX-NET, highlighting the computational efficiency of our model. Additionally, LG UNETR demonstrates superior performance on all segmentation targets while requiring fewer computational resources compared to the original Swin UNETR.

Compared to the BraTS2024-GLI dataset, the segmentation difficulty of the BraTS2023-GLI dataset is much lower, resulting in better segmentation performance for pure CNN-based models than hybrid models. In contrast, most hybrid models achieve higher Dice scores than pure CNN-based models on the BraTS2024-GLI dataset, as the attention mechanisms enhance the model’s ability to handle more complex problems.

4.5. Ablation Studies

In this section, we evaluate the effectiveness of each core component in our model, which includes as follows:

(1): Evaluating the contribution of each proposed block to the overall performance of our model.
(2): Examining the influence of the number and location of SMA blocks on segmentation accuracy.
(3): Investigating the effect of model width on model performance.
(4): Exploring the impact of hyperparameter P on model performance.

All ablation experiments are conducted on the BraTS2024-GLI dataset.

4.5.1. The Effectiveness of Proposed Blocks

Table 4 provides a comparison of performance across model variations. Model 1 serves as the baseline model: Swin UNETR. In Model 2, we integrate the SMA blocks, resulting in a Dice score improvement of 1.25% over Model 1. Subsequently, Model 3 incorporates our NiN blocks, yielding a Dice score of 81.86%, which marks a 0.90% improvement relative to Model 1. Finally, the LG UNETR model, which combined both SMA and NiN blocks, achieves the highest performance, with a Dice score of 82.51%. Moreover, our model maintains the lowest FLOPs among these variants, highlighting the efficiency of our proposed configurations.

4.5.2. The Effects of the Number and Location of SMA Blocks on Performance

In this section, we explore how the number and location of SMA blocks influence the performance of model segmentation. Table 5 presents a comparative analysis of different settings across various stages of model development. The results shows that Setting 1 and Setting 2 achieve higher segmentation accuracy compared to Setting 3, indicating the importance of incorporating SMA blocks in the bottom layer. Furthermore, Setting 4 exhibits the best segmentation performance, demonstrating that increasing the number of SMA blocks can yield greater benefits.

4.5.3. The Effects of Model Width on Performance

In this section, we discuss the impact of model width on segmentation performance. As shown in Table 6, the Dice score increases with the number of feature channels. This phenomenon indicates that our model can leverage its inherent scalability to achieve higher segmentation precision without requiring additional data. However, it is important to note that this performance gain incurred significant computational costs, which may not be suitable for resource-constrained equipment.

4.5.4. The Effects of Hyperparameter P on Performance

In this section, we discuss the impact of hyperparameter P on segmentation performance. As shown in Table 7, the optimal value of P is found to be 128, which demonstrates a 0.73% and 0.51% improvement in performance compared to the values of 64 and 256, respectively. Furthermore, the number of parameters with this particular setting is positioned in the mid-range, and there is only a marginal disparity in FLOPs.

5. Conclusions

Brain tumors pose significant challenges for segmentation due to their substantial variations in morphology and size among patients. To address these challenges, we propose a novel hybrid model named LG UNETR. Our model is developed based on Swin UNETR, which utilizes the local attention of 3D Swin Transformer and residual blocks to extract features. To further enhance the performance of Swin UNETR, we propose SMA and NiN blocks to replace the residual blocks in the decoder and the feature fusion component, respectively. Experimental results show that our model achieves the highest segmentation accuracy on the BraTS 2024-GLI dataset and ranks second on the BraTS 2023-GLI dataset, demonstrating consistently superior performance with lower computational consumption compared to the original Swin UNETR. The ablation studies also confirm the effectiveness and high efficiency of each proposed block.

Author Contributions

S.X.: methodology, project administration, writing—original draft, and writing—review and editing. Z.L.: methodology, software, writing—original draft, and writing—review and editing. J.Z.: writing—review and editing. W.H.: validation and writing—review and editing. G.M.: supervision, validation, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fujian Provincial Natural Science Foundation of China (Grant number: 2024J01158).

Institutional Review Board Statement

Not appliable.

Informed Consent Statement

Not applicable.

Data Availability Statement

BraTS2024-GLI dataset: https://www.synapse.org/Synapse:syn59059776; BraTS2023-GLI dataset: https://www.synapse.org/Synapse:syn51514105 (accessed on 26 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hanahan, D.; Weinberg, R.A. Hallmarks of cancer: The next generation. Cell 2011, 144, 646–674. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Li, C.; Wang, R.; Liu, Z.; Wang, M.; Tan, H.; Wu, Y.; Liu, X.; Sun, H.; Yang, R.; et al. Annotation-efficient deep learning for automatic medical image segmentation. Nat. Commun. 2020, 12, 5915. [Google Scholar] [CrossRef] [PubMed]
Greenwald, N.F.; Miller, G.; Moen, E.; Kong, A.; Kagel, A.; Dougherty, T.; Fullaway, C.C.; McIntosh, B.J.; Leow, K.S.; Schwartz, M.; et al. Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat. Biotechnol. 2021, 40, 555–565. [Google Scholar] [CrossRef]
Ma, J.; He, Y.; Li, F.; Han, L.J.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2023, 15, 654. [Google Scholar] [CrossRef]
Cinarer, G.; Emiroglu, B.G. Classificatin of Brain Tumors by Machine Learning Algorithms. In Proceedings of the 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 11–13 October 2019; pp. 1–4. [Google Scholar] [CrossRef]
Padlia, M.; Sharma, J. Fractional Sobel Filter Based Brain Tumor Detection and Segmentation Using Statistical Features and SVM. In Nanoelectronics, Circuits and Communication Systems; Springer: Singapore, 2018. [Google Scholar] [CrossRef]
Virupakshappa; Amarapur, B. An Automated Approach for Brain Tumor Identification using ANN Classifier. In Proceedings of the 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC), Mysore, India, 8–9 September 2017; pp. 1011–1016. [Google Scholar] [CrossRef]
Havaei, M.; Davy, A.; Warde-Farley, D.; Biard, A.; Courville, A.C.; Bengio, Y.; Pal, C.J.; Jodoin, P.M.; Larochelle, H. Brain tumor segmentation with Deep Neural Networks. Med Image Anal. 2016, 35, 18–31. [Google Scholar] [CrossRef]
Pereira, S.; Pinto, A.; Alves, V.; Silva, C.A. Brain Tumor Segmentation Using Convolutional Neural Networks in MRI Images. IEEE Trans. Med. Imaging 2016, 35, 1240–1251. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Nyúl, L.G.; Udupa, J.K.; Zhang, X. New variants of a method of MRI scale standardization. IEEE Trans. Med. Imaging 2000, 19, 143–150. [Google Scholar] [CrossRef]
Syazwany, N.S.; Nam, J.H.; Chul Lee, S. MM-BiFPN: Multi-Modality Fusion Network With Bi-FPN for MRI Brain Tumor Segmentation. IEEE Access 2021, 9, 160708–160720. [Google Scholar] [CrossRef]
Sagar, A. ViTBIS: Vision Transformer for Biomedical Image Segmentation. arXiv 2022, arXiv:2201.05920. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wu, Y.; Liao, K.Y.; Chen, J.; Wang, J.; Chen, D.Z.; Gao, H.; Wu, J. D-former: A U-shaped Dilated Transformer for 3D medical image segmentation. Neural Comput. Appl. 2022, 35, 1931–1944. [Google Scholar] [CrossRef]
Wei, C.; Ren, S.; Guo, K.; Hu, H.; Liang, J. High-Resolution Swin Transformer for Automatic Medical Image Segmentation. Sensors 2023, 23, 3420. [Google Scholar] [CrossRef] [PubMed]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar] [CrossRef]
Wang, L.; Dong, X.; Wang, Y.; Liu, L.; An, W.; Guo, Y.K. Learnable Lookup Table for Neural Network Quantization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12413–12423. [Google Scholar] [CrossRef]
Liang, J.; Yang, C.; Zeng, M.; Wang, X. TransConver: Transformer and convolution parallel network for developing automatic brain tumor segmentation in MRI images. Quant. Imaging Med. Surg. 2021, 12, 2397–2415. [Google Scholar] [CrossRef]
Wang, W.; Chen, C.; Ding, M.; Li, J.; Yu, H.; Zha, S. TransBTS: Multimodal Brain Tumor Segmentation Using Transformer. arXiv 2021, arXiv:2103.04430. [Google Scholar]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv 2022, arXiv:2201.01266. [Google Scholar]
Karargyris, A.; Umeton, R.; Sheller, M.J.; Aristizabal, A.; George, J.; Bala, S.; Beutel, D.J.; Bittorf, V.; Chaudhari, A.; Chowdhury, A.; et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell. 2021, 5, 799–810. [Google Scholar] [CrossRef]
de Verdier, M.C.; Saluja, R.; Gagnon, L.; Labella, D.; Baid, U.; Tahon, N.E.H.M.; Foltyn-Dumitru, M.; Zhang, J.; Alafif, M.M.; Baig, S.; et al. The 2024 Brain Tumor Segmentation (BraTS) Challenge: Glioma Segmentation on Post-treatment MRI. arXiv 2024, arXiv:2405.18368. [Google Scholar]
Baid, U.; Ghodasara, S.; Bilello, M.; Mohan, S.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S.; et al. The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.S.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging 2015, 34, 1993–2024. [Google Scholar] [CrossRef]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted Res-UNet for High-Quality Retina Vessel Segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar] [CrossRef]
Guan, S.; Khan, A.A.; Sikdar, S.; Chitnis, P.V. Fully Dense UNet for 2D Sparse Photoacoustic Tomography Artifact Removal. IEEE J. Biomed. Health Inform. 2018, 24, 568–576. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
Milletarì, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Kamnitsas, K.; Ledig, C.; Newcombe, V.F.J.; Simpson, J.P.; Kane, A.D.; Menon, D.K.; Rueckert, D.; Glocker, B. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med. Image Anal. 2016, 36, 61–78. [Google Scholar] [CrossRef]
Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D Dilated Multi-Fiber Network for Real-time Brain Tumor Segmentation in MRI. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar] [CrossRef]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Athens, Greece, 17–21 October 2016. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2020, 18, 203–211. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
Peiris, H.; Hayat, M.; Chen, Z.; Egan, G.F.; Harandi, M. A Robust Volumetric Transformer for Accurate 3D Tumor Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021. [Google Scholar] [CrossRef]
Xing, Z.; Yu, L.; Wan, L.; Han, T.; Zhu, L. NestedFormer: Nested Modality-Aware Transformer for Brain Tumor Segmentation. arXiv 2022, arXiv:2208.14876. [Google Scholar]
Pinaya, W.H.L.; Tudosiu, P.D.; Gray, R.J.; Rees, G.; Nachev, P.; Ourselin, S.; Cardoso, M.J. Unsupervised brain imaging 3D anomaly detection and segmentation with transformers. Med. Image Anal. 2022, 79, 102475. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
Liang, J.; Yang, C.; Zhong, J.; Ye, X. BTSwin-Unet: 3D U-shaped Symmetrical Swin Transformer-based Network for Brain Tumor Segmentation with Self-supervised Pre-training. Neural Process. Lett. 2022, 55, 3695–3713. [Google Scholar] [CrossRef]
Bakas, S.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M.; et al. Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge. arXiv 2018, arXiv:1807.10165. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Multi-Atlas Labeling Beyond the Cranial Vault. 2015. Available online: https://www.synapse.org/Synapse:syn3193805/wiki/ (accessed on 25 March 2025).
Hatamizadeh, A.; Yang, D.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1748–1758. [Google Scholar] [CrossRef]
Roy, S.; Koehler, G.; Ulrich, C.; Baumgartner, M.; Petersen, J.; Isensee, F.; Jaeger, P.F.; Maier-Hein, K.H. MedNeXt: Transformer-driven Scaling of ConvNets for Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar] [CrossRef]
Shaker, A.M.; Maaz, M.; Rasheed, H.A.; Khan, S.H.; Yang, M.; Khan, F.S. UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation. IEEE Trans. Med. Imaging 2024, 43, 3377–3390. [Google Scholar] [CrossRef]
Xie, Q.; Chen, Y.; Liu, S.; Lu, X. SSCFormer: Revisiting ConvNet-Transformer Hybrid Framework From Scale-Wise and Spatial-Channel-Aware Perspectives for Volumetric Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2024, 28, 4830–4841. [Google Scholar] [CrossRef]
Lin, C.W.; Chen, Z. MM-UNet: A novel cross-attention mechanism between modules and scales for brain tumor segmentation. Eng. Appl. Artif. Intell. 2024, 133, 108591. [Google Scholar] [CrossRef]
Yu, F.; Cao, J.; Liu, L.; Jiang, M. SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 19–25 October 2025. [Google Scholar]
Yang, Y.; Wang, Y.; Qin, C. Pancreas segmentation with multi-channel convolution and combined deep supervision. J. Biomed. Eng. 2025, 42, 140–147. [Google Scholar]
Lee, H.H.; Bao, S.; Huo, Y.; Landman, B.A. 3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. arXiv 2022, arXiv:2209.15076. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Cardoso, M.J.; Li, W.; Brown, R.; Ma, N.; Kerfoot, E.; Wang, Y.; Murrey, B.; Myronenko, A.; Zhao, C.; Yang, D.; et al. MONAI: An open-source framework for deep learning in healthcare. arXiv 2022, arXiv:2211.02701. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]

Figure 1. Overview of LG UNETR architecture.

Figure 2. The structure of Swin Transformer blocks.

Figure 3. The structure of original residual blocks.

Figure 4. The structure of SMA blocks in the middle layers.

Figure 5. The structure of Mask Attention.

Figure 6. The structure of SMA block in the bottom layer.

Figure 7. The structure of NiN blocks.

Figure 8. Prediction visualization of different models on the test set of the BraTS2024-GLI dataset. Existing methods face challenges in accurately segmenting small tissue regions (marked in the dashed box). Best viewed in zoom.

Table 1. Labeling distribution between the two datasets.

Dataset	Total No. of Images	No. of Images with Red Labels	No. of Images with Green Labels	No. of Images with Blue Labels	No. of Images with At Least One Missing Label
BraTS2023-GLI	1251	1208	1250	1218	71
BraTS2024-GLI	1350	565	1350	990	791

Table 2. Comparison with state-of-the-art methods on the BraTS2024-GLI dataset. The best results are in bold and the second best underlined. The up arrow indicates that the higher the better and the dowm arrow indicates that the lower the better.

Model	Params (M) (↓)	GFLOPs (↓)	Dice (%) (↑)				${HD}_{95}$ (mm) (↓)
Model	Params (M) (↓)	GFLOPs (↓)	ET	TC	WT	Avg.	${HD}_{95}$ (mm) (↓)
MedNeXt [47]	61.7	1079.9	74.61	71.49	85.45	77.19	12.24
UNETR [46]	173.6	2095.4	78.92	75.44	86.16	80.17	8.42
UNETR++ [48]	280.7	1912.6	79.22	75.77	87.19	80.72	8.45
Swin UNETR [21]	97.1	1234.5	79.14	76.36	87.39	80.96	8.42
3D UX-NET [53]	82.9	2362.4	80.18	76.96	87.22	81.45	8.40
LG UNETR (ours)	83.9	1150.9	80.80	78.25	88.50	82.51	8.02