LGMSU-Net: Local Features, Global Features, and Multi-Scale Features Fused the U-Shaped Network for Brain Tumor Segmentation

: Brain tumors are one of the deadliest cancers in the world. Researchers have conducted a lot of research work on brain tumor segmentation with good performance due to the rapid development of deep learning for assisting doctors in diagnosis and treatment. However, most of these methods cannot fully combine multiple feature information and their performances need to be improved. This study developed a novel network fusing local features representing detailed information, global features representing global information, and multi-scale features enhancing the model’s robustness to fully extract the features of brain tumors and proposed a novel axial-deformable attention module for modeling global information to improve the performance of brain tumor segmentation to assist clinicians in the automatic segmentation of brain tumors. Moreover, positional embeddings were used to make the network training faster and improve the method’s performance. Six metrics were used to evaluate the proposed method on the BraTS2018 dataset. Outstanding performance was obtained with Dice score, mean Intersection over Union, precision, recall, params, and inference time of 0.8735, 0.7756, 0.9477, 0.8769, 69.02 M, and 15.66 millisecond, respectively, for the whole tumor. Extensive experiments demonstrated that the proposed network obtained excellent performance and was helpful in providing supplementary advice to the clinicians.


Introduction
The incidence of brain tumors is increasing in recent years [1]. As well, the median survival time of glioblastoma with the highest malignancy is only 14.6 months [2]. Brain tumor segmentation (BTS) technology is an essential basic step with many applications such as quantitative analysis and operational planning. However, the manual segmentation of brain tumors is difficult and time-consuming [3]. Accurate and automatic BTS technology is urgently needed to minimize human errors.
Automatic BTS methods can roundly fall into three categories [4], namely, atlas registration-based methods [5,6], machine learning-based methods with hand-crafted features [7][8][9], and deep learning-based methods with automatic end-to-end learned features [10][11][12]. Compared with the former two categories of methods which need prior knowledge or complex hand-crafted features, the third category of methods can directly learn knowledge from the given data and usually obtains better segmentation performance. Thus, the methods based on deep learning have become the research hotspot in the field of BTS.
In recent years, deep learning-based methods have greatly promoted the development of various fields of image analysis. Researchers are actively innovating network structures limitations, such as causing optimization difficulties and making multi-hop dependency modeling difficult. Adopting attention mechanisms is another popular way to extract global features, which has given significant boosts to various fields of vision. Thus, we applied an attention mechanism to extract global features in this study.
Wang et al. [18] introduced a nonlocal operation, computing the response at a position as a weighted sum of the features at all positions, which was similar to the method [19] called Transformer based on self-attention. An increasing number of studies [20,21] adopted Transformer architecture in convolutional neural networks (CNNs) to extract global features and fuse local features and global features to improve the performance of BTS. The self-attention was restricted to a small local area due to the high computation and large memory consumption of the Transformer, which limited the learning ability of the models using the Transformer in many cases. Recently, many efforts have been made to address this problem. Ho et al. [22] proposed the axial attention module (AAM) composing 2D self-attention into two 1D self-attentions to reduce the computation from O H 2 W 2 to O HW √ HW , where H and W represent the height and width of the input, respectively. Inspired by deformable convolution, Zhu et al. [23] introduced the deformable attention module (DAM), only focusing on a small set of key sampling points around a reference, and reduced the computation to O(HW). Liu et al. [24] proposed the Swin Transformer, which calculated self-attention in non-overlapping local windows to reduce the memory consumption and improve computational efficiency. The UTNet model built by Gao et al. [25] projected key and value into low-dimensional features, which reduced the computation of traditional Transformer. Applying the modules improved based on Transformer, similar to the aforementioned modules to improve the performance of BTS and reduce the computation complexity, is useful and popular. Cao et al. [26] proposed Swin-Unet, which was a UNet-like pure Transformer for medical image segmentation. Valanarasu et al. [27] introduced a gated AAM adding a control mechanism in the self-attention module to segment medical images including brain images. Inspired by AAM and DAM, we assumed that selecting a small set of key sampling points on a certain axis of the feature map than selecting all points of the certain axis to compute the response at a position might be beneficial to use effective information and discard useless information. Meanwhile, finding relevant positions on a certain axis was easier to obtain the optimal solution than looking for relevant locations on the entire feature map. Thus, we proposed the axial-deformable attention (ADA) module, which sampled a small set of key points in a certain column or a certain row to extract global features, instead of using all points in a column or row like AAM or using points in all regions like DAM. Compared with other self-attention modules, the ADA module had lower computational complexity and more accurate performance.
However, many researchers employed convolution operations and self-attention mechanisms to fuse local and global features while ignoring multi-scale features that could effectively improve the robustness of the model [28,29], which were also essential for BTS. Therefore, we proposed a new network that fused local feature information extracted by convolution operations, global feature information extracted by ADA module, and multiscale feature information obtained by multi-scale input (MSI) and multi-scale output (MSO) structures to enhance the robustness of the network, effectively improving the performance of BTS task.
This study introduced a novel U-shaped network fusing local features, global features, and multi-scale features (LGMSU-Net) to segment brain tumors. The main contributions of this study were as follows: (1) A new encoder-decoder network was proposed, which reasonably and effectively fused local features representing detailed information, global features representing global information, and multi-scale features enhancing the robustness of the network to significantly improve the performance of BTS with low computational complexity. (2) A novel self-attention module was designed to extract global features, named the ADA module. It sampled a small set of key points in a certain column or a certain row, which could achieve a more accurate performance of BTS with acceptable computation complexity. (3) Positional embeddings were proved in self-attention mechanism could accelerate convergence during training and slightly improve the BTS performance. (4) Extensive experiments had proved that the proposed network achieved excellent performance in the task of BTS, which meant that it would help clinicians to reduce the time spent in BTS tasks and improve the segmentation accuracy. Moreover, it also could be used as an auxiliary tool to provide suggestions on BTS tasks to help junior doctors improve their skills as soon as possible.
We introduce the LGMSU-Net architecture in Section 2. The experimental results of ablation experiments and contrast experiments are presented and discussed in Section 3. Finally, we conclude the work in Section 4.

Materials and Methods
In this section, we first introduced the LGMSU-Net architecture (as shown in Figure 1) and then detailed the proposed ADA module, which could extract global features reasonably and efficiently generate better predictions. extracted by convolution operations, global feature information extracted by ADA m ule, and multi-scale feature information obtained by multi-scale input (MSI) and m scale output (MSO) structures to enhance the robustness of the network, effectively proving the performance of BTS task.
This study introduced a novel U-shaped network fusing local features, global tures, and multi-scale features (LGMSU-Net) to segment brain tumors. The main co butions of this study were as follows: (1) A new encoder-decoder network was proposed, which reasonably and effect fused local features representing detailed information, global features represen global information, and multi-scale features enhancing the robustness of the netw to significantly improve the performance of BTS with low computational comple (2) A novel self-attention module was designed to extract global features, named ADA module. It sampled a small set of key points in a certain column or a ce row, which could achieve a more accurate performance of BTS with acceptable putation complexity. (3) Positional embeddings were proved in self-attention mechanism could accel convergence during training and slightly improve the BTS performance. (4) Extensive experiments had proved that the proposed network achieved exce performance in the task of BTS, which meant that it would help clinicians to re the time spent in BTS tasks and improve the segmentation accuracy. Moreover, it could be used as an auxiliary tool to provide suggestions on BTS tasks to help ju doctors improve their skills as soon as possible.
We introduce the LGMSU-Net architecture in Section 2. The experimental resu ablation experiments and contrast experiments are presented and discussed in Secti Finally, we conclude the work in Section 4.

Materials and Methods
In this section, we first introduced the LGMSU-Net architecture (as shown in Fi 1) and then detailed the proposed ADA module, which could extract global features sonably and efficiently generate better predictions. 4

Network Architecture
As shown in Figure 1, the proposed LGMSU-Net added MSI structure on the left. Also, some convolution operations in the middle were replaced by ADA module, which was detailed in Section 2.2, and MSO structure was added on the right compared with UNet [11]. The network architecture was introduced from left to right as follows. The multi-scale features provided by the MSI structure were fused with the local features from convolution operations, as shown on the left side of Figure 1. In the middle of the model, the global feature extracted by ADA module integrated with the local feature extracted by convolution operation. Finally, the MSO structure applied local features from convolution operations to output the final result. The core idea of the model we proposed was to reasonably and effectively fuse local, global, and multi-scale features so as to improve the performance of BTS with acceptable computational complexity.
Inspired by a previous study [30], in the MSI structure, we applied average pooling to generate MSI, then used a 3 × 3 convolution to increase the number of input channels to the same number as the number of channels of the local features, and then combined them. Finally, the multi-scale features and local features were concatenated along the channel axis. We also used MSOs to accelerate convergence of training and fuse local features and multiscale features on the decoding path. Specifically, we first applied upsampling technique to get a feature map with the same size as the original images and a 1 × 1 convolution kernel to reduce the dimension to one dimension, as represented in Figure 1. Then we connected the lower feature to the upper one, and adopted a 1 × 1 convolution kernel and sigmoid activation to get MSOs of the same size as the original images, namely four probability maps. Finally, we directly took the average of the four images, took 1 if it is greater than 0.5, and considered it as the tumor part, and took 0 as the other part, considered it as the non-tumor part, and got the segmentation result.

ADA Module for Extracting Global Features
ADA module contained two parts similar to AAM [22], namely, Vertical-DAM (VDAM) and Horizontal-DAM (HDAM). The VDAM mixed the information of some positions in a certain column to compute the response at a position in the same column while ignoring other columns, as shown in Figure 2a. Similarly, the HDAM mixed the information of some positions in a certain row to compute the response at a position in the same row while ignoring other rows. The response value of a certain position in the output feature maps of VDAM or HDAM could be obtained using the Formula (1). Figure 2c was the overall structure of the ADA module. Firstly, the input feature map was weighted by VDAM. Then, the original feature map and the VDAM-weighted feature map were added as the input of HDAM for horizontal attention weighting, and the final weighted feature map was obtained.
where, m is the attention head, and M is the total head number which was set to 8 in this study. In a certain column or a certain row, k is the sampled key and K is the total sampled key number, which was 4 in this study. ∆P mk ∈ R M×K×W×H×1 and A mk ∈ R M×K×W×H were both built via linear projection over the query feature maps obtained via linear projection over the input feature maps. A tensor T ∈ R M×K×W×H×1 with all elements of 0 was designed, which was concatenated with ∆P mk in the last dimension through two different ways to build the tensor ∆P mk ∈ R M×K×W×H×2 to the sample only in the vertical or horizontal direction, rather than in all regions similar to DAM. ∆P mk and A mk denote the sampling offset and attention weight of the sampling point in the attention head, respectively. A mk is operated by softmax function with range [0, 1]. As in DAM [23], the bilinear interpolation was used to solve the problem that P + ∆P mk was fractional. W v x obtains value feature maps f v ∈ R C×W×H and f mh (·) represents dividing f v into M parts along the channel axis.
represents concatenating the outputs of all attention heads.   The ADA module we proposed saved O(HW) computation over the standard selfattention. The segmentation performances and analysis of the ADA module and other self-attention modules were presented in Section 3.2.

Dataset and Implementation Details
The dataset [31][32][33][34][35] used in this study is from the BraTS2018 challenge, and includes 285 3D brain magnetic resonance imaging (MRI) cases. Each case contains four MRI modalities (T1, T1c, T2, and FLAIR), which have been adjusted to the same size of 240 × 240 × 155. Thus, we extracted 44,175 axial slices of 240 × 240 pixels of every modality to train, validate and test the LGMSU-Net. As well, the split ratio of the training set, validation set, and test set is 7:1:2. In this work, we segmented the whole tumor region.

Preprocessing and Augmentation
The intensities in different modalities and patients are various, which will adversely affect the performance of BTS technology. We achieved intensity normalization by applying z-score transformation, i.e., by subtracting the mean and dividing by the standard deviation within brain regions on each slice. Moreover, during training, we randomly employed one of three data augmentation methods to process the slices, namely random rotations (Rotated 90 degrees clockwise or counterclockwise at random, or rotated along any of the three axes at random.), elastic deformation (The standard deviation of Gaussian kernel was 3, and the scale factor controlling the deformation strength was 15.), and gamma transform (The gamma factor was set to 2).

Implementation Details
The input to all the networks that we study in this work is a four-channel input consisting of a slice selected from each of the four MRI modalities. We utilized RMSprop optimizer with weight decay 10 −8 , momentum 0.9 and learning rate 0.00001 to train all networks. The batch size was 4, the number of epochs was 40 and the learning rate would be reduced by a factor of 0.5 if the Dice score of the validation set had not been increased for four epochs. All hyperparameters used in the network were optimal parameters obtained through extensive experiments. Dice loss and binary cross-entropy loss functions were directly summed to form the loss function we used for backpropagation. Furthermore, to alleviate the overfitting problem, in addition to using data augmentation to increase the amount of data, we also applied dropout operations in the network. We adopted Pytorch and NVIDIA RTX 2080 Ti for all experiments.

Ablation Experiments
We used six metrics, namely Dice score, mean Intersection over Union (mIoU), Precision, Recall, number of parameters of a certain model (Params), and Inference time (The time, in milliseconds, taken to segment a slice), to evaluate BTS performance of networks we studied on the test set, with the checkpoint obtaining the highest Dice score on the validation set. In addition, we performed Wilcoxon signed-rank tests on the Dice Score metric following the method of [36] to confirm that our improvement was effective in a statistical sense. If the p-value was less than 0.05, it was considered statistically significant.
We first examined the impact of the number of ADA modules on the BTS performance of the proposed model without multi-input and multi-output. The experimental results were shown in Table 1. The table showed that the Dice score evaluation index of the model increased with the increase in the number of ADA modules until the number of ADA modules reached three. Therefore, we applied three ADA modules in our proposed model. The number of model parameters decreased and the inference time increased with the increase in the number of ADA modules. This was because we did not add an ADA module to the model but replaced some convolution operations with an ADA module, which revealed that an ADA module had fewer parameters than the replaced convolution operations, but required more computation.  in this table were obtained by comparing with line 3. Place ADA module from the lowest layer of the model. For example, "one" meant an ADA module was applied on the last layer, and "two" meant that an ADA module was placed on the last and penultimate layers. The best performance was shown in bold. Then, the effect of components in LGMSU-Net was explored and the experimental results were listed in Table 2. As seen in line 2 of Table 2, replacing some convolution operations in "BN" with "ADA" significantly promoted BTS performance, revealing that fusing global features and local features could effectively improve the segmentation performance. Meanwhile, the number of parameters of the model in line 2 decreased but the inference time increased, which was consistent with the analysis of Table 1 that an ADA module had fewer parameters and more computation than the replaced convolution operations. Moreover, the use of positional embeddings in the study [27] slightly improved the BTS performance but greatly speeded up the training process, as shown in Figure 3.  Table 2) and MSO (line 5 of Table 2) both improved the network performance, revealing that the multi-scale features were important for the BTS task. Further, the simultaneous application of the MSI and MSO, as shown in line 6 of Table 2, further improved the BTS performance. Finally, the number of increased parameters and inference time of all the components we used in LGMSU-Net were acceptable compared with the improved performance. Table 2. Evaluation results of ablation experiments on the BraTS2018 dataset using six evaluation metrics. "BN" was the base network, namely, UNet; "ADA" was the axial-deformable attention module; "P" was positional embeddings; "MSI" was multi-scale input, and "MSO" was multiscale output. The "+" of "BN + ADA" indicated that some convolution operations in the "BN" were replaced with "ADA," and other plus signs indicated that the corresponding component was directly added to the original model. The "*" of "p-Value (*)" denoted the line of the model used for comparison, for example, "0.035(1)" in line 2 meant we performed Wilcoxon signed-rank test between "BN" in line 1 and "BN + ADA" in line 2. The best performance was shown in bold. The results of the contrast experiments between the ADA module and other selfattention modules were summarized in Table 3, which obviously showed that the ADA module we proposed outperformed other popular self-attention modules for the BTS task, and the assumption mentioned in Introduction was proved. The visualizations of the input and output of different self-attention modules in the third layer of the model were plotted in Figure 4.  The results of the contrast experiments between the ADA module and other sel tention modules were summarized in Table 3, which obviously showed that the A module we proposed outperformed other popular self-attention modules for the BTS t and the assumption mentioned in Introduction was proved. The visualizations of th put and output of different self-attention modules in the third layer of the model w plotted in Figure 4.  Table 3. Ablation analysis on the BraTS2018 dataset for different self-attention modules using six evaluation metrics. "Transformer" was the classical self-attention module [19], "AAM" was the axial attention module, "DAM" was the deformable attention module, "SwinAM" was the Swin Transformer attention module, and "ADA" was the axial-deformable attention module proposed in this study. All p-Values in this table were obtained by comparing with the ADA (line 5). The best performance was shown in bold. As shown in Figure 4, the heatmap of the output results of our proposed ADA module showed the strongest contrast between the brain tumor and the background region, and the contour position of the brain tumor was the clearest, which proved that the proposed ADA module was effective on the BTS task. To be specific, the outline, location, and size of brain tumors were not clear in the output heatmaps of Transformer and AAM modules and the output heatmap of SwinAM was composed of many windows after calculating the global attention in a sliding window. Although, the DAM module's output heatmap was similar to that of ADA, the tumor region in the ADA module's output heatmap was cleaner and more strongly contrasted with the background, which explained visually why our proposed ADA module worked better.  As shown in Figure 4, the heatmap of the output results of our proposed ADA module showed the strongest contrast between the brain tumor and the background region, and the contour position of the brain tumor was the clearest, which proved that the proposed ADA module was effective on the BTS task. To be specific, the outline, location, and size of brain tumors were not clear in the output heatmaps of Transformer and AAM modules and the output heatmap of SwinAM was composed of many windows after calculating the global attention in a sliding window. Although, the DAM module's output heatmap was similar to that of ADA, the tumor region in the ADA module's output heatmap was cleaner and more strongly contrasted with the background, which explained visually why our proposed ADA module worked better.

Contrast Experiments
To further evaluate the performance of the proposed model, we conducted contrast experiments between the proposed LGMSU-Net and seven other state-of-the-art methods, namely UNet [11], DeepLab v3+ [12], AttentionUnet [37], UNet3+ [16], TransBTS [20], Swin-Unet [26], and Unetr [38] to verify the feasibility and efficiency of LGMSU-Net and showed the experimental results in Table 4. As shown in the table, the UNet model achieved ideal segmentation performance for the BTS task. Although there were only slightly more parameters and inference time, the AttentionUnet significantly improved the BTS performance compared with UNet, which indicated that the attention mechanism was beneficial to the BTS task. UNet3+ and DeepLab v3+ both greatly increased the parameters and used multi-scale features. However, their segmentation results were worse than those of UNet, which proved that fusing multi-scale features and local features without careful design might damage the performance of models for the BTS task. TransBTS added four Transformer layers at the bottom compared with UNet, which greatly increased the parameters but only improved the performance a little. This showed the effectiveness of global features and also indicated that TransBTS did not use global features correctly. As shown in line 7 of Table 4, the Unetr whose encoding path without no convolution operations had the maximum number of parameters but the second to last result, which revealed that local features used to obtain detailed information played an essential role in the BTS task. Meanwhile, the worst segmentation performance for the BTS task was obtained using Swin-Unet, a pure Transformer with no convolution operations, which again implied that only extracting global features and ignoring local features were not feasible for the BTS task. Finally, the proposed LGMSU-Net, whose inference time was fast enough, was the highest in Dice score, mIoU, and Recall evaluation metrics and had the third-smallest number of parameters among all eight advanced models. The comparison of the experimental data in columns 6 and 7 in Table 4 showed that the recall indexes of all models were generally low. This may be due to the high false-negative rate caused by the large background area in the data. The performance improvement of the proposed LGMSU-Net was mainly due to the improvement of the recall index, which indicated that the model effectively alleviated the problem of difficult segmentation due to large background proportions. We believed that the key to achieving excellent performance of the proposed LGMSU-Net in the BTS task was to reasonably and effectively integrate local features representing detailed information, global features representing the global information, and multi-scale features enhancing the robustness of the model. Furthermore, we visualized the results of contrast experiments and showed that in Figure 5. It can be seen that compared to the ground-truth segmentation of brain tumors in the second row, our proposed model performed the best for different input images, both in terms of brain tumor location and contour (the third row), which was consistent with the above analysis. LGMSU-Net. The first row represented the visual image of the input image, the second row represented the ground-truth image of the brain tumor, and the third row represented the segmented brain tumor image through our proposed model, the remaining seven rows were the segmented images of the other seven advanced brain tumor segmentation models in Table 4.
Moreover, considering that noise in real-world biomedical images was a well-known problem that may reduce the accuracy of diagnosis, we added Gaussian noise to the input image to explore its impact on models' performance.
As can be seen from Table 5, the performance of all models declined to varying degrees after noise was added. It can be seen from the drop value that the proposed model had the minimum drop value and optimal performance. We believed that this was because the proposed model combined various rich feature information, which enhanced Figure 5. BTS results of seven state-of-the-art methods versus the proposed LGMSU-Net. The first row represented the visual image of the input image, the second row represented the ground-truth image of the brain tumor, and the third row represented the segmented brain tumor image through our proposed model, the remaining seven rows were the segmented images of the other seven advanced brain tumor segmentation models in Table 4.
Moreover, considering that noise in real-world biomedical images was a well-known problem that may reduce the accuracy of diagnosis, we added Gaussian noise to the input image to explore its impact on models' performance.
As can be seen from Table 5, the performance of all models declined to varying degrees after noise was added. It can be seen from the drop value that the proposed model had the minimum drop value and optimal performance. We believed that this was because the proposed model combined various rich feature information, which enhanced the robustness of the model and enabled it to show better performance for different inputs.

Limitations
Although our proposed method achieved excellent performance on the BTS task, it also had some limitations. First, we conducted all experiments on 2D slices extracted from 3D data, thus discarding the information between slices, which may influence the performance of BTS. Second, M and K values in Formula (1) needed to be set manually, which increased the work and thus limited the learning ability of the model. Thirdly, it had not yet been applied in clinical practice, and it was not known what problems would arise in clinical practice.

Conclusions and Future Work
Since brain tumors were currently one of the most lethal diseases and different brain tumors may have the same or different shapes, textures, locations, or contours. Therefore, BTS was a task with low fault tolerance and high difficulty. In this paper, we proposed an accurate brain tumor automatic segmentation model that fused multiple feature information. The model we designed had the following advantages.
(1) Our network extracted local detailed feature information through convolution operation, multi-scale feature information to enhance the robustness of the network through MSI and MSO, and global feature information through our proposed ADA module, and reasonably fused these features to segment brain tumors automatically and accurately on the Brats2018 dataset, which was superior to the seven state-of-theart methods, as shown in Table 4 and Figure 5. Furthermore, our model was found to be the best robust among eight advanced BTS models by studying the effect of noise on model performance, as shown in Table 5. (2) The ADA module we proposed was a novel and useful self-attention module for the BTS task. It sampled a small set of key points in a certain column or a certain row, which could achieve a more accurate performance of BTS with acceptable computation complexity, as shown in Table 3 and Figure 4. (3) Positional embeddings were proved in self-attention mechanism could accelerate convergence during training and slightly improve the BTS performance, as shown in Table 2 and Figure 3. (4) Extensive experiments had proved that the proposed network achieved excellent performance on the task of BTS, which meant that it would help clinicians to reduce the time spent in BTS tasks and improve the segmentation accuracy. Moreover, it also could be used as an auxiliary tool to provide suggestions on BTS tasks to help junior clinicians improve their skills as soon as possible.
In future work, we plan to design a 3D BTS model with a small number of parameters and memory footprint. In addition, finding ways to let the model learn the appropriate M and K values automatically to further improve the performance and usability of the model is also important. Moreover, inspired by [39], we will explore the image fusion algorithm to achieve a more accurate segmentation of brain tumors. Last but not least, we still need to further explore the practicality of our proposed method in clinical research.

Conflicts of Interest:
The authors declare no conflict of interest.