AResU-Net: Attention Residual U-Net for Brain Tumor Segmentation

: Automatic segmentation of brain tumors from magnetic resonance imaging (MRI) is a challenging task due to the uneven, irregular and unstructured size and shape of tumors. Recently, brain tumor segmentation methods based on the symmetric U-Net architecture have achieved favorable performance. Meanwhile, the effectiveness of enhancing local responses for feature extraction and restoration has also been shown in recent works, which may encourage the better performance of the brain tumor segmentation problem. Inspired by this, we try to introduce the attention mechanism into the existing U-Net architecture to explore the effects of local important responses on this task. More speciﬁcally, we propose an end-to-end 2D brain tumor segmentation network, i.e., attention residual U-Net (AResU-Net), which simultaneously embeds attention mechanism and residual units into U-Net for the further performance improvement of brain tumor segmentation. AResU-Net adds a series of attention units among corresponding down-sampling and up-sampling processes, and it adaptively rescales features to effectively enhance local responses of down-sampling residual features utilized for the feature recovery of the following up-sampling process. We extensively evaluate AResU-Net on two MRI brain tumor segmentation benchmarks of BraTS 2017 and BraTS 2018 datasets. Experiment results illustrate that the proposed AResU-Net outperforms its baselines and achieves comparable performance with typical brain tumor segmentation methods.


Introduction
Brain tumors are abnormal cells growing in human brains, regarded as a type of common neurological disease, which is harmful to human health extremely [1]. As an important way to assist in the diagnosis and treatment of brain tumors, automatic brain tumor segmentation performed on brain magnetic resonance images is of great significance in clinical medicine [2]. The most common malignant brain tumor is glioma, which can be further divided into high-grade glioma (HGG) and low-grade glioma (LGG) [3]. Magnetic resonance imaging (MRI) is a typical non-invasive imaging technology, which can produce high-quality brain images without damage and skull artifacts, and is regarded as the main technical means for the diagnosis and treatment of brain tumors. With multimodal brain images, doctors can perform quantitative analysis of brain tumors to develop the best diagnosis and

Related Work
In this section, we review literature in three domains related to our AResU-Net model, including patch-wise brain tumor segmentation networks, brain tumor semantic segmentation networks and attention mechanisms.
Patch-wise Segmentation. Recently, a number of deep networks have been developed in the field of brain tumor segmentation, which has achieved significant performance improvement over traditional methods. Among them, patch-wise based brain tumor segmentation networks, as representative works proposed early, are trained on small patches with labels to correctly distinguish brain tumors from normal tissues. To achieve favorable performance, researchers have designed various modules to introduce more contextual contact information among different slices into networks. Havaei et al. [24] embedded multi-scale and multi-path modules into a 2D network structure to capture richer contextual information. Instead of utilizing 2D convolutional neural networks (CNNs) as the backbone, Urban et al. presented a patch-wise based brain tumor segmentation network by using 3D CNN [25] architecture. Pereira et al. [26] explored small 3 × 3 convolutional kernels to design a deeper network, achieving more non-linearities and effectively reducing over-fitting problem. To further boost the segmentation performance, Kamnitsas et al. [27] adopted a dual pathway 3D CNN model with the dense structure for the brain tumor segmentation task, which also performed multi-scale processing on input images and post-processing on result images by using conditional random field (CRF). This work finally gained the first place in the BraTS 2015 competition. In addition, Zhao et al. [28] integrated a fully convolutional neural network and CRF, training three 2D patch-wise models from axial, sagittal and coronal views with a voting-based fusion strategy to finish the brain tumor segmentation.
Semantic-wise Segmentation. The semantic segmentation model classifies each pixel of the whole brain image into an assigned label to complete brain tumor segmentation. Most of the semantic segmentation models for the brain tumor segmentation task are based on U-Net architecture proposed by Ronneberger et al. [14], which has also been widely applied to other medical image segmentation tasks. U-Net contains a contracting path to capture context information and an expanding path that ensures accurate location, largely improving the performance of medical image segmentation task. Dong et al. [29] developed a 2D U-Net based brain tumor segmentation network and employed real-time data augmentation to refine its segmentation performance. Kong et al. [30] embedded a feature pyramid module into U-Net architecture to integrate multi-scale semantic and location information, which effectively improved the segmentation accuracy. Additionally, cascade strategy, dense block, dilated convolution and up skip connection have also been introduced into U-Net architecture [31][32][33][34][35], continuously optimizing network structures to pursue more accurate tumor segmentation results.
Attention mechanism. Attention mechanisms have been increasingly applied to a variety of computer vision tasks, which can be roughly divided into two categories in terms of purposes. The first purpose is to focus on long-range dependencies. NL-Net, as a representative work presented by Wang et al. [17], can generate new spatial feature responses through the weighted sum of all responses to capture long-range dependency of spatial dimension. Based on the NL-Net model, Zhao et al. [21] designed location-sensitive NL to learn long-range context, which achieved impressive segmentation results. Fu et al. [36] proposed dual attention modules that consisted of spatial and channel attention for semantic segmentation, in which the spatial attention was similar to the non-local(NL) operation in NL-Net and channel attention followed the same idea. Moreover, Zhang et al. [37] extended NL with a prior distribution and built an ensemble of NLs with weights to further improve segmentation performance. Another purpose of attention mechanism is to learn the scaling factors of each channel for feature maps. A typical work is SENet [18] that focuses on the channel relationship and performs dynamic channel-wise feature recalibration to enhance feature expression. Instead of using simple global average pooling to summarize statistics of features, EncNet [19] employed VLAD [38] encoder to collect them, and the output of the encoder was also passed through fully connected layers to get channel-wise factors. In the up-sampling stage of image segmentation, DFN [20] and PAN [39] fed the features of deep layers with stronger semantics into SE-like attention block to provide high-level category information used to precisely recovery details. With such a block, the features from deep and shallow layers were well combined to enhance the learning of features with larger resolution but weaker semantic, helpful for restructuring original image resolution. Inspired by the success of attention, we explore it to learn channel-wise factors to augment channel responses selectively.

Method
In this section, we firstly give a brief introduction to the data preprocessing utilized in this work. Then, the details of AResU-Net, including the feature learning module, the contextual fusion module and feature recovery module, are further described. Finally, we introduce the loss function adopted in AResU-Net.

Data Preprocessing
MRI brain tumor segmentation is a challenging task in the field of medical image analysis due to the complexity of brain structure and biological tissue, as well as the influence of imaging quality. Though deep learning-based models are robust to noise, data processing is still an important step to improve the brain tumor segmentation performance. In this work, we mainly utilize two multi-modality MRI brain scan datasets, i.e., BraTS 2017 and BraTS 2018 datasets. To better fit for the network, we also perform data processing on the original MRI brain tumor images, whose overall steps can be shown in Figure 2. As demonstrated in this figure, most invalid pixels are firstly removed from the original 3D brain image data, and then each refined 3D image data is sliced into a number of 2D images. Then, several patches with size of 128 × 128 are extracted from each 2D slice image, followed by a z-score normalization, i.e., zero-mean normalization, operation performed on each 128 × 128 patch. The z-score normalization is calculated as following: where z and z are the input image and normalized output image, respectively. µ is the mean value of the input image, and δ is the standard deviation of the input image. After these steps, images with normalized multi-center and inhomogeneous intensity can be achieved. Finally, to mitigate the effects of overfitting problem, the Gaussian noise reduction [40] is further performed on each normalized patch, whose result can be taken as the input of the segmentation network.

AResU-Net
The AResU-Net model is an end-to-end brain tumor segmentation network through simultaneously embedding attention mechanism and residual block into the existing U-Net architecture, which is suitable for processing medical image stabilization with limited training data. AResU-Net utilizes the training strategy of 2D convolutional neural network to achieve the multi-modal pixel-level classification. The size of input image data for the network is 128 × 128 × 4, i.e., the image size is 128 and 128, and the number of channels is 4.
As shown in Figure 1, our brain tumor segmentation network can be seen as a classical encoder-decoder structure that preserves high-level information in deep layers by combining shallow features and deep features. The feature learning (encoder) module includes three residual (Res) blocks and a bottom convolutional layer with dropout function to obtain high-level features of context semantic information. The feature recovery (decoder) module adopts three similar up-sampling residual (SRes) blocks for precisely positioning and feature recovery. In order to gain richer low-level and high-level information, the attention and squeeze excitation (ASE) block is added as the horizontal connection, effectively enhancing feature information representations between down-sampling features and up-sampling information. Finally, we integrate the softmax layer to obtain the final segmentation results of multiple classes. By integrating these blocks into a unified architecture, the AResU-Net network will capture richer information and achieve stable segmentation results.

Feature Learning Module
Feature learning module consists of the down-sampling process through a variety of residual(Res) blocks whose basic architecture can be illustrated in Figure 3. As given in Figure 3, the residual block is achieved by a shortcut connection element-wise addition operation, which greatly improves the training speed and accuracy without any extra parameters. Each block of encoder contains two convolution layers and one pooling layer. In our work, we mainly adopt ResNet-34 [41] for the feature learning module, which includes two same units, and each unit is composed of a regularization, an activation function, and a 3 × 3 convolutional layer. The residual operation [41] can be denoted as where x and y are the input and output vectors of related layers, and F(x, W i ) is the mapping function for the residual path. The result of F(x, W i ) should have the same dimension as x. Residual block adds a shortcut mechanism to avoid the gradient vanishing and accelerate the network convergence, integrating the rough local and global features. Considering the size of brain tumors, we adopt three down-sampling units to achieve feature map learning, and the final image size is 16 × 16 pixels after the whole down-sampling process.

Contextual Fusion Module
Here, we combine the attention mechanism with the shortcut idea to construct a novel contextual fusion module, instead of the direct connection given in the original U-Net architecture. The contextual fusion module takes advantage of preserving some detailed information from an encoder to a decoder that helps to recovery the feature information loss. As shown in Figure 1, the contextual fusion module is mainly composed of a series of attention and squeeze-excitation (ASE) blocks. The ASE block encodes dependencies of each channel through a fully connected operation, embedding global spatial information into each channel vector. The details of ASE block can be illustrated in Figure 4. The ASE block allows the network to give different attention to various channels based on the importance of feature maps. To achieve more robust feature information, we firstly exploit a regularization, PReLU activation, 1 × 1 convolution layer and sum operation on corresponding input maps, which can be computed as where x and F tr are the input and output of related layers, respectively. β is the normalized parameter, and θ is the activation function PReLU. Then, the corresponding output feature information F tr will pass through the squeeze operation implemented by a global average pooling layer to aggregate global information for each channel of the whole image. To fully capture channel-wise dependencies, the output feature maps via squeeze operation are fed to an excitation operation that consists of two fully connected layers around the non-linearity operations interaction between channels, i.e., the ReLU and the sigmoid activation function. Finally, the weight vector is reshaped into (1, 1, 1, C) size, where C is the number of feature maps, applied to each feature map by multiplication operation returning to the channel dimension of the transformation output F scale (F tr , F eq ). The corresponding process [18] can be denoted as follows: In the above equations, F sq , F eq and F scale denote squeeze global spatial information, channel-wise dependencies information and output vectors of related layers, respectively. W 1 and W 2 are the dimensionality-reduction layer and the dimensionality-increasing layer, respectively. r represents the reduction ratio, ρ and σ are parameters of non-linearity activation. In addition, H and W are spatial dimensions.
Therefore, ASE block emphasizes useful feature information and suppresses redundant feature information through variable weights obtained by attention mechanism, focusing on local details for achieving better results. By integrating several ASE blocks, the contextual fusion module well utilizes the attention mechanism and shortcut idea in the architecture, which will pay more attention to the important feature maps.  The ASE block first exploits the PReLU activation, 1 × 1 convolution and sum operation on the corresponding input information, then the output feature information will be squeezed, which is implemented by a global average pooling layer to aggregate global information for each channel of the entire image.

Feature Recovery Module
The feature recovery module is designed to restore high-level image features extracted from the feature learning module and the contextual fusion module. The bilinear interpolation and deconvolution are two common operations for decoder structures, where the bilinear interpolation operation increases the image size with linear interpolation while the deconvolution operation employs convolution operation to enlarge the image information. We adopt an efficient spatial residual (SRes) block that is similar to the residual structure [41] for enhancing the decoding performance, and its basic structure can be shown in Figure 5. The SRes block mainly includes two convolutional units as given in Res block, a 1 × 1 convolution concatenation with input feature maps as shortcut connection, and element-wise addition operation. In the SRes block, the feature decoder module outputs a mask whose size is the same as that of the original input. As shown in Figure 1, we can achieve the richer feature information recovery through deconvolution based on the combination of several SRes blocks.

Loss Function
The MRI brain tumor segmentation task usually exhibits a severe class imbalance problem. Table 1 illustrates the distribution of sub-classes in the training data of BraTS dataset, approximately 98.46% of voxels belong to either the healthy tissue or the black surrounding area, labeled as background. However, the edema and the enhancing tumor only cover 1.02% and 0.29% voxels of the whole data, respectively. Moreover, the necrotic and non-enhancing tumors occupy the lowest volume among all categories, which has a rate of only 0.23%. Although the data pre-processing alleviates this problem to some extent, it still severely affects the segmentation accuracy. Here, we employ a combined loss function [15] that integrates weight cross-entropy (WCE) and generalized dice loss(GDL) to address this class balance problem as below.
where L GDL and L WCE respectively represent the generalized dice loss and the weighted cross entropy loss, which are correspondingly defined as Equations (8) and (9).
where L is the total number of labels, and w i denotes the weight assigned to theith label. As for the generalized dice loss, p i and g i represent the pixel value of the segmented binary image and the binary ground truth image, respectively.

Experiments and Results
In this section, we execute experiments on two brain tumor benchmarks to evaluate the effectiveness of AResU-Net. We firstly describe the details of the employed datasets for model evaluation. Then, experimental settings are further introduced, followed by a simple description of evaluation metric. Finally, compared experiment results on two brain tumor benchmarks are given and analyzed.

Datasets
In this work, we mainly adopt the public BraTS 2017 and BraTS 2018 [4,42] brain tumor segmentation datasets for the performance evaluation. These two datasets are released by the Multimodal Brain Tumor Segmentation Challenge (BraTS) that run in conjunction with the International Conference On Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2017 and 2018. The BraTS 2017 dataset consists of training dataset, validation dataset and test dataset, and each sample has four different modalities, i.e., fluid-attenuated inversion recovery (FLAIR), T1 weighting (T1), T1 weighted contrast enhancement (T1ce), and T2 weighting (T2). However, only the training dataset is available to the public among the three datasets. The BraTS 2017 training dataset includes 285 patient samples, in which 210 samples are from high-grade glioma (HGG) patients and the remaining 75 samples belong to low-grade glioma (LGG) patients. All the images are skull-striped and re-sampled to an isotropic 1 × 1 × 1 mm 3 resolution with image size of 240 × 240 × 155, and the four sequences from the same patient have been co-registered. The ground truth of each image is labeled based on manual segmentation results given by experts. The basic labels include four types named as the GD-enhanced tumor (labeled as 4), the peritumoral edema (labeled as 2), the necrotic and non-enhancing tumor (labeled as 1), and the healthy pixel (labeled as 0). According to these labels, the whole tumor (Whole, the combined areas of labels 1, 2, 4), core tumor (Core, the combined areas of labels 1, 4), and Enhancing tumor (Enhancing, the area of label 4) are further constructed. Two typical samples from this dataset can be illustrated as. Figure 6. Besides, BraTS 2018 dataset shares the same training data with BraTS 2017 dataset but has different validation and testing datasets.

Experimental Settings
The AResU-Net is conducted in Keras 2.2.4 using the Tensorflow [43] 1.5.0 backend, running on a PC equipped with 64 GB RAM and a single NVIDIA GTX 1080 GPU. Stochastic gradient descent (SGD) algorithm is employed as an optimizer with an initial learning rate of 0.085, a momentum of 0.95, and a weight decay of 5e −6 . Besides, we utilize the patch-wise and the weight dice loss function to obtain a superior network model, and the size of each input patch is 128 × 128 × 4 pixels. All networks are trained from scratch with a batch size of 10 for 5 epochs. For fair comparisons, we report results of methods marked with * through publicly released codes of authors and try our best to fine-tune their parameters.

Evaluation Metric
We adopt the commonly used Dice score as the evaluation metric to estimate the given AResU-Net model. Dice score measures the rate of overlap area between the manual segmentation result and the automatic segmentation result, which can be computed by the following equation: In the above equation, TP, FP and FN represent true positive, false positive and false negative prediction, respectively.

Experiment Results on the BraTS 2017 Dataset
The first experiment is implemented on the BraTS 2017 HGG dataset. In this experiment, 80% of the BraTS 2017 HGG training cases (168 HGG cases) are used to train brain tumor segmentation networks, and the remaining 42 HGG cases are for testing. We compare AResU-Net with its baseline of U-Net [29], as well as four recently proposed networks of FCNN, ResU-Net, Densely CNN and the CNN [15,26,28,44]. The compared experiment results are listed in Table 2. As shown in Table 2, for the CNN and the FCNN models, they employ 2D CNN models with 33 × 33 patches as inputs to predict center voxel. In addition, the FCNN model additionally applies conditional random field as post-process to improve the prediction performance. Without utilizing any post-processing strategy, AResU-Net respectively achieves mean Dice scores of 0.892, 0.853 and 0.825 on the whole tumor, core tumor and enhancing tumor. It obtains 6.1%, 5.2% and 7.5% gains over its baseline of U-Net on the segmentation of these three areas, illuminating a significant accuracy improvement. Meanwhile, it respectively outperforms ResU-Net 1.2%, 0.3% and 0.5% on the whole tumor, core tumor and enhancing tumor. Additionally, compared with the FCNN, the CNN and Densely CNN, AResU-Net also obtains the highest Dice score on the whole tumor and enhancing tumor segmentation. These results demonstrate the effectiveness of our model for the brain tumor segmentation task.

0.825
The second experiment is conducted on the whole BraTS 2017 training data. In this experiment, we choose 80% of images for training and the rest images are used for testing, which means that brain tumor images from 228 patients construct the training set and images from the rest 57 cases constitute the testing set. We compare AResU-Net with its baseline of U-Net [29], as well as four recently proposed networks including SegNet, PSP-Net, NovelNet and ResU-Net [15,34,45]. The compared experiment results are reported in Table 3. As given in Table 3, AResU-Net achieves mean dice coefficients of 0.881, 0.780 and 0.719 for the whole tumor, core tumor and enhancing tumor segmentation, respectively. Compared with the basic U-Net, AResU-Net outperforms it by 1.10%, 1.80% and 1.60% gains on the whole tumor, core tumor and enhancing tumor, respectively. AResU-Net is also superior to the other networks in all of the three areas. Moreover, some examples of visual comparison are also given in Figure 7. Overall, these results demonstrate that the attention mechanism helps improve the accuracy of the brain tumor segmentation task.

Experiment Results on the BraTS 2018 Dataset
To further evaluate AResU-Net, we also perform an experiment on the BraTS 2018 dataset. In this experiment, 285 samples from the 2018 BraTS training dataset are adopted for training and 66 subjects from the validation dataset are utilized for testing. We compare AResU-Net with the baseline of U-Net and other networks including ResU-Net, Ensemble Net 3DU-Net, S3DU-Net, TTA and MMC [15,16,[46][47][48][49]. Table 4 shows the compared experiment results. From Table 4 we can see that AResU-Net achieves average dice scores of 0.876, 0.810 and 0.773 on the whole tumor, core tumor and enhancing tumor segmentation, respectively. Compared with the baseline of U-Net, AResU-Net gains 1.60%, 2.00% and 0.60% performance improvement on the whole tumor, core tumor and enhancing tumor segmentation, respectively. In comparison with some recent methods of [16,[46][47][48][49], AResU-Net also achieves the best performance on the enhancing tumor segmentation. Meanwhile, it obtains the second place among the compared methods on the core tumor segmentation, which is only slightly inferior to S3DU-Net [47]. However, its dice score is not favorable on the whole tumor segmentation. The reason lies in that these methods utilize either 3D networks or more complicated 2D network structures. Nevertheless, these compared results still prove the effectiveness of the given AResU-Net method for brain tumor segmentation application.

Conclusions
In this paper, we presented a novel AResU-Net model for the MRI brain tumor segmentation task, which simultaneously embedded attention mechanism and residual units into U-Net to improve the segmentation performance of brain tumors. By adding a series of attention units among corresponding down-sampling and up-sampling processes, AResU-Net adaptively rescaled features to enhance local responses of down-sampling residual features, as well as the recovery effects of the up-sampling process. Experiment results on two brain tumor segmentation benchmarks demonstrated that AResU-Net outperformed U-Net by a large margin and gained comparable performance with other typical brain tumor segmentation methods. However, due to the limitation of computational resources, our model utilized 2D slices as inputs to construct the segmentation network, which led to more loss of context information among different slices of brain tumors to a certain extent. Therefore, in the future, we will extend our AResU-Net to 3D network to purse better segmentation results and apply it to other medical image segmentation tasks for further evaluation. In addition, some more powerful feature extraction modules will also be explored to gain the extra performance improvement.