Multi-Task Learning for Small Brain Tumor Segmentation from MRI

Segmenting brain tumors accurately and reliably is an essential part of cancer diagnosis and treatment planning. Brain tumor segmentation of glioma patients is a challenging task because of the wide variety of tumor sizes, shapes, positions, scanning modalities, and scanner’s acquisition protocols. Many convolutional neural network (CNN) based methods have been proposed to solve the problem of brain tumor segmentation and achieved great success. However, most previous studies do not fully take into account multiscale tumors and often fail to segment small tumors, which may have a significant impact on finding early-stage cancers. This paper deals with the brain tumor segmentation of any sizes, but specially focuses on accurately identifying small tumors, thereby increasing the performance of the brain tumor segmentation of overall sizes. Instead of using heavyweight networks with multi-resolution or multiple kernel sizes, we propose a novel approach for better segmentation of small tumors by dilated convolution and multi-task learning. Dilated convolution is used for multiscale feature extraction, however it does not work well with very small tumor segmentation. For dealing with small-sized tumors, we try multi-task learning, where an auxiliary task of feature reconstruction is used to retain the features of small tumors. The experiment shows the effectiveness of segmenting small tumors with the proposed method. This paper contributes to the detection and segmentation of small tumors, which have seldom been considered before and the new development of hierarchical analysis using multi-task learning.


Introduction
Among cancers originally from the brain, glioma is one of the most common frequent [1]. Glioma arises from glial cells and can be separated into low-grade glioma and high-grade glioma. The high-grade gliomas are malignant and they often have a mean survival time of 15 months [2]. Low-grade gliomas may be malignant or benign. They develop slowly and can be progressed to high-grade gliomas. Currently, the most common treatment methods for gliomas are radiotherapy, chemotherapy, and surgery. Of all the brain tumor treatments, it is important for segmenting the tumors and their surrounding tissues accurately because it helps doctors to evaluate disease progression, treatment response, and therapy planning for glioma patients. Computed tomography (CT), Positron emission tomography (PET), and magnetic resonance imaging (MRI) are the most common ways of detecting and monitoring tumors. Among these ways, MRI is the first chosen method because it has a high resolution, superior contrast, and no harm to patients' health. In the conventional method of clinical diagnosis and treatment, the doctor needs to navigate 3D images search and segment tumor areas manually, which is very boring, time-consuming, and requires high expertise. Due to those feature reconstruction. It helps to better initialize parameters of the model and at the same time it retains the detailed features as much as possible. In this paper, we use multi-task learning with the auxiliary task of feature reconstruction to preserve detailed features for small tumors, which are often lost in down-sampled layers. We define small tumors as the group of tumors of size about less than 10% of the median. In the case of enhancing tumors, which is the main target of experiments, the median is about the size of 20,000 voxels, therefore the small tumors are defined to be of size less than 2000 voxels. The experiments are performed in the tumors of size less than 1000 voxels and less than 2000 voxels each. Specifically, we devised a new multi-task learning network based on U-Net, where U-module is embedded in the encoding stage of the network. Our network is simple and easy to run in medium hardware specification and has been successful in dealing with small tumors. To the best of our knowledge, this is the first time multi-task learning is used to detect and segment small objects. Even though there have been approaches [13,14] using multi-task learning for medical image segmentation, they have not been targeted for hierarchical analysis. The contribution of the paper can be summarized as follows: 1.
Most of the previous studies seldom considered the small-sized tumors, which have a significant impact on finding early-stage cancers. This paper focuses on solving the problem of accurately identifying multiscale tumors, especially very small-sized tumors in the brain.

2.
Instead of using heavyweight networks with multiresolution or multiple kernel sizes, we propose a novel approach for better segmentation of small tumors by dilated convolution and multi-task learning. Using multi-task learning for hierarchical analysis is relatively new and has never been tried before small-tumor detection or segmentation.

Related Work
Medical image analysis is the process of analyzing the images to solve practical clinical issues. Medical imaging analysis has helped radiologists and doctors to make diagnostic and treatment planning process more accurate [15]. Some computer-aided diagnosis (CADx) system relied on practical medical analysis as a vital factor and can directly affect the operation of clinical examination and treatment [16,17]. The purpose of medical imaging analysis is to extract useful features, which are helpful for improving clinical diagnosis. In CADx, machine learning has been playing a very important role in popular applications such as tumor segmentation, classification, nodule detection, image-guided therapies, medical image retrieval, and annotation [18][19][20]. In the past, traditional handcrafted features were used for machine learning. The task of selecting certain features and estimating them is complicated. Deep learning is commonly used for the purposes of medical image processing and analysis among the machine learning techniques because useful information and features are automatically learned by the neural network.
Segmentation in the medical image is a requirement for medical application such as brain segmentation [21,22], body organ segmentation [23][24][25], and cardiac ventricle segmentation [26]. The accurate segmentation results are useful in the diagnosis and treatment planning. The relevant information extracted in medical images using the segmentation methods includes shape, length, volume, relative organ location, and abnormalities [27]. An iterative 3D multiscale Otsu threshold method is introduced in [28] for the segmentation of the medical image. The impacts of noise and weak edges are reduced by multilevel image representation. A hybrid methodology for the automatic segmentation of ultrasound images is proposed in [29]. The proposed approach combines knowledge from Fuzzy clustering of spatial constraint-based kernels with the edge-based features of the regularized level set. An experiment was applied to test the method on both real images and synthetically generated images.
Brain tumor segmentation has attracted much attention and many methods have been proposed. In the early works, brain tumor segmentation methods were often based on thresholds, edges, regions, atlases, classifications, and clustering methods [30]. These methods usually do not consume a lot of computational resources, but they do not achieve high performance and require user interaction. The multimodal brain tumor segmentation challenge (BRATS) was introduced for the first time in 2012. Many methods have been proposed for the task of segmenting brain tumors. Prastawa et al. [31] proposed a method that compared the differences between the atlas of the normal healthy brain with the patient scan for brain tumor segmentation. One drawback of the method is that it excludes the tumor-induced mass effect caused by the deformation of adjacent healthy structures, which may cause incorrect registration. Some researchers propose a method, which uses tumor growth models to modify the healthy atlas and performs both registration and segmentation to a new brain atlas [32,33]. These approaches can get the benefit of taking tumor characteristics into account. However, tumor growth models are followed by further complexities. Cordier et al. [34] have proposed a multi-atlas approach that relies on a searching algorithm for a similar patch of images. Numerous other active-contour approaches were proposed [35,36]. All of these methods rely on the symmetry and/or alignment features of the left-right brain. Since it is a challenging task to align a brain with a large tumor onto a template, some methods complete the registration task and the tumor segmentation task at the same time. Some voxel-wise classifiers based on segmentation models such as random forests [37,38] or SVM [39,40] obtained promising results. For example, Geremia et al. [41] suggested a method that used information from neighboring voxel and far-off region, including the symmetrical section from the brain for classifying each voxel in magnetic resonance (MR) images.
Recently, many researchers have applied convolution neural networks (CNNs) to segmenting brain tumors with MRI scans and have obtained very good results. The directions of the study can be divided into three categories. The first group used the 2D convolution in deep learning to make training faster but failed to learn 3D structures, which are one of the most important features in the medical image. The second group used 3D convolution to exploit the 3D features and they often got higher segmentation performance compared with 2D approaches but they require a high computational cost. The third group combined the 2D together with 3D to take advantage of the compactness of the 2D model and the volumetric information in 3D models. Ronneberger et al. [5] proposed U-Net, a fully convolution network with the encoder-decoder architecture. The U-Net is designed to train in an end-to-end manner to create a full resolution segmented mask. Despite this 2D-based approach method ignores the most important 3D features in the medical images, they have accomplished remarkable performance in brain tumor segmentation. Cicek et al. [42] upgraded the original U-Net to 3D U-Net by leveraging 3D operations such as 3D convolution layers and 3D max-pooling layers. Chen et al. [43] proposed a network exploring the group convolution and dilated convolution that helps to not only reduce computing costs but also have good performance for segmentation of brain tumors. In brain tumors, the tumor substructures have a stratified structure. In particular, the enhanced tumor is found inside the tumor core and the tumor core is included in the whole tumor area. Many methods have utilized this property to increase the efficiency of brain tumor segmentation. Malmi et al. [44] and Pereira et al. [45] introduced a two-stages cascaded approach for brain tumor segmentation. At the first stage, the whole tumor was segmented and the tumor core and enhancing tumor were segmented in the second stage. Wang et al. [46] presented a cascade of three CNN networks to segment whole tumor, tumor core and enhancing tumor, respectively. Although the above method achieved high efficiency, they had to build three almost identical models. It causes the redundancy of the model, complexity in training and ignoring the correlation among these models. In addition, training several models separately makes the quality of the latter models significantly affected by the former models. Liu et al. [47] proposed an end-to-end network, called CU-Net, which exploited the hierarchical structure from brain tumors. However, the network was built based on 2D CNN and processed input MR images on every slice. It leads to the loss of three-dimensional space information, which is very important in medical imaging. Chen et al. [43] utilized an existing parcellation to bring the brain's location information into patch-based neural networks that improve networks' brain tumor segmentation performance. Brugger et al. [48] proposed a partially reversible U-Net architecture. The reversible architecture can recover each layer's outputs from the subsequent layer's ones, eliminating the need to store activations for backpropagation. Fabian et al. [49] introduced an instantiation of the 3D U-Net [42] with minor modifications. They used patches of size 128 × 128 × 128 with a batch size of two for training, reducing the number of filters, reduced the number of filters right before upsampling, and used instance normalization instead of batch normalization for having more consistent results. Nuechterlein et al. proposed 3D-ESPNet [50], which is extended from ESPNet, a fast and efficient network based on point-wise convolution for 2D semantic segmentation to 3D medical image data. Kao et al. [51] introduced a methodology to integrate human brain connectomics and parcellation for brain tumor segmentation and survival prediction. Wei et al. [52] designed a new separable 3D convolution by dividing each 3D convolution into three branches in a parallel fashion, each with a different orthogonal view, namely axial, sagittal and coronal. They also proposed a separable 3D block that takes advantage of the state-of-the art residual inception architecture. However, these networks are not for the hierarchical analysis nor targeted for small tumors.
Multi-task learning has been successful in many applications of deep learning including the brain tumor segmentation. Mlynarski et al. [53] exploited multi-task learning for brain tumor segmentation with an auxiliary task of classification. Their network proved the effectiveness of the multi-task learning on the brain tumor segmentation. Myronenko [54] proposed a U-Net based model with an additional variational autoencoder branch for reconstructing the input images. The image reconstruction task is jointly trained with the segmentation task. Although this architecture worked well for segmentation in general, their model requires a large memory and too heavy computation cost.
U-Net is a very successful network in biomedical image segmentation. It is the first choice when dealing with medical imaging segmentation. However, U-Net gradually reduces the feature resolution through down-sampling layers. This process leads to the poor-segmentation on small-sized objects. There are several methods that have been proposed to address small object problems. Kampffmeyer et al. [55] applied a patch-based and pixel-to-pixel based method for improving the segmentation performance on small objects in urban remote sensing images. Shareef et al. [8] proposed a small tumor-aware network with multiple kernel sizes for accurate segmentation of small tumors in breast ultrasound images. These methods work well on 2D images but may not operate efficiently on MR images, where the data is in the 3D form.
In this paper, we propose a network that focuses on small-sized tumors in the brain tumor segmentation problem. Our network uses multi-task learning for retaining important features of small tumors, which is often lost during the downsampling process of original U-Net. The experimental results show that the auxiliary task in our model is helpful for the small tumor segmentation task.

Materials and Method
In this part, we will first present the concepts of the baseline model and U-Module [9]. Secondary, we will show the architecture of our network. Finally, the strategy to train our network is demonstrated.

Baseline Network
Our model is based on a variant of U-Net called the dilated multifiber network [43] (DMFNet). We used DMFNet because it consumes a small amount of computation cost and memory, suitable for most GPUs while maintaining good performance. The DMFNet is based on the U-Net with the encoder-decoder network architecture. Firstly, a residual unit with two regular convolution layers is called a fiber [43]. The multifiber design consists of multiple separated residual units, or fibers. The multifiber (MF) unit takes advantage of a multiplexer for information routing and the dilated multifiber (DMF) unit with an adaptive weighting scheme for different dilation rates.
The network exploring the 3D multifiber unit and 3D dilated convolution that helps to reduce computing costs while still having good performance for segmentation of brain tumors. The DMFNet use MF and dilated DMF unit as the basic blocks. In the encoder part, six DMF units extract the multiscale features by using the different kernel sizes in the dilated convolution layers. In the decoder part, the extracted features are upsampled and concatenated with the high-resolution feature map from the encoder part. The Rectified Linear Unit (RELU) functions and batch normalization are conducted prior to every convolution operation of DMF and MF units. Channel grouping is applied to separate the convolution channel into many groups of convolutions. The technique significantly reduces the model parameters between the feature maps.

U-Module as an Auxiliary Task
Our model is inspired by autoencoder architecture and the U-module [9]. Autoencoder is one type of neural network where the output has the same dimensionality as the input. It is often used to compress data into a latent space presentation or dimension reduction using unsupervised learning methods. In the autoencoder, the encoder learns efficient data codings and preserves as much of the relevant information as possible. The decoder takes the encoding and reconstructs it into a full image. The U-module is based on an autoencoder with the purpose of better parameter initialization of CNNs for medical image classification. The U-module is proved to retain the feature in the next layers of CNNs. In our paper, besides the main task of segmentation, we add U-module to our model as an auxiliary task, which helps to force the model to preserve as much of the relevant and important information as possible.

Proposed Network Architecture
Our proposed method is based on two observations. Firstly, the size of brain tumors is various. Secondly, the U-Net cannot detect the tumor with a small size, especially the enhancing tumors, which have the smallest size. Some of the previous methods, they just focus on improving the performance.
There are still some unsolved issues including missing tumors and poor detection/segmentation performance on small-sized tumors. For small object segmentation, context is one of the most important elements [6]. In a typical U-Net, the down-sampling layers are useful for feature extraction and extending receptive fields. However, they lose an important element, which is the resolution. Through downsampling layers, the feature resolution is gradually reduced. As a result, the features of small objects can be lost. Once those features are lost, they are difficult to recover in spite of the effort of skip connection [5,6]. Therefore, we need a method to be able to retain the features of small objects, even if the resolution is reduced. The architecture of our model is shown in Figure 1. Our model employed a multi-task learning with two tasks. The main task was brain tumor segmentation and additional feature reconstruction was used for an auxiliary task. The model inputted four MRI modalities and outputted a segmented mask of regions including whole tumor, tumor core, and enhancing tumor. In particular, we added an upsampling layer and an MF unit for reconstructing features after encoder layers. By applying this technique, important features were retained in the following layers and reduced the side effects of downsampling layers on small tumors. As shown in Figure 1, U-module was formed by 3 DMF units, one MR unit, and an upsampling layer. The first DMF unit will reduce the size of the feature map. After two DMF units, the feature map resolution was the same and we added an upsampling layer and an MF unit to upsample the feature map to create a reconstructed feature map with the same size as the previous feature map. In detail, the operation of the U-module is described as the following formula: where the ϕ and ρ transitions present the encoder and decoder of the U-module. The encoders compress the original feature map F into a smaller feature map Z. The decoder using the upsampling layer and MF unit to recover the original feature map. The difference between the reconstructed map and the features map is minimized so the small feature map can represent the large feature map. This way, the most important and relevant features of the previous layer will be retained in the next layer. In our model, we used two U-modules at the first and second encoder layers because those layers had the largest feature resolution to be kept. In addition to helping our model retain the properties of small tumors, the U-module can also improve parameter initialization as its original purpose.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 16 Figure 1. Our proposed multi-task learning network with two tasks. One task is brain tumor segmentation. We add the additional feature reconstruction task to our model as an auxiliary task to help retain important features when downsampling. The model takes a sequence of MRI modalities as input and generates the brain tumor mask.

Training Strategy
In this paper, we used two loss functions for the segmentation task and feature reconstruction task separately. The first loss function is the generalized dice loss (GDL) [56]. Most of the previous methods use dice loss to calculate the difference between the predicted mask and ground truth. However, the distribution of the number of voxels in the tumor regions varied greatly in the brain tumor structure. Therefore, we take advantage of the GDL to solve the imbalanced class problem in the labels of subtumors. The GDL is calculated by the following formula: where the and are the ground truth tumor mask and the predicted tumor mask of class c, respectively. The class weight was used to provide invariance to the different label set properties. We used mean squared error to measure the difference between the reconstructed feature map and the input feature map of each U-module. The loss for the feature reconstruction task was calculated by the following formula: where, the and are the reconstruction losses of the first and second U-module block. is the input feature of U-module and is the reconstructed feature map. There were two tasks in our model. The first one is the tumor segmentation task and the second one is feature reconstruction. Therefore, we had two loss functions for optimization. We need to combine two loss functions together. The combined loss function is the sum of the loss functions of the main task along with the auxiliary task as shown in the following: Figure 1. Our proposed multi-task learning network with two tasks. One task is brain tumor segmentation. We add the additional feature reconstruction task to our model as an auxiliary task to help retain important features when downsampling. The model takes a sequence of MRI modalities as input and generates the brain tumor mask.

Training Strategy
In this paper, we used two loss functions for the segmentation task and feature reconstruction task separately. The first loss function is the generalized dice loss (GDL) [56]. Most of the previous methods use dice loss to calculate the difference between the predicted mask and ground truth. However, the distribution of the number of voxels in the tumor regions varied greatly in the brain tumor structure. Therefore, we take advantage of the GDL to solve the imbalanced class problem in the labels of subtumors. The GDL is calculated by the following formula: c=1 w c n y nc p nc 2 c=1 w c n y nc + p nc (4) where the y nc and p nc are the ground truth tumor mask and the predicted tumor mask of class c, respectively. The class weight w c was used to provide invariance to the different label set properties.
We used mean squared error to measure the difference between the reconstructed feature map and the input feature map of each U-module. The loss for the feature reconstruction task was calculated by the following formula: where, the L recon1 and L recon2 are the reconstruction losses of the first and second U-module block. I orig is the input feature of U-module and I recon is the reconstructed feature map.
There were two tasks in our model. The first one is the tumor segmentation task and the second one is feature reconstruction. Therefore, we had two loss functions for optimization. We need to combine two loss functions together. The combined loss function is the sum of the loss functions of the main task along with the auxiliary task as shown in the following: where the L segmentation is the loss function for our main task (tumor segmentation) and the L reconstruction is the loss function for our auxiliary task (feature reconstruction). The λ 1 and λ 2 are the weight parameters. Since each loss function has its optimization direction and the range value of each loss function is different, we need to set up a training strategy for getting the best optimization. At the first 20 epochs, we set the λ 1 and λ 2 equal 0.5 for better parameter initialization. After, we set the main task weight the λ 1 = 1 and auxiliary task weight λ 2 = 0.05 so the network can focus on the main task. The reconstruction task was used to retain important features and help the network converge better in the training phase. During the inference phase, the reconstruction task was removed. We only used the main task to generate segmented tumor masks.

Experimental Results
In this session, we presented the experimental details of our work. Firstly, we will show the dataset that we used in our paper. Second, we will mention data preprocessing, implementation details, and evaluation metrics. Finally, we will present our results.

Dataset
Our experiments were implemented based on the imaging data provided by the Brain Tumor Segmentation Challenge (BraTS2018) [57][58][59]. All of the medical images were provided in 3D volumes with four kinds of MR sequences called native T1weighted (T1), post-contrast T1-weighted (T1CE), T2-weighted (T2), and fluid attenuated inversion recovery (FLAIR). Multimodal scans were acquired from 19 institutions with various clinical protocols and different scanners. Figure 2 shows a visualization of BraTS2018 datasets with tumor masks on each MRI modality. where the is the loss function for our main task (tumor segmentation) and the is the loss function for our auxiliary task (feature reconstruction). The and are the weight parameters. Since each loss function has its optimization direction and the range value of each loss function is different, we need to set up a training strategy for getting the best optimization. At the first 20 epochs, we set the and equal 0.5 for better parameter initialization. After, we set the main task weight the = 1 and auxiliary task weight = 0.05 so the network can focus on the main task.
The reconstruction task was used to retain important features and help the network converge better in the training phase. During the inference phase, the reconstruction task was removed. We only used the main task to generate segmented tumor masks.

Experimental Results
In this session, we presented the experimental details of our work. Firstly, we will show the dataset that we used in our paper. Second, we will mention data preprocessing, implementation details, and evaluation metrics. Finally, we will present our results.

Dataset
Our experiments were implemented based on the imaging data provided by the Brain Tumor Segmentation Challenge (BraTS2018) [57,58,59]. All of the medical images were provided in 3D volumes with four kinds of MR sequences called native T1weighted (T1), post-contrast T1-weighted (T1CE), T2-weighted (T2), and fluid attenuated inversion recovery (FLAIR). Multimodal scans were acquired from 19 institutions with various clinical protocols and different scanners. Figure 2 shows a visualization of BraTS2018 datasets with tumor masks on each MRI modality. In this dataset, all modalities were coregistered with the same anatomical template and sampled with a dimension of 1 mm × 1 mm × 1 mm. The MR images of all patients were manually segmented by four experts and verified and approved by experienced neuroradiologists. These labels were annotated based on the intratumoral structures in gliomas. The segmentation results were evaluated based on the tumor subregions, including the whole tumor, the tumor core, and the enhancing tumor as shown in Figure 3. The evaluation was done by uploading the segmented files to the online evaluation system CBICA Image Processing Portal (https://ipp.cbica.upenn.edu). The dataset included 285 patients for the training process and 66 patients for validation. The training datasets can be downloaded from the contest's official website (http://braintumorsegmentation.org). In this dataset, all modalities were coregistered with the same anatomical template and sampled with a dimension of 1 mm × 1 mm × 1 mm. The MR images of all patients were manually segmented by four experts and verified and approved by experienced neuroradiologists. These labels were annotated Appl. Sci. 2020, 10, 7790 9 of 16 based on the intratumoral structures in gliomas. The segmentation results were evaluated based on the tumor subregions, including the whole tumor, the tumor core, and the enhancing tumor as shown in Figure 3. The evaluation was done by uploading the segmented files to the online evaluation system CBICA Image Processing Portal (https://ipp.cbica.upenn.edu). The dataset included 285 patients for the training process and 66 patients for validation. The training datasets can be downloaded from the contest's official website (http://braintumorsegmentation.org).

Data Preprocessing
The preprocessing is very important for the computer vision. In this paper, we used several preprocessing methods including cropping, normalization, and augmentation. The details of data processing technique are described as follows: Crop images. All the brain MR images are randomly sampled with the size of 128 × 128 × 128 to fit with the CPU memory. This technique still keeps most of the image content within the crop area and does not affect the datasets, but it will reduce the image size and the computational complexity.
Data normalization. In medical images, MRI scans are usually obtained from different scanners and different acquisition protocols. Therefore, MRI intensity value normalization is crucial for compensating heterogeneity between images. All input images are normalized to have a zero mean and unit variance. We also take the threshold of 5-95% intensity of each voxel in the MR image Data augmentation. In deep learning, data augmentation techniques are very commonly used techniques for increasing the amount of sample without adding new data. They help deep neural networks enhance generalization. First, they are capable of generating more data from a limited set of data. This is useful for data sets where for the collection of new data, labeling is very hard and time-consuming as for medical imaging. The second is that they help avoid overfitting. For deep learning networks, one problem that can be encountered is that the network can remember large amounts of data, resulting in overfitting. Instead of just learning the key concepts of data, the model memorizes all the inputs, which causes decreasing model efficiency. In this paper, we used some data augmentation methods such as rotation and flipping and scaling with the range of 0-10 degrees, 0-0.5, and 0.9-1.1, respectively.

Implementation Details
In our proposed network, the size of the training input was 128 × 128 × 128 volume, while the original image size was 240 × 240 × 155. It would be an omission if we only randomly sampled a portion of the original data during the testing process. In addition, we zero pad the original MRI scans with a size of 240 × 240 × 155 to transform the input to the size 240 × 240 × 160 because the depth needs to be dividable by 16. We developed the proposed model on the Pytorch [60] platform with

Data Preprocessing
The preprocessing is very important for the computer vision. In this paper, we used several preprocessing methods including cropping, normalization, and augmentation. The details of data processing technique are described as follows: Crop images. All the brain MR images are randomly sampled with the size of 128 × 128 × 128 to fit with the CPU memory. This technique still keeps most of the image content within the crop area and does not affect the datasets, but it will reduce the image size and the computational complexity.
Data normalization. In medical images, MRI scans are usually obtained from different scanners and different acquisition protocols. Therefore, MRI intensity value normalization is crucial for compensating heterogeneity between images. All input images are normalized to have a zero mean and unit variance. We also take the threshold of 5-95% intensity of each voxel in the MR image.
Data augmentation. In deep learning, data augmentation techniques are very commonly used techniques for increasing the amount of sample without adding new data. They help deep neural networks enhance generalization. First, they are capable of generating more data from a limited set of data. This is useful for data sets where for the collection of new data, labeling is very hard and time-consuming as for medical imaging. The second is that they help avoid overfitting. For deep learning networks, one problem that can be encountered is that the network can remember large amounts of data, resulting in overfitting. Instead of just learning the key concepts of data, the model memorizes all the inputs, which causes decreasing model efficiency. In this paper, we used some data augmentation methods such as rotation and flipping and scaling with the range of 0-10 degrees, 0-0.5, and 0.9-1.1, respectively.

Implementation Details
In our proposed network, the size of the training input was 128 × 128 × 128 volume, while the original image size was 240 × 240 × 155. It would be an omission if we only randomly sampled a portion of the original data during the testing process. In addition, we zero pad the original MRI scans with a size of 240 × 240 × 155 to transform the input to the size 240 × 240 × 160 because the depth needs to be dividable by 16. We developed the proposed model on the Pytorch [60] platform with python programming language. We used NVIDIA GeForce RTX 2080Ti GPU with 11 GB of memory.

Evaluation Metrics
The model segmentation results were evaluated by the dice similarity coefficient (DSC). The DSC is one of the commonly used evaluation metrics in segmentation models. They have a value between 0 and 1. The metric indicates the overlap level between two objects. A value of 0 means there is no overlap and a value of 1 means the two objects completely overlap each other. The formula for calculating DSC is expressed in the equation: In which, P and G are the predicted segmentation and ground truth, respectively. TP denotes the number of true positive, FP denotes the number of false positive, and FN denotes the number of false negative segmentation.

Small Tumor Segmentation Performance
We evaluated the effectiveness of the proposed model with other models on small-sized tumors. We randomly split training into 80% for training and 20% for evaluation. We trained all models on the training set and get the best results of each model on the evaluation set. As shown in Figure 4, the enhancing tumor is a tumor type with the smallest size in brain. Therefore, we proved our effectiveness on small-sized tumors by comparing the segmentation results of our models with other models on small enhancing tumors. The criteria for selecting small-sized tumors are the volume size of tumors in voxels, and thresholds of the volume are selected as 2000 voxels and 1000 voxels respectively.

Evaluation Metrics
The model segmentation results were evaluated by the dice similarity coefficient (DSC). The DSC is one of the commonly used evaluation metrics in segmentation models. They have a value between 0 and 1. The metric indicates the overlap level between two objects. A value of 0 means there is no overlap and a value of 1 means the two objects completely overlap each other. The formula for calculating DSC is expressed in the equation: In which, and are the predicted segmentation and ground truth, respectively. denotes the number of true positive, denotes the number of false positive, and denotes the number of false negative segmentation.

Small Tumor Segmentation Performance
We evaluated the effectiveness of the proposed model with other models on small-sized tumors. We randomly split training into 80% for training and 20% for evaluation. We trained all models on the training set and get the best results of each model on the evaluation set. As shown in Figure 4, the enhancing tumor is a tumor type with the smallest size in brain. Therefore, we proved our effectiveness on small-sized tumors by comparing the segmentation results of our models with other models on small enhancing tumors. The criteria for selecting small-sized tumors are the volume size of tumors in voxels, and thresholds of the volume are selected as 2000 voxels and 1000 voxels respectively. As shown in Table 1 and Table 2, our model performed better with enhancing tumors smaller than 2000 voxels. Other models such as DMFNet [43], ReversibleUnet [48], and No New-Net [49] often overlook tumors of a very small size. However, our method can detect and segment them.  Tables 1 and 2, our model performed better with enhancing tumors smaller than  2000 voxels. Other models such as DMFNet [43], ReversibleUnet [48], and No New-Net [49] often overlook tumors of a very small size. However, our method can detect and segment them.  Figure 5 shows the input images, ground truth, and segmentation results of the enhancing tumor predicted by four models including our model, ReversibleUnet [48], No New-Net [49], and DMFNet [43]. As shown in the first row of the figure, ReversibleUnet and No New-Net predicted high false negatives on the enhancing tumor (yellow color), DMFNet failed to segment the enhancing tumor area while our model can accurately segment them. In the second and third rows of Figure 5, we can see that our model could segment the area of small enhancing tumor while other models failed to segment them. Figure 6 shows the dice similarity coefficient (DSC) stratified by tumor size.

Overall Segmentation Performance
We performed the training stage on the BraTS2018 training set and then evaluated the segmentation results on the validation set. The segmentation performances of our proposed model were 81.82%, 89.75%, and 84.05% for the enhancing tumor, tumor core, and whole tumor, respectively. We also set up a table comparing our performance with other segmentation methods such as 3D U-Net [42], 3D-ESPNet [50], Kao et al. [51], DMFNet [43], Reversible U-Net [48], or comparing with winners of the competition such as No New-Net [49] and NVDLMED [54] (single model). As shown in Table 3, our method achieved the best performance on enhancing tumor segmentation compared to other models. Enhanced tumors are directly related to the early development of brain cancer and they are the most important one in early diagnosis of cancers. However, in most cases, the number of small tumors in the dataset could be relatively small. Therefore, the performance improvement of small tumors may not contribute much to the overall performance in some cases. For the whole tumor and the tumor core, our model achieved comparable results. Note that the No New-Net ranked second in the BraTS2018 challenge and its performance on the whole tumor is better than others. However, it is worth knowing that they trained their models with external datasets from other institutes. The NVDLMED [54] method has a very complex model, which takes up almost 32 GB.

Conclusions
In this paper, we built a 3D model for multi-task brain tumor segmentation in MR images. In our model, we leveraged the multi-task learning with U-module to make the tumor segmentation model better for small tumor size. The U-module helped retain features of small-sized tumors, which are easily overlooked because of decreasing feature resolution after each encoder layer of U-Net based models. We proved the efficiency of our proposed brain tumor segmentation algorithm on BraTS2018 dataset. During training, we did not use any external dataset. Our model outperformed other models in terms of small-sized tumor segmentation. The model also had comparable results with other state-of-the-art models on overall performance.