3D U-Net for Skull Stripping in Brain MRI

: Skull stripping in brain magnetic resonance imaging (MRI) is an essential step to analyze images of the brain. Although manual segmentation has the highest accuracy, it is a time-consuming task. Therefore, various automatic segmentation algorithms of the brain in MRI have been devised and proposed previously. However, there is still no method that solves the entire brain extraction problem satisfactorily for diverse datasets in a generic and robust way. To address these shortcomings of existing methods, we propose the use of a 3D-UNet for skull stripping in brain MRI. The 3D-UNet was recently proposed and has been widely used for volumetric segmentation in medical images due to its outstanding performance. It is an extended version of the previously proposed 2D-UNet, which is based on a deep learning network, speciﬁcally, the convolutional neural network. We evaluated 3D-UNet skull-stripping using a publicly available brain MRI dataset and compared the results with three existing methods (BSE, ROBEX, and Kleesiek’s method; BSE and ROBEX are two conventional methods, and Kleesiek’s method is based on deep learning). The 3D-UNet outperforms two typical methods and shows comparable results with the speciﬁc deep learning-based algorithm, exhibiting a mean Dice coefﬁcient of 0.9903, a sensitivity of 0.9853, and a speciﬁcity of 0.9953.


Introduction
Brain extraction from a volumetric dataset, T 1 -weighted (T 1 W) magnetic resonance images (MRIs) in general, is called skull stripping [1]. This is important preprocessing and is typically an initial step of most brain MRI studies e.g., cortical surface reconstruction [2], brain volumetric measurement [3], tissue identification [4], brain parts identification [5], multiple sclerosis analysis [6], assessing schizophrenia [7], Alzheimer's disease [8]. Brain MRIs exhibit superior soft tissue contrast that is not normally found in other imaging protocols, such as computed tomography (CT) or X-rays. Automatic skull stripping is a difficult task because of the obscure brain boundaries, low contrast MRIs, and absence of intensity standardization [9]. Moreover, entire brain extraction becomes more challenging when MRI datasets with a pathological disorder are used [10].
After the pioneering results of Krizhevsky et al. [11], deep learning, particularly convolutional neural network (CNN)-based algorithms, has become a commonly used algorithmic approach to resolve medical imaging problems and challenges [12]. CNN-based algorithms are trained with known labeled data to learn the underlying mathematical description required for object or region detection, classification, and segmentation [13]. Generally, these algorithms require a vast amount of properly labeled data to train from scratch. However, biomedical image data is usually not sufficient for this challenge. Problems often worsen because labeling data requires a substantial manual effort from a brain anatomy expert in order to accomplish this tedious task [14]. Moreover, manual delineation of involve extensive training to learn the peculiar brain features to accurately skull strip the brain MRIs. Therefore, brain extraction remains a rate-limiting and tedious step in the brain MRIs analysis pipeline.
Recently, CNNs and deep learning-based techniques have achieved excellent performance in biomedical 2D image segmentation, with an accuracy close to human performance [37][38][39]. After successful application of CNNs on 2D biomedical images, efforts had been made on 3D biomedical volumetric data, specifically on the brain MRI [10,40]. Auto-Context CNN (Auto-Net) [41], Active Shape Model and CNN (ASM-CNN) [42], and complementary segmentation networks (CompNet) [43] are the most recently published works on brain segmentation. F. Milletari et al. [40] presented CNN with Hough voting approach that enables fully automatic localization and segmentation of a region of interest for 3D segmentation of the deep brain region. J. Kleesiek et al. [10] firstly proposed an end-to-end approach for brain extraction and skull stripping based on 3D CNN and achieved good performance. Their architecture can handle many forms including contrast-enhanced scans. Although Kleesiek's method is the first CNN for the problem and showed state-of-the-art performance, the network is not deep. Note that the depth of network is not a problem for this specific task at hand, but it can be a problem when generalizing this network for other tasks. Generally, the shallow depth has a limited learning capability because shallow layers cannot integrate features from various levels of abstraction. Despite the immense variety of algorithms and techniques published on skull stripping, there is no consensus among the research community regarding the best method, due to obscure brain boundaries, low-contrast brain MRIs, and the absence of an intensity standardization.
In this paper, we propose the use of 3D U-Net [44], a deep network architecture designed for semantic segmentation, for skull-stripping in brain MRIs. 3D-UNet has been recently proposed and widely used for volumetric segmentation. As 2D-UNet cannot fully utilize 3D spatial information, the resulting image for the third axis is not good if the 2D-UNet is used for a 3D problem [44]. Therefore, 3D-UNet is more appropriate for skull stripping than 2D-UNet. The architecture of 3D-UNet comprises a contracting path, an expanding path, and precise localization for the good use of features in multiscale. Thus, it integrates localization and context information. 3D-UNet also uses concatenation to combine the features at high resolution with the up-sampled output to obtain local characteristics. We describe how to implement the 3D-UNet for skull-stripping in brain MRI, and how to get sufficient data for training. Then, we applied it to real publicly available brain MRI datasets and compared the brain extraction results with the three state-of-the-art widely used skull-stripping methods. Two are from classical approaches (BSE and ROBEX) and one is a CNN-based method (Kleesiek's method) for evaluation.

Datasets
The performance of 3D-UNet and the other methods (BSE and ROBEX) have been evaluated on a brain MRI dataset: "Neurofeedback Skull-stripped (NFBS) repository" that is publicly available [45]. There are 125 scans of 77 females and 48 males in the 21-45 age range (average: 31 years) with a variety of clinical and subclinical psychiatric symptoms. Each subject data comprises of skull-stripped T 1 -weighted images along with the manually corrected gold standard. The size of the individual scan is 256 × 256 × 192 and each voxel size is 1 × 1 × 1 mm 3 . The first two dimensions of each scan represent the individual size of a 2D slice and the third dimension indicates the total number of slices present in the scan. Initial skull-striped brain mask from T 1 W MRIs have been obtained using the semi-automated skull stripping software BEaST [34]. The results were checked by visual inspection and manual correction was made to the worst results using a Free-view Visualization Tool [46]. These corrected brain masks were incorporated into the BEaST library and the process was iterated until satisfactory brain masks were available for the whole image (slices). Experts and the research community have differing opinions on the standard of what to include or exclude in the brain [9]. The brain ground truth standard used for this dataset follows from the paper of Eskildsen et al. [34]. Brain tissues consist of the cerebrum, cerebellum, brainstem, internal vessels, and arteries along with the cerebrospinal fluid (CSF) in ventricles, internal cisterns, and deep sulci. Skin, skull, eyes, dura mater, external blood vessels, and nerves are included in non-brain tissues.

Volume Sampling
Volume sampling is an essential process for training the CNN. The reason for this is that more training data in the form of subvolumes is required than the number of images available. This is a big problem in biomedical images where limited image data is available. To avoid this problem, we followed the procedure as described by [47], for subvolume choice in the training data. The size of each scan in the NFBS dataset is 256 × 256 × 192 voxels. Subvolumes of size 64 × 64 × 64 voxels have been chosen. Gaussian distribution with the mean at the center of the volume (center of the brain) and a diagonal covariance of σ i = 40, i ∈ [1][2][3] was utilized. After several iterations and experiments using different numbers of subvolumes in training, it was found that 1000 subvolumes are sufficient for a precise and accurate segmentation of the brain.

Network Architecture
The complete network architecture is illustrated in Figure 1. There are two main parts of the network. The first part is the contracting encoder (left side in Figure 1), which extracts global features. It consists of convolution block and max pooling. A convolution block consists of convolution, batch normalization [48], and a rectified linear unit (ReLu). This was repeated two times. Batch normalization (BN) is used for faster convergence. BN is the normalization of the activation function value or the output value of the convolution. When BN is used, it is not influenced by a parameter scale during weight propagation. Thus, the learning rate that controls how much we adjust the weights can be increased, enabling rapid learning. In our training, the decay value of the BN was set to 0.9. The convolution filter size was 3 × 3 × 3 voxels. The max pooling employed for downsampling, has the size of 3 × 3 × 3 voxels with a stride of 2 in each dimension. Although max pooling has the advantage of reducing computational volume and adding robustness to noise, it has the disadvantage of losing important information. After many experiments, we found that three max pooling operations (in each dimension) are the most appropriate.
The number of feature channels is doubled after each downsampling. The dropout layer follows the fourth convolutional block. Dropout is a regularization technique to avoid overfitting [49]. Overfitting refers to the case where the model fits very well with the training dataset but does not perform well with test dataset. The reason for this is that the model is excessively optimized and fit for the training dataset. Therefore, it only works for the training dataset. The dropout technique intentionally drops out some units in the network in the training process to reduce the overfitting. Dropout improves performance by virtually creating many models and executing predictions. Learning by one model may inevitably lead to overfitting. However, if multiple models are trained and predictions with each model are made, then the risk of overfitting can be reduced. Dropout is good for training, but it generally takes more training time. Therefore, we applied dropout once at the fourth layer because the most concentrated features appear after the fourth layer.
The second part is the decoder (right in Figure 1) called expanding path in which each step consists of up convolution, concatenation with the respective cropped feature map from the encoder part, and the two convolution blocks. In CNNs, the convolution layer reduces the size of the feature map through convolution. Up convolution, however, increases the size of the feature map. It works in a way that it makes zero-padding around each pixel followed by convolution on the padded image. The final convolution filter size is 1 × 1 × 1. It is designed to get the desired class number from 64 features. The second part is the decoder (right in Figure 1) called expanding path in which each step consists of up convolution, concatenation with the respective cropped feature map from the encoder part, and the two convolution blocks. In CNNs, the convolution layer reduces the size of the feature map through convolution. Up convolution, however, increases the size of the feature map. It works in a way that it makes zero-padding around each pixel followed by convolution on the padded image. The final convolution filter size is 1 × 1 × 1. It is designed to get the desired class number from 64 features.

Training
Adam optimizer of the TensorFlow framework was employed for network training [50]. A variant of popular nonparametric non-uniform intensity normalization (N3) algorithm [51] was used for bias field correction. The algorithm works quite well when the brain mask is available. Therefore, we apply the method to the input sample (see Table 1 for details). The network output was generated by using the softmax and cross-entropy that are both standard in machine learning. The output of SoftMax function is compared with the ground truth labels using weighted cross-entropy loss. First, we assign k pairs of the training images (X) and labels (Y) as { }

Training
Adam optimizer of the TensorFlow framework was employed for network training [50]. A variant of popular nonparametric non-uniform intensity normalization (N3) algorithm [51] was used for bias field correction. The algorithm works quite well when the brain mask is available. Therefore, we apply the method to the input sample (see Table 1 for details). The network output was generated by using the softmax and cross-entropy that are both standard in machine learning. The output of SoftMax function is compared with the ground truth labels using weighted cross-entropy loss. First, we assign k pairs of the training images (X) and labels (Y) as X j , Y j j=1,2,...,k. . Then, we flatten are also flattened into 1D vector., where i, j are the voxel and the image order, respectively. For each image, the posterior probability of voxel i with label l computed by the SoftMax classifier can be written: where f y i (.) is the CNN computation function, M i is the patch of the voxel i and k is the class number. The output value of the SoftMax function is a real number between 0 and 1, and the sum of the all the values is 1. The weighted cross-entropy loss function is written as follows: Since the cross-entropy between the actual distribution q and the estimated distribution p is −∑ i q(y i )log(p(y i |X(M i ) )) and the actual distribution q(y i ) is 1 for ground truth, and 0 otherwise.
Weights are initialized using truncated normal distribution. The standard deviation of the random values was set to 0.1. Out of 125 scans, 105 scans were randomly selected for training the network and the remaining 20 scans for testing. We ran about 420,000 training iterations on two NVIDIA 1080Ti GPUs (NVIDIA Corp., Santa Clara, CA, USA, 2018), which took approximately two days.

Experimental Results and Discussion
The implementation of some popular and prevalent skull-stripping methods is publicly available. We considered two classical non-deep learning methods, i.e., BSE [31] and ROBEX [9], and one deep learning based method, i.e., Kleesiek's method [10], for comparison with the presented algorithm. BSE uses anisotropic diffusion filtering, edge detection, and a series of morphological operation to recognize the brain. BSE is extremely fast and generates highly explicit whole brain segmentation. BSE in the BrainSuite16a1 [52] package have been chosen with the default parameters (diffusion iteration = 3, diffusion constant = 25, edge constant = 0.64, erosion Size 1) for skull stripping. ROBEX is a learning-based brain extraction system in which a discriminative (random forest) and a generative model (point distribution) are combined to achieve skull stripping. The contour of the brain is further refined by graph cuts. It shows improved performance without any parameter tuning. ROBEX1.2 package available at [53] was used for this comparison. The Kleesiek's method is to use a specifically designed deep learning network for skull stripping [10]. Note that we also considered BET [32] for comparison but the results were worst on this dataset. Thus, we have not included it in this comparison.

Evaluation Metrics
For quantitative evaluation of automatically extracted brain with the ground truth, three overlap metrics were computed, i.e., Dice coefficient, sensitivity, and specificity. The adopted metrics evaluate various performance characteristics between the extracted brain by the algorithms and the manual annotation (ground truth) brain segmentation. Sensitivity and specificity were used to measure the rate of classification (or misclassification) of the brain. The compromise between specificity and sensitivity metrics is a Dice coefficient; it assesses the tradeoff between the incorrect and the correct voxel segmentations.Dice coefficient measures the overlap between the ground truth and segmentation results obtained from the algorithm. It can be defined by using the notion of true positive (TP), false positive (FP), and false negative (FN) as: where TP = Pixels correctly classified as a brain in ground truth and by algorithm; FP = Pixels not classified as a brain in ground truth but classified as brain by the algorithm; TN = Pixels not classified as a brain in ground truth and by algorithm; FN = Pixels classified as a brain in ground truth, but not classified as brain by the algorithm. Tables 2 and 3 show Dice score for all the 20 test scans for proposed (3D-UNet), BSE, ROBEX and Kleesiek's method, respectively. The value of Dice coefficient varies from 0 (disjoint or complete disagree) to 1(for complete overlap or agreement).  Sensitivity is also known as TP rate. It measures the proportion or percentage of the TPs that are correctly classified as the brain. Mathematically, it can be written as: Specificity is also known as TN rate. It measures the proportion or percentage of the TNs that are correctly classified as non-brain. Mathematically, it can be written as: Table 2 summarizes the overall analysis for each evaluated metric i.e., Dice, sensitivity, and specificity, respectively. The plot of these measurements for the individual test scan is illustrated in Figure 2. Performance of the 3D-UNet method was excellent with respect to Dice, sensitivity, and specificity, across all the test datasets; the conventional methods (BSE and ROBEX) generally performed well (i.e., > 0.9 on Dice coefficient, sensitivity, and specificity). BSE shows a better Dice score as compared to ROBEX, particularly, the Dice score for scan 5 is substantially different (0.8793) as depicted in Figure 2a. The overall sensitivity values of ROBEX for test scans is relatively greater than BSE. This indicates that ROBEX classified the brain tissues more accurately and segmented most of the brain because the sensitivity assesses how much brain tissue is included in the brain segmentation.

Quantitative Results
performed well (i.e., > 0.9 on Dice coefficient, sensitivity, and specificity). BSE shows a better Dice score as compared to ROBEX, particularly, the Dice score for scan 5 is substantially different (0.8793) as depicted in Figure 2a. The overall sensitivity values of ROBEX for test scans is relatively greater than BSE. This indicates that ROBEX classified the brain tissues more accurately and segmented most of the brain because the sensitivity assesses how much brain tissue is included in the brain segmentation. On the other hand, the specificity scores of BSE is much better than the ROBEX. This represents that BSE segmented the non-brain tissues more correctly and excluded most of the non-brain region compared to ROBEX. Moreover, BSE and ROBEX failed to correctly segment the brain in the initial slices of the brain MRI volume. The Dice score in the slices (40-45% of the total slices contains brain On the other hand, the specificity scores of BSE is much better than the ROBEX. This represents that BSE segmented the non-brain tissues more correctly and excluded most of the non-brain region compared to ROBEX. Moreover, BSE and ROBEX failed to correctly segment the brain in the initial slices of the brain MRI volume. The Dice score in the slices (40-45% of the total slices contains brain tissues) for both the competitor algorithms is less than 0.90 (on average) as displayed in Figure 3. In contrary, the 3D-UNet shows a consistent performance in all the slices either from the start or at the end of the brain MRI volume. Figure 3  highest average Dice, sensitivity, and specificity score with a significant difference when compared with BSE and ROBEX. Sensitivity value for BSE is smaller than the other methods and indicates that it has excluded several brain tissues than the other methods. Specificity value for ROBEX is substantially small for test dataset (scan 5) and implies that it has included more non-brain tissues than the other methods. Table 3 shows the comparison results of Kleesiek's method and 3D-UNet. Both deep learning-based methods show excellent performance compared to classical methods.
Kleesiek's method seems to be a little superior in terms of dice coefficient and specificity. The sensitivity value of 3D-UNet was a little better as compared to Kleesiek's algorithm. In conclusion, the results of both the CNNs-based methods are comparable and outperformed the classical best skull stripping algorithms. Kleesiek's method on a single test brain MRI scan slice-by-slice. BSE and ROBEX failed to correctly segment the brain in the initial slices of the brain MRI volume. The Dice score in the slices (from 80-140 slices) for both the competitor algorithms is less than 0.90 (on average).

Qualitative Results.
For the qualitative performance of the proposed method and comparison with other algorithms, some orthogonal slices from the test data are depicted in Figure 4-7. The first and second column of the figure displays input and ground truth orthogonal slices, from different scans, respectively. Last three columns show brain extraction results from the 3D-UNet, BSE, and ROBEX, respectively. One important aspect of both the algorithms (ROBEX and BSE) is that their segmentation in the inferior (eyes images) and superior slices are not good, and show mediocre performance as shown in Figures  4 and 5. The failure region (under or over-segmentation of the brain) have been indicated with a rectangle and displayed in a zoom view. The 3D-UNet produces more accurate and smooth generation of brain compared to BSE and ROBEX.

Input
Ground Truth 3D-UNet BSE ROBEX Figure 3. Dice comparison of 3D-UNet with BSE, ROBEX and Kleesiek's method on a single test brain MRI scan slice-by-slice. BSE and ROBEX failed to correctly segment the brain in the initial slices of the brain MRI volume. The Dice score in the slices (from 80-140 slices) for both the competitor algorithms is less than 0.90 (on average).
The overall quantitative performance of the 3D-UNet with ROBEX and BSE is shown in Table 2. Best values (results) in the respective category are emboldened. Figure 2 demonstrates corresponding plots for each algorithm, including all the test brain MRI scans. The proposed algorithm achieved the highest average Dice, sensitivity, and specificity score with a significant difference when compared with BSE and ROBEX. Sensitivity value for BSE is smaller than the other methods and indicates that it has excluded several brain tissues than the other methods. Specificity value for ROBEX is substantially small for test dataset (scan 5) and implies that it has included more non-brain tissues than the other methods. Table 3 shows the comparison results of Kleesiek's method and 3D-UNet. Both deep learning-based methods show excellent performance compared to classical methods. Kleesiek's method seems to be a little superior in terms of dice coefficient and specificity. The sensitivity value of 3D-UNet was a little better as compared to Kleesiek's algorithm. In conclusion, the results of both the CNNs-based methods are comparable and outperformed the classical best skull stripping algorithms.

Qualitative Results.
For the qualitative performance of the proposed method and comparison with other algorithms, some orthogonal slices from the test data are depicted in Figure 4-7. The first and second column of the figure displays input and ground truth orthogonal slices, from different scans, respectively. Last three columns show brain extraction results from the 3D-UNet, BSE, and ROBEX, respectively. One important aspect of both the algorithms (ROBEX and BSE) is that their segmentation in the inferior (eyes images) and superior slices are not good, and show mediocre performance as shown in Figures 4 and 5. The failure region (under or over-segmentation of the brain) have been indicated with a rectangle and displayed in a zoom view. The 3D-UNet produces more accurate and smooth generation of brain compared to BSE and ROBEX. three columns show brain extraction results from the 3D-UNet, BSE, and ROBEX, respectively. One important aspect of both the algorithms (ROBEX and BSE) is that their segmentation in the inferior (eyes images) and superior slices are not good, and show mediocre performance as shown in Figures  4 and 5. The failure region (under or over-segmentation of the brain) have been indicated with a rectangle and displayed in a zoom view. The 3D-UNet produces more accurate and smooth generation of brain compared to BSE and ROBEX.

Input
Ground Truth 3D-UNet BSE ROBEX  . Skull stripping result on brain image of the eyes. 3D-UNet produced more accurate and smooth segmentation of the brain as compared to BSE and ROBEX. Segmentation failure regions (under or over-segmentation of the brain) are indicated with the rectangles, along with their zoom view.
ROBEX shows improved performance without any parameter tuning. One disadvantage of the ROBEX is that it smooths gyri and sulci (contour of the brain) excessively that lead to the inclusion of non-brain tissue dura and/or gray matter loss, as represented in Figure 4. BSE also failed when no significant difference was present between the intensities of the brain and non-brain edges. Furthermore, the skull-stripping results of both the algorithms are not consistent. ROBEX had a high sensitivity but poor Dice coefficient (overlap) and specificity. Similarly, BSE had greater score of specificity but low sensitivity and Dice coefficient values. On the other hand, the brain extraction results of both CNN-based algorithms are excellent and consistent in either case, i.e., the brain slices with eyes or without eyes, as shown in Figure 6. BSE is the fastest among the compared algorithms. It took only a 3.5(± 0.4) s for the whole brain extraction. 3D-UNet and ROBEX spent 228(± 0.5) s and 73.8(± 1.6) s, respectively. 3D-UNet takes more time in skull-stripping due to deep CNNs structure, but its good performance and results are worth considering, and using it for skull-stripping. Figure  7 shows one result example of qualitative comparison of 3D U-Net and Kleesiek's methods. Both deep learning-based methods perform outstandingly well, showing little difference with ground truth.

Input
Ground Truth 3D-UNet BSE ROBEX  ROBEX shows improved performance without any parameter tuning. One disadvantage of the ROBEX is that it smooths gyri and sulci (contour of the brain) excessively that lead to the inclusion of non-brain tissue dura and/or gray matter loss, as represented in Figure 4. BSE also failed when no significant difference was present between the intensities of the brain and non-brain edges. Furthermore, the skull-stripping results of both the algorithms are not consistent. ROBEX had a high sensitivity but poor Dice coefficient (overlap) and specificity. Similarly, BSE had greater score of specificity but low sensitivity and Dice coefficient values. On the other hand, the brain extraction results of both CNN-based algorithms are excellent and consistent in either case, i.e., the brain slices with eyes or without eyes, as shown in Figure 6. BSE is the fastest among the compared algorithms. It took only a 3.5(±0.4) s for the whole brain extraction. 3D-UNet and ROBEX spent 228(±0.5) s and 73.8(±1.6) s, respectively. 3D-UNet takes more time in skull-stripping due to deep CNNs structure, but its good performance and results are worth considering, and using it for skull-stripping. Figure 7 shows one result example of qualitative comparison of 3D U-Net and Kleesiek's methods. Both deep learning-based methods perform outstandingly well, showing little difference with ground truth.

Conclusions
We proposed the use of 3D-UNet, an end-to-end deep learning-based segmentation algorithm for skull stripping. The method is fully automatic and has shown successful skull-stripping in real MRI datasets of the brain. The presented method has been compared with the two most popular conventional and one deep learning based methods. 3D-UNet outperformed the classical methods in terms of its Dice coefficient, sensitivity and specificity. It also shows comparable results with a specifically designed deep network for the problem. In future, we will deal with brain MRIs for pathological disorders. Although the presented deep learning-based technique is a little slow compared to existing non-deep learning algorithms, its excellent performance is worth considering and using for skull stripping. Optimization of the network to make it faster also remains a future work.