Loss Weightings for Improving Imbalanced Brain Structure Segmentation Using Fully Convolutional Networks

Brain structure segmentation on magnetic resonance (MR) images is important for various clinical applications. It has been automatically performed by using fully convolutional networks. However, it suffers from the class imbalance problem. To address this problem, we investigated how loss weighting strategies work for brain structure segmentation tasks with different class imbalance situations on MR images. In this study, we adopted segmentation tasks of the cerebrum, cerebellum, brainstem, and blood vessels from MR cisternography and angiography images as the target segmentation tasks. We used a U-net architecture with cross-entropy and Dice loss functions as a baseline and evaluated the effect of the following loss weighting strategies: inverse frequency weighting, median inverse frequency weighting, focal weighting, distance map-based weighting, and distance penalty term-based weighting. In the experiments, the Dice loss function with focal weighting showed the best performance and had a high average Dice score of 92.8% in the binary-class segmentation tasks, while the cross-entropy loss functions with distance map-based weighting achieved the Dice score of up to 93.1% in the multi-class segmentation tasks. The results suggested that the distance map-based and the focal weightings could boost the performance of cross-entropy and Dice loss functions in class imbalanced segmentation tasks, respectively.


Introduction
Brain structure segmentation on magnetic resonance (MR) images is an essential technique for measuring, visualizing, and evaluating brain morphology. It is used for diagnosis support of psychiatric and neurodegenerative diseases, brain development analysis, and surgical planning and navigation [1,2]. It is manually performed in practice, but manual segmentation is a very laborious task and is subject to intra-and inter-operator variability [1]. Thus, it is desirable to provide an automatic accurate segmentation of brain structures. The most successful state-of-the-art approach for automated segmentation is a fully convolutional network (FCN) [3]. It enables pixel-wise segmentation in an end-toend manner. Since it was proposed by Long et al. [3] in 2015, it has been improved for medical image segmentation [4,5] and applied to brain structure segmentation tasks [6]. However, it is often biased towards the majority (large-size) classes and suffers from low segmentation performance on the minority (small-size) classes due to a high imbalance between background and foreground classes in medical images. To address this problem, which is commonly known as the class imbalance, there are two types of approaches: data-level approaches and algorithm-level approaches [7,8].
Data-level approaches mainly alleviate the class imbalance by undersampling the majority classes [9] and oversampling the minority classes [10]. However, the majority undersampling limits the information of available data for training and the minority oversampling can lead to overfitting. On the other hand, algorithm-level approaches address the class imbalance by improving algorithms for training. The most common approach is improving loss functions. The improvement of loss functions can be carried out by using new evaluation metrics for loss function or weighting loss functions to enhance the importance of minority classes in the training process. Thus far, various types of loss functions [11][12][13][14][15][16][17] and loss weighting strategies [4,[18][19][20][21][22][23][24][25] have been proposed to alleviate the class imbalance problem. They can be applied for any medical image segmentation tasks in a plug-and-play fashion [26]. However, it is unclear which loss function and weighting strategy should be used in different situations. Thus, it is important to reveal weighted loss functions which can enhance the capability of FCNs in brain structure segmentation tasks.
In related works, Ma et al. [26] performed a systematic study of the utility of 20 loss functions on typical segmentation tasks using public datasets and evaluated the performance of these loss functions in the imbalanced segmentation tasks. Moreover, Ma et al. [27] compared and evaluated the boundary-based loss functions, which minimize the distance between boundaries of ground-truth and predicted segmentation labels, in an empirical study. Yeung et al. [28] focused on compound loss functions, combining Dice and crossentropy-based losses with a modulating factor of focal loss function [19] and evaluated what compound loss functions were effective to handle class imbalance problems. As shown in these related works, the effect of loss functions varies according to the situation of segmentation tasks (e.g., medical images used for segmentation, the number and size of segmentation target objects, and the degree of class imbalance). However, how the loss functions work for different segmentation targets remains undiscussed, although their accuracies were evaluated in the related works.
We test the effect of weighted loss functions in different situations of imbalanced brain structure segmentation tasks, including binary-and multi-class segmentation tasks. Especially, in this study, we focus on weighting strategies of loss functions, defined based on class frequency, predictive probability, and distance map, and aim to investigate and discuss how the loss weightings affect the performance of FCNs in brain structure segmentation tasks with different class imbalances.

Segmentation Target
In this study, we adopted a segmentation task of brain structures, including the cerebrum, cerebellum, brainstem, and blood vessels, on MR images. As for MR images, we used MR cisternography (MRC) and MR angiography (MRA) images ( Figure 1). MRC images, i.e., heavily T2-weighted images, can clearly represent brain surface and cerebral sulci due to the high intensity of cerebrospinal fluid, whereas MRA images can highlight blood vessels. In our group, we used MRC and MRA as clinical routine MR sequences because of the ease of segmentation processing, and segmented brain parenchyma on MRC images and blood vessels on MRA images for the planning and navigation of neurosurgeries. The brain structures have different features in the MR images. The cerebrum is the largest part of the brain and has a low-level foreground-background imbalance in the MRC images. Its surface, i.e., cerebral sulci, has a bit more of a complex shape. The cerebellum is the second largest part of the brain and is located under the cerebrum. It can be considered a middle-level imbalanced target. The brainstem is a small part of the brain and is located between the cerebrum and the spinal cord. It has a high foreground-background imbalance. The brain parenchyma, i.e., the cerebrum, cerebellum, and brainstem, appears in much the same location in every MRC image volume, although its size and shape have individual differences. Its surface can be clearly visualized in MRC images due to high signal intensity of the cerebrospinal fluid around it. On the other hand, blood vessels have varying locations and shapes and appear as small white spots in MRA images. Thus, they are considered a hard-to-segment target with the high foreground-background imbalance, although they are clearly visualized in MRA images. We used the segmentation targets to fundamentally evaluate the effect of loss weightings on the FCN-based segmentation of different brain structures.
Healthcare 2021, 9, x FOR PEER REVIEW 3 of 23 due to high signal intensity of the cerebrospinal fluid around it. On the other hand, blood vessels have varying locations and shapes and appear as small white spots in MRA images. Thus, they are considered a hard-to-segment target with the high foreground-background imbalance, although they are clearly visualized in MRA images. We used the segmentation targets to fundamentally evaluate the effect of loss weightings on the FCN-based segmentation of different brain structures.

Network Architecture
As an FCN architecture, we adopted a 2D U-net [4], which is one of the most popular FCN architectures for medical image segmentation. Figure 2 shows the network architecture used in this study. The U-net architecture, which consists of a symmetrical encoderdecoder architecture with skip connections, has been often adopted as a baseline FCN architecture for various medical image segmentation tasks. Many different variants of the U-net architecture have been proposed according to different medical image segmentation tasks, and moreover, a 3D U-net architecture [5] has been introduced for volumetric medical image segmentation. However, training the 3D U-net on full input MR image volumes is usually impractical due to memory limitations of the graphical processing unit (GPU). In the case of the MR image volumes used in this study, it would require at least more than 150 GB of GPU memory, which far exceeds the memory of prevalent GPUs. To overcome the memory limitation, approaches to train 3D FCNs on resized or cropped MR image volumes have been proposed. However, resizing MR image volumes to a smaller size may cause the loss of information on segmentation targets, whereas a patch-based approach [5,29] that crops MR image volumes requires the tuning of more hyperparameters (i.e., patch size), which may affect segmentation performance. Thus, in this study, we decided to use the simple 2D U-net architecture to reduce other factors affecting the results as much as possible.

Network Architecture
As an FCN architecture, we adopted a 2D U-net [4], which is one of the most popular FCN architectures for medical image segmentation. Figure 2 shows the network architecture used in this study. The U-net architecture, which consists of a symmetrical encoder-decoder architecture with skip connections, has been often adopted as a baseline FCN architecture for various medical image segmentation tasks. Many different variants of the U-net architecture have been proposed according to different medical image segmentation tasks, and moreover, a 3D U-net architecture [5] has been introduced for volumetric medical image segmentation. However, training the 3D U-net on full input MR image volumes is usually impractical due to memory limitations of the graphical processing unit (GPU). In the case of the MR image volumes used in this study, it would require at least more than 150 GB of GPU memory, which far exceeds the memory of prevalent GPUs. To overcome the memory limitation, approaches to train 3D FCNs on resized or cropped MR image volumes have been proposed. However, resizing MR image volumes to a smaller size may cause the loss of information on segmentation targets, whereas a patch-based approach [5,29] that crops MR image volumes requires the tuning of more hyperparameters (i.e., patch size), which may affect segmentation performance. Thus, in this study, we decided to use the simple 2D U-net architecture to reduce other factors affecting the results as much as possible.

Loss Functions
As shown in the related works [26][27][28], loss functions are an important factor for handling the class imbalance. Existing loss functions for FCN-based segmentation can be divided into four categories: distribution-based loss, region-based loss, boundary-based loss, and compound loss [26]. Distribution-based loss functions measure the dissimilarity between two distributions based on cross-entropy. Region-based loss functions quantify the mismatch or the overlap between two regions. Dice loss function [11,12] is the most common loss function in this category. Boundary-based loss functions measure the distance between two boundaries. Euclidean distance [16] or Hausdorff distance [17] metrics can be used for loss functions in this category. Compound loss functions are defined as the combinations among the distribution-, region-, and boundary-based loss functions [15,28,[30][31][32].

Loss Functions
As shown in the related works [26][27][28], loss functions are an important factor for handling the class imbalance. Existing loss functions for FCN-based segmentation can be divided into four categories: distribution-based loss, region-based loss, boundary-based loss, and compound loss [26]. Distribution-based loss functions measure the dissimilarity between two distributions based on cross-entropy. Region-based loss functions quantify the mismatch or the overlap between two regions. Dice loss function [11,12] is the most common loss function in this category. Boundary-based loss functions measure the distance between two boundaries. Euclidean distance [16] or Hausdorff distance [17] metrics can be used for loss functions in this category. Compound loss functions are defined as the combinations among the distribution-, region-, and boundary-based loss functions [15,28,[30][31][32].
As described in [26], most of the distribution-based and region-based loss functions can be considered as the variants of cross-entropy and Dice loss functions, respectively. Moreover, boundary-based loss functions, which are formally defined in a region-based way, have similarities to the Dice loss function. Therefore, as most of the loss functions are based on the cross-entropy and Dice loss functions, we decided to use these two loss functions in this study. The cross-entropy loss and the Dice loss are defined as where , and , are the ground-truth label and the predicted segmentation probability of class at pixel , respectively. and are the numbers of pixels and classes in images for a training dataset, respectively.

Loss Weighting Strategies
In highly imbalanced segmentation tasks, FCNs are likely to ignore small-size foreground classes in the training process, which results in the low segmentation accuracy of As described in [26], most of the distribution-based and region-based loss functions can be considered as the variants of cross-entropy and Dice loss functions, respectively. Moreover, boundary-based loss functions, which are formally defined in a region-based way, have similarities to the Dice loss function. Therefore, as most of the loss functions are based on the cross-entropy and Dice loss functions, we decided to use these two loss functions in this study. The cross-entropy loss L CE and the Dice loss L Dice are defined as where g i,c and p i,c are the ground-truth label and the predicted segmentation probability of class c at pixel i, respectively. N and C are the numbers of pixels and classes in images for a training dataset, respectively.

Loss Weighting Strategies
In highly imbalanced segmentation tasks, FCNs are likely to ignore small-size foreground classes in the training process, which results in the low segmentation accuracy of the foreground classes. This is what is called the class imbalance problem and can be alleviated by weighting the loss of small-size foreground classes. In this study, we adopted five loss weighting strategies defined based on different factors of class frequency, predictive probability, and distance map. Table 1 indicates the overview of weighted loss functions used in this study. The details of loss weightings are described below.
Predictive probability-based weighting Focal weighting Distance map-based weighting Distance transform map-based weighting Distance penalty term-based weighting Dice lossfunction L Dice Class frequency-based weighting Distance map-based weighting Distance transform map-based weighting Distance penalty term-based weighting Healthcare 2021, 9, 938 6 of 23

Inverse Frequency Weighting
Inverse frequency weighting [24], which is one of the most common weighting strategies, is a method for weighting each class based on the class frequency. The weight is inversely proportional to the number of pixels. The smaller the size of target objects is, the higher the weight of them becomes. The inverse frequency weight W Inverse c in class c is defined by where α is a power parameter. In this study, we used α = 1 for the cross-entropy loss function and α = 2 for the Dice loss function. The Dice loss function weighted by the inverse of square frequency is known as generalized Dice loss function [24].

Inverse Median Frequency Weighting
Inverse median frequency weighting [18] is a frequency-based weighting as with the inverse frequency weighting. The inverse median frequency weight W Median c is computed as where F c is the normalized frequency of class c and median(·) denotes a function returning the median value of input data.

Focal Weighting
Focal weighting [19] is a method for putting more focus on hard-to-classify class pixels based on predictive probability. It gives a higher weight to class pixels with lower prediction confidence and reduces the loss assigned to well-classified pixels during the training process. The focal weighting W Focal where γ is called a focusing parameter. In this study, we used γ = 2 for cross-entropy loss function as in [19] and γ = 1 for Dice loss function as in [25]. Note that for simplification, here, we did not consider the balancing factor α used in [19].

Distance Transform Map-Based Weighting
Distance transform map (DTM), which is computed as the Euclidean distance from the boundary of target objects, is used in the distance-based loss functions [16,17]. Figure 3b shows an example of DTM. DTM-based weighting can be performed by multiplying prediction errors by the DTM. This weighting assigns higher weights to the pixels which are more distant from the boundary of ground-truth labels. Here, we defined the DTMbased weight W DTM where DTM c is the distance transform map in class c, and ∂G c denotes the boundary of ground-truth label in class c. ||x − y|| 2 denotes the Euclidean distance between pixels x and y in images.
Healthcare 2021, 9, 938 where is the distance transform map in class , and denotes the boundary of ground-truth label in class . ‖ − ‖ denotes the Euclidean distance between pixels and in images.

Distance Penalty Term-Based Weighting
Distance penalty term (DPT) is a distance map for weighting hard-to-segment boundary regions [20], in contrast to the DTM. Let DPT c be the distance penalty term in class c. Then, DPT c is defined as the inverse of the DTM c , and thus, it puts higher weights on the pixels closer to the boundary of ground-truth labels in contrast with the DTM-based weighting. Figure 3c shows an example of DPT. As with the DTM-based weighting, DPTbased weighting penalizes prediction errors with the DPT. The DPT-based weight W DPT c is defined by We used the cross-entropy and Dice loss functions weighted by the above five weighting strategies. Table 1 summarizes the weighted loss functions used in this study. As for the weighted Dice loss functions, L Inverse Dice , L Median Dice , and L Focal Dice put their weights on both the numerator and denominator terms as in [24], while L DTM Dice and L DPT Dice assign their weights to the false positive (i.e., in the denominator.

Dataset
We used the MR images of 84 patients with unruptured cerebral aneurysms, which were imaged with MRC and time-of-flight MRA sequences on a 3.0 T scanner (Signa HDxt 3.0 T, GE Healthcare, WI, USA) at the University of Tokyo Hospital, Tokyo, Japan. The MR image volumes had 144-190 slices of 512 × 512 pixels with an in-plane resolution of 0.47 × 0.47 mm 2 and a slice thickness of 1.00 mm. As a preprocessing step, the MR images were normalized to have a mean of 0 and a standard deviation of 1. The dataset consisting of 84 cases was divided into the following three subsets: training (60 cases), validation (4 cases), and test subsets (20 cases).
The ground-truth-labeled images for training and testing were manually created by using an open-source software for medical image processing (3D Slicer, Brigham and Women's Hospital, MA, USA); the cerebrum, cerebellum, and brainstem were annotated on MRC images, while blood vessels were annotated on MRA images. The manual annotation was performed by a biomedical engineer and a neurosurgeon. Table 2 indicates the frequency F c = N ∑ i=1 g i,c /N of the foreground classes (the cerebrum, cerebellum, brainstem, and blood vessels) in the training subsets. The cerebrum was the most frequent in the foreground classes, followed by the cerebellum, brainstem, and blood vessels. The goal of this work was to study the effect of loss weightings in different class imbalance situations. Thus, we evaluated the effect of loss weightings on both binary-and multi-class segmentation tasks. Table 3 indicates the overview of the training datasets in the binary-and multi-class segmentation tasks. Table 3. Training datasets in binary-and multi-class segmentation tasks. BG, CR, CL, BS, and BV stand for background, cerebrum, cerebellum, brainstem, and blood vessels, respectively.

Dataset Ratio 1
Binary-class segmentation tasks Binary-class segmentation tasks: To test how the effect of loss weightings varies according to the size of a foreground class in binary-class segmentation tasks, we evaluated the segmentation performance on the binary-class segmentation task for each of the foreground classes. Note that the binary-class segmentation tasks for the cerebrum, cerebellum, and brainstem were performed using MRC images, whereas the binary-class segmentation for blood vessels was performed using MRA images.
Multi-class segmentation tasks: To test how the effect of loss weightings varies according to the imbalance of foreground classes in multi-class segmentation tasks, we evaluated the segmentation performance on the three-, four-, and five-class segmentation tasks; the three, four, and five classes include the foreground classes of (cerebrum, blood vessels), (cerebrum, cerebellum, blood vessels), and (cerebrum, cerebellum, brainstem, blood vessels), respectively. Note that the multi-class segmentation tasks were performed using multi-modal MR images which included MRC and MRA images.

Network Training Procedure
In the binary-and multi-class segmentation tasks, we trained the FCN model on each training dataset using the cross-entropy and Dice loss functions with or without the loss weightings. The FCN model was trained from scratch for 30 epochs with the Adam optimization algorithm [33] (α (learning rate) = {1e − 3, 1e − 4, and 1e − 5}, β 1 = 0.9, β 2 = 0.999, and epsilon = 1e − 7) and a batch size of 5 in each training process. For testing, we used the best trained model in the set {learning rate, , and {1e − 5, 30} because the condition for good training convergence, especially learning rate and number of epochs, was different according to the loss weightings.
The FCN model with the weighted loss functions were implemented by using Keras with Tensorflow backend, and the training and prediction were performed on an Ubuntu 16.04 PC (CPU: Intel Xeon Gold 5222 3.80 GHz, RAM: 384 GB) with NVIDIA Quadro RTX8000 GPU cards for deep learning.

Evaluation Metrics
To quantitatively evaluate the segmentation performance, we adopted the Dice similarity coefficient (DSC), surface DSC (SDSC) [34], average symmetric surface distance (ASD), and Hausdorff distance (HD). The DSC and SDSC, overlap-based metrics, can be used for evaluating the region overlaps; the DSC measures the overlap of whole regions between ground-truth and predicted labels, whereas the SDSC measures the overlap of the two surface regions. The DSC was calculated by where G and P denote the regions of ground-truth and predicted labels, respectively. The SDSC was calculated by where ∂G and ∂P denote the boundaries of ground-truth and predicted labels, respectively. B  [26,34]. We here used τ = 1 mm as in [26].
The ASD and HD, boundary distance-based metrics, can be used for evaluating the surface errors; ASD measures the average surface distance between ground-truth and predicted labels, whereas HD measures the max surface distance between them. The ASD was calculated by where D(a, A) denote the minimum Euclidean distance from a voxel a to a set of voxels A.
As for HD, in this study, 95th-percentile HD (95HD) was used, as in [27]. When the segmentation accuracy increases, the overlap-based and the boundary distance-based metrics approach 1 and 0, respectively. The evaluation metrics was implemented using the open-source code, which is available at [35].
Furthermore, we used a rank score, which was defined based on [36], to comprehensively evaluate which loss weightings worked well based on the above metrics, as in [26]. The rank score was computed according to the following steps: Step 1. Performance assessment per case: compute metrics m i loss j , class k , case l (i = 1, . . . , N m ) of all loss functions loss j (j = 1, . . . , 12) for all classes class k (k = 1, . . . , N c ) in all test cases case l (l = 1, . . . , 20), where N m and N c are the number of metrics and classes, respectively. Note that in this case, we used four metrics m i ∈ {DSC, SDSC, ASD, 95HD} and a total of twelve loss functions, including cross-entropy and Dice loss functions with no weighting, Inverse, Median, Focal, DTM, and DPT weightings.
Step 2. Statistical tests: perform Wilcoxon signed-rank pairwise statistical tests between all loss functions with the values m i loss j , class k , case l − m i loss j , class k , case l .
Step 3. Significance scoring: compute a significance score s ik loss j for loss functions loss j , classes class k , and metrics m i . s ik loss j equals the number of loss functions performing significantly worse than loss j according to the statistical tests (p < 0.05, not adjusted for multiplicity).
Step 4. Rank score computing: compute the final rank score R loss j of each loss function from the mean significance score of all classes and metrics in each of the binaryand multi-class segmentation tasks by the following equation:

Results
We compared the results of loss weightings (inverse frequency weighting (Inverse), inverse median frequency weighting (Median), focal weighting (Focal), distance transform map-based weighting (DTM), and distance penalty term-based weighting (DPT)) with those of no weighting (N/A). The statistical difference between N/A and each loss weighting was evaluated by the Wilcoxon signed-rank test. A p-value less than 0.05 was considered significant. Subsequently, we comprehensively evaluated the effect of loss weightings by using the rank scores. Tasks   Table 4 summarizes all the results in the binary-class segmentation tasks. Figure 4 shows the violin plots of the Dice scores. As for cross-entropy loss function, Inverse and Median provided worse results than N/A in any segmentation tasks. Focal, DTM, and DPT tended to improve the surface accuracy in the highly imbalanced segmentation tasks (i.e., segmentation of brainstem and blood vessels) although the improvement was not statistically significant. As for Dice loss function, Inverse and Median significantly improved the segmentation accuracy in the highly imbalanced segmentation tasks, compared with N/A. Focal tended to provide better results than N/A in all the binary-class segmentation tasks. The distance map-based weightings (i.e., DTM and DPT) worked well in the segmentation of brain parenchyma, but they were ineffective in the segmentation of blood vessels.     Compared with the results of N/A, the significantly worse and better results are shown in black and red, respectively (Wilcoxon signed-rank test, * p < 0.05, ** p < 0.01, and *** p < 0.001, not adjusted for multiplicity). Figure 5 visualizes an example of the segmentation results of blood vessels, which are the highly imbalanced class, in the binary-class segmentation task. As for the cross-entropy loss function, N/A had difficulty in segmenting the upper blood vessels. Both Inverse and Median allowed the FCN to extract most of the upper blood vessels which N/A failed to segment, but obviously increased the overextraction. Focal provided almost the same result as N/A. Both DTM and DPT extracted the wider region of blood vessels than N/A. As for the Dice loss function, N/A had false negatives in the upper blood vessels as with the cross-entropy loss function. It also provided a few more false positives. The class frequency-based weightings, especially Inverse, improved the false positives as well as the false negatives. Focal provided better results than N/A, although it was not so much as Inverse. The results of the distance map-based weightings, especially DPT, were worse than that of N/A.  Tasks   Table 5 summarizes all the results in the multi-class segmentation tasks. Figure 6 shows the violin plots of the Dice scores. As for the cross-entropy loss function, Inverse and Median, as in the binary-class segmentation tasks, worsened the results in any multiclass segmentation tasks. The results of Focal, especially surface accuracies, were equivalent to or better than those of N/A in almost all the tasks. In the distance map-based weighting, DPT worked well for improvement of segmentation accuracy. As for the Dice loss function, Inverse and Median significantly improved the segmentation accuracy of blood vessels, which were a very high-level imbalanced class, in any multi-class segmentation tasks. However, Inverse also significantly worsened the segmentation accuracy of the cerebrum and cerebellum, which were relatively large-size targets. Focal provided better results than N/A for almost all the segmentation targets. The distance map-based weightings showed inconsistent results between the multi-class segmentation tasks. Figure 7 visualizes an example of the segmentation results in the five-class segmentation task. It shows the false positive and false negative labels as well as the predicted  Tasks   Table 5 summarizes all the results in the multi-class segmentation tasks. Figure 6 shows the violin plots of the Dice scores. As for the cross-entropy loss function, Inverse and Median, as in the binary-class segmentation tasks, worsened the results in any multi-class segmentation tasks. The results of Focal, especially surface accuracies, were equivalent to or better than those of N/A in almost all the tasks. In the distance map-based weighting, DPT worked well for improvement of segmentation accuracy. As for the Dice loss function, Inverse and Median significantly improved the segmentation accuracy of blood vessels, which were a very high-level imbalanced class, in any multi-class segmentation tasks. However, Inverse also significantly worsened the segmentation accuracy of the cerebrum and cerebellum, which were relatively large-size targets. Focal provided better results than N/A for almost all the segmentation targets. The distance map-based weightings showed inconsistent results between the multi-class segmentation tasks.       Compared with the results of N/A, the significantly worse and better results are shown in black and red, respectively (Wilcoxon signed-rank test, * p < 0.05, ** p < 0.01, and *** p < 0.001, not adjusted for multiplicity). Figure 7 visualizes an example of the segmentation results in the five-class segmentation task. It shows the false positive and false negative labels as well as the predicted labels. False positives were likely to appear around the surface of the cerebrum, cerebellum, and brainstem, while false negatives tended to appear in the upper part of blood vessels. As for the cross-entropy loss function, Inverse and Median reduced the false negatives, but more than that, they greatly increased the false positives. Focal worked well for a reduction in the false positives, although it did not reduce the false negatives. The results of the distance map-based weightings showed that DPT was a little effective in reducing the false positives and false negatives. As for Dice loss function, Inverse reduced the false negatives in blood vessels, although it failed to segment the whole cerebrum. Median worked to reduce the false negatives in blood vessels, as with Inverse. Focal slightly reduced the false positives. DTM and DPT seemed to provide almost the same results as N/A.   Table 6 indicates the ranking results of loss weightings in the binary-and multi-class segmentation tasks. The distance map-based weightings for cross-entropy loss function and the predictive-probability weighting for Dice loss function tended to have high rank scores in both the binary-and multi-class segmentation tasks. In the binary-class segmentation tasks, the Dice loss function with Focal showed the best ranking result. It actually obtained a high average DSC and SDSC of 92.8% and 93.3%, respectively. Compared with no weighting, it improved the DSC and SDSC values of all tasks by 0.2-8.1% and 0.5-12.5%, respectively. In the multi-class segmentation tasks, the cross-entropy loss function with DPT had the highest rank score, followed by the Dice loss function with Focal. In the five-class segmentation task, DPT achieved the highest average DSC and SDSC values of 93.1% and 94.6%, respectively. Table 6. Ranking results of no weighting (N/A), inverse frequency weighting (Inverse), inverse median frequency weighting (Median), focal weighting (Focal), distance transform map-based weighting (DTM), and distance penalty term-based weighting (DPT) in (a) binary-class segmentation tasks and (b) multi-class segmentation tasks. The best results are shown in bold. The rank is determined based on the rank scores of segmentation results on all datasets.

Loss Function Weighting
Rank

Discussion
We evaluated the effect of loss weightings on the segmentation of the cerebrum, cerebellum, brainstem, and blood vessels from the MR images. From the segmentation results with the non-weighted loss functions, we found that the segmentation errors of the cerebrum, cerebellum, and brainstem, including false positives and false negatives, were concentrated at the edges of them, whereas the segmentation errors of blood vessels, especially false negatives, appeared in the upper part of them. This is probably because the edges of brain parenchyma or the upper blood vessels were variable according to the cases and the FCN was biased toward training image features on easier-to-segment majority regions. Thus, in order to improve the brain structure segmentation, it would be important to make the FCN focus on training image features around the edge of brain parenchyma and in the upper part of blood vessels by loss weightings. We discuss the effect of loss weightings based on the results in the binary-and multi-class segmentation tasks below. Subsequently, we also discuss the limitations of this study.

Binary-Class Segmentation Tasks
As for the cross-entropy loss function, the class frequency-based weightings (Inverse and Median) greatly increased false positives. They assign a lower uniform weight to the loss of larger-size classes, i.e., background class in the case of binary-class segmentation tasks. They gave a low uniform weight to low-confidence background pixels near the edge of the foreground, which would result in a large increase in false positives on the low-confidence background pixels, although they could also help reduce false negatives. On the other hand, the predictive probability-and the distance map-based weightings tended to improve the surface accuracy of highly imbalanced classes, i.e., the brainstem and blood vessels. Different from the class frequency-based weighting, they assign a different weight to each pixel. Using such pixel-wise weights instead of uniform weights may be appropriate for imbalanced segmentation because FCNs do not focus equally on all the pixels of the same class during training. The predictive-probability-based weighting (Focal) gives higher weights to pixels with lower prediction confidences based on the predictive probability and helps correct pixels misclassified with low prediction confidence, whereas the distance map-based weightings (DTM and DPT) define pixel-wise weights based on the distance from the edge of ground-truth labels and help correct surface segmentation errors. Thus, it is considered that these loss weightings could correct the surface error because pixels around the edge of foreground class were subject to be misclassified with low prediction confidence in the highly imbalanced segmentation tasks.
As for the Dice loss function, the class frequency-based weightings significantly improved the accuracy in the highly imbalanced segmentation tasks, although they did not work well for the cross-entropy loss function. They assigned the weight to both the denominator and numerator for the Dice loss function, which would allow the FCN to reduce false negatives without increasing false positives. The predictive probabilitybased weighting, which showed the best performance in Table 6, worked well for the low-and middle-level imbalanced segmentation tasks as well as the highly imbalanced segmentation tasks. This can be explained by the fact that the FCN with the Dice loss function had more pixels misclassified with low prediction confidence in the low-and middle-level imbalanced segmentation tasks, compared with that of the cross-entropy loss function. Additionally, the distance map-based weightings tended to improve the surface accuracy in the brain parenchyma segmentation. However, they were ineffective in the segmentation task of blood vessels. As shown in [16], in the case of the segmentation of objects which have variable locations and shapes, they might be able to work stably by using a scheduling strategy, i.e., gradually increasing the weight to the mismatched region with the training epochs.

Multi-Class Segmentation Tasks
The binary-class segmentation tasks included the class imbalance problem between background and foreground classes, whereas the multi-class segmentation tasks, which deal with two or more foreground classes, included the class imbalance problems not only between background and foreground classes but also among foreground classes.
However, the results in the multi-class segmentation tasks showed similar tendencies to those in the binary-class segmentation tasks, although some of them were affected by the foreground-foreground class imbalance.
The class frequency-based weightings failed to improve the segmentation performance of the FCN with the cross-entropy loss function in any multi-class segmentation tasks because they greatly increased false positives by assigning an extremely low weight to the background pixels. For the Dice loss function, they also worked negatively for the low-and middle-level imbalanced classes. Especially in the five-class segmentation task, Inverse could not segment the cerebrum at all due to the foreground-foreground class imbalance. However, it also provided the best DSC value for blood vessels. Thus, the class frequencybased weightings could work well for only objects with very high imbalance because of their extreme weighting in any segmentation tasks. The predictive probability-based weighting totally worked well for both the cross-entropy and Dice loss functions. These results suggested that despite the foreground-foreground class imbalance, it could enable FCNs to focus on the pixels misclassified with low prediction confidence, i.e., hard-tosegment pixels, by considering the predictive probability. As well, the distance map-based weightings tended to provide good segmentation results for the cross-entropy loss function. In particular, the cross-entropy loss function with DPT achieved the best performance as indicated in Table 6b. However, the distance map-based weightings provided unstable segmentation results for the Dice loss function. In this study, although we designed the Dice loss function with the distance map-based weightings by multiplying the false positive and false negative terms in the denominator by the weights, using a scheduling strategy might make the effect of the distance map-based weightings more stable, as mentioned above. Therefore, the cross-entropy loss function with DPT and the Dice loss function with Focal achieved relatively high accuracy in any segmentation targets and tasks, but some other weightings outperformed their weightings according to segmentation targets. For example, the Dice loss function with Inverse provided better DSC and SDSC results for blood vessels than that with Focal. Therefore, in this study, we focused on the unary weighted loss functions instead of compound loss functions, but considering the difference of features in loss weightings, the combination of different weighted loss functions might lead to the further improvement of segmentation performance.

Limitations
For limitations of this work, we adopted the segmentation of brain parenchyma and blood vessels on MRC and MRA images, which is performed as a routine work in our group. However, the effect of loss weightings might depend on segmentation targets and tasks, although the results in this study reflected the features of loss weightings. Considering a wider range of applications, we should test the loss weightings in other brain structure segmentation tasks (e.g., the segmentation of white matter, gray matter, and cerebrospinal fluid on T1-weighted MR images). Second, we used the 2D U-Net architecture to investigate the effect of loss weightings with less hyperparameters. However, we would need to test 3D FCNs with the weighted loss functions, because they have been applied for volumetric brain structure segmentation. Moreover, we set default parameters for loss weightings (e.g., the focusing parameter for focal weighting) based on the previous studies, but tuning such parameters would enable the performance improvement of FCNs. Furthermore, in this study, we focused on segmenting brain structures, including blood vessels, from the MR images of patients with cerebral aneurysms, but considering the clinical practice, it would be desired to automatically detect the location of aneurysms, as in [37], in addition to the segmentation.

Conclusions
This paper investigated how the loss weightings work for FCN-based brain structure segmentation on MR images in different class imbalance situations. Using the 2D U-Net with cross-entropy or Dice loss functions as a baseline network, we tested the five loss weightings, which were defined based on class frequency, predictive probability, and distance map, in the binary-and multi-class brain structure segmentation on MRC and MRA images. From the experimental results, we found that the cross-entropy loss function with the distance map-based weightings, especially distance penalty term-based weighting, and the Dice loss function with the predictive probability-based weighting could stably provide good segmentation results. In the binary-class segmentation tasks, the Dice loss function with focal weighting showed the best performance and achieved a high average DSC of 92.8%, whereas in the multi-class segmentation tasks, the cross-entropy loss function with distance penalty term-based weighting provided the best performance. It achieved the highest average DSC of 93.1% in the five-class segmentation task. We also found that their weighted loss functions were relatively robust to the foreground-foreground class imbalance as well as the background-foreground class imbalance. In other words, the experimental results suggested that they could work well in the situations of both binaryand multi-class segmentation. Therefore, it may be effective to use the distance penalty term-based weighting in the cross-entropy loss function and the focal weighting in the Dice loss function. We believe that these findings would help to select weighting strategies for loss functions or design advanced loss weighting strategies.
In future work, for clinical application, we will address the detection and segmentation of a diseased area that is more highly imbalanced, such as a cerebral aneurysm, as well as its surrounding structures, by using the loss weighting strategies. Moreover, we will design compound loss functions (i.e., combination among the loss weightings) and further investigate the effect of them for different brain structure segmentation tasks.