BUSIS: A Benchmark for Breast Ultrasound Image Segmentation

Breast ultrasound (BUS) image segmentation is challenging and critical for BUS computer-aided diagnosis (CAD) systems. Many BUS segmentation approaches have been studied in the last two decades, but the performances of most approaches have been assessed using relatively small private datasets with different quantitative metrics, which results in a discrepancy in performance comparison. Therefore, there is a pressing need for building a benchmark to compare existing methods using a public dataset objectively, to determine the performance of the best breast tumor segmentation algorithm available today, and to investigate what segmentation strategies are valuable in clinical practice and theoretical study. In this work, a benchmark for B-mode breast ultrasound image segmentation is presented. In the benchmark, (1) we collected 562 breast ultrasound images and proposed standardized procedures to obtain accurate annotations using four radiologists; (2) we extensively compared the performance of 16 state-of-the-art segmentation methods and demonstrated that most deep learning-based approaches achieved high dice similarity coefficient values (DSC ≥ 0.90) and outperformed conventional approaches; (3) we proposed the losses-based approach to evaluate the sensitivity of semi-automatic segmentation to user interactions; and (4) the successful segmentation strategies and possible future improvements were discussed in details.


Introduction
Breast cancer occurs in the highest frequency in women among all cancers and is also one of the leading causes of cancer death worldwide [1].The key to reducing mortality is to find the signs and symptoms of breast cancer at its early stage.In current clinical practice, breast ultrasound (BUS) imaging with computer-aided diagnosis (CAD) system has become one of the most important and effective approaches for breast cancer detection due to its noninvasive, non-radioactive and costeffective nature.In addition, it is the most suitable approach for large-scale breast cancer screening and diagnosis in low-resource countries and regions.
CAD systems based on B-mode breast ultrasound (BUS) have been developed to overcome the inter-and intra-variabilities of the radiologists' diagnoses, and have demonstrated the ability to improve the diagnosis performance of breast cancer [2].Automatic BUS segmentation, extracting tumor region from normal tissue regions of BUS image, is a crucial component in a BUS CAD system.It can change the traditional subjective tumor assessments into operator-independent, reproducible, and accurate tumor region measurements.
Driven by the clinical demand, automatic BUS image segmentation has attracted great attention in the last two decades; and many automatic segmentation algorithms were proposed.The existing approaches can be classified into semi-automatic and fully automatic according to "with or without" user interactions in the segmentation process.In most semi-automatic methods, the user needs to specify a region of interest (ROI) containing the lesion, a seed in the lesion, or an initial boundary.
Fully automatic segmentation is usually considered as a top-down framework that models the knowledge of breast ultrasound and oncology as prior constraints and needs no user intervention at all.However, it is quite challenging to develop automatic tumor segmentation approaches for BUS images, due to the poor image quality caused by speckle noise, low contrast, weak boundary, and artifacts.Furthermore, tumor size, shape, and echo strength vary considerably across patients, which prevents the application of strong priors to object features that are important for conventional segmentation methods.
In previous works, most approaches were evaluated by using private datasets and different quantitative metrics (see Table 1), that make the objective and effective comparisons among the methods quite challenging.As a consequence, it remains difficult to determine the best performance of the algorithms available today, what segmentation strategies are accessible to clinic practice and study, and what image features are helpful and useful in improving segmentation accuracy and robustness.
In this paper, we present a BUS image segmentation benchmark including 562 B-Mode BUS images with ground truths, and compare sixteen state-of-the-art BUS segmentation methods by using seven popular quantitative metrics.Besides the BUS dataset in this study, three other BUS datasets [66][67][68] were published recently.[66] and [67] have many challenging images with small tumors and could be valuable to test algorithm performance on segmenting small tumors; but [66] has only 163 images, and [67] does not have ground truths for most images.[68] has 763 images including 133 normal images (without tumors).It is valuable to test algorithms' robustness in dealing with normal images.However, the three datasets did not use the same standardized process for ground truth generation; therefore, we do not report the performance of the algorithms on them.
We also make the BUS dataset and the performance of the sixteen approaches available at http://cvprip.cs.usu.edu/busbench.To the authors' best knowledge, this is the first attempt to benchmark the BUS image segmentation methods.With the help of this benchmark, researchers can compare their methods with other algorithms, and find the primary and essential factors for improving the segmentation performance.
The paper is organized as follows: Section 2 gives a brief review of BUS image segmentation; Section 3 illustrates the set-up of the benchmark; in Section 4, the experimental results are presented; and the discussions and conclusion are in Sections 5 and 6, respectively.

Related Works
Many BUS segmentation approaches have been studied in the last two decades, and have proven effective using their datasets.In this section, a brief review of automatic BUS image segmentation approaches is presented.For more details, refer to the survey paper [19].The BUS image segmentation approaches are classified into five categories: (1) deformable models, (2) graph-based approaches, (3) machine learning-based approaches, (4) classical approaches, and (5) other kinds.
Deformable models (DMs).According to the ways of representing the curves and surfaces, DMs are generally classified into two subcategories: (1) the parametric DMs (PDMs) and (2) the geometric DMs (GDMs).In PDMs-based segmentation approaches, the main work was focused on generating good initial tumor boundaries.[20][21][22][23][24] investigated PDMs by utilizing different preprocessing methods such as the balloon forces, sticks filter, gradient vector flow (GVF) model, watershed approach,  [19].[62] was a common choice for defining the prior energy [31,32].[32][33][34] utilized Gaussian distribution to model both intensity and texture distributions, and the Gaussian parameters were either from manually selection or from user interaction.
Graph cuts is a special case of the MRF-MAP modeling, but focuses on binary segmentation.[3] proposed a novel fully automatic BUS image segmentation framework in which the graph cuts energy modeled the information from both the frequency and space domains.[35] built the graph using image regions, which was initialized by specifying a group of tumor regions (F) and a group of background regions (B).[36] were applied to define the weight function of the smoothness term (prior energy).
[35] proposed a discriminative graph cut approach in which the data term was determined online by a pre-trained Probabilistic Boosting Tree (PBT) classifier [37].In [38], a hierarchical multiscale superpixel classification framework was proposed to define the data term.
Machine learning-based approaches.Both supervised and unsupervised learning approaches have been applied to BUS image segmentation.Unsupervised approaches are simple and fast, and commonly utilized as preprocessing to generate candidate image regions.Supervised approaches are good for integrating features at different levels and producing accurate results.
Clustering: [39] proposed a BUS image segmentation method by applying the spatial fuzzy c-means (sFCM) [40] to the local texture and intensity features.In [41], FCM was applied to intensities for generating image regions in four clusters.[9] applied FCM to image regions produced by using the mean shift method.[13] extended the FCM and proposed the neutrosophic l-means (NLM) clustering to deal with the weak boundary problem in BUS image segmentation by considering the indeterminacy membership.
SVM and NN: [42] trained a support vector machine (SVM) using local image features to categorize small image lattices (16 × 16) into the tumor or non-tumor classes.[10] trained Adaboost classifier using 24 Haar-like features [43] to generate a set of candidate tumor regions.[44] proposed an NN-based method to segment 3D BUS images by processing 2D image slices using local image features.Two artificial neural networks (ANNs) to determine the best-possible threshold were trained [45].[11] trained an ANN to conduct pixel-level classification by using the joint probability of intensity and texture [20] with two new features: the phase in the max-energy orientation (PMO) and radial distance (RD).The ANN had 6 hidden nodes and 1 output node.
Deep Learning: deep learning-based approaches have been reported to achieve state-of-the-art performance for many medical tasks such as prostate segmentation [46], cell tracking [47], muscle perimysium segmentation [48], brain tissue segmentation [49], breast tumor diagnosis [50], etc. [51] combined fuzzy logic with fully convolutional network (FCN), and the 5-layer structure of the breast is utilized to refine the final segmentation results.[52] applied fuzzy logic to five convolutional blocks.It can handle the breast images having no tumors or more than one tumor which could not be processed well before.Deep learning models have great potential to achieve good performance due to the ability to characterize large image variations and to learn compact image representation using a sufficiently huge image dataset automatically.Deep learning architectures based on convolutional neural networks (CNNs) were employed in medical image segmentation [46][47][48][49][50][51][52].[60] compared the performance of LeNet [62], UNet [47], and FCN-AlexNet [63] for detecting tumors in BUS images, and the results showed that the patch-based LeNet achieved the best performance for the first dataset (306 images), and the transfer learning-based FCN-AlexNet outperformed other approaches for the second dataset (163 images).[61] utilized fully convolutional CNNs to identify the tissue layers of the breast, and integrated the layer information into a fully connected CRF model to generate the final segmentation results.[65] proposed the STAN architecture to improve the small tumor segmentation.
Two encoders were employed in STAN to extract the multi-scale contextual information from different levels of the contracting part.
Classical approaches: Three most popular classical approaches were applied to BUS image segmentation: thresholding, region growing, and watershed.[11] proposed an automatic seed generation approach.[54] defined the cost of growing a region by modeling common contour smoothness and region similarity (mean intensity and size).
Watershed could produce more stable results than thresholding and region growing approaches.
[40] selected the markers based on the grey level and connectivity.[56] applied watershed to determine the boundaries of the binary image.The markers were set as the connected dark regions.[58] applied watershed and post-refinement based on the grey level and location to generate candidate tumor regions.
Other kinds: Two interesting approaches are in this category: cell computation [35,36] and cellular automation [14].Cell computation: The cells are the small image regions, and adjacent cells compete with each other to split or merge.[36] defined two types of competitions: Type 1 and Type II.In Type I competition, two adjacent cells from different regions compete to split one cell from a region and merge it into another region.One cell splits from a multi-cell region and generates a single-cell region in Type II competition.This approach is simple and fast, but it needs user interaction to select the tumor region.Cellular automation (CA): Each cell in CA has three components: state, neighbors, and a transition function.A cell's state updates by using its transition function and the states of its neighboring cells.[14] constructed the transition function by using local texture correlation.It could generate accurate tumor boundaries and did not have the shrink problem in Graph cuts.The computation cost for CA to reach a stable state set was quite high.
In Table 1, we list 20 BUS image segmentation approaches published recently.
[16] is a level set-based segmentation approach and sets the initial tumor boundary by user-specified ROI.The maximum number of iterations is set to 450 as the stopping criterion.[14] is based on cellular automation and uses the pixels on the boundary of the ROI specified by the user as the background seeds and pixels on an adaptive cross at the ROI center as the tumor seeds.[11] utilizes a predefined reference point (center of the upper part of the image) for seed generation and pre-trained tumor grey-level distribution for texture feature extraction.We use the same reference point defined in [11] and the predefined grey-level distribution provided by the authors; 10-fold cross-validation is employed to evaluate the overall segmentation performance.[3] and [18] are two graph-based fully automatic approaches.In our experiments, we adopt all the parameters from the original papers correspondingly.[47,63,65,[71][72][73][74] are deep learning approaches and 5-fold cross-validation was applied to test the performance.An example of the ground truth generation is in Figure 2. It is meaningless to compare semi-automatic methods with fully automatic methods; therefore, we will compare the methods in two categories separately.In the evaluation of semi-automatic approaches, we compare the segmentation performances of the two methods using the same set of ROIs and evaluate the sensitivity of the methods to ROIs with different looseness ratio (LR) defined by

𝐿𝑅 = 𝐵𝐷 𝐵𝐷
where  is the size of the bounding box of the ground truth and is used as the baseline, and BD is the size of an ROI containing BD0.We produce 10 groups of ROIs with different LRs automatically using the approach described in [55]: move the four sides of an ROI toward the image borders to increase the looseness ratio; and the amount of the move is proportional to the margin between the side and the image border.The LR of the first group is 1.1; and the LR of each of the other groups is 0.2 larger than that of its previous group.
The method in [11] is fully automatic, it involves neural network training and testing, and a 10fold cross-validation strategy is utilized to evaluate its performance.Methods in [3,18] need no training and operator interaction.All experiments are performed using a windows-based PC equipped with a dual-core (2.6 GHz) processor and 8 GB memory.The performances of these methods are validated by comparing the results with the ground truths.Both area and boundary metrics are employed to assess the performances of the approaches.The area error metrics include the true positive ratio (TPR), false positive ratio (FPR), Jaccard index (JI), Dice's coefficient (DSC), and area error ratio where Am is the pixel set in the tumor region of the ground truth, Ar is the pixel set in the tumor region

MAE is defined by
In Eq. ( 7),  and  are the numbers of points on boundaries  and  , respectively.
The seven metrics above were discussed in [19].For the first two metrics (TPR and FPR), each of them only measures a certain aspect of the segmentation result, and is not suitable for describing the overall performance; e.g., a high TPR value indicates that most portion of the tumor region is in the segmentation result; however, it cannot claim an accurate segmentation because it does not measure the ratio of correctly segmented non-tumor regions.The other five metrics (JI, DSC, AER, HE and MAE) are more comprehensive and effective to measure the overall performance of the segmentation approaches and are commonly applied to tune the parameters of the segmentation models [3], e.g., large JI and DSC and small AER, HE and MAE values indicate the high overall segmentation performance.
Although JI, DSC, AER, HE, and MAE are comprehensive metrics, we still recommend using both TPR and FPR for evaluating BUS image segmentation; since with these two metrics, we can discover some hidden characteristics that cannot be found through the comprehensive metrics.Suppose that the algorithm has low overall performance (small JI and DSC, and large AER, HE and MAE); if FPR and TPR are large, we can conclude that the algorithm has overestimated the tumor region; if both FPR and TPR are small, the algorithm has underestimated the tumor regions.The findings from TPR and FPR can guide the improvement of the algorithms.

Semi-automatic segmentation approaches
Ten ROIs have been generated automatically for each BUS image, and LRs range from 1.1 to 2.9 (step size is 0.2).Totally, 5620 ROIs are generated for the entire BUS dataset, and we run each of the semi-automatic segmentation approach 5620 times to produce the results.All the segmentation results on the ROIs with the same LR are utilized to calculate the average TPR, FPR, DSC, AER, HE, and MAE, respectively; and the results of [14] and [16] are shown in Figures 3 and 4, respectively.
The segmentation results of [14] are demonstrated in Figure 3.All average JI values are between 0.7 and 0.8; and all average DSC values are between 0.8 and 0.9.The average TPR values are above 0.7, and increase with LRs of ROIs; the average JI and DSC values increase firstly, and then decrease; the average FPR values increase with the increasing looseness of ROIs; and the average DSC, HE and MAE decrease firstly, and then increase.Five metrics (average JI, DSC, AER, HE and MAE) reach their optimal values at the LR of 1.9 (Table 2).
As shown in Figure 4, all the average TPR and DSC values of the method in [16] are above 0. are small, which indicate that the high performance of the method in [16] can be achieved by using tight ROIs; however, the values of the three metrics increase almost linearly with the LRs of ROIs when the looseness is greater than 1.3; this observation shows that the overall performance of [16] Figure 3. Average segmentation results of [14] using ROIs with diferent looseness ratio (LR).
LR LR LR LR drops rapidly by using large ROIs above a certain level of LR.The average MAE values decrease firstly, and then increase and vary with the LRs in a small range.Four metrics (average JI, DSC, AER and MAE) reach their optimal values at the LR of 1.5 (Table II).After 1.5, the increasing ROIs make [16] segment more non-tumor regions into the tumor region (refer to the average FPR curve in Figure 4).The increasing false positive results in the decreasing of the average, JI and DSC values, and increasing of all other metrics.
As shown in Figures 3 and 4, and in Table 2, the two approaches achieve their best performances with different LRs (1.5 and 1.9 respectively).We can observe the following facts:  [14] and [16] are quite sensitive to the sizes of ROIs.
 The performances of the two approaches drop if the looseness level is greater than a certain value; and the performance of the method [14] drops much slower than that of the method in [16]. Set 1.9 as the optimal LR for [14] and 1.5 for [16]; and [14] achieves better average performance than that of [16].
 The running time of the approach in [16] is proportional to the size of specified ROI, while there is no such relationship of the running time for the approach in [14].
 The running time of the approach in [14] is slower than that of the approach in [16] by one order of the magnitude.

Fully automatic segmentation approaches
The performance of 14 fully automatic approaches is reported in Table 3. Except for methods [3], [11], and [18], the other approaches are deep convolutional neural networks.In general, all deep learning approaches outperform [3], [11], and [18] using the benchmark dataset.[3] achieves better performance than that of the methods in [11] and [18] on all five comprehensive metrics.[14] and [52] achieve the lowest average FPR.The method in [11] has the same average TPR value as the method in [3]; however, its average FPR value is much high (1.06) which is almost six times larger than that of the method in [3]; the high average FPR and AER values of the method in [11] indicate that large portions of non-tumor regions are misclassified as tumor regions.The average JIs of all deep learning approaches are above 0.8 except FCN-AlexNet; and [51] achieved the best average JI performance.Table 3 also shows the average optimal performances of [16] and [14] at the LRs of 1.5 and 1.9, respectively.

Discussions
Many semi-automatic segmentation approaches are utilized for BUS image segmentation [19].
User interactions (setting seeds and/or ROIs) are required by these approaches and could be useful for segmenting BUS images with extremely low quality.As shown in Table III, the two interactive approaches could achieve very good performance if the ROI is set properly.
Figures 3 and 4 also demonstrate that the two semi-automatic approaches achieve varying performances using different sizes of ROIs.Therefore, the major issue in semi-automatic approaches is to determine the best ROIs/seeds.But such issue has been neglected before completely.Most semi-automatic approaches focused only on improving segmentation performance by designing complex features and segmentation models, but failed to consider user interaction as an important factor that could affect the segmentation performance.Hence, we recommend researchers that they should consider such issues when they develop semi-automatic approaches.Two possible solutions could be employed to solve this issue.First, for a given approach, we could choose the best LR by running experiments on a given BUS image training set (like Section 4.1) and apply the LR to the test set.Second, like the interactive segmentation approach in [57], we could bypass this issue by designing segmentation models less sensitive to user interactions.
Fully automatic segmentation approaches have many good properties such as operator-independence and reproducibility.The key strategy that shared by many successful fully automatic approaches is to localize the tumor ROI accurately by modeling domain knowledge.[11] localizes tumor ROI by formalizing the empirical tumor location, appearance, and size; [3] generates tumor ROI by finding adaptive reference position; and in [18], the ROI is generated to detect the mammary layer of BUS image, and the segmentation algorithm only detects the tumor in this layer.However, in many fully automatic approaches, the performance heavily depends on hand-crafted features and some inflexible constraints, e.g., [11] utilizes a fixed reference position to rank the candidate regions in the ROI localization process.Table 3 demonstrates that deep Learning approaches outperform all traditional approaches.It is worth noting that deep learning approaches have limitations in segmenting small breast tumors [65].
As shown in Table 3, using the benchmark dataset, approaches [3,11,14,16,18] cannot achieve the performances reported in the original papers.The average JI of [3] is 14% less than the original average JI; the average FPR of [11] is 87% higher than the original value; the average TPR of [18] is 17% less than its reported value; and the average JI values of [14] and [16] are 17% and 10% lower than the reported values, respectively.The major reasons are: (1) the large image varieties of the benchmark dataset; and (2) the lack of robustness of the approaches when dealing with images from different resources.
As shown in Table 1, many quantitative metrics exist for evaluating the performances of BUS image segmentation approaches.In this paper, we have applied seven metrics [19] to evaluate BUS image segmentation approaches.As shown in Figures 3 and 4, average JI, DSC, and AER have the same trend, and each of them is sufficient to evaluate the area error comprehensively.

Conclusion
In this paper, we establish a BUS image segmentation benchmark and present the comparing results of 16 state-of-the-art segmentation approaches; two of them are semi-automatic and others are fully automatic.The BUS dataset contains 562 BUS images collected using three different ultrasound machines; therefore, the images have a large variance in terms of image contrast, brightness and level of noise, and can be valuable for testing the robustness of the algorithms as well.In the approaches, two of them [3,18] are graph-based approaches, [16] is a level set-based segmentation approach, [11] is ANN-based approach, [14] is based on cell competition, and [47,51,52,65,67,[71][72][73][74][75][76] are deep convolutional neural networks.
The quantitative analysis of the considered approaches highlights the following findings.
 As shown in Table 3, by using the benchmark, no approaches in this study can achieve the same performances reported in their original papers.
 The two semi-automatic approaches are quite sensitive to user interaction (LR).See Figures 3   and 4.
 Deep learning approaches outperform all conventional approaches using our benchmark dataset.
 The quantitative metrics such as JI, DSC, AER, HE, and MAE are more comprehensive and effective to measure the overall segmentation performance than TPR and FPR; however, TPR and FPR are also useful for developing and improving algorithms.
In addition, the benchmark should be and will be expanded continuously.

Acknowledgement
This work was supported, in part, by the Institute for Modeling Collaboration (IMCI) at the Uni-

3. 2
Datasets and Ground Truth Generation Our BUS image dataset has 562 images among women in ages between 26 to 78 years.The images are collected by the Second Affiliated Hospital of Harbin Medical University, the Affiliated Hospital of Qingdao University, and the Second Hospital of Hebei Medical University using multiple ultrasound devices: GE VIVID 7 and LOGIQ E9, Hitachi EUB-6500, Philips iU22, and Siemens ACUSON S2000.The images from different resources are valuable for testing the robustness of the algorithms.Example images from different devices are shown in Figure 1.Informed consents to the protocol from all patients were acquired.The privacy of the patients is well protected.Four experienced radiologists are involved in the ground truth generation; three radiologists read each image and delineated each tumor boundary individually, and the fourth one (senior expert) will judge if the majority voting results need adjustments.The ground truth generation has four steps: 1) every of the three experienced radiologists delineates each tumor boundary manually, and three delineation results are produced for each BUS image.2) all pixels inside/on the boundary are viewed as tumor region, outside pixels as background; and majority voting is used to generate the preliminary result for each BUS image.3) the senior expert read each BUS image and refer to its corresponding preliminary result to decide if it needs any adjustment.4) We label tumor pixel as 1 and background pixel as 0; and generate a binary and uncompressed image as the ground truth for each BUS image.
generated by a segmentation method, and | • | indicates the number of elements of a set.TPR, FPR, and AER take values in [0, 1]; and FPR could be greater than 1 and takes value in [0, +∞).Furthermore, Hausdorf error (HE) and mean absolute error (MAE) are used to measure the worst possible disagreement and the average agreement between two boundaries, respectively.Let Cm and Cr be the boundaries of the tumors in the ground truth and the segmentation result, respectively.where x and y are the points on the boundaries Cm and Cr , respectively; and (•, ) is the distance between a point and a boundary C as (, ) = min ∈ {‖ − ‖} where ‖ − ‖ is the Euclidean distance between points z and k; and (, ) is the minimum distance between point z and all points on C.

7 ,
and the average JI values vary in the range [0.65, 0.75].The average TPR values increase with the increasing LR values of ROIs.Both the average JI and DSC values tend to increase firstly, and then decrease with the increasing LRs of ROIs.FPR, AER and HE have low average values when the LRs

Table 3 .
Overal Performance of All approaches.The values before the slashes are approaches' performances on the proposed dataset, and after the slashes are their performances reported in the original publications.Notation '--' indicates that the corresponding metric was not reported in the original paper.