WDTISeg: One-Stage Interactive Segmentation for Breast Ultrasound Image Using Weighted Distance Transform and Shape-Aware Compound Loss

: Accurate tumor segmentation is important for aided diagnosis using breast ultrasound. Interactive segmentation methods can obtain highly accurate results by continuously optimizing the segmentation result via user interactions. However, traditional interactive segmentation methods usually require a large number of interactions to make the result meet the requirements due to the performance limitations of the underlying model. With greater ability in extracting image information, convolutional neural network (CNN)-based interactive segmentation methods have been shown to effectively reduce the number of user interactions. In this paper, we proposed a one-stage interactive segmentation framework (interactive segmentation using weighted distance transform, WDTISeg) for breast ultrasound image using weighted distance transform and shape-aware compound loss. First, we used a pre-trained CNN to attain an initial automatic segmentation, based on which the user provided interaction points of mis-segmented areas. Then, we combined Euclidean distance transform and geodesic distance transform to convert interaction points into weighted distance maps to transfer segmentation guidance information to the model. The same CNN accepted the input image, the initial segmentation, and weighted distance maps as a concatenation input and provided a reﬁned result, without another additional segmentation network. In addition, a shape-aware compound loss function using prior knowledge was designed to reduce the number of user interactions. In the testing phase on 200 cases, our method achieved a dice of 82.86 ± 16.22 (%) for automatic segmentation task and a dice of 94.45 ± 3.26 (%) for interactive segmentation task after 8 interactions. The results of comparative experiments proved that our method could obtain higher accuracy with fewer simple interactions than other interactive segmentation methods.


Introduction
Breast cancer is one of the leading causes of death in women around the world and diagnosing breast cancer in its early stages will always remain crucial [1,2]. Breast ultrasound is widely used in clinical diagnosis for its advantages of safety and low cost. Generally, accurate tumor segmentation is necessary and significant for precise diagnosis using breast ultrasound. However, fully automatic segmentation methods are difficult to obtain accurate results that can meet clinical analysis standards [3]. This is mainly related to the poor quality of the ultrasound images, but also to the limitations of the segmentation model. Compared to automatic segmentation methods that gives results at once, the advantage of interactive segmentation is that the user provides prior knowledge about the object through interactions to guide the refinement of the segmentation result [4]. In a real clinical situation, each patient may have multiple ultrasound images, and it is unrealistic to use manual annotation of tumor boundaries for all of them. Therefore, interactive segmentation tools with fast implementation of high accuracy segmentation have a significant meaning for clinical use.
There are three key points of excellent interaction segmentation of medical images: simple type of interactions, efficient interaction information transfer, and the use of prior knowledge. Existing interactive segmentation methods have different types of interactions, which can be divided into providing points [5][6][7][8][9], scribbles [10][11][12][13][14][15][16], a bounding box (BB), or a polygon box (PB) [11,17]. Among these ways, providing scribbles, BB and PB require the user to swipe the mouse pointer over the image for a long time, while clicking on points is intuitively the easiest interaction type. Interaction information transfer refers to the way of using user interactions to guide the segmentation. Most of the existing interaction segmentation methods use Gaussian probability maps, distance transforms, etc., to transfer user interaction information to the segmentation mode. However, they cannot utilize both the location information of interaction points and the contextual information of the image. Since human interaction is actually providing prior information to the network, using prior information in the model can reduce the number of user interactions but few ways take advantage of this.
Some conventional interactive segmentation methods are based on graph theory. The graph cuts [10] method uses the Gaussian mixture model (GMM) as the underpinning model and needs the user's scribbles for refinement. In this method, a large number of scribbles are needed before getting a satisfactory accuracy. GrabCut [11] requires the user to provide a bounding box to limit the region of interest (ROI) to take less scribbles, but its performance is poor on medical images as graph cuts [10] on account of the same GMM model. The Random walker segmentation method [12] uses random walker as the basic model to attain a refined result. These three methods all require a lot of user interactions due to the poor performance of underpinning model. In 2007, Bai et al. proposed an interactive framework [13] using geodesic distances to convert user-provided scribbles, so that the target could be automatically segmented. This was the first method to use geodesic distance transform for interactive segmentation, while some subsequent methods [14][15][16] have improved on this. However, all of them only perform well on images with large differences between foreground and background, because the geodesic distance focuses on the gradient information of the original image.
In order to break through the limitations of traditional method, interactive segmentation methods based on CNNs have been proposed. Xu et al. [5] converted user's interaction points into Euclidean distance maps based on foreground and background points. The fivechannel image (original RGB channels and two distance transform map) was used as the input of a full convolutional network (FCN) to obtain the segmentation result. Euclidean distance transform is concerned with the location information of interaction points and cannot utilize image contexts information. BIFSeg [6] uses an image segmentation method similar to GrabCut [11]. The user first draws a boundary box as the input for CNN to obtain an initial result. Then, image-specific fine-tuning conducts CNN to improve segmentation results. DeepIGeoS [7] firstly proposes using geodesic distance maps as part of the input for CNNs. Geodesic distance maps can reflect the grayscale texture information of the original image by calculating the shortest distance from the full image to a specific point, so that CNNs can identify mis-segmentations of foreground and background from the input data to refine the segmentation result. However, it is sensitive to the contrast and spatial information of the image, and lacks the importance of clearly indicating the location information of the interaction point. For example, in the case of blurred tumor boundary, the geodesic distance near the boundary does not change significantly due to the small image gray gradient change, while the Euclidean distance is only related to the locations of interactions and not influenced by the quality of the original image. This means that the Euclidean distance is more effective than the geodesic distance in pointing out the misalignment area. Therefore, there is an urgent need for a method that combines the advantages of both distance transforms.
In this paper, we proposed a one-stage interactive segmentation framework for breast ultrasound image based on the above three key points. Compared with existing two-stage interactive segmentation networks [6,7] to refine the result of automatic segmentation network, our method has several advantages. First, our method can use the same CNN network (I-net) to obtain the automatic segmentation and refined results in turn. We trained I-net on automatic segmentation task to ensure that it could provide an initial segmentation when inputting the only original image. Second, our method has more effective interaction information transfer. We proposed a weighted distance transform combined geodesic distance and Euclidean distance transforms, which means the distance map could reflect both the texture information near the object area and the location information of the interaction points in the whole image. Third, our method can reduce the number of interactions for the use of prior information in the training phase. We referred to the proposed framework as the interactive segmentation using weighted distance transform (WDTISeg).
The main contributions of the proposed method are as follows: (1) We proposed a one-stage interactive segmentation framework for breast ultrasound image segmentation, which is the first method to use a network to get both automatic and interactive segmentation. The training process was greatly simplified because no additional automatic segmentation network was required to provide the initial results; (2) We proposed to convert user interactions into maps with weighted distance transform which combines geodesic distance and Euclidean distance transforms. This combination can effectively convey both location information of interaction points and exploit image contexts knowledge; (3) We proposed a shape-aware compound loss function using the prior knowledge of breast tumors in the training phase to reduce the number of interactions. The compound loss function improved the accuracy of model segmentation while avoiding oscillation and overfitting in the training process. Figure 1 shows the pipeline of the proposed framework WDTISeg for interactive segmentation. I-net is the backbone segmentation network of the framework with input data of four channels. It was pre-trained for automatic segmentation task. Firstly, the input image was put into I-net to obtain an initial automatic segmentation, while the other three channels were all zero. Secondly, the user provided interaction points of foreground and background on mis-segmented regions according to the initial segmentation. Then these interaction points were converted into the foreground distance map (cyan) and the background distance map (red) by weighted distance transform, as shown in Figure 1. Finally, the input image, the initial segmentation, and two distance maps were concatenated before being put into I-net to attain a refined segmentation. This interactive process was repeated until we attained a segmentation with satisfactory accuracy.

The Structure of I-Net
It should be noted that I-net attains automatic segmentation and interactive segmentation depending on the form of the input data. When the image is first fed into the network, neither the initial segmentation nor distance maps exist, so I-net provides an output of automatic segmentation. Compared with two-stage interactive segmentation methods or refined segmentation methods, such as DeepIGeoS [7], our method does not need an additional CNN to obtain an initial segmentation, making our model lighter and easy to train. Figure 1 also shows details of I-net in our method. The network received a fourchannel data as input to predict the segmentation result. As shown in Figure 1, I-net is designed based on U-net [18] with an encoder-decoder architecture.
I-net improved U-net [18] in several parts to fit our segmentation tasks. We used group normalization [19] to replace batch normalization [20] to make normalization free from the dependence on batch size. In addition, we used the leaky ReLu [21] layer instead of the ReLu [22] layer to solve the problem of dead neuron while retaining the advantages of ReLu function. Leaky ReLu is defined as follows: where the α was set to 0.2 in this work.
To avoid the noise in the input image introduced by the shallow layers' skip connection to affect the segmentation results, I-net retained only two middle skip connections compared to U-net. These two skip connections aimed to utilize the combination of lowlevel features and high-level information to achieve better segmentation of tumor margins. At the last stage of the decoder, a convolution layer with one filer was used to attain the final segmentation.

Weighted Distance Transform
We proposed a weighted distance transform to convert user interactions into distance maps. An image can be regarded as an undirected graph with weight. Each pixel is a node and the grayscale difference between neighboring pixels is the weights of the edge. In graph theory, the geodesic distance is the distance of the shortest path between two nodes in a graph, so the geodesic distance map can reflect the grayscale texture information of the original image. The Euclidean distance is the shortest distance between two points in geometric space.
Let i and j be two different pixels in an image I, then the unsigned geodesic distance between i and j is: where P i, j is the set of all paths between pixel i and j. p(s) is one feasible path and it is parameterized by s ∈ [0, 1]. u(s) is a unit vector that is tangent to the direction of p(s) [7,13,14]. The unsigned Euclidean distance between i and j is: When we combine the two distances, let the weighted distance between pixel i and j be: where λ is a hyperparameter that requires experimental verification. Weighted distance turns into Euclidean distance when λ = 0, and geodesic distance when λ = 1, respectively. Suppose S f and S b represent the foreground points set and the background points set, respectively. Then the unsigned weighted distance from i to the points set S (S ∈ S f , S b is: There are already many algorithms in computer science for solving optimization Equation (3), such as the Floyd's algorithm, the Dijkstra's algorithm [23], and the fast marching algorithm [24]. Here we use fast marching for its speediness. The geodesic distance map was set to all zeros if no points in the foreground region and background region were clicked. Figure 2 shows an example of weighted distance maps when λ in (4) takes different values.

Training and Testing of I-Net
For fast and efficient model training, we used automatically generated simulated interaction points in the model training phase, while interaction points were obtained by user clicks on images in the testing phase.

Training
In the training phase, in order to quickly and automatically build the model for interaction segmentation, we generated interaction points that simulated the user's clicks by comparing the ground truth ( f y ) with the initial segmentation ( f x ). The subtraction of the two images could provide a mis-segmented foreground region and background region, as shown in (6).
Then user interaction points can be automatically generated from each mis-segmentation region by randomly sampling n pixels in that region. The number of pixels of the region is N. In this work, n was determined as the follow function (7) by experience.
where ceil(x) returns the smallest integer value greater than or equal to x. Figure 3 shows some examples of simulated interaction points in the training phase.

Testing
Interaction points in the testing phase are obtained by the operator by clicking on the mis-segmented region as shown in Figure 4. Instead of having a ground truth in the training phase, the user clicks with points on mis-segmented areas with prior knowledge. In each interaction phase, the user should give one foreground point and one background point with reference to the initial segmentation from P-net or the segmentation result of previous interaction. As shown in the framework shown in Figure 1, the interaction process continues until the user is satisfied with the result of the segmentation or the maximum threshold of interactions is reached, which was set to 8 in our study.

Loss Function
Incorporating prior knowledge into the loss function is important to improve segmentation accuracy and reduce the number of user interactions [25]. By observing breast tumors in ultrasound images, we found some prior knowledge that is useful for tumor segmentation. First, most tumors are actually compact contiguous domains. Some benign tumors are even closer to round or ellipse shapes. Second, the physician usually ensures that only one tumor remains on the image when saving the breast ultrasound. Even when there are two or more tumors on the image, they can be separated by cropping the image.
Based on the above findings, we proposed a shape-aware compound loss function L total to incorporate prior knowledge with CNN. As defined in (8), L total is composed by binary cross entropy loss (L BCE ), dice loss (L Dice ), and shape constraint loss (L SC ).
where ω is the weight of L SC .
Here L SC is the loss function we used for the shape constraint: where S and P are the area and the perimeter of the prediction segmentation. When the predicted tumor shape is circular, the loss L SC is the minimum value of 1. Since the tumor area is compact and connected, the loss L SC is be greater than 1 when multiple areas are segmented or the tumor shape is dispersed. Since the shape constraint itself converges to 1 when the shape is a circle, in order to use only its compact shape constraint function without affecting the overall segmentation effect, ω takes 0.05 here.
Cross entropy (CE) is commonly used as a loss function in deep learning and binary cross entropy (BCE) can be used as a loss function in binary classification tasks. The formula for BCE is as follows: where X and Y represent the segmentation of the method and the ground truth, respectively. Dice loss is a dice-based loss function. The reason why dice loss is sometimes directly used as the loss function is that the real goal of segmentation is to maximize the dice coefficient. In general, the use of dice loss has a negative impact on back propagation and tends to make the training unstable.
where X and Y represent the same things with (10).

Setting
A dataset of 2200 breast ultrasound images was acquired in Fudan University Shanghai Cancer Center, Shanghai, China from January 2019 to December 2019. The equipment used to obtain ultrasound images included the Aixplorer ultrasound system (SuperSonic Imagine S.A., Aix-en-Provence, France) at 7-15 MHz and the Resona 5S ultrasound system (Shenzhen Mindray Bio-Medical Electronics Co. Ltd., Shenzhen, China) at 5-14 MHz. All images were stored in DICOM format. Each ultrasound image has a tumor segmentation that has been precisely outlined by an experienced radiologist as the ground truth. The image size range is from 721 × 496 to 931 × 606. All images are resized to 256 × 256 before being fed into the network.
All images are arranged in chronological order of the patients' diagnosis. We used the first 2000 cases for training and the remaining 200 cases for testing, which ensured the independence of the patients in our training dataset and testing dataset.
For the quantitative evaluation, our work employed the dice value (dice) (%).
where X and Y represent the same qualities as in (10).

Implementation Details This Is Example 1 of an Equation
Adam [26] with a learning rate at 3 × 10 −4 was used to be the optimizer in the training stage. The batch size was 32 and the ratio of validation was set to 20% (200 cases). The model was trained for 50 epochs and only saved at the best validation loss. We trained and tested our interactive network using an Intel(R) Xeon(R) Gold 6130 CPU at 2.10 GHz and an NVIDIA TESLA V100 (32G).
Our WDTIseg was at low cost during the training and testing phases. In the training phase, WDTIseg was trained with different λ and loss functions, while the average training time was 624.6 s. The model size was 385 Mb. In the testing phase, the time from the input image put into the network to attain the final segmentation after 8 interactions was recorded, while the average cost was 17.6 s.

Performance on Automatic Segmentation Task
Our proposed framework WDTISeg could both obtain automatic segmentation and refine results based on interactions. To demonstrate that our method did not require an additional training of an automatic segmentation network to obtain the initial segmentation, we compared the automatic segmentation results of U-net and WDTISeg. Table 1 shows automatic segmentation results of U-net and WDTISeg. The dice of automatic segmentation results of WDTISeg was 82.86 ± 16.22 (%), better than that of U-net. In the automatic segmentation examples in Figure 5, it is clear that the results of WDTISeg were similar to U-net, and the segmentation results were even slightly more compact. 92.87 ± 6.09 WDTISeg (λ = 0.5, L Dice ) 92.17 ± 7.29 WDTISeg (λ = 0.5, L BCE ) 93.01 ± 6.46 WDTISeg (λ = 0.5, L Dice+BCE ) 93.54 ± 3.63 These prove that WDTISeg can still have comparable automatic segmentation performance to U-net after interactive segmentation training.
Our study focused on improving automatic segmentation-based refinement, and for the first time, we proposed that the interactive segmentation network can generate the initial segmentation results by itself without the need to train additional automatic segmentation networks.

Impact of the Factor λ in Weighted Distance Transform
To verify the effectiveness of combining the two distance transforms, we compared the single interaction results when λ took different values. Different values represent the different weights of the two distance transforms. The weighted distance became Euclidean distance completely when λ took 0, and Geodesic distance completely when λ took 1. However, we used the same user interactions during the experiment.
As can be seen from Table 1, the interactive segmentation method performed much better than the conventional automatic segmentation method U-net, by as much as 10%. Our method with the parameter λ = 0.5, L total achieved a dice score of 94.45 ± 3.26% and it performed better than the other four values of λ. By fixing the loss function to be L total , we can see that the results when λ was t between 0 and 1 were better than both 0 and 1. This proved that combining the two distance conversions can perform better than using either method alone on an interactive segmentation task. Figure 6 shows a comparison of the segmentation results of our method with different values of λ by given the same user interaction points. The upper case 1 is a tumor with an obscure border, where the interaction point location information is more important than the texture information. In this case, the performance of Euclidean distance transform should be better than Geodesic transform, which is as the same in Figure 6. In the lower case 2, the tumor boundary is obvious, but it has a mis-segmentation outside the tumor. This requires both texture information to ensure correct segmentation of the tumor region and interaction point location information to instruct the network to remove mis-segmented regions outside the tumor. Therefore, λ of 0.5 is better than any other value in case 2. This proves that the combination of our two distance transforms is beneficial in dealing with tumors in different cases. The combination of Euclidean distance transform and geodesic distance transform can both convey the location information of interaction points and make use of image context information. The experimental results demonstrate that this combination improves the stability of the segmentation model to cope with images that are difficult to segment.

Effect of Proposed Loss Function
We explored the effect of our involved loss function by observing the dice rate on the training and validation datasets, as shown in Figure 7. On the plot of dice rate on the training dataset, L Dice (Dice loss) achieved the best performance, while L total (BCE + Dice + SC loss) came second. The reason why L Dice performed well on the training set is that the network used dice as the evaluation metric, and the network maximizes dice by optimizing the network structure during training. However, the dice rate of L Dice on the validation dataset had a sharp oscillation. This is mainly because L Dice is a region-dependent loss, and if some pixels of a small target are incorrectly predicted, then it will lead to a significant change in the loss value, which will result in a drastic change in the gradient. Compared with other three loss functions, L total incorporating prior knowledge achieves optimality on the validation set and does not show more intense oscillations after epochs greater than 30. The BCE loss used for dichotomous classification is insensitive to category imbalance, so it can prevent the oscillation due to L Dice to some extent. On the other hand, the loss function L SC based on compact shape constraint utilizes prior information of tumor shape, and thus can improve the accuracy of segmentation. Note that L SC converges to 1 at the minimum when the tumor is circular, so it cannot be used as a segmentation loss function alone.
The purpose of introducing a subjective human into the segmentation process is to use human's prior knowledge as a supplement to improve the segmentation accuracy. In the interactive segmentation task, human is both the participant in the interactive segmentation process, the prior information provider, and also the evaluator of segmentation results without the ground truth. In the interaction segmentation task, we may learn from the few-shot learning which has been widely used machine learning classification task. Human guides segmentation on a few simple images so that the network can master the segmentation skills, further reducing the training time and human interaction time.

Quantitative Comparison of Different Methods
We evaluated WDTISeg with graph cuts, random walker, and DeepIGeoS (R-net). Table 1 presents a quantitative comparison of these methods on the testing data. All results are accepted after 8 interactions for the interactive segmentation method. Compared with the other three methods, the dice of WDTISeg (λ = 0.5, L total ) reached 94.45 ± 3.26 (%) after 8 interactions, which fully shows that our method can achieve a high segmentation accuracy with fewer interactions.
Visual comparison results are shown in Figure 8. All interactive segmentation methods can attain a high accurate segmentation after enough interactions. However, results of graph cuts and random walker showed more rough edges. In contrast, our method was able to obtain a segmentation that fit more closely to the tumor margin. What is more, our WDTISeg only required simple point clicks, while graph cuts and GrabCut require more scribbles or a bounding box. The results of DeepIGeoS are more similar to that of our method, because we also used distance conversion to pass interaction information. However, it can be found that the segmentation result of our method was smoother at the tumor edge, especially at the lower right corner of case 4. This may benefit from the fact that we used a shape constraint loss to impose prior constraints on tumor shape.

Conclusions
In this paper, we proposed a one-stage interactive segmentation framework (WDTISeg) for breast ultrasound image segmentation. The ultrasound image was put into the network first to attain an initial segmentation, on which user interaction points were provided to indicate mis-segmentations. Interaction points were converted into distance maps by weighted distance transform to be part of input of the interactive network. The one-stage network of point interaction made the interaction simpler. The loss function designed for the clinical prior knowledge of breast cancer further improved the segmentation accuracy. Comparison with other methods on the test dataset demonstrated the advantages of the proposed method.
However, our method had limitations in combining the two distance transforms. In this paper, in order to verify the usefulness of combining two distance conversion methods, different ratios were tried to conduct experiments, and the experimental results proved in general that combining two methods helps to improve the segmentation accuracy. Considering the differences of different ultrasound images, the most suitable combination ratio should be different for each image. If an optimal ratio value can be obtained adaptively according to the characteristics of the ultrasound image itself, thus attaining the best segmentation result, it will further improve the segmentation accuracy and enhance the segmentation robustness.