ESTAN: Enhanced Small Tumor-Aware Network for Breast Ultrasound Image Segmentation

Breast tumor segmentation is a critical task in computer-aided diagnosis (CAD) systems for breast cancer detection because accurate tumor size, shape, and location are important for further tumor quantification and classification. However, segmenting small tumors in ultrasound images is challenging due to the speckle noise, varying tumor shapes and sizes among patients, and the existence of tumor-like image regions. Recently, deep learning-based approaches have achieved great success in biomedical image analysis, but current state-of-the-art approaches achieve poor performance for segmenting small breast tumors. In this paper, we propose a novel deep neural network architecture, namely the Enhanced Small Tumor-Aware Network (ESTAN), to accurately and robustly segment breast tumors. The Enhanced Small Tumor-Aware Network introduces two encoders to extract and fuse image context information at different scales, and utilizes row-column-wise kernels to adapt to the breast anatomy. We compare ESTAN and nine state-of-the-art approaches using seven quantitative metrics on three public breast ultrasound datasets, i.e., BUSIS, Dataset B, and BUSI. The results demonstrate that the proposed approach achieves the best overall performance and outperforms all other approaches on small tumor segmentation. Specifically, the Dice similarity coefficient (DSC) of ESTAN on the three datasets is 0.92, 0.82, and 0.78, respectively; and the DSC of ESTAN on the three datasets of small tumors is 0.89, 0.80, and 0.81, respectively.

REAST ultrasound (BUS) imaging has become an effective screening method due to its painless, noninvasive, nonradioactive and cost-effective nature.BUS image segmentation aims to extract tumor region(s) from normal breast tissues in images.It is an essential step in BUS computer-aided diagnosis (CAD) systems.However, because of the speckle noise, poor image quality and variable tumor shapes and sizes, accurate BUS image segmentation is challenging.
According to the National Cancer Institute in the United States, the relative survival is 99% if the breast cancer is detected and treated at the early stages, and only 27% if the cancer has spread to other organs of the body [1].Early detection of breast tumor is the key to reducing the mortality rate.However, at the early stages, most tumor are small and occupy a relatively small region in BUS images.It is challenging to distinguish them from normal breast tissues.Therefore, accurate detection of small tumors is critical for   Despite their simplicity, these methods require knowledge and expertise in extracting features, and they are not robust due to poor scalability and high sensitivity to noise.Refer to [11] for a comprehensive review of BUS image segmentation.
Recently, several deep learning approaches [12]- [21] have been developed for BUS image segmentation; TABLE I lists the most recent deep learning approaches for BUS image segmentation.Huang et al. [12] proposed a fuzzy fully convolutional network to perform BUS image segmentation.Contrast enhancement and wavelet features were applied as a preprocessing approach to augment the training data.The augmented training image set and features from convolutional layers were transformed to a fuzzy domain by a fuzzy membership function.The context information and the human breast structure are integrated to the Conditional Random Fields (CRFs) to enhance the segmentation results.Yap et al. evaluated the performance of three different deep learning approaches for segmenting BUS images: a patch-based LeNet, a U-Net, and a transfer learning with a pretrained AlexNet.These three methods achieved remarkable overall performance in segmenting BUS images on two different datasets.Zhuang el at.[14] proposed an RDAU-Net model, based on U-Net architecture, to perform the tumor segmentation task on BUS images, where dilated residual blocks and attention gates were used to replace the basic blocks and original skip connections in U-Net, respectively.Similarly, Hu et al. [15] proposed a method that combined the dilated fully convolution network with a phase-based active contour model.Moreover, to exclude tumor-like regions, the method in [16] integrated radiologists' visual attention for BUS segmentation.Byra et al. [17] proposed a deep learning segmentation approach based on entropy parametric maps.The attention gate block is employed to improve the performance of the segmentation task.Furthermore, Moon et al. [18] proposed an ensemble CNN architecture for CAD system to diagnose BUS images.The ensemble approach comprises multi-models where each is trained on original images, segmented image tumors, tumor masks, and fused images.The fused images were prepared by combining an original image, segmented tumor, and the tumor shape information (TSI).Lee et al. [19] proposed channel attention module with multi-scale grid average pooling for segmenting BUS images.The approach utilizes both local and global information to improve the segmentation performance.These methods achieved good overall performance.However, as shown in Fig. 1, they failed to achieve good performance for segmenting small tumors.First, these methods are designed to improve overall performance using general-purpose square kernels which are developed to learn features from natural images.Second, all currently available BUS datasets are small, and most deep learning-based approaches require a large and high-quality training set.
Small object detection and/or segmentation is challenging in computer vision.It forms the foundation of many image related tasks, such as remote sensing, scene understanding, object tracking, instance and panoptic segmentation, aerospace detection, and image captioning.Chen et al. [20] proposed an augmented technique for the R-CNN algorithm with a context model and small region proposal generator; which was the first benchmark dataset for small object detection.Krishna et al. [21] designed a Faster R-CNN with a modified upsampling technique to improve the performance of small object detection.Guan et al. [22] proposed a semantic context aware network (SCAN), which integrates location fusion module and context fusion module to detect semantic and contextual features.The DenseU-Net architecture was proposed by Dong [23], which performs semantic segmentation of small objects in urban remote sensing images.It uses residual connections and a weighted focal loss function with median frequency balancing to improve the performance of small object detection.
To the best of our knowledge, STAN [24] was the first deep learning architecture to improve small tumor segmentation.Three skip connections and two encoders were employed to extract multi-scale contextual information from different layers of the contracting part.STAN outperformed other deep learning approaches for segmenting small tumors in BUS images.However, its average false positive rate on small tumors is much larger than that of large tumors.In this paper, we extend STAN and propose a new architecture, namely Enhanced Small Tumor-Aware Network, to achieve robust segmentation for tumors with different sizes.The new architecture has two encoder branches.The basic encoder has five blocks and learns features at different scales.The ESTAN encoder applies rowcolumn-wise kernels to adapt to the breast anatomy during the feature learning.In the decoder, each block has three skip connections that fuse rich contextual features from the two encoders.The contextual features are robust to different tumor sizes and help distinguish tumor regions from normal regions.
The rest of the paper is organized as bellow: Section II presents the proposed architecture and implementation details; Section III demonstrates experimental results; and Section IV provides the conclusion and discusses the future work.

II. ENHANCED SMALL TUMOR-AWARE NETWORK
In this section, we introduce the proposed Enhanced Small Tumor-Aware Network (ESTAN) for solving the issue of small tumor segmentation in BUS images.ESTAN builds upon two observations: 1) BUS images contain tumors of a broad range of sizes, and current state-of-the-art approaches have poor performance on segmenting small tumors; and 2) the current deep learning-based approaches used square-shape kernels and have difficulty utilizing context information of BUS images, e.g., breast tissue anatomy.To alleviate these challenges, we propose the ESTAN to extract and fuse image context information at different scales.ESTAN constructs feature maps using both square and large row-column-wise kernels.These feature maps transmit multi-scale context information and preserve fine-grained tumor location information.Therefore, the new design enables ESTAN to accurately segment breast tumors of different sizes, and it is especially efficient with small size tumors.ESTAN consists of two encoders and one decoder with three skip connections.The overall architecture of the proposed approach is shown in Fig. 2.

A. Basic Encoder
The basic encoder down-samples the input feature maps to extract low-level spatial and contextual information.Both convolution and pooling operations with strides greater than 1 are employed for downsampling the feature maps in the encoder blocks.The basic encoder comprises of five blocks, where each block contains two convolutional layers and a max pooling layer; except the fifth block, which has no pooling layer.The basic blocks in the encoder are different from the original U-Net encoder blocks, since the new architecture uses two skip connections to copy feature maps from the encoder blocks to the corresponding upsampling layers in the decoder module.Fig. 2(c) illustrates the architecture of the basic encoder.Let denote the input images as ∈ ℝ × × , where h, w and c are the height, width, and number of channels, respectively.Let f be the convolution function for square kernels, be the number of kernels and be kernel size in the ith convolution layer, respectively.The output of the jth block of the basic encoder is defined by where is the output of a given block, and is the pooling operation in the jth block.Additionally, , , , , and have values 32, 64, 128, 256, and 512, respectively.

B. ESTAN Encoder
The receptive field in CNNs has an important role in building effective feature maps.It defines the input image region that produces the output feature, and image regions outside the receptive field of a feature will not contribute to the computation of the feature.To ensure the coverage of all relevant image regions and achieve enhanced performance, many dense prediction tasks used large receptive fields [25] [26].There are several techniques for increasing the size of the receptive field such as stacking more layers, sub-sampling, and dilated convolutions [27].However, in BUS image, large receptive field can result in poor performance for small tumors segmentation [24].The goal ESTAN encoder is to produce effectively feature maps and avoid large receptive field.
STAN [24] proposed a two-encoder architecture and only applied small kernels, e.g., 1 × 1, 3 × 3, and 5 × 5.The small kernels can avoid large receptive field.The two encoders fused contextual information at different scales by producing features using different sizes of receptive fields.This design improved the overall performance for small breast tumor segmentation.However, STAN produced high false positive for BUS images with some small tumors.To overcome this problem, we redesign the encoder by applying row-column-wise kernels.The small square kernels in STAN constructed feature maps using only using square image regions.The motivation of the design is that BUS images are composed of vertically stacked breast layers (Fig. 3).Applying row-column-wise kernels in CNNs can avoid calculating features using images regions from multiple anatomical layers and produce more accurate and meaningful feature maps.
ESTAN encoder comprises five blocks, named ESTAN blocks, which are parallel with the basic encoder blocks.Each block has four square kernels and two row-column-wise kernels in two parallel branches.Such kernels can efficiently extract contextual and fine-grained details of small tumors in the BUS images.Furthermore, ESTAN blocks add one extra nonlinearity to each encoder blocks.
where is the output of the jth ESTAN block, and is the pooling operation, ℎ is the row-column-wise convolution function with the size of × 1 and 1 × , respectively.The size of in , , , , and blocks are 15, 13, 11, 9, and 7, respectively.The size of in and is 5, and in the rest is 1.Furthermore, Block 5 has no pooling operation for both encoders.Moreover, , , , , and have values 32, 64, 128, 256, and 512, respectively.
In addition, STAN has 22 million parameters while ESTAN uses 30 million, because ESTAN uses more convolution layers in both encoder and decoder.The training time for both STAN and ESTAN is fast, and it depends on the dataset size, batch size, and the hardware specification of the machine.

C. Decoder and Skip Connections
The decoder module comprises four upsampling blocks, where each has one upsampling followed by three convolution layers.Unlike the U-Net architecture, where the decoder has two convolution layers, the ESTAN adds an additional kernel after the first convolution kernel to control the post concatenation channels.Let f be the convolution function, be the number of kernels, and be the kernel size.The output of the jth block of the decoder is defined by: = , , , ( ) where is the upsampling layer.and in all blocks are 3 and in block 1,2, and 3 is 1, and in block 4 is 5.To overcome the singularity issues during the training, we have introduced three skipping connections to copy feature maps at different scales from both encoders to the decoder.The possible singularities that occur are overlap, elimination and linear dependence singularities.The first two skip connections come from combining the result of , in the basic encoder and the result of , in the ESTAN encoder concatenates to the upsampling layer.The second skip connection that comes from the result of , combines to the , in the decoder part.In addition, , , , and are 256, 128, 64, and 32, respectively.Fig. 2(d) illustrates the decoder module.

D. Implementation and Training
In this work, we use three public datasets [28][29] [13] to train and test all the approaches.The input images and their ground truths are resized to 256 × 256 pixels.We applied image width and height shift augmentation techniques to the training set of Dataset B, which has only 163 BUS images.During the training, the batch size is set to 4 and the maximum number of epochs is set to 50.To train the model, we applied adaptive moment estimation (Adam) [30], and the initial learning rate is set to 0.0001.In most BUS images, the number of the tumor pixels is much smaller than that of background pixels, which might cause the overclassification the background pixels.To alleviate this issue, we employed the Dice loss [31] to measure the relative overlap between the ground truth and the predicted labels.The dice loss function is defined by where = { ∈ [0, 1]} and = { } are the output of the final pixel-wise sigmoid layer and the ground truth, respectively.
The BUSIS dataset contains 562 images collected from three hospitals using GE VIVID 7, LOGIQ E9, Hitachi EUB-6500, Philips iU22, and Siemens ACUSON S2000.The BUSI dataset is from Baheya Hospital for Early Detection & Treatment of Women's Cancer in Egypt using LOGIQ E9 ultrasound system and LOGIQ E9 Agile ultrasound system with the ML6-15-D Matrix linear probe transducers.The BUSI dataset has 780 images, of which there are 133 normal, 487 benign, and 210 malignant images.The Dataset B has only 163 breast ultrasound images, and the UDIAT Diagnostic Centre of the Parc Taul´ı Corporation, Sabadell (Spain) collected the images using Siemens ACUSON Sequoia C512 system with 17L5 linear array transducer.The tumor size is an important variable, and Fig. 4 illustrates the histograms of tumor size distributions of the three datasets based on their original BUS image.The physical sizes of most tumors of the three datasets are unavailable; therefore, we define the tumor size as the length (pixels) of the longest axis of a tumor region in original BUS image.The distributions of BUSI and Dataset B show skewed shapes to the right where many tumors are smaller than 150 pixels.The BUSI dataset has more large tumors compared to the other datasets, and the sizes of most tumors are from 150 and 250 pixels.In addition, BUSIS dataset are from five different BUS workstations and the image quality has large variations.
All approaches are tested using a workstation with a 3.50 GHz Intel(R) Xeon(R) CPU, a 32 GB of ram, and an Nvidia Titan Xp GPU.

B. Overall Performance
In this section, we compare the proposed approach with AlexNet, SegNet, U-Net, CE-Net, MultiResUNet, RDAU-Net, SCAN, DenseU-Net, and STAN.The results are shown in Fig. 5 and Table II.
Fig. 5 shows the segmentation results of four sample BUS images.In the first row, the tumor in the BUS image is small, and AlexNet, U-Net, MultiResUNet, SCAN and DenseU-Net have poor segmentation performance.In the second and third samples (2nd and 3rd rows), all approaches, except the proposed ESTAN, produce high false positives, which demonstrates that they have difficulty in distinguishing tumor region from tumor-like regions.In Fig. 5(k), STAN can segment small tumors accurately, but still produce false tumor regions.Fig. 5(l) shows that ESTAN segments the four images accurately without any false tumor regions.
TABLE II illustrates the overall quantitative results of all approaches on three datasets.The proposed ESTAN achieves the best overall performance on all three datasets.AlexNet and SegNet obtain high TPRs, but at the cost of high FPRs.

C. Small Tumor Segmentation
The physical size for all images of the three datasets are not available.Therefore, the length of the longest axis of a tumor region from original BUS image (non-resized) is chosen to be a criterion to select small tumors, and the length threshold is set Fig. 6.False positive rates of overall and small tumor segmentation on the three datasets.inclusive results of all approaches on the three datasets using seven quantitative metrics.ESTAN outperforms all other nine approaches for small tumor segmentation on the three datasets.AlexNet and SegNet obtain high TPRs, but at the cost of high FPR.

D. Segmenting Tumors with Different Sizes
To demonstrate the effectiveness of the proposed ESTAN model, we split the BUSIS [28] dataset into four tumor size  groups.We chose BUSIS dataset for the following reasons: 1) The BUSIS dataset is collected from three hospitals using five ultrasound devices operated by different radiologists; 2) the ground truth of the BUSIS dataset has less bias because it is prepared by four experienced radiologists, where three radiologists generate tumor boundaries for each BUS image separately, and the fourth radiologist-a senior expert-judges and adjusts the majority voting results; and 3) all ten approaches have achieved their best results on BUSIS dataset compared to BUSI and Dataset B. We choose the length of the longest axis of a tumor as our condition to select tumor groups in the original BUS image.The first group contains 19 images with tumor sizes from 0 to 100 pixels, the second group has 30 images from 100 to 120 pixels, the third group consists of 81 images from 120 to 160 pixels, and the fourth group has 432 images from 160 to 533 pixels.TABLE IV lists the results of JIs and FPRs of four tumor groups.AlexNet shows poor performance for segmenting small tumor group with JI of 0.57 and FP of 0.97, while the FP and JI improve dramatically in other three groups.The results of segmenting tumors in both groups (100-120) and (120-160) are very close to each other, e.g., CE-NET and SCAN have achieved the same JI with 0.81 and 0.80 in both groups, respectively.The results show that the tumor size between (0-100) are the most difficult cases, and all ten approaches cannot achieve as good performance as segmenting large tumors.On the other hand, the fourth group contains the large tumor sizes, and all approaches achieve better results than the other tumor size groups.The proposed ESTAN achieves the highest JIs and lowest FPRs on all tumor size groups.

IV. CONCLUSION
In this paper, we propose the Enhanced Small Tumor-Aware Network (ESTAN) for tumor segmentation in BUS images.ESTAN comprises of two encoder branches that extract and fuse image context information at different scales.The ESTAN blocks apply row-column-wise kernels to adapt to the breast anatomy.The decoder has three skip connections from the two encoders to fuse features.The proposed architecture is sensitive to small breast tumors, and segments small tumor accurately with low false positive rate.In addition, the approach achieves state-of-the-art performance in segmenting tumors with different sizes.We validate the proposed approach extensively using three datasets and compare it with other nine breast tumor segmentation approaches.The results demonstrate that ESTAN achieves the state-of-the-art performance on all datasets.
In the future, we plan to test the proposed approach using large datasets and focus on developing domain-enriched deep architectures for small object detection.
ESTAN: Enhanced Small Tumor-Aware Network for Breast Ultrasound Image Segmentation Bryar Shareef, Alex Vakanski, Member, IEEE, Min Xian, Member, IEEE, Phoebe E. Freer 1 Abstract-Breast tumor segmentation is a critical task in computer-aided diagnosis (CAD) systems for breast cancer detection because accurate tumor size, shape and location are important for further tumor quantification and classification.However, segmenting small tumors in ultrasound images is challenging, due to the speckle noise, varying tumor shapes and sizes among patients, and existence of tumor-like image regions.Recently, deep learning-based approaches have achieved great success for biomedical image analysis, but current state-of-the-art approaches achieve poor performance for segmenting small breast tumors.In this paper, we propose a novel deep neural network architecture, namely Enhanced Small Tumor-Aware Network (ESTAN), to accurately and robustly segment breast tumor.ESTAN introduces two encoders to extract and fuse image context information at different scales and utilizes row-column-wise kernels in the encoder to adapt to the breast anatomy.We validate the proposed approach and compare to nine state-of-the-art approaches on three public breast ultrasound datasets using seven quantitative metrics.The results demonstrate that the proposed approach achieves the best overall performance and outperforms all other approaches on small tumor segmentation.Index Terms-breast ultrasound, tumor segmentation, deep learning, small tumor-aware network I. INTRODUCTION breast cancer early detection, and can improve clinical decision, treatment planning, and recovery.The approaches of BUS image segmentation can be classified into traditional approaches and deep learning-based approaches.Numerous traditional approaches have been used to BUS image segmentation, such as thresholding [2][3][4][5][6][7], region growing [8][9], and watershed [10].
DenseU-Net (d) CE-Net (e) RDAU-Net Fig. 1.Performance of state-of-the-art approaches for segmenting breast tumors with different sizes.GT: Ground truth.

Fig. 2 (
b) illustrates the design of ESTAN block.Let be the number of kernels, and be the kernel size.The output of jth ESTAN block is defined by
(Corresponding author: Min Xian) B. Shareef, A. Vakanski, and M. Xian are with the Department of computer science, University of Idaho at Idaho Falls, Idaho Falls, ID 83401 USA (e-mails: shar0416@vandals.uidaho.edu,vakanski@uidaho.edu,mxian@uidaho.edu).P. Freer is with the Department of Radiology and Imaging Sciences, University of Utah School of Medicine, Salt Lake City, UT 84132, USA (email: phoebe.freer@hsc.utah.edu) to 120 pixels.BUSIS, BUSI, and Dataset B contain 49, 151, and 76 small tumors, respectively.Fig.6illustrates the false positive rate comparison between the overall and small tumor segmentation.All ten approaches have higher false positive rate for small tumors.The false positive rate of AlexNet has increased dramatically for small tumor segmentation.The ESTAN approach is superior in comparison to all nine approaches and achieves the lowest false positive for both overall and small tumor segmentation.TABLE III shows all-