Graudally Applying Weakly Supervised and Active Learning for Mass Detection in Breast Ultrasound Images

We propose a method for effectively utilizing weakly annotated image data in an object detection tasks of breast ultrasound images. Given the problem setting where a small, strongly annotated dataset and a large, weakly annotated dataset with no bounding box information are available, training an object detection model becomes a non-trivial problem. We suggest a controlled weight for handling the effect of weakly annotated images in a two stage object detection model. We~also present a subsequent active learning scheme for safely assigning weakly annotated images a strong annotation using the trained model. Experimental results showed a 24\% point increase in correct localization (CorLoc) measure, which is the ratio of correctly localized and classified images, by assigning the properly controlled weight. Performing active learning after a model is trained showed an additional increase in CorLoc. We tested the proposed method on the Stanford Dog datasets to assure that it can be applied to general cases, where strong annotations are insufficient to obtain resembling results. The presented method showed that higher performance is achievable with lesser annotation effort.


Introduction
Breast cancer is the second leading cause of death for women all over the world, while their cause still remains unknown [Cheng et al.(2010) Cheng, Shan, Ju, Guo, and Zhang]. Like most cancer, early detection plays an important role in reducing the death rate [Cheng et al.(2006) Cheng, Shi, Min, Hu, Cai, and Du]. While digital mammography is the most commonly used technique for detecting breast cancer, its limitations ages with no mask annotations are given a pseudo mask ground truth generated by the initially trained model and the second model is trained to perform both segmentation and image level classification with these pseudo annotations. Generative adversarial networks (GANs) are tuned to perform semantic segmentation while using both image level annotations and generate mask annotations. Shin et al. [Shin et al.(2018)Shin, Lee, Yun, Kim, and Lee] uses both bounding box annotations and image level labels to localize and classify objects using multiple-instance learning (MIL). Images without bounding box annotations are given a bounding box chosen from a bag of bounding boxes presented during the localization stage. Various methods for choosing an object among the candidates are tested.
Active learning is a mechanism for expanding the given dataset by labeling unlabeled data with the train model. User intervention for labeling is encouraged during the whole training process. Active learning can be applied to different types of datasets and fields where data is scarce. Mask prediction for lung CT images generated by unsupervised segmentation is used as ground truth annotation for training a supervised segmentation network [Zhang et al.(2018) Zhang, Gopalakrishnan, Lu, Summers, Moss, and Yao]. The segmentation network is trained multiple times while using the mask prediction from the previous model as the ground truth, progressively improving after each training session.
We propose an appropriate method for controlling the influence of weakly labeled data in a Faster-RCNN based object detection model. The presented method shows increase in correct localization (CorLoc) measures, which is preferred over mean average precision (mAP) in medical imaging, and fraction of lesions detected, which measures the localization performance. The presented method assumes a relatively small strongly annotated dataset insufficient for achieving high classification capability and a larger dataset with weakly labeled images, which is a typical setting for medical imaging where making strong annotations are costly.
The main contributions of this work are, first, designing a reasonable method of controlling the effect of weakly labeled data in an end-to-end object detection model and, second, designing an acceptable approach for actively assigning annotations for weakly labeled data, supplementing the insufficient annotations for object detection. The strongly annotated data, Dstrong, contain a single bounding box coordinate and the box classification label per image, and the weakly labeled data, D weak , only contain an image level label per image. An actively annotated dataset, Dactive, is newly constructed after a training session and will be concatenated to Dstrong in the next training session. Individual data streams are maintained during training for the strongly annotated dataset and the weakly labeled dataset. Dataflow in the network is shown in Figure 1. The loss for Dstrong is calculated in the same manner, as it is proposed in [Ren et al.(2015)Ren, He, Girshick, and Sun], where loss for the region proposal network (RPN) and the RCNN-top layer is propagated seperately. Images in the D weak dataset can contribute to the classification loss in RCNN-top only when the RPN has proposed a correct region. The loss for D weak will have less influence until this condition is believed to be satisfied. After the first training session is finished, Dactive dataset is crafted from D weak by giving a prediction that is likely to contain a mass a single ground truth annotation. Images in Dactive will be concatenated to Dstrong, reducing the sparsity issue that the task originally conveyed. The experiments show that using D weak images in a conservative manner helps the classifier to be detect more lesions. Training with Dactive shows an additional increase in the overall performance. We believe that the proposed method can be adopted to general cases where strong annotations are insufficient to train the model classifier and weak labels are more available.

Datasets
The proposed data are evaluated on the Seoul National University Bundang Hospital Breast Ultrasound (SNUBH BUS) dataset for BUS images and further tested on the Stanford Dog dataset for general images. While the SNUBH BUS dataset has both Dstrong and D weak images, the Stanford Dog dataset only contains Dstrong images. Thus, the Stanford Dog data are manually divided into Dstrong and D weak , where only image labels are used in images selected as D weak .
The SNUBH dataset collected from the Seoul National University Bundang Hospital is obtained from different ultrasound systems described in [Shin et al.(2018)Shin, Lee, Yun, Kim, and Lee], including Philips (ATL HDI 5000, iU22), SuperSonic Imagine (Aixplorer), and Samsung Medison (RS80A). The dataset contains a total of 5624 images from 2578 patients. The Dstrong subset is comprised of 1200 images, 600 of which are benign and the other 600 of which are malignant. We use 400 images from each class as a training set, and 200 as the test set. D weak subset is comprised of 4224 images, 3291 of which are benign and the remaining 933 malignant. All of the image labels are proven with biopsy results, also meaning that the data are the cases where biopsy was needed to diagnose the patient, making classification with BUS images an even more difficult task.
The Stanford Dog dataset is a collection of color images of 120 breeds of dogs with a total of 20,580 images, all including class labels and bounding box coordinates. In order to mimic the situation in BUS images, we select two similar looking middle size breeds to classify, the Bloodhound and the English foxhound and then converted them to grayscale images. The number of images in each class is 187 and 157, respectively. Each dataset is subdivided into 20 Dstrong training set, 60 test set, and the remaining 107 and 77 images from Blackhound and English Foxhound, respectively, to D weak dataset. This setting enforces a situation where there are limited amount of strong annotations. The Stanford Dog dataset is tested to demonstrate the validness of the presented method on general images. Only a limited amount of strongly annotated images are available for training and the task is not straight forward, since the images are grayscale images, having room for improvement. The dataset is available online (http://vision.stanford.edu/aditya86/ImageNetDogs/). A summary of the number of images for both datasets is provided in Table 1. Mal., Ben., Blk., Eng., denote malignant, benign, Blackhound, English Foxhound respectively.

Training Procedure Using D strong Subset
The Faster-RCNN model is used for object detection tasks, which is detecting lesions in BUS images. Faster-RCNN is a two stage object detector, where a RPN is trained to specifically perform region proposals on feature maps. Region of interest (RoI) obtained from the RPN is then fed to the RCNN-top layer for classification and additional bounding box regression. Bounding box information is only given by images in Dstrong subset. This information is used for bounding box regression in both the RPN and RCNN-top, and for foreground background classification in the RPN. The overall dataflow is shown in Figure 1.

Training Procedure Using D weak Subset
Without bounding box annotations, bounding box regression or foreground background classification can be performed. Thus, images in the D weak dataset can only aid the classification procedure in the RCNN-top section. We must have a strategy for giving labels to RoIs proposed by the RPN in order to use D weak images. Although there is no complete way for figuring out the labels of each RoIs, it is known that given an image label, there is at least one mass that should be labeled as the image label. We are able to infer the most probable RoI that should be labeled by rewriting the model with random variables. Let Xroi be indicator random variables that map all RoIs to their ground truth (background, benign, malignant) and G be the set of all RoIs in an image. Set G is obtained as an output of the RPN. RoIs in G are considered to contain distinct objects after the non-maximum suppression (NMS) post-processing. NMS eliminates RoIs that overlap with an IoU over 0.5. The RoI with the higher foreground score is kept among the two RoIs. The relationship of the values is defined as malignant > benign > background.
Thus, Y represents the label of an image, since a single malignant lesion would make an image label malignant, and a single benign lesion would make the image label benign if there are no other malignant lesions. Subsequently, the most probable mass to be labeled given the image label can be written, as follows, Because Y is a max of all RoI labels, conditioning the probability with Y = label gives no information if the probability in question is that of X having the same label. Thus, it is optimal to choose the RoI with the highest probability of containing the labeled object. LetX denote the mapping between the proposed RoI and the predicted label by the RCNN-top layer. SinceX is trained directly by the cross entropy loss with X when using the Dstrong dataset,X can be used as an alternative of X if suitably trained. Therefore, we label the RoI with the highest image label score after running through RCNN-top, to be the train target in the RCNN-top section and then calculate the loss for a single D weak image, as follows, However,X would not be able to replace X in the early stages of training. Hence, we introduce a controlled weight for L RCNN−top cls , so-called α. We increase α from a 0.01 as the training progresses and the manner of this increase can vary. The weight α for L RCNN−top cls was selected among the following candidates: The black plot shows a log-like increase, namely inverse exponential, in α, which converges to 1 quickly. The blue plot is a linear increase of α.
(1), (2), (3) are the conservative increase of α during the training phase, namely polynomial increase, which relates to the orange plot, red plot, and green plot, respectively.

D active Construction with D weak Test Results
Dactive is a dataset that we create with the D weak dataset by adding annotations that are generated from the initial model after a training is finished. Dactive dataset can aid the Dstrong dataset, since images in Dstrong are assumed to be insufficient in this problem setting. Predicted bounding boxes and predictions are not reliable in itself, which requires the cautious selection of images to include. Verifying whether a predicted bounding box contains an object or not is the main issue. The double prediction problem can be a benefit for solving this problem. Double prediction is the case when two different predictions are made for a single object, as seen in Figure 3. Objects in double predicted boxes are more likely to contain an object than other predicted boxes, since it was predicted to contain a lesion twice. We can generate a strong annotation by selecting the correct labeled box of the two predicted boxes. The image level label is used to pick the correct bounding box among the two uncertain predictions.

Figure 3: Example of a double prediction case in Breast ultrasound (BUS)
images. The bounding box in blue represents the ground truth for a benign mass. Predicted boxes are colored in orange and cyan for malignant and benign predictions, respectively.
All of the images in D weak are tested through the trained model, generating multiple bounding boxes with labels for each image. We iterate through the boxes in an image to check whether there is a double prediction based on the PASCAL VOC criteria, which defines boxes to be overlapping when their IoU is higher than 0.5. If multiple double prediction pairs exist for an image, we choose the pair with the higher IoU. Once a pair is selected for an image, we annotate the image with the bounding box that hold the original image label. Newly annotated images will contain a bias towards benign, since D weak is biased. Thus, we only choose malignant images to add to the Dactive dataset to compensate this bias, and also due to the medical imaging setting where a failure to detect a malignant lesion is critical. The newly generated Dactive dataset is used in the same manner as the Dstrong dataset, since they can now produce the same type of losses.

Faster-RCNN Hyperparameters and Model Details
We use the PASCAL VOC pre-trained VGG-16 [Simonyan and Zisserman(2014)] as the backbone for generating feature maps, only fine-tuning the final layers higher than conv3 1, which is the method used by the original Faster- RCNN [Ren et al.(2015)Ren, He, Girshick, and Sun]. The RPNs regression and classification network was modified to use 3×3 convolution instead of 1×1 for better detection of objects. We reduced the size of the fully connected layer in the RCNN-top to 2048 to prevent overfitting. The Dstrong dataset was augmented by horizontal flipping, which increases the number of images, and, by random brightness, contrast adjustments given to images, which preserves the number of images. Steps are used to check the training progress, since epochs cannot be calculated when using two datasets with different sizes. One step corresponds to using a single batch from each datasets. Th Adam optimizer was used for optimization, with a configuration of batch size 1 for each dataset. Negative sampling for background RoI was performed when training D weak images, since the choosing a RoI with the image label for D weak images makes the distribution of RoIs unbalanced. The least scoring box was labeled as background for the RCNN-top to cal-culate. Class weights were also given for D weak losses, since the dataset has a bias towards benign. All of the details and code of the model will be available online (https://github.com/YeolJ00/faster-rcnn-pytorch) for research purposes.

Evaluation Specifications
In this study, a model generates multiple bounding boxes for an image. Each detection is considered to be a true positive (TP) if the classified label of the detection matches the target GT class, and the IoU between the predicted bounding box and the target GT is higher than 0.5. Otherwise, it is regarded as a false positive (FP). We evaluate the performance of the model with the test images in SNUBH and Stanford Dog dataset through some measures such as correct localization (CorLoc), and fraction of lesion detected.
CorLoc is defined as the ratio of correctly classified and localized images. A correctly classified image is an image that contains a TP detection in its predicted boxes. Although mean average precision is widely used for general deep learning models, Cor-Loc is more applicable the BUS case, since detecting a positive mass is critical in medical imaging. Additionally, only a single mass in an image is labeled as GT, while there could be other possible unlabeled masses, thus FP detections might actually contain masses. The fraction of lesion detected is the measure for localization performance, which is obtained by the ratio of images that have a bounding box that overlaps with its GT box. Table 2 presents the quantitative result of the experiments. The experiments are conducted on a total of 160,000 training steps, and all of the hyperparameters except α are equally applied. It is found that a model does not perform well when α is a constant value or increased with an inverse exponential function. We believe that the value was too high in the early stages of training. L weak was not penalized enough before RPN was trained enough to provide valid RoI proposals, which gives an incorrect loss for the classifier. Based on this idea, we compared more conservative functions for increasing α. We can see that all of the subsequent methods demonstrate an improvement both in CorLoc and the fraction of lesion detected. The fraction of lesion detected is the fraction of ground truth lesions that were given a bounding box. Performance tends to increase as α is maintained low during most of the training phase, and the model exhibited the best result when α followed (2). 24% point CorLoc increase and a 20% point fraction of lesion detected increase was shown as compared to the model without controlled weight. A slight loss of performance was shown when α follows (3). We believe this is due to a drastically increasing α for the case when the total step is 160,000, making the loss increase faster than the optimization step. Additionally, weakly annotated data was fully used only for a small number of steps in (3). Qualitative results for controlling α are shown in Figure 4. The proposed schedule for α shows both solid localization of objects and classification of bounding box proposals. Figure 4 also shows a false positive detection for the proposed method, yet the false positive detection has a relatively low score of being malignant when compared to the method following (3).

Experiments for Active Learning on SNUBH Dataset
Quantitative results for active learning experiment is shown in Table 3. Dactive constructed from the model trained with the proposed α weight (2) consists of 238 malignant images. Active learning aims to extend the Dstrong dataset, which is the primary dataset that trains the model. Performing active learning gives a 2.75% increase in CorLoc measure and a 3.75% increase in the fraction of lesion detected measure. Both classification and localization performance has increased. CorLoc and Fraction of lesion detected before and after active learning is presented. Figure 5 presents the qualitative results. Some masses that were difficult to detect or classify were given the correct predictions after training with Dactive. Both localization and classification performance are enhanced. Figure 5: Qualitative results for controlling α. Bounding boxes colored red/blue are ground truth boxes for malignant/benign masses. Bounding boxes colored orange/cyan are predictions for malignant/benign masses. Boxes on the left are the results before active learning, and the right side shows the same predictions made for the images after active learning. The Faster-RCNN based model in [Ren et al.(2015)Ren, He, Girshick, and Sun] is a model that uses weakly annotated images jointly with strong, bounding box annotations. Thus, we were able to reconstruct the model to train with the SNUBH dataset. Implementations of the models are provided online (https://github.com/YeolJ00/faster-rcnn-pytorch). Table 4 shows the results.

Experiments on Stanford Dog dataset
Experiments for controlled weight and active learning was performed with the Stanford Dog dataset.
The results for controlling α and active learning are summarized in Tables 5 and 6 respectively. Little increase in CorLoc was shown for the proposed α control method. We believe that the reason behind the negligible performance increase for the proposed α control method is due to the big bounding box proportion in the images. This enables the RPN to propose correct bounding boxes at an earlier stage of the training, which means that the loss is less likely to be lead to a local minimum. Acitve learning added 23 images to the strong annotated dataset, 10 Blackhound boxes, and 13 English Foxhound boxes. We included images from both classes, since this is not a medical imaging task where a detecting a certain class is preferred. Performing active learning on the trained model shows a slight decrease in CorLoc measures, which is a measure that ignores FP predcitions. However, the widely used measure of performance for object detection tasks is mAP, which increased by 17.46% point after active learning. The increase in strong annotations has reduced false positive predictions, significantly increasing the precision of the model. Model performance does not vary much due to the generally high performance. The prediction result samples can be viewed in Figure 6.  -.88inlCorLoc, fraction of lesion detected, and mean average precision (mAP) before and after active learning is presented.

Conclusions and Discussion
We propose an applicable mechanism for utilizing weakly annotated images for object detection models in a setting where bounding box information is insufficient for achieving high classification performance. The proposed method enables a successful increase of the size of strong annotations by safely assigning bounding box predictions as ground truth. The method is applied to the primary task of detecting masses in BUS and tested on the Stanford Dog dataset to verify generality. A comparison with different variants of the method supports the reasoning behind the manner of controlling the influence of weakly annotated images. We notice that maintaining the loss from weakly annotated images at a low level until the RPN proposes bounding boxes containing objects guides the model to have a higher classification capability. Additionally, we set specific configurations for the active learning scheme, which can be a risky work, since there is no way to confirm the correct assigning of GT bounding boxes. The results show that it can enhance classification performance if it was an issue. For our future work, we plan to extend the proposed method to autonomously detect whether if the RPN is proposing bounding boxes containing objects and control the weight, which was originally increased following a fixed schedule. This will increase the generality of the method, since the point of RPN convergence may vary depending on the size and detection difficulty of a dataset. We believe that the proposed method can be applied to typical cases of medical imaging tasks where strong annotations are costly and weakly labeled data are relatively easy to obtain from the diagnosis procedure.