Using Sparse Patch Annotation for Tumor Segmentation in Histopathological Images

Tumor segmentation is a fundamental task in histopathological image analysis. Creating accurate pixel-wise annotations for such segmentation tasks in a fully-supervised training framework requires significant effort. To reduce the burden of manual annotation, we propose a novel weakly supervised segmentation framework based on sparse patch annotation, i.e., only small portions of patches in an image are labeled as ‘tumor’ or ‘normal’. The framework consists of a patch-wise segmentation model called PSeger, and an innovative semi-supervised algorithm. PSeger has two branches for patch classification and image classification, respectively. This two-branch structure enables the model to learn more general features and thus reduce the risk of overfitting when learning sparsely annotated data. We incorporate the idea of consistency learning and self-training into the semi-supervised training strategy to take advantage of the unlabeled images. Trained on the BCSS dataset with only 25% of the images labeled (five patches for each labeled image), our proposed method achieved competitive performance compared to the fully supervised pixel-wise segmentation models. Experiments demonstrate that the proposed solution has the potential to reduce the burden of labeling histopathological images.


Introduction
Deep learning has made rapid development and remarkable progress in pathological image analysis in recent years [1][2][3][4][5][6][7]. The application of deep learning in pathological diagnosis and prognosis cannot be imagined without high-quality annotations. However, acquiring precise annotations is difficult since it requires knowledge of pathology and is time-consuming and labor-intensive, particularly for segmentation tasks that involve manually outlining the specific structures.
Unfortunately, experts with a wealth of pathological knowledge, the source of high quality and clean clinical tagging of key data, are often scarce and have limited energy to spend on data labeling. Therefore, deep-learning methods based on sparsely annotated labels are critical to reducing their workload of labeling and pushing the application of deep learning in the field of pathology. Tumor segmentation has been one of the most fundamental tasks in digital pathology for accurate diagnosis.
A binary label ('tumor' or 'normal') is assigned to each image in the training set to train these models. However, the performance of an image-wise segmentation model is limited by the insufficiency of the labeling information. Since a mere binary label 'tumor' cannot reflect the location and proportion of the tumor, assigning the same label 'tumor' to different images as long as they contain the tumor may confuse the network training and lead to inaccurate segmentation results, which is unacceptable-in particular for small tumors.
In contrast, a pixel-wise segmentation model can produce more accurate segmentation results. However, pathologists must annotate the tumor regions as masks to train the model, which takes much more time and energy. More importantly, unlike other medical images, such as MRI and CT images, pathology images usually lack a clear distinction between the normal and tumor areas [19], which imposes additional difficulties for labeling.
To compensate for the shortcomings of the above two methods, we propose the concept of the patch-level label. Note that, in our proposed method, a patch refers to a grid cell of an image, which is different from the definition in other articles [11,18]. Suppose we divide the image with the size of 224 × 224 pixels into a 14 × 14 grid, then the patch size is 16 × 16 pixels. For each image in the training set, pathologists only need to annotate several (usually 5-10) patches as the label, significantly saving the annotation cost. The left of Figure 1 shows different types of labels. We designed a patch-wise segmentation model called Pseger to accommodate this new label. It has two branches for image classification and patch classification, respectively. The image classification is an auxiliary task that helps improve the performance of the patch classification branches. Due to the superior performance the Trasformer-based networks [20] have achieved in recent years, we select Swin Transformer [21], a representation of them as the backbone of the model. Moreover, this method can be easily extended to other backbones.
To take advantage of the unlabeled data, we trained our Pseger with an innovative semi-supervised algorithm. The algorithm is developed based on the characteristics of the patch-level label, integrating the ideas of consistent learning [22] and self-training [23]. The contributions of this paper are summarized as follows: • We proposed the concept of sparse patch annotation for tumor segmentation, which can significantly reduce the annotation burden. To achieve this new way of labeling, we developed an annotation tool (Figure 1, right). • In order to handle this new label, we created a patch-wise segmentation model called Pseger, which was equipped with an innovative semi-supervised algorithm to make full use of the unlabeled data. • We comprehensively evaluated our proposed method on two datasets. The experimental results showed that when trained with only 25% labeled data (five patches for each labeled image), our approach can yield a competitive result compared to the pixel-wise segmentation models trained using 100% labeled data. The ablation study showed the effectiveness of the semi-supervised algorithm.

Weakly-Supervised Learning
Pixel-level labels require a considerable amount of time and effort, and the frequently occurring manual errors may give the network the wrong guidance. Weakly-supervised learning (WSL) has recently emerged as a paradigm to relieve the burden of dense pixelwise annotations [24]. Many WSL techniques have been proposed, including global imagelevel labels [25,26], scribbles [19,27], points [28,29], bounding boxes [30,31], and global image statistics, such as the target-region size [32,33].
Although these weakly supervised methods have achieved good performance in natural and medical image segmentation, most weak annotations may not necessarily be best or most suited for tumor segmentation. As mentioned above, the image-level label cannot reflect the location and proportion of the tumor, which may result in inaccurate segmentation results. Other label types are more suitable for segmentation tasks where the instances have clear boundaries, such as glands and nuclei. Nevertheless, the boundary between the normal and the tumor area in pathology images is usually fuzzy and ambiguous. Unlike existing weak annotations, we propose patch-level annotation for patch-wise tumor segmentation.

Multi-Task Learning
Multi-task learning is an emerging field in machine learning that seeks to improve the performance of multiple related tasks by leveraging useful information among them [34]. A deep-learning model for multi-task learning usually consists of a feature extractor shared by all the tasks and multiple branches for each task. In recent years, multi-task learning has been widely exploited in the field of pathological image analysis [18,35,36]. For example, Wang et al. [18] proposed a hybrid model for pixel-wise HCC segmentation of H&Estained WSIs.
The model had three subnetworks sharing the same encoder, corresponding to three associated tasks. Guo et al. [37] employed a classification model to filter images containing tumorous regions and subsequently refined the segmentation results by a pixel-wise segmentation model. Inspired by these seminal works, we adopted a two-branch model, one branch for image classification and another for patch segmentation, to learn more general features and thus reduce the risk of overfitting.

Semi-Supervised Learning
Semi-supervised learning (SSL) is a combination of both supervised and unsupervised learning methods, in which the network is trained with a small amount of labeled data and a large amount of unlabeled data. SSL methods can make full use of the information provided by unlabeled data, thereby improving the model performance. In recent years, SSL methods have been widely used in the computer vision field [38][39][40][41][42][43].
There are two common SSL strategies, including consistent learning [22] and selftraining [23]. The general idea of consistent learning is that model prediction should keep constant under different perturbations to the input. This method allows for various perturbations to be designed depending on the characteristics of the data and the network. For instance, Xu et al. [40] proposed two novel data augmentation mechanisms and incorporated them into the consistency learning framework for prostate ultrasound segmentation.
Another strategy, self-training, can be broadly divided into four steps. First, train a teacher model using labeled data. Second, use a trained teacher model to generate pseudo labels for unlabeled images. Third, learn an equal-or-larger student model on labeled and unlabeled images. Finally, use the student as a teacher and repeat the above procedures several times. Wang et al. [41] proposed a few-shot learning framework by combining ideas of semi-supervised learning and self-training. They first adopted a teacher-student model in the initial semi-supervised learning stage and obtained pseudo labels for unlabeled data. Then, they designed a self-training method to update pseudo labels and the segmentation model by alternating downsampling and cropping strategies.

Materials and Methods
Here, we propose a novel patch-wise segmentation model called PSeger. Equipped with an innovative semi-supervised algorithm, it can learn from the patch-level label and take advantage of the unlabeled data. Figure 2 gives an overview of the training procedure. It involves three steps: (1) basic training; (2) pseudo label generation; and (3) consistency learning. They are described in detail in the following. The information about the two datasets we used is also described later. to if certain criteria are satisfied Step 2. Pseudo label generation: Step 3. Consistency learning: Step 1. Basic training:

Basic Training
Since the idea of patch-level label is inspired by Vision Transformer (ViT) [20], we take it as the backbone of PSeger to illustrate the process of basic training. An overview of the model is depicted in Figure 3, which consists of an embedding projection module, a sequence of transformer encoder blocks, and two classifiers for image classification and patch classification, respectively. In the process of forward propagation, an input image x ∈ R H×W×N C (H, W, and N C represent the height, width, and number of channels of x, respectively) is first flattened into M = HW/P 2 non-overlapped patches with the size of P × P pixels. Then, a 2-D convolution operation is employed to obtain patch embeddings, supplemented with position encoding: where z 0 ∈ R M×L (L represents the embedding length) is the input of the first transformer encoder block, x k ∈ R P×P×C is the kth patch, P E is the embedding projection, and P pos E is the position encoding. Then, the embeddings are processed by the transformer encoder blocks. Each block includes a multi-head self-attention (MSA) [44] module and a multilayer perceptron (MLP) module, both of which are operating as residual operators, and with a layer normalization (LN) [45]. The output of the lth transformer encoder block can be described as follows, where z L is the final output of the transformer encoder. Each element of the output z k L ∈ z L contains contextual features due to the attention mechanism, which makes it possible to classify a patch based on the information of the related patches. We adopt an MLP head H patch for patch classification. By these means, z l processed by an LN is sent to H patch before applying a softmax function to obtain predictions of each patch: whereŷ ∈ R M×C are the patch predictions, and C is the number of categories.
In addition to the patch classifier, we introduce an auxiliary image classifier H image to the network, which determines whether an input image has a tumor or not. The main motivation for use of image classifier is to help the patch classifier achieve better performance, since in multi-task learning the network tends to find more representative features shared by different tasks [18]. Similar to the patch classifier, the image classifier receives the average of the Lth transformer encoder output z L ∈ R M×L with an LN, and produces the classification resultŷ img ∈ R C through a softmax function: The loss function for the basic training is defined as: where L img and L patch are the losses for image classification task and patch classification task, respectively. α is a weighting factor for the two losses. Both L img and L patch are crossentropy loss functions; however, L patch only considers the annotated patches. Specifically, L patch is defined as: where K is the number of the labeled patches in the sample x, C is the number of classes, y k is the binary indicator (0 or 1) if class label c is the correct classification for the kth patch. y (k,c) is the prediction of the kth patch at the cth class.

Pseudo Label Generation
After the basic training process, the model with the best patch classification accuracy on the validation set is used to generate the pseudo labels for samples in the unlabeled data X U , as is depicted in Figure 4. The trained model receives as input an image x i ∈ X U and infers the image predictionŷ i,img and patch predictionsŷ i , which are subsequently transformed into the image probability p i,img and patch probabilities p i by the softmax function. The latter are then ranked by their dominant values. We move x i from X U to X L along with its pseudo label if p i,img and ranked p i (denoted as r(p i )) meet the following criteria: , which means the patch predictions should remain consistent with the image prediction.
We made some attempts with small-scale data in the early stage and found that the image prediction confidence scores were high (usually above 0.9); however, the patch prediction confidence scores were relatively low (usually below 0.7). Therefore, we empirically set τ 1 to 0.8 and τ 2 to 0.6.

Consistency Learning
When the step of pseudo label generation is finished, the model begins to retrain on the updated training set X L . The details are as follows. First, for an input image x ∈ X L , it is transformed into aug_x and aug_x by twice independent data augmentation operation. Then, the student model and the teacher model take them as input and output two sets of patch predictionsŷ andŷ , respectively. These two sets should remain consistent based on the smoothness assumption in semi-supervised learning [46]. Therefore, we apply the KL divergence consistency loss betweenŷ andŷ : where M is the number of patches in the sample x; C is the number of categories;ŷ (m,c) andŷ (m,c) are the predictions of the mth patches at the cth category. Thus, the total loss function can be written as, where L sup is previously defined in Equation (6). λ(E) is a function of training epoch index E, which helps control the balance between the supervised loss and the consistency loss.
As is the case with other consistency learning methods [40,47], we use a Gaussian ramp-up function as λ(E): where E is the epoch index. When E = E max , λ reaches the maximum weight λ max for the consistency loss. We empirically set λ max to 1 and E max to 20 epochs. For the student model, the parameters θ are updated through back-propagation algorithm by minimizing L total . For the teacher model, the parameter θ are initially set to θ 0 and updated by computing the exponential moving average of θ: where t represents the index of the global training steps. α helps control the speed at which the teacher model parameters θ are updated, and we empirically set it to 0.99.

Datasets
We evaluated our proposed method on a public dataset BCSS [48] and an in-house dataset. BCSS dataset includes 151 hematoxylin and eosin-stained images corresponding to 151 histologically-confirmed breast cancer cases. The mean image size is 1.18 mm 2 (SD = 0.80 mm 2 ). We followed the train-test splitting rule (https://bcsegmentation.grandchallenge.org/Baseline/ (accessed on 1 June 2022) ) that the images from these institutes were used as an unseen testing set to report accuracy: OL, LL, E2, EW, GM, and S3. (The abbreviations stand for tissue source sites (For more details, see https://docs.gdc.cancer. gov/Encyclopedia/pages/TCGA_Barcode/) (accessed on 1 June 2022)). Then, the remained 108 images were cropped into 27,207 smaller images (with the size of 224 × 224). We used 1018 of these smaller images for validation and the remained were for training.
The in-house dataset came from Department of Pathology, the First Affiliated Hospital of Sun Yat-sen University, China. This study was approved by the Ethics Committee of First Affiliated Hospital of Sun Yat-sen University, and data collection were performed in accordance with relevant guidelines and regulations. The dataset contains 28,187 images from 111 cases (WSIs). We used the images of 84 cases for training and validation, and the images from the remaining cases for test. For the training set, 292 images were from the non-tumor regions, labeled as 'normal'.
A total of 24,971 images were from tumor regions but many of them did not contain any tumor cells. We selected 407 out of these images and labeled 10 patches for each images using our self-developed annotation tool. Among these labeled images, if one contains any tumor cells, then at least one patch will be labeled as 'tumor', and the image label will be 'tumor', as well. Details about the BCSS dataset and the in-house dataset are shown in Tables 1 and 2, respectively. In the training step, we employed the AdamW optimizer [49] with a base learning rate of 5 × 10 −4 . For the learning rate schedule, we adopted a linear warmup for five epochs (the warmup learning rate was 5 × 10 −7 ), followed by cosine annealing for 20 epochs. The batch size was 16, and the backbones used for Pseger were pre-trained on ImageNet. All experiments were done with a RTX 3090. There are five training strategies for PSeger: Baseline+ST with X u : first train the model on the labeled data, then use the trained model to infer the pseudo labels of the unlabeled data, and finally retrain the model on both the labeled data and pseudo-labeled data. -Baseline+ST+CL with X u : first train the model on the labeled data, then use the trained model to infer the pseudo labels of the unlabeled data, and finally retrain the model on both the labeled data and pseudo-labeled data with consistency learning.

Evaluation Metrics
In the experiment of comparison with segmentation models, we choose Intersection over Union (IoU) as the evaluation indicator, which is calculated as follows, where A and B are the predicted tumor area and ground truth, respectively. The final IoU score is obtained by averaging the IoU for each RoI in the BCSS test set.
In the ablation study, since our in-house dataset has no pixel-wise annotations, we select patch-level and image-level Acc, AUC, and F1 as evaluation indicators. AUC (Area Under the Curve) score is simply the area under the Receiver Operating Characteristic (ROC) curve. Acc and F1 are calculated as follows, where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. The final scores of each evaluation indicator are calculated by averaging the score for each image in BCSS or the in-house test set.

Comparison with Segmentation Models
We compared our proposed method to a variety of segmentation models on the BCSS dataset ( Figure 5). We trained PSeger with two strategies: Baseline and Baseline+ST+CL with X u . Five patches were labeled for each images in the labeled training set, and ratios of labeled training data were from 1% to 25%. In comparison, we chose two architectures of segmentation models, DeepLabv3+ [50] and Unet++ [51], and equipped them with six backbones: ResNet18, ResNet34, ResNet50 [52], EfficientNet-B1, EfficientNet-B3 [53], and RegNetX-1.6GF [54], respectively. Therefore, 12 segmentation models were trained and tested on the BCSS dataset. These segmentation models and the training and test steps were implemented base on Segmenta-tionModels [55]. By comparing the two graphs in Figure 5, we can see that when the proportion of labeled training data reaches 25%, our proposed method can achieve 80.31 ± 0.23% IoU on the test set, comparable with the third-best model (DeepLabv3plus+EfficientNet-b1: IoU = 80.31 ± 0.95%) out of 12 segmentation models.

Visualization of Segmentation Results
To further compare our proposed method with the pixel-wise segmentation method, we selected one of the best performing PSegers (trained by Baseline+ST+CL with X u with 25% images in the training set labeled, IoU = 80.65%) and compared it with the best performing model in segmentation models (Unetplusplus+EfficientNet-b3, IoU = 81.74%), as is shown in Figure 6.
In general, the performance of PSeger is comparable to that of Unetplusplus+EfficientNet-b3. The largest prediction differences aroused in case 1 and case 4. In case 1, PSeger performed worse because of more false detection on non-tumorous area; in case 4, Unetplusplus+EfficientNet-b3 performed poorly because of more false positive regions and much more missed detection on tumorous area.
In addition, Figures 7 and 8 display some segmentation results on our in-house dataset. Red and green overlays are tumor regions and non-tumor regions judged by PSeger, respectively, while regions not covered by any overlay are background areas. It can be seen from Figure 8

The Effect of the Amount of Labeling
As an important factor affecting model performance, the amount of labeling is reflected in two aspects: the ratio of annotated training samples to all training samples (denoted as X l %), and the number of the labeled patches in each sample (denoted as K). We conducted experiments on the BCSS dataset to examine the effect of X l % and K on the model performance. Figure 9 shows the patch-level AUC values and the image-level AUC values of Baseline and Baseline+ST+CL with X u under different X l % and K, respectively, and the results are given as the mean of three experiments performed in duplicate.
Overall, the two AUC values have increased with increased X l % and K. However, the increase has slowed down with higher X l % and K. More importantly, Baseline+ST+CL with X u always outperforms Baseline on image-level AUC, while the former has better patch-level AUC than the latter only when X l = 1% or K = 3.

Training with Different Strategies
To assess the contributions of self-training and consistency learning separately, we performed experiments on the BCSS dataset and the in-house dataset with five different training strategies mentioned before. Each experiment was repeated five times independently and the results are summarized in Tables 3 and 4, where bold and underlined values represent the best and second-best results on a metric, respectively. From Table 3, the strategy of Baseline+ST+CL with X u helps PSeger achieve the best performance on four of the six indicators (AUC = 92.04%, Acc = 85.72%, F1 = 80.4%, AUC img = 94.31%), significantly higher than the value that the strategy of Baseline has achieved (AUC = 88.62%, Acc = 84.28%, F1 = 78.63%, AUC img = 93.25%). The strategy of Baseline+ST with X u achieves the second-best performance (AUC = 91.98%, Acc = 85.58%, F1 = 80.05%, AUC img = 94.05%), which is roughly similar to that of Baseline+ST+CL with X u . Additionally, the performance of Baseline+CL is inferior to that of Baseline. Furthermore, when X u is involved in the training procedure, the model (Baseline+CL with X u ) performs better than Baseline and has reached the highest in the two indicators of Acc img (86.17%) and F1 img (87.55%).
From Table 4, while the performance of PSeger trained by Baseline+ST+CL with X u on the in-house dataset is still better than that trained by Baseline, combining the two semi-supervised strategies (consistency learning and self-training) does not achieve better performance than either.
Nevertheless, the CNN-based models still achieve decent outcomes. It is somewhat surprising that the model using ViT-base as the backbone is not as good as the models using the CNN architecture in the patch-level evaluation indexes; however, it can surpass most CNN architecture models in the image-level evaluation indexes (second only to ResNeXt-101 (32 × 8d)).

Discussion
In the ablation study, we first investigated the effect of the amount of labeling on model performance ( Figure 9). On the image-level AUC, the model trained by Baseline+ST+CL with X u was always better than that trained by Baseline under otherwise equal conditions. However, on the patch-level AUC, that was not always true, particularly when K > 3 and X l % > 1%. This meant that the proposed semi-supervised method can effectively improve the image classification performance; however, it enhanced the patch classification performance only when the amount of annotation was small. When the annotation amount increased, the semi-supervised learning method was not as good as the fully-supervised learning method. Further study is therefore needed to optimize semi-supervised training.
Next, we performed experiments on different training strategies (Tables 3 and 4). Both consistency learning and self-training benefited the model, and self-training improved the model performance more significantly. Additionally, combining the consistency learning strategy with the self-training strategy has the potential to fully utilize the pseudoannotated data and further improve model performance. However, it depends on the dataset and requires appropriate parameter settings to achieve the expected result.
Finally, the experiment of training with different backbones (Table 5) proves that our proposed method is suitable for transformer-based models and models with CNN architecture. By comparing the performance of different models, we found that Swin Transformer was better than CNN models on both image-level metrics and patch-level metrics.
In comparison, Vision Transformer was only better than most CNNs on image-level metrics and inferior to many CNNs on patch-level metrics. This may because the patch classification accuracy depends on the ability to capture localized features and the sensitivity to context-driven features. Although Vision Transformer is more sensitive to contextual features than CNN models, its local feature extraction ability is poorer, which affects the final patch classification accuracy.
Our proposed method can be improved in several ways: -Hierarchical patch-level label. Here, we only considered the annotation form at a single scale, which did not take advantage of the information at different magnifications of the pathological images. Therefore, the annotation can be extended to multiple scales, allowing the model to learn from hierarchical information. -Automatic patch selection for labeling. Choosing which patches to label is subjective and will affect the learning effect of the model. Hence, an active learning mechanism [59] can be introduced to automatically find the most informative patches to label, improving learning efficiency. -Hybrid CNN-transformer architecture. In terms of local feature extraction and global feature capture, CNN and transformer have respective advantages, as analyzed before. Therefore, a hybrid CNN-transformer architecture, like in [60,61], might combine the benefits of the two better to achieve greater performance. -More advanced semi-supervised algorithm. Our semi-supervised algorithm still has problems, such as being sensitive to hyperparameters. In the future, ideas from some advanced semi-supervised algorithms in recent years, such as Mixmatch [62], can be introduced into the training algorithm. At the same time, some constraints can be added to prevent the model from overfitting, such as the consistency of prediction results between the patch classification branch and the image classification branch.

Conclusions
In this work, we proposed a novel form of annotation, sparse patch annotation, and developed an annotation tool to achieve this new way of labeling. We created a patch-wise segmentation model called Pseger to handle this new label, which was equipped with an innovative semi-supervised algorithm to fully utilize the unlabeled data. We compared the proposed method to various pixel-wise segmentation models ( Figure 5). It was shown that, when trained with only 25% labeled data (five patches for each labeled image), our model achieved comparable segmentation results with the semantic segmentation models trained on fully pixel-level labeled data.
Our proposed method enables pathologists to focus their time and energy on labeling the representative parts of the image rather than carefully delineating complex boundaries, significantly reducing the annotation burden.  Data Availability Statement: Our annotation tool is available at: https://github.com/FHDD/ PSeger-LabelMe (accessed on 1 June 2022). The public dataset used in this study can be accessed at the following link: https://bcsegmentation.grand-challenge.org/ (accessed on 1 June 2022). The private dataset is available upon reasonable request to the corresponding authors.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: