1. Introduction
Semantic image segmentation is the task of assigning to each pixel the class of its enclosing object or region as its label, thereby creating a segmentation mask. Due to its wide applicability, this task has received extensive attention from experts in several areas, such as autonomous driving, robot navigation, scene understanding, and medical imaging. Owing to its huge success, deep learning has become the de-facto choice for semantic image segmentation. Recent approaches have used convolutional neural networks (CNNs) [
1,
2] and fully convolutional networks (FCNs) [
3,
4,
5] for this task and achieved promising results. Several recent surveys [
6,
7,
8,
9,
10,
11] describe the successes of semantic image segmentation and directions for future research.
Typically, large volumes of labeled data are needed to train deep CNNs for image analysis tasks, such as classification, object detection, and semantic image segmentation. This is especially so for semantic image segmentation, where each pixel in each training image has to be labeled or annotated in order to infer the labels of the individual pixels of a given test image. The availability of densely annotated images in sufficient numbers is problematic, particularly in domains such as material science, engineering, and medicine, where annotating images is time consuming and requires significant user expertise. For instance, while reading retinal images to identify unhealthy areas, it is common for graders (with ophthalmology training) to discuss each image at length to carefully resolve several confounding and subtle image attributes [
12,
13,
14]. Labeling cells, cell clusters, and microbial byproducts in biofilms take up to two days per image on average [
15,
16,
17]. Therefore, it is highly beneficial to develop high-performance deep segmentation networks that can train with scantly annotated training data.
In this paper, we propose a novel approach for semantic segmentation of images that can work with datasets with scant expert annotations. Our approach, segmentation with scant pixel annotations (SSPA) combines active learning and semi-supervised learning approaches to build segmentation models where segmentation masks are generated using automatic pseudo-labeling as well as by using expert manual annotations on a selective small set of images. The proposed SSPA approach employs a marker-based watershed algorithm based on image morphology to automatically generate pseudo-segmentation masks for the full training dataset. The performance of a segmentation model generated using these pseudo-masks is analyzed, and a sample of images to be annotated by experts is selected. These images with expert-generated masks along with images with pseudo-masks are used to train the next model. The process is iterated to successively generate a sequence of models until either the performance improvement plateaus or no further refinements are possible. The approach uses top-k (bottom-k) image entropy over pixel prediction confidence probabilities to identify the sample images at each iteration.
Despite the careful selection of image samples in each iteration to limit the overall manual annotation effort, annotating each image in its entirety can be tedious for experts. This is especially so for the high information density of the segmentation task, where each and every pixel needs to be classified. The proposed SSPA approach reduces annotation effort by using pixels as the unit of annotation instead of annotating entire images or image patches. For each image that is selected for manual annotation, only pixels within a specified uncertainty range are marked for expert annotation. For the rest of the image, a set of replacement rules that leverage useful patterns learned by the segmentation model are used to automatically assign labels. The segmentation mask for an image includes these labels along with expert annotations, and is used to generate the next model. Using such directed annotations enables the SSPA approach to develop high-performance models with minimal annotation effort. The results of the SSPA approach are validated on bio-medical and biofilm datasets and achieves high segmentation accuracy with less than 1% annotation effort. We also evaluated our method on a benchmark dataset for melanoma segmentation and achieved state-of-the-art performance with less than 1% annotation effort. The approach is general purpose and is equally applicable for the segmentation of other image datasets.
The rest of this paper is organized as follows. 
Section 2 discusses related work on semantic segmentation with scantly annotated data. The 
SSPA approach is detailed in 
Section 3. In 
Section 4, the datasets, network architecture, setup and evaluation metrics are presented. Experimental results and conclusions are discussed in 
Section 5 and 
Section 6, respectively.
  2. Related Work
Semantic segmentation [
4] is one of the challenging image analyses tasks that has been studied earlier using image processing algorithms and more recently using deep learning networks; see [
6,
10,
11,
18] for detailed surveys. Several image processing algorithms based on methods including clustering, texture and color filtering, normalized cuts, superpixels, graph and edge-based region merging, have been developed to perform segmentation by grouping similar pixels and partitioning a given image into visually distinguishable regions [
6]. More recent supervised segmentation approaches based on [
4] use fully connected networks (FCNs) to output spatial maps instead of classification scores by replacing the fully connected layers with convolutional layers. These spatial maps are then up-sampled using deconvolutions to generate pixel-level label outputs. Other decoder variants to transform a classification network to a segmentation network include the SegNet [
19] and the U-Net [
20].
Currently, deep learning-based approaches are perhaps the de facto choice for semantic segmentation. Recently, Sehar and Naseem [
11], reviewed most of the popular learning algorithms (∼120) for semantic segmentation tasks, and concluded the overwhelming success of deep learning compared to the classical learning algorithms. However, as pointed out by the authors, the need for large volumes of training data is a well-known problem in developing segmentation models using deep networks. Two main directions that were explored earlier for addressing this problem are the use of limited dense annotations (scant annotations) and the use of noisy image-level annotations (weakly supervised annotations). The approach proposed in this paper is based on the use of scant annotations to address manual labeling at scale. Active learning and semi-supervised learning are two popular methods in developing segmentation models using scant annotations and are described below.
  2.1. Active Learning for Segmentation
In the iterative active learning approach, a limited number of unlabeled images are selected in each iteration for annotation by experts. The annotated images are merged with training data and used to develop the next segmentation model, and the process continues until the model performance plateaus on a given validation set. Active learning approaches can be broadly categorized based on the criteria used to select images for annotation and the unit (images, patches, and pixels) of annotation. For instance, in [
21], FCNs are used to identify uncertain images as candidates, and similar candidates are pruned leaving the rest for annotation. In [
22], the drop-out method from [
23] is used to identify candidates and then discriminatory features of the latent space of the segmentation network are used to obtain a diverse sample. In [
24], active learning is modeled as an optimization problem maximizing Fisher information (a sample has higher Fisher information if it generates larger gradients with respect to the model parameters) over samples. In [
25], sample selection is modeled as a Boolean knapsack problem, where the objective is to select a sample that maximizes uncertainty while keeping annotation costs below a threshold. The approach in [
21] uses 50% of the training data from the MICCAI Gland challenge (85 training, 80 test) and lymph node (37 training, 37 test) datasets;  [
22] uses 27% of the training data from MR images dataset (25 training, 11 test);  [
24] uses around 1% of the training data from an MR dataset with 51 images; and [
25] uses 50% of the training data from 1,247 CT scans (934 training, 313 test) and 20% annotation cost. Each of these works produces a model with the same performance as those obtained by using the entire training data.
The unit of annotation for most active learning approaches used for segmentation is the whole image. Though the approach in [
25] chooses samples with least annotation cost, it requires experts to annotate the whole image. An exception to these are [
24,
26,
27], where 2D patches are used as the unit of annotation. While active learning using pixel-level annotations (as used by SSPA approach) is rare, some recent works show how pixel-level annotations can be cost effective and produce high-performing segmentation models [
28]. Pixel-level annotations require experts to be directed to the target pixels along with the surrounding context, and such support is provided by software prototypes, including those such as the PIXELPICK described in [
28]. There are several domain-specific auto-annotators exist for medical images and authors have also developed a domain-specific auto-annotator for biofilms that will be released soon to that community.
  2.2. Semi-Supervised Segmentation with Pseudo-Labels
Semi-supervised segmentation approaches usually augment manually labeled training data by generating pseudo-labels for the unlabeled data and using these to generate segmentation models. As an exception, the approach in [
29] uses K-means along with graph cuts to generate pseudo-labels and use these to train a segmentation model, which is then used to produce refined pseudo-labels, and the process is repeated until the model performance converges. Such approaches do not use any labeled data for training. A more typical approach in [
30] first generates a segmentation model by training on a set of scant expert annotations, and the model is then used to assign pseudo-labels to unlabeled training data. The final model is obtained by training it on the expert-labeled data along with pseudo-labeled data until the performance converges. For a more comprehensive discussion on semi-supervised approaches, please see [
10,
18].
  2.3. Proposed SSPA Approach
The 
SSPA approach seamlessly integrates active learning and semi-supervised learning approaches with pseudo-labels to produce high-performing segmentation models with cost-effective expert annotations. Similar to the semi-supervised approach in [
29], the SSPA does not require any expert annotation to produce the base model. It uses an image processing algorithm based on the watershed transform [
31] to generate pseudo-labels. The base model generated using these pseudo-labels is then successively refined using active learning. However, unlike the prior active learning approaches used for segmentation, we employ image entropy instead of image similarity to select top-k high entropy or low entropy images for expert annotation. Further, unlike most of the earlier active learning approaches for segmentation (with the exception of [
28]), our unit of annotation is a pixel, targeting uncertain pixels only while other pixels are labeled based on the behavior learned by the models (please see 
Section 3 for more details.).
Our preliminary work reported as a short paper in [
32], explored the viability of using pseudo-labels in place of expert annotations for semantic segmentation. In that paper, we considered datasets where expert annotations are available for the entire dataset and built a benchmark segmentation model using fully supervised learning. We then compared models built using a mixture of pseudo-labels and expert annotated labels with the benchmark model to show that the viability of pseudo-labels for building segmentation models. Requiring the experts to annotate all of the training data a priori and building a fully supervised segmentation model makes our prior work very different from the proposed approach. Further, having the experts deeply annotate each pixel in each image in all of the training data makes our prior approach impractical for several domains, where significant expertise is needed to annotate each image.
In contrast, in the 
SSPA approach, expert annotations are obtained on demand only for the training samples identified in each active learning step. Further, the unit of annotation is a pixel, and the process is terminated when the model performance plateaus or no further refinements are possibly similar to [
29]. The 
SSPA approach outperforms state-of-the-art results in multiple datasets including those used in [
32].
The SSPA uses the watershed algorithm to generate pseudo-segmentation masks. This algorithm [
31,
33,
34,
35] treats an image as a topographic surface with its pixel intensities capturing the height of the surface at each point in the image. The image is partitioned into basins and watershed lines by flooding the surface from minima. The watershed lines are drawn to prevent the merging of water from different sources. The variant of watershed algorithm used in this paper, the marker-controlled watershed algorithm (MC-WS) [
36], automatically determines the regional minima and achieves better performance than the regular one. MC-WS uses morphological operations [
37] and distance transforms [
38] of binarized images to identify object markers that are used as regional minima.
In Petit et al. [
39], the authors proposed a ConvNets-based strategy to perform segmentation on medical images. They attempted to reduce the annotation effort by using a partial set of noisy labels such as scribbles, bounding boxes, etc. Their approach extracts and eliminates ambiguous pixel labels to avoid the error propagation due to these incorrect and noisy labels. Their architecture consists of two stages. In the first stage, ambiguity maps are produced by using 
K FCNs that perform binary classification for each of the 
K classes. Each classifier is given the input of pixels only true positive and true negative to the given class and the rest are ignored. In the second stage, the model trained at the first stage is used to predict labels for missing classes, using a curriculum strategy [
40]. The authors stated that only 30% of training data surpassed the baseline trained with complete ground-truth annotations. Even though this approach allows recovering the scores obtained without incorrect/incomplete labels, it relies on the use of a perfectly labeled sub-dataset (100% clean labels). This approach was further extended to an approach called INERRANT  [
41] to achieve better confidence estimation for the initial pseudo-label generation, by assigning a dedicated confidence network to maximize the number of correct labels collected during the pseudo-labeling stage.
Pan et al. [
42] proposed a label-efficient hybrid supervised framework for medical image segmentation, where the annotation effort is reduced by mixing a large quantity of weakly annotated labels with a handful of strongly annotated data. Mainly two techniques, namely dynamic instance indicator (DII) and dynamic co-regularization (DCR), are used to extract the semantic clues while reducing the error propagation due to strongly annotated labels. Specifically, DII adjusts the weights for weakly annotated instances based on the gradient directions available in strongly annotated instances, and DCR handles the collaborative training and consistency regularization. The authors stated that the proposed framework shows competitive performance only with 10% of strongly annotated labels, compared to the 100% strongly supervised baseline model. Unlike SSPA, their approach assumes the existence of strongly annotated data to begin with. Without using directed expert annotation as done in SSPA, it is highly unlikely that a handful of strongly annotated samples chosen initially will cover all the variations of the data, and hence we argue that the involvement of experts in a directed manner guided by model predictions is important, especially in sensitive segmentation application domains, such as medical and material science.
Zhou et al. [
43] recently proposed a watershed transform-based iterative weakly supervised approach for segmentation. This approach first generates weak segmentation annotations through image-level class activation maps, which are then refined by watershed segmentation. Using these weak annotations, a fully supervised model is trained iteratively. However, this approach carries many downsides, such as no control over initial segmentation error propagation in the iterative training, requires many manual parameterization during weak annotation generation, and lack of grasping fuzzy, low-contrast and complex boundaries of the objects [
44,
45]. Segmentation error propagation through iterations can adversely impact model performance, especially in areas requiring sophisticated domain expertise. In such cases, it may be best to seek expert help in generating segmentation ground truth to manage boundary complexities of the objects and mitigate the error propagation of weakly supervision. Our experiments show that the SSPA models outperform the watershed-based iterative weakly supervised approach.
  3. The SSPA Algorithm
The inputs to the SSPA algorithm (
Figure 1) are a set of images, 
 and an optional set of corresponding ground truth, 
, binary labels, where 
. The SSPA employs an iterative algorithm that uses a sequence of training sets, 
 to build a sequence of models, 
. Model 
 is generated at the 
 iteration using training set 
. Let 
 be the set of pixel-level binary labels for each training sample in 
. Each 
 comprises image–label pairs 
, where the 
 refers to a training sample from 
U, and 
 refers to the corresponding pseudo labels from 
, distinct from 
. We apply 
 to each training sample in 
 to obtain a set of confidence predictions 
. For each image 
 in 
, 
 contains an element 
 with a pair of values, 
 for each pixel in 
, where 
 denotes the prediction confidence value that the pixel belongs to class 1 and 
 denotes the pixel label assigned by 
. Min-max normalization of raw 
 values of all pixels in 
 is used to normalize them to a 
 interval to construct 
 pairs for each 
. We use 
 to refer to 
 containing the normalized value, label pairs for each image in 
. The entropy of a sample image is found using its corresponding normalized prediction confidence values. The mean entropy 
 of 
 (and of 
) is calculated as the mean of the entropy of all images in 
.
The SSPA algorithm can be divided into three main steps as described below.
Initial Model and Initial Pseudo-label Set Generation: The marker controlled watershed (MC-WS) algorithm was employed to avoid the over-segmentation caused due to noise and irregularities that typically occur in the use of the watershed transform. The MC-WS floods the topographic image surface from a predefined set of markers, thereby preventing over-segmentation. To apply MC-WS on each image, an approximate estimate of the foreground objects in the image was first found using binarization. White noise and small holes in the image were removed using morphological opening and closing, respectively. To extract the sure foreground region of the image, distance transform was then used to apply a threshold. Then, to extract the sure background region of the image, dilation was applied on the image. The boundaries of the foreground objects were computed as the difference between the sure foreground and sure background regions. Marker labeling was implemented by labeling all sure regions with positive integers and labeling all unknown (or boundary) regions with a 0. Finally, watershed was applied on the maker image to modify the boundary region to obtain the watershed segmentation mask or binary label of the image.
Using MC-WS, we created an ensemble of three watershed segmentation modules, 
 and applied it to the set 
U to generate labels, 
. Use majority voting to determine the initial set of pseudo binary labels, 
. Train a segmentation network on pair 
 to obtain initial model 
 (refer to lines 23–27 in Algorithm 1). We use model 
 to generate the prediction confidence values 
 and normalized prediction confidence set 
 for 
U. Let 
 be the mean entropy of 
.
      
| Algorithm 1 Segmentation with scant pixel annotations. | 
| Input: 
 ▹Optional
 1:procedureSSPA(, , , )2:    Choose J images with highest entropy (HE)/lowest entropy (LE) from 3:    for each  do4:        Let 5:        6:        7:        Let each  be a pixel outside  in 8:        if  then9:           10:        else11:           12:           13:           14:        end if15:    end for16:     Replace J in 17:    return 18:end procedure19:procedureWatershed(U)20:     ensemble21:    22:    return 23:end procedure24:repeat25:    SSPA()26:until
 | 
Segmentation with Scant Pixel Annotations: Let . The pseudo label set , the corresponding model  and the normalized prediction confidence values  generate the pseudo binary label set  from  as follows. First, choose J images from  with highest entropy (HE) or the lowest entropy (LE) values. Let  be one such image chosen and  be its label in . We construct the training label  for  as follows. Consider all pixels in  with prediction confidence values between () in  (pixels whose predictions from model  are in the uncertainty range), for expert annotation (lines 8–9). The value of the parameter  is assigned empirically for each dataset. Let  be the set of all pixels in  that are marked for expert annotation. One way to obtain expert annotation for pixel labels is to manually label each pixel in . If  is available, we copy the pixel label for each pixel from the ground-truth label of  into .
Now consider the pixels that are not in . The pixel-level labels for these pixels in  can be decided using either the previous model  or the current model  (lines 11–17). Let  be a pixel in  row and  column of . If , then the label for  in  is the same as that in . Else, label for  in  is the same as assigning a class 0 or 1 to  based on a .  is calculated as the mean prediction confidence value of . Generate the next set of training labels,  by replacing J labels in  (line 19). Train a segmentation network on pair  to obtain next model .
Termination condition: At each iteration 
i, record the mean entropy of 
, 
. The algorithm terminates when the mean entropy of 
, 
 is higher than the mean entropy of 
 (lines 29–31). The decrease in model performance indicates that the model is unlearning useful patterns or features during training at the 
th iteration. In the presence of 
 labels, we also record evaluation metrics such as intersection over union (IoU) and Dice score. However, mean entropy as an evaluation metric takes precedence over IoU and Dice score, even in the presence of 
 labels. Refer to 
Section 4.4 for a detailed discussion of evaluation metrics. Select 
 with the best evaluation metrics as the best model to obtain binary labels, 
 using the least expert intervention.
 Note that the SSPA uses two parameters—the uncertainty range threshold  and the number of images J selected for expert annotation in each iteration. Model prediction values around 0.5 lead to most uncertainty and a range around this value determined by  is well suited for many datasets. The value of J can be set based on the improvement of model performance across iterations. The approach can be adapted to other datasets by setting these parameters appropriately.
  5. Experimental Results and Discussion
For each dataset, the initial model and initial pseudo-label set generation step was applied to obtain the first set of pseudo labels  for each image in the dataset. Model  was constructed using a training set and the label set . Models  were constructed iteratively by following the segmentation with scant pixel annotation step using two different values of J and both high entropy and low entropy pixel label replacement strategies. The SSPA approach was terminated when the mean entropy of a model constructed in an iterative step increases from the previous step. Since we have access to ground-truth labels for all datasets, we used it to construct the ground-truth model  and to benchmark the evaluation results from the models built using the SSPA approach. In addition to studying the prediction accuracy of the models constructed from the SSPA approach, we also observed the behavior of the SSPA approach’s pixel annotation strategies using heatmaps and confidence values of the pixels labels assigned by the models .
We now discuss the results of applying the SSPA approach to each of the three datasets. Below, we use  () to denote the strategy of choosing J images from the training set  with the HE (LE) values and then selectively replace the uncertain pixel labels identified by the SSPA approach in the training set . For each data set, we considered , the minimum value as well as J values corresponding to 10% of the training data. Models obtained using  values under-performed in all cases and are discussed in the paper. We also calculated the percentage of pixels replaced as the ratio of the total number of pixels labels replaced over all the J images to the total number of the pixels in the training set.
  5.1. EM Dataset
For this dataset, experiments were conducted using  (10% of the training data) in each iteration. Models obtained using LE pixel label replacements outperformed others.
Pixel label replacements in HE images. Our results in 
Figure 3 indicate that the performance of models with HE pixel label replacements did not improve, despite increased expert annotation efforts. In 
Figure 3, for each model and 
J value combination on the x-axis, three types of information are depicted—the IoU (black bar), Dice scores (blue bar), and the percentage of pixels replaced (the red trend-line). Models 
 and 
 had the same Dice (0.874) and IoU (0.776) scores. Model 
 was obtained by replacing 1.18% of the pixel labels from the 3 images in the output of model 
. Model 
 was obtained by replacing 2.46% of the pixel labels from the 3 images in the output of model 
. Since the mean entropy of 
 (2.976) was higher than that of 
 (2.845), the algorithm terminated. The models obtained using HE pixel label replacements achieved similar IoU values but had lower Dice scores in comparison to the benchmark model 
, which had mean entropy of 1.986, IoU of 0.823, and Dice score of 0.903.
 Pixel label replacements in LE images.Figure 4 shows the results obtained using LE pixel label replacements. For this experiment, we randomly chose one image as a test image and trained 
 using the remaining 29 images. Next, we generated models 
 and 
 using 
. Model 
 was obtained by replacing 0.86% of the pixel labels from the 3 LE images in the output of model 
. Model 
 was obtained by replacing 1.69% of the pixel labels from the 3 LE images in the output of model 
. Since the mean entropy of 
 (2.447) was higher than that of 
 (2.441), the algorithm terminated. Model 
 with IoU value 0.818 and Dice score 0.9 performs comparably with 
, having mean entropy 1.953, IoU 0.82, and Dice score of 0.9. The 
 values for LE slightly differ from those for HE since they are computed using 29 instead of 30 images. We also studied the entropy distribution of models generated using the LE pixel label replacement strategy. The entropy distribution of 
 had high variability, while 
 had the least variability and the best performance.
 We studied the model prediction entropy distribution of models generated using LE pixel label replacement strategy. The results are displayed in 
Figure 5. Here, the x-axis plots the model and the y-axis plots the entropy values. Each box plot in the figure shows the entropy value distribution of images for each model. As can be seen from the figure, the entropy distribution of 
 had high variability, while 
 had the least variability and the best performance. Although the median entropy values across models 
, 
, and 
 were higher than the upper quartile of 
, the SSPA approach seemed to have reduced the variability in the entropy values to obtain better performance. Further, we studied pixel oscillation to understand the effectiveness of the pixel replacement using the SSPA approach. We define oscillating pixels as the pixels with normalized prediction confidence values in the target range [0.45,0.55] (
), which result in inverse prediction confidence values after being replaced by 
 labels in the model input. Oscillating pixels can be problematic since they represent the unlearning of useful patterns in the input. We observed that 50.32% of the 0.86% replaced pixels in 
 oscillated, whereas 58.07% of the 1.69% replaced pixels in 
 oscillated. We conjecture that there is a correlation that smaller number of oscillating pixels lead to better performance in models.
  5.2. Melanoma Dataset
For this dataset, experiments were conducted using  (10% of the training data) in each iteration. Two test images were randomly chosen for evaluation from the dataset and rest of the data were used to generate three successive models using the SSPA approach. Models obtained using HE pixel label replacements outperformed others.
Pixel label replacements for HE images.Figure 6 shows the results obtained using the HE pixel label replacements. As depicted in the figure, model 
 was obtained by replacing only 0.8% of the pixel labels from the 5 HE images in the output of model 
. Model 
 was obtained by replacing 4.05% of the pixel labels from the 5 HE images in the output of model 
. Since the mean entropy of 
 (1.689) was higher than that of 
 (1.432), the algorithm terminated. Model 
 with IoU value 0.824 and Dice Score 0.973 outperformed the benchmark model 
, having IoU value 0.764, and Dice score of 0.962. The mean entropy of 
, 2.232, was higher than that of 
.
 To visually track and assess the change in entropy induced by pixel label replacements, we constructed a heatmap for each image in the training set using the normalized prediction confidence values. The heatmap of a sample image from 
Figure 7A demonstrated that the confidence predictions of 
 were consistent with the 
 labels shown in 
Figure 2. A more detailed view of the pixels (in red) to be annotated by experts can be seen in 
Figure 7B,C, at two different scales. The target regions in the range [0.45,0.55] (
) occurring mostly around the boundaries of the lesion in 
Figure 7B,C, showed the highest uncertainty.
Pixel label replacements for LE images. On the other hand, 
 with LE pixel label replacements (also depicted in 
Figure 6) resulted in a mean entropy 2.064, Dice score 0.958, and IoU 0.718, comparable to those of 
. However, 
 was generated using 11.04% pixel label replacements. Recall that model 
 was generated using only 0.8% pixel label replacements and had a much lower mean entropy value of 1.432 in the HE case. Since the mean entropy of 
, 2.669, was higher than that of 
, the SSPA, approach terminated. The performance of 
 with LE pixel label replacements was comparable to that 
 with HE pixel label replacements but required 15.51% pixels to be replaced in comparison to 4.05%.
   5.3. Biofilm Dataset
For this dataset, experiments were conducted using  (10% of the training data) in each iteration. Three test images were randomly chosen for evaluation from the dataset, and rest of the data were used to generate three successive models using the SSPA approach. Models obtained using HE pixel label replacements outperformed others.
Pixel label replacements for HE images.Figure 8 shows the results obtained using HE pixel label replacements. As shown in the figure, model 
 was obtained by replacing only 0.85% of the pixel labels from the 8 HE images in the output of model 
. Model 
 was obtained by replacing 3.73% of the pixel labels from the 8 HE images in the output of model 
. The performance of 
 and 
 was similar with IoU values around 0.691 and Dice scores around 0.815. The mean entropy of 
 decreased to 1.778 from 2.587 in 
. The mean entropy of 
 increased to 1.991, and the algorithm terminated. The benchmark model 
 had mean entropy 2.822, IoU 0.609, and Dice score 0.754.
 The heatmap in 
Figure 9A shows that the target regions for pixel label replacements of 
 were found within the bacterial cells, contrary to the melanoma datasets where the boundaries of objects showed the highest uncertainty. The uncertainty of the model within the bacterial cells is likely due to the unique nature of having to segment the biofilm dataset. Similar to the EM dataset, the goal of segmenting the biofilm dataset was to determine the boundary map of the bacterial cells. 
Figure 9B shows a more explicit view of the pixels (in red) to be annotated by the experts.
Figure 10 illustrates the entropy distribution of HE pixel label replacement models with the y–axis representing the distribution of entropy values for each model in the x–axis. All models display a normal distribution with a few outliers. The best model, 
 had the lowest median, lower than the lower quartile of 
 and 
. The entropy distribution of both 
 and 
 showed decreasing variability, which illustrates the positive effect of pixel level replacements on model variability. We also observed that 
 had a higher rate of oscillation than 
, further validating the correlation between improved performance and lower oscillation.
 Pixel label replacements for LE images. On the other hand, model 
 with LE pixel label replacements (also depicted in 
Figure 8) recorded a mean entropy 1.802, IoU 0.469 and Dice score 0.634. 
 was generated using 0.33% pixel label replacements. The pixel replacements required by 
 was 0.35%. Since the mean entropy of 
, 2.541 was higher than that of 
, the algorithm terminated here.
   5.4. Comparing SSPA with Other Methods
We also investigated the effectiveness of the SSPA approach by comparing its performance with the state-of-the-art fully supervised and weakly supervised segmentation methods. We trained two fully supervised encoder–decoder architectures, one using U-Net and another using DeepLabV3+ with Resnet101 [
53,
54]. The DeepLabV3+ model has an encoding phase which uses atrous spatial pyramid pooling (ASPP) and a decoding phase to give a better segmentation results along object boundaries. To compare our method with weakly supervised segmentation methods, we trained two U-Net models using grab-cut [
55] and MC-WS methods, respectively, both of which generate initial pseudo-labels.
The segmentation results of SSPA and other fully supervised (FS) and weakly supervised (WS) methods are summarized in 
Table 1. FS models were trained using full pixel level (
P) expert labels, whereas WS models were trained using complete image-level (
I) labels. The SSPA approach performs approximately equally or better on all datasets in comparison to these fully supervised and weakly supervised methods. This is despite the minimal annotation effort needed for the SSPA in comparison to the other methods.
For the EM dataset, SSPA+LE performs equally to the two fully supervised U-Net methods (Dice scores for both are around 90.0% and IOU scores are around 82%). For this dataset, the performance of the SSPA+HE is lower in comparison to these two FS methods. The SSPA+HE and SSPA+LE perform better than both the weakly supervised methods in both the Dice and IOU measures. For the melanoma dataset, the SSPA+HE outperforms all of the FS and WS with respect to both IOU (82.4%) and Dice score (97.3%) values. For the biofilm dataset, the SSPA+HE outperforms both the FS methods as well as WS: grab-cut+UNET method. The performance of SSPA+HE and WS:MC-WS+U-NET is approximately the same. Further, we observed that SSPA+LE shows better IOU and Dice-score results on EM dataset compared to SSPA+HE version, whereas the SSPA+HE version performs well with the melanoma and biofilm datasets. Further, the performance improvement of the SSPA method on the biofilm dataset is approximately 9% compared to the supervised approaches.
The best performing models were obtained by replacing less than 5% of the pixels summed across all iterations for all datasets (2.55% for EM with LE, 4.85% for melanoma with HE, and 4.58% for biofilms with HE). These percentages include the pixels to be replaced to generate one more model after the best performing model in order for the active learning process to terminate.
  5.5. Discussion
The SSPA is a novel approach for generating high-performing deep learning models for image semantic segmentation using scant expert image annotations. Automated methods generating pseudo-labels are integrated with an iterative active learning approach to selectively perform manual annotation and improve model performance. We used an ensemble of MC-WS segmentation modules to generate pseudo-labels. We also considered other popular choices, such as grab-cut [
55] to generate pseudo-labels and chose MC-WS based on its relative superior performance. Pseudo-labeling approaches other than MC-WS may perform better for other applications, and these can be easily incorporated into the SSPA approach. Note that using a method that generates high-quality pseudo-labels is beneficial to the SSPA, but it is not essential to its success. In the SSPA approach, the pixel replacement effort required by the expert is inversely proportional to the initial pseudo-label quality. In the worst-case scenario, a low-quality initial pseudo-label set has to be compensated by the extra labeling effort from the experts. In the SSPA, images that need expert attention are chosen based on their model prediction entropy values. We employ entropy as the uncertainty sampling measure for the active learning process, over marginal, ratio, and least confidence sampling techniques. Entropy-based sampling is well known and has been shown to be well suited for selecting candidates for classification and segmentation tasks [
56]. In the SSPA, we compute entropy value for each image and use these values to identify the top-k images whose certain pixels have to be manually annotated by experts. A high entropy (HE) value for an image indicates an image where most pixel predictions are uncertain (probability in the range 
) in that image. If an image with HE value is selected as one of the top-k images for annotation by experts, then pixels with prediction values around 0.5 are labeled by the experts in order to reduce the uncertainty of predictions.
Alternatively, a low entropy (LE) value for an image indicates that most of the pixel predictions are made with high confidence. If an image with LE entropy value is selected as one of the top-k images for annotation by experts, then this means that there are sufficient pixels with uncertain predictions (probability in the range 
) in that image, and these need to be labeled by experts to improve the performance of the model. 
Table 1 illustrates the experiments conducted on both high entropy and low entropy and the best-performed strategy (HE or LE) for each dataset. The uncertainty range threshold 
 is one of the two parameters to the SSPA that was empirically determined to be 0.05 for our experiments. The parameter value may be varied based on different datasets based on expert assessments of model predictions.
From the above experimental results, we can conclude that the best model constructed from the SSPA approach achieved high prediction accuracy with a mix of over 94% pseudo pixel labels generated from the MC-WS ensemble that were iteratively improved using select expert annotations. The terminating condition we employed also worked well in practice by stopping the constructing of new models when mean entropy increases with increased expert annotations. The additional methods—heatmaps and oscillating pixels—were valuable in understanding the behavior of the SSPA approach. They provided insights on which pixels are hard for a model to learn and how the scant annotations provided in each iteration contributed to the mean entropy of model outputs and the accuracy of the models. From these methods, we observed that the SSPA method may not always assign the same label as the expert to a pixel consistently. Therefore, in the final model, certain pixels may be assigned incorrect labels, though they were assigned correct labels in earlier models.
The SSPA is a general purpose segmentation method that should be applicable to several datasets. The segmentation performance of the SSPA method evaluated through IOU and Dice scores does not depend on the percentage of the pixels to be relabeled. The percentage of pixels to relabeled is related to the manual labeling effort. No specific threshold values are used to identify images with HE and LE values. Top-k HE (or LE) images are chosen for annotation. Similarly, pixels with most uncertain predictions (probability value ) are examined by the experts and labeled. Two parameters that need to be chosen in order to apply the SSPA are (1) the number of images to be analyzed in each iteration (the value j), and (2) the uncertainty range delta for pixels. For each dataset, experiments can be run based on on both LE and HE values, and the resulting models can be compared and chosen.