Semi-Supervised Contrastive Learning for Few-Shot Segmentation of Remote Sensing Images

: Deep learning has been widely used in remote sensing image segmentation, while a lack of training data remains a signiﬁcant issue. The few-shot segmentation of remote sensing images refers to the segmenting of novel classes with a few annotated samples. Although the few-shot segmentation of remote sensing images method based on meta-learning can get rid of the dependence on large data training, the generalization ability of the model is still low. This work presents a few-shot segmentation of remote sensing images with a self-supervised background learner to boost the generalization capacity for unseen categories to handle this challenge. The methodology in this paper is divided into two main modules: a meta learner and a background learner. The background learner supervises the feature extractor to learning latent categories in the image background. The meta learner expands on the classic metric learning framework by optimizing feature representation through contrastive learning between target classes and latent classes acquired from the background learner. Experiments on the Vaihingen dataset and the Zurich Summer dataset show that our model has satisfactory in-domain and cross-domain transferring abilities. In addition, broad experimental evaluations on PASCAL-5 i and COCO-20 i demonstrate that our model outperforms the prior works of few-shot segmentation. Our approach surpassed previous methods by 1.1% with ResNet-101 in a 1-way 5-shot setting.


Introduction
Remote sensing image segmentation is mainly used to identify and segment out ground objects in images.Therefore, semantic segmentation, as a basic vision task, has essential applications in remote sensing image segmentation.Deep learning has yielded great results in the direction of fully supervised semantic segmentation [1][2][3].However, training a fully supervised semantic segmentation model requires many densely labeled images, and segmentation in remote sensing images requires more labeled samples, and the labeling process is laborious.
To alleviate the need for dense pixel annotation, few-shot segmentation of remote sensing images is proposed.Few-shot segmentation aims to segment novel classes according to common features among base classes learned during the training phase.A better-performing few-shot segmentation of remote sensing images model should be able to adapt to novel classes by learning only a few annotated samples.Currently, the majority of few-shot learning approaches adhere to a meta-learning paradigm.Due to better flexibility and accuracy, metric-based meta-learning models are widely used in few-shot segmentation tasks.
A typical few-shot segmentation model treats non-target classes as the background in support images, leading to specific features being undermined.The features of latent classes in the background can be taken as a reference to discriminate between the foreground and background.To this end, mining latent novel classes can widen the gap between the background and foreground prototypes.Furthermore, similar and dissimilar categories might be misclassified in the category representation space.Our method use contrastive learning to decrease the distance between similar categories and increase the distance between different categories in the embedding space, hence improving the feature extractor's accuracy for category representation.
We propose a few-shot segmentation of the remote sensing images model using a self-supervised background learner to overcome the problem of feature undermining and discriminator bias.Considering that there is also rich feature information in the background of the support set, the background learner learns from the unlabeled data to assist the meta-learner for further feature enhancement.In addition, to learn richer general features between categories, our model uses contrastive learning to make the category embedding space more uniformly distributed and to retain more feature information.
In summary, our contributions lie in these particular aspects: • We propose that when segmenting novel classes, the background learner can learn the latent classes and assist the feature extractor in obtaining information about other classes in the background.The background learner eliminates confusion between the target class and non-target class objects with different semantics in the query image, and improves segmentation accuracy.

•
We use contrast learning to refine the segmented edges by explicitly classifying query and background features in the embedding space.

•
This provides a more accurate segmentation model for the few-shot segmentation of remote sensing images.This addresses the costly problem of annotating images of new classes in remote sensing.

Related work 2.1. Few-Shot Learning
In many application scenarios, the models have a limited number of annotated samples to generalize well.Few-shot learning is proposed, to recognize the novel classes with a few samples by learning meta-knowledge among the different categories.Generally, fewshot learning is classified as being model-based [4], metrics-based [5], and optimizationbased [6][7][8].Particularly, the Siamese network [9] laid the foundations for later metricsbased model growth.Additionally, the matching network [6] provides another idea for the development of non-parametric metrics learning.In the metrics learning framework, the prototype network [7] leads the model to focus on the similarity between the support and query pairs while ignoring semantic features learning.

Few-Shot Segmentation of Remote Sensing Images
In semantic segmentation, approaches on the basis of few-learning can segment novel classes with only a few annotated images.Recent works can be divided into two categories using different focal points, i.e., a parameter matching-based method and a prototype-based method.In a breakthrough of few-shot segmentation, PANet [10] introduced a prototype alignment method that provides highly representative prototypes for each semantic class and that segments query objects based on feature matching.Furtherly, segmentation that is based on deep learning methods relies on big data training [11,12], but remote sensing images are densely annotated and laborious to acquire.The studies in [13][14][15][16][17] explore how to reduce the need for dense annotation from self-/semi-supervised learning and weakly supervised learning.Jiang et al. [18] introduced a few-shot learning method for remote sensing image segmentation.However, few-shot learning is not maturely applied in remote sensing image segmentation.

Self-Supervised Learning
The influence of a fully supervised learning model will be considerably constrained for specific tasks, owing to a lack of data and labels.However, self-supervised learning can improve the feature extraction ability of the model when faced with a new field and task that lacks abundant labeled data.MoCo's [19] appearance triggered a surge in visual self-supervised learning.Then, one by one, SimCLR [20], BYOL [21], SwAV [22], and other self-supervised learning algorithms were proposed.The model can learn features using selfsupervised learning based on a pretext task [23].A pretext task, such as generating a pseudolabel using a superpixel method, can provide additional local information that is used for few-shot learning to compensate for the lack of annotated data.Self-supervised learning has been widely used for the task of few-shot segmentation.For example, SSL-ALPNet [24] employs superpixel-based pseudo-label rather than manual annotation.MLC [25] uses an offline annotation module to generate pseudo masks of unlabeled data as a pretext task.SSNet [26] obtains supervised information in the background of the query set via super-pixel segmentation.

Problem Setting
Our model conducts training process on base classes C base with abundant annotated images, and then use the generalization ability to segment novel classes C novel with few annotated images (C base ∪ C novel = ∅).Our model extracts images containing base classes from train set D base , and images including the novel classes compose test set D novel .The training set D base = (I i , M i ) N base i=1 is constructed using N base image-mask pairs that have objects in C base , where i states the i-th image and M i represents the corresponding mask.The test set has the same construction as the training set.A few-shot segmentation training episode typically consists of a set of query images Q and a set of supporting images S with ground-truth masks.We require the additional set of images E in our few-shot segmentation setting.In detail, a training episode of few-shot segmentation is composed of n ways for every way to have k shots support samples, q query images, and e extra images.

Overview
As previously mentioned, non-target classes in the images are simply treated as a background.To alleviate this issue, we consider the features of non-target classes that lead the feature extractor to learn features in the background.In addition, we apply contrastive learning to obtain a more accurate discriminator.Noteworthly, the meta-learner use fully supervised learning and the background learner use self-supervised learning.These two branches jointly supervise the feature extractor; hence, we define our model as semi-supervised learning.
Our framework.We designed a few-shot segmentation framework that learns metaknowledge via training on supporting-query pairs, and that mines the latent novel classes in the contexts via self-supervised learning.With the auxiliary supervision named background learner, our method can learn semantic features in the background that help discriminate the support classes in the complicated scene.The model (Figure 1) conducts semantic segmentation by first sending support and query images to an encoder that extracts features.Then, the support prototypes are computed from the prototype generation module that contains masked average pooling.The background learner trains the model with extra images to mine latent novel classes, as introduced in Section 4.2.To learn richer general features between categories, our model employs contrastive learning via infoNCE loss, as described in Section 4.3.Support and query prototypes from the meta learner are used as positive samples, whereas latent category prototypes from the background learner are kept as negative samples in the memory bank.In this work, we added self-supervised learning from additional unlabeled images to learn the image background features.In the background learner, we added extra images into an encoder-decoder to obtain prediction maps where the encoder is shared with the characteristics extractor of the supporting and query images.Training with the additional branch, we calculated the segmentation loss L seg as follows: where L meta represents the ground truth segmentation loss of the query image, L background denotes the extra images segmentation loss, and L contrastive is the contrastive loss.The settings of hyperparameters µ and λ are analyzed in detail in Section 6.2.5.

Background Learner 4.2.1. Pseudo-Label Generation
A self-supervised pretext task attempts to make greater use of background pixel information.Pseudo-label generation utilizes super-pixel segmentation to segment unlabeled data (Figure 2).We chose SLIC as the method for superpixel generation.The SLIC method generates superpixels via iterative clustering.In detail, we set the compactness to 10 and n_segments to 100, and selected the five superpixels with the highest class activation values from the final k generated superpixels as the background in potential classes.For the number of potential classes selected, we set k to 5. In Section 6.2.5, we demonstrate a comparison experiment for the hyperparameter k taking values.
The pseudo-label generation denotes each super-pixel as a pseudo-class c p and generates the corresponding binary mask M p .The class activation score S(c p ) of each pseudoclass c p is calculated by the average of the extracted extra feature F e : The five classes cp with the highest activation scores in the pseudo-classes were selected as the most likely latent novel classes in the background.The corresponding binary mask is denoted as Mp .

Loss Function
The model optimizes the parameters by reducing the value of the loss function.The model applies a cross-entropy loss function in the background learner to supervise the training of our model.In the background learner branch, since our model defines the number of pseudo-classes to be five, the background learner uses a multi-category crossentropy loss.We define L background as the multi-category cross-entropy loss between the pseudo-class mask Mp and predicted extra mask Mp by: where HW is the height and width of feature maps, Mp represents the generated pseudoclass label, and Mp represents the predicted background feature mask.

Contrastive Representation Learning
Contrastive learning allows the model to learn similarities and differences between feature points to learn general features between categories.In the vector representation space, contrastive learning enables the model to bring positive samples closer to the anchor samples and negative samples further away.
Contrastive learning is more effective when there are enough negative samples; however, standard few-shot learning frameworks may relate to less negative samples.According to this question, we introduce extra unlabeled images as negative samples (Figure 3).On the one hand, these extra images supervise the feature extractor in mining background information, and on the other hand, they can be utilized as negative examples in contrastive learning to supervise the model in learning general features between images.
To increase the negative samples set, we employ a memory bank to store negative samples, as inspired by SimCLR [20].k negative samples are stored in the memory bank, represented as x − k = {x k1 , x k2 ...x kn }.The positive samples generated by query image encoding are denoted as x + k = x k0 , while the negative samples generated by supporting image encoding are denoted as x q , forming the space of all samples for the contrastive learning.
Positive samples are the support and query prototypes encoded from the support and query images, while negative samples are the potential background classes obtained from the background learner branch.Clustering between similar prototypes is strengthened by increasing the distance between the positive and negative samples using infoNCE loss [19]: where L contrastive denotes the loss between the positive and negative samples in contrastive learning.τ represents the temperature coefficient; we set τ = 0.03, based on broad experiments.s(•, •) is the distance measure function between the positive and negative samples; we chose the cosine similarity function as the measure function in this paper: Figure 3. Detailed diagram of contrastive representation learning that learns general features between categories by computing distance between target classes and latent classes.

Dataset and Metrics
PASCAL-5 i and COCO-20 i .The performance of our model is evaluated on two datasets, i.e., PASCAL-5 i and COCO-20 i .The PASCAL-5 i dataset has 20 classes, consisting of PASCAL VOC 2012 [27] and augmented SBD [28].The COCO-20 i dataset contains 80 categories.In the few-shot segmentation task, the classes of both the datasets are divided into four folds; three folds are used for training, and the fourth fold is used for assessment.
Remote sensing dataset.We ran trials using the Vaihingen dataset and the Zurich Summer dataset to assess the effectiveness of our model for remote sensing.The Vaihingen dataset was provided by the International Society for Photogrammetry and Remote Sensing (ISPRS), which collected data from high-resolution aerial images of Vaihingen, with each image labeled with six classes.The Zurich Summer dataset consists of 20 photos, including eight classes.The Zurich dataset can reflect real-world conditions well, which helps with evaluating the performance of the segmentation models in remote sensing.Details of these two remote sensing datasets are shown in the (Table 1).Baseline and metrics.PANet was used as the baseline model, because our model is based on metric learning.Following previous methods, the mean Intersection-over-Union (mIoU) is adopted for evaluating the model performance.On in-domain and cross-domain transfer, the remote sensing image segmentation performance is assessed using the F1 score of each class, and the overall accuracy.

Implementation and Training Details
Network structure.To prove the effectiveness of our approach, ResNet-50 and ResNet-101 were separately applied to be the feature extractor.The last level of the ResNet is deleted for improved generalization, and the last ReLU is replaced by cosine similarity [10].As for the auxiliary semantic branch for learning background features, a lightweight decoder is introduced behind the encoder shared with the meta-learner.The decoder consists of three convolution layers, and all except the final convolution are followed by batch normalization and ReLU.ImageNet pre-trained ResNet parameters are used for initialization, as in previous approaches.
Implementation details.In particular, on PASCAL-5 i and COCO-20 i , all episodes are developed with a support-query pair, and an additional image that supervises the model learning background features in the training phase.To train the model, we used the SGD optimizer with a learning rate of 5 × 10 −4 that decays by 0.1 every 10,000 iterations, and a momentum of 0.9.To obtain improved model parameters, the SGD backpropagation approach continually modifies the model parameters over 3000 iterations.The training image and mask pairs were cropped to (417,417) and enhanced through random horizontal flipping.Specifically, our model stores the negative samples in a memory bank where a dictionary is developed to store and update the embedding of negative samples.The system of this experiment was Ubuntu 16.04, and the processor was Intel Xeon Silver 4210R.The graphics processor (GPU) was the GeForce RTX 3090 GPU with 1 TB memory.

Comparison with the State-of-the-Art
Extensive experiments were conducted to evaluate the model performance on PASCAL-5 i and COCO-20 i .In particular, we chose ResNet-50 and ResNet-101 to be the encoding networks.In convolutional neural networks, ResNet is the feature extractor with the best segmentation effect.The extracted feature information is more detailed as a result of having more layers.Based on experimental experience in most peer papers, we chose the two most traditional resnets, Resnet-50 and Resnet-101, for the comparison experiments.More experimental results are shown in Figure A1.To assess the performance of the model in remote sensing image segmentation, we compared the result between full supervised deep learning models and different few-shot segmentation models on the Vaihingen dataset and the Zurich Summer dataset.

Pascal-5 i
On ResNet-50 and ResNet-101, our approach outperformed previous methods (Table 2).In particular, our approach outperformed previous methods by 1.1% with ResNet-101 in a 1-way 5-shot setting, and by 0.9% with ResNet-50 in a 1-way 1-shot setting.Our approach is on par with the cutting-edge technology in other settings.From the experimental findings, the generalization ability of our model is reflected in the segmentation results on the PASCAL-5 i dataset, which illustrates the necessity for improvement.

COCO-20 i
The COCO-20 i dataset includes more categories than the PASCAL-5 i dataset, and has more realistic scenes.We recorded the results of our approach in Table 3.In this dataset, our approach outperformed previous methods by a considerable margin (0.6%) on the 1-shot setting with ResNet-50.The results recorded in Tables 2 and 3 demonstrate the superiority of our approach.

Vaihingen
To show our model's generalization capability, we trained it on the PASCAL-5 i dataset and tested it on the Vaihingen dataset.To segment the remote sensing images, we directly transferred the parameters trained on the PASCAL-5 i dataset.The performance of all comparing methods is listed in Table 4, and there is a significant performance improvement compared with other few-shot segmentation models, outperforming PANet [10] by 7.5%.Segmentation performance is prominent in 'building' and 'tree' classes, and it surpasses PANet [10] by 19.9% in the 'car' class.The qualitative outcomes of our method on the Vaihingen dataset are shown in Figure 4. We evaluated the generalization ability of our model on the Zurich Summer dataset.The model performance was particularly good in the 'tree' and 'water' classes.Table 5 shows the results of all approaches, and illustrates a considerable performance improvement compared with other few-shot segmentation models, exceeding PANet [10] by 13.2% overall.The qualitative outcomes of our method on the Zurich dataset are shown in Figure 5.Our model contains two main parts, the background learner and the contrastive representation learning.The effectiveness of each component is evaluated on the PASCAL-5 i dataset (Table 6).The background learner contributes the most to performance improvement, achieving a 0.9% increase in accuracy.The contrastive representation learning is indispensable and provides a 0.5% accuracy improvement.Our method achieves an ideal improvement with these two components.

Effect of the Background Learner
We used unlabeled images to supervise the background latent features.The unlabeled objects in the background were over-smoothed in the previous methods, which undermined the features.Extra images served as negative samples and allowed comparisons to be made.In the background learner, we selected a batch-size of extra images in 1, 2, 4, and 8, where the model obtained the highest accuracy for the batch-size of 2. We show the segmentation accuracy of the model with different batch-sizes in Table 7. Semantic segmentation is a dense classification task that needs positive and negative samples to balance the feature distribution.The background learner supervised our model by recognizing the semantic features in the background that helped the matching network discriminate between background and foreground.To this end, mining the latent class broadened the gap between the background and foreground prototypes in the embedding space.This broadening leads the matching network to segment objects within complex backgrounds.We visualized the ability to recognize semantic features (Figure 6), which shows the features cluster in the embedding space.Few-shot segmentation model mostly uses the parameters pre-trained on ImageNet dataset to initialize the model; In Figure 6, (a) is the distribution of category objects clustered by pre-trained parameters in the embedding space, (b) is the object distribution generated after the training of baseline model, and (c) is the object distribution generated by our model with background learner-assisted training.As shown in Figure 6, our model with a background learner significantly increased the intra-class similarity, meaning that points representing objects in each category were more highly clustered.We recorded the performance of our model with a variable number of classes in the query image to analyze the effect of the background learner (Figure 7).When the query image had numerous categories, our model fared better in terms of data performance.

Effect of Contrastive Representation Learning
The contrastive representation learning is indispensable and provides a 0.5% accuracy improvement.Our model maintains the representation of positive samples close to one another and the representation of negative and positive samples far apart by using a contrast loss.By using contrast learning, the learnt representation can disregard changes brought on by background alterations so that it can learn higher-dimensional and more important feature information.Unlike other contrastive learning approaches, our model stores the negative samples required for comparison in a memory bank, rather than relying on batch size.In practice, a dictionary is developed to store and update the embedding of negative samples.

In-Domain and Cross-Domain Transfer
To demonstrate the generalization ability of our model, its performance is shown in terms of cross-domain segmentation and in-domain category transfer.
Cross-domain segmentation.We transfer the parameters trained on the PASCAL-5 i dataset directly to segment remote sensing images, which evaluates the generalization ability of our model over different domain categories.
In-domain category transfer.Our model was trained on the Vaihingen dataset, allowing it to learn more targeted parameters.The performance of our model was tested with a new category from the same dataset.Following the setup for training the few-shot segmentation model, our model takes four categories of samples from the Vaihingen dataset as the training set, and the remaining one category as the test set.
A comparison between cross-domain segmentation and in-domain category transfer on 'impervious surface' and 'building' classes is shown in Table 8, which demonstrates how training the model on an in-domain category can improve the accuracy.The F1 score of the building and impervious surface class on the in-domain category transfer is 1.8% and 1.6% higher than that of the cross-domain segmentation.In the pseudo-label generation, we set the number of the cluster as 5 after the ablation experiments.The ablations on the hyper-parameter k in the pseudo-label generation are presented in Table 9.We compare the performance of the model when k = 1, 3, 5, 7, and find that the performance of the model is the best when k = 5.Another three hyperparameters are set in our model, namely the coefficient µ before the two loss functions, and the temperature coefficient λ in the contrast loss.As shown in Figure 8, the hyperparameters are verified with ResNet-50 in a 1-way 1-shot setting, and the best segmentation accuracy is attained when the coefficients µ and λ are set to 0.4 and 0.6, respectively.The impact of the contrastive learning model is significantly influenced by the temperature coefficient τ.If it is set to different parameters, the effect may be tens of percent points worse.Generally speaking, the temperature coefficient τ should take a relatively small value from experience, ranging from 0.01 to 0.1.The temperature coefficient τ in the contrast loss was set to 0.03 according to broad experiments.

Conclusions
This work presents a novel background learner to learn latent background features.With the self-supervised background learner, the feature extractor can mine the latent novel classes in the background.By using the self-supervised method, our model improves in segmentation accuracy without marking costs.Another novelty is the application of contrastive representation learning, which can generate a more accurate discriminator with the use of a contrastive loss.With all these components, our method dramatically improves and is on par with the cutting-edge technology of PASCAL-5 i and COCO-20 i .In addition, our model combines few-shot learning and remote sensing image segmentation, and obtains good results on a dataset of remote sensing images.Few-shot segmentation using a self-supervised background learner achieves a good result that may allow for background knowledge to be learned.Furthermore, our model demonstrates that few-shot learning can obtain good results within remote sensing image segmentation.

Limitation
It can be seen from Figure 9 that the segmentation results obtained by our model when segmenting 'road', 'tree', and 'grass' are relatively rough, and that the segmentation of small objects in the picture is not accurate enough.During the experiments, we also discovered that partial segmentation results exist and caused semantic confusion.Semantic confusion often appears in images with similar semantics, as shown in Figure 10.Categories with similar semantics, such as dogs and sheep, or chairs and tables, are confused in the same image.For instance, 'sheep' is erroneously segmented as 'dog' in the middle group of images.

Figure 1 .
Figure 1.The main architecture of the presented framework.The dotted box is a self-supervised background learner (in Section 4.2), which mines latent classes in the background.The contrastive representation learning branch is described in Section 4.3.

Figure 2 .
Figure 2. Detail diagram of pseudo-label generation module.

Figure 4 .
Figure 4. Qualitative outcomes of our method on Vaihingen dataset.

Figure 5 .
Figure 5. Qualitative outcomes of our method on Zurich dataset.

Figure 6 .
Figure 6.The t-sne visualization of the model and different colors represent features of different categories with and without background learner is shown in the figure.

Figure 7 .
Figure 7.The performance of our model with different numbers of classes in query image.

Figure 8 .
Figure 8. Comparative experimental results for different values of the hyperparameters.(a) µ is the coefficient before the loss function of the background learner.(b) λ is the coefficient before the contrast loss.(c) τ is the temperature coefficient in the contrast loss.

Table 1 .
Detailed information of remote sensing datasets.

Table 2 .
Mean-IoU of 1-way on PASCAL-5 i .Bold numbers represent the best data in the comparison experiment.

Table 3 .
Mean-IoU of 1-way on COCO-20 i .Bold numbers represent the best data in the comparison experiment.

Table 4 .
F1 score of each class, and the overall accuracy using Vaihingen dataset (comparison between deep learning and few-shot learning of segmentation model).

Table 5 .
F1 score of each class and overall accuracy on Zurich dataset (comparison between deep learning and few-shot learning of segmentation model).

Table 6 .
Ablation studies on the effects of different components.Contrastive Representation Learning (CRL): Contrast target classes and latent classes.Background Learner (BL): Mine latent classes in the background using the background learner.Checkmark indicates the modules used in the comparison experiment and bold numbers represent the best data in the comparison experiment.

Table 7 .
Ablation studies on the batch-size of extra images in background learner.

Table 8 .
F1 score of building and impervious surface on cross-domain and in-domain transfer.

Table 9 .
Ablation studies on the hyper-parameter k in pseudo-label generation.Bold numbers represent the best data in the comparison experiment.