Pixel-Wise and Class-Wise Semantic Cues for Few-Shot Segmentation in Astronaut Working Scenes

: Few-shot segmentation (FSS) is a cutting-edge technology that can meet requirements using a small workload. With the development of China Aerospace Engineering, FSS plays a fundamental role in astronaut working scene (AWS) intelligent parsing. Although mainstream FSS methods have made considerable breakthroughs in natural data, they are not suitable for AWSs. AWSs are characterized by a similar foreground (FG) and background (BG), indistinguishable categories


Introduction
Significant achievements have been made in the China Space Station (CSS) [1,2], and many in-orbit experiments have been performed smoothly [3][4][5].It is a fundamental but scientific problem to intelligently analyze AWSs' semantic information, which can be further applied to augmented reality, in-orbit robots, and other equipment to assist astronauts.
Semantic segmentation (SS) [6] is an advanced computer vision task that can achieve pixel-level classification and describe the contours of objects accurately.However, training such a network requires a large-scale dataset, which requires both labor-consuming and time-consuming work by experts.According to statistics, it takes about 18 min to annotate an image [7].Even worse, SS is unable to handle categories that do not exist during training, thus limiting its further application.
Astronauts usually encounter new tasks.Quick but expert support is desired.Therefore, two basic requirements need to be met: the first is to give solutions with a low workload, and the second is to deal with unseen categories.
Fortunately, FSS [8][9][10][11][12][13] is an effective solution.Only a handful of annotated data samples (usually one or five in experiments) are required to segment new classes.FSS has made breakthroughs with natural data such as COCO-20 i and PASCAL-5 i [8,12].However, the models designed for these data are not suitable for AWSs.There are considerable Since the proposal of fully convolutional neural networks (FCNs), semantic segmentation based on deep learning has developed rapidly.The subsequent work can be divided into several categories.The first is to improve the accuracy by using techniques such as atrous convolution [15] and contextual information [16] or structures such as encoderdecoder [15,17] and U-shape [18].The second category is to improve speed using schemes like depthwise separable convolution [19] and two-branch network structure [20][21][22].The third category focuses on the important regions using modules like channel attention [23], spatial attention [24], and self-attention [25] to weight the computation for different network details.However, semantic segmentation relies on supervised learning and requires a large amount of labeled data.When faced with novel categories, semantic segmentation cannot handle them, which limits its flexible application.

Few-Shot Learning
Few-shot learning (FSL) [10,[26][27][28][29] aims to transfer knowledge with several labeled samples.Unlike supervised learning, which learns between image-label pairs, FSL constructs a special unit, episodes [10], to simulate few-shot scenarios.All training and testing are conducted in episodes.FSL can be classified into metric-based, optimization-based, and reinforcement-based methods [8].The first is for feature-level methods [30], the second is for training strategies [31], and the third is for data processing [32].Among them, metric-based methods have achieved SOTA and have low computational complexity [33].Our model is also based on similarity metrics between support and query data and extends it to semantic segmentation.

Few-Shot Segmentation
The work in [9] pioneered FSS by establishing a complete framework, standard datasets, and metrics.The key to FSS is to determine the similarity between support and query data.Mainstream approaches can be categorized into two forms: prototypebased [8,[34][35][36][37][38][39][40] and pixel-based [10,12,13,[41][42][43][44][45].The former is a class-wise semantic cue that compresses the support foreground into a high-dimensional vector.Numerical methods such as cosine and Euclidean distance [8] are used to measure their similarity.The latter uses the 4D convolution [42,45] or transformer [46,47] to establish a dense correspondence between support and query data through pixel-wise cues.In general, pixel-wise methods are more accurate but are weak in handling scale variations and small objects.The class-wise approach is faster and can be used as a complement to the pixel-wise approach.We combine the two methods in a novel way and demonstrate their effectiveness in the experiment.

Problem Definition
FSS aims to segment unseen categories in new images with the help of several labeled support images.As with canonical SS methods, the dataset is divided into a training set termed D train and a test set termed D test , containing the categories C train and C test , respectively.Unlike SS, the property of FSS is C train ∩ C test = ∅.In other words, the segmentation capability learned in the training set needs to be generalized to unknown categories.Episodes are created in both the training and testing phases to simulate the few-shot setting [10].Specifically, given K support image-mask pairs S = and only one query image-mask pair Q = I q , M q , the inference of I q is performed by semantic cues from S, i.e., learning the mapping f : I q /S → M q .Similar to traditional SS, M q is used only in the training phase.

Method Overview
The overall architecture of our PCNet is illustrated in Figure 1.The input of PCNet is a query image and several support images with fine annotated masks.Segmentation of the query image is desired as the output.Firstly, the pre-trained ResNet50 [48] is selected as the shared feature extractor between the support and query paths with frozen parameters.Secondly, in the last three blocks of the backbone, we design novel pixel-wise and classwise correlation blocks (PCC blocks) for mining dense semantic relationships between support and query features.Specifically, we construct prototypes using support features and corresponding resized masks for each layer in each block.Abstract deep prototypes reverse global semantic cues to shallow layers through distillation (reverse prototype distillation, RPD), forcing shallow prototypes to learn more explicit semantic information.Thirdly, all features of each block are fused using a simple fusion module, and features from multiple blocks are combined by hierarchical fusion.Finally, the results are obtained through a decoder consisting mainly of convolutional and upsampling layers.We will explain our model in detail in the following sections, with a focus on PCC blocks and RPDs under the 1-shot setting.

Feature Preparation
To ensure the generalization of FSS, we follow the standard practice of freezing the backbone.Most published FSS works use features of the last layer of the backbone for postprocessing.We are inspired by [42,44] to fully utilize the dense features within every block for feature extraction.The support features and query features are F s i,j and F q i,j , respectively, where i is the index of blocks.For ResNet50, i ∈ {1, 2, 3, 4, 5}.The larger i is, the deeper the block is.And j is the index of layers in each block.
We actually perform the research with i valued {3, 4, 5}.In response, the number of layers to each block is {4, 6, 3}, respectively.The scale of layers in the same block is the same, but the scales are different between different blocks.The scales corresponding to these three blocks are 1  8 , 1  16 , 1  32 of the input image, respectively.For the support mask, we adjust it to the same scale as the features in each block, denoted as M s i .⨁ is the pixel-wise addition.

Feature Preparation
To ensure the generalization of FSS, we follow the standard practice of freezing the backbone.Most published FSS works use features of the last layer of the backbone for post-processing.We are inspired by [42,44] to fully utilize the dense features within every block for feature extraction.The support features and query features are  , and  , , respectively, where  is the index of blocks.For ResNet50,  ∈ 1,2,3,4,5 .The larger  is, the deeper the block is.And  is the index of layers in each block.
We actually perform the research with  valued 3,4,5 .In response, the number of layers to each block is 4,6,3 , respectively.The scale of layers in the same block is the same, but the scales are different between different blocks.The scales corresponding to these three blocks are , , of the input image, respectively.For the support mask, we adjust it to the same scale as the features in each block, denoted as  .

Rational Application of the Cross-Attention
Transformers capture contextual relationships between pixels and are widely used for feature extraction [46,47,49].The core formula of the transformer is as follows: where , ,   are tokens constructed from the feature map, and  is the dimension of Q and K.  =  /√ is the affinity map [13], measuring the correlation between Q and K.When , ,   come from the same feature map, Equation (1) denotes the selfattention, which is widely used for feature extraction within an image.One of the critical techniques in FSS is to find the association between the support feature and the query feature.Cross-attention is an effective pixel-wise association method that has been widely used in many studies [12,13,43,44].The key to cross-attention is that , , and  in Equation (1) come from different feature maps.Methods in [12,13] use query features as Q and background-removed support features as K and V, while the support mask is used for background removal only.Such an approach is suitable for natural datasets such as COCO [14] and PASCAL VOC [50], which obtain images with significant inter-class differences and large foreground-background differences.Thus, removing the background before transformers indeed reduces the effect of irrelevant factors. is the pixel-wise addition.

Rational Application of the Cross-Attention
Transformers capture contextual relationships between pixels and are widely used for feature extraction [46,47,49].The core formula of the transformer is as follows: where Q, K, and V are tokens constructed from the feature map, and d is the dimension of Q and K.A = QK T / √ d is the affinity map [13], measuring the correlation between Q and K.When Q, K, and V come from the same feature map, Equation (1) denotes the self-attention, which is widely used for feature extraction within an image.
One of the critical techniques in FSS is to find the association between the support feature and the query feature.Cross-attention is an effective pixel-wise association method that has been widely used in many studies [12,13,43,44].The key to cross-attention is that Q, K, and V in Equation (1) come from different feature maps.Methods in [12,13] use query features as Q and background-removed support features as K and V, while the support mask is used for background removal only.Such an approach is suitable for natural datasets such as COCO [14] and PASCAL VOC [50], which obtain images with significant inter-class differences and large foreground-background differences.Thus, removing the background before transformers indeed reduces the effect of irrelevant factors.
However, for AWS data, the similarity between different categories is significant.The FG and BG of images are easy to confuse.Cross-attention is achieved by flattening the feature map, where each pixel represents a token.The purpose of cross-attention is to find matching relationships on pixels.As shown in Figure 2, for a certain pixel in the support FG, there are similar pixels for both the FG and BG in the query.The cross-attention is calculated by point-wise multiplication.Similar query pixels are multiplied by support FG pixels to yield similar values.Query pixels are multiplied by the support BG to reach the value 0. Therefore, it is difficult to distinguish similar query pixels.The above characteristic of AWSs leads to false segmentation.
Inspired by DCAMA [44], we adopt another form of cross-attention.As shown in Figure 3, Q, K, and V are query features, support features, and support masks, respectively.We first compute the affinity matrix, A = QK T / √ d, followed by softmax, i.e., ∼ A = so f tmax QK T / √ d , which represents the pixel-wise dense correlation in K and Q.Unlike [12,13], the BG of the support features is also involved in the computation.Therefore, ∼ A combines the different feature values of FG and BG and their corresponding positional encodings to contain more information.A higher similarity between the support and query features leads to a stronger correlation.The thickness of the lines is used in Figure 3 to indicate the degree of correlation.A weaker correlation between dissimilar features helps to distinguish features that are semantically different but similar in appearance.In addition, we multiply However, for AWS data, the similarity between different categories is significant.The FG and BG of images are easy to confuse.Cross-attention is achieved by flattening the feature map, where each pixel represents a token.The purpose of cross-attention is to find matching relationships on pixels.As shown in Figure 2, for a certain pixel in the support FG, there are similar pixels for both the FG and BG in the query.The cross-attention is calculated by point-wise multiplication.Similar query pixels are multiplied by support FG pixels to yield similar values.Query pixels are multiplied by the support BG to reach the value 0. Therefore, it is difficult to distinguish similar query pixels.The above characteristic of AWSs leads to false segmentation.

Masked Support Image
Support Foreground: K/V Query Image: Q Inspired by DCAMA [44], we adopt another form of cross-attention.As shown in Figure 3, Q, K, and V are query features, support features, and support masks, respectively.We first compute the affinity matrix,  =  /√, followed by softmax, i.e.,  = ( /√), which represents the pixel-wise dense correlation in K and Q.Unlike [12,13], the BG of the support features is also involved in the computation.Therefore,  combines the different feature values of FG and BG and their corresponding positional encodings to contain more information.A higher similarity between the support and query features leads to a stronger correlation.The thickness of the lines is used in Figure 3 to indicate the degree of correlation.A weaker correlation between dissimilar features helps to distinguish features that are semantically different but similar in appearance.In addition, we multiply  with V, i.e.,  =  .The learned correlation is used to weight the support mask to obtain query masks.the value 0. Therefore, it is difficult to distinguish similar query pixels.The above characteristic of AWSs leads to false segmentation.Inspired by DCAMA [44], we adopt another form of cross-attention.As shown in Figure 3, Q, K, and V are query features, support features, and support masks, respectively.We first compute the affinity matrix,  =  /√, followed by softmax, i.e.,  = ( /√), which represents the pixel-wise dense correlation in K and Q.Unlike [12,13], the BG of the support features is also involved in the computation.Therefore,  combines the different feature values of FG and BG and their corresponding positional encodings to contain more information.A higher similarity between the support and query features leads to a stronger correlation.The thickness of the lines is used in Figure 3 to indicate the degree of correlation.A weaker correlation between dissimilar features helps to distinguish features that are semantically different but similar in appearance.In addition, we multiply  with V, i.e.,  =  .The learned correlation is used to weight the support mask to obtain query masks.PCC blocks are implemented in the last three blocks of the backbone.As shown in Figure 4, the 5th block is used as an example to illustrate the proposed module.Support features and query features are denoted as F s 5,j and F q 5,j , respectively.There are three layers inside the block, i.e., j = 1, 2, 3.All the layers are scaled to 1  32 of the input image.We resize support masks to the same scale by linear interpolation to obtain M s 5 .We consider the query feature F q 5,j , the support feature F s 5,j , and the support mask M s 5 of corresponding layers as Q, K, and V, respectively.The corresponding tensor dimensions are F q 5,j , F s 5,j ∈ R h×w×c and M s 5 ϵR h×w×1 , respectively.The multi-head form is used for computation according to Equation (1), i.e., learning the mapping f : F q 5,j , F s 5,j , M s 5 → F j , where F j ∈ R h×w×1 .Then, we stack different layers of F j in the channel dimension, i.e., F block5 = cat F j , j = 1, 2, 3.These features are then fused using a convolution operation.With such dense cross-attention, each layer in each block is involved in computation, providing considerable pixel-wise semantic information.
With such dense cross-attention, each layer in each block is involved in computation, providing considerable pixel-wise semantic information.In addition, we construct prototypes for each support layer according to the standard practice [51], which is formulated as follows:

Cross Attention
where ⨂ is the Hadamard product and ℱ is the average global pooling.As shown in Figure 4, in the 5th block, three prototypes are constructed to represent the class-wise semantic cues of a particular layer, i.e., the semantic information of the whole category is compressed into a high-dimensional vector.We distill deeper prototypes to shallower layers reversely (which will be explained in the next section).Then, we expand the shallowest prototype =  and connect it to  to obtain the final feature of this block.The whole process is summarized as follows: where ℱ : ℝ × × → ℝ × × .ℱ means concatenation in the channel dimension. , includes dense pixel-wise and class-wise semantic cues in the ith block (i is 5 in this example), allowing for detail-to-global feature extraction.Similar operations are performed for features in each block.means the Hadamard product.M is the support mask.P is the prototype.RPD stands for reverse prototype distill.Conv includes convolution, group normalization, and ReLu.
In addition, we construct prototypes for each support layer according to the standard practice [51], which is formulated as follows: where is the Hadamard product and F ave is the average global pooling.As shown in Figure 4, in the 5th block, three prototypes are constructed to represent the class-wise semantic cues of a particular layer, i.e., the semantic information of the whole category is compressed into a high-dimensional vector.We distill deeper prototypes to shallower layers reversely (which will be explained in the next section).Then, we expand the shallowest prototype p shallow = p 1 and connect it to F block5 to obtain the final feature of this block.The whole process is summarized as follows: where F expand : R 1×1×c → R h×w×c .F cat means concatenation in the channel dimension.F PCC,i includes dense pixel-wise and class-wise semantic cues in the ith block (i is 5 in this example), allowing for detail-to-global feature extraction.Similar operations are performed for features in each block.

Reverse Prototype Distillation (RPD)
Equation ( 2) computes the prototype of each layer, compressing the category cues at different stages.Prototypes are characterized by global abstraction and have been used in many studies [8,52,53].However, some works [13,44] argue that prototypes suffer from information loss.
In this paper, such a class-wise feature is proved to be an effective complement to pixelwise features, and the two features can realize complementary advantages.Unlike previous work where prototypes are extracted only once, in PCNet, dense prototypes are extracted.Several studies [20,22] point out that deeper features are more abstract and characterize global semantic information.Similarly, the deeper the prototype, the more precise its semantic information.The natural idea is to propagate the prototypes of deeper features to shallower ones, thus enhancing the semantic representation of shallow prototypes.
Specifically, for all prototypes p i,j ∈ R 1×1×c , we first extend them to the same dimension using the fully connected layer, i.e., p ′ i,j = F MLP p i,j ϵR 1×1×c ′ , where c ′ is a fixed value.Next, the softmax layer normalizes prototypes by the following: where T is the distillation temperature, a hyperparameter which we set to 0.5.The effects caused by different T will be given in ablation studies.
The direction of distillation is opposite to the forward of the backbone, i.e., reverse distillation.The deeper the feature layer is, the stronger its prototype.The prototype of the last layer in the last block is the initial teacher and distills forward sequentially.The KL divergence loss [12] is utilized to supervise the process of distillation, which is formulated as follows: where n denotes the number of layers in each block, and n = 4, 6, 3 when i = 3, 4, 5. L KL1 denotes the full distillation loss in each block, as shown by the short-dashed arcs in Figure 1 and the dashed arcs in Figure 4. L KL2 denotes the full distillation loss from the first layer of the deeper block to the last layer of the shallower block, as shown by the long dashed arcs in Figure 1 and the straight dashed line in Figure 4.It should be noted that the distillation is only performed during training.

Feature Fusion and Decoder
We use a simple module to fuse pixel-wise and class-wise semantic cues from the PCC block.As shown in Figure 1, the fusion block consists of stacked convolution, group normalization, and ReLu.Then, the feature map is resized so that it has the same scale as the previous block.Inspired by [44], we merge the features of the last three blocks using the progressive union module, which is formalized as follows: where F fusion denotes the fusion block.Finally, F 3 is processed by a simple decoder containing mainly convolutional and upsampling layers to obtain the query mask.The overall loss of PCNet is formulated as follows: where L BCE is the binary cross-entropy loss, which is a commonly used loss in many studies [8,12,44].λ is set to 1 since we do not find a large improvement in performance with different values.

Extension to K-Shot Setting
PCNet can be extended to a K-shot (K > 1) model without retraining.As shown in Figure 5a, we flatten the feature map before the cross-attention to obtain h × w tokens.As shown in Figure 5b, under the K-shot setting, we connect the tokens of the K support feature maps to construct h × w × K tokens.The same operation is applied to the K support masks.Equation ( 1) is computed using the dot product, which does not require learning and does not change the dimension of the final result.Therefore, this extension can be performed naturally.For K pairs of support data, we use Equation (2) to compute the prototypes at each level.Then, we find the average of the K prototypes.The formulation is calculated as follows: where p n is a single prototype created from one support image and its corresponding mask.

Extension to K-Shot Setting
PCNet can be extended to a K-shot (K > 1) model without retraining.As shown in Figure 5a, we flatten the feature map before the cross-attention to obtain ℎ ×  tokens.As shown in Figure 5b, under the K-shot setting, we connect the tokens of the K support feature maps to construct ℎ ×  ×  tokens.The same operation is applied to the K support masks.Equation ( 1) is computed using the dot product, which does not require learning and does not change the dimension of the final result.Therefore, this extension can be performed naturally.For K pairs of support data, we use Equation (2) to compute the prototypes at each level.Then, we find the average of the K prototypes.The formulation is calculated as follows: where  is a single prototype created from one support image and its corresponding mask.

Dataset
To meet the experimental needs, we create the semantic segmentation dataset of the CSS.Specifically, we rely on the simulator developed by the China Astronaut Research and Training Center (ACC) to acquire rich images.For objects inside the cabin, a camera is used to capture images of each object at different angles, distances, and scales.We want to simulate the astronauts working inside the CSS from various perspectives and to improve the richness of the dataset.For important objects outside the cabin, we utilize the virtual reality system developed by the ACC to simulate various perspectives outside the cabin in space.
We select 1000 images containing 28 real objects inside the cabin and 8 virtual objects outside the cabin as the final dataset for annotation.To further increase the richness and adaptability of data, the dataset is augmented with random rotation, random flip, random brightness, random exposure, and random saturation.Finally, we obtained 7255 pairs and named them Space Station 36 (SS-36).Among them, the training set contains 4836 images

Dataset
To meet the experimental needs, we create the semantic segmentation dataset of the CSS.Specifically, we rely on the simulator developed by the China Astronaut Research and Training Center (ACC) to acquire rich images.For objects inside the cabin, a camera is used to capture images of each object at different angles, distances, and scales.We want to simulate the astronauts working inside the CSS from various perspectives and to improve the richness of the dataset.For important objects outside the cabin, we utilize the virtual reality system developed by the ACC to simulate various perspectives outside the cabin in space.
We select 1000 images containing 28 real objects inside the cabin and 8 virtual objects outside the cabin as the final dataset for annotation.To further increase the richness and adaptability of data, the dataset is augmented with random rotation, random flip, random brightness, random exposure, and random saturation.Finally, we obtained 7255 pairs and named them Space Station 36 (SS-36).Among them, the training set contains 4836 images with corresponding masks, and the test set contains 2419 images with corresponding masks.To the best of our knowledge, this is the first AWS dataset applied to SS.Some samples are shown in Figure 6.As shown in Figure 6a, the distinction between the categories of AWS images is unclear.The FG and BG are easily confused.The effects of light are intense.The FG has fewer textures and occupies a small proportion of pixels.Figure 6c shows the results obtained using the state-of-the-art (SOTA) method [12].Issues like incomplete segmentation, FG/GB confusion, and missing segmentation are likely to happen.Mainstream FSS methods are unable to handle such complex images effectively.
Furthermore, we construct SS-36 into the cross-validation format required by FSS.The training and test sets are divided into four splits.In order to balance the diversity and fairness among categories, each split contains seven in-cabin real objects and two out-of-cabin virtual objects, totaling nine categories.There is no category overlap between the four splits.We named this dataset SS-9 i .
The training is performed on three of the splits, and the testing is performed on another split.Thus, four experiments need to be performed, which is known as the cross-validation in FSS.
Some samples are shown in Figure 6.As shown in Figure 6a, the distinction between the categories of AWS images is unclear.The FG and BG are easily confused.The effects of light are intense.The FG has fewer textures and occupies a small proportion of pixels.Figure 6c shows the results obtained using the state-of-the-art (SOTA) method [12].Issues like incomplete segmentation, FG/GB confusion, and missing segmentation are likely to happen.Mainstream FSS methods are unable to handle such complex images effectively.Furthermore, we construct SS-36 into the cross-validation format required by FSS.The training and test sets are divided into four splits.In order to balance the diversity and fairness among categories, each split contains seven in-cabin real objects and two out-ofcabin virtual objects, totaling nine categories.There is no category overlap between the four splits.We named this dataset SS-9 i .
The training is performed on three of the splits, and the testing is performed on another split.Thus, four experiments need to be performed, which is known as the crossvalidation in FSS.

Metrics
Two metrics [8,12] are used in our experiments: Equation ( 11) is the mean intersection over union (mIoU), where  is the number of categories.Equation ( 12) is the average of the IoU of the FG and the IoU of the BG.

Metrics
Two metrics [8,12] are used in our experiments: Equation ( 11) is the mean intersection over union (mIoU), where C is the number of categories.Equation ( 12) is the average of the IoU of the FG and the IoU of the BG.

Implementation Details
We normalize the input images to uniform 400 × 400 pixels and do not perform any further data enhancement, similar to previous works [42,44].To ensure the generalization of the model, we fix the parameters of the backbone.It should be noted that using different backbones will result in different performances [13,43,52].However, we mainly verify the performance of the proposed core modules.Therefore, only ResNet50 is chosen for our experiments.The same backbone is chosen for all the comparison models.We conduct experiments on 2080Ti GPUs using the PyTorch framework.For training, the batch size, initial learning rate, momentum, and weight decay are set to 8, 0.001, 0.9, and 0.0001, respectively.The SGD optimizer is used until the model converges.During testing, we randomly select 500 samples, and the average is the final result.

Comparison with the SOTA Methods
In this work, eight typical models from recent years are adopted for comparison.Two of them are based on prototypes [8,52], and six on pixel-wise semantic matching [12,13,[42][43][44][45].All models are based on ResNet50.All models are retrained on SS-9 i with original settings until they converge.It is worth noting that the leading models, BAM [8] and HDMNet [12], use additional branches for base classes.Specifically, the base path is first trained, and then the backbone of the base branch is used for the second stage.However, using the backbone trained with the base branch is equivalent to fine-tuning it on the target dataset, unlike most models that fix the backbone.Therefore, it is not fair to use this approach for comparison.Among experiments, both results for this two-stage training and one-stage training with a fixed backbone are illustrated.
Table 1 shows the results of PCNet and some mainstream methods in comparison.The best results are highlighted in bold, while suboptimal results are underlined.It can be seen that PCNet achieves optimal results in almost all settings, and only Split3 achieves suboptimal results under the 1-shot setting.The closest match to PCNet is DCAMA [44], which achieves suboptimal results in most of the results.Our method outperforms DCAMA by 4.34% and 5.15% under the 1-shot and 5-shot settings, respectively.For two special methods, BAM and HDMNet, the results obtained with the pretrained backbone are significantly worse than those obtained by fine-tuning the backbone on SS-9 i .BAM shows a reduction of 11.21% and 12.40% under 1-shot and 5-shot settings, respectively.HDMNet has decreases of 10.95% and 8.79%, respectively.The reason for such significant differences is that AWSs are very different from natural data.Therefore, the backbone fine-tuned by SS-9 i is more suitable for AWSs.It is not fair to compare with other models using such a trick.Even with the pre-trained backbone, our proposed method greatly outperforms BAM and HDMNet.Compared with HDMNet, the SOTA method on nature data, the mIoU of PCNet is 24.50% and 31.34%higher under 1-shot and 5-shot settings, respectively.
From another perspective, class-based methods [8,52] are inferior to pixel-wise methods [12,13,42,44,45,54].In addition, pixel-wise methods are further divided into three categories: methods based on 4D convolutions [42,45], methods using cross-attention between the query image and support images [12,13,43] only, and methods taking masks into cross-attention [44].PCNet belongs to the third category.The second category gives the worst results, while the third category gives the best results, which proves the effectiveness of the method proposed.
In addition, we calculate the FB-IOU and inference speed of various models.As shown in Table 2, our model still shows the highest accuracy, outperforming DCAMA by 1.03% and 2.01% under 1-shot and 5-shot settings, respectively.The inference speed of pixel-wise methods is much lower than that of class-wise methods due to the intensive computation consumption by the pixel-wise form.PCNet ranks fourth among the pixel-wise methods, lower than BAM, DCAMA, and HSNet.

Qualitative Results
Figure 7 shows a visual comparison between HDMNet and PCNet under the 1-shot setting.Although HDMNet is a SOTA method on natural datasets, it does not work well on AWS images, providing few correct and complete segmentations.In contrast, PCNet works much better because of its unique specifications for AWS data.Figure 8 shows the visualization of PCNet under 1-shot and 5-shot settings.Similar to most FSS methods, the segmentation becomes better as support cues increase in amount.

Ablation Studies
In order to demonstrate the effect of the proposed modules and hyperparameters used, we perform ablation studies.For brevity, the experiments are performed only under the 1-shot setting.Compared with FB-IoU, mIoU reflects the performance of the model more accurately.Therefore, we adopt mIoU as the metric in most of the experiments.

Effects of the Proposed Modules
As shown in Table 3, we validate the effect of the two main modules, PCC and RPD.

Ablation Studies
In order to demonstrate the effect of the proposed modules and hyperparameters used, we perform ablation studies.For brevity, the experiments are performed only under the 1-shot setting.Compared with FB-IoU, mIoU reflects the performance of the model more accurately.Therefore, we adopt mIoU as the metric in most of the experiments.

Effects of the Proposed Modules
As shown in Table 3, we validate the effect of the two main modules, PCC and RPD.The baseline is the architecture without dense prototypes and reverse prototype distillation, where the PCC module uses only dense cross-attention.No RPD means no prototype distillation.It can be seen that the average mIoU of all four splits increases by 4.34% with the usage of dense prototypes.In addition, the performance improves by 7.77% with the addition of RPD.These results demonstrate the effectiveness of using such class-wise semantic cues.Some FSS methods [12,13,43] use cross-attention for pixel-wise matching between support features and query features.These methods first multiply support features with support masks.As explained in Chapter 2.3, removing the foreground yields the masked support features, which are used in cross-attention as K and V.It is worth noting that instead of using dense cross-attention inside the block, other methods [12,13,43] use crossattention at the last layer of each block.For fairness, we modify their architecture to the same dense form as ours.As shown in Figure 9, our method outperforms the other methods significantly, improving the average accuracy by 21.93%.These results prove that the method proposed in our work is more applicable to AWS data.Some FSS methods [12,13,43] use cross-attention for pixel-wise matching between support features and query features.These methods first multiply support features with support masks.As explained in Chapter 2.3, removing the foreground yields the masked support features, which are used in cross-attention as K and V.It is worth noting that instead of using dense cross-attention inside the block, other methods [12,13,43] use crossattention at the last layer of each block.For fairness, we modify their architecture to the same dense form as ours.As shown in Figure 9, our method outperforms the other methods significantly, improving the average accuracy by 21.93%.These results prove that the method proposed in our work is more applicable to AWS data.

Effects of the Distillation Temperature
The distillation temperature T in Equation ( 4) is a hyperparameter that affects the results.We choose T = {0.5, 1, 2, 3} for experiments.As shown in Table 4, when T equals 0.5, the average accuracy of all four splits is the highest.PCNet adopts the value 0.5.However, we can also find that the variation in T shows a weak influence on results, and the best result (T equals 0.5) is only 2.60% higher than the worst result (T equals 3).

Effects of the Distillation Temperature
The distillation temperature T in Equation ( 4) is a hyperparameter that affects the results.We choose T = {0.5, 1, 2, 3} for experiments.As shown in Table 4, when T equals 0.5, the average accuracy of all four splits is the highest.PCNet adopts the value 0.5.However, we can also find that the variation in T shows a weak influence on results, and the best result (T equals 0.5) is only 2.60% higher than the worst result (T equals 3).We compare the gap between FSS and SS.PSPNet is chosen as the SS method, a widely used model, which is used as the base branch in [8,12].As support features increase, memory usage gradually increases.To lower computation consumption, we resize the size of input images to 350 × 350.For PSPNet, we use standard supervised learning to train and test each split.For FSS, we conduct cross-validation by selecting shot = {1, 5, 8, 10, 12, 13, 14, 15}.
As shown in Table 5, PSPNet has the highest accuracy in Split1 and Split2 but did not perform as well as in the other two splits.Split0, the mIoU of PSPNet is only higher than that of PCNet under the 1-shot setting.With an increase in support data, PCNet's mIoU is higher than PSPNet's.In Split3, the accuracy of PSPNet is comparable to that of PSPNet under the 5-shot setting.Compared to FSS, PSPNet shows an unevenly distributed accuracy across the different groups due to the smaller categories in both Split0 and Split3.Traditional semantic segmentation has a weaker ability to deal with small objects.It is important to note that PSPNet [55] has an mIoU of 85.40 for PASCAL VOC and 80.20 for Cityscapes but only 65.91 for AWS data, proving that AWSs are challenging scenes.Figure 10 shows the average accuracy across all four splits with different amounts of support data.The red dotted line is the result of PSPNet.For FSS, the mIoU has increased significantly from the 1-shot to the 5-shot setting, increasing by 17.61%.However, the curve from the 5-shot to the 15-shot setting is relatively flat, with a growth rate of 5.14%.In particular, when the number of support features is greater than 10, the improvement in accuracy is slower.From the 10-shot to 15-shot setting, the accuracy increases by only 2.21%.While PCNet's accuracy surpasses PSPNet after the 13-shot setting, the increment is small.The 15-shot setting is only 0.47% more accurate than PSPNet.After the 15-shot setting, it does not make much sense for FSS, so it can be assumed that the PCNet's accuracy is stable at the same level as PSPNet.
SS-9 i contains 100 to 137 samples per class.The accuracy achieved by PSPNet using more than 100 samples can be achieved with only 13 samples for PCNet.What is more, with PCNet, there is no need to train on unseen categories.This conclusion proves that the proposed method can reduce annotations in amount and can be applied to untrained classes.
In particular, when the number of support features is greater than 10, the improvement in accuracy is slower.From the 10-shot to 15-shot setting, the accuracy increases by only 2.21%.While PCNet's accuracy surpasses PSPNet after the 13-shot setting, the increment is small.The 15-shot setting is only 0.47% more accurate than PSPNet.After the 15-shot setting, it does not make much sense for FSS, so it can be assumed that the PCNet's accuracy is stable at the same level as PSPNet.SS-9 i contains 100 to 137 samples per class.The accuracy achieved by PSPNet using more than 100 samples can be achieved with only 13 samples for PCNet.What is more, with PCNet, there is no need to train on unseen categories.This conclusion proves that the proposed method can reduce annotations in amount and can be applied to untrained classes.

Conclusions
The special layout of AWSs poses a great challenge to the segmentation supported by a small number of annotations.We propose a dedicated and efficient model, PCNet, to solve this issue.Specifically, PCNet is mainly composed of two modules: PCC and RPD.The former uses unique cross-attention and dense prototypes to extract complementary semantic associations between support and query features pixel-wise and class-wise, respectively.The latter reversely distills deep prototypes to shallow layers to improve the quality of their corresponding prototypes.To verify the effectiveness of our proposed method and facilitate the engineering application, we customize a dataset for AWSs.Experiments show that PCNet exceeds current leading FSS methods in accuracy.An accuracy increase is achieved as 4.34% and 5.15% more than the suboptimal model under 1-shot and 5-shot settings, respectively.Further experiments show that PCNet matches the traditional semantic segmentation method in accuracy under the 13-shot setting.It is noted that FSS methods, such as PCNet, can handle untrained classes, breaking the limitation of traditional semantic segmentation.In summary, PCNet shows both academic and engineering significance, accelerating the development of AWS intelligent parsing.

Figure 1 .
Figure 1.The overall architecture of the proposed PCNet, including the shared backbone, PCC block, fusion block, RPD module, and a simple decoder.Different colors in the backbone indicate different blocks.Dashed arcs indicate inter-layer RPDs (short arcs) or inter-block RPDs (long arcs).⨁ is the pixel-wise addition.

Figure 1 .
Figure 1.The overall architecture of the proposed PCNet, including the shared backbone, PCC block, fusion block, RPD module, and a simple decoder.Different colors in the backbone indicate different blocks.Dashed arcs indicate inter-layer RPDs (short arcs) or inter-block RPDs (long arcs).is the pixel-wise addition.

∼A
with V, i.e., Attn = ∼ AV.The learned correlation is used to weight the support mask to obtain query masks.Aerospace 2024, 11, x FOR PEER REVIEW 5 of 19

Figure 2 .
Figure 2. False segmentation due to the similarity of FG-BG.The blue line indicates that the query BG is correlated with the support FG and is the cause of incorrect segmentation.The red line indicates the correct correlation.

Figure 3 .
Figure 3. Cross-attention in our method.The blue line indicates the false correlation.The red line indicates correct correlations.The thicker the line, the stronger the correlation.Correlations calculated between support and query features are transferred to the support mask to predict the query mask.

Figure 2 .
Figure 2. False segmentation due to the similarity of FG-BG.The blue line indicates that the query BG is correlated with the support FG and is the cause of incorrect segmentation.The red line indicates the correct correlation.

Figure 2 .
Figure 2. False segmentation due to the similarity of FG-BG.The blue line indicates that the query BG is correlated with the support FG and is the cause of incorrect segmentation.The red line indicates the correct correlation.

Figure 3 .
Figure 3. Cross-attention in our method.The blue line indicates the false correlation.The red line indicates correct correlations.The thicker the line, the stronger the correlation.Correlations calculated between support and query features are transferred to the support mask to predict the query mask.

Figure 3 .
Figure 3. Cross-attention in our method.The blue line indicates the false correlation.The red line indicates correct correlations.The thicker the line, the stronger the correlation.Correlations calculated between support and query features are transferred to the support mask to predict the query mask.

Figure 4 .
Figure 4. PCC block modified from the 5th block of ResNet50.Darker colors indicate deeper layers.MAP denotes the masked average pooling.⨂ means the Hadamard product.M is the support mask.P is the prototype.RPD stands for reverse prototype distill.Conv includes convolution, group normalization, and ReLu.

Figure 4 .
Figure 4. PCC block modified from the 5th block of ResNet50.Darker colors indicate deeper layers.MAP denotes the masked average pooling.means the Hadamard product.M is the support mask.P is the prototype.RPD stands for reverse prototype distill.Conv includes convolution, group normalization, and ReLu.

Figure 6 .
Figure 6.Characteristics of AWSs.(a) Samples of the dataset.(b) Labels corresponding to the samples.(c) Some predictions using the method from HDMNet [12].

Figure 6 .
Figure 6.Characteristics of AWSs.(a) Samples of the dataset.(b) Labels corresponding to the samples.(c)Some predictions using the method from HDMNet[12].

Figure 7 .
Figure 7.Comparison of qualitative results between HDMNet and PCNet.SupportGround Truth Prediction Support Ground Truth Prediction

Figure 7 .
Figure 7.Comparison of qualitative results between HDMNet and PCNet.

Figure 7 .
Figure 7.Comparison of qualitative results between HDMNet and PCNet.

Figure 8 .
Figure 8.Comparison of qualitative results with PCNet under different settings.(a) Results under the 1-shot setting.(b) Results under the 5-shot setting.

Figure 8 .
Figure 8.Comparison of qualitative results with PCNet under different settings.(a) Results under the 1-shot setting.(b) Results under the 5-shot setting.

Figure 9 .
Figure 9. Ablation studies of different forms of cross-attention.

Figure 9 .
Figure 9. Ablation studies of different forms of cross-attention.

Figure 10 .
Figure 10.Average mIoU over four splits under different settings.The red dotted line is the result of PSPNet.The blue line shows the results for PCNet with a different amount of support data.

Figure 10 .
Figure 10.Average mIoU over four splits under different settings.The red dotted line is the result of PSPNet.The blue line shows the results for PCNet with a different amount of support data.

Table 2 .
Quantitative results using FB-IoU and FPS as metrics.Best results are bolded, and sub-optimal results are underlined."*" indicates that the backbone of the model is pre-trained by ImageNet.

Table 3 .
Ablation studies of the proposed modules."Prototype" means using dense prototypes in cross-attention."RPD" denotes the reverse prototype distillation.Best results are bolded, and sub-optimal results are underlined.

Table 4 .
Ablation studies of different distillation temperatures.Best results are bolded, and suboptimal results are underlined.

Table 4 .
Ablation studies of different distillation temperatures.Best results are bolded, and suboptimal results are underlined.

Table 5 .
Comparison between SS and FSS.Best results are bolded, and sub-optimal results are underlined.