Center-Guided Transformer for Panoptic Segmentation

: A panoptic segmentation network to predict masks and classes for things and stuff in images is proposed in this work. Recently, panoptic segmentation has been advanced through the combination of the query-based learning and end-to-end learning approaches. Current research focuses on learning queries without distinguishing between thing and stuff classes. We present decoupling query learning to generate effective thing and stuff queries for panoptic segmentation. For this purpose, we adopt different workﬂows for thing and stuff queries. We design center-guided query selection for thing queries, which focuses on the center regions of individual instances in images, while we set stuff queries as randomly initialized embeddings. Also, we apply a decoupling mask to the self-attention of query features to prevent interactions between things and stuff. In the query selection process, we generate a center heatmap that guides thing query selection. Experimental results demonstrate that the proposed panoptic segmentation network outperforms the state of the art on two panoptic segmentation datasets.


Introduction
Panoptic segmentation [1] is the task in the domain of computer vision, involving the segmentation of both things and stuff.Things are defined as distinguishable and individual instances such as people, cars, and animals, where each instance contains unique id and class.In contrast, stuff means amorphous regions and encompassing areas such as the sky, meadows, grass, and other similar homogeneous areas.Starting from ConvNet-based panoptic segmentation models [2][3][4][5], recent panoptic segmentation methods [6][7][8][9] employ transformers to learn thing and stuff queries in various ways.DETR [7] and its variants [8,9] set queries as randomly initialized embeddings and train the queries with a transformer decoder, as shown in Figure 1a.Then, the learned queries are transformed into class and mask predictions for things and stuff.Next, as in Figure 1b, the query selection approach, which chooses effective features from image features based on the class probability, is adopted in object detection methods [10,11].However, things and stuff have different properties.Things are countable and contain small segments, while stuff is uncountable and includes large segments; thus, thing and stuff queries need to be learned differently.Also, the aforementioned approaches mix query features using self-attention in the transformer decoder, which yields interactions between thing and stuff queries.
In this work, we propose an architecture that integrates effective query selection for things and a decoupling mask to prevent things and stuff from interrupting each other, as illustrated in Figure 1c.First, we develop center-guided query selection for things, which exploits the center regions of instances from image features.To analyze center regions, we estimate a center heatmap, which has high values at the center of individual instances, to generate a center-guided feature.Based on the center-guided feature, we select the effective thing queries.After treating stuff queries as randomly initialized embeddings, we separately train thing and stuff queries using a transformer decoder with the decoupling mask.Using the trained thing and stuff queries, we obtain panoptic segmentation results from mask and class predictions.Experimental results demonstrate that the proposed panoptic segmentation network outperforms the state of the art on the COCO panoptic dataset [12] and ADE20K panoptic dataset [13].Specifically, the proposed network yields the best performance for things, while it provides comparable results for stuff with respect to the state of the art on the COCO panoptic dataset.The proposed network achieves 52.2 PQ and 44.1 AP th pan on the COCO panoptic dataset, and 41.5 PQ and 28.9 AP th pan on the ADE20K panoptic dataset, where the metrics PQ and AP th pan have a range from 0 to 100.The rest of this paper is organized as follows: Section 2 surveys panoptic segmentation methods and center-based learning techniques.Section 3 describes the proposed network.Section 4 discusses the experimental results.Finally, Section 5 concludes this work and provides future work directions.

Related Works
Panoptic segmentation [1] is a joint task including the semantic segmentation and instance segmentation tasks, requiring the prediction of distinct masks to represent both things and stuff.Early panoptic segmentation methods attempt to combine the existing semantic segmentation network and instance segmentation network effectively.For example, UPSNet [2] utilizes two separate branches to produce semantic and instance segmentation, and then it subsequently integrates both results using an additional panoptic segmentation head.PanopticFCN [3] jointly models things and stuff networks by designing a unified convolution pipeline to simplify panoptic segmentation.
Recently, transformer-based models [6][7][8][9] have achieved the promising performance in panoptic segmentation.DETR [7] is an end-to-end solution to address both object detection and panoptic segmentation tasks.However, it is still inferior to classical segmentation models, since DETR produces panoptic segmentation by adding a simple mask head on top of object detection networks.MaskFormer [8] and Mask2Former [9] have a architectures similar to that of DETR but differ in using a global segmentation decoder and some specialized designs for mask prediction.MaskFormer [8] builds a pixel decoder to generate mask predictions through simple matrix multiplication between enhanced queries and pixel decoder output.Mask2Former [9] proposes masked attention, which uses mask predictions in the process of self-attention, to reduce training time significantly.Panoptic Segformer [6] adopts an auxiliary location decoder to assist instance queries to learn location clues and ease model training.These transformer-based methods rely on learnable queries initialized with random values to estimate things and stuff.On the contrary, we propose a query selection algorithm to extract individual queries from image features using object center information.Also, fast panoptic segmentation networks [5,14,15] are presented.YOSO [15] developed the feature pyramid aggregator for speedup in GPU latency and the separable dynamic decoder for generating panoptic kernels.IDNet [14] decomposes panoptic segmentation into category and location information, which simplifies the network architecture.
The object's center is able to provide a rich context for solving various computer vision tasks, such as object detection [16][17][18][19] and segmentation [5,20].CenterNet [16] detects each object as center keypoints.CenterNet2 [17] further enhances center representation using a heatmap approach.FCOS [18] introduces a centerness branch to predict the deviation of a pixel from the center of its corresponding box.ExtremeNet [19] predicts geometric centers and aligns them into a bounding box.In the segmentation task, CenterMask [20] leverages a center heatmap for anchor-free instance segmentation.Panoptic-DeepLab [5] first estimates all foreground masks from an image and then extracts thing classes based on instance centers.On the other hand, we take into account centers of object instances to extract instance queries from the corresponding feature space.In the experimental results, the proposed algorithm exhibits the superior performance in comparison to the other panoptic segmentation models.Backbone and transformer encoder: The backbone extracts image features from an input image X ∈ R H 0 ×W 0 ×3 , and the transformer encoder generates a new feature map F encoder ∈ R H 1 ×W 1 ×C from the image features, where H 1 = H 0 /32, W 1 = W 0 /32, and C = 2048.We employ ResNet50 [21] for the backbone and the transformer encoder in [9].The transformer encoder consists of deformable attention [10], layer normalization, and a feed forward network (FFN).Feature map F encoder is gradually upsampled to a center embedding F center ∈ R H×W×C and a mask embedding F mask ∈ R H×W×C through the two sets of convolution layer and bilinear interpolation operation, where H = 4H 1 , W = 4W 1 .Also, F encoder is fed into the transformer decoder for attention mechanisms with queries.

Architecture
Center-guided query selection: Traditional transformer-based panoptic segmentation models [6,8,9] typically use randomly initialized embeddings to learn queries without distinguishing between things and stuff.The proposed network learns things and stuff separately to prevent thing queries and stuff queries from interrupting each other.Inspired by center-based learning for object detection [16][17][18][19], we develop the mechanism of centerguided query selection for thing queries.The center regions of individual instances in input images contain the cues to distinguish different instances.Thus, we estimate a center heatmap to guide effective thing query selection.
Figure 3 shows the diagram of the proposed center-guided query selection.Center embedding F center passes through the FFN to estimate center heatmap H ∈ R H×W , which contains the location information of the instances.Then, we obtain center-guided feature Fcenter ∈ R H×W×C using element-wise multiplication between the estimated center heatmap H and each channel of F center .To this end, we employ the feature selection process in [10,11] to determine the top K query features from center-guided feature Fcenter .Fcenter passes through a linear layer and softmax to obtain class probability map P ∈ R H×W×C thing for things, where C thing is the number of thing classes.Then, we pick the highest probability from P for each pixel and construct thing query Q thing ∈ R N thing ×C , where N thing = K, by selecting the top K features from Fcenter in terms of the highest probabilities extracted from P. Since center heatmap H has high values on the central parts of the instances, H conveys strong visual patterns related to the instances to obtain effective thing query Q thing .Note that we only perform center-guided query selection for thing queries Q thing , while we simply set stuff queries Q stuff ∈ R N stuff ×C as N stuff randomly initialized embeddings.Transformer decoder with decoupling mask: We need to train queries to inject enough information to derive classes and masks.For this purpose, Q thing and T and fed to the transformer decoder, which includes selfattention, deformable attention, and the FFN, as in Figure 4. Considering the different properties between things and stuff, we apply decoupling mask D ∈ R (N thing +N stuff )×(N thing +N stuff ) to self-attention, where D's element D(i, j) is defined as (1) Then, self-attention in the transformer decoder is formulated as where Q l , K l , and V l are the query, key, and value extracted from Q through a linear layer, respectively.We prevent interference between thing and stuff queries using decoupling mask D. For the stability of the learning process, we use the residual connection with Q and perform layer normalization after the residual connection.After the self-attention process, we use deformable attention to inject F encoder into Q, resulting in enhanced query set Q. Estimation: Masks and classes are estimated from enhanced query set Q. First, masks are computed using the dot product between Q and mask embedding F mask .Second, Q passes through a fully connected layer to predict the class probability.Finally, we obtain panoptic segmentation results from mask and class predictions.

Loss
The proposed network outputs N thing + N stuff predictions, including masks and classes.Then, we perform the Hungarian algorithm [22] to match predictions and ground truths, following [6][7][8][9].For each match, we compute the focal loss [23] between class probability prediction c k and ground truth ĉk as follows: where λ class , α, and γ were experimentally set to 4, 0.25, and 2, respectively.Also, to compare the estimated mask M k ∈ R H 0 ×W 0 and ground truth Mk , we employ the mask loss (L m (M k , Mk )) in [8], which is composed of per-pixel cross-entropy loss L pixel (M k , Mk ) and dice loss [24] L dice (M k , Mk ): where λ pixel and λ dice were set to 5 and 5, according to [9].Additionally, to train the center-guided query selection module, we generate ground-truth heatmap Ĥ by applying Gaussian distributions to all instance center points for each image.Then, we compute the focal loss between the predicted center heatmap H and Ĥ.

Experiments 4.1. Setting
Dataset: We conducted experiments on the proposed network on two panoptic segmentation datasets: the COCO panoptic [12] and ADE20K panoptic [13] datasets.The COCO panoptic dataset consists of annotated images with mask and class labels for 80 thing and 53 stuff classes.The COCO panoptic dataset is divided into training set, validation set, and test set, which contain 118,785, 5000, and 5000 images, respectively.The ADE20K panoptic dataset provides object-and semantic-level information for object detection and segmentation.It consists of 100 thing classes and 50 stuff classes.The dataset contains 20,210 images for the training set and 2000 images for the validation set.Figures 5 and 6 show examples of the COCO panoptic and ADE20K panoptic datasets.Implementation details and training settings: We implemented the model using the detectron2 [25] platform, based on PyTorch.For training, the size of the input images was set to 1280 × 1280 on COCO, while it was set to 640 × 640 on ADE20K.We employed the standard convolution-based ResNet50 [21], ResNet101 [21] and Swin-Transformer [26] as the backbone network.The transformer encoder and the transformer decoder were repeated six times and nine times, respectively.The numbers of thing and stuff queries were 300 and 53, respectively.During the training process, we used four NVIDIA RTX A6000 GPUs, with a batch size of 4 per GPU.For training the proposed network, we set epoch to 50 for the COCO panoptic dataset and epoch to 120 for the ADE20K panoptic dataset.We optimized the proposed network using the AdamW optimizer.The initial learning rate was set to 1 × 10 −4 , and the multiple step learning rate scheduling technique was applied to decay the learning rate at specific epochs.The reduction rate is set to 1/10, and the learning rate gradually decreases at 36 epoch and 48 epoch for the COCO panoptic dataset, while it decreases at 75 epoch and 105 epoch for the ADE20K panoptic dataset..The weight decay value is set to 0.05.
Evaluation metrics: For evaluation, we used Panoptic Quality (PQ) [1] to measure the performance in both classification and segmentation.Also, we used the additional metric of AP th pan , which measures the average precision (AP) of segmentation for thing categories to demonstrate the effectiveness of the proposed method for thing classes.Both PQ and AP th pan had a range from 0 to 100.

Comparison with Other Methods
COCO panoptic dataset: In Table 1, we compare the proposed method with existing panoptic segmentation methods [3,[5][6][7][8][9]14,27] on the COCO panoptic [12] dataset.Table 1 shows the PQ, AP th pan scores of the existing methods, which were obtained from the respective papers.We see that the proposed method outperforms both non-transformer-based methods [3,5] and transformer-based ones [6][7][8][9].Specifically, the proposed method surpasses the prior state of the art (Mask2Former [9]) by margins of 0.3 and 2.4 in terms of PQ and AP th pan , respectively.The proposed method achieves the remarkable performance for thing classes (58.4 PQ thing ), which indicates that the proposed center-guided query selection is essential to exploiting features for different instances in each image.The highest score of AP th pan shows that the proposed network is effective in the segmentation of thing classes.Figure 7 shows the qualitative comparison of the proposed method with MaskFormer and Mask2Former on the COCO panoptic dataset.
As shown in Figure 7, the proposed network provides more accurate segmentation results and distinguishes different instances compared with MaskFormer and Mask2Former.For example, in the third row in Figure 7, the proposed method significantly enhances the detection performance of things, resulting in more segmentation thing masks than both MaskFormer and Mask2Former.Specifically, while MaskFormer merges the several masks of individual cakes into a single mask, Mask2Former completely fails to detect and segment the cake instances.On the other hand, the proposed method faithfully detect individual cake instances and provides accurate segmentation mask results.Moreover, as illustrated in the stuff region in the fourth row, MaskFormer incorrectly classifies the fence class into the tree class and yields merged masks.Also, Mask2Former fails to obtain segmentation masks for tree regions.In contrast, the proposed method provides remarkable mask results for stuff classes, including tree, fence, window, and wall-brick.ADE20K panoptic dataset: Table 2 compares the proposed method with IDNet [14], MaskFormer [8], Panoptic Segformer [6], YOSO [15], and Mask2Former [9] in terms of PQ and AP th pan on ADE20K.The proposed method achieves the best performance in all metrics.Our method surpasses Mask2Former by over 1.8 and 2.7 in terms of PQ and AP th pan .Figure 8 shows the qualitative comparison of the proposed method with Mask-Former and Mask2Former.We observe that the proposed method effectively distinguishes thing and stuff classes and yields more accurate mask and class results than MaskFormer and Mask2Former.For instance, in the fourth row in the Figure 8, MaskFormer produces inaccurate mask results for the chair class, leading to incorrect or incomplete masks for some chair instances.Mask2Former completely misses the chair and table instances.However, the proposed method yields accurate chair and table segmentation results.Furthermore, in the last row, MaskFormer fails to find wall regions, and thus it misclassifies wall regions into building or water classes.Mask2Former accurately predicts the stuff area, but fails to achieve accurate instance segmentation such as houses and stairs instances.In contrast, the proposed method not only accurately predicts segmentation results for stuff, but also precisely extracts thing segmentation results from the image.

Ablation Study
In Tables 3 and 4, we conduct ablation studies to validate the effectiveness of centerguided query selection and the decoupling mask.Tables 3 and 4 list the performance of the proposed network without center-guided query selection and the decoupling mask on COCO and ADE20K, respectively.As shown in Tables 3 and 4, both components improve the performance in all metrics.Specifically, center-guided query selection increases the PQ thing and AP th pan scores by 1.2 and 2.1 on COCO, while it improves the PQ thing and AP th pan scores by 1.7 and 2.8 on ADE20K.This indicates that center-guided feature Fcenter is effective in segmenting objects and distinguishing different instances.Also, without the decoupling mask, PQ thing and PQ stuff performance is degraded on both COCO and ADE20K.When we remove the two components, PQ scores are reduced by 1.8 and 2.1 on COCO and ADE20K, respectively.These results indicate that the proposed modules are essential for accurate panoptic segmentation.
For the learning of thing and stuff queries, there are three approaches: (1) centerguided query selection, (2) feature selection [10,11], and (3) random initialization.Table 5 shows an ablation study according to combinations of thing and stuff query learning.We observe that the combination of center-guided query selection for things and random initialization for stuff, i.e., the proposed method, yields the best performance.When feature selection is adopted for stuff instead of random initialization, we experience accuracy degradation.Also, the proposed combination surpasses the traditional feature selection in [10,11].Table 6 lists the panoptic segmentation performance for various backbones: (1) ResNet50, (2) ResNet101, and (3) Swin-T [26].By comparing ResNet50 and ResNet101, the performance is improved as parameters increase.Also, the transformer-based backbone [26] yields the best performance, even though it uses fewer parameters than ResNet101.

Conclusions
We propose a panoptic segmentation network to predict masks and classes for things and stuff.The key insight of the proposed network is to generate effective thing and stuff queries for panoptic segmentation.First, we developed center-guided query selection, which exploits center information for detecting and segmenting individual instances.Second, we applied a decoupling mask to the transformer decoder, which prevents the interaction between thing and stuff queries.Experiments on COCO and ADE20K validated that the proposed panoptic segmentation network outperforms the existing methods, especially with respect to things.Despite its effectiveness, the proposed panoptic segmentation network has a limitation with respect to stuff, as reported in Table 1.Therefore, it remains a future work direction to generate effective queries for stuff classes.

Figure 1 .
Figure 1.Approaches for query learning: (a) randomly initialized embeddings, (b) query selection, and (c) proposed center-guided query selection and decoupling mask.

Figure 2
Figure 2 illustrates an overview of the proposed panoptic segmentation network.In this section, we introduce the proposed center-guided query selection module and transformer decoder with decoupling mask.

Figure 2 .
Figure 2. Overview of the proposed network.We use the backbone and transformer encoder to extract image feature F encoder from an input image X. F encoder is gradually upsampled to obtain center embedding F center (orange block) and mask embedding F mask (purple block).Thing queries are selected from F center through the center-guided query selection process, while stuff queries are randomly initialized.The transformer decoder generates enhanced queries based on the attention mechanism between F encoder and queries with decoupling mask D.Then, the enhanced queries are transformed into mask and class predictions for panoptic segmentation.

Figure 3 .
Figure 3. Diagram of center-guided query selection.It generates center-guided feature Fcenter using the estimated heatmap H, which contains the center information of the instances.Feature selection extracts K queries from center-guided feature Fcenter based on thing class probabilities.

Figure 4 .
Figure 4. Diagram of the transformer decoder.Given query Q, it generates enhanced query Q through self-attention and deformable attention with image feature F encoder .

Figure 5 .
Figure 5. Examples of COCO panoptic images [12].The first row and the second row represent images and their annotations, respectively.

Figure 6 .
Figure 6.Examples of ADE20K panoptic images [13].The first row and the second row represent images and their annotations, respectively.

Table 1 .
[12]arison of the proposed method with existing panoptic segmentation networks on the COCO panoptic[12]val2017 dataset.The best results are boldfaced.

Table 2 .
[13]arison of the proposed method with existing panoptic segmentation networks on the ADE20K panoptic[13]validation dataset.The best results are boldfaced.

Table 3 .
Ablation study on the COCO panoptic val2017 dataset according to different settings.The best results are boldfaced.

Table 4 .
Ablation study on the ADE20K panoptic validation set according to different settings.The best results are boldfaced.

Table 5 .
Ablation study on the COCO panoptic val2017 dataset according to query learning settings.The best results are boldfaced.