Next Article in Journal
Do What You Say—Computing Personal Values Associated with Professions Based on the Words They Use
Next Article in Special Issue
Fidex and FidexGlo: From Local Explanations to Global Explanations of Deep Models
Previous Article in Journal
Advantages of Density in Tensor Network Geometries for Gradient-Based Training
Previous Article in Special Issue
Adversarial Validation in Image Classification Datasets by Means of Cumulative Spectral Gradient
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

EGSDK-Net: Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation

1
College of Computer Science and Technology, Jilin University, Changchun 130012, China
2
School of Mechanical and Aerospace Engineering, Jilin University, Changchun 130025, China
*
Authors to whom correspondence should be addressed.
Algorithms 2025, 18(2), 71; https://doi.org/10.3390/a18020071
Submission received: 16 November 2024 / Revised: 12 January 2025 / Accepted: 20 January 2025 / Published: 1 February 2025
(This article belongs to the Special Issue Machine Learning Algorithms for Image Understanding and Analysis)

Abstract

:
In recent years, panoptic segmentation has garnered increasing attention from researchers aiming to better understand scenes in images. Although many excellent studies have been proposed, they share some common unresolved issues. Firstly, panoptic segmentation, as a novel task, is still confined within inherent frameworks. Secondly, the prevalent kernel update strategies do not adequately utilize the information from each stage. To address these two issues, redwe propose an edge-guided stepwise dual kernel update network (EGSDK-Net) for panoptic segmentation; the core components are the real-time edge guidance module and the stepwise dual kernel update module. The first component, after extracting and positioning edge features through an extra branch, applies these features to the normally transmitted feature maps within the network to highlight the edges. The input image is initially processed with the Canny edge detector to generate and store the predicted edge map, which acts as the ground truth for supervising the extracted edge feature map. The stepwise dual kernel update module enhances the utilization of information by allowing each stage to update both its own kernel and that of the subsequent stage, thereby improving the judgment capabilities of the kernels. redEGSDK-Net achieves a PQ of 60.6, representing a 2.19% improvement over RT-K-Net.

1. Introduction

Panoptic segmentation is an important task in computer vision, with many real-world applications, such as autonomous driving, relying on the detailed information it provides. Researchers introduced panoptic segmentation [1] based on semantic segmentation and instance segmentation to achieve a more comprehensive understanding of scenes. The latter two categorize the pixels in the image into “stuff” and “thing” classes, while panoptic segmentation combines the characteristics of both. It requires classifying countable objects in the “thing” classes and classifying amorphous regions in the “stuff” classes. To better accomplish this task, recent state-of-the-art approaches [2,3,4,5,6,7] have employed a unified mask classification method for panoptic segmentation. Among them, RT-K-Net [5] has improved the structure, training, and inference process of K-Net [2], achieving real-time high-performance panoptic segmentation. Nonetheless, we must point out that there are still some clear defects in the existing methods that remain unresolved.
First, panoptic segmentation, as a novel task, is still confined within the framework of traditional segmentation tasks. It can be approximately understood as the classification of all relevant objects in the image, but viewing this task exclusively through the perspectives of “stuff” and “thing” categories has notable limitations. For instance, YOSO [6] introduces a real-time panoptic segmentation framework. By employing dynamic convolutions and feature pyramid aggregation, YOSO achieves competitive performance while reducing computational overhead, making it suitable for real-time applications. Panoptic SwiftNet [7] utilizes scale-equivariant feature extraction, cross-scale upsampling through pyramidal fusion, and boundary-aware learning of pixel-to-instance assignment to achieve fast inference over large input resolutions. Although YOSO [6] and Panoptic SwiftNet [7] demonstrate excellent efficiency and results, they are still limited to the traditional segmentation framework that classifies solely based on the “stuff” and “things” categories. Moreover, some recent segmentation algorithms [8,9] have attempted to incorporate boundary information to improve segmentation performance. Ref. [8] introduced an improved Mask R-CNN algorithm for segmenting photovoltaic hot spots in thermal infrared images. The proposed method enhances the edge features of hot spots using residual neural networks and integrates a feature pyramid structure with an edge-guided approach, improving segmentation accuracy significantly. On the other hand, ref. [9] focused on the fusion of multimodal information in high-resolution satellite and airborne remote sensing images. The authors proposed a multimodal fusion network with edge detection guidance to bridge the semantic gap and improve feature extraction. The method enhances segmentation performance by incorporating spatial boundary information and designing an adaptive fusion block to optimize multi-level feature integration. These two works underscore the importance of boundary information in improving segmentation accuracy. However, existing panoptic segmentation methods [2,3,4,5,6,7,10,11,12,13] have limited utilization of boundary information. Hence, we believe that integrating insights from external domains to address the limitations of the traditional framework could be a promising direction for future research. In addition, regarding the kernel initialization of K-Net [2], it employs two branches for mask and kernel initialization. However, these two branches are not essential, as they introduce unnecessary complexity to the model and deviate from the original design concept of a unified structure. Existing real-time panoptic segmentation methods [10,11,12,13] rely on a dual-branch structure for semantic segmentation and instance segmentation, which results in shortcomings in both accuracy and inference speed. Therefore, RT-K-Net [5] takes it a step further by using a single unified branch for initialization. Yet, we find that while the initialization process has been optimized, the kernel update still maintains the conventional “progressive” approach. When the kernel update moves to the next update step, the information from the previous step does not contribute to the process, which undoubtedly leads to a waste of information and ultimately results in suboptimal mask prediction outcomes. As shown in Figure 1, our method achieves the best PQ on the benchmark dataset Cityscapes, surpassing RT-K-Net with a particularly small increase in time cost, and significantly outperforming other real-time panoptic segmentation algorithms [10,11,12,13]. The information from the earlier step cannot only prevent information from heading in the wrong direction but also contribute to stabilizing the inference and training processes. Making effective use of this information will be crucial for enhancing the performance of panoptic segmentation.
Based on the considerations mentioned above, we propose a real-time edge guidance module and a stepwise dual kernel update module. The former is designed to better align with the requirements of the panoptic segmentation task. Panoptic segmentation classifies almost all target pixels, while the traditional concept of edges also refers to the boundaries between any objects. The two possess a fundamental similarity in their task properties. Distinguishing different objects through edges will enhance the segmentation accuracy in target boundary areas and promote stronger contrast in target regions. To achieve this, we employ a simple structure to extract edge cues. We begin by taking feature maps from the lower stage, followed by extracting edge features using convolutional structures. We apply supervision to these edge features to ensure ongoing optimization. Finally, we utilize both the max pooling layer and the average pooling layer along the channel dimension to improve attention to spatial edge positions and incorporate the edge information into the backbone’s final output to achieve edge-enhanced feature maps. The stepwise dual kernel update module optimizes the kernel update process based on RT-K-Net [5]. Specifically, previous kernel update modules only had sequential connections between the structural stages, which did not fully utilize information from the earlier stage. To this end, we optimize the kernel generation process in the kernel update step. We first generate the kernels for the current and the next stages. Then, the kernel from the current stage is fused with the kernel from the previous stage (if available) and is convolved with the feature maps to generate the mask for the current stage. Meanwhile, the kernel for the next stage is preserved and will be used in the subsequent stage. The combination of the two with RT-K-Net forms our proposed edge-guided stepwise dual kernel update network (EGSDK-Net). Our contributions can be summarized as follows:
(1)
We propose a novel architecture for the field of panoptic segmentation, called EGSDK-Net. It liberates panoptic segmentation from the constraints of semantic and instance segmentation for the first time and offers valuable structural contributions to the field.
(2)
A real-time edge guidance module (RTEGM) is designed. This module not only introduces a novel theory regarding the relationship between edge detection and panoptic segmentation but also enhances segmentation performance through a lightweight structure.
(3)
A stepwise dual kernel update module (SDKUM) is proposed. Considering the limitations of past methods in utilizing system information, SDKUM addresses this issue through a clever design. By using information more effectively, it successfully promotes advancements in model capabilities.

2. Related Work

Panoptic segmentation methods: One of the primary aims of introducing the panoptic segmentation task is to integrate semantic segmentation and instance segmentation tasks. According to the development progress of this task, it can be divided into two main categories: early panoptic segmentation methods [15,16,17,18] and unified architectures for panoptic segmentation [2,3,4,19]. The early methods either generated instance predictions using bounding box proposals [15,16] or created panoptic segmentation through alternative design strategies that did not require proposals [17,18]. In contrast, recent unified architectures for panoptic segmentation directly predict segmentation masks corresponding to their respective classes. The K-Net [2] framework employed a set of learnable kernels for consistent segmentation of instances and semantic categories, proposing a kernel update strategy to resolve the challenges of distinguishing various instances. Mask2Former [3] constructed a universal architecture for image segmentation, which includes a backbone feature extractor, a pixel decoder, and a Transformer decoder, applying masked attention along with multiple optimization improvements. In addition, some methods focus on real-time panoptic segmentation [5,10,11,12,13]. Petrovai et al. [12] tackled panoptic segmentation as a dense classification problem and generated masks for “stuff” classes as well as for each instance of “things” classes. RT-K-Net [5] is the first unified architecture for real-time panoptic segmentation. It is based on K-Net and comprehensively optimizes its structure, training, and inference processes. YOSO [6] also proposes a real-time panoptic segmentation framework. By utilizing dynamic convolutions and feature pyramid aggregation, YOSO achieves competitive performance while reducing computational overhead, making it suitable for real-time applications. Panoptic SwiftNet [7] focuses on efficient multi-scale feature extraction and boundary-aware learning, making it particularly well-suited for large-scale remote sensing applications. Considering the various shortcomings of the aforementioned methods, we propose the edge-guided stepwise dual kernel update network (EGSDK-Net).
Edge-guided methods: Due to edges being considered an additional target attribute, earlier deep learning methods generally did not integrate edge information into the network. Recently, with the growing emphasis on model performance, researchers have begun to recognize that edges can accurately represent the contours of objects in images. Consequently, numerous edge-guided solutions have emerged across various visual tasks [20,21,22,23,24,25]. To address the common oversight of edge information in significant object detection tasks for optical remote sensing images, ERPNet [20] introduced an edge-guided recursive localization network, which features edge-aware position attention units as its core module to emphasize prominent objects in these images. MLEFGN [22] introduced a framework that integrates edge detection, edge guidance, and image denoising into an end-to-end CNN model to address the issues of complex scenes and information loss. Lin et al. [24] proposed an edge-guided generative network model to produce semantically consistent outputs from small images, which provide limited information. EGCNet [25] treated the image stitching synthesis phase as an image blending problem and utilized perceptual edges to guide the network with additional geometric priors, enhancing the preservation of structural consistency. Ref. [8] proposed an improved Mask R-CNN algorithm for photovoltaic hot spot segmentation in thermal infrared images. By introducing an edge-guided feature pyramid structure and a spatial attention module, this method significantly improves segmentation accuracy. Ref. [9] addressed the challenge of multimodal fusion for high-resolution remote sensing images. The proposed method incorporates an edge detection guidance module to fuse multi-level features and spatial boundary information, thus improving the segmentation performance for complex scenes. In the methods mentioned above, refs. [21,23,24] utilized the Canny method [26] to generate edges or to treat its edge predictions as ground truth. Although the Canny method [26] is somewhat old-fashioned, it still provides a favorable balance between efficiency and accuracy, making it a technique worth considering.

3. Method

In this section, we will provide a detailed introduction to the proposed EGSDK-Net from three aspects: the overall architecture, the real-time edge guidance module, and the stepwise dual kernel update module. Since our method is built upon RT-K-Net [5] and K-Net [2], in the section focusing on the stepwise dual kernel update module, we will review the workflows of K-Net and RT-K-Net, while also presenting our improvements for a better understanding.

3.1. Overall Structure

The overall structure of EGSDK-Net is shown in Figure 2. First, we utilize the backbone composed of convolutional layers and RTFormer blocks proposed in RTFormer [27] to obtain the basic feature maps. Since RTFormer [27] employs a dual-branch module starting from stage three, the top layer of the backbone outputs low-resolution and high-resolution feature maps, denoted as R 5 l o w and R 5 h i g h , respectively. We denote the output of the second stage as R 2 . Since R 2 has a high resolution and is rich in detailed information, we use it as the input for the real-time edge guidance module (RTEGM). The edge feature map extracted and supervised by the RTEGM subsequently affects R 5 l o w through weighted application. The reason for choosing R 5 l o w instead of R 5 h i g h is that R 5 l o w retains a significant amount of semantic information but loses many detailed cues. Enhancing its discrimination of target areas using the edge feature map will be more effective. The edge-guided R 5 l o w is then enhanced through DAPPM [28] to improve its ability to perceive multi-scale objects. After upsampling, the output of this process is combined with R 5 h i g h and processed through an additional convolutional module to mitigate aliasing effects, yielding the feature map F. The stage that follows the application of the edge feature map represents the neck of the entire model. The output F from the neck is then fed into the head, which consists of the stepwise dual kernel update module (SDKUM), to obtain the model’s mask predictions and class probability predictions. Considering that the kernel convolves the features to create mask predictions, additional supervision for the feature map F is required to ensure accurate mask predictions. Following [5], we use the auxiliary loss function as follows:
L a u x = ω r a n k L r a n k + ω s e g L s e g + ω d i s c L d i s c ,
where ω r a n k = 0.1 , ω s e g = 1.0 , and ω d i s c = 1.0 are the balancing factors used in the auxiliary loss function to balance L r a n k , L s e g , L d i s c , which is consistent with the design in RT-K-Net. L r a n k represents the mask-ID cross-entropy loss, L s e g denotes the cross-entropy loss, and L d i s c is the contrastive loss function introduced by RT-K-Net [5]. The overall loss function of EGSDK-Net can be formulated as follows:
L t o t a l = ω m a s k L m a s k + ω d i c e L d i c e + ω c l s L c l s + ω e d g e L e d g e + L a u x ,
where ω m a s k = 1.0 , ω d i c e = 4.0 , ω c l s = 2.0 , and ω e d g e = 1.0 are also the balancing factors for the overall loss function. ω m a s k , ω d i c e , ω c l s are consistent with the design in RT-K-Net, while ω e d g e is set by us. L m a s k refers to the binary cross-entropy loss, L d i c e represents the dice loss, L c l s denotes the focal loss, and L e d g e utilizes the balanced cross-entropy loss function. The applications of L m a s k , L d i c e , and L c l s are consistent with those in K-Net [2] and RT-K-Net [5], to improve the guidance of stuff masks and thing masks during training. Moreover, we adopt the same training and inference optimizations, post-processing methods, and instance-based cropping augmentation as RT-K-Net [5]. Please refer to [5] for more information on these steps.

3.2. Real-Time Edge Guidance Module

Edge detection and panoptic segmentation are closely related. Edges outline the contours of the objects to be detected, while panoptic segmentation not only divides the areas of each object but also categorizes the segmented regions. As a result, they naturally exhibit similar attributes. The current architecture for panoptic segmentation typically trains masks and categories separately. This decoupling is more advantageous for the integration of edge detection and panoptic segmentation tasks. A critical question to ponder is how to achieve reliable edges and appropriately incorporate them into the information flow. There is no intention to use complex structures to obtain edges, as their primary role is to guide panoptic segmentation in focusing on the contours of different targets. Meanwhile, the edges obtained must be accurate, which poses challenges regarding the attention to spatial locations and the training optimization process. Based on this consideration, we designed and proposed a real-time edge guidance module (RTEGM).
The structure of the RTEGM is illustrated in Figure 3. The feature map R 2 from the lower stages of the backbone serves as the input for the RTEGM. The process of extracting R 2 does not involve cross-token interaction, resulting in it being primarily composed of local information with high resolution. This greatly benefits the extraction of edge features. R 2 is first passed through a convolutional layer for additional information extraction and transmission. Subsequently, it is fed into the dual-branch structure of the RTEGM. The design of the first branch is aimed at the application of edge features. Inspired by the spatial attention module in CBAM [29], we performed max pooling and average pooling operations along the channel dimension on R 2 to generate feature maps R m a x and R a v g at different contextual scales. This section of the design aims to make the module parameters focus more on the positions of edge features after backpropagation. R m a x and R a v g are then aggregated to produce the edge feature descriptor R m i x , which possesses characteristics from both max pooling and average pooling. This process can be formulated as follows:
R m i x = A v g p o o l R 2 + M a x P o o l R 2 ,
where R m i x R B × C × H × W , B is the batch size, C indicates the number of feature channels and H and W are the height and width of R 5 , respectively. Unlike CBAM [29], which concatenates the pooled results and uses a large 7 × 7 convolution kernel to compress the channels and obtain a 2D spatial attention map, our module does not use convolution to filter the edge feature maps. Instead, we directly use additive aggregation to preserve all edge position information, avoiding the loss or insufficiency of contour information that could limit the improvement of panoptic segmentation performance. Subsequently, R m i x is applied as weights to R 5 l o w ; this process can be formulated as follows:
R 5 l o w = R 5 l o w × 1 + R m i x .
Since all edge positions are retained in the previous step to avoid potential noise interference, we add 1 to R m i x to additionally reference the original target region contour positioning information. It is worth mentioning that we do not normalize R m i x using the sigmoid function as most methods do. Instead, we use R m i x itself as the weight. Since R m i x contains the essential information from all channels of each pixel in two forms, allowing its value to accurately reflect the significance of each point related to edge features. Using the sigmoid function for normalization inevitably reduces the differences in response intensity concerning this significance. We believe this leads to a waste of extracted edge information and ultimately results in suboptimal performance in edge-guided panoptic segmentation. After edge enhancement, R 5 l o w participates in the subsequent processes as described in Section 3.1. The other branch of RTEGM mainly supervises and continuously optimizes edge features through backpropagation. This branch consists solely of a 1 × 1 convolution, functioning as an edge prediction head, which produces a feature map with dimensions R 1 × H × W . After activation with the sigmoid function, this output is sent to L e d g e for supervision and guidance, which helps eliminate potential noise and enhance the quality of edge features. To provide accurate edge guidance without spending too much time in the generation process of edge ground truth, we use the widely used and validated Canny [26] to generate edge predictions for the training images. Although the Canny algorithm is a traditional edge detection method, it effectively classifies edge pixels based on the physical properties of the image and significantly reduces noise through processes such as filtering, non-maximum suppression, and double thresholding, thereby yielding stable and reliable edge detection results. Given its robustness, we consider that using the output of the Canny algorithm as the ground truth for L e d g e is a valid and practical approach. Backpropagation directly affects the 1 × 1 convolution and the convolution processing R 2 that is connected to it. The benefit of this approach is that backpropagation bypasses R m i x , allowing the optimization to focus directly on the edge feature map instead of R m i x , which is more similar to an edge attention map, thus improving edge prediction performance.

3.3. Stepwise Dual Kernel Update Module

The core idea of K-Net [2] is to divide feature pixels into N meaningful groups using learnable convolutional kernels. Because there are no restrictions on the content of the groups, K-Net [2] can perform semantic segmentation, instance segmentation, and panoptic segmentation in the same manner. Initially, K-Net [2] uses N randomly initialized kernels K 0 to generate the initial mask prediction from the image features. As depicted in Figure 4a, K-Net [2] initializes this process with two branches: semantic segmentation and instance segmentation. It generates mask predictions M i n s t and M s e g using the corresponding convolutional kernels and image features. However, RT-K-Net [5] argues that this dual-branch structure not only contradicts the idea of a unified architecture but also adds extra computational burden to the model. To address this, they propose simplifying the process by merging the two branches into a single branch to enhance the role of auxiliary semantic loss. The initialization process of RT-K-Net [5] is shown in Figure 4b. First, the RTFormer segmentation head produces a single feature map F. Then, F is convolved by the directly initialized panoptic kernel K 0 to generate the initial mask prediction M 0 . Since the position for adding the auxiliary semantic loss has been removed, RT-K-Net [5] introduces an auxiliary semantic segmentation head that is used only during training and has the same structure as the main initialization head. We believe that RT-K-Net [5] makes impressive contributions to the mask and kernel initialization, so we use initialization methods that are essentially consistent with it, as shown in Figure 4c. However, in the subsequent kernel update process, RT-K-Net [5] uses a procedure that is identical to that of K-Net [2]. The primary reason for the kernel update is that the initial K 0 does not have enough discriminative ability, which results in the generated M 0 performing poorly. Typically, the kernel update process consists of several stages. However, the current process still applies a sequential method for kernel transmission, where the kernel information updated in the previous stage serves only as input for the next stage, which undoubtedly leads to a waste of information. To address this, we make adjustments to the kernel update content based on RT-K-Net [5], depicted in Figure 5. Our basic steps are generally consistent with those of K-Net [2] and RT-K-Net [5], primarily including the following three steps:
(1) Group Feature Assembling: First, the feature map, F, of the i-th stage is element-wise multiplied by the binarized mask prediction B i 1 from the previous stage to form the group feature, as follows:
F k = u H v W B i 1 u , v F u , v ,
where F k R B × N × C , B is the batch size, N represents the number of groups discussed earlier, which also equals the number of convolution kernels, and C indicates the number of feature channels.
(2) Adaptive Feature Update Since the mask M i 1 from the ( i 1 )-th stage may be inaccurate, and there may be mutual interference from noise among the different groups, element-wise multiplication is first performed between F k and K i 1 to derive F E . Two gates, G F and G K , are learned to evaluate the effects of the kernel and group features on the kernel update process, respectively. Following this, G F and G K are applied to weigh the group features and the kernel, completing the adaptive kernel update, as follows:
K = G F φ 1 F k + G K φ 2 K i 1 ,
where φ 1 and φ 2 are fully connected layers with non-shared weights, followed by layer normalization (LN). Unlike K-Net [2] and RT-K-Net [5], K ˜ R B × N × 2 C . The purpose of doubling the channel count during the use of fully connected layers is to output two updated kernels in the subsequent steps: one is the current stage kernel K i and the other is an early prediction of the next stage kernel K ^ i + 1 , which is only temporarily retained.
(3) Kernel Interaction The kernel interaction process facilitates the sharing of contextual information among different groups of each kernel. Considering both convenience and effectiveness, it is common to use a combination of multi-head attention and feedforward neural network for kernel interaction, resulting in the updated kernel K i at the current stage, which can be formulated as follows:
K i , K ^ i + 1 = S p l i t F F N M H S A K .
Subsequently, we utilize different feedforward neural networks (FFNs) to generate predictions for both the mask and class probabilities. This can be formulated as follows:
M i = F F N M K i F ;
C l s i = F F N C K i F .
Before entering the next repeated update stage, we concatenate the additional predictions from the previous stage for the current stage kernel with K i , formulated as follows:
K i = F F N K i , K ^ i ,
where [ . ] represents the channel concatenation operation. At this point, the process of kernel updating is complete. Our SDKUM is obtained by combining the kernel and mask initialization proposed in RT-K-Net [5] with this kernel updating operation. Our innovation primarily lies in increasing the number of channels in the previous stage and additionally predicting the kernel for the next stage, which provides two advantages compared to conventional methods: (1) The information extracted from the previous stage is directly incorporated into the update step of the next stage, leading to better information utilization within the module. (2) Since the process of generating K i and K ^ i + 1 involves extensive information exchange, this means that K ^ i + 1 is rich in content from the current stage. Additionally, with the effects of backpropagation in iterations, K ^ i + 1 also exhibits characteristics of the next stage kernel, thereby contributing to the reduction of semantic information gaps between stages. In conclusion, the stepwise dual kernel update module promotes the effective use of stage information with only a slight increase in model complexity, thereby improving the consistency between the stage kernels. Detailed experimental validation will be presented in Section 4.4.

4. Experiments

In this section, we conduct a comprehensive evaluation of the proposed EGSDK-Net, comparing it with state-of-the-art panoptic segmentation methods and performing ablation studies to assess the effectiveness of our algorithm.

4.1. The Dataset and Metrics

We utilize the widely used Cityscapes dataset [14] to study EGSDK-Net. This dataset contains a total of 5000 images at a resolution of 1024 × 2048 , split into 2975 training images, 500 evaluation images, and 1525 test images. It offers fine-grained panoptic labels across 30 categories, with 19 (8 “things” and 11 “stuff” categories) allocated for evaluation. Similar to RT-K-Net [5], we apply data augmentation techniques such as random scaling, random cropping, random adjustments to color and brightness, and random horizontal flipping, where the random cropping size is fixed at 512 × 1024 . Following earlier methods [2,5], we assess panoptic segmentation results using the standard metric: Panoptic Quality (PQ) [1]. PQ is a metric used to measure the similarity between panoptic segmentation predictions and ground truth. It unifies the evaluation criteria for both semantic segmentation and instance segmentation. By independently calculating the PQ for each category and averaging over all categories, PQ becomes insensitive to class imbalance, allowing for a fair evaluation of panoptic segmentation performance. For each category, the unique matching of predicted and ground truth segments is divided into three groups: true positives (TPs), false positives (FPs), and false negatives (FNs), representing matched segment pairs, unmatched predicted segments, and unmatched ground truth segments, respectively. Ref. [1] defines the use of intersection over union (IoU) to determine whether the ground truth segment and the predicted segment match. If I o U > 0.5 , it is considered TP. For a given class c, the formula for IoU is defined as follows:
I o U c = P c G c P c G c ,
where P c and G c represent all the pixels labeled as class in the predicted and ground truth segments, respectively. Given these three groups, PQ is defined as follows:
P Q = p , g T P I o U p , g T P + 1 2 F P + 1 2 F P .
Furthermore, to provide a complete evaluation of the model’s performance, we report metrics for the “thing” and “stuff” classes, referred to as P Q t h and P Q s t , respectively.

4.2. Implementation Details

EGSDK-Net is implemented in the Detectron2 framework using PyTorch [30]. The base variant of RTFormer is utilized, initialized with a pre-trained model from ImageNet [31]. The hyperparameters associated with training are as follows: batch size (32), initial learning rate (0.0002), weight decay (0.05), maximum iterations (90 k), optimizer (AdamW), scheduler (‘poly’), ω m a s k (1.0), ω d i c e (4.0), ω r a n k (0.1), ω c l s (2.0), and ω s e g (1.0). We incorporate warm-up in the training process, raising the learning rate from 0.000002 to 0.0002 in the first 1000 iterations. The number of prediction masks (convolutional kernels) N is set to 100. Our kernel update frequency matches that of RT-K-Net [5]. All experiments are conducted on a single V100 GPU.

4.3. Comparison Experiment

Table 1 presents the quantitative comparison results of EGSDK-Net with SOTA methods [5,10,11,12,13] on the Cityscapes dataset [14]. For a fair comparison, we retrain the main reference method of EGSDK-Net in the current experimental environment, while using the official data for the other methods. According to the experimental results, EGSDK-Net demonstrates a clear leading advantage in the comprehensive evaluation of panoptic segmentation. It achieves a PQ of 60.6, leading RT-K-Net [5], the second-ranked method, by 2.19%, while other methods do not even reach a PQ of 59. In the evaluation of “thing” classes, EGSDK-Net ranks second overall, only slightly behind Hou et al. [13] (51.9 vs. 52.1). Notably, the retrained RT-K-Net [5] performs poorly in the “thing” classes, surpassing only MGNet [10]. Although the improvement over RT-K-Net [5] is not very significant, we must point out that the results of RT-K-Net [5] in the “stuff” classes already reach a relatively high value. EGSDK-Net surpassing it numerically is sufficient to confirm its outstanding performance. This result is attributed not only to the strong RTFormer backbone and the optimizations provided by RT-K-Net [5], but also to the enhancements in the kernel update strategy within EGSDK-Net.

4.4. Ablation Study

We first conduct ablation experiments on the core components of EGSDK-Net. To save on training costs, the maximum number of training iterations for the ablation experiments is set to 60 k. The experimental results and visual comparisons are shown in Table 2 and Figure 6. It is evident that inserting SDKUM into the baseline leads to significant improvements in all three metrics, particularly in the “thing” classes, which rise from 48.9 to 50.0. After inserting RTEGM, we obtain the final structure of the proposed EGSDK-Net. It is evident that the scores for PQ and P Q t h keep increasing, while P Q s t shows some decline. We believe this may be related to the Canny operator [26]. This operator extracts edges based on the intrinsic properties of the image, and the contours of the “stuff” classes are quite challenging to predict. When the Canny operator [26] inaccurately predicts these contours, it leads to errors in the model, which in turn results in a decline in the evaluation results for the object classes. Given the significant improvement in P Q t h after adding RTEGM, we believe that the benefits of this module outweigh its drawbacks, making it worthwhile to retain. Additionally, the results in Table 2 support our analysis of performance sources discussed in Section 4.3.
Next, we investigate the internal structure of RTEGM. We set up a baseline by removing the max pooling and average pooling layers, and the results are shown in the first row of Table 3. Next, we conduct experiments that restore either max pooling or average pooling, as shown in the second and third rows of Table 3. We observe that the enhancement from the max pooling layer is more significant. This may be because the max pooling layer functions similarly to non-maximum suppression, allowing it to eliminate false positive points with relatively lower probability values, thus producing more accurate edge feature maps. In contrast, the improvement from the average pooling layer is less pronounced, and there is a marked decline in the “stuff” classes. We think this remains tied to the difficult-to-predict edges of the “stuff” classes, as the average pooling layer does not filter out unreliable points as effectively as the max pooling layer. On the contrary, the averaging process across the channel dimension leads to more confusion and complexity in predicting the contours of the “stuff” classes, ultimately causing a drop in results. When both pooling methods are applied simultaneously, the results for the three metrics show improvement compared to when they are used separately. This might be due to the combined use of max pooling and average pooling layers, which provides a more thorough evaluation of the edges from two perspectives, thereby leading to better performance in PQ, P Q t h , and P Q s t . In addition, we conduct experiments on RTEGM as shown in Table 4. We remove the existing structure of RTEGM and its corresponding supervision, instead relying entirely on the non-trainable Canny [26] for edge prediction. The results indicate that the current strategy significantly outperforms the direct use of Canny [26]. While the results of Canny [26] act as the ground truth for the edge module during training, why is the direct application of Canny [26] less effective? We attribute this to the backpropagation process, as the direct integration of Canny [26] remains separate from the entire system, which prevents timely feedback on the conditions of individual images to the network, thereby limiting parameter adjustments and leading to subpar results.
Finally, we study the internal structure of SDKUM. The focus of our experimental exploration is on how to effectively merge the additional updated kernel K ^ i + 1 from the previous stage with the kernel update results of the current stage, as presented in Table 4. The data comparisons from these four rows indicate that concatenation is still the most effective method. We believe that the combination of concatenation with the subsequent fusion significantly enhances the interaction and knowledge integration between K ^ i + 1 and K i . In contrast, both addition and multiplication have their respective shortcomings. The former tends to retain all the information from the two kernel update outputs, which inevitably causes the previous stage to overly influence the current stage, slowing down the update speed and leading to poor results. Element-wise multiplication retains only the coexistent information in the updated results of the two kernels. As the stage increases, the overall volume of information within the module tends to decrease, which naturally leads to unsatisfactory results. In summary, the current combination method within SDKUM is the most effective.

5. Conclusions

Panoptic segmentation is a critical task in the field of computer vision, with widespread applications in areas such as autonomous driving, robotics, and medical imaging. However, despite its significance, panoptic segmentation faces challenges such as the limitations imposed by traditional segmentation frameworks and the suboptimal utilization of information in existing kernel update strategies. This paper presents a novel panoptic segmentation network, EGSDK-Net. It primarily addresses two issues: the constraints of panoptic segmentation within inherent frameworks and the low information utilization of existing kernel update strategies. To address the first issue, we take into account the relationship between edge detection and the panoptic segmentation task and propose a real-time edge guidance module. This module supervises the extracted edge features, which receive additional attention to their spatial positions, using the edges predicted by Canny as ground truth. For the other issue, we introduce a stepwise dual kernel update module. This module facilitates the rational use of information from the earlier stage by performing additional updates and combinations of the kernels from the previous stage to the next. Numerous comparison experiments and ablation studies confirm the validity and effectiveness of the proposed model. In the future, using more advanced edge detection algorithms instead of the existing Canny to generate edge ground truth appears to be a promising direction.

Author Contributions

Conceptualization, P.M. and H.Z.; methodology, P.M.; software, P.M.; validation, P.M., H.Z. and K.M.; formal analysis, H.Z.; investigation, P.M.; resources, H.Z.; data curation, P.M.; writing—original draft preparation, P.M.; writing—review and editing, P.M.; visualization, P.M.; supervision, H.Z. and K.M.; project administration, P.M. and H.Z.; funding acquisition, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Provincial Science and Technology Innovation Special Fund Project of Jilin Province, grant number 20190302026GX, Natural Science Foundation of Jilin Province, grant number 20200201037JC, and the Fundamental Research Funds for the Central Universities for JLU.

Data Availability Statement

The cityscapes dataset in this paper is openly available at https://www.cityscapes-dataset.com/ (accessed on 1 March 2024).

Acknowledgments

The authors are grateful to the anonymous reviewers for their insightful comments, which have certainly improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar]
  2. Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
  3. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  4. Yu, Q.; Wang, H.; Qiao, S.; Collins, M.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. k-means Mask Transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 288–307. [Google Scholar]
  5. Schön, M.; Buchholz, M.; Dietmayer, K. Rt-k-net: Revisiting k-net for real-time panoptic segmentation. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–7. [Google Scholar]
  6. Hu, J.; Huang, L.; Ren, T.; Zhang, S.; Ji, R.; Cao, L. You only segment once: Towards real-time panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 17819–17829. [Google Scholar]
  7. Šarić, J.; Oršić, M.; Šegvić, S. Panoptic SwiftNet: Pyramidal Fusion for Real-Time Panoptic Segmentation. Remote Sens. 2023, 15, 1968. [Google Scholar] [CrossRef]
  8. Wang, F.; Wang, Z.; Chen, Z.; Zhu, D.; Gong, X.; Cong, W. An edge-guided deep learning solar panel hotspot thermal image segmentation algorithm. Appl. Sci. 2023, 13, 11031. [Google Scholar] [CrossRef]
  9. Jin, J.; Zhou, W.; Yang, R.; Ye, L.; Yu, L. Edge detection guide network for semantic segmentation of remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  10. Schön, M.; Buchholz, M.; Dietmayer, K. Mgnet: Monocular geometric scene understanding for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 15804–15815. [Google Scholar]
  11. Chen, L.C.; Wang, H.; Qiao, S. Scaling wide residual networks for panoptic segmentation. arXiv 2020, arXiv:2011.11675. [Google Scholar]
  12. Petrovai, A.; Nedevschi, S. Real-time panoptic segmentation with prototype masks for automated driving. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1400–1406. [Google Scholar]
  13. Hou, R.; Li, J.; Bhargava, A.; Raventos, A.; Guizilini, V.; Fang, C.; Lynch, J.; Gaidon, A. Real-time panoptic segmentation from dense detections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Virtual, 14–19 June 2020; pp. 8523–8532. [Google Scholar]
  14. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  15. Mohan, R.; Valada, A. Efficientps: Efficient panoptic segmentation. Int. J. Comput. Vis. 2021, 129, 1551–1579. [Google Scholar] [CrossRef]
  16. Porzi, L.; Bulo, S.R.; Colovic, A.; Kontschieder, P. Seamless scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 15–20 June 2019; pp. 8277–8286. [Google Scholar]
  17. Gao, N.; Shan, Y.; Wang, Y.; Zhao, X.; Yu, Y.; Yang, M.; Huang, K. Ssap: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 642–651. [Google Scholar]
  18. Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 12475–12485. [Google Scholar]
  19. Yu, Q.; Wang, H.; Kim, D.; Qiao, S.; Collins, M.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 2560–2570. [Google Scholar]
  20. Zhou, X.; Shen, K.; Weng, L.; Cong, R.; Zheng, B.; Zhang, J.; Yan, C. Edge-Guided Recurrent Positioning Network for Salient Object Detection in Optical Remote Sensing Images. IEEE Trans. Cybern. 2023, 53, 539–552. [Google Scholar] [CrossRef] [PubMed]
  21. Zheng, X.; Wang, B.; Ai, L.; Tang, P.; Liu, D. EDGE-Net: An edge-guided enhanced network for RGB-T salient object detection. J. Electron. Imaging 2023, 32, 063032. [Google Scholar] [CrossRef]
  22. Fang, F.; Li, J.; Yuan, Y.; Zeng, T.; Zhang, G. Multilevel Edge Features Guided Network for Image Denoising. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 3956–3970. [Google Scholar] [CrossRef]
  23. Wang, D.; Xie, C.; Liu, S.; Niu, Z.; Zuo, W. Image inpainting with edge-guided learnable bidirectional attention maps. arXiv 2021, arXiv:2104.12087. [Google Scholar]
  24. Lin, H.; Pagnucco, M.; Song, Y. Edge guided progressively generative image outpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 806–815. [Google Scholar]
  25. Dai, Q.; Fang, F.; Li, J.; Zhang, G.; Zhou, A. Edge-guided composition network for image stitching. Pattern Recognit. 2021, 118, 108019. [Google Scholar] [CrossRef]
  26. Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef] [PubMed]
  27. Wang, J.; Gou, C.; Wu, Q.; Feng, H.; Han, J.; Ding, E.; Wang, J. RTFormer: Efficient design for real-time semantic segmentation with transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 7423–7436. [Google Scholar]
  28. Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
  29. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  30. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  31. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–26 June 2009; pp. 248–255. [Google Scholar]
Figure 1. Comparison between EGSDK-Net and state-of-the-art methods [2,5,10,11,12,13] on the Cityscapes dataset [14]. Our method outperforms K-Net [2], Panoptic-DeepLab [11], Petrovai and Nedevschi [12], and Hou et al. [13] in terms of Panoptic Quality and runtime. Additionally, with a comparable runtime, our method shows a significant improvement in the Panoptic Quality compared to MGNet [10] and RT-K-Net [5].
Figure 1. Comparison between EGSDK-Net and state-of-the-art methods [2,5,10,11,12,13] on the Cityscapes dataset [14]. Our method outperforms K-Net [2], Panoptic-DeepLab [11], Petrovai and Nedevschi [12], and Hou et al. [13] in terms of Panoptic Quality and runtime. Additionally, with a comparable runtime, our method shows a significant improvement in the Panoptic Quality compared to MGNet [10] and RT-K-Net [5].
Algorithms 18 00071 g001
Figure 2. The overall structure of EGSDK-Net. Our main contributions are focused on RTEGM and SDKUM.
Figure 2. The overall structure of EGSDK-Net. Our main contributions are focused on RTEGM and SDKUM.
Algorithms 18 00071 g002
Figure 3. The structure of the RTEGM.
Figure 3. The structure of the RTEGM.
Algorithms 18 00071 g003
Figure 4. Kernel and mask initialization utilized in K-Net [2] (a), RT-K-Net [5] (b) and EGSDK-Net (c). We initialize K ^ 1 for the first stage of kernel updates while initializing the K 0 .
Figure 4. Kernel and mask initialization utilized in K-Net [2] (a), RT-K-Net [5] (b) and EGSDK-Net (c). We initialize K ^ 1 for the first stage of kernel updates while initializing the K 0 .
Algorithms 18 00071 g004
Figure 5. The structure of the kernel update in SDKUM. The entire update process comprises n iterative stages, with each stage receiving the kernel K i 1 , mask M i 1 , and pre-predicted kernel K ^ i from the preceding stage as inputs (the initial inputs for the first stage are the initialized kernel K 0 , mask M 0 , and kernel K ^ 1 ). Upon completion of the update process, the predicted class C l s n and mask M n are output.
Figure 5. The structure of the kernel update in SDKUM. The entire update process comprises n iterative stages, with each stage receiving the kernel K i 1 , mask M i 1 , and pre-predicted kernel K ^ i from the preceding stage as inputs (the initial inputs for the first stage are the initialized kernel K 0 , mask M 0 , and kernel K ^ 1 ). Upon completion of the update process, the predicted class C l s n and mask M n are output.
Algorithms 18 00071 g005
Figure 6. Visual comparison results of ablation experiments. There are three sets of images, labeled (ac). Each set contains four parts, displayed from left to right and top to bottom: input image, baseline, baseline w/SDKUM, and ours. In each part, the left side shows the original image, while the right side displays a zoomed-in version of the area inside the white box on the left image. It shows that after inserting SDKUM, the segmentation results become better, especially after utilizing RTEGM, and the boundaries of the segmentation regions become more accurate.
Figure 6. Visual comparison results of ablation experiments. There are three sets of images, labeled (ac). Each set contains four parts, displayed from left to right and top to bottom: input image, baseline, baseline w/SDKUM, and ours. In each part, the left side shows the original image, while the right side displays a zoomed-in version of the area inside the white box on the left image. It shows that after inserting SDKUM, the segmentation results become better, especially after utilizing RTEGM, and the boundaries of the segmentation regions become more accurate.
Algorithms 18 00071 g006
Table 1. Comparison results between our method and the state-of-the-art panoptic segmentation methods on the Cityscapes validation set. The best results are shown in boldface and the second-best results are underlined.
Table 1. Comparison results between our method and the state-of-the-art panoptic segmentation methods on the Cityscapes validation set. The best results are shown in boldface and the second-best results are underlined.
MethodBackbone PQ PQ th PQ st GPU
MGNet [10]ResNet-1855.745.363.1TitanRTX
K-Net [2]ResNet50-FPN56.946.064.8-
Petrovai and Nedevschi [12]VoVNet2-3957.350.462.4V100
Panoptic-DeepLab [11]SWideRNet-(0.25,0.25,0.75)58.4--V100
Hou et al. [13]ResNet50-FPN58.852.163.7V100
Panoptic SwiftNet [7]ResNet-1855.9--RTX3090
YOSO. [6]ResNet5059.751.066.1V100
RT-K-Net [5]RTFormer59.348.966.7V100
OursRTFormer60.651.966.8V100
Table 2. The results of the ablation experiments on the core components of EGSDK-Net.
Table 2. The results of the ablation experiments on the core components of EGSDK-Net.
Method PQ PQ th PQ st
Baseline59.348.966.7
Baseline w/SDKUM60.050.067.3
Ours60.651.966.8
Table 3. The results of the ablation experiments on the internal structure of RTEGM.
Table 3. The results of the ablation experiments on the internal structure of RTEGM.
Method PQ PQ th PQ st
Convs59.348.966.7
only MaxPool59.650.066.7
only AvgPool59.449.366.7
Ours60.651.966.8
Table 4. The results of the ablation experiments on the internal structure of SDKUM.
Table 4. The results of the ablation experiments on the internal structure of SDKUM.
Method PQ PQ th PQ st
Baseline w/RTEGM59.749.467.1
Using Addition for Fusion59.550.066.7
Using Multiplication for Fusion59.750.366.5
Ours60.651.966.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mu, P.; Zhao, H.; Ma, K. EGSDK-Net: Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation. Algorithms 2025, 18, 71. https://doi.org/10.3390/a18020071

AMA Style

Mu P, Zhao H, Ma K. EGSDK-Net: Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation. Algorithms. 2025; 18(2):71. https://doi.org/10.3390/a18020071

Chicago/Turabian Style

Mu, Pengyu, Hongwei Zhao, and Ke Ma. 2025. "EGSDK-Net: Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation" Algorithms 18, no. 2: 71. https://doi.org/10.3390/a18020071

APA Style

Mu, P., Zhao, H., & Ma, K. (2025). EGSDK-Net: Edge-Guided Stepwise Dual Kernel Update Network for Panoptic Segmentation. Algorithms, 18(2), 71. https://doi.org/10.3390/a18020071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop