Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance

Yang, Lanshi; Wang, Shiguo; Teng, Shuhua

doi:10.3390/app15095152

Open AccessArticle

Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance

by

Lanshi Yang

¹

,

Shiguo Wang

²

and

Shuhua Teng

^1,*

¹

School of Engineering Science, Shandong Xiehe University, Jinan 250107, China

²

School of Computer Science and Technology, Changsha University of Science and Technology, Changsha 410076, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5152; https://doi.org/10.3390/app15095152

Submission received: 21 February 2025 / Revised: 2 April 2025 / Accepted: 8 April 2025 / Published: 6 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Panoptic segmentation, a pivotal research direction in computer vision, unifies pixel-level object and background recognition within a scene, crucial for applications like autonomous driving. However, existing methods, including State-of-the-Art models like Mask2Former, often exhibit limitations such as inadequate adaptation in multi-scale feature fusion and ambiguous boundary segmentation, particularly for small objects in complex scenes. To address these specific challenges, we propose a novel network: PSM-FFEG (Panoptic Segmentation Model with Feature Fusion and Edge Guidance). PSM-FFEG introduces three key components: (1) A dynamic multi-scale feature fusion module enhancing contextual modeling via cascaded convolutions and adaptive attention. (2) An explicit edge guidance module refining boundary features with dedicated supervision. (3) A dual-path Transformer decoder optimizing cross-path feature interaction between pixels and queries. Extensive experiments on the Cityscapes and MS COCO datasets demonstrate that, using a ResNet50 backbone, PSM-FFEG achieves 2.6% and 2.4% absolute improvements in panoptic quality (PQ) over the Mask2Former baseline, respectively. Notably, PSM-FFEG shows significant gains for small objects on Cityscapes, with PQ increasing by 4.3% for traffic lights and 6.8% for motorcycles. These results validate the effectiveness of the proposed modules. To foster further research, our implementation code will be made publicly available.

Keywords:

deep learning; image segmentation; panoptic segmentation; feature fusion; edge guidance

1. Introduction

Panoptic segmentation, as one of the core tasks in computer vision, has witnessed rapid advancements from methodological exploration to performance breakthroughs since Kirillov et al. [1] formally introduced the concept of panoptic segmentation in 2018, gradually emerging as a prominent research direction in image understanding. Early research in panoptic segmentation often involved integrating separate semantic [2] and instance segmentation techniques. Instance segmentation itself evolved from classical computer vision methods (e.g., thresholding, region growing) towards deep learning approaches. Object detection techniques, progressing from models like Faster R-CNN [3] and YOLO [4] to instance segmentation models like Mask R-CNN [5] by adding mask prediction heads, formed the basis for many early panoptic methods. To unify the tasks, three mainstream panoptic approaches emerged: (1) Top-down methods, typically extending object detectors like Mask R-CNN, such as UPSNet [6], which perform detection then masking but can suffer from inter-task conflicts. (2) Bottom-up methods, like Panoptic-DeepLab [7], which group pixels into instances but can struggle with boundary accuracy in dense scenes. (3) Unified framework methods [8], often leveraging Transformers for end-to-end prediction, such as Mask2Former [9], which treat ‘thing’ and ‘stuff’ uniformly and have achieved State-of-the-Art results.

Despite the success of unified methods like Mask2Former [9], specific limitations remain that hinder performance in complex scenarios. These include: (1) reliance on static feature pyramid fusion strategies (e.g., simple addition), limiting adaptability to diverse object scales; (2) suboptimal segmentation of small objects and blurred boundaries for intricate structures, potentially due to insufficient high-frequency detail processing in the decoder; and (3) limited interaction pathways in standard decoders, which often process pixel and query information sequentially, hindering optimal pixel-to-instance assignment.

To address these specific limitations observed in State-of-the-Art models like Mask2-Former, we propose PSM-FFEG, which introduces targeted innovations: (1) a dynamic multi-scale feature fusion mechanism incorporating attention to improve adaptive context modeling, moving beyond static fusion; (2) an explicit edge guidance module with dedicated supervision to enhance boundary definition, particularly for small objects and detailed structures; and (3) a dual-path Transformer decoder enabling richer, iterative interaction between pixel features and object queries to improve segmentation in complex layouts.

The main contributions of this paper are as follows:

(1): A dynamic multi-scale feature fusion module is proposed, which adaptively captures contextual information through cascaded deformable convolutional kernels and dual-dimensional attention in both channel and spatial domains.
(2): An explicit edge feature guidance module is introduced, establishing a bidirectional enhancement mechanism between pixel features and edge gradients.
(3): A dual-path Transformer decoder architecture is designed, enabling collaborative optimization of global semantics and local features through cross-attention mechanisms.

The remainder of this paper is organized as follows: Section 2 reviews related work in panoptic segmentation, feature fusion techniques, and edge-guided approaches. Section 3 details the proposed PSM-FFEG network architecture, including its dynamic fusion, edge guidance, and dual-path decoder modules. Section 4 presents the experimental setup, datasets, evaluation metrics, main results, ablation studies, qualitative analysis, and a discussion of computational cost. Finally, Section 5 concludes the paper, summarizing contributions and limitations, and outlining directions for future work.

2. Related Work

Panoptic segmentation, introduced by Kirillov et al., unifies semantic segmentation (classifying pixels by category) and instance segmentation (detecting and segmenting individual objects). This section reviews prior work in panoptic segmentation methods, feature fusion techniques relevant to this task, and edge-guided approaches.

2.1. Panoptic Segmentation Methods

Early research often combined separate semantic and instance segmentation networks. Methodologies have evolved into three main streams:

Top-Down Methods: These typically extend existing object detectors. An initial detection stage identifies bounding boxes for ‘thing’ instances, followed by mask prediction within each box. UPSNet [6], for example, integrated a semantic segmentation head into Mask R-CNN [5]. While intuitive, these methods can suffer from conflicts between the separate prediction heads and challenges in resolving overlapping instances.

Bottom-Up Methods: These methods first perform pixel-level grouping or embedding, followed by clustering or voting mechanisms to form instances. Panoptic-DeepLab [7] predicts semantic maps, instance centers, and pixel offsets to group pixels into instances. While potentially faster, bottom-up methods often struggle with accurately delineating boundaries in dense scenes, leading to instance adhesion.

Unified Framework Methods: More recent approaches aim for end-to-end prediction, often leveraging Transformer architectures. MaskFormer [9] reframed segmentation as a mask classification problem using learnable queries. Mask2Former built upon this with masked attention, achieving State-of-the-Art results by treating both semantic and instance segmentation uniformly via queries. K-Query [10] further explored query-based methods. While powerful, even these unified methods can face challenges, such as the Mask2Former’s noted lower accuracy on small objects like traffic lights [9], motivating our work on targeted improvements. Our proposed PSM-FFEG builds on the unified approach but introduces specific modules for dynamic fusion and edge refinement to address these remaining gaps.

2.2. Feature Fusion Techniques

Effective fusion of multi-scale features is crucial for recognizing objects and context at varying resolutions. Feature Pyramid Networks (FPN) [11] are a common baseline, typically using simple addition or concatenation for combining features from different backbone stages. However, these static fusion strategies, often using fixed weights or simple operations (as seen in PanopticFPN [11]), may lack adaptability to diverse scene content and object scales [12]. This rigidity can lead to suboptimal feature representation, especially when dealing with significant scale variations within a single scene. To overcome this, our work introduces dynamic multi-scale feature fusion, incorporating attention mechanisms (DLK and DFF modules) that allow the network to adaptively weigh and combine features based on the input context, enhancing representation power compared to fixed fusion rules.

2.3. Edge-Guided Segmentation

Accurate boundary delineation remains a challenge, particularly for complex textures or slender objects. While deep networks excel at semantics, the hierarchical pooling operations can progressively dilute fine-grained edge information. Some semantic segmentation works explore edge preservation or refinement, often using edge detection as an auxiliary task or incorporating graphical models like CRFs as post-processing. However, explicitly integrating robust edge guidance *within* the panoptic segmentation pipeline, especially for enhancing instance boundaries, is less explored. Existing decoders often lack sufficient capability to process high-frequency edge information effectively. Our proposed Edge Guidance Module (EGF) directly addresses this by explicitly encoding multi-scale edge information, fusing it with backbone features via attention and gating, and employing an auxiliary edge supervision loss to force the network to learn precise boundary features, thereby improving segmentation accuracy, especially around object borders.

3. Method

3.1. Network Infrastructure Overview

The proposed Panoptic Segmentation Model based on Feature Fusion and Edge Guidance (PSM-FFEG) network architecture, illustrated schematically in Figure 1, follows a pipeline designed to enhance multi-scale feature representation and boundary accuracy. It leverages initial features from a standard backbone and processes them through dedicated encoding, guidance, and decoding stages.

Backbone Network: We employ ResNet50 as the feature extractor. It provides multi-level feature maps C1, C2, C3, C4 from its stages 1 through 4, with spatial resolutions progressively reduced by factors of 4, 8, 16, 32 relative to the input image size. These features serve as the foundational input for subsequent modules. Notably, the high-resolution C1 feature map is crucial input for the Edge Guidance Layer due to its retention of fine spatial details.

Feature Encoding Layer: This layer aims to generate powerful multi-scale semantic representations. It consists of:

Dynamic Large Kernel (DLK) Modules: Applied sequentially to the backbone features C1, C2, C3, C4 with intermediate downsampling, producing initial multi-scale features E1, E2, E3, E4 with enhanced receptive fields.
Dynamic Feature Fusion (DFF) Modules: These modules perform adaptive cross-scale fusion between adjacent features from the DLK modules (e.g., fusing E1 and E2, E2 and E3, etc.) to produce refined features E1’, E2’, E3’. The deepest feature E4 (renamed E4’ for consistency notation later) doesn’t undergo DFF fusion at this stage but provides deep semantic context. The output E1’, E2’, E3’, E4’ captures richer context than the original backbone features.

Edge Guidance Layer: This layer focuses specifically on refining boundary information.

Edge Guidance Fusion (EGF) Module: It takes the shallow, high-resolution C1 feature from the backbone and the deep $E_{4}^{'}$ feature from the encoder. It fuses them to generate an edge-enhanced feature map, $E_{e d g e}^{'}$ , guided by an auxiliary edge loss during training.
Multi-scale Edge Injection: The learned edge feature $E_{e d g e}^{'}$ is then downsampled to match the resolutions of ${E_{1}^{'}, E_{2}^{'}, E_{3}^{'}}$ and integrated with them using a gating mechanism. This process yields the final multi-scale feature pyramid ${F_{1}, F_{2}, F_{3}, F_{4}}$ . These features ${F_{1}, F_{2}, F_{3}, F_{4}}$ , spanning resolutions $1 / 4$ to $1 / 32$ , serve as the input to the decoder.

Dual-Path Decoding Layer: A parallel interactive architecture is utilized to process the feature pyramid. The query path updates instance-aware features through a Transformer decoder, while the pixel path extracts fine-grained details via a convolutional network. The dual-path features interact through a cross-attention mechanism, ultimately generating the semantic categories and instance masks necessary for panoptic segmentation tasks. The decoder consists of multiple layers, each refining the features through self-attention and cross-attention mechanisms. The final output includes both semantic segmentation maps and instance masks.

3.2. Dynamic Multi-Scale Feature Fusion Module

In panoptic segmentation tasks, effective feature fusion plays a pivotal role in enhancing model performance. However, traditional feature fusion methods, such as simple channel concatenation or weighted averaging, often rely on fixed strategies, making them inadequate for handling the diversity and complexity of targets across different scenes [12]. Particularly when processing multi-scale targets, the absence of an adaptive feature selection mechanism frequently results in insufficient feature representation, compromising segmentation accuracy. To overcome the limitations of traditional feature fusion strategies in adapting to scene diversity, this section introduces a cascaded architecture incorporating Dynamic Large Kernel (DLK) and Dynamic Feature Fusion (DFF), with its innovation manifested in two key aspects: dynamic receptive field expansion and cross-scale adaptive fusion.

3.2.1. Dynamic Large Kernel Module

As illustrated in Figure 2, the DLK module facilitates multi-scale feature extraction through a combination of deformable kernel cascading and attention selection mechanisms. The processing pipeline comprises three distinct stages:

Cascaded Deformable Convolution: A dual-path architecture is implemented utilizing depthwise separable convolution, specifically configured as:

\{\begin{matrix} Path 1 : 5 \times 5 Convolution, dilation 1 \\ Path 2 : 7 \times 7 Convolution, dilation 3 \end{matrix}

(1)

This specific configuration of cascaded 5 × 5 (dilation 1) and 7 × 7 (dilation 3) kernels was chosen empirically. It effectively achieves a large 23 × 23 receptive field (calculated via Equation (2)) necessary for capturing broad context, while using depthwise separable convolutions and controlled dilation helps manage computational cost compared to a single, very large dense kernel.

Leveraging the receptive field accumulation effect, an equivalent 23 × 23 large kernel operation is achieved. The computational formulation is expressed as follows:

R_{i} = R_{i - 1} + (k_{i} - 1) \times j_{i}

(2)

Where

R_{i - 1}

denotes the effective receptive field of the (i−1)th layer,

k_{i}

represents the kernel size of the ith layer, and

j_{i}

corresponds to the stride or dilation rate of the ith layer. Given that the first layer employs a 5 × 5 convolution kernel with dilation = 1, the initial

R_{0}

= 1, consequently yielding

R_{1}

= 1 + (5−1)×1 = 5. The subsequent layer utilizes a 7 × 7 convolution kernel with dilation=3, resulting in an effective kernel size of 7 × 3 = 21 (due to dilation = 3), and

R_{2}

= 5 + (19−1)×1 = 23. Through the cascade of these two convolutional layers, an effective receptive field size analogous to 23 × 23 is achieved. This cascading approach effectively circumvents the substantial computational burden associated with the direct implementation of larger convolution kernels.

Dual-dimensional dynamic selection: A novel channel-spatial cooperative attention mechanism is introduced, with its computational process described as follows:

A_{s} = σ (C o n v_{7 \times 7} ([M a x P o o l (F); A v g P o o l (F)]))

(3)

A_{c} = σ (M L P (G A P (F)) + M L P (G M P (F)))

(4)

F^{'} = (F \otimes A_{s}) ⊙ A_{c}

(5)

where

A_{s}

and

A_{c}

correspond to the spatial and channel attention maps, respectively, while ⊗ and ⊙ signify element-wise multiplication and channel-wise multiplication operations.

This dual-dimensional attention mechanism allows the DLK module to adaptively modulate input features based on channel importance and spatial relevance for the given context. The term ‘dynamic’ here refers to this adaptive feature weighting learned through attention, rather than a process of explicitly selecting or discarding entire feature channels or spatial locations.

Residual Learning: Incorporating cross-layer connections to facilitate enhanced gradient propagation.

F_{o u t} = F_{i n} + C o n v_{1 \times 1} (F^{'})

(6)

3.2.2. Dynamic Feature Fusion Module

The DFF module facilitates adaptive aggregation of cross-level features (Figure 3), with its core technical components comprising:

Geometric Alignment: Implementing deformable upsampling on high-level feature

E_{h i g h}

to ensure precise spatial alignment with low-level feature

E_{l o w}

.

E_{u p} = D e f o r m C o n v (E_{h i g h}, Δ p)

(7)

where

Δ p

represents the output generated by the offset prediction network.

Dynamic Weight Generation: Producing fusion weights through comprehensive global context modeling.

W = S o f t m a x (C o n v_{1 \times 1} (E_{u p} \oplus E_{l o w}))

(8)

Multi-modal Fusion: Employing a gating mechanism for effective feature integration using dynamically generated weights:

E_{f u s i o n} = W \cdot E_{u p} + (1 - W) \cdot E_{l o w}

(9)

where W is the fusion weight matrix, and · denotes element-wise multiplication. The output

E_{f u s i o n}

is the fused feature map.

Crucially, the fusion weights W (Equation (8)) are computed based on the content of both feature levels (

E_{u p}

and

E_{l o w}

) through convolutions and softmax normalization applied to their combination. This allows the DFF module to dynamically adjust the contribution of high-level context versus low-level detail for each spatial location based on the input features, rather than using fixed fusion rules. This adaptive weighting enhances the representation of features across varying scales.

The dynamic characteristics of both DLK and DFF modules are derived from the incorporation of attention mechanisms [13]. Through comprehensive utilization of global information (spatial and channel dimensions) from input feature maps, the model adaptively modulates feature weights, effectively circumventing the constraints inherent in conventional fixed fusion approaches. This adaptive mechanism enables the model to dynamically adjust its processing strategy based on the specific characteristics of input data, thereby facilitating more efficient extraction of salient features.

3.3. Edge Guidance Module

In deep learning architectures, the acquisition of high-level semantic features predominantly depends on the hierarchical stacking of convolutional and pooling operations. Nevertheless, this sequential feature extraction paradigm exhibits inherent limitations, particularly in preserving crucial image details such as edge information. This deficiency becomes especially pronounced during pixel-level classification at object boundaries, where the degradation or loss of edge information significantly compromises segmentation accuracy. To mitigate the issue of edge information attenuation in deep feature extraction, we propose an Edge-Guided Fusion (EGF) module that enhances boundary segmentation precision through explicit edge modeling and multi-scale feature augmentation strategies. As illustrated in Figure 4, the proposed architecture comprises three fundamental technical components:

Edge Feature Encoding: Integrating shallow feature

C_{1} \in R^{H / 4 \times W / 4 \times 256}

from the backbone network with deep feature

E_{4}^{'} \in R^{H / 32 \times W / 32 \times 256}

from the encoder to establish a comprehensive multi-scale edge representation.

E_{e d g e} = Γ (C o n v_{3 \times 3} ([U p_{8} (E_{4}^{'}) \oplus C_{1}]))

(10)

where

U p_{8} (\cdot)

signifies 8× bilinear upsampling, while

Γ (\cdot)

represents the enhanced channel attention mechanism:

w_{c} = σ (C o n v_{1 \times 1} (G A P (E_{e d g e})))

(11)

E_{e d g e}^{'} = E_{e d g e} \otimes w_{c} + E_{e d g e}

(12)

A simple yet effective channel attention mechanism using Global Average Pooling (GAP) followed by a 1 × 1 convolution (Equation (11)) was chosen here for its low computational overhead while still being able to effectively highlight channels carrying significant edge information within the fused C1 and

E_{4}^{'}

features.

Edge Supervised Learning: Incorporating auxiliary supervision signals to strengthen edge perception capabilities, with the ground truth edge map being generated through an enhanced Canny operator.

L_{e d g e} = \frac{1}{N} \sum_{i = 1}^{N} B C E (U p_{4} (E_{e d g e}^{'}), G_{e d g e})

(13)

Multi-scale Feature Injection: Constructing an edge feature pyramid to facilitate cross-level guidance.

{E_{e d g e}^{k}} = {D o w n_{2^{k - 1}} (E_{e d g e}^{'})}, k = 1, \dots, 4

(14)

Multi-scale features are integrated with backbone features via a gating mechanism.

F_{k} = C o n v_{1 \times 1} ([E_{k}^{'} \oplus (E_{e d g e}^{k} \otimes σ (C o n v_{3 \times 3} (E_{k}^{'})))])

(15)

While our focus is on integrating edge guidance directly within the panoptic framework via the EGF module and auxiliary loss, we note that alternative approaches exist for edge-aware segmentation. These include post-processing segmentation masks with Conditional Random Fields (CRFs) or employing multi-task learning frameworks that explicitly predict edge maps alongside semantic or instance segmentation. A direct quantitative comparison to these diverse methods is beyond the scope of this study, which concentrates on enhancing a unified panoptic model. However, our ablation results demonstrate the effectiveness of the proposed EGF module in significantly improving metrics (PQ) within our PSM-FFEG model.

3.4. Dual-Path Decoder

Conventional Transformer decoders predominantly employ single-path update strategies (e.g., Panoptic-DeepLab [7] exclusively updates query embeddings, while MaskFormer [9] emphasizes pixel feature optimization), implementing layer-wise alternating updates of either pixel embeddings or query embeddings. However, this unidirectional information flow may result in inadequate feature interaction, particularly when segmenting dense instances of identical categories, often leading to mask overlap issues. Drawing inspiration from the EM algorithm [14], we propose a dual-path update strategy that models pixel-instance membership relationships as latent variables, approximating the maximum likelihood solution through alternating optimization of pixel features (E-step) and queries (M-step).

The theoretical framework can be formally expressed as:

{argmax}_{θ} E_{z \sim p (z | x)} [log p (x, z | θ)]

(16)

where the alternating optimization of pixel feature x and query parameter

θ

establishes a bidirectional information pathway.

Pixel Feature Update: Given the current query, the similarity between each pixel feature and individual queries is computed through the cross-attention mechanism. This similarity metric effectively represents the posterior probability of pixel membership to each target instance. Subsequently, the pixel features are updated as the weighted average of the queries, based on these posterior probabilities.

α_{i j} = \frac{exp (q_{i}^{T} k_{j})}{\sum_{k = 1}^{N} exp (q_{i}^{T} k_{k})}

(17)

X^{n e w} = \sum_{i = 1}^{M} α_{i j} v_{i}

(18)

where q, k, and v represent the query, key, and value vectors, respectively, which are derived through linear projection operations.

Query Update: Given the current pixel features and their corresponding posterior probabilities, each query is updated through masked attention and self-attention mechanisms. The masked attention mechanism confines the query’s attention scope to the foreground region of the mask predicted by the preceding layer, effectively weighting the query based on pixel posterior probabilities. The self-attention mechanism facilitates information exchange among queries, further refining their representations. Through this alternating update paradigm, the dual-path strategy progressively optimizes both pixel features and queries, enabling mutual adaptation and ultimately converging to a local optimal solution. This process bears similarity to the EM algorithm’s iterative optimization between latent variables and model parameters, ultimately approximating the maximum likelihood estimation.

MaskAttn (Q, X) = Softmax (\frac{Q W_{Q} {(X W_{K} ⊙ M)}^{T}}{\sqrt{d}}) X W_{V}

(19)

where

M \in {0, 1}^{H \times W}

represents the foreground mask, while ⊙ signifies element-wise multiplication.

Compared to conventional single-path update strategies, the dual-path approach offers several distinct advantages: (1) Enhanced information interaction: While single-path updates typically focus solely on either query-pixel interactions or pixel-level self-attention, the dual-path strategy facilitates more comprehensive information exchange between pixel features and queries through alternating updates, enabling better capture of both global and local relationships within the image. (2) Improved optimization efficiency: The bidirectional optimization inherent in the dual-path strategy accelerates convergence to optimal solutions, thereby reducing the required number of decoder layers. (3) Superior feature representation: The synergistic interaction between pixel features and queries in the dual-path framework generates more refined feature representations, consequently enhancing segmentation accuracy.

3.5. Loss Function

In the training of deep learning networks, the design of the loss function is crucial for model performance. This section’s loss function consists of two parts, mainly used for edge-guided supervision and dual-path decoder prediction generation. The former learns edge information from feature maps, while the latter continuously updates the generated predictions. The overall loss function of this network is:

L = L_{E} + L_{p r e}

(20)

L_{E}

is the edge supervision loss in the edge guidance module, and

L_{pre}

is the predicted loss instance activation loss.

L_{E}

is defined as:

L_{E} = λ_{edge} L_{edge}

(21)

λ_{edge}

and

L_{edge}

are the cross-entropy losses of two edge stages. The predicted loss

L_{p r e}

of the dual-path decoder is defined as:

L_{p r e} = \sum_{i = 0}^{D} (λ_{c e} L_{c e}^{i} + λ_{d i c e} L_{d i c e}^{i}) + λ_{c l s} L_{c l s}^{i}

(22)

In the context of the Transformer decoder, the variable D represents the number of layers. The value of i = 0 denotes the prediction loss of the IA bootstrap query prior to its integration into the Transformer decoder. The variables

L_{c e}^{i}

and

L_{d i c e}^{i}

refer to the binary cross-entropy loss and dice loss, respectively, associated with the segmentation mask.

L_{c l s}

represents the cross-entropy loss for object categorization, with a “no object” weight of 0.1.

λ_{c e}

,

λ_{d i c e}

, and

λ_{c l s}

are hyperparameters that serve to balance the three losses. Similarly, the Hungarian algorithm is employed to search for the optimal two-part match for target assignment. Additionally, an additional location cost,

λ_{l o c}

L_{l o c}

, is incorporated for each query, in order to account for the cost of the location of the query.

4. Experiments

4.1. Specific Realizations

4.1.1. Datasets and Evaluation Indicators

This paper selects the widely recognized Cityscapes and MS COCO Panoptic 2017 datasets in the field of panoptic segmentation as the basis for model training and evaluation. The Cityscapes dataset primarily focuses on urban street scenes and contains 5000 high-quality annotated images (2975 for training, 500 for validation, and 1525 for testing), with an image resolution of 1024 × 2048 pixels. This dataset contains 19 semantic categories, with instance-level annotations provided for 8 of these categories. The MS COCO Panoptic 2017 dataset contains images from common scenes, with a maximum image resolution of 640 × 640 pixels. This dataset contains 133 semantic categories (80 of which are instance categories), and the training, validation, and test sets comprise 118 k, 5 k, and 20 k images, respectively. Both datasets are representative benchmarks in the field of panoptic segmentation and enable effective evaluation of the model’s performance across diverse scenes.

The panoptic segmentation task, at the same time as its proposal, introduces the panoptic quality (PQ) metric to better measure the performance of task execution. PQ provides a comprehensive and accurate measurement of model performance by combining segmentation quality (SQ) and recognition quality (RQ). SQ reflects the model’s segmentation accuracy at the pixel level. Specifically, SQ represents the average IoU value of all object masks that are correctly identified and have an intersection over union (IoU) greater than 0.5. RQ is used to measure the accuracy of the model in target semantic category recognition, focusing on the model’s ability to correctly classify objects.

PQ, SQ, and RQ expressions are shown as Equations (5)–(7), where TP represents instances correctly predicted by the model; FP represents instances incorrectly predicted by the model; FN refers to predictions where the category is correct but the instance is identified incorrectly, and IoU(p,g) is the intersection over union between the predicted object p and the ground truth g. PQ ranges from 0 to 1 and is typically expressed as a percentage.

S Q = \frac{\sum_{(p, g)} \in T P^{I o U (p, q)}}{| T P |}

(23)

R Q = \frac{|T P|}{|T P| + \frac{1}{2} |F P| + \frac{1}{2} |F N|}

(24)

P Q = S Q \times R Q

(25)

PQ not only comprehensively reflects the performance of the model in instance segmentation and semantic segmentation, but also further refines into two sub-indicators:

P Q^{s t}

and

P Q^{t h}

. Among them,

P Q^{s t}

focuses on evaluating the segmentation quality of non-instance classes (usually background), while

P Q^{t h}

is used to measure the segmentation effect of instance classes (foreground objects). This refined evaluation system enables researchers to analyze the performance of the model in different scenarios more comprehensively and accurately.

4.1.2. Implementation Details

The model in this experiment was implemented using Detectron 2. The model was trained using the AdamW optimizer [15] with an initial learning rate of 0.0001. Because the Cityscapes dataset has a relatively small number of training images, weights pre-trained on the ImageNet dataset [16] were used to initialize the model. During training, input images underwent multi-scale augmentation, with the shorter side randomly scaled between 512 and 1024 pixels and the longer side constrained to a maximum of 2048 pixels. Data augmentation techniques, including random horizontal flipping and random cropping, were also employed. During testing, the shorter side of the input images was uniformly resized to 1024 pixels while preserving the original aspect ratio. The ResNet-50 architecture served as the backbone network. Using the same standard backbone allows the performance improvements to be more clearly attributed to our proposed architectural innovations rather than differences in backbone capacity. While more advanced backbones like Swin Transformers might yield higher absolute scores, exploring such combinations was left for future work.Training was conducted for 200 epochs with a batch size of 16, utilizing a polynomial learning rate decay schedule with a warm-up period. The loss function described in Section 3.5 was used, with loss weights

λ_{e d g e}

,

λ_{c e}

,

λ_{d i c e}

, and

λ_{c l s}

set to 0.8, 2.0, 5.0, and 5.0, respectively. The experiments were performed on a single NVIDIA Tesla A100 GPU with 40 GB of VRAM, using PyTorch 2.1.0 and CUDA version 12.1.

Regarding statistical validation across multiple training runs, we acknowledge the importance of such analyses for ensuring result reliability. However, due to significant computational resource and time constraints associated with training these large models, the reported results are based on carefully conducted single training runs using fixed random seeds and standardized protocols to ensure reproducibility of these specific runs. Consequently, we were unable to perform multiple runs to calculate margins of error or conduct statistical significance tests (e.g., t-tests) for this submission. While, we followed standardized initialization procedures and training protocols to ensure representativeness of our results, including fixed random seeds for reproducibility and adherence to established hyperparameter settings from relevant literature. We recognize this limitation (meaning standard deviations are not reported and statistical significance cannot be claimed) and plan to conduct more comprehensive statistical validation with multiple runs in future work, resources permitting, to further strengthen the reliability of our findings.

4.2. Experimental Results and Analysis

To validate the efficacy of the proposed approach, comprehensive experiments were conducted under identical environmental and configuration settings, comparing the PSM-FFEG model against various panoptic segmentation models utilizing Res50 as the backbone. We employed widely recognized and representative metrics in panoptic segmentation, namely PQ,

P Q^{t h}

, and

P Q^{s t}

. The experimental results on both Cityscapes val and MS COCO panoptic 2017 val datasets are presented in Table 1 and Table 2, respectively. The findings demonstrate that our proposed panoptic segmentation method, incorporating feature fusion and edge guidance, surpasses all comparative methods under the same backbone network configuration.

On the Cityscapes val dataset, PSM-FFEG achieved performance metrics of 64.7%, 59.4%, and 69.1% for PQ,

P Q^{t h}

, and

P Q^{s t}

, respectively. These results represent significant improvements of 2.6, 4.6, and 1.8 percentage points over the baseline Mask2Former model.

On the MS COCO dataset, PSM-FFEG demonstrated notable performance with PQ,

P Q^{t h}

, and

P Q^{s t}

metrics reaching 54.3%, 61.4%, and 44.3%, respectively. These results signify substantial improvements of 2.4, 3.7, and 1.3 percentage points over the baseline Mask2Former model.

To comprehensively assess the segmentation efficacy of our proposed method across diverse object categories, we conducted an in-depth analysis of 19 distinct categories within the Cityscapes dataset (as detailed in Table 3).

Table 3 meticulously presents the PQ enhancements of our method relative to Mask2Former across various categories in the Cityscapes dataset. The model demonstrates consistent improvements across all 19 categories, exhibiting robust performance across objects of varying scales. Particularly noteworthy are the substantial improvements observed for small-scale objects (including motorcycles, traffic lights, and traffic signs), with enhancements of 6.8, 4.3, and 3.4 percentage points, respectively. The ablation study results presented in Table 4 suggest that both the Dynamic Multi-Scale Feature Fusion (DLK+DFF) and the Edge Guidance (EGF) modules contribute significantly to these gains, as indicated by the step-wise improvements in

P Q^{t h}

upon their introduction. Furthermore, the model achieves stable performance gains for medium-scale objects (such as pedestrians and riders) and large-scale objects (including trucks, fences, and buses). This comprehensive performance enhancement validates the effectiveness of our model’s innovative design in feature extraction, multi-scale fusion, and edge guidance, demonstrating its adaptability to diverse segmentation requirements across various scenarios.

To fully evaluate the practicality of the PSM-FFEG model, we perform a detailed analysis of its computational cost and compare it with the baseline Mask2Former model. Table 5 shows the performance and computational overhead comparison of the two models on the Cityscapes dataset.

As can be seen from Table 5, although the PSM-FFEG model improves the PQ index by 2.6 percentage points compared with Mask2Former, its computational complexity also increases accordingly. Specifically, the floating point operations (FLOPs) of PSM-FFEG increased by about 20.6%, the number of parameters increased by about 6.2%, and the frame rate (FPS) decreased by about 18.8%. In practical applications, this means that the average processing time per frame of an image increased by about 6.4 milliseconds.

This increase in computational cost mainly comes from three aspects:

Dynamic multi-scale feature fusion module: Cascaded convolution and multi-dimensional attention mechanism in DLK and DFF modules contribute about 60% of the additional computational overhead Edge guidance module: EGF module and its multi-scale edge information injection account for about 20% of the additional computation Dual-path decoder: The interactive pixel-query optimization strategy increases the computational workload by about 20% compared with the standard decoder It is worth noting that although the inference speed of PSM-FFEG is reduced, it can still achieve a performance of nearly 30FPS in actual application scenarios, which is sufficient to meet the real-time requirements of most autonomous driving and robotic vision tasks. In particular, the significant performance improvement for small objects and complex scene boundaries (as shown in Table 3, traffic lights +4.3%, motorcycles +6.8% PQ improvement) may be enough to offset this moderate increase in computational overhead in many application scenarios.

To provide a more intuitive demonstration of our method’s segmentation capabilities, we randomly selected three complex environments from the Cityscapes and COCO panoptic segmentation datasets for panoptic segmentation visualization. The results are compared with those generated by the Mask2Former method, as illustrated in Figure 4 and Figure 5 (displaying, from left to right: the input image, ground truth annotation, Mask2Former output, and our method’s output). Critical areas are highlighted with red bounding boxes.

By comprehensively observing the segmentation results in Figure 5 (Cityscapes) and Figure 6, it can be clearly seen that the proposed PSM-FFEG method outperforms the Mask2Former baseline model in several key aspects:

1.: Enhanced recognition of small objects and details: Whether it is a long-distance traffic light, pedestrian, or traffic sign on a city street, or an animal, distant object, or background in the COCO dataset, PSM-FFEG shows stronger recognition and accurate segmentation capabilities. The red box area highlights the obvious omission or ambiguity of Mask2Former on these small targets or details, while PSM-FFEG can capture and outline them more effectively. This is due to the better use of multi-scale information by the dynamic feature fusion module.

2.: Improved accuracy in boundary and edge processing: Improved accuracy of boundary and edge processing: Across the two datasets, PSM-FFEG shows significant advantages in processing object edges. Whether it is the boundary between vehicles and roads/backgrounds, the outlines of buildings and the sky, or the boundaries between animals and grass, athletes, etc., the edges generated by PSM-FFEG are clearer, smoother, and closer to the real contours. In contrast, the edges of Mask2Former sometimes appear fuzzy, jagged, or have slight penetration. This directly confirms the effectiveness of the edge guidance module (EGF) in optimizing boundary representation. While these qualitative results focus on comparing the final segmentation masks, visualizing confidence scores or uncertainty maps could offer further insights into model behavior. Generating and analyzing such maps was considered outside the scope of the current comparative analysis but represents a potential direction for future work.
3.: Robustness and consistency in complex scenes: In the crowded baseball scene or animal scene as shown in Figure 6, as well as the complex urban intersection (Figure 5), PSM-FFEG appears to perform better in dealing with object occlusion, dense instances, and complex backgrounds. It seems better able to distinguish adjacent instances (e.g., players in Figure 6), reducing mis-segmentation and category confusion compared to the baseline. We hypothesize that the richer features from the dynamic fusion modules and the enhanced pixel-query interaction in the dual-path decoder contribute to this improved robustness. The interactive optimization mechanism of the dual-path decoder likely plays a key role. However, we note that this assessment is primarily qualitative, as we did not employ specific datasets or metrics designed to quantitatively measure performance under varying levels of occlusion.

4.3. Ablation Study

To verify the contribution of each module of the network model in this paper, ablation experiments were conducted on the Cityscapes dataset to better analyze the PSM-FFEG model and focus on validating the effectiveness of its key components. For this purpose, the following experiments were designed, mainly focusing on four core components: Dynamic Large Kernel (DLK) for multi-scale feature extraction, Dynamic Feature Fusion (DFF) for multi-scale feature fusion, Edge-guided Fusion (EGF), and Dual Path Update Strategy. Each of these components was evaluated separately by gradually removing them. In terms of experimental settings, ResNet-50 was used as the backbone network and tested in the PSM-FFEG model. To comprehensively evaluate the performance of the model, DLK, DFF, EGF, and the Dual-Path Transformer Decoder were added to the original network for experimentation. The results are shown in Table 4, where “-” indicates that the component was not used in the network structure, and “✔” indicates that it was used.

By Table 4 experimental data, using the baseline network Mask2Former algorithm as the basic experiment, the experimental results PQ is 62.1%. By gradually adding various innovative modules, it can be observed that performance gradually improves. First, after adding the Dynamic Multi-Scale Feature Extraction (DLK) and Dynamic Feature Fusion (DFF) modules, the PQ value increases from 62.1% to 63.4%, improving by 1.3 percentage points. Especially, it performs particularly well in handling object categories (

P Q^{t h}

), increasing from 54.8% to 56.8%, which verifies the effectiveness of the feature fusion module in multi-scale feature extraction and fusion. Further introducing the Edge-guided module, the model performance continues to improve, with PQ reaching 64.1%, compared to the previous stage, improving by 0.7 percentage points. Notably,

P Q^{t h}

significantly increases to 58.6%, indicating that the introduction of edge information is particularly helpful for improving the segmentation effect of object categories. While this internal ablation quantifies the benefit of the EGF module within our panoptic framework, a direct quantitative comparison to traditional semantic segmentation methods focused solely on edge preservation (e.g., using boundary F-measure metrics) is complex due to differing tasks and evaluation protocols. Our focus here is demonstrating improvement over a strong panoptic baseline.

At the same time, the AP value also increases to 39.6%, indicating an improvement in overall detection performance. Finally, isolating the effect of the Dual-path strategy by comparing the model performance with it (last row, Table 4) versus without it (second-to-last row, Table 4), the model achieves the best performance: the PQ value reaches 64.7% (+0.6% gain), with

P Q^{t h}

and

P Q^{s t}

reaching 59.4% (+0.8%) and 69.1% (+0.3%), respectively, and IoU reaching 80.3% (+0.5%). This demonstrates that the Dual-path strategy effectively enhances feature interaction and further optimizes the segmentation effect across different categories. Overall, by gradually stacking various innovative modules, the model achieves continuous improvement in all evaluation metrics. These experimental results fully prove the effectiveness of each module, which complement each other in different aspects: DLK+DFF provides strong feature extraction and fusion capabilities, the Edge-guided module enhances boundary details, while the Dual-path strategy further optimizes overall performance.

5. Conclusions

This research introduces the PSM-FFEG panoptic segmentation framework, presenting three significant innovations to address the challenges of multi-scale feature fusion and boundary segmentation in complex scenarios: (1) A dynamic multi-scale feature fusion architecture that facilitates adaptive contextual information capture through deformable convolution cascading and dual-dimensional channel-spatial attention mechanisms. (2) An edge-guided fusion module that substantially enhances object boundary segmentation accuracy, particularly for small objects, by explicitly extracting and augmenting edge features while integrating them with multi-scale representations. (3) A dual-path decoder that elevates overall segmentation performance through an alternating pixel-query optimization strategy.

Despite the promising results demonstrated on Cityscapes and MS COCO, our proposed PSM-FFEG approach has several limitations that warrant further investigation and offer avenues for future research:

Computational Complexity: The introduction of the cascaded convolutions and attention mechanisms within the DLK modules, the cross-scale fusion in DFF, and particularly the iterative nature of the Dual-Path Transformer Decoder contribute to increased computational complexity (FLOPs and parameters) compared to simpler baseline models like Mask2Former with a standard decoder. This might pose challenges for deployment in real-time applications or on resource-constrained hardware without further optimization or model compression.
Hyperparameter Sensitivity: The performance of PSM-FFEG can be sensitive to the setting of certain hyperparameters. This includes the loss weights balancing the main panoptic loss terms ( $λ_{c} e$ , $λ_{d} i c e$ , $λ_{c} l s$ ) and the auxiliary edge loss ( $λ_{e} d g e$ ), as well as potential internal parameters within the attention or fusion modules. Achieving optimal performance currently requires careful empirical tuning for each dataset, which can be time-consuming.
Long-range Dependency Modeling: While the DLK module increases the effective receptive field and attention mechanisms capture context, the model’s ability to explicitly model very long-range spatial dependencies across the entire image might still be limited compared to architectures employing global self-attention across all pixels, which are often computationally prohibitive at high resolutions. The attention mechanisms used are primarily focused within local windows or feature channels.
Potential Failure Cases: While demonstrating overall improvement, the model might still struggle in certain scenarios. These could include: objects belonging to extremely rare categories not well represented in the training data; images with severe weather, lighting, or blur exceeding training conditions; objects with highly atypical appearances; or instances where the auxiliary edge supervision was noisy or inaccurate, potentially providing misleading guidance. A systematic analysis of such failure modes was not conducted but is important for understanding robustness.

These limitations motivate future research directions. We plan to explore: (1) Techniques for model compression and efficiency optimization to reduce the computational overhead of the dynamic modules and decoder; (2) Developing adaptive or meta-learning approaches to automate hyperparameter tuning (especially loss weights); and (3) Investigating the integration of lightweight global context mechanisms or graph neural networks to enhance long-range dependency modeling without excessive cost.

Author Contributions

Main writer and literature review, L.Y.; supervision, ideas and knowledge contribution, and review, S.T. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Hunan Province (No. 2023JJ30185) and the Scientific Research Fund of Hunan Provincial Education Department (No. 22A0640), and 166 Engineering Project (232CXCYT-105010501).

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank all anonymous reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kirillov, A.; He, K.; Girshick, R.; Dollár, P.; Russakovsky, O. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 9404–9413. [Google Scholar]
Guo, Y.; Liu, Y.; Georgiou, T.; Lew, M.S. A review of semantic segmentation using deep neural networks. Int. J. Multimed. Inf. Retr. 2018, 7, 87–93. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. UPSNet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 8818–8826. [Google Scholar]
Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12475–12485. [Google Scholar]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Kaiser, L.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Yao, Z.; Wang, X.; Bao, Y. K-Query: Panoptic segmentation method based on keypoint query. J. Comput. 2023, 46, 1693–1708. [Google Scholar] [CrossRef]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6399–6408. [Google Scholar]
Azad, R.; Niggemeier, L.; Hüttemann, M.; Antoine, E.; Baumgartner, M. Beyond self-attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2024; pp. 1287–1297. [Google Scholar]
Wang, Z.; Lin, X.; Wu, N.; Zhuang, Y.; Wang, J. DTMFormer: Dynamic Token Merging for Boosting Transformer-Based Medical Image Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5814–5822. [Google Scholar]
Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process. Mag. 1996, 13, 47–60. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Hou, R.; Li, J.; Bhargava, A.; Raventos, A.; Guizilini, V.; Fang, C.; Lynch, J.; Gaidon, A. Real-time panoptic segmentation from dense detections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 8523–8532. [Google Scholar]
Chang, S.E.; Chen, Y.; Yang, Y.C.; Lin, E.T.; Hsiao, P.Y.; Fu, L.C. SE-PSNet: Silhouette-based enhancement feature for panoptic segmentation network. J. Vis. Commun. Image Represent. 2023, 90, 103736. [Google Scholar] [CrossRef]
Mohan, R.; Valada, A. EfficientPS: Efficient panoptic segmentation. Int. J. Comput. Vis. 2021, 129, 1551–1579. [Google Scholar] [CrossRef]
Xu, Y.; Liu, R.; Zhu, D.; Wang, W.; Wu, J.; Liu, L.; Wang, S.; Zhao, B. Cascade contour-enhanced panoptic segmentation for robotic vision perception. Front. Neurorobotics 2024, 18, 1489021. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhao, H.; Qi, X.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Fully convolutional networks for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 214–223. [Google Scholar]
Jain, J.; Li, J.; Chiu, M.; Hassani, A.; Tulyakov, S.; Minaee, S. OneFormer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar]
Li, Q.; Qi, X.; Torr, P.H. Unifying training and inference for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 14–19 June 2020; pp. 13320–13328. [Google Scholar]
Lin, G.; Li, S.; Chen, Y.; Li, X. IDNet: Information decomposition network for fast panoptic segmentation. IEEE Trans. Image Process. 2023, 33, 1487–1496. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-Net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 5463–5474. [Google Scholar]

Figure 1. The overall architecture of the proposed PSM-FFEG network. Input images are processed by a ResNet-50 backbone. The Feature Encoding Layer uses Dynamic Large Kernel (DLK) and Dynamic Feature Fusion (DFF) modules for multi-scale context. The Edge Guidance Layer fuses shallow and deep (E4’) features, applies edge supervision (via EGF module), and injects edge information into the feature pyramid F1–F4. The Dual-Path Decoder (Right side of dotted line) interactively refines pixel embeddings and object queries using F1–F4 to produce final panoptic segmentation outputs (class and mask predictions).

Figure 2. Detailed structure of the Dynamic Large Kernel (DLK) module. It employs parallel depthwise separable convolutions (5 × 5 and 7 × 7 with dilation) cascaded to achieve a large effective receptive field (23 × 23). Dual-dimensional attention (spatial ‘As’ via Conv7 × 7 on pooled features, channel ‘Ac’ via MLPs on GAP/GMP) adaptively modulates features before residual addition (‘Z’).

Figure 3. Structure of the Dynamic Feature Fusion (DFF) module. It adaptively fuses features from two adjacent levels (F1, F2 shown as example). High-level features (F2) undergo deformable upsampling (see Equation (7)) before concatenation with low-level features (F1). A channel attention mechanism (Conv 1 × 1 -> Sigmoid) generates dynamic fusion weights (‘W’) based on the combined features to produce the fused output (F3).

Figure 4. Structure of the Edge Guidance Fusion (EGF) component within the Edge Guidance Layer. It fuses shallow backbone features (C1, denoted L1) with deep encoder features (E4’, denoted L4 after upsampling) using convolutions. Channel attention enhances edge-relevant features (‘Edge Feature’). This module facilitates auxiliary edge supervision during training.

Figure 5. Qualitative comparison of panoptic segmentation results on Cityscapes.

Figure 6. Qualitative comparison of panoptic segmentation results on COCO.

Table 1. Comparison of panoptic segmentation results on the Cityscapes validation dataset.

Method	Backbone	PQ	PQ^th	PQ^st
RT Panoptics [17]	Res50	58.8	52.1	63.7
UPSNet [6]	Res50	59.3	54.6	62.7
SE-PSNet [18]	Res50	60.0	55.9	62.9
EfficientPS [19]	Res50	60.3	55.3	53.9
CCPSNet [20]	Res50	60.5	56.9	63.1
PanopticFCN [21]	Res50	61.4	54.8	66.6
Mask2Former [9]	Res50	62.1	54.8	67.3
OneFormer [22]	Res50	62.7	55.8	68.4
K-Query [10]	Res50	63.2	56.2	68.3
Ours	Res50	64.7	59.4	69.1

Table 2. Comparison of Panoptic Segmentation Results on the COCO Dataset.

Method	Backbone	PQ	PQ^th	PQ^st
UPSNet [6]	Res50	42.5	48.6	33.4
Unilying [23]	Res50	43.4	48.6	35.5
IDNet [24]	Res50	42.1	47.5	33.9
CCPSNet [20]	Res50	43.0	49.2	33.6
PanopticFCN [21]	Res50	44.3	50.0	35.6
K-Net [25]	Res50	47.1	51.7	40.3
Max-DeepLab [26]	Max-X	48.4	53.0	41.5
Mask2Former [9]	Res50	51.9	57.7	43.0
OneFormer [22]	Res50	52.2	57.9	43.5
K-Query [10]	Res50	52.9	58.9	43.8
Ours	Res50	54.3	61.4	44.3

Table 3. Accuracy of each category on the Cityscapes dataset.

Category	Mask2Former	PSM-FFEG	Improvement
Road	98.5	99.1	0.6
Sidewalk	86.7	88.9	2.2
Building	92.3	94.2	1.9
Wall	58.4	63.8	5.4
Fence	62.1	65.7	3.6
Poles	65.3	68.9	3.6
Traffic Light	70.2	74.5	4.3
Traffic Sign	79.8	83.2	3.4
Vegetation	92.4	93.8	1.4
Terrain	62.8	65.9	3.1
Sky	94.9	95.7	0.8
Person	82.6	85.4	2.8
Rider	65.1	68.8	3.7
Car	94.3	95.9	1.6
Truck	68.7	74.3	5.6
Bus	78.9	83.6	4.7
Train	71.2	75.8	4.6
Motorcycle	65.4	72.2	6.8
Bicycle	77.2	80.5	3.3

Table 4. Results of set ablation experiments on the dataset Cityscapesval.

Network	DLK+DFF	Edge-Guided	Dual-Path	PQ	PQ^th	PQ^st	AP	IoU
Mask2Former	-	-	-	62.1	54.8	67.3	37.3	77.5
PSM-FFEG	✔	-	-	63.4	56.8	68.2	38.5	79.3
	✔	✔	-	64.1	58.6	68.8	39.6	79.8
	✔	✔	✔	64.7	59.4	69.1	40.7	80.3

Table 5. Comparison of Computational Cost and Performance between PSM-FFEG and Mask2Former.

Model	PQ (%)	FLOPs (G)	Parameters (M)	FPS	Inference (ms)
Mask2Former	62.1	74.3	43.8	36.1	27.7
PSM-FFEG	64.7	89.6	46.5	29.3	34.1
Difference	+2.6	+15.3	+2.7	−6.8	+6.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Wang, S.; Teng, S. Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance. Appl. Sci. 2025, 15, 5152. https://doi.org/10.3390/app15095152

AMA Style

Yang L, Wang S, Teng S. Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance. Applied Sciences. 2025; 15(9):5152. https://doi.org/10.3390/app15095152

Chicago/Turabian Style

Yang, Lanshi, Shiguo Wang, and Shuhua Teng. 2025. "Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance" Applied Sciences 15, no. 9: 5152. https://doi.org/10.3390/app15095152

APA Style

Yang, L., Wang, S., & Teng, S. (2025). Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance. Applied Sciences, 15(9), 5152. https://doi.org/10.3390/app15095152

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Panoptic Segmentation Method Based on Feature Fusion and Edge Guidance

Abstract

1. Introduction

2. Related Work

2.1. Panoptic Segmentation Methods

2.2. Feature Fusion Techniques

2.3. Edge-Guided Segmentation

3. Method

3.1. Network Infrastructure Overview

3.2. Dynamic Multi-Scale Feature Fusion Module

3.2.1. Dynamic Large Kernel Module

3.2.2. Dynamic Feature Fusion Module

3.3. Edge Guidance Module

3.4. Dual-Path Decoder

3.5. Loss Function

4. Experiments

4.1. Specific Realizations

4.1.1. Datasets and Evaluation Indicators

4.1.2. Implementation Details

4.2. Experimental Results and Analysis

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI