Frequency Adaptive PEM: Marine Ship Panoptic Segmentation

Yuan, Ming; Meng, Hao; Wu, Junbao; Cao, Yiqian

doi:10.3390/jmse14050419

Open AccessArticle

Frequency Adaptive PEM: Marine Ship Panoptic Segmentation

¹

College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin 150001, China

²

Key Laboratory of Intelligent Technology and Application of Marine Equipment, Harbin Engineering University, Harbin 150001, China

³

Tianjin Navigation Instruments Research Institute, Tianjin 300131, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(5), 419; https://doi.org/10.3390/jmse14050419

Submission received: 16 January 2026 / Revised: 12 February 2026 / Accepted: 18 February 2026 / Published: 25 February 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Panoptic segmentation of ships plays a crucial role in intelligent navigation and maritime safety, providing essential references for route planning and collision avoidance. However, the complexity of the maritime environment, including issues such as water surface reflections, weather disturbances, and the challenge of detecting small ship targets, significantly increases the difficulty of the segmentation task. To address these challenges, this paper proposes a novel panoptic ship segmentation framework, FA PEM, based on the PEM algorithm. First, we propose the Dynamic Correlation-Aware Upsampling (DCAU) module, which adopts a content-adaptive sampling point selection and grouping upsampling strategy, significantly improving boundary alignment and fine-grained feature extraction. Second, we propose the Spatial-Frequency Attention Module (SFAM). By modeling both spatial and frequency domain features, this module integrates multi-scale deep convolutions and Fourier transforms, enhancing the model’s ability to perceive both global structures and local textures. Furthermore, to address the lack of an appropriate dataset for ship panoptic segmentation, we construct and annotate a new dataset, the Ship Panoptic Segmentation Dataset (SPSD), consisting of 4360 ship images. Experimental results demonstrate that FA PEM significantly outperforms the baseline FEM on both the Cityscapes and SPSD datasets, achieving advanced performance and exhibiting strong generalization ability.

Keywords:

ship panoptic segmentation; dynamic upsampling; spatial-frequency; adaptive attention mechanism

1. Introduction

In recent years, the rapid progress of deep learning technologies has significantly advanced the field of image segmentation. Within this context, panoptic segmentation combines the advantages of both semantic segmentation and instance segmentation, requiring the precise delineation of background regions and every individual object in an image. As a crucial pixel-level task in computer vision, panoptic segmentation has wide applications in domains such as healthcare [1,2,3], autonomous driving [4,5], and industrial inspection [6], and has gradually become a major research focus in the computer vision community. In particular, within the domains of intelligent ship navigation [7,8], marine detection [9,10], and port management [11,12], panoptic segmentation enables the distinction between ships and various obstacles in maritime scenes. This capability provides critical support for route planning and collision avoidance, thereby reducing the risk of maritime accidents. Consequently, research on ship panoptic segmentation in marine environments demonstrates considerable academic value from both theoretical and practical perspectives. However, ship panoptic segmentation faces several critical challenges. First, the complex and dynamic ocean environment, characterized by surface reflections, wave motion, and adverse weather conditions, greatly increases the difficulty of segmentation and makes it challenging for models to distinguish ships from the background. Second, there exists significant scale variation among ships, with small vessels frequently missed or misclassified by segmentation networks. On the other hand, the lack of diverse and richly annotated datasets represents a major challenge. Obtaining high-quality labeled data is both resource-intensive and time-consuming, and such data are particularly scarce in diverse maritime environments. This limitation hinders the generalization capacity and practical performance of segmentation models.

Furthermore, the scale and quality of the training dataset significantly affect the model’s generalization ability and accuracy. In particular, in panoptic segmentation tasks, the number of training samples directly impacts the performance of deep convolutional neural networks (CNNs). Recent studies [13,14] show that when the number of training samples is limited, the model is prone to overfitting, resulting in poor performance on unseen data. Conversely, increasing the number of training samples typically enhances the model’s accuracy and reduces the risk of overfitting. However, increasing the dataset size does not always result in improved performance. Overly large datasets can lead to underfitting, especially when the model is of low complexity. This prevents the model from learning the complex features in the data effectively, thereby wasting the potential of the dataset. Therefore, selecting the appropriate number of training samples, combined with quality-aware dataset optimization strategies, is critical for improving the performance of panoptic segmentation models. In addition to sample size, the quality of the dataset is equally important. Data augmentation techniques [15,16], such as flipping, rotation, and scaling, can effectively enhance the diversity of the training data, improving the model’s generalization ability and reducing overfitting. These quality-aware optimization strategies not only increase the diversity of the dataset but also improve the model’s performance in complex scenarios.

Currently, most panoptic segmentation models depend on remote sensing images [17,18], infrared thermal images [19], and visible light images [20]. Remote sensing images provide broad coverage, but their low spatial resolution limits the detection of small objects. Infrared images can capture the reflective characteristics of objects and are suitable for continuous monitoring under all weather conditions and at any time of day; however, they are susceptible to environmental and atmospheric interference. In comparison, visible light images provide higher spatial resolution and richer texture information, and they are also more easily acquired. To address the challenges of ship panoptic segmentation, we construct a new visible light ship dataset, the Ship Panoptic Segmentation Dataset (SPSD). To enhance model performance, we optimize both the scale and quality of the SPSD dataset during its construction. The dataset covers diverse maritime and port scenarios and includes multiple ship categories. We apply data augmentation techniques, such as image scaling, flipping, and cropping, to increase the diversity of training samples, thereby improving the model’s generalization capability and mitigating overfitting. Additionally, we refine the annotation process to ensure high accuracy and consistency of labels, providing high-quality training data for the model. Through these optimization strategies, the SPSD dataset effectively supports the training of ship panoptic segmentation models and improves segmentation accuracy in complex maritime environments.

In the field of deep learning, scene-aware panoptic segmentation algorithms have become a prominent research focus. In the context of maritime ship segmentation, challenges arise from complex variations in environmental illumination, weather disturbances, and the similarity among different types of ships, which result in difficulties in fine-grained feature extraction and multi-scale representation. To address these challenges, researchers have proposed various methods for extracting fine-grained information and performing multi-scale panoptic segmentation. In maritime-related studies, PanSR [21] introduces an object-centric proposal module and a new proposal-aware matching scheme, which significantly improves panoptic segmentation performance for small objects and dense scenes. Nevertheless, its reliance on mask prediction constrained by bounding boxes restricts the accurate modeling of complex boundaries and fine-grained structures. PanopticUAV [9] incorporates deformable convolution, CBAM attention, and a Laplacian boundary enhancement module to improve segmentation accuracy, but it still experiences missed detections and blurred boundaries when processing small objects and complex environments. Beyond models specifically designed for maritime scenarios, numerous open-source panoptic segmentation frameworks have explored multi-scale modeling and fine-grained feature representation in recent years. For example, Panoptic FPN [22] unifies instance and semantic segmentation by sharing a feature pyramid network (FPN) structure, thereby improving overall segmentation accuracy. However, it lacks the ability to dynamically adapt features across different scales, which limits its effectiveness for objects of various sizes. YOSO [23] employs a lightweight feature pyramid aggregator and a separable dynamic decoder to reduce computational complexity, but this also reduces the representation of high resolution features and deep semantic information, resulting in limited ability to segment small objects and extract complex boundaries. BiSeNetFormer [24] combines a spatial branch and a semantic branch with a mask classification mechanism, achieving unified modeling for multiple tasks and a favorable balance between inference speed and segmentation accuracy. However, its segmentation accuracy for fine-grained targets in complex scenes still requires improvement. RT-YOSO [25] utilizes an efficient STDC backbone and instance-aware cropping to ensure that training samples effectively contain the centers of target instances, thereby improving segmentation of small objects and diverse categories. However, its modeling of detailed features near object boundaries remains limited, and in scenarios with ambiguous boundaries, the model remains prone to inaccurate boundary segmentation. The PEM [26] algorithm introduces a prototype-based cross-attention mechanism and a multi-scale feature pyramid network to enhance efficiency. Nonetheless, when dealing with complex scenes and multi-scale panoptic segmentation, PEM [26] does not fully consider the influence of fine-grained information and global contextual information on the segmentation of objects at different scales.

Although existing panoptic segmentation methods achieve notable progress in segmentation accuracy, fine-grained feature modeling, and multi-scale feature extraction, they still exhibit limitations in complex maritime environments. In particular, they remain insufficient in global context modeling and multi-scale information representation. Factors such as illumination variations, water-surface reflections, and wave interference further challenge accurate boundary alignment and foreground–background separation. To address these challenges, we propose the FA PEM network, which is developed based on the PEM [26] baseline. This network incorporates a DCAU module and an SFAM module. The DCAU module employs a content-adaptive sampling point selection mechanism, enabling the model to flexibly determine upsampling regions based on local features and structures. This approach facilitates better alignment with object boundaries and enhances the model’s ability to capture fine-grained details. The DCAU module effectively alleviates boundary ambiguity caused by water-surface reflections and wave-induced disturbances, thereby improving boundary alignment accuracy between vessels and background regions. The SFAM module performs feature fusion in both the spatial and frequency domains, fully leveraging low-frequency structural features while preserving high-frequency edge and texture information. This dual-domain feature fusion enhances the model’s ability to distinguish targets in cluttered maritime scenes and improves robustness in complex backgrounds. In particular, it strengthens multi-scale object representation and global context modeling. These two modules operate synergistically to enhance both fine-grained feature representation and global context modeling, improve object localization accuracy and mask prediction quality, and make FA PEM more suitable for panoptic segmentation in complex maritime environments.

The main contributions of this paper are as follows:

We propose a novel panoptic segmentation framework, FA PEM, designed to achieve efficient multi-scale and fine-grained panoptic segmentation for ships in challenging environments with complex backgrounds.
We propose the DCAU module, which employs a grouped upsampling strategy and a content-adaptive sampling point selection mechanism. This enables the model to better align with object boundaries and further enhances consistency within segmented regions. As a result, it effectively alleviates boundary confusion during the upsampling process and improves the model’s ability to extract fine-grained information.
We propose the SFAM module, which utilizes a dual-branch strategy in both the spatial and frequency domains. The spatial branch applies multi-scale depthwise separable convolutions to extract multi-scale information, while the frequency branch leverages techniques such as the Fourier transform to extract and enhance frequency features from feature maps. This design enables effective modeling of global contextual information.
To address the lack of existing ship panoptic segmentation datasets, we manually annotate a comprehensive dataset, SPSD, that encompasses a wide range of scenarios and rich category diversity. This dataset contains 4360 sample images and covers various real-world scenes, including oceans, docks, and rivers, thus providing a solid foundation for subsequent experiments and practical applications.
We conduct extensive experiments and analyses on both the publicly available Cityscapes dataset [27] and our self-constructed SPSD dataset. Experimental results demonstrate that FA PEM achieves strong performance and shows robust results for panoptic segmentation of ships in complex scenarios.

The remainder of this paper is organized as follows: Section 2 reviews related work on ship datasets and panoptic segmentation. Section 3 provides a detailed description of our new dataset and the proposed FA PEM method, including the overall architecture and implementation details. Section 4 presents the experiments and evaluates the outcomes. Section 5 concludes with a summary of our findings and discusses future directions for development.

2. Related Work

This section reviews related work utilized in our proposed algorithm, including ship datasets and panoptic segmentation algorithms.

2.1. Ship Datasets

In recent years, deep learning techniques have been widely applied to ship detection and panoptic segmentation tasks. However, publicly available ship datasets remain limited, especially those specifically designed for panoptic segmentation. In the following, we introduce several representative public ship datasets.

HRSID Dataset [28]: The HRSID dataset is specifically designed for ship detection, semantic segmentation, and instance segmentation in high-resolution SAR imagery. It comprises 5604 high-resolution SAR images and 16,951 ship instances, encompassing various resolutions, sea states, sea areas, and coastal ports. However, this dataset is not suitable for panoptic segmentation research.

SeaShips Dataset [29]: The SeaShips dataset is a large-scale collection consisting of 31,455 images covering six common types of ships. All images are extracted from approximately 10,080 real-world video clips, providing a diverse range of scales, ship types, and backgrounds. Although this dataset is notable for its size, it is primarily designed for object detection tasks and is not suitable for research in panoptic segmentation.

MariShipInsSeg Dataset [30]: The MariShipInsSeg dataset consists of 4001 visible light marine ship images and 8413 ship instances. It covers seven ship types and various sea state scenarios. This dataset is primarily intended for instance segmentation and is not designed for panoptic segmentation research.

Most existing ship datasets are developed for object detection and instance segmentation tasks, with a notable lack of datasets specifically tailored to panoptic segmentation. To address this gap, we construct SPSD, a dataset for ship panoptic segmentation that contains 4360 images. The SPSD dataset includes eight representative ship categories across a variety of oceanic scenes.

2.2. Panoptic Segmentation

Panoptic segmentation is the fusion of semantic segmentation and instance segmentation, focusing on the simultaneous segmentation of both background regions and object instances. In segmentation tasks, “stuff” refers to amorphous, uncountable regions, such as land, sky, and ocean. Additionally, “things” refers to countable instances, such as different ship categories. Since panoptic segmentation achieves both semantic and instance segmentation, it provides a comprehensive understanding of the scene as well as instance-level category segmentation, thereby opening up possibilities for real-world applications, such as environmental monitoring [9,31], agricultural protection [6,32], and intelligent video surveillance [33,34]. Currently, deep learning-based panoptic segmentation methods are categorized into bottom-up approaches, top-down approaches, and unified panoptic segmentation frameworks based on their network structures.

Bottom-up approaches include Panoptic FPN, which extends the Mask R-CNN [35] framework by adding a semantic segmentation branch to achieve efficient parallel learning. Axial-DeepLab [36] introduces an independent axial attention mechanism that decomposes the two-dimensional attention into two separate one-dimensional attentions along the height and width axes. This design enables the model to capture a larger receptive field using independent attention mechanisms. Max-DeepLab [37] predicts a set of non-overlapping masks and their corresponding class labels directly, enabling truly end-to-end panoptic segmentation. YOSO [23] leverages dynamic convolution to perform segmentation between panoptic kernels and image feature maps in a single step, simultaneously completing both instance and semantic segmentation. In general, bottom-up methods offer faster inference speed, but their accuracy in complex scenes may be limited.

Top-down approaches include UPSNet [38], which proposes a unified panoptic segmentation head and a class-agnostic prediction mechanism to resolve conflicts between instance and semantic predictions. Auto-Panoptic [39] utilizes neural architecture search (NAS) to automatically design efficient panoptic segmentation network architectures, simultaneously addressing instance segmentation for foreground objects and semantic segmentation for background regions. EfficientPS [40] introduces a parameter-free adaptive fusion mechanism that dynamically adjusts the fusion strategy based on the prediction confidence from the semantic and instance heads, thereby preserving more fine-grained details. While top-down methods tend to achieve higher segmentation accuracy, they generally require greater computational resources.

Unified panoptic segmentation frameworks include models such as LPSNet [41], which decomposes the panoptic segmentation task into parallel object detection and semantic segmentation subtasks, subsequently merging their outputs through a parameter-free panoptic head. Panoptic SegFormer [42] separates the set of queries into independent foreground and background subsets, thereby minimizing category interference and improving segmentation quality. PanopticDepth [43] simultaneously addresses panoptic segmentation and depth estimation by employing instance-specific convolutional kernels, resulting in high-quality segmentation outputs and depth maps. OneFormer [44] achieves multi-task unification within a single model, significantly reducing computational resource requirements. By sharing features and enabling information exchange, OneFormer improves both efficiency and accuracy. However, in striving for structural simplicity and unified task modeling, these frameworks inevitably sacrifice the ability to capture fine-grained spatial details and precise object boundaries. Mask2Former [45] incorporates a masked attention mechanism into the Transformer decoder, restricting attention to the predicted segmentation regions. This design enables more effective extraction of local features, thereby enhancing model performance and accelerating convergence.

Considering both the accuracy and maturity of existing algorithms, and aiming to improve panoptic segmentation performance for ships in marine environments, we propose the FA PEM panoptic segmentation algorithm based on the PEM [26] baseline. The proposed method introduces the DCAU module, which leverages a grouped upsampling strategy to enhance computational efficiency and reduce complexity. Simultaneously, it adopts an adaptive upsampling mechanism and a content-aware attention mechanism to dynamically adjust sampling positions during upsampling, effectively enhancing the model’s ability to represent features along object boundaries and in fine-grained regions. In addition, we propose the SFAM module, which integrates multi-scale spatial feature modeling with frequency information enhancement. This integration substantially improves the model’s capacity to extract global features as well as capture local details. The collaboration between these two modules significantly enhances the model’s panoptic segmentation performance and generalization capability for ship targets in complex marine environments.

3. Methodology

In this section, we present a brief overview of the proposed FA PEM method for panoptic segmentation of ships in complex backgrounds. The detailed components of our approach, including the DCAU module, the SFAM module, and the loss function, are described in the following subsections.

3.1. Overall Architecture

To enhance panoptic perception capabilities for intelligent ship navigation, we propose the FA PEM architecture for ship segmentation. The overall framework of FA PEM is illustrated in Figure 1. FA PEM is developed based on the PEM [26] model and incorporates several improvements to boost panoptic segmentation performance in complex scenarios. The model employs ResNet-50 as the backbone network, which first performs multi-scale feature extraction on the input images, establishing a solid foundation for subsequent segmentation tasks. In the decoder stage, the model introduces the DCAU module, which leverages content-adaptive dynamic sampling point selection and a grouped upsampling strategy. This enables the upsampling process to flexibly adjust sampling according to feature content, effectively aligning object boundaries and significantly enhancing the extraction of fine-grained features as well as the accuracy of segmentation details. Furthermore, in the deep feature fusion stage, we propose the SFAM module to achieve collaborative modeling of spatial and frequency domain features. In the spatial domain, the module applies depthwise separable convolutions at multiple scales and a linear attention mechanism [46] to comprehensively capture multi-scale spatial features. In the frequency domain, the module utilizes the Fourier transform to extract and enhance low-frequency structural features and high-frequency edge details, thereby enabling complementary fusion of fine-grained features and global structural information. The multi-scale features processed by the above modules are then fed into a prototype mask segmentation head for pixel-wise prediction, ultimately achieving accurate panoptic segmentation of different ship categories in complex scenes.

3.2. Dynamic Context-Aware Upsampling (DCAU) Module

In dense prediction tasks such as panoptic segmentation, models typically rely on upsampling operations to transform low-resolution feature maps into high-resolution ones in order to meet output resolution requirements. Although performance metrics in panoptic segmentation continue to improve, boundary prediction remains suboptimal. Traditional upsampling methods, such as bilinear interpolation and nearest neighbor interpolation, assign sampling points based on a fixed rule determined by relative distance. This often results in multiple upsampled points being incorrectly assigned to the same semantic cluster in regions requiring detailed representation, thereby failing to capture local variations. When processing points belonging to different semantic clusters, these methods cannot clearly distinguish point assignments, leading to blurred boundaries between clusters and reduced boundary sharpness in segmentation results. Although transposed convolution is a learnable approach, its upsampling kernel remains fixed after training. This restricts its ability to flexibly adapt to changes in the input data, as the assignment rule for upsampled points does not change, making it difficult to meet the requirements of complex and dynamic real-world tasks.

To address these challenges, we propose the DCAU module, which is illustrated in Figure 2. It is specifically designed for panoptic segmentation tasks, aimed at addressing boundary blur and segmentation inaccuracies in traditional upsampling methods. The DCAU module is inspired by upsampling optimization techniques in semantic segmentation. However, unlike upsampling methods in semantic segmentation, which primarily focus on pixel-level classification and often rely on fixed rules or global learning to restore image resolution, the DCAU module explicitly addresses boundary alignment between object instances. In panoptic segmentation tasks, precise boundary alignment of object instances is crucial. The DCAU module dynamically selects sampling points based on local image features, using content-adaptive sampling, thereby effectively mitigating the boundary blur issues caused by fixed sampling rules in traditional methods. Through this dynamic adjustment, the DCAU module accurately aligns the boundaries of object instances, ensuring more precise segmentation details. Additionally, the DCAU module employs a grouped upsampling strategy, performing feature map grouping during the resolution restoration process. This not only improves computational efficiency but also maintains segmentation accuracy.

The DCAU module consists of three main steps: sampling point selection, weight generation, and feature fusion. In the sampling point selection stage, a set of relevant sampling points is selected from the decoder features for each encoder feature point. During the weight generation stage, kernel weights are generated by computing inner product similarity and applying softmax normalization to obtain similarity-aware weights. In the feature fusion stage, the selected sampling points are aggregated through a weighted summation using the generated kernel weights, resulting in upsampled feature representations.

In the sampling point selection phase, we first apply deformable convolution (DCN) to the decoded features

F \in R^{H \times W \times C}

. DCN adapts the receptive field according to the input features, allowing for more flexible perception of target areas with varying deformations and complex structures, thereby effectively enhancing spatial awareness and fine-grained detail modeling capabilities. The features processed by DCN, denoted as

F_{1} \in R^{H \times W \times C}

, are further refined by dynamically selecting relevant points based on feature information. A linear layer is used to project and generate offset values, improving the model’s ability to capture fine-grained details. For the low-resolution feature map

F_{1} \in R^{H \times W \times C}

, we aim to upsample it to a high-resolution feature map,

F^{'} \in R^{2 H \times 2 W \times C}

. For each point

l^{'}

in

F^{'}

, we first calculate its corresponding base position in

F_{1}

, denoted as

l = [\frac{l^{'}}{2}]

. Then, based on the region surrounding the corresponding position l in the low-resolution feature map

F_{1}

, we dynamically select S sampling points, denoted as

L_{l} = {p_{i} ∣ p_{i} \in R^{2}, i = 1, \dots, s}

, from the neighborhood of that position to serve as candidate semantic regions. The features corresponding to these sampling points

L_{l}

are represented as

S_{l} = {x_{i} ∣ x_{i} \in R^{C}, i = 1, \dots, s}

. The formula for sampling point selection is as follows:

F_{1} = D C U (F)

(1)

L_{l} = ϕ (F_{1})

(2)

S_{l} = Sam (F_{1}, L_{l})

(3)

Here,

ϕ

denotes the offset prediction network constructed from linear layers, which is responsible for generating two-dimensional coordinate offsets for each sampling point.

Sam (F_{1}, L_{l})

refers to the bilinear interpolation of feature points in F at the locations specified by

L_{l}

, thereby obtaining the point features in continuous spatial domains.

To further reduce computational cost and improve efficiency, we introduce a grouped upsampling strategy during the sampling point selection stage to optimize the sampling process. Specifically, the feature map is first divided into several groups along the channel dimension, and independent coordinate offsets are subsequently generated for each group. All channels within the same group share the same sampling point offset pattern. This approach effectively reduces computational complexity while preserving the feature representation capacity of the model.

After obtaining the sampling points, it is necessary to assign appropriate weights to them in order to quantify their similarity to the current high-resolution location. Specifically, encoder features are derived from the high-resolution layers of the feature extraction backbone, ResNet-50, which preserve rich spatial details. For each high-resolution position

l^{'}

, the corresponding encoder feature point

q \in Q

and decoder candidate points

p \in L_{l}

are used to compute similarity, followed by a normalization step to generate the sampling point weight map. The weight computation process is defined as follows:

W = norm (ψ (Q, L_{l}))

(4)

Here,

ψ

denotes the similarity computation, and norm represents the normalization operation. Specifically, the similarity is computed using the inner product and softmax operation to assign weights to each sampling point. To ensure that the candidate feature p from the decoder and the encoded feature q are compared in spaces of the same dimensionality, we introduce two learnable linear mappings,

M_{P}

and

M_{q}

, which act on the decoder’s candidate feature p and the encoder’s candidate feature q, respectively, to compute the similarity. The calculation formula is as follows:

w_{i} = softmax (p^{T} M_{p}^{T} M_{q} q)

(5)

Here,

M_{p} \in R^{d \times C}

and

M_{q} \in R^{d \times C^{'}}

are learnable parameters, and d denotes the embedding dimension. The inner product computes the semantic similarity between the low-resolution feature at position p and the target high-resolution feature at position q. A higher similarity indicates that the reconstruction at this resolution position is more important. Through the softmax operation, all weights are constrained to be non-negative and sum to one, generating a set of normalized weights. This ensures that features with higher semantic relevance are assigned higher weights, while features with lower relevance have their corresponding weight reduced.

During the feature fusion stage, the generated weights are used to perform a weighted summation of the selected sampling points, resulting in the upsampled feature point

x_{l^{'}}^{'}

:

x_{l^{'}}^{'} = \sum_{i = 1}^{s} w_{i} x_{i}

(6)

Here,

x_{i}

refers to the feature in the low-resolution feature map corresponding to the sampling point

L_{l}

.

w_{i}

is the weight computed by calculating the inner product similarity between the low-resolution features and the target high-resolution features, followed by the softmax operation. This weight reflects the similarity between the sampling point

L_{l}

and the target position l’. This process encourages the upsampled points to align with the structural patterns of the encoder features while suppressing noise. For instance, in boundary regions, the encoder features exhibit higher similarity to decoder points belonging to the correct semantic cluster, resulting in larger assigned weights and thus enhancing segmentation performance at object boundaries.

Meanwhile, to enhance the capacity for spatial feature extraction, we adopt an efficient feature fusion strategy. Specifically, we first employ a bottleneck structure, utilizing convolution operations to reduce the number of channels in the fused feature map to C/n, which effectively decreases computational complexity and improves network training efficiency. In addition, to further strengthen spatial feature extraction, we perform global average pooling independently along the height and width dimensions, producing feature maps of size

C / n \times 2 H \times 1

and

C / n \times 1 \times 2 W

, respectively. These maps serve as encoded representations for each spatial direction. The resulting feature maps are then transformed and concatenated to fully integrate information from both the height and width dimensions. This process enables the network to simultaneously capture global contextual information, thereby enhancing its understanding and attention to spatial structures. Furthermore, we incorporate batch normalization and nonlinear activation functions to increase the representational power of the model and to improve its ability to learn complex and nonlinear features. After feature fusion, the feature maps are partitioned along the height and width dimensions. Subsequently, we use a sigmoid activation function to adaptively generate weights for the height and width dimensions, allowing the network to assign appropriate importance to features in both vertical and horizontal directions while suppressing irrelevant regions. The weighted features are then passed through a convolutional layer to restore the original channel dimensionality. Additionally, we introduce a residual connection by adding the module’s input to the fused and weighted features via element-wise addition. This residual structure not only mitigates the vanishing gradient problem in deep networks but also facilitates efficient feature propagation and enhances both the stability and representational capacity of the network. Through this approach, the network effectively integrates global contextual information from multiple directions, significantly improving its ability to extract and represent spatial features. Their formulas are as follows:

X_{1} = {Conv}_{1 \times 1} (X)

(7)

z_{h} = \frac{1}{2 W} \sum_{j = 1}^{2 W} X_{1} (c, i, j), z_{h} \in R^{C / n \times 2 H \times 1}

(8)

z_{w} = \frac{1}{2 H} \sum_{i = 1}^{2 H} X_{1} (c, i, j), z_{h} \in R^{C / n \times 2 W \times 1}

(9)

f = C B H (concat [z_{h}, z_{w}])

(10)

g_{h} = σ ({Conv}_{1 \times 1}^{2 h} (f_{h}))

(11)

g_{w} = σ ({Conv}_{1 \times 1}^{2 w} (f_{w}))

(12)

F_{o} = {Conv}_{1 \times 1} (X_{1} \cdot g_{h} \cdot g_{w}) + X

(13)

Here, X denotes the upsampled feature map, while

X_{1}

represents the feature map after channel dimension reduction.

z_{h}

and

z_{w}

correspond to the features obtained by global average pooling along the height and width dimensions, respectively. CBH refers to the combination of convolution, batch normalization, and the h-swish activation function.

g_{h}

and

g_{w}

denote the attention weights along the height and width directions, respectively.

σ

denotes the Sigmoid function.

F_{o}

represents the output feature of the DCAU module.

3.3. Spatial-Frequency Attention Module

During the information transmission process in deep neural networks, feature information loss is inevitable. This issue becomes particularly prominent as network depth increases, where low-level detail features are progressively weakened across layers, resulting in reduced capacity for representing fine-grained structures and textures. To address this problem, we propose the SFAM module, which is illustrated in Figure 3. This module jointly models spatial and frequency domain features, enabling the model to capture global information and texture details from multiple dimensions, significantly enhancing its ability to extract features from multi-scale objects. While spatial-frequency techniques have been applied in image denoising and classification, their role and objectives in panoptic segmentation are notably different from the previous two tasks. In image denoising, the frequency domain is primarily used to suppress high-frequency noise and remove interference signals, aiming to restore the image’s clear structure. In object classification, the focus is on extracting global features in the spatial domain to determine the object’s class, typically emphasizing overall shape recognition without considering the boundary information of object instances. In contrast, SFAM’s core task in panoptic segmentation is pixel-level object segmentation, especially in complex backgrounds. By combining spatial and frequency domain features, the SFAM module not only captures global structural information but also enhances the representation of boundaries and details, achieving high-precision multi-scale ship segmentation and object boundary extraction.

The SFAM module first employs a bottleneck structure, using a 1 × 1 convolution to reduce the dimensionality of the input features, thereby decreasing computational cost and improving efficiency. Next, the SFAM module adopts a parallel-branch design to perform feature fusion in both the spatial and frequency dimensions. In the spatial dimension, we design three depthwise separable convolutions with different kernel sizes (3 × 3, 5 × 5, and 7 × 7) to extract feature information at multiple spatial scales. This approach facilitates the capture of fine-grained details across multiple scales and enhances the model’s ability to perceive targets. After processing with a nonlinear activation function, the model’s feature representation capacity is further enhanced. Each scale-specific path incorporates a linear attention mechanism [46] to replace the computationally expensive conventional self-attention mechanism. The linear attention mechanism [46] significantly reduces computational complexity while maintaining the ability to model long-range spatial dependencies. Specifically, queries (Q) and keys (K) are generated from the input features via linear projection and are subsequently activated by the ELU function to ensure non-negativity, thereby improving numerical stability and discriminability and serving as an alternative to the traditional softmax weighting. Unlike standard self-attention, we apply rotary position encoding (RoPE) [47] to Q and K to introduce relative positional information, which enhances the network’s capacity to model spatial structures without additional learnable parameters. Furthermore, to avoid the exhaustive pairwise matching of all Q-K pairs in standard dot-product attention, the linear attention mechanism [46] adopts a scaling strategy based on the mean of the keys, constructing a kernel function-based attention factor. This approach requires only a single average computation of K and a weighted multiplication, thus greatly reducing computational complexity and improving efficiency, while completing the entire attention process. Moreover, the features from the three branches, fused through the linear attention mechanism, are concatenated, simultaneously followed by a channel adjustment for weighted fusion to produce feature

F_{1}

, further enhancing the spatial feature representation. Their formulas are as follows:

ϕ (x) = ELU (x) + 1

(14)

Q_{i}^{rope} = RoPE (ϕ (W_{Q} X_{i}))

(15)

K_{i}^{rope} = RoPE (ϕ (W_{K} X_{i}))

(16)

V_{i} = X_{i}

(17)

z_{i} = \frac{1}{Q_{i} \cdot mean {(K_{i})}^{⊤} + ε}

(18)

L_{a} = Q_{i}^{rope} \cdot {(K_{i}^{rope})}^{⊤} V_{i} \cdot z_{i}

(19)

A_{i} = L_{a} (R (C_{3} D_{n} S (L N (C_{1} (F_{i}))))), n = 3, 5, 7

(20)

A = c a t (A_{3}, A_{5}, A_{7})

(21)

F_{1} = C_{1} (A) * σ (C_{1} B (C_{1} (F_{i}))) + C_{1} (F_{i})

(22)

Here, the function

ϕ (x)

represents a non-negative mapping through the ELU activation function.

W_{Q}

and

W_{K}

are the linear transformation matrices used to generate the Query and Key, respectively.

X_{i}

denotes the input feature at the i-th position. RoPE [47] refers to the Rotary Position Embedding, which introduces relative positional information into the features.

Q_{i}^{rope}

and

K_{i}^{rope}

are the Query and Key after RoPE [47] encoding, respectively. “mean” (

K_{i}

) refers to the mean of all Keys across positions, and

z_{i}

is the scaling factor for the attention output.

L_{a}

represents the linear attention [46] operation, and R refers to the reshape operation.

C_{1}

represents a 1 × 1 convolution, while

D_{n}

represents an n × n depthwise separable convolution, where n is 3, 5, or 7. S represents the SiLU activation function, and “LN” represents layer normalization. “cat” refers to feature concatenation, and

A_{i}

represents the attention outputs from different scale branches. “BS” represents the batch normalization and activation function, and the symbol * denotes element-wise multiplication.

In the frequency domain, we first apply a two-dimensional Fast Fourier Transform (FFT) [48] to the input features

F_{i}

, converting them from the spatial domain into a frequency domain representation, denoted as Tepx. Tepx is a complex tensor that contains both real and imaginary components. We perform convolution, normalization, and ReLU activation on the real part to enhance its nonlinear representational capacity. Subsequently, a sigmoid function is used to generate frequency domain weights. These weights are then multiplied by the original complex spectrum to extract the most salient frequency domain features. This process not only increases the model’s sensitivity to frequency information but also effectively enhances its ability to capture fine-grained feature details. Next, we perform an inverse Fast Fourier Transform (IFFT) [49] on the modulated frequency domain features to convert them back to the spatial domain. Finally, we compute the magnitude, normalize, and apply a nonlinear activation function to obtain the enhanced frequency domain feature

F_{2}

. Its formula is as follows:

F_{2} = R B (i f f t (σ (C_{1} (C B R (f f t_{r e a l} (C_{1} (F_{i}))))) * f f t (C_{1} (F_{i}))))

(23)

Here,

F_{i}

represents the input feature, and

fft (\cdot)

and

ifft (\cdot)

represent the Fast Fourier Transform and the Inverse Fast Fourier Transform, respectively. “CBR” refers to the combination of convolution, batch normalization, and the ReLU activation function.

The spatial and frequency domain features are fused via element-wise addition, which effectively enhances the model’s ability to capture multi-scale feature information and texture details. Furthermore, to reinforce information integrity during feature propagation, we introduce a residual structure that directly adds the input features

F_{i}

to the fused output, resulting in the feature

F_{o}

. This strategy not only alleviates potential gradient vanishing and semantic dilution issues in deep networks but also improves the stability and continuity of feature representation. Through the dual-path fusion and residual design described above, our approach significantly strengthens both global context modeling and local perceptual capability of the feature representations. Its formula is as follows:

F_{o} = C_{1} (C_{3} (F_{1}) + F_{2}) + F_{i}

(24)

3.4. Ship Panoptic Segmentation Dataset (SPSD)

Currently, in the field of maritime panoptic segmentation, there is no publicly available dataset that offers both sufficient data volume and a diverse range of ship categories. To address this limitation, we construct a dedicated panoptic segmentation dataset that includes multiple scenes and multiple ship categories. The dataset covers diverse aquatic environments such as offshore waters, inland rivers, and ports, and contains a variety of fine-grained ship classes, making it more suitable for real-world ship applications than existing public datasets. In terms of data collection, 60% of the images are obtained from online sources, while the remaining 40% are captured from real-world scenes. To apply the ship dataset to panoptic segmentation tasks, manually annotate the samples using the LabelMe tool. Panoptic segmentation requires labeling every pixel in an image, including both background regions and precise segmentation of each object instance by determining the number, category, and mask of every target. To avoid overlap or omission of pixels at the boundaries between object instances and background regions, we adopt a layered annotation strategy. Specifically, background regions such as sea, sky, and land are annotated first, during which unannotated foreground objects may be included. After completing the background annotation, each ship instance is then precisely labeled. This annotation process is highly labor-intensive and requires approximately 1000 h to complete. Finally, all annotated images are converted into the COCO panoptic standard format.

In this study, all ship images are carefully annotated and categorized. Based on the characteristics of the primary ship types present in each image, the dataset is ultimately divided into eight representative categories. These categories include common civilian vessels such as sail boat and speed boat, as well as specialized ship types such as warship and rescue boat. This diverse categorization scheme effectively meets the application requirements of various maritime scenarios. In addition, to support comprehensive panoptic segmentation, the background environment is broadly classified into three typical categories: sea, sky, and land. For better visualization and understanding, representative examples of the dataset are illustrated in Figure 4. Specifically, the first through eighth rows correspond to the eight ship categories: cargo ship, warship, passenger ship, cruise, speed boat, sail boat, small boat, and rescue boat.

3.5. Loss Function

Our algorithm employs the same loss functions as PEM [26], including the weighted binary cross-entropy (WBCE) loss and the Intersection over Union (IoU) loss.

The binary cross-entropy loss is widely used in binary classification tasks. For tasks with imbalanced class distributions, the weighted binary cross-entropy (WBCE) loss function can effectively enhance the model’s ability to identify minority class samples. Unlike the standard binary cross-entropy loss, the weighted BCE incorporates class weights into the loss calculation, thereby adjusting the contribution of different classes to the overall loss. The computation is as follows:

L_{WBCE} = - \frac{1}{N} \sum_{i = 1}^{N} [w_{1} y_{i} log (p_{i}) + w_{0} (1 - y_{i}) log (1 - p_{i})]

(25)

Here,

y_{i}

denotes the ground truth label,

p_{i}

represents the predicted probability, and N is the total number of samples.

w_{0}

represents the weight for the background class (stuff), and

w_{1}

represents the weight for the foreground class (things).

w_{0}

and

w_{1}

adjust the contributions of the background and foreground classes to the loss function, respectively. By appropriately setting the class weights, the impact of class imbalance can be effectively mitigated, improving the overall performance of the panoptic segmentation model. Since panoptic segmentation is a pixel-level segmentation task, the background class typically occupies a large portion of the image, resulting in a significantly larger number of sample pixels than the foreground class. To ensure the model focuses more on the foreground class during training, we use the inverse of the pixel count for both the background (stuff) and foreground (things) classes, as shown in the formula below:

{\hat{ω}}_{0} = \frac{1}{N_{0}}

(26)

{\hat{ω}}_{1} = \frac{1}{N_{1}}

(27)

Here,

N_{0}

represents the total number of background class pixels in the entire training set, and

N_{1}

represents the total number of foreground class pixels in the entire training set.

Additionally, to prevent the weight values from becoming excessively small, we normalize these weights. The formula for calculating the normalized weights is as follows:

ω_{0} = \frac{{\hat{ω}}_{0}}{{\hat{ω}}_{0} + {\hat{ω}}_{1}}

(28)

ω_{1} = \frac{{\hat{ω}}_{1}}{{\hat{ω}}_{0} + {\hat{ω}}_{1}}

(29)

This normalization method ensures that the contributions of the foreground and background classes to the loss function are more balanced.

The IoU loss function is commonly used in image segmentation tasks to measure the degree of overlap between the predicted region and the ground truth region. The IoU metric is defined as the ratio of the intersection to the union of the predicted and ground truth regions. The calculation is as follows:

IoU = \frac{| P \cap G |}{| P \cup G |}

(30)

Here, P and G represent the predicted region and the ground truth region, respectively. The corresponding IoU loss function is defined as follows:

L_{I o U} = 1 - I o U

(31)

We combine these two loss functions as the total loss for the model:

L_{a l l} = L_{W B C E} + L_{I o U}

(32)

4. Experiments

In this section, we first present the implementation details, datasets, and evaluation metrics of our experiments. We then compare the performance of our method with advanced models in terms of accuracy and other key evaluation metrics. The experimental results demonstrate the superiority of the FA PEM model. Furthermore, to gain deeper insights into the effectiveness of each component within FA PEM, we conduct comprehensive ablation studies on the aforementioned dataset. These experiments provide a more comprehensive understanding of the key factors that contribute to the enhanced performance of FA PEM.

4.1. Implementation Details and Datasets

Training: All experiments are conducted using the PyTorch 1.12 deep learning framework with CUDA version 11.6 and Python 3.10. The operating system is Ubuntu 20.04 LTS, running on a 12th Gen Intel(R) Core™ i9-12900K ×24 processor (Intel Corporation, Santa Clara, CA, USA) and equipped with two NVIDIA GeForce RTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA). Due to the high resolution of Cityscapes [27] images and to accurately evaluate model performance, we resize the input images from the Cityscapes dataset [27] to [1024, 2048]. For our custom ship panoptic segmentation dataset SPSD, the input image size is set to [800, 1333]. The number of training iterations is set to 90k. We use the AdamW optimizer, with an initial learning rate of 0.0007, a batch size of 32, and a weight decay of 0.05. Other parameters are kept consistent with the official PEM [26] configuration. Datasets: Our experiments are conducted primarily on the Cityscapes dataset [27] and our self-constructed Ship Panoptic Segmentation Dataset (SPSD). Cityscapes [27] is a large-scale dataset specifically designed for visual understanding tasks in urban street scenes. It contains a total of 3475 images, with 2975 images for training and 500 for validation. The SPSD ship dataset comprises 4360 images, including 3484 images for training and 876 for validation.

4.2. Evaluation Metrics

In panoptic segmentation tasks, traditional semantic segmentation and instance segmentation evaluation metrics exhibit clear limitations. The mean Intersection over Union (mIoU) metric for semantic segmentation is suitable only for evaluating pixel-wise classification regions and does not account for instance detection and segmentation. Conversely, the Average Precision (AP) metric for instance segmentation applies only to object instances. Neither metric can comprehensively measure the prediction performance of panoptic segmentation. To address this issue, Kirillov et al. [22] introduced a dedicated evaluation metric for panoptic segmentation, namely Panoptic Quality (PQ), when they proposed the panoptic segmentation task.

The Panoptic Quality (PQ) metric is designed to quantify the degree of correspondence between predicted results and ground truth annotations. Specifically, Panoptic Quality for Things (PQ_th) and Panoptic Quality for Stuff (PQ_st) represent the PQ scores computed exclusively for the “thing” and “stuff” categories, respectively. The formulas are defined as follows:

P Q = \frac{\sum_{p, g \in T P} IoU (p, g)}{| T P | + \frac{1}{2} | F P | + \frac{1}{2} | F N |}

(33)

P Q_{t h} = \frac{\sum_{(p, g) \in T P_{thing}} IoU (p, g)}{| T P_{thing} | + \frac{1}{2} | F P_{thing} | + \frac{1}{2} | F N_{thing} |}

(34)

P Q_{s t} = \frac{\sum_{(p, g) \in T P_{stuff}} IoU (p, g)}{| T P_{stuff} | + \frac{1}{2} | F P_{stuff} | + \frac{1}{2} | F N_{stuff} |}

(35)

In these formulas, p denotes the predicted segmentation results, and g represents the ground truth. The

\sum_{p, g \in T P} IoU (p, g)

term refers to the average Intersection over Union between matched predictions and their corresponding ground truth instances. In the denominator, false positives (FP) and false negatives (FN) penalize predicted regions that do not correspond to any ground truth instance, as well as ground-truth regions that are missed by the predictions.

4.3. Performance Comparison

In this section, we compare our method with contemporary advanced panoptic segmentation algorithms on the Cityscapes datasets [27].

As shown in Table 1, we compare several recent panoptic segmentation algorithms on the Cityscapes dataset [27]. The results demonstrate that the proposed FA PEM algorithm achieves strong performance on Cityscapes [27], obtaining the highest scores for PQ, PQ_th, and PQ_st, which are 63.1%, 56.4%, and 68.0%, respectively. Compared to the baseline PEM [26] model, FA PEM improves PQ by 2.0%, PQ_th by 2.1%, and PQ_st by 1.9%. This indicates that our proposed improvements not only enhance the segmentation performance for instance targets (things classes), particularly at boundaries, but also yield more accurate segmentation in challenging background (stuff classes) regions. FA PEM achieves a balance between accuracy and robustness in panoptic segmentation, enabling effective segmentation across different object categories and improving the overall segmentation accuracy of the model.

Furthermore, compared with other recent advanced algorithms, our model demonstrates a significant performance advantage. For the PQ metric, FA PEM surpasses PanopticFPN [22] (+5.4%), PanopticFCN [50] (+1.7%), UPSNet [38] (+3.8%), EfficientPS [40] (+2.8%), YOSO [23] (+3.4%), and FPSNet [51] (+8.0%). Similarly, for PQ_th, FA PEM surpasses PanopticFPN [22] (+4.8%), BiSeNetFormer [24] (+4.2%), RealTimePan [52] (+4.3%), ChaInNet [53] (+1.4%), and MFNet [55] (+0.5%). For PQ_st, FA PEM surpasses UPSNet [38] (+5.3%), YOSO [23] (+1.9%), SE-PSNet [56] (+5.1%), CCPSNet [57] (+4.9%), RT-K-Net [58] (+1.5%), and RT-YOSO [25] (+0.7%). Notably, FA PEM also surpasses panoptic segmentation models specifically designed for maritime scenarios. As shown in Table 1, FA PEM improves PQ by 0.7% compared with PanSR [21]. This improvement further confirms the robustness and effectiveness of FA PEM in jointly segmenting foreground instance objects and background regions under varying scene conditions.

As shown in Table 2, we present a comprehensive comparison of recent panoptic segmentation methods on the SPSD ship dataset. The proposed FA PEM model improves PQ by 2.1% over the baseline PEM, with corresponding gains of 2.0% in PQ_th and 2.3% in PQ_st. These improvements demonstrate a clear enhancement in overall panoptic segmentation performance. Moreover, FA PEM consistently outperforms other advanced panoptic segmentation methods. Specifically, in terms of PQ, FA PEM outperforms PanopticFPN [22] (+2.5%), PanopticFCN [50] (+3.1%), Panoptic-DeepLab [64] (+2.0%), and YOSO [23] (+2.2%). For PQ_th, FA PEM achieves higher scores than Panoptic-DeepLab [64] (+1.8%), PanopticFCN [50] (+3.1%), Mask2Former [45] (+1.6%), MaskFormer [65] (+4.1%), and YOSO [23] (+3.4%). For PQ_st, FA PEM outperforms Panoptic-DeepLab [64] (+2.2%), PanopticFPN [22] (+4.3%), and PanopticFCN [50] (+3.1%). Notably, we replicated the PanopticUAV model, specifically designed for marine scenarios. Compared to PanopticUAV, FA-PEM improves PQ, PQ_th, and PQ_st by 1.7%, 1.7%, and 1.5%, respectively. These results demonstrate that FA PEM achieves superior panoptic segmentation performance in marine environments. Overall, the experimental results validate the effectiveness of the proposed FA PEM framework. Compared with recent panoptic segmentation approaches, FA PEM shows stronger capability in handling the complex challenges of maritime scenarios, highlighting its potential for intelligent vessel navigation and maritime monitoring applications. Meanwhile, we also observe that there is still room for improvement for FA PEM on the SPSD dataset as a whole, indicating that our multi-class, multi-scenario ship panoptic segmentation dataset is both challenging and highly effective for evaluating panoptic segmentation algorithms.

Meanwhile, as shown in Table 3, we conduct a comparative analysis of the computational complexity of recent panoptic segmentation algorithms on the Cityscapes dataset. The experimental results demonstrate that the computational complexity of our FA PEM model is 226 GFLOPs, which is 11 GFLOPs lower than the baseline PEM model. This indicates that the proposed FA PEM model not only avoids increasing computational overhead but also achieves effective complexity optimization. Furthermore, compared to other recent representative methods, FA PEM also shows a significant advantage in computational complexity. For instance, the computational complexity of FA PEM is lower than that of MP-Former [61] (−293 GFLOPs), Mask DINO [60] (−417 GFLOPs), Re-Maskformer [62] (−348 GFLOPs), UISE [63] (−63 GFLOPs), MR-Mask2Former (−296 GFLOPs), UPSNet [38] (−261 GFLOPs), Seamless [66] (−288 GFLOPs), LEPSNet [67] (−225 GFLOPs), Panoptic-DeepLab [64] (−321 GFLOPs), Mask2Former [45] (−293 GFLOPs), and YOSO [23] (−39 GFLOPs). In summary, FA PEM not only achieves optimal performance in panoptic segmentation accuracy but also delivers significant improvements in computational complexity, further confirming its good balance between performance and efficiency.

4.4. Ablation Experiment

In this subsection, we conduct ablation studies on the DCAU and SFAM modules to verify their effectiveness and evaluate their contributions to our panoptic segmentation algorithm. All ablation experiments are conducted on the Cityscapes [27] and SPSD datasets, using parameter settings consistent with those used in the aforementioned comparative experiments. The results are presented in detail in the tables and subsequently discussed comprehensively.

4.4.1. Ablation Experiment on Cityscapes [27]

Our ablation experiments on the Cityscapes dataset [27], as shown in Table 4, demonstrate the positive impact of the proposed modules on panoptic segmentation performance. Using ResNet-50 as the backbone network, we incrementally integrate the DCAU and SFAM modules into the model. When the DCAU module is added, PQ increases from 61.1% to 61.9%, while PQ_th and PQ_st improve by 0.9% and 0.7%, respectively. This result shows that the DCAU module, through dynamic sampling point selection and adaptive weighting, effectively enhances the model’s ability to preserve semantic information at boundaries and alleviates boundary blurring in traditional upsampling approaches. In addition, the grouped upsampling and feature fusion strategies further reduce computational complexity and strengthen spatial structure understanding, resulting in improved segmentation accuracy. When the SFAM module is added, PQ improves from 61.1% to 62.2%, with increases of 1.2% in foreground panoptic quality and 1.0% in background panoptic quality. This indicates that SFAM effectively extracts multi-scale features via depthwise separable convolutions and efficiently models spatial dependencies using a linear attention mechanism [46]. In the frequency domain, SFAM captures both low and high-frequency information, thereby increasing the model’s sensitivity to image details. The fusion of spatial and frequency features provides notable gains in both boundary and complex background segmentation. When both DCAU and SFAM are integrated, PQ rises from 61.1% to 63.1%, with corresponding increases of 2.1% and 1.9% for PQ_th and PQ_st, indicating a substantial improvement in overall panoptic segmentation accuracy. These results strongly confirm the effectiveness of the DCAU and SFAM modules and show that their combination significantly improves the overall performance of the model for multi-scale target panoptic segmentation.

4.4.2. Ablation Experiment on SPSD

Our ablation experiments on the SPSD dataset, as shown in Table 5, demonstrate that each proposed module has a positive effect on panoptic segmentation performance for ship targets. Using ResNet-50 as the backbone network, we progressively integrate the DCAU and SFAM modules into the model. The results show that the PQ metric increases by 0.7% and 1.3%, respectively. Specifically, with the addition of the DCAU module, PQ increases from 60.2% to 60.9%, while PQ_th and PQ_st improve by 0.8% and 0.3%, respectively. The DCAU module employs a content-adaptive dynamic sampling point selection mechanism, allowing the model to flexibly determine upsampling regions based on local features and structures, thereby achieving better alignment with object boundaries and improving boundary discrimination accuracy. Additionally, the introduction of grouped upsampling and feature fusion strategies significantly reduces computational cost, improves overall inference efficiency, and enhances the model’s ability to capture spatial relationships. When the SFAM module is added, PQ increases from 60.2% to 61.5%, with improvements of 1.5% in PQ_th and 0.5% in PQ_st, leading to a substantial gain in panoptic segmentation accuracy. By jointly modeling spatial and frequency domain features, the SFAM module effectively enhances the model’s ability to recognize multi-scale object structures and fine-grained details. In the spatial domain, the use of multiple receptive fields enables the model to capture fine-grained information at different scales, and the efficient linear attention mechanism [46] allows for modeling long-range pixel dependencies. In the frequency domain, the SFAM module leverages Fourier transforms to integrate low and high frequency feature information, preserving the micro-level details of complex edges and textures in ship targets. The fusion of spatial and frequency features significantly improves segmentation accuracy in boundary delineation and under challenging background conditions. When both DCAU and SFAM are incorporated, PQ rises from 60.2% to 62.3%, with PQ_th and PQ_st increasing by 2.0% and 2.3%, respectively, resulting in a notable improvement in overall panoptic segmentation accuracy. These findings fully demonstrate the complementary and collaborative effects of the DCAU and SFAM modules, which together substantially enhance the panoptic segmentation capability of FA PEM for ship targets. Meanwhile, we also observe that there is still room for improvement for FA PEM on the SPSD dataset as a whole, indicating that our multi-class, multi-scenario ship panoptic segmentation dataset is both challenging and highly effective for evaluating panoptic segmentation algorithms.

Meanwhile, in Table 6, we conduct an ablation study to evaluate the computational complexity of the proposed FA PEM model. On the SPSD dataset, when we incorporate only the DCAU module into the baseline PEM model, the computational complexity decreases by 3.8 GFLOPs compared with the baseline, demonstrating that DCAU effectively reduces overall computational cost. The reduction in computational complexity primarily stems from two aspects. First, during the sampling point selection stage, DCAU adopts a grouped upsampling mechanism in which each group shares a common set of sampling offset patterns. This intra-group sharing strategy significantly reduces channel-wise redundant computations in offset prediction and sampling operations, thereby lowering computational overhead while preserving representational capacity. Second, during the feature fusion stage, DCAU introduces a bottleneck structure to compress feature dimensions. This dimensionality reduction strategy further decreases the overall FLOPs. When we jointly introduce the DCAU and SFAM modules to construct the FA PEM model, the computational complexity remains nearly unchanged compared with the variant that includes only DCAU, without introducing a noticeable increase. This efficiency stems from the lightweight design of the SFAM module. Specifically, SFAM replaces standard convolution with depthwise separable convolution, employs a bottleneck structure to compress feature dimensions, and adopts a linear attention mechanism instead of conventional self-attention. These design choices effectively control computational overhead and prevent a significant increase in complexity. Overall, compared with the baseline PEM model, FA PEM improves panoptic segmentation performance while reducing computational complexity by 3.7 GFLOPs. These results demonstrate that the DCAU and SFAM modules not only enhance panoptic segmentation performance but also improve computational efficiency.

To analyze the contributions of each component in the SFAM module, we perform detailed ablation experiments on the SFAM components using both the Cityscapes and SPSD datasets. The backbone network, image resolution, and other experimental parameters are consistent across all experiments. As shown in Table 7, when only the spatial component is added, the PQ metric improves by 0.6% and 0.4% on the SPSD and Cityscapes datasets, respectively, demonstrating an effective enhancement in model performance. The spatial domain component utilizes a multi-receptive field depthwise separable convolution strategy, which effectively establishes long-range dependencies in the spatial dimension, enabling the model to capture global information more efficiently and improve its spatial context understanding. When only the frequency domain component is added, the PQ metric increases by 0.8% and 0.7% on the SPSD and Cityscapes datasets, respectively, resulting in a significant performance improvement. The frequency domain component leverages Fourier transforms to convert feature maps into the frequency domain, effectively extracting low-frequency global structural information and high-frequency detailed features. By fusing low-frequency and high-frequency features, it enhances the model’s ability to capture complex edge and texture details, significantly improving accuracy. When both the spatial and frequency domain components are added simultaneously, the PQ metric improves by 1.3% on the SPSD dataset and 1.1% on the Cityscapes dataset, further enhancing model performance. This demonstrates that when the spatial and frequency domain modules are combined, the complementary fusion of spatial and frequency features not only captures global structural information but also extracts detailed local features and multi-scale information. This significantly improves model accuracy in challenging scenarios, such as complex backgrounds and multi-scale objects.

Additionally, we analyze the resolution of ship images, as shown in Table 8. From the table, it is evident that as the image resolution increases, the panoptic segmentation metrics (PQ, PQ_th, and PQ_st) show significant improvement. Specifically, as the resolution increases from 480 × 640 to 800 × 1333, the model’s PQ increases from 57.8 to 61.1, indicating that increasing image resolution effectively enhances the accuracy of panoptic segmentation. Furthermore, as the resolution increases, the PQ_th and PQ_st metrics also show notable improvements, particularly at a resolution of 800 × 1333, where PQ_th reaches 57.3 and PQ_st reaches 75.6, demonstrating the higher resolution’s ability to capture finer details. However, when the image resolution is further increased to 1000 x 1600, the panoptic segmentation metric PQ decreases compared to the 800 × 1333 resolution. This phenomenon suggests that while higher resolution provides more detailed information, it also results in a smaller receptive field for the model. The model’s ability to capture global information weakens, focusing more on local features, which impacts its understanding of global structures and leads to a decrease in panoptic segmentation performance. Additionally, the background in the SPSD ship dataset is more complex, with challenging factors such as the sea surface, waves, and weather conditions. An increase in resolution may introduce more irrelevant details, making it harder to accurately segment the ship targets and background. Moreover, as the image resolution increases, the model’s inference time also increases. The inference time increases by 14.3 ms when the resolution changes from 480 × 640 to 600 × 1000, by 1.4 ms from 600 × 1000 to 800 × 1333, and by 4.4 ms from 800 × 1333 to 1000 × 1600. This indicates that while the resolution increases, the added computational complexity reduces the model’s inference efficiency. Considering both panoptic segmentation quality metrics and inference time, the 800 × 1333 resolution achieves a good balance between model performance and inference time, with only a relatively small increase in inference time. Therefore, based on the above analysis, we select the 800 × 1333 resolution for the FA PEM ship panoptic segmentation model.

To more effectively illustrate our model’s performance on the SPSD ship panoptic segmentation dataset, we present qualitative visualizations of the segmentation results. As shown in Figure 5, we compare the segmentation results of the FA PEM algorithm and the PEM algorithm [26]. (a) shows the original image, (b) shows the panoptic segmentation results of the PEM algorithm, and (c) presents the panoptic segmentation results of the FA PEM algorithm. Each row corresponds to a different type of ship samples from the SPSD dataset. To clearly highlight the details, we zoom in on specific regions and boundaries. From the zoomed-in region in Figure 5b, it is evident that the panoptic segmentation results from the PEM algorithm [26] have certain limitations. Specifically, in the first row, the PEM algorithm fails to detect some ships in multi-target scenarios, and parts of the background are misclassified as ship targets, leading to inaccurate background boundary segmentation. In the second row, the PEM algorithm [26] misidentifies a warship as a passenger ship, resulting in a false positive segmentation. Additionally, the mask boundary of the warship is incomplete. In the third row, the mask boundary segmentation for the cargo ship is also incomplete, leading to partial loss of the ship’s mask information and inaccurate boundary extraction. In the fourth and sixth rows, under complex weather conditions, such as varying illumination and wave interference, the model struggles to accurately capture segmentation boundary details. In the zoomed-in region of the fourth row, the boundary segmentation is imprecise, with discontinuities and rough edges, failing to capture the true contours of the ship. In the sixth row, wave interference results in imprecise and overly rough boundaries for the ship. In the fifth and eighth rows, the PEM algorithm misses small ships, leading to false negatives. In the seventh row, under foggy conditions, the PEM algorithm [26] misclassifies the background land as sky, resulting in incorrect background segmentation. These issues demonstrate the limitations of the PEM algorithm [26] in handling complex scenarios, boundary delineation, and small target segmentation, which negatively impact the accuracy and completeness of the segmentation results. In contrast, the zoomed-in regions in Figure 5c clearly show that the FA PEM algorithm significantly improves boundary segmentation in the same scenes. First, in complex backgrounds and multi-target scenarios, FA PEM effectively avoids both false positives and false negatives, significantly improving small ship detection and the accurate identification of background regions. Second, FA PEM provides more precise boundary extraction for ship segmentation masks, with fine-grained performance that clearly outperforms the PEM algorithm [26]. This enhancement significantly improves the completeness and distinctiveness of ship contour extraction. Overall, FA PEM not only improves the segmentation accuracy of ships compared to the PEM algorithm, but also enhances the model’s robustness and practical applicability in multi-class and multi-scale scenarios.

5. Conclusions

In this paper, we propose a novel network, FA PEM, to address the challenges of multi-scale ship panoptic segmentation in complex maritime environments. Specifically, the algorithm introduces the DCAU module, which employs content-adaptive dynamic sampling point selection and a grouped upsampling mechanism to effectively alleviate the boundary blurring problem commonly encountered in traditional upsampling operations. This design significantly enhances the model’s ability to capture and represent fine-grained boundaries between ships and background regions. Meanwhile, the SFAM module integrates multi-scale spatial feature extraction with frequency domain feature analysis, enabling collaborative modeling in both the spatial and frequency domains. This approach substantially improves the model’s capacity to represent both global structures and local details of target objects, achieving complementary and efficient feature fusion. Given the scarcity of panoptic segmentation datasets for ships in maritime scenarios, we have constructed a ship panoptic segmentation dataset that includes real-world scenes, providing a valuable benchmark for evaluating panoptic segmentation algorithms. Experimental results on the Cityscapes [27] and SPSD ship datasets demonstrate that the proposed FA PEM model achieves superior performance and exhibits strong generalization capability.

In future work, we aim to explore lightweight model architectures while maintaining ship panoptic segmentation performance. In addition, to reduce reliance on large-scale, high-quality labeled data, we plan to introduce self-supervised [68], weakly supervised [6], or semi-supervised [69,70] learning strategies. These strategies would allow the model to maintain strong performance and improve data utilization and generalization even in scenarios with limited annotations.

Author Contributions

Conceptualization, M.Y. and H.M.; methodology, M.Y.; software, M.Y.; validation, M.Y.; formal analysis, M.Y. and H.M.; investigation, H.M. and Y.C.; resources, M.Y. and H.M.; data curation, M.Y. and J.W.; writing—original draft preparation, M.Y.; writing—review and editing, M.Y. and H.M.; visualization, M.Y.; supervision, H.M. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 2019YFE0105400) and the Key Technologies for the Development of Intelligent Technology Test Ships (Grant No. CJ01N20).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Access to the data will be considered upon request.

Acknowledgments

We are deeply grateful to our colleagues for their exceptional support in developing the dataset and assisting with the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, K.; Prokop, J.; Montalt-Tordera, J.; Mohammadi, S. Panoptic Segmentation of Mammograms with Text-to-Image Diffusion Model. In Proceedings of the MICCAI Workshop on Deep Generative Models; Springer: Cham, Switzerland, 2024; pp. 98–108. [Google Scholar] [CrossRef]
Shaheema, B.; Muppalaneni, N.B.; Devi, K.S. An explainable deep learning-based panoptic segmentation for brain tumor diagnosis. Neural Comput. Appl. 2025, 37, 20639–20662. [Google Scholar] [CrossRef]
Lv, J.; Zhu, Y.; Tenorio, C.G.C.; Chohan, B.S.; Eastwood, M.; Raza, S.E.A. Leveraging Pathology Foundation Models for Panoptic Segmentation of Melanoma in H&E Images. In Proceedings of the Annual Conference on Medical Image Understanding and Analysis; Springer: Cham, Switzerland, 2025; pp. 58–72. [Google Scholar] [CrossRef]
Dusi, A.; Helou, B. LiDAR Panoptic Segmentation for Autonomous Driving: A Survey. Electron. Imaging 2025, 37, AVM-115. [Google Scholar] [CrossRef]
Kinzig, C.; Miller, H.; Lauer, M.; Stiller, C. Panoptic segmentation from stitched panoramic view for automated driving. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024; pp. 3342–3347. [Google Scholar] [CrossRef]
Knott, M.; Odion, D.; Sontakke, S.; Karwa, A.; Defraeye, T. Weakly Supervised Panoptic Segmentation for Defect-Based Grading of Fresh Produce. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5462–5471. [Google Scholar] [CrossRef]
Kiefer, B.; Zust, L.; Kristan, M.; Pers, J.; Tersek, M.; Mudenagudi, U.; Desai, C.; Wiliem, A.; Kreis, M.; Akalwadi, N.; et al. 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 1542–1569. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Wu, Y.; Zhao, J.; Postolache, O.; Liu, S. Visual navigation systems for maritime smart ships: A survey. J. Mar. Sci. Eng. 2024, 12, 1781. [Google Scholar] [CrossRef]
Dou, Y.; Yao, F.; Wang, X.; Qu, L.; Long, C.; Xu, Z.; Ding, L.; Bullock, L.; Zhong, G.; Wang, S. PanopticUAV: Panoptic segmentation of UAV images for marine environment monitoring. Comput. Model. Eng. Sci. 2024, 138, 1001. [Google Scholar] [CrossRef]
Fu, J.; Li, F.; Zhao, J.; Wang, Y.; Zhang, H. Maritime Infrared Ship Detection in UAV Imagery Based on Two-Stage Region-Segmentation-Guided Learning Network. IEEE Trans. Instrum. Meas. 2025, 74, 5028516. [Google Scholar] [CrossRef]
Liu, Z.; Li, Z.; Liang, Y.; Persello, C.; Sun, B.; He, G.; Ma, L. Rsps-sam: A remote sensing image panoptic segmentation method based on sam. Remote Sens. 2024, 16, 4002. [Google Scholar] [CrossRef]
Yoon, H.; Kim, H.K.; Kim, S. PPDD: Egocentric Crack Segmentation in the Port Pavement with Deep Learning-Based Methods. Appl. Sci. 2025, 15, 5446. [Google Scholar] [CrossRef]
Ng, W.; Minasny, B.; Mendes, W.d.S.; Demattê, J.A.M. The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data. Soil 2020, 6, 565–578. [Google Scholar] [CrossRef]
Lin, Y.S.; Huang, P.H.; Chen, Y.Y. Deep learning-based hepatocellular carcinoma histopathology image classification: Accuracy versus training dataset size. IEEE Access 2021, 9, 33144–33157. [Google Scholar] [CrossRef]
Chu, H.C.; Zhang, Y.L.; Chiang, H.C. A CNN sound classification mechanism using data augmentation. Sensors 2023, 23, 6972. [Google Scholar] [CrossRef]
Abriha, D.; Szabó, S. Strategies in training deep learning models to extract building from multisource images with small training sample sizes. Int. J. Digit. Earth 2023, 16, 1707–1724. [Google Scholar] [CrossRef]
Sun, B.; Yan, Y.; Fu, H.; He, G.; He, Y.; Zhang, Z.; Gao, F. MFGS2Net: A Morphological Features Guided Scale Separation Network for Remote Sensing Panoptic Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5621717. [Google Scholar] [CrossRef]
Sun, Z.; Liu, J.; Zhang, W.; Liu, F.; Yang, J.; Xiao, L. Multi-scale Feature Interaction and Adaptive Experts for Panoptic Segmentation in Remote Sensing Images. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Ibrahim, I.A.; Namoun, A.; Ullah, S.; Alasmary, H.; Waqas, M.; Ahmad, I. Infrared ship segmentation based on weakly-supervised and semi-supervised learning. IEEE Access 2024, 12, 117908–117920. [Google Scholar] [CrossRef]
Hahn, O.; Reich, C.; Araslanov, N.; Cremers, D.; Rupprecht, C.; Roth, S. Scene-Centric Unsupervised Panoptic Segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 24485–24495. [Google Scholar] [CrossRef]
Žust, L.; Kristan, M. PanSR: An object-centric mask transformer for panoptic segmentation. arXiv 2024, arXiv:2412.10589. [Google Scholar] [CrossRef]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar] [CrossRef]
Hu, J.; Huang, L.; Ren, T.; Zhang, S.; Ji, R.; Cao, L. You only segment once: Towards real-time panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17819–17829. [Google Scholar] [CrossRef]
Rosi, G.; Cuttano, C.; Cavagnero, N.; Averta, G.; Cermelli, F. The revenge of BiSeNet: Efficient multi-task image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 8066–8074. [Google Scholar] [CrossRef]
Ammar, A.; Khalil, M.I.; Salama, C. Rt-yoso: Revisiting yoso for real-time panoptic segmentation. In Proceedings of the 2023 5th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 21–23 October 2023; pp. 306–311. [Google Scholar] [CrossRef]
Cavagnero, N.; Rosi, G.; Cuttano, C.; Pistilli, F.; Ciccone, M.; Averta, G.; Cermelli, F. Pem: Prototype-based efficient maskformer for image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15804–15813. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Li, W.; Zhang, Z.; Wang, P.; Zhang, W. Global Mask R-CNN for marine ship instance segmentation. Neurocomputing 2022, 480, 257–270. [Google Scholar] [CrossRef]
Pushkala, K.P.; Subbulakshmi, P. Synergistic integration of vision transformers and advanced segmentation algorithms for panoptic mapping of marine litter. Front. Mar. Sci. 2025, 12, 1726472. [Google Scholar] [CrossRef]
Darbyshire, M.; Sklar, E.; Parsons, S. Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 335–349. [Google Scholar] [CrossRef]
Stolle, K.H. Balancing Shared and Task-Specific Representations: A Hybrid Approach to Depth-Aware Video Panoptic Segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 3301–3309. [Google Scholar]
Shin, I.; Kim, D.; Yu, Q.; Xie, J.; Kim, H.S.; Green, B.; Kweon, I.S.; Yoon, K.J.; Chen, L.C. Video-kmax: A simple unified approach for online and near-online video panoptic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 229–239. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.C. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 108–126. [Google Scholar] [CrossRef]
Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5463–5474. [Google Scholar] [CrossRef]
Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8818–8826. [Google Scholar] [CrossRef]
Wu, Y.; Zhang, G.; Xu, H.; Liang, X.; Lin, L. Auto-panoptic: Cooperative multi-component architecture search for panoptic segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 20508–20519. [Google Scholar] [CrossRef]
Mohan, R.; Valada, A. Efficientps: Efficient panoptic segmentation. Int. J. Comput. Vis. 2021, 129, 1551–1579. [Google Scholar] [CrossRef]
Hong, W.; Guo, Q.; Zhang, W.; Chen, J.; Chu, W. Lpsnet: A lightweight solution for fast panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16746–16754. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Xie, E.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P.; Lu, T. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar] [CrossRef]
Gao, N.; He, F.; Jia, J.; Shan, Y.; Zhang, H.; Zhao, X.; Huang, K. Panopticdepth: A unified framework for depth-aware panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1632–1642. [Google Scholar] [CrossRef]
Jain, J.; Li, J.; Chiu, M.T.; Hassani, A.; Orlov, N.; Shi, H. Oneformer: One transformer to rule universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2989–2998. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar] [CrossRef]
Han, D.; Pan, X.; Han, Y.; Song, S.; Huang, G. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5961–5971. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Feng, J.; Liu, X. Frequency-Quantized Variational Autoencoder Based on 2D-FFT for Enhanced Image Reconstruction and Generation. Comput. Mater. Contin. 2025, 83, 2087–2107. [Google Scholar] [CrossRef]
Zhang, H.; Li, Z.; Chen, Y.; Lu, C.; Yan, P. Fast image reconstruction method using radial harmonic Fourier moments and its application in digital watermarking. J. Frankl. Inst. 2025, 362, 107391. [Google Scholar] [CrossRef]
Li, Y.; Zhao, H.; Qi, X.; Wang, L.; Li, Z.; Sun, J.; Jia, J. Fully convolutional networks for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 214–223. [Google Scholar] [CrossRef]
De Geus, D.; Meletis, P.; Dubbelman, G. Fast panoptic segmentation network. IEEE Robot. Autom. Lett. 2020, 5, 1742–1749. [Google Scholar] [CrossRef]
Hou, R.; Li, J.; Bhargava, A.; Raventos, A.; Guizilini, V.; Fang, C.; Lynch, J.; Gaidon, A. Real-time panoptic segmentation from dense detections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8523–8532. [Google Scholar] [CrossRef]
Mao, L.; Ren, F.; Yang, D.; Zhang, R. ChaInNet: Deep chain instance segmentation network for panoptic segmentation. Neural Process. Lett. 2023, 55, 615–630. [Google Scholar] [CrossRef]
Tian, Z.; Zhang, B.; Chen, H.; Shen, C. Instance and panoptic segmentation using conditional convolutions. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 669–680. [Google Scholar] [CrossRef] [PubMed]
Lei, H.; He, F.; Jia, B.; Wu, Q. MFNet: Panoptic segmentation network based on multiscale feature weighted fusion and frequency domain attention mechanism. IET Comput. Vis. 2023, 17, 88–97. [Google Scholar] [CrossRef]
Chang, S.E.; Chen, Y.; Yang, Y.C.; Lin, E.T.; Hsiao, P.Y.; Fu, L.C. Se-psnet: Silhouette-based enhancement feature for panoptic segmentation network. J. Vis. Commun. Image Represent. 2023, 90, 103736. [Google Scholar] [CrossRef]
Xu, Y.; Liu, R.; Zhu, D.; Chen, L.; Zhang, X.; Li, J. Cascade contour-enhanced panoptic segmentation for robotic vision perception. Front. Neurorobot. 2024, 18, 1489021. [Google Scholar] [CrossRef]
Schön, M.; Buchholz, M.; Dietmayer, K. Rt-k-net: Revisiting k-net for real-time panoptic segmentation. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
Kerola, T.; Li, J.; Kanehira, A.; Kudo, Y.; Vallet, A.; Gaidon, A. Hierarchical lovász embeddings for proposal-free panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14413–14423. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M.; Shum, H.Y. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3041–3050. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Xu, H.; Huang, S.; Liu, S.; Ni, L.M.; Zhang, L. Mp-former: Mask-piloted transformer for image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18074–18083. [Google Scholar] [CrossRef]
Zhu, X.; Dong, X.; Yu, W.; Liang, H.; Kong, B. Refactored Maskformer: Refactor localization and classification for improved universal image segmentation. Displays 2025, 87, 102981. [Google Scholar] [CrossRef]
Hu, J.; Cao, L.; Jin, X.; Zhang, S.; Ji, R. Universal Image Segmentation with Efficiency. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8550–8562. [Google Scholar] [CrossRef]
Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12475–12485. [Google Scholar] [CrossRef]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar] [CrossRef]
Porzi, L.; Bulo, S.R.; Colovic, A.; Kontschieder, P. Seamless scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8277–8286. [Google Scholar] [CrossRef]
Wang, P.; Zhou, D. Image panoptic segmentation network based on Efficientnet. In Proceedings of the 2024 International Conference on Image Processing, Multimedia Technology and Maching Learning, Dali, China, 27–29 December 2024; pp. 97–102. [Google Scholar] [CrossRef]
Wang, Z.; Chen, H.; Qin, H.; Chen, Q. Self-supervised pre-training joint framework: Assisting lightweight detection network for underwater object detection. J. Mar. Sci. Eng. 2023, 11, 604. [Google Scholar] [CrossRef]
Chen, H.; Li, M.; Liu, Y.; Zhou, J.; Fu, X.; Liu, S.; Yu, F.R. Dynamic Mutual Adversarial Learning for Semi-Supervised Semantic Segmentation of Underwater Images with Limited and Noisy Annotations. J. Mar. Sci. Eng. 2025, 13, 2334. [Google Scholar] [CrossRef]
Ding, M.; Li, G.; Hu, Y.; Liu, H.; Hu, Q.; Huang, X. Semi-Supervised Underwater Image Enhancement Method Using Multimodal Features and Dynamic Quality Repository. J. Mar. Sci. Eng. 2025, 13, 1195. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of Frequency Adaptive PEM (FA PEM). DCAU represents the dynamic correlation-aware upsampling module. SFAM represents the Spatial-Frequency Attention Module.

Figure 2. Dynamic Correlation-Aware Upsampling (DCAU) Module.

Figure 3. Spatial-Frequency Attention Module (SFAM).

Figure 4. The detailed information of the SPSD dataset. Different colors indicate different panoptic segments, including ship instances and background (stuff) classes.

Figure 5. Comparison of panoptic segmentation results between the PEM [26] algorithm and the FA PEM algorithm on the SPSD dataset. (a) denotes the original figure, (b) denotes the PEM panoptic segmentation result, and (c) denotes the FA PEM panoptic segmentation result.

Table 1. Comparing different panoptic segmentation methods on the Cityscapes [27]. Underlined entries indicate that the corresponding values are not available.

Method	Backbone	Scale	PQ	PQ_th	PQ_st
PanopticFPN [22]	ResNet-50	1024, 2048	57.7	51.6	62.2
PanopticFCN [50]	ResNet-50	1024, 2048	61.4	54.8	66.6
UPSNet [38]	ResNet-50	1024, 2048	59.3	54.6	62.7
LPSNet [41]	ResNet-50	1024, 2048	59.7	54.0	63.9
FPSNet [51]	ResNet-50	1024, 2048	55.1	48.3	60.1
Mask2Former [45]	ResNet-50	1024, 2048	62.1	54.9	67.3
RealTimePan [52]	ResNet-50	1024, 2048	58.8	52.1	63.7
YOSO [23]	ResNet-50	1024, 2048	59.7	51.0	66.1
BiSeNetFormer [24]	ResNet-50	1024, 2048	57.5	52.2	62.4
ChaInNet [53]	ResNet-50	1024, 2048	59.5	55.0	62.8
CondInst [54]	ResNet-50	1024, 2048	61.7	59.0	63.7
MFNet [55]	ResNet-50	1024, 2048	60.0	55.9	63.5
SE-PSNet [56]	ResNet-50	1024, 2048	60.0	55.9	62.9
EfficientPS [40]	ResNet-50	1024, 2048	60.3	55.3	53.9
CCPSNet [57]	ResNet-50	1024, 2048	60.5	56.9	63.1
RT-K-Net [58]	RTFormer	1024, 2048	60.2	51.5	66.5
HLE [59]	ResNet-50	1024, 2048	59.8	51.1	66.1
RT-YOSO [25]	STDC2	1024, 2048	59.2	48.1	67.3
Mask Dino [60]	ResNet-50	1024, 2048	62.5	55.1	67.2
MP-Former [61]	ResNet-50	1024, 2048	62.7	—	—
Re-Maskformer [62]	ResNet-50	1024, 2048	62.9	55.2	68.4
UISE [63]	ResNet-50	1024, 2048	59.9	51.1	66.2
PanSR [21]	ResNet-50	1024, 2048	62.4	55.7	67.3
PEM [26]	ResNet-50	1024, 2048	61.1	54.3	66.1
FA PEM (our)	ResNet-50	1024, 2048	63.1	56.4	68.0

Table 2. Comparison of different panoptic segmentation methods on the SPSD.

Method	Backbone	PQ	PQ_th	PQ_st
PanopticFPN [22]	ResNet-50	59.8	55.4	71.3
PanopticFCN [50]	ResNet-50	59.2	54.2	72.5
Panoptic-DeepLab [64]	ResNet-50	60.3	55.5	73.4
MaskFormer [65]	ResNet-50	59. 3	53.2	75.6
Mask2Former [45]	ResNet-50	61.4	55.7	76.8
YOSO [23]	ResNet-50	60.1	53.9	76.6
PanopticUAV [9]	ResNet-50	60.6	55.6	74.1
PEM [26]	ResNet-50	60.2	55.3	73.3
FA PEM (our)	ResNet-50	62.3	57.3	75.6

Table 3. Comparison of the FLOPs of models on the Cityscapes.

Method	Backbone	Scale	FLOPs
MP-Former [61]	ResNet-50	1024, 2048	519 G
Mask DINO [60]	ResNet-50	1024, 2048	643 G
Re-Maskformer [62]	ResNet-50	1024, 2048	574 G
UISE [63]	ResNet-50	1024, 2048	289 G
UPSNet [38]	ResNet-50	1024, 2048	487 G
Seamless [66]	ResNet-50	1024, 2048	514 G
LEPSNet [67]	ResNet-50	1024, 2048	451 G
Panoptic-DeepLab [64]	ResNet-50	1024, 2048	547 G
Mask2Former [45]	ResNet-50	1024, 2048	519 G
MR-Mask2Former	ResNet-50	1024, 2048	522 G
YOSO [23]	ResNet-50	1024, 2048	265 G
PEM [26]	ResNet-50	1024, 2048	237 G
FA PEM (our)	ResNet-50	1024, 2048	226 G

Table 4. Comparison of the validity of the modules on the Cityscapes [27]. ✓ and × indicate the presence and absence of the corresponding module, respectively.

DCAU	SFAM	Backbone	Scale	PQ	PQ_th	PQ_st
×	×	ResNet-50	1024, 2048	61.1	54.3	66.1
✓	×	ResNet-50	1024, 2048	61.9	55.2	66.8
×	✓	ResNet-50	1024, 2048	62.2	55.5	67.1
✓	✓	ResNet-50	1024, 2048	63.1	56.4	68.0

Table 5. Comparison of the validity of the modules on the SPSD. ✓ and × indicate the presence and absence of the corresponding module, respectively.

DCAU	SFAM	Backbone	Scale	PQ	PQ_th	PQ_st
×	×	ResNet-50	800, 1333	60.2	55.3	73.3
✓	×	ResNet-50	800, 1333	60.9	56.1	73.6
×	✓	ResNet-50	800, 1333	61.5	56.8	73.8
✓	✓	ResNet-50	800, 1333	62.3	57.3	75.6

Table 6. Comparing the computational complexity of modules on the SPSD.

Method	Backbone	Scale	FLOPs
PEM	ResNet-50	800, 1333	103.1 G
PEM + DCAU	ResNet-50	800, 1333	99.3 G
FA PEM (our)	ResNet-50	800, 1333	99.4 G

Table 7. Comparing different sub-components of the SFAM module on the SPSD and Cityscapes.

Sub-Components	SPSD			Cityscapes
Sub-Components	PQ	PQ_th	PQ_st	PQ	PQ_th	PQ_st
PEM	60.2	55.3	73.3	61.1	54.3	66.1
spatial-only	60.8	56.1	73.4	61.5	54.9	66.3
frequency-only	61.0	56.4	73.5	61.8	55.3	66.6
combined	61.5	56.8	73.8	62.2	55.5	67.1

Table 8. Comparison of different image resolutions on the SPSD.

Backbone	Scale	PQ	PQ_th	PQ_st	T (ms)
ResNet-50	480, 640	57.8	52.8	71.2	26.1
ResNet-50	600, 1000	61.1	56.2	74.0	40.4
ResNet-50	800, 1333	62.3	57.3	75.6	41.8
ResNet-50	1000, 1600	60.8	55.1	74.8	46.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, M.; Meng, H.; Wu, J.; Cao, Y. Frequency Adaptive PEM: Marine Ship Panoptic Segmentation. J. Mar. Sci. Eng. 2026, 14, 419. https://doi.org/10.3390/jmse14050419

AMA Style

Yuan M, Meng H, Wu J, Cao Y. Frequency Adaptive PEM: Marine Ship Panoptic Segmentation. Journal of Marine Science and Engineering. 2026; 14(5):419. https://doi.org/10.3390/jmse14050419

Chicago/Turabian Style

Yuan, Ming, Hao Meng, Junbao Wu, and Yiqian Cao. 2026. "Frequency Adaptive PEM: Marine Ship Panoptic Segmentation" Journal of Marine Science and Engineering 14, no. 5: 419. https://doi.org/10.3390/jmse14050419

APA Style

Yuan, M., Meng, H., Wu, J., & Cao, Y. (2026). Frequency Adaptive PEM: Marine Ship Panoptic Segmentation. Journal of Marine Science and Engineering, 14(5), 419. https://doi.org/10.3390/jmse14050419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency Adaptive PEM: Marine Ship Panoptic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Ship Datasets

2.2. Panoptic Segmentation

3. Methodology

3.1. Overall Architecture

3.2. Dynamic Context-Aware Upsampling (DCAU) Module

3.3. Spatial-Frequency Attention Module

3.4. Ship Panoptic Segmentation Dataset (SPSD)

3.5. Loss Function

4. Experiments

4.1. Implementation Details and Datasets

4.2. Evaluation Metrics

4.3. Performance Comparison

4.4. Ablation Experiment

4.4.1. Ablation Experiment on Cityscapes [27]

4.4.2. Ablation Experiment on SPSD

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI