In this section, we present a brief overview of the proposed FA PEM method for panoptic segmentation of ships in complex backgrounds. The detailed components of our approach, including the DCAU module, the SFAM module, and the loss function, are described in the following subsections.
3.2. Dynamic Context-Aware Upsampling (DCAU) Module
In dense prediction tasks such as panoptic segmentation, models typically rely on upsampling operations to transform low-resolution feature maps into high-resolution ones in order to meet output resolution requirements. Although performance metrics in panoptic segmentation continue to improve, boundary prediction remains suboptimal. Traditional upsampling methods, such as bilinear interpolation and nearest neighbor interpolation, assign sampling points based on a fixed rule determined by relative distance. This often results in multiple upsampled points being incorrectly assigned to the same semantic cluster in regions requiring detailed representation, thereby failing to capture local variations. When processing points belonging to different semantic clusters, these methods cannot clearly distinguish point assignments, leading to blurred boundaries between clusters and reduced boundary sharpness in segmentation results. Although transposed convolution is a learnable approach, its upsampling kernel remains fixed after training. This restricts its ability to flexibly adapt to changes in the input data, as the assignment rule for upsampled points does not change, making it difficult to meet the requirements of complex and dynamic real-world tasks.
To address these challenges, we propose the DCAU module, which is illustrated in
Figure 2. It is specifically designed for panoptic segmentation tasks, aimed at addressing boundary blur and segmentation inaccuracies in traditional upsampling methods. The DCAU module is inspired by upsampling optimization techniques in semantic segmentation. However, unlike upsampling methods in semantic segmentation, which primarily focus on pixel-level classification and often rely on fixed rules or global learning to restore image resolution, the DCAU module explicitly addresses boundary alignment between object instances. In panoptic segmentation tasks, precise boundary alignment of object instances is crucial. The DCAU module dynamically selects sampling points based on local image features, using content-adaptive sampling, thereby effectively mitigating the boundary blur issues caused by fixed sampling rules in traditional methods. Through this dynamic adjustment, the DCAU module accurately aligns the boundaries of object instances, ensuring more precise segmentation details. Additionally, the DCAU module employs a grouped upsampling strategy, performing feature map grouping during the resolution restoration process. This not only improves computational efficiency but also maintains segmentation accuracy.
The DCAU module consists of three main steps: sampling point selection, weight generation, and feature fusion. In the sampling point selection stage, a set of relevant sampling points is selected from the decoder features for each encoder feature point. During the weight generation stage, kernel weights are generated by computing inner product similarity and applying softmax normalization to obtain similarity-aware weights. In the feature fusion stage, the selected sampling points are aggregated through a weighted summation using the generated kernel weights, resulting in upsampled feature representations.
In the sampling point selection phase, we first apply deformable convolution (DCN) to the decoded features
. DCN adapts the receptive field according to the input features, allowing for more flexible perception of target areas with varying deformations and complex structures, thereby effectively enhancing spatial awareness and fine-grained detail modeling capabilities. The features processed by DCN, denoted as
, are further refined by dynamically selecting relevant points based on feature information. A linear layer is used to project and generate offset values, improving the model’s ability to capture fine-grained details. For the low-resolution feature map
, we aim to upsample it to a high-resolution feature map,
. For each point
in
, we first calculate its corresponding base position in
, denoted as
. Then, based on the region surrounding the corresponding position l in the low-resolution feature map
, we dynamically select S sampling points, denoted as
, from the neighborhood of that position to serve as candidate semantic regions. The features corresponding to these sampling points
are represented as
. The formula for sampling point selection is as follows:
Here, denotes the offset prediction network constructed from linear layers, which is responsible for generating two-dimensional coordinate offsets for each sampling point. refers to the bilinear interpolation of feature points in F at the locations specified by , thereby obtaining the point features in continuous spatial domains.
To further reduce computational cost and improve efficiency, we introduce a grouped upsampling strategy during the sampling point selection stage to optimize the sampling process. Specifically, the feature map is first divided into several groups along the channel dimension, and independent coordinate offsets are subsequently generated for each group. All channels within the same group share the same sampling point offset pattern. This approach effectively reduces computational complexity while preserving the feature representation capacity of the model.
After obtaining the sampling points, it is necessary to assign appropriate weights to them in order to quantify their similarity to the current high-resolution location. Specifically, encoder features are derived from the high-resolution layers of the feature extraction backbone, ResNet-50, which preserve rich spatial details. For each high-resolution position
, the corresponding encoder feature point
and decoder candidate points
are used to compute similarity, followed by a normalization step to generate the sampling point weight map. The weight computation process is defined as follows:
Here,
denotes the similarity computation, and norm represents the normalization operation. Specifically, the similarity is computed using the inner product and softmax operation to assign weights to each sampling point. To ensure that the candidate feature p from the decoder and the encoded feature q are compared in spaces of the same dimensionality, we introduce two learnable linear mappings,
and
, which act on the decoder’s candidate feature p and the encoder’s candidate feature q, respectively, to compute the similarity. The calculation formula is as follows:
Here, and are learnable parameters, and d denotes the embedding dimension. The inner product computes the semantic similarity between the low-resolution feature at position p and the target high-resolution feature at position q. A higher similarity indicates that the reconstruction at this resolution position is more important. Through the softmax operation, all weights are constrained to be non-negative and sum to one, generating a set of normalized weights. This ensures that features with higher semantic relevance are assigned higher weights, while features with lower relevance have their corresponding weight reduced.
During the feature fusion stage, the generated weights are used to perform a weighted summation of the selected sampling points, resulting in the upsampled feature point
:
Here, refers to the feature in the low-resolution feature map corresponding to the sampling point . is the weight computed by calculating the inner product similarity between the low-resolution features and the target high-resolution features, followed by the softmax operation. This weight reflects the similarity between the sampling point and the target position l’. This process encourages the upsampled points to align with the structural patterns of the encoder features while suppressing noise. For instance, in boundary regions, the encoder features exhibit higher similarity to decoder points belonging to the correct semantic cluster, resulting in larger assigned weights and thus enhancing segmentation performance at object boundaries.
Meanwhile, to enhance the capacity for spatial feature extraction, we adopt an efficient feature fusion strategy. Specifically, we first employ a bottleneck structure, utilizing convolution operations to reduce the number of channels in the fused feature map to C/n, which effectively decreases computational complexity and improves network training efficiency. In addition, to further strengthen spatial feature extraction, we perform global average pooling independently along the height and width dimensions, producing feature maps of size
and
, respectively. These maps serve as encoded representations for each spatial direction. The resulting feature maps are then transformed and concatenated to fully integrate information from both the height and width dimensions. This process enables the network to simultaneously capture global contextual information, thereby enhancing its understanding and attention to spatial structures. Furthermore, we incorporate batch normalization and nonlinear activation functions to increase the representational power of the model and to improve its ability to learn complex and nonlinear features. After feature fusion, the feature maps are partitioned along the height and width dimensions. Subsequently, we use a sigmoid activation function to adaptively generate weights for the height and width dimensions, allowing the network to assign appropriate importance to features in both vertical and horizontal directions while suppressing irrelevant regions. The weighted features are then passed through a convolutional layer to restore the original channel dimensionality. Additionally, we introduce a residual connection by adding the module’s input to the fused and weighted features via element-wise addition. This residual structure not only mitigates the vanishing gradient problem in deep networks but also facilitates efficient feature propagation and enhances both the stability and representational capacity of the network. Through this approach, the network effectively integrates global contextual information from multiple directions, significantly improving its ability to extract and represent spatial features. Their formulas are as follows:
Here, X denotes the upsampled feature map, while represents the feature map after channel dimension reduction. and correspond to the features obtained by global average pooling along the height and width dimensions, respectively. CBH refers to the combination of convolution, batch normalization, and the h-swish activation function. and denote the attention weights along the height and width directions, respectively. denotes the Sigmoid function. represents the output feature of the DCAU module.
3.3. Spatial-Frequency Attention Module
During the information transmission process in deep neural networks, feature information loss is inevitable. This issue becomes particularly prominent as network depth increases, where low-level detail features are progressively weakened across layers, resulting in reduced capacity for representing fine-grained structures and textures. To address this problem, we propose the SFAM module, which is illustrated in
Figure 3. This module jointly models spatial and frequency domain features, enabling the model to capture global information and texture details from multiple dimensions, significantly enhancing its ability to extract features from multi-scale objects. While spatial-frequency techniques have been applied in image denoising and classification, their role and objectives in panoptic segmentation are notably different from the previous two tasks. In image denoising, the frequency domain is primarily used to suppress high-frequency noise and remove interference signals, aiming to restore the image’s clear structure. In object classification, the focus is on extracting global features in the spatial domain to determine the object’s class, typically emphasizing overall shape recognition without considering the boundary information of object instances. In contrast, SFAM’s core task in panoptic segmentation is pixel-level object segmentation, especially in complex backgrounds. By combining spatial and frequency domain features, the SFAM module not only captures global structural information but also enhances the representation of boundaries and details, achieving high-precision multi-scale ship segmentation and object boundary extraction.
The SFAM module first employs a bottleneck structure, using a 1 × 1 convolution to reduce the dimensionality of the input features, thereby decreasing computational cost and improving efficiency. Next, the SFAM module adopts a parallel-branch design to perform feature fusion in both the spatial and frequency dimensions. In the spatial dimension, we design three depthwise separable convolutions with different kernel sizes (3 × 3, 5 × 5, and 7 × 7) to extract feature information at multiple spatial scales. This approach facilitates the capture of fine-grained details across multiple scales and enhances the model’s ability to perceive targets. After processing with a nonlinear activation function, the model’s feature representation capacity is further enhanced. Each scale-specific path incorporates a linear attention mechanism [
46] to replace the computationally expensive conventional self-attention mechanism. The linear attention mechanism [
46] significantly reduces computational complexity while maintaining the ability to model long-range spatial dependencies. Specifically, queries (Q) and keys (K) are generated from the input features via linear projection and are subsequently activated by the ELU function to ensure non-negativity, thereby improving numerical stability and discriminability and serving as an alternative to the traditional softmax weighting. Unlike standard self-attention, we apply rotary position encoding (RoPE) [
47] to Q and K to introduce relative positional information, which enhances the network’s capacity to model spatial structures without additional learnable parameters. Furthermore, to avoid the exhaustive pairwise matching of all Q-K pairs in standard dot-product attention, the linear attention mechanism [
46] adopts a scaling strategy based on the mean of the keys, constructing a kernel function-based attention factor. This approach requires only a single average computation of K and a weighted multiplication, thus greatly reducing computational complexity and improving efficiency, while completing the entire attention process. Moreover, the features from the three branches, fused through the linear attention mechanism, are concatenated, simultaneously followed by a channel adjustment for weighted fusion to produce feature
, further enhancing the spatial feature representation. Their formulas are as follows:
Here, the function
represents a non-negative mapping through the ELU activation function.
and
are the linear transformation matrices used to generate the Query and Key, respectively.
denotes the input feature at the i-th position. RoPE [
47] refers to the Rotary Position Embedding, which introduces relative positional information into the features.
and
are the Query and Key after RoPE [
47] encoding, respectively. “mean” (
) refers to the mean of all Keys across positions, and
is the scaling factor for the attention output.
represents the linear attention [
46] operation, and R refers to the reshape operation.
represents a 1 × 1 convolution, while
represents an n × n depthwise separable convolution, where n is 3, 5, or 7. S represents the SiLU activation function, and “LN” represents layer normalization. “cat” refers to feature concatenation, and
represents the attention outputs from different scale branches. “BS” represents the batch normalization and activation function, and the symbol * denotes element-wise multiplication.
In the frequency domain, we first apply a two-dimensional Fast Fourier Transform (FFT) [
48] to the input features
, converting them from the spatial domain into a frequency domain representation, denoted as Tepx. Tepx is a complex tensor that contains both real and imaginary components. We perform convolution, normalization, and ReLU activation on the real part to enhance its nonlinear representational capacity. Subsequently, a sigmoid function is used to generate frequency domain weights. These weights are then multiplied by the original complex spectrum to extract the most salient frequency domain features. This process not only increases the model’s sensitivity to frequency information but also effectively enhances its ability to capture fine-grained feature details. Next, we perform an inverse Fast Fourier Transform (IFFT) [
49] on the modulated frequency domain features to convert them back to the spatial domain. Finally, we compute the magnitude, normalize, and apply a nonlinear activation function to obtain the enhanced frequency domain feature
. Its formula is as follows:
Here, represents the input feature, and and represent the Fast Fourier Transform and the Inverse Fast Fourier Transform, respectively. “CBR” refers to the combination of convolution, batch normalization, and the ReLU activation function.
The spatial and frequency domain features are fused via element-wise addition, which effectively enhances the model’s ability to capture multi-scale feature information and texture details. Furthermore, to reinforce information integrity during feature propagation, we introduce a residual structure that directly adds the input features
to the fused output, resulting in the feature
. This strategy not only alleviates potential gradient vanishing and semantic dilution issues in deep networks but also improves the stability and continuity of feature representation. Through the dual-path fusion and residual design described above, our approach significantly strengthens both global context modeling and local perceptual capability of the feature representations. Its formula is as follows:
3.4. Ship Panoptic Segmentation Dataset (SPSD)
Currently, in the field of maritime panoptic segmentation, there is no publicly available dataset that offers both sufficient data volume and a diverse range of ship categories. To address this limitation, we construct a dedicated panoptic segmentation dataset that includes multiple scenes and multiple ship categories. The dataset covers diverse aquatic environments such as offshore waters, inland rivers, and ports, and contains a variety of fine-grained ship classes, making it more suitable for real-world ship applications than existing public datasets. In terms of data collection, 60% of the images are obtained from online sources, while the remaining 40% are captured from real-world scenes. To apply the ship dataset to panoptic segmentation tasks, manually annotate the samples using the LabelMe tool. Panoptic segmentation requires labeling every pixel in an image, including both background regions and precise segmentation of each object instance by determining the number, category, and mask of every target. To avoid overlap or omission of pixels at the boundaries between object instances and background regions, we adopt a layered annotation strategy. Specifically, background regions such as sea, sky, and land are annotated first, during which unannotated foreground objects may be included. After completing the background annotation, each ship instance is then precisely labeled. This annotation process is highly labor-intensive and requires approximately 1000 h to complete. Finally, all annotated images are converted into the COCO panoptic standard format.
In this study, all ship images are carefully annotated and categorized. Based on the characteristics of the primary ship types present in each image, the dataset is ultimately divided into eight representative categories. These categories include common civilian vessels such as sail boat and speed boat, as well as specialized ship types such as warship and rescue boat. This diverse categorization scheme effectively meets the application requirements of various maritime scenarios. In addition, to support comprehensive panoptic segmentation, the background environment is broadly classified into three typical categories: sea, sky, and land. For better visualization and understanding, representative examples of the dataset are illustrated in
Figure 4. Specifically, the first through eighth rows correspond to the eight ship categories: cargo ship, warship, passenger ship, cruise, speed boat, sail boat, small boat, and rescue boat.
3.5. Loss Function
Our algorithm employs the same loss functions as PEM [
26], including the weighted binary cross-entropy (WBCE) loss and the Intersection over Union (IoU) loss.
The binary cross-entropy loss is widely used in binary classification tasks. For tasks with imbalanced class distributions, the weighted binary cross-entropy (WBCE) loss function can effectively enhance the model’s ability to identify minority class samples. Unlike the standard binary cross-entropy loss, the weighted BCE incorporates class weights into the loss calculation, thereby adjusting the contribution of different classes to the overall loss. The computation is as follows:
Here,
denotes the ground truth label,
represents the predicted probability, and N is the total number of samples.
represents the weight for the background class (stuff), and
represents the weight for the foreground class (things).
and
adjust the contributions of the background and foreground classes to the loss function, respectively. By appropriately setting the class weights, the impact of class imbalance can be effectively mitigated, improving the overall performance of the panoptic segmentation model. Since panoptic segmentation is a pixel-level segmentation task, the background class typically occupies a large portion of the image, resulting in a significantly larger number of sample pixels than the foreground class. To ensure the model focuses more on the foreground class during training, we use the inverse of the pixel count for both the background (stuff) and foreground (things) classes, as shown in the formula below:
Here, represents the total number of background class pixels in the entire training set, and represents the total number of foreground class pixels in the entire training set.
Additionally, to prevent the weight values from becoming excessively small, we normalize these weights. The formula for calculating the normalized weights is as follows:
This normalization method ensures that the contributions of the foreground and background classes to the loss function are more balanced.
The IoU loss function is commonly used in image segmentation tasks to measure the degree of overlap between the predicted region and the ground truth region. The IoU metric is defined as the ratio of the intersection to the union of the predicted and ground truth regions. The calculation is as follows:
Here, P and G represent the predicted region and the ground truth region, respectively. The corresponding IoU loss function is defined as follows:
We combine these two loss functions as the total loss for the model: