SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation

Sharma, Rabi; Saqib, Muhammad; Lin, Chin-Teng; Blumenstein, Michael

doi:10.3390/math13010165

Open AccessArticle

SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation

School of Computer Science, University of Technology Sydney, Broadway, Ultimo, NSW 2007, Australia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(1), 165; https://doi.org/10.3390/math13010165

Submission received: 29 September 2024 / Revised: 31 December 2024 / Accepted: 3 January 2025 / Published: 5 January 2025

(This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Maritime surveillance is essential for ensuring security in the complex marine environment. The study presents SwinInsSeg, an instance segmentation model that combines the Swin transformer and a lightweight MKA module to segment ships accurately and efficiently in maritime surveillance. Current models have limitations in segmenting multiscale ships and achieving accurate segmentation boundaries. SwinInsSeg addresses these limitations by identifying ships of various sizes and capturing finer details, including both small and large ships, through the MKA module, which emphasizes important information at different processing stages. Performance evaluations on the MariBoats and ShipInsSeg datasets show that SwinInsSeg outperforms YOLACT, SOLO, and SOLOv2, achieving mask average precision scores of 50.6% and 52.0%, respectively. These results demonstrate SwinInsSeg’s superior capability in segmenting ship instances with improved accuracy.

Keywords:

instance segmentation; maritime surveillance; convolutional neural network; transformers; attention

MSC:

68T07

1. Introduction

In the 21st century, the sea has taken on a crucial role in national sovereignty, security, and development. As a result, to effectively manage ocean traffic and safeguard coastal and maritime areas, the deployment of offshore and onshore surveillance systems has become increasingly important for the armed forces and civilian sectors. The traditional approach to marine surveillance relied heavily on real-time human monitoring, a method that was often inefficient, expensive, and prone to missed inspections [1]. However, the development of marine surveillance systems using computer vision has largely replaced manual monitoring and is now widely used for managing maritime traffic and conducting surveillance [2,3,4,5]. Ship detection algorithms, as explored by Nalamati et al. (2020, 2022), Park et al. (2022), and Xing et al. (2023) [6,7,8,9], utilize bounding boxes to identify various features. Still, they often struggle to provide accurate details of ship edges due to the inclusion of unnecessary background pixels. In contrast, instance segmentation methods excel at distinguishing ships of different sizes, which is essential for the three-dimensional reconstruction of marine scenes. This reconstruction is vital for understanding marine traffic and aiding in visual navigation. Therefore, instance segmentation significantly improves the safety and precision of marine systems, enhancing their navigation and surveillance capabilities. From the literature, we have identified that published articles on marine ships mainly fall into four categories: satellite remote sensing images [10,11,12,13], synthetic aperture radar images [14,15,16], infrared images [17,18], and visible-light images [3,4,5,19,20,21]. Each system has unique benefits and drawbacks, making it suitable for specific situations:

Satellite Remote Sensing Images: At a microscopic level, these images cover large areas. Their quality and clarity might be poor, leading to misjudgment.
Synthetic Aperture Radar (SAR) Images: Although accurate, SAR-based object identification algorithms may not always be reliable. Low-frequency spectrum information may not be compatible with the human visual system.
Infrared Images: Certain scenarios may benefit from infrared images. Low resolution, significant noise, and a lack of color and texture information make them difficult to interpret.
Visible-Light Images: Compared to other categories, visible-light images are clearer, more detailed, more precise, and better at defining texture. They look nicer and are used in more applications.

Visual perception systems have benefits as well as limitations. Selecting a suitable system depends on the specific task and conditions. But combining them might improve accuracy and reliability. We have used visible-light images for our research because of their advantages. Due to security concerns, the availability of the dataset is limited to research and development purposes. Nonetheless, we found a few marine datasets for instance segmentation tasks, such as MariBoats and ShipInsSeg.

Most instance segmentation techniques are developed using approaches based on Convolutional Neural Networks (CNNs) rather than using transformer-based methods for segmenting individual ships. Instance segmentation is more complex than object detection because it identifies objects’ positions and creates semantic masks for them. CNN-based techniques rely on both two-stage and single-stage frameworks. A conventional two-stage method, like Mask R-CNN [22], uses Faster R-CNN [23] for segmenting instances. Although this approach has been widely used in prior research and provides a foundation for improving instance segmentation accuracy, it has certain drawbacks. Due to its slower inference speed, longer training times, and higher memory usage, this approach is not appropriate for use in marine applications. On the other hand, single-stage approaches immediately predict object categories and build object masks without the region proposal stage, offering a lightweight model, reduced training time, and real-time performance. Our research focuses on using the SOLOv2 [24] single-stage model rather than the two-stage technique to improve segmentation efficiency and accuracy.

This study uses a transformer-based technique to accurately segment ships of different sizes that are captured in distant marine images. We present the SwinInsSeg model, a single-stage ship instance segmentation framework that leverages the Swin transformer’s ability to efficiently learn complicated features in challenging and dynamic maritime environments. Additionally, feature pyramid networks (FPNs) struggle to effectively extract features across diverse marine environments due to limitations in handling significant variations in object sizes and scales. To enhance performance and feature extraction, the model incorporates a multi-kernel attention (MKA) module, which captures multiscale features using attention mechanisms with different kernel sizes. This study shows that SwinInsSeg outperforms existing methods in ship instance segmentation.

Our main contributions are as follows:

SwinInsSeg significantly improves the accuracy of segmentation, providing precise segmentation of ships of various sizes.
Our novel MKA module uses attention techniques with different kernel sizes to dynamically extract multiscale features, resulting in feature enhancement. This method enables the model to refine and capture different objects more accurately.

The sections of this paper are arranged as follows. Section 2 focuses on instance segmentation, transformer-based approaches, marine surveillance, and their applications. The proposed method is discussed in Section 3. Section 4 covers comparison and ablation experiments that demonstrate the effectiveness of our method. Section 6 presents the conclusions.

2. Related Works

2.1. Instance Segmentation Techniques

One of the most difficult computer vision tasks is instance segmentation. This task involves detecting object coordinates and generating exact instance masks, which goes beyond basic object detection techniques [23,25]. Instance segmentation methods are typically split into two types: two-stage and single-stage methods. Mask R-CNN [22] is a well-known anchor-based instance segmentation algorithm that operates in two stages. It first identifies regions of interest (ROIs) and then classifies and segments them. Two-stage approaches limit real-time performance, particularly with smaller input images. These limitations are caused by re-pooling features for every ROI and the subsequent computations. While two-stage bounding boxes are accurate, they fail to distinguish overlapping objects and achieve precise segmentation. To further improve instance segmentation accuracy [26,27,28], several additional methods group pixels into an infinite number of prior object instances.

One-stage/single-stage anchor-free object detection [29,30,31] uses instance segmentation methods to quickly obtain an object’s location, category, and instance mask. Current real-time techniques, like YOLACT [32], use two concurrent tasks to accomplish instance segmentation. These methods initially generate prototype masks across the entire image without being restricted to a specific location. They then forecast coefficients for each individual instance in a linear combination. Fast NMS takes the place of non-maximum suppression to enhance real-time segmentation performance. SOLO [33] is another real-time method that uses an object’s center and size to predict object instances and categories instead of bounding-box extraction. It divides images into grids and predicts object semantic categories and instance masks based on the grid center for classification and instance masks. Several improvements based on the SOLO method were introduced in SOLOv2 [24]. It uses convolutional kernel learning and feature learning to create instance masks and a dynamic technique for object position segmentation. Moreover, the matrix NMS technique enhances the AP and reduces redundant predictions. Because of its dynamic capabilities and single-stage design, we chose SOLOv2 [24] for ship instance segmentation.

2.2. Transformer-Based Techniques

Transformers, originally developed in natural language processing (NLP), have gained widespread use in computer vision tasks. The vision transformer (ViT) in [34] adopted the notion of separating images into patches, linearly embedding them, and processing them as sequences. This method achieved state-of-the-art (SOTA) performance on image classification benchmarks. Wang et al. [35] introduced a pyramid vision transformer (PVT) that outperformed ResNet [36] in object identification and semantic segmentation because of its hierarchical pyramid structure. Cheng et al. [37] used masked attention processes to enhance transformer abilities for panoramic, instance, and semantic segmentation. The Swin transformer [38], a hierarchical ViT version, uses non-overlapping 4 × 4 patches and a sliding-window technique for efficient global and local feature extraction. Its architecture improves computational efficiency and feature extraction by confining self-attention computations to local windows and allowing cross-window interactions via a hierarchical design. Inspired by the Swin transformer’s hierarchical sliding-window mechanism, we adopted it as the backbone in our SwinInsSeg model. Its ability to effectively capture both local and global contextual features makes it the ideal choice for ship instance segmentation in complex maritime environments. This approach ensures that the features extracted are both computationally efficient and contextually rich, thereby improving segmentation accuracy while handling challenges like dynamic backgrounds.

2.3. Maritime Surveillance and Applications

Current computer vision research focuses on marine surveillance. The fight against smuggling, illegal immigration, and marine vessel attacks has increased recently. Traditional remote sensing [39,40] requires observatory and communications equipment to be installed. However, there is a possibility that the costly equipment may either be purposefully broken down or turned off. Rather than vision technology, these systems rely significantly on human involvement and monitoring. Safety and cost-efficiency may be improved by employing vision technology in smart surveillance systems. Using vision technology for marine surveillance can safeguard a nation’s interests and vital military data.

Object detection technology has been developed to address the challenges of the marine environment, such as shark identification [41] and other applications, like estimating populations [42] and defending against drone threats [43]. Combining MobileNetV2 with SSD for detection improved accuracy to 92% and frame rate to 15 fps [44]. Deep learning methods have shown remarkable achievements for object detection in terms of speed and accuracy. Additionally, transformer-based intrusion detection methods have been applied to complex scenes [7].

Recently, there has been limited research carried out on marine surveillance using instance segmentation techniques. To understand maritime ships [4], global mask heads are used to create instance masks based on global and semantic data. Using global and local attention techniques preserves semantic and global information, improving segmentation [5]. In [45] employed a two-stage process to separate ships by distinguishing fog-infused and fog-free sea images. Dynamic contour learning and interference reduction can detect ship occurrences in fog and generate synthetic images [46].

3. Proposed Method

We describe the SwinInsSeg model in this section. The model loss function and our MKA module are also explained.

3.1. Overall Architecture

The proposed architecture, which builds upon the SOLOv2 model [24], is shown in Figure 1 and comprises three main components: the backbone, the feature pyramid network (FPN) known as the neck, and the mask head. Our proposed architecture uses Swin-T as the backbone for feature extraction, which results in higher inference speeds and lower computational costs compared to other backbones like Swin-B [38] and Swin-L [38]. The Swin transformer backbone, shown in Figure 1 highlighted with a red-dashed rectangular box, includes a patch partition block and comprises four stages. Initially, the input is an RGB image (i.e., three channels) split into 4 × 4 patches using the patch partition block and flattened along the channel dimensions. Patch partitioning reduces the input image dimensions from H × W × 3 to

\frac{H}{4}

×

\frac{W}{4}

× 48, significantly minimizing computational costs and memory usage. After patch partitioning, in stage 1, the linear embedding module linearly transforms the data from each pixel channel into

\frac{H}{4}

×

\frac{W}{4}

× C and feeds them into the Swin transformer block as its input.

The Swin transformer block, as illustrated in Figure 2, comprises two sequential blocks that form the core of the Swin transformer backbone. These blocks include two important modules: W-MSA and SW-MSA. The W-MSA module, or window-based multi-head self-attention, further partitions the image blocks into non-overlapping regions to calculate self-attention within these regions. The SW-MSA module, known as shifted-window multi-head self-attention, divides non-overlapping windows at layer l and reduces the window distance by half at layer

l + 1

, allowing window information from different levels to interact. In addition, the Swin transformer block features two Multi-Layer Perceptron (MLP) modules and two LayerNorm (LN) modules. A LayerNorm (LN) layer precedes each MSA module and MLP, with a residual connection following each module. The incorporation of the W-MSA and SW-MSA modules facilitates the flow of information across windows. The transformer block computations are given in Equation (1):

\begin{matrix} {\hat{X}}^{l} = W - MSA (LN ({\hat{X}}^{l - 1})) + {\hat{X}}^{l - 1} \\ X^{l} = MLP (LN ({\hat{X}}^{l})) + {\hat{X}}^{l} \\ {\hat{X}}^{l + 1} = SW - MSA (LN (X^{l})) + X^{l} \\ X^{l + 1} = MLP (LN ({\hat{X}}^{l + 1})) + {\hat{X}}^{l + 1} \end{matrix}

(1)

where the results for

{\hat{X}}^{l}

and

X^{l}

represent the output features from the W-MSA and MLP modules for the l^th block, respectively. Similarly,

{\hat{X}}^{l + 1}

and

X^{l + 1}

refer to the resulting features of the SW-MSA and MLP modules for the

l + 1

^th block. The structure is consistent from stages 2 to 4, including patch merging layers and Swin transformer blocks. The patch merging layer efficiently combines four patches simultaneously, reducing the image’s height and width by half. As illustrated in Figure 1, each stage operates at a different resolution; for example, stage 1 operates at a resolution of

\frac{H}{4}

×

\frac{W}{4}

. After the patch merging layer in stage 2, the resolution becomes

\frac{H}{8}

×

\frac{W}{8}

, and this pattern continues, with stage 3 at

\frac{H}{16}

×

\frac{W}{16}

, and the final stage presenting a resolution of

\frac{H}{32}

×

\frac{W}{32}

. Although the resolution decreases at each stage, the number of input channels, denoted by C, increases to capture more in-depth and complex relationships across different stages. In the top-down pathway, the outputs from various stages, with feature maps or channels Ci dimensions of 96, 192, 384, and 768, are processed through the proposed multi-kernel attention (MKA) module. This module is designed to enhance the model’s feature representation capabilities, particularly for instance segmentation. The refined features are then forwarded to the neck, or feature pyramid network (FPN), to improve the detection of objects across different sizes. The aggregated feature maps with information from various FPN levels are directed to the mask head. This architecture can create instance masks without relying on bounding boxes, facilitating precise segmentation tasks.

The mask head is divided into an object category branch and a kernel branch. These branches use the input feature maps to make predictions. Specifically, the object category branch predicts the categories of objects, while the kernel branch generates dynamic kernels. Meanwhile, the feature branch processes mask features, using multilevel feature maps for representation. In the final step, the kernel branch and feature branch are convolved together to produce the instance mask, effectively combining their outputs.

3.2. Multi-Kernel Attention Module

To address the challenges posed by object scale variability in maritime scenarios, we propose an effective and novel multi-kernel attention (MKA) module for segmenting objects of varying sizes within such environments. The stages of the feature pyramid network (FPN) are characterized by their distinct receptive fields and semantic depths. However, the original FPN has a limitation in that it only considers the global features of adjacent levels, which may not be adequate due to object size variations at different levels. Our MKA technique overcomes this limitation by using convolutional layers with diverse kernel sizes to effectively capture changes in object size across multiple scales. The MKA module consists of vertical branches, with each branch using a specific kernel size to capture features at various scales. Incorporating the attention mechanism, specifically CBAM [47], enhances the focus on important channels and spatial regions after applying different kernel sizes, thus refining both local and global features. The fusion of output features from all branches allows our model to recognize the diverse sizes of objects found in marine settings, thereby enhancing instance segmentation precision.

The MKA module, as illustrated in Figure 3, incorporates a residual structure and a multi-kernel design to enhance feature learning. Initially, the backbone’s output features are processed through 1 × 1 convolutional layers to reduce the channel size to 256, preparing them for the MKA module. This module has four convolutional layers with kernel sizes of 1, 3, 5, and 7 for multiscale feature extraction. Each kernel size is designed to capture various spatial patterns and levels of information in the feature maps. For instance, the 1 × 1 and 3 × 3 kernels are effective at extracting fine-grained features from smaller objects, while the larger kernels, 5 × 5 and 7 × 7, are better at capturing more global patterns and larger structures. Skip connections are used to avoid the gradient vanishing problem. The feature maps generated by each kernel size are then processed through CBAM [47], allowing the network to focus on the most informative features, thereby improving the accuracy of segmentation, especially in cluttered marine environments. Within the CBAM block, highlighted by a red rectangular box, are two main branches: channel attention and spatial attention. The channel attention branch generates an attention map that determines the inter-channel relationships, focusing on the important channel for each kernel size. This process involves deriving global features (dimensions of 1 × 256) via global average-pooling and max-pooling layers, denoted as

{Feature}_{m a x}^{c}

and

{Feature}_{a v g}^{c}

. These are then fed into a shared network that produces a channel map

N_{c}

∈

R^{c \times 1 \times 1}

with a singular hidden layer, where ‘c’ represents the number of channels. A reduction ratio ’r’ is applied to optimize the parameters, denoted as

R^{c / r \times 1 \times 1}

. Element-wise addition is used to merge the resulting feature maps. The computation of channel attention is given in the following Equation (2):

\begin{matrix} N_{c} (F) & = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F))) \\ N_{c} (F) & = σ (W E_{1} (W E_{0} ({Feature}_{\max}^{c})) + W E_{1} (W E_{0} ({Feature}_{avg}^{c}))) \end{matrix}

(2)

where

σ

represents the sigmoid function, and

{W E}_{1} \in R^{c / r \times 1 \times 1}

and

{W E}_{0} \in R^{c / r \times 1 \times 1}

are the MLP weights.

Spatial attention is similar to channel attention in that a spatial attention map is generated using the spatial relationships of features. These features are computed using global max-pooling and average-pooling layers, represented by

{Feature}_{\max}^{s} \in R^{1 \times H \times W}

and

{Feature}_{avg}^{s} \in R^{1 \times H \times W}

along the channel dimension and concatenated. We use a

7 \times 7

convolutional layer to reduce the dimension and generate the spatial attention map

N_{s} (Feature) \in R^{H \times W}

. Equation (3) is used to calculate spatial attention:

\begin{matrix} N_{c} (F) & = σ (K e r n e l^{7 \times 7} ([A v e r a g e P o o l (F); (M a x P o o l (F)])) \\ N_{c} (F) & = σ (K e r n e l^{7 \times 7} ([{Feature}_{\max}^{s}; {Feature}_{avg}^{s}])) \end{matrix}

(3)

where

σ

denotes a sigmoid function and the convolution layers use a kernel size of

K e r n e l^{7 \times 7}

.

3.3. Loss Function

We employ the loss function from the original SOLOv2 model for object categorization and instance segmentation tasks. We utilize the focal loss function in backpropagation for category loss to address the imbalance between background and foreground classes. Furthermore, in the case of instance segmentation, we employ the dice loss to enhance the accuracy of mask prediction. The focal loss is given in Equation (4):

\begin{matrix} F o c a l l o s s (C_{p}) = - α_{t} {(1 - C_{p})}^{γ} log (C_{p}) \end{matrix}

(4)

where C_p represents the probability of a class. The balancing and focusing factors are denoted by

α

and

γ

, respectively, with

α

set to 0.25 and

γ

set to 2.0.

The dice loss is used to measure the similarity between the segmented pixels and the ground-truth mask, thereby improving the accuracy of the segmented foreground area. The dice loss is given in Equation (5):

\begin{matrix} D i c e L o s s = 1 - \frac{2 \times \sum_{x = 1}^{W} \sum_{y = 1}^{H} (i_{x, y} \times j_{x, y})}{\sum_{x = 1}^{W} \sum_{y = 1}^{H} i_{x, y}^{2} + \sum_{x = 1}^{W} \sum_{y = 1}^{H} j_{x, y}^{2}} \end{matrix}

(5)

where

i_{x, y} \in [0, 1]

represents the predicted segmentation probability at pixel coordinate

(x, y)

and

j_{x, y} \in [0, 1]

represents the binary ground-truth mask value at pixel coordinate

(x, y)

. W and H denote the width and height of the image, respectively. × denotes element-wise multiplication. The summations are performed over all pixel coordinates in the image.

The complete training loss function for our network is given in Equation (6):

\begin{matrix} T r a i n i n g L o s s = {L o s s}_{c a t e} + λ {L o s s}_{m a s k} \end{matrix}

(6)

where

{L o s s}_{c a t e}

and

{L o s s}_{m a s k}

refer to the object classification loss and mask prediction loss, respectively, with

λ

being a coefficient set to 3. The category score at a specific location on the grid is calculated by processing the input image through the backbone network and FPN inference, and then using the matrix NMS technique to eliminate redundant objects.

4. Experiments

This section presents a comprehensive performance evaluation of our proposed model for instance segmentation. First, we discuss the datasets used for our experiments and outline the experimental configurations, including the setup and hyperparameters. Then, we discuss the evaluation criteria used to measure the model’s performance. Furthermore, we conducted ablation studies to isolate and understand the impact of various components and configurations on overall performance. Finally, we present the visual results for the qualitative assessment of our model.

4.1. Datasets and Metrics

Two datasets of marine ships, MariBoats [5] and ShipInsSeg [48], were used. A brief description of each dataset can be found below.

MariBoats: The MariBoats dataset [5] is a collection of 6271 images of boats with 15,777 annotations. This collection comes from 13,717 images found through Google searches using specific keywords, focusing exclusively on the boat category. This dataset is publicly available and contains one class, i.e., boat. The dataset is randomly split into two sets: 80% for training and 20% for testing, with no low-quality, blurred, or redundant images. The annotations for this dataset are available in COCO format for instance segmentation.

ShipInsSeg: The ShipInsSeg dataset, which we annotated, consists of 5116 boat images curated from YouTube videos and contains a single category, i.e., boats. The dataset is complex due to environmental factors like wave activity, water reflections, crowded ships, and occlusions. The videos were collected from YouTube using a Creative Commons license, and the LabelMe tool [49] was used to label the images in polygon format for instance segmentation. The images have a resolution of 1280 × 720 and are split into 80% training and 20% testing.

4.2. Experimental Details

We used the MMDetection [50] toolkit to implement our proposed model using the experimental settings shown in Table 1. We based our model on the SOLOv2 [24] for benchmarking and comparison purposes. In our experiments, we employed a learning rate of 0.001, a weight decay of 0.0001, and a momentum of 0.9. Each model was trained for 12 epochs, changing the learning rate by 0.1 on the 8th and 11th epochs to increase learning efficiency. We utilized pre-trained weights for the backbone network to accelerate the training process. Additionally, the training optimization technique known as stochastic gradient descent was used. No extra hyperparameters in MMDetection were adjusted in these experiments.

4.3. Main Results

This section discusses the quantitative analysis conducted to evaluate the learning capabilities of our model on two datasets. Its performance was compared with other single-stage models using the COCO mean average precision evaluation metrics. The COCO evaluation metrics, commonly used for the evaluation of instance segmentation models, included average precision at different IoU thresholds, i.e., AP₅₀, AP₇₅, AP_S, AP_M, and AP_L. Moreover, the average precision (AP) and average recall (AR) were calculated based on the Intersection over Union (IoU). The IoU measures how well a predicted object matches the actual or ground-truth object. It calculates the overlap area divided by the union area, as shown below:

\begin{matrix} I O U = \frac{i n t e r s e c t (x_{t}, y_{t})}{u n i o n (x_{t}, y_{t})} \end{matrix}

(7)

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

(8)

\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}

(9)

where

x_{t}

and

y_{t}

represent the predicted object and the ground-truth object, respectively. Here,

T P

stands for true positives for cases when the model correctly predicted the presence of an object.

F P

denotes false positives for cases when the model incorrectly predicted an object’s presence. Lastly,

F N

stands for false negatives for cases when the model failed to detect an object that was actually present. To measure segmentation performance, we computed the average precision (AP) of all categories for each IoU threshold, as shown in Equation (8), and the average recall (AR), as shown in Equation (9). Precision is calculated by dividing the number of correctly detected positive cases (true positives) by the total number of positive instances, which includes both true and false positives. Recall is measured by dividing the number of true positives by the total of true positives and false negatives.

In Table 2, we present the results of the average precision (AP) across various IoU thresholds, with AP values ranging from 0.5 to 0.95, in increments of 0.05. The AP is divided into three categories according to the size of the object: AP_S for small objects, AP_M for medium objects, and AP_L for large objects. Our experimental results highlight the performance of the mask AP across these categories, from AP_S to AP_L.

4.3.1. Performance Evaluation

Results on MariBoats Dataset

Our proposed method was compared with three single-stage instance segmentation models, including YOLACT, SOLO, and SOLOv2, using different backbones (ResNet-50, ResNet-101, and Swin-T) on the MariBoats dataset. For our experiments, we selected the best single-stage models to compare with our model.

Our proposed method significantly improved segmentation performance, as shown in Table 3. In particular, by using Swin-T as the backbone network and our proposed MKA module, the performance of our method improved by 2.2% in the mask AP. Furthermore, when comparing AP₅₀ and AP₇₅, our model demonstrated improvements of 1.2% and 3%, respectively. We observed significant enhancements in AP_S and AP_L, although AP_M experienced a slight decrease of 0.6% compared to the SOLO-Decoupled model. Moreover, our model’s average recall (AR) achieved a performance gain of 57.5% compared to the other models. The AR for small objects (AR_S) was 11.5%, 35.3% for medium objects (AR_M), and 77.6% for large objects (AR_L).

In Figure 4, we present the accuracy vs. epoch curves using various evaluation metrics. The graph shows the increasing trend in accuracy vs. number of epochs. There is a slight drop at the fourth epoch for the mask AP, along with other evaluation metrics like AP₇₅, AP_M, and AP_L. Afterward, there is a significant increase toward the 12th epoch. Under the same training setup for our experiments, the learning abilities of our model are better than other state-of-the-art models.

Results on ShipInsSeg Dataset

The same results analysis structure and experimental setup were used for the ShipInsSeg dataset, as demonstrated in Table 4. We assessed three single-stage models with distinct backbones for the ShipInsSeg dataset. Our model performed significantly better than other single-stage instance segmentation techniques, improving the mask AP to 52.0%. It demonstrated a significant improvement of 2.2% over the second-best model, SOLOv2 [24], which is based on a ResNet-101 backbone. The segmentation results obtained using our method were superior, with 77.7% for AP₅₀ and 54.7% for AP₇₅. Furthermore, based on different object sizes, our approach performed well on medium objects with a gain of 70.7%, but it dropped by 1% for AP_S and 0.8% for AP_L when compared with the second-best approach (marked in bold text). Compared to other models, our model achieved an average recall (AR) of 56.2%. Our technique worked well with small, medium, and large objects in terms of average recall (AR). As seen in Table 4, SwinInsSeg performed better than the other methods, achieving scores of 33.0% for AR_S, 74.9% for AR_M, and 90.3% for AR_L. For every model on the ShipInsSeg dataset, the accuracy versus epoch curves are displayed in Figure 5.

4.4. Ablation Experiments

This section uses the two datasets to examine three types of ablation studies. First, to measure the impact of the multi-kernel attention (MKA) module on instance segmentation performance, we compared the performance across five different scenarios: initially without the MKA module; then introducing the MKA module sequentially after stage 1, stages 1 and 2, and stages 1 through 3; and finally, incorporating the MKA module across all stages, from 1 to 4. In Table 5, we show the results of the ablation experiments based on the MariBoats dataset [5] for measuring the performance of the MKA module. We observed that adding the MKA module enhanced instance segmentation performance compared to not adding the MKA module. Furthermore, adding the MKA module at all stages from 1 to 4 improved instance segmentation performance in terms of the mask AP by 3.3% alongside enhancements in other evaluation metrics. Therefore, we added these four MKA modules at all stages for the best results. Similarly, for the ShipInsSeg dataset, as shown in Table 6, we observed that the absence of the MKA module resulted in no performance gains. However, adding the MKA module at each stage enhanced segmentation performance. Integrating MKA modules across all stages resulted in a significant increase in the mask AP of 4.4%, as well as enhancements in various COCO evaluation metrics, showcasing the module’s efficacy in improving instance segmentation.

Second, we conducted an ablation study of the MKA module using different backbones, as shown in Table 7 and Table 8. For the MariBoats and ShipInsSeg datasets, we observed that using the Swin-T backbone with the MKA module improved the AP by 1.5% and 1.2%, respectively. Additionally, our method showed a significant improvement in segmenting objects of various sizes for AP_S, AP_M, and AP_L.

Finally, we conducted ablation studies to evaluate the performance of each component of the full MKA module on the MariBoats and ShipInsSeg datasets, as shown in Table 9 and Table 10, respectively. We observed that the MKA module with skip connections improved the AP by 3.3% and 4.4% compared to the baseline model on the MariBoats and ShipInsSeg datasets. The performance increased significantly as CBAM was gradually integrated with greater kernel sizes (1 × 1, 3 × 3, 5 × 5, and 7 × 7) because larger kernels were able to capture more contextual information. Additionally, the skip connections enabled better feature fusion and improved segmentation performance, as they are essential for feature refinement and maintaining spatial and multiscale information.

4.5. Qualitative Results

This section discusses the visual results of our proposed model on the two datasets. The experiments demonstrate that our method can efficiently identify and segment ships.

MariBoats Dataset: The segmentation results on the MariBoats dataset using different backbones are displayed in Figure 6, which is organized alphabetically. According to the visual results, all models can detect and segment ships. Subfigure (a) shows the ground-truth images, along with their annotations (in polygon format). Compared with the other methods, the visual results in subfigure (l) demonstrate how our model effectively enhanced the mask quality. While it outperformed the other methods, as shown in subfigures (b) to (k), the visual results of the proposed SwinInsSeg model in subfigure (l) are superior in terms of the segmentation details of ships. We observed that our model could detect and segment new ships, as shown in subfigure (l), with the ships circled in yellow.

ShipInsSeg Dataset: The visual results on the ShipInsSeg dataset are presented in Figure 7, using different models and backbones in subfigures (b) to (l). Subfigure (a) shows the ground-truth images. We observed that certain models had difficulties detecting and segmenting ships due to the scale variation in the ships. The best visual results for ship segmentation obtained by our model are shown in Figure 7.

Our model, SwinInsSeg, reduced the problem of scale variation in ships, as shown in subfigure (l), and could accurately detect and segment ships. Furthermore, our model efficiently recognized and segmented multiple ships. Our model also detected and segmented new ships, which are represented by a yellow circle.

5. Discussion

The SwinInsSeg model is a single-stage segmentation method for precise ship instance segmentation in marine environments. It uses the Swin transformer to capture long-range dependencies, making it suitable for modeling complex visual patterns. The model incorporates a multi-kernel attention module to process features at multiple scales through different kernel sizes to enhance feature extraction and discard irrelevant features using attention. The results show improved segmentation accuracy and visual performance; however, the proposed model increases computational demands, leading to a lower frame-per-second rate that does not meet real-time performance requirements. Future work could improve SwinInsSeg’s computational efficiency by customizing the backbone with lightweight techniques like Depth-wise Separable Convolutions or the GhostNet module. This would reduce computational costs during feature extraction while maintaining the model’s ability to capture critical features, aiming to improve the frame-per-second rate and segmentation accuracy.

6. Conclusions

In this paper, we developed an effective architecture for ship instance segmentation. We proposed a single-stage instance segmentation model known as SwinInsSeg by incorporating the Swin transformer and multi-kernel attention module, making it effective for maritime surveillance. We chose the Swin transformer over conventional CNN-based backbones because it captures long-range dependencies more effectively. We designed a hierarchical self-attention mechanism in the sliding-window approach that generates effective feature maps. Our model’s learning capabilities were significantly improved by enhancing key features via our multi-kernel attention (MKA) module and subsequent aggregation of these refined features. To validate the effectiveness of our method, we conducted extensive experiments on two datasets. Future work will further refine our model to address the complex challenges of multi-class ship datasets. Moreover, using video sequences and adding multi-object trackers will further improve segmentation, making it a more versatile and powerful tool for maritime surveillance.

Author Contributions

Conceptualization, R.S. and M.S.; methodology, R.S.; software, R.S.; validation, M.S., C.-T.L. and M.B.; investigation, M.S.; writing—original draft preparation, R.S. and M.S.; writing—review and editing, R.S., M.S., and M.B.; supervision, M.S. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

MariBoats dataset available in https://github.com/s2120200252/Visible-ship-dataset and ShipInsSeg dataset upon request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Q.Z.; Zang, F.N. Ship detection for visual maritime surveillance from non-stationary platforms. Ocean. Eng. 2017, 141, 53–63. [Google Scholar] [CrossRef]
Zhang, W.; He, X.; Li, W.; Zhang, Z.; Luo, Y.; Su, L.; Wang, P. An integrated ship segmentation method based on discriminator and extractor. Image Vis. Comput. 2020, 93, 103824. [Google Scholar] [CrossRef]
Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Li, W.; Zhang, Z.; Wang, P.; Zhang, W. Global Mask R-CNN for marine ship instance segmentation. Neurocomputing 2022, 480, 257–270. [Google Scholar] [CrossRef]
Sun, Z.; Meng, C.; Huang, T.; Zhang, Z.; Chang, S. Marine ship instance segmentation by deep neural networks using a global and local attention (GALA) mechanism. PLoS ONE 2023, 18, e0279248. [Google Scholar] [CrossRef]
Nalamati, M.; Sharma, N.; Saqib, M.; Blumenstein, M. Automated monitoring in maritime video surveillance system. In Proceedings of the 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), Wellington, New Zealand, 25–27 November 2020; pp. 1–6. [Google Scholar]
Nalamati, M.; Saqib, M.; Sharma, N.; Blumenstein, M. Exploring Transformers for Intruder Detection in Complex Maritime Environment. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Sydney, NSW, Australia, 2–4 February 2022; pp. 428–439. [Google Scholar]
Park, H.; Ham, S.H.; Kim, T.; An, D. Object recognition and tracking in moving videos for maritime autonomous surface ships. J. Mar. Sci. Eng. 2022, 10, 841. [Google Scholar] [CrossRef]
Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023, 11, 696. [Google Scholar] [CrossRef]
Xu, J.; Sun, X.; Zhang, D.; Fu, K. Automatic detection of inshore ships in high-resolution remote sensing images using robust invariant generalized Hough transform. IEEE Geosci. Remote Sens. Lett. 2014, 11, 2070–2074. [Google Scholar]
Zhang, S.; Wu, R.; Xu, K.; Wang, J.; Sun, W. R-CNN-based ship detection from high resolution remote sensing imagery. Remote Sens. 2019, 11, 631. [Google Scholar] [CrossRef]
Yao, Y.; Jiang, Z.; Zhang, H.; Zhao, D.; Cai, B. Ship detection in optical remote sensing images based on deep convolutional neural networks. J. Appl. Remote Sens. 2017, 11, 042611. [Google Scholar] [CrossRef]
Huang, G.; Wan, Z.; Liu, X.; Hui, J.; Wang, Z.; Zhang, Z. Ship detection based on squeeze excitation skip-connection path networks for optical remote sensing images. Neurocomputing 2019, 332, 215–223. [Google Scholar] [CrossRef]
Huang, L.; Liu, B.; Li, B.; Guo, W.; Yu, W.; Zhang, Z.; Yu, W. OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 11, 195–208. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; Volume 2, pp. 324–331. [Google Scholar]
Ouchi, K.; Tamaki, S.; Yaguchi, H.; Iehara, M. Ship detection based on coherence images derived from cross correlation of multilook SAR images. IEEE Geosci. Remote Sens. Lett. 2004, 1, 184–187. [Google Scholar] [CrossRef]
Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
Bai, X.; Liu, M.; Wang, T.; Chen, Z.; Wang, P.; Zhang, Y. Feature based fuzzy inference system for segmentation of low-contrast infrared ship images. Appl. Soft Comput. 2016, 46, 128–142. [Google Scholar] [CrossRef]
Shaodan, L.; Chen, F.; Zhide, C. A ship target location and mask generation algorithms base on Mask RCNN. Int. J. Comput. Intell. Syst. 2019, 12, 1134–1143. [Google Scholar] [CrossRef]
Chen, X.; Chen, H.; Wu, H.; Huang, Y.; Yang, Y.; Zhang, W.; Xiong, P. Robust visual ship tracking with an ensemble framework via multi-view learning and wavelet filter. Sensors 2020, 20, 932. [Google Scholar] [CrossRef]
Sharma, R.; Saqib, M.; Lin, C.; Blumenstein, M. MASSNet: Multiscale Attention for Single-Stage Ship Instance Segmentation. Neurocomputing 2024, 594, 127830. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inf. Process. Syst. 2017. [Google Scholar] [CrossRef]
De Brabandere, B.; Neven, D.; Van Gool, L. Semantic instance segmentation with a discriminative loss function. arXiv 2017, arXiv:1708.02551. [Google Scholar]
Liu, S.; Jia, J.; Fidler, S.; Urtasun, R. Sgn: Sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3496–3504. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision; pp. 568–578.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; pp. 10012–10022.
Schwehr, K. Vessel tracking using the automatic identification system (AIS) during emergency response: Lessons from the Deepwater Horizon incident. Cent. Coast. Ocean. Mapping/Joint Hydrogr. Cent. 2011. [Google Scholar]
Nikolió, D.; Popovic, Z.; Borenovió, M.; Stojkovió, N.; Orlić, V.; Dzvonkovskaya, A.; Todorovic, B.M. Multi-radar multi-target tracking algorithm for maritime surveillance at OTH distances. In Proceedings of the 2016 17th International Radar Symposium (IRS); pp. 1–6.
Sharma, N.; Scully-Power, P.; Blumenstein, M. Shark detection from aerial imagery using region-based CNN, a study. In Proceedings of the AI 2018: Advances in Artificial Intelligence: 31st Australasian Joint Conference, Wellington, New Zealand, 11–14 December 2018; Proceedings 31. Springer: Berlin/Heidelberg, Germany, 2018; pp. 224–236. [Google Scholar]
Saqib, M.; Khan, S.D.; Sharma, N.; Scully-Power, P.; Butcher, P.; Colefax, A.; Blumenstein, M. Real-time drone surveillance and population estimation of marine animals from aerial imagery. In Proceedings of the 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ); pp. 1–6.
Nalamati, M.; Kapoor, A.; Saqib, M.; Sharma, N.; Blumenstein, M. Drone detection in long-range surveillance videos. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–6. [Google Scholar]
Zou, Y.; Zhao, L.; Qin, S.; Pan, M.; Li, Z. Ship target detection and identification based on SSD_MobilenetV2. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1676–1680. [Google Scholar]
Sun, Y.; Su, L.; Cui, H.; Chen, Y.; Yuan, S. Ship instance segmentation in foggy scene. In Proceedings of the 2021 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021; pp. 8340–8345. [Google Scholar]
Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Zhang, Z.; Zhang, W.; Yuan, S. Irdclnet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6029–6043. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); pp. 3–19.
Sharma, R.; Saqib, M.; Lin, C.; Blumenstein, M. Maritime Surveillance Using Instance Segmentation Techniques. In Proceedings of the International Conference on Data Science and Communication; Springer: Berlin/Heidelberg, Germany, 2023; pp. 31–47. [Google Scholar]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]

Figure 1. Overall structure of the SwinInsSeg framework, with the backbone consisting of the Swin-T transformer and comprising four stages (red-dashed rectangular box). The output of each stage is passed through the multi-kernel attention (MKA) module to enhance the feature maps. The neck (green-dashed rectangular box) comprises the feature pyramid network (FPN), which captures objects across different feature maps. Finally, the mask head (yellow-dashed rectangular box) generates the instance masks.

Figure 2. A pair of Swin transformer blocks. The initial block contains the W-MSA module, whereas the subsequent block contains the SW-MSA module.

Figure 3. The proposed multi-kernel attention module. The output of each stage in the Swin transformer backbone, as shown in Figure 1, is fed as input to our MKA module, where feature maps are enhanced using different kernels for different object sizes. To focus on important feature maps, we use the CBAM attention mechanism for further refinement and concatenation. The red rectangular box highlights the CBAM mechanism.

Figure 4. Accuracy vs. epoch curves for the MariBoats dataset, evaluated using the COCO metrics. (a) The AP curve, which highlights the overall performance of the models over the training epochs. (b) The AP₅₀ curve. (c) The AP₇₅ curve, which indicates performance at an IoU threshold of 75%. (d) The AP_S curve. (e) The AP_M curve, which evaluates medium-sized ships. (f) The AP_L curve. Our SwinInsSeg curves are shown in indigo blue. (Best viewed in color).

Figure 5. Accuracy vs. epoch curves for the ShipInsSeg dataset, evaluated using the COCO metrics. (a) The AP curve, which highlights the overall performance of the models over the training epochs. (b) The AP₅₀ curve. (c) The AP₇₅ curve, which indicates performance at an IoU threshold of 75%. (d) The AP_S curve. (e) The AP_M curve, which evaluates medium-sized ships. (f) The AP_L curve. Our SwinInsSeg curves are shown in indigo blue. (Best viewed in color).

Figure 6. Visual results of different methods on the MariBoats dataset. (a) Ground-truth images. (b–l) Visual results of the SOLO-ResNet-50, SOLO-ResNet-101, SOLO-Decoupled, SOLO-Swin-T, SOLOv2-ResNet-50, SOLOv2-ResNet-101, SOLOv2-Swin-T, YOLACT-ResNet-50, YOLACT-ResNet-101, YOLACT-Swin-T, and SwinInsSeg (ours) models, respectively. Newly detected and segmented objects are marked in yellow. (Best viewed in color).

Figure 7. Visual results of different methods on the ShipInsSeg dataset. (a) Ground-truth images. (b–l) Visual results of the SOLO-ResNet-50, SOLO-ResNet-101, SOLO-Decoupled, SOLO-Swin-T, SOLOv2-ResNet-50, SOLOv2-ResNet-101, SOLOv2-Swin-T, YOLACT-ResNet-50, YOLACT-ResNet-101, YOLACT-Swin-T, and SwinInsSeg (ours) models, respectively. Newly detected and segmented objects are marked in yellow. (Best viewed in color).

Table 1. Configuration list.

Configuration Settings	Version
Operating System	Ubuntu 18.04.6 LTS
GPU	NVIDIA Quadro P6000
Memory	24 GB
Deep Learning Framework	PyTorch 1.7
MMCV	1.7.1
CUDA	10.2
MMDETECTION	2.27.0

Table 2. The AP values across various IOU thresholds and object sizes.

Symbol	Description	Object Size
AP	IOU at 0.5; 0.95; 0.05	-
AP₅₀	IOU at 0.5	-
AP₇₅	IOU at 0.75	-
AP_S	Small objects	$a r e a < 32^{2}$
AP_M	Medium objects	$32^{2} < a r e a < 96^{2}$
AP_L	Large objects	$a r e a > 96^{2}$

Table 3. Comparison results of the mask AP on the MariBoats dataset.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	AR	AR_S	AR_M	AR_L	Params (M)	FPS
YOLACT [32]	ResNet-50	36.5	66.4	35.7	2.8	15.0	51.5	46.0	11.0	27.0	61.6	34.73	19.429
YOLACT [32]	ResNet-101	38.5	67.4	38.3	3.0	14.9	54.5	47.3	10.4	27.6	63.8	53.72	16.287
YOLACT [32]	Swin-T	41.5	68.1	43.7	3.0	13.0	59.3	49.3	11.0	27.8	66.9	70.32	9.96
SOLO [33]	ResNet-50	40.6	61.5	46.0	1.2	7.8	61.1	45.3	2.3	15.8	66.9	35.89	14.563
SOLO [33]	ResNet-101	40.8	61.8	46.4	0.8	8.0	61.6	46.0	2.0	15.7	68.0	54.89	12.699
SOLO [33]	Swin-T	39.7	60.5	45.1	0.9	7.4	60.4	44.7	1.8	13.7	66.7	71.45	10.67
SOLO-Decoupled [33]	ResNet-50	48.4	74.2	54.5	3.4	20.6	68.1	55.0	7.4	33.3	74.9	39.62	14.057
SOLOv2 [24]	ResNet-50	45.9	71.7	49.6	3.2	17.1	65.5	54.1	8.0	31.6	76.0	46.0	17.722
SOLOv2 [24]	ResNet-101	48.3	72.8	51.7	4.0	18.7	68.3	55.7	9.4	31.6	76.0	65.59	13.457
SOLOv2 [24]	Swin-T	47.3	71.2	50.7	3.5	15.3	68.0	53.9	9.2	28.5	74.3	80.5	12.486
SwinInsSeg (ours)	Swin-T	50.6	75.4	54.7	4.2	20.0	71.1	57.5	11.5	35.3	77.6	90.5	11.55

Table 4. Comparison results of the mask AP on the ShipInsSeg dataset.

Method	Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	AR	AR_S	AR_M	AR_L	Params (M)	FPS
YOLACT [32]	ResNet-50	36.5	62.9	36.1	12.4	54.9	58.5	44.7	20.8	61.5	72.7	34.73	13.617
YOLACT [32]	ResNet-101	41.1	68.9	42.1	13.5	60.4	67.6	48.3	23.9	65.8	75.8	53.72	10.76
YOLACT [32]	Swin-T	48.6	77.0	50.4	22.7	65.8	74.9	55.2	32.9	70.9	80.9	70.40	8.17
SOLO [33]	ResNet-50	29.2	40.3	32.2	3.1	46.2	63.0	33.0	4.4	51.6	70.6	35.89	8.217
SOLO [33]	ResNet-101	28.5	39.5	31.5	2.5	45.2	62.6	32.9	4.1	51.3	71.6	54.89	7.061
SOLO [33]	Swin-T	26.6	36.8	29.2	2.5	41.7	59.2	30.4	3.5	47.0	68.1	71.52	6.80
SOLO-Decoupled [33]	ResNet-50	49.3	73.2	53.3	16.8	70.1	85.1	53.5	23.7	74.4	88.5	39.62	8.194
SOLOv2 [24]	ResNet-50	48.3	71.7	50.5	16.5	67.8	85.6	52.3	23.1	71.9	88.5	46.0	11.510
SOLOv2 [24]	ResNet-101	49.8	73.4	52.7	17.4	69.6	87.6	53.4	23.8	73.9	90.1	65.59	8.59
SOLOv2 [24]	Swin-T	47.6	70.8	49.9	16.2	66.8	85.8	51.4	22.3	70.9	88.2	82.34	7.34
SwinInsSeg (ours)	Swin-T	52.0	77.7	54.7	21.7	70.7	86.8	56.2	33.0	74.9	90.3	91.55	7.05

Table 5. Effectiveness of the MKA module on the MariBoats dataset.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Swin-T	47.3	71.2	50.7	3.5	15.3	68.0
Swin-T + MKA at stage 1	47.9	72.0	51.5	3.65	16.09	68.55
Swin-T + MKA at stages 1 and 2	48.5	72.9	53.1	3.78	18.55	70.05
Swin-T + MKA at stages 1, 2, and 3	49.7	73.5	53.6	3.99	19.35	70.6
Swin-T + MKA at all stages	50.6	75.4	54.7	4.2	20.0	71.1

Table 6. Effectiveness of the MKA module on the ShipInsSeg dataset.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Swin-T	47.6	70.8	49.9	16.2	66.8	85.8
Swin-T + MKA at stage 1	49.15	72.77	50.8	17.55	67.9	85.96
Swin-T + MKA at stages 1 and 2	49.87	74.03	52.5	18.90	68.5	86.09
Swin-T + MKA at stages 1, 2, and 3	51.35	75.95	53.4	20.10	68.98	86.6
Swin-T + MKA at all stages	52.0	77.7	54.7	21.7	70.7	86.8

Table 7. Different backbones using the MKA module on the MariBoats dataset.

Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
ResNet-50 + MKA at all stages	46.7	72.5	50.9	3.8	17.5	68.9
ResNet-101 + MKA at all stages	49.1	73.4	52.4	4.1	19.2	69.9
Swin-T + MKA at all stages	50.6	75.4	54.7	4.2	20.0	71.1

Table 8. Different backbones using the MKA module on the ShipInsSeg dataset.

Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
ResNet-50 + MKA at all stages	49.6	72.3	52.2	18.1	68.5	85.9
ResNet-101 + MKA at all stages	50.8	75.8	53.4	19.9	70.1	87.8
Swin-T + MKA at all stages	52.0	77.7	54.7	21.7	70.7	86.8

Table 9. Evaluation of the MKA module on the MariBoats dataset.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Baseline Model	47.3	71.2	50.7	3.5	15.3	68.0
1 × 1 with CBAM	47.8	71.8	51.2	3.65	16.5	68.5
3 × 3 with CBAM	48.4	72.5	52.0	3.83	17.7	69.2
5 × 5 with CBAM	48.1	73.6	52.9	3.98	18.5	70.1
7 × 7 with CBAM	49.2	74.5	53.8	4.05	19.4	70.95
Full MKA with skip connections	50.6	75.4	54.7	4.2	20.0	71.1

Table 10. Evaluation of the MKA module on the ShipInsSeg dataset.

Method	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Baseline Model	47.6	70.8	49.9	16.2	66.8	85.8
1 × 1 with CBAM	48.5	72.3	50.7	17.7	68.1	85.4
3 × 3 with CBAM	49.2	73.9	51.6	19.1	68.95	85.9
5 × 5 with CBAM	50.4	75.1	52.88	19.9	69.6	86.2
7 × 7 with CBAM	51.2	76.5	54.05	20.9	70.1	86.5
Full MKA with skip connections	52.0	77.7	54.7	21.7	70.7	86.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sharma, R.; Saqib, M.; Lin, C.-T.; Blumenstein, M. SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation. Mathematics 2025, 13, 165. https://doi.org/10.3390/math13010165

AMA Style

Sharma R, Saqib M, Lin C-T, Blumenstein M. SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation. Mathematics. 2025; 13(1):165. https://doi.org/10.3390/math13010165

Chicago/Turabian Style

Sharma, Rabi, Muhammad Saqib, Chin-Teng Lin, and Michael Blumenstein. 2025. "SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation" Mathematics 13, no. 1: 165. https://doi.org/10.3390/math13010165

APA Style

Sharma, R., Saqib, M., Lin, C.-T., & Blumenstein, M. (2025). SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation. Mathematics, 13(1), 165. https://doi.org/10.3390/math13010165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SwinInsSeg: An Improved SOLOv2 Model Using the Swin Transformer and a Multi-Kernel Attention Module for Ship Instance Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Instance Segmentation Techniques

2.2. Transformer-Based Techniques

2.3. Maritime Surveillance and Applications

3. Proposed Method

3.1. Overall Architecture

3.2. Multi-Kernel Attention Module

3.3. Loss Function

4. Experiments

4.1. Datasets and Metrics

4.2. Experimental Details

4.3. Main Results

4.3.1. Performance Evaluation

Results on MariBoats Dataset

Results on ShipInsSeg Dataset

4.4. Ablation Experiments

4.5. Qualitative Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI