ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments

Liu, Dezhi; Hua, Liangchun; Wang, Zhipan; Wang, Le; Chu, Bin; Zeng, Haibo; Chen, Zegang; Long, Zhong; Zhang, Yunfei; Zhang, Hua

doi:10.3390/rs18111789

Open AccessArticle

ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments

by

Dezhi Liu

¹,

Liangchun Hua

²,

Zhipan Wang

^1,*

,

Le Wang

²,

Bin Chu

³

,

Haibo Zeng

^4,5,

Zegang Chen

⁶,

Zhong Long

⁶,

Yunfei Zhang

¹

and

Hua Zhang

¹

Changsha University of Science and Technology, Changsha 410114, China

²

Hunan Institute of Geomatics Sciences and Technology, Changsha 410007, China

³

School of Electronic Information, Wuhan University, Wuhan 430072, China

⁴

The Second Surveying and Mapping Institute of Hunan Province, Changsha 410029, China

⁵

Key Laboratory of Natural Resources Monitoring and Supervision in Southern Hilly Region, Ministry of Natural Resources, Changsha 410029, China

⁶

Ecological Geological Survey and Monitoring Institute of Hunan, Changsha 410119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1789; https://doi.org/10.3390/rs18111789

Submission received: 15 April 2026 / Revised: 22 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Advances in Satellite Image Analysis and Applications for Earth Observation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A large-scale, high-quality multi-scale ship remote sensing dataset containing 69,407 professionally annotated samples is constructed, which significantly outperforms existing public datasets in sample diversity, scale coverage, and small target representation.
A novel nnU-Net-based ShipMS-BSNet is proposed for ship segmentation, which achieves simultaneous multi-scale perception and background suppression via encoder-end MSRF-BSCA synergy and optimizes segmentation boundaries and small targets through an output-end MSR module. Extensive experiments show it outperforms mainstream methods, achieving 0.879 precision, 0.875 Recall, 0.868 F1-score and 0.761 IoU on the self-built dataset.

What is the implication of the main finding?

The constructed large-scale ship segmentation dataset provides a unified benchmark platform for performance evaluation in the field of remote sensing ship segmentation, effectively alleviating the current data bottleneck and laying a solid foundation for further algorithmic advances.
The proposed synergistic encoding strategy and output-end Multi-Scale Refinement paradigm offer a new solution to extreme scale variation and complex background interference, which can be extended to other remote sensing segmentation tasks to support maritime monitoring and traffic management applications.

Abstract

Accurate segmentation of ship targets in high-resolution remote sensing images is crucial for maritime monitoring, traffic management and naval security. However, existing methods struggle to simultaneously address extreme scale variations in ships and severe complex background interference, leading to unsatisfactory accuracy and generalization in scenarios with shoreline occlusion and ocean wave noise. To tackle this challenge, we first construct a large-scale, high-quality multi-scale ship dataset containing 69,407 professionally annotated samples. Then, we propose ShipMS-BSNet, a multi-scale feature fusion network based on nnU-Net. At the encoder, the Multi-Scale Receptive Field Enhancement (MSRF) module captures multi-scale contextual information, while the Background Suppression Channel Attention (BSCA) module suppresses invalid background responses via learnable negative bias. At the decoder, dynamic upsampling restores spatial details, and a final Multi-Scale Refinement (MSR) module optimizes target boundaries. Extensive experiments on our self-built dataset and the public HRSC2016 dataset show that our method outperforms mainstream approaches. On the self-built dataset, it achieves 0.879 precision, 0.875 Recall, 0.868 F1-score and 0.761 IoU, validating its strong robustness for multi-scale ship segmentation in complex marine environments.

Keywords:

ship extraction; channel attention; dilated convolution; semantic segmentation

1. Introduction

Maritime vessel monitoring is an indispensable component of modern shipping management, maritime safety, environmental protection, and the global economy. The rapid growth in global trade has significantly increased the density and complexity of ship activities, making efficient and accurate ship extraction from remote sensing imagery a critical technology. Remote sensing, with its advantages of wide coverage, strong timeliness, and freedom from geographical constraints, has become the primary means of maritime surveillance [1,2].

Early research on ship detection predominantly relied on traditional image processing and machine learning techniques. These methods are built upon manually designed features, encompassing strategies like thresholding, visual saliency analysis, shape and texture characterization, and transform-domain processing [3]. A fundamental limitation of these techniques is their reliance on handcrafted features, which often fail to adequately represent the diverse and variable shapes of ships, especially under differing resolutions and imaging conditions. Consequently, they exhibit poor generalization capability in complex scenarios, struggling to reliably detect small ships against cluttered backgrounds [4].

The advent of deep learning, particularly convolutional neural networks, has revolutionized this field by enabling end-to-end feature learning. Deep learning-based methods have substantially outperformed traditional techniques, demonstrating superior representational power and strong adaptability to complex scenes, thus becoming the dominant paradigm for object detection and segmentation in optical remote sensing imagery [5,6,7,8,9].

For the task of maritime ship extraction, semantic segmentation offers a distinct advantage over bounding-box object detection by providing precise, pixel-level ship contours, which are essential for high-precision applications [10,11,12,13]. However, current semantic segmentation methods still face significant, unresolved challenges in the maritime domain. The complex marine environment—characterized by wave textures, sun glint, and adverse weather—often degrades image quality, leading to blurred target boundaries and mis-segmentation [14,15]. Furthermore, the substantial scale variations among different types of ships, the difficulty in detecting small vessels, and the complexities of handling densely docked ships in harbors impose stringent demands on a model’s robustness and generalization capability [16,17].

A critical bottleneck that exacerbates these challenges is the scarcity of high-quality, large-scale datasets specifically designed for ship semantic segmentation. Most existing public datasets are limited in volume and scene diversity, often lacking comprehensive coverage of the very scenarios that pose the greatest difficulties: small targets, multi-scale vessels, and nearshore ships with complex backgrounds [18,19]. This data deficiency directly restricts the generalization ability and the performance ceiling of advanced deep learning models. To bridge this gap, this study aims to construct a large-scale, diverse, and pixel-level annotated ship segmentation dataset that systematically addresses the limitations of existing data resources. Furthermore, on this basis, we propose a novel deep learning model tailored to overcome the identified challenges of multi-scale feature extraction and background interference, thereby achieving robust and high-accuracy ship semantic segmentation in complex maritime environments. The primary contributions of this article are outlined as follows.

Existing public ship segmentation datasets suffer from limited sample diversity, insufficient coverage of small vessels, and restricted scene variability, which collectively constrain the generalization capability of deep learning models. To bridge this gap, we construct a large-scale, pixel-level annotated dataset comprising 69,407 image–label pairs encompassing diverse maritime environments, including open seas, nearshore waters, and ports. The dataset spans a wide range of ship scales, from small fishing boats to large cargo vessels, and undergoes rigorous multi-round quality control to ensure annotation consistency. This resource provides a standardized benchmark that directly addresses the data bottleneck impeding progress in ship segmentation research.

Standard encoder architectures face two inherent challenges in maritime scenes: single-scale convolutions cannot adequately capture ships spanning from tens to hundreds of pixels, and conventional channel attention [20] treats all channels with neutral initial weighting, making it inefficient at distinguishing weak ship signals from dominant background textures, such as waves and glint. We address these issues through two interlocking modules embedded at each encoder stage. First, the Multi-Scale Receptive Field Enhancement (MSRF) module employs parallel

1 \times 1

,

3 \times 3

standard, and

3 \times 3

dilated convolutions, deliberately excluding any pooling operation, to preserve fine spatial details of small vessels while simultaneously capturing broader context for large ships. Second, the Background Suppression Channel Attention (BSCA) module introduces a learnable bias initialized to a negative value before the Sigmoid activation, forcing the network to start training with strong suppression on all channels and progressively up-weight only ship-discriminative ones. This negative bias mechanism, absent in standard SE variants, provides an implicit data-driven background subtraction prior specifically tailored to scenes where background pixels overwhelmingly dominate. The synergy of these modules achieves simultaneous multi-scale perception and background suppression, yielding significant improvements in complex maritime environments.

In most segmentation networks, including recent maritime architectures, such as MSCF-Net [21] and MASSNet [22], the decoder output is directly projected to class predictions via a single

1 \times 1

convolution, leaving residual boundary blurring and weak responses in small target regions unaddressed. We propose the Multi-Scale Refinement (MSR) module, placed at the network output as a lightweight post-processing unit. It extracts multi-scale edge information by paralleling

3 \times 3

standard and dilated convolutions, then fuses the enhanced features with the original input through a residual connection. This design performs fine-grained adjustments on segmentation boundaries and small target regions with negligible computational overhead, further improving boundary sharpness and target integrity in the final predictions.

This paper is structured into six distinct sections. Commencing with a review of related work in Section 2, it proceeds to detail the proposed methodology in Section 3. Experimental procedures and their outcomes are presented in Section 4, followed by a discussion in Section 5. This work culminates in concluding remarks in Section 6.

2. Related Works

2.1. Optical Satellite Remote Sensing Ship Dataset

The year 2016 marked the introduction of the first dataset specifically dedicated to ORS ship image object detection, presented by Liu et al. [23]. They constructed the HRSC2016 dataset by collecting 1070 ship images from Google Earth. Although relatively small in scale, it served as a pioneering ORS ship detection dataset and played a significant role in the development of such datasets. In 2018, Kaggle organized the Airbus Ship Detection Challenge and released a dataset named Kaggle-Ship [24]. While large in volume with 208,164 images, it suffered from severe class imbalance. Compared to the image size, the ship targets were considerably smaller, and a large number of images contained no ships at all (pure background). Additionally, the number of ships per image was highly imbalanced. Chen et al. [25] made a significant contribution in 2020 by systematically extending the HRSC dataset to create FGSD. Their work involved compiling 2612 Google Earth images from 17 strategically important ports located in four maritime nations: China, Japan, the United States, and Spain. The resulting dataset provides coverage of 43 different ship types. Compared to HRSC2016, this dataset shows a notable increase in the number of images.

In the same year, Deng et al. [26] collected a large number of ship images from Kaggle-Ship, DOTA, and Google Earth, and subsequently constructed and released the ASRSS dataset, which focuses on small-scale ship detection. In 2021, Wu et al. [27] utilized GaoFen-1 imagery with a resolution of 16 m to build the GF1-LRSD dataset, specifically designed for detecting extremely small ships. The GF1-LRSD dataset is dominated by small objects under 16 pixels, which constitute about 94% of all instances. This stands in sharp contrast to other datasets, where more than half of the objects are larger than 16 pixels. Also in 2021, Zhang et al. [28] constructed the ShipRSImageNet dataset, which incorporates images from multiple sources, including the xView dataset, HRSC2016, FGSD, Kaggle-Ship, GaoFen-2 satellite, and JiLin-1 satellite. ShipRSImageNet contains 50 ship categories and 17,573 ship instances, but it consists of only 3435 images, making its scale still relatively limited. To better evaluate ship detection algorithms in large-scale scenes, Su et al. [29] developed the Large-Area Remote Sensing Ship Detection Dataset (LARS), which comprises 20 images, each exceeding 10,000 × 10,000 pixels in size. In 2023, Guo et al. [30] constructed the MCSD dataset, covering 38 major ports worldwide. Most recently, in 2024, Hu et al. [6] introduced the SCCOS dataset, the latest benchmark for ORS ship detection, with imagery sourced from multiple platforms. However, these datasets primarily employ bounding-box annotations and lack precise pixel-level segmentation labels, which limits the application of deep learning models in fine-grained ship recognition and semantic segmentation tasks.

Indeed, datasets dedicated to ship semantic segmentation remain scarce. Ciocarlan et al. [31] collected 16 Sentinel-2 L2A images covering coastlines, ports, and the Suez Canal. Lee et al. [32] gathered 1984 images from Google Earth with variations in acquisition time, ship location, and climatic conditions, including scenes of ports, coastlines, shallow waters, and open oceans. Both rotated bounding boxes and their corresponding instance mask labels were annotated for these images. However, existing ship segmentation datasets are generally limited in scale, lack scene diversity, and do not provide comprehensive coverage of small, multi-scale, and nearshore vessels. There remains a significant shortage of large-scale, high-quality, pixel-level annotated ORS ship datasets. Table 1 summarizes the detailed characteristics of some publicly available satellite remote sensing ship segmentation datasets.

2.2. Ship Segmentation Network

Before the widespread adoption of deep learning techniques, traditional image processing algorithms were commonly used in ship segmentation tasks. These methods typically relied on handcrafted features and mathematical models. However, traditional approaches often depended on specific assumptions and prior knowledge. When confronted with remote sensing imagery characterized by complex backgrounds, diverse target scales, and substantial noise, these models suffer from limited generalization and robustness, thus failing to meet practical application requirements.

Deep learning technology has rapidly become the mainstream approach for ship segmentation due to its powerful feature learning capabilities. Zheng et al. [33] proposed the RepUnet framework, which combines the re-parameterization technique of RepVGG with the U-Net [34] architecture. Employing a multi-branch design during training and a single-path structure during inference significantly improves inference speed while maintaining segmentation accuracy, achieving a balance between precision and efficiency. To address ship instance segmentation in foggy conditions, Sun et al. [35] developed IRDCLNet, which leverages interference reduction and dynamic contour learning. It demonstrates exceptional robustness in extracting ship contours under adverse weather conditions. Zhang et al. [36] proposed SwinSeg, an integration of the Swin Transformer and a lightweight MLP within a hybrid network. This architecture enables efficient long-range dependency modeling while maintaining linear computational complexity, enhancing global contextual awareness in semantic segmentation. Rabi Sharma et al. [22] developed the MASSNet framework. It employs multi-dimensional attention mechanisms to bolster multi-scale feature extraction, which yields more refined and context-aware representations and significantly boosts ship segmentation accuracy. Jiang et al. [21] proposed the MSCF-Net framework, which leverages a set of multi-scale convolutional kernels to capture semantically rich features. The encoder integrates an enhanced spatial pyramid pooling (ESPP) module. Its purpose is to endow the model with both an enlarged receptive field and access to richer contextual information from multiple scales. Additionally, a multi-scale attention module is incorporated, aimed at enhancing cross-layer feature interaction within the network.

Beyond these dedicated ship segmentation networks, Transformer-based architectures have driven significant progress in general semantic segmentation [37]. TransUNet [38] integrates a CNN encoder with a Transformer decoder, employing self-attention to model long-range dependencies while preserving fine spatial details through U-Net-like skip connections. SegFormer [39] introduces a hierarchical Transformer encoder combined with a lightweight all-MLP decoder, achieving competitive performance with high computational efficiency.

More recently, foundation models have begun to reshape the landscape of computer vision. The Segment Anything Model (SAM) [40] introduced a promptable segmentation framework trained on billions of masks, demonstrating impressive zero-shot generalization across natural images. However, direct application of SAM to remote sensing ship segmentation often yields suboptimal results due to the domain gap, especially for small vessels, which frequently present ambiguous boundaries against complex sea surfaces, and SAM tends to produce coarse masks that lack the pixel-level precision required for accurate ship extraction. In parallel, vision-language models (VLMs), such as CLIP, have been explored for remote sensing tasks, leveraging textual descriptions to guide visual recognition [41]. While these models offer semantic flexibility, their use in dense pixel-level ship segmentation remains nascent, and bridging the gap between image-level alignment and precise boundary delineation remains an open challenge.

In the task of remote sensing ship extraction, ship targets generally exhibit extreme scale variations, irregular contours and highly complex backgrounds, which impose stringent requirements on the network’s multi-scale feature perception capability and background anti-interference capability. At present, many existing segmentation methods either suffer from insufficient robustness in complex marine environments or perform poorly in handling multi-scale targets. Therefore, it is of crucial importance to specifically design a ship segmentation network that integrates strong multi-scale feature perception and background interference suppression capabilities.

3. Data and Methods

3.1. Data

In this study, to meet the requirements for training and evaluating semantic segmentation models for ship remote sensing images, we have constructed a large-scale, high-quality, and diverse dedicated dataset. The construction process primarily consists of three stages: data acquisition, fine-grained annotation, and data processing. The regions of data acquisition and representative instance scenes are illustrated in Figure 1.

Data acquisition and annotation: The original remote sensing image data were sourced from the Google Earth platform (earth.google.com/web/, accessed on 27 May 2026), ensuring diversity and broad coverage. The annotation process was carried out using the professional geographic information system software ArcGIS 10.4 (Esri, Redlands, CA, USA). Specialists meticulously delineated ship targets in the form of polygon vectors. This annotation method captures the contours and shapes of ships with high precision, providing accurate ground-truth information to support model learning.

Data processing and generation: To generate training samples ready for model use, we developed an automated data processing pipeline:

First, a spatial buffer analysis was performed on the annotated ship polygons to group vessels that are in close proximity to one another. Based on these grouped ship clusters, cropping windows were generated such that each cropped image contains one or more spatially adjacent ships while ensuring that no single ship instance is split across multiple crops. Subsequently, all cropped images were standardized by resizing them to a uniform resolution of $512 \times 512$ pixels, ensuring consistency with the model’s input specifications.
The polygon vector annotations generated in ArcGIS were converted into binary masks using a custom-developed algorithm. For the binary masks, pixels corresponding to ships form the positive class, encoded as 1, and all other pixels form the negative class, encoded as 0. This process establishes a precise and consistent benchmark for model training.

The final dataset consists of 69,407 image–label pairs, including 1000 negative samples (i.e., images without any ship targets) to reduce false detection rates during inference and enhance model robustness. The dataset maintains high diversity and balance across multiple dimensions, such as the scale, environment, and target count:

In terms of scale diversity, it covers vessel targets ranging from small boats to large ships.
Regarding environmental variety, backgrounds include open oceans, nearshore ports, and complex weather scenarios with disturbances such as clouds and waves.
In target quantity distribution, the number of ships per image ranges from single to multiple instances, following a reasonable distribution that effectively avoids bias in model training.

All samples were meticulously annotated manually and underwent strict quality control to ensure label accuracy. The construction of this dataset establishes a solid data foundation for training and evaluating ship segmentation algorithms, while its diversity and high quality significantly enhance the model’s generalization capability in real-world complex scenarios. The overall dataset production workflow is shown in Figure 2. The probability density statistics of the aspect ratio and actual area of all ship targets in the dataset are shown in Figure 3 and Figure 4.

3.2. ShipMS-BSNet

The ship extraction network proposed in this paper, named ShipMS-BSNet, adopts an encoder–decoder architecture, taking three-channel optical remote sensing images of size

512 \times 512

as input and outputting a binary segmentation map consisting of background and ship classes.

nnU-Net [42], originally developed for biomedical image segmentation, is fundamentally a U-Net variant that excels at generating precise, pixel-level predictions with sharp boundaries—an inherent strength attributable to its deep encoder–decoder structure with dense skip connections and progressive resolution recovery stage by stage.

In ship semantic segmentation, boundary accuracy is critical, particularly for small vessels and ships docked in complex nearshore environments. Architectures such as DeepLabV3+ [43] rely on an Atrous Spatial Pyramid Pooling (ASPP) [44] module to capture multi-scale context, but the subsequent single-step upsampling limits its ability to gradually refine object boundaries, often resulting in coarse edges around small targets. SegFormer, a Transformer-based architecture, provides excellent global context modeling but tends to produce overly smooth segmentation maps and can struggle with fine boundary details when targets are small. In contrast, nnU-Net’s multi-stage decoder with skip connections at every resolution level offers a natural advantage for preserving and gradually recovering the spatial precision demanded by ship extraction.

The overall architecture is illustrated in Figure 5. While the high-level framework inherits the well-established encoder–decoder scheme of nnU-Net, the core components have been fundamentally redesigned to address the unique challenges inherent in maritime remote sensing: extreme scale variations, complex background interference, and the need for accurate boundary delineation.

Specifically, the network comprises four main components: the encoder, the decoder, the Multi-Scale Refinement (MSR) module, and skip connections. The encoder integrates two novel modules—the Multi-Scale Receptive Field Enhancement (MSRF) block and the Background Suppression Channel Attention (BSCA) module—to capture robust multi-scale features while actively suppressing irrelevant background responses. The decoder abandons traditional transposed convolutions in favor of a dynamic upsampling module (DySample) to achieve content-adaptive resolution restoration and is further augmented by the MSR module placed at the output end for edge-aware post-processing. These designs, detailed below, collectively distinguish ShipMS-BSNet from generic U-Net variants and recent maritime segmentation models.

3.3. Encoder

The encoder progressively extracts multi-level feature representations from the input image. To meet the challenges posed by ships of drastically varying sizes and complex marine backgrounds, we have substantially modified the standard convolutional downsampling stages by embedding the MSRF and BSCA modules in every stage. The encoder consists of 8 stacked encoding stages, with the number of feature channels evolving as: 3 → 32, 32 → 64, 64 → 128, 128 → 256, 256 → 512, 512 → 512, 512 → 512, 512 → 512.

Each stage contains two successive Convolution-Normalization-ReLU (ConvNormReLU) blocks. The first block controls spatial downsampling: stage 1 retains the input resolution (stride 1), while stages 2–8 halve the feature map size through stride 2 convolutions. The second ConvNormReLU block always uses stride 1 for further nonlinear transformation without changing dimensions. Every ConvNormReLU applies a

3 \times 3

convolution, followed by Instance Normalization and LeakyReLU activation. Instance Normalization is chosen for its robustness to contrast variations common in remote sensing images and its compatibility with small-batch training.

A key innovation of our encoder is the placement of the MSRF module and the BSCA module after the two ConvNormReLU blocks within each stage, as depicted in Figure 6. This sequential arrangement ensures that, after initial feature extraction and resolution change, the feature maps are enriched with multi-scale contextual information and dynamically recalibrated on a per channel basis before being passed to the next stage.

Multi-Scale Receptive Field Enhancement (MSRF). Ships in remote sensing images span from a few pixels (small fishing boats) to hundreds of pixels (large cargo vessels), rendering single-kernel convolutions insufficient. To address this, we design the MSRF block, illustrated in Figure 6b, which captures multi-scale information without any spatial pooling, a crucial requirement for preserving the details of small targets.

MSRF consists of three parallel convolutional branches: Branch 1 uses a

1 \times 1

convolution to retain the current-scale representation; Branch 2 employs a standard

3 \times 3

convolution to capture local neighborhood context; and Branch 3 applies a

3 \times 3

dilated convolution with a dilation rate of 2, which expands the receptive field without introducing additional parameters.

The outputs of the three branches are concatenated along the channel dimension and fused by a

1 \times 1

convolution, which compresses the channels back to the original number. The fused output then passes through Instance Normalization and LeakyReLU and is finally added to the original input via a residual connection. This design contrasts with the Atrous Spatial Pyramid Pooling (ASPP) commonly used in segmentation, which often includes a global average-pooling branch that can smear fine spatial details. By deliberately excluding any pooling operation, MSRF ensures that small-ship features remain intact while simultaneously capturing broader context for large vessels.

Background Suppression Channel Attention (BSCA). In maritime images, sea surface, waves, glint, and coastal land often dominate the pixel count, and their textures can trigger strong responses in certain feature channels, leading to false positives or blurred boundaries. To actively suppress these background-induced features, we propose the BSCA module, which goes beyond standard channel attention through a carefully designed bias mechanism.

As shown in Figure 6c, BSCA first applies global average pooling to obtain channel-wise statistics

z \in R^{B \times C}

. These statistics are then passed through a bottleneck of two fully connected layers with a reduction ratio

r = 16

(ReLU in between). Crucially, unlike the conventional Squeeze-and-Excitation (SE) block, we introduce a learnable bias term b before the Sigmoid activation, initialized to

- 2.0

. The channel attention weights are computed as:

s = σ (W_{2} \cdot δ (W_{1} \cdot z) + b),

(1)

where

σ

denotes the Sigmoid function,

δ

is the ReLU activation,

W_{1}

and

W_{2}

are the FC layer weights, and b is the learnable bias. The final output is:

Y = X ⊙ s .

(2)

The negative initialization of b biases the Sigmoid input toward negative values, yielding initial channel weights of approximately

0.12

(since

σ (- 2) \approx 0.12

). This means that, at the start of training, the network is encouraged to suppress all channels. As training proceeds, the network learns to increase the weights of channels that encode ship-related features while keeping background-associated channels suppressed. This data-driven, soft-gating mechanism implements an implicit “background subtraction” without requiring any explicit background labels. Our design is fundamentally different from the standard SE block, where the implicit initial weight is around

0.5

, which provides no such background-suppression prior.

3.4. Decoder

The decoder restores the deep semantic features to the original resolution for pixel-level prediction. Unlike the standard U-Net decoder, which relies on transposed convolutions or bilinear interpolation, our decoder adopts two consecutive improvements: dynamic upsampling and an output-stage refinement module (MSR), described in Section 3.5.

Let the input feature of the current stage be

F_{l o w} \in R^{B \times C_{i n} \times H \times W}

, where

C_{i n}

denotes the number of input channels. We employ the dynamic upsampling module DySample for upsampling. Its core idea is to adaptively determine the sampling position of each target pixel during upsampling by learning the offsets of sampling points, thereby achieving more flexible spatial detail restoration than fixed grid sampling. DySample first uses a

1 \times 1

convolution to predict the sampling offsets

O \in R^{B \times 2 G \times H \times W}

, where G denotes the number of groups. Then, it generates sampling coordinates based on the initial sampling grid and offsets, and finally obtains the upsampled feature

F_{u p} \in R^{B \times C_{i n} \times 2 H \times 2 W}

through grid sampling. This process can be expressed as:

The decoder consists of 7 upsampling stages, detailed in Figure 7. We denote the low-resolution input to the current stage as

F_{l o w} \in R^{B \times C_{i n} \times H \times W}

. First,

F_{l o w}

is upsampled using the dynamic upsampler DySample [45]. DySample learns to predict sampling point offsets O via a

1 \times 1

convolution, and then generates the upsampled feature map

F_{u p} \in R^{B \times C_{i n} \times 2 H \times 2 W}

through adaptive grid sampling:

F_{u p} = D y s a m p l e (F_{l o w}, O) .

(3)

Compared to transposed convolutions, DySample does not introduce large learnable kernels that may cause checkerboard artifacts, nor does it rely on fixed interpolation schemes that are blind to image content. Instead, it flexibly adjusts the resampling positions based on the input features, which is especially beneficial for recovering sharp ship boundaries.

Next,

F_{u p}

is concatenated with the skip connection feature

F_{s k i p}

from the corresponding encoder stage along the channel dimension, yielding

F_{c a t} \in R^{B \times (C_{i n} + C_{s k i p}) \times 2 H \times 2 W}

. The concatenated features are then processed by two ConvNormReLU blocks: the first compresses the channel number from

C_{i n} + C_{s k i p}

to

C_{s k i p}

, and the second further refines the fused features without changing the channel count.

This process repeats for 7 stages, progressively restoring the resolution from

4 \times 4

back to

512 \times 512

, with the final channel depth becoming 32. The skip connections remain crucial, but the decoder’s distinguishing factor lies in the combination of DySample and the subsequent MSR module, which together achieve sharper and more accurate results than conventional U-Net decoders.

3.5. MSR Module

In standard segmentation networks, the decoder output is typically passed directly through a

1 \times 1

convolution to produce the final prediction. However, we observe that the output features may still exhibit blurred boundaries and weak responses in small target regions. To address these residual artifacts, we propose the Multi-Scale Refinement (MSR) module, which acts as a lightweight post-processing enhancement between the decoder’s final feature map and the classification layer. Its structure is shown in Figure 8.

Let the decoder output be

F \in R^{B \times C \times H \times W}

. MSR applies two parallel branches to F: Branch 1 uses a

3 \times 3

standard convolution to extract local boundary details, while Branch 2 employs a

3 \times 3

dilated convolution with a dilation rate of 2 to capture wider spatial context.

Both branch outputs are activated by LeakyReLU and then concatenated, resulting in

F_{c a t} \in R^{B \times 2 C \times H \times W}

. A

3 \times 3

convolution compresses the channels back to 32, followed by InstanceNorm and LeakyReLU. Finally, a residual connection adds the processed features to the original F, yielding the refined features.

The design is extremely lightweight and introduces negligible computational overhead. By operating directly on the features that are about to be projected to the output logits, MSR implicitly performs multi-scale edge refinement: the standard branch emphasizes fine discontinuities, while the dilated branch considers larger spatial context to suppress false alarms from isolated noisy pixels. The residual design ensures that the refinement does not disturb the already well-learned features, especially at early training stages. This module is one of the key differentiators from many existing segmentation networks that lack dedicated output refinement.

3.6. Comparison with Related Architectures

MSRF shares the high-level goal of multi-scale context capture with Inception [46] and ASPP, but diverges in details tailored to ship segmentation. Inception modules employ multiple kernel sizes and typically include a pooling branch; however, pooling reduces spatial resolution and degrades the fine boundary information critical for small vessels. ASPP captures multi-scale context via parallel dilated convolutions and a global average-pooling branch. While effective for large objects, the global pooling collapses spatial details into a single vector per channel, risking suppression of tiny vessel signatures. In contrast, our MSRF deliberately avoids any pooling and uses only three branches: a

1 \times 1

convolution, a

3 \times 3

standard convolution, and a

3 \times 3

dilated convolution (rate 2). This design preserves full resolution in every branch, retaining small-ship features while aggregating broader context. Moreover, the residual connection makes MSRF pluggable into each encoder stage with far fewer parameters than deep Inception architectures.

The key innovation of BSCA is its learnable negative bias, which fundamentally distinguishes it from standard Squeeze-and-Excitation (SE) modules. In SE, channel weights generated by two fully connected layers initially center around 0.5, treating all channels equally. This agnostic start forces the network to learn relevant channels solely through gradient descent, which is inefficient when background-associated channels dominate early training. BSCA introduces a trainable bias b initialized to

- 2.0

, making the initial channel weights 0.12. This imposes strong suppression on all channels at the start of training. Subsequently, only channels carrying discriminative ship features are up-weighted, while background-dominated channels remain suppressed by the negative prior. This provides implicit, data-driven background subtraction from the earliest stages. Though simple, BSCA explicitly addresses the maritime reality where most pixels belong to the background. Compared with more complex attention variants, it adds negligible cost and significantly improves segmentation in ports and nearshore scenes.

The decoder of ShipMS-BSNet departs from U-Net, MSCF-Net, and MASSNet in two respects. First, it replaces transposed convolution or bilinear interpolation with the DySample dynamic upsampler. The former methods can cause checkerboard artifacts or over-smoothed boundaries, whereas DySample learns per pixel sampling offsets from the feature content, recovering sharper ship contours. Second, we append a MSR module to the decoder output. Most networks directly project features to predictions via a single

1 \times 1

convolution. MSR instead applies parallel standard and dilated convolutions with a residual connection for edge-aware refinement, sharpening boundaries and boosting small targets with negligible overhead.

MSCF-Net and MASSNet represent recent advances in remote sensing segmentation but follow different strategies. MSCF-Net employs pyramid-like multi-scale fusion and standard channel attention without background suppression bias. MASSNet uses self-attention yet relies on conventional upsampling and lacks output refinement. In contrast, ShipMS-BSNet jointly addresses three challenges: extreme scale variation, dominant background interference, and boundary recovery. These components are interconnected: MSRF enriches features, BSCA suppresses background channels, DySample recovers resolution, and MSR performs final refinement. This integrated design consistently outperforms generic architectures on small and multi-scale ship extraction from complex maritime backgrounds.

3.7. Loss Function

Binary Cross-Entropy loss [47] calculates the discrepancy between predicted probabilities and ground-truth labels pixel by pixel. Let the probability map predicted by the model be

P \in {[0, 1]}^{H \times W \times 2}

, where

P_{1}

denotes the probability of the ship class and

P_{0}

denotes the probability of the background class. The ground-truth label is denoted as

Y \in {0, 1}^{H \times W}

, where 1 represents the ship, and 0 represents the background. The calculation formula for

L_{B C E}

is:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot log (p_{i}) + (1 - y_{i}) \cdot log (1 - p_{i})],

(4)

where

p_{i}

is the probability that the ith pixel belongs to the ship class predicted by the model,

y_{i} \in 0, 1

denotes the ground-truth label of the ith pixel (0 for background and 1 for ship), and N is the total number of image pixels. BCE loss optimizes each pixel independently with stable gradients, which can provide clear pixel-wise learning signals for the network, but is susceptible to class imbalance.

Dice loss [48] is derived from the Dice coefficient, which measures the overlap between the predicted segmentation map and the ground-truth label. Its calculation formula is given by:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} y_{i} + ϵ},

(5)

where

ϵ

is a smoothing factor to avoid division by zero.

Dice loss directly maximizes the Dice coefficient of the segmentation result and is insensitive to the pixel ratio between the foreground and background, which can effectively alleviate the class imbalance problem. However, in the early stage of training, when the predicted region and the ground-truth region are completely disjoint, the gradient of Dice loss may be zero or oscillate, leading to unstable convergence.

In ship extraction tasks, the pixel proportion of ship targets in high-resolution remote sensing images is usually extremely small, and background pixels occupy an absolutely dominant position. The statistics of the target proportion in the dataset are given in Table 2. The ship dataset adopted in this paper contains nearly 70,000 images, among which 58.0% have a target proportion lower than 1% and 90.3% lower than 5%, leading to an extremely severe class imbalance problem.

In view of this characteristic, if only the standard Binary Cross-Entropy (BCE) loss is used for optimization, the model tends to predict all pixels as the background, resulting in serious missed detection of small targets [49]. If only Dice loss is employed, although it can alleviate class imbalance, the gradient is unstable in the early training stage, and the convergence speed is slow [50]. Therefore, we adopt a weighted combination of BCE loss and Dice loss to balance pixel-level classification accuracy and regional overlap, thereby improving the segmentation and detection capability of the model for small targets. The total loss function is defined as:

L = α \cdot L_{D i c e} + β \cdot L_{B C E},

(6)

where

α

and

β

are weight coefficients.

According to the dataset statistics, ship targets occupy less than 5% of the pixels in most images, leading to a severe class imbalance issue. The weighting strategy is inspired by the balanced cross-entropy formulation introduced in Focal Loss for dense object detection [51], where a higher weight is assigned to the foreground class to prevent the loss from being dominated by easy negatives. Accordingly, we set

α = 0.7

for the Dice loss and

β = 0.3

for the BCE loss. This assignment allows the Dice loss, which directly optimizes the spatial overlap of ship regions, to drive the optimization with greater force, while the BCE loss provides stable, pixel-wise gradient signals that help maintain training stability. The specific values of 0.7 and 0.3 are chosen to reflect the approximate inverse frequency ratio between the foreground and background, following the principle that the minority class should receive a proportionally higher loss weight, as commonly practiced in dense prediction tasks.

4. Experiments

4.1. Implementation Details

Our ShipMS-BSNet model was constructed with PyTorch v2.6.0 serving as the foundational platform. To verify the effectiveness of the proposed method, this paper performs a systematic comparative analysis between ShipMS-BSNet and current mainstream semantic segmentation algorithms. The experiments are carried out on the self-constructed remote sensing ship dataset and the public HRSC2016 dataset respectively. The competitors included U-Net, nnU-Net, Swin-unet [52], SegFormer, TransUNet and MSCF-Net. To ensure a fair comparison, all models were trained from scratch without pre-trained weights, and an identical data augmentation pipeline and the same set of hyperparameters were applied across all architectures. All experiments were conducted on a workstation powered by an NVIDIA GeForce RTX 4090 GPU and an Intel Core i9-12900KF CPU. The operating system was Windows 10 Professional. For the software environment, we leveraged Python 3.10.18, PyTorch 2.6.0, and CUDA 12.4. The self-constructed remote sensing ship dataset and the public HRSC2016 dataset were split into training, validation and test sets with a ratio of 7:2:1. The training and validation sets supported the model development and tuning process, while the test set was exclusively reserved for the final assessment of model performance. All models were trained employing the AdamW optimizer. A consistent training setup was employed for all models: a 1

\times 10^{- 5}

initial learning rate and a batch size of 32, alongside a fixed random seed of 42. Each architecture was trained over 50 epochs, with the top-performing checkpoint on the validation data being saved for the subsequent test phase.

4.2. Evaluation Metrics

The performance of the ShipMS-BSNet model was assessed using a comprehensive set of four standard segmentation metrics. These metrics, employed to quantify segmentation accuracy, are: precision, Recall, F1-score, and Intersection over Union (IoU).

Precision measures the correctness of the predicted ship pixels. It shows the reliability of the segmentation results. Its calculation formula is:

P r e c i s i o n = \frac{T P}{T P + F P},

(7)

where True Positive (TP) represents the ship pixels that are accurately identified by the model. Conversely, false positive (FP) corresponds to background pixels that are erroneously assigned to the ship category.

Recall measures the model’s ability to capture all genuine ship pixels present in the image. It tells how well the model detects ships. Its calculation formula is:

R e c a l l = \frac{T P}{T P + F N},

(8)

where FN (false negative) is the count of ship pixels misclassified as the background.

The F1-score balances the trade-off between precision and Recall, making it suitable for handling datasets with skewed sample ratios. The calculation formula is:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} .

(9)

Intersection over Union (IoU) is defined as the size of the overlapping region divided by the combined area of the predicted segmentation and the ground-truth annotation. It serves as a core benchmark for evaluating pixel-level alignment in image segmentation. The calculation formula is:

I o U = \frac{T P}{T P + F P + F N} .

(10)

4.3. Results on the Public HRSC2016 Dataset

To verify the effectiveness of the proposed ship extraction method, comparative experiments are conducted in this section on the public HRSC2016 ship dataset. Table 3 summarizes the quantitative performance results of all compared methods on the HRSC2016 test set. It can be observed from the overall results that the proposed method achieves the best values in three metrics: precision, F1-score and IoU. Its Recall is only second to that of MSCF-Net, demonstrating the best overall performance.

ShipMS-BSNet achieves a precision of 0.846, 1.6% higher than the second-best MSCF-Net and 3.5% higher than nnU-Net, indicating it maintains a relatively low false positive rate on the HRSC2016 dataset. However, as shown in the second column of Figure 9, all models exhibit misrecognition in the upper right region, incorrectly segmenting sea surface ripples as ships. This phenomenon reflects a common challenge for current models in handling complex sea backgrounds, where the texture and brightness of ripples under light reflection may resemble parts of ships, leading to confusion. Notably, although ShipMS-BSNet also shows this misrecognition, its misdetected area is relatively smaller compared to methods like U-Net and nnU-Net, consistent with its higher precision.

In terms of Recall, MSCF-Net ranks first with 0.837, followed closely by ShipMS-BSNet with 0.835. Their Recall values are comparable, and both significantly outperform Swin-unet (0.819) and nnU-Net (0.811), demonstrating good small target detection capability for both methods. Although ShipMS-BSNet has a slightly lower Recall, its F1-score still surpasses that of MSCF-Net when combined with precision, achieving a better balance between false positives and false negatives.

Figure 9 shows typical segmentation results of all methods on the HRSC2016 test set, selecting two scenarios: moored ships in ports and single offshore ships. In complex port scenarios with building interference, compared methods often misclassify docks and water ripples as ships (e.g., the upper right region in the second column). In simple offshore scenarios, all methods can fully detect ships, but compared methods have blurred edges. The proposed method achieves more accurate hull contour restoration, further validating its effectiveness in handling both complex backgrounds and fine-grained segmentation.

4.4. Results on the Self-Constructed Remote Sensing Ship Dataset

After verifying the effectiveness of the proposed ShipMS-BSNet on the public HRSC2016 dataset, we further conduct experiments on our self-constructed remote sensing ship dataset to evaluate its generalization ability in real-world scenarios.

Table 4 summarizes the quantitative performance results of all compared methods on the test set of our self-constructed ship dataset. ShipMS-BSNet outperforms all other methods in the four core metrics, demonstrating the effectiveness of the proposed approach.

Figure 10 presents representative segmentation examples of all compared methods on the test set. In the case of extremely small targets, U-Net suffers from severe missed detections. Swin-unet and nnU-Net achieve partial detection but generate coarse boundaries. In contrast, both MSCF-Net and ShipMS-BSNet can accurately locate these targets, with ShipMS-BSNet yielding more intact contours. For medium-sized vessels, all methods can detect the targets successfully, while ShipMS-BSNet demonstrates superior edge precision and higher intersection with the ground truth.

4.5. Feature Map Visualization

To verify the encoder’s semantic focusing and feature extraction capabilities, we average the first eight channels of features from encoder stage 4. Figure 11 shows the average feature map and heatmap overlaid on the original image.

The average feature map shows that deep high-activation regions coincide with ship bodies, with significant activations for small ships and no false background activations. Background regions like the sea and docks have extremely low responses. The heatmap further confirms that high-activation regions fully cover ships of different scales, with stable responses to contours and details, and no feature diffusion or foreground–background confusion.

4.6. Segmentation Result Visualization

To intuitively verify the practical segmentation performance of the proposed model, we visualize the full pipeline of segmentation results on test samples, as shown in Figure 12, including the original image, predicted mask, color segmentation, and overlay display.

The test sample is a coastal port remote sensing image containing two ship targets of different scales and shapes, with typical interference, such as adjacent docks and sea surface texture noise. The results show that the model can generate complete binary segmentation masks for both ship targets without obvious holes or fractures. The predicted contours are highly consistent with the actual target boundaries. Notably, the model achieves complete segmentation of large-scale ships and accurate detection of small-scale ships simultaneously, with no missed or false detections and no mis-segmentation of docks or sea backgrounds. The overlay display further confirms that the segmentation results perfectly match the position, contour and scale of ship targets in the original image, with no obvious boundary offset or over-segmentation.

These results, together with the previous feature visualization, form a complete logical verification. The encoder’s ability to accurately focus on target features and resist background interference is the core premise of high-precision segmentation. By effectively filtering noise and anchoring the semantic regions of ships in the encoder stage, the model can restore accurate target contours in the decoder stage, achieving end-to-end pixel-level segmentation of multi-scale ships and verifying its effectiveness and robustness in remote sensing ship segmentation tasks.

4.7. Ablation Study

To verify the individual contributions and effectiveness of each core module in ShipMS-BSNet, we design a series of ablation experiments. By gradually adding key modules on the self-constructed dataset and comparing performance changes, we quantitatively analyze the roles of MSRF, BSCA and MSR in ship segmentation.

Using standard nnU-Net as the baseline, we sequentially add BSCA, MSRF, and their combination, then integrate DySample and MSR to get the full model. All models use identical training configurations and datasets. Table 5 shows the ablation results.

The ablation studies validate the effectiveness of each proposed module. The BSCA module slightly improves segmentation performance by suppressing background interference and reducing false positives via a learnable negative bias mechanism.

The Multi-Scale Receptive Field Enhancement (MSRF) module delivers more significant gains by expanding receptive fields through multi-branch dilated convolutions, which is particularly beneficial for multi-scale ship feature extraction.

Combining MSRF and BSCA yields performance exceeding the additive effect of individual modules, demonstrating clear synergistic benefits: MSRF provides contextual information for accurate ship–background distinction, while BSCA purifies inputs for MSRF feature fusion.

Further integrating dynamic upsampling and Multi-Scale Refinement modules optimizes boundary details and small target segmentation. Overall, the complete ShipMS-BSNet achieves 6.2% F1-score and 5.4% IoU improvements over the baseline, enabling robust ship segmentation in complex remote sensing backgrounds.

5. Discussion

5.1. Scale Advantages and Research Value of Datasets

The most significant core advantage of the self-constructed ship segmentation datasets in this study lies in their data scale and fine-grained annotation quality. The datasets contain a collection of 69,407 high-resolution remote sensing scenes. Each image was subjected to professional ArcGIS polygon annotation and rigorous quality control, ensuring precise ship boundary delineation and producing pixel-level binary masks. This substantial volume of data provides ample and diverse learning samples for deep learning models, serving as a fundamental guarantee for mitigating overfitting and enhancing model generalization.

Compared to existing mainstream ship datasets (such as HRSC2016 and MCSD), which typically contain thousands to tens of thousands of images, this dataset achieves an order-of-magnitude increase in scale. This expansion not only represents quantitative growth but also enables more comprehensive coverage of real-world complex scenarios. The dataset includes diverse environments ranging from nearshore to open sea, fair weather to complex meteorological interference, and sparse to densely clustered scenes. Moreover, the ship targets maintain high diversity and balance in scale, morphology, and quantity distribution, with 1000 negative samples included to enhance model robustness. Such a large-scale, accurately annotated dataset effectively addresses the long-standing “data scarcity” issue in this field, laying a solid foundation for training deeper and more advanced segmentation models, and is expected to become a new benchmark in ship segmentation research.

Benefiting from its large scale and high quality, this dataset offers broad applicability and significant potential for future expansion. It not only furnishes dependable training and evaluation support for the present study but also establishes an elevated baseline that enables the global research community to benchmark algorithms equitably, refine models for complex maritime scenes, advance small target detection, and propel semantic segmentation research.

5.2. Performance of ShipMS-BSNet on Real Large Remote Sensing Images

Although the experiments in Section 4 have comprehensively evaluated the segmentation performance of ShipMS-BSNet on standard datasets, there remains a critical gap between academic research and real-world engineering applications: all previous experiments were conducted on pre-cropped 512 × 512 image patches, while actual remote sensing images are usually gigapixel-scale and cannot be directly input into the model for end-to-end inference.

To bridge this gap, we collected ultra-high-resolution satellite images covering large ocean areas from Google Earth and adopted a sliding window cropping strategy to generate 512 × 512 image patches. The proposed model was applied for parallel inference, and the prediction results were seamlessly stitched to generate a complete segmentation map. This approach allows us to comprehensively evaluate the model’s performance in real-world complex environments.

As shown in Figure 13, ShipMS-BSNet successfully processes ultra-large remote sensing images within memory constraints and maintains consistent high-precision segmentation across the entire image. Notably, no obvious stitching artifacts, missed detections or false positives are observed at the window boundaries. This is attributed to the synergistic effect of the proposed modules: the BSCA module suppresses background noise at window edges and reduces false detections, while the MSRF module with large receptive fields ensures that targets partially located at window edges can still be completely captured.

Furthermore, the Google Earth images used in this experiment differ from the training dataset in terms of imaging satellites, illumination conditions and background complexity. The stable performance of the model further verifies its cross-domain generalization ability.

5.3. Performance of ShipMS-BSNet in Different Marine Backgrounds

Figure 14 presents the segmentation performance of ShipMS-BSNet under various ship scales, target densities, marine backgrounds and weather conditions. The results demonstrate that the proposed model exhibits excellent robustness in all the above complex scenarios, fully validating its capability to accurately extract cross-scale ship targets from cluttered backgrounds.

The principal advantage of ShipMS-BSNet over conventional methods is its proven capacity to tackle the dual difficulties posed by multi-scale targets and complex environments. The elaborately designed MSRF module employs parallel dilated convolutions with varying dilation rates to simultaneously extract and integrate fine-grained local features and multi-scale contextual information, which guarantees precise segmentation of cross-scale ship targets from large inshore vessels to small offshore vessels. Meanwhile, the proposed Background Suppression Channel Attention mechanism can effectively suppress background noise from coastal docks, sea surface textures and wave clutter, significantly improving the model’s robustness against marine environmental interferences, such as clouds and sea fog, and enabling it to maintain stable segmentation performance under complex illumination and low visibility conditions.

The multi-scale feature extraction capability of MSRF and the background suppression capability of BSCA form a favorable synergistic effect: the former provides the model with cross-scale target discriminative features, while the latter accurately filters out irrelevant background noise. Their combination enables the model to achieve high-precision and high-robustness pixel-level ship segmentation even in extremely complex remote sensing marine environments.

5.4. Limitations

Although the proposed ShipMS-BSNet model demonstrates excellent performance in ship segmentation tasks, it still has several inherent limitations. Firstly, the model has high computational complexity, and both training and inference processes consume a large amount of computing resources, resulting in significant inference latency. This makes it difficult to deploy on edge computing devices with limited computing power, and it is unable to meet the stringent requirements of real-time ship monitoring in marine environments.

Secondly, the current model only uses single-modal visible light remote sensing images as input and fails to fully exploit the imaging characteristics of other remote sensing modalities, such as infrared and Synthetic Aperture Radar. Under extreme weather conditions, such as insufficient nighttime illumination and cloud cover, the detection accuracy and robustness of the model will decrease significantly. Furthermore, the current evaluation is limited to our self-constructed dataset and HRSC2016, both of which primarily consist of images from specific satellite sensors and geographical regions. The generalization capability of ShipMS-BSNet to other widely used satellite platforms (e.g., Sentinel-2, WorldView) and to diverse geographical areas remains to be validated.

Thirdly, the model adopts a fully supervised learning paradigm, and the training process relies on large-scale pixel-level fine-annotated data. However, pixel-level annotation of remote sensing images not only requires professional knowledge but is also time-consuming and labor-intensive, which greatly limits the expansion of the dataset and the popularization and application of the model in practical scenarios.

Subsequent optimization will be carried out in four directions: realize a lightweight model through knowledge distillation and lightweight architecture design to adapt to edge devices and real-time monitoring; fuse multi-source remote sensing data, such as infrared and SAR, to improve robustness in complex environments; construct a multi-sensor, multi-region benchmark to systematically evaluate and improve cross-platform generalization; study weakly supervised and semi-supervised learning to reduce annotation dependence; and deploy the model to an actual marine monitoring platform to further optimize performance in real scenarios.

6. Conclusions

To address the challenges of multi-scale target detection, severe background interference, and blurred edge details in ship segmentation from remote sensing images in complex marine environments, this paper proposes a high-performance ship semantic segmentation network, ShipMS-BSNet. The network constructs an encoder architecture integrating Multi-Scale Receptive Field Enhancement (MSRF) and Background Suppression Channel Attention (BSCA). The former captures cross-scale target features through multi-branch dilated convolutions, while the latter accurately suppresses background noise via a learnable negative bias mechanism. Their synergistic effect improves the model’s discriminative ability for ship targets in complex backgrounds. In the decoder part, traditional transposed convolution is replaced by Dynamic Sampling, and the Multi-Scale Refinement (MSR) module is introduced at the output end, which effectively solves the edge blurring problem during upsampling and enhances the segmentation integrity of small targets.

To support rigorous evaluation across diverse maritime conditions, we constructed a large-scale ship segmentation dataset comprising 69,407 image–label pairs. This dataset is characterized by its extensive coverage of multiple vessel scales—ranging from small fishing boats to large cargo ships—and its inclusion of diverse scene types, including open seas, nearshore waters, and busy ports. These attributes enable a fine-grained assessment of algorithm performance under varying degrees of background complexity and target sizes. Extensive comparative experiments and ablation studies on both this self-constructed dataset and the public HRSC2016 dataset verify the effectiveness of each proposed module. ShipMS-BSNet achieves state-of-the-art performance across multiple core metrics, with particularly pronounced advantages in complex port backgrounds and small-ship segmentation tasks, where the scale diversity and scene variety of our dataset provide a demanding and representative benchmark. Its comprehensive performance is superior to existing mainstream methods.

Future research will focus on multi-modal remote sensing data fusion to fully exploit the complementary information of infrared, SAR, and other modalities, and to improve the model’s environmental adaptability under extreme weather conditions. Meanwhile, model lightweighting will be realized through knowledge distillation, model quantization, and other techniques to provide technical support for real-time marine ship monitoring systems.

Author Contributions

Conceptualization, D.L. and Z.W.; methodology, D.L. and Z.W.; software, D.L.; validation, D.L.; formal analysis, D.L., Z.W. and B.C.; investigation, D.L.; resources, H.Z. (Haibo Zeng), Z.C., Z.L. and Y.Z.; data curation, D.L.; writing—original draft preparation, D.L., Z.W. and B.C.; writing—review and editing, D.L., L.H., Z.W., L.W., H.Z. (Haibo Zeng), Z.C., Z.L., Y.Z. and H.Z. (Hua Zhang); visualization, D.L.; supervision, Z.W.; project administration, Z.W.; funding acquisition, L.H., L.W. and H.Z. (Hua Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Open Research Fund of the Science and Technology Innovation Platform of 2025ZKPT057, Changsha University of Science and Technology, the Hunan Provincial Natural Science Foundation of China (No. 2024JJ8367, No. 2025JJ80009, No. 2026JJ60618), and the Hunan Engineering Technology Research Center of Natural Resources Survey and Monitoring (No. 2018TP2040).

Data Availability Statement

The datasets in this study are available at: https://github.com/wzp8023391/Large-Scale-Ship-Segmentation-Dataset (accessed on 27 May 2026).

Acknowledgments

The authors thank the reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, B.; Xie, X.; Wei, X.; Tang, W. Ship detection and classification from optical remote sensing images: A survey. Chin. J. Aeronaut. 2021, 34, 145–163. [Google Scholar] [CrossRef]
Li, X.; Xie, L.; Wang, C.; Miao, J.; Shen, H.; Zhang, L. Boundary-enhanced dual-stream network for semantic segmentation of high-resolution remote sensing images. GIScience Remote Sens. 2024, 61, 2356355. [Google Scholar] [CrossRef]
Kanjir, U.; Greidanus, H.; Oštir, K. Vessel detection and classification from spaceborne optical images: A literature survey. Remote Sens. Environ. 2018, 207, 1–26. [Google Scholar] [CrossRef]
Xing, X.; Ji, K.; Zou, H.; Chen, W.; Sun, J. Ship classification in TerraSAR-X images with feature space based sparse representation. IEEE Geosci. Remote Sens. Lett. 2013, 10, 1562–1566. [Google Scholar] [CrossRef]
Zhang, X.; Feng, Y.; Zhang, S.; Wang, N.; Lu, G.; Mei, S. Robust aerial person detection with lightweight distillation network for edge deployment. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5630616. [Google Scholar] [CrossRef]
Hu, J.; Zhi, X.; Shi, T.; Wang, J.; Li, Y.; Sun, X. Dataset and benchmark for ship detection in complex optical remote sensing image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642611. [Google Scholar] [CrossRef]
CĂrunta, C.; Cărunta, A.; Popa, C.A. Heavy and lightweight deep learning models for semantic segmentation: A survey. IEEE Access 2025, 13, 17745–17765. [Google Scholar] [CrossRef]
Guo, Z.; Bian, L.; Wei, H.; Li, J.; Ni, H.; Huang, X. DSNet: A novel way to use atrous convolutions in semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 3679–3692. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, N.; Chen, L.; Liu, L.; Zhu, H. FAMNet: Lightweight Road Extraction Network with Fused Attention and Multi-level Cascaded ASPP. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 25616–25629. [Google Scholar] [CrossRef]
Chen, X.; Wu, X.; Prasad, D.K.; Wu, B.; Postolache, O.; Yang, Y. Pixel-wise ship identification from maritime images via a semantic segmentation model. IEEE Sens. J. 2022, 22, 18180–18191. [Google Scholar] [CrossRef]
Hordiiuk, D.; Oliinyk, I.; Hnatushenko, V.; Maksymov, K. Semantic segmentation for ships detection from satellite imagery. In Proceedings of the 2019 IEEE 39th International Conference on Electronics and Nanotechnology (ELNANO); IEEE: Kyiv, Ukraine, 2019; pp. 454–457. [Google Scholar]
Huang, Y.; Shi, P.; He, H.; He, H.; Zhao, B. Senet: Spatial information enhancement for semantic segmentation neural networks. Vis. Comput. 2024, 40, 3427–3440. [Google Scholar] [CrossRef]
Shi, M.; Lin, S.; Yi, Q.; Weng, J.; Luo, A.; Zhou, Y. Lightweight context-aware network using partial-channel transformation for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7401–7416. [Google Scholar] [CrossRef]
Zhang, L.; Sun, X.; Li, Z.; Kong, D.; Liu, J.; Ni, P. Boundary enhancement-driven accurate semantic segmentation networks for unmanned surface vessels in complex marine environments. IEEE Sens. J. 2024, 24, 24972–24987. [Google Scholar] [CrossRef]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale global context network for semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622913. [Google Scholar] [CrossRef]
Yu, H.; Yang, T.; Zhou, L.; Wang, Y. PDNet: A lightweight deep convolutional neural network for InSAR phase denoising. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5239309. [Google Scholar] [CrossRef]
Shim, D.; Lee, C. Small-Object Semantic Segmentation of Satellite Ship Images Using Modified U-Net With Morphological Loss. IEEE Access 2025, 13, 27700–27713. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, X.; Gao, G.; Lang, H.; Liu, G.; Cao, C.; Song, Y.; Guan, Y.; Dai, Y. Development and application of ship detection and classification datasets: A review. IEEE Geosci. Remote Sens. Mag. 2024, 12, 12–45. [Google Scholar] [CrossRef]
Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar]
Jiang, X.; Ding, X.; Jiang, X. MSCF-Net: Attention-guided multi-scale context feature network for ship segmentation in surveillance videos. Mathematics 2024, 12, 2566. [Google Scholar] [CrossRef]
Sharma, R.; Saqib, M.; Lin, C.; Blumenstein, M. MASSNet: Multiscale attention for single-stage ship instance segmentation. Neurocomputing 2024, 594, 127830. [Google Scholar] [CrossRef]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Kaggle. Airbus Ship Detection Challenge. Kaggle Competition, 2023. Available online: https://www.kaggle.com/c/airbus-ship-detection (accessed on 24 November 2024).
Chen, K.; Wu, M.; Liu, J.; Zhang, C. FGSD: A dataset for fine-grained ship detection in high resolution satellite images. arXiv 2020, arXiv:2003.06832. [Google Scholar] [CrossRef]
Ruizhe, D.; Qihao, C.; Qi, C.; Xiuguo, L. A deformable feature pyramid network for ship detection from remote sensing images. Acta Geod. Cartogr. Sin. 2020, 49, 787. [Google Scholar]
Wu, J.; Pan, Z.; Lei, B.; Hu, Y. LR-TSDet: Towards tiny ship detection in low-resolution remote sensing images. Remote Sens. 2021, 13, 3890. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, L.; Wang, Y.; Feng, P.; He, R. ShipRSImageNet: A large-scale fine-grained dataset for ship detection in high-resolution optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8458–8472. [Google Scholar] [CrossRef]
Su, N.; Huang, Z.; Yan, Y.; Zhao, C.; Zhou, S. Detect larger at once: Large-area remote-sensing image arbitrary-oriented ship detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505605. [Google Scholar] [CrossRef]
Guo, B.; Zhang, R.; Guo, H.; Yang, W.; Yu, H.; Zhang, P.; Zou, T. Fine-grained ship detection in high-resolution satellite images with shape-aware feature learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1914–1926. [Google Scholar] [CrossRef]
Ciocarlan, A.; Stoian, A. Ship detection in sentinel 2 multi-spectral images with self-supervised learning. Remote Sens. 2021, 13, 4255. [Google Scholar] [CrossRef]
Lee, S.H.; Park, H.G.; Kwon, K.H.; Kim, B.H.; Kim, M.Y.; Jeong, S.H. Accurate ship detection using electro-optical image-based satellite on enhanced feature and land awareness. Sensors 2022, 22, 9491. [Google Scholar] [CrossRef]
Zheng, Y.; Er, M.J.; Yi, G.; Shen, S. RepUNet: A fast image semantic segmentation model based on convolutional reparameterization of ship satellite images. In Proceedings of the 2021 6th International Conference on Automation, Control and Robotics Engineering (CACRE); IEEE: New York, NY, USA, 2021; pp. 461–465. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Zhang, Z.; Zhang, W.; Yuan, S. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6029–6043. [Google Scholar] [CrossRef]
Zhang, Y.; Li, C.; Shang, S.; Chen, X. SwinSeg: Swin transformer and MLP hybrid network for ship segmentation in maritime surveillance system. Ocean Eng. 2023, 281, 114885. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Chen, J.; Mei, J.; Li, X.; Lu, Y.; Yu, Q.; Wei, Q.; Luo, X.; Xie, Y.; Adeli, E.; Wang, Y.; et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 2024, 97, 103280. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 4015–4026. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 801–818. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 6027–6037. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 2818–2826. [Google Scholar]
Li, Q.; Jia, X.; Zhou, J.; Shen, L.; Duan, J. Rediscovering bce loss for uniform classification. arXiv 2024, arXiv:2403.07289. [Google Scholar] [CrossRef]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
Yeung, M.; Sala, E.; Schönlieb, C.B.; Rundo, L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [Google Scholar] [CrossRef]
Liu, B.; Dolz, J.; Galdran, A.; Kobbi, R.; Ben Ayed, I. Do we really need dice? The hidden region-size biases of segmentation losses. Med. Image Anal. 2024, 91, 103015. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]

Figure 1. The regions of data acquisition and representative instance scenes.

Figure 2. Process flowchart for creating the proposed ship segmentation dataset. Red boxes represent the cropping windows generated after buffer analysis.

Figure 3. Histogram of probability density of aspect ratios for all ship targets in the dataset.

Figure 4. Histogram of actual area probability density statistics for all ship targets in the dataset.

Figure 5. A structural overview of ShipMS-BSNet model.

Figure 6. The structure of each encoding module, where (a) represents the structure of the downsampling module, (b) represents the structure of the MSRF module, and (c) represents the structure of the BSCA module.

Figure 7. The structure of each decoding module.

Figure 8. The structure of the MSR module.

Figure 9. Pixel-level segmentation result visualization of all compared methods on the test set of the public HRSC2016 dataset. Colors: white = TP, black = TN, red = FP, and green = FN.

Figure 10. Pixel-level segmentation result visualization of all compared methods on the test set of our self-constructed remote sensing ship dataset. Colors: white = TP, black = TN, red = FP, and green = FN.

Figure 11. Average feature map and heatmap overlaid on original image.

Figure 12. Overlay comparison of segmentation results.

Figure 13. Segmentation results of ultra-large remote sensing images. (a–e) Enlarged views of the corresponding regions in the original images.

Figure 14. Segmentation results under different marine backgrounds, where (a) compares results for targets at different scales, (b) compares results in different marine environments, and (c,d) compare results under different weather conditions.

Table 1. Statistical summary of the ship datasets.

Dataset	Year	Images Number	Label	Images Size (Pixels)	Data Source	Resolution (m)
HRSC2016	2016	1061	OBB/Mask	300 × 300–1500 × 900	Google Earth	0.4–2
Kaggle-Ship	2018	208,164	HBB	768 × 768	/	/
FGSD	2020	2612	HBB	930 × 930	Google Earth	0.12–1.93
ASRSS	2020	>40,000	HBB	768 × 768	Google Earth DOTA Kaggle	0.5–2
GF1-LRSD	2021	4406	HBB	512 × 512	GaoFen-1	16
ShipRSImageNet	2021	3435	HBB	930 × 930–1500 × 900	xView HRSC2016 FGSD Kaggle-Ship GaoFen-2 JiLin-1	0.12–6
LARS	2022	20	OBB	10,000 × 10,000	Google Earth	0.8
MCSD	2023	12,600	OBB	1024 × 1024	Google Earth	0.25–2
SCCOS	2024	4639	OBB	1024 × 1024	Google Earth Microsoft map WorldView-3 Pleiades Orbview-3 JiLin-1	0.3–1
S2-SHIPS	2021	16	Mask	1738 × 938	Sentinel-2	<10
SDS	2022	1984	Mask	3000 × 3000	Google Earth	/
Ours	2025	69,407	Mask	512 × 512	Google Earth	0.12

Table 2. Statistics on the proportion of target pixels in the dataset.

Target Pixel Proportion	Number of Images	Percentage of Images
No target (0%)	1000	1.4%
Extremely low (0–1%)	39,654	57.1%
Low (1–5%)	22,096	31.8%
Medium (5–10%)	3748	5.5%
Relatively high (10–20%)	2008	2.9%
High (>20%)	901	1.3%

Table 3. Comparative experimental results of different methods on the public HRSC2016 dataset.

Methods	Precision	Recall	F1	IoU
U-Net	0.794	0.799	0.792	0.684
nnU-Net	0.817	0.811	0.801	0.697
SegFormer	0.822	0.822	0.815	0.703
Swin-unet	0.825	0.819	0.813	0.705
TransUNet	0.829	0.826	0.821	0.711
MSCF-Net	0.832	0.837	0.829	0.721
ShipMS-BSNet	0.846	0.835	0.833	0.732

Table 4. Comparative experimental results of different methods on the self-constructed remote sensing ship dataset.

Methods	Precision	Recall	F1	IoU
U-Net	0.809	0.811	0.805	0.707
nnU-Net	0.828	0.826	0.817	0.722
SegFormer	0.833	0.829	0.819	0.726
Swin-unet	0.836	0.831	0.825	0.729
TransUNet	0.842	0.848	0.834	0.737
MSCF-Net	0.856	0.864	0.846	0.752
ShipMS-BSNet	0.879	0.875	0.868	0.761

Table 5. Experimental results of ablation using different frameworks. G: giga FLOPs; M: million parameters.

Config	Precision	Recall	F1	IoU	FLOPs	Parameters	FPS
Baseline	0.828	0.826	0.817	0.722	67.51 G	46.33 M	200
Baseline + BSCA	0.838	0.834	0.822	0.727	67.52 G	46.47 M	199
Baseline + MSRF	0.852	0.855	0.836	0.735	99.04 G	71.32 M	136
Baseline + MSRF + BSCA	0.865	0.864	0.854	0.756	99.05 G	71.47 M	127
Ship MS-BSNet	0.879	0.875	0.868	0.761	99.00 G	68.49 M	118

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, D.; Hua, L.; Wang, Z.; Wang, L.; Chu, B.; Zeng, H.; Chen, Z.; Long, Z.; Zhang, Y.; Zhang, H. ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments. Remote Sens. 2026, 18, 1789. https://doi.org/10.3390/rs18111789

AMA Style

Liu D, Hua L, Wang Z, Wang L, Chu B, Zeng H, Chen Z, Long Z, Zhang Y, Zhang H. ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments. Remote Sensing. 2026; 18(11):1789. https://doi.org/10.3390/rs18111789

Chicago/Turabian Style

Liu, Dezhi, Liangchun Hua, Zhipan Wang, Le Wang, Bin Chu, Haibo Zeng, Zegang Chen, Zhong Long, Yunfei Zhang, and Hua Zhang. 2026. "ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments" Remote Sensing 18, no. 11: 1789. https://doi.org/10.3390/rs18111789

APA Style

Liu, D., Hua, L., Wang, Z., Wang, L., Chu, B., Zeng, H., Chen, Z., Long, Z., Zhang, Y., & Zhang, H. (2026). ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments. Remote Sensing, 18(11), 1789. https://doi.org/10.3390/rs18111789

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ShipMS-BSNet: A Multi-Scale Semantic Segmentation Method for Remote Sensing Ships in Complex Marine Environments

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Optical Satellite Remote Sensing Ship Dataset

2.2. Ship Segmentation Network

3. Data and Methods

3.1. Data

3.2. ShipMS-BSNet

3.3. Encoder

3.4. Decoder

3.5. MSR Module

3.6. Comparison with Related Architectures

3.7. Loss Function

4. Experiments

4.1. Implementation Details

4.2. Evaluation Metrics

4.3. Results on the Public HRSC2016 Dataset

4.4. Results on the Self-Constructed Remote Sensing Ship Dataset

4.5. Feature Map Visualization

4.6. Segmentation Result Visualization

4.7. Ablation Study

5. Discussion

5.1. Scale Advantages and Research Value of Datasets

5.2. Performance of ShipMS-BSNet on Real Large Remote Sensing Images

5.3. Performance of ShipMS-BSNet in Different Marine Backgrounds

5.4. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI