1. Introduction
Maritime vessel monitoring is an indispensable component of modern shipping management, maritime safety, environmental protection, and the global economy. The rapid growth in global trade has significantly increased the density and complexity of ship activities, making efficient and accurate ship extraction from remote sensing imagery a critical technology. Remote sensing, with its advantages of wide coverage, strong timeliness, and freedom from geographical constraints, has become the primary means of maritime surveillance [
1,
2].
Early research on ship detection predominantly relied on traditional image processing and machine learning techniques. These methods are built upon manually designed features, encompassing strategies like thresholding, visual saliency analysis, shape and texture characterization, and transform-domain processing [
3]. A fundamental limitation of these techniques is their reliance on handcrafted features, which often fail to adequately represent the diverse and variable shapes of ships, especially under differing resolutions and imaging conditions. Consequently, they exhibit poor generalization capability in complex scenarios, struggling to reliably detect small ships against cluttered backgrounds [
4].
The advent of deep learning, particularly convolutional neural networks, has revolutionized this field by enabling end-to-end feature learning. Deep learning-based methods have substantially outperformed traditional techniques, demonstrating superior representational power and strong adaptability to complex scenes, thus becoming the dominant paradigm for object detection and segmentation in optical remote sensing imagery [
5,
6,
7,
8,
9].
For the task of maritime ship extraction, semantic segmentation offers a distinct advantage over bounding-box object detection by providing precise, pixel-level ship contours, which are essential for high-precision applications [
10,
11,
12,
13]. However, current semantic segmentation methods still face significant, unresolved challenges in the maritime domain. The complex marine environment—characterized by wave textures, sun glint, and adverse weather—often degrades image quality, leading to blurred target boundaries and mis-segmentation [
14,
15]. Furthermore, the substantial scale variations among different types of ships, the difficulty in detecting small vessels, and the complexities of handling densely docked ships in harbors impose stringent demands on a model’s robustness and generalization capability [
16,
17].
A critical bottleneck that exacerbates these challenges is the scarcity of high-quality, large-scale datasets specifically designed for ship semantic segmentation. Most existing public datasets are limited in volume and scene diversity, often lacking comprehensive coverage of the very scenarios that pose the greatest difficulties: small targets, multi-scale vessels, and nearshore ships with complex backgrounds [
18,
19]. This data deficiency directly restricts the generalization ability and the performance ceiling of advanced deep learning models. To bridge this gap, this study aims to construct a large-scale, diverse, and pixel-level annotated ship segmentation dataset that systematically addresses the limitations of existing data resources. Furthermore, on this basis, we propose a novel deep learning model tailored to overcome the identified challenges of multi-scale feature extraction and background interference, thereby achieving robust and high-accuracy ship semantic segmentation in complex maritime environments. The primary contributions of this article are outlined as follows.
Existing public ship segmentation datasets suffer from limited sample diversity, insufficient coverage of small vessels, and restricted scene variability, which collectively constrain the generalization capability of deep learning models. To bridge this gap, we construct a large-scale, pixel-level annotated dataset comprising 69,407 image–label pairs encompassing diverse maritime environments, including open seas, nearshore waters, and ports. The dataset spans a wide range of ship scales, from small fishing boats to large cargo vessels, and undergoes rigorous multi-round quality control to ensure annotation consistency. This resource provides a standardized benchmark that directly addresses the data bottleneck impeding progress in ship segmentation research.
Standard encoder architectures face two inherent challenges in maritime scenes: single-scale convolutions cannot adequately capture ships spanning from tens to hundreds of pixels, and conventional channel attention [
20] treats all channels with neutral initial weighting, making it inefficient at distinguishing weak ship signals from dominant background textures, such as waves and glint. We address these issues through two interlocking modules embedded at each encoder stage. First, the Multi-Scale Receptive Field Enhancement (MSRF) module employs parallel
,
standard, and
dilated convolutions, deliberately excluding any pooling operation, to preserve fine spatial details of small vessels while simultaneously capturing broader context for large ships. Second, the Background Suppression Channel Attention (BSCA) module introduces a learnable bias initialized to a negative value before the Sigmoid activation, forcing the network to start training with strong suppression on all channels and progressively up-weight only ship-discriminative ones. This negative bias mechanism, absent in standard SE variants, provides an implicit data-driven background subtraction prior specifically tailored to scenes where background pixels overwhelmingly dominate. The synergy of these modules achieves simultaneous multi-scale perception and background suppression, yielding significant improvements in complex maritime environments.
In most segmentation networks, including recent maritime architectures, such as MSCF-Net [
21] and MASSNet [
22], the decoder output is directly projected to class predictions via a single
convolution, leaving residual boundary blurring and weak responses in small target regions unaddressed. We propose the Multi-Scale Refinement (MSR) module, placed at the network output as a lightweight post-processing unit. It extracts multi-scale edge information by paralleling
standard and dilated convolutions, then fuses the enhanced features with the original input through a residual connection. This design performs fine-grained adjustments on segmentation boundaries and small target regions with negligible computational overhead, further improving boundary sharpness and target integrity in the final predictions.
This paper is structured into six distinct sections. Commencing with a review of related work in
Section 2, it proceeds to detail the proposed methodology in
Section 3. Experimental procedures and their outcomes are presented in
Section 4, followed by a discussion in
Section 5. This work culminates in concluding remarks in
Section 6.
3. Data and Methods
3.1. Data
In this study, to meet the requirements for training and evaluating semantic segmentation models for ship remote sensing images, we have constructed a large-scale, high-quality, and diverse dedicated dataset. The construction process primarily consists of three stages: data acquisition, fine-grained annotation, and data processing. The regions of data acquisition and representative instance scenes are illustrated in
Figure 1.
Data acquisition and annotation: The original remote sensing image data were sourced from the Google Earth platform (
earth.google.com/web/, accessed on 27 May 2026), ensuring diversity and broad coverage. The annotation process was carried out using the professional geographic information system software ArcGIS 10.4 (Esri, Redlands, CA, USA). Specialists meticulously delineated ship targets in the form of polygon vectors. This annotation method captures the contours and shapes of ships with high precision, providing accurate ground-truth information to support model learning.
Data processing and generation: To generate training samples ready for model use, we developed an automated data processing pipeline:
First, a spatial buffer analysis was performed on the annotated ship polygons to group vessels that are in close proximity to one another. Based on these grouped ship clusters, cropping windows were generated such that each cropped image contains one or more spatially adjacent ships while ensuring that no single ship instance is split across multiple crops. Subsequently, all cropped images were standardized by resizing them to a uniform resolution of pixels, ensuring consistency with the model’s input specifications.
The polygon vector annotations generated in ArcGIS were converted into binary masks using a custom-developed algorithm. For the binary masks, pixels corresponding to ships form the positive class, encoded as 1, and all other pixels form the negative class, encoded as 0. This process establishes a precise and consistent benchmark for model training.
The final dataset consists of 69,407 image–label pairs, including 1000 negative samples (i.e., images without any ship targets) to reduce false detection rates during inference and enhance model robustness. The dataset maintains high diversity and balance across multiple dimensions, such as the scale, environment, and target count:
In terms of scale diversity, it covers vessel targets ranging from small boats to large ships.
Regarding environmental variety, backgrounds include open oceans, nearshore ports, and complex weather scenarios with disturbances such as clouds and waves.
In target quantity distribution, the number of ships per image ranges from single to multiple instances, following a reasonable distribution that effectively avoids bias in model training.
All samples were meticulously annotated manually and underwent strict quality control to ensure label accuracy. The construction of this dataset establishes a solid data foundation for training and evaluating ship segmentation algorithms, while its diversity and high quality significantly enhance the model’s generalization capability in real-world complex scenarios. The overall dataset production workflow is shown in
Figure 2. The probability density statistics of the aspect ratio and actual area of all ship targets in the dataset are shown in
Figure 3 and
Figure 4.
3.2. ShipMS-BSNet
The ship extraction network proposed in this paper, named ShipMS-BSNet, adopts an encoder–decoder architecture, taking three-channel optical remote sensing images of size as input and outputting a binary segmentation map consisting of background and ship classes.
nnU-Net [
42], originally developed for biomedical image segmentation, is fundamentally a U-Net variant that excels at generating precise, pixel-level predictions with sharp boundaries—an inherent strength attributable to its deep encoder–decoder structure with dense skip connections and progressive resolution recovery stage by stage.
In ship semantic segmentation, boundary accuracy is critical, particularly for small vessels and ships docked in complex nearshore environments. Architectures such as DeepLabV3+ [
43] rely on an Atrous Spatial Pyramid Pooling (ASPP) [
44] module to capture multi-scale context, but the subsequent single-step upsampling limits its ability to gradually refine object boundaries, often resulting in coarse edges around small targets. SegFormer, a Transformer-based architecture, provides excellent global context modeling but tends to produce overly smooth segmentation maps and can struggle with fine boundary details when targets are small. In contrast, nnU-Net’s multi-stage decoder with skip connections at every resolution level offers a natural advantage for preserving and gradually recovering the spatial precision demanded by ship extraction.
The overall architecture is illustrated in
Figure 5. While the high-level framework inherits the well-established encoder–decoder scheme of nnU-Net, the core components have been fundamentally redesigned to address the unique challenges inherent in maritime remote sensing: extreme scale variations, complex background interference, and the need for accurate boundary delineation.
Specifically, the network comprises four main components: the encoder, the decoder, the Multi-Scale Refinement (MSR) module, and skip connections. The encoder integrates two novel modules—the Multi-Scale Receptive Field Enhancement (MSRF) block and the Background Suppression Channel Attention (BSCA) module—to capture robust multi-scale features while actively suppressing irrelevant background responses. The decoder abandons traditional transposed convolutions in favor of a dynamic upsampling module (DySample) to achieve content-adaptive resolution restoration and is further augmented by the MSR module placed at the output end for edge-aware post-processing. These designs, detailed below, collectively distinguish ShipMS-BSNet from generic U-Net variants and recent maritime segmentation models.
3.3. Encoder
The encoder progressively extracts multi-level feature representations from the input image. To meet the challenges posed by ships of drastically varying sizes and complex marine backgrounds, we have substantially modified the standard convolutional downsampling stages by embedding the MSRF and BSCA modules in every stage. The encoder consists of 8 stacked encoding stages, with the number of feature channels evolving as: 3 → 32, 32 → 64, 64 → 128, 128 → 256, 256 → 512, 512 → 512, 512 → 512, 512 → 512.
Each stage contains two successive Convolution-Normalization-ReLU (ConvNormReLU) blocks. The first block controls spatial downsampling: stage 1 retains the input resolution (stride 1), while stages 2–8 halve the feature map size through stride 2 convolutions. The second ConvNormReLU block always uses stride 1 for further nonlinear transformation without changing dimensions. Every ConvNormReLU applies a convolution, followed by Instance Normalization and LeakyReLU activation. Instance Normalization is chosen for its robustness to contrast variations common in remote sensing images and its compatibility with small-batch training.
A key innovation of our encoder is the placement of the MSRF module and the BSCA module after the two ConvNormReLU blocks within each stage, as depicted in
Figure 6. This sequential arrangement ensures that, after initial feature extraction and resolution change, the feature maps are enriched with multi-scale contextual information and dynamically recalibrated on a per channel basis before being passed to the next stage.
Multi-Scale Receptive Field Enhancement (MSRF). Ships in remote sensing images span from a few pixels (small fishing boats) to hundreds of pixels (large cargo vessels), rendering single-kernel convolutions insufficient. To address this, we design the MSRF block, illustrated in
Figure 6b, which captures multi-scale information without any spatial pooling, a crucial requirement for preserving the details of small targets.
MSRF consists of three parallel convolutional branches: Branch 1 uses a convolution to retain the current-scale representation; Branch 2 employs a standard convolution to capture local neighborhood context; and Branch 3 applies a dilated convolution with a dilation rate of 2, which expands the receptive field without introducing additional parameters.
The outputs of the three branches are concatenated along the channel dimension and fused by a convolution, which compresses the channels back to the original number. The fused output then passes through Instance Normalization and LeakyReLU and is finally added to the original input via a residual connection. This design contrasts with the Atrous Spatial Pyramid Pooling (ASPP) commonly used in segmentation, which often includes a global average-pooling branch that can smear fine spatial details. By deliberately excluding any pooling operation, MSRF ensures that small-ship features remain intact while simultaneously capturing broader context for large vessels.
Background Suppression Channel Attention (BSCA). In maritime images, sea surface, waves, glint, and coastal land often dominate the pixel count, and their textures can trigger strong responses in certain feature channels, leading to false positives or blurred boundaries. To actively suppress these background-induced features, we propose the BSCA module, which goes beyond standard channel attention through a carefully designed bias mechanism.
As shown in
Figure 6c, BSCA first applies global average pooling to obtain channel-wise statistics
. These statistics are then passed through a bottleneck of two fully connected layers with a reduction ratio
(ReLU in between). Crucially, unlike the conventional Squeeze-and-Excitation (SE) block, we introduce a learnable bias term
b before the Sigmoid activation, initialized to
. The channel attention weights are computed as:
where
denotes the Sigmoid function,
is the ReLU activation,
and
are the FC layer weights, and
b is the learnable bias. The final output is:
The negative initialization of b biases the Sigmoid input toward negative values, yielding initial channel weights of approximately (since ). This means that, at the start of training, the network is encouraged to suppress all channels. As training proceeds, the network learns to increase the weights of channels that encode ship-related features while keeping background-associated channels suppressed. This data-driven, soft-gating mechanism implements an implicit “background subtraction” without requiring any explicit background labels. Our design is fundamentally different from the standard SE block, where the implicit initial weight is around , which provides no such background-suppression prior.
3.4. Decoder
The decoder restores the deep semantic features to the original resolution for pixel-level prediction. Unlike the standard U-Net decoder, which relies on transposed convolutions or bilinear interpolation, our decoder adopts two consecutive improvements: dynamic upsampling and an output-stage refinement module (MSR), described in
Section 3.5.
Let the input feature of the current stage be , where denotes the number of input channels. We employ the dynamic upsampling module DySample for upsampling. Its core idea is to adaptively determine the sampling position of each target pixel during upsampling by learning the offsets of sampling points, thereby achieving more flexible spatial detail restoration than fixed grid sampling. DySample first uses a convolution to predict the sampling offsets , where G denotes the number of groups. Then, it generates sampling coordinates based on the initial sampling grid and offsets, and finally obtains the upsampled feature through grid sampling. This process can be expressed as:
The decoder consists of 7 upsampling stages, detailed in
Figure 7. We denote the low-resolution input to the current stage as
. First,
is upsampled using the dynamic upsampler DySample [
45]. DySample learns to predict sampling point offsets
O via a
convolution, and then generates the upsampled feature map
through adaptive grid sampling:
Compared to transposed convolutions, DySample does not introduce large learnable kernels that may cause checkerboard artifacts, nor does it rely on fixed interpolation schemes that are blind to image content. Instead, it flexibly adjusts the resampling positions based on the input features, which is especially beneficial for recovering sharp ship boundaries.
Next, is concatenated with the skip connection feature from the corresponding encoder stage along the channel dimension, yielding . The concatenated features are then processed by two ConvNormReLU blocks: the first compresses the channel number from to , and the second further refines the fused features without changing the channel count.
This process repeats for 7 stages, progressively restoring the resolution from back to , with the final channel depth becoming 32. The skip connections remain crucial, but the decoder’s distinguishing factor lies in the combination of DySample and the subsequent MSR module, which together achieve sharper and more accurate results than conventional U-Net decoders.
3.5. MSR Module
In standard segmentation networks, the decoder output is typically passed directly through a
convolution to produce the final prediction. However, we observe that the output features may still exhibit blurred boundaries and weak responses in small target regions. To address these residual artifacts, we propose the Multi-Scale Refinement (MSR) module, which acts as a lightweight post-processing enhancement between the decoder’s final feature map and the classification layer. Its structure is shown in
Figure 8.
Let the decoder output be . MSR applies two parallel branches to F: Branch 1 uses a standard convolution to extract local boundary details, while Branch 2 employs a dilated convolution with a dilation rate of 2 to capture wider spatial context.
Both branch outputs are activated by LeakyReLU and then concatenated, resulting in . A convolution compresses the channels back to 32, followed by InstanceNorm and LeakyReLU. Finally, a residual connection adds the processed features to the original F, yielding the refined features.
The design is extremely lightweight and introduces negligible computational overhead. By operating directly on the features that are about to be projected to the output logits, MSR implicitly performs multi-scale edge refinement: the standard branch emphasizes fine discontinuities, while the dilated branch considers larger spatial context to suppress false alarms from isolated noisy pixels. The residual design ensures that the refinement does not disturb the already well-learned features, especially at early training stages. This module is one of the key differentiators from many existing segmentation networks that lack dedicated output refinement.
3.6. Comparison with Related Architectures
MSRF shares the high-level goal of multi-scale context capture with Inception [
46] and ASPP, but diverges in details tailored to ship segmentation. Inception modules employ multiple kernel sizes and typically include a pooling branch; however, pooling reduces spatial resolution and degrades the fine boundary information critical for small vessels. ASPP captures multi-scale context via parallel dilated convolutions and a global average-pooling branch. While effective for large objects, the global pooling collapses spatial details into a single vector per channel, risking suppression of tiny vessel signatures. In contrast, our MSRF deliberately avoids any pooling and uses only three branches: a
convolution, a
standard convolution, and a
dilated convolution (rate 2). This design preserves full resolution in every branch, retaining small-ship features while aggregating broader context. Moreover, the residual connection makes MSRF pluggable into each encoder stage with far fewer parameters than deep Inception architectures.
The key innovation of BSCA is its learnable negative bias, which fundamentally distinguishes it from standard Squeeze-and-Excitation (SE) modules. In SE, channel weights generated by two fully connected layers initially center around 0.5, treating all channels equally. This agnostic start forces the network to learn relevant channels solely through gradient descent, which is inefficient when background-associated channels dominate early training. BSCA introduces a trainable bias b initialized to , making the initial channel weights 0.12. This imposes strong suppression on all channels at the start of training. Subsequently, only channels carrying discriminative ship features are up-weighted, while background-dominated channels remain suppressed by the negative prior. This provides implicit, data-driven background subtraction from the earliest stages. Though simple, BSCA explicitly addresses the maritime reality where most pixels belong to the background. Compared with more complex attention variants, it adds negligible cost and significantly improves segmentation in ports and nearshore scenes.
The decoder of ShipMS-BSNet departs from U-Net, MSCF-Net, and MASSNet in two respects. First, it replaces transposed convolution or bilinear interpolation with the DySample dynamic upsampler. The former methods can cause checkerboard artifacts or over-smoothed boundaries, whereas DySample learns per pixel sampling offsets from the feature content, recovering sharper ship contours. Second, we append a MSR module to the decoder output. Most networks directly project features to predictions via a single convolution. MSR instead applies parallel standard and dilated convolutions with a residual connection for edge-aware refinement, sharpening boundaries and boosting small targets with negligible overhead.
MSCF-Net and MASSNet represent recent advances in remote sensing segmentation but follow different strategies. MSCF-Net employs pyramid-like multi-scale fusion and standard channel attention without background suppression bias. MASSNet uses self-attention yet relies on conventional upsampling and lacks output refinement. In contrast, ShipMS-BSNet jointly addresses three challenges: extreme scale variation, dominant background interference, and boundary recovery. These components are interconnected: MSRF enriches features, BSCA suppresses background channels, DySample recovers resolution, and MSR performs final refinement. This integrated design consistently outperforms generic architectures on small and multi-scale ship extraction from complex maritime backgrounds.
3.7. Loss Function
Binary Cross-Entropy loss [
47] calculates the discrepancy between predicted probabilities and ground-truth labels pixel by pixel. Let the probability map predicted by the model be
, where
denotes the probability of the ship class and
denotes the probability of the background class. The ground-truth label is denoted as
, where 1 represents the ship, and 0 represents the background. The calculation formula for
is:
where
is the probability that the
ith pixel belongs to the ship class predicted by the model,
denotes the ground-truth label of the
ith pixel (0 for background and 1 for ship), and
N is the total number of image pixels. BCE loss optimizes each pixel independently with stable gradients, which can provide clear pixel-wise learning signals for the network, but is susceptible to class imbalance.
Dice loss [
48] is derived from the Dice coefficient, which measures the overlap between the predicted segmentation map and the ground-truth label. Its calculation formula is given by:
where
is a smoothing factor to avoid division by zero.
Dice loss directly maximizes the Dice coefficient of the segmentation result and is insensitive to the pixel ratio between the foreground and background, which can effectively alleviate the class imbalance problem. However, in the early stage of training, when the predicted region and the ground-truth region are completely disjoint, the gradient of Dice loss may be zero or oscillate, leading to unstable convergence.
In ship extraction tasks, the pixel proportion of ship targets in high-resolution remote sensing images is usually extremely small, and background pixels occupy an absolutely dominant position. The statistics of the target proportion in the dataset are given in
Table 2. The ship dataset adopted in this paper contains nearly 70,000 images, among which 58.0% have a target proportion lower than 1% and 90.3% lower than 5%, leading to an extremely severe class imbalance problem.
In view of this characteristic, if only the standard Binary Cross-Entropy (BCE) loss is used for optimization, the model tends to predict all pixels as the background, resulting in serious missed detection of small targets [
49]. If only Dice loss is employed, although it can alleviate class imbalance, the gradient is unstable in the early training stage, and the convergence speed is slow [
50]. Therefore, we adopt a weighted combination of BCE loss and Dice loss to balance pixel-level classification accuracy and regional overlap, thereby improving the segmentation and detection capability of the model for small targets. The total loss function is defined as:
where
and
are weight coefficients.
According to the dataset statistics, ship targets occupy less than 5% of the pixels in most images, leading to a severe class imbalance issue. The weighting strategy is inspired by the balanced cross-entropy formulation introduced in Focal Loss for dense object detection [
51], where a higher weight is assigned to the foreground class to prevent the loss from being dominated by easy negatives. Accordingly, we set
for the Dice loss and
for the BCE loss. This assignment allows the Dice loss, which directly optimizes the spatial overlap of ship regions, to drive the optimization with greater force, while the BCE loss provides stable, pixel-wise gradient signals that help maintain training stability. The specific values of 0.7 and 0.3 are chosen to reflect the approximate inverse frequency ratio between the foreground and background, following the principle that the minority class should receive a proportionally higher loss weight, as commonly practiced in dense prediction tasks.
4. Experiments
4.1. Implementation Details
Our ShipMS-BSNet model was constructed with PyTorch v2.6.0 serving as the foundational platform. To verify the effectiveness of the proposed method, this paper performs a systematic comparative analysis between ShipMS-BSNet and current mainstream semantic segmentation algorithms. The experiments are carried out on the self-constructed remote sensing ship dataset and the public HRSC2016 dataset respectively. The competitors included U-Net, nnU-Net, Swin-unet [
52], SegFormer, TransUNet and MSCF-Net. To ensure a fair comparison, all models were trained from scratch without pre-trained weights, and an identical data augmentation pipeline and the same set of hyperparameters were applied across all architectures. All experiments were conducted on a workstation powered by an NVIDIA GeForce RTX 4090 GPU and an Intel Core i9-12900KF CPU. The operating system was Windows 10 Professional. For the software environment, we leveraged Python 3.10.18, PyTorch 2.6.0, and CUDA 12.4. The self-constructed remote sensing ship dataset and the public HRSC2016 dataset were split into training, validation and test sets with a ratio of 7:2:1. The training and validation sets supported the model development and tuning process, while the test set was exclusively reserved for the final assessment of model performance. All models were trained employing the AdamW optimizer. A consistent training setup was employed for all models: a 1
initial learning rate and a batch size of 32, alongside a fixed random seed of 42. Each architecture was trained over 50 epochs, with the top-performing checkpoint on the validation data being saved for the subsequent test phase.
4.2. Evaluation Metrics
The performance of the ShipMS-BSNet model was assessed using a comprehensive set of four standard segmentation metrics. These metrics, employed to quantify segmentation accuracy, are: precision, Recall, F1-score, and Intersection over Union (IoU).
Precision measures the correctness of the predicted ship pixels. It shows the reliability of the segmentation results. Its calculation formula is:
where True Positive (TP) represents the ship pixels that are accurately identified by the model. Conversely, false positive (FP) corresponds to background pixels that are erroneously assigned to the ship category.
Recall measures the model’s ability to capture all genuine ship pixels present in the image. It tells how well the model detects ships. Its calculation formula is:
where FN (false negative) is the count of ship pixels misclassified as the background.
The F1-score balances the trade-off between precision and Recall, making it suitable for handling datasets with skewed sample ratios. The calculation formula is:
Intersection over Union (IoU) is defined as the size of the overlapping region divided by the combined area of the predicted segmentation and the ground-truth annotation. It serves as a core benchmark for evaluating pixel-level alignment in image segmentation. The calculation formula is:
4.3. Results on the Public HRSC2016 Dataset
To verify the effectiveness of the proposed ship extraction method, comparative experiments are conducted in this section on the public HRSC2016 ship dataset.
Table 3 summarizes the quantitative performance results of all compared methods on the HRSC2016 test set. It can be observed from the overall results that the proposed method achieves the best values in three metrics: precision, F1-score and IoU. Its Recall is only second to that of MSCF-Net, demonstrating the best overall performance.
ShipMS-BSNet achieves a precision of 0.846, 1.6% higher than the second-best MSCF-Net and 3.5% higher than nnU-Net, indicating it maintains a relatively low false positive rate on the HRSC2016 dataset. However, as shown in the second column of
Figure 9, all models exhibit misrecognition in the upper right region, incorrectly segmenting sea surface ripples as ships. This phenomenon reflects a common challenge for current models in handling complex sea backgrounds, where the texture and brightness of ripples under light reflection may resemble parts of ships, leading to confusion. Notably, although ShipMS-BSNet also shows this misrecognition, its misdetected area is relatively smaller compared to methods like U-Net and nnU-Net, consistent with its higher precision.
In terms of Recall, MSCF-Net ranks first with 0.837, followed closely by ShipMS-BSNet with 0.835. Their Recall values are comparable, and both significantly outperform Swin-unet (0.819) and nnU-Net (0.811), demonstrating good small target detection capability for both methods. Although ShipMS-BSNet has a slightly lower Recall, its F1-score still surpasses that of MSCF-Net when combined with precision, achieving a better balance between false positives and false negatives.
Figure 9 shows typical segmentation results of all methods on the HRSC2016 test set, selecting two scenarios: moored ships in ports and single offshore ships. In complex port scenarios with building interference, compared methods often misclassify docks and water ripples as ships (e.g., the upper right region in the second column). In simple offshore scenarios, all methods can fully detect ships, but compared methods have blurred edges. The proposed method achieves more accurate hull contour restoration, further validating its effectiveness in handling both complex backgrounds and fine-grained segmentation.
4.4. Results on the Self-Constructed Remote Sensing Ship Dataset
After verifying the effectiveness of the proposed ShipMS-BSNet on the public HRSC2016 dataset, we further conduct experiments on our self-constructed remote sensing ship dataset to evaluate its generalization ability in real-world scenarios.
Table 4 summarizes the quantitative performance results of all compared methods on the test set of our self-constructed ship dataset. ShipMS-BSNet outperforms all other methods in the four core metrics, demonstrating the effectiveness of the proposed approach.
Figure 10 presents representative segmentation examples of all compared methods on the test set. In the case of extremely small targets, U-Net suffers from severe missed detections. Swin-unet and nnU-Net achieve partial detection but generate coarse boundaries. In contrast, both MSCF-Net and ShipMS-BSNet can accurately locate these targets, with ShipMS-BSNet yielding more intact contours. For medium-sized vessels, all methods can detect the targets successfully, while ShipMS-BSNet demonstrates superior edge precision and higher intersection with the ground truth.
4.5. Feature Map Visualization
To verify the encoder’s semantic focusing and feature extraction capabilities, we average the first eight channels of features from encoder stage 4.
Figure 11 shows the average feature map and heatmap overlaid on the original image.
The average feature map shows that deep high-activation regions coincide with ship bodies, with significant activations for small ships and no false background activations. Background regions like the sea and docks have extremely low responses. The heatmap further confirms that high-activation regions fully cover ships of different scales, with stable responses to contours and details, and no feature diffusion or foreground–background confusion.
4.6. Segmentation Result Visualization
To intuitively verify the practical segmentation performance of the proposed model, we visualize the full pipeline of segmentation results on test samples, as shown in
Figure 12, including the original image, predicted mask, color segmentation, and overlay display.
The test sample is a coastal port remote sensing image containing two ship targets of different scales and shapes, with typical interference, such as adjacent docks and sea surface texture noise. The results show that the model can generate complete binary segmentation masks for both ship targets without obvious holes or fractures. The predicted contours are highly consistent with the actual target boundaries. Notably, the model achieves complete segmentation of large-scale ships and accurate detection of small-scale ships simultaneously, with no missed or false detections and no mis-segmentation of docks or sea backgrounds. The overlay display further confirms that the segmentation results perfectly match the position, contour and scale of ship targets in the original image, with no obvious boundary offset or over-segmentation.
These results, together with the previous feature visualization, form a complete logical verification. The encoder’s ability to accurately focus on target features and resist background interference is the core premise of high-precision segmentation. By effectively filtering noise and anchoring the semantic regions of ships in the encoder stage, the model can restore accurate target contours in the decoder stage, achieving end-to-end pixel-level segmentation of multi-scale ships and verifying its effectiveness and robustness in remote sensing ship segmentation tasks.
4.7. Ablation Study
To verify the individual contributions and effectiveness of each core module in ShipMS-BSNet, we design a series of ablation experiments. By gradually adding key modules on the self-constructed dataset and comparing performance changes, we quantitatively analyze the roles of MSRF, BSCA and MSR in ship segmentation.
Using standard nnU-Net as the baseline, we sequentially add BSCA, MSRF, and their combination, then integrate DySample and MSR to get the full model. All models use identical training configurations and datasets.
Table 5 shows the ablation results.
The ablation studies validate the effectiveness of each proposed module. The BSCA module slightly improves segmentation performance by suppressing background interference and reducing false positives via a learnable negative bias mechanism.
The Multi-Scale Receptive Field Enhancement (MSRF) module delivers more significant gains by expanding receptive fields through multi-branch dilated convolutions, which is particularly beneficial for multi-scale ship feature extraction.
Combining MSRF and BSCA yields performance exceeding the additive effect of individual modules, demonstrating clear synergistic benefits: MSRF provides contextual information for accurate ship–background distinction, while BSCA purifies inputs for MSRF feature fusion.
Further integrating dynamic upsampling and Multi-Scale Refinement modules optimizes boundary details and small target segmentation. Overall, the complete ShipMS-BSNet achieves 6.2% F1-score and 5.4% IoU improvements over the baseline, enabling robust ship segmentation in complex remote sensing backgrounds.
5. Discussion
5.1. Scale Advantages and Research Value of Datasets
The most significant core advantage of the self-constructed ship segmentation datasets in this study lies in their data scale and fine-grained annotation quality. The datasets contain a collection of 69,407 high-resolution remote sensing scenes. Each image was subjected to professional ArcGIS polygon annotation and rigorous quality control, ensuring precise ship boundary delineation and producing pixel-level binary masks. This substantial volume of data provides ample and diverse learning samples for deep learning models, serving as a fundamental guarantee for mitigating overfitting and enhancing model generalization.
Compared to existing mainstream ship datasets (such as HRSC2016 and MCSD), which typically contain thousands to tens of thousands of images, this dataset achieves an order-of-magnitude increase in scale. This expansion not only represents quantitative growth but also enables more comprehensive coverage of real-world complex scenarios. The dataset includes diverse environments ranging from nearshore to open sea, fair weather to complex meteorological interference, and sparse to densely clustered scenes. Moreover, the ship targets maintain high diversity and balance in scale, morphology, and quantity distribution, with 1000 negative samples included to enhance model robustness. Such a large-scale, accurately annotated dataset effectively addresses the long-standing “data scarcity” issue in this field, laying a solid foundation for training deeper and more advanced segmentation models, and is expected to become a new benchmark in ship segmentation research.
Benefiting from its large scale and high quality, this dataset offers broad applicability and significant potential for future expansion. It not only furnishes dependable training and evaluation support for the present study but also establishes an elevated baseline that enables the global research community to benchmark algorithms equitably, refine models for complex maritime scenes, advance small target detection, and propel semantic segmentation research.
5.2. Performance of ShipMS-BSNet on Real Large Remote Sensing Images
Although the experiments in
Section 4 have comprehensively evaluated the segmentation performance of ShipMS-BSNet on standard datasets, there remains a critical gap between academic research and real-world engineering applications: all previous experiments were conducted on pre-cropped 512 × 512 image patches, while actual remote sensing images are usually gigapixel-scale and cannot be directly input into the model for end-to-end inference.
To bridge this gap, we collected ultra-high-resolution satellite images covering large ocean areas from Google Earth and adopted a sliding window cropping strategy to generate 512 × 512 image patches. The proposed model was applied for parallel inference, and the prediction results were seamlessly stitched to generate a complete segmentation map. This approach allows us to comprehensively evaluate the model’s performance in real-world complex environments.
As shown in
Figure 13, ShipMS-BSNet successfully processes ultra-large remote sensing images within memory constraints and maintains consistent high-precision segmentation across the entire image. Notably, no obvious stitching artifacts, missed detections or false positives are observed at the window boundaries. This is attributed to the synergistic effect of the proposed modules: the BSCA module suppresses background noise at window edges and reduces false detections, while the MSRF module with large receptive fields ensures that targets partially located at window edges can still be completely captured.
Furthermore, the Google Earth images used in this experiment differ from the training dataset in terms of imaging satellites, illumination conditions and background complexity. The stable performance of the model further verifies its cross-domain generalization ability.
5.3. Performance of ShipMS-BSNet in Different Marine Backgrounds
Figure 14 presents the segmentation performance of ShipMS-BSNet under various ship scales, target densities, marine backgrounds and weather conditions. The results demonstrate that the proposed model exhibits excellent robustness in all the above complex scenarios, fully validating its capability to accurately extract cross-scale ship targets from cluttered backgrounds.
The principal advantage of ShipMS-BSNet over conventional methods is its proven capacity to tackle the dual difficulties posed by multi-scale targets and complex environments. The elaborately designed MSRF module employs parallel dilated convolutions with varying dilation rates to simultaneously extract and integrate fine-grained local features and multi-scale contextual information, which guarantees precise segmentation of cross-scale ship targets from large inshore vessels to small offshore vessels. Meanwhile, the proposed Background Suppression Channel Attention mechanism can effectively suppress background noise from coastal docks, sea surface textures and wave clutter, significantly improving the model’s robustness against marine environmental interferences, such as clouds and sea fog, and enabling it to maintain stable segmentation performance under complex illumination and low visibility conditions.
The multi-scale feature extraction capability of MSRF and the background suppression capability of BSCA form a favorable synergistic effect: the former provides the model with cross-scale target discriminative features, while the latter accurately filters out irrelevant background noise. Their combination enables the model to achieve high-precision and high-robustness pixel-level ship segmentation even in extremely complex remote sensing marine environments.
5.4. Limitations
Although the proposed ShipMS-BSNet model demonstrates excellent performance in ship segmentation tasks, it still has several inherent limitations. Firstly, the model has high computational complexity, and both training and inference processes consume a large amount of computing resources, resulting in significant inference latency. This makes it difficult to deploy on edge computing devices with limited computing power, and it is unable to meet the stringent requirements of real-time ship monitoring in marine environments.
Secondly, the current model only uses single-modal visible light remote sensing images as input and fails to fully exploit the imaging characteristics of other remote sensing modalities, such as infrared and Synthetic Aperture Radar. Under extreme weather conditions, such as insufficient nighttime illumination and cloud cover, the detection accuracy and robustness of the model will decrease significantly. Furthermore, the current evaluation is limited to our self-constructed dataset and HRSC2016, both of which primarily consist of images from specific satellite sensors and geographical regions. The generalization capability of ShipMS-BSNet to other widely used satellite platforms (e.g., Sentinel-2, WorldView) and to diverse geographical areas remains to be validated.
Thirdly, the model adopts a fully supervised learning paradigm, and the training process relies on large-scale pixel-level fine-annotated data. However, pixel-level annotation of remote sensing images not only requires professional knowledge but is also time-consuming and labor-intensive, which greatly limits the expansion of the dataset and the popularization and application of the model in practical scenarios.
Subsequent optimization will be carried out in four directions: realize a lightweight model through knowledge distillation and lightweight architecture design to adapt to edge devices and real-time monitoring; fuse multi-source remote sensing data, such as infrared and SAR, to improve robustness in complex environments; construct a multi-sensor, multi-region benchmark to systematically evaluate and improve cross-platform generalization; study weakly supervised and semi-supervised learning to reduce annotation dependence; and deploy the model to an actual marine monitoring platform to further optimize performance in real scenarios.
6. Conclusions
To address the challenges of multi-scale target detection, severe background interference, and blurred edge details in ship segmentation from remote sensing images in complex marine environments, this paper proposes a high-performance ship semantic segmentation network, ShipMS-BSNet. The network constructs an encoder architecture integrating Multi-Scale Receptive Field Enhancement (MSRF) and Background Suppression Channel Attention (BSCA). The former captures cross-scale target features through multi-branch dilated convolutions, while the latter accurately suppresses background noise via a learnable negative bias mechanism. Their synergistic effect improves the model’s discriminative ability for ship targets in complex backgrounds. In the decoder part, traditional transposed convolution is replaced by Dynamic Sampling, and the Multi-Scale Refinement (MSR) module is introduced at the output end, which effectively solves the edge blurring problem during upsampling and enhances the segmentation integrity of small targets.
To support rigorous evaluation across diverse maritime conditions, we constructed a large-scale ship segmentation dataset comprising 69,407 image–label pairs. This dataset is characterized by its extensive coverage of multiple vessel scales—ranging from small fishing boats to large cargo ships—and its inclusion of diverse scene types, including open seas, nearshore waters, and busy ports. These attributes enable a fine-grained assessment of algorithm performance under varying degrees of background complexity and target sizes. Extensive comparative experiments and ablation studies on both this self-constructed dataset and the public HRSC2016 dataset verify the effectiveness of each proposed module. ShipMS-BSNet achieves state-of-the-art performance across multiple core metrics, with particularly pronounced advantages in complex port backgrounds and small-ship segmentation tasks, where the scale diversity and scene variety of our dataset provide a demanding and representative benchmark. Its comprehensive performance is superior to existing mainstream methods.
Future research will focus on multi-modal remote sensing data fusion to fully exploit the complementary information of infrared, SAR, and other modalities, and to improve the model’s environmental adaptability under extreme weather conditions. Meanwhile, model lightweighting will be realized through knowledge distillation, model quantization, and other techniques to provide technical support for real-time marine ship monitoring systems.