MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images

Xu, Yubin; Pan, Haiyan; Wang, Lingqun; Zou, Ran

doi:10.3390/s25092940

Open AccessArticle

MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images

School of Information Science, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(9), 2940; https://doi.org/10.3390/s25092940

Submission received: 31 March 2025 / Revised: 28 April 2025 / Accepted: 4 May 2025 / Published: 7 May 2025

(This article belongs to the Special Issue Recent Advances in Synthetic Aperture Radar (SAR) Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Synthetic aperture radar (SAR) ship detection holds significant application value in maritime monitoring, marine traffic management, and safety maintenance. Despite remarkable advances in deep-learning-based detection methods, performance remains constrained by the vast size differences between ships, limited feature information of small targets, and complex environmental interference in SAR imagery. Although many studies have separately tackled small target identification and multi-scale detection in SAR imagery, integrated approaches that jointly address both challenges within a unified framework for SAR ship detection are still relatively scarce. This study presents MC-ASFF-ShipYOLO (Monte Carlo Attention—Adaptively Spatial Feature Fusion—ShipYOLO), a novel framework addressing both small target recognition and multi-scale ship detection challenges. Two key innovations distinguish our approach: (1) We introduce a Monte Carlo Attention (MCAttn) module into the backbone network that employs random sampling pooling operations to generate attention maps for feature map weighting, enhancing focus on small targets and improving their detection performance. (2) We add Adaptively Spatial Feature Fusion (ASFF) modules to the detection head that adaptively learn spatial fusion weights across feature layers and perform dynamic feature fusion, ensuring consistent ship representations across scales and mitigating feature conflicts, thereby enhancing multi-scale detection capability. Experiments are conducted on a newly constructed dataset combining HRSID and SSDD. Ablation experiment results demonstrate that, compared to the baseline, MC-ASFF-ShipYOLO achieves improvements of 1.39% in precision, 2.63% in recall, 2.28% in

{A P}_{50}

, and 3.04% in

A P

, indicating a significant enhancement in overall detection performance. Furthermore, comparative experiments show that our method outperforms mainstream models. Even under high-confidence thresholds, MC-ASFF-ShipYOLO is capable of predicting more high-quality detection boxes, offering a valuable solution for advancing SAR ship detection technology.

Keywords:

ship detection; deep learning; multi-scale feature fusion; small object detection; remote sensing SAR image; YOLO11

1. Introduction

As an important task of maritime surveillance, ship detection plays an irreplaceable role in various fields such as marine economic development, search and rescue operations, maritime transportation, military reconnaissance, and security law enforcement [1,2,3,4]. Marine ship detection frequently relies on remote sensing imagery due to its comprehensive spatial coverage, economic efficiency, and rapid data availability [5]. This includes synthetic aperture radar (SAR), which is the preferred data source for ship detection, largely owing to its all-weather and all-day imaging capabilities [6,7,8,9].

Traditional SAR ship detection methods (such as CFAR) rely on manual parameter tuning and exhibit significant performance degradation in complex sea conditions or dense ship scenarios [10,11,12,13], which has gradually shifted the research focus towards more robust detection algorithms. In recent years, artificial intelligence technologies, especially CNNs, have brought breakthrough progress to ship detection due to their powerful nonlinear feature extraction and efficient feature representation capabilities [14]. Using the SSDD dataset as an example, the

{m A P}_{50}

for ship detection has improved from 78.8% to 97.8% [15]. Current deep-learning solutions for object detection adopt either two-stage or single-stage detectors. Two-stage algorithms primarily comprise two fundamental processes: (1) generation of region proposals and (2) classification of features and bounding box regression within the proposed regions [16]. Representative algorithms consist of the R-CNN series [17,18,19]. Ke et al. [20] introduced deformable convolution kernels to improve Faster R-CNN, enhancing the model’s adaptability to geometric deformations of ships, which increased the

m A P

by 2.02%. Zhao et al. [21] proposed a two-stage detector named the Attention Receptive Field Pyramid Network (ARPN), which enhances non-local feature associations and refines multi-level representations, thereby boosting the performance of multi-scale ship detection in SAR imagery. While two-stage object detection algorithms achieve higher detection accuracy, they entail greater computational complexity and resource consumption. Single-stage algorithms (such as SSD, RetinaNet, FCOS, and the YOLO series) eliminate the need for region proposal generation and can directly predict bounding boxes and class probabilities through anchor boxes. Wang et al. [22] optimized SAR ship detection by combining SSD with data augmentation and transfer learning, effectively reducing false detection rates and achieving higher target localization precision. Miao et al. [23] proposed an improved lightweight RetinaNet that reduces parameter count and computational overhead without compromising detection accuracy. Zhu et al. [24] presented an enhanced FCOS + ATSS network that achieved an 8.3% improvement in

A P

compared to the baseline model. The YOLO series dominates ship detection as the most popular single-stage approach [16]. Simultaneously, Transformer models based on self-attention mechanisms effectively capture global relationships between image pixels, demonstrating tremendous potential in ship detection. For instance, the SMEP-DETR model proposed by Yu et al. [25] effectively suppresses speckle noise in SAR images while enhancing edge features, enabling small target ships to maintain high localization accuracy even in complex background environments. The RDB-DINO approach introduced by Qin et al. [26] mitigates confusion between small ships and complex backgrounds while enhancing the feature representation of small ships, significantly improving detection performance for small ships. Transformers possess superior global modeling capabilities [27] and outperform certain single-stage detection models in object detection tasks.

Despite significant advances in deep-learning-based SAR ship detection methods, multiple challenges persist: ship types are increasingly diverse with substantial size variations, and ship pixel areas within the same dataset can differ by nearly 1000-fold [28]. CNNs with fixed receptive fields struggle to adapt to these multi-scale characteristics. Additionally, ships may appear in complex scenarios including near-shore ports, open ocean, coastlines, and inland waterways, making them prone to background confusion [2,29]. Furthermore, densely distributed fleets further increase detection difficulty [30]. Moreover, SAR ship datasets contain a high proportion of small targets [31], which provide limited feature information [32,33] and are susceptible to background interference, severely constraining detection performance.

To address these challenges, researchers have explored various improvement strategies. Generally, these methods primarily include multi-scale feature fusion optimization, attention mechanisms, enhanced feature extraction networks, and loss function improvements. Regarding multi-scale feature fusion, Liu et al. [34] combined a Feature Pyramid Network (FPN) with Scale-Equalizing Pyramid Convolution (SEPC) to enhance YOLOv4’s multi-scale feature processing capability. Chen et al. [35] innovatively applied a k-means clustering algorithm based on Shape Similarity Distance (SSD) metrics to optimize FPN, effectively resolving small ship detection problems in complex environments. Li et al. [36] proposed a Balanced Shifting Multi-scale Fusion (BSMF) module based on YOLOv8, significantly improving detection performance for targets of different scales. In terms of attention mechanism applications, Cui et al. [37] embedded a Convolutional Block Attention Module (CBAM) into the feature map cascade process of the pyramid network, connecting layers sequentially from top to bottom to extract rich multi-resolution semantic features and highlighting salient features. Yang et al. [38] introduced a Coordinate Attention Module (CoAM) in single-stage detection networks to mitigate complex background interference and enhance semantic feature representation. Zhou et al. [39] proposed an attention aggregation network WEF-Net that coordinates semantic information across feature layers of different resolutions, simultaneously enhancing multi-scale detection capabilities and background suppression effects. Tang et al. [40] integrated the BiFormer attention mechanism into YOLOv7-tiny, significantly reducing false positives and false negatives in near-shore scenarios. Wang et al. [41] proposed a lightweight multi-scale SAR ship detection model called MSSD-Net, which introduces the Multi-Scale Coordinate Attention Module (MSCA) to effectively capture global information from input feature maps and enhance the capability to process features across different scales. Regarding feature extraction network enhancements, Fan et al. [42] proposed the CSDP module, which employs deep large-kernel convolutions to enlarge the receptive field of shallow layers, thereby enhancing the representation of small target features. Yang et al. [38] proposed a Receptive Field Improvement Module (RFIM) to enhance detection capabilities for ships of various scales. In terms of loss function improvements, Li et al. [36] introduced Gaussian Wasserstein distance loss, while Fan et al. [42] proposed the MPDIOU loss function, effectively alleviating the class imbalance problem between small SAR ships and backgrounds.

Despite these improvements enhancing SAR ship detection performance, research has predominantly focused on enhancing individual detection models that address either small target detection challenges or multi-scale ship detection issues individually. Research that simultaneously addresses both challenges within a unified detection framework is still relatively limited. This study aims to propose an improved ship detection model that, on the one hand, enhances small target feature representation and distinctiveness by increasing attention allocation to small targets, reducing information loss in high-level feature representations, and strengthening the model’s perception of small targets and their contextual background relationships. On the other hand, it resolves cross-scale feature conflict problems inherent in traditional FPN structures, thereby enhancing feature consistency across targets of different scales. This dual improvement strategy significantly enhances model performance in small target recognition and multi-scale target detection, providing a more effective solution for ship detection in SAR images.

To overcome these limitations, we propose MC-ASFF-ShipYOLO (Monte Carlo Attention—Adaptively Spatial Feature Fusion—ShipYOLO), an improved single-stage detector based on YOLO11 [43]. Our main contributions are as follows.

We introduce the MCAttn module to enhance backbone network performance. This module randomly selects one attention map from three different scales through stochastic sampling pooling operations for the weighting of feature maps. By capturing multi-scale information, the MCAttn module improves the backbone network’s ability to discriminate small ship morphology and position, enhances focus on small targets, prevents information loss during network deepening, and strengthens contextual relationship learning. The module effectively enhances the feature representation of small ships, thereby improving the model’s detection capability for small ship targets.
We incorporated the ASFF module into the detection head, which adaptively learns spatial fusion weights across multi-scale feature layers and performs dynamic feature integration. This approach ensures the consistency of ship feature representations across different scales, effectively mitigating the problem of feature conflicts between varying feature layers. Consequently, the model’s multi-scale ship detection capability is enhanced.
We propose MC-ASFF-ShipYOLO, an improved deep-learning ship detection model that demonstrates superior precision compared to other baseline models when evaluated on the mixed HRSID and SSDD datasets. Our detection framework effectively improves the critical challenges arising from small target recognition and multi-scale feature representation in SAR ship detection tasks. The experimental results validate the efficacy of our approach and provide valuable methodological guidance for future research in maritime target detection.

The subsequent sections are arranged accordingly: Section 2 introduces the newly constructed hybrid SAR ship detection dataset and elaborates on the proposed framework and improvement modules. Section 3 presents comparative experiments and ship detection results. Section 4 presents experimental results and comprehensively analyzes the contributions of the improvement modules. Finally, Section 5 provides a conclusion of the entire paper.

2. Materials and Methods

This section presents the experimental datasets, data preprocessing methods, implementation details, improved methods, and evaluation metrics.

2.1. Dataset Introduction

This study employs a hybrid dataset, constructed from HRSID and SSDD, as the basis for model training and performance evaluation. Released in 2020, HRSID is a high-resolution SAR image dataset designed for ship detection and segmentation tasks. It comprises 5604 images with a resolution of 800 × 800 pixels, ranging from 0.5 m to 3 m, and includes diverse scenes such as open seas and coastal ports. SSDD utilizes the improved version officially released by Zhang et al. [13] in 2021, comprising 1160 images with resolutions varying from 1 m to 15 m and inconsistent image dimensions (height 190–526 pixels, width 214–668 pixels). This dataset provides three annotation methods: (1) BBox; (2) RBox; (3) PSeg. To unify model input, the HRSID dataset employs axis-aligned vertical bounding boxes (BBoxes) as its annotation format; for the SSDD dataset, we also use the BBox annotation format. Additionally, we apply zero-padding to the SSDD dataset to achieve uniform 800 × 800 dimensions. This approach preserves original image information without distortion while simultaneously unifying input dimensions, improving model compatibility and computational efficiency.

After processing, we constructed a hybrid experimental dataset containing 6764 SAR images. An 8:2 ratio was used to divide the training and validation sets (5411 images for training, 1353 images for validation), ensuring experimental reliability and model generalization capability. Following the target size definition standards of the MS COCO dataset, in 800 × 800 SAR images, we categorize ship targets into three classes: large targets (area > 1%, pixel range

(80 \times 80, 800 \times 800])

, medium targets (0.5% < area ≤ 1%, pixel range

(40 \times 40, 80 \times 80])

, small targets (0 < area ≤ 0.5%, pixel range

(0, 40 \times 40])

.

Figure 1 displays complex background environments, densely distributed small ships, and example scenes of multi-scale ship targets. Figure 2 presents the quantity distribution and proportion of ship targets at various scales in both training and validation sets, with data showing that small ships account for as much as 91.3% and 91.5% in the training and validation sets, respectively. This highlights the primary challenges faced in SAR ship detection. Therefore, improving models to suppress noisy background interference, enhance multi-scale detection capability, and increase small target feature representation and attention is key to improving overall model detection performance.

2.2. Implementation Details

In this study, all experiments were conducted using PyTorch on a PC with Intel(R) Xeon(R) Gold 6430 CPU @ 2.10 GHz and NVIDIA GeForce RTX 4090 (24,564 MiB) GPU. The PC operating system was Windows 11, with PyTorch framework version 2.6.0 and CUDA architecture 12.4. Table 1 summarizes and lists the specific experimental environment.

To guarantee the fairness and robustness of the experiments, consistent parameter settings were implemented across all models. Input images were uniformly sized at 800 × 800 pixels, with training conducted for 100 epochs using a batch size of 16. The training was conducted using the SGD optimizer, configured with an initial learning rate (lr0) of 0.01, a learning rate decay factor (lrf) of 0.01 (final learning rate is lr0 × lrf = 0.0001), a momentum of 0.937, and a weight decay of 0.0005. Notably, the default mosaic data augmentation was disabled when training YOLO series models.

However, it should be emphasized that due to fundamental differences in algorithmic principles and network structures, certain models could not directly adapt to the aforementioned unified parameter settings. For these models, multiple sets of hyperparameter comparison experiments were conducted, and the optimal performance results were selected as the final comparison benchmark to ensure objectivity and scientific validity throughout the evaluation process. The specific details can be found in Table 2.

2.3. Evaluation Metrics

This research employs the precision, recall,

{A P}_{50}

, and

A P

metrics to systematically evaluate model performance in ship recognition.

Intersection over Union (IoU) is a widely used metric for object detection, measuring the overlap between predicted and ground truth boxes. It is defined as the ratio of their intersection to their union, as shown below:

I o U = \frac{I n t e r s e c t i o n A r e a}{U n i o n A r e a}

(1)

In ship detection tasks, models may misclassify background regions and ship targets. The detection results can be categorized into four types: TP, TN, FP, and FN. Precision and recall are then defined based on these metrics, as shown in Equations (2) and (3):

P = \frac{T P}{T P + F P} \times 100 %,

(2)

R = \frac{T P}{T P + F N} \times 100 %

(3)

Average Precision (

A P

) is defined as the mean precision across different recall levels, computed as the area under the PR curve, where recall and precision are represented on the x- and y-axes, respectively (see Equation (4)):

A P = \int_{0}^{1} P (R) d R

(4)

MS COCO provides comprehensive evaluation metrics [44]. In this paper,

{A P}_{50}

and

A P

are employed to assess ship detection performance.

{A P}_{50}

represents the area under the PR curve at an IoU threshold of 0.5.

A P

(Average Precision) is a robust and comprehensive evaluation metric that measures precision across multiple IoU thresholds (0.5 to 0.95 in 0.05 increments), yielding an average of over 10 distinct levels.

2.4. The MC-ASFF-ShipYOLO Model

MC-ASFF-ShipYOLO is an improved model based on the YOLO11 object detection network. As a YOLO series model released on 30 September 2024, YOLO11 achieves significant improvements in feature extraction capability and small object detection performance compared to previous versions, with the advantages of higher precision, fewer parameters, and faster inference speed.

This research proposes a dual optimization strategy for both backbone and head sections through an in-depth analysis of the YOLO11 network architecture. In the backbone section, we incorporate a Monte Carlo Attention mechanism to strengthen the network’s ability to capture the morphology and precise locations of small ship targets while also enhancing its understanding of contextual relationships between small objects and their surrounding environment, thereby effectively improving small object detection performance. In the head section, we add an Adaptive Spatial Feature Fusion (ASFF) module, enabling the model to more effectively consolidate multi-scale ship feature information, mitigating conflicts and inconsistencies between features at different scales, and significantly improving detection accuracy and robustness in complex multi-scale target scenarios. Figure 3 illustrates the overall architecture of the MC-ASFF-ShipYOLO model, while Figure 4 presents the detailed internal structure of each functional module.

2.4.1. Monte Carlo Attention (MCAttn)

In CNN architectures, successive convolutional and pooling operations cause a progressive reduction in feature map resolution. For small ship target detection, these objects occupy extremely few pixels in images, limiting the available feature information for extraction. As network depth increases, these sparse features are easily compressed or even lost, leading to significant missed detections of small targets and consequently degrading the accuracy of the model’s overall detection. Therefore, enhancing the network’s learning capability for small target features and improving the model’s perception of contextual relationships between small objects and their surrounding environment are critical for detection performance improvement.

The proposed incorporation of Monte Carlo Attention into the YOLO network backbone effectively aids in small-ship target detection. MCAttn (Monte Carlo Attention) is a channel attention mechanism [45] that differs from the SE (squeeze–excitation) attention strategy [46], which obtains attention maps for each channel through global average pooling. Instead, MCAttn generates scale-invariant attention maps for each channel using pooling operations based on random sampling. Figure 5 illustrates the architecture of the MCAttn module.

MCAttn performs three average pooling operations on the input feature map to generate Pooled Tensors of different scales (3 × 3, 2 × 2, and 1 × 1). Then, from these three scales of Pooled Tensors, it randomly selects one 1 × 1 pooled tensor as the final attention map. MCAttn captures information at different scales, and this cross-scale characteristic deepens the model’s understanding of the relationship between small targets and their surrounding environment, enabling the model to better focus on the location of small ship targets, thereby improving recognition accuracy. The calculation steps for the MCAttn output attention map are as follows:

A_{m} (x) = \sum_{i = 1}^{n} P (x, i) f (x, i),

(5)

\sum_{i = 1}^{n} P (x, i) = 1,

(6)

\prod_{i = 1}^{n} P (x, i) = 0

(7)

The MCAttn output attention map is defined as

A_{m} (x)

, where x is the input tensor, i represents the size of the output attention map,

f (x, i)

represents the average pooling function, and the associated probability

P (x, i)

satisfies the constraints in Equations (6) and (7) above. n represents the number of pooling tensors.

Attention mechanisms commonly used in deep learning, like squeeze-and-excitation (SE), generate fixed-dimension attention maps by producing 1 × 1 output tensors through global average pooling (GAP), which often fails to function effectively in ship recognition tasks. Furthermore, this approach limits the model’s ability to capture cross-scale correlations, particularly when establishing non-linear relationships between small targets and their surrounding environments. MCAttn overcomes the limitations of conventional attention mechanisms by randomly sampling pooling tensors at various scales, significantly enhancing the model’s sensitivity to feature variations of small targets across multi-scale feature maps, thereby improving ship recognition accuracy.

2.4.2. Adaptively Spatial Feature Fusion (ASFF)

As network depth increases, target detail features diminish while semantic information strengthens. Semantic features of small-scale targets can be fully extracted in shallow network layers, whereas large-scale targets require deeper network processing. This disparity may cause small target information to be lost during deep processing, making it difficult for models to achieve optimal detection balance. Furthermore, traditional multi-scale feature fusion networks (such as FPN) possess inherent deficiencies: when a target is classified as a positive sample in one feature layer, the corresponding region in other feature layers may be treated as background, resulting in feature conflicts and gradient computation interference. This inconsistency also exists in YOLO series networks. Therefore, we incorporated ASFF [47] into the detection head. ASFF adaptively learns spatial fusion weights for feature maps of different scales, effectively filtering conflicting information during training, enabling the model to utilize multi-scale features more efficiently, and enhancing feature scale invariance. The Detact + ASFF structure is illustrated in Figure 6, with the specific implementation requiring two steps: Feature Resizing and Adaptive Fusion.

Feature Resizing

The Neck component of YOLO11 generates three feature maps at different scales, namely P3, P4, and P5, each with distinct resolution and channel dimensions. Consequently, features from other levels require resizing to align with the features of the current level. For notational convenience, P3, P4, and P5 correspond to

L e v e l l (l = 1, 2, 3)

.

x^{l}

is defined as the feature representation at resolution

L e v e l l (l \in \{1, 2, 3\})

. For any given

L e v e l l

, the feature maps

x^{n}

from other

L e v e l s n (n \neq l)

must be resized to match the dimensions of

x^{l}

. We denote

x^{n \to l}

as the process of adjusting the feature map of

L e v e l n (n \in \{1, 2, 3\})

to conform to the dimensions of

L e v e l l

. This dimensional transformation is accomplished through appropriate up-sampling and down-sampling strategies.

2.: Adaptive Fusion

After Feature Resizing, Adaptive Fusion is performed on features of three different scales. The calculation formula is as follows, where

x_{i j}^{n \to l}

denotes the feature vector at position

(i, j)

in the feature map from

L e v e l n

after being resized to

L e v e l l

, and

y_{i j}^{l}

denotes the

(i, j)

-th vector along the channel dimension in the output feature map

y^{l}

.

y_{i j}^{l} = α_{i j}^{l} \cdot x_{i j}^{1 \to l} + β_{i j}^{l} \cdot x_{i j}^{2 \to l} + γ_{i j}^{l} \cdot x_{i j}^{3 \to l}

(8)

α_{i j}^{l}, β_{i j}^{l},

and

γ_{i j}^{l}

represent the spatial importance weights of feature maps from

L e v e l n (n \in \{1, 2, 3\})

to

L e v e l l

, which are adaptively learned by the network. These weights can be simple scalars shared across all channels. They satisfy the constraints

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1

and

α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l} \in [0,1]

. Taking

α_{i j}^{l}

as an example, the calculation formula is as follows:

α_{i j}^{l} = \frac{e^{λ_{α_{i j}}^{l}}}{e^{λ_{α_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}}}

(9)

Weight scalar mappings

λ_{α}^{l}, λ_{β}^{l},

and

λ_{γ}^{l}

are computed from

x^{1 \to l}, x^{2 \to l},

and

x^{3 \to l}

, respectively, utilizing 1 × 1 convolution layers. Subsequently,

α_{i j}^{l}, β_{i j}^{l},

and

γ_{i j}^{l}

are defined through the Softmax function, with

λ_{α_{i j}}^{l}, λ_{β_{i j}}^{l},

and

λ_{γ_{i j}}^{l}

serving as control parameters. Through the ASFF module, MC-ASFF-ShipYOLO adaptively fuses spatial features across three scales, feeding the fusion results into detection heads to accomplish ship target detection.

2.4.3. Loss Function

Confidence estimation, bounding box regression, and distribution focal loss jointly constitute the model’s loss function.

Considering the complex marine environment and the significant background noise affecting ship detection in SAR imagery, we adopt Binary Cross-Entropy (BCE) loss for confidence estimation, as it contributes to enhancing the model’s ability to distinguish foreground ships from complex backgrounds, thereby improving the accuracy of object existence prediction. For bounding box localization, the CIoU (Complete Intersection over Union) loss function is employed, demonstrating superior performance in high-precision box regression tasks. Moreover, compared to IoU, DIoU, and GIoU, CIoU is better suited for object detection scenarios with significant scale variations, making it an ideal choice for addressing the challenges of multi-scale ship detection.

Small objects occupy limited pixel regions in images, often accounting for less than 0.5% of the total image area (Section 2.1). This small size makes even minor deviations in bounding box predictions significantly impact localization accuracy. Distribution Focal Loss alleviates this challenge. The core concept of Distribution Focal Loss (DFL) is to predict bounding box coordinates using discrete probability distributions and optimize them through focal loss, concentrating predictions near the actual boundary values. Compared to traditional methods that directly predict bounding box coordinates, DFL offers greater flexibility and accuracy, enabling the model to capture subtle deviations and enhance adaptability for small-object detection [48]. Moreover, the focal mechanism also balances learning effectiveness across multi-scale objects. The calculation formula for DFL is presented below:

D F L (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log {(S}_{i}) + (y - y_{i}) \log (S_{i + 1}))

(10)

In the formula,

y_{i}

and

y_{i + 1}

represent the closest discrete neighboring points to the ground truth value y, where

y_{i} = ⌊y⌋

and

y_{i + 1} = ⌈y⌉

.

S (\cdot)

denotes the Softmax function, which generates probability distributions over discrete points, and

S_{i}

is the probability corresponding to

y_{i}

in the predicted distribution.

3. Results

This section provides a detailed evaluation of the MC-ASFF-ShipYOLO model’s performance on the constructed hybrid SAR ship dataset and systematically compares it with classical object detection networks. We conduct a comprehensive assessment of the improved model’s effectiveness in SAR ship detection tasks through both quantitative metrics and qualitative analysis. YOLO11 is an improved version of YOLOv8. Compared to the original model, YOLO11 introduces the Cross-Stage Partial with Self-Attention (C2PSA) module, which boosts the ability to capture contextual information, thereby improving detection performance for small-scale and occluded objects. Furthermore, the innovative C3k2 module replaces the traditional C2f module, optimizing both efficiency and inference speed while maintaining detection accuracy [43]. These advancements render YOLO11 particularly suitable for SAR ship detection. The YOLO11 architecture encompasses five model variants of varying scales (n, s, m, l, x), designed to address the diverse requirements of different dataset sizes and application scenarios. Based on these considerations, we selected YOLO11s as the improvement baseline.

3.1. Comparison Experiment

In this section, we chose the two-stage detectors, alongside the high-efficiency neural network model EfficientNet, and incorporated various editions of the YOLO series (such as YOLOv8, YOLOv9, YOLOv10, YOLO11, and YOLO12) to conduct comparative experiments. Table 3 provides a detailed comparison of the experimental outcomes.

The analysis of the experimental data in Table 3 indicates that our selected baseline model, YOLO11s, demonstrates excellent performance on the hybrid SAR ship dataset, achieving 92.48% precision, 84.21% recall, 92.78%

{A P}_{50}

, and 67.40%

A P

. Notably, YOLOv9s also performs well, with recall and

A P

values 0.73% and 0.13% higher than YOLO11s, respectively; however, its precision and

{A P}_{50}

are 1.58% and 0.81% lower. Comprehensively evaluating detection performance and computational efficiency, YOLO11s emerges as the ideal base architecture for this research, offering faster inference speed (increased by approximately 37.04 FPS) compared to YOLOv9s.

The MC-ASFF-ShipYOLO proposed in this study significantly outperforms traditional algorithms including the YOLO series and Faster R-CNN in ship detection tasks under complex marine environments. The predefined anchor mechanism of Faster R-CNN exhibits reduced matching efficiency when facing multi-scale targets, resulting in decreased detection performance for small ships. Although Cascade R-CNN improves multi-scale adaptability through progressive anchor optimization, its general accuracy still requires improvement. In contrast, YOLO series models demonstrate superior general performance.

Our MC-ASFF-ShipYOLO model achieves significant performance improvements, reaching optimal levels in precision (93.87%),

{A P}_{50}

(94.56%), and

A P

(70.44%), with recall at 86.84%, second only to EfficientNet. In-depth analysis reveals that while EfficientNet demonstrates the highest recall (89.29%), its precision is merely 74.91%, the lowest among all comparison models, indicating strong generalization ability and sensitivity to small object detection, but weak bounding box localization accuracy and complex background suppression capabilities, resulting in high recall but low precision with numerous false positives. Compared to the baseline model, MC-ASFF-ShipYOLO improves recall by 2.63%, exceeding YOLOv9s by 1.90%. The performance improvements are primarily attributed to the significant contributions of the MCAttn and ASFF modules in tackling the challenges of small object detection in complex environments and multi-scale ship recognition. The improved model exhibits greater stability during training without signs of overfitting.

3.2. Qualitative Analysis of Ship Detection Results

In this section, we perform ship target inference using the trained models and visualize the results in Figure 7. We selected the best-performing backbone from various object detection algorithms or the optimal size from the YOLO series to highlight the advanced nature of our improved model, MC-ASFF-ShipYOLO. For the parameter configuration, we used a confidence threshold of 0.75 and applied NMS with an IoU setting of 0.6. This selection is based on sufficient confidence in MC-ASFF-ShipYOLO, namely that the model can detect more ship targets even with higher confidence requirements while effectively reducing interference from low-quality detection results.

The visualization analysis of inference results in Figure 7 demonstrates that the improved MC-ASFF-ShipYOLO model can detect more ship targets, performing excellently even in complex environments or dense areas. This indicates that while enhancing attention to small ship targets, the model effectively suppresses noise interference from surrounding complex environments, exhibiting stronger generalization capability and robustness. Furthermore, these results confirm that MC-ASFF-ShipYOLO appropriately utilizes cross-scale information, enhancing its learning capacity for ship target features. In EfficientNet’s inference results, we observed fewer detected ships. This is due to the high confidence threshold (confidence threshold = 0.75), which filters out low-confidence targets, consistent with EfficientNet’s characteristics of high recall but low precision on the ship detection dataset.

4. Discussion

This section discusses the optimal position for introducing MCAttn into the backbone network, assessing the impact of the improved module through visualization-based qualitative analysis. Additionally, this section presents the comparison results of our improved module with similar existing methods, as well as the ablation experiment results.

4.1. Effectiveness of the MCAttn and Its Optimal Placement

Different features are extracted from various parts of the YOLO11 network backbone, necessitating a decision regarding which feature sections should utilize the MCAttn mechanism to maximize ship detection performance. We conducted a comparative experiment by introducing MCAttn after each C3k2 module in the backbone, with position illustrations shown in Figure 8. Table 4 lists the experimental results.

The experimental results revealed the critical impact of MCAttn module placement selection. Not all positions in the backbone where MCAttn was inserted could improve detection performance. When MCAttn was inserted after the P2 level C3k2 module (Posi-1), the model’s precision decreased by 1.95% and

A P

decreased by 0.74%, despite slight improvements in recall and

{A P}_{50}

(by 0.23% and 0.16%, respectively). These minimal gains could hardly offset the precision loss. The appropriate placement of MCAttn is crucial for fully leveraging its cross-scale information capture capabilities. When MCAttn was positioned after the P5 level C3k2 module (Posi-4), all metrics achieved significant improvements: precision reached 92.62% (+0.14%), recall reached 85.56% (+1.35%),

{A P}_{50}

reached 93.46% (+1.18%), and

A P

reached 69.34% (+1.94%). In particular, the substantial improvement in

A P

indicates significantly enhanced comprehensive detection capability, improved adaptability to multi-scale objects, and higher quality of bounding box predictions. Considering that

A P

is the most challenging evaluation metric, and given that all other metrics also improved, Posi-4 was determined to be the optimal insertion position for MCAttn. Notably, although the Posi-2 experiment achieved the highest recall (86.42%), its precision was only 90.55%, decreasing by 1.93%, indicating that the model may be overly sensitive to background noise and prone to false detections. This imbalance between precision and recall reflects diminished discriminative capability between background and target regions, which is disadvantageous for practical applications.

4.2. Comparative Evaluation of Similar Approaches

This section presents a comparative analysis with analogous methodologies to quantitatively assess the advantages of incorporating MCAttn and ASFF modules into the baseline model. For attention mechanisms, we selected CBAM [49] and SimAM [50] as comparative approaches and integrated them at identical positions as MCAttn. Regarding feature fusion, BiFPN [51] and HS-FPN [52] were implemented as comparative frameworks. Table 5 illustrates the experimental outcomes.

The experimental results indicate that various modules exert differential effects on YOLO11 performance. Regarding attention mechanisms, the 3D attention mechanism SimAM caused significant deterioration in model detection accuracy, with recall,

A P_{50}

, and

A P

decreasing by 10.01%, 10.02%, and 10.28%, respectively. Channel attention mechanisms CBAM and MCAttn both enhanced model precision, with MCAttn demonstrating superior performance. Specifically, MCAttn outperformed CBAM by 0.07% and 0.49% in the precision and recall metrics, respectively, while exhibiting more substantial advantages in

A P_{50}

and

A P

metrics, surpassing CBAM by 0.14% and 1.07%, respectively. In terms of feature fusion modules, although the Multi-level Feature Fusion Pyramid (HS-FPN) and Bidirectional Feature Pyramid Network (BiFPN) significantly enhanced inference speed (increasing FPS by 129.63 and 155.95, respectively), both led to a decline in overall detection performance, with the more challenging

A P

metric decreasing by 5.46% and 0.90%, respectively. In comparison, the ASFF module achieved significant improvements in detection performance, increasing precision and recall by 0.75% and 0.41% and enhancing

A P_{50}

and

A P

by 0.82% and 0.99%, respectively. Consequently, integrating the MCAttn and ASFF modules into the YOLO11 architecture demonstrates greater advantages in improving ship detection performance.

4.3. Ablation Experiment

To systematically assess the impact of the proposed enhancement strategies on ship detection performance, a comprehensive set of ablation experiments was performed. Using YOLO11s as the baseline model, experiments were performed on the constructed hybrid SAR ship detection dataset, with detailed results presented in Table 6. By incrementally adding each functional module and analyzing the resulting performance changes, we were able to explicitly measure the individual contribution of each improvement as well as their combined effects, thereby confirming the effectiveness of the proposed approach.

The ablation experiment results in Table 6 demonstrate that the baseline YOLO11s model (Experiment 1) achieved 92.48% precision, 84.21% recall, 92.28%

A P_{50}

, and 67.40%

A P

on the hybrid dataset, serving as the comparative benchmark for subsequent improvements. Upon incorporating MCAttn after the P5 level C3k2 module in the backbone (Experiment 2), all performance metrics improved. This comprehensive enhancement confirmed the effectiveness of the MCAttn module in strengthening feature learning, capturing cross-scale information, and increasing attention to small-ship targets. Adding the ASFF module to the detection head (Experiment 3) also yielded notable improvements: compared to the baseline model, precision increased by 0.75%, recall by 0.41%,

A P_{50}

by 0.82%, and

A P

by 0.99%. This indicates that ASFF effectively enhanced feature scale invariance through the adaptive learning of spatial fusion weights for feature maps at different scales, thereby improving ship detection accuracy.

In conclusion, the MC-ASFF-ShipYOLO model significantly enhanced detection accuracy for small ship targets in complex maritime scenarios while effectively addressing challenges in multi-scale feature extraction. Although detection performance has been enhanced, it is important to note that computational complexity increased correspondingly. Table 3 shows that the improved model has increased in parameter count, and the FPS has decreased to 232.56 images per second, presenting a trade-off that warrants further optimization in future work.

4.4. Analyze the Contribution of the Improvement Module

Ablation experiment results indicate that introducing the MCAttn module alone provided more significant overall accuracy improvements compared to the ASFF module. Relative to the ASFF experimental group, the MCAttn experimental group showed a 0.61% decrease in precision but achieved increases of 0.94% in recall, 0.36% in

A P_{50}

, and 0.95% in

A P

. To further analyze the contribution of the MCAttn module to ship detection, we visualized network output feature maps using Grad-CAM++ [53]. To ensure fairness in comparative experiments, all tests were conducted under identical configuration environments with the same parameters (e.g., Confidence = 0.75). Figure 9 displays the feature maps following the SPPF layer output in the backbone.

By comparing the feature heat maps, we can visually observe that after introducing the MCAttn module, the network’s attention to critical features of small ship targets significantly increased, with the fifth comparison group also demonstrating enhanced multi-scale detection capability. These changes are reflected in the heat maps, where ship regions exhibit higher activation intensity and more precise target localization [53]. The comparative results conclusively validate the positive contribution of the improved module in enhancing ship detection precision.

5. Conclusions

This study presented MC-ASFF-ShipYOLO, an improved framework for SAR ship detection, by introducing MCAttn and ASFF to enhance the YOLO model, which effectively mitigates the dual challenges of small target recognition and multi-scale feature integration. Through comprehensive experimentation on our constructed benchmark combining HRSID and SSDD datasets, we demonstrated the effectiveness of our two key innovations. Ablation experiment results demonstrate that both improvements enhanced model performance, with more significant effects when working synergistically. Comparative experiments with other widely used object detection algorithms confirm that MC-ASFF-ShipYOLO exhibits distinct advantages in small target recognition and multi-scale ship detection. Even at high confidence thresholds, the model generates high-quality detection results in dense and complex marine environments. We conducted an in-depth investigation of the optimal positioning for the MCAttn module and analyzed its positive impact on network feature activation using Grad-CAM++ visualization techniques. This research not only provides an effective implementation scheme for the YOLO11 model in SAR ship detection tasks but also establishes a novel technical approach for addressing the challenges of small ship identification and multi-scale ship detection in SAR imagery. The findings have practical significance for advancing the field of marine target monitoring.

Despite the progress achieved in this study, where our improved model demonstrated advantages in small ship detection and multi-scale challenges, certain limitations remain. The improved network architecture has increased computational overhead. Future research will focus on the following directions: We will explore more lightweight modules and optimization strategies. Additionally, we plan to implement rotated bounding box (PBox) annotation to better characterize ship geometrical features and reduce interference between adjacent targets in dense scenarios, further enhancing detection performance. These improvements will significantly enhance the model’s practical utility and applicability in complex marine environments.

Author Contributions

Conceptualization, Y.X., H.P. and L.W.; methodology, Y.X., H.P. and L.W.; software, Y.X., H.P. and R.Z.; validation, Y.X., H.P. and L.W.; formal analysis, H.P.; investigation, Y.X. and H.P.; resources, Y.X., H.P. and L.W.; data curation, Y.X., H.P., L.W. and R.Z.; writing—original draft preparation, Y.X. and H.P.; writing—review and editing, Y.X. and H.P.; visualization, Y.X. and R.Z.; supervision, L.W. and R.Z.; project administration, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Paolo, F.S.; Kroodsma, D.; Raynor, J.; Hochberg, T.; Davis, P.; Cleary, J.; Marsaglia, L.; Orofino, S.; Thomas, C.; Halpin, P. Satellite mapping reveals extensive industrial activity at sea. Nature 2024, 625, 85–91. [Google Scholar] [CrossRef] [PubMed]
Min, L.; Dou, F.; Zhang, Y.; Shao, D.; Li, L.; Wang, B. CM-YOLO: Context Modulated Representation Learning for Ship Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4202414. [Google Scholar] [CrossRef]
Alexandre, C.; Devillers, R.; Mouillot, D.; Seguin, R.; Catry, T. Ship detection with SAR C-Band satellite images: A systematic review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14353–14367. [Google Scholar] [CrossRef]
Er, M.J.; Zhang, Y.; Chen, J.; Gao, W. Ship detection with deep learning: A survey. Artif. Intell. Rev. 2023, 56, 11825–11865. [Google Scholar] [CrossRef]
Sharma, R.; Saqib, M.; Lin, C.; Blumenstein, M. MASSNet: Multiscale attention for single-stage ship instance segmentation. Neurocomputing 2024, 594, 127830. [Google Scholar] [CrossRef]
Owda, A.; Dall, J.; Badger, M.; Cavar, D. Improving SAR wind retrieval through automatic anomalous pixel detection. Int. J. Appl. Earth Obs. Geoinf. 2023, 122, 103444. [Google Scholar] [CrossRef]
Tsokas, A.; Rysz, M.; Pardalos, P.M.; Dipple, K. SAR data applications in earth observation: An overview. Expert Syst. Appl. 2022, 205, 117342. [Google Scholar] [CrossRef]
Chen, X.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Hierarchical and progressive learning with key point sensitive loss for sonar image classification. Multimed. Syst. 2024, 30, 380. [Google Scholar] [CrossRef]
Muhammad, Y.; Liu, S.; Xu, M.; Wan, J.; Sheng, H.; Shah, N.; Xin, Z.; Arife Tugsan, I.C. YOLOv8-BYTE: Ship tracking algorithm using short-time sequence SAR images for disaster response leveraging GeoAI. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103771. [Google Scholar]
Pappas, O.; Achim, A.; Bull, D. Superpixel-level CFAR detectors for ship detection in SAR imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1397–1401. [Google Scholar] [CrossRef]
Rihan, M.Y.; Nossair, Z.B.; Mubarak, R.I. An improved CFAR algorithm for multiple environmental conditions. Signal Image Video Process. 2024, 18, 3383–3393. [Google Scholar] [CrossRef]
Ponsford, A.; McKerracher, R.; Ding, Z.; Moo, P.; Yee, D. Towards a Cognitive Radar: Canada’s Third-Generation High Frequency Surface Wave Radar (HFSWR) for Surveillance of the 200 Nautical Mile Exclusive Economic Zone. Sensors 2017, 17, 1588. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. SAR ship detection based on improved YOLOv5 and BiFPN. ICT Express 2024, 10, 28–33. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep learning for SAR ship detection: Past, present and future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Li, J.; Chen, J.; Cheng, P.; Yu, Z.; Yu, L.; Chi, C. A survey on deep-learning-based real-time SAR ship detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3218–3247. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Ke, X.; Zhang, X.; Zhang, T.; Shi, J.; Wei, S. SAR ship detection based on an improved faster R-CNN using deformable convolution. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 3565–3568. [Google Scholar]
Zhao, Y.; Zhao, L.; Xiong, B.; Kuang, G. Attention receptive pyramid network for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2738–2756. [Google Scholar] [CrossRef]
Wang, Z.; Du, L.; Mao, J.; Liu, B.; Yang, D. SAR target detection based on SSD with data augmentation and transfer learning. IEEE Geosci. Remote Sens. Lett. 2018, 16, 150–154. [Google Scholar] [CrossRef]
Miao, T.; Zeng, H.; Yang, W.; Chu, B.; Zou, F.; Ren, W.; Chen, J. An improved lightweight RetinaNet for ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4667–4679. [Google Scholar] [CrossRef]
Zhu, M.; Hu, G.; Li, S.; Zhou, H.; Wang, S.; Feng, Z. A novel anchor-free method based on FCOS + ATSS for ship detection in SAR images. Remote Sens. 2022, 14, 2034. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. SMEP-DETR: Transformer-Based Ship Detection for SAR Imagery with Multi-Edge Enhancement and Parallel Dilated Convolutions. Remote Sens. 2025, 17, 953. [Google Scholar] [CrossRef]
Qin, C.; Zhang, L.; Wang, X.; Li, G.; He, Y.; Liu, Y. RDB-DINO: An Improved End-to-End Transformer with Refined De-Noising and Boxes for Small-Scale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5200517. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Khan, A.R.; Asif, H.; Asif, A.; Farooq, U. A survey of the vision transformers and their CNN-transformer based variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
Zhao, C.; Fu, X.; Dong, J.; Cao, S.; Zhang, C. Enhancing, Refining, and Fusing: Towards Robust Multi-Scale and Dense Ship Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9919–9933. [Google Scholar] [CrossRef]
Yasir, M.; Shanwei, L.; Mingming, X.; Jianhua, W.; Nazir, S.; Islam, Q.U.; Dang, K.B. SwinYOLOv7: Robust ship detection in complex synthetic aperture radar images. Appl. Soft Comput. 2024, 160, 111704. [Google Scholar] [CrossRef]
Zhou, R.; Gu, M.; Hong, Z.; Pan, H.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S. SIDE-YOLO: A Highly Adaptable Deep Learning Model for Ship Detection and Recognition in Multisource Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1501405. [Google Scholar] [CrossRef]
Liu, Y.; Ma, Y.; Chen, F.; Shang, E.; Yao, W.; Zhang, S.; Yang, J. YOLOv7oSAR: A Lightweight High-Precision Ship Detection Model for SAR Images Based on the YOLOv7 Algorithm. Remote Sens. 2024, 16, 913. [Google Scholar] [CrossRef]
Gong, Y.; Zhang, Z.; Wen, J.; Lan, G.; Xiao, S. Small ship detection of SAR images based on optimized feature pyramid and sample augmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7385–7392. [Google Scholar] [CrossRef]
Yue, T.; Zhang, Y.; Liu, P.; Xu, Y.; Yu, C. A generating-anchor network for small ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7665–7676. [Google Scholar] [CrossRef]
Gao, S.; Liu, J.M.; Miao, Y.H.; He, Z.J. A high-effective implementation of ship detector for SAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4019005. [Google Scholar] [CrossRef]
Chen, P.; Li, Y.; Zhou, H.; Liu, B.; Liu, P. Detection of small ship objects using anchor boxes cluster and feature pyramid network model for SAR imagery. J. Mar. Sci. Eng. 2020, 8, 112. [Google Scholar] [CrossRef]
Li, X.; Duan, W.; Fu, X.; Lv, X. R-SABMNet: A YOLOv8-Based Model for Oriented SAR Ship Detection with Spatial Adaptive Aggregation. Remote Sens. 2025, 17, 551. [Google Scholar] [CrossRef]
Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense attention pyramid networks for multi-scale ship detection in SAR images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8983–8997. [Google Scholar] [CrossRef]
Yang, X.; Zhang, X.; Wang, N.; Gao, X. A robust one-stage detector for multiscale ship detection with complex background in massive SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5217712. [Google Scholar] [CrossRef]
Zhou, L.; Wan, Z.; Zhao, S.; Han, H.; Liu, Y. BFEA: A SAR Ship Detection model based on Attention mechanism and Multi-scale Feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11163–11177. [Google Scholar] [CrossRef]
Tang, X.; Zhang, J.; Xia, Y.; Xiao, H. DBW-YOLO: A high-precision SAR ship detection method for complex environments. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7029–7039. [Google Scholar] [CrossRef]
Wang, X.; Xu, W.; Huang, P.; Tan, W. MSSD-Net: Multi-Scale SAR Ship Detection Network. Remote Sens. 2024, 16, 2233. [Google Scholar] [CrossRef]
Fan, X.; Hu, Z.; Zhao, Y.; Chen, J.; Wei, T.; Huang, Z. A small ship object detection method for satellite remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11886–11898. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Dai, W.; Liu, R.; Wu, Z.; Wu, T.; Wang, M.; Zhou, J.; Yuan, Y.; Liu, J. Exploiting Scale-Variant Attention for Segmenting Small Medical Objects. arXiv 2024, arXiv:2407.07720. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef]
Yamauchi, T. Spatial Sensitive Grad-CAM++: Improved visual explanation for object detectors via weighted combination of gradient map. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 8164–8168. [Google Scholar]

Figure 1. (a) Ship targets in complex inland river environments; (b) scene showing densely distributed small ship targets; (c) examples of multi-scale ship targets after zero-padding processing of the SSDD dataset. Red bounding boxes highlight real ships.

Figure 2. Number of ships of different scales and the proportion of each scale in the training and validation sets.

Figure 3. Overall framework of the MC-ASFF-ShipYOLO model based on YOLO11 improvements, with improved modules highlighted in red boxes.

Figure 4. Detailed composition of modules in the improved MC-ASFF-ShipYOLO model.

Figure 5. Detailed implementation structure of Monte Carlo Attention. The letters H, W, and C denote Height, Width, and Channels respectively.

Figure 6. Add detailed implementation diagram of ASFF to the head section. The P3, P4, and P5 feature layers are output by the Neck component of YOLO. YOLO incorporates three detection heads, each dedicated to detecting objects at different scales. The Adaptive Spatial Feature Fusion (ASFF) module is integrated immediately preceding each detection head.

Figure 7. The MC-ASFF-ShipYOLO model is compared with the YOLO series models and traditional object detection models. Both Faster-RCNN and Cascade-RCNN use ResNet50 as the backbone.

Figure 8. Alternative positions for MCAttn module insertion in the YOLO11 backbone. PX/Y: feature map at pyramid level X (X = 2,3,4,5), Y represents the down-sampling factor. In the figure, position i (i = 1, 2, 3, 4) corresponds to the four experimental groups (Posi-i) presented in Table 4.

Figure 9. Visualization results of the Grad-CAM++ feature heat map for ship target recognition (Confidence = 0.75). The feature heat maps of YOLO11-s and YOLO11-s + MCAttn are both outputs after the backbone SPPF layer.

Table 1. Experimental environment.

Configuration	Model/Parameter
Operation system	Windows 11
CPU	Intel(R) Xeon(R) Gold 6430 @2.10 GHz
GPU	NVIDIA GeForce RTX 4090 (24,564 MiB)
RAM	120 GB
Compiler	Python3.11.11
Framework	CUDA12.4/cudnn9.1.0/torch2.6.0

Table 2. Hyperparameter settings for other models. All models used the SGD optimizer (momentum = 0.937, weight decay = 0.0005). “-” in the NMS column indicates that non-maximum suppression was disabled during training. In the Backbone column, ‘R’ denotes ResNet and ‘Eff-b3’ denotes EfficientNet-B3.

Models	Backbone	Batch Size	lr0	lrf	NMS
Faster R-CNN	R50	16	0.01	0.0001	0.6
Faster R-CNN	R101	12	0.01	0.0001	0.7
Cascade R-CNN	R50	16	0.01	0.0001	0.6
Cascade R-CNN	R101	16	0.01	0.0001	0.7
EfficientNet	Eff-b3	6	0.01	0.0001	-

Table 3. Experimental environment. In the Backbone or Size column, ‘R’ denotes ResNet and ‘Eff-b3’ denotes EfficientNet-B3. P (Precision), R (Recall),

A P_{50}

(Average Precision at IoU threshold of 0.5),

A P

(AP averaged over IoU thresholds in [0.5:0.95] (step = 0.05)), and FPS (images per second, measuring inference speed).

Table 3. Experimental environment. In the Backbone or Size column, ‘R’ denotes ResNet and ‘Eff-b3’ denotes EfficientNet-B3. P (Precision), R (Recall),

A P_{50}

(Average Precision at IoU threshold of 0.5),

A P

(AP averaged over IoU thresholds in [0.5:0.95] (step = 0.05)), and FPS (images per second, measuring inference speed).

Models	Backbone or Size	$P$ /%	$R$ /%	$A P_{50}$ /%	$A P$ /%	Params (M)	FPS (img/s)
YOLOv8	n	90.85	82.83	90.67	64.05	3.01	666.67
YOLOv8	s	90.74	84.75	91.81	65.24	11.13	370.37
YOLOv9	t	91.11	83.82	91.55	65.76	1.97	555.57
YOLOv9	s	90.90	84.94	91.97	67.53	7.17	333.33
YOLOv10	n	88.90	81.55	90.11	65.20	2.70	666.67
YOLOv10	s	91.09	83.40	91.76	66.77	8.04	384.62
YOLO11	n	90.41	81.56	89.20	63.92	2.58	555.57
YOLO11	s	92.48	84.21	92.78	67.40	9.41	370.37
YOLO12	n	90.93	81.08	90.43	66.04	2.56	416.67
YOLO12	s	91.39	82.21	91.59	66.24	9.23	243.90
EfficientNet	Eff-b3	74.91	89.29	88.00	61.30	18.34	306.30
Faster R-CNN	R50	84.18	82.87	83.40	60.80	41.35	308.10
Faster R-CNN	R101	84.18	83.08	83.20	60.50	60.34	324.00
Cascade R-CNN	R50	84.51	83.98	84.50	63.10	69.15	317.50
Cascade R-CNN	R101	84.97	84.13	83.80	62.80	88.14	317.10
MS-ASFF- ShipYOLO (Ours)	-	93.87	86.84	94.56	70.44	60.28	232.56

Table 4. Effect of MCAttn insertion at different positions.

Models	$P$ /%	$R$ /%	$A P_{50}$ /%	$A P$ /%
Baseline	92.48	84.21	92.28	67.40
Posi-1	90.53	84.44	92.44	66.66
Posi-2	90.55	86.42	93.20	68.28
Posi-3	92.37	85.28	93.45	69.17
Posi-4	92.62	85.56	93.46	69.34

Table 5. Comparison experiments of different modules (baseline: YOLO11s).

Model	$P$ /%	$R$ /%	$A P_{50}$ /%	$A P$ /%	FPS (img/s)
Baseline	92.48	84.21	92.28	67.40	370.37
+SimAM	92.11	74.20	82.26	57.12	434.78
+CBAM	92.55	85.07	93.32	68.27	263.16
+MCAttn	92.62	85.56	93.46	69.34	270.27
+HS-FPN	91.31	82.20	89.56	61.94	500.00
+BiFPN	91.45	83.37	91.94	66.50	526.32
+ASFF	93.23	84.62	93.10	68.39	303.03

Table 6. Ablation experiment. “🗸” indicates the module is added, and “🗶” indicates it is not.

Different YOLO11s Models	MCAttn	ASFF	$P$ /%	$R$ /%	$A P_{50}$ /%	$A P$ /%	FPS (img/s)
Experiment 1 (Baseline)	🗶	🗶	92.48	84.21	92.28	67.40	370.37
Experiment 2	🗸	🗶	92.62	85.56	93.46	69.34	270.27
Experiment 3	🗶	🗸	93.23	84.62	93.10	68.39	303.03
Experiment 4	🗸	🗸	93.87	86.84	94.56	70.44	232.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Pan, H.; Wang, L.; Zou, R. MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images. Sensors 2025, 25, 2940. https://doi.org/10.3390/s25092940

AMA Style

Xu Y, Pan H, Wang L, Zou R. MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images. Sensors. 2025; 25(9):2940. https://doi.org/10.3390/s25092940

Chicago/Turabian Style

Xu, Yubin, Haiyan Pan, Lingqun Wang, and Ran Zou. 2025. "MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images" Sensors 25, no. 9: 2940. https://doi.org/10.3390/s25092940

APA Style

Xu, Y., Pan, H., Wang, L., & Zou, R. (2025). MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images. Sensors, 25(9), 2940. https://doi.org/10.3390/s25092940

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MC-ASFF-ShipYOLO: Improved Algorithm for Small-Target and Multi-Scale Ship Detection for Synthetic Aperture Radar (SAR) Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Introduction

2.2. Implementation Details

2.3. Evaluation Metrics

2.4. The MC-ASFF-ShipYOLO Model

2.4.1. Monte Carlo Attention (MCAttn)

2.4.2. Adaptively Spatial Feature Fusion (ASFF)

2.4.3. Loss Function

3. Results

3.1. Comparison Experiment

3.2. Qualitative Analysis of Ship Detection Results

4. Discussion

4.1. Effectiveness of the MCAttn and Its Optimal Placement

4.2. Comparative Evaluation of Similar Approaches

4.3. Ablation Experiment

4.4. Analyze the Contribution of the Improvement Module

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI