MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection

Chen, Haowei; He, Manman; Yang, Zhen; Gan, Lixin

doi:10.3390/s25185736

Open AccessArticle

MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection

by

Haowei Chen

¹,

Manman He

²,

Zhen Yang

^1,* and

Lixin Gan

³

¹

The School of Information Engineering, Jiangxi Science and Technology Normal University, Road of Xuefu, Nanchang 330013, China

²

The School of Information and Electrical Engineering, China Agricultural University, Beijing 100107, China

³

The School of Intelligent Manufacturing, Jiangxi Science and Technology Normal University, Road of Xuefu, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(18), 5736; https://doi.org/10.3390/s25185736

Submission received: 30 July 2025 / Revised: 6 September 2025 / Accepted: 12 September 2025 / Published: 14 September 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

Small-vessel detection in Synthetic Aperture Radar (SAR) imagery constitutes a critical capability for maritime surveillance systems. However, prevailing methodologies such as sea-clutter statistical models and deep learning-based detectors face three fundamental limitations: weak target scattering signatures, complex sea clutter interference, and computational inefficiency. These challenges create inherent trade-offs between noise suppression and feature preservation while hindering high-resolution representation learning. To address these constraints, we propose the Multi-cue Efficient Maritime detector (MCEM), an anchor-free framework integrating three synergistic components: a Feature Extraction Module (FEM) with scale-adaptive convolutions for enhanced signature representation; a Feature Fusion Module (F²M) decoupling target-background ambiguities; and a Detection Head Module (DHM) optimizing accuracy-efficiency balance. Comprehensive evaluations demonstrate MCEM’s state-of-the-art performance: achieving 45.1%

A P_{S}

on HRSID (+2.3pp over YOLOv8) and 77.7%

A P_{L}

on SSDD (+13.9pp over same baseline), the world’s most challenging high-clutter SAR datasets. The framework enables robust maritime surveillance in complex oceanic conditions, particularly excelling in small target detection amidst high clutter.

Keywords:

synthetic aperture radar (SAR); deep learning; small ship detection; feature fusion; CFAR

1. Introduction

Safeguarding the marine environment is a critical priority for coastal nations worldwide, where timely detection of ship activities on the ocean surface serves as a fundamental component [1,2,3]. This necessity drives the adoption of SAR technology. SAR’s capability for continuous all-weather and day-night ocean observation enables accurate ship localization, establishing this approach as a vital research domain [4,5,6,7,8]. Within this framework, the operational advantages of SAR are exploited through statistical modeling of sea clutter characteristics, leveraging detectors grounded in Constant False Alarm Rate principles [9,10,11]. By synergizing spatial distribution characteristics (e.g., kernel density estimation) with intensity statistics, this approach effectively mitigates false detections from marine clutter, thereby enhancing system reliability in practical maritime surveillance applications [12].

In recent years, SAR image ship detection has made significant progress while overcoming numerous formidable challenges. These advances are inseparable from the continuous evolution of technical approaches: early traditional methods based on sea clutter statistics (such as CFAR) laid the foundation [9,11,13], while current mainstream deep learning methods demonstrate stronger feature learning capabilities [14,15,16,17,18]. Although generally effective, these methods exhibit critical limitations in small-vessel detection under high sea clutter (>40% missed detections) and cross-satellite deployment scenarios [19].

SAR ship detection, particularly for small vessels (<32 × 32 pixels), presents significant challenges due to inherent limitations in conventional methodologies, hindering adaptability and robustness across diverse maritime scenarios. These limitations are prominently observed in two key approaches: Traditional Constant False Alarm Rate detectors exhibit poor adaptability and high false alarm rates under extremely low signal-to-noise ratios, often failing to detect small targets [20,21], while anchor-based deep learning models (e.g., Faster R-CNN, RetinaNet) face fundamental issues including mismatches between predefined anchor boxes and small target sizes (significantly reducing positive samples [22]), insufficient resolution in deep feature maps causing semantic information degradation [23], shallow features lacking adequate semantic support [24], and multi-scale fusion struggling to accurately capture small targets [25]. Consequently, both classes of detectors suffer from suboptimal performance for critical small vessel detection tasks. In summary, SAR ship detection tasks face three core challenges: small target feature degradation [20,26], complex noise interference [27,28], and real-time processing resource constraints [29,30,31].

To address these limitations, we propose the Multi-Cue Efficient Maritime Detector (MCEM)—an innovative anchor-free framework inspired by prior work [32,33,34], which consists of a feature extraction module (FEM), a feature fusion module (F²M), and a detection head module (DHM), as shown in Figure 1. The detection pipeline initiates with the Feature Extraction Module (FEM) performing efficient feature capture from raw input data to preserve critical small vessel signatures. These representations then advance to the F²M, which strategically integrates multi-scale global-local contexts using SPDConv [35] and RCS-OSA [36] to resolve information degradation. Finally, the optimized features undergo detection refinement in the Detection Head Module (DHM), where fully shared convolutional layers conduct comprehensive feature interactions, directly generating accurate detection outcomes for small vessels.

Specifically, FEM incorporates two synergistic submodules to address critical challenges in SAR ship detection. The Scale-aware Refinement (SR) component utilizes SPDConv [35] as an advanced convolutional operator replacing traditional pooling layers, combined with RCS-OSA’s feature cascading mechanism [36], to prevent detail loss and semantic deficiencies in small vessel representation. Complementing this, the Adaptive Image Feature Integration (AIFI) submodule [37] employs multi-head attention with residual learning to dynamically process variable-scale inputs, significantly enhancing detection robustness across diverse maritime scenarios through adaptive feature transformation and fusion. F²M utilizes SPDConv, RCS-OSA, and CPCA attention to create a feature fusion mechanism that combines features from different levels, avoiding the separation of shallow details and deep semantics and suppressing sea clutter noise interference. DHM introduces a fully shared convolution strategy, enabling comprehensive convolution enhancement interactions. It also integrates GroupNorm to improve the detection head’s performance in localization and classification, avoiding redundant anchor box calculations, classification/localization task conflicts, and training instability. The uniqueness of the MCEM framework lies in the synergistic interaction between its three core modules, which systematically address key challenges in small target detection using SAR.

Our experiments on the HRSID [38] and SSDD [39] public datasets show that MCEM systematically addresses the limitations of traditional anchor-free methods (such as FCOS and CenterNet) [40] in small object detection through global-local feature fusion and fully shared convolutions. Our contributions can be summarized as follows: (1) Development of a lightweight anchor-free detector (MCEM) specifically engineered to overcome SAR small vessel detection challenges: feature degradation in low-resolution targets, noise robustness in complex maritime environments, and computational efficiency for real-time deployment. (2) Three innovative and efficient modular components have been proposed: FEM, F²M, and DHM. These collectively overcome critical SAR small vessel detection bottlenecks through coordinated processing: FEM maintains target integrity during feature extraction, F²M enables cross-layer semantic fusion while suppressing noise interference, and DHM optimizes detection parameters for efficiency and stability. (3) Benchmark-leading accuracy: MCEM achieves state-of-the-art performance across maritime SAR benchmarks by delivering 45.1%

A P_{S}

on HRSID, the world’s most challenging maritime SAR dataset, surpassing prior SOTA by 2.3 percentage points while simultaneously attaining 77.7%

A P_{L}

on SSDD with a 17.8 percentage point advantage over YOLOv11, all accomplished alongside 40% lower model complexity than conventional detectors while maintaining full detection precision.

2. Related Works

This section contextualizes our contribution through a tripartite analysis: the historical trajectory and inherent bottlenecks of SAR maritime target detectors, algorithmic advances in small-target feature representation, detection head optimizations for resource-constrained scenarios, critically examining limitations resolved in our framework.

2.1. SAR Maritime Target Detection

Traditional maritime detection predominantly relied on statistical models like Constant False Alarm Rate (CFAR) detectors [41,42], which model sea clutter distributions but suffer critical limitations including high false alarm rates exceeding 35% in complex sea states and over 40% missed detections for sub-100 px targets [21]. While advancing maritime detection, existing methods fail to resolve low-SCR feature degradation in SAR small-vessel scenarios. This gap is addressed by MCEM’s clutter-invariant learning. The evolution to deep learning introduced anchor-based detectors such as Faster R-CNN that improved generalization but created new bottlenecks: predefined anchor boxes caused a 60% reduction in positive samples for small ships [22], deep feature downsampling induced semantic degradation [23], and computational overheads exceeded 200 ms inference latency [43]. Subsequent anchor-free approaches like FCOS [40] eliminated anchor mismatches but still struggled with feature degradation and real-time constraints. Our work bridges this gap by proposing an anchor-free framework specifically optimized for SAR small-vessel detection, resolving feature degradation while achieving real-time performance under 50 ms.

2.2. Feature Enhancement for Small Targets

Effective small-target detection requires advanced feature enhancement and fusion strategies. Hierarchical fusion mechanisms form the foundation, with Feature Pyramid Networks (FPNs) [44] pioneering multi-scale integration. Subsequent innovations include PANet’s bottom-up augmentation [45], BiFPN’s learnable cross-scale connections [46], and AugFPN’s semantic consistency improvements [47]. Concurrently, feature enhancement techniques evolved through spatial/channel attention mechanisms [48,49] that sharpen salient regions, dilated convolutions expanding receptive fields [50], and vision transformers capturing global context [51]. Existing methods fail to preserve high-frequency details critical for sub-32 px vessel discrimination. This deficiency is resolved by our F²M’s spatial-semantic equilibrium. However, these methods exhibit limitations in SAR-specific noise handling and computational efficiency. Our approach introduces a dedicated Feature Enhancement Module integrating SPDConv for spatial preservation and RCS-OSA for efficient aggregation, specifically addressing SAR small-vessel characteristics.

2.3. Detection Head Optimization

Detection head architectures critically impact small-target recognition accuracy and efficiency. Task-alignment mechanisms represent significant advances, with TOOD [52] pioneering joint classification-localization optimization through interactive learning, though its parameter-heavy design limits applicability to low-resolution targets. Normalization techniques also evolved, where GroupNorm [53] stabilized small-batch training but lacked optimization for SAR noise patterns. Existing methods remain constrained by the accuracy-efficiency tradeoff in real-time maritime surveillance. Our DHM overcomes this barrier through fully-shared convolution, achieving 2.6 ms latency. Recent lightweight heads prioritized speed at the cost of accuracy. Our Detection Head Module innovates through fully shared convolutions for efficient multi-scale interaction, GroupNorm-enhanced noise suppression, and streamlined task-aligned structuring, synergistically optimizing classification and regression for maritime targets.

3. Proposed Method

During training, the model processes annotated SAR ship datasets (e.g., HRSID and SSDD) with

512 \times 512

resolution input images

I \in R^{3 \times 512 \times 512}

and corresponding bounding box annotations

B = {{(x_{c}, y_{c}, w, h, s)}_{k}}_{k = 1}^{K}

specifying center coordinates, width, height, and class labels. For inference, the network directly processes arbitrary-sized SAR inputs

I_{test} \in R^{3 \times H \times W}

with automatic padding and scaling for normalization, ultimately outputting predicted target sets

\hat{B} = {{({\hat{x}}_{c}, {\hat{y}}_{c}, \hat{w}, \hat{h}, \hat{s})}_{m}}_{m = 1}^{M}

where

\hat{s} \in [0, 1]

denotes detection confidence scores.

To address core challenges of small-target feature degradation, maritime clutter interference, and edge-computation constraints in SAR ship detection, we propose the MCEM, an end-to-end anchor-free framework formalized as follows:

Y = DHM (F^{2} (FEM (I)))

(1)

where the FEM processes input

I

through AIFI for position-aware encoding (

X \oplus P

) followed by SR for spatial reorganization, collaboratively preserving small-vessel details; the

F^{2} (F)

integrates multi-level features via three parallel pathways; and the

DHM (F)

employs deformable convolutions and task-alignment mechanisms to generate final detections

Y \in R^{M \times 5}

. The optimization objective combines localization and classification losses:

L = 2.0 L_{GIoU} + 1.0 L_{Focal} + 1.5 L_{Varifocal}

(2)

where

L_{GIoU}

handles rotated box regression,

L_{Focal}

addresses class imbalance, and

L_{Varifocal}

calibrates confidence for small targets. Critical hyperparameters include the following: AdamW optimizer (

l r = 10^{- 4}

,

β = (0.9, 0.999)

, weight decay

0.05

), SPDConv scale factor

S = 2

, and NMS threshold

τ = 0.01

.

3.1. Feature Extraction Module (FEM)

In real-world maritime scenarios, small ship targets are often swamped by complex background clutter. To address this critical challenge, we design the FEM specifically for clutter suppression and small-target feature enhancement, implementing a dual-submodule architecture as shown in Figure 2. Formally, let the input tensor

X \in R^{B \times C \times H \times W}

represent SAR image batches, where B denotes batch size, C the channel depth, and H, W the spatial dimensions of feature maps, respectively.

3.1.1. AIFI Sub-Module

To address significant scale variations in maritime SAR targets, where ship sizes range from sub-pixel clusters to hundreds of pixels introducing severe feature representation inconsistency, we develop the AIFI module inspired by RT-DETR [37]. This position-aware transformation employs geometric-sensitive encoding to dynamically adapt to resolution fluctuations while preserving feature topology, thus ensuring dimensional stability in output representations which significantly boosts multi-scale detection efficiency across diverse maritime scenarios. The forward propagation is defined as follows:

Y = X + Dropout (W_{out} ⊙ GELU (W_{in} ⊙ (X \oplus P)))

(3)

where

X \in R^{B \times C \times H \times W}

denotes the input feature tensor with batch size B, channel depth C, and spatial dimensions

H \times W

;

P \in R^{C \times H \times W}

represents the 2D sinusoidal positional encoding generated by the Build-2D-Sincos method;

W_{in} \in R^{C \times d}

and

W_{out} \in R^{d \times C}

are projection weight matrices with latent dimension d; ⊕ indicates broadcasted element-wise addition; ⊙ denotes Hadamard product;

GELU (\cdot)

is the Gaussian Error Linear Unit activation; and

Dropout (\cdot)

applies regularization through random feature abandonment.

3.1.2. SR Sub-Module

Unlike optical imagery, SAR’s unique bird’s-eye perspective captures expansive maritime areas with complex background clutter, where significant speckle noise and environmental interference often obscure critical ship targets, substantially degrading detection accuracy. To counteract information loss inherent in standard convolutional operations, particularly detail attenuation from strided convolutions and pooling layers, we implement SPD-Conv [35], which replaces destructive downsampling with non-destructive spatial-to-depth transformation. While SPD-Conv effectively preserves small target features, its computational profile requires optimization for real-time maritime surveillance.

By integrating SPD-Conv with the RCS-OSA module [36] into the unified SR sub-module (Figure 3), we achieve synergistic performance: RCS-OSA’s multi-scale fusion capability compensates for SPD-Conv’s computational overhead while further enhancing feature discrimination, yielding significant accuracy improvements without compromising inference speed.

The SPD-Conv transformation is as follows:

Y_{SPD} = Conv (SPD (X, S))

(4)

where

SPD (\cdot)

rearranges spatial blocks of size S × S into channel dimensions, and Conv applies convolution without spatial reduction.

The RCS-OSA operation combines multi-scale processing:

Y_{RCS - OSA} = OSA ([{RCS}_{1} (X), \dots, {RCS}_{n} (X)])

(5)

using n parallel recurrent convolution paths with concatenation and one-shot aggregation.

The unified SR processing is as follows:

Y_{SR} = RCS - OSA (SPD - Conv (X))

(6)

producing enhanced features optimized for small vessel detection.

3.2. Feature Fusion Module (F²M)

SAR ship detection faces a fundamental trade-off: shallow convolutional layers capture high-resolution spatial details crucial for small target localization but suffer from semantic ambiguity and noise sensitivity, while deeper layers provide robust semantic representations at the cost of significantly degraded spatial resolution. To resolve this persistent challenge and enable robust small ship detection in complex SAR environments, we propose the F²M. This architecture innovatively integrates two complementary submodules through parallel processing pathways: The RCS-OSA component [36] enhances computational efficiency via structural reparameterization and channel optimization, achieving 50% faster inference than conventional 3 × 3 convolutions. Simultaneously, the SPDConv operator [35] preserves critical spatial information through non-destructive space-to-depth conversion, significantly enhancing detail retention for sub-pixel targets. F²M processes features through three synergistic pathways:

\begin{matrix} F_{1} & = RCS - OSA (Concat (SR (X), ↑ AIFI (X))) \end{matrix}

(7)

\begin{matrix} F_{2} & = RCS - OSA (Concat (RCS - OSA (X), SPDConv (X))) \end{matrix}

(8)

\begin{matrix} F_{3} & = RCS - OSA (Concat (AIFI (X), SPDConv (X))) \end{matrix}

(9)

where input tensor

X \in R^{B \times C \times H \times W}

is processed by five key components:

SPDConv (X)

extracts shallow spatial features preserving high-frequency details,

AIFI (X)

generates intermediate representations with spatial adaptability,

↑ AIFI (X)

denotes bilinear-upsampled AIFI features enhancing resolution,

SR (X)

provides small-target enhanced features, and

RCS - OSA (X)

produces deep semantic features through recurrent convolution.

The fused outputs

F_{1}, F_{2}, F_{3}

collectively achieve multi-scale feature integration by combining SPDConv-optimized spatial acuity with RCS-OSA-enhanced semantic richness, yielding three critical advantages: significant noise suppression through cross-pathway consistency, computational efficiency via parameter sharing, and enhanced generalization by mitigating feature co-adaptation effects.

3.3. Detection Head Module (DHM)

SAR’s unique side-looking imaging geometry fundamentally differs from optical sensing modalities, introducing heightened speckle noise and azimuth ambiguities that manifest as persistent background interference. This noise sensitivity, compounded by conventional detection heads’ architectural limitations, including parametric redundancy, computational inefficiency, and limited discriminative capability, severely impedes maritime target recognition. Such systems typically exhibit compromized detection fidelity characterized by elevated false alarms and missed detections.

To simultaneously enhance recognition accuracy and processing efficiency, we propose the DHM, Figure 4, with three integrated innovations. Building upon FCOS’s normalization principles [40], we first integrate GroupNorm layers to stabilize feature distributions against noise-induced perturbations, substantially improving localization precision. Second, a fully parameter-shared convolutional backbone achieves significant complexity reduction while maintaining operational flexibility for constrained hardware environments, enhanced by adaptive scaling transformations that dynamically resolve scale variances across maritime targets. Third, extending TOOD’s task-alignment framework [52], we establish a synergistic dual-path architecture employing deformable convolutions (DCNv2) for geometric refinement in positioning branches, attention-guided feature selection for classification pathways, and continuously interacting feature extractors that bridge both tasks through cross-pathway distillation. This co-design approach yields superior noise resilience while sustaining efficient inference performance.The computational flow formalizes as follows:

\begin{matrix} F_{r c} & = TaskExtractor (GN (X)) \end{matrix}

(10)

\begin{matrix} F_{r}, F_{c} & = Decompose (F_{r c}) \end{matrix}

(11)

\begin{matrix} bbox & = Φ_{reg} (DCNv 2 (F_{r}; Δ, M)) \end{matrix}

(12)

\begin{matrix} cls & = Φ_{cls} (AttSel (F_{c}, F_{r c})) \end{matrix}

(13)

where input tensor

X \in R^{B \times C \times H \times W}

is processed to generate:

F_{r c} \in R^{B \times C_{f} \times H \times W}

representing joint task-interactive features; deformable convolution parameters

Δ \in R^{B \times 18 \times H \times W}

(offsets) and

M \in R^{B \times 9 \times H \times W}

(modulation masks) for geometric adaptation;

Φ_{reg}

denoting the regression head with adaptive scaling; AttSel implementing attention-based feature selection for classification refinement.

Final predictions combine outputs across n detection layers:

Y = ⋃_{i = 1}^{n} ({Scale}_{i} ({bbox}_{i}) \times {cls}_{i})

(14)

providing optimized ship localization and classification for resource-limited platforms.

4. Experimental Results

This section presents a comprehensive performance evaluation of the proposed MCEM framework. We commence with detailed descriptions of the benchmark datasets, hardware configuration, and quantitative evaluation metrics. Subsequently, a comparative analysis against leading state-of-the-art detectors is conducted to objectively quantify performance advantages. Finally, rigorous ablation studies validate the individual contribution of each component module to the overall framework efficacy.

4.1. Datasets and Experimental Setup

Two authoritative SAR maritime datasets were rigorously employed for comprehensive performance validation: The SAR Ship Detection Dataset (SSDD) [39] and High-Resolution SAR Image Dataset (HRSID) [38]. These complementary benchmarks enable robust evaluation across diverse operational scenarios, addressing critical challenges in maritime surveillance.

SSDD, curated in 2017 from RadarSat-2, TerraSAR-X, and Sentinel-1 acquisitions, encompasses 1160 complex near-shore scenes capturing Yantai and Visakhapatnam coastal zones. With full-polarimetric (HH/HV/VV/VH) data containing 2456 annotated vessels across resolutions spanning 1–15 m, this dataset presents significant background clutter challenges within ∼10 km swath widths. Imagery ranging from 217 × 214 to 526 × 646 pixels was standardized to 512 × 512 resolution and partitioned into 928 training and 232 testing samples, ensuring consistent evaluation protocols.

HRSID, compiled in 2020 from Sentinel-1B, TerraSAR-X, and TanDEM observations, provides 5600 high-resolution scenes (800 × 800 pixels) of Houston and São Paulo maritime regions. Featuring HH/VV/HV polarizations at 0.5 m, 1 m, and 3 m ground sampling distances within ∼4 km swaths, this benchmark contains 16,951 vessel instances exhibiting pronounced small-target characteristics, with 98% of targets occupying less than or equal to 0.12% of image area and median dimensions of 32 × 24 pixels. The standardized 8:2 partitioning yields 3642 training and 1962 testing images, presenting formidable small-target detection challenges in varied maritime environments.

Figure 5 statistically characterizes critical annotation distributions: category distributions highlight multi-scale targets in SSDD versus small-vessel predominance in HRSID; bounding box analysis confirms median target areas of 768 square pixels for HRSID (range: 64 to 90,000 square pixels) compared to 1728 square pixels for SSDD (range: 144 to 147,000 square pixels); centroid dispersion mapping reveals coastal clustering; aspect ratio histograms quantify prevalent 3:1 to 5:1 elongated morphologies. These statistical profiles substantiate both datasets’ operational relevance, particularly establishing HRSID as a rigorous benchmark for high-resolution SAR small-target detection.

The computational infrastructure integrated an NVIDIA GeForce RTX 3060 GPU featuring 12 GB GDDR6 memory and 3584 CUDA cores, delivering 12.8 TFLOPS theoretical peak performance for accelerated deep learning operations. This GPU platform operated in conjunction with an Intel^® Core™ i5-11600K CPU at 4.9 GHz turbo frequency (6 cores/12 threads), providing essential computational capacity for high-resolution SAR imagery processing. All comparative models underwent standardized training on SSDD with 150 epochs at batch size 4, implementing mosaic augmentation cutoff at epoch 140, learning rate factor 0.01, and weight decay 0.0005. The software environment employed Python 3.10 and PyTorch 1.11.0 frameworks accelerated through CUDA^® 11.6 and cuDNN™ 8.8.0 libraries, optimizing tensor computations via parallel processing architectures.

Critical advantages emerge from this configuration for maritime SAR detection research: The 12 GB memory capacity inherent to the RTX 3060 enables full 1024 × 1024 SAR image batch processing without tiling, preserving contextual information vital for small-vessel detection. Concurrently, mixed-precision training implemented through CUDA 11.6 reduces memory requirements by approximately 40% while maintaining numerical stability during backpropagation. Furthermore, native support for deformable convolutions within PyTorch 1.11 facilitates efficient implementation of geometric adaptation mechanisms in the Detection Head Module. Benchmark validation confirmed 2.1× faster convergence relative to previous-generation hardware configurations (RTX 2080 with CUDA 10.2), demonstrating practical feasibility for large-scale SAR experimentation. All experiments executed under Windows 11 (64-bit) environment leveraged hardware-accelerated DirectML extensions to maximize throughput efficiency.

4.2. Evaluation Metrics

Model performance was rigorously quantified using six complementary metrics aligned with SAR detection benchmarks, each addressing distinct aspects of detection efficacy in complex maritime environments:

Precision (P) [54] evaluates detection reliability by quantifying the proportion of correctly identified ship targets among all positive predictions, defined as follows:

P = \frac{| T_{TP} |}{| T_{TP} | + | T_{FP} |}

(15)

where

T_{TP}

denotes true positives (correctly classified ships with

IoU \geq 0.5

) and

T_{FP}

represents false positives (background clutter or land structures misclassified as vessels). This metric critically assesses a detector’s resistance to false alarms in cluttered near-shore scenarios, with higher precision indicating superior specificity.

Recall (R) [54] measures detection completeness by calculating the fraction of actual ships successfully identified:

R = \frac{| T_{TP} |}{| T_{TP} | + | T_{FN} |}

(16)

where

T_{FN}

indicates false negatives (undetected ships, particularly challenging for sub-

20 \times 20

pixel targets). Recall directly reflects a model’s capability to minimize missed detections in open-sea scenarios, serving as a critical metric for maritime safety applications.

The

F_{1}

[55] score integrates precision and recall through their harmonic mean:

F_{1} = 2 \cdot \frac{P \cdot R}{P + R}

(17)

This balanced metric addresses SAR detection’s inherent class imbalance (typically

> 95 %

background pixels), providing a singular performance indicator robust to varying ship densities across maritime scenes.

Spatial localization accuracy was evaluated via rotated Intersection over Union (

IoU

) [56], accounting for ship orientation variations:

IoU = \frac{A (B_{p} \cap B_{g t})}{A (B_{p} \cup B_{g t})}

(18)

where

B_{p}

and

B_{g t}

represent predicted and ground-truth oriented bounding boxes, with

A (\cdot)

computing polygon area using the shoelace formula. Detections were validated at

τ = 0.5

IoU

threshold, with higher

IoU

values indicating more precise ship localization essential for maritime navigation safety.

Average Precision (

AP

) [57] integrates precision-recall characteristics across all confidence thresholds:

AP = \int_{0}^{1} p (r) d r

(19)

where

p (r)

denotes precision at recall level r, computed using the continuous precision-recall curve. This metric quantifies detection stability across operational conditions, with higher

A P

values indicating consistent performance under varying confidence thresholds—critical for real-world deployment where detection confidence varies.

For comprehensive benchmarking, the COCO evaluation protocol was implemented (Table 1), extending analysis through three critical dimensions: multi-threshold evaluation quantifies localization robustness via

A P

computed across

IoU \in

[0.50:0.05:0.95]; scale-specific metrics provide granular performance insights through

A P_{S}

,

A P_{M}

, and

A P_{L}

which respectively assess small, medium, and large target detection capabilities; while strict localization criteria establish rigorous alignment requirements via

A P_{75}

evaluation demanding high-precision bounding box registration. This integrated framework delivers nuanced insights into detector capabilities, particularly for small-vessel detection where

A P_{S}

serves as the decisive metric for near-shore surveillance applications.

4.3. Comparison with SOTA Methods

Benchmark methods were rigorously selected to represent the full spectrum of contemporary SAR detection paradigms: two-stage detectors exemplified by Faster-RCNN [42], which pioneers region proposal networks for high-precision localization; one-stage detectors including RetinaNet [61] with its focal loss addressing class imbalance challenges and SSD [62] utilizing multi-scale feature maps for efficient detection; lightweight architectures such as MobileNet [63] employing depthwise separable convolutions for computational efficiency; and the evolutionary YOLO series featuring YOLOv7 [64] with extended architectural scaling, YOLOv8 [65] implementing anchor-free detection, YOLOv10 [66] introducing enhanced model compression techniques for edge deployment, and YOLOv11 [67] incorporating advanced multi-scale feature fusion modules. This diverse selection spans accuracy-focused, efficiency-optimized, and balanced architectures, specifically incorporating the latest YOLO innovations with their distinct technical advancements to provide comprehensive benchmarking against prevailing SAR detection methodologies across varied operational contexts.

On the SSDD benchmark, MCEM achieves state-of-the-art performance with 70.2%

A P

, establishing a significant 2.2 percentage point improvement over YOLOv8 (68.0%) while surpassing YOLOv10 (67.0%) and YOLOv11 (67.9%). This multi-scale superiority is systematically validated through critical metrics in Table 2: MCEM attains 84.7%

A P_{75}

, outperforming YOLOv8 by 3.2 percentage points and YOLOv11 by 1.9 percentage points, which validates enhanced boundary regression capability. The framework achieves 67.3%

A P_{S}

, exceeding YOLOv8 by 1.0 point and YOLOv10 by 2.1 points, confirming breakthrough performance on sub-32 px vessels. It demonstrates 74.8%

A P_{M}

, dominating YOLOv8 by 2.5 points and YOLOv11 by 2.8 points, reflecting consistent mid-range detection. Notably, MCEM establishes 77.7%

A P_{L}

, representing a 13.9 percentage point lead over YOLOv8 and 15.93 points over RetinaNet, resolving critical limitations in oversized target recognition. These advancements collectively originate from MCEM’s novel multi-cue fusion architecture, which integrates three core innovations: hierarchical feature enhancement overcoming low-contrast signatures, adaptive clutter suppression mitigating background interference, and computational optimization maintaining real-time processing. Figure 6 visually corroborates this advantage through precise localization of morphologically ambiguous targets in complex near-shore environments, where comparative methods exhibit substantial errors due to intense clutter interference.

On the HRSID benchmark, MCEM demonstrates exceptional cross-domain robustness with 60.0%

A P

, establishing a significant 2.3 percentage point improvement over YOLOv8 at 57.7%

A P

while surpassing all YOLO variants including YOLOv11 at 58.0%. This comprehensive superiority extends to critical detection metrics as documented in Table 3: MCEM achieves 45.1%

A P_{S}

exceeding YOLOv8 by 2.3 points and YOLOv10 by 1.3 points, confirming breakthrough small-target detection capability. Its 67.5%

A P_{75}

outperforms Faster-RCNN by 2.2 points and YOLOv8 by 2.5 points, validating enhanced localization precision. The framework attains 79.2%

A P_{M}

dominating YOLOv8 by 1.9 points and YOLOv11 by 1.5 points, demonstrating mid-range robustness. Notably, MCEM establishes 71.8%

A P_{L}

with a 8.6 point lead over RetinaNet and 8.0 point advantage over YOLOv8, resolving critical large-scale recognition limitations. These results collectively confirm the architectural resilience of MCEM when adapting from optical-like SSDD to full-polarimetric HRSID imaging conditions, representing the first method to break the 60%

A P

barrier on this challenging benchmark. The framework maintains balanced performance across all target scales, reducing the performance gap between small (

A P_{S}

) and large (

A P_{L}

) targets to 26.7 percentage points, achieving a 43% reduction in scale sensitivity versus MobileNet. Figure 7 shows the visualization results on the HRSID dataset.This breakthrough validates robust cross-modal capability for maritime surveillance in heterogeneous SAR environments, particularly excelling in high-clutter scenarios where traditional detectors exhibit significant performance degradation for sub-32 px targets.

4.4. Ablation Experiments

Comprehensive ablation studies conducted on the SSDD dataset evaluate module efficacy using the YOLOv8 baseline. Experimental configurations presented in Table 4 indicate module activation with checkmarks and deactivation with crosses. Case 1 exclusively integrates the Feature Enhancement Module, Case 2 incorporates solely the Feature Fusion Module, while Case 3 activates only the Detection Head Module. Combined configurations include Case 4 featuring Feature Enhancement and Feature Fusion Modules, with Case 5 completing the full integration.

Experimental analysis reveals distinct module characteristics. The Feature Enhancement Module in Case 1 elevates localization precision with

A P_{75}

at 84.2%, achieving a 3.9 percentage point improvement over the baseline 80.3% while attaining

A P_{M}

of 75.8% for medium targets. This configuration reduces precision to 94.2% versus the baseline 95.2%. Case 2 demonstrates the Feature Fusion Modules balanced enhancement, increasing precision to 95.9% and boosting

m A P_{50 : 95}

to 72.0% with a 3.2 percentage point gain. Case 3 shows the Detection Head Modules effectiveness for small targets, achieving

A P_{S}

of 67.5% marking a 2.1 point improvement.

Module integration generates synergistic effects. Case 4 attains peak

A P_{50}

performance at 98.8%. Full integration in Case 5 delivers comprehensive advancements reaching

m A P_{50 : 95}

of 73.5% and

A P_{L}

of 77.7%, the latter representing an 11.1 percentage point improvement over baseline. This configuration demonstrates enhanced robustness across target scales, exceeding isolated module implementations particularly in multi-scale detection where

A P_{L}

improvement reaches 11.1 points.

Three architectural insights emerge: the Feature Enhancement Module requires complementary components to mitigate precision limitations; the Feature Fusion Module provides consistent efficiency gains across metrics; the Detection Head Module optimizes small-target detection. Hierarchical integration yields synergistic performance surpassing individual contributions, validating the frameworks efficacy on SSDD benchmarks as documented in Table 4 and Table 5.

To demonstrate the accelerated convergence characteristics of our approach, Figure 8 presents comparative training dynamics curves with epochs on the x-axis and performance metrics on the y-axis. MCEM achieves competitive detection accuracy significantly earlier in the training process compared to conventional detectors, exhibiting rapid performance saturation during initial training stages. This accelerated optimization capability substantially reduces computational resource requirements while maintaining state-of-the-art performance benchmarks across maritime detection scenarios.

To rigorously quantify classification performance, we employ confusion matrices that systematically delineate prediction outcomes across vessel and background classes. As a fundamental diagnostic tool in pattern recognition, the confusion matrix quantitatively summarizes true positive (TP), true negative (TN), false positive (FP), and false negative (FN) classifications. This framework enables granular analysis of model errors beyond aggregate accuracy metrics, revealing critical insights into class-specific confusion patterns. These insights are particularly valuable for identifying systematic misclassifications between morphologically similar maritime targets and background structures. Figure 9 presents comparative confusion matrices for the baseline and MCEM frameworks. Quantitative analysis demonstrates MCEM’s enhanced discrimination capability, achieving a true positive rate (TPR) of 0.98 compared to the baseline’s 0.93. This improvement signifies a marked reduction in missed vessel detections under challenging maritime conditions. Concurrently, MCEM demonstrates substantially lower false alarm rates compared to the baseline, particularly in mitigating coastal clutter misclassifications that commonly plague SAR ship detection systems. These metrics collectively validate MCEM’s superior ability to resolve critical ship-background confusion prevalent in near-shore SAR imagery.

5. Conclusions

This study proposed a novel Multi-Cue Efficient Maritime Detector framework to address the critical challenge of small vessel detection in large-scale complex SAR imagery. The MCEM architecture integrates three specialized modules: a Feature Extraction Module incorporating Scale-aware Refinement for small-target feature enhancement and Adaptive Image Feature Integration for position-aware encoding, collectively suppressing background clutter interference; a Feature Fusion Module leveraging SPDConv and RCS-OSA structures to balance positional and semantic information during multi-scale fusion; and a Detection Head Module employing full-shared convolution to optimize small-target detection while maintaining computational efficiency. Extensive experiments demonstrated MCEM’s superiority over state-of-the-art methods in accuracy and robustness. Visual validation on SSDD confirmed exceptional alignment with ground truth and significant performance gains in complex near-shore scenarios. HRSID evaluations further substantiated its generalization capability across diverse maritime environments. Ablation studies established each module’s distinct contribution to overall performance. While exhibiting state-of-the-art performance, the framework may experience detection fidelity degradation under extreme sea conditions where clutter patterns exhibit vessel-like scattering characteristics. This fundamental challenge, inherent to SAR small-target detection, warrants consideration when deploying in high-wind maritime environments. Future work will explore adaptive clutter suppression mechanisms to address this limitation while concurrently prioritizing lightweight design strategies to enhance operational efficiency without compromising detection performance, particularly for real-time deployment on edge platforms in surveillance systems. Additionally, we will investigate multi-satellite data fusion techniques to improve cross-platform generalization and develop dynamic scene adaptation algorithms for varying maritime conditions.

Author Contributions

Conceptualization, H.C. and M.H.; methodology, H.C. and Z.Y.; validation, H.C.; formal analysis, H.C., M.H., Z.Y., and L.G.; writing—original draft preparation, H.C., M.H., Z.Y., and L.G.; funding acquisition, Z.Y. and L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62261026), the Outstanding Youth Project of Jiangxi Natural Science Foundation (20232ACB212006), the Education Department Foundation of Jiangxi Province (No. GJJ2201330), and the Oracle Information Processing Key Laboratory Open Project of Henan Province (OIP2024H003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Crisp, D.J. The State-of-the-Art in Ship Detection in Synthetic Aperture RADAR Imagery; Report No. DRDC-TM-2005-243; Department of Defence, DSTO: Canberra, ACT, Australia, 2004. [Google Scholar]
Zhang, T.; Ji, J.; Li, X.; Yu, W.; Xiong, H. Ship detection from PolSAR imagery using the complete polarimetric covariance difference matrix. IEEE Trans. Geosci. Remote Sens. 2019, 57, 2824–2839. [Google Scholar] [CrossRef]
Xu, W.; Guo, Z.; Huang, P.; Tan, W.; Gao, Z. Towards Efficient SAR Ship Detection: Multi-Level Feature Fusion and Lightweight Network Design. Remote Sens. 2025, 17, 2588. [Google Scholar] [CrossRef]
Zhang, T.; Wang, W.; Yang, Z.; Yin, J.; Yang, J. Ship detection from PolSAR imagery using the hybrid polarimetric covariance matrix. IEEE Geosci. Remote Sens. Lett. 2021, 18, 1575–1579. [Google Scholar] [CrossRef]
Zhang, T.; Yang, Z.; Gan, H.; Xiang, D.; Zhu, S.; Yang, J. PolSAR ship detection using the joint polarimetric information. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8225–8241. [Google Scholar] [CrossRef]
Tian, Z.; Wang, W.; Zhou, K.; Song, X.; Shen, Y.; Liu, S. Weighted Pseudo-Labels and Bounding Boxes for Semisupervised SAR Target Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5193–5203. [Google Scholar] [CrossRef]
Wang, J.; Quan, S.; Xing, S.; Li, Y.; Wu, H.; Meng, W. PSO-based fine polarimetric decomposition for ship scattering characterization. ISPRS J. Photogramm. Remote Sens. 2025, 220, 18–31. [Google Scholar] [CrossRef]
Xie, N.; Zhang, T.; Zhang, L.; Chen, J.; Wei, F.; Yu, W. VLF-SAR: A Novel Vision-Language Framework for Few-shot SAR Target Recognition. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 9530–9544. [Google Scholar] [CrossRef]
Robey, F.C.; Fuhrmann, D.R.; Kelly, E.J.; Nitzberg, R. A CFAR adaptive matched filter detector. IEEE Trans. Aerosp. Electron. Syst. 1992, 28, 208–216. [Google Scholar] [CrossRef]
Wackerman, C.C.; Friedman, K.S.; Pichel, W.G.; Clemente-Colón, P.; Li, X. Automatic detection of ships in RADARSAT-1 SAR imagery. Can. J. Remote Sens. 2001, 27, 568–577. [Google Scholar] [CrossRef]
Liu, T.; Yang, Z.; Yang, J.; Gao, G. CFAR ship detection methods using compact polarimetric SAR in a K-Wishart distribution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 12, 3737–3745. [Google Scholar] [CrossRef]
Chen, J.; Niu, L.; Zhang, J.; Si, J.; Qian, C.; Zhang, L. Amodal instance segmentation via prior-guided expansion. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 313–321. [Google Scholar]
Huang, B.; Zhang, T.; Quan, S.; Wang, W.; Guo, W.; Zhang, Z. Scattering Enhancement and Feature Fusion Network for Aircraft Detection in SAR Images. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 1936–1950. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 91–99. [Google Scholar]
Chen, S.; Wang, H.; Xu, F.; Jin, Y.-Q. Target classification using the deep convolutional networks for SAR images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4806–4817. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Chen, J.; Yan, J.; Fang, Y.; Niu, L. Webly supervised fine-grained classification by integrally tackling noises and subtle differences. IEEE Trans. Image Process. 2025, 34, 2641–2653. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship Detection in Large-Scale SAR Images Via Spatial Shuffle-Group Enhance Attention. IEEE Trans. Geosci. Remote Sens. 2021, 59, 379–391. [Google Scholar] [CrossRef]
Zhu, M.; Hu, G.; Li, S.; Zhou, H.; Wang, S.; Feng, Z. A Novel Anchor-Free Method Based on FCOS + ATSS for Ship Detection in SAR Images. Remote Sens. 2022, 14, 2034. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, X.; Li, Y.; Wei, Y.; Ye, L. Enhanced Semantic Feature Pyramid Network for Small Object Detection. Signal Process. Image Commun. 2023, 113, 116919. [Google Scholar] [CrossRef]
Wang, Z.; Wang, C.; Pei, J.; Huang, Y.; Zhang, Y.; Yang, H. A Deformable Convolution Neural Network for SAR ATR. In Proceedings of the IGARSS 2020—IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2639–2642. [Google Scholar]
Chen, P.; Zhou, H.; Li, Y.; Liu, P.; Liu, B. A Novel Deep Learning Network with Deformable Convolution and Attention Mechanisms for Complex Scenes Ship Detection in SAR Images. Remote Sens. 2023, 15, 2589. [Google Scholar]
Panda, S.L.; Sahoo, U.K.; Maiti, S.; Sasmal, P. An Attention U-Net-Based Improved Clutter Suppression in GPR Images. IEEE Trans. Instrum. Meas. 2024, 73, 1–11. [Google Scholar] [CrossRef]
Li, Z.; He, H.; Zhou, T.; Zhang, Q.; Han, X.; You, Y. Dual CG-IG Distribution Model for Sea Clutter and Its Parameter Correction Method. J. Syst. Eng. Electron. 2025, 1–11. [Google Scholar] [CrossRef]
Li, M.; Lin, S.; Huang, X. SAR Ship Detection Based on Enhanced Attention Mechanism. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), Hangzhou, China, 26–28 November 2021; pp. 759–762. [Google Scholar]
Li, Y.; Liu, J.; Li, X.; Zhang, X.; Wu, Z.; Han, B. A Lightweight Network for Ship Detection in SAR Images Based on Edge Feature Aware and Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3782–3796. [Google Scholar] [CrossRef]
Zhang, Y.; Cai, W.; Guo, J.; Kong, H.; Huang, Y.; Ding, X. Lightweight SAR Ship Detection via Pearson Correlation and Nonlocal Distillation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Liu, M.; Xu, J.; Zhou, Y. Real-time processing on airborne platforms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 123–134. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Kim, K.-H.; Hong, S.; Roh, B.; Cheon, Y.; Park, M. PVANET: Deep but lightweight neural networks for real-time object detection. arXiv 2016, arXiv:1608.08021. [Google Scholar]
He, F.; Wang, C.; Guo, B. SSGY: A Lightweight Neural Network Method for SAR Ship Detection. Remote Sens. 2025, 17, 2868. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Machine Learning and Knowledge Discovery in Databases; Amini, M.-R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Springer: Cham, Switzerland, 2023; pp. 443–459. [Google Scholar]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. RCS-YOLO: A Fast and High-Accuracy Object Detector for Brain Tumor Detection. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 600–610. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. arXiv 2024, arXiv:2304.08069. [Google Scholar]
Wei, S.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A high-resolution SAR images dataset for ship detection and instance segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9736. [Google Scholar]
Meng, S.; Ren, K.; Lu, D.; Gu, G.; Chen, Q.; Lu, G. A Novel Ship CFAR Detection Algorithm Based on Adaptive Parameter Enhancement and Wake-Aided Detection in SAR Images. Infrared Phys. Technol. 2018, 89, 263–270. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Wang, W.-Q. A lightweight faster R-CNN for ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2019, 19, 1–5. [Google Scholar] [CrossRef]
Xu, P.; Li, Q.; Zhang, B.; Wu, F.; Zhao, K.; Du, X.; Yang, C.; Zhong, R. On-Board Real-Time Ship Detection in HISEA-1 SAR Images Based on CFAR and Lightweight Deep Learning. Remote Sens. 2021, 13, 1995. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Gao, J.; Geng, X.; Zhang, Y.; Wang, R.; Shao, K. Augmented Weighted Bidirectional Feature Pyramid Network for Marine Object Detection. Expert Syst. Appl. 2024, 237, 121688. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Li, L.; Ma, H.; Zhang, X.; Zhao, X.; Lv, M.; Jia, Z. Synthetic Aperture Radar Image Change Detection Based on Principal Component Analysis and Two-Level Clustering. Remote Sens. 2024, 16, 1861. [Google Scholar] [CrossRef]
Qian, Y.; Yan, S.; Lukežič, A.; Kristan, M.; Kämäräinen, J.-K.; Matas, J. DAL: A Deep Depth-Aware Long-term Tracker. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 7825–7832. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision–ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Zhang, W.; Wang, S.; Thachan, S.; Chen, J.; Qian, Y. Deconv R-CNN for Small Object Detection on Remote Sensing Images. In Proceedings of the 2018 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Valencia, Spain, 22–27 July 2018; pp. 2483–2486. [Google Scholar]
Liu, L.; Pan, Z.; Lei, B. Learning a Rotation Invariant Detector with Rotatable Bounding Box. arXiv 2017, arXiv:1711.09405. [Google Scholar]
Li, Q.; Xiao, D.; Shi, F. A decoupled head and coordinate attention detection method for ship targets in SAR images. IEEE Access 2022, 10, 128562–128578. [Google Scholar] [CrossRef]
Yue, T.; Zhang, Y.; Liu, P.; Xu, Y.; Yu, C. A generating-anchor network for small ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7665–7676. [Google Scholar] [CrossRef]
Hao, Y.; Zhang, Y. A lightweight convolutional neural network for ship target detection in SAR images. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 1882–1898. [Google Scholar] [CrossRef]
Zhou, L.; Wan, Z.; Zhao, S.; Han, H.; Liu, Y. BFEA: A SAR ship detection model based on attention mechanism and multiscale feature fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11163–11177. [Google Scholar] [CrossRef]
Zhao, L.; Ning, F.; Xi, Y.; Liang, G.; He, Z.; Zhang, Y. MSFA-YOLO: A multi-scale SAR ship detection algorithm based on fused attention. IEEE Access 2024, 12, 24554–24568. [Google Scholar]
Wang, Z.; Miao, F.; Huang, Y.; Lu, Z.; Ohtsuki, T.; Gui, G. Object Detection on SAR Images Via YOLOv10 and Integrated ACmix Attention Mechanism. In Proceedings of the 2024 6th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Nanjing, China, 6–8 December 2024; pp. 756–760. [Google Scholar]
Bakirci, M.; Bayraktar, I. Assessment of YOLO11 for Ship Detection in SAR Imagery Under Open Ocean and Coastal Challenges. In Proceedings of the 2024 21st International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 23–25 October 2024; pp. 1–6. [Google Scholar]

Figure 1. Architecture of the Multi-Cue Efficient Maritime Detector (MCEM) processing pipeline: The Feature Extraction Module (FEM) performs adaptive spatial modeling on input SAR imagery (single image illustrated); the Feature Fusion Module (F²M) integrates multi-scale features through hierarchical pathways; the Detection Head Module (DHM) employs three specialized branches for bounding box regression, class prediction, and confidence estimation, generating final detection outputs.

Figure 2. Computational flow of FEM and F²M modules. FEM (top): processes input through sequential CBS blocks (Conv-BN-SiLU), scales features via SR stages (SPDConv transformations), andintegrates spatial information with AIFI. F²M (down): fuses features through three parallel pathways combining Concat operations, RCS-OSA modules, SPDConv transforms, and upsampling to generate multi-scale outputs (y1, y2, y3). Arrow directions indicate data flow progression.

Figure 3. Architecture of the proposed sub-module Scale-aware Refinement (SR).

Figure 4. Architecture of the proposed module Detection Head Module (DHM).

Figure 5. Bounding box characteristic analysis for both datasets. The upper-left panel quantifies training set composition across vessel categories; the upper-right panel visualizes bounding box size distribution through width-height density mapping; the lower-left panel maps center-point coordinates normalized to image dimensions; the lower-right panel analyzes object aspect ratios relative to image area.

Figure 6. Qualitative detection performance on SSDD dataset under complex near-shore conditions. Color-coded annotations distinguish: true positive detections in blue, ground-truth vessel locations in red, false negative instances indicating missed detections in yellow, and false positive results representing erroneous background classifications in orange. This visualization highlights the detector’s capability for precise multi-scale vessel localization amidst significant maritime clutter.

Figure 7. The visual results on the HRSID dataset are displayed.

Figure 8. Experimental results are presented as curves with training epochs on the x-axis and quantitative metrics on the y-axis.

Figure 9. Classification results comparison: The confusion matrices summarize dataset records based on true categories versus model-predicted categories. Matrix rows represent true values, columns represent predicted values.

Table 1. COCO evaluation metrics specification.

Metric	Definition	Evaluation Focus
$A P$ [58]	Average precision at IoU thresholds [0.50:0.05:0.95]	Comprehensive detection accuracy
$A P_{50}$ [57]	AP at IoU = 0.50	Basic localization capability
$A P_{75}$ [58]	AP at IoU = 0.75	Precise localization requirement
$A P_{S}$ [59]	AP for small targets (area < 1024 px²)	Small target performance (sub-32 × 32 px)
$A P_{M}$ [60]	AP for medium targets (1024 ≤ area ≤ 9216 px²)	Medium target detection (32 × 32 to 96 × 96px)
$A P_{L}$ [60]	AP for large targets (area > 9216 px²)	Oversized target capability

Table 2. Performance comparison of MCEM against state-of-the-art methods on SSDD dataset.

Methods	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{L}$ (%)	P (%)	R (%)	F1 (%)	${mAP}_{50}$ (%)
Faster-RCNN [42]	65.04	96.12	77.47	61.68	71.54	66.07	93.7	91.6	92.6	96.7
MobileNet [63]	41.88	79.89	38.97	35.65	56.90	29.09	74.2	71.5	72.8	80.4
SSD [62]	52.94	88.07	60.08	49.00	61.26	52.50	83.6	81.1	82.3	88.6
RetinaNet [61]	60.76	90.90	70.46	57.32	67.62	61.77	87.6	85.4	86.5	91.5
YOLOv7 [64]	56.80	89.90	64.30	55.00	63.30	32.20	86.7	84.3	85.5	90.5
YOLOv8 [65]	68.00	97.20	81.50	66.30	72.30	63.80	95.2	93.4	94.3	97.5
YOLOv10 [66]	67.00	96.20	80.30	65.20	71.60	51.20	93.5	91.4	92.4	96.8
YOLOv11 [67]	67.90	96.40	82.80	66.30	72.00	59.90	93.7	91.6	92.6	97.0
MCEM (Ours)	70.20	96.9	84.70	67.30	74.80	77.70	96.5	94.7	95.6	98.1

Table 3. Performance comparison of MCEM against state-of-the-art methods on HRSID dataset.

Methods	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{L}$ (%)	P (%)	R (%)	F1 (%)	${mAP}_{50}$ (%)
Faster-RCNN [42]	56.70	81.10	65.30	43.60	74.50	46.30	80.5	69.8	74.7	83.5
MobileNet [63]	35.40	59.10	39.00	16.10	62.00	34.60	57.6	51.4	54.3	61.2
SSD [62]	43.50	67.10	49.80	23.30	70.40	52.70	68.2	60.7	64.2	69.5
RetinaNet [61]	50.70	83.10	54.00	36.20	72.10	63.20	81.8	73.1	77.2	85.6
YOLOv7 [64]	54.00	85.40	58.60	40.50	73.20	70.00	84.0	75.5	79.5	87.8
YOLOv8 [65]	57.70	83.20	65.00	42.80	77.30	59.30	87.2	75.9	82.4	86.5
YOLOv10 [66]	57.30	83.80	65.10	43.80	75.80	41.70	82.5	73.9	77.9	86.2
YOLOv11 [67]	58.00	83.40	64.80	43.50	77.70	33.40	82.1	73.5	77.5	85.9
MCEM (Ours)	60.00	85.20	67.50	45.10	79.20	71.80	90.3	79.3	84.5	88.4

Table 4. Basic information of ablation experiment.

Case	FEM	F²M	DHM	P (%)	R (%)	${mAP}_{50}$ (%)	${mAP}_{50 : 95}$ (%)	F1 (%)	GFLOPs	Speed (ms)
Base	×	×	×	95.2	93.4	97.5	68.8	94.3	6.5	3.4
Case 1	✓	×	×	94.2	93.2	97.0	69.5	93.7	6.3	5.5
Case 2	×	✓	×	95.9	94.2	98.3	72.0	95.0	6.2	3.0
Case 3	×	×	✓	96.6	94.7	98.5	72.5	95.6	6.4	2.6
Case 4	✓	✓	×	97.0	94.7	98.8	72.1	95.8	6.0	3.0
Case 5	✓	✓	✓	96.5	94.7	98.1	73.5	95.6	6.1	2.9

Table 5. The situation under the COCO metrics of ablation experiment.

Case	$AP$ (%)	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{L}$ (%)
Base	68.3	96.8	80.3	65.4	73.8	66.6
Case 1	69.4	96.5	84.2	66.1	75.8	69.1
Case 2	69.7	97.2	83.4	66.5	74.6	73.8
Case 3	70.3	97.3	84.4	67.5	74.8	76.6
Case 4	69.6	97.9	82.9	66.5	75.1	72.3
Case 5	70.2	96.9	84.7	67.3	74.8	77.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; He, M.; Yang, Z.; Gan, L. MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection. Sensors 2025, 25, 5736. https://doi.org/10.3390/s25185736

AMA Style

Chen H, He M, Yang Z, Gan L. MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection. Sensors. 2025; 25(18):5736. https://doi.org/10.3390/s25185736

Chicago/Turabian Style

Chen, Haowei, Manman He, Zhen Yang, and Lixin Gan. 2025. "MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection" Sensors 25, no. 18: 5736. https://doi.org/10.3390/s25185736

APA Style

Chen, H., He, M., Yang, Z., & Gan, L. (2025). MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection. Sensors, 25(18), 5736. https://doi.org/10.3390/s25185736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection

Abstract

1. Introduction

2. Related Works

2.1. SAR Maritime Target Detection

2.2. Feature Enhancement for Small Targets

2.3. Detection Head Optimization

3. Proposed Method

3.1. Feature Extraction Module (FEM)

3.1.1. AIFI Sub-Module

3.1.2. SR Sub-Module

3.2. Feature Fusion Module (F²M)

3.3. Detection Head Module (DHM)

4. Experimental Results

4.1. Datasets and Experimental Setup

4.2. Evaluation Metrics

4.3. Comparison with SOTA Methods

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

MCEM: Multi-Cue Fusion with Clutter Invariant Learning for Real-Time SAR Ship Detection

Abstract

1. Introduction

2. Related Works

2.1. SAR Maritime Target Detection

2.2. Feature Enhancement for Small Targets

2.3. Detection Head Optimization

3. Proposed Method

3.1. Feature Extraction Module (FEM)

3.1.1. AIFI Sub-Module

3.1.2. SR Sub-Module

3.2. Feature Fusion Module (F2M)

3.3. Detection Head Module (DHM)

4. Experimental Results

4.1. Datasets and Experimental Setup

4.2. Evaluation Metrics

4.3. Comparison with SOTA Methods

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Feature Fusion Module (F²M)