1. Introduction
Synthetic Aperture Radar (SAR), as an important remote sensing technology, plays a key role in ocean monitoring due to its unique all-weather and all-time surveillance capabilities [
1]. Compared to traditional optical remote sensing technologies, SAR offers significant advantages, such as the ability to penetrate clouds, fog, and other adverse weather conditions [
2], and it is not limited by day–night cycles. This ensures that SAR can continuously provide high-quality ground imaging even in complex environments [
3]. As a result, SAR holds considerable potential for applications in various fields, including ocean monitoring, disaster response, military reconnaissance, and environmental protection.
In maritime surveillance, ship detection in SAR images is of great significance, as it plays a crucial role in understanding maritime traffic and ensuring maritime safety [
4]. However, due to the inherent complexity of SAR images, ship detection remains a challenging task. Traditional SAR ship detection methods often rely on extracting specific features from the images, such as texture, shape, and contextual information, to distinguish ship targets from surrounding clutter, with the Constant False Alarm Rate (CFAR) algorithm being a typical example [
5]. The CFAR algorithm dynamically adjusts the detection threshold based on local clutter statistics to maintain a constant false alarm rate [
6]. However, these traditional detection methods heavily depend on manually designed features, and when faced with challenges such as ship occlusion, overlap, or dense target distributions, their detection performance often falls short and fails to meet practical requirements [
7].
In recent years, the rapid advancement of deep learning technology has introduced new approaches to address these challenges [
8]. Convolutional neural networks, with their robust feature learning and representation capabilities, can autonomously extract complex high-level features from large-scale data [
9], overcoming the limitations of manually designed features. This enables them to outperform traditional methods in handling the complexity of SAR image backgrounds, interference noise, and target ambiguity. As a result, deep learning-based algorithms have become the dominant approach for SAR ship detection, demonstrating exceptional performance in practical applications [
10]. For example, Geng et al. [
11] introduced a CNN framework combining candidate detection and embedded active learning, with a tri-phase mechanism (target localization, bounding box refinement, and selective sample training) to improve accuracy in nearshore scenarios. Ai et al. [
12] developed a Multi-scale Kernel Size Feature Fusion CNN (MKSFF-CNN), which used heterogeneous convolutional kernels (3 × 3 to 7 × 7) for parallel feature extraction and cross-scale channel attention for adaptive weighting. Sun et al. [
13] proposed the NSD-SSD method, which combines dilated convolution with multi-scale feature fusion and utilizes the K-means clustering algorithm for prior box reconstruction, thereby improving the detection accuracy of small targets. Zhu et al. [
14] proposed Dual-Branch YOLO (DB-YOLO), incorporating Cross-Stage Partial (CSP) networks and a Dual-Path Feature Pyramid (DB-FPN) to strengthen spatial-semantic interaction. Yang et al. [
15] proposed a single-stage ship detector incorporating a Coordinate Attention Module (CoAM) for target localization and a Receptive Field Expansion Module (RFIM) for multi-scale context modeling.
Although deep learning-based algorithms have improved ship detection performance through high-precision models, deploying them on resource-limited platforms, such as airborne and spaceborne platforms [
16], remains challenging due to constraints in computation, storage, and real-time processing [
17]. For example, YOLOv8x [
18] achieves a mean average precision (mAP) of 98.49% on the SSDD dataset [
19], 0.12% higher than YOLOv8n. However, its parameter size is 68.23 M, 21.6 times that of YOLOv8n, with an inference cost of approximately 258.55 GFLOPs. Therefore, reducing computational complexity while maintaining detection accuracy remains an important challenge. To address this, researchers have explored various methods, including model pruning, quantization, knowledge distillation, and lightweight network design [
20], to enhance efficiency without compromising performance. Among these methods, lightweight network design has attracted considerable attention due to its ability to optimize network structures for specific detection tasks while maintaining detection, making it a key focus in SAR ship detection research. For example, Zhao et al. [
21] proposed Morphological Feature Pyramid Yolo v4-tiny, which employs morphological preprocessing for denoising and edge enhancement, combined with a lightweight feature pyramid to optimize multi-scale detection. Yan et al. [
22] proposed a lightweight SAR ship detection model, LssDet, which integrated the Cross-Sidelobe Attention (CSAT) module for interference suppression, the Lightweight Path Aggregation FPN (L-PAFPN) for efficient feature fusion, and the Focus module for enhanced feature extraction. Zheng et al. [
23] proposed HRLE-SARDet, a lightweight SAR target detection algorithm that reduces computational cost with LSFEBackbone and enhances small target detection using the HRLE-C3 module combining CNN and self-attention. Yang et al. [
24] proposed a lightweight backbone network, IMNet (based on MobileNetv3), combined with a Slim Bidirectional Feature Pyramid Network (Slim-BiFPN) and embedded a Coordinate Attention (CA) mechanism to suppress background noise and enhance multi-level feature fusion. Feng et al. [
25] reconstructed the feature extraction module using a lightweight ViT-based network and incorporated the Faster-WF2 module into the neck to enhance multi-scale feature fusion while balancing detection accuracy and computational cost. Yu et al. [
26] proposed the multi-scale ship detector VS-LSDet, which integrates a Visual Saliency Enhancement Module (VSEM) and a lightweight backbone (GSNet) to highlight targets and reduce computational complexity. Luo et al. [
27] proposed the lightweight model SHIP-YOLO, which reduces the model’s parameter count and computational burden by applying GhostConv instead of standard convolution in the neck of YOLOv8n and incorporating the reparameterized RepGhost bottleneck structure in the C2f module. Zhang et al. [
28] redesigned the feature extraction network DEMNet using the CSMSConv and CSMSC2F modules, effectively optimizing the multi-scale issues of ship targets in SAR images. Meanwhile, the introduction of the DEPAFPN feature fusion module and the EMA attention mechanism further alleviated the computational burden. Hao et al. [
29] improved YOLOX by integrating a reparameterized MobileNetV3 with the CSP structure to construct a compact backbone, aiming to reduce computational cost while maintaining acceptable detection accuracy. Cao et al. [
30] reduced computational complexity by reconstructing the feature selection module while applying dilated convolutions in the multi-scale feature focusing (MFF) module to optimize multi-scale processing. Huo et al. [
31] improved feature representation and reduced computational complexity by designing the lightweight attention-enhanced C3 (LAEC3) module and the attention-guided fusion module (AGFM).
Although existing methods achieved lightweight improvements to models such as YOLO through structural modifications and attention mechanisms, their deployment on resource-constrained platforms remained limited due to high computational complexity and large parameter sizes. To address this limitation, we developed an SAR ship detection model based on a three-stage collaborative design, aiming to reduce computational complexity and parameter count without compromising detection accuracy. This design thereby supports efficient deployment on edge devices. The main contributions of this paper are as follows:
To enhance the model’s capability of capturing key features and achieving real-time detection, a feature extraction network was constructed using depthwise separable convolution blocks and CBAM modules, with the optimal configuration determined through ablation experiments.
To optimize feature aggregation, the Multi-Scale Coordinated Fusion (MSCF) and Bi-EMA Enhanced Fusion (BiEF) modules were introduced to construct a joint spatial-channel perception framework based on cross-layer feature interactions. This framework enabled the integration of multi-level features from the backbone while maintaining scale consistency and minimizing information loss.
To address the computational redundancy of the C2f module in the feature fusion process, the Efficient Feature Learning (EFL) module was proposed. It reorganized features using a simplified hierarchical structure and incorporated the Efficient Channel Attention (ECA) [
32] mechanism to adaptively adjust feature weights, reducing computational costs and enhancing the detection of small targets.
3. Results
3.1. Dataset
In this experiment, we selected the SAR Ship Detection Dataset (SSDD), the first open-source dataset in the field of SAR ship detection, which holds considerable research value.
Regarding data collection, the dataset includes scenes from nearshore and offshore regions, such as Yantai, China, and Visakhapatnam, India, offering a realistic representation of the complex and varied monitoring environments and making the model training more applicable to real-world scenarios [
19]. SSDD integrates images from various sensors, including RadarSat-2, Terra SAR-X, and Sentinel-1, with resolutions ranging from 1 m to 15 m. The dataset includes multiple polarization modes, such as HH, VV, VH, and HV.
As shown in
Figure 11, we selected a subset of images from the SSDD dataset for visual analysis of the experimental results.
3.2. Experimental Environment and Details
In this study, all experiments were conducted on a Windows 11 operating system, with the computing platform built around an AMD Ryzen 7 7840H processor and an NVIDIA GeForce RTX 4060 graphics card operating at a base frequency of 3.8 GHz and equipped with 16 GB of memory. The development environment included Python 3.9.19 and Torch 2.4.1, with GPU acceleration provided by CUDA 11.8.
During testing, the Adam optimizer was employed with a batch size of 16 and 300 epochs. The initial learning rate was set to 0.001 and was reduced using a cosine annealing schedule to minimize the loss function.
To ensure fairness and consistency in the comparative analysis, all experimental results reported in this paper were obtained under the unified experimental conditions described in this section. These conditions include consistent training settings, dataset partitioning, and evaluation metrics across all models.
3.3. Evaluation Criteria
To accurately evaluate the performance of each component in the experiment, we employ four key metrics: recall, precision, mAP50, and F1 score.
Recall measures the model’s ability to detect all the actual targets present, while
precision evaluates the proportion of true positive samples among all samples predicted as targets. The
F1 score is the harmonic mean of
precision and
recall. The formulas for each metric are as follows:
TP: True positives refer to the targets that the model correctly detects.
FN: False negatives are the actual targets that the model fails to detect.
FP: False positives are non-targets that the model incorrectly identifies as targets [
42].
mAP: The mean average precision is used to evaluate the average accuracy of detection boxes across different categories, with an IoU threshold of 0.5.
3.4. Ablation Experiment
To comprehensively validate the effectiveness of the proposed method, four ablation experiments were carefully designed. In each experiment, the principle of single-variable control was strictly followed, and targeted adjustments to different application methods were carried out to systematically analyze their impact on model performance.
3.4.1. The Impact of Different Depths on the Backbone Network
This section investigates the impact of different backbone depths on model performance. The backbone consists of five layers, with the depth of the first layer fixed at 1. The depths of the remaining four layers are varied for comparison across the following configurations: [1,2,2,1], [2,2,2,2], [2,3,3,2], [3,3,3,2], [3,3,3,3], [3,4,4,3], and [4,4,4,4].
Table 1 presents the precision (P), recall (R), mean average precision (mAP), and F1 score of each configuration. The model performed optimally when the backbone depth was set to [3,3,3,2], achieving the highest values for all evaluation metrics.
The selection of these configurations was driven by the design of the backbone network (
Section 2.2), which aimed to optimize performance. The configuration [3,3,3,2] achieved the highest performance across multiple metrics, indicating its effectiveness in feature extraction. Variations from this optimal configuration resulted in a decrease in performance, suggesting that changes in depth reduced the model’s ability to effectively extract features. These findings guided the choice of depthwise separable convolution blocks in the backbone network, which were selected based on the best-performing configuration: [3,3,3,2].
3.4.2. Comparison of Different Attention Mechanisms in Backbone Networks
This section investigates the performance of various attention mechanisms integrated with a feature extraction network composed of depthwise separable convolution blocks, as described in
Section 2.2. The evaluated attention modules include ECA [
32], SE [
43], CA [
44], PSA [
45], and CBAM. The comparison results are presented in
Table 2.
Among all variants, the model incorporating CBAM achieved the highest precision, recall, mAP, and F1 score. The SE module attained a similar mAP, with only a 0.05% difference, but it showed a 3.15% lower recall. The ECA module had a 0.25% lower mAP than CBAM, with a precision gap of 0.92% and a recall gap of 5.72%. CA and PSA resulted in more noticeable decreases across all metrics. These results indicate that CBAM was more effective in enhancing the representation capability of the backbone network when combined with depthwise separable convolution blocks.
3.4.3. Experiment of Fusion Module at Different Backbone Output Points
To examine the role of the proposed fusion modules (MSCF and BiEF), ablation experiments were conducted at three predefined positions, as illustrated in
Figure 1. The baseline (Number 1), with all fusion modules disabled, achieved a precision of 98.99%, recall of 88.91%, mAP of 97.28%, and F1 of 0.94. As shown in
Table 3, in partial configurations (Numbers 2 to 5), the results showed varying performance. For instance, enabling MSCF at Positions 1 and 2 (Number 2) led to a slight decrease in performance across all metrics. This outcome suggests that shallow fusion, when applied alone, may introduce redundant features or fail to provide sufficient semantic reinforcement. In configurations Number 3 to 5, where BiEF was placed at Position 3, either alone or with MSCF at one additional position, recall improved while precision declined, and mAP remained close to the baseline. These observations indicate that the performance of individual modules or their combination without proper structural coordination does not consistently enhance detection accuracy.
In contrast, the full configuration (Number 6), which applies MSCF at Positions 1 and 2 and BiEF at Position 3, resulted in the highest performance across all metrics. This pattern indicates that the two modules contribute differently: MSCF is more effective at capturing spatial-scale information at earlier stages, while BiEF aids in semantic refinement at deeper layers. Their combined use provides complementary effects that are not observed when the modules are used independently.
To further assess whether MSCF and BiEF modules are functionally interchangeable, two additional experiments were conducted: one applying BiEF at all three positions and another applying MSCF at all positions. The configuration using BiEF at all positions achieved a precision of 98.22%, recall of 92.67%, and mAP of 98.14%. The configuration using MSCF at all positions resulted in a precision of 98.13%, recall of 92.33%, and mAP of 98.06%. Compared to the heterogeneous combination in Number 5, both homogeneous configurations showed lower mAP and F1 scores. These results confirm that MSCF and BiEF are not fully interchangeable, and their combination offers complementary benefits.
The findings support the design choice outlined in
Section 2.4, where MSCF is placed at earlier stages to capture spatial and scale information, and BiEF is utilized at the final stage to enhance deep semantic features. The heterogeneous placement consistently outperforms homogeneous alternatives. Furthermore, the ablation results suggest that performance improvements do not stem from the mere addition of fusion modules but from their complementary and coordinated integration across multiple stages, which collectively enhance feature representation and detection accuracy.
3.4.4. Comparison of ECA, EFL, and C2f Modules
The original model uses the C2f module in the neck structure, which improves the information flow and feature utilization between different layers through large residual connections. However, in SAR ship detection, the C2f module shows limited improvements in both detection accuracy and recall. To address this, we propose replacing the C2f module with the EFL module, which combines convolution operations with a channel attention mechanism to improve feature fusion. The ECA module, which employs 3 × 3 and 1 × 1 convolutions, is also included for comparison.
As shown in
Table 4, the EFL module outperforms the C2f module across all evaluation metrics. Specifically, the mAP increases by 0.23%, recall improves by 0.83%, and accuracy rises by 0.45%, indicating that the EFL module is more effective in enhancing detection performance. While the ECA module also shows improvements over the C2f module, its performance is slightly lower than the EFL module, particularly in mAP.
These results show that the EFL module provides better improvements in recall and mAP, making it a more effective choice for SAR ship detection tasks. The ECA module is beneficial, but the EFL module offers more substantial performance gains, demonstrating its more effective role in improving model performance.
3.4.5. Overall Ablation Study
In
Section 3.4.1,
Section 3.4.2,
Section 3.4.3 and
Section 3.4.4 a series of ablation experiments were conducted using a controlled variable approach. Each experiment examined the effect of introducing a single structural module—CBAM, BiEF, MSCF, or EFL—while keeping other components unchanged. This method allowed an independent evaluation of each module under consistent experimental settings. It was observed that in some of these experiments, the introduction of a single module resulted in a slight decrease in one or more evaluation metrics. Such results suggest that individual components may not always align optimally when applied in isolation and that their integration may introduce structural redundancy or inconsistencies in feature interaction.
To further analyze the overall behavior of the proposed architecture, a comprehensive ablation experiment was conducted. Based on the DSC-Backbone, additional modules were integrated incrementally to form five variants, as listed in
Table 5. Each configuration was evaluated under identical training conditions to examine the effect of progressive module fusion on detection performance. All experimental settings, including dataset, input size, training schedule, and evaluation metrics, were kept consistent across variants. This ensures comparability between different structural combinations and avoids the influence of external variables.
The ablation study shown in
Table 5 (steps 1–5) focuses on individual modules and their stepwise integration, which focused on individual modules and their progressive integration, a supplementary set of experiments was conducted to examine key combinations of the proposed modules. This additional experimental design aims to extend the analysis by capturing interactions between multiple modules simultaneously, offering a more nuanced understanding of their collective influence within the overall architecture. The combinations tested were DSC + MSCF + BiEF, CBAM + MSCF + BiEF, and MSCF + BiEF + EFL, along with several other relevant groupings. These experiments serve as a supplementary investigation addressing combinations of modules not covered in the previous studies. The selected configurations aim to enhance the completeness and focus of the analysis while minimizing redundancy and covering the fundamental aspects of module integration. The results, as summarized in
Table 5 (Steps 7–12), provide additional information that complements the earlier findings and supports a more comprehensive evaluation of module fusion strategies.
The introduction of the epoch-based precision, AP, and recall curves (
Figure 12) serves to further illuminate the training dynamics and performance evolution as the modules were incrementally added. These curves offer a clear visualization of how each model variant’s performance improved over time, particularly with respect to precision, average precision (AP), and recall metrics. By tracking these metrics across different epochs, deeper insights can be gained into how the gradual integration of modules enhances the overall model’s detection capabilities and stability during training.
3.5. Comparison with Other Ship Detection Models
Table 6 summarizes the performance of the proposed model and several representative lightweight detectors on the SSDD dataset. Among these, the proposed model achieves the highest accuracy (98.76%) and the second-highest recall (93.70%), with an mAP of 98.35% and an F1 score of 0.96. Compared to YOLOv11n and YOLOv12n, the proposed model shows slightly higher accuracy and mAP while maintaining a comparable F1 score. Specifically, YOLOv11n achieves an accuracy of 98.32%, a recall of 93.15%, and an mAP of 98.22%, while YOLOv12n yields a slightly higher recall (94.01%) but lower accuracy and mAP.
From the results in
Table 6, despite improvements in computational efficiency, the detection performance of PC-Net [
46] and LCFANet-n [
47] is generally lower than that of YOLOv8n, particularly in accuracy and mAP. This suggests that reduced computational complexity does not always correlate with enhanced detection effectiveness.
In terms of computational efficiency, the proposed model has the lowest resource demand among all compared models, with 5.04 G FLOPs and 1.64 M parameters. This is 17.8% and 29.3% lower than PC-Net and 16.4% and 36.4% lower than YOLOv12n in terms of FLOPs and parameter count, respectively. Compared to YOLOv8n, which also shows competitive performance, the proposed model reduces FLOPs and parameters by 47.5% and 48.1%, respectively. Other models such as Faster R-CNN and SSD show higher computational costs or reduced detection capability. For instance, Faster R-CNN has high recall but low accuracy and very large computational demands (370.21 G FLOPs and 139.10 M parameters), making it unsuitable for real-time deployment on resource-constrained platforms. SSD has moderate computational complexity but suffers from low recall, which limits detection coverage. Considering both detection performance and computational cost, the proposed model achieves a favorable balance. Its high accuracy and recall ensure reliable detection, while the low FLOPs and parameter count support deployment in constrained environments such as satellites. This balance enables the precise identification of maritime targets while adhering to the stringent computational, storage, and power constraints imposed by satellite onboard systems. Among the evaluated models (as shown in
Figure 13), only our model and YOLOv8n successfully detected all targets. The proposed model achieves marginally higher accuracy (98.76% compared to 97.15%) and exhibits higher confidence scores among the correctly detected targets. Compared to the latest PC-Net and LCFANet-n, the proposed model achieves higher detection accuracy along with reduced computational complexity and fewer parameters, reflecting an improved balance between performance and efficiency.
3.6. More Visual Detection Results
More inshore and offshore scenarios were detected using our LCFANet model and YOLOv8n, with the results are presented in
Figure 14 and
Figure 15. In these figures, the true targets are highlighted with yellow bounding boxes in the ground truth for comparison.
These images were selected from a subset of the SAR Ship Detection Dataset (SSDD), which contains real-world data with varying noise levels that naturally occur in maritime environments. The data in the SSDD dataset are derived from operational SAR systems and represent a wide range of imaging conditions, including different sea states and sensor configurations. For the analysis in this study, the signal-to-noise ratio (
SNR) was computed for each target in the selected images based on the power of the signal and the power of the background noise. The signal power was calculated as the sum of the squared pixel values within the target region, while the background noise power was estimated using the variance of the pixel values in a surrounding background region, which was assumed to contain no targets. The
SNR for each target in the image was calculated using the following formula:
where
is the signal power (sum of the squared pixel values in the target region), and
is the noise power (variance of the pixel values in the background region). This approach provides a clear measure of how the model performs in various noise conditions that are typical in SAR imagery [
48]. Additionally, for each image, the average SNR across all targets was computed to provide an overall understanding of the image’s noise characteristics. The average SNR was calculated using the following formula:
where
n is the number of targets in the image, and
is the SNR for each individual target. This average SNR provides a comprehensive measure of the image’s overall noise level.
The images selected for this section exhibit inherent variations in the signal-to-noise ratio (SNR), with values spanning from 10 dB to 22 dB. These differences reflect the natural diversity found within the SSDD dataset [
19], which contains a variety of imaging conditions influenced by factors such as sensor type, resolution, and environmental settings.
Figure 14 and
Figure 15 present examples of these variations, where some images exhibit higher background noise while others show lower noise levels, illustrating the dataset’s broad range of noise characteristics.
3.7. Theoretical Analysis of Computational Complexity
To quantitatively evaluate the computational cost of the proposed model, we conduct a theoretical analysis based on two commonly used metrics: the number of parameters (
Params), which reflects the storage requirements, and floating-point operations (
FLOPs), which estimate the computational complexity during inference. In convolutional neural networks, the primary computational load arises from convolutional layers. The number of parameters and
FLOPs for a standard convolution layer are calculated as follows:
where
Cin and
Cout represent the number of input and output channels,
K is the kernel size, and
H × W is the resolution of the output feature map. The multiplier of 2 accounts for both multiplication and addition operations, following conventions adopted in YOLO and related works.
While convolutional layers dominate the computational cost, other operations also contribute to the overall complexity. For example, attention modules such as CBAM and ECA introduce additional convolution, pooling, and activation operations. The fully connected layers used in some modules add matrix multiplications and bias terms. Pooling, interpolation, and concatenation operations, though parameter-free, also involve memory and arithmetic operations. Therefore, convolution-based formulas serve as the foundation for complexity estimation, supplemented by empirical analysis using the thop profiling tool.
Under an input resolution of 640 × 640, we evaluated the computational complexity of each component of our network. The results are listed in
Table 7:
Compared to the YOLOv8n model, which contains 3.16 million parameters and 8.86 GFLOPs, the proposed model shows a 48.1% reduction in parameters and a 47.5% reduction in FLOPs. These reductions are associated with the application of lightweight components such as depthwise separable convolutions and simplified attention mechanisms.