Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement

Zhou, Yi; Zhu, Kun; Guo, Haitao; Lu, Jun; Gong, Zhihui; Liu, Xiangyun

doi:10.3390/electronics14234728

Open AccessArticle

Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement

by

Yi Zhou

,

Kun Zhu

^*

,

Haitao Guo

,

Jun Lu

,

Zhihui Gong

and

Xiangyun Liu

Institute of Geospatial Information, Information Engineering University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4728; https://doi.org/10.3390/electronics14234728

Submission received: 2 November 2025 / Revised: 24 November 2025 / Accepted: 25 November 2025 / Published: 30 November 2025

(This article belongs to the Special Issue Recent Advances in Applications of Machine Learning and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Synthetic aperture radar images have all-weather and all-time capabilities and are widely used in the field of ship target surveillance at sea. However, its detection accuracy is often limited by factors such as complex sea conditions, diverse ship scales, and image noise. Aiming at the problems such as inconsistent scale of ship target detection in SAR images, difficulty in detecting small targets, and interference from complex backgrounds, this paper proposes a ship detection method for SAR images based on band recombination and multi-scale feature enhancement. Firstly, aiming at the problem that the single-channel replication mode adopted by the deep neural network cannot fully extract the ship target information in SAR images, a band recombination method was designed to enhance the ship information in the images. Furthermore, the coordinate channel attention and bottleneck Transformer attention mechanisms are introduced in the backbone part of the network to enhance the network’s representation ability of the target spatial distribution and maintain the global feature modeling ability. Finally, a multi-scale feature enhancement and multi-scale effective feature aggregation module was designed to improve the detection accuracy of multi-scale ships in wide-format images. The experimental results on the LS-SSDD and HRSID datasets show that the average accuracies of the method proposed in this paper reach 78.1% and 94.5% respectively, which are improved by 6.9% and 0.8% compared with the baseline model, and are superior to other advanced algorithms, verifying the effectiveness of the method proposed in this paper. Meanwhile, the algorithm proposed in this paper has also demonstrated good performance in wide-format SAR images of actual large scenes. The method proposed in this paper effectively improves the problems of missed detection and false detection of small-target ships in SAR images of large scenes. At the same time, it enhances the efficiency of rapid and accurate detection in large scenes and can provide reliable technical support for the field of maritime target surveillance.

Keywords:

synthetic aperture radar; ship detection; multi-scale feature enhancement; multi-scale effective feature aggregation; band reorganization

1. Introduction

Synthetic Aperture Radar (SAR) has become increasingly vital in the field of intelligent interpretation of remote sensing images due to its all-weather and day-and-night monitoring capabilities [1,2]. Its sensitivity to metallic objects makes ships easily distinguishable from the background in SAR images. Furthermore, the ability to scan large swathes—often hundreds of kilometers wide—provides significant advantages for efficient wide-area maritime monitoring and dynamic tracking. Consequently, ship target detection based on SAR images is widely applied in maritime traffic management, maritime target surveillance, and disaster emergency response [3]. Traditional SAR ship detection methods often rely on statistical models, such as Constant False Alarm Rate (CFAR) algorithms. These methods determine a threshold to distinguish ship targets from the background in SAR images by performing statistical analysis and modeling of the background clutter [4]. To address the characteristics of different SAR images and meet diverse application requirements, researchers have proposed various statistical models, including the Gaussian, Gamma, Normal, and K-distributions [5,6]. These methods form the foundation of model-based target detection. Advancements in statistical theory have further led to sophisticated multi-channel adaptive detectors, such as Generalized Likelihood Ratio Test (GLRT)-based approaches, which are designed to enhance selectivity and performance in multi-channel data [7]. While these algorithms benefit from simple principles and computational efficiency, their detection performance is highly dependent on the accuracy of the clutter statistical model. While these model-based methods are mathematically rigorous and offer well-defined performance guarantees, they face challenges posed by the complex and non-stationary characteristics of real-world SAR backgrounds and the requirement for manual parameter tuning. In scenarios involving complex sea states, varying imaging conditions, or small, blurred targets, they often suffer from insufficient robustness and declining detection accuracy, making it difficult to meet the demands of high-precision marine monitoring.

Recent research has focused on addressing specific challenges, such as complex maritime conditions and multi-scale targets. For example, Bakirci [8] provided a comprehensive overview of advanced ship detection and ocean monitoring using satellite imagery and deep learning, highlighting the integration of multi-source data for marine science applications. Furthermore, the role of polarimetric information is critical; Adil et al. [9] analyzed the polarimetric scattering behavior of vessels at different incidence angles, underscoring the importance of polarization features for improving target-clutter discrimination. To tackle the issue of scale variation and complex backgrounds, models incorporating dynamic convolution and adaptive fusion mechanisms have been proposed, Li [10] demonstrating enhanced robustness. Our work builds upon these advancements by proposing a unified framework that systematically addresses the interrelated challenges of input data representation, multi-scale feature extraction, and global context modeling for SAR ship detection.

In recent years, advances in deep learning have enabled algorithms to autonomously learn key features from massive image datasets without complex modeling processes, demonstrating superior accuracy and robustness in object detection tasks. Two major categories of mature algorithms have emerged: the first is two-stage detection algorithms, which initially extract region proposals and then apply image classification to these candidate regions. Representative models include R-CNN [11], Faster R-CNN [12], and Cascade R-CNN [13]. In contrast, single-stage detection algorithms eliminate the region proposal step, directly predicting target locations from input images based on localization tasks. Typical examples include YOLO [14], SSD [15], and RetinaNet [16]. Furthermore, with the rise of Transformer models, numerous approaches have begun incorporating them into object detection. The Facebook team pioneered DETR (Detection Transformer) [17], an end-to-end object detection framework based on Transformer architecture. Although these classical detection algorithms have achieved remarkable success in natural image processing, their direct application to SAR ship target detection remains challenging due to the unique noise characteristics inherent to SAR images.

To address challenges in SAR ship target detection—including varying target scales, poor small-target detectability, and complex background interference—numerous researchers have refined the aforementioned classical algorithms. From a technical perspective, existing SAR ship detection methods primarily focus on improving the following model categories: CNN-based single-stage detection frameworks are widely adopted for their high efficiency. He et al. [18] reduced missed and false detections of multi-scale targets in complex scenes by optimizing feature extraction and detection head design. Tang et al. [19] enhanced YOLOv7’s ability to learn target scale diversity through their AMMRF module, improving detection accuracy while maintaining a lightweight structure. Liu et al. [20] enabled the FESAR method to better handle multi-scale variations by integrating convolutional enhancement and spatial relationship analysis modules. To address poor adaptability to slender ship shapes and sensitivity in small-target localization, Tang et al. [21] designed adaptive coordinate generation and multi-frequency perception guidance mechanisms, boosting detection accuracy for specially shaped targets. STSAR-YOLOv5 [22], LWM-YOLO [23], and Gao [24] were specifically optimized for small-target detection, complex backgrounds, and large-scale scenarios, respectively. While these methods balance detection efficiency and accuracy, they face limitations such as the loss of small-target features during repeated downsampling in deep networks. Transformer-based detection frameworks have emerged as another major direction due to their powerful global modeling capabilities. Xia et al. [25] proposed CRTransSar, which combines the global context awareness of Transformers with the local feature representation of CNNs, effectively addressing SAR targets’ scattering properties and background interference. Qu et al. [26] enhanced inter-target correlations through mask-guided Transformer encoders. RDB-DINO [27], SFRT-DETR [28], and small-target-enhanced RT-DETR [29] all built upon the Transformer architecture with improvements targeting small object sensitivity and multi-scale focus. These approaches mitigate insufficient global context modeling in complex backgrounds but face challenges such as high computational complexity and limited local detail perception. For wide-swath, low-resolution SAR images, Kang et al. [30] modified the region proposal strategy based on Faster R-CNN to achieve rapid ship detection. Li et al. [31] leveraged multi-polarization information and an improved VGG16 network to enhance classification performance for low-resolution targets. He et al. [32] improved recognition in medium-to-low resolution images through densely connected CNNs. SSD-YOLO [33] effectively suppressed background noise via small-target feature enhancement and self-attention path aggregation. These methods address specific issues of sparse features and low resolution in wide-swath images, but still have room for improvement in balancing small-target sensitivity and robustness to complex backgrounds.

Although the aforementioned methods have optimized SAR ship detection performance to some extent, inherent characteristics of SAR data—such as large-scale coverage, complex background interference, and sparse target distribution—continue to result in relatively low detection accuracy and frequent missed detections of small targets. Specifically, two technical issues persist: First, when processing single-channel SAR images, existing deep learning-based object detection models commonly employ a channel replication strategy to convert inputs into three channels. While this approach formally adapts the data to network architectures, it fails to effectively capture the transform-domain characteristics of SAR images, leading to significant deficiencies in target representation learning due to suboptimal information utilization. This highlights a gap between simple data adaptation and principled, theoretically-grounded input representation learning for SAR data. Second, ship detection in SAR images often suffers from small-target sizes and high missed detection rates. During target extraction, multiple downsampling operations in the network inevitably cause partial pixel information loss, subsequently giving rise to both missed detections and false alarms.

In summary, to address the challenges of single-channel information limitation, background clutter interference, and difficult feature extraction for small targets in SAR images, this study proposes a ship detection method integrating band reorganization and multi-scale enhancement. The proposed approach aims to improve the accuracy of ship target detection in SAR images, thereby providing more effective technical support for dynamic maritime vessel surveillance.

2. Methodology

The overall network architecture, illustrated in Figure 1, is a cohesive framework designed to systematically overcome key limitations of existing deep learning models in SAR ship detection. Rather than a simple aggregation of techniques, each component is purposefully integrated to address a specific shortcoming: (1) The Band Reorganization (BR) module is proposed to overcome the inefficiency of simple channel replication, which fails to extract discriminative ship information and amplifies noise. (2) The Coordinate Attention (CA) mechanism is introduced to enhance the network’s perceptual capability for locating ship targets amidst vast and cluttered backgrounds, a task where standard convolutions are deficient. (3) The Bottleneck Transformer with Triple Attention (BOT3) is incorporated to address the limited global feature modeling capability of traditional CNNs, which is crucial for understanding context in wide-swath SAR imagery. (4) To tackle the fundamental challenge of significant scale variation, two dedicated modules are designed: the Multi-Scale Feature Enhancement (MSFE) module focuses on extracting robust edge and detail features for small targets, while the Multi-Scale Effective Feature Aggregation (MEFA) module ensures effective integration of features across different levels of the network. This interconnected design ensures that each part complements the others, leading to a robust and efficient detection system.

2.1. SAR Image Band Reorganization

Deep learning networks were originally developed for natural images, which are typically three-channel RGB data, whereas SAR image consists of single-channel data [34]. Conventional methods replicate the single channel to construct pseudo-three-channel inputs. While this meets network input requirements, it introduces three issues: repeated noise superposition, feature homogenization, and loss of critical information. To address these limitations, this paper proposes a BR method based on multi-scale decomposition and dynamic fusion, which optimizes input representation through Laplacian pyramid reconstruction (as shown in Figure 2). The mathematical foundation of our band reorganization approach is built upon multi-scale pyramid decomposition with three hierarchical levels (m = 3), where each level captures distinct frequency components of the SAR image. The core procedure comprises the following four steps:

2.1.1. Gaussian Pyramid Hierarchical Noise Suppression

First, iterative Gaussian convolution and downsampling are performed on the original SAR image to generate multi-level low-resolution images (Gaussian pyramid). Higher pyramid levels exhibit progressively reduced resolution, thereby suppressing the influence of background noise. We employ a 3-level pyramid structure (k = 0, 1, 2) with a fixed downsampling ratio of 2:1 between consecutive levels.

G_{k} = downsample (G_{k - 1} \otimes g_{5 \times 5}), g_{5 \times 5} = \frac{1}{256} (\begin{matrix} 1 & 4 & 6 & 4 & 1 \\ 4 & 16 & 24 & 16 & 4 \\ 6 & 24 & 36 & 24 & 6 \\ 4 & 16 & 24 & 16 & 4 \\ 1 & 4 & 6 & 4 & 1 \end{matrix})

(1)

where

G_{k}

denotes the Gaussian pyramid image at the k-th level.

g_{5 \times 5}

is a 5 × 5 Gaussian convolution kernel matrix with standard deviation, and ⊗ denotes the 2D convolution operation between the image matrix and the kernel matrix.

G_{k - 1}

is the image matrix at the

(k - 1)

-th level (higher resolution). The higher the layer, the lower the resolution. High-frequency noise decreases progressively with convolution and downsampling, while the bottom layer image retains global radiation features. The frequency domain response of the Gaussian kernel matches the high-frequency distribution characteristics of the noise in SAR images. By utilizing its low-pass filtering properties, it can effectively attenuate speckle noise, and the downsampling operation further reduces the spatial correlation of the noise.

2.1.2. High-Frequency Detail Extraction via Laplacian Pyramid

Multi-scale high-frequency components are extracted from the Gaussian pyramid via inverse difference reconstruction: The Laplacian pyramid is constructed across three levels (k = 0, 1, 2) using bilinear interpolation for upsampling operations.

L_{k} = G_{k} - (upsample (G_{k + 1}) \otimes g_{5 \times 5})

(2)

In the above formula,

L_{k}

is a matrix representing the high-frequency detail information of the k-th layer.

G_{k + 1}

are the Gaussian pyramid image matrices at the

(k + 1)

-th levels, respectively.

upsample (\cdot)

is the upsampling function that enlarges the image via interpolation. Since the Gaussian pyramid images have already smoothed the original data, low-frequency background noise is effectively suppressed in the high-frequency detail components. The Laplacian pyramid essentially captures multi-scale gradient information of the image. Through the “difference of Gaussians” operation applied across pyramid levels, it precisely isolates salient high-frequency details strongly associated with ship targets while filtering out the low-frequency background interference remaining after Gaussian smoothing.

2.1.3. Goal-Oriented Dynamic Weight Fusion

An adaptive weight allocation mechanism is designed to perform weighted fusion of multi-level high-frequency features: The fusion weights are dynamically computed based on the energy concentration within ship regions, where the ship binary mask

M_{s h i p}

is generated using Otsu’s thresholding method.

D = \sum_{k = 0}^{m} ω_{k} L_{k}, ω_{k} = \frac{{|L_{k} • M_{s h i p}|}_{F}}{\sum_{j = 0}^{m} {|L_{j} • M_{s h i p}|}_{F}}

(3)

where

D

is the fused detail feature map matrix obtained through weighted summation.

ω_{k}

is a scalar weight for the k-th layer Laplacian detail

L_{k}

.

M_{s h i p}

is the ship binary mask matrix generated based on scattering intensity thresholding, with ship regions as 1 and background as 0. • represents element-wise multiplication (Hadamard product).

{|\cdot|}_{F}

denotes the Frobenius norm, which calculates the square root of the sum of squared matrix elements, measuring the “energy” of detail information within ship regions. This weighting mechanism assigns higher weights to high-level details (e.g.,

L_{0}

) in target-dense regions, while enhancing the contribution of low-level details (e.g.,

L_{2}

) in areas with complex backgrounds.

The original image

I_{r a w}

contains rich low-frequency information, providing the overall structure and background context of the scene, which is essential for accurate scene understanding. In contrast, the fused detail features

D

primarily consist of high-frequency information, emphasizing target edges, textures, and other fine details that aid in distinguishing different targets.

A three-channel input is constructed by selecting the original image

G_{0}

, the first-level details

L_{1}

, and the second-level details

L_{2}

: This specific channel selection (

G_{0}

,

L_{2}

,

D

) was empirically determined to achieve an optimal balance among global context, intermediate details, and enhanced high-frequency information.

I_{R G B} = [G_{0}, L_{2}, D]

(4)

where

I_{R G B}

is the constructed three-channel input image tensor.

[\cdot, \cdot, \cdot]

denotes the channel concatenation operation that stacks three single-channel matrices along the channel dimension.

G_{0}

is the original image matrix preserving global information.

L_{2}

is the second-level detail matrix.

2.2. Coordinate Attention Mechanism

Due to the absence of spectral information in SAR images and interference from coherent speckle and cross-shaped artifacts, the geometric representation of targets is highly susceptible to background noise. To enhance the network’s capability to characterize the spatial distribution features of targets, this paper introduces a CA mechanism [35], as illustrated in Figure 3. The CA mechanism is configured with a reduction ratio of r = 16 for channel reduction in the intermediate transformation.

Ship targets often exhibit significant directional elongation (e.g., along the major axis). The Coordinate Attention mechanism inherently captures long-range dependencies by performing 1D global pooling on the feature maps during the information aggregation stage. This operation separately captures long-range dependencies along the vertical and horizontal directions, effectively addressing the limitation of traditional attention mechanisms in modeling features along the elongated structure of ships. By computing features for the c-th channel separately along the vertical and horizontal directions, it locates targets while preserving precise positional information in the other direction, thereby effectively enhancing the accuracy of ship detection in the SAR image.

z_{c}^{h} = \frac{1}{W} P a v g (x)

(5)

z_{c}^{w} = \frac{1}{H} P a v g (x)

(6)

where

z_{c}^{h}

and

z_{c}^{w}

represent the outputs of the c-th channel feature map with vertical dimension H and horizontal span W, respectively, and

P a v g ()

denotes the average pooling layer. The separate pooling operations along the vertical and horizontal directions decouple the spatial encoding of targets. This approach preserves precise positional information in one dimension while compensating for the positional ambiguity caused by standard 2D pooling through cross-directional feature fusion. Consequently, the resulting attention weights can focus more precisely on ship target regions. This design aligns well with the pronounced directional specificity of ships in SAR imagery, enabling more efficient capture of spatial correlations along their major axis.

During the CA computation phase, the feature maps generated along the vertical and horizontal directions are concatenated to fuse multi-dimensional spatial information. This is followed by h-swish activation, convolutional operations, and batch normalization to enhance spatial perception capability for ships, resulting in a new intermediate feature map

X_{H W}

. The intermediate feature transformation uses a 1 × 1 convolution that reduces channels to C/r, followed by batch normalization and h-swish activation.

X_{H W} = C B h_{1 \times 1} (c o n c a t (z_{c}^{h}, z_{c}^{w}))

(7)

where

C B h_{1 \times 1}

denotes a module comprising a

1 \times 1

convolutional layer, batch normalization, and h-swish activation, while

C o n c a t

represents the concatenation of feature maps along the channel dimension.

In the attention weight generation phase, the intermediate feature map is decomposed into vertical and horizontal vectors. These vectors then pass through convolutional layers for channel adjustment and are activated by Sigmoid, yielding attention vectors

X_{h}

and

X_{w}

for the two respective directions. The final attention feature map

Y

, which highlights ship target characteristics, is generated through weighted computation. The final attention application uses element-wise multiplication between the input feature map and the generated attention maps.

Y = C B S_{1 \times 1} (X_{h}) \otimes C B S_{1 \times 1} (X_{w}) \otimes x

(8)

where

C B S_{1 \times 1}

denotes a module consisting of a

1 \times 1

convolutional layer, batch normalization, and Sigmoid activation; ⊗ represents the outer product operator; and

x

refers to the input feature map.

2.3. Bottleneck Transformer

SAR image is characterized by complex textures and non-uniform noise distribution. When processing such high-dimensional nonlinear data, traditional convolutional neural networks often suffer from low computational efficiency and inadequate global feature modeling capabilities. To address these limitations, this paper introduces a Transformer-based BOT3 module (Bottleneck Transformer with Triple Attention) [36], whose schematic is shown in Figure 4. Designed with SAR imaging characteristics in mind, BOT3 incorporates lightweight mechanisms that maintain powerful global feature modeling capabilities while significantly reducing computational costs. The BOT3 module employs 8 attention heads with head dimension d = 64, and uses relative positional encoding with learnable parameters for both horizontal and vertical directions.

The BOT3 module adopts a hybrid CSP-Transformer architecture, whose core integrates a Multi-Head Self-Attention (MHSA) mechanism to enhance feature representation for multi-scale targets. This enables the capture of long-range dependencies between ships and marine backgrounds from multiple subspaces. The workflow is designed to model long-range dependencies between ships and the background, thereby overcoming the limited receptive field of traditional CNNs. It proceeds as follows: input features pass through three

1 \times 1

convolutional layers to generate the Query (

Q

), Key (

K

), and Value (

V

). Simultaneously, a two-dimensional positional embedding, which adapts to the spatial characteristics of SAR imagery and prevents the self-attention mechanism from losing positional information, producing vertical (

R_{h}

) and horizontal (

R_{w}

) encoding vectors. These vectors are then summed to form a position encoding matrix

R

adapted to the 2D features. Matrix operations generate a fused feature map and a content attention map, which are summed and normalized, then multiplied by weights to obtain the output feature map

Z

. This process can be expressed as follows:

Q = C o n {v_{l \times l}}^{Q} (X)

(9)

K = C o n {v_{1 \times 1}}^{K} (X)

(10)

V = C o n {v_{1 \times 1}}^{V} (X)

(11)

R = R_{h} + R_{w}

(12)

where

R_{h} \in R^{H \times d_{m o d e l}}

and

R_{w} \in R^{W \times d_{m o d e l}}

are learnable positional encoding matrices for height and width dimensions, respectively.

E = Q R^{T} + Q K^{T}

(13)

A = S o f t m a x (E)

(14)

Z = V A^{T}

(15)

where

X \in R^{N \times C \times W \times H}

denotes the input feature map, with

N

representing the batch size,

C

the number of channels, and W and

H

the width and height of the feature map, respectively.

E

is the logit matrix that combines attention to position and content.

A

is the attention weight matrix.

Given the relatively limited data volume and computational resources, the Bottleneck Transformer module reduces parameter count by optimizing convolutional and self-attention operations. It also incorporates residual connections to mitigate gradient vanishing in deep networks, thereby enhancing training stability and efficiency. The module adds only one convolutional layer before and after the MHSA for dimension adjustment, enabling efficient processing of SAR images. We employ n = 3 consecutive Bottleneck Transformer blocks in the BOT3 module for an optimal performance–complexity trade-off.

X_{out} = C o n ν_{l \times 1} (x) + M H S A (C o n ν_{l \times 1} (X))

(16)

where

M H S A (•)

denotes the multi-head self-attention operation.

X_{o u t}

is the output tensor of the Bottleneck Transformer block, incorporating residual connection (+).

The BOT3 module follows the CSP architecture, enhancing expressive capability through feature branching and fusion. The input features are processed by two

1 \times 1

convolutional layers to generate feature maps

X_{1}

and

X_{2}

. After extracting enhanced features by stacking n Bottleneck Transformers, the results are concatenated along the channel dimension, followed by a

1 \times 1

convolutional layer to map the features to the output channels

X_{BOT 3}

. The CSP split ratio is set to 0.5, meaning the feature channels are equally divided between the transformer path and identity path.

C o n v_{1 \times 1} (X) \to [X_{1}, X_{2}]

(17)

X_{m} = B o t t l e n e c k T r a n s f o r m e r_{n} (X_{1})

(18)

X_{B O T 3} = C o n v_{1 \times 1} (c o n c a t (X_{m}, X_{2}))

(19)

The BOT3 module effectively integrates low-level spatial features

X_{2}

with high-level global contextual representations

X_{m}

. The resulting output feature map concurrently captures both spatial details and contextual information, thereby providing more robust feature inputs for the detection head.

2.4. Multi-Scale Feature Enhancement Module

An SAR image inherently contains complexities such as coherent speckle noise and complex marine backgrounds. Coupled with the common characteristics of remote sensing images, these factors often lead to misjudgments due to feature similarity in small ship target detection. Furthermore, traditional backbone networks exhibit limited capability when processing SAR images, often yielding features with insufficient semantic information and restricted receptive fields, making it difficult to effectively distinguish small ship targets from the background. To address these issues, this paper draws inspiration from the Feature Enhancement Module (FEM) [37] and constructs an MSFE module, as illustrated in Figure 5. This module comprises several key components—a unique multi-branch dilated convolution structure, an edge enhancer, and feature fusion mechanisms—that work collaboratively to enhance the feature representation of small ship targets. The MSFE module processes input features of dimension C × H × W and maintains the same output dimension through careful dimension preservation in all operations.

The multi-branch dilated convolution structure divides the input feature map into four branches, each performing a 1 × 1 convolutional operation on the input features. The first branch employs a residual structure, effectively preserving critical feature information of small targets by constructing an equivalent mapping. This design is particularly important for detecting small ship targets in SAR images: their features are typically weak and prone to loss during convolutional operations, but the residual structure ensures these key characteristics are retained, providing a solid foundation for subsequent detection and recognition. The other three branches perform cascaded standard convolution operations with kernel sizes of 1 × 3, 3 × 1, and 3 × 3, respectively. This diverse kernel configuration helps process the input feature map from different perspectives and scales, thereby extracting richer feature information. Notably, the two middle branches incorporate additional dilated convolution layers, enabling the extracted feature maps to capture broader contextual information. In complex marine backgrounds of SAR images, environmental cues around small ship targets are crucial for accurate identification. The dilated convolution layers expand the receptive field, effectively preserving these contextual details and thereby improving the model’s detection accuracy for small targets and its adaptability to complex backgrounds. All convolutional layers use padding=’same’ to maintain spatial dimensions, and ReLU activation follows each convolution except the final output layer.

The output feature maps of the latter three branches can be expressed as follows:

C_{1} = C B R_{3 \times 3} [C B R_{1 \times 1} (x)]

(20)

C_{2} = D C B R_{3 \times 3}^{5} \{C B R_{3 \times 1} \{C B R_{1 \times 3} [C B R_{1 \times 1} (x)]\}\}

(21)

C_{3} = D C B R_{3 \times 3}^{5} \{C B R_{1 \times 3} \{C B R_{3 \times 1} [C B R_{1 \times 1} (x)]\}\}

(22)

where

C B R_{m \times n} (•)

denotes a module comprising a convolutional layer (kernel size m × n), batch normalization, and a ReLU activation layer. Here,

D C B R_{3 \times 3}^{5}

specifically represents a convolutional layer with a 3 × 3 kernel and a dilation rate of 5. x is the input original feature map, while

C_{1}

,

C_{2}

, and

C_{3}

are the output feature maps of the three branches after standard and dilated convolutions, respectively. The dilation rate of 5 was empirically determined to provide optimal receptive field expansion without grid artifacts, and all 1 × 1 convolutions use 256 output channels.

Subsequently, the multi-scale feature maps from the multiple branches are fused along the channel dimension, followed by channel dimensionality reduction using a 1 × 1 convolutional kernel, yielding the fused feature map

C_{4}

. This process can be expressed as follows:

C_{4} = C B_{1 \times 1} (C o n c a t (C_{1}, C_{2}, C_{3})) + C B_{1 \times 1} (x)

(23)

where

C o n c a t (•)

denotes the concatenation of feature maps along the channel dimension, and

C B_{1 \times 1} (•)

represents a module consisting of a convolutional layer with kernel size 1 × 1 and a batch normalization layer. The final 1 × 1 convolution reduces the channel dimension from 768 (3 × 256) back to 256 to maintain dimension consistency.

The Edge Enhancer (EE) emphasizes crucial boundaries of ships in the image through detailed processing and information extraction of the input feature map. Specifically, the EE first applies an average pooling layer

P a v g (•)

to suppress high-frequency noise and detailed information in the feature map, resulting in a preliminarily smoothed feature map

A v g P o o l

. This operation can be mathematically expressed as

A v g P o o l = P a v g (C_{5})

(24)

where the average pooling uses a kernel size of 3 × 3 and a stride of 1 to preserve spatial resolution while providing sufficient smoothing.

Subsequently, the edge information feature map

C_{5}

is obtained by subtracting the average-pooled feature map

A v g P o o l

from the edge information

E

, expressed as:

E = C_{5} - A v g P o o l

(25)

This process successfully separates potential edge information by leveraging the differential characteristics between edge regions and smooth areas in the image. To further enhance the extracted edge information and make it more prominent in subsequent feature representations, the obtained edge information E is sequentially processed through a convolutional layer and an activation function. The convolutional layer performs feature extraction and transformation on the edge information, while the activation function introduces nonlinearity to strengthen its representational capacity. The enhanced edge information

C_{6}

is obtained through this series of operations, expressed as follows:

C_{6} = C B S (E)

(26)

where

C B S (•)

denotes a module comprising a convolutional layer with kernel size 1 × 1, a batch normalization layer, and a ReLU activation layer.

Subsequently, the enhanced edge information

C_{6}

is added to the original edge information feature map

C_{5}

to obtain the final enhanced feature map

X

, expressed as

X = C_{6} + C_{5}

(27)

Through this approach, the Edge Enhancer effectively highlights key object boundaries, enabling the enhanced feature map to strengthen the representation of object edges while preserving original feature information. This provides richer and more accurate feature representations for subsequent target detection and recognition tasks.

Finally, the fused feature map

C_{4}

from the four branches and the edge-enhanced feature map obtained through multiple Edge Enhancers are concatenated along the channel dimension. The concatenated result is then passed through a 1 × 1 convolutional layer for dimensionality reduction, yielding the final feature map

X_{end}

. The MSFE module employs 3 Edge Enhancer blocks in series, with each enhancer processing the output of the previous one. This process can be expressed as

X_{end} = C B R_{1 \times 1} (C o n c a t (C_{4}, X_{1} + X_{2} + X_{3}))

(28)

where

X_{1}

,

X_{2}

,

X_{3}

represent the outputs from the three sequential Edge Enhancer blocks.

2.5. Multi-Scale Effective Feature Aggregation Module

In large-scene wide-swath SAR ship target detection, significant variations in ship scales—from small fishing boats to large cargo vessels—result in considerable differences in their pixel sizes within images. Additionally, complex and variable marine backgrounds interfere with detection, while substantial image noise further exacerbates the difficulty, posing severe challenges to ship target detection. Specifically, small fishing boats occupy few pixels and exhibit weak features in SAR images, making them highly susceptible to being flooded by background noise. Although large cargo ships have relatively larger sizes, they still occupy limited regions in wide-swath images, and the structural and contour differences among various ship types further increase detection difficulty. To address these issues, this paper designs a Multi-scale Effective Feature Aggregation (MEFA) module to significantly enhance the feature representation capability of convolutional neural networks when processing a large-scene wide-swath SAR image. The MEFA module takes two input feature maps,

X_{1}

and

X_{2}

, from different network levels and fuses them through concatenation before multi-scale processing in Figure 6.

The MEFA module integrates feature map

X_{1}

from different levels of the backbone network with intermediate feature map

X_{2}

from the detection head, combining multi-scale information to enable the model to simultaneously capture local details from low-level features and global semantics from high-level features. This allows the model to better adapt to the significant scale variations of ship targets in large-scene wide-swath SAR images and improve detection capability for ships of different sizes. The fused feature map is split into four branches:

X_{1}

,

X_{2}

,

X_{3}

,

X_{4}

. Three of these branches undergo convolutional operations with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively, extracting multi-scale feature information from the image. For small ship targets, smaller kernels (e.g., 3 × 3) capture fine-grained details, while for large ship targets, larger kernels (e.g., 7 × 7) capture overall structural information. These three branches are then concatenated with the original fused feature map along the channel dimension, preserving both multi-scale target information and original image content. The kernel size progression (3, 5, 7) was designed to capture increasingly larger contextual information, with corresponding padding values (1, 2, 3) selected to maintain spatial dimensions. This process can be mathematically expressed as follows:

X_{end} = C B R_{1 \times 1} (C o n c a t (C_{4}, X_{1} + X_{2} + X_{3}))

(29)

x_{1} = C o n v_{1 \times 1} (\frac{1}{4} X_{c o n c a t}) \in R^{C \times H \times W}

(30)

x_{2} = C o n v_{3 \times 3}^{P = 1} (\frac{1}{4} X_{c o n c a t}) \in R^{C \times H \times W}

(31)

x_{3} = C o n v_{5 \times 5}^{P = 2} (\frac{1}{4} X_{c o n c a t}) \in R^{C \times H \times W}

(32)

x_{4} = C o n v_{7 \times 7}^{P = 3} (\frac{1}{4} X_{c o n c a t}) \in R^{C \times H \times W}

(33)

X_{f u s e d} = C o n v_{1 \times 1} (c o n c a t ([x_{1}, x_{2}, x_{3}, x_{4}], dim = 1)) \in R^{C \times H \times W}

(34)

where

C o n v_{n \times n}^{P = m}

denotes a grouped convolution with kernel size n×n and padding m. All convolutional layers in the multi-scale branches use 64 output channels, resulting in 256 total channels after concatenation, which are then reduced back to C channels by the final 1 × 1 convolution.

Incorporating the CBAM attention mechanism, dual visual focus is achieved through channel attention and spatial attention, dynamically suppressing interference from regions such as waves and islands by weighting the input features. The CBAM component uses a reduction ratio r = 16 in the channel attention MLP, and the spatial attention employs 7 × 7 convolution for spatial feature aggregation.

W_{c} = S i g m o i d (M L P (P a v g (X)) + M L P (P m a x (X)))

(35)

W_{s} = S i g m o i d (C o n v_{1 \times 1} ([M e a n (X), M a x (X)]))

(36)

Y_{1} = X_{1} \otimes W_{c} \otimes W_{s}, Y_{2} = X_{2} \otimes W_{c} \otimes W_{s}

(37)

Y_{M E F A} = S i g m o i d (X_{f u s e d}) \otimes Y_{1} + S i g m o i d (X_{f u s e d}) \otimes Y_{2} + Y_{1} + Y_{2}

(38)

where

P a v g (•)

denotes the average pooling layer,

P m a x (•)

denotes the max pooling layer,

M e a n (•)

represents average pooling along the channel dimension, and

M a x (•)

represents max pooling along the channel dimension.

W_{c}

and

W_{s}

are channel and spatial attention weight tensors.

Y_{1}

and

Y_{2}

are attention-weighted feature map tensors.

Y_{M E F A}

is the final output tensor of the MEFA module. The final fusion strategy employs element-wise multiplication between the multi-scale fused features and attention-weighted inputs, followed by residual connections to preserve original information.

3. Experimental Results and Analysis

3.1. Dataset and Experimental Configuration

To evaluate and compare the performance of different algorithms in SAR image target detection tasks, experiments were conducted using commonly adopted SAR ship target detection datasets, including the LS-SSDD dataset [38] and the HRSID dataset [39]. Key parameters of these datasets are summarized in Table 1. Additionally, large-scene wide-swath SAR images from real-world scenarios, all sourced from Sentinel-1, were also utilized.

LS-SSDD, introduced in 2020, is the first public benchmark dataset designed for large-scene small-target SAR ship detection. It is constructed from Sentinel-1 Interferometric Wide Swath (IW) mode data, with a ground range resolution of 5 m and an azimuth resolution of 20 m. The dataset covers multiple typical maritime regions worldwide and comprises 15 original SAR images, each with dimensions of 24,000 × 16,000 pixels. The training set and test set were split in an 8:2 ratio.

HRSID consists of 5604 SAR images captured by Sentinel-1 and TerraSAR-X, each with a size of 800 × 800 pixels and resolutions of 0.5 m, 1 m, and 3 m. The dataset contains 16,965 ship instances, including 15,510 small ships, 1344 medium ships, and 111 large ships. The data were divided into training, test, and validation sets in a 6:2:2 ratio.

Prior to conducting experiments, each dataset was divided into training and test sets. The training set was used to train the model, and when the training loss and accuracy stabilized, predictions were made on the test set images to validate the model’s generalization capability. The model was built using the PyTorch1.10 deep learning framework with the following configuration: a batch size of 16, an initial learning rate of 0.00001, and a maximum of 150 training epochs. The model was optimized using the Stochastic Gradient Descent (SGD) algorithm and trained on an NVIDIA 4090D GPU with 24 GB of memory provided by the AutoDL platform.

3.2. Evaluation Metrics

This paper employs precision (P), recall (R), and average precision (AP) to evaluate the performance of the detection model. Precision is defined as the proportion of correctly detected ships among all detected ships:

P = \frac{T P}{T P + F P}

(39)

Recall represents the proportion of successfully detected targets among all actual targets:

R = \frac{T P}{T P + F N}

(40)

where

F N

denotes false negatives (missed detections),

T P

represents true positives (correct detections satisfying two conditions: correct category—predicted class matches the ground truth class, and correct localization—IoU between predicted and ground truth boxes is greater than or equal to a preset threshold).

F P

indicates false positives (false detections).

A P

is determined by the precision–recall curve and is defined as follows:

A P = \int_{0}^{1} P (R) d R

(41)

P (R)

is represented by the area under the precision–recall curve. It reflects the accuracy of the target detection model. Following the COCO dataset

A P

calculation method,

m A P_{50}

and

m A P_{50 : 95}

are computed separately.

m A P_{50}

is the average precision when the Intersection over Union (IoU) between the detection box and the ground truth box is greater than 0.5, while

m A P_{50 : 95}

denotes the average precision with IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05. Computational Load (GFLOPs) and Parameter Count (Para, M) represent the model’s complexity and computational cost. FPS (Frames Per Second) indicates the model’s inference speed.

3.3. Ablation Study on Network Architecture Effectiveness

To validate the effectiveness of the proposed method and provide theoretical motivation for each module design, a progressive ablation study was conducted to investigate the contribution of each improved module to the SAR ship detection task.

Using the LS-SSDD dataset and YOLOv8 as the baseline model, the experiments sequentially incorporated BR, CA, BOT3, MSFE, and MEFA modules to evaluate their individual improvements. As shown in Table 2, the BR module improved

m A P_{50}

from 0.712 to 0.743 and increased recall by 4.3% without increasing computational cost (maintaining 3.0 GFLOPs), confirming its effectiveness as an efficient preprocessing strategy that suppresses background noise in SAR images via channel restructuring. This confirms that frequency-domain processing provides a computationally efficient foundation for subsequent modules.

Further incorporation of the CA mechanism enhanced the model’s feature extraction capability for ships, raising

m A P_{50}

to 0.760 and P to 0.864, validating its role in capturing directional spatial dependencies crucial for ship target localization. The addition of the BOT3 module improved

m A P_{50 : 95}

to 0.311, highlighting the importance of global context modeling for robust ship detection in complex marine environments.

Introducing the MEFA module further increased

m A P_{50}

to 0.78 through dynamic weighted fusion of multi-scale features. Finally, integrating the MSFE branch expanded the receptive field via dilated convolutions and balanced detection performance across targets of different scales. Although precision slightly decreased to 0.867,

m A P_{50 : 95}

reached a peak of 0.316. This precision–recall trade-off illustrates the complementary nature of these modules: while MEFA enhances feature aggregation, MSFE improves multi-scale representation at the cost of some false positives.

A critical finding emerges from analyzing module combinations: omitting the BR and CA using only BOT3, MEFA, and MSFE caused a sharp drop in

m A P_{50 : 95}

to 0.744. This strongly confirms that the modules exhibit synergistic effects: the foundational preprocessing (BR) and spatial attention (CA) modules are essential for the effective operation of the more complex multi-scale modules.

When all modules were combined, the proposed method achieved the best performance across all metrics except P, where it was slightly lower than the configuration without MSFE. The MSFE module’s multi-scale receptive fields and feature enhancement capabilities improve detection sensitivity, especially for small targets, but simultaneously increase susceptibility to background artifacts in complex SAR scenes. This is quantitatively evidenced by the 1.5% increase in recall accompanied by a 1.5% decrease in precision.

The relationship between the precision decrease and multi-scale feature fusion can be visualized in Figure 7, which presents representative false detection cases. We attribute the MSFE module’s performance characteristics to its ability to capture multi-scale target information through feature fusion and enhancement, improving feature complementarity and expressive power, strengthening key features while suppressing noise, and adapting to complex scenes by learning multi-scenario feature patterns—all of which benefit target detection. P, defined as the ratio of correctly detected targets to all detections, decreased slightly after multi-scale feature fusion. Although the detection of small targets increased, some irrelevant false detections were also introduced, as illustrated in Figure 7. R, defined as the ratio of correctly detected targets to all annotated targets, improved, indicating that the model successfully detected more true targets. This suggests that multi-scale feature fusion enhances small-target detection capability but also introduces a certain degree of false detections.

In the LS-SSDD dataset, the tendency to over-learn background pseudo-features and feature-level conflicts, combined with the complexity of SAR scenes and sparse target distribution, led to increased false detections and a noticeable drop in P. However, since some true positives were still correctly identified, R and mAP were less affected. This demonstrates that our module combination achieves an optimal balance for SAR ship detection, where the gains in recall and overall mAP outweigh the slight precision reduction.

As the proposed MSFE module is an improved version of the FEM module, an ablation experiment was also conducted by replacing MSFE with FEM. The results show that the FEM module performs worse than the MSFE module in all indicators, providing concrete evidence that our architectural modifications in MSFE are theoretically justified and empirically superior for small ship target feature representation. Meanwhile, the more complex structure of MSFE also led to a slight increase in Para and GFLOPs.

To further validate the module complementarity, we conducted replacement experiments with alternative approaches. The proposed BR module surpassed wavelet transform (WT) and principal component analysis (PCA) methods, confirming the superiority of our multi-scale decomposition and dynamic fusion approach for SAR-specific channel reorganization.

These results collectively validate that each module addresses distinct challenges in SAR ship detection while exhibiting complementary synergistic effects: BR provides foundational feature optimization through frequency-domain processing, CA enhances directional spatial perception, BOT3 improves global contextual representation, while MEFA and MSFE strengthen multi-scale feature fusion—together forming an efficient and robust framework that delivers measurable improvements beyond mere architectural complexity.

To further investigate the generalizability and theoretical robustness of the proposed method, ablation experiments were also conducted on the HRSID dataset. The results demonstrate consistent performance patterns that reinforce our theoretical motivations. The BR module effectively enhances model performance across both datasets, with improvements observed in P, R, and

m A P_{50}

, confirming its universal effectiveness in SAR image preprocessing. The CA module significantly increases inference speed while maintaining accuracy, validating its efficiency in spatial attention modeling. The BOT3 and MSFE modules achieve a favorable balance between accuracy and computational efficiency across both datasets, demonstrating the generalizability of our global context and multi-scale feature learning approaches.

A particularly revealing finding emerges from the HRSID experiments: when both the BR and CA modules were omitted, R dropped significantly to 0.848. In contrast, when all modules were integrated, the model achieved relatively superior results across P, R, and mAP, providing cross-dataset evidence for the necessity of the complete architecture and the synergistic effects between modules.

The computational complexity analysis across both datasets reveals important theoretical insights into our architectural design. As shown in Table 2 and Table 3, the gradual incorporation of modules leads to increases in both parameter count and computational load. The introduction of the CA module causes a noticeable jump in these metrics, which can be attributed to its use of global pooling, fully connected matrix operations, and activation functions—all of which require substantial parameters and serial operations. These components provide theoretical benefits by enabling the model to focus on ship-related channel features, suppress background noise, and improve the detection accuracy of small targets.

Subsequent modules, such as BOT3 and MEFA, also increase parameters and computations but are designed with multi-branch parallel structures (e.g., multi-scale branches in BOT3, cross-layer feature fusion in MEFA). Theoretical analysis indicates that these operations—primarily based on parallel convolution and feature concatenation—are better aligned with GPU parallel acceleration capabilities, leading to higher hardware utilization. In addition, the newly added parameters are mostly weights of branch convolutions, which are gradually stacked and do not exceed the original complexity threshold of the model. Therefore, the impact on computation and FPS is more gradual.

This computational pattern analysis provides theoretical justification for our architectural choices: while individual modules increase complexity, their careful integration maintains practical deployability while delivering measurable performance improvements that justify the added computational cost.

The SAR images in the LS-SSDD and HRSID datasets differ in noise level, contrast, and target scale, resulting in variations in computational resource consumption and the model’s parallel adaptability after module stacking. Moreover, the magnitude and rate of increase and decrease in indicators in the two datasets are different, but the overall trend remains consistent. This consistency across diverse SAR datasets strongly supports the theoretical robustness and generalizability of our proposed architecture.

3.4. Comparative Experiments with Other Methods

To validate the effectiveness of the proposed method in SAR image ship target detection, comparative experiments were conducted on both the LS-SSDD and HRSID datasets. The models included in the comparison were categorized into three groups to ensure thorough and insightful analysis: Classical detectors (Faster R-CNN [12], Cascade R-CNN [13], EfficientDet [40], SSD [15], FCOS [41], and GL-DETR [42]), Specialized SAR detection models (HR-SDNet [43], SARNet [44], YOLO-SARSI [19], MSIF [45], MSFA [46], FESAR [20], SARNAS [47], SSD-YOLO [33], FDI-YOLO [48], ROSD [49], SIF-Net [50], DRGD-YOLO [51], and PSG [52]), YOLO series baseline models (YOLOv5s [53], YOLOv8 [54], YOLOv11 [55], and YOLOv12 [56]).

On the LS-SSDD dataset, which focuses on small-target detection, the proposed method achieved an

m A P_{50}

of 0.781, outperforming all compared classical and newer models (as shown in Table 4). In particular, our method achieves 0.316 on

m A P_{50 : 95}

, which is significantly higher than other methods, indicating that the generated prediction boxes have higher boundary accuracy. The performance superiority stems from our holistic approach: the Band Reorganization module provides cleaner inputs by suppressing background noise, while the multi-scale enhancement and attention mechanisms significantly improve target-clutter discrimination. The most notable improvements are observed in

m A P_{50 : 95}

(13.3% higher than YOLOv8) and recall, demonstrating enhanced boundary precision and reduced missed detections.

Compared to YOLO series baseline models, the introduction of BR, CA, BOT3, and MSFE structures led to a significant improvement in detection performance while maintaining the high efficiency inherent to single-stage detectors. In terms of model complexity, the proposed method achieves a balance with moderate parameter counts and computational load, while reaching an inference speed of 175.44 FPS. This frame rate substantially exceeds that of most two-stage models and Transformer-based detectors, demonstrating strong potential for real-time processing.

To further validate the generalization capability of the proposed method in complex scenarios, experimental results on the HRSID dataset demonstrate that our approach achieves an

m A P_{50}

of 0.945—the best among all compared methods. Compared with specialized SAR ship detection models, e.g., FESAR and SSD-YOLO, the proposed method achieves a significant breakthrough in detection accuracy. Notably, while maintaining high precision (0.923), the proposed method also achieves the highest recall (0.889) among all models. This indicates its ability to more comprehensively identify targets in complex marine environments and dense scenes while effectively controlling false detections. The balanced improvement across both precision and recall metrics underscores our method’s capability to handle the precision–recall trade-off that often challenges SAR ship detection. These results confirm the effectiveness of the proposed multi-scale enhancement and attention mechanisms in improving model generalization.

As summarized in Table 4 and Table 5, the proposed method achieves optimal performance on both the LS-SSDD and HRSID datasets in terms of core detection metrics (

m A P_{50}

and

m A P_{50 : 95}

), demonstrating its high accuracy and strong generalization capability for ship detection in wide-swath SAR images.

To more intuitively demonstrate the superiority of the proposed algorithm in small-target detection, representative scenes from both the LS-SSDD and HRSID datasets were selected for comparative analysis against other algorithms, as illustrated in Figure 8. The experimental results show that the proposed algorithm achieves optimal performance in small-target detection tasks (highlighted by the blue bounding boxes in the figure), significantly reducing both missed detections and false alarms while generally achieving higher confidence scores. These visual results corroborate the quantitative analysis, particularly demonstrating our method’s effectiveness in detecting challenging small targets that are frequently missed by other approaches. These outcomes fully demonstrate that the proposed algorithm exhibits stronger robustness in wide, large-scene, small-target detection and holds broader application potential.

3.5. Generalization Performance Evaluation

Building upon the benchmark validation using public datasets (LS-SSDD and HRSID), this section further evaluates the adaptability and generalization capability of the proposed method for real-world large-scene wide-swath SAR images by incorporating the SAHI slicing inference strategy. The experimental data were selected from three geographically and oceanographically distinct regions: the U.S. West Coast, East Coast, and the northwestern Gulf of Mexico. These areas were chosen to validate the method’s robustness across different marine environments—featuring steep coastlines on the West Coast, and gentle continental shelves with busy shipping lanes on the East Coast and in the Gulf of Mexico.

The data were sourced from Sentinel-1 Interferometric Wide Swath (IW) mode Single Look Complex (SLC) products provided by the Alaska Satellite Facility (ASF), with VV + VH polarization and a standard coverage range of approximately 250 km. The raw data downloaded from these websites were processed through a series of operations—including orbit correction, radiometric calibration, multi-looking, filtering, and geocoding—to generate images consistent with the format of the LS-SSDD dataset. For incidence angles ranging from 30° to 45°, which are characteristic of the Sentinel-1 Interferometric Wide (IW) swath mode, ships typically exhibit higher backscattering values in the co-polarization channel (VV), while lower backscattering is observed in the cross-polarization channel (VH) [57,58]. In cross-polarization, ships present a smaller radar cross-section (RCS), and the clutter often falls below the noise floor, making detection noise-limited [58]. The use of cross-polarization offers notable advantages, particularly for acquisitions with small incidence angle [58]. Moreover, since the noise floor is largely independent of incidence angle, cross-polarization enables more uniform ship detection performance across the image swath. Therefore, for the purpose of this study, which focuses on developing a robust detection network, VH-polarized SAR imagery was selected. Large sub-regions were extracted from the full-scene data for testing. During processing, the SAHI framework was employed for sliding window inference. Given the image resolution of approximately 10 meters per pixel, the base slice size was set to 800 × 800 pixels with an overlap ratio of 20% to effectively mitigate missed detections caused by ship targets being truncated at slice boundaries.

Finally, a weighted non-maximum suppression algorithm was applied to merge the initial detection results from all slices. An IoU threshold of 0.4 was set, and confidence-weighted averaging was performed on bounding boxes in overlapping regions to generate seamless, large-scene detection results.

To intuitively demonstrate the detection performance of the proposed method in complex large-scene real-world data, the detection results for the three scenarios are visualized in Figure 9, Figure 10, Figure 11 and Figure 12. For each image, a representative sub-region on the right side is selected and displayed as a zoomed-in view to clearly illustrate the actual detection effects. The annotation rules for the detection results are as follows: red rectangular boxes indicate ship targets correctly detected by the proposed algorithm, blue elliptical boxes mark missed detections, and green boxes indicate false detections. Analysis of the zoomed-in regions shows that the proposed method performs well in detecting small-scale targets. However, under interference from complex background clutter, instances remain where island or reef features are misclassified as ship targets.

Given the current lack of publicly available large-scale annotated datasets for SAR ship detection, this study employs expert visual interpretation as the quasi-ground truth for quantitative evaluation. The researchers performed systematic visual analysis of the full-scene image without reference to any algorithmic outputs, identifying targets based on morphological characteristics, scattering intensity, and land-sea contextual relationships. The detection results of the proposed algorithm were subsequently compared against this ground truth set, as summarized in Table 6. The proposed method demonstrates excellent generalization performance across wide-swath SAR images from different geographical regions. In the West Coast scenario, where the background is relatively simple and sea clutter interference is weak (providing an optimal testing environment), the method achieved the highest detection rate and precision. Although precision decreases in the more complex backgrounds of the East Coast and Gulf of Mexico scenarios, the detection rate remains consistently above 80%. This confirms the method’s stable and reliable ship detection capability across diverse wide-area scenes. The East Coast and Gulf of Mexico scenarios present greater complexity, with dense shipping traffic, gentle coastlines, and offshore facilities generating numerous false targets. These conditions place higher demands on the algorithm’s ability to distinguish foreground targets from background clutter, leading to a reduction in precision. Nevertheless, even under these challenging conditions, the proposed method maintains a detection rate above 80%, highlighting its strong robustness in capturing true targets within complex backgrounds.

4. Conclusions

SAR image-based methods for ship detection were often challenged by intense background clutter and the difficulty of extracting discriminative features from small targets in wide-swath images, resulting in limited detection performance. To mitigate these issues, this paper proposed a novel ship detection algorithm that integrated multi-channel feature extraction with a multi-scale enhancement scheme for small targets. The proposed algorithm enhanced input feature representation through channel recombination. It further incorporated a CA mechanism and a BOT3 module to bolster spatial feature discrimination and global context modeling. Additionally, a dedicated multi-scale module was constructed to facilitate feature refinement and information aggregation, collectively improving the detection accuracy for multi-scale ship targets. Extensive experiments on the LS-SSDD and HRSID datasets demonstrated that our algorithm significantly enhanced the detection of small ship targets in wide-swath SAR images, offering reliable technical support for maritime surveillance in complex, large-scale scenarios. Nevertheless, the proposed approach still exhibited certain limitations, including false positives for large ships, high computational complexity, and insufficient robustness to physical distortions inherent in wide-swath images. Future research will prioritize developing false positive suppression strategies, pursuing lightweight model design, and incorporating AIS data association alongside physical distortion modeling to further enhance the algorithm’s robustness and practical utility.

Author Contributions

Conceptualization, Y.Z., K.Z., H.G., J.L., Z.G. and X.L.; methodology, Y.Z.; validation, Y.Z. and K.Z.; formal analysis, Y.Z. and K.Z.; investigation, Y.Z. and K.Z.; resources, Y.Z. and K.Z.; data curation, Y.Z. and K.Z.; writing—original draft preparation, Y.Z., K.Z., H.G., J.L., Z.G. and X.L.; writing—review and editing, Y.Z. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, No. 42301464; China Postdoctoral Science Foundation, No. 2023M744302; The Key Project of Science and Technology of Henan Province, No. 252102211061; and Natural Science Foundation of Henan Province, No. 252300420301; Key Laboratory Project, No. KX25203.

Data Availability Statement

The experimental data for the generalization test were obtained from https://search.asf.alaska.edu/(accessed on 10 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wu, L.B. SAR Image Processing and Target Recognition; Aviation Industry Press: Beijing, China, 2013. [Google Scholar]
Wei, S.J.; Jiang, P.F.; Yuan, Q.Z. Detection and Recognition of SAR Small Ship Objects Using Deep Neural Network. J. Northwestern Polytech. Univ. 2019, 37, 546–552. [Google Scholar] [CrossRef]
Zheng, Z.Y.; Zhang, T.; Liu, Z.Y.; Li, Y.J.; Sun, C.M. Ship target detection in SAR images based on RCIoU loss function. J. Shandong Univ. (Eng. Sci.) 2022, 52, 15–22. [Google Scholar]
Leng, X.G. Study on Ship Detection Technology Based on Complex-Valued Information in Single-Channel SAR Imagery. Master’s Thesis, National University of Defense Technology, Changsha, China, 2019. [Google Scholar]
El-Darymli, K.; Gill, E.W.; Mcguire, P.; Power, D.; Moloney, C. Automatic target recognition in synthetic aperture radar imagery: A state-of-the-art review. IEEE Access 2016, 4, 6014–6058. [Google Scholar] [CrossRef]
Gao, G. Statistical modeling of SAR images: A survey. Sensors 2010, 10, 775–795. [Google Scholar] [CrossRef]
Liu, W.; Cao, H.; Xiong, G.; Liu, J.; Qi, C. GLRT-based detectors with enhanced selectivity for mismatched signals through a random-signal approach. Signal Process. 2026, 240, 110375. [Google Scholar] [CrossRef]
Bakirci, M. Advanced ship detection and ocean monitoring with satellite imagery and deep learning for marine science applications. Reg. Stud. Mar. Sci. 2025, 81, 103975. [Google Scholar] [CrossRef]
Adil, M.; Nunziata, F.; Buono, A.; Velotto, D.; Migliaccio, M. Polarimetric Scattering by a Vessel at Different Incidence Angles. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4008605. [Google Scholar] [CrossRef]
Li, Z.; Deng, Z.; Hao, K.; Zhao, X.; Jin, Z. A Ship Detection Model Based on Dynamic Convolution and an Adaptive Fusion Network for Complex Maritime Conditions. Sensors 2024, 24, 859. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Zhao, Y.; Lv, W.Y.; Xu, S.L.; Wei, J.M.; Wang, G.Z.; Dang, Q.Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
He, S.; Wang, Y.Z.; Yang, Z.W. Multi-scale ship detection algorithm in SAR images in complex scenes. Appl. Electron. Tech. 2025, 51, 59–64. [Google Scholar]
Tang, H.; Gao, S.; Li, S.; Wang, P.; Liu, J.; Wang, S.; Qian, J. A Lightweight SAR Image Ship Detection Method Based on Improved Convolution and YOLOv7. Remote Sens. 2024, 16, 486. [Google Scholar] [CrossRef]
Liu, C.; Yan, C. FESAR: SAR ship detection model based on local spatial relationship capture and fused convolutional enhancement. Mach. Vis. Appl. 2024, 35, 34. [Google Scholar] [CrossRef]
Tang, F.L.; Li, W.S.; Qiao, G.J. Ship Detection in SAR Images Based on Adaptive Coordinate Generation and Multi-Frequency Perception. Laser Optoelectron. Prog. 2025, 62, 2028006. [Google Scholar]
Xu, J.T.; Wang, Z.W.; Zhao, G.P. Small target detection algorithm based on SAR images. Comput. Eng. Des. 2025, 46, 570–577. [Google Scholar]
Zhang, Y.; Zhang, Y.; Wang, T.; Wang, B.Y. Lightweight research on ship target detection in large-scale SAR images. J. Geo-Inf. Sci. 2025, 27, 256–270. [Google Scholar]
Gao, D.; Li, M.; Fan, D.Z.; Dong, Y.; Li, Z.X.; Wang, R. A ship detection method from lightweight SAR images under complex backgrounds. J. Geo-Inf. Sci. 2024, 26, 2612–2625. [Google Scholar]
Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. Remote Sens. 2022, 14, 1488. [Google Scholar] [CrossRef]
Qu, H.C.; Shen, L.; Guo, W.; Wang, J.K. Ships Detection in SAR Images Based on Anchor-Free Model With Mask Guidance Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 666–675. [Google Scholar] [CrossRef]
Qin, C.; Zhang, L.P.; Wang, X.Q.; Li, G.; He, Y.; Liu, Y.H. RDB-DINO: An Improved End-to-End Transformer With Refined De-Noising and Boxes for Small-Scale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
Cao, J.; Han, P.H.; Liang, H.P.; Niu, Y. SFRT-DETR: A SAR ship detection algorithm based on feature selection and multi-scale feature focus. Signal Image Video Process. 2025, 19, 115. [Google Scholar]
Zhang, H.S.; Wu, W.; Xu, J.; Wu, F.; Ji, Y.M. A Ship Detection Method for SAR Images Based on Small Target Feature Enhanced RT-DETR. Comput. Sci. 2025, 1–11. [Google Scholar] [CrossRef]
Kang, M. Study on the Methods of SAR Ship Detection and Recognition Based on Deep Learning. Master’s Thesis, National University of Defense Technology, Changsha, China, 2017. [Google Scholar]
Li, J.Q.; Jiang, Z.J.; Yao, L.B.; Jian, T. A Ship Target Fusion Recognition Algorithm Based on Deep Learning. Ship Electron. Eng. 2020, 40, 31–35. [Google Scholar]
He, J.L. Research on Detection and Classification of Ship Targets in SAR Images. Master’s Thesis, Xidian University, Xi’an, China, 2019. [Google Scholar]
Fu, X.Y.; Zhou, Z.C.; Meng, H.; Li, S.T. A synthetic aperture radar small ship detector based on transformers and multi-dimensional parallel feature extraction. Eng. Appl. Artif. Intell. 2024, 137, 109049. [Google Scholar] [CrossRef]
Jiao, L.C.; Zhao, J. A Survey on the New Generation of Deep Learning in Image Processing. IEEE Access 2019, 7, 172231–172263. [Google Scholar] [CrossRef]
Wang, Y.; Wang, W.; Li, Y.; Jia, Y.D.; Xu, Y.; Ling, Y.; Ma, J.Q. An attention mechanism module with spatial perception and channel information interaction. Complex Intell. Syst. 2024, 10, 5427–5444. [Google Scholar] [CrossRef]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.Y.; Liu, Y.; Guo, P.Y.; Yan, J.H. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Zhang, T.W.; Zhang, X.L.; Ke, X.; Zhan, X.; Shi, J.; Wei, S.J.; Pan, D.C.; Li, J.W.; Su, H.; Zhou, Y.; et al. LS-SSDD-v1.0: A Deep Learning Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images. Remote Sens. 2020, 12, 2997. [Google Scholar] [CrossRef]
Wei, S.J.; Zeng, X.; Qu, Q.; Wang, M.; Su, H.; Shi, J. HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmentation. IEEE Access 2020, 8, 120234–120254. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 9627–9636. [Google Scholar]
Li, C.; Hei, Y.Q.; Xi, L.H.; Li, W.T.; Xiao, Z. GL-DETR: Global-to-Local Transformers for Small Ship Detection in SAR Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wei, S.J.; Su, H.; Ming, J.; Wang, C.; Yan, M.; Kumar, D.; Shi, J.; Zhang, X.L. Precise and robust ship detection for high-resolution SAR imagery based on HR-SDNet. Remote Sens. 2020, 12, 167. [Google Scholar] [CrossRef]
Gao, S.; Liu, J.M.; Miao, Y.H.; He, Z.J. A high-effective implementation of ship detector for SAR images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, S.F.; Chi, C.; Yao, Y.Q.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv 2020, arXiv:1912.02424. [Google Scholar]
Li, Y.X.; Li, X.; Li, W.J.; Hou, Q.B.; Liu, L.; Cheng, M.M.; Yang, J. SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection. arXiv 2025, arXiv:2403.06534. [Google Scholar]
Du, W.T.; Chen, J.; Zhang, C.C.; Zhao, P.; Wan, H.Y.; Zhou, Z.; Cao, Y.C.; Huang, Z.X.; Li, Y.S.; Wu, B.C. SARNas: A hardware-aware SAR target detection algorithm via multi-objective neural architecture search. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–23. [Google Scholar]
Wang, P.; Luo, Y.; Zhu, Z.L. FDI-YOLO: Feature disentanglement and interaction network based on YOLO for SAR object detection. Expert Syst. Appl. 2025, 260, 125442. [Google Scholar] [CrossRef]
Yang, X.; Zhang, X.; Wang, N.N.; Gao, X.B. A robust one-stage detector for multiscale ship detection with complex background in massive SAR images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Wang, H.; Liu, S.L.; Lv, Y.X.; Li, S.Y. Scattering information fusion network for oriented ship detection in SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Cao, R.; Sui, J. A Dynamic Multi-Scale Feature Fusion Network for Enhanced SAR Ship Detection. Sensors 2025, 25, 5194. [Google Scholar] [CrossRef]
Zhu, H.; Li, D.; Wang, H.; Yang, R.; Liang, J.; Liu, S.; Wan, J. A Progressive Saliency-Guided Small Ship Detection Method for Large Scene SAR Images. Remote Sens. 2025, 17, 3085. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M.; et al. GMES Sentinel-1 mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
Pelich, R.; Chini, M.; Hostache, R.; Matgen, P.; Lopez-Martinez, C.; Nuevo, M.; Ries, P.; Eiden, G. Large-scale automatic vessel monitoring based on dual-polarization Sentinel-1 and AIS data. Remote Sens. 2019, 11, 1078. [Google Scholar] [CrossRef]

Figure 1. The overall frame structure of this paper.

Figure 2. SAR image band reorganization. R, G, and B are the three channels of an image.

Figure 3. Coordinate channel attention mechanism network structure.

Figure 4. BOT3 network structure.

Figure 5. Multi-scale feature enhancement module.

Figure 6. Multi-Scale Effective Feature Aggregation module.

Figure 7. Visualization of False Detection Cases in the Detection Model with the MSFE Module. The red box indicates the actual detected target, while the yellow box indicates the ship target falsely detected by the method in this paper.

Figure 8. The detection results of the comparative experiments on the HRSID and LS-SSDD datasets. The red boxes indicate the detection boxes of each algorithm, while the blue boxes mark small-scale ship targets that are easily missed by most comparison algorithms.

Figure 9. Accurate detection results of vessel targets in the Gulf of Mexico, USA.

Figure 10. Detection results of ship targets in the Gulf of Mexico region of the United States. The blue circles represent missed detections, and the green ones represent false detections.

Figure 11. Detection results of ship targets in the West Coast area of the United States. The blue circles represent missed detections.

Figure 12. Detection results of ship targets in the East Coast area of the United States.

Table 1. Some parameters of the LS-SSDD dataset and the HRSID dataset.

Parameter	LS-SSDD	HRSID
Satellite	Sentinel-1	Sentinel-1, TerraSAR-X
Sensor Mode	IW	SM, ST, HS
Resolution	$10 \times 10$ m	0.5 m, 1 m, 3 m
Polarization	VV, VH	HH, HV, VV
Image Size	$800 \times 800$ pixels	$800 \times 800$ pixels

Table 2. Ablation experiment results on the LS-SSDD dataset. ✓ indicates that the component is included in the model variant. Bold indicates the optimal result.

Yolov8	BR	CA	BOT3	MEFA	MSFE	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs	Para	FPS
✓						0.828	0.626	0.712	0.279	3.0	8.1	625.00
✓	✓					0.852	0.653	0.743	0.299	3.0	8.1	669.23
✓		✓				0.863	0.664	0.766	0.309	28.5	11.1	370.37
✓			✓			0.862	0.665	0.765	0.308	29.1	12.0	639.23
✓				✓		0.864	0.665	0.765	0.306	36.2	15.3	556.33
✓					✓	0.867	0.664	0.765	0.306	37.2	14.3	400.00
✓	✓	✓				0.864	0.669	0.76	0.307	11.1	28.5	294.12
✓	✓	✓	✓			0.871	0.674	0.772	0.311	12.0	29.1	270.27
✓	✓			✓	✓	0.868	0.673	0.758	0.297	44.9	18.5	200.00
✓	✓	✓		✓	✓	0.872	0.676	0.750	0.296	45.0	18.5	200.00
✓	✓	✓	✓	✓		0.880	0.681	0.780	0.313	15.2	37.8	208.33
✓	✓	✓	✓	✓	✓	0.867	0.682	0.781	0.316	19.4	45.6	175.44
✓		✓	✓	✓	✓	0.851	0.644	0.744	0.293	45.6	19.4	333.33
✓	✓	✓	✓		+FEM	0.877	0.679	0.772	0.311	13.6	32.7	250.00
✓	✓	✓	✓	✓	+FEM	0.860	0.680	0.779	0.313	17.8	40.5	204.08
✓	+WT	✓	✓	✓	✓	0.860	0.668	0.755	0.296	3.0	8.1	183.56
✓	+PCA	✓	✓	✓	✓	0.857	0.627	0.731	0.291	3.0	8.1	196.45

Table 3. Ablation experiment results on the HRSID dataset. ✓ indicates that the component is included in the model variant. Bold indicates the optimal result.

Yolov8	BR	CA	BOT3	MEFA	MSFE	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs	Para	FPS
✓						0.925	0.862	0.937	0.693	3.0	8.1	190.31
✓	✓					0.913	0.881	0.941	0.721	3.0	8.1	192.31
✓		✓				0.929	0.864	0.940	0.706	28.5	11.1	357.14
✓			✓			0.922	0.880	0.942	0.722	29.1	11.9	238.10
✓				✓		0.922	0.865	0.937	0.695	36.2	15.3	153.85
✓					✓	0.922	0.884	0.943	0.721	37.2	14.3	243.90
✓	✓	✓				0.923	0.882	0.942	0.724	11.1	28.5	181.82
✓	✓	✓	✓			0.908	0.879	0.943	0.730	12.0	29.1	208.33
✓	✓			✓	✓	0.913	0.873	0.943	0.727	44.9	18.5	203.34
✓	✓	✓		✓	✓	0.915	0.862	0.943	0.729	45.0	18.5	192.31
✓	✓	✓	✓	✓		0.920	0.890	0.944	0.730	15.2	37.8	196.08
✓	✓	✓	✓	✓	✓	0.935	0.889	0.945	0.724	19.4	45.6	125.00
✓		✓	✓	✓	✓	0.934	0.848	0.942	0.722	45.6	19.4	270.27
✓	✓	✓	✓		+FEM	0.917	0.879	0.943	0.727	13.6	32.7	322.58
✓	✓	✓	✓	✓	+FEM	0.916	0.889	0.944	0.724	17.8	40.5	132.36
✓	+WT	✓	✓	✓	✓	0.931	0.869	0.939	0.698	19.4	45.6	134.45
✓	+PCA	✓	✓	✓	✓	0.922	0.871	0.939	0.708	19.4	45.6	145.55

Table 4. Performance comparison of the proposed method with other methods on the LS-SSDD dataset. Bold indicates the optimal result.

Method	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs	Para	FPS
Faster R-CNN [12]	0.828	0.626	0.712	0.279	134.38	41.35	–
Cascade R-CNN [13]	0.852	0.653	0.743	0.299	162.18	69.15	–
EfficientDet [40]	0.864	0.669	0.76	0.307	107.52	39.40	–
HR-SDNet [43]	0.871	0.674	0.772	0.311	260.39	90.92	–
SARNet [44]	0.583	0.819	0.762	–	104.20	42.60	–
GL-DETR [42]	0.781	0.792	0.751	0.149	–	–	18.5
MSIF [45]	0.865	0.781	0.751	–	–	–	–
MSFA [46]	0.858	0.712	0.754	0.286	–	–	–
DRGD-YOLO [51]	–	–	0.756	0.301	10.4	2.95	–
PSG [52]	–	–	0.733	–	–	–	21.5
YOLO-SARSI [19]	–	–	0.737	0.285	–	18.43	–
YOLOv5s [53]	0.806	0.723	0.761	–	7.1	2.5	454.55
YOLOv8s [54]	0.828	0.626	0.712	0.279	3.0	8.1	625.00
YOLOv11 [55]	0.821	0.622	0.703	0.256	6.3	25.8	476.19
YOLOv12 [56]	0.805	0.636	0.722	0.270	6.3	25.3	434.60
Our Model	0.867	0.682	0.781	0.316	19.4	45.6	175.44

Table 5. Performance comparison of the proposed method with other methods on the HRSID dataset.

Method	P	R	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs	Para	FPS
Faster R-CNN [12]	0.888	0.725	0.779	0.589	134.38	41.35	8.9
Cascade R-CNN [13]	0.861	0.823	0.811	0.551	162.18	69.15	8.9
SARNas [47]	0.892	0.883	0.908	–	1.28	5.46	–
GL-DETR [42]	0.914	0.875	0.921	–	–	16.1	–
FESAR [20]	0.935	0.858	0.940	–	3.5	46.8	–
FCOS [41]	0.776	0.902	0.877	0.373	170.60	50.80	7.8
SSD [15]	0.941	0.584	0.820	0.582	87.80	24.40	13.1
ROSD [49]	0.927	0.881	0.927	–	123.50	65.80	–
SIF-Net [50]	0.799	0.827	0.797	–	–	–	–
DRGD-YOLO [51]	0.924	0.856	0.931	0.698	10.4	2.95	–
PSG [52]	–	0.870	0.831	–	–	–	18.1
SSD-YOLO [33]	0.928	0.872	0.930	–	21.30	7.37	–
FDI-YOLO [48]	0.962	0.819	0.909	0.710	7.9	2.27	200
YOLOv5s [53]	0.863	0.821	0.904	0.602	7.1	2.5	184.45
YOLOv8s [54]	0.925	0.862	0.937	0.621	3.0	8.1	190.31
YOLOv11 [55]	0.921	0.839	0.935	0.618	6.3	25.8	202.12
YOLOv12 [56]	0.910	0.821	0.911	0.602	6.3	25.3	178.34
Our Model	0.923	0.889	0.945	0.631	19.4	45.6	125.00

Table 6. Quantitative index table of image detection in different regions.

Image Location	False Alarm Rate	Detection Rate	Precision	TP	FN	FD
West Coast	9.6%	90.4%	90.4%	75	10	8
Mexico Gulf	12.5%	87.5%	87.5%	28	13	4
East Coast	20%	80%	80%	12	9	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.; Zhu, K.; Guo, H.; Lu, J.; Gong, Z.; Liu, X. Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement. Electronics 2025, 14, 4728. https://doi.org/10.3390/electronics14234728

AMA Style

Zhou Y, Zhu K, Guo H, Lu J, Gong Z, Liu X. Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement. Electronics. 2025; 14(23):4728. https://doi.org/10.3390/electronics14234728

Chicago/Turabian Style

Zhou, Yi, Kun Zhu, Haitao Guo, Jun Lu, Zhihui Gong, and Xiangyun Liu. 2025. "Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement" Electronics 14, no. 23: 4728. https://doi.org/10.3390/electronics14234728

APA Style

Zhou, Y., Zhu, K., Guo, H., Lu, J., Gong, Z., & Liu, X. (2025). Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement. Electronics, 14(23), 4728. https://doi.org/10.3390/electronics14234728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Target Detection in SAR Imagery Based on Band Recombination and Multi-Scale Feature Enhancement

Abstract

1. Introduction

2. Methodology

2.1. SAR Image Band Reorganization

2.1.1. Gaussian Pyramid Hierarchical Noise Suppression

2.1.2. High-Frequency Detail Extraction via Laplacian Pyramid

2.1.3. Goal-Oriented Dynamic Weight Fusion

2.2. Coordinate Attention Mechanism

2.3. Bottleneck Transformer

2.4. Multi-Scale Feature Enhancement Module

2.5. Multi-Scale Effective Feature Aggregation Module

3. Experimental Results and Analysis

3.1. Dataset and Experimental Configuration

3.2. Evaluation Metrics

3.3. Ablation Study on Network Architecture Effectiveness

3.4. Comparative Experiments with Other Methods

3.5. Generalization Performance Evaluation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI