1. Introduction
As an active microwave imaging sensor, Synthetic Aperture Radar (SAR) possesses unique all-weather, all-time, and long-range imaging advantages, enabling its significant role in marine monitoring [
1], maritime traffic management [
2] and ship salvage [
3]. The automatic identification and localization of ship targets from complex maritime backgrounds, known as ship target detection [
4], is a central technology in SAR image interpretation with substantial research value and wide applicability.
Conventional methods for detecting ship targets in SAR imagery mainly depend on hand-crafted feature extraction algorithms and classifiers. These include Constant False Alarm Rate (CFAR) detection [
5], threshold segmentation [
6], adaptive detection [
7], and wavelet-based detection techniques [
8]. Such approaches typically identify ships by extracting their geometric, textural, and scattering characteristics. However, the fundamental limitation of such approaches lies in their reliance on the ’manual feature design-shallow classification’ paradigm, which presents an inherent and irreconcilable contradiction with the highly complex physical imaging characteristics inherent in SAR imagery. This contradiction manifests particularly acutely in the following three typical challenges. Firstly, at the feature level, manually designed features are highly sensitive to coherent speckle noise prevalent in SAR imagery and struggle to adaptively capture the substantial appearance variations exhibited by targets due to changes in attitude and size. Secondly, at the model level, traditional classifiers possess limited learning capabilities, rendering them incapable of modelling the high-dimensional, non-linear feature relationships that emerge from the intertwining of ships and port facilities within complex coastal backgrounds. Consequently, false alarm rates increase significantly under conditions of severe background clutter. Finally, at the methodological level, most conventional approaches employ isolated ‘detection-discrimination’ workflows, failing to achieve end-to-end optimisation from image to target. This leads to cumulative errors across processing stages when handling multi-scale, densely clustered, or partially occluded targets, resulting in insufficient overall robustness. Therefore, although a series of improved algorithms represented by CFAR [
9,
10,
11] have enhanced performance in specific scenarios through dynamic thresholds and multi-scale modelling, they have not overcome the inherent limitations of the ’artificial feature design’ paradigm. With the growing demand for processing large-scale, highly complex SAR data, the shortcomings of traditional methods in detection accuracy, adaptability, and computational efficiency have become increasingly apparent. The core focus of current research has shifted towards leveraging deep learning approaches to construct high-level feature representations directly from data through end-to-end learning. These representations exhibit robustness to noise and strong discriminative power against background clutter, thereby systematically overcoming these long-standing challenges.
In 2012, AlexNet [
12], a deep convolutional neural network (CNN)-based architecture, achieved a decisive victory in the ImageNet image recognition competition, catalyzing broad interest in deep learning methodologies. Deep learning-based target detection algorithms are generally divided into two categories depending on the use of explicit region proposals: two-stage target detection algorithms and one-stage target detection algorithms. Two-stage algorithms convert the detection task into a classification problem by first generating region proposals and then classifying the localized image regions within them. Representative models in this category include R-CNN [
13], Faster R-CNN [
14], Cascade R-CNN [
15] and Mask R-CNN [
16]. In the context of SAR ship detection, Xiao et al. [
17] introduced a multi-resolution detection approach based on an improved regional convolutional neural network (R-CNN), which enhanced input image sizing, region proposal optimization, database categorization and weight balancing to boost detection accuracy in complex multi-resolution SAR scenarios. In another study, Ke et al. [
18] incorporated deformable convolutions into Faster R-CNN, enabling the model to adaptively learn two-dimensional offsets and better represent geometric variations in ship shapes, which raised average precision. Jian et al. [
19] developed an SS R-CNN framework using self-supervised learning, where feature representation networks were pre-trained on ship-free ocean imagery, leading to improved Mask R-CNN performance in remote sensing ship detection, particularly for small ships. Two-stage object detection algorithms first generate region proposals, followed by classification and refinement. This approach tends to produce numerous false alarm proposals in terrestrial areas, consuming significant computational resources in subsequent processing stages. Consequently, it struggles to meet the practical demands for rapid processing of large-scale SAR data. In contrast, one-stage target detection algorithms, also referred to as regression-based detectors, bypass explicit region proposal generation and treat detection as a unified regression problem over the entire image. Notable one-stage models include YOLO [
20], SSD [
21], RetinaNet [
22] and FCOS [
23]. For SAR ship detection, Yu et al. [
24] augmented YOLOv5 with a coordinate attention mechanism and a bidirectional feature pyramid, improving both feature fusion and detection accuracy. Miao et al. [
25] combined wavelet decomposition with an enhanced SSD model to strengthen the detection of coastal ships in complex SAR environments. Yang et al. [
26] proposed an improved fully convolutional one-stage detector (Improved-FCOS) incorporating multi-level feature attention, feature refinement reuse, and enhanced detection heads, addressing issues such as misclassification, small object detection and anchor-related limitations in SAR imagery. One-stage object detection algorithms unify detection as a dense prediction problem, significantly enhancing efficiency. However, this approach leads to conflicting optimisation objectives in complex terrestrial and open-sea environments, making it challenging to simultaneously maintain high detection rates and low false alarm rates. Although both categories of deep learning approaches have propelled progress in the field, existing methods have failed to systematically model and resolve core challenges inherent in SAR imagery, such as speckle noise, multi-scale targets, and the extreme heterogeneity of sea–land backgrounds. Consequently, developing specialised detection architectures capable of deeply integrating SAR imaging mechanisms to achieve a superior balance between accuracy, speed, and scene adaptability has become an urgent research priority requiring breakthroughs.
To address the core challenges encountered in SAR ship detection, this paper proposes a novel detection model, CCAI-YOLO. Building upon YOLOv8n as its baseline, this model avoids a simple stacking of components. Instead, through a series of targeted designs, it constructs a systematic solution that collaboratively tackles the unique difficulties inherent in SAR imagery. The principal contributions of this study are summarised as follows:
A novel ship detection model named CCAI-YOLO is proposed, demonstrating superior multi-scale ship detection capabilities in complex environments.
At the model architecture level, the synergistic optimisation of C2f-ODConv, C2f-ACmix and ASFF-Head enhances the CCAI-YOLO model’s feature extraction, global context modelling and multi-scale feature fusion capabilities, thereby strengthening detection performance of the model.
At the training optimisation level, the Inner-SIoU loss function is employed, inte-grating directional awareness with internal scaling. This enables prediction boxes to align rapidly and stably with target boxes, thereby enhancing model training efficiency.
Experimental results on the SSDD and SAR-Ship-Dataset datasets indicate that CCAI-YOLO exhibits enhanced accuracy and robustness compared to multiple alternative methods, thereby contributing to the advancement of maritime defence infrastructure.
The subsequent structure of this paper is as follows:
Section 2 reviews relevant prior research;
Section 3 elaborates on the proposed CCAI-YOLO model architecture and its design principles;
Section 4 details the experimental setup and specific implementation methods;
Section 5 analyses and discusses the experimental results;
Section 6 systematically summarises the entire paper and draws conclusions.
2. Related Works
The YOLO series represents a cornerstone in object detection, having pioneered the single-stage paradigm that successfully balances speed with accuracy. As a mature and benchmark-setting iteration within this series, YOLOv8 remains a preferred choice in research due to its robust all-around performance in SAR image ship detection tasks.
To enhance the extraction and utilisation of multi-scale vessel features in SAR imagery, a series of improved models based on the YOLOv8 framework have been successively proposed. Regarding feature enhancement and multi-scale fusion, the MSFA-YOLO proposed by Zhao et al. [
27] enhances multi-scale representations by integrating C2fSE and DenseASPP modules, though its robustness and accuracy remain subject to improvement. The DGSP-YOLO constructed by Zhu et al. [
28] significantly enhances small object detection capability and noise resistance by embedding SPDConv, C2fMHSA, and DySample samplers. Regarding attention mechanisms and context modelling, Li et al. [
29] proposed MAEE-Net, which integrates a Multi-Attention Feature Fusion Module (MAFM) and Edge Feature Enhancement Module (EFEM) at the neck to reinforce shallow target features while suppressing background interference. Luo et al. [
30] designed SHIP-YOLO, incorporating a stochastic attention mechanism and Wise-IoU loss to address challenges posed by small targets and complex backgrounds. In terms of lightweight design and efficiency optimisation, Wang et al. [
31] proposed YOLOSAR-Lite, which employs knowledge distillation and lightweight component replacement to reduce model complexity while maintaining accuracy. Regarding specialised detector heads and loss function design, the YOLOV8-FDF framework proposed by Jiang et al. [
32] integrates the FADC module with a deformable feature adaptation mechanism, employing a dedicated detector head to enhance recognition accuracy for minute objects.
In summary, while existing studies have made notable advances in SAR ship detection by incorporating novel network modules, attention mechanisms, and refined loss functions, most efforts remain confined to localized architectural enhancements. These approaches typically focus on single performance metrics, lacking systematic design that addresses the synergistic relationship between feature characterisation capability, multi-scale contextual integration capability, and positioning accuracy. Particularly under typical SAR imaging conditions characterised by high noise, multi-scale targets, and complex sea–land backgrounds, existing architectures often fail to achieve lightweight designs while maintaining high detection precision. Consequently, models face severe challenges in real-world deployment scenarios, where balancing efficiency and effectiveness proves difficult.
To systematically address the aforementioned challenges, this study proposes CCAI-YOLO—a lightweight SAR ship detection framework based on a collaborative optimisation strategy. Unlike localised refinements, this approach comprehensively reconstructs YOLOv8’s backbone network, neck structure, detection head, and loss function, with its design strictly addressing core difficulties in SAR image detection.
3. Materials and Methods
The overall architecture of the proposed CCAI-YOLO model is depicted in
Figure 1. Building upon the YOLOv8 framework, this model employs a systematic design centred on architecture co-optimisation and supplemented by fine-tuned training strategies. It aims to collaboratively address the specific challenges of ship detection in SAR imagery. At the model architecture level, we implemented three targeted enhancements constituting the core contributions of the algorithm: (1) Within the backbone network, the original C2f module on the key path was replaced with C2f-ODConv, leveraging dynamic convolutions to enhance adaptability and representational capacity for features exhibiting variable scales and orientations of ship targets; (2) Within the neck network, we introduced the C2f-ACmix module. Its parallel design of convolutional and self-attention operations strengthens global context modelling to suppress complex background noise; (3) We adopted the Adaptive Spatial Feature Fusion Detection Head (ASFF-Head), enabling adaptive weighted fusion of multi-scale features to enhance positioning accuracy for ships of varying dimensions. These three enhancements form the structural pillars underpinning the model’s performance improvement. At the training optimisation level, we replaced the bounding box regression loss function with Inner-SIoU. This enhancement serves as a crucial auxiliary strategy, guiding the aforementioned optimised architecture to learn precise box regression parameters more efficiently through the introduction of direction-aware loss, thereby further improving model performance. Subsequent sections shall elaborate on the design principles of each module, and through systematic comparison and ablation experiments, shall, respectively, validate the contributions of architectural co-optimisation and loss function fine-tuning. Finally, the comprehensive impact of this research on the field of SAR ship detection shall be explored.
3.1. C2f-ODConv Module
While prior dynamic convolution methods like CondConv [
33] and DyConv [
34] have been widely explored, they often share the same filter coefficients across each kernel, constraining their representational flexibility. These methods typically rely on linearly combining multiple static convolutions, which substantially increases parameter counts. In contrast, the Omni-Dimensional Dynamic Convolution (ODConv) [
35] module adopted in this work evolves this concept by introducing a multi-dimensional attention mechanism and a parallel strategy. ODConv performs linear weighting along four dimensions: the number of kernels, spatial size, input channel count and output channel count, enabling improved adaptation to complex SAR backgrounds and diverse ship shapes. Compared to earlier techniques, ODConv uses only a single convolutional kernel, significantly lowering parameters while maintaining high accuracy. By capturing inter-dimensional correlations, it enhances feature extraction capability and ensures efficient convolutional computation. The structure of ODConv is shown in
Figure 2.
ODConv is expressed by Equation (1).
where
represents the attention parameter of the convolution kernel
,
represents the attention parameter of the output channel dimension,
represents the attention parameter of the input channel dimension, and
represents the attention parameter of the convolution kernel spatial dimension,
represents a weighted operation between convolutional filters of different dimensions.
Despite its advantages in lightweight design and gradient flow, the C2f module in YOLOv8 remains limited by its local receptive field when applied to SAR ship detection. Complex maritime backgrounds and sea clutter interference often lead to insufficient feature representation. Consequently, the C2f module struggles to fully capture the highly variable characteristics of the ship target. To address this issue, we propose the C2f-ODConv module, which replaces a CBS component in the C2f bottleneck with an ODConv layer. This targeted design aims to directly address the core feature constraints in SAR ship detection. The bottleneck layer of C2f adopts a ‘compression-processing-expansion’ architecture, serving as the core node for information filtering and enhancement within the feature flow. At the critical feature refinement stage—the second CBS component within the C2f bottleneck—ODConv is introduced to incorporate dynamicity. This replaces traditional static convolutions with input-dependent dynamic convolutional kernels, enabling the extraction of highly variable ship target features in SAR imagery. These features arise from noise, scale variations, and complex backgrounds. As illustrated in
Figure 3, our enhancement mechanism strategically integrates ODConv’s four-dimensional dynamic properties at this critical juncture. The spatial dynamic weighting mechanism adapts to local deformations and azimuth variations in ship targets caused by imaging geometry. Channel dynamic attention autonomously enhances feature channels associated with strong scatter points on ships while suppressing channels contaminated by sea clutter or coherent speckle noise. The combined effects of filter dynamic combination and kernel size dynamic selection mechanisms collectively enhance the ability of the model to characterise the diverse scales and complex structures of ship targets. Therefore, C2f-ODConv is not merely an operator upgrade, but a purposeful design that embeds a dynamic perception mechanism at critical feature bottlenecks. It fundamentally enhances the feature extraction stage’s ability to address specific challenges in SAR ship detection, improving detection accuracy while maintaining a lightweight module.
3.2. C2f-ACmix Module
Self-attention mechanisms play a crucial role in computer vision, particularly within the Transformer architecture [
36], where they are extensively employed to model global dependencies within images. However, they often incur high computational costs. In contrast, convolutional operations efficiently extract local features such as edges and textures but lack global receptive fields. To bridge this gap, we adopt the ACmix module [
37], a hybrid design that effectively integrates both paradigms. As shown in
Figure 4, ACmix begins by projecting the input features into Query (Q), Key (K) and Value (V) tensors via three 1 × 1 convolutions. Q, K, V tensors are then processed along two parallel paths: a self-attention path, which reorganizes the tensors and computes attention-based aggregation, and a convolutional path, where the same tensors are reshaped and processed through fixed convolutional kernels. The outputs of both paths are summed, scaled by a learnable parameter, and combined with the input via a residual connection.
As shown in
Figure 5, the ACmix module is seamlessly integrated into the C2f structure by placing it at the end of the Bottleneck, forming the C2f-ACmix module. This position is pivotal for achieving synergistic effects between local feature refinement and global semantic injection. Here, the local features extracted through prior convolutional layers possess foundational discriminative power, which the ACmix parallel dual-path mechanism strategically enhances. In addressing the specific challenges of SAR scenarios, this module enhances detection of local strong scatterers and edge structures on ships through its convolutional path. Concurrently, its self-attention path is dedicated to modelling long-range semantic dependencies, effectively distinguishing systematic echoes from man-made regular structures such as harbours from the discrete strong scatter distribution of ship targets. By computing global correlations, it suppresses false alarms triggered by structured backgrounds. Positioning ACmix at the end of the C2f bottleneck layer, rather than at the beginning of the architecture or within the backbone network, represents a deliberate design trade-off. This ensures features undergo effective local abstraction and dimensionality reduction via front-end convolutions before incurring higher computational costs for global relationship modelling. By endowing features with critical contextual understanding just before output to the next stage at relatively economical computational expense, the module enhances robustness in complex scenarios while maintaining overall network efficiency.
3.3. ASFF Detection Head
Ships in SAR imagery exhibit significant scale variations and often appear in dense clusters, leading to overlapping objects that complicate detection. The FPN-PAN structure in YOLOv8 also shows limited effectiveness in multi-scale feature fusion. To mitigate these limitations, we introduce the Adaptive Spatial Feature Fusion (ASFF) [
38] detection head to replace the original YOLOv8 detection head, as shown in
Figure 6.
The neck network of CCAI-YOLO outputs three multi-scale feature maps, denoted as level 1 to 3. Taking ASFF-1 as an example, its output is formed by adaptively fusing these features using learnable weights
α,
β, and
γ, as defined in Equation (2):
Here, , , and denote the input feature maps for level 1, level 2, and level 3, respectively. represents upsampling the level 2 feature map to the same dimensions as level1, with following the same principle. During the fusion process, the adjusted feature maps are multiplied by their corresponding adaptive weight parameters , and respectively. The weighted results are then summed to generate the final output feature map of the ASFF-1 module.
Let
denote the feature vector at position (
i,
j) on the feature map adjusted from layer
n to layer
l. The process of feature fusion at layer
l is shown in Equation (3):
Here, denotes the feature vector at position (i, j) within the output feature map . , and represent the spatial importance weights corresponding to the transmission of feature maps from three distinct hierarchical levels to the l layer; these weights are obtained through adaptive learning by the network. It should be noted that , and are scalars and are shared across all channels. They satisfy the constraint: .
The specific calculation method is as follows:
Similarly, and can be derived. Here, , and serve as control parameters for the Softmax function. We compute these weight scalar maps using 1 × 1 convolutional layers, whose parameters can be learned through standard backpropagation. In this manner, features from each layer can be adaptively aggregated across different scales. The resulting fused features follow YOLO’s transmission path to the detection head, ultimately serving for object classification and localisation.
In CCAI-YOLO, the original YOLOv8 detection head is replaced with an ASFF head. The core improvement of this module lies in its adaptive fusion of feature maps from different scales within the neck network, achieved through learnable weight parameters prior to final predictions. Specifically, the ASFF mechanism dynamically assesses the reliability and importance of features at different levels across each spatial location. For SAR ship detection tasks, this implies the model autonomously determines greater reliance on deep, high-semantic features when detecting large ships near shore, while flexibly combining shallow, high-resolution detail features when capturing distant small targets. It generates clearer, more consistent multi-scale feature representations by suppressing responses from cluttered backgrounds or strong noise interference, whilst simultaneously enhancing semantic information highly relevant to ship targets.
3.4. Inner-SIoU Loss
Based on the Intersection over Union (IoU) metric, several advanced loss functions have been developed in recent years, including Generalized IoU (GIoU) [
39], Distance IoU (DIoU) [
40], Complete IoU (CIoU) [
41], Enhanced IoU (EIoU) [
42], SCYLLA-IoU (SIoU) [
43] and Wise IoU (WIoU) [
44]. GIoU addresses the zero-gradient issue in non-overlapping cases by incorporating the minimum enclosing box of both predicted and ground-truth bounding boxes. DIoU accelerates center alignment by penalizing the Euclidean distance between box centers. CIoU offers a more comprehensive optimization by considering overlap area, center distance, and aspect ratio. EIoU refines CIoU by reformulating the aspect ratio loss for more effective gradient propagation. SIoU introduces angle-based and shape-aware penalties to speed up convergence and improve accuracy. WIoU focuses on generalization by adapting to data quality. While YOLOv8 utilizes CIoU, it faces limitations in SAR ship detection due to the prevalence of low-quality training images. CIoU’s aspect ratio term can over-penalize such samples, impairing generalization. Moreover, when center misalignment is large, the distance term may over-emphasize center matching at the expense of overlap optimization.
To overcome the convergence and scale sensitivity issues in bounding box regression for SAR imagery, we adopt the Inner-SIoU loss [
45]. This loss function integrates the directional awareness mechanism of SIoU with the internal scaling concept of Inner-IoU. SIoU incorporates the vector angle between bounding boxes and redefines the penalty metric, while Inner-IoU introduces an internal scaling mechanism that calculates auxiliary IoU and corresponding penalty terms by simultaneously scaling both predicted and ground-truth boxes. The use of Inner-SIoU mitigates the performance degradation observed with CIoU loss function and speeds up convergence process, thereby boosting the adaptability of the model and generalization capabilities in dynamic detection environments.
Figure 7 illustrates the Inner-IoU diagram,
and
denote the centre point coordinates of the ground truth box,
and
denote the centre point coordinates of the predicted box,
and
denote the width and height of the ground truth box,
and
denote the width and height of the predicted box, and
denotes the scaling factor for generating the auxiliary bounding box, typically ranging between [0.5, 1.5]. Among these, the scaling factor constitutes a core hyperparameter of this method. If the scaling factor is too large, the auxiliary box deviates insufficiently from the original box, resulting in inadequate constraints on centroid and size errors. Conversely, if the scaling factor is too small, the auxiliary box becomes excessively constricted, potentially causing training instability. Setting the scaling factor to 0.75 imposes a moderately tightening constraint on the predicted bounding box. Moreover, ship targets in SAR imagery typically exhibit relatively distinct scattering boundaries, though their appearance is affected by speckle noise. A scaling factor of 0.75 strengthens localisation constraints without unduly compressing the bounding box to the point of compromising the necessary tolerance for countering noise, thereby achieving a favourable balance between enhanced accuracy and maintained robustness. Consequently, this value was consistently employed throughout all formal experiments.
Inner-SIoU combines Inner-IoU and SIoU, with its calculation formula as follows:
where
denotes the SIoU loss function,
denotes the intersection-over-union ratio between the predicted box and the ground truth box, and the definition of
is shown in Equation (6).
The definitions of
and
are shown in Equations (7) and (8):
where
,
,
,
,
,
,
and
are defined as follows.
Inner-IoU refines the evaluation of bounding box overlap by concentrating on their central regions. It incorporates a scale factor to adjust auxiliary box sizes, allowing flexible adaptation to different detection tasks and target types, thereby significantly boosting the model’s adaptability and generalization. Meanwhile, SIoU improves convergence and localization accuracy by explicitly modeling directional relationships between boxes, as illustrated in
Figure 8.
The SIoU loss calculation formula is as follows:
Here,
denotes the angular cost, where the angle in this cost function refers to the angle formed by the line connecting the centre point of the ground truth and the centre point of the predicted bounding box. The formula is as follows.
where
denotes the distance between the centres of the ground truth bounding box and the predicted bounding box,
represents the height difference between the centres of the ground truth and predicted bounding boxes,
and
denote the centre coordinates of the ground truth bounding box, and
and
denote the centre coordinates of the predicted bounding box.
denotes shape cost, defined by the following formula.
and denote the width and height of the ground truth bounding box, and denote the width and height of the predicted bounding box, and represents the degree of concern for shape cost.
The Inner-SIoU loss function seamlessly integrates the strengths of both Inner-IoU and SIoU. This design offers a key advantage in SAR ship detection tasks: by improving the convergence dynamics of training, our proposed novel structural model can more efficiently learn precise localisation parameters. For ship targets occupying a small proportion of the image, minor positional deviations in initial prediction boxes can lead to substantial IoU loss. The angular constraints imposed by Inner-SIoU effectively reduce directional oscillations of prediction boxes during training, promoting faster and more stable alignment with the target. Consequently, the Inner-SIoU loss function enhances the efficiency and stability of model training, thereby generating more reliable detection boxes in complex boundary scenarios.
5. Discussion
Based on the results from ablation experiments, comparative studies and detection visualizations, CCAI-YOLO exhibits strong adaptability for ship detection across diverse scenarios. This capability originates from the coordinated design of the four proposed enhancements, which collectively address characteristic challenges in SAR imagery—including inherent speckle noise, complex near-shore backgrounds with land and infrastructure and multi-scale ship distribution.
Although the model achieves promising results under experimental conditions, its performance remains to be verified in real-world spaceborne SAR environments. Further assessment is needed to determine its stability and robustness in operational settings. Future work will emphasize evaluating the model under authentic space mission conditions and developing advanced data augmentation techniques to improve both SAR image interpretability and generalization. Moreover, the model requires further lightweight optimization to minimize computational demands, facilitating deployment across a broader range of satellite platforms and promoting practical adoption in maritime remote sensing applications.
6. Conclusions
This paper presents CCAI-YOLO, an improved high-precision model for ship detection in SAR images, based on YOLOv8n. By integrating the C2f-ODConv module into the backbone network, the model effectively handles high noise and complex backgrounds in SAR imagery, enabling more adaptive feature extraction. The inclusion of the C2f-ACmix module in the neck enhances the capture of global contextual information, improving ship localization and recognition accuracy while maintaining computational efficiency. The detection head employs an ASFF architecture to mitigate inconsistencies arising from multi-scale feature fusion. Furthermore, Inner-SIoU loss function has been incorporated, designed to enhance the model’s convergence and localisation capability. On the SSDD and SAR-Ship-Dataset, CCAI-YOLO achieved F1 scores of 0.973 and 0.958, respectively, mAP50 values of 0.988 and 0.982, and mAP50-95 values of 0.749 and 0.716, demonstrating leading overall performance.
However, with ongoing advances in remote sensing sensors, processing high-resolution SAR imagery has become an inevitable trend. In real-world deployment, the real-time performance and efficiency of detection algorithms remain critical challenges. Achieving a lightweight model without sacrificing accuracy is still a core issue to be addressed. Furthermore, conventional axis-aligned bounding boxes struggle to handle overlapping objects effectively. The introduction of rotated bounding boxes could provide azimuth information of ship targets, enabling more precise separation from the background and improving localization accuracy. Techniques such as knowledge distillation and network pruning also offer promising avenues for model optimization. In future work, we will focus on further refining the model architecture to improve its generalization and robustness in complex real-world scenarios, keeping pace with evolving application requirements.