Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms

Rocha, Ranyeri do Lago; Figueiredo, Felipe A. P. de

doi:10.3390/rs17183170

Open AccessArticle

Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms

by

Ranyeri do Lago Rocha

^*

and

Felipe A. P. de Figueiredo

WAI Lab, National Institute of Telecommunication, 510 João de Camargo Avenue, Santa Rita do Sapucaí 37536-001, MG, Brazil

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(18), 3170; https://doi.org/10.3390/rs17183170

Submission received: 8 February 2025 / Revised: 8 September 2025 / Accepted: 9 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue Deep Learning for Multi-Source Remote Sensing Image Interpretation: Exploring, Rethinking, and Limiting Breakthroughs)

Download

Browse Figures

Versions Notes

Abstract

This study enhances Synthetic Aperture Radar (SAR) ship detection by integrating attention mechanisms, Bi-Level Routing Attention (BRA), Swin Transformer, and a Convolutional Block Attention Module (CBAM) into state-of-the-art YOLO architectures (YOLOv11 and v12). Addressing challenges like small ship sizes and complex maritime backgrounds in SAR imagery, we systematically evaluate the impact of adding and replacing attention layers at strategic positions within the models. Experiments reveal that replacing the original attention layer at position 4 (C3k2 module) with the CBAM in YOLOv12 achieves optimal performance, attaining an mAP@0.5 of 98.0% on the SAR Ship Dataset (SSD), surpassing baseline YOLOv12 (97.8%) and prior works. The optimized CBAM-enhanced YOLOv12 also reduces computational costs (5.9 GFLOPS vs. 6.5 GFLOPS in the baseline). Cross-dataset validation on the SAR Ship Detection Dataset (SSDD) confirms consistent improvements, underscoring the efficacy of targeted attention-layer replacement for SAR-specific challenges. Additionally, tests on the SADD and MSAR datasets demonstrate that this optimization generalizes beyond ship detection, yielding gains in aircraft detection and multi-class SAR object recognition. This work establishes a robust framework for efficient, high-precision maritime surveillance using deep learning.

Keywords:

YOLO; Bi-Level Routing Attention; Swin attention; Convolutional Block Attention Module; SAR Ship Detection; maritime surveillance; small ship detection

1. Introduction

Computer-aided maritime surveillance has gained attention since the evolution of computer vision (CV) methods for classifying, detecting, and tracking ships in different contexts. Using images and video helps companies identify illegal activities and monitor the behavior of ships on coastlines [1]. The traditional Automatic Identification System (AIS) data used to control maritime traffic and provide increased security for navigation suffers from several problems, primarily caused by human manipulation and incorrect data sent by ships to the control station [2]. Although AIS data is mandatory for some classes of ships, illegal activities can lead to the AIS transponder turning off. To minimize the impacts, some works have proposed fusing AIS data with satellite data [3,4,5] and others with radar images [6], such as Synthetic Aperture Radar (SAR). The association between two data types enables the identification of collaborative and non-collaborative ships.

The remote sensing technologies involving satellites make SAR data an important method for maritime monitoring [7]. The advantages of SAR are its all-day and all-weather data-collecting capabilities. In [8,9], surveys on SAR ship detection are presented, ranging from traditional detection methods using sea–land segmentation and constant false alarm rate (CFAR)-based models to recent deep learning (DL) CV models. Despite SAR being an important step for maritime imagery, many problems must be tackled. Initially, land–ocean segmentation could be necessary in some SAR image scenarios. In scenarios with limited SAR image data, one alternative is to fine-tune models pre-trained on standard datasets (e.g., COCO or ImageNet). However, this approach may lead to suboptimal detection performance due to domain differences. Moreover, ships on the ocean or near land with similar patterns can produce complex backgrounds, making the detection more challenging. Since the SAR Ship Detection (SSD) [10] dataset and SAR Ship Detection Dataset (SSDD) [11] were released, numerous researchers have been working to enhance ship detection in complex environments with intricate backgrounds. These two datasets facilitated access to SAR images in the maritime context, making a significant contribution to research, particularly in the application of Deep Learning. In [8], 177 papers were organized into the past, present, and future development of models applied to ship detection in SAR images. Convolutional Neural Network (CNN)-based models have been utilized in SAR ship detection, particularly the You Only Look Once (YOLO) model, in nearly all its versions, ranging from YOLOv1 to YOLOv10. This work aims to analyze the use of attention modules in various YOLO versions, with the goal of achieving improved performance in benchmark datasets and addressing challenges related to SAR imagery, including small objects (e.g., small ships), complex backgrounds, and cluttered scenes (e.g., similar objects between land and ships in the ocean). While focused on ship detection, we also evaluate the generalizability of our approach to other object types and multi-class SAR detection tasks.

This study employs three recent YOLO architectures (v10, v11, and v12) as baselines to investigate whether strategically integrating attention blocks across distinct network layers can enhance detection performance in complex SAR environments. Building on the YOLO family’s ongoing evolution, which prioritizes dual objectives of reduced computational complexity and heightened accuracy, we leverage these advancements to optimize SAR ship detection.

To enhance small ship detection in SAR imagery, we integrate three attention mechanisms, Convolutional Block Attention Module (CBAM), Swin Transformer, and Bi-level Routing Attention (BRA), into YOLO versions 10, 11, and 12. These modules are strategically embedded at critical network stages: during downsampling, upsampling, or as a cross-stage feature fusion mechanism positioned between the downsampling and upsampling stages. This targeted placement aims to optimize multi-scale feature extraction, specifically addressing challenges posed by small vessels and complex maritime backgrounds in SAR data.

Our systematic integration of attention mechanisms across targeted YOLO architectures advances SAR ship detection in three key dimensions. First, we establish, to our knowledge, the first YOLOv11 and YOLOv12 frameworks explicitly optimized for SAR object detection, demonstrating superior performance on ship, aircraft, and multi-class datasets. Regarding SAR ship detection, this work addresses a critical gap in the maritime surveillance literature. Second, by embedding CBAM, Swin, and BRA modules at strategically identified layers, we significantly enhance small target detection capabilities, overcoming persistent challenges posed by vessel compactness in radar imagery. Finally, rigorous validation across two benchmark datasets (SAR Ship and SSDD) demonstrates that optimizing attention allocation in layers sensitive to object scale and background complexity (Section 4.3.2) yields state-of-the-art performance. This approach not only resolves SAR-specific detection nuances but also provides a transferable foundation for scaling deep learning applications in remote sensing.

The remainder of this work is structured as follows. Section 2 reviews the SAR-specific object detection literature, with emphasis on YOLO architectural developments and the application of attention mechanisms. Section 3 outlines our experimental methodology, including the benchmark datasets (SAR Ship, SSDD, SADD, and MSAR), evaluation metrics, YOLO baseline architectures, and strategies for integrating attention modules. Section 4 presents a comprehensive evaluation: environmental setup, performance analysis, and ablation studies on the primary SAR Ship dataset, followed by rigorous cross-dataset validation on SSDD to assess generalizability. To further stress-test our approach, we extend the evaluation to include non-ship SAR object detection on the SADD and MSAR datasets. Finally, Section 5 synthesizes key findings and proposes future research directions for maritime surveillance. All supplementary material are at https://github.com/rlrocha90/Enhancing-YOLO-based-SAR-Ship-Detection-with-Attention-Mechanisms (accessed on 7 February 2025).

2. Related Works

Since the development of the original YOLO model [12], researchers across various fields have continuously proposed new iterations and enhancements to its architecture [13,14]. The evolution from the initial YOLO reflects a consistent effort to achieve two primary goals: reducing model complexity to enable faster image processing and improving detection performance. Remarkably, YOLOv10 has managed to achieve both objectives simultaneously, showcasing advancements in efficiency without compromising accuracy [15].

Detecting small objects is a particularly challenging yet crucial task. These models might help address this issue [16]. This task becomes even more demanding when using images captured by stationary cameras, drones, satellites, or other vision systems, where small objects are often indistinguishable from background noise or are represented by only a few pixels [17]. Traditional CNNs face inherent limitations in accurately identifying such small targets despite the substantial progress enabled by the DL paradigm. Overcoming these challenges requires innovative modifications to the architecture and the development of specialized techniques to enhance detection capabilities for small objects. YOLO’s evolution underscores the importance of addressing these limitations while maintaining its hallmark speed and efficiency, solidifying its relevance in real-world applications.

SAR images are crucial in maritime object detection because they capture data under various weather and lighting conditions [18]. However, these images often include complex backgrounds due to the unique image acquisition process employed by SAR sensors. Additionally, detecting small objects, such as ships, relies heavily on precise chip cropping after image generation to focus on regions of interest. One large SAR image from the sensor is cropped into sub-images, and then ship chips (containing the ship image) are obtained and added to the dataset. This makes ship detection in SAR data (i.e., SAR ship detection) particularly challenging.

Numerous models have been proposed to address this issue, with many leveraging variations in the YOLO architecture as a baseline. Earlier versions, such as YOLOv3 [19] and YOLOv4 [20], laid the groundwork for applying YOLO-based methods to SAR ship detection. Subsequent advancements, including YOLOv5 [7,21], YOLOv7 [22,23], YOLOv8 [24,25,26], and YOLOX [27], introduced innovations to handle the intricacies of SAR images and the challenges posed by small-object detection better.

In addition to the YOLO family, other DL architectures have also been employed. These include Deep Convolutional Neural Networks (DCNNs) [28,29], the Detection Transformer (DETR) [30], and the CenterNet network framework [31], among others. Each approach aims to enhance the detection of small objects in SAR imagery, leveraging unique architectural features to improve accuracy and robustness.

In almost all the cited works, authors have leveraged the layered architecture of recent DL models to integrate various enhancing modules seamlessly. This approach enhances the models’ efficient and performant operations, such as dilated [27], deformable, and depthwise separable convolutions, and ghost modules [32]. However, attention mechanisms are one of the most performative ones that have revolutionized how models focus on relevant regions of an image while ignoring less important details [33]. Attention mechanisms have been particularly effective in addressing challenges like small-object detection [34], adverse weather conditions [35], and occlusion [36] by dynamically weighting features based on their relevance. Since the introduction of attention mechanisms for image processing in [37], followed by advancements like Vision Transformer (ViT) [26] and DETR [38], attention-based methods have become integral to modern computer vision tasks.

One notable development is the Bi-Level Routing Attention (BRA) mechanism [39], which introduces a dynamic, query-aware sparse attention mechanism. This innovation has inspired numerous applications across diverse scenarios. For instance, BRA has been utilized in power fitting detection tasks [40], medical imaging for chest X-ray analysis [41], and small-object detection in remote sensing images, often in combination with YOLO-based architectures (e.g., BRA+YOLO) [42,43].

Furthermore, BRA has been applied to enhance object detection in adverse weather scenarios using BRA+YOLOv9 [44], improve recognition accuracy, and address the challenges of detecting occluded, long-range, and diminutive targets, such as Unmanned Aerial Vehicles (UAVs) [45]. Additionally, this mechanism has been applied in recognizing classroom learning behavior, as demonstrated in BRA-enhanced YOLOv8 models [46].

Another attention module, developed with numerous applications, is the Convolutional Block Attention Module (CBAM) [47]. This attention module treats the input feature map along two separate dimensions, channel and spatial, to capture image details. Among works with the CBAM, in [48], the authors used YOLOv8 with the CBAM and Receptive Field Attention Convolution (RFAConv) combined to address erroneous detection tasks in UAV scenarios, which are challenging due to the small size of the targets and complex background. The combined modules were used to replace the original C2f and Conv modules of YOLOv8. In [49], the authors used YOLOv5 to recognize traffic signs, also using the CBAM. After a pre-processing step, the CBAM attention mechanism was introduced to the model, allowing it to focus on the shape of the traffic signs. The CBAM helped improve perceptions in a complex background. In a more recent work [50], the authors used a YOLOv11 model to detect pests and diseases on maize leaves. The challenges are significant, primarily due to complex conditions. The CBAM was introduced between the SPPF and C2PSA modules to improve the model’s capability to identify and select essential features in both channel and spatial dimensions.

Attention mechanisms have been successfully employed in various object detection models, including Swin-YOLOv5 [34,51] and ViT-based approaches [26]. Swin [52] is a new vision Transformer based on Shifted Windows. This scheme achieves greater efficiency in computing self-attention across non-overlapping local windows while also allowing for cross-window connections. In [53], the authors aimed to address the low detection accuracy and inaccurate positioning of small objects in remote sensing images by utilizing the YOLOv5 model and a modified CSPDarknet53 structure, combined with the Swin Transformer. Swin was added to retain context information and extract more differentiated features. In another recent work [54], the authors improved the Swin Transformer, developing it into a Neural Swin Transformer (NST), and aggregated it with the YOLOv11 model. They applied the proposed model to ship detection in SAR images and compared it with ship detection in optical images. The proposed approach, incorporating neural elements, eliminates the redundant information generated by the local window self-attention module in Swin.

However, these works typically focus on general object detection tasks rather than the unique complexities of SAR imagery, such as the presence of complex backgrounds and the low resolution of small ships. To contextualize these advancements, Table 1 synthesizes key SAR ship detection methods. While YOLO-based models have dominated recent research (e.g., YOLOv3 and YOLOv8), only a few studies have integrated attention mechanisms. Notably, YOLOv8+SimAM [25] achieves 97.72% mAP@0.5 on SSD, and BRA has been applied in remote sensing [42]. However, these approaches lack optimization for SAR-specific challenges, such as small vessels (relative size < 0.2) or computational efficiency. Non-YOLO frameworks (e.g., DETR, CenterNet) exhibit high accuracy but incur higher computational costs, which limits their real-time deployment.

Building on this foundation, our work bridges the identified gaps through three key innovations: (i) we pioneer optimized YOLOv11 and YOLOv12 frameworks for SAR ship detection, addressing the absence of tailored adaptations for these state-of-the-art architectures, (ii) unlike ad hoc attention integration, we implement systematic layer replacement to target SAR-specific complexities like small objects and background clutter, and (iii) we validate robustness via cross-dataset evaluation (SAR Ship and SSDD), ensuring generalizability beyond single-benchmark tuning.

Crucially, while prior SAR-focused YOLO enhancements relied on convolutional modifications (e.g., dilated or deformable convolutions [27,32,55]), our attention mechanism integration, replacing original layers with BRA, Swin, or a CBAM, selectively amplifies discriminative features for small vessels while avoiding computational bloat.

Furthermore, compared to non-YOLO frameworks (e.g., DETR [30], CenterNet [31]), our approach maintains the real-time efficiency of the YOLO family. As Table 1 confirms, our CBAM-enhanced YOLOv12n achieves superior accuracy (98.6% and 98.0% mAP@0.5 on SSDD and SSD, respectively) with reduced computations (5.9 GFLOPS vs. baseline 6.5 GFLOPS), enabling high-precision maritime surveillance without sacrificing deployability.

3. Materials and Methods

This section provides an overview of the datasets, metrics, architectural design, and methodologies used to enhance SAR ship detection performance. The first dataset used in this work is central to it: the SAR Ship Dataset (SSD), a highly challenging dataset characterized by diverse ship sizes, complex backgrounds, and small object features. The second, the SAR Ship Detection Dataset (SSDD), is smaller but has the same challenges in images. Additionally, two other datasets are presented to investigate the models’ generalization power in different contexts, like the SAR Aircraft Detection Dataset (SADD) and Large-Scale Multi-Class SAR Image (MSAR).

For all four datasets, we compute the bounding box size proportion following the COCO metric to define small, medium, and large objects. The COCO dataset contains images with a resolution of

640 \times 640

pixels, and the values used are small, less than

32 \times 32

pixels; medium, between

32 \times 32

and

96 \times 96

pixels; and large, larger than

96 \times 96

pixels. Since the datasets used here have different image sizes, we defined a way to compute the bounding box size using a conversion factor to carry out a fair comparison. We will reference this as the COCO converted metric. The conversion factor is computed as

((i m a g e_{h} + i m a g e_{w}) / 2) / 640

, and this factor is used to define the pixel quantity for small, medium, and large bounding boxes.

This section also details the architecture of the three most recent YOLO models (YOLOv10, YOLOv11, and YOLOv12), which integrate state-of-the-art innovations, as well as the adopted attention mechanisms (BRA, Swin, and the CBAM). By addressing specific challenges in SAR imagery, such as the detection of small objects and complex contextual interference, the methodologies aim to optimize detection accuracy and efficiency. The architectural innovations and their implications for feature representation and multi-scale object detection are explored to highlight the advancements introduced by this work.

3.1. Datasets

3.1.1. SAR Ship Dataset—SSD

The SSD [10] was constructed using SAR images from two satellite sources: 102 images from Gaofen-3 and 108 images from Sentinel-1. The authors processed these SAR images by cropping subregions containing ships to generate a comprehensive collection of labeled samples. Each subregion was resized or standardized to 256 × 256 pixels, resulting in grayscale images suitable for SAR-specific detection tasks.

The dataset comprises 39,729 labeled ship chips (i.e., patches), where each chip is a grayscale image labeled for ship detection. These images capture several real-world challenges: ships are often very small and multiscale, under 20% of a 256 × 256 patch and varying widely due to different sensors and incidence, angle distortions; complex backgrounds like ports, islands, and calm or stirred seas produce clutter and false alarms; and inherent SAR speckle noise further obscures ship features, worsening detection. Using the COCO converted metric, the objects in this dataset are

1.39 %

small,

48.04 %

medium, and

50.07 %

large. Additionally, densely packed or overlapping vessels in coastal scenes increase the likelihood of missed detections, and the limited dataset diversity (in imaging modes, resolutions, and polarizations) creates imbalance and generalization issues for models trained on it. Representative examples from the dataset are presented in Figure 1.

The authors of the SSD identify ship size and background complexity as the most significant factors affecting performance in SAR ship detection tasks. Backgrounds in SAR images often include elements that can resemble ships, introducing challenges in distinguishing ships from non-ship objects. Additionally, ship sizes in the dataset can be very small, sometimes represented by only a few pixels (<0.2 of the chip), further complicating detection. These characteristics make the SSD a valuable benchmark for evaluating models designed for SAR-based ship detection.

One of the strengths of this dataset is the variety of ship sizes it offers. This variety arises from the diverse shapes of ships, differences in resolution due to imaging conditions, and the mechanisms used by satellites like Gaofen-3 and Sentinel-1 to capture SAR images. This diversity presents an excellent opportunity to test models for their robustness in detecting ships at multiple scales or to evaluate the performance of sub-models specialized in small, medium, or large ship sizes.

Further insight into ship size distribution is also presented by analyzing the bounding box size relative to the image size. The relative size is computed as

Relative Size = \frac{\sqrt{w_{b b o x} \times h_{b b o x}}}{\sqrt{w_{i m g} \times h_{i m g}}},

where

w_{b b o x}

and

h_{b b o x}

are the bounding box dimensions, and

w_{i m g}

and

h_{i m g}

are the image dimensions. Their analysis reveals that the relative size of bounding boxes is predominantly less than 0.2, indicating that most ships in the dataset are relatively small compared to the image dimensions. This characteristic highlights the importance of models that can effectively handle small-object detection.

Another significant challenge posed by the SSD is its complex background. SAR images are generated based on the reflection of radar signals, which means that all elements within the scene contribute to the final scatter image, not just the ships. This results in a background that can include patterns resembling ships, such as ocean waves, islands, landmasses, and man-made structures like buildings. These elements can exhibit scattering characteristics similar to those of ships, making the dataset highly complex and demanding for ship detection tasks.

This complexity, while challenging, also offers opportunities for leveraging DCNNs in detection tasks. One of the primary difficulties lies in the small size of ships, which can become even harder to discern after subsampling operations in convolutional layers. Small objects risk being lost in deeper layers of the network due to resolution reduction. However, DCNNs excel at learning intricate patterns and can adapt to complex backgrounds by extracting meaningful features across multiple scales. With the right architectural design, such as incorporating feature pyramids or attention mechanisms, these networks might effectively address small-object detection and background complexity, demonstrating strong performance on SAR-based datasets.

3.1.2. SAR Ship Detection Dataset—SSDD

The SSDD [56] dataset is the first SAR dataset released as a benchmark for researchers. The 1160 images that comprise the dataset come from three radars, RadarSat-2, TerraSAR-X, and Sentinel-1, featuring ships of various sizes and materials. The authors collected images with diverse sea conditions in the sea and offshore.

The dataset comprises 1160 images and 2456 ships, with small ships occupying only a few pixels in each image. The images have different sizes, ranging from 214 × 214 to 668 × 668 pixels. The authors used a three-pixel threshold to determine whether or not to annotate an object as a ship. In [11], a comprehensive review of the SSDD is presented, providing details about formats and annotation versions. Some challenges encountered in SSDD images include small ships with inconspicuous features, densely parallel ships, ships with large scale differences, severe speckle noise, a complex sea background, and various types of sea clutter. Using the COCO normalized metric, we found

18.13 %

small objects,

56.86 %

medium objects, and

25.01 %

large objects.

Although the SSDD provides horizontal bounding boxes (HBBs), oriented bounding boxes (OBBs), and polygon-segmented (Pseg) objects, in this work, we only use the HBB annotation. This is so that the results can be fairly compared with those from SSD, which only have HBB annotations. The original annotation format used is the PASCAL VOC with parameters

x m i n

,

y m i n

,

x m a x

, and

y m a x

points of the bounding box, and a simple conversion was made to adapt them to the YOLO annotation format, with

x c e n t e r

,

y c e n t e r

,

w i d t h

, and

h e i g h t

parameters of the bounding box. Some examples of how ships are shown in SSDD images can be found in Figure 2.

3.1.3. SAR Aircraft Detection Dataset—SADD

In [57], the authors collected and constructed a public SAR dataset with aircraft images. All images were collected from the German TerraSAR-X satellite, with HH polarization and image resolutions ranging from 0.5 to 3 m. The dataset is composed of 884 images with one or more aircraft target per image. The background is relatively complex, which includes the airport runway, the airport apron, and the civil aviation airport. The original dataset has an image size of

224 \times 224

pixels, but we use a dataset version that is already

640 \times 640

.

The authors demonstrated that the sizes of objects in the dataset are diverse. The original dataset used in [57] was divided into positive and negative targets, with the positive targets representing aircraft and the negative ones representing the background components. Here, for the detection task, we only consider the positive targets (884 images and 7662 targets). Although the authors showed that small objects are in the majority in the original dataset with positive and negative targets, when using only positive targets for the detection object task, we have more than 60% of medium-sized ones (ranging from

32 \times 32

pixels to

96 \times 96

pixels). Using the COCO metric, this dataset has

9.65 %

small,

62.26 %

medium, and

28.09 %

large objects. Some examples can be found in Figure 3.

3.1.4. Large-Scale Multi-Class SAR Image—MSAR

In [58], a new SAR dataset is presented, featuring multi-class objects and various scenarios, including airports, ports, inshore areas, islands, offshore areas, and urban areas. The image targets were collected from the HISEA-1 and the Gaofen-3 satellites. This dataset includes all four polarization modes, HH, HV, VH, and VV.

The dataset consists of four classes, aircraft, oil tank, bridge, and ships, comprising a total of 28,449 images. Some examples can be seen in Figure 4. Here, we also used the COCO converted metric to determine the bounding box size proportion, with

13.04 %

small,

63.03 %

medium, and

23.92 %

large objects.

3.2. Performance Evaluation Metrics

In object detection, the performance of models is often evaluated using the mean average precision (mAP) metric. This metric summarizes how well a model detects objects across different categories and confidence thresholds [59,60]. It is a standard measure in benchmarking datasets and object detection models, enabling meaningful comparisons. For instance, the YOLO series uses mAP to track the performance progression of its models across versions. At the same time, the COCO dataset uses mAP extensively in its challenges to rank and compare models. Two commonly used mAP variations in modern benchmarks are mAP@0.5 and mAP@0.5:0.95. These metrics provide insights into the accuracy and robustness of a model’s predictions.

A foundational concept in computing mAP is the intersection-over-union (IoU), which measures the overlap between the predicted bounding box,

B_{p}

, and the ground truth bounding box,

B_{g t}

. The IoU is given by

IoU = \frac{| B_{p} \cap B_{g t} |}{| B_{p} \cup B_{g t} |},

where

| B_{p} \cap B_{g t} |

is the area of intersection between the predicted and ground truth boxes, and

| B_{p} \cup B_{g t} |

is the area of their union.

Precision (P) and recall (R) are key metrics used in object detection.

Precision measures the proportion of correctly predicted bounding boxes ( $T P$ ) among all predicted boxes ( $T P + F P$ ):

$P = \frac{T P}{T P + F P} .$
Recall measures the proportion of correctly predicted bounding boxes ( $T P$ ) among all ground truth boxes ( $T P + F N$ ):

$R = \frac{T P}{T P + F N} .$

For each class, a precision–recall (PR) curve is generated by varying the confidence threshold of predictions. The average precision (AP) for a class is the area under the PR curve.

AP = \int_{0}^{1} P (R) d R .

This integration is often approximated numerically using a discrete set of points.

The metric mAP@0.5 computes the mean of AP values across all classes, where a prediction is considered correct if its IoU with the ground truth exceeds 0.5.

mAP @ 0.5 = \frac{1}{C} \sum_{c = 1}^{C} {AP}_{c} (IoU \geq 0.5),

where C is the total number of object classes.

The metric mAP@0.5:0.95 provides a more comprehensive evaluation by averaging AP values over multiple IoU thresholds from 0.5 to 0.95, with a step size of 0.05.

mAP @ 0.5 : 0.95 = \frac{1}{C} \sum_{c = 1}^{C} \frac{1}{T} \sum_{t = 1}^{T} {AP}_{c} (IoU = 0.5 + 0.05 \cdot (t - 1)),

where T is the number of IoU thresholds; typically,

T = 10

for thresholds

[0.5, 0.55, \dots, 0.95]

.

mAP@0.5 measures the mean of the average precision (AP) over all classes using a fixed IoU threshold of 0.5, meaning that any detection must overlap with the ground truth by at least 50% to be considered correct. Precision and recall are then calculated over varying confidence thresholds. In contrast, mAP@0.5:0.95 averages the AP across 10 IoU thresholds, ranging from 0.50 to 0.95 in increments of 0.05, providing a more comprehensive evaluation of both detection capability and localization accuracy. While mAP@0.5:0.95 demands tighter box alignment (performance at higher thresholds reflects more precise location matching), mAP@0.5 is lenient in terms of localization (i.e., minor misalignments between predicted and ground truth boxes are tolerated).

3.3. Architectures of the YOLO Models

This work proposes improving the performance of YOLO models by adding and/or modifying the attention layers. Until the end of this work, the three most recent YOLO models are YOLOv10 [15], YOLOv11 [61], and YOLOv12 [62]. YOLOv10 is an evolution of YOLOv8, proposed to optimize YOLO’s parameter utilization and efficiency by reducing computational redundancy in its architecture. Beyond all the optimization for efficiency and accuracy improvement at a low cost, YOLOv10 also presents NMS-free (non-maximum suppression-free) post-processing, which reduces the problem of redundant predictions. Version 11 introduces an improved backbone and neck architecture, enhancing feature extraction. In version 12, the authors proposed a new self-attention approach to process large receptive fields more efficiently, as well as a new aggregation module based on the previous ELAN [63]. A performance comparison is presented in [62], and YOLOv12 outperforms both YOLOv10 and YOLOv12 in performance, both mAP@0.5 and mAP@0.5:0.95. The architectures of the YOLO models employed in this work are found in Figure 5.

In this work, we specifically adopt the n (nano) versions of YOLOv10, YOLOv11, and YOLOv12 due to their favorable balance between accuracy and computational efficiency. These lightweight models are particularly well suited for SAR ship detection tasks, which require real-time processing on platforms with limited hardware resources, such as drones or embedded systems. This balance makes the YOLO n models an effective and deployable solution for maritime surveillance scenarios, especially where small-object detection and low-latency inference are critical.

In the next subsections, we present all attention mechanisms used to enhance the models’ performance.

3.3.1. Bi-Level Routing Attention—BRA

BRA was introduced as an attention module in the BiFormer model [39]. BRA is a dynamic, query-aware, sparse attention mechanism where each query attends to a small subset of semantically relevant key–value pairs. This approach filters irrelevant key–value pairs and applies fine-grained token-to-token attention only in selected regions. The BRA process consists of three steps, which are described below.

Step 1:: Region Partition and Input Projection

The two-dimensional (2D) input feature map

X \in R^{H \times W \times C}

is divided into

S \times S

non-overlapping regions, forming

X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

. Query, key, and value tensors (Q, K,

V \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

) are derived via linear projections.

Q = X^{r} W^{q}, K = X^{r} W^{k}, V = X^{r} W^{v},

(1)

where

W^{q}, W^{k}, W^{v} \in R^{C \times C}

are projection weights for the query, key, and value, respectively.

Step 2:: Region-to-region routing with directed graph

Region-level queries and keys (

Q^{r}, K^{r} \in R^{S^{2} \times C}

) are obtained by averaging Q and K per region. An adjacency matrix

A^{r} \in R^{S^{2} \times S^{2}}

is computed as

A^{r} = Q^{r} {(K^{r})}^{T},

(2)

representing region-to-region affinity. The affinity graph is pruned by keeping only the top-k connections per region using a routing index matrix,

I^{r} \in R^{S^{2} \times k}

.

I^{r} = topkIndex (A^{r}),

(3)

where the

i^{t h}

row of

I^{r}

contains indices of the k most relevant regions for the

i^{t h}

region.

Step 3:: Token-to-token attention

Key and value tensors are gathered using

I^{r}

, and fine-grained attention is applied.

K^{g} = gather (K, I^{r}), V^{g} = gather (V, I^{r}), O = Attention (Q, K^{g}, V^{g}) + LCE (V),

(4)

where

K^{g}, V^{g} \in R^{S^{2} \times \frac{k H W}{S^{2}} \times C}

, and

LCE (V)

is a local context enhancement term.

BRA is effective for small targets and dense occlusion [46]. For SAR ship images, where small objects cluster in specific areas, BRA improves detection accuracy [43]. As shown in [39], BRA preserves fine-grained details crucial for small objects. Complex backgrounds and noise in remote sensing images (see Figure 1 for examples) are mitigated by BRA, thereby enhancing the precision of small-object detection [42].

BRA’s placement in YOLO (e.g., during upsampling, before downsampling, or during feature fusion) impacts small-object detection. Across different configurations, BRA enhances detection by incorporating global attention information, which captures the overall structure and global semantic details, as well as local attention details [42]. Additionally, it improves computer vision tasks by capturing structural and detailed features [64].

3.3.2. Swin Transformer—Swin

The Swin Transformer [52] introduces a hierarchical vision Transformer architecture using shifted windows. Its core innovation lies in computing self-attention within non-overlapping local windows that shift between layers, enabling efficient cross-window communication while maintaining linear computational complexity relative to image size. This design is particularly advantageous for SAR ship detection, where modeling multi-scale features and complex backgrounds is essential.

The Swin Transformer block builds upon the standard Transformer block [33] but replaces the multi-head self-attention (MSA) module with two novel components:

W-MSA: Regular window-based self-attention;
SW-MSA: Shifted-window self-attention.

As shown in Figure 6, each block contains consecutive W-MSA and SW-MSA modules, with LayerNorm (LN) applied before each MSA and Multi-Layer Perceptron (MLP) operation.

The processing within a Swin block follows these steps:

\begin{matrix} {\hat{z}}^{l} & = W - MSA (LN (z^{l - 1})) + z^{l - 1}, \\ z^{l} & = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}, \\ {\hat{z}}^{l + 1} & = SW - MSA (LN (z^{l})) + z^{l}, \\ z^{l + 1} & = MLP (LN ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}, \end{matrix}

(5)

where

{\hat{z}}^{l}

and

z^{l}

denote intermediate features after the (S)W-MSA and MLP operations at block l, respectively.

The key innovation involves alternating between regular and shifted window partitions:

Regular partitioning: At layer l, an $H \times W$ feature map is divided into $M \times M$ non-overlapping windows (e.g., $2 \times 2$ windows for $8 \times 8$ input → four $4 \times 4$ partitions).
Shifted partitioning: At layer $l + 1$ , windows are offset by $(⌊ M / 2 ⌋, ⌊ M / 2 ⌋)$ pixels (e.g., $(2, 2)$ pixels when $M = 4$ ).

This shifting strategy creates new window boundaries that overlap with adjacent windows from the previous layer, enabling cross-window connections while maintaining non-overlapping computation within each layer. As demonstrated in [52], this approach efficiently models long-range dependencies without quadratic computational complexity.

The architecture processes images through four stages, with progressive downsampling (resulting in a 2× reduction in resolution between stages). Each stage contains multiple Swin Transformer blocks that collectively extract local features within windows (W-MSA), model cross-window dependencies (SW-MSA), and maintain spatial hierarchy for multi-scale ship detection. This hierarchical design makes Swin particularly effective for SAR imagery, where ships appear at vastly different scales and require context integration across complex maritime backgrounds.

3.3.3. Convolutional Block Attention Module—CBAM

The Convolutional Block Attention Module (CBAM) [47] addresses key challenges in SAR ship detection by sequentially refining features along channel and spatial dimensions. This lightweight module enhances discriminative feature learning while suppressing irrelevant background clutter, crucial for detecting small ships in complex maritime environments.

The CBAM processes input feature maps

F \in R^{C \times H \times W}

through two complementary attention submodules: (i) Channel Attention: Identifies what features are meaningful. (ii) Spatial Attention: Locates where informative regions exist. The sequential refinement is expressed as

\begin{matrix} F^{'} & = M_{c} (F) \otimes F, \\ F^{″} & = M_{s} (F^{'}) \otimes F^{'}, \end{matrix}

where ⊗ denotes element-wise multiplication. Channel attention values are broadcast spatially, while spatial attention weights are broadcast across channels. This is represented in Figure 7

The channel attention mechanism submodule emphasizes semantically important feature channels. As shown in Figure 8, it generates channel-wise descriptors through parallel pooling operations.

\begin{matrix} M_{c} (F) = & σ (MLP (AvgPool (F)) + \\ MLP (MaxPool (F))) . \end{matrix}

Some key characteristics of this module are (i) shared weights: the same MLP (single hidden layer with ReLU) processes both pooled features, (ii) sigmoid activation:

σ

produces channel weights

\in [0, 1]

, and (iii) feature refinement: emphasizes channels containing ship signatures while suppressing noise.

The spatial attention mechanism submodule focuses on spatially significant regions, particularly important for small ships in SAR imagery. It combines channel-pooled features

M_{s} (F^{'}) = σ (f^{7 \times 7} (concat [{AvgPool}_{c} (F^{'}); {MaxPool}_{c} (F^{'})])),

where

{AvgPool}_{c}

and

{MaxPool}_{c}

pool along channel dimension (

R^{C \times H \times W} \to R^{1 \times H \times W}

), concat stacks along channel dimension and

f^{7 \times 7}

is a convolutional filter integrating a spatial context of

7 \times 7

size. In Figure 9 this is visually represented.

As detailed in the original CBAM paper [47], the module processes input features through two complementary attention mechanisms: channel attention (focusing on “what” is salient) and spatial attention (focusing on “where” salient features reside). Through experimentation, the authors determined that arranging these modules sequentially (i.e., channel attention followed by spatial attention) yielded superior performance compared to parallel arrangements. This sequential ordering also outperformed the spatial-first configuration. Consequently, this work adopts the CBAM implementation exactly as originally proposed, with channel attention preceding spatial attention in the processing chain.

4. Results and Discussions

This section presents the results obtained from training two different original YOLO versions using the SAR image datasets presented in Section 3.1. We proposed modifications to the YOLOv11 and YOLOv12 architectures, including the addition of new attention blocks or the replacement of the original attention layers. The attentions BRA, Swin, and CBAM, described in Section 3.3.1, Section 3.3.2 and Section 3.3.3, respectively, were used to evaluate their performance. All models presented here were trained from scratch on SAR images annotated with horizontal bounding boxes, without relying on any pre-trained weights.

4.1. Environmental Setup

All models tested and presented in this work were trained and analyzed on a machine running Ubuntu 22.04.2 LTS, equipped with an AMD Ryzen Threadripper 3960X 24-core processor at 3.8 GHz, 128 GB of RAM, 6 TB of storage, and an NVIDIA GeForce RTX 4090 GPU with 24 GB of graphics memory. For training, we use 500 epochs with patience, defined as 100 epochs, batch size 16, and 2 workers. The input image was defined as the default 640 × 640. All datasets were split into training, validation, and test sets using a 7:2:1 ratio, respectively, even when sampled. The metrics used to compare the models are precision (P), recall (R), mAP@0.5, mAP@0.5-0.95, Floating-Point Operations Per Second (FLOPS), and Frames per Second (FPS). The FPS is shown only when the full dataset is used. Python (version 3.11.11) was used as the programming language for all tasks, including dataset manipulation, random image selection, dividing the dataset into training, validation, and testing subsets, training the model, and making predictions.

4.2. Performance Evaluation on the SAR Ship Dataset

Table 2 compares the baseline YOLOv10n, v11n, and v12n, all of which were trained on the SSD. We first conducted training on the original models to understand how they adapt and perform in complex SAR images. The results were used to determine which models would undergo the addition or replacement of layers with attention mechanisms to enhance their performance. The table illustrates the evolution in performance from YOLOv10 to YOLOv12, demonstrating consistency with the original publications and evaluations in benchmark datasets such as COCO.

The YOLOv11n model has a lower number of layers (v11: 181, v10: 385, and v12: 465) compared to the other two models. Even with this, it overperforms the YOLOv10n model with fewer than half layers. The YOLOv12n outperforms both YOLOv10n and YOLOv12n, despite having a greater number of layers. Even with this, the v12 model has a computational complexity almost equal to v11, with 6.5 and 6.4 GFLOPS, respectively. This indicates that the number of layers does not significantly impact performance when comparing models.

Table 3 summarizes the performance of various models on the SAR Ship dataset, as reported by multiple authors since the dataset’s release in 2019. The models listed in the table, ranging from earlier works to more recent methods up to 2022, were originally grouped in [8]. Due to a lack of some model names, only the reference number is presented in the Model column. In addition, four very recent models have been incorporated into the comparison to provide a fair assessment alongside the latest technologies and baseline methods.

The table includes key evaluation metrics such as precision, recall, mAP@0.5, and mAP@0.5:0.95, as well as details on each study’s data splits (Train/Val/Test). Note that several earlier works did not report precision and recall, and many only provided mAP@0.5. The mAP@0.5:0.95 metric is only available for the models trained in our work and by [25]. Due to these unbalanced metrics across studies, mAP@0.5 is used as the primary standard for comparison.

Table 3 indicates that since the introduction of SSD usage, the models’ performance has been increasing continuously. Some models were originally trained using 70% of the dataset for training, whereas others used 80%. Models marked with an asterisk (*) in Table 3 did not specify their data division strategy, so we assume they followed the original division of the SSD.

Only one model using a split ratio of 7:2:1 achieved more than 97% mAP@0.5 [27]. The other one achieves more than 97% mAP@0.5 using a split ratio of 8:0:2 (i.e., employing 80% of the dataset for training), which is not a fair comparison. It is interesting to note that for a split ratio of 7:2:1, all three original YOLO models surpassed the state of the art, with YOLOv12n being the best among them.

4.3. Ablation Study

This section presents in-depth ablation studies of the YOLO models, analyzing various configurations, their performance, and the corresponding trade-offs. The ablation studies provide insights into selecting the optimal model configuration and explore alternative setups that are effective for small-object detection.

The ablation tests will focus on YOLOv11n and YOLOv12n because they represent the most accurate and computationally efficient baselines, aligning with the goal of pioneering optimizations for the latest YOLO architectures in SAR detection and showing the most promise for improvement in initial experiments. YOLOv10n’s lower performance and higher computational cost made it a less compelling candidate for the intensive ablation study phase.

4.3.1. Ablation Study on Attention Layer Positioning in YOLO Architecture—Adding Layers

In this study, we systematically investigate different insertion points for BRA, Swin, and the CBAM within the YOLOv11n and YOLOv12n architectures. We first conduct a training batch only for YOLOv12n, which exhibited a better performance on the original architecture. For this task, we employed a subset of the original SAR Ship dataset, comprising 2856 images for training, 814 images for validation, and 410 images for testing. Using this reduced dataset, it was possible to accelerate the experimentation process while still providing sufficient data to evaluate the impact of different attention module positioning strategies.

We added attention layers before selected Conv layers in the backbone and before and after the upsampling modules in the neck. We followed the rationale presented in [46] and conducted the same exploration, aiming to determine the optimal position for new attention modules. The best layer and position cannot be defined in advance. Although attention at mid-stage feature maps helps capture the broader context for object localization, the size of objects in the image can necessitate the addition of a new attention module in different layers.

For YOLOv11n, as shown in Figure 10, attention modules were added before the following layers: (i) BRA modules were added before layers 5, 7, 9, 12, and 15, and (ii) the CBAM/Swin modules were added before layers 5, 7, and 11. For YOLOv12n, attention modules were added as follows: (i) BRA modules were added before layers 5, 7, 9, 10, and 13, and (ii) the CBAM/Swin modules were added before layers 5, 7, and 9.

We also conducted a fusion from the backbone to the neck. As discussed in [72], feature fusion in the neck part can improve object detection performance across multiple scales. Due to the presence of redundant information in the fusion of feature maps, the authors suggest incorporating attention modules into the fusion process.

In both models, as shown in Figure 10, the fusion was performed from layers 4 and 6 to concat layers in the neck at layers 12 and 15 in v11, and 10 and 13 in v12. Every insertion was performed individually, meaning that only one attention module was added at a time, and the resultant model was trained. When finished, the previous attention module was removed and inserted into another layer. Table 4 presents only the best results for each attention module inserted into YOLOv12n (all results can be found at github page. See the URL in Section 1).

Table 4 indicates that adding attention modules does not increase the performance of the model. The CBAM achieved the highest mAP@0.5. However, it retains the same mAP@0.5 value as the baseline and performs worse in other metrics.

Additionally, we conducted the same performance test using YOLOv11n. Table 5 presents only the best results for each attention module inserted into YOLOv11n (all results can be found at github page. See the URL in Section 1).

For YOLOv11n, the results were superior to those of the baseline model. All changed versions using BRA and the CBAM achieved the same value for mAP@0.5. However, with BRA added, the model had a better mAP@0.5-0.95 value, achieving

0.711

. The model using Swin in fusion mode achieved higher performance among all tested models in both versions, v11 and v12.

As SAR images represent complex environments, the sampled dataset cannot replicate the same level of complexity as the full dataset. Therefore, next, we test all new configurations using the entire dataset. To train these models using the full dataset, we choose only the models that achieved the best performance in the sampled dataset. Although more data tends to guide the model to a better result, the more data present in the full dataset introduces more complexity and, in this way, all models had the same or a worse performance than the ones presented in Table 4 and Table 5.

For YOLOv12n, using the full SSD, we did not find any combination of all added layers that surpassed the original model’s performance. The results ranged from

0.976

to

0.978

for mAP@0.5. For YOLOv11n, using BRA in the fusion, we obtained

0.977

, which is an improvement over the original YOLOv11n’s

0.976

. Even with this, the original YOLOv12n still has the best performance at

0.978

. Another fact is that the models’ inference speed is also affected when more layers are added. In both models, the number of FPS was reduced. For YOLOv12, the addition of BRA decreases the frame rate from 112 to 96 frames per second, which is the worst-case scenario. For YOLOv11, the frame rate decreases from 164 to 116, and this decrease is also observed when BRA is added. These results can be found in Table 6 and Table 7.

The integration of multiple and distinct attention modules into a model, particularly through the sequential addition of layers, can result in a phenomenon known as the ’non-focusing effect’ [73]. This effect occurs when the interaction between different attention mechanisms does not complement each other, potentially leading to degraded model performance rather than improved performance. The results demonstrate that simply adding modules such as BRA, Swin, or the CBAM to the YOLOv11n and YOLOv12n models did not yield performance gains, and in some cases, even resulted in a reduction.

Adding layers often fails to improve performance while inflating computational costs. Therefore, next, we conducted tests with both models, replacing some of the original attention modules with the three analyzed ones. We hypothesize that replacing layers can enhance performance by refining critical feature extraction pathways without sacrificing efficiency while avoiding the “no focus” effect caused by conflicting attention hierarchies. This approach may optimize architectures for domain-specific challenges, such as detecting small SAR ships.

4.3.2. Ablation Study on Attention Layer Positioning in YOLO Architecture—Replacing Layers

Although YOLOv11n improved its performance with the addition of one BRA layer, it still performed worse than the original YOLOv12n. We thus decided to replace some layers instead of just adding them. In this section, we conducted all training with the full dataset.

For YOLOv11, the attention modules replaced the following layers: (i) BRA modules replaced layers 4, 6, 8, and 13, and (ii) the CBAM/Swin modules replaced layers 4, 6, and 10. For YOLOv12, the BRA, CBAM, and Swin attention modules replaced layers 4, 6, and 8. Figure 11 shows all the combinations for both models. To train the model, only one attention model was replaced at a time. Each module was replaced with a new attention module (BRA, the CBAM, or Swin), which was then trained and tested. The previous attention module was then removed, and the new one was inserted to replace the original module in another position. This process is repeated for all attention modules in all positions.

The same explanation regarding the attention module insertion, given in Section 4.3.1, can be applied here. The difference in layer number is related to replacing existing layers, not adding new ones. Table 8 below shows the results in detail. We only show the best result for each replaced attention module (all results can be found at github page. See the URL in Section 1).

Replacing the original attention layers yielded three significant outcomes. First, the CBAM-enhanced YOLOv12n achieved the highest performance for SAR ship detection, with mAP@0.5 increasing from 0.978 (baseline YOLOv12n) to 0.980. For YOLOv11n, mAP@0.5 improved from 0.976 to 0.978. Second, layer 4 (specifically the second C3k2 module) was identified as the optimal replacement location across attention mechanisms. While all attention modules (BRA, the CBAM, and Swin) improved performance at this layer, the CBAM delivered the highest gains. With a default input size of 640 × 640, this layer outputs an 80 × 80 feature map that feeds into the first detection head via concatenation with an UpSample module at layer 14 (forming layer 16), which is crucial for detecting small objects. Third, the CBAM proved most effective for SAR-specific challenges, outperforming both baseline models and other attention mechanisms. Replacing layer 4 with the CBAM achieved the peak mAP@0.5 (0.980 for v12n). Notably, the CBAM at layer 6 in YOLOv12n also yielded strong results (mAP@0.5: 0.979), although slightly lower than the results obtained with layer 4 replacement. Additionally, the CBAM-enhanced YOLOv12n reduced computational costs to 5.9 GFLOPS (compared to the baseline of 6.5 GFLOPS) while maintaining real-time FPS.

The FPS results in Table 8 confirm a pattern consistent with earlier ablation studies: replacing the original attention layer with BRA substantially reduces inference speed, decreasing YOLOv12n’s FPS from 112 (baseline) to 93 (−17%). In contrast, the CBAM and Swin maintain near-baseline efficiency, with the CBAM achieving 110 FPS for YOLOv12n (compared to the baseline of 112 FPS) and 161 FPS for YOLOv11n (compared to the baseline of 164 FPS). Meanwhile, Swin matches the baseline FPS exactly at 112 FPS for YOLOv12n and 164 FPS for YOLOv11n when deployed at layer 4. This efficiency retention underscores BRA’s higher computational complexity, while the CBAM and Swin offer lightweight alternatives suitable for real-time SAR surveillance.

Figure 12 visually validates the performance gains from attention-layer replacement by comparing ground truth annotations with bounding box predictions from the top-optimized models: YOLOv11n/CBAM/L4 and YOLOv12n/CBAM/L4. The samples explicitly demonstrate YOLOv12n-CBAM’s superior detection capability in complex SAR scenarios, notably its precision in identifying small ships amidst background clutter, which aligns with its quantitative performance peak in Table 8. This visualization underscores how targeted attention replacement (particularly CBAM at layer 4) enhances feature discrimination critical for maritime surveillance. The figure provides a critical visual demonstration of model performance differences: for a given input, detection results vary across architectures. While baseline YOLOv11n and YOLOv12n sometimes match the proposed model in detecting certain ships, only the optimized YOLOv12n with the CBAM at Layer 4 consistently detects all vessels aligned with ground truth annotations across all examples shown. This underscores the necessity of holistic evaluation using quantitative metrics, such as mAP@0.5, which objectively averages precision across the entire dataset to validate performance gains.

4.3.3. An Ablation Study on the Addition of a Small-Object Detection Head

To address the challenges posed by small-object detection, several studies have incorporated additional detection heads into YOLO architectures, targeting objects such as fish, vehicles, and fruits in diverse contexts [74,75,76]. Building on this, ref. [46] introduced a fourth detection head, the Tiny Object Detection Layer (TODL), to improve small face detection in classrooms. These efforts primarily extend YOLOv5 and YOLOv8, which natively include three detection heads. While these efforts leveraged YOLO’s native three-head architecture, our work pioneers TODL integration in YOLOv11n and YOLOv12n.

Specifically, we adapted TODL to address the SSD dataset’s critical challenges: ships averaging < 0.2 relative size amidst complex backgrounds like islands and sea clutter. As shown in Figure 13, our implementation incorporates (i) an upsampling module scaling feature maps to a 160 × 160 resolution, (ii) a dedicated TODL detection head for high-resolution feature processing, and (iii) a convolutional refinement module restoring 80 × 80 scale. This configuration may theoretically enhance small-ship discrimination by preserving fine-grained details before the first detection head. However, the quantitative results revealed limited practical gains.

The results in Table 9 show that integrating the CBAM at layer 4 and a new TODL detection head operating on

160 \times 160

feature maps improved performance only for YOLOv11. This aligns with the pattern observed when adding new attention modules to the baseline models. While YOLOv12 showed no improvement in mAP@0.5, it achieved gains in mAP@0.5-0.95. We hypothesize that YOLOv12’s limited gains stem from information degradation in earlier convolutional layers. Repeated convolution (subsampling) operations can suppress tiny objects (sub-pixel ship features). If targets initially occupy too few pixels, subsequent upsampling cannot recover the lost information in preceding layers. Thus, while TODL benefits other domains, its efficacy for SAR ships remains constrained by fundamental resolution limits.

4.3.4. Generalization Experiment: Performance Evaluation on SSDD

To rigorously validate model generalizability for real-world maritime surveillance, we conducted cross-dataset testing using the SSDD. Characterized by 1160 images from RadarSat-2, TerraSAR-X, and Sentinel-1 sensors with horizontal bounding box annotations, the SSDD presents distinct challenges, including higher ship density (2.1 ships/image vs. SSD’s 1.3) and smaller vessel sizes [11].

We performed inference on both original models (YOLOv11n and YOLOv12n) and their CBAM-enhanced variants (with the original attention layer at position 4 replaced by the CBAM), using models trained exclusively on the SSD dataset to predict the SSDD. This cross-dataset evaluation (conducted without retraining) assesses generalization capability and dataset-specific bias. The results shown in Table 10 reveal that while all models exhibited suboptimal performance (mAP@0.5: 0.678–0.714), replacing the attention layer at position 4 with the CBAM consistently improved robustness. For YOLOv11n, mAP@0.5 increased from 0.693 to 0.714; for YOLOv12n, it rose from 0.678 to 0.693. This demonstrates the CBAM’s efficacy in enhancing feature discrimination for SAR-specific challenges, such as small ships and complex backgrounds, even in unseen data domains.

In the sequence, we retrained the baseline YOLO’s models and optimal CBAM-enhanced variants from scratch to evaluate consistent performance patterns. The results in Table 11 demonstrate that CBAM replacement at layer 4 universally improved robustness across datasets: YOLOv11n-CBAM reached 0.978 and 0.986 mAP@0.5 (vs. baseline 0.976 and 0.984) with improvements of 0.007 and 0.008 in mAP@0.5-0.95 (vs. baseline 0.703 and 0.715) on the SSD and SSDD, respectively. On the other hand, YOLOv12n-CBAM achieved 0.980 and 0.979 mAP@0.5 (vs. baseline 0.978 and 0.975) with gains of 0.012 and 0.011 in mAP@0.5-0.95 (vs. baseline 0.704 and 0.700) on the SSD and SSDD, respectively.

Crucially, no single model dominated both contexts: YOLOv12n-CBAM performed best on the SSD (0.980 mAP@0.5), whereas YOLOv11n-CBAM excelled on the SSDD (0.986 mAP@0.5), indicating that dataset-specific characteristics, such as target density and clutter complexity, favor different architectures. These findings confirm that while targeted attention-layer optimization enhances cross-dataset adaptability, optimal model selection requires a consideration of operational constraints, necessitating context-specific fine-tuning to address the inherent peculiarities of each dataset.

The decrease in FPS count on the SSDD stems from the inherent complexities of the dataset, which amplify computational demands. The SSDD exhibits higher ship density (2.1 ships/image vs. SSD’s 1.3). This increases computational load during detection, as more objects require processing per frame. Additionally, the SSDD’s variable image dimensions (214 × 214 to 668 × 668 pixels) versus the SSD’s uniform 256 × 256 chips introduce resizing overhead during preprocessing. Non-maximum suppression (NMS) operations further strain post-processing on denser detections. Potential channel mismatches (SSD: one channel; SSDD: three channels) may also contribute to inefficiencies, as channel conversion could introduce latency.

Additionally, we fine-tuned these models using the best YOLOv11 and YOLOv12 with the CBAM at layer 4, pre-trained on the SSD, as the base model. The objective was to analyze the cross-dataset adaptability between a model pre-trained on the SSD (base) dataset and fine-tuned on the SSDD dataset. In Table 1, the DETR model achieved 0.991 mAP@0.5 using a dataset split ratio of 8:2 [30]. Note that DETR (14.34 million parameters, 44.4 GFLOPS) differs significantly in size from our nano-sized YOLO models. The nano-sized YOLO models used in this work have specifically 2.6 million parameters (YOLOv11n) and 2.5 million parameters (YOLOv12n). Therefore, to enable a fairer comparison, we tested an 80:20 split on the SSDD using medium-sized YOLO variants (“m”): YOLOv11m (20.1 million parameters, 68.0 GFLOPS) and YOLOv12m (19.6 million parameters, 59.8 GFLOPS). With these scenarios, we define new considerations for the results in Table 12: for both YOLO models, v11 and v12, the baseline original model, the original with layer 4 replaced by the CBAM trained from scratch, and fine-tuned on the SSDD, all using a 7:2:1 dataset split ratio. Next, the results are presented using the same configured models, but now with an 8:2 dataset split and model size “m”.

Table 12 shows that the CBAM’s inclusion in layer 4 increased the performance of both models (YOLOv11 and YOLOv12) on both sizes “n” and “m”. Training a model from scratch using different ratios and model sizes maintained consistency in model improvement when using the CBAM at layer 4.

Fine-tuning CBAM-enhanced models (pre-trained on SSD) for the SSDD yielded varied outcomes across architectures and sizes. As shown in Table 12, while performance remained competitive (mAP@0.5: 0.983–0.988), trends differed by model. YOLOv12n gained +0.009 mAP@0.5 when fine-tuned (0.988 vs. 0.979 from scratch at 7:2:1 split), outperforming its baseline. Conversely, YOLOv11n declined by −0.003 mAP@0.5 (0.983 vs. 0.986 from-scratch) under the same split. When compared to non-CBAM baselines, most fine-tuned models improved (e.g., YOLOv12n: +0.013 over baseline), though none surpassed DETR’s 0.991 mAP@0.5 (Table 1). The highest SSDD performance was 0.989 mAP@0.5 by YOLOv11m/CBAM/L4 trained from scratch on an 8:2 split.

As the last analysis, we tuned the hyperparameters of the best model, the YOLOv11m/ CBAM/L4 trained from scratch with a split of 8:2, which attained an mAP@0.5 of 0.989. For this, we used YOLO’s tuning mode. The goal is to achieve a higher mAP metric than the DETR model. We defined 300 iterations with 30 epochs for each iteration, and all hyperparameters were tuned. As a result, the tuned YOLOv11m/CBAM/L4 (from scratch + 8:2) achieved an mAP@0.5 of 0.992 and mAP@0.5-0.95 of 0.762 (+0.001 in mAP@0.5 above DETR results), making it the best model in both “n” and “m” sizes of YOLOv11.

4.3.5. Generalization Experiment: Performance Evaluation on Non-Ship SAR Datasets

Although we have shown that YOLOv11 and CBAM combined in Layer 4 can have a significant improvement on the SSD and SSDD datasets, the contexts are the same, and the images are closely related. Therefore, we conducted a simple but informative analysis on the SADD and MSAR datasets to confirm the effect of CBAM aggregation on the YOLO model.

The analysis was made following the same setup as described in Section 4.1. For the SADD, we increased the training epochs to 1000, using the default dataset split of 7:2:1. The other parameters remained the same. As presented in Section 3.1, these two datasets are composed of different objects, but most of them have medium bounding box sizes. In Table 13, the results for the SADD dataset are presented.

As shown in Table 13, changing YOLOv11n’s original layer 4 by introducing the CBAM achieved the best mAP@0.5 results, reaching higher values than YOLOv11n baseline (+0.004), YOLOv12n baseline (+0.002), and YOLOv12n+CBAM (+0.004). These results support the extensive initial analysis made on the SSD dataset and reinforce the CBAM’s contribution for small and medium object detection in SAR images.

In a deeper analysis, the last dataset, MSAR, has four object classes, as described in Section 3.1. This introduces new complexity to the model, making it a more challenging dataset. For the MSAR dataset, we used 500 epochs for training, using a 7:2:1 rule for the dataset split. All remaining parameters are the same as described in Section 4.1. In Table 14, all results are presented and compared with baselines.

Based on the results presented in Table 14, which evaluates model performance on the multi-class MSAR dataset, several findings emerge regarding the impact of integrating the CBAM attention mechanism. The overall performance analysis indicates that the CBAM-enhanced YOLOv12n model achieved the highest scores in both overall mAP@0.5 (0.974) and mAP@0.5-0.95 (0.788), demonstrating superior detection and localization accuracy across all object classes. This suggests that the combination of YOLOv12’s architectural advancements and the CBAM’s feature refinement capabilities is particularly effective for handling the diversity and complexity inherent in multi-class SAR imagery.

A class-specific examination reveals that the aircraft category, which posed the greatest challenge with the lowest baseline performance, benefited most significantly from the CBAM’s integration. YOLOv12n+CBAM/L4 achieved the best mAP@0.5 (0.937) and recall (0.863) for aircraft, indicating a marked improvement in detecting these difficult targets. For the ship and oil tank classes, where baseline performance was already near-perfect, the CBAM provided more modest gains, primarily enhancing recall and finer-grained localization accuracy as reflected in the mAP@0.5-0.95 metrics. The bridge class showed interesting variability, with YOLOv12n+CBAM delivering the best results (0.978 mAP@0.5). At the same time, YOLOv11n+CBAM experienced a slight decrease compared to its baseline, highlighting that the effectiveness of attention mechanisms can vary depending on both the object class and the base architecture.

These findings reinforce this work’s core thesis that strategic attention-layer replacement, particularly in YOLOv12, creates a more robust and accurate detector capable of handling the complexities of multi-class SAR object detection, with the most substantial improvements occurring in the most challenging detection scenarios.

5. Conclusions

This study advances SAR ship detection by strategically integrating attention mechanisms, specifically Bi-Level Routing Attention (BRA), Swin Transformer, and the Convolutional Block Attention Module (CBAM), into state-of-the-art YOLO architectures (v10, v11, v12). Crucially, we demonstrate that replacing the original attention layer at a critical network position (layer 4, corresponding to the second C3k2 module) yields superior performance gains compared to simply adding new attention modules. Among the mechanisms evaluated, the CBAM proved most effective when deployed at this optimal location within YOLOv12n.

The optimized CBAM-enhanced YOLOv12n achieved an mAP@0.5 of 98.0% on the challenging SAR Ship Dataset (SSD), surpassing the baseline YOLOv12n (97.8%) and prior state-of-the-art methods. Beyond improved accuracy, this model offers enhanced computational efficiency, reducing operations to 5.9 GFLOPS (compared to 6.5 GFLOPS in the baseline) while utilizing fewer layers (462 vs. 465). Rigorous cross-dataset validation on the SSDD dataset confirmed the robustness and generalizability of this approach, with CBAM replacement consistently improving performance over baselines. Furthermore, evaluation on non-ship SAR datasets (the SADD for aircraft and MSAR for multi-class detection) demonstrated that the strategic integration of the CBAM delivers performance gains across various object types and complex multi-class scenarios, solidifying its value as a general enhancement for SAR object detection architectures.

This work establishes a robust framework for efficient, high-precision maritime surveillance using SAR imagery. Future research will explore attention mechanisms in distinct contexts (e.g., drone-based imagery), the fine-tuning of their hyperparameters, incorporating oriented bounding boxes (OBBs) and slicing-aided hyper inference (SAHI) for enhanced precision, and further refining dynamic attention routing and small-object detection layers to build even more capable all-weather surveillance systems. The DETR architecture will also be considered in our future works due to its good performance on the SSDD dataset.

Author Contributions

Conceptualization, R.d.L.R. and F.A.P.d.F.; methodology, R.d.L.R. and F.A.P.d.F.; validation, R.d.L.R. and F.A.P.d.F.; formal analysis, F.A.P.d.F.; writing—original draft preparation, R.d.L.R.; writing—review and editing, R.d.L.R. and F.A.P.d.F.; supervision, F.A.P.d.F.; project administration, F.A.P.d.F.; funding acquisition, F.A.P.d.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by CNPq (Grant Nos. 403612/2020-9, 311470/2021-1, 403827/2021-3, and 306199/2025-4), by Minas Gerais Research Foundation (FAPEMIG) (Grant Nos. PPE-00124-23, APQ-04523-23, APQ-05305-23, and APQ-03162-24), Brasil 6G project (1245.010604/2020-14), supported by RNP and MCTI, and by the projects XGM-AFCCT-2024-2-5-1 and XGM-AFCCT-2024-9-1-1 supported by xGMobile—EMBRAPII-Inatel Competence Center on 5G and 6G Networks, with financial resources from the PPI IoT/Manufatura 4.0 from MCTI grant number 052/2023, signed with EMBRAPII.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Zhang, B.; Liu, J.; Liu, R.W.; Huang, Y. Deep-learning-empowered visual ship detection and tracking: Literature review and future direction. Eng. Appl. Artif. Intell. 2025, 141, 109754. [Google Scholar] [CrossRef]
Melnyk, O.; Kuznichenko, S.; Onishchenko, O. Impact of AIS Manipulation on Shipping Safety and Strategic Countermeasures. Lex Portus 2024, 10, 31–39. [Google Scholar] [CrossRef]
Milios, A.; Bereta, K.; Chatzikokolakis, K.; Zissis, D.; Matwin, S. Automatic Fusion of Satellite Imagery and AIS data for Vessel Detection. In Proceedings of the 2019 22th International Conference on Information Fusion (FUSION), Ottawa, ON, Canada, 2–5 July 2019; pp. 1–5. [Google Scholar] [CrossRef]
Yasir, M.; Liu, S.; Xu, M.; Wan, J.; Pirasteh, S.; Dang, K.B. ShipGeoNet: SAR Image-Based Geometric Feature Extraction of Ships Using Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Yasir, M.; Liu, S.; Xu, M.; Wan, J.; Nazir, S.; Islam, Q.U.; Dang, K.B. SwinYOLOv7: Robust ship detection in complex synthetic aperture radar images. Appl. Soft Comput. 2024, 160, 111704. [Google Scholar] [CrossRef]
Galdelli, A.; Mancini, A.; Ferrà, C.; Tassetti, A.N. A Synergic Integration of AIS Data and SAR Imagery to Monitor Fisheries and Detect Suspicious Activities. Sensors 2021, 21, 2756. [Google Scholar] [CrossRef]
Humayun, M.F.; Nasir, F.A.; Bhatti, F.A.; Tahir, M.; Khurshid, K. YOLO-OSD: Optimized Ship Detection and Localization in Multiresolution SAR Satellite Images Using a Hybrid Data-Model Centric Approach. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5345–5363. [Google Scholar] [CrossRef]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
El-Darymli, K.; McGuire, P.; Power, D.; Moloney, C.R. Target detection in synthetic aperture radar imagery: A state-of-the-art survey. J. Appl. Remote Sens. 2013, 7, 071598. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. A SAR Dataset of Ship Detection for Deep Learning under Complex Backgrounds. Remote Sens. 2019, 11, 765. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Li, J.; Xu, X.; Wang, B.; Zhan, X.; Xu, Y.; Ke, X.; Zeng, T.; Su, H.; et al. SAR ship detection dataset (SSDD): Official release and comprehensive data analysis. Remote Sens. 2021, 13, 3690. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
Gao, Z.; Yu, X.; Rong, X.; Wang, W. Improved YOLOv8n for Lightweight Ship Detection. J. Mar. Sci. Eng. 2024, 12, 1774. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based Object Detection Method for Remote Sensing Images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Song, Y.; Tang, L.; He, G. LW-SARDet: A Lightweight SAR Ship Detector via Decomposed Convolution. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Hong, Z.; Yang, T.; Tong, X.; Zhang, Y.; Jiang, S.; Zhou, R.; Han, Y.; Wang, J.; Yang, S.; Liu, S. Multi-Scale Ship Detection From SAR and Optical Imagery Via A More Accurate YOLOv3. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6083–6101. [Google Scholar] [CrossRef]
Xu, P.; Li, Q.; Zhang, B.; Wu, F.; Zhao, K.; Du, X.; Yang, C.; Zhong, R. On-Board Real-Time Ship Detection in HISEA-1 SAR Images Based on CFAR and Lightweight Deep Learning. Remote Sens. 2021, 13, 1995. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, M.; Wang, H.; Tan, J. Ship Detection in SAR Images Based on Multi-Scale Feature Extraction and Adaptive Feature Fusion. Remote Sens. 2022, 14, 755. [Google Scholar] [CrossRef]
Li, X.; Yang, W.; Jiang, Y.; Guan, S.; Zhang, Y. Research on SAR Image Ship Target Detection Algorithm with Improved YOLOv7. In Proceedings of the 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), Guangzhou, China, 20–22 September 2024; pp. 443–448. [Google Scholar] [CrossRef]
Shao, R.; Pan, X.; Ding, D.; Yang, L.; Han, M.; Huang, Z. Small Ship Detection Method of SAR Images Using Lightweight Feature-fused Enhancement Learning. In Proceedings of the 2024 IEEE International Conference on Computational Electromagnetics (ICCEM), Nanjing, China, 15–17 April 2024; pp. 1–3. [Google Scholar] [CrossRef]
Wang, H.; Shi, J.; Karimian, H.; Liu, F.; Wang, F. YOLOSAR-Lite: A lightweight framework for real-time ship detection in SAR imagery. Int. J. Digit. Earth 2024, 17, 2405525. [Google Scholar] [CrossRef]
Xu, Y.; Du, W.; Deng, L.; Zhang, Y.; Wen, W. Ship target detection in SAR images based on SimAM attention YOLOv8. IET Commun. 2024, 18, 1428–1436. [Google Scholar] [CrossRef]
Huang, Y.; Wang, D.; Huang, W.; An, D. A ViT Merged Oriented-Detector with Neuron Attention for Ship Detection in SAR Images. In Proceedings of the 2024 IEEE 7th International Conference on Electronic Information and Communication Technology (ICEICT), Xi’an, China, 31 July–2 August 2024; pp. 85–90. [Google Scholar] [CrossRef]
Zhang, Y.; Dong, C.; Guo, L.; Meng, X.; Liu, Y.; Wei, Q. AFMSFFNet: An Anchor-Free-Based Feature Fusion Model for Ship Detection. Remote Sens. 2024, 16, 3465. [Google Scholar] [CrossRef]
Yu, J.; Zhou, G.; Zhou, S.; Qin, M. A Fast and Lightweight Detection Network for Multi-Scale SAR Ship Detection under Complex Backgrounds. Remote Sens. 2022, 14, 31. [Google Scholar] [CrossRef]
Zhang, G.; Li, Z.; Li, X.; Yin, C.; Shi, Z. A Novel Salient Feature Fusion Method for Ship Detection in Synthetic Aperture Radar Images. IEEE Access 2020, 8, 215904–215914. [Google Scholar] [CrossRef]
Lin, H.; Liu, J.; Li, X.; Wei, L.; Liu, Y.; Han, B.; Wu, Z. DCEA: DETR with Concentrated Deformable Attention for End-to-End Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17292–17307. [Google Scholar] [CrossRef]
Wang, X.; Cui, Z.; Cao, Z.; Dang, S. Dense Docked Ship Detection via Spatial Group-Wise Enhance Attention in SAR Images. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1244–1247. [Google Scholar] [CrossRef]
Li, S.; Fu, X.; Dong, J. Improved Ship Detection Algorithm Based on YOLOX for SAR Outline Enhancement Image. Remote Sens. 2022, 14, 4070. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Chen, X.; Wei, C.; Xin, Z.; Zhao, J.; Xian, J. Ship Detection under Low-Visibility Weather Interference via an Ensemble Generative Adversarial Network. J. Mar. Sci. Eng. 2023, 11, 2065. [Google Scholar] [CrossRef]
Chen, L.; Shi, W.; Deng, D. Improved YOLOv3 Based on Attention Mechanism for Fast and Accurate Ship Detection in Optical Remote Sensing Images. Remote Sens. 2021, 13, 660. [Google Scholar] [CrossRef]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Xie, Z.; Fu, M.; Liu, X. Detection of Fittings Based on the Dynamic Graph CNN and U-Net Embedded with Bi-Level Routing Attention. Electronics 2023, 12, 4611. [Google Scholar] [CrossRef]
He, B.; Chen, Y.; Zhu, D.; Xu, Z. Domain adaptation via Wasserstein distance and discrepancy metric for chest X-ray image classification. Sci. Rep. 2024, 14, 2690. [Google Scholar] [CrossRef] [PubMed]
Yao, G.; Zhu, S.; Zhang, L.; Qi, M. HP-YOLOv8: High-Precision Small Object Detection Algorithm for Remote Sensing Images. Sensors 2024, 24, 4858. [Google Scholar] [CrossRef] [PubMed]
Jin, X.; Tong, A.; Ge, X.; Ma, H.; Li, J.; Fu, H.; Gao, L. YOLOv7-Bw: A Dense Small Object Efficient Detector Based on Remote Sensing Image. IECE Trans. Intell. Syst. 2024, 1, 30–39. [Google Scholar] [CrossRef]
Li, J.; Feng, Y.; Shao, Y.; Liu, F. IDP-YOLOV9: Improvement of Object Detection Model in Severe Weather Scenarios from Drone Perspective. Appl. Sci. 2024, 14, 5277. [Google Scholar] [CrossRef]
Li, J.; Xie, C.; Wu, S.; Ren, Y. UAV-YOLOv5: A Swin-Transformer-Enabled Small Object Detection Model for Long-Range UAV Images. Ann. Data Sci. 2024, 11, 1109–1138. [Google Scholar] [CrossRef]
Liu, Q.; Jiang, R.; Xu, Q.; Wang, D.; Sang, Z.; Jiang, X.; Wu, L. YOLOv8n_BT: Research on Classroom Learning Behavior Recognition Algorithm Based on Improved YOLOv8n. IEEE Access 2024, 12, 36391–36403. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Li, H.; Li, Y.; Xiao, L.; Zhang, Y.; Cao, L.; Wu, D. RLRD-YOLO: An Improved YOLOv8 Algorithm for Small Object Detection from an Unmanned Aerial Vehicle (UAV) Perspective. Drones 2025, 9, 293. [Google Scholar] [CrossRef]
He, Y.; Guo, M.; Zhang, Y.; Xia, J.; Geng, X.; Zou, T.; Ding, R. NTS-YOLO: A Nocturnal Traffic Sign Detection Method Based on Improved YOLOv5. Appl. Sci. 2025, 15, 1578. [Google Scholar] [CrossRef]
He, J.; Ren, Y.; Li, W.; Fu, W. YOLOv11-RCDWD: A New Efficient Model for Detecting Maize Leaf Diseases Based on the Improved YOLOv11. Appl. Sci. 2025, 15, 4535. [Google Scholar] [CrossRef]
Yasir, M.; Shanwei, L.; Mingming, X.; Hui, S.; Hossain, M.S.; Colak, A.T.I.; Wang, D.; Jianhua, W.; Dang, K.B. Multi-scale ship target detection using SAR images based on improved Yolov5. Front. Mar. Sci. 2023, 9, 1086140. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Cao, X.; Zhang, Y.; Lang, S.; Gong, Y. Swin-Transformer-Based YOLOv5 for Small-Object Detection in Remote Sensing Images. Sensors 2023, 23, 3634. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Wang, D.; Wu, B.; An, D. NST-YOLO11: ViT Merged Model with Neuron Attention for Arbitrary-Oriented Ship Detection in SAR Images. Remote Sens. 2024, 16, 4760. [Google Scholar] [CrossRef]
Yasir, M.; Liu, S.; Pirasteh, S.; Xu, M.; Sheng, H.; Wan, J.; de Figueiredo, F.A.P.; Aguilar, F.J.; Li, J. YOLOShipTracker: Tracking ships in SAR images using lightweight YOLOv8. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104137. [Google Scholar] [CrossRef]
Li, J.; Qu, C.; Shao, J. Ship detection in SAR images based on an improved faster R-CNN. In Proceedings of the 2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, P.; Xu, H.; Tian, T.; Gao, P.; Li, L.; Zhao, T.; Zhang, N.; Tian, J. SEFEPNet: Scale Expansion and Feature Enhancement Pyramid Network for SAR Aircraft Detection With Small Sample Dataset. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3365–3375. [Google Scholar] [CrossRef]
Xia, R.; Chen, J.; Huang, Z.; Wan, H.; Wu, B.; Sun, L.; Yao, B.; Xiang, H.; Xing, M. CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection. Remote Sens. 2022, 14, 1488. [Google Scholar] [CrossRef]
Teixeira, E.; Araujo, B.; Costa, V.; Mafra, S.; Figueiredo, F. Literature review on ship localization, classification, and detection methods based on optical sensors and neural networks. Sensors 2022, 22, 6879. [Google Scholar] [CrossRef]
Rocha, R.d.L.; de Figueiredo, F.A. Beyond Land: A Review of Benchmarking Datasets, Algorithms, and Metrics for Visual-Based Ship Tracking. Electronics 2023, 12, 2789. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/#usage-examples (accessed on 7 February 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar] [CrossRef]
Qu, Y.; Wang, C.; Xiao, Y.; Yu, J.; Chen, X.; Kong, Y. Optimization Algorithm for Surface Defect Detection of Aircraft Engine Components Based on YOLOv5. Appl. Sci. 2023, 13, 1344. [Google Scholar] [CrossRef]
Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship Detection in Large-Scale SAR Images Via Spatial Shuffle-Group Enhance Attention. IEEE Trans. Geosci. Remote Sens. 2021, 59, 379–391. [Google Scholar] [CrossRef]
Chen, Y.; Yu, J.; Xu, Y. SAR Ship Target Detection for SSDv2 under Complex Backgrounds. In Proceedings of the 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), Chongqing, China, 10–12 July 2020; pp. 560–565. [Google Scholar] [CrossRef]
Chaudhary, Y.; Mehta, M.; Goel, N.; Bhardwaj, P.; Gupta, D.; Khanna, A. YOLOv3 Remote Sensing SAR Ship Image Detection. In Proceedings of the Data Analytics and Management; Khanna, A., Gupta, D., Pólkowski, Z., Bhattacharyya, S., Castillo, O., Eds.; Springer: Singapore, 2021; pp. 519–531. [Google Scholar]
Zhu, C.; Zhao, D.; Liu, Z.; Mao, Y. Hierarchical Attention for Ship Detection in SAR Images. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2145–2148. [Google Scholar] [CrossRef]
Wang, X.; Cui, Z.; Cao, Z.; Tian, Y. Ship Detection in Large Scale Sar Images Based on Bias Classification. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1263–1266. [Google Scholar] [CrossRef]
Wu, Z.; Hou, B.; Ren, B.; Ren, Z.; Wang, S.; Jiao, L. A Deep Detection Network Based on Interaction of Instance Segmentation and Object Detection for SAR Images. Remote Sens. 2021, 13, 2582. [Google Scholar] [CrossRef]
Dong, Y.; Zhang, H.; Wang, C.; Zhang, B.; Li, L. Ship Detection based on M2Det for SAR images under Heavy Sea State. In Proceedings of the EUSAR 2021: 13th European Conference on Synthetic Aperture Radar, Online, 29 March–1 April 2021; pp. 1–4. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. BGF-YOLO: Enhanced YOLOv8 with multiscale attentional feature fusion for brain tumor detection. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2024: 27th International Conference, Marrakesh, Morocco, 6–10 October 2024; Proceedings, Part VIII; Lecture Notes in Computer Science (LNCS). Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; Volume 15008, pp. 35–45. [Google Scholar] [CrossRef]
Yang, S.; Chen, L.; Wang, J.; Jin, W.; Yu, Y. A Novel Lightweight Object Detection Network with Attention Modules and Hierarchical Feature Pyramid. Symmetry 2023, 15, 2080. [Google Scholar] [CrossRef]
Li, J.; Liu, C.; Lu, X.; Wu, B. CME-YOLOv5: An Efficient Object Detection Network for Densely Spaced Fish and Small Targets. Water 2022, 14, 2412. [Google Scholar] [CrossRef]
Kim, M.; Jeong, J.; Kim, S. ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image. Remote Sens. 2021, 13, 4851. [Google Scholar] [CrossRef]
Yang, W.; Ma, X.; Hu, W.; Tang, P. Lightweight Blueberry Fruit Recognition Based on Multi-Scale and Attention Fusion NCBAM. Agronomy 2022, 12, 2354. [Google Scholar] [CrossRef]

Figure 1. Examples of ship chips in the SSD. Red rectangle are the ground truth bounding box.

Figure 2. Examples of ground truth ships in the SSDD dataset. Green rectangle are the ground truth bounding box.

Figure 3. Examples of images in the SADD dataset. Red rectangles are the ground truth bounding box.

Figure 4. Examples of images and objects in the MSAR dataset. Red rectangles are the ground truth bounding box in different objects.

Figure 5. YOLO models v10, v11 and v12, respectively.

Figure 6. Swin Transformer block architecture.

Figure 7. CBAM architecture with sequential channel and spatial attention modules.

Figure 8. Channel attention mechanism with shared-weight MLP.

Figure 9. Spatial attention mechanism with channel pooling.

Figure 10. Attention modules added in YOLOv11n (left) and YOLOv12n (right). The dashed line indicates the inclusion of the attention layer. These layers were added one at a time.

Figure 11. Attention modules replaced in YOLOv11n (left) and YOLOv12n (right). The dashed lines indicate the inclusion of the attention layer. These layers were added one at a time.

Figure 12. Prediction using SAR Ship Dataset showing the best performance on test image samples. Red rectangles are the ground truth bounding box and the blue rectangles are the predicted bounding box for each model.

Figure 13. Modified YOLOv11 and YOLOv12 for TODL scenario: new 160 × 160 detection head and CBAM replacing the original layer 4.

Table 1. Comparison of SAR ship detection methods.

Ref.	Architecture	Attention Mechanism	Key Innovations	mAP@0.5	Dataset (Split Ratio)
[19]	YOLOv3	None	Multi-scale ship detection	95.52	SSD (7:3)
[20]	YOLOv4-tiny	None	Real-time on-board detection	93.46	SSD (7:2:1)
[21]	YOLOv5	None	Multi-scale feature fusion	95.1/95.6	SSD/SSDD (8:2/8:2)
[24]	YOLOv8	None	Lightweight framework	95.32/96.06	SSDD/SSD (7:1:2/7:1:2)
[25]	YOLOv8	SimAM	SimAM attention module	97.72	SSD (8:2)
[27]	YOLOX	None	Anchor-free feature fusion	97.87/97.21	SSDD/SSD (7:2:1/7:2:1)
[30]	DETR	None	End-to-end detection	99.1/96.2	SSDD/SSD (8:2/8:2)
[31]	CenterNet	SGE	Dense docked ship detection	93.9	SSD (N/A)
[42]	YOLOv8	BRA	Small-object detection in remote sensing	95.11/93.05/53.49	RSOD/VHR-10/VisDrone2019 (8:1:1/8:1:1/ ≈9:0.5:0.5)
[48]	YOLOv8	CBAM + RFAConv	UAV small target detection	46.1	VisDrone2019 (N/A)
[54]	YOLOv11	Neural Swin Transformer	SAR/optical ship detection	99.01/97.53	RSDD-SAR/SSDD+ (≈7.1:2.9/8:2)
Ours	YOLOv11/ YOLOv12	CBAM (at layer 4)	First optimized YOLOv11/v12 for detection on SAR images Targeted attention-layer replacement Cross-dataset validation Reduced computation (5.9 GFLOPS)	98.6/98.0 v11n/v12n ————— 99.2 v11m	SSDD/SSD (7:2:1/7:2:1) ————— SSDD (8:2)

Table 2. Model performance on SAR ship dataset.

Model	P	R	mAP@0.5	mAP@0.5-0.95	Layers	GFLOPS
YOLOv10n	0.951	0.9470	0.9736	0.6870	385	8.4
YOLOv11n	0.951	0.9470	0.9760	0.7030	181	6.4
YOLOv12n	0.951	0.9500	0.9780	0.7040	465	6.5

Table 3. Performance Comparison on SAR Ship Dataset.

Model	Train/Val/Test	P	R	mAP@0.5	mAP0.5-0.95
[10]	7:2:1	-	-	0.8907	-
[65]	8:1:1	-	-	0.9470	-
[29]	7:2:1	0.944	-	0.9189	-
[66]	7:2:1	-	-	0.9107	-
[67]	-	-	-	0.9025	-
[31]	7:2:1 *	-	-	0.9390	-
[68]	7:1:2	-	-	0.9510	-
[69]	7:2:1 *	-	-	0.9240	-
[20]	7:2:1	-	-	0.9346	-
[19]	7:0:3	0.837	0.958	0.9552	-
[70]	7:0:3	-	-	0.9580	-
[71]	7:2:1 *	-	-	0.9439	-
[28]	8:0:2	0.911	0.922	0.9610	-
[21]	8:0:2	-	-	0.9510	-
[27]	7:2:1	0.924	0.957	0.9721	-
[30]	8:2:0	-	-	0.9620	-
[24]	7:1:2	0.927	0.914	0.9606	-
[25]	8:0:2	0.947	0.945	0.9772	0.690

Table 4. Performance impact of adding attention modules in YOLOv12n: evaluation on sampled dataset.

Attention	P	R	mAP@0.5	mAP@0.5-0.95
YOLOv12n (baseline)	0.951	0.950	0.978	0.704
BRA in Layer 10	0.942	0.949	0.977	0.695
CBAM in Layer 7	0.945	0.946	0.978	0.696
Swin in Layer 9	0.940	0.949	0.977	0.700

Table 5. Performance impact of adding attention modules in YOLOv11n: evaluation on sampled dataset.

Attention	P	R	mAP@0.5	mAP@0.5-0.95
YOLOv11n (baseline)	0.951	0.947	0.976	0.703
BRA in Fusion	0.945	0.952	0.979	0.711
CBAM in Layer 9	0.948	0.938	0.979	0.702
Swin in Fusion	0.929	0.960	0.980	0.702

Table 6. Performance impact of adding attention modules in YOLOv12n: evaluation on full dataset.

Attention	P	R	mAP@0.5	mAP@0.5-0.95	GFLOPS	FPS
YOLOv12n (baseline)	0.951	0.950	0.978	0.704	6.5	112
BRA in Layer 10	0.951	0.950	0.976	0.700	20	96
CBAM in Layer 7	0.954	0.950	0.977	0.701	6.5	106
Swin in Layer 9	0.955	0.950	0.977	0.701	6.6	108

Table 7. Performance impact of adding attention modules in YOLOv11n: evaluation on full dataset.

Attention	P	R	mAP@0.5	mAP@0.5-0.95	GFLOPS	FPS
YOLOv11n (baseline)	0.951	0.947	0.976	0.703	6.4	164
BRA in Fusion	0.957	0.950	0.977	0.705	9	116
CBAM in Layer 9	0.949	0.949	0.975	0.701	6.6	152
Swin in Fusion	0.947	0.947	0.975	0.701	7.4	154

Table 8. YOLOv11n and YOLOv12n performance with attention modules: layer replacement strategy.

Model/Attention/Layer	P	R	mAP@0.5	mAP@0.5-0.95	GFLOPS	FPS
YOLOv11n (baseline)	0.951	0.947	0.976	0.703	6.4	164
v11/BRA/L4	0.950	0.954	0.978	0.706	6.7	128
v11/CBAM/L4	0.951	0.950	0.978	0.710	5.8	161
v11/Swin/L4	0.953	0.953	0.978	0.703	6.4	164
YOLOv12n (baseline)	0.951	0.950	0.978	0.704	6.5	112
v12/BRA/L4	0.950	0.956	0.978	0.704	6.8	93
v12/CBAM/L4	0.955	0.957	0.980	0.716	5.9	110
v12/Swin/L4	0.956	0.957	0.978	0.709	6.5	112

Table 9. Performance for YOLOv11n and YOLOv12n with CBAM and TODL.

Model/Attention/Layer	P	R	mAP@0.5	mAP@0.5-0.95	GFLOPS	FPS
YOLOv11n (baseline)	0.951	0.947	0.976	0.703	6.4	164
YOLOv12n (baseline)	0.951	0.950	0.978	0.704	6.5	112
v11+CBAM/L4+TODL	0.952	0.946	0.977	0.699	19.5	139
v12+CBAM/L4+TODL	0.950	0.950	0.977	0.706	19.2	95

Table 10. Cross-dataset performance for YOLOv11n and YOLOv12n original—trained on SSD and predicted on SSDD.

Model/Attention/Layer	P	R	mAP@0.5	mAP@0.5-0.95
YOLOv11n (baseline)	0.822	0.617	0.693	0.299
YOLOv11n/CBAM/L4	0.855	0.615	0.714	0.326
YOLOv12n (baseline)	0.804	0.585	0.678	0.304
YOLOv12n/CBAM/L4	0.794	0.631	0.693	0.305

Table 11. Performance for YOLOv11n and YOLOv12n with CBAM in SSDD dataset—trained from scratch with SSDD.

Model/Attention/Layer	Dataset	P	R	mAP@0.5	mAP@0.5-0.95	FPS
YOLOv11n (baseline)	SSD	0.951	0.947	0.976	0.703	164
YOLOv11n (baseline)	SSDD	0.959	0.955	0.984	0.715	95
YOLOv11n/CBAM/L4	SSD	0.951	0.950	0.978	0.710	161
YOLOv11n/CBAM/L4	SSDD	0.974	0.962	0.986	0.723	100
YOLOv12n (baseline)	SSD	0.951	0.950	0.978	0.704	112
YOLOv12n (baseline)	SSDD	0.955	0.957	0.975	0.700	47
YOLOv12n/CBAM/L4	SSD	0.955	0.957	0.980	0.716	110
YOLOv12n/CBAM/L4	SSDD	0.959	0.930	0.979	0.711	47

Table 12. Performance for YOLOv11n/m and YOLOv12n/m with CBAM on SSDD—fine-tuning and new setup.

Model/Attention/Layer	P	R	mAP@0.5	mAP@0.5-0.95
YOLOv11n (baseline + 7:2:1)	0.959	0.955	0.984	0.715
v11n/CBAM/L4 (from scratch + 7:2:1)	0.974	0.962	0.986	0.723
v11n/CBAM/L4 (fine-tuning + 7:2:1)	0.971	0.949	0.983	0.735
YOLOv11m (baseline + 8:2)	0.967	0.962	0.984	0.723
v11m/CBAM/L4 (from scratch + 8:2)	0.984	0.965	0.989	0.756
v11m/CBAM/L4 (fine-tuning + 8:2)	0.952	0.982	0.987	0.754
YOLOv12n (baseline + 7:2:1)	0.955	0.957	0.975	0.700
v12n/CBAM/L4 (from scratch + 7:2:1)	0.959	0.930	0.979	0.711
v12n/CBAM/L4 (fine-tuning + 7:2:1)	0.967	0.962	0.988	0.733
YOLOv12m (baseline + 8:2)	0.954	0.955	0.980	0.715
v12m/CBAM/L4 (from scratch + 8:2)	0.984	0.966	0.988	0.751
v12m/CBAM/L4 (fine-tuning + 8:2)	0.972	0.953	0.987	0.740

Table 13. Performance for YOLOv11n and YOLOv12n on the SADD dataset—trained from scratch.

Model/Attention/Layer	P	R	mAP@0.5	mAP@0.5-0.95
YOLOv11n (baseline)	0.965	0.931	0.957	0.741
YOLOv11n/CBAM/L4	0.968	0.934	0.961	0.752
YOLOv12n (baseline)	0.970	0.932	0.959	0.748
YOLOv12n/CBAM/L4	0.970	0.932	0.957	0.746

Table 14. Performance for YOLOv11n and YOLOv12n on the MSAR dataset—trained from scratch.

Model/Attention/Layer	Class	P	R	(m)AP@0.5 *	(m)AP@0.5-0.95 *
YOLOv11n (baseline)	All	0.939	0.932	0.965	0.766
	Ship	0.952	0.971	0.987	0.832
	Aircraft	0.914	0.827	0.909	0.520
	Tank	0.959	0.982	0.991	0.884
	Bridge	0.933	0.946	0.973	0.829
YOLOv11n+CBAM/L4	All	0.938	0.945	0.965	0.770
	Ship	0.953	0.974	0.988	0.838
	Aircraft	0.924	0.878	0.917	0.537
	Tank	0.936	0.984	0.991	0.886
	Bridge	0.937	0.946	0.965	0.819
YOLOv12n (baseline)	All	0.953	0.940	0.970	0.785
	Ship	0.956	0.974	0.989	0.853
	Aircraft	0.934	0.826	0.924	0.553
	Tank	0.963	0.981	0.992	0.909
	Bridge	0.959	0.978	0.976	0.824
YOLOv12n+CBAM/L4	All	0.942	0.949	0.974	0.788
	Ship	0.951	0.984	0.988	0.859
	Aircraft	0.934	0.863	0.937	0.555
	Tank	0.967	0.983	0.992	0.906
	Bridge	0.916	0.967	0.978	0.832

* mAP is only computed for “All” classes. For one single class, the computed metric is the AP.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rocha, R.d.L.; Figueiredo, F.A.P.d. Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms. Remote Sens. 2025, 17, 3170. https://doi.org/10.3390/rs17183170

AMA Style

Rocha RdL, Figueiredo FAPd. Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms. Remote Sensing. 2025; 17(18):3170. https://doi.org/10.3390/rs17183170

Chicago/Turabian Style

Rocha, Ranyeri do Lago, and Felipe A. P. de Figueiredo. 2025. "Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms" Remote Sensing 17, no. 18: 3170. https://doi.org/10.3390/rs17183170

APA Style

Rocha, R. d. L., & Figueiredo, F. A. P. d. (2025). Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms. Remote Sensing, 17(18), 3170. https://doi.org/10.3390/rs17183170

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing YOLO-Based SAR Ship Detection with Attention Mechanisms

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Datasets

3.1.1. SAR Ship Dataset—SSD

3.1.2. SAR Ship Detection Dataset—SSDD

3.1.3. SAR Aircraft Detection Dataset—SADD

3.1.4. Large-Scale Multi-Class SAR Image—MSAR

3.2. Performance Evaluation Metrics

3.3. Architectures of the YOLO Models

3.3.1. Bi-Level Routing Attention—BRA

3.3.2. Swin Transformer—Swin

3.3.3. Convolutional Block Attention Module—CBAM

4. Results and Discussions

4.1. Environmental Setup

4.2. Performance Evaluation on the SAR Ship Dataset

4.3. Ablation Study

4.3.1. Ablation Study on Attention Layer Positioning in YOLO Architecture—Adding Layers

4.3.2. Ablation Study on Attention Layer Positioning in YOLO Architecture—Replacing Layers

4.3.3. An Ablation Study on the Addition of a Small-Object Detection Head

4.3.4. Generalization Experiment: Performance Evaluation on SSDD

4.3.5. Generalization Experiment: Performance Evaluation on Non-Ship SAR Datasets

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI