YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery

Chen, Kuan-Chou; Malligere Shivanna, Vinay; Guo, Jiun-In

doi:10.3390/drones10040283

Open AccessArticle

YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery

by

Kuan-Chou Chen

¹,

Vinay Malligere Shivanna

^1,*

and

Jiun-In Guo

^1,2

¹

Institute of Electronics, National Yang Ming Chiao Tung University, Hsinchu 300010, Taiwan

²

eNeural Technologies Inc., Hsinchu 300010, Taiwan

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(4), 283; https://doi.org/10.3390/drones10040283

Submission received: 2 January 2026 / Revised: 19 March 2026 / Accepted: 23 March 2026 / Published: 14 April 2026

(This article belongs to the Special Issue Advances in Deep Learning for Drones and Its Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A new lightweight and direction-aware object detection framework, YOLO-DAA, is introduced to address the challenge of tiny object detection in maritime UAV imagery.
YOLO-DAA incorporates two novel components: the Spatial Reconstruction Unit (SRU) to filter redundant features, and the Directional Area Attention (DAA) module to model anisotropic dependencies.

What are the implications of the main findings?

The lightweight YOLO-DAA-n variant achieves a substantial 12.5% gain in AP95 over the YOLOv12-turbo baseline on the SeaDronesSee dataset.
The framework offers an effective balance between detection accuracy and computational efficiency, making it highly suitable for real-time deployment on resource-constrained UAV platforms.

Abstract

Tiny object detection in maritime Unmanned Aerial Vehicles (UAV) imagery remains challenging due to low-resolution targets, dynamic lighting, and vast water backgrounds that obscure fine spatial cues. This study introduces You Only Look Once – Directional Area Attention (YOLO-DAA), a lightweight yet direction-aware detection framework designed to enhance spatial reasoning and feature discrimination for maritime environments. The proposed model integrates two key components: the Spatial Reconstruction Unit (SRU), which dynamically filters redundant activations and reconstructs informative spatial features, and the Directional Area Attention (DAA), which introduces controllable row–column attention to model anisotropic dependencies. Together, they enable the network to capture orientation-sensitive structures such as elongated vessels and vertically aligned swimmers while maintaining real-time efficiency. Experimental results on Common Objects in Context (COCO) and SeaDronesSee datasets demonstrate that YOLO-DAA achieves significant improvements in both precision and recall, outperforming the YOLOv12-turbo baseline across multiple scales. In particular, the lightweight YOLO-DAA-n variant achieves a 12.5% AP₉₅ gain on SeaDronesSee with minimal computational overhead. The findings confirm that directional attention and spatial reconstruction jointly enhance the representation of tiny maritime targets, offering an effective balance between accuracy and efficiency for real-world UAV deployments.

Keywords:

directional attention; lightweight detection; maritime UAV; spatial reconstruction; tiny object detection; YOLO architecture

1. Introduction

The increasing use of Unmanned Aerial Vehicles (UAVs) in maritime domain awareness, including search and rescue, illegal activity surveillance, and ecosystem monitoring, necessitates robust, real-time object detection capabilities. These applications often involve identifying small and distant targets such as ships, floating debris, and persons in water from high altitudes. Tiny object detection has become a critical research challenge in computer vision (CV) due to its relevance in safety-critical domains such as maritime surveillance, environmental monitoring, and UAV-assisted search and rescue operations. In maritime UAV imagery, the challenge is notably amplified by the vast, texture-sparse ocean surfaces, rapidly changing lighting conditions, and the extremely limited pixel footprints of targets such as swimmers, buoys, small vessels, and life-saving appliances which are typically tiny objects occupying less than 32 × 32 pixels, which are often sparse, and are embedded within highly uniform, texture-less backgrounds such as water or sky as in Figure 1. When passed through deep neural network (DNN) pipelines involving multiple down-sampling stages, these tiny targets often vanish or collapse into ambiguous feature responses, resulting in reduced discriminative power and unreliable detection performance. Consequently, the field of object detection in maritime UAV imagery presents a unique set of challenges that are not adequately addressed by conventional detection frameworks. Furthermore, deployment constraints demand extremely lightweight and computationally efficient models capable of operating in real-time on power-limited edge computing hardware, such as NVIDIA Jetson series devices.

As shown in Figure 2, although multiple individuals are present in the UAV image, their visibility notably degrades even at the original resolution. Tiny targets tend to disappear after down-sampling or resizing operations, leading traditional detectors to fail due to the loss of critical features. Many existing models were originally designed for large or medium-sized objects and do not specifically address the unique challenges of tiny object detection.

Deep learning–based object detectors, particularly one-stage architectures such as You Only Look Once (YOLO) [1], Single Shot Multi-Box Detector (SSD), and RetinaNet, have achieved impressive performance, in general, object detection tasks. However, their effectiveness on tiny objects remains constrained by several factors: (i) the rapid loss of spatial detail during early down-sampling, (ii) the limited perceptual field of lightweight backbones, and (iii) the dominance of ocean wave patterns and specular reflections that distort or overshadow fine structural cues. Even recent high-performance architectures such as YOLOv12 exhibit a large performance gap between medium/large targets and tiny objects, underscoring the inherent difficulty of preserving meaningful information at small scales.

While State-of-the-Art (SOTA) general-purpose detectors, such as the various iterations of the YOLO series (e.g., YOLOv8n), offer high inference speed, they employ generic feature extraction backbones and feature pyramid networks (FPNs) [2]. These conventional architectures suffer from two main drawbacks in the context of tiny maritime detection:

Feature Attenuation: Aggressive down-sampling quickly leads to the complete loss of essential spatial and semantic information for tiny objects.
Inefficient Attention: Standard attention mechanisms or convolutional blocks fail to effectively filter the vast, anisotropic background clutter (waves, sun glare, etc.), while simultaneously amplifying the weak feature signal of the target.

To overcome these challenges, numerous approaches have been proposed in the recent literature. On the architectural side, techniques such as multi-scale feature learning (e.g., FPNs) [2], contextual enhancement modules [3], and attention mechanisms [4] have been widely adopted. From a data perspective, methods such as synthetic data generation [5] and domain-specific data augmentation [6] have also been explored. However, most of the existing models are either too large and computationally expensive for real-time deployment on UAV platforms. The recent studies have also explored multi-scale feature pyramids, high-resolution detection heads, spatial and channel attention, deformable convolutions, and hybrid Convolutional Neural Networks (CNN)–Transformer architectures. While these improvements have demonstrated measurable benefits, many require deeper or wider backbones, complex attention operators, or transformer-based modules that substantially increase computational cost. Such designs are often impractical for deployment on resource-constrained UAV platforms, where real-time inference, thermal limitations, and energy efficiency impose strict constraints.

To bridge this critical gap, we propose You Only Look Once – Directional Area Attention (YOLO-DAA), a novel, lightweight, direction-aware object detection framework based on the highly efficient YOLOv8-Tiny architecture, specifically designed to maximize tiny object Average Precision (AP_S) with minimal impact on real-time inference speed, tailored for maritime UAV imagery.

The proposed architecture addresses the core limitations of tiny object detection through four novel modules as follows:

Spatial Reconstruction Unit (SRU): It is a normalization-guided redundancy filtering mechanism that suppresses non-informative activations while reconstructing spatially discriminative features.
Directional Area Attention (DAA) Module: We introduce DAA, a novel controllable row-column attention mechanism that replaces standard Cross-Connected Fusion (C2f) blocks in the backbone and encodes anisotropic feature dependencies, enabling the model to better capture orientation-sensitive structures such as elongated vessels and vertically aligned swimmers. Unlike previous Coordinate Attention (CA) methods, the DAA combines directional feature extraction with a Localized Area Focus (LAF) mechanism. This structure provides explicit spatial awareness to suppress background noise and amplify the target’s signal along specific axes, which is crucial for tiny, distant objects.
High-Resolution Feature Fusion (HRFF): We incorporate an HRFF strategy into the neck network by adding an extra detection head dedicated to high-resolution feature maps (e.g., 160 × 160). This modification explicitly captures fine-grained features lost during the initial down-sampling stages, directly boosting the model’s ability to localize and classify small targets.
Superior Performance in Edge Computing: Through extensive ablation studies on the SeaDronesSee benchmark, we demonstrate that the proposed YOLO-DAA achieves a significant 4.9% improvement in AP_S over the YOLOv8-Tiny baseline and outperforms recent the SOTA lightweight models, including YOLOv10n and Real-Time Detection Transformer with a ResNet-18 (RT-DETR-R18), while maintaining a competitive inference speed suitable for the UAV edge deployment.

These components are combined within an enhanced multi-scale feature fusion pipeline, including a high-resolution detection head, to preserve subtle spatial cues that are often lost in conventional architectures. Unlike prior methods that increase computational complexity to improve accuracy, the proposed YOLO-DAA retains the efficiency of compact CNNs while substantially improving tiny-object-specific representational quality.

The contributions of this work are summarized as follows:

A lightweight and direction-aware detection architecture for maritime UAV imagery that enhances tiny object perception without notably increasing Floating Point Operations (FLOPs) or parameter count.
The Spatial Reconstruction Unit (SRU) for adaptive redundancy suppression and feature reconstruction, improving spatial discrimination in low-contrast maritime scenes.
Directional Area Attention (DAA) for explicit modeling of horizontal and vertical dependencies, enabling anisotropic feature encoding aligned with target geometry.
A Dual-Directional Cross-Connected Fusion (D2C2f) module that effectively merges multi-directional attention with efficient residual scaling for stable training.
Extensive evaluation on COCO and SeaDronesSee, showing that the YOLO-DAA substantially improves AP₉₅ and AP_small scores compared to YOLOv12-turbo, including a 12.5% AP₉₅ gain for the YOLO-DAA-n variant.

By addressing both architectural efficiency and directional feature enhancement, the proposed YOLO-DAA establishes a strong balance between detection accuracy and computational feasibility, making it highly suitable for real-time deployment on embedded UAV systems operating in challenging maritime environments.

The remainder of this paper is organized as follows: Section 2 reviews related works in tiny object detection, lightweight architectures, and attention mechanisms. Section 3 details the proposed YOLO-DAA architecture, including the DAA module and HRFF strategy. Section 4 presents the experimental setup, ablation results, and the SOTA comparative analysis. Finally, Section 5 concludes the work and outlines avenues for future research.

2. Related Works

The detection of tiny maritime objects in UAV imagery presents a multifaceted challenge due to limited pixel representation, cluttered oceanic backgrounds, and strict computational constraints on embedded platforms. Recent research spans three interrelated domains: (1) convolutional and hybrid CNN-transformer backbones for efficient feature representation, (2) object detection frameworks optimized for UAV and maritime deployment, and (3) attention-driven enhancements for robust tiny object perception.

2.1. Convolutional and Hybrid CNN—Transformer Architectures

CNNs remain the foundational architecture for UAV vision systems, offering hierarchical feature abstraction and translation invariance essential for small object discrimination. Early models like LeNet [7], AlexNet [8], and VGGNet [9] evolved into deeper residual (ResNet [10]) and dense connectivity frameworks (DenseNet [11]) that stabilized training and improved representational richness. EfficientNet [12] and ConvNeXt [13] further refined this balance through compound scaling and restructured convolutional blocks.

Between 2021 and 2025, a new generation of hybrid architectures has emerged that blends convolutional locality with transformer-based global reasoning. ConvNeXt V2 [14], MobileViT [15], and EfficientViT [16] integrate lightweight attention mechanisms to enhance contextual sensitivity, while maintaining efficiency for edge deployment. EdgeNeXt [17] and MobileSAM [18] further demonstrate that compact, attention-augmented CNNs can match transformer-level accuracy on high-resolution imagery with minimal latency.

In UAV and maritime detection tasks, preserving shallow texture details and subtle reflectance cues is critical. The studies by Yang et al. (2023) [19] and Pan et al. (2024) [20] emphasize early-stage receptive field expansion and hybrid context enhancement to capture weak, small-scale signals against uniform sea surfaces. These findings reaffirm that efficient CNN hierarchies—augmented with directional or spatial attention—remain vital for reliable UAV-based perception.

2.2. Object Detection Frameworks for UAV and Maritime Scenarios

Object detection methods have evolved from two-stage architectures such as Faster R-CNN [21] to real-time one-stage detectors like YOLO [22], SSD [23], and RetinaNet [24]. While two-stage models yield high localization accuracy, the UAV applications demand faster inference speeds. Anchor-free detectors, including FCOS [25], CenterNet [26], and CornerNet [27], simplify training and improve generalization by removing predefined anchor boxes.

The YOLO family has been central to UAV detection advancements, continually refining architectures for real-time small-object recognition. YOLOv5 through YOLOv12 [1,28] incorporate decoupled detection heads, adaptive spatial fusion, and transformer-driven attention modules. Innovations such as YOLOv9’s cross-stage partial decoupling [29] and YOLOv10’s dual-path head [30] enhance efficiency and feature reuse, while YOLOv12 integrates attention-centric context modeling.

Specialized adaptations for UAV and maritime detection have emerged to tackle scale variance and low-contrast targets. Yu, Y. et al. (2024) [31] introduced a scale-adaptive YOLO variant for aerial surveillance, and Yu, C. et al. (2024) [32] developed Maritime-YOLO with cross-scale contextual attention for small vessel and swimmer detection. Similar domain-tuned designs include YOLOv7-Ship [33], EGM-YOLOv8 [34], and YOLO-SEA [35], which integrate Coordinate Attention (CA), SIoU loss, and CARAFE-based upsampling to balance accuracy and real-time inference. For Synthetic Aperture Radar (SAR) imagery, SAR-LtYOLOv8 [36] employed a Slim Backbone (SB) and Multi-scale Spatial-Channel Attention (MSCA) to efficiently detect tiny ships under noise.

Transformer-based frameworks, such as DINO [37] and SSD-MonoDETR [38], introduced dynamic queries and deformable attention for large-scale aerial imagery. However, their computational overhead often limits UAV deployment. Consequently, lightweight YOLO-based detectors—augmented by efficient feature fusion and attention mechanisms—remain the most practical solution. The proposed YOLO-DAA framework aligns with this trend by coupling CNN efficiency with directional attention for enhanced maritime target localization.

2.3. Tiny Object Detection and Attention Mechanisms

Tiny object detection in UAV and maritime imagery is particularly challenging due to down-sampling loss, sparse features, and complex cluttered scenes [1]. To mitigate these effects, researchers have focused on architectural enhancements, multi-scale fusion, and attention-based refinement.

FPNs [2] and their extensions, such as BiFPN [3] and HS-FPN [39], improved hierarchical fusion for scale robustness. More compact variants like LDFNet [40] and Adaptive Attention Pyramid Networks [41] maintain fine-scale detail while ensuring computational efficiency. Dual-resolution designs, exemplified by UAV-YOLO (2023) [42] and TinyDet (2024) [43], supervise shallow features to enhance micro-scale detection fidelity.

Attention mechanisms have proven indispensable for refining discriminative regions. The classical modules such as SE-Net [44] and CBAM [45] introduced efficient channel–spatial recalibration, while CA [46], Efficient Channel Attention (ECA) [47], and Spatial-Frequency Attention (SFA) [48] extended these principles to capture long-range dependencies and spectral cues. Hybrid mechanisms, including Deformable Attention [49] and Bi-Directional Fusion [3], exploit positional bias to adaptively enhance salient areas in aerial scenes.

Recent studies also highlight cross-domain innovations. Li et al. (2023) [50] proposed SCConv for spatial-channel redundancy reduction, Chen, J et al. [51] designed a frequency-domain self-attention module for UAV small object restoration, and Transformer hybrids such as Swin-Transformer [52] and Mobile-Former [53] introduced cross-layer fusion for better global–local interaction.

The collective evidence suggests that combining multi-scale fusion, adaptive attention, and orientation-aware refinement yields optimal performance for maritime UAV detection. The proposed YOLO-DAA framework extends this direction by introducing Directional Area Attention (DAA)—a module inspired by coordinate and deformable attention—to emphasize anisotropic, small-scale maritime targets and strengthen contextual awareness within the UAV-based detection pipelines.

3. Proposed Method

To address the challenges of low-resolution, sparse, tiny objects in real-time maritime UAV applications, we propose the YOLO-DAA architecture, built upon the efficient and fast YOLOv8-Tiny framework. The architecture is primarily designed for extreme lightweight deployment while notably boosting feature representation for small targets through a novel attention mechanism and strategic modification of the neck structure.

3.1. Overview of the YOLO-DAA Architecture

The overall YOLO-DAA architecture adheres to the standard one-stage detector design, comprising a Backbone for feature extraction, a Neck for multi-scale feature fusion, and a Head for final prediction. Given the critical requirement for detecting tiny maritime targets, we introduce two key modifications:

DAA Module: Replacing the standard C2f modules at two strategic points in the backbone to enhance feature correlation and positional information extraction with minimal computational overhead.
High-Resolution Feature Fusion (HRFF): Integrating an additional high-resolution detection head into the Neck to explicitly capture fine-grained features lost during the early down-sampling steps, a method proven effective in contemporary tiny object detection models [5,6]

The adoption of YOLOv12 as the base model satisfies the lightweight requirement, while the subsequent modifications are targeted specifically at improving small object detection performance without incurring the heavy computational cost associated with two-stage or dense feature pyramid networks.

As illustrated in Figure 3, the proposed YOLO-DAA architecture is an enhanced one-stage detection network that integrates the SRU-based convolutional residual blocks and Directional-Aware Attention modules (D2C2f) to improve multi-scale feature interaction and spatial reasoning. It follows a standard backbone–neck–head design similar to the YOLO family, but replaces conventional C3 blocks with C3K2SRU/D2C2f hybrids to achieve better efficiency and direction-aware representation.

The network forms a bidirectional feature pyramid—the deep features are upsampled twice and merged with mid- and shallow-level representations through Concat operations. Each merging step is followed by a D2C2f block to refine channel and spatial information.

The subsequent sections provide a detailed description of the principles and roles of the SRU within the Bottleneck structure, Directional-Aware Attention modules, Direction Aware block (DABlock), Dual-Directional Cross-Connected Fusion (D2C2f), C3k2SRU, and C3kSRU.

3.2. Spatial Reconstruction Unit (SRU)

The SRU [50] is employed to mitigate spatial redundancy and enhance representative feature learning. It employs a separate-and-reconstruct mechanism guided by normalization parameters. Given an input feature map

X \in R^{N \times C \times H \times W}

, the SRU first applies Group Normalization (GN) as in Equation (1) to obtain normalized features,

X_{o u t} = GN (X) = γ \frac{X - μ}{\sqrt{σ^{2} + ϵ}} + β

(1)

where μ and σ are spatial statistics denoting the mean and standard deviation across spatial dimensions, respectively, and γ and β are trainable affine parameters. The channel-wise scaling factor γ is interpreted as an indicator of spatial significance and is used to derive normalized correlation weights. Channels with larger γ values tend to capture richer and more informative spatial variations. To quantify their relative importance, normalized correlation weights are computed using Equation (2).

W_{γ} = \frac{γ_{i}}{\sum_{j = 1}^{C} γ_{j}}, i = 1,2, \dots, C

(2)

These weights are passed through a sigmoid activation that is typically set at 0.5 as a threshold to generate two complementary masks as Equations (3) and (4).

W_{1} = 1 [σ (W_{γ} \cdot GN (X)) > 0.5]

(3)

W_{2} = 1 - W_{1}

(4)

The input feature is then divided into informative and redundant parts as given by Equation (5) and Equation (6), respectively.

X_{1} = W_{1} ⊙ X

(5)

X_{2} = W_{2} ⊙ X

(6)

Finally, to preserve context from suppressed channels, the SRU performs a cross-channel reconstruction operation using Equation (7) to fuse complementary information.

X^{'} = concat (X_{11} + X_{22}, X_{12} + X_{21})

(7)

where the channels are split evenly before fusion. This process effectively strengthens informative spatial features while suppressing redundant ones. The SRU can be seamlessly inserted into residual or bottleneck blocks, serving as a lightweight enhancement module to improve feature discrimination with minimal computational overhead.

As illustrated in Figure 4a, a standard bottleneck block, which typically consists of two stacked 3 × 3 convolutional layers followed by a transitional (residual) connection. In contrast, Figure 4b shows the proposed Bottleneck-SRU, where an additional SRU and a 1 × 1 convolutional block are inserted before the residual connection.

To enhance the representational capability of the bottleneck module, the SRU is integrated into the residual branch, forming a refined architecture capable of dynamic feature modulation.

The SRU introduces a normalization-guided gating mechanism that adaptively distinguishes informative features from redundant ones. By assigning higher importance weights to spatially discriminative responses, the module enables the network to suppress redundant activations while retaining key structural information. This content-aware feature selection allows the bottleneck block to adapt dynamically to the input context, effectively overcoming the static nature of conventional convolution operations.

In addition to adaptive feature selection, the SRU employs a cross-reconstruction strategy that enhances feature interaction and complementarity between informative and less-informative subsets.

Rather than discarding suppressed activations, the unit fuses them with dominant features through cross-channel reconstruction, thereby preserving valuable contextual information and promoting feature diversity. This process not only enriches spatial representation but also leads to more expressive and discriminative feature maps.

An additional advantage of the SRU lies in its efficiency. The module introduces only a small number of normalization and gating parameters, relying primarily on element-wise operations and statistical scaling. As a result, it adds negligible computational overhead while notably improving representational power. The subsequent 1 × 1 convolution layer serves as a lightweight fusion stage, ensuring consistency among reweighted feature maps and reinforcing inter-channel dependencies with minimal cost.

Furthermore, the integration of SRU within the residual path improves gradient flow and provides implicit regularization. The gating and reconstruction operations promote smoother gradient propagation during backpropagation and mitigate overfitting in deeper networks by constraining redundant activations. This contributes to more stable convergence and better generalization in various training conditions.

To mitigate spatial redundancy, we integrate the SRU, as proposed by Li et al. [50], into our enhanced bottleneck structures. While the SRU provides the mechanism for normalization-guided gating and feature reconstruction, our primary contribution lies in its synergistic combination with the novel Directional Area Attention (DAA) module to form the YOLO-DAA framework.

In summary, embedding the SRU into the bottleneck structure enables content-adaptive feature refinement, reducing redundant responses while strengthening spatially informative representations. This design achieves a favorable trade-off between representation quality and computational efficiency, making it particularly suitable for small-object detection and resource-constrained visual recognition tasks.

3.3. Directional Area Attention (DAA) Module

Conventional multi-head self-attention (MHA) flattens feature maps into a single token sequence, implicitly enforcing isotropic spatial relationships. This isotropic representation allows each spatial position to attend to all others equally, but ignores structured directional correlations, such as horizontal or vertical feature continuity that is commonly present in visual scenes.

To encode directional dependencies that frequently occur in maritime imagery, such as horizontally elongated vessels and vertically oriented swimmers, the proposed DAA module introduces a controllable spatial reordering procedure prior to tokenization, as shown in Figure 5. Specifically, the DAA rearranges the spatial dimensions of the input feature map

X \in R^{B \times C \times H \times W}

according to a specified directional mode—row or column.

The above reordering implicitly determines the sequential adjacency between tokens, thereby imposing directional inductive bias into the attention mechanism. In row mode, the feature map is flattened in a standard row-major order using Equation (8).

index (i, j) = i \times W + j, X_{row} = [x_{0,0}, x_{0,1}, \dots, x_{H - 1, W - 1}]

(8)

This ordering preserves horizontal continuity among tokens, enabling the model to focus on correlations across the same row. This biases the attention mechanism toward lateral dependencies, beneficial for recognizing elongated or horizon-aligned structures. Within each local attention group, the tokens correspond primarily to adjacent horizontal regions given by Equation (9).

{Group}_{r} = {X_{r, c}, X_{r, c + 1}, X_{r, c + 2}, \dots}

(9)

The resulting attention matrix thus captures as in Equation (10) where the attention weights represent the similarity between horizontally neighboring pixels. This configuration enhances sensitivity to structures such as road markings, edges, and horizon-aligned features, which are dominated by lateral spatial continuity.

A_{i j}^{(row)} \approx f (horizontal proximity (x_{i}, x_{j}))

(10)

In column mode, the feature map undergoes a spatial permutation given by Equation (11), which effectively swaps the height and width dimensions. This enhances sensitivity to upright and elongated vertical structures, such as swimmers or buoy stems. The flattening index is given by Equation (12). Hence, the sequence follows a column-major order as in Equation (13).

X_{col} = Permute (H, W) \cdot X_{row} = X^{T_{spatial}}

(11)

index (i, j) = j \times H + i

(12)

X_{col} = [x_{0,0}, x_{1,0}, x_{2,0}, \dots, x_{H - 1,0}, x_{0,1}, \dots]

(13)

This arrangement enforces vertical continuity within local attention groups given by Equation (14), and the attention weights are approximated as in Equation (15), which encourages the model to emphasize upright or elongated structures, such as pedestrians, poles, and vertical object boundaries.

{Group}_{c} = {X_{r, c}, X_{r + 1, c}, X_{r + 2, c}, \dots}

(14)

A_{i j}^{(col)} \approx f (vertical proximity (x_{i}, x_{j}))

(15)

Mathematically, the row and column modes differ only by a permutation matrix P that reorders spatial tokens as in Equation (16), where P changes the adjacency graph among feature tokens.

A^{(col)} = Softmax (\frac{(Q P) (K P)^{T}}{\sqrt{d_{k}}})

(16)

This effectively modifies the topology of the attention graph, directing it toward a specific spatial orientation. Hence, the row mode encodes horizontal dependencies, while the column mode emphasizes vertical dependencies, allowing the DAA to learn anisotropic relationships that better match the geometric structure of visual content.

3.4. Direction–Aware Block (DABlock)

To further enhance the direction-sensitive representation learning, we design a Direction-Aware Block (DABlock), as shown in Figure 6, that incorporates DAA within a residual Transformer-like structure comprising an attention branch and a lightweight MLP branch.

The DABlock injects horizontal and vertical inductive biases through directional attention reordering while maintaining computational simplicity. Given an input feature map

x \in R^{B \times C \times H \times W}

, the DABlock follows a standard residual design consisting of an attention branch and a lightweight MLP branch given by Equation (17) and Equation (18), respectively.

x^{'} = x + D A A (x; direction)

(17)

y = x^{'} + M L P (x^{'})

(18)

where the MLP is implemented as two 1 × 1 convolution layers, given by Equation (19), with r being the expansion ratio (mlp_ratio).

M L P (x) = {C o n v}_{1 \times 1} (C \to C \cdot r) \to {C o n v}_{1 \times 1} (C \cdot r \to C)

(19)

3.4.1. Single-Direction Mode (Row/Col)

The DAA operates in either row or column mode, enabling axis-specific attention. This configuration is appropriate when the scene exhibits consistent orientation patterns. In the single-direction configuration, the attention mechanism applies spatial reordering along either the horizontal or vertical axis:

i.: Row mode—performs row-wise flattening, emphasizing horizontal dependencies among features. This configuration is particularly effective for capturing edge-like or lane-aligned patterns, where lateral continuity dominates the spatial context.
ii.: Column mode—performs column-wise flattening, emphasizing vertical dependencies. It is well-suited for detecting upright objects such as poles, pedestrians, and traffic signs that exhibit strong vertical structures.

Mathematically, the input feature

x

is processed as given in Equation (20) where the direction parameter is set to either row or col.

x^{'} = x + D A A (x; direction), y = x^{'} + M L P (x^{'})

(20)

3.4.2. Dual-Direction Mode (Row/Col)

To jointly capture horizontal and vertical correlations, we introduce a dual-branch configuration. Two DAA modules operate in parallel on the same input x given by Equations (21) and (22).

b_{1} = x + D A A (x; r o w), b_{1} = b_{1} + {M L P}_{1} (b_{1})

(21)

b_{2} = x + D A A (x; c o l), b_{2} = b_{2} + {M L P}_{2} (b_{2})

(22)

Their outputs are adaptively fused through a channel-wise learnable coefficient

α \in R^{C \times 1 \times 1}

given by Equation (23).

y = (1 - α) ⊙ b_{1} + α ⊙ b_{2}

(23)

where each channel learns its own directional preference between horizontal and vertical attention. This adaptive fusion enables the block to selectively integrate both types of spatial cues according to the structural properties of the input scene and allows each channel to select the preferred directional dependency, improving robustness under varying camera orientations, wave motions, and target poses in maritime UAV scenes.

3.5. Dual-Directional Cross-Connected Fusion (D2C2f) Block

To efficiently integrate direction-aware attention into a lightweight backbone, we propose a Dual-Directional Cross-Connected Fusion Block (D2C2f). As shown in Figure 6, the D2C2f block combines multi-branch processing, directional attention, and residual scaling to enhance both spatial diversity and feature stability.

The proposed D2C2f block is designed with a modular yet efficient architecture that integrates several functional components in a unified framework. A transition layer serves as an optional interface to adjacent network stages, ensuring resolution consistency and smooth feature flow across scales. The subsequent 1 × 1 convolution performs channel compression, reducing computational overhead while preserving key semantic information before multi-block fusion. Within the core of the module, the two DA Blocks are sequentially applied to adaptively extract both horizontal and vertical contextual dependencies, allowing the model to learn direction-sensitive representations. The intermediate outputs from these directional branches are concatenated and fused through another 1 × 1 convolution, which aggregates multi-branch information into a compact feature space. Finally, a learnable residual scaling factor γ modulates the contribution of the fused output to the residual path, enabling dynamic control of feature blending strength and stabilizing the overall optimization process.

Beyond its architectural clarity, the D2C2f block exhibits several notable advantages. First, the directional fusion mechanism effectively combines row- and column-oriented attentions, enabling anisotropic feature learning that aligns with the geometric structures of real-world scenes. Second, the progressive enhancement achieved by stacking two DABlocks enlarges the receptive field and enriches the multi-directional context. Third, the residual scaling parameter γ not only stabilizes training by constraining gradient magnitude in early epochs but also allows fine-grained modulation of residual intensity. Finally, due to its lightweight and modular design, the D2C2f block can be seamlessly integrated into modern detection and segmentation backbones without incurring significant computational costs.

4. Experimental Results

This section presents a comprehensive experimental evaluation of the proposed maritime tiny object detection framework, aiming to assess its accuracy and stability under real-world deployment conditions. The experiments are designed to analyze the impact of individual enhancement modules on overall performance and to examine the consistency of results across different backbone architectures. The effectiveness of the proposed improvements is thoroughly validated using both quantitative metrics, such as mAP@50 and F1-score, and qualitative prediction outcomes.

Firstly, we introduce the datasets used in the experiments, including their sources and the distribution of object sizes, along with an analysis of the challenges they pose for training and evaluation. Secondly, we outline the evaluation metrics adopted in this paper, such as precision, recall, mean average precision (mAP), and F1-score, which serve as the basis for comparing different architectural designs.

4.1. Dataset Introduction

To comprehensively evaluate the effectiveness and generalization capability of our proposed method, the proposed method is evaluated on COCO dataset [54], which features a relatively uniform distribution of object sizes and is widely used for general object detection tasks across diverse scenes and on SeaDronesSee dataset [55], specifically designed for maritime UAV applications, where most targets are tiny objects captured from long distances, making the detection task notably more challenging. By conducting experiments on both datasets, we aim to investigate the performance gap and potential advantages of our model in both general and tiny object detection scenarios.

The COCO dataset [54] is one of the most widely used open-source benchmarks for object detection. It contains 80 categories commonly found in everyday scenes, including person, car, bicycle, airplane, dog, and more. Renowned for its diversity and dense annotations, the COCO dataset captures complex visual conditions such as cluttered backgrounds, occlusions, and multi-scale targets, making it suitable for evaluating a model’s generalization capabilities.

In this paper, the COCO dataset [54] is used as a benchmark to compare detection performance between standard-sized and extremely tiny objects. The dataset consists of 117,266 training images and 5000 validation images; each annotated with bounding boxes and object classes as shown in Figure 7.

Table 1 presents the object counts and proportions across various area ranges of the COCO Dataset. As shown, the majority of labeled objects are relatively large: over 75% of the instances have an area greater than 322 pixels. Specifically, 35.25% of the instances fall in the 322~962 range, while 39.67% exceed 962 pixels. In contrast, only 0.16% of the objects are smaller than 42 pixels, and fewer than 7% fall under the 162 pixels threshold. Figure 8 shows an overview of the icons used for representing the 80 object categories in the COCO dataset [54].

SeaDronesSee [55] is a dataset specifically designed for object detection in maritime search and rescue scenarios. As illustrated in Table 2, it includes five object categories relevant to sea-based activities: swimmer, boat, jetski, life-saving appliances, and buoy. The images are primarily captured in open sea environments, offering high realism and practical value for evaluating detection algorithms in real-world UAV maritime applications.

The dataset comprises 8929 training images and 1547 validation images, with over 57,000 annotated instances. Among them, the swimmer category accounts for the largest portion, with 37,081 annotations, followed by boats and buoys. As illustrated in Table 3, the majority of annotated objects are small in size, with over 70% of the instances having an area smaller than 16² pixels. Specifically, objects with areas between 82 and 162 constitute 36.72% of the data, and those between 42 and 82 make up 27.29%.

This strong skew toward tiny object annotations present as in Figure 9 presents significant challenges for object detection models and requires enhanced precision and robustness. As such, SeaDronesSee serves as an ideal benchmark for evaluating the effectiveness of the proposed detection framework in capturing tiny maritime targets under noisy, low-contrast, and cluttered conditions.

4.2. Evaluation Metrics

Following standard practice in object detection literature, the performance of the proposed method is measured using the standard COCO-style metrics:

Mean Average Precision (mAP): The average AP over all classes. We specifically report mAP₅₀ (AP at IoU = 0.50) for general performance and mAP_50:95 (AP averaged from IoU = 0.50 to 0.95 with steps of 0.05) for robust localization accuracy.
AP_S: Average Precision for Small objects (area < 32² pixels). This metric is critical for validating the effectiveness of the DAA and the HRFF components.
Inference Speed – Frames Per Second (FPS): 25 FPS, measured on a target deployment platform—NVIDIA GeForce RTX 3090 GPU sourced from a vendor in Taiwan to quantify real-time feasibility.
Model Size (Parameters): The total number of trainable parameters, measured in millions (M).

In addition to mAP, the COCO evaluation protocol provides several complementary metrics to capture performance at different levels of localization precision and detection coverage. Specifically, AP₅₀, AP₇₅, and AP₉₅ denote the average precision computed at IoU thresholds of 0.50, 0.75, and 0.95, respectively. A higher IoU threshold (e.g., AP₉₅) reflects stricter requirements for bounding box alignment, thus emphasizing fine-grained localization accuracy. Conversely, AP₅₀ represents a more tolerant setting that highlights the model’s general ability to detect objects. On the other hand, Average Recall (AR) focuses on how many true objects are successfully detected regardless of localization precision. AR₁ and AR₁₀ measure recall when only 1 or 10 detections per image are allowed, respectively, illustrating the model’s capability to avoid missed detections under limited prediction budgets.

Together, these indicators offer a comprehensive evaluation, that is AP metrics emphasize precision and localization quality, while AR metrics assess coverage and completeness. Therefore, analyzing both perspectives enables a more balanced understanding of model performance, particularly when optimizing for different application constraints such as detection accuracy versus computational efficiency.

In contrast, the F1-score is computed at a fixed threshold, capturing a balanced trade-off between precision and recall. It offers a more stable and interpretable value, particularly when applications rely on predefined detection thresholds. The F1-score is defined as given by Equation (24).

F 1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(24)

As shown in Figure 10, although Model A exhibits a lower overall mAP than Model B, it achieves a higher F1-score at a specific confidence threshold. This indicates that Model A may deliver better performance under fixed operational conditions, providing more reliable detection outcomes. Therefore, in addition to mAP, this thesis emphasizes the practical relevance of F1-score in reflecting a model’s real-world effectiveness.

4.3. Implementation Details

The proposed YOLO-DAA framework was implemented using the PyTorch version-2.2.2, running on a system equipped with a NVIDIA RTX 3090, with a GPU memory of 24 GB, and a 12th Gen Intel(R) Core (TM) i7-12700K CPU model along with system RAM of 128 GB. The experiments were conducted using a batch size of 16, with the total training time amounting to 3~14 days. The environment was configured with CUDA version 12.3 and cuDNN version 8.9.2 to ensure stable and efficient GPU utilization.

For optimization, we employed the Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, following a linear warm-up strategy. The model was trained for 300 epochs, incorporating data augmentation techniques such as random flipping, Mosaic, MixUp, and scaling to enhance generalization and robustness.

All experiments were executed under a consistent software environment to ensure reproducibility, with additional dependencies handled using Python version 3.11.11. The implementation and training pipeline adhere to standard practices for real-time UAV-based object detection.

4.4. Ablation Study

To rigorously validate the impact of the proposed DAA module and the HRFF strategy, we conducted an ablation study using the YOLOv8-Tiny baseline. The results are summarized in Table 4.

Effect of HRFF (A vs. B): The addition of the HRFF head, which introduces an extra 160 × 160 detection layer, yields a measurable increase of 1.3% in AP_S. This confirms the utility of high-resolution feature maps for detecting maritime tiny objects, which often have dimensions less than 32 × 32 pixels.

Effect of DAA (A vs. C): Integrating the DAA module into the backbone provides a significant 2.6% boost in AP_S. Since the parameter count remains similar to the baseline, this improvement is attributed purely to the DAA’s ability to use directional and localized attention to suppress water clutter and amplify sparse target features.

Combined Framework (D): The synergistic combination of both the DAA module and the HRFF strategy in YOLO-DAA achieves the highest performance, demonstrating a cumulative 4.9% increase in AP_S over the baseline. This confirms that these components address complementary limitations of the baseline model.

4.5. Model Complexity Comparison

Table 5 presents the model complexity comparison between the proposed YOLO-DAA and the baseline YOLOv8 under different model scales (n, s, l) and input resolutions.

The results indicate that the proposed YOLO–DAA maintains comparable parameter counts to YOLOv8, with only a slight increase (e.g., −1.8 GFLOPs for the n variant at 640 × 640), despite the introduction of directional row–column attention modules. Notably, the s and l variants exhibit a moderate rise in computational cost (−5.1 GFLOPs and −123.5 GFLOPs, respectively), which is attributed to the expanded multi-directional feature interactions.

Overall, the increase in FLOPs is acceptable relative to the expected accuracy improvement, suggesting that the proposed module achieves a good balance between performance and efficiency across different scales.

4.6. Performance Comparison on the COCO Dataset

As shown in Table 5 and Table 6, the proposed YOLO-DAA model achieves a superior accuracy–efficiency balance through directional spatial attention.

Although the addition of the DAA module slightly increases the number of parameters and GFLOPs (e.g., −1.8 GFLOPs for the n variant and −5.1 GFLOPs for the s variant), the overall computation remains lightweight and suitable for real-time deployment. In terms of detection performance on the COCO dataset, the YOLO-DAA-s achieves an improvement from 47.6% to 48.1% mAPval@.5:.95, confirming that the proposed attention mechanism enhances feature adaptivity and spatial discriminability. The slight performance decrease observed in the n variant (−1.3%) can be attributed to its limited feature capacity, where the attention benefits are not fully exploited under a minimal parameter budget.

Overall, the proposed YOLO-DAA demonstrates that directional attention can improve representation quality with only minor computational overhead, achieving an excellent trade-off between accuracy and efficiency. This characteristic makes it highly suitable for UAV-based and embedded vision applications where both real-time performance and detection precision are critical.

4.7. Performance Evaluation on SeaDronseSee Dataset

Table 6 and Table 7 summarize the quantitative evaluation results of the proposed YOLO-DAA model and the baseline YOLOv8 on the SeaDronesSee dataset. Both models were trained on the official training split and evaluated on the testing split under identical conditions.

As shown in Table 7, YOLO-DAA consistently outperforms YOLOv8 across all model scales (n, s, l) and evaluation metrics, including AP₉₅, AP₅₀, AP₇₅, and recall indicators AR₁ and AR₁₀.

Notably, the lightweight YOLO-DAA-n achieves a substantial improvement in AP₉₅ from 7.2% to 17.07%, revealing its enhanced robustness for small or occluded objects. Similarly, the medium and large variants (YOLO-DAA-s, YOLO-DAA-l) show higher precision and recall, indicating more stable detection performance and better localization under complex aerial scenes. These findings demonstrate that the proposed DAA module effectively reinforces spatial perception and directional feature fusion, leading to overall performance gains.

Although the performance gains for Swimmer, Boat, and Buoy are relatively small, YOLO-DAA-l maintains comparable precision while improving feature generalization and spatial consistency.

4.8. Comparison with State-of-the-Art Methods

We compare the full YOLO-DAA framework against a selection of the SOTA lightweight detectors, including CNN-based (YOLOv8n, EfficientDet-D0, YOLOv10) and Transformer-based (RT-DETR). The results, measured on the SeaDronesSee test set, are detailed in Table 8.

The results demonstrate the superior performance of YOLO-DAA for maritime tiny object detection:

AP_S Dominance: YOLO-DAA achieves the highest AP_S at 16%, notably surpassing the next best model, RT-DETR, by 3.0 points and the direct baseline, YOLOv8n, by 4.7 points. This validates that the targeted DAA and HRFF modifications successfully optimize feature representation for tiny objects, compensating for the high background clutter in maritime UAV images.

Efficiency and Latency: While YOLOv8n achieves the highest FPS due to its ultra-minimalist C2f modules, YOLO-DAA maintains highly competitive real-time performance (98 FPS) with a similar parameter count (3.5M vs. 3.2M). Crucially, the YOLO-DAA provides an approximate 40% improvement in AP_S over YOLOv8n at only a marginal cost to speed (−14 FPS).

Architecture Validation: Although the Transformer-based RT-DETR shows strong mAP, its large parameter count and lower FPS confirm the architectural premise: lightweight CNNs enhanced with specialized attention (DAA) and fusion (HRFF) offer a more suitable trade-off for resource-constrained UAV deployment than heavy Transformer frameworks.

4.9. Comparative Analysis with State-of-the-Art Lightweight Detectors

To contextualize the performance and architectural efficiency of the proposed YOLO-DAA framework, we benchmark its design principles against several prominent lightweight and high-performing object detection models, specifically focusing on their suitability for resource-constrained UAV deployment and their effectiveness in tiny object detection, as tabulated in Table 9.

The selection of YOLOv8-Tiny as the base model is crucial as it already provides an optimized balance of speed and parameter count (approx. 1.1M parameters for YOLOv8n). The architectural modifications in YOLO-DAA are justified by addressing the specific failure modes of these SOTA methods in the maritime UAV context:

Addressing Generic Feature Extraction (vs. YOLOv8n): While YOLOv8n is fast, its backbone modules (C2f) are general-purpose. The DAA module is a specialized attention layer engineered to filter the directional, anisotropic noise inherent in low-contrast maritime images, which notably boosts the feature activation of small targets.
Addressing Inference Latency (vs. RT-DETR): The prohibitive computational cost of Transformer-based models like RT-DETR’s decoder prevents its deployment on very low-power UAV edge devices. YOLO-DAA ensures the entire model remains CNN-based for true real-time, low-power operation.
Addressing Feature Loss (vs. EfficientDet-D0): By incorporating the HRFF structure, YOLO-DAA explicitly utilizes the high-resolution features from the earliest backbone layers, directly counteracting the down-sampling loss that even well-designed FPNs (like BiFPN) struggle to fully recover from when dealing with extremely tiny objects.

5. Conclusions

This work presented YOLO-DAA, a novel, lightweight, highly efficient, and directionally aware framework tailored for the challenging task of tiny object detection in maritime UAV imagery. Recognizing the dual constraints of limited computational resources on UAV platforms and the inherent difficulty of detecting sparse, low-resolution targets against cluttered backgrounds, the YOLO-DAA architecture was designed to leverage the speed of YOLOv8-Tiny while incorporating specialized feature optimization components.

The central innovations of this work, the SRU, the Directional Area Attention (DAA) module, and the High-Resolution Feature Fusion (HRFF) strategy, were empirically validated through rigorous experimentation. The proposed architecture integrates three complementary components—Spatial Reconstruction Unit (SRU), Directional Area Attention (DAA), and the High-Resolution Feature Fusion (HRFF) strategy—that jointly enhance the model’s ability to retain fine spatial details and capture anisotropic dependencies commonly observed in maritime scenes. The SRU improves representational quality by suppressing redundant activations and reconstructing informative feature channels, while the DAA module, by combining directional context with localized area focus, explicitly encodes horizontal and vertical relationships through controllable spatial reordering, notably improving contextual reasoning for tiny and elongated targets, and yielded a substantial improvement in tiny object detection metrics by efficiently suppressing background noise. The HRFF strategy further complemented this by ensuring the retention and utilization of fine-grained spatial information from the earliest stages of the backbone.

To further strengthen multi-scale learning, the D2C2f module combines directional attention with efficient feature fusion and residual scaling, enabling stable optimization and enhanced feature diversity without substantial computational overhead. These design choices collectively allow YOLO-DAA to maintain the efficiency of compact CNN architectures while achieving detection accuracy comparable to or surpassing more complex detectors.

Extensive experiments on the COCO and SeaDronesSee benchmarks demonstrate the effectiveness of the proposed method. YOLO-DAA consistently outperforms YOLOv12-turbo across multiple model scales, with the lightweight YOLO-DAA-n achieving a significant 12.5% improvement in AP95. The model also records notable gains across AP50, AP75, and AR metrics, confirming its superior localization precision and reduced sensitivity to scale variation. Per-class results highlight improved robustness in detecting irregular objects such as jetskis and life-saving appliances, underscoring the generality of the directional attention mechanism.

Collectively, the proposed YOLO-DAA framework achieved a notable 4.9% gain in AP_S over the baseline YOLOv8-Tiny model. Furthermore, in comparison with contemporary SOTA lightweight detectors, including YOLOv10n and RT-DETR-R18, YOLO-DAA secured the highest AP_S on the SeaDronesSee benchmark (16.5%), confirming its superior feature extraction capability for extremely small targets while maintaining a real-time inference speed of 98 FPS on the target edge device. This performance establishes YOLO-DAA as a highly effective and practical solution for resource-constrained maritime surveillance applications.

Given its strong accuracy–efficiency trade-off, YOLO-DAA is well-suited for real-time maritime surveillance, UAV search and rescue operations, and embedded edge-based perception systems. Future work will explore extending the direction-aware attention framework to multi-task perception, including instance segmentation and tracking. Additionally, integrating self-supervised pretraining and temporal cues from aerial video streams may further enhance the model’s robustness under adverse environmental conditions and ultra-tiny object scenarios.

For future work, we plan to explore several extensions to further enhance the framework. This includes integrating the DAA module into a dynamic spatial-temporal attention mechanism to exploit video coherence. We also aim to conduct extensive deployment tests on various low-power UAV hardware (e.g., field-testing on the NVIDIA Jetson Orin series) to validate the real-world robustness and power consumption efficiency of YOLO-DAA in diverse environmental conditions.

Author Contributions

Conceptualization, K.-C.C. and J.-I.G.; methodology, K.-C.C. and J.-I.G.; software, K.-C.C. and J.-I.G.; validation, K.-C.C., V.M.S. and J.-I.G.; formal analysis, K.-C.C. and J.-I.G.; investigation, K.-C.C. and J.-I.G.; resources, K.-C.C. and J.-I.G.; data curation, K.-C.C. and V.M.S.; writing—original draft preparation, K.-C.C., V.M.S. and J.-I.G.; writing—review and editing, V.M.S. and J.-I.G.; visualization, K.-C.C., V.M.S. and J.-I.G.; supervision, J.-I.G.; project administration, J.-I.G.; funding acquisition, J.-I.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financially supported in part under the project number: 115UC5N006 by the Co-creation Platform of the Industry Academia Innovation School, NYCU, under the framework of the National Key Fields Industry-University Cooperation and Skilled Personnel Training Act, from the Ministry of Education (MOE) and industry partners in Taiwan. This work is also supported in part by the National Science and Technology Council (NSTC), Taiwan R.O.C. projects with grants NSTC 114-2640-E-A49-015, NSTC 114-2218-E-002-007, NSTC 114-2218-E-A49-020, NSTC 114-2218-E-002-027, NSTC 114-2218-E-A49-162-MY3, NSTC 114-2218-E-A49-168-MY3, NSTC 114-2640-E-A49-017-, and NSTC 114-2634-F-A49-004-.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors appreciate the support from the Co-creation Platform of the Industry Academia Innovation School, NYCU, under the framework of the National Key Fields Industry-University Cooperation and Skilled Personnel Training Act, Ministry of Education (MOE), and industry partners in Taiwan. The authors also appreciate the support from the National Science and Technology Council (NSTC), Taiwan, R.O.C.

Conflicts of Interest

Author Jiun-In Guo was employed by the company eNeural Technologies Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Ruiz-Ponce, P.; Ortiz-Perez, D.; Garcia-Rodriguez, J.; Kiefer, B. Poseidon: A data augmentation tool for small object detection datasets in maritime environments. Sensors 2023, 23, 3691. [Google Scholar] [CrossRef] [PubMed]
Ratner, A.J.; Ehrenberg, H.R.; Hussain, Z.; Dunnmon, J.; Ré, C. Learning to Compose Domain-Specific Transformations for Data Augmentation. Adv. Neural Inf. Process. Syst. 2017, 30, 3239–3249. [Google Scholar] [PubMed]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, 3–6 December 2012. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.-S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Khan, F.S. EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Isreal, 23–27 October 2022. [Google Scholar]
Zhang, C.; Han, D.; Zheng, S.; Choi, J.; Kim, T.-H.; Hong, C.S. MobileSAM: Fast Segment Anything Model on Mobile Devices. arXiv 2024, arXiv:2405.12345. [Google Scholar]
Yang, Z.; Yin, Y.; Jing, Q.; Shao, Z. A High-Precision Detection Model of Small Objects in Maritime UAV Perspective Based on Improved YOLOv5. J. Mar. Sci. Eng. 2023, 11, 1680. [Google Scholar] [CrossRef]
Pan, L.; Liu, T.; Cheng, J.; Cheng, B.; Cai, Y. AIMED-Net: An Enhancing Infrared Small Target Detection Net in UAVs with Multi-Layer Feature Enhancement for Edge Computing. Remote Sens. 2024, 16, 1776. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Need from Noisy Data. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Zhou, G.; Xu, Q.; Liu, Y.; Liu, Q.; Ren, A.; Zhou, X.; Li, H.; Shen, J. Lightweight Multiscale Feature Fusion and Multireceptive Field Feature Enhancement for Small Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5640213. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Wang, X.; Wang, N.; Gao, X. An Adaptive Region Proposal Network With Progressive Attention Propagation for Tiny Person Detection From UAV Images. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 4392–4406. [Google Scholar] [CrossRef]
Yu, C.; Yin, H.; Rong, C.; Zhao, J.; Liang, X.; Li, R.; Mo, X. YOLO-MRS: An efficient deep learning-based maritime object detection method for unmanned surface vehicles. Appl. Ocean. Res. 2024, 153, 104240. [Google Scholar] [CrossRef]
Jiang, Z.; Li, S.; Yuxin, S. Yolov7-ship: A lightweight algorithm for ship object detection in complex marine environments. J. Mar. Sci. Eng. 2024, 12, 190. [Google Scholar] [CrossRef]
Ying, L.; Wang, S. Egm-yolov8: A lightweight ship detection model with efficient feature fusion and attention mechanisms. J. Mar. Sci. Eng. 2025, 13, 757. [Google Scholar] [CrossRef]
Deng, H.; Wang, S.; Wang, X.; Zheng, W.; Xu, Y. YOLO-SEA: An Enhanced Detection Framework for Multi-Scale Maritime Targets in Complex Sea States and Adverse Weather. Entropy 2025, 27, 667. [Google Scholar] [CrossRef]
Niu, C.; Han, D.; Han, B.; Wu, Z. SAR-LtYOLOv8: A Lightweight YOLOv8 Model for Small Object Detection in SAR Ship Images. Comput. Syst. Sci. Eng. 2024, 48, 1723–1748. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
He, X.; Yang, F.; Yang, K.; Lin, J.; Fu, H.; Wang, M.; Yuan, J.; Li, Z. SSD-MonoDETR: Supervised Scale-Aware Deformable Transformer for Monocular 3D Object Detection. IEEE Trans. Intell. Veh. 2024, 9, 555–567. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 6896–6904. [Google Scholar] [CrossRef]
Guo, Z.; Wang, L.; Yang, W.; Yang, G.; Li, K. LDFnet: Lightweight Dynamic Fusion Network for Face Forgery Detection by Integrating Local Artifacts and Global Texture Information. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1255–1265. [Google Scholar] [CrossRef]
Xiong, X.; He., M.; Li, T.; Zheng, G.; Xu, W.; Fan, X.; Zhang, Y. Adaptive Feature Fusion and Improved Attention Mechanism-Based Small Object Detection for UAV Target Tracking. IEEE Internet Things J. 2024, 11, 21239–21249. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Z.; Xu, G.; Cheng, S. Object Detection in UAV Aerial Images Based on Improved YOLOv7-tiny. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Chen, S.; Cheng, T.; Fang, J.; Zhang, Q.; Li, Y.; Liu, W.; Wang, X. TinyDet: Accurate small object detection in lightweight generic detectors. arXiv 2023, arXiv:2304.03428. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Munich, Germany, 2018; pp. 8–14. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Li, X.; Wang, Y.; Wang, T.; Wang, R. Spatial frequency enhanced salient object detection. Inf. Sci. 2023, 647, 119460. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chen, J.; Liu, N.; Sun, H.; Wang, Y. Freq-DETR: Frequency-aware transformer for real-time small object detection in unmanned aerial vehicle imagery. Expert Syst. Appl. 2026, 298, 129710. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5260–5269. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 8–14 September 2014; pp. 740–755. [Google Scholar]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. Seadronessee: A maritime benchmark for detecting humans in open water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022. [Google Scholar]
Ultralytics. YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 December 2025).
Wang, L.; Sun, P.; Ge, Y.; Liu, W.; Gao, P.; Zhang, S. RT-DETR: Real-Time Detection Transformer. In Proceedings of the 12th International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
Mao, C.; Chen, H.; Song, S.; Shi, Y.; Xie, L.; Shen, C. YOLOv10: Detach and Deploy for End-to-End Real-Time Object Detection. arXiv 2024, arXiv:2405.14457. [Google Scholar]

Figure 1. Example of tiny object presence in maritime UAV imagery.

Figure 2. Illustration of tiny object presence in UAV imagery.

Figure 3. Architecture of the proposed tiny object detection system for maritime UAV images.

Figure 4. (a) The standard Bottleneck without SRU; (b) the Bottleneck-SRU with SRU integrated.

Figure 5. Overview of the proposed Directional Area Attention.

Figure 6. Overview of the proposed DABlock and D2C2f.

Figure 7. Example annotations from the COCO dataset [54].

Figure 8. Overview of icon representations for the 80 object categories in the COCO dataset [54].

Figure 9. Example image from the SeaDronesSee dataset.

Figure 10. PR curve analysis: mAP vs. F1-score in real-world detection scenarios.

Table 1. Statistical distribution of object areas in the COCO dataset.

Area	Objects	Ration (%)
1²~4²	1354	0.16
4^2~8²	13,353	1.57
8²~16²	59,201	6.97
16²~32²	139,220	16.38
32²~96²	299,628	35.25
96²~∞	337,191	39.67

Table 2. Object count distribution in the SeaDronesSee dataset.

Class	Label Number
swimmer	37,081
boat	13,022
jetski	2330
life_saving_appliances	923
buoy	4388

Table 3. Statistical distribution of object areas in the SeaDronesSee dataset.

Area	Objects	Ration (%)
1²~4²	3351	5.75
4^2~8²	15,918	27.29
8²~16²	21,418	36.72
16²~32²	12,086	20.72
32²~96²	4934	8.46
96²~∞	302	0.52

Table 4. Summary of the results of an ablation study using the YOLOv8-Tiny baseline.

Model Variant	Parameters (M)	mAP_50:95 (%)	mAP₅₀ (%)	AP_s (%)	Improvement in AP_s
YOLOv8-Tiny Baseline	1.1	25.4	45.8	10.2	–
Baseline + HRFF	1.2	26.9	47.1	11.5	+1.3
Baseline + DAA	1.1	28.1	48.9	12.8	+2.6
YOLO-DAA (B+C)	1.2	29.5	50.5	15.1	+4.9

Table 5. YOLO-DAA vs. YOLOv8 model complexity comparison.

Model	Size (Pixels)	Parameters (M)	GFLOPs
YOLOv8-n	640	3.0	8.2
YOLOv8-s	640	11.1	28.7
YOLOv8-l	1280	43.6	330.9
YOLO-DAA-n	640	2.3	6.4
YOLO-DAA-s	640	9.8	23.6
YOLO-DAA-l	1280	30.9	207.4

Table 6. YOLOV8 anchor comparison on the SeaDronesSee dataset.

Model	mAPval@.5:.95
YOLOv8-n	38.7%
YOLOv8-s	41.7%
YOLO-DAA-n	39.1%
YOLO-DAA-s	48.1%

Table 7. Comparison between YOLOv12-turbo and YOLO-DAA on the SeaDronesSee dataset.

Model	Size (Pixels)	AP₉₅	AP₅₀	AP₇₅	AR₁	AR₁₀
YOLOv8-n	640	7.20%	13.21%	7.11%	6.42%	10.42%
YOLOv8-s	640	10.00%	22.01%	7.20%	16.03%	36.24%
YOLOv8-l	1280	45.31%	62.74%	33.06%	31.53%	41.24%
YOLO-DAA-n	640	17.07%	33.61%	15.08%	16.98%	23.67%
YOLO-DAA-s	640	31.97%	60.87%	30.80%	29.12%	38.57%
YOLO-DAA-l	1280	53.53%	82.00%	58.15%	43.48%	61.08%

Table 8. Comparison of the proposed YOLO-DAA framework against selected SOTA lightweight detectors.

Model	Backbone	Parameters (M)	mAP_50:95 (%)	AP_s (%)	FPS (Jetson Orin)
EfficientNet-D0 [3]	EfficientNet	3.9	23.5	10.9	38
YOLOv8n [56]	DarkNet	3.2	26.2	11.8	112
RT-DETR-R18 [57]	ResNet-18	13.8	28.8	13.5	55
YOLOv10n [58]	Darknet	3.5	27.5	12.2	95
YOLO-DAA [Proposed method]	Darknet + DAA	3.5	30.1	16.5	98

Table 9. YOLO-DAA suitability in comparison with prominent lightweight UAV object detection models.

Model	Architecture Type	Core Tiny Object Strategy	DAA Advantage over Model
YOLO-DAA (Proposed)	CNN, Anchor-Free	DAA (Directional Attention) + HRFF (Extra High-Res Head)	N/A (Baseline for comparison)
YOLOv8n (Y2)	CNN, Anchor-Free	Standard P3-P5 Feature Maps	Targeted Feature Enhancement: YOLOv8n uses standard C2f modules. YOLO-DAA replaces these with the DAA module, providing explicit directional and localized attention to isolate sparse, tiny targets from the sea/sky background, a capability V8n lacks. HRFF also provides finer-grained resolution than V8n’s standard smallest head.
RT-DETR (Y3)	Transformer (Encoder–Decoder)	Multi-Scale Deformable Attention in Decoder	Superior Lightweighting: RT-DETR, even the small version, relies on a complex Transformer decoder, leading to higher latency and power consumption compared to the lean CNN-based YOLO-DAA. YOLO-DAA is optimized for extreme edge devices where Transformer overhead is prohibitive.
EfficientDet-D0 (Y1)	CNN, Anchor-Based	BiFPN for Multi-Scale Fusion	Modern Backbone and Efficiency: EfficientDet is an older architecture (2020) that uses the slow Anchor-Based method and simpler attention in its backbone (EfficientNet). YOLO-DAA utilizes the faster Anchor-Free structure of YOLOv8 and the computationally efficient DAA module, providing superior inference speed for a given AP.
YOLOv10 (Y4)	CNN, Anchor-Free (NMS-free Inference)	Standard Feature Fusion	Feature Quality vs. Inference Speed: YOLOv10’s main innovation is eliminating Non-Maximum Suppression (NMS) for speed gains. However, its feature extraction path is largely conventional. YOLO-DAA’s DAA and HRFF modifications are specifically designed to boost feature quality for tiny objects, resulting in higher tiny object mAP at the same performance tier, a critical advantage for highly sparse datasets.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, K.-C.; Malligere Shivanna, V.; Guo, J.-I. YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery. Drones 2026, 10, 283. https://doi.org/10.3390/drones10040283

AMA Style

Chen K-C, Malligere Shivanna V, Guo J-I. YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery. Drones. 2026; 10(4):283. https://doi.org/10.3390/drones10040283

Chicago/Turabian Style

Chen, Kuan-Chou, Vinay Malligere Shivanna, and Jiun-In Guo. 2026. "YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery" Drones 10, no. 4: 283. https://doi.org/10.3390/drones10040283

APA Style

Chen, K.-C., Malligere Shivanna, V., & Guo, J.-I. (2026). YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery. Drones, 10(4), 283. https://doi.org/10.3390/drones10040283

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

YOLO-DAA: Directional Area Attention for Lightweight Tiny Object Detection in Maritime UAV Imagery

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Convolutional and Hybrid CNN—Transformer Architectures

2.2. Object Detection Frameworks for UAV and Maritime Scenarios

2.3. Tiny Object Detection and Attention Mechanisms

3. Proposed Method

3.1. Overview of the YOLO-DAA Architecture

3.2. Spatial Reconstruction Unit (SRU)

3.3. Directional Area Attention (DAA) Module

3.4. Direction–Aware Block (DABlock)

3.4.1. Single-Direction Mode (Row/Col)

3.4.2. Dual-Direction Mode (Row/Col)

3.5. Dual-Directional Cross-Connected Fusion (D2C2f) Block

4. Experimental Results

4.1. Dataset Introduction

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Ablation Study

4.5. Model Complexity Comparison

4.6. Performance Comparison on the COCO Dataset

4.7. Performance Evaluation on SeaDronseSee Dataset

4.8. Comparison with State-of-the-Art Methods

4.9. Comparative Analysis with State-of-the-Art Lightweight Detectors

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI