Next Article in Journal
Machine and Deep Learning Approaches for Wind Turbine Model Parameter Prediction Within the Framework of IEC 61400-27 Standard
Previous Article in Journal
Distributed Images Transmission with Related Feature Assistance: A DeepJSCC Approach
Previous Article in Special Issue
Distributed Privacy-Preserving Fusion for Multi-UAV Target Localization via Free-Noise Masking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery

1
College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China
2
Aviation University of Air Force, Changchun 130012, China
3
Northwest Institute of Nuclear Technology, Xi’an 710024, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(5), 1100; https://doi.org/10.3390/electronics15051100
Submission received: 31 January 2026 / Revised: 26 February 2026 / Accepted: 28 February 2026 / Published: 6 March 2026

Abstract

Ubiquitous aerial sensing with unmanned aerial vehicles (UAVs) is becoming an essential component of AI-native perception systems, motivated by the trend toward edge deployment and potential integration with future sixth-generation (6G)-connected aerial networks. In this work, we focus on improving the perception-side accuracy and computational efficiency of small-object detection in UAV imagery. However, small object detection in high-altitude UAV imagery remains highly challenging due to the extremely low pixel occupancy of targets and the severe multi-scale interference introduced by complex backgrounds. To address these limitations, we propose a Multi-scale Attention Fusion Network (MAF-Net), an AI-native paradigm for real-time small object detection in UAV imagery. The proposed approach enhances small-target representation and robustness through three key designs. First, a density-adaptive anchor optimization strategy is developed by combining K-means++ clustering with an IoU-based distance metric, enabling anchors to better match scale variation under diverse object densities. Second, a multi-scale feature reinforcement module is introduced to strengthen fine-grained detail preservation by integrating shallow feature maps via skip connections and hierarchical aggregation. Third, a dual-path attention mechanism is employed to jointly model channel importance and spatial localization, improving discriminative feature calibration in cluttered aerial scenes. Extensive experiments on three public benchmarks (AI-TOD, DOTA, and RSOD) demonstrate that MAF-Net consistently outperforms the baseline detector, achieving mAP@0.5 gains of 14.1%, 11.28%, and 22.09%, respectively. These results confirm that MAF-Net provides an effective and deployment-friendly solution for robust small object detection, supporting real-time UAV-based inspection and AI-native ubiquitous aerial sensing applications.

1. Introduction

Unmanned aerial vehicle (UAV)-based aerial sensing has rapidly evolved into a key component of ubiquitous sensing for modern intelligent systems, enabling applications such as infrastructure inspection, public safety monitoring, precision agriculture, and emergency response [1,2,3,4,5,6]. With the emerging trend of AI-native computing, perception models are increasingly pushed to resource-constrained edge platforms, and future sixth-generation (6G) aerial networks may further enable collaborative sensing among UAVs. These trends motivate the need for accurate yet efficient on-board object detection from high-resolution drone imagery [7]. Meanwhile, recent AI-oriented system studies have also highlighted the infrastructure-side constraints that accompany AI proliferation [8,9,10,11,12,13].
Object detection aims to simultaneously localize and classify objects in images, and recent deep learning detectors have achieved impressive performance on standard benchmarks such as Microsoft Common Objects in Content (MS COCO) [14] and Pascal Visual Object Classes (Pascal VOC) [15]. However, these detectors often show a noticeable performance drop when applied to UAV remote sensing imagery, especially for small object detection. This degradation is primarily caused by the unique imaging characteristics of high-altitude aerial scenes: (i) extremely low pixel occupancy, where targets may occupy fewer than 10 pixels in width/height; (ii) severe scale variation within the same image due to viewpoint changes and altitude differences; and (iii) complex background clutter with strong appearance similarity to targets [16]. In practical UAV reconnaissance scenarios, the ratio of target area to image area can be well below 0.5%, making feature extraction and discrimination fundamentally challenging.
Existing detectors struggle in such cases, mainly because small targets are easily suppressed by deep downsampling operations and are prone to being overwhelmed during multi-scale feature fusion. Moreover, conventional anchor configurations are typically optimized for natural images and do not match the density and scale distributions of aerial targets, leading to poor localization and low recall in crowded scenes. These limitations indicate that UAV small object detection requires dedicated designs that preserve fine-grained features, enhance multi-scale perception, and adapt to density-dependent target patterns while maintaining real-time efficiency for deployment in aerial sensing systems.
To address these challenges, we propose MAF-Net (Multi-scale Attention Fusion Network), a real-time detection framework tailored for small object detection in UAV imagery. Specifically, MAF-Net is designed to resolve three key bottlenecks in high-altitude UAV scenes: (i) tiny-object features are easily diminished by deep downsampling; (ii) conventional anchor priors are mismatched with the scale–density distribution of aerial targets, especially in crowded scenes; and (iii) cluttered backgrounds lead to ambiguous localization and weak target–background separability. Accordingly, MAF-Net improves small-target sensitivity and robustness through three aspects:
  • Shallow feature integration for small targets: We strengthen fine-grained representation by introducing an additional high-resolution 160 × 160 detection head and reinforcing shallow feature maps via skip connections and hierarchical aggregation, preserving spatial details critical for tiny objects.
  • Density-adaptive anchor optimization: We introduce an anchor generation strategy based on K-means++ clustering with an IoU-based distance metric, enabling anchors to better match dataset-specific scale variation under diverse object densities and improving recall in crowded aerial scenes.
  • Dual-path attention for robust discrimination: We employ a lightweight dual-path attention mechanism that jointly models channel-wise importance and spatial localization, combining variance-aware hybrid attention enhancement (HAE) and coordinate-decoupled geometric attention (AGD) to improve feature recalibration and target–background separability in cluttered environments.
Extensive experiments on public UAV remote sensing benchmarks demonstrate that MAF-Net achieves consistent improvements over the baseline detector under challenging small-object settings while maintaining real-time inference efficiency. We validate our method primarily from the perception perspective, reporting detection accuracy and on-device efficiency metrics; end-to-end communication-aware latency under bandwidth constraints are not explicitly modeled in our experiments. These results highlight the effectiveness of AI-native multi-scale attention fusion for ubiquitous aerial sensing tasks.
Additionally, although MAF-Net is built upon a YOLOv7 baseline, it is not a straightforward “small head + attention + anchor tuning” recipe. Firstly, our attention is variance-aware: we explicitly introduce variance pooling (beyond average/max pooling) to capture local intensity/texture dispersion that is prominent in UAV small objects and cluttered backgrounds, and we combine it with a coordinate-decoupled geometric attention to strengthen position-sensitive cues. Furthermore, our anchors are density-adaptive: instead of applying generic auto-anchor settings, we design a K-means++ clustering with an IoU-based distance tailored to aerial scale–density distributions, improving matching in crowded scenes. Finally, the added high-resolution small-object head is coupled with a lightweight multi-scale fusion design to preserve shallow spatial details while maintaining real-time efficiency, as verified by ablation studies.
Thus, we employ an AI-native approach, which refers to a data-driven perception pipeline in which object detection is implemented as a first-class on-board AI function and is explicitly designed for edge deployment constraints (e.g., real-time inference and limited compute/memory) rather than a hand-crafted or rule-based vision pipeline. The main contributions of this work are summarized as follows:
  • We propose MAF-Net, a real-time detection framework that enhances small object sensitivity by combining shallow feature reinforcement, density-adaptive anchor design, and dual-path attention fusion.
  • We provide a systematic design and analysis of feature aggregation strategies that better preserve small-target spatial cues in high-resolution UAV imagery.
  • We validate the proposed method on multiple benchmark datasets, demonstrating robust performance gains for small object detection under complex aerial backgrounds.

2. Related Work

Small object detection has long been recognized as a challenging problem in object detection, especially in aerial remote sensing where objects are tiny, densely distributed, and embedded in complex backgrounds. In recent years, deep neural network-based detectors have become the dominant paradigm and can generally be categorized into two-stage and one-stage frameworks. This section reviews representative approaches and discusses their limitations in UAV small object detection scenarios, integrating the latest advances in remote sensing-specific small object detection.

2.1. One-Stage Object Detectors

One-stage detectors directly predict categories and bounding boxes without explicit proposal generation, offering faster inference and making them attractive for real-time aerial sensing. The YOLO family is a representative line of one-stage detectors. YOLOv1 formulates detection as a regression problem with high speed but limited localization accuracy for small or crowded objects [17]. Subsequent versions, including YOLOv2 and YOLOv3, improve detection performance via anchor-based prediction and multi-scale heads, with YOLOv3 introducing feature pyramid-style outputs to better handle scale variation [18,19]. YOLOv4 and later versions, such as YOLOv5 [20], further incorporate advanced augmentation and feature fusion strategies (e.g., Mosaic, PAN/FPN), improving both accuracy and efficiency [21]. More recent designs such as YOLOv6, YOLOv7, and YOLOv8 continue to enhance real-time detection capability with optimized training strategies and architectural refinements [22,23]. The latest YOLOv10 further optimizes real-time end-to-end detection but still faces challenges in small object detection in remote sensing images due to background noise, information loss, and complex multi-object interactions [24].
Other one-stage detectors include SSD, which performs multi-scale prediction on different feature maps but relies heavily on manually designed anchor settings and often underperforms on small dense objects [25]. RetinaNet addresses class imbalance with focal loss but may not achieve real-time performance under high-resolution aerial inputs [26]. EfficientDet is built upon a weighted bidirectional feature pyramid (BiFPN) and employs compound scaling, offering improved feature fusion, but its computational cost can still be non-trivial for real-time UAV deployment [27]. Dynamic YOLO [28], designed for underwater small object detection, constructs a lightweight backbone based on DCN v3 and a unified feature fusion framework with channel-aware, scale-aware, and spatial-aware attention, which provides valuable insights for aerial small object detection.

2.2. Two-Stage Object Detectors

Two-stage detectors typically generate region proposals first and then refine them through classification and bounding-box regression, often achieving strong accuracy. Early works such as R-CNN introduced deep features into detection pipelines but suffered from multi-stage training and high computational overhead [29]. SPPNet improved efficiency by adopting spatial pyramid pooling to reduce redundant convolution operations, but it was not fully end-to-end trainable [30]. Fast R-CNN and Faster R-CNN further advanced two-stage detection by enabling shared feature extraction and introducing region proposal networks (RPNs), improving both accuracy and speed [31,32].
To enhance multi-scale representation, Feature Pyramid Networks (FPNs) construct pyramidal features to detect objects at different scales. While effective, downsampling in deeper layers still weakens fine-grained small-object cues, and performance may degrade under extreme tiny targets or dense distributions. Cascade R-CNN improves localization via progressive refinement with increasing IoU thresholds, but this cascaded design is typically less suitable for real-time UAV applications [33]. R-FCN introduces position-sensitive mechanisms to improve localization, whereas Mask R-CNN extends detection to instance segmentation, usually at higher computational cost [34,35]. TridentNet leverages multiple branches with different receptive fields for multi-scale detection, but its inference complexity limits real-time deployment [36].
Overall, two-stage detectors are often accurate but can be computationally heavy and may still struggle with extremely small aerial targets due to insufficient high-resolution feature preservation. Transformer-based two-stage detectors such as DETR [37] and Deformable DETR [38] use self-attention mechanisms to capture global context, but their performance on small objects is limited by insufficient local feature extraction. KANs-DETR [39] replaces fully connected layers with Kolmogorov–Arnold Networks (KANs) to enhance feature representation robustness, providing a new direction for optimizing two-stage detectors for small object detection.

2.3. Small Object Detection in Aerial and Construction Imagery: Methods, Challenges, and Research Gaps

Small object detection in UAV imagery introduces additional constraints beyond natural-image detection, including ultra-low pixel occupancy, severe background clutter, and strong scale-density variations across scenes [16]. To improve small-object sensitivity, researchers have explored feature pyramid enhancement, shallow feature reuse, attention mechanisms, and adaptive anchor design [35,36,40]. Attention-based modules can emphasize salient regions and suppress clutter, improving the detectability of small targets in large backgrounds [34,35]. Meanwhile, anchor optimization strategies aim to align prior box distributions with dataset-specific object scales and densities, improving localization and recall in crowded aerial scenes [41,42].
Recent progress in UAV-based small-object detection for construction monitoring has underscored the effectiveness of deep learning in addressing practical engineering challenges. However, these advancements also reveal unresolved issues that are central to our research. Wang [43] systematically evaluated ten traditional image augmentation methods (e.g., shearing, contrast adjustment, probabilistic sampling) for rebar counting in reinforced concrete (RC) structures, using Faster R-CNN and YOLOv10 with transformer backbones (ViT, PVT, Swin Transformer). Their work confirmed that augmentation efficacy is architecture-dependent—for example, shearing achieved the best performance on YOLOv10-PVT (AP50 = 87.71%, rebar count accuracy = 86.27%)—and emphasized that geometric distortions (e.g., translation, scaling) can degrade performance for thin, small objects like rebars. While this study advanced data augmentation for construction-specific small targets, it focused primarily on dataset expansion rather than optimizing feature extraction or attention mechanisms for ultra-small, low-pixel targets (≤32 × 32 pixels) common in high-altitude aerial sensing.
Complementary research on non-PPE identification in construction sites [44] further validated YOLOv10-based transformer architectures for small-object detection under complex on-site backgrounds but similarly relied on standard feature fusion pipelines that struggle with the extreme scale imbalance (e.g., 97.96% small targets in the AI-TOD dataset) and dense clutter in remote sensing scenarios. Collectively, these construction-focused studies [43,44] underscore the critical role of task-specific optimization (augmentation, backbone selection) for small targets, but they lack dedicated designs to address the three core challenges in high-altitude UAV imagery: extremely low feature signal-to-noise ratio (SNR), ambiguous spatial localization, and insufficient channel feature decoupling.
Motivated by these insights and the gaps in existing work, our work develops a real-time framework (MAF-Net) that combines shallow feature reinforcement, density-adaptive anchors, and a dual-path attention fusion mechanism. Unlike prior construction-related methods [43,44] that prioritize augmentation or backbone tuning, MAF-Net targets the root cause of small-object detection failure in aerial imagery—weak feature representation—by: (1) adding a shallow 160 × 160 detection layer to preserve fine-grained spatial cues; (2) optimizing anchors via K-means++ clustering with IoU-based distance metrics to match diverse object densities; (3) integrating a hybrid-attention encoder (HAE) and attention-guided decoder (AGD) to jointly model channel importance and coordinate-aware spatial localization. This design enables MAF-Net to outperform construction-specific baselines [43] and general aerial detectors [16,40] on ultra-small targets, supporting AI-native ubiquitous aerial sensing applications beyond construction scenarios.

3. The MAF-Net Model

3.1. Problem Analysis

This paper mainly focuses on small target detection of UAVs under high-altitude visible light. Under the formal framework ofa target detection task, the input image space is set as follows:
I R H × W × 3
where H , W respectively represent the height and width dimensions of the image. The target detection process can be modeled as a mapping function:
f : I B , B = { ( b i , l i ) } i = 1 N
where B is the detection output set, including the bounding box coordinate B i R 4 and its corresponding category label l i Z + . There are three main definitions of quantitative standards for micro targets:
(1)
As shown in Equation (3), it is quantified as the ratio between the object bounding box area and the total image area:
b i B , Area ( b i ) ϵ · H · W ( ϵ = 0.009 )
(2)
As shown in Equation (4), it is defined by the absolute value of the object bounding box area:
b i B , Area ( b i ) 32 × 32 pixels
(3)
As shown in Equation (5), it is quantified by the ratios of the object bounding box width and height relative to the image width and height, respectively:
min ( width ( b i ) W , height ( b i ) H ) δ ( δ = 0.1 )
The feature map F R C × H × W of the remote sensing image I is derived from the input image after the implementation of feature extraction. As noted above, the small object region is defined as Ω s { 1 , 2 , , H } × { 1 , 2 , , W } , and the background region is Ω b = Ω Ω s ( Ω is the full feature space), which satisfies | Ω s |     | Ω b | (the spatial proportion of small objects is extremely low).
Features corresponding to the small object region are formally defined as:
F ( c , i , j ) = F s ( c , i , j ) + F b ( c , i , j ) + ε ( c , i , j )
where F s represents the true features corresponding to small object regions, whose amplitude is significantly lower in magnitude compared to background features F b ( F s     F b ); ε : the noise of feature extraction (information loss caused by convolution/pooling).
As discussed, small object detection is fundamentally challenged by three mathematical issues:
(1)
Extremely low feature signal-to-noise ratio: Let the extremely low feature signal-to-noise ratio be SNR:
SNR = F s F b + ε     1
where the object features are overwhelmed by the background/noise.
(2)
Ambiguous spatial localization: The spatial support set Ω s of small objects is excessively small. Traditional 2D global pooling leads to spatial information aliasing, failing to accurately distinguish the spatial positions of Ω s and Ω b .
(3)
Insufficient channel feature decoupling: Different channels contribute differently to the representation of small objects (e.g., edge/texture channels are the core features of small objects). However, without adaptive weights, effective channels are diluted by invalid ones.
The optimization objective of the detection task is to find an attention map A R C × H × W ( 0 A ( c , i , j ) 1 ) such that the weighted feature Y = A F satisfies:
A ( c , i , j ) 1 , ( i , j ) Ω s A ( c , i , j ) 0 , ( i , j ) Ω b , and SNR ( Y ) SNR ( F )
(where ⊙ denotes element-wise multiplication, i.e., the Hadamard product).

3.2. Base Model Selection and Justification

3.2.1. Rationale for Selecting YOLOv7: Core Advantages Aligned with Our Problem Requirements

Accuracy–Efficiency Trade-off: Compared to its predecessors (YOLOv5–v6) and newer variants (YOLOv8–v10), YOLOv7 strikes a more favorable balance between detection accuracy and inference speed—critical for real-time small object detection in resource-constrained environments. For example, on the VisDrone2018 dataset (dominated by small UAV-captured objects), YOLOv7-tiny achieves 38.7% mAP@0.5 with 110 FPS on an NVIDIA Tesla V100, outperforming YOLOv5s (35.9% mAP@0.5, 105 FPS) and YOLOv6s (36.8% mAP@0.5, 98 FPS) [23].
Small-Object Detection Enhancement: YOLOv7 optimizes the neck structure with multi-scale feature fusion (PANet + FPN) and adaptive anchor matching, which mitigates feature loss for small objects during downsampling. Unlike YOLOv5 (which relies on simple Concat operations for feature fusion) and YOLOv8 (which simplifies the neck to focus on speed), YOLOv7 retains a more comprehensive feature pyramid that preserves fine-grained details of small targets. This design aligns with the primary challenge of small object detection in remote sensing—preserving spatial details while fusing multi-scale semantic information.
Deployment Flexibility: YOLOv7 can be deployed on edge devices with only minimal performance degradation, rendering it well-suited for airborne drone detection or industrial real-time detection scenarios—scenarios that our approach is designed to address. This flexibility is particularly valuable when compared to heavier detectors, such as DETR variants, which are challenging to deploy at the edge [28,39].

3.2.2. Comparative Analysis Against Representative Alternatives

To further validate YOLOv7’s superiority, we compare it against three categories of state-of-the-art detectors, integrating key data from the provided literature:
(1)
Newer YOLO Variants (YOLOv8–v10)
Newer YOLO versions (v8–v10) introduce architectural refinements but often sacrifice small-object performance or efficiency for overall accuracy: YOLOv8 adopts a C2f backbone and simplified neck, achieving 41.2% mAP@0.5 on COCO but only 26.9% mAP@0.5 on the AI-TOD dataset (average object size = 12.8 pixels) [45]. YOLOv9 [46]: Integrates transformer modules for global feature capture but increases computational complexity, making it less suitable for edge deployment. Its small-object performance ( A P S = 28.1% on COCO) is only marginally better than YOLOv7 ( A P S = 27.6%), but at the cost of 30% higher GFLOPs [46]. YOLOv10 [24]: Introduces end-to-end NMS-free detection and multimodal fusion, achieving 50.02% mAP@0.5 on AI-TOD. However, its reliance on transformer backbones (e.g., ViT) increases latency, and it struggles with ultra-small objects (<32 × 32 pixels) due to excessive downsampling.
(2)
Two-stage detectors, typified by Faster R-CNN and related variants
Two-stage detectors are known for high accuracy but fail to meet real-time and deployment requirements: Faster R-CNN [32]: Achieves 42.0% mAP@0.5 on COCO but only 26.6% A P S (small objects) due to feature loss in deep downsampling. Its two-stage pipeline (region proposal + refinement) results in low inference speed (26 FPS), making it unsuitable for real-time scenarios like UAV patrols [37]. Cascade R-CNN [33]: Improves localization accuracy (44.0% mAP@0.5 on COCO) but further increases computational overhead (180G FLOPs), with FPS dropping to 18. It also struggles with dense small objects in remote sensing images, as region proposals are prone to overlapping and missing tiny targets [47]. In contrast, YOLOv7’s one-stage design achieves comparable small-object accuracy ( A P S = 27.6% on COCO) while running 3× faster (85 FPS for YOLOv7s) and requiring 70% fewer parameters (7.1 M vs. 24 M for Faster R-CNN), making it more suitable for our real-time deployment needs.
(3)
DETR-Based Architectures
DETR and its variants leverage transformers for global feature capture but face limitations in small-object detection and efficiency: DETR [37]: The original DETR achieves 42.0% mAP@0.5 on COCO but only 20.5% A P S , as its global attention mechanism struggles with low-resolution small-object features. It requires 500 training epochs to converge (10× longer than YOLOv7) and runs at 28 FPS—too slow for real-time applications [38]. Deformable DETR [38]: Introduces deformable attention to focus on key sampling points, improving A P S to 26.4% on COCO. However, it still lags behind YOLOv7 ( A P S = 27.6%) and has higher computational complexity (173G FLOPs vs. 45G for YOLOv7s), with inference speed limited to 19 FPS [38]. Its multi-scale deformable attention module enhances small-object detection but increases memory access overhead, making it less efficient than YOLOv7’s E-ELAN [38]. KANs-DETR [39]: Replaces fully connected layers with Kolmogorov–Arnold Networks (KANs) to enhance small-object robustness, achieving 30.6% A P S on COCO. However, its complex transformer encoder increases parameters to 12.8M (80% more than YOLOv7s) and reduces inference speed to 22 FPS, making it less suitable for edge deployment [39]. ISO-DETR [48]: A novel DETR variant for industrial small object detection, replacing one-to-one matching with IoU-based many-to-one assignment. It achieves 77.3% mAP@0.5 on a custom industrial dataset but runs at 22 FPS (same as KANs-DETR) and requires 39M parameters—5× more than YOLOv7-tiny [48].

3.3. Modeling

Building on the YOLOv7 baseline, as illustrated in Figure 1, we redesign the network to better handle small objects in UAV aerial images. First, we add a dedicated detection head tailored to small-scale targets, improving localization and classification under limited pixel support. Second, anchor generation is performed with K-means++ rather than conventional K-means, reducing sensitivity to initialization and avoiding unstable clustering results when the preset cluster configuration is sub-optimal. Third, we embed a coordinate-aware attention module to reinforce informative spatial–channel cues and suppress background interference, leading to stronger feature representations for tiny objects. Together, these modifications substantially boost performance on small-object detection.
The improved complete network model is shown in Figure 2. In the following three sections, the above three parts will be introduced in sequence, namely the improvement of anchor box size, the addition of a small object detection layer, and the coordinate attention mechanism.

3.4. Incorporating an Additional Layer for Small Object Detection

Directly applying standard object detection pipelines to small-object scenarios typically results in higher miss rates and frequent misclassification. A key reason is the feature extraction strategy adopted by most modern detectors: deep, multi-stage convolutional backbones progressively downsample the input to obtain more abstract semantic representations. While deeper layers provide stronger semantics, repeated down-sampling inevitably reduces the spatial granularity of feature maps. For small targets that occupy only a few pixels, this loss of spatial detail can erase or severely weaken the discriminative cues needed for reliable localization and recognition, thereby degrading detection performance. For an input image of size 640 × 640 , Table 1 reports the corresponding feature-map resolutions and effective receptive-field sizes under different down-sampling ratios.
In UAV high-altitude remote sensing images, when the image size is 800 × 800, most (over 90%) of the target bounding boxes occupy an area of less than 320 pixels, which accounts for only 0.5% of the total image area; The length or width of the target box can be less than 10 pixels. According to Table 1, we can know that when using feature maps of sizes 20 × 20, 40 × 40, and 80 × 80, the receptive field size is greater than or close to the small target size; no matter how target feature extraction is performed, the features of the detected target cannot be obtained. Therefore, we need a shallow 160 × 160 feature map to extract the features of the detected target and improve the detection rate for small targets.
In standard CNN backbones, hierarchical representations are formed through successive convolutional transformations applied layer by layer. Specifically, the feature tensor at depth l is computed from the previous layer as
F ( l ) = σ ! W ( l ) F ( l 1 ) + b ( l ) ,
where W ( l ) and b ( l ) denote the convolutional weights and bias at the l-th layer, and σ ( · ) is an activation function with nonlinear characteristics. The resulting feature map is F ( l ) R h l × w l × d l , while the input to this layer is F ( l 1 ) R h l 1 × w l 1 × d l 1 , with h l , w l , and d l indicating the spatial dimensions and channel depth, respectively. In typical detection backbones, spatial resolution decreases as depth increases due to repeated stride-2 downsampling; after l stages, the cumulative reduction factor can be written as
s l = k = 1 l 2 = 2 l ,
which yields the following relationships:
h l = H s l , w l = W s l
When the input size is H = W = 640 , the resolution of deep feature maps drops drastically (see Table 1), resulting in mismatches. As illustrated in Figure 3, the newly added small object detection layer yields a feature map of size 160 × 160 after 4-fold downsampling of the 640 × 640 input. All subsequent detection layers employ 2-fold downsampling, thereby sequentially generating feature maps with dimensions of 80 × 80 , 40 × 40 and 20 × 20 .

3.5. Hybrid-Attention Encoder

As convolutional neural networks continue to evolve, attention mechanisms have become a key technique for enhancing feature representation in visual tasks. Among them, the Convolutional Block Attention Module (CBAM), which sequentially combines channel and spatial attention, has been widely adopted in general object detection frameworks. However, in the specific context of high-altitude remote sensing imagery—characterized by small targets, complex backgrounds, and significant scale variations—CBAM exhibits notable limitations. Primarily, its spatial attention mechanism relies on globally aggregated features, which often fail to capture the localized activations of small objects. Since small targets occupy only a sparse set of pixels in the feature map, their representational strength tends to be diluted by dominant background regions during global pooling operations. Consequently, the generated attention maps may inadequately highlight regions containing small objects, thereby diminishing detection sensitivity. Furthermore, the fixed sequential processing of channel and spatial attention may lead to mutual interference; for instance, channel-refined features could be excessively smoothed during subsequent spatial reweighting, especially when target-related activations are weak.
To address these shortcomings, we introduce variance pooling as a complementary operator to enhance CBAM’s capacity for small target detection. Unlike global pooling, variance pooling computes statistical variance within local regions, thereby quantifying the intensity of activation fluctuations at a finer spatial granularity. In remote sensing imagery, small targets such as vehicles or building edges typically correspond to high-frequency patterns with strong local contrast, resulting in higher variance values compared to homogeneous background areas. By integrating variance pooling into the channel attention branch—either as a replacement for or in parallel with classical global pooling—we can better preserve and amplify the feature responses of small targets. This adaptation allows the attention mechanism to more effectively distinguish between informative high-variance regions and redundant low-variance regions, thereby enhancing the discriminability of subtle yet critical details.
The advantages of incorporating variance pooling are threefold. First, it strengthens local discriminability by emphasizing regions with high internal variation, which often coincide with small object boundaries and textures. Second, it maintains spatial fine-grained information through localized statistical computations, mitigating the information loss caused by successive downsampling operations. Third, variance pooling exhibits greater robustness to scale variations, as it responds to relative contrast within a region rather than absolute activation magnitude, thus rendering it highly suitable for multi-scale small object detection scenarios. In our modified attention module, variance-aware features are fused with channel-wise and spatial-wise attention pathways, enabling more balanced and target-sensitive feature recalibration. Experimental results on aerial remote sensing datasets demonstrate that the proposed variance-augmented attention mechanism yields consistent improvements in recall and precision for small object detection, validating its efficacy in addressing the limitations of conventional CBAM under challenging remote sensing conditions.

3.5.1. Hybrid-Attention Encoder Channel Attention Module

As illustrated in Figure 4, the Hybrid-Attention Encoder Channel Attention Module is designed to enhance channel-wise feature representation by jointly exploiting multiple global statistical cues. Given an input feature map F R C × H × W , three parallel pooling operations—global average pooling, global max pooling, and global variance pooling—are first applied along the spatial dimensions. These operations compress the feature map into three channel descriptors of size 1 × 1 × C , each capturing complementary information: average pooling encodes global contextual responses, max pooling emphasizes the most salient activations, and variance pooling reflects the distribution and contrast of feature responses. The resulting descriptors are then fed into a shared multi-layer perceptron (MLP) to model inter-channel dependencies in a parameter-efficient manner. The MLP outputs corresponding channel-wise responses, which are subsequently combined through element-wise addition. Subsequently, a sigmoid activation function is employed to produce normalized channel attention weights. These weights adaptively recalibrate the input feature map by emphasizing informative channels and suppressing those with lower relevance.

3.5.2. Hybrid-Attention Encoder Spatial Attention Module

Figure 5 depicts the structure of the Hybrid-Attention Encoder Spatial Attention Module, which is designed to enhance spatial feature representation by emphasizing informative regions. Given an input feature map F, three parallel pooling operations—max pooling, average pooling, and variance pooling—are executed along the channel dimension to capture complementary spatial information. Max pooling highlights the most salient activations, average pooling encodes overall contextual information, and variance pooling reflects the spatial distribution and contrast of features. The resulting pooled feature maps are concatenated and fed into a convolutional layer to fuse spatial information and model local dependencies. Finally, a sigmoid activation function is employed to generate a spatial attention map, which adaptively reweights the input feature map to enhance target regions and suppress irrelevant background responses.

3.5.3. Feature Modeling with HAE Module

Figure 6 illustrates the overall architecture of the proposed Hybrid-Attention Encoder. Given an input feature map, the encoder sequentially applies channel attention and spatial attention to enhance feature representation. First, the input feature map is processed by the Hybrid-Attention Encoder Channel Attention Module, which learns channel-wise importance weights and recalibrates feature responses accordingly through element-wise multiplication. The refined feature map is then fed into the Hybrid-Attention Encoder Spatial Attention Module, where spatial dependencies are modeled to emphasize informative regions while suppressing irrelevant background responses. Another element-wise multiplication is performed to generate the final output feature map. By cascading channel and spatial attention mechanisms, the Hybrid-Attention Encoder effectively captures both inter-channel relationships and spatial context, resulting in more discriminative and robust feature representations for downstream detection tasks.
The spatial attention weights are computed according to the following formula:
F H A E = A s ( A c F ) , A c R C × 1 × 1 , A s R 1 × H × W

3.6. Attention-Guided Decoder

3.6.1. Feature Modeling with AGD Module

According to Equation (12), in the hybrid-attention encoder A s is the global spatial attention, which only models the global spatial correlation of H × W , without decoupling the positional information of the abscissa x and the ordinate y, and lacks scale-adaptive local modeling. However, the core task of the object detection head is the accurate regression of the candidate box ( x 1 , y 1 , x 2 , y 2 ) and the position-sensitive classification of the object center/edge, and global spatial modeling cannot meet the demand of the detection head for fine-grained coordinate information.
The core of the attention-guided decoder lies in decoupling channel attention from coordinate position information. It performs one-dimensional pooling (along the row/column direction) on features to retain coordinate information and ultimately outputs joint channel-coordinate attention features. These features perfectly match the requirements of the detection head for location-sensitive features. The attention-guided decoder structure is shown in Figure 7.
For the feature map F R C × H × W generated by the backbone and neck networks, the attention-guided decoder performs processing in three sequential steps: coordinate information embedding, coordinate attention generation, and adaptive feature weighting, with the final output denoted as F A G D R C × H × W :
Step 1: Coordinate Information Embedding (1D row/column pooling, preserving x / y positional information)
Different from the 2D global pooling in HAE, AGD performs row-wise global average pooling G c y and column-wise global average pooling G c x for each channel c, which preserves the positional information of the ordinate y and abscissa x, respectively:
G c y ( i ) = 1 W j = 1 W F c , i , j , G c y R C × H × 1
G c x ( j ) = 1 H i = 1 H F c , i , j , G c x R C × 1 × W
where i [ 1 , H ] denotes the row index (ordinate y) and j [ 1 , W ] denotes the column index (abscissa x).
Step 2: Coordinate Attention Generation (decoupling attention weights for x / y )
The concatenated G c y and G c x are dimensionally reduced via a multilayer perceptron (MLP) and then decoupled into row attention M y and column attention M x , which model the positional correlation of the ordinate and abscissa, respectively:
Z = δ Conv 1 [ G y ; G x ] , Z R C / r × H × W
Z y = σ Conv 2 y ( Z ) , Z x = σ Conv 2 x ( Z )
M y = Z y R C × H × 1 , M x = Z x R C × 1 × W
where r is the dimensional reduction ratio (typically set to 16), δ denotes the ReLU activation function, Conv 1 is a 1 × 1 convolutional layer used for dimensionality reduction, Conv 2 y / Conv 2 x are 1 × 1 convolutions for dimensionality recovery to C channels, M y is the row-wise (y-axis) attention weight, and M x is the column-wise (x-axis) attention weight.
Step 3: Feature Weighting (output of channel–coordinate joint attention feature)
Element-wise multiplication is applied between the row/column attention weights and the original feature map to derive the final output feature map of the AGD:
F A G D = F M y M x
By comparing the feature outputs of variance pooling, HAE, and AGD, the core mathematical necessity for the detection head to select AGD is as follows:
F A G D = F M y ( C , y ) M x ( C , x )
The AGD includes position-sensitive features with channel-coordinate decoupling, where M y ( C , y ) denotes the attention related to channel C and the ordinate y, and M x ( C , x ) denotes the attention related to channel C and the abscissa x. Specifically, the attention weights of AGD are binary functions of “channel + single coordinate”, which directly encode the positional information of the target.
F H A E = F M c ( C ) M s ( H , W )
The HAE includes position-agnostic features with channel-global spatial modeling, where M s ( H , W ) denotes the attention only related to the global spatial dimension H × W , which is a unary function without coordinate decoupling and cannot encode the fine-grained x / y positional information of the target.
The bounding box regression of the object detection head is a numerical prediction for ( x 1 , y 1 , x 2 , y 2 ) . The output F A G D of AGD directly contains the feature encoding of x / y coordinates, which enables the regression branch of the detection head to directly learn the mapping relationship between coordinates and features. In contrast, F H A E of HAE requires the detection head to additionally extract positional information from global features, which increases the learning difficulty. Therefore, the demand AGD can match the detection head.

3.6.2. Loss Function

The loss function employed in this paper comprises the sum of classification loss L c l s and regression loss L r e g . For a batch with N samples, where each sample has K candidate boxes (Anchor/Anchor-free), the total loss of the detection head is defined as:
L = 1 N p o s i = 1 N j = 1 K α L c l s ( p i j , t i j ) + β L r e g ( b i j , b ^ i j , t i j )
where p i j denotes the predicted foreground probability of the candidate box ( i , j ) , and t i j { 0 , 1 } represents the ground-truth label (1 for foreground, 0 for background); b i j = ( x 1 , y 1 , x 2 , y 2 ) is the predicted regression value of the candidate box, and b ^ i j is the ground-truth box; N p o s is the number of positive samples, and α / β are loss weights; L c l s is Focal Loss (addressing class imbalance), and L r e g is GIoU Loss (addressing scale/ratio invariance of box regression).
The formula for Focal Loss L c l s is given by:
L c l s ( p , t ) = t α ( 1 p ) γ log p ( 1 t ) ( 1 α ) p γ log ( 1 p )
where γ is the focusing parameter (typically set to 2), and the core idea is to reduce the weight of easy samples and focus on hard samples at the foreground/background boundary.
M y / M x of AGD assign higher attention weights to the edge/center positions of the target, making the detection head more sensitive to the positional features of foreground targets. This results in the predicted foreground probability p i j (for positive samples) being closer to 1 and the background p i j (for negative samples) being closer to 0, ultimately reducing the value of Focal Loss:
L c l s A G D ( p , t ) < L c l s H A E ( p , t )
Let the predicted foreground probability with AGD features be p A G D and that with HAE features be p H A E . Then, | p A G D 1 |   <   | p H A E 1 | , and substituting this into the Focal Loss formula yields the above inequality (for γ > 0 , ( 1 p ) γ decreases exponentially as p increases). Therefore, AGD can optimize for classification loss.
The formula L r e g of GIoU Loss is defined as:
L r e g ( b , b ^ ) = 1 GIoU ( b , b ^ )
GIoU = IoU | A c U | | A c |
where IoU denotes the Intersection over Union (IoU) metric, A c denotes the smallest axis-aligned bounding box enclosing both b and b ^ , and U denotes the union region of b and b ^ .
The coordinate-decoupled features of AGD make the detection head’s predicted values b i j for x 1 , y 1 , x 2 , y 2 closer to the ground-truth box b ^ i j , i.e., IoU ( b A G D , b ^ ) > IoU ( b H A E , b ^ ) , and | A c U | A G D < | A c U | H A E . This ultimately reduces the GIoU Loss:
L r e g A G D ( b , b ^ ) < L r e g H A E ( b , b ^ )

3.7. Implementation of the Attention Fusion Module

Let the input feature map be F R C × H × W . Our attention fusion module contains two lightweight paths, namely HAE and AGD, and is applied to each pyramid feature (e.g., P 3 / P 4 / P 5 ) independently (with the same architecture but without weight sharing across scales).
HAE (variance-aware hybrid attention). For channel attention, we compute three channel descriptors using global average pooling (GAP), global max pooling (GMP), and global variance pooling (GVP): p avg , p max , p var R C . They are fed into a shared MLP with reduction ratio r (two FC layers: C C / r C ) to obtain
M c = σ MLP ( p avg ) + MLP ( p max ) + MLP ( p var ) R C ,
and F 1 = M c F . For spatial attention, we compute channel-wise average/max/variance maps s avg , s max , s var R 1 × H × W from F 1 , concatenate them, and apply a k × k convolution followed by sigmoid:
M s = σ Conv k × k ( [ s avg ; s max ; s var ] ) R 1 × H × W ,
yielding F 2 = M s F 1 . Note that variance pooling introduces no additional learnable parameters.
AGD (coordinate-decoupled geometric attention). We adopt decoupled 1D aggregation along height/width to obtain g h R C × H × 1 and g w R C × 1 × W via average pooling, then we compress them with a shared 1 × 1 convolution to C channels, where C = max ( 8 , C / r ) :
u = δ Conv 1 × 1 ( [ g h ; g w ] ) R C × ( H + W ) × 1 ,
split u into u h R C × H × 1 and u w R C × 1 × W , and expand back to C channels to form two attention maps: A h = σ ( Conv 1 × 1 ( u h ) ) R C × H × 1 and A w = σ ( Conv 1 × 1 ( u w ) ) R C × 1 × W . The AGD-refined feature is F 3 = F 2 A h A w .
Fusion strategy. We fuse two paths via element-wise residual addition:
F out = F 2 + F 3 ,
which keeps the module lightweight and stable in training.

4. Experiments

4.1. Experiment Dataset and Evaluation Metrics

In order to better validate the issue of small target detection in aerial remote sensing images, we utilized datasets such as AI-TOD [40], DOTA [41], RSOD [42,49] as well as our own model dataset captured by UAVs across five different scenarios. The AI-TOD dataset consists of a total of 14,018 images, of which 11,214 images were used for training and 2804 images were used to test the model’s performance. In the test set of 2804 images, there are 70,437 entities, and the number of entities corresponding to the eight categories is shown in Table 2. Compared with existing aerial image object detection datasets, this dataset shows a significantly smaller average object size of approximately 14 pixels, which is different from other datasets with relatively larger target objects.
The AI-TOD dataset exhibits a pronounced size imbalance in target distribution. As shown in Figure 8a, small targets (<32 × 32 pixels) dominate the dataset, with a representation rate of 97.96%, while medium targets ( 32 × 32 96 × 96 pixels) account for merely 2.04%. Notably, the dataset contains no instances of large targets (>96 × 96 pixels). This extreme distribution presents significant challenges for target detection algorithms, particularly in handling scale variations across different target sizes.
The target categories of the AI-TOD dataset are depicted in Figure 8b. Notably, the most prominent category is vehicles, constituting 87.78% of the total.
The DOTA dataset (A Large-scale Dataset for Object Detection in Aerial Images) is a widely used benchmark for object detection in aerial imagery, designed to support both algorithm development and performance evaluation. It contains 2806 aerial images collected from diverse sensors and platforms. Image resolutions range from approximately 800 × 800 to 4000 × 4000 pixels, and the dataset includes objects exhibiting substantial variation in scale, orientation, and shape. Every image is meticulously labeled by professional interpreters, covering 15 common object categories. In total, the DOTA dataset comprises 188,282 annotated instances, and the per-category instance counts are reported in Table 3.
The DOTA dataset demonstrates a distinct multi-scale target distribution pattern. As illustrated in Figure 9a, medium-sized targets ( 32 × 32 96 × 96 pixels) constitute the majority (53.49%), followed by small targets (< 32 × 32 pixels) at 35.75%. Notably, large targets (> 96 × 96 pixels) represent a non-negligible proportion (10.77%), indicating the dataset’s capability to support multi-scale object detection research. This balanced yet diverse size distribution poses unique challenges for algorithms in handling scale variations across different target categories.
As illustrated in Figure 9b, the target categories of the DOTA dataset are presented. Notably, the top categories include ships, accounting for 31.05% of the total, followed by small-vehicles at 18.85%, large-vehicles at 15.20%, and storage tanks at 10.01%.
The RSOD dataset [42,49] is an open dataset for object detection in remote sensing images, including four types of targets: airplanes, playgrounds, overpasses, and oil drums. The RSOD dataset contains 976 images and 6950 entities, with the number of entities in each category shown in Table 4. This dataset was released by Wuhan University in 2015.
The RSOD dataset exhibits a distinctive multi-scale target distribution characteristic, as depicted in Figure 10a. Statistical analysis reveals that medium-sized targets ( 32   × 32–96 × 96 pixels) dominate the dataset, accounting for 61.30% of the total instances. Small targets (<32 × 32 pixels) represent a minority proportion at 13.30%, while large targets (>96 × 96 pixels) maintain a significant presence, with 25.40% representation. This tri-modal size distribution demonstrates the dataset’s comprehensive coverage of scale variations, making it particularly valuable for multi-scale object detection research.
The balanced representation across different target scales presents both opportunities and challenges for detection algorithms. The coexistence of small, medium, and large targets within the same dataset requires robust scale-invariant feature learning capabilities, which is crucial for developing generalized object detection frameworks in remote sensing applications. Notably, the substantial proportion of large targets (25.40%) distinguishes RSOD from other aerial image datasets that typically focus on small objects, providing unique test conditions for evaluating scale adaptation performance.

4.2. Experimental Results and Comparative Analysis

4.2.1. Analysis of Small Target Layer Results

Within the field of small target detection, the typical size of the detection target area is smaller than 32 × 32 pixels. Figure 11 illustrates a thermal diagram depicting small target detection under various detector parameters. The results from the figure indicate that the detector head designed for large targets faces challenges in detecting small targets. Consequently, it becomes imperative to enhance the detection layer specifically for small targets, thereby improving their detection accuracy.
According to Figure 12, it can be seen that under four different detection layers, for small targets, the detection effect can only be achieved in shallow feature maps of 160 × 160, while in high-level feature maps, the expected detection effect cannot be achieved. In the Figure 12, there are four images in each row, Figure 12a–d; Figure 12e–h; Figure 12i–l show attention heatmaps of a detected target waiting below four detection layers. Figure 12a,e,i; Figure 12b,f,j; Figure 12c,g,k; Figure 12d,h,l show the heatmaps of attention under the 20 × 20, 40 × 40, 80 × 80, 160 × 160 feature maps, respectively.

4.2.2. Analysis of Results for Adjusting Anchor Box Size and Loss Function

For small target detection in UAV high-altitude remote sensing images, the anchor box size was adjusted and a detection layer was added due to the difference in size between target features and ordinary target detection features. The detection results of feature maps under different downsampling multiples are described in the previous section, as shown in Figure 13. The left image (Figure 13a) illustrates the results of object detection using the previous anchor box configuration, where no targets were detected. In contrast, the right image (Figure 13b) shows the detection results after adjusting the size of the anchor boxes. It is evident that by optimizing the size of the anchor boxes, the accuracy of object detection can be significantly improved.

4.2.3. Analysis of Object Detection Results for Dual-Path Attention

As shown in Figure 14, Figure 14a reports the detection results obtained without the Dual-path Attention module, while Figure 14b shows the results after integrating the Dual-path Attention. Without Dual-path Attention, the detector identifies 205 vehicles; with Dual-path Attention, the number of detected vehicles increases to 300. This substantial improvement indicates that the Dual-path Attention effectively strengthens small-object detection, particularly in scenes where targets are densely clustered and easily missed.
Figure 15 visualizes the feature responses before and after introducing the Dual-path Attention. The left heatmap in Figure 15a shows the activation distribution without Dual-path Attention, whereas the right heatmap in Figure 15b corresponds to the output after Dual-path Attention is applied. As observed, Dual-path Attention suppresses responses to irrelevant regions while strengthening activations on vehicle targets, leading to more discriminative and concentrated attention over the objects of interest. This improvement stems from Dual-path Attention’s ability to embed positional information into channel attention, producing direction-aware and position-sensitive feature maps. By capturing long-range dependencies while preserving precise spatial cues, Dual-path Attention enhances target representation—particularly beneficial for small and densely distributed objects. Moreover, Dual-path Attention is lightweight and easy to integrate into compact network backbones, offering accuracy gains with negligible computational overhead. Empirically, Dual-path Attention has been shown to improve not only image classification but also dense prediction tasks such as object detection and instance segmentation.
As illustrated in Figure 16, Figure 16a shows the original aerial image. Figure 16b presents the attention heatmap produced by the baseline YOLOv7 model for the same scene, where strong activations are dispersed across the image and substantial background responses are observed, indicating notable attention noise. In contrast, Figure 16c shows the heatmap generated by the proposed MAF-Net. Here, the model concentrates its responses primarily on the target regions, while activations over irrelevant background areas are markedly suppressed, demonstrating more discriminative and focused attention.

4.2.4. Analysis of Experimental Results for MAF-Net Model

The experiment was completed in the environment of Windows 10, Python 3.8, PyTorch 1.7.1, CUDA 12.3. The GPU used was RTX4090, and the CPU was Intel Core (TM) i9-9900K. The input image size of this experiment was 800 × 800, the initial learning rate was set to 0.01, and the epoch was set to 300.
Table 5 summarizes the class-wise object detection results of the proposed MAF-Net on the challenging AI-TOD dataset. It can be observed that the proposed method achieves overall precision, recall, mAP@0.5, and mAP@0.5:0.95 values of 0.667, 0.553, 0.558, and 0.245, respectively. Among all object categories, the storage-tank class obtains the highest performance, while the wind-mill and pool classes show relatively lower results due to their limited label numbers and complex backgrounds in real-world scenarios.
The quantitative results of class-wise object detection are presented in Table 6, which reports the performance metrics, including Precision (P), Recall (R), mAP@0.5, and AP@0.5:0.95 for each object category, as well as the overall dataset performance across all classes.
As shown in the table, the overall detection performance on the dataset achieves a Precision of 0.718, a Recall of 0.474, an mAP@0.5 of 0.490, and an mAP@0.5:0.95 of 0.280. For individual categories, tennis-court demonstrates the most outstanding performance, with an mAP@0.5 of 0.921 and an mAP@0.5:0.95 of 0.773, which is attributed to its distinct geometric features and clear boundaries in aerial images. In contrast, roundabout and helicopter exhibit relatively low detection accuracy, with mAP@0.5 values of 0.103 and 0.182, respectively. This is mainly due to their small size, complex background interference, and limited number of labeled samples (179 and 73 instances, respectively), leading to insufficient model generalization. Notably, large-scale objects such as large-vehicle (mAP@0.5 = 0.764) and harbor (mAP@0.5 = 0.713) achieve high detection precision and recall, indicating that the model effectively captures their prominent visual characteristics.
Table 7 presents the category-level detection performance of MAF-Net on the RSOD dataset. The overall detection results are 0.347 for precision, 0.955 for recall, 0.349 for mAP@0.5, and 0.238 for mAP@0.5:0.95. Specifically, the aircraft and oiltank classes achieve relatively high detection accuracy, while the overpass and playground classes are affected by small sample sizes.
Ablation experiments were conducted to compare the impacts of different attention mechanism modules on the experimental outcomes, as shown in Table 8, Table 9 and Table 10. According to the results in the table, we observe that after adding the Dual-path Attention, YOLOv7’s mAP increased by 24.6%; After adding a small object detection layer, mAP increased by 15.7%; When using two modules simultaneously, MAP increased by 30.2%. It can be verified that using these two modules is effective.
As shown in Table 11, Table 12 and Table 13, MAF-Net consistently outperforms the baseline, achieving a 6.87% gain in mAP.

4.2.5. Computational Complexity and Inference Efficiency

To comprehensively evaluate the practical applicability of MAF-Net—especially its suitability for real-time UAV deployment on edge devices, as highlighted by the reviewer—its computational complexity (parameter count and FLOPs), inference speed, and edge-deployment adaptability are systematically analyzed and compared with baseline models, as reported in Table 14.
In terms of parameter count and computational overhead, MAF-Net maintains a moderate parameter size of 39.123 M and a computational cost of 59.773 GFLOPs. This is comparable to most YOLOv7 variants (e.g., YOLOv7 Improved: 37.829 M parameters, 59.942 GFLOPs; YOLOv7-MH: 37.532 M parameters, 52.608 GFLOPs) and significantly lower than YOLOv7-UWSC (42.437 M parameters, 112.139 GFLOPs). Importantly, this moderate parameter count is critical for edge device deployment, as edge platforms (e.g., embedded GPUs, FPGAs commonly used in UAVs) typically have limited memory bandwidth and on-chip storage. MAF-Net’s 39.123 M parameters can be efficiently loaded and run on such devices without excessive memory consumption, whereas models with larger parameter sizes (e.g., YOLOv7-UWSC) may suffer from memory overflow or prolonged loading times, which are prohibitive for real-time UAV operations.
Regarding inference speed, MAF-Net delivers a frame rate of 21.277 FPS, which is significantly faster than YOLOv7-SCD (9.346 FPS), YOLOv7-UWSC (14.286 FPS), and YOLOv7 Improved (11.96 FPS). For real-time UAV remote sensing detection, a frame rate of at least 15 FPS is generally required to ensure that the system can process continuous aerial imagery and provide timely feedback for UAV navigation or target tracking. MAF-Net’s 21.277 FPS fully meets this real-time requirement, whereas the aforementioned baselines fail to reach the minimum frame rate threshold, making them unsuitable for dynamic UAV deployment. Although YOLOv7-tiny exhibits the highest inference speed (38.46 FPS) due to its lightweight structure (6.270 M parameters, 6.697 GFLOPs), it suffers from substantial performance degradation (e.g., mAP@0.5 of only 0.2892 on AI-TOD, 0.193 on DOTA, and 0.243 on RSOD, as shown in Table 11, Table 12 and Table 13), which renders it impractical for high-precision UAV detection tasks where accurate target identification is critical.
YOLOv7-MH achieves a slightly higher FPS (23.81) than MAF-Net but at the cost of lower detection accuracy (e.g., mAP@0.5 of 0.4417 on AI-TOD, 0.257 on DOTA, and 0.126 on RSOD), and its parameter count (37.532 M) is only marginally lower than MAF-Net’s. This trade-off is unfavorable for UAV edge deployment, where both real-time performance and detection accuracy are essential to support reliable decision-making.
Collectively, MAF-Net’s moderate parameter count (39.123 M) and computational cost (59.773 GFLOPs) ensure compatibility with resource-constrained UAV edge devices, while its 21.277 FPS inference speed meets the real-time requirement for UAV aerial imagery processing. Compared with baseline models, MAF-Net is the only variant that simultaneously achieves superior detection accuracy, moderate computational overhead, and real-time inference speed—key prerequisites for practical real-time UAV deployment on edge devices.
To comprehensively evaluate the performance of different algorithms for small object detection, five representative models are compared on the target dataset, with the detailed comparative results summarized in Table 15.

4.2.6. Hardware Experimental Verification

In the experiments, we employed the DJI Mini 2 UAV as the experimental platform; its physical prototype is shown in Figure 17, and detailed parameters are listed in Table 16.
The mobile control, data reception, and image processing terminal is illustrated in Figure 18, which mainly consists of the UAV controller, the data transmission and reception program, and the Core-3588SG core board.
The proposed approach is deployed and verified on the above hardware and software platform to validate its effectiveness and practicability in real scenarios.
Experimental detection results achieved by the UAV platform under real-world conditions are shown in Figure 19.

4.3. Robustness Evaluation Under Complex Conditions

4.3.1. Construction of an Experimental Dataset for Complex Conditions

To further validate the generalization capability and robustness of our proposed method, we performed cross-dataset evaluations and stability tests under varying flight altitudes, illumination conditions, and motion blur—all of which are representative challenges encountered in practical UAV aerial detection scenarios.
By utilizing the integrated software and hardware platform established in the preceding section, a DJI Mini 2 unmanned aerial vehicle (UAV) was deployed to acquire high-resolution imagery of ten distinct target categories, represented by scaled-down physical proxies including aircraft, fighter jets, helicopters, hummers, missiles, tanks, trucks, warships, and yachts. These data collection experiments were conducted across five representative environmental scenarios: indoor settings, structured outdoor grid environments, paved surfaces, water surfaces, and grassland areas. As illustrated in Figure 20, all acquired imagery was manually annotated with pixel-level precision, and the annotated data were systematically curated to form the dedicated test dataset for this study (referred to as the Indoor-Grid-Pavement-Water-Grass (IGPWG) dataset), which encompasses the five aforementioned environmental subsets.
As shown in Figure 21a–c, the target was captured at different preset altitudes in the Grass dataset. As shown in Figure 21d–f, the target was captured at different preset altitudes in the Grid dataset.
As shown in Figure 22, the target was imaged under various illumination conditions in the Indoor dataset.
As shown in Figure 23, the target was imaged under various blur conditions in the Water dataset.
As shown in Figure 24, the target was imaged under various blur conditions in the Indoor dataset.

4.3.2. Experimental Results and Analysis

Table 17 details the test dataset distribution across five environments. The Pavement scenario contains the most images (848) and diverse categories, while the Water scenario is specialized for maritime targets (warship, yacht). Indoor and Grid have similar data scales and shared categories, and the Grass scenario features hummer and tank as dominant entities. This design enables rigorous evaluation of model generalization across diverse conditions.
Training was performed on the Indoor dataset, and the corresponding results are presented in Table 18.
Training was performed on the Grid dataset, and the corresponding results are presented in Table 19.
Training was performed on the Pavement dataset, and the corresponding results are presented in Table 20.
Training was performed on the Water dataset, and the corresponding results are presented in Table 21.
Training was performed on the Grass dataset, and the corresponding results are presented in Table 22.
Training was performed on the IGPWG dataset, and the corresponding results are presented in Table 23.
In the aforementioned experiments, 10% of the imagery from each of the five distinct datasets was randomly partitioned into the test set, with the remaining 90% allocated to the training set. As presented in Table 18, Table 19, Table 20, Table 21, Table 22 and Table 23, the experimental findings demonstrate the robustness of the proposed model across a diverse range of environmental conditions, including varying flight altitudes, illumination intensities, and motion blur levels.

5. Conclusions

To address the persistent challenges of missed and false detections of small targets in high-altitude visible-light imagery, this paper proposes an enhanced MAF-Net model. By exploiting shallow feature maps, a dedicated small-object detection layer is introduced to better preserve fine-grained spatial information, thereby improving the detectability of small targets. In addition, the K-means++ algorithm is employed for anchor box clustering to better adapt anchor sizes to the characteristics of aerial datasets, improving detection efficiency. Furthermore, a coordinate attention mechanism is integrated into the detection head, significantly enhancing performance in densely populated small-target scenarios where missed detections are common.
Extensive experiments, including ablation studies, attention mechanism analysis, and comparisons with state-of-the-art methods on multiple aerial benchmarks (AI-TOD, DOTA, and RSOD), demonstrate the effectiveness of the proposed approach. The MAF-Net model achieves notable performance gains, with improvements in mAP@0.5 of 14.1%, 11.28%, and 22.09% on the respective datasets. These results confirm the superior accuracy, robustness, and generalization capability of MAF-Net for small-target detection in high-altitude imagery.
Although MAF-Net achieves promising performance on both public benchmarks and self-constructed datasets (e.g., IGPWG, Water, Grass), its generalization capability under extreme aerial imaging conditions—such as dense fog, rain, snow, and severe motion blur induced by UAV jitter—still needs to be further verified. In these harsh scenarios, feature degradation of small objects becomes more prominent, which may result in an obvious decline in detection performance. Future research will concentrate on constructing and annotating datasets captured under extreme environmental conditions, as well as optimizing the model architecture to strengthen its anti-interference ability. Furthermore, future work will also focus on model compression and lightweight design for efficient deployment on mobile and embedded platforms to achieve a desirable trade-off between detection accuracy and inference efficiency.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15051100/s1.

Author Contributions

Conceptualization, K.M.; methodology, K.M. and Z.Z.; software, K.M.; validation, K.M.; formal analysis, K.M. and Z.Z.; investigation, K.M.; resources, J.H.; data curation, K.M.; writing—original draft preparation, K.M.; writing—review and editing, Z.Z. and J.Z.; visualization, K.M.; supervision, J.H.; project administration, Z.Z.; funding acquisition, Z.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplemantray Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Povlsen, P.; Bruhn, D.; Durdevic, P.; Arroyo, D.O.; Pertoldi, C. Using YOLO Object Detection to Identify Hare and Roe Deer in Thermal Aerial Video Footage—Possible Future Applications in Real-Time Automatic Drone Surveillance and Wildlife Monitoring. Drones 2024, 8, 2. [Google Scholar] [CrossRef]
  2. Khan, A.H.; Rizvi, S.T.R.; Dengel, A. Real-time Traffic Object Detection for Autonomous Driving. arXiv 2024, arXiv:2402.00128. [Google Scholar] [CrossRef]
  3. Aldahmani, A.; Ouni, B.; Lestable, T.; Debbah, M. Cyber-Security of Embedded IoTs in Smart Homes: Challenges, Requirements, Countermeasures, and Trends. IEEE Open J. Veh. Technol. 2023, 4, 281–292. [Google Scholar] [CrossRef]
  4. Guo, X.; Chen, Y.; Wang, Y. Learning-Based Robust and Secure Transmission for Reconfigurable Intelligent Surface Aided Millimeter Wave UAV Communications. IEEE Wireless Commun. Lett. 2021, 10, 1795–1799. [Google Scholar] [CrossRef]
  5. Wu, X.; Dong, J.; Bao, W.; Zou, B.; Wang, L.; Wang, H. Augmented Intelligence of Things for Emergency Vehicle Secure Trajectory Prediction and Task Offloading. IEEE Internet Things J. 2024, 11, 36030–36043. [Google Scholar] [CrossRef]
  6. He, Q.; Qu, C. Modular Landfill Remediation for AI Grid Resilience. arXiv 2025, arXiv:2512.19202. [Google Scholar] [CrossRef]
  7. Chen, Y.; Guo, X.; Zhou, G.; Jin, S.; Ng, D.W.K.; Wang, Z. Unified Far-Field and Near-Field in Holographic MIMO: A Wavenumber-Domain Perspective. IEEE Commun. Mag. 2025, 63, 30–36. [Google Scholar] [CrossRef]
  8. He, Q.; Qu, C. Waste-to-Energy-Coupled AI Data Centers: Cooling Efficiency and Grid Resilience. arXiv 2025, arXiv:2512.24683. [Google Scholar] [CrossRef]
  9. Liu, D.; Shen, Q.; Liu, J. The Health-Wealth Gradient in Labor Markets: Integrating Health, Insurance, and Social Metrics to Predict Employment Density. Computation 2026, 14, 22. [Google Scholar] [CrossRef]
  10. Wu, X.; Zhang, Y.-T.; Lai, K.-W.; Yang, M.-Z.; Yang, G.-L.; Wang, H.-H. A Novel Centralized Federated Deep Fuzzy Neural Network with Multi-Objectives Neural Architecture Search for Epistatic Detection. IEEE Trans. Fuzzy Syst. 2025, 33, 94–107. [Google Scholar] [CrossRef]
  11. Ke, Z.; Cao, Y.; Chen, Z.; Yin, Y.; He, S.; Cheng, Y. Early Warning of Cryptocurrency Reversal Risks via Multi-Source Data. Fin. Res. Lett. 2025, 85, 107890. [Google Scholar] [CrossRef]
  12. Liu, W.; Huang, T.; Zhang, P.; Ke, Z.; Min, M.; Zhao, P. High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers. arXiv 2023, arXiv:2307.13352. [Google Scholar] [CrossRef]
  13. Shen, Q.; Zhang, J. AI-Enhanced Disaster Risk Prediction with Explainable SHAP Analysis: A Multi-Class Classification Approach Using XGBoost. Res. Sq. 2025. [Google Scholar] [CrossRef]
  14. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  15. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  16. Liao, L.; Luo, L.; Su, J.; Xiao, Z.; Zou, F.; Lin, Y. Eagle-YOLO: An Eagle-Inspired YOLO for Object Detection in Unmanned Aerial Vehicles Scenarios. Mathematics 2023, 11, 2093. [Google Scholar] [CrossRef]
  17. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2016. [Google Scholar]
  18. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 6517–6525. [Google Scholar]
  19. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
  20. Sun, G.; Wang, S.; Xie, J. An Image Object Detection Model Based on Mixed Attention Mechanism Optimized YOLOv5. Electronics 2023, 12, 1515. [Google Scholar] [CrossRef]
  21. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  22. Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
  23. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  24. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  25. Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
  26. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 2999–3007. Available online: https://ieeexplore.ieee.org/document/8417976 (accessed on 27 February 2026).
  27. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  28. Chen, J.; Er, M.J. Dynamic YOLO for small underwater object detection. Artif. Intell. Rev. 2024, 57, 165. [Google Scholar] [CrossRef]
  29. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  30. Han, J.; Zhang, D.; Hu, X.; Guo, L.; Ren, J.; Wu, F. Background Prior-Based Salient Object Detection via Deep Reconstruction Residual. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1309–1321. [Google Scholar]
  31. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  32. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  33. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection (Conference Paper). In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2018; pp. 6154–6162. [Google Scholar]
  34. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks; Curran Associates Inc.: New York, NY, USA, 2016. [Google Scholar]
  35. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In IEEE Transactions on Pattern Analysis Machine Intelligence; IEEE Computer Society: Piscataway, NJ, USA, 2017. [Google Scholar]
  36. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware Trident Networks for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE Computer Society: Piscataway, NJ, USA, 2019. [Google Scholar]
  37. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
  38. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
  39. Zhang, J.; Peng, W.; Xiao, A.; Liu, T.; Fu, J.; Chen, J.; Yan, Z. KANs-DETR: Enhancing Detection Transformer with Kolmogorov–Arnold Networks for small object. High-Confid. Comput. 2026, 6, 100336. [Google Scholar] [CrossRef]
  40. Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny Object Detection in Aerial Images; Wuhan University: Wuhan, China, 2021. [Google Scholar]
  41. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images; IEEE Computer Society: Piscataway, NJ, USA, 2018. [Google Scholar]
  42. Xiao, Z.; Liu, Q.; Tang, G.; Zhai, X. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images. Int. J. Remote Sens. 2015, 36, 618–644. [Google Scholar] [CrossRef]
  43. Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with Faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 2025, 1, 33702. [Google Scholar] [CrossRef]
  44. Wang, S. Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body worn cameras with benchmark datasets. Sci. Rep. 2025, 15, 27043. [Google Scholar] [CrossRef]
  45. Ryu, J.; Kwak, D.; Choi, S. YOLOv8 with Post-Processing for Small Object Detection Enhancement. Appl. Sci. 2025, 13, 7275. [Google Scholar] [CrossRef]
  46. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision–ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar] [CrossRef]
  47. Liu, Y.F.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multiscale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614914. [Google Scholar] [CrossRef]
  48. Saeed, F.; Paul, A. ISO-DeTr: A novel detection transformer for industrial small object detection. Mach. Learn. Appl. 2026, 23, 100809. [Google Scholar] [CrossRef]
  49. Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
  50. Yi, W.G.; Wang, B. Research on Underwater Small Target Detection Algorithm Based on Improved YOLOv7. IEEE Access 2023, 11, 66818–66827. [Google Scholar] [CrossRef]
  51. Zhang, X.; Huang, D.Q. Research on UAV Ground Target Detection Based on Improved YOLOv7. In Proceedings of the 2023 3rd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 24–26 March 2023; pp. 28–32. [Google Scholar] [CrossRef]
Figure 1. The overall architecture of the proposed MAF-Net. A dedicated small-object detection layer is appended to the YOLOv7 head at the low-level feature map stage, and a coordinate attention mechanism is integrated into the detection head. The detection anchors are clustered using the KMeans++ algorithm to optimize small-object detection performance.
Figure 1. The overall architecture of the proposed MAF-Net. A dedicated small-object detection layer is appended to the YOLOv7 head at the low-level feature map stage, and a coordinate attention mechanism is integrated into the detection head. The detection anchors are clustered using the KMeans++ algorithm to optimize small-object detection performance.
Electronics 15 01100 g001
Figure 2. Block diagram of the MAF-Net architecture.
Figure 2. Block diagram of the MAF-Net architecture.
Electronics 15 01100 g002
Figure 3. Schematic diagram of the receptive field.
Figure 3. Schematic diagram of the receptive field.
Electronics 15 01100 g003
Figure 4. Structure diagram of the hybrid-attention encoder channel attention module.
Figure 4. Structure diagram of the hybrid-attention encoder channel attention module.
Electronics 15 01100 g004
Figure 5. Structure diagram of the hybrid-attention encoder spatial attention module.
Figure 5. Structure diagram of the hybrid-attention encoder spatial attention module.
Electronics 15 01100 g005
Figure 6. Model structure of Hybrid-Attention Encoder.
Figure 6. Model structure of Hybrid-Attention Encoder.
Electronics 15 01100 g006
Figure 7. Model structure of attention-guided decoder.
Figure 7. Model structure of attention-guided decoder.
Electronics 15 01100 g007
Figure 8. (a) Target size distribution in the AI-TOD dataset: Small targets (< 32 × 32 pixels): 97.96% of total Medium targets ( 32 × 32 96 × 96 pixels): 2.04% Large targets (> 96 × 96 pixels): Absent (b) Quantitative distribution of object categories in AI-TOD dataset.
Figure 8. (a) Target size distribution in the AI-TOD dataset: Small targets (< 32 × 32 pixels): 97.96% of total Medium targets ( 32 × 32 96 × 96 pixels): 2.04% Large targets (> 96 × 96 pixels): Absent (b) Quantitative distribution of object categories in AI-TOD dataset.
Electronics 15 01100 g008
Figure 9. (a) Target size distribution in the DOTA dataset: Small targets (< 32 × 32 pixels): 35.75% of total Medium targets ( 32 × 32 96 × 96 pixels): 53.49% Large targets (> 96 × 96 pixels): 10.77% (b) Quantitative distribution of object categories in DOTA dataset.
Figure 9. (a) Target size distribution in the DOTA dataset: Small targets (< 32 × 32 pixels): 35.75% of total Medium targets ( 32 × 32 96 × 96 pixels): 53.49% Large targets (> 96 × 96 pixels): 10.77% (b) Quantitative distribution of object categories in DOTA dataset.
Electronics 15 01100 g009
Figure 10. (a) Target size distribution in the RSOD dataset: Small targets (< 32 × 32 pixels): 13.30% of total Medium targets ( 32 × 32 96 × 96 pixels): 61.30% Large targets (> 96 × 96 pixels): 25.40% (b) Quantitative distribution of object categories in RSOD dataset.
Figure 10. (a) Target size distribution in the RSOD dataset: Small targets (< 32 × 32 pixels): 13.30% of total Medium targets ( 32 × 32 96 × 96 pixels): 61.30% Large targets (> 96 × 96 pixels): 25.40% (b) Quantitative distribution of object categories in RSOD dataset.
Electronics 15 01100 g010
Figure 11. When the feature maps of the three detection heads in YOLOv7 are (a) 80 × 80, (b) 40 × 40, and (c) 20 × 20, respectively, it is evident that only the detection head with the hightest resolution can effectively detect small targets, while the other two detection heads struggle to detect them.
Figure 11. When the feature maps of the three detection heads in YOLOv7 are (a) 80 × 80, (b) 40 × 40, and (c) 20 × 20, respectively, it is evident that only the detection head with the hightest resolution can effectively detect small targets, while the other two detection heads struggle to detect them.
Electronics 15 01100 g011
Figure 12. There are four images in each row, (ad); (eh); (il) are attention heatmaps of a detected target waiting below four detection layers. (a,e,i); (b,f,j); (c,g,k); (d,h,l) are the heatmaps of attention under the 20 × 20, 40 × 40, 80 × 80, 160 × 160 feature maps, respectively.
Figure 12. There are four images in each row, (ad); (eh); (il) are attention heatmaps of a detected target waiting below four detection layers. (a,e,i); (b,f,j); (c,g,k); (d,h,l) are the heatmaps of attention under the 20 × 20, 40 × 40, 80 × 80, 160 × 160 feature maps, respectively.
Electronics 15 01100 g012
Figure 13. The left image (a) illustrates the results of object detection using the previous anchor box configuration, where no targets were detected. In contrast, the right image (b) shows the detection results after adjusting the size of the anchor boxes. It is evident that by optimizing the size of the anchor boxes, the accuracy of object detection can be significantly improved.
Figure 13. The left image (a) illustrates the results of object detection using the previous anchor box configuration, where no targets were detected. In contrast, the right image (b) shows the detection results after adjusting the size of the anchor boxes. It is evident that by optimizing the size of the anchor boxes, the accuracy of object detection can be significantly improved.
Electronics 15 01100 g013
Figure 14. The left image (a) shows the object detection results before applying the Dual-path Attention, while the right image (b) presents the results after applying the Dual-path Attention. In (a), 205 vehicles were detected, whereas in (b), 300 vehicles were detected. This demonstrates that the Dual-path Attention can enhance the detection accuracy of small targets in areas with a high density of such targets.
Figure 14. The left image (a) shows the object detection results before applying the Dual-path Attention, while the right image (b) presents the results after applying the Dual-path Attention. In (a), 205 vehicles were detected, whereas in (b), 300 vehicles were detected. This demonstrates that the Dual-path Attention can enhance the detection accuracy of small targets in areas with a high density of such targets.
Electronics 15 01100 g014
Figure 15. The left figure (a) shows the thermal map before processing through the Dual-path Attention, while the right figure (b) displays the thermal map after processing through the Dual-path Attention. It is evident that the Dual-path Attention effectively reduces the attention on non-car parts and enhances the attention on the target car.
Figure 15. The left figure (a) shows the thermal map before processing through the Dual-path Attention, while the right figure (b) displays the thermal map after processing through the Dual-path Attention. It is evident that the Dual-path Attention effectively reduces the attention on non-car parts and enhances the attention on the target car.
Electronics 15 01100 g015
Figure 16. Visualization of attention heatmaps. (a) Original image. (b) Heatmap produced by YOLOv7, exhibiting widespread background activation (attention noise). (c) Heatmap produced by the proposed model, where responses are concentrated on the target regions and background activations are suppressed.
Figure 16. Visualization of attention heatmaps. (a) Original image. (b) Heatmap produced by YOLOv7, exhibiting widespread background activation (attention noise). (c) Heatmap produced by the proposed model, where responses are concentrated on the target regions and background activations are suppressed.
Electronics 15 01100 g016
Figure 17. The DJI Mavic Mini 2 unmanned aerial vehicle (UAV) employed as the physical experimental platform for the field tests.
Figure 17. The DJI Mavic Mini 2 unmanned aerial vehicle (UAV) employed as the physical experimental platform for the field tests.
Electronics 15 01100 g017
Figure 18. The mobile control, data reception, and image processing terminal.
Figure 18. The mobile control, data reception, and image processing terminal.
Electronics 15 01100 g018
Figure 19. Experimental results in real-world environments.
Figure 19. Experimental results in real-world environments.
Electronics 15 01100 g019
Figure 20. Sample images of the IGPWG dataset under diverse environmental conditions.
Figure 20. Sample images of the IGPWG dataset under diverse environmental conditions.
Electronics 15 01100 g020
Figure 21. Sample images of the dataset under different altitude conditions. (a) High-altitude aircraft UAV image; (b) Medium-altitude aircraft UAV image; (c) Low-altitude aircraft UAV image; (d) High-altitude helicopter UAV image; (e) Medium-altitude helicopter UAV image; (f) Low-altitude helicopter UAV image.
Figure 21. Sample images of the dataset under different altitude conditions. (a) High-altitude aircraft UAV image; (b) Medium-altitude aircraft UAV image; (c) Low-altitude aircraft UAV image; (d) High-altitude helicopter UAV image; (e) Medium-altitude helicopter UAV image; (f) Low-altitude helicopter UAV image.
Electronics 15 01100 g021
Figure 22. Sample images under different illumination conditions.
Figure 22. Sample images under different illumination conditions.
Electronics 15 01100 g022
Figure 23. Sample images of the Water dataset under different blur conditions.
Figure 23. Sample images of the Water dataset under different blur conditions.
Electronics 15 01100 g023
Figure 24. Sample images of the Indoor dataset under different blur conditions.
Figure 24. Sample images of the Indoor dataset under different blur conditions.
Electronics 15 01100 g024
Table 1. Given the input image size of 640 × 640 , the parameters of different detection heads.
Table 1. Given the input image size of 640 × 640 , the parameters of different detection heads.
Feature Map SizeDownsampling MultipleReceptive Field
20 × 2032 times downsampling32 × 32
40 × 4016 times downsampling16 × 16
80 × 808 times downsampling8 × 8
160 × 1604 times downsampling4 × 4
Table 2. Number of different target types in the AI-TOD dataset.
Table 2. Number of different target types in the AI-TOD dataset.
Object ClassTrain NumberTest NumberProportion
person1412738415.00
vehicle2480775991587.78
ship1354137914.79
airplane6231700.22
storage-tank527824791.87
bridge5121400.18
wind-mill176670.06
pool293340.10
Table 3. Number of different target types in the DOTA dataset.
Table 3. Number of different target types in the DOTA dataset.
Object ClassTrain NumberTest NumberProportion
plane805525318.77
large-vehicle16969438715.20
small-vehicle26126543818.85
ship28068896031.05
harbor598320907.24
ground-track-field3251440.50
soccer-ball-field3261530.53
tennis-court23677602.63
baseball-diamond4152140.74
swimming-pool17364401.52
roundabout3991790.62
basketball-court5151320.46
storage-tank5029288810.01
bridge20474641.61
helicopter630730.25
Table 4. Number of different target types in the RSOD dataset.
Table 4. Number of different target types in the RSOD dataset.
Object ClassImage NumberEntity NumberProportion
aircraft446499371.84
oiltank1891912.75
overpass1761802.59
playground165158622.82
Table 5. Class-wise object detection results of MAF-Net on the AI-TOD dataset.
Table 5. Class-wise object detection results of MAF-Net on the AI-TOD dataset.
ClassImagesLabelsPrecisionRecallmAP@0.5mAP@0.5:0.95
all2804704370.6670.5530.5580.245
person280438410.7430.2640.3560.118
vehicle2804599150.7660.7450.7540.315
ship280437910.7790.6670.7220.348
airplane28041700.8020.7940.8260.396
storage-tank280424790.8310.830.8640.475
bridge28041400.6830.4620.520.204
wind-mill2804670.30.2840.1830.0394
pool2804340.4320.3820.2420.064
Table 6. Class-wise object detection results on the DOTA dataset.
Table 6. Class-wise object detection results on the DOTA dataset.
ClassImagesLabelsPrecisionRecallmAP@0.5mAP@0.5:0.95
all458288530.7180.4740.490.28
plane45825310.8120.7270.7390.461
large-vehicle45843870.7180.7750.7640.499
small-vehicle45854380.5790.60.5690.309
ship45889600.7990.5840.5830.312
harbor45820900.6830.7550.7130.319
ground-track-field4581440.7740.3090.3960.169
soccer-ball-field4581530.6010.3330.2980.171
tennis-court4587600.8390.9090.9210.773
baseball-diamond4582140.7630.4910.5780.308
swimming-pool4584400.6460.550.4870.188
roundabout4581790.6850.06150.1030.0366
basketball-court4581320.5860.4020.4080.289
storage-tank45828880.680.3460.3570.174
bridge4584640.6020.2350.2570.0748
helicopter4587310.03230.1820.123
Table 7. Class-wise object detection results of MAF-Net on the RSOD dataset.
Table 7. Class-wise object detection results of MAF-Net on the RSOD dataset.
ClassImagesLabelsPrecisionRecallmAP@0.5mAP@0.5:0.95
all2537770.3470.9550.3490.238
aircraft2535460.330.9730.3370.224
oiltank2531970.3720.9540.3730.306
overpass253190.3470.8950.3210.14
playground253150.33810.3640.28
Table 8. Fine-grained ablation experiment results on AI-TOD dataset.
Table 8. Fine-grained ablation experiment results on AI-TOD dataset.
AlgorithmPrecisionRecallmAP@0.5mAP@0.5:0.95
YOLOv7 (Baseline)0.6640.2430.2560.104
YOLOv7 + HAE (Average/Max Pooling Only)0.6120.3570.3280.142
YOLOv7 + HAE (With Variance Pooling)0.6350.4020.3600.165
YOLOv7 + AGD (Without Coordinate Decoupling)0.5870.3960.3410.151
YOLOv7 + AGD (With Coordinate Decoupling)0.6030.4480.3850.179
YOLOv7 + HAE + AGD (Dual-path Attention)0.6480.5010.4330.208
YOLOv7 + 160 × 160 Detection Layer0.5370.4010.4200.181
YOLOv7 + Density-adaptive Anchor0.6280.3270.3010.132
YOLOv7 + Hierarchical Feature Aggregation0.6410.3150.3040.135
YOLOv7 + Joint Optimization of Three Components0.6050.4730.4890.215
MAF-Net (Complete Model)0.6670.5530.5580.245
Table 9. Fine-grained ablation experiment results on DOTA dataset.
Table 9. Fine-grained ablation experiment results on DOTA dataset.
AlgorithmPrecisionRecallmAP@0.5mAP@0.5:0.95
YOLOv7 (Baseline)0.4130.4650.2600.156
YOLOv7 + HAE (Average/Max Pooling Only)0.3890.5210.3150.173
YOLOv7 + HAE (With Variance Pooling)0.4070.5580.3430.189
YOLOv7 + AGD (Without Coordinate Decoupling)0.3760.5130.3020.168
YOLOv7 + AGD (With Coordinate Decoupling)0.3980.5720.3380.185
YOLOv7 + HAE + AGD (Dual-path Attention)0.4250.6140.3790.207
YOLOv7 + 160 × 160 Detection Layer0.4120.5130.2790.166
YOLOv7 + Density-adaptive Anchor0.4350.4980.2970.169
YOLOv7 + Hierarchical Feature Aggregation0.4290.4870.2910.167
YOLOv7 + Joint Optimization of Three Components0.4680.5720.3360.189
MAF-Net (Complete Model)0.7180.4740.4900.280
Table 10. Fine-grained ablation experiment results on RSOD dataset.
Table 10. Fine-grained ablation experiment results on RSOD dataset.
AlgorithmPrecisionRecallmAP@0.5mAP@0.5:0.95
YOLOv7 (Baseline)0.0560.1420.0270.017
YOLOv7 + HAE (Average/Max Pooling Only)0.1890.4230.1560.098
YOLOv7 + HAE (With Variance Pooling)0.2140.4870.1820.117
YOLOv7 + AGD (Without Coordinate Decoupling)0.1970.4560.1630.102
YOLOv7 + AGD (With Coordinate Decoupling)0.2260.5320.1950.124
YOLOv7 + HAE + AGD (Dual-path Attention)0.2530.6010.2270.143
YOLOv7 + 160 × 160 Detection Layer0.2230.6120.2080.131
YOLOv7 + Density-adaptive Anchor0.1640.3850.1340.089
YOLOv7 + Hierarchical Feature Aggregation0.1720.3980.1410.093
YOLOv7 + Joint Optimization of Three Components0.2760.7120.2640.162
MAF-Net (Complete Model)0.3470.9550.3490.238
Table 11. Comparison of object detection results among different models in AI-TOD dataset.
Table 11. Comparison of object detection results among different models in AI-TOD dataset.
Model NamePrecisionRecallmAP@0.5mAP@0.5:0.95
MAF-Net0.6670.5530.5580.245
YOLOv7-SCD0.57890.45430.46770.2045
YOLOv7-UWSC0.7410.34910.3720.162
YOLOv7 Improved0.5880.48740.48930.211
YOLOv7-tiny0.76020.27150.28920.1161
YOLOv7-MH0.63040.42990.44170.1917
Table 12. Comparison of object detection results among different models in DOTA dataset.
Table 12. Comparison of object detection results among different models in DOTA dataset.
Model NamePrecisionRecallmAP@0.5mAP@0.5:0.95
MAF-Net0.7180.4740.490.28
YOLOv7-SCD0.66330.4330.42340.2209
YOLOv7-UWSC0.4050.510.2810.166
YOLOv7 Improved0.4720.6250.320.264
YOLOv7-tiny0.3810.350.1930.094
YOLOv7-MH0.4060.4630.2570.146
Table 13. Comparison of object detection results among different models in RSOD dataset.
Table 13. Comparison of object detection results among different models in RSOD dataset.
Model NamePrecisionRecallmAP@0.5mAP@0.5:0.95
MAF-Net0.3470.9550.3490.238
YOLOv7-SCD0.3050.8290.3110.21
YOLOv7-UWSC0.03960.09620.02050.0129
YOLOv7 Improved0.5820.2240.07340.0121
YOLOv7-tiny0.2480.6950.2430.147
YOLOv7-MH0.1490.50.1260.0779
Table 14. Comparison of model parameters, computational complexity and inference speed.
Table 14. Comparison of model parameters, computational complexity and inference speed.
Model NameParameters (M)FLOPs (G)FPS
MAF-Net39.12359.77321.277
YOLOv7-SCD41.56449.4539.346
YOLOv7-UWSC42.437112.13914.286
YOLOv7 Improved37.82959.94211.96
YOLOv7-tiny6.2706.69738.46
YOLOv7-MH37.53252.60823.81
Table 15. Detection accuracy comparison of various models for small object detection on AI-TOD dataset.
Table 15. Detection accuracy comparison of various models for small object detection on AI-TOD dataset.
Model NamePrecisionRecallmAP@0.5mAP@0.5:0.95
MAF-Net0.6670.5530.5580.245
YOLOv7-UWSC [50]0.7410.34910.3720.162
YOLOv7-tiny [51]0.76020.27150.28920.1161
KANs-DETR [39]0.6380.4120.4270.186
YOLO-CC [45]0.6050.4370.4430.197
Table 16. Detailed specifications of the DJI Mavic Mini 2 UAV.
Table 16. Detailed specifications of the DJI Mavic Mini 2 UAV.
ItemSpecification
Folded dimensions (without propellers) 138 × 81 × 58 mm
Unfolded dimensions (without propellers) 159 × 203 × 56 mm
Diagonal wheelbase213 mm
Maximum horizontal flight speed (near sea level, no wind)16 m/s (Sport mode), 10 m/s (Normal mode), 6 m/s (Cine mode)
Maximum ascent speed6 m/s (Normal mode), 8 m/s (Sport mode)
Maximum descent speed6 m/s
Maximum hover time38 min
Maximum flight time45 min
Battery capacity5000 mAh
Gimbal pitch range 135 to 45
Gimbal roll range 45 to 45
Gimbal yaw range 27 to 27
Table 17. Detailed distribution of the test dataset across different groups.
Table 17. Detailed distribution of the test dataset across different groups.
CategoryIndoorGridPavementWaterGrass
Images208203848367154
AEW Aircraft1449155017
Aircraft2715128027
Fighter1831173017
Helicopter5458184022
Hummer1959226050
Missile404612700
Tank4236199049
Truck10539405
Warship0003390
Yacht0002270
Table 18. Class-wise object detection performance on the Indoor dataset.
Table 18. Class-wise object detection performance on the Indoor dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all26280.6790.90.8590.742
aew2660.88810.9950.85
aircraft2620.17910.9950.995
fighter2660.88510.9950.826
helicopter2640.9920.250.5790.457
hummer2610.54710.9950.896
missile2630.3620.950.4560.328
tank2610.84110.9950.896
truck2650.73410.8620.686
Table 19. Class-wise object detection performance on the Grid dataset.
Table 19. Class-wise object detection performance on the Grid dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all20350.7450.6480.7110.542
aew2030.76710.9950.807
aircraft201100.1990.139
fighter2040.9310.250.4590.34
helicopter2050.86610.9950.699
hummer2070.64910.8890.619
missile2050.5580.60.6620.526
tank2060.4980.3320.4910.407
truck2040.69310.9950.796
Table 20. Class-wise object detection performance on the Pavement dataset.
Table 20. Class-wise object detection performance on the Pavement dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all841230.9740.9890.9940.791
aew84120.91610.9950.769
aircraft84130.95310.9950.782
fighter84140.98310.9950.816
helicopter84200.98310.9950.822
hummer84200.98510.9950.809
missile84140.9910.9950.792
tank84190.98210.9950.873
truck841110.9090.9880.664
Table 21. Class-wise object detection performance on the Water dataset.
Table 21. Class-wise object detection performance on the Water dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all36600.9920.9850.9950.632
warship363310.970.9950.73
yacht36270.98410.9950.533
Table 22. Class-wise object detection performance on the Grass dataset.
Table 22. Class-wise object detection performance on the Grass dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all15190.8570.750.8290.603
aew1520.9930.50.8280.679
aircraft1520.69510.9950.721
helicopter1540.71910.9950.821
hummer1550.84510.9950.706
tank1550.88810.9950.577
truck151100.1660.116
Table 23. Class-wise object detection performance on the IGPWG dataset.
Table 23. Class-wise object detection performance on the IGPWG dataset.
ClassImagesLabelsPRmAP@0.5mAP@0.5:0.95
all1782620.9630.9870.990.766
aew178200.85410.9950.8
aircraft178140.97310.9950.828
fighter178270.9630.9690.9940.806
helicopter178390.99810.9950.787
hummer178350.9910.9950.792
missile178240.95310.9950.814
tank178330.9710.9950.829
truck178180.9760.9440.990.741
warship178310.99610.9960.739
yacht178210.9520.9530.9520.525
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, K.; Zhang, Z.; Zhang, J.; Huang, J. AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics 2026, 15, 1100. https://doi.org/10.3390/electronics15051100

AMA Style

Ma K, Zhang Z, Zhang J, Huang J. AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics. 2026; 15(5):1100. https://doi.org/10.3390/electronics15051100

Chicago/Turabian Style

Ma, Ke, Zhongjie Zhang, Jiarui Zhang, and Jian Huang. 2026. "AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery" Electronics 15, no. 5: 1100. https://doi.org/10.3390/electronics15051100

APA Style

Ma, K., Zhang, Z., Zhang, J., & Huang, J. (2026). AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics, 15(5), 1100. https://doi.org/10.3390/electronics15051100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop