AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery

Ma, Ke; Zhang, Zhongjie; Zhang, Jiarui; Huang, Jian

doi:10.3390/electronics15051100

Open AccessArticle

AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery

¹

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

²

Aviation University of Air Force, Changchun 130012, China

³

Northwest Institute of Nuclear Technology, Xi’an 710024, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(5), 1100; https://doi.org/10.3390/electronics15051100

Submission received: 31 January 2026 / Revised: 26 February 2026 / Accepted: 28 February 2026 / Published: 6 March 2026

(This article belongs to the Special Issue AI-Native Ubiquitous 6G: Key Technologies, Architectures, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Ubiquitous aerial sensing with unmanned aerial vehicles (UAVs) is becoming an essential component of AI-native perception systems, motivated by the trend toward edge deployment and potential integration with future sixth-generation (6G)-connected aerial networks. In this work, we focus on improving the perception-side accuracy and computational efficiency of small-object detection in UAV imagery. However, small object detection in high-altitude UAV imagery remains highly challenging due to the extremely low pixel occupancy of targets and the severe multi-scale interference introduced by complex backgrounds. To address these limitations, we propose a Multi-scale Attention Fusion Network (MAF-Net), an AI-native paradigm for real-time small object detection in UAV imagery. The proposed approach enhances small-target representation and robustness through three key designs. First, a density-adaptive anchor optimization strategy is developed by combining K-means++ clustering with an IoU-based distance metric, enabling anchors to better match scale variation under diverse object densities. Second, a multi-scale feature reinforcement module is introduced to strengthen fine-grained detail preservation by integrating shallow feature maps via skip connections and hierarchical aggregation. Third, a dual-path attention mechanism is employed to jointly model channel importance and spatial localization, improving discriminative feature calibration in cluttered aerial scenes. Extensive experiments on three public benchmarks (AI-TOD, DOTA, and RSOD) demonstrate that MAF-Net consistently outperforms the baseline detector, achieving mAP@0.5 gains of 14.1%, 11.28%, and 22.09%, respectively. These results confirm that MAF-Net provides an effective and deployment-friendly solution for robust small object detection, supporting real-time UAV-based inspection and AI-native ubiquitous aerial sensing applications.

Keywords:

MAF-Net; small object detect; attention mechanisms

1. Introduction

Unmanned aerial vehicle (UAV)-based aerial sensing has rapidly evolved into a key component of ubiquitous sensing for modern intelligent systems, enabling applications such as infrastructure inspection, public safety monitoring, precision agriculture, and emergency response [1,2,3,4,5,6]. With the emerging trend of AI-native computing, perception models are increasingly pushed to resource-constrained edge platforms, and future sixth-generation (6G) aerial networks may further enable collaborative sensing among UAVs. These trends motivate the need for accurate yet efficient on-board object detection from high-resolution drone imagery [7]. Meanwhile, recent AI-oriented system studies have also highlighted the infrastructure-side constraints that accompany AI proliferation [8,9,10,11,12,13].

Object detection aims to simultaneously localize and classify objects in images, and recent deep learning detectors have achieved impressive performance on standard benchmarks such as Microsoft Common Objects in Content (MS COCO) [14] and Pascal Visual Object Classes (Pascal VOC) [15]. However, these detectors often show a noticeable performance drop when applied to UAV remote sensing imagery, especially for small object detection. This degradation is primarily caused by the unique imaging characteristics of high-altitude aerial scenes: (i) extremely low pixel occupancy, where targets may occupy fewer than 10 pixels in width/height; (ii) severe scale variation within the same image due to viewpoint changes and altitude differences; and (iii) complex background clutter with strong appearance similarity to targets [16]. In practical UAV reconnaissance scenarios, the ratio of target area to image area can be well below 0.5%, making feature extraction and discrimination fundamentally challenging.

Existing detectors struggle in such cases, mainly because small targets are easily suppressed by deep downsampling operations and are prone to being overwhelmed during multi-scale feature fusion. Moreover, conventional anchor configurations are typically optimized for natural images and do not match the density and scale distributions of aerial targets, leading to poor localization and low recall in crowded scenes. These limitations indicate that UAV small object detection requires dedicated designs that preserve fine-grained features, enhance multi-scale perception, and adapt to density-dependent target patterns while maintaining real-time efficiency for deployment in aerial sensing systems.

To address these challenges, we propose MAF-Net (Multi-scale Attention Fusion Network), a real-time detection framework tailored for small object detection in UAV imagery. Specifically, MAF-Net is designed to resolve three key bottlenecks in high-altitude UAV scenes: (i) tiny-object features are easily diminished by deep downsampling; (ii) conventional anchor priors are mismatched with the scale–density distribution of aerial targets, especially in crowded scenes; and (iii) cluttered backgrounds lead to ambiguous localization and weak target–background separability. Accordingly, MAF-Net improves small-target sensitivity and robustness through three aspects:

Shallow feature integration for small targets: We strengthen fine-grained representation by introducing an additional high-resolution $160 \times 160$ detection head and reinforcing shallow feature maps via skip connections and hierarchical aggregation, preserving spatial details critical for tiny objects.
Density-adaptive anchor optimization: We introduce an anchor generation strategy based on K-means++ clustering with an IoU-based distance metric, enabling anchors to better match dataset-specific scale variation under diverse object densities and improving recall in crowded aerial scenes.
Dual-path attention for robust discrimination: We employ a lightweight dual-path attention mechanism that jointly models channel-wise importance and spatial localization, combining variance-aware hybrid attention enhancement (HAE) and coordinate-decoupled geometric attention (AGD) to improve feature recalibration and target–background separability in cluttered environments.

Extensive experiments on public UAV remote sensing benchmarks demonstrate that MAF-Net achieves consistent improvements over the baseline detector under challenging small-object settings while maintaining real-time inference efficiency. We validate our method primarily from the perception perspective, reporting detection accuracy and on-device efficiency metrics; end-to-end communication-aware latency under bandwidth constraints are not explicitly modeled in our experiments. These results highlight the effectiveness of AI-native multi-scale attention fusion for ubiquitous aerial sensing tasks.

Additionally, although MAF-Net is built upon a YOLOv7 baseline, it is not a straightforward “small head + attention + anchor tuning” recipe. Firstly, our attention is variance-aware: we explicitly introduce variance pooling (beyond average/max pooling) to capture local intensity/texture dispersion that is prominent in UAV small objects and cluttered backgrounds, and we combine it with a coordinate-decoupled geometric attention to strengthen position-sensitive cues. Furthermore, our anchors are density-adaptive: instead of applying generic auto-anchor settings, we design a K-means++ clustering with an IoU-based distance tailored to aerial scale–density distributions, improving matching in crowded scenes. Finally, the added high-resolution small-object head is coupled with a lightweight multi-scale fusion design to preserve shallow spatial details while maintaining real-time efficiency, as verified by ablation studies.

Thus, we employ an AI-native approach, which refers to a data-driven perception pipeline in which object detection is implemented as a first-class on-board AI function and is explicitly designed for edge deployment constraints (e.g., real-time inference and limited compute/memory) rather than a hand-crafted or rule-based vision pipeline. The main contributions of this work are summarized as follows:

We propose MAF-Net, a real-time detection framework that enhances small object sensitivity by combining shallow feature reinforcement, density-adaptive anchor design, and dual-path attention fusion.
We provide a systematic design and analysis of feature aggregation strategies that better preserve small-target spatial cues in high-resolution UAV imagery.
We validate the proposed method on multiple benchmark datasets, demonstrating robust performance gains for small object detection under complex aerial backgrounds.

2. Related Work

Small object detection has long been recognized as a challenging problem in object detection, especially in aerial remote sensing where objects are tiny, densely distributed, and embedded in complex backgrounds. In recent years, deep neural network-based detectors have become the dominant paradigm and can generally be categorized into two-stage and one-stage frameworks. This section reviews representative approaches and discusses their limitations in UAV small object detection scenarios, integrating the latest advances in remote sensing-specific small object detection.

2.1. One-Stage Object Detectors

One-stage detectors directly predict categories and bounding boxes without explicit proposal generation, offering faster inference and making them attractive for real-time aerial sensing. The YOLO family is a representative line of one-stage detectors. YOLOv1 formulates detection as a regression problem with high speed but limited localization accuracy for small or crowded objects [17]. Subsequent versions, including YOLOv2 and YOLOv3, improve detection performance via anchor-based prediction and multi-scale heads, with YOLOv3 introducing feature pyramid-style outputs to better handle scale variation [18,19]. YOLOv4 and later versions, such as YOLOv5 [20], further incorporate advanced augmentation and feature fusion strategies (e.g., Mosaic, PAN/FPN), improving both accuracy and efficiency [21]. More recent designs such as YOLOv6, YOLOv7, and YOLOv8 continue to enhance real-time detection capability with optimized training strategies and architectural refinements [22,23]. The latest YOLOv10 further optimizes real-time end-to-end detection but still faces challenges in small object detection in remote sensing images due to background noise, information loss, and complex multi-object interactions [24].

Other one-stage detectors include SSD, which performs multi-scale prediction on different feature maps but relies heavily on manually designed anchor settings and often underperforms on small dense objects [25]. RetinaNet addresses class imbalance with focal loss but may not achieve real-time performance under high-resolution aerial inputs [26]. EfficientDet is built upon a weighted bidirectional feature pyramid (BiFPN) and employs compound scaling, offering improved feature fusion, but its computational cost can still be non-trivial for real-time UAV deployment [27]. Dynamic YOLO [28], designed for underwater small object detection, constructs a lightweight backbone based on DCN v3 and a unified feature fusion framework with channel-aware, scale-aware, and spatial-aware attention, which provides valuable insights for aerial small object detection.

2.2. Two-Stage Object Detectors

Two-stage detectors typically generate region proposals first and then refine them through classification and bounding-box regression, often achieving strong accuracy. Early works such as R-CNN introduced deep features into detection pipelines but suffered from multi-stage training and high computational overhead [29]. SPPNet improved efficiency by adopting spatial pyramid pooling to reduce redundant convolution operations, but it was not fully end-to-end trainable [30]. Fast R-CNN and Faster R-CNN further advanced two-stage detection by enabling shared feature extraction and introducing region proposal networks (RPNs), improving both accuracy and speed [31,32].

To enhance multi-scale representation, Feature Pyramid Networks (FPNs) construct pyramidal features to detect objects at different scales. While effective, downsampling in deeper layers still weakens fine-grained small-object cues, and performance may degrade under extreme tiny targets or dense distributions. Cascade R-CNN improves localization via progressive refinement with increasing IoU thresholds, but this cascaded design is typically less suitable for real-time UAV applications [33]. R-FCN introduces position-sensitive mechanisms to improve localization, whereas Mask R-CNN extends detection to instance segmentation, usually at higher computational cost [34,35]. TridentNet leverages multiple branches with different receptive fields for multi-scale detection, but its inference complexity limits real-time deployment [36].

Overall, two-stage detectors are often accurate but can be computationally heavy and may still struggle with extremely small aerial targets due to insufficient high-resolution feature preservation. Transformer-based two-stage detectors such as DETR [37] and Deformable DETR [38] use self-attention mechanisms to capture global context, but their performance on small objects is limited by insufficient local feature extraction. KANs-DETR [39] replaces fully connected layers with Kolmogorov–Arnold Networks (KANs) to enhance feature representation robustness, providing a new direction for optimizing two-stage detectors for small object detection.

2.3. Small Object Detection in Aerial and Construction Imagery: Methods, Challenges, and Research Gaps

Small object detection in UAV imagery introduces additional constraints beyond natural-image detection, including ultra-low pixel occupancy, severe background clutter, and strong scale-density variations across scenes [16]. To improve small-object sensitivity, researchers have explored feature pyramid enhancement, shallow feature reuse, attention mechanisms, and adaptive anchor design [35,36,40]. Attention-based modules can emphasize salient regions and suppress clutter, improving the detectability of small targets in large backgrounds [34,35]. Meanwhile, anchor optimization strategies aim to align prior box distributions with dataset-specific object scales and densities, improving localization and recall in crowded aerial scenes [41,42].

Recent progress in UAV-based small-object detection for construction monitoring has underscored the effectiveness of deep learning in addressing practical engineering challenges. However, these advancements also reveal unresolved issues that are central to our research. Wang [43] systematically evaluated ten traditional image augmentation methods (e.g., shearing, contrast adjustment, probabilistic sampling) for rebar counting in reinforced concrete (RC) structures, using Faster R-CNN and YOLOv10 with transformer backbones (ViT, PVT, Swin Transformer). Their work confirmed that augmentation efficacy is architecture-dependent—for example, shearing achieved the best performance on YOLOv10-PVT (AP50 = 87.71%, rebar count accuracy = 86.27%)—and emphasized that geometric distortions (e.g., translation, scaling) can degrade performance for thin, small objects like rebars. While this study advanced data augmentation for construction-specific small targets, it focused primarily on dataset expansion rather than optimizing feature extraction or attention mechanisms for ultra-small, low-pixel targets (≤32 × 32 pixels) common in high-altitude aerial sensing.

Complementary research on non-PPE identification in construction sites [44] further validated YOLOv10-based transformer architectures for small-object detection under complex on-site backgrounds but similarly relied on standard feature fusion pipelines that struggle with the extreme scale imbalance (e.g., 97.96% small targets in the AI-TOD dataset) and dense clutter in remote sensing scenarios. Collectively, these construction-focused studies [43,44] underscore the critical role of task-specific optimization (augmentation, backbone selection) for small targets, but they lack dedicated designs to address the three core challenges in high-altitude UAV imagery: extremely low feature signal-to-noise ratio (SNR), ambiguous spatial localization, and insufficient channel feature decoupling.

Motivated by these insights and the gaps in existing work, our work develops a real-time framework (MAF-Net) that combines shallow feature reinforcement, density-adaptive anchors, and a dual-path attention fusion mechanism. Unlike prior construction-related methods [43,44] that prioritize augmentation or backbone tuning, MAF-Net targets the root cause of small-object detection failure in aerial imagery—weak feature representation—by: (1) adding a shallow 160 × 160 detection layer to preserve fine-grained spatial cues; (2) optimizing anchors via K-means++ clustering with IoU-based distance metrics to match diverse object densities; (3) integrating a hybrid-attention encoder (HAE) and attention-guided decoder (AGD) to jointly model channel importance and coordinate-aware spatial localization. This design enables MAF-Net to outperform construction-specific baselines [43] and general aerial detectors [16,40] on ultra-small targets, supporting AI-native ubiquitous aerial sensing applications beyond construction scenarios.

3. The MAF-Net Model

3.1. Problem Analysis

This paper mainly focuses on small target detection of UAVs under high-altitude visible light. Under the formal framework ofa target detection task, the input image space is set as follows:

I \subseteq R^{H \times W \times 3}

(1)

where

H, W

respectively represent the height and width dimensions of the image. The target detection process can be modeled as a mapping function:

f : I \to B, B = {(b_{i}, l_{i})}_{i = 1}^{N}

(2)

where

B

is the detection output set, including the bounding box coordinate

B_{i} \in R^{4}

and its corresponding category label

l_{i} \in Z^{+}

. There are three main definitions of quantitative standards for micro targets:

(1): As shown in Equation (3), it is quantified as the ratio between the object bounding box area and the total image area:

$\forall b_{i} \in B, Area (b_{i}) \leq ϵ \cdot H \cdot W (ϵ = 0.009)$

(3)
(2): As shown in Equation (4), it is defined by the absolute value of the object bounding box area:

$\forall b_{i} \in B, Area (b_{i}) \leq 32 \times 32 pixels$

(4)
(3): As shown in Equation (5), it is quantified by the ratios of the object bounding box width and height relative to the image width and height, respectively:

$min (\frac{width (b_{i})}{W}, \frac{height (b_{i})}{H}) \leq δ (δ = 0.1)$

(5)

The feature map

F \in R^{C \times H \times W}

of the remote sensing image

I

is derived from the input image after the implementation of feature extraction. As noted above, the small object region is defined as

Ω_{s} \subset {1, 2, \dots, H} \times {1, 2, \dots, W}

, and the background region is

Ω_{b} = Ω ∖ Ω_{s}

(

Ω

is the full feature space), which satisfies

| Ω_{s} | ≪ | Ω_{b} |

(the spatial proportion of small objects is extremely low).

Features corresponding to the small object region are formally defined as:

F (c, i, j) = F_{s} (c, i, j) + F_{b} (c, i, j) + ε (c, i, j)

(6)

where

F_{s}

represents the true features corresponding to small object regions, whose amplitude is significantly lower in magnitude compared to background features

F_{b}

(

∥ F_{s} ∥ ≪ ∥ F_{b} ∥

);

ε

: the noise of feature extraction (information loss caused by convolution/pooling).

As discussed, small object detection is fundamentally challenged by three mathematical issues:

(1): Extremely low feature signal-to-noise ratio: Let the extremely low feature signal-to-noise ratio be SNR:

$SNR = \frac{∥ F_{s} ∥}{∥ F_{b} ∥ + ∥ ε ∥} ≪ 1$

(7)

where the object features are overwhelmed by the background/noise.
(2): Ambiguous spatial localization: The spatial support set $Ω_{s}$ of small objects is excessively small. Traditional 2D global pooling leads to spatial information aliasing, failing to accurately distinguish the spatial positions of $Ω_{s}$ and $Ω_{b}$ .
(3): Insufficient channel feature decoupling: Different channels contribute differently to the representation of small objects (e.g., edge/texture channels are the core features of small objects). However, without adaptive weights, effective channels are diluted by invalid ones.

The optimization objective of the detection task is to find an attention map

A \in R^{C \times H \times W}

(

0 \leq A (c, i, j) \leq 1

) such that the weighted feature

Y = A ⊙ F

satisfies:

\{\begin{matrix} A (c, i, j) \to 1, & (i, j) \in Ω_{s} \\ A (c, i, j) \to 0, & (i, j) \in Ω_{b} \end{matrix}, and SNR (Y) ≫ SNR (F)

(8)

(where ⊙ denotes element-wise multiplication, i.e., the Hadamard product).

3.2. Base Model Selection and Justification

3.2.1. Rationale for Selecting YOLOv7: Core Advantages Aligned with Our Problem Requirements

Accuracy–Efficiency Trade-off: Compared to its predecessors (YOLOv5–v6) and newer variants (YOLOv8–v10), YOLOv7 strikes a more favorable balance between detection accuracy and inference speed—critical for real-time small object detection in resource-constrained environments. For example, on the VisDrone2018 dataset (dominated by small UAV-captured objects), YOLOv7-tiny achieves 38.7% mAP@0.5 with 110 FPS on an NVIDIA Tesla V100, outperforming YOLOv5s (35.9% mAP@0.5, 105 FPS) and YOLOv6s (36.8% mAP@0.5, 98 FPS) [23].

Small-Object Detection Enhancement: YOLOv7 optimizes the neck structure with multi-scale feature fusion (PANet + FPN) and adaptive anchor matching, which mitigates feature loss for small objects during downsampling. Unlike YOLOv5 (which relies on simple Concat operations for feature fusion) and YOLOv8 (which simplifies the neck to focus on speed), YOLOv7 retains a more comprehensive feature pyramid that preserves fine-grained details of small targets. This design aligns with the primary challenge of small object detection in remote sensing—preserving spatial details while fusing multi-scale semantic information.

Deployment Flexibility: YOLOv7 can be deployed on edge devices with only minimal performance degradation, rendering it well-suited for airborne drone detection or industrial real-time detection scenarios—scenarios that our approach is designed to address. This flexibility is particularly valuable when compared to heavier detectors, such as DETR variants, which are challenging to deploy at the edge [28,39].

3.2.2. Comparative Analysis Against Representative Alternatives

To further validate YOLOv7’s superiority, we compare it against three categories of state-of-the-art detectors, integrating key data from the provided literature:

(1): Newer YOLO Variants (YOLOv8–v10)

Newer YOLO versions (v8–v10) introduce architectural refinements but often sacrifice small-object performance or efficiency for overall accuracy: YOLOv8 adopts a C2f backbone and simplified neck, achieving 41.2% mAP@0.5 on COCO but only 26.9% mAP@0.5 on the AI-TOD dataset (average object size = 12.8 pixels) [45]. YOLOv9 [46]: Integrates transformer modules for global feature capture but increases computational complexity, making it less suitable for edge deployment. Its small-object performance (

A P_{S}

= 28.1% on COCO) is only marginally better than YOLOv7 (

A P_{S}

= 27.6%), but at the cost of 30% higher GFLOPs [46]. YOLOv10 [24]: Introduces end-to-end NMS-free detection and multimodal fusion, achieving 50.02% mAP@0.5 on AI-TOD. However, its reliance on transformer backbones (e.g., ViT) increases latency, and it struggles with ultra-small objects (<32 × 32 pixels) due to excessive downsampling.

(2): Two-stage detectors, typified by Faster R-CNN and related variants

Two-stage detectors are known for high accuracy but fail to meet real-time and deployment requirements: Faster R-CNN [32]: Achieves 42.0% mAP@0.5 on COCO but only 26.6%

A P_{S}

(small objects) due to feature loss in deep downsampling. Its two-stage pipeline (region proposal + refinement) results in low inference speed (26 FPS), making it unsuitable for real-time scenarios like UAV patrols [37]. Cascade R-CNN [33]: Improves localization accuracy (44.0% mAP@0.5 on COCO) but further increases computational overhead (180G FLOPs), with FPS dropping to 18. It also struggles with dense small objects in remote sensing images, as region proposals are prone to overlapping and missing tiny targets [47]. In contrast, YOLOv7’s one-stage design achieves comparable small-object accuracy (

A P_{S}

= 27.6% on COCO) while running 3× faster (85 FPS for YOLOv7s) and requiring 70% fewer parameters (7.1 M vs. 24 M for Faster R-CNN), making it more suitable for our real-time deployment needs.

(3): DETR-Based Architectures

DETR and its variants leverage transformers for global feature capture but face limitations in small-object detection and efficiency: DETR [37]: The original DETR achieves 42.0% mAP@0.5 on COCO but only 20.5%

A P_{S}

, as its global attention mechanism struggles with low-resolution small-object features. It requires 500 training epochs to converge (10× longer than YOLOv7) and runs at 28 FPS—too slow for real-time applications [38]. Deformable DETR [38]: Introduces deformable attention to focus on key sampling points, improving

A P_{S}

to 26.4% on COCO. However, it still lags behind YOLOv7 (

A P_{S}

= 27.6%) and has higher computational complexity (173G FLOPs vs. 45G for YOLOv7s), with inference speed limited to 19 FPS [38]. Its multi-scale deformable attention module enhances small-object detection but increases memory access overhead, making it less efficient than YOLOv7’s E-ELAN [38]. KANs-DETR [39]: Replaces fully connected layers with Kolmogorov–Arnold Networks (KANs) to enhance small-object robustness, achieving 30.6%

A P_{S}

on COCO. However, its complex transformer encoder increases parameters to 12.8M (80% more than YOLOv7s) and reduces inference speed to 22 FPS, making it less suitable for edge deployment [39]. ISO-DETR [48]: A novel DETR variant for industrial small object detection, replacing one-to-one matching with IoU-based many-to-one assignment. It achieves 77.3% mAP@0.5 on a custom industrial dataset but runs at 22 FPS (same as KANs-DETR) and requires 39M parameters—5× more than YOLOv7-tiny [48].

3.3. Modeling

Building on the YOLOv7 baseline, as illustrated in Figure 1, we redesign the network to better handle small objects in UAV aerial images. First, we add a dedicated detection head tailored to small-scale targets, improving localization and classification under limited pixel support. Second, anchor generation is performed with K-means++ rather than conventional K-means, reducing sensitivity to initialization and avoiding unstable clustering results when the preset cluster configuration is sub-optimal. Third, we embed a coordinate-aware attention module to reinforce informative spatial–channel cues and suppress background interference, leading to stronger feature representations for tiny objects. Together, these modifications substantially boost performance on small-object detection.

The improved complete network model is shown in Figure 2. In the following three sections, the above three parts will be introduced in sequence, namely the improvement of anchor box size, the addition of a small object detection layer, and the coordinate attention mechanism.

3.4. Incorporating an Additional Layer for Small Object Detection

Directly applying standard object detection pipelines to small-object scenarios typically results in higher miss rates and frequent misclassification. A key reason is the feature extraction strategy adopted by most modern detectors: deep, multi-stage convolutional backbones progressively downsample the input to obtain more abstract semantic representations. While deeper layers provide stronger semantics, repeated down-sampling inevitably reduces the spatial granularity of feature maps. For small targets that occupy only a few pixels, this loss of spatial detail can erase or severely weaken the discriminative cues needed for reliable localization and recognition, thereby degrading detection performance. For an input image of size

640 \times 640

, Table 1 reports the corresponding feature-map resolutions and effective receptive-field sizes under different down-sampling ratios.

In UAV high-altitude remote sensing images, when the image size is 800 × 800, most (over 90%) of the target bounding boxes occupy an area of less than 320 pixels, which accounts for only 0.5% of the total image area; The length or width of the target box can be less than 10 pixels. According to Table 1, we can know that when using feature maps of sizes 20 × 20, 40 × 40, and 80 × 80, the receptive field size is greater than or close to the small target size; no matter how target feature extraction is performed, the features of the detected target cannot be obtained. Therefore, we need a shallow 160 × 160 feature map to extract the features of the detected target and improve the detection rate for small targets.

In standard CNN backbones, hierarchical representations are formed through successive convolutional transformations applied layer by layer. Specifically, the feature tensor at depth l is computed from the previous layer as

F^{(l)} = σ! (W^{(l)} * F^{(l - 1)} + b^{(l)}),

(9)

where

W^{(l)}

and

b^{(l)}

denote the convolutional weights and bias at the l-th layer, and

σ (\cdot)

is an activation function with nonlinear characteristics. The resulting feature map is

F^{(l)} \in R^{h_{l} \times w_{l} \times d_{l}}

, while the input to this layer is

F^{(l - 1)} \in R^{h_{l - 1} \times w_{l - 1} \times d_{l - 1}}

, with

h_{l}

,

w_{l}

, and

d_{l}

indicating the spatial dimensions and channel depth, respectively. In typical detection backbones, spatial resolution decreases as depth increases due to repeated stride-2 downsampling; after l stages, the cumulative reduction factor can be written as

s_{l} = \prod_{k = 1}^{l} 2 = 2^{l},

(10)

which yields the following relationships:

h_{l} = ⌊\frac{H}{s_{l}}⌋, w_{l} = ⌊\frac{W}{s_{l}}⌋

(11)

When the input size is

H = W = 640

, the resolution of deep feature maps drops drastically (see Table 1), resulting in mismatches. As illustrated in Figure 3, the newly added small object detection layer yields a feature map of size

160 \times 160

after 4-fold downsampling of the

640 \times 640

input. All subsequent detection layers employ 2-fold downsampling, thereby sequentially generating feature maps with dimensions of

80 \times 80

,

40 \times 40

and

20 \times 20

.

3.5. Hybrid-Attention Encoder

As convolutional neural networks continue to evolve, attention mechanisms have become a key technique for enhancing feature representation in visual tasks. Among them, the Convolutional Block Attention Module (CBAM), which sequentially combines channel and spatial attention, has been widely adopted in general object detection frameworks. However, in the specific context of high-altitude remote sensing imagery—characterized by small targets, complex backgrounds, and significant scale variations—CBAM exhibits notable limitations. Primarily, its spatial attention mechanism relies on globally aggregated features, which often fail to capture the localized activations of small objects. Since small targets occupy only a sparse set of pixels in the feature map, their representational strength tends to be diluted by dominant background regions during global pooling operations. Consequently, the generated attention maps may inadequately highlight regions containing small objects, thereby diminishing detection sensitivity. Furthermore, the fixed sequential processing of channel and spatial attention may lead to mutual interference; for instance, channel-refined features could be excessively smoothed during subsequent spatial reweighting, especially when target-related activations are weak.

To address these shortcomings, we introduce variance pooling as a complementary operator to enhance CBAM’s capacity for small target detection. Unlike global pooling, variance pooling computes statistical variance within local regions, thereby quantifying the intensity of activation fluctuations at a finer spatial granularity. In remote sensing imagery, small targets such as vehicles or building edges typically correspond to high-frequency patterns with strong local contrast, resulting in higher variance values compared to homogeneous background areas. By integrating variance pooling into the channel attention branch—either as a replacement for or in parallel with classical global pooling—we can better preserve and amplify the feature responses of small targets. This adaptation allows the attention mechanism to more effectively distinguish between informative high-variance regions and redundant low-variance regions, thereby enhancing the discriminability of subtle yet critical details.

The advantages of incorporating variance pooling are threefold. First, it strengthens local discriminability by emphasizing regions with high internal variation, which often coincide with small object boundaries and textures. Second, it maintains spatial fine-grained information through localized statistical computations, mitigating the information loss caused by successive downsampling operations. Third, variance pooling exhibits greater robustness to scale variations, as it responds to relative contrast within a region rather than absolute activation magnitude, thus rendering it highly suitable for multi-scale small object detection scenarios. In our modified attention module, variance-aware features are fused with channel-wise and spatial-wise attention pathways, enabling more balanced and target-sensitive feature recalibration. Experimental results on aerial remote sensing datasets demonstrate that the proposed variance-augmented attention mechanism yields consistent improvements in recall and precision for small object detection, validating its efficacy in addressing the limitations of conventional CBAM under challenging remote sensing conditions.

3.5.1. Hybrid-Attention Encoder Channel Attention Module

As illustrated in Figure 4, the Hybrid-Attention Encoder Channel Attention Module is designed to enhance channel-wise feature representation by jointly exploiting multiple global statistical cues. Given an input feature map

F \in R^{C \times H \times W}

, three parallel pooling operations—global average pooling, global max pooling, and global variance pooling—are first applied along the spatial dimensions. These operations compress the feature map into three channel descriptors of size

1 \times 1 \times C

, each capturing complementary information: average pooling encodes global contextual responses, max pooling emphasizes the most salient activations, and variance pooling reflects the distribution and contrast of feature responses. The resulting descriptors are then fed into a shared multi-layer perceptron (MLP) to model inter-channel dependencies in a parameter-efficient manner. The MLP outputs corresponding channel-wise responses, which are subsequently combined through element-wise addition. Subsequently, a sigmoid activation function is employed to produce normalized channel attention weights. These weights adaptively recalibrate the input feature map by emphasizing informative channels and suppressing those with lower relevance.

3.5.2. Hybrid-Attention Encoder Spatial Attention Module

Figure 5 depicts the structure of the Hybrid-Attention Encoder Spatial Attention Module, which is designed to enhance spatial feature representation by emphasizing informative regions. Given an input feature map F, three parallel pooling operations—max pooling, average pooling, and variance pooling—are executed along the channel dimension to capture complementary spatial information. Max pooling highlights the most salient activations, average pooling encodes overall contextual information, and variance pooling reflects the spatial distribution and contrast of features. The resulting pooled feature maps are concatenated and fed into a convolutional layer to fuse spatial information and model local dependencies. Finally, a sigmoid activation function is employed to generate a spatial attention map, which adaptively reweights the input feature map to enhance target regions and suppress irrelevant background responses.

3.5.3. Feature Modeling with HAE Module

Figure 6 illustrates the overall architecture of the proposed Hybrid-Attention Encoder. Given an input feature map, the encoder sequentially applies channel attention and spatial attention to enhance feature representation. First, the input feature map is processed by the Hybrid-Attention Encoder Channel Attention Module, which learns channel-wise importance weights and recalibrates feature responses accordingly through element-wise multiplication. The refined feature map is then fed into the Hybrid-Attention Encoder Spatial Attention Module, where spatial dependencies are modeled to emphasize informative regions while suppressing irrelevant background responses. Another element-wise multiplication is performed to generate the final output feature map. By cascading channel and spatial attention mechanisms, the Hybrid-Attention Encoder effectively captures both inter-channel relationships and spatial context, resulting in more discriminative and robust feature representations for downstream detection tasks.

The spatial attention weights are computed according to the following formula:

F_{H A E} = A_{s} \otimes (A_{c} \otimes F), A_{c} \in R^{C \times 1 \times 1}, A_{s} \in R^{1 \times H \times W}

(12)

3.6. Attention-Guided Decoder

3.6.1. Feature Modeling with AGD Module

According to Equation (12), in the hybrid-attention encoder

A_{s}

is the global spatial attention, which only models the global spatial correlation of

H \times W

, without decoupling the positional information of the abscissa x and the ordinate y, and lacks scale-adaptive local modeling. However, the core task of the object detection head is the accurate regression of the candidate box

(x_{1}, y_{1}, x_{2}, y_{2})

and the position-sensitive classification of the object center/edge, and global spatial modeling cannot meet the demand of the detection head for fine-grained coordinate information.

The core of the attention-guided decoder lies in decoupling channel attention from coordinate position information. It performs one-dimensional pooling (along the row/column direction) on features to retain coordinate information and ultimately outputs joint channel-coordinate attention features. These features perfectly match the requirements of the detection head for location-sensitive features. The attention-guided decoder structure is shown in Figure 7.

For the feature map

F \in R^{C \times H \times W}

generated by the backbone and neck networks, the attention-guided decoder performs processing in three sequential steps: coordinate information embedding, coordinate attention generation, and adaptive feature weighting, with the final output denoted as

F_{A G D} \in R^{C \times H \times W}

:

Step 1: Coordinate Information Embedding (1D row/column pooling, preserving

x / y

positional information)

Different from the 2D global pooling in HAE, AGD performs row-wise global average pooling

G_{c}^{y}

and column-wise global average pooling

G_{c}^{x}

for each channel c, which preserves the positional information of the ordinate y and abscissa x, respectively:

G_{c}^{y} (i) = \frac{1}{W} \sum_{j = 1}^{W} F_{c, i, j}, G_{c}^{y} \in R^{C \times H \times 1}

(13)

G_{c}^{x} (j) = \frac{1}{H} \sum_{i = 1}^{H} F_{c, i, j}, G_{c}^{x} \in R^{C \times 1 \times W}

(14)

where

i \in [1, H]

denotes the row index (ordinate y) and

j \in [1, W]

denotes the column index (abscissa x).

Step 2: Coordinate Attention Generation (decoupling attention weights for

x / y

)

The concatenated

G_{c}^{y}

and

G_{c}^{x}

are dimensionally reduced via a multilayer perceptron (MLP) and then decoupled into row attention

M_{y}

and column attention

M_{x}

, which model the positional correlation of the ordinate and abscissa, respectively:

Z = δ ({Conv}_{1} ([G_{y}; G_{x}])), Z \in R^{C / r \times H \times W}

(15)

Z_{y} = σ ({Conv}_{2 y} (Z)), Z_{x} = σ ({Conv}_{2 x} (Z))

(16)

M_{y} = Z_{y} \in R^{C \times H \times 1}, M_{x} = Z_{x} \in R^{C \times 1 \times W}

(17)

where r is the dimensional reduction ratio (typically set to 16),

δ

denotes the ReLU activation function,

{Conv}_{1}

is a

1 \times 1

convolutional layer used for dimensionality reduction,

{Conv}_{2 y}

/

{Conv}_{2 x}

are

1 \times 1

convolutions for dimensionality recovery to C channels,

M_{y}

is the row-wise (y-axis) attention weight, and

M_{x}

is the column-wise (x-axis) attention weight.

Step 3: Feature Weighting (output of channel–coordinate joint attention feature)

Element-wise multiplication is applied between the row/column attention weights and the original feature map to derive the final output feature map of the AGD:

F_{A G D} = F \otimes M_{y} \otimes M_{x}

(18)

By comparing the feature outputs of variance pooling, HAE, and AGD, the core mathematical necessity for the detection head to select AGD is as follows:

F_{A G D} = F \otimes M_{y} (C, y) \otimes M_{x} (C, x)

(19)

The AGD includes position-sensitive features with channel-coordinate decoupling, where

M_{y} (C, y)

denotes the attention related to channel C and the ordinate y, and

M_{x} (C, x)

denotes the attention related to channel C and the abscissa x. Specifically, the attention weights of AGD are binary functions of “channel + single coordinate”, which directly encode the positional information of the target.

F_{H A E} = F \otimes M_{c} (C) \otimes M_{s} (H, W)

(20)

The HAE includes position-agnostic features with channel-global spatial modeling, where

M_{s} (H, W)

denotes the attention only related to the global spatial dimension

H \times W

, which is a unary function without coordinate decoupling and cannot encode the fine-grained

x / y

positional information of the target.

The bounding box regression of the object detection head is a numerical prediction for

(x_{1}, y_{1}, x_{2}, y_{2})

. The output

F_{A G D}

of AGD directly contains the feature encoding of

x / y

coordinates, which enables the regression branch of the detection head to directly learn the mapping relationship between coordinates and features. In contrast,

F_{H A E}

of HAE requires the detection head to additionally extract positional information from global features, which increases the learning difficulty. Therefore, the demand AGD can match the detection head.

3.6.2. Loss Function

The loss function employed in this paper comprises the sum of classification loss

L_{c l s}

and regression loss

L_{r e g}

. For a batch with N samples, where each sample has K candidate boxes (Anchor/Anchor-free), the total loss of the detection head is defined as:

L = \frac{1}{N_{p o s}} \sum_{i = 1}^{N} \sum_{j = 1}^{K} (α L_{c l s} (p_{i j}, t_{i j}) + β L_{r e g} (b_{i j}, {\hat{b}}_{i j}, t_{i j}))

(21)

where

p_{i j}

denotes the predicted foreground probability of the candidate box

(i, j)

, and

t_{i j} \in {0, 1}

represents the ground-truth label (1 for foreground, 0 for background);

b_{i j} = (x_{1}, y_{1}, x_{2}, y_{2})

is the predicted regression value of the candidate box, and

{\hat{b}}_{i j}

is the ground-truth box;

N_{p o s}

is the number of positive samples, and

α / β

are loss weights;

L_{c l s}

is Focal Loss (addressing class imbalance), and

L_{r e g}

is GIoU Loss (addressing scale/ratio invariance of box regression).

The formula for Focal Loss

L_{c l s}

is given by:

L_{c l s} (p, t) = - t α {(1 - p)}^{γ} log p - (1 - t) (1 - α) p^{γ} log (1 - p)

(22)

where

γ

is the focusing parameter (typically set to 2), and the core idea is to reduce the weight of easy samples and focus on hard samples at the foreground/background boundary.

M_{y} / M_{x}

of AGD assign higher attention weights to the edge/center positions of the target, making the detection head more sensitive to the positional features of foreground targets. This results in the predicted foreground probability

p_{i j}

(for positive samples) being closer to 1 and the background

p_{i j}

(for negative samples) being closer to 0, ultimately reducing the value of Focal Loss:

L_{c l s}^{A G D} (p, t) < L_{c l s}^{H A E} (p, t)

(23)

Let the predicted foreground probability with AGD features be

p^{A G D}

and that with HAE features be

p^{H A E}

. Then,

| p^{A G D} - 1 | < | p^{H A E} - 1 |

, and substituting this into the Focal Loss formula yields the above inequality (for

γ > 0

,

{(1 - p)}^{γ}

decreases exponentially as p increases). Therefore, AGD can optimize for classification loss.

The formula

L_{r e g}

of GIoU Loss is defined as:

L_{r e g} (b, \hat{b}) = 1 - GIoU (b, \hat{b})

(24)

GIoU = IoU - \frac{| A_{c} - U |}{| A_{c} |}

(25)

where IoU denotes the Intersection over Union (IoU) metric,

A_{c}

denotes the smallest axis-aligned bounding box enclosing both b and

\hat{b}

, and U denotes the union region of b and

\hat{b}

.

The coordinate-decoupled features of AGD make the detection head’s predicted values

b_{i j}

for

x_{1}, y_{1}, x_{2}, y_{2}

closer to the ground-truth box

{\hat{b}}_{i j}

, i.e.,

IoU (b^{A G D}, \hat{b}) > IoU (b^{H A E}, \hat{b})

, and

| A_{c} {- U |}^{A G D} < {| A_{c} - U |}^{H A E}

. This ultimately reduces the GIoU Loss:

L_{r e g}^{A G D} (b, \hat{b}) < L_{r e g}^{H A E} (b, \hat{b})

(26)

3.7. Implementation of the Attention Fusion Module

Let the input feature map be

F \in R^{C \times H \times W}

. Our attention fusion module contains two lightweight paths, namely HAE and AGD, and is applied to each pyramid feature (e.g.,

P 3

/

P 4

/

P 5

) independently (with the same architecture but without weight sharing across scales).

HAE (variance-aware hybrid attention). For channel attention, we compute three channel descriptors using global average pooling (GAP), global max pooling (GMP), and global variance pooling (GVP):

p_{avg}, p_{\max}, p_{var} \in R^{C}

. They are fed into a shared MLP with reduction ratio r (two FC layers:

C \to C / r \to C

) to obtain

M_{c} = σ (MLP (p_{avg}) + MLP (p_{\max}) + MLP (p_{var})) \in R^{C},

(27)

and

F_{1} = M_{c} ⊙ F

. For spatial attention, we compute channel-wise average/max/variance maps

s_{avg}, s_{\max}, s_{var} \in R^{1 \times H \times W}

from

F_{1}

, concatenate them, and apply a

k \times k

convolution followed by sigmoid:

M_{s} = σ ({Conv}_{k \times k} ([s_{avg}; s_{\max}; s_{var}])) \in R^{1 \times H \times W},

(28)

yielding

F_{2} = M_{s} ⊙ F_{1}

. Note that variance pooling introduces no additional learnable parameters.

AGD (coordinate-decoupled geometric attention). We adopt decoupled 1D aggregation along height/width to obtain

g_{h} \in R^{C \times H \times 1}

and

g_{w} \in R^{C \times 1 \times W}

via average pooling, then we compress them with a shared

1 \times 1

convolution to

C^{'}

channels, where

C^{'} = max (8, C / r)

:

u = δ ({Conv}_{1 \times 1} ([g_{h}; g_{w}])) \in R^{C^{'} \times (H + W) \times 1},

(29)

split

u

into

u_{h} \in R^{C^{'} \times H \times 1}

and

u_{w} \in R^{C^{'} \times 1 \times W}

, and expand back to C channels to form two attention maps:

A_{h} = σ ({Conv}_{1 \times 1} (u_{h})) \in R^{C \times H \times 1}

and

A_{w} = σ ({Conv}_{1 \times 1} (u_{w})) \in R^{C \times 1 \times W}

. The AGD-refined feature is

F_{3} = F_{2} ⊙ A_{h} ⊙ A_{w}

.

Fusion strategy. We fuse two paths via element-wise residual addition:

F_{out} = F_{2} + F_{3},

(30)

which keeps the module lightweight and stable in training.

4. Experiments

4.1. Experiment Dataset and Evaluation Metrics

In order to better validate the issue of small target detection in aerial remote sensing images, we utilized datasets such as AI-TOD [40], DOTA [41], RSOD [42,49] as well as our own model dataset captured by UAVs across five different scenarios. The AI-TOD dataset consists of a total of 14,018 images, of which 11,214 images were used for training and 2804 images were used to test the model’s performance. In the test set of 2804 images, there are 70,437 entities, and the number of entities corresponding to the eight categories is shown in Table 2. Compared with existing aerial image object detection datasets, this dataset shows a significantly smaller average object size of approximately 14 pixels, which is different from other datasets with relatively larger target objects.

The AI-TOD dataset exhibits a pronounced size imbalance in target distribution. As shown in Figure 8a, small targets (<32 × 32 pixels) dominate the dataset, with a representation rate of 97.96%, while medium targets (

32 \times 32

–

96 \times 96

pixels) account for merely 2.04%. Notably, the dataset contains no instances of large targets (>96 × 96 pixels). This extreme distribution presents significant challenges for target detection algorithms, particularly in handling scale variations across different target sizes.

The target categories of the AI-TOD dataset are depicted in Figure 8b. Notably, the most prominent category is vehicles, constituting 87.78% of the total.

The DOTA dataset (A Large-scale Dataset for Object Detection in Aerial Images) is a widely used benchmark for object detection in aerial imagery, designed to support both algorithm development and performance evaluation. It contains 2806 aerial images collected from diverse sensors and platforms. Image resolutions range from approximately

800 \times 800

to

4000 \times 4000

pixels, and the dataset includes objects exhibiting substantial variation in scale, orientation, and shape. Every image is meticulously labeled by professional interpreters, covering 15 common object categories. In total, the DOTA dataset comprises 188,282 annotated instances, and the per-category instance counts are reported in Table 3.

The DOTA dataset demonstrates a distinct multi-scale target distribution pattern. As illustrated in Figure 9a, medium-sized targets (

32 \times 32

–

96 \times 96

pixels) constitute the majority (53.49%), followed by small targets (<

32 \times 32

pixels) at 35.75%. Notably, large targets (>

96 \times 96

pixels) represent a non-negligible proportion (10.77%), indicating the dataset’s capability to support multi-scale object detection research. This balanced yet diverse size distribution poses unique challenges for algorithms in handling scale variations across different target categories.

As illustrated in Figure 9b, the target categories of the DOTA dataset are presented. Notably, the top categories include ships, accounting for 31.05% of the total, followed by small-vehicles at 18.85%, large-vehicles at 15.20%, and storage tanks at 10.01%.

The RSOD dataset [42,49] is an open dataset for object detection in remote sensing images, including four types of targets: airplanes, playgrounds, overpasses, and oil drums. The RSOD dataset contains 976 images and 6950 entities, with the number of entities in each category shown in Table 4. This dataset was released by Wuhan University in 2015.

The RSOD dataset exhibits a distinctive multi-scale target distribution characteristic, as depicted in Figure 10a. Statistical analysis reveals that medium-sized targets (

32 \times

32–96 × 96 pixels) dominate the dataset, accounting for 61.30% of the total instances. Small targets (<32 × 32 pixels) represent a minority proportion at 13.30%, while large targets (>96 × 96 pixels) maintain a significant presence, with 25.40% representation. This tri-modal size distribution demonstrates the dataset’s comprehensive coverage of scale variations, making it particularly valuable for multi-scale object detection research.

The balanced representation across different target scales presents both opportunities and challenges for detection algorithms. The coexistence of small, medium, and large targets within the same dataset requires robust scale-invariant feature learning capabilities, which is crucial for developing generalized object detection frameworks in remote sensing applications. Notably, the substantial proportion of large targets (25.40%) distinguishes RSOD from other aerial image datasets that typically focus on small objects, providing unique test conditions for evaluating scale adaptation performance.

4.2. Experimental Results and Comparative Analysis

4.2.1. Analysis of Small Target Layer Results

Within the field of small target detection, the typical size of the detection target area is smaller than 32 × 32 pixels. Figure 11 illustrates a thermal diagram depicting small target detection under various detector parameters. The results from the figure indicate that the detector head designed for large targets faces challenges in detecting small targets. Consequently, it becomes imperative to enhance the detection layer specifically for small targets, thereby improving their detection accuracy.

According to Figure 12, it can be seen that under four different detection layers, for small targets, the detection effect can only be achieved in shallow feature maps of 160 × 160, while in high-level feature maps, the expected detection effect cannot be achieved. In the Figure 12, there are four images in each row, Figure 12a–d; Figure 12e–h; Figure 12i–l show attention heatmaps of a detected target waiting below four detection layers. Figure 12a,e,i; Figure 12b,f,j; Figure 12c,g,k; Figure 12d,h,l show the heatmaps of attention under the 20 × 20, 40 × 40, 80 × 80, 160 × 160 feature maps, respectively.

4.2.2. Analysis of Results for Adjusting Anchor Box Size and Loss Function

For small target detection in UAV high-altitude remote sensing images, the anchor box size was adjusted and a detection layer was added due to the difference in size between target features and ordinary target detection features. The detection results of feature maps under different downsampling multiples are described in the previous section, as shown in Figure 13. The left image (Figure 13a) illustrates the results of object detection using the previous anchor box configuration, where no targets were detected. In contrast, the right image (Figure 13b) shows the detection results after adjusting the size of the anchor boxes. It is evident that by optimizing the size of the anchor boxes, the accuracy of object detection can be significantly improved.

4.2.3. Analysis of Object Detection Results for Dual-Path Attention

As shown in Figure 14, Figure 14a reports the detection results obtained without the Dual-path Attention module, while Figure 14b shows the results after integrating the Dual-path Attention. Without Dual-path Attention, the detector identifies 205 vehicles; with Dual-path Attention, the number of detected vehicles increases to 300. This substantial improvement indicates that the Dual-path Attention effectively strengthens small-object detection, particularly in scenes where targets are densely clustered and easily missed.

Figure 15 visualizes the feature responses before and after introducing the Dual-path Attention. The left heatmap in Figure 15a shows the activation distribution without Dual-path Attention, whereas the right heatmap in Figure 15b corresponds to the output after Dual-path Attention is applied. As observed, Dual-path Attention suppresses responses to irrelevant regions while strengthening activations on vehicle targets, leading to more discriminative and concentrated attention over the objects of interest. This improvement stems from Dual-path Attention’s ability to embed positional information into channel attention, producing direction-aware and position-sensitive feature maps. By capturing long-range dependencies while preserving precise spatial cues, Dual-path Attention enhances target representation—particularly beneficial for small and densely distributed objects. Moreover, Dual-path Attention is lightweight and easy to integrate into compact network backbones, offering accuracy gains with negligible computational overhead. Empirically, Dual-path Attention has been shown to improve not only image classification but also dense prediction tasks such as object detection and instance segmentation.

As illustrated in Figure 16, Figure 16a shows the original aerial image. Figure 16b presents the attention heatmap produced by the baseline YOLOv7 model for the same scene, where strong activations are dispersed across the image and substantial background responses are observed, indicating notable attention noise. In contrast, Figure 16c shows the heatmap generated by the proposed MAF-Net. Here, the model concentrates its responses primarily on the target regions, while activations over irrelevant background areas are markedly suppressed, demonstrating more discriminative and focused attention.

4.2.4. Analysis of Experimental Results for MAF-Net Model

The experiment was completed in the environment of Windows 10, Python 3.8, PyTorch 1.7.1, CUDA 12.3. The GPU used was RTX4090, and the CPU was Intel Core (TM) i9-9900K. The input image size of this experiment was 800 × 800, the initial learning rate was set to 0.01, and the epoch was set to 300.

Table 5 summarizes the class-wise object detection results of the proposed MAF-Net on the challenging AI-TOD dataset. It can be observed that the proposed method achieves overall precision, recall, mAP@0.5, and mAP@0.5:0.95 values of 0.667, 0.553, 0.558, and 0.245, respectively. Among all object categories, the storage-tank class obtains the highest performance, while the wind-mill and pool classes show relatively lower results due to their limited label numbers and complex backgrounds in real-world scenarios.

The quantitative results of class-wise object detection are presented in Table 6, which reports the performance metrics, including Precision (P), Recall (R), mAP@0.5, and AP@0.5:0.95 for each object category, as well as the overall dataset performance across all classes.

As shown in the table, the overall detection performance on the dataset achieves a Precision of 0.718, a Recall of 0.474, an mAP@0.5 of 0.490, and an mAP@0.5:0.95 of 0.280. For individual categories, tennis-court demonstrates the most outstanding performance, with an mAP@0.5 of 0.921 and an mAP@0.5:0.95 of 0.773, which is attributed to its distinct geometric features and clear boundaries in aerial images. In contrast, roundabout and helicopter exhibit relatively low detection accuracy, with mAP@0.5 values of 0.103 and 0.182, respectively. This is mainly due to their small size, complex background interference, and limited number of labeled samples (179 and 73 instances, respectively), leading to insufficient model generalization. Notably, large-scale objects such as large-vehicle (mAP@0.5 = 0.764) and harbor (mAP@0.5 = 0.713) achieve high detection precision and recall, indicating that the model effectively captures their prominent visual characteristics.

Table 7 presents the category-level detection performance of MAF-Net on the RSOD dataset. The overall detection results are 0.347 for precision, 0.955 for recall, 0.349 for mAP@0.5, and 0.238 for mAP@0.5:0.95. Specifically, the aircraft and oiltank classes achieve relatively high detection accuracy, while the overpass and playground classes are affected by small sample sizes.

Ablation experiments were conducted to compare the impacts of different attention mechanism modules on the experimental outcomes, as shown in Table 8, Table 9 and Table 10. According to the results in the table, we observe that after adding the Dual-path Attention, YOLOv7’s mAP increased by 24.6%; After adding a small object detection layer, mAP increased by 15.7%; When using two modules simultaneously, MAP increased by 30.2%. It can be verified that using these two modules is effective.

As shown in Table 11, Table 12 and Table 13, MAF-Net consistently outperforms the baseline, achieving a 6.87% gain in mAP.

4.2.5. Computational Complexity and Inference Efficiency

To comprehensively evaluate the practical applicability of MAF-Net—especially its suitability for real-time UAV deployment on edge devices, as highlighted by the reviewer—its computational complexity (parameter count and FLOPs), inference speed, and edge-deployment adaptability are systematically analyzed and compared with baseline models, as reported in Table 14.

In terms of parameter count and computational overhead, MAF-Net maintains a moderate parameter size of 39.123 M and a computational cost of 59.773 GFLOPs. This is comparable to most YOLOv7 variants (e.g., YOLOv7 Improved: 37.829 M parameters, 59.942 GFLOPs; YOLOv7-MH: 37.532 M parameters, 52.608 GFLOPs) and significantly lower than YOLOv7-UWSC (42.437 M parameters, 112.139 GFLOPs). Importantly, this moderate parameter count is critical for edge device deployment, as edge platforms (e.g., embedded GPUs, FPGAs commonly used in UAVs) typically have limited memory bandwidth and on-chip storage. MAF-Net’s 39.123 M parameters can be efficiently loaded and run on such devices without excessive memory consumption, whereas models with larger parameter sizes (e.g., YOLOv7-UWSC) may suffer from memory overflow or prolonged loading times, which are prohibitive for real-time UAV operations.

Regarding inference speed, MAF-Net delivers a frame rate of 21.277 FPS, which is significantly faster than YOLOv7-SCD (9.346 FPS), YOLOv7-UWSC (14.286 FPS), and YOLOv7 Improved (11.96 FPS). For real-time UAV remote sensing detection, a frame rate of at least 15 FPS is generally required to ensure that the system can process continuous aerial imagery and provide timely feedback for UAV navigation or target tracking. MAF-Net’s 21.277 FPS fully meets this real-time requirement, whereas the aforementioned baselines fail to reach the minimum frame rate threshold, making them unsuitable for dynamic UAV deployment. Although YOLOv7-tiny exhibits the highest inference speed (38.46 FPS) due to its lightweight structure (6.270 M parameters, 6.697 GFLOPs), it suffers from substantial performance degradation (e.g., mAP@0.5 of only 0.2892 on AI-TOD, 0.193 on DOTA, and 0.243 on RSOD, as shown in Table 11, Table 12 and Table 13), which renders it impractical for high-precision UAV detection tasks where accurate target identification is critical.

YOLOv7-MH achieves a slightly higher FPS (23.81) than MAF-Net but at the cost of lower detection accuracy (e.g., mAP@0.5 of 0.4417 on AI-TOD, 0.257 on DOTA, and 0.126 on RSOD), and its parameter count (37.532 M) is only marginally lower than MAF-Net’s. This trade-off is unfavorable for UAV edge deployment, where both real-time performance and detection accuracy are essential to support reliable decision-making.

Collectively, MAF-Net’s moderate parameter count (39.123 M) and computational cost (59.773 GFLOPs) ensure compatibility with resource-constrained UAV edge devices, while its 21.277 FPS inference speed meets the real-time requirement for UAV aerial imagery processing. Compared with baseline models, MAF-Net is the only variant that simultaneously achieves superior detection accuracy, moderate computational overhead, and real-time inference speed—key prerequisites for practical real-time UAV deployment on edge devices.

To comprehensively evaluate the performance of different algorithms for small object detection, five representative models are compared on the target dataset, with the detailed comparative results summarized in Table 15.

4.2.6. Hardware Experimental Verification

In the experiments, we employed the DJI Mini 2 UAV as the experimental platform; its physical prototype is shown in Figure 17, and detailed parameters are listed in Table 16.

The mobile control, data reception, and image processing terminal is illustrated in Figure 18, which mainly consists of the UAV controller, the data transmission and reception program, and the Core-3588SG core board.

The proposed approach is deployed and verified on the above hardware and software platform to validate its effectiveness and practicability in real scenarios.

Experimental detection results achieved by the UAV platform under real-world conditions are shown in Figure 19.

4.3. Robustness Evaluation Under Complex Conditions

4.3.1. Construction of an Experimental Dataset for Complex Conditions

To further validate the generalization capability and robustness of our proposed method, we performed cross-dataset evaluations and stability tests under varying flight altitudes, illumination conditions, and motion blur—all of which are representative challenges encountered in practical UAV aerial detection scenarios.

By utilizing the integrated software and hardware platform established in the preceding section, a DJI Mini 2 unmanned aerial vehicle (UAV) was deployed to acquire high-resolution imagery of ten distinct target categories, represented by scaled-down physical proxies including aircraft, fighter jets, helicopters, hummers, missiles, tanks, trucks, warships, and yachts. These data collection experiments were conducted across five representative environmental scenarios: indoor settings, structured outdoor grid environments, paved surfaces, water surfaces, and grassland areas. As illustrated in Figure 20, all acquired imagery was manually annotated with pixel-level precision, and the annotated data were systematically curated to form the dedicated test dataset for this study (referred to as the Indoor-Grid-Pavement-Water-Grass (IGPWG) dataset), which encompasses the five aforementioned environmental subsets.

As shown in Figure 21a–c, the target was captured at different preset altitudes in the Grass dataset. As shown in Figure 21d–f, the target was captured at different preset altitudes in the Grid dataset.

As shown in Figure 22, the target was imaged under various illumination conditions in the Indoor dataset.

As shown in Figure 23, the target was imaged under various blur conditions in the Water dataset.

As shown in Figure 24, the target was imaged under various blur conditions in the Indoor dataset.

4.3.2. Experimental Results and Analysis

Table 17 details the test dataset distribution across five environments. The Pavement scenario contains the most images (848) and diverse categories, while the Water scenario is specialized for maritime targets (warship, yacht). Indoor and Grid have similar data scales and shared categories, and the Grass scenario features hummer and tank as dominant entities. This design enables rigorous evaluation of model generalization across diverse conditions.

Training was performed on the Indoor dataset, and the corresponding results are presented in Table 18.

Training was performed on the Grid dataset, and the corresponding results are presented in Table 19.

Training was performed on the Pavement dataset, and the corresponding results are presented in Table 20.

Training was performed on the Water dataset, and the corresponding results are presented in Table 21.

Training was performed on the Grass dataset, and the corresponding results are presented in Table 22.

Training was performed on the IGPWG dataset, and the corresponding results are presented in Table 23.

In the aforementioned experiments, 10% of the imagery from each of the five distinct datasets was randomly partitioned into the test set, with the remaining 90% allocated to the training set. As presented in Table 18, Table 19, Table 20, Table 21, Table 22 and Table 23, the experimental findings demonstrate the robustness of the proposed model across a diverse range of environmental conditions, including varying flight altitudes, illumination intensities, and motion blur levels.

5. Conclusions

To address the persistent challenges of missed and false detections of small targets in high-altitude visible-light imagery, this paper proposes an enhanced MAF-Net model. By exploiting shallow feature maps, a dedicated small-object detection layer is introduced to better preserve fine-grained spatial information, thereby improving the detectability of small targets. In addition, the K-means++ algorithm is employed for anchor box clustering to better adapt anchor sizes to the characteristics of aerial datasets, improving detection efficiency. Furthermore, a coordinate attention mechanism is integrated into the detection head, significantly enhancing performance in densely populated small-target scenarios where missed detections are common.

Extensive experiments, including ablation studies, attention mechanism analysis, and comparisons with state-of-the-art methods on multiple aerial benchmarks (AI-TOD, DOTA, and RSOD), demonstrate the effectiveness of the proposed approach. The MAF-Net model achieves notable performance gains, with improvements in mAP@0.5 of 14.1%, 11.28%, and 22.09% on the respective datasets. These results confirm the superior accuracy, robustness, and generalization capability of MAF-Net for small-target detection in high-altitude imagery.

Although MAF-Net achieves promising performance on both public benchmarks and self-constructed datasets (e.g., IGPWG, Water, Grass), its generalization capability under extreme aerial imaging conditions—such as dense fog, rain, snow, and severe motion blur induced by UAV jitter—still needs to be further verified. In these harsh scenarios, feature degradation of small objects becomes more prominent, which may result in an obvious decline in detection performance. Future research will concentrate on constructing and annotating datasets captured under extreme environmental conditions, as well as optimizing the model architecture to strengthen its anti-interference ability. Furthermore, future work will also focus on model compression and lightweight design for efficient deployment on mobile and embedded platforms to achieve a desirable trade-off between detection accuracy and inference efficiency.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics15051100/s1.

Author Contributions

Conceptualization, K.M.; methodology, K.M. and Z.Z.; software, K.M.; validation, K.M.; formal analysis, K.M. and Z.Z.; investigation, K.M.; resources, J.H.; data curation, K.M.; writing—original draft preparation, K.M.; writing—review and editing, Z.Z. and J.Z.; visualization, K.M.; supervision, J.H.; project administration, Z.Z.; funding acquisition, Z.Z. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplemantray Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Povlsen, P.; Bruhn, D.; Durdevic, P.; Arroyo, D.O.; Pertoldi, C. Using YOLO Object Detection to Identify Hare and Roe Deer in Thermal Aerial Video Footage—Possible Future Applications in Real-Time Automatic Drone Surveillance and Wildlife Monitoring. Drones 2024, 8, 2. [Google Scholar] [CrossRef]
Khan, A.H.; Rizvi, S.T.R.; Dengel, A. Real-time Traffic Object Detection for Autonomous Driving. arXiv 2024, arXiv:2402.00128. [Google Scholar] [CrossRef]
Aldahmani, A.; Ouni, B.; Lestable, T.; Debbah, M. Cyber-Security of Embedded IoTs in Smart Homes: Challenges, Requirements, Countermeasures, and Trends. IEEE Open J. Veh. Technol. 2023, 4, 281–292. [Google Scholar] [CrossRef]
Guo, X.; Chen, Y.; Wang, Y. Learning-Based Robust and Secure Transmission for Reconfigurable Intelligent Surface Aided Millimeter Wave UAV Communications. IEEE Wireless Commun. Lett. 2021, 10, 1795–1799. [Google Scholar] [CrossRef]
Wu, X.; Dong, J.; Bao, W.; Zou, B.; Wang, L.; Wang, H. Augmented Intelligence of Things for Emergency Vehicle Secure Trajectory Prediction and Task Offloading. IEEE Internet Things J. 2024, 11, 36030–36043. [Google Scholar] [CrossRef]
He, Q.; Qu, C. Modular Landfill Remediation for AI Grid Resilience. arXiv 2025, arXiv:2512.19202. [Google Scholar] [CrossRef]
Chen, Y.; Guo, X.; Zhou, G.; Jin, S.; Ng, D.W.K.; Wang, Z. Unified Far-Field and Near-Field in Holographic MIMO: A Wavenumber-Domain Perspective. IEEE Commun. Mag. 2025, 63, 30–36. [Google Scholar] [CrossRef]
He, Q.; Qu, C. Waste-to-Energy-Coupled AI Data Centers: Cooling Efficiency and Grid Resilience. arXiv 2025, arXiv:2512.24683. [Google Scholar] [CrossRef]
Liu, D.; Shen, Q.; Liu, J. The Health-Wealth Gradient in Labor Markets: Integrating Health, Insurance, and Social Metrics to Predict Employment Density. Computation 2026, 14, 22. [Google Scholar] [CrossRef]
Wu, X.; Zhang, Y.-T.; Lai, K.-W.; Yang, M.-Z.; Yang, G.-L.; Wang, H.-H. A Novel Centralized Federated Deep Fuzzy Neural Network with Multi-Objectives Neural Architecture Search for Epistatic Detection. IEEE Trans. Fuzzy Syst. 2025, 33, 94–107. [Google Scholar] [CrossRef]
Ke, Z.; Cao, Y.; Chen, Z.; Yin, Y.; He, S.; Cheng, Y. Early Warning of Cryptocurrency Reversal Risks via Multi-Source Data. Fin. Res. Lett. 2025, 85, 107890. [Google Scholar] [CrossRef]
Liu, W.; Huang, T.; Zhang, P.; Ke, Z.; Min, M.; Zhao, P. High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers. arXiv 2023, arXiv:2307.13352. [Google Scholar] [CrossRef]
Shen, Q.; Zhang, J. AI-Enhanced Disaster Risk Prediction with Explainable SHAP Analysis: A Multi-Class Classification Approach Using XGBoost. Res. Sq. 2025. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Liao, L.; Luo, L.; Su, J.; Xiao, Z.; Zou, F.; Lin, Y. Eagle-YOLO: An Eagle-Inspired YOLO for Object Detection in Unmanned Aerial Vehicles Scenarios. Mathematics 2023, 11, 2093. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2016. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Sun, G.; Wang, S.; Xie, J. An Image Object Detection Model Based on Mixed Attention Mechanism Optimized YOLOv5. Electronics 2023, 12, 1515. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 2999–3007. Available online: https://ieeexplore.ieee.org/document/8417976 (accessed on 27 February 2026).
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Chen, J.; Er, M.J. Dynamic YOLO for small underwater object detection. Artif. Intell. Rev. 2024, 57, 165. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Han, J.; Zhang, D.; Hu, X.; Guo, L.; Ren, J.; Wu, F. Background Prior-Based Salient Object Detection via Deep Reconstruction Residual. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1309–1321. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection (Conference Paper). In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE Computer Society: Piscataway, NJ, USA, 2018; pp. 6154–6162. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks; Curran Associates Inc.: New York, NY, USA, 2016. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In IEEE Transactions on Pattern Analysis Machine Intelligence; IEEE Computer Society: Piscataway, NJ, USA, 2017. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware Trident Networks for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE Computer Society: Piscataway, NJ, USA, 2019. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Zhang, J.; Peng, W.; Xiao, A.; Liu, T.; Fu, J.; Chen, J.; Yan, Z. KANs-DETR: Enhancing Detection Transformer with Kolmogorov–Arnold Networks for small object. High-Confid. Comput. 2026, 6, 100336. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny Object Detection in Aerial Images; Wuhan University: Wuhan, China, 2021. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images; IEEE Computer Society: Piscataway, NJ, USA, 2018. [Google Scholar]
Xiao, Z.; Liu, Q.; Tang, G.; Zhai, X. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images. Int. J. Remote Sens. 2015, 36, 618–644. [Google Scholar] [CrossRef]
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with Faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 2025, 1, 33702. [Google Scholar] [CrossRef]
Wang, S. Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body worn cameras with benchmark datasets. Sci. Rep. 2025, 15, 27043. [Google Scholar] [CrossRef]
Ryu, J.; Kwak, D.; Choi, S. YOLOv8 with Post-Processing for Small Object Detection Enhancement. Appl. Sci. 2025, 13, 7275. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision–ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar] [CrossRef]
Liu, Y.F.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multiscale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5614914. [Google Scholar] [CrossRef]
Saeed, F.; Paul, A. ISO-DeTr: A novel detection transformer for industrial small object detection. Mach. Learn. Appl. 2026, 23, 100809. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Yi, W.G.; Wang, B. Research on Underwater Small Target Detection Algorithm Based on Improved YOLOv7. IEEE Access 2023, 11, 66818–66827. [Google Scholar] [CrossRef]
Zhang, X.; Huang, D.Q. Research on UAV Ground Target Detection Based on Improved YOLOv7. In Proceedings of the 2023 3rd International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 24–26 March 2023; pp. 28–32. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed MAF-Net. A dedicated small-object detection layer is appended to the YOLOv7 head at the low-level feature map stage, and a coordinate attention mechanism is integrated into the detection head. The detection anchors are clustered using the KMeans++ algorithm to optimize small-object detection performance.

Figure 2. Block diagram of the MAF-Net architecture.

Figure 3. Schematic diagram of the receptive field.

Figure 4. Structure diagram of the hybrid-attention encoder channel attention module.

Figure 5. Structure diagram of the hybrid-attention encoder spatial attention module.

Figure 6. Model structure of Hybrid-Attention Encoder.

Figure 7. Model structure of attention-guided decoder.

Figure 8. (a) Target size distribution in the AI-TOD dataset: Small targets (<

32 \times 32

pixels): 97.96% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 2.04% Large targets (>

96 \times 96

pixels): Absent (b) Quantitative distribution of object categories in AI-TOD dataset.

Figure 8. (a) Target size distribution in the AI-TOD dataset: Small targets (<

32 \times 32

pixels): 97.96% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 2.04% Large targets (>

96 \times 96

pixels): Absent (b) Quantitative distribution of object categories in AI-TOD dataset.

Figure 9. (a) Target size distribution in the DOTA dataset: Small targets (<

32 \times 32

pixels): 35.75% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 53.49% Large targets (>

96 \times 96

pixels): 10.77% (b) Quantitative distribution of object categories in DOTA dataset.

Figure 9. (a) Target size distribution in the DOTA dataset: Small targets (<

32 \times 32

pixels): 35.75% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 53.49% Large targets (>

96 \times 96

pixels): 10.77% (b) Quantitative distribution of object categories in DOTA dataset.

Figure 10. (a) Target size distribution in the RSOD dataset: Small targets (<

32 \times 32

pixels): 13.30% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 61.30% Large targets (>

96 \times 96

pixels): 25.40% (b) Quantitative distribution of object categories in RSOD dataset.

Figure 10. (a) Target size distribution in the RSOD dataset: Small targets (<

32 \times 32

pixels): 13.30% of total Medium targets (

32 \times 32

–

96 \times 96

pixels): 61.30% Large targets (>

96 \times 96

pixels): 25.40% (b) Quantitative distribution of object categories in RSOD dataset.

Figure 11. When the feature maps of the three detection heads in YOLOv7 are (a) 80 × 80, (b) 40 × 40, and (c) 20 × 20, respectively, it is evident that only the detection head with the hightest resolution can effectively detect small targets, while the other two detection heads struggle to detect them.

Figure 12. There are four images in each row, (a–d); (e–h); (i–l) are attention heatmaps of a detected target waiting below four detection layers. (a,e,i); (b,f,j); (c,g,k); (d,h,l) are the heatmaps of attention under the 20 × 20, 40 × 40, 80 × 80, 160 × 160 feature maps, respectively.

Figure 13. The left image (a) illustrates the results of object detection using the previous anchor box configuration, where no targets were detected. In contrast, the right image (b) shows the detection results after adjusting the size of the anchor boxes. It is evident that by optimizing the size of the anchor boxes, the accuracy of object detection can be significantly improved.

Figure 14. The left image (a) shows the object detection results before applying the Dual-path Attention, while the right image (b) presents the results after applying the Dual-path Attention. In (a), 205 vehicles were detected, whereas in (b), 300 vehicles were detected. This demonstrates that the Dual-path Attention can enhance the detection accuracy of small targets in areas with a high density of such targets.

Figure 15. The left figure (a) shows the thermal map before processing through the Dual-path Attention, while the right figure (b) displays the thermal map after processing through the Dual-path Attention. It is evident that the Dual-path Attention effectively reduces the attention on non-car parts and enhances the attention on the target car.

Figure 16. Visualization of attention heatmaps. (a) Original image. (b) Heatmap produced by YOLOv7, exhibiting widespread background activation (attention noise). (c) Heatmap produced by the proposed model, where responses are concentrated on the target regions and background activations are suppressed.

Figure 17. The DJI Mavic Mini 2 unmanned aerial vehicle (UAV) employed as the physical experimental platform for the field tests.

Figure 18. The mobile control, data reception, and image processing terminal.

Figure 19. Experimental results in real-world environments.

Figure 20. Sample images of the IGPWG dataset under diverse environmental conditions.

Figure 21. Sample images of the dataset under different altitude conditions. (a) High-altitude aircraft UAV image; (b) Medium-altitude aircraft UAV image; (c) Low-altitude aircraft UAV image; (d) High-altitude helicopter UAV image; (e) Medium-altitude helicopter UAV image; (f) Low-altitude helicopter UAV image.

Figure 22. Sample images under different illumination conditions.

Figure 23. Sample images of the Water dataset under different blur conditions.

Figure 24. Sample images of the Indoor dataset under different blur conditions.

Table 1. Given the input image size of

640 \times 640

, the parameters of different detection heads.

Table 1. Given the input image size of

640 \times 640

, the parameters of different detection heads.

Feature Map Size	Downsampling Multiple	Receptive Field
20 × 20	32 times downsampling	32 × 32
40 × 40	16 times downsampling	16 × 16
80 × 80	8 times downsampling	8 × 8
160 × 160	4 times downsampling	4 × 4

Table 2. Number of different target types in the AI-TOD dataset.

Object Class	Train Number	Test Number	Proportion
person	14127	3841	5.00
vehicle	248077	59915	87.78
ship	13541	3791	4.79
airplane	623	170	0.22
storage-tank	5278	2479	1.87
bridge	512	140	0.18
wind-mill	176	67	0.06
pool	293	34	0.10

Table 3. Number of different target types in the DOTA dataset.

Object Class	Train Number	Test Number	Proportion
plane	8055	2531	8.77
large-vehicle	16969	4387	15.20
small-vehicle	26126	5438	18.85
ship	28068	8960	31.05
harbor	5983	2090	7.24
ground-track-field	325	144	0.50
soccer-ball-field	326	153	0.53
tennis-court	2367	760	2.63
baseball-diamond	415	214	0.74
swimming-pool	1736	440	1.52
roundabout	399	179	0.62
basketball-court	515	132	0.46
storage-tank	5029	2888	10.01
bridge	2047	464	1.61
helicopter	630	73	0.25

Table 4. Number of different target types in the RSOD dataset.

Object Class	Image Number	Entity Number	Proportion
aircraft	446	4993	71.84
oiltank	189	191	2.75
overpass	176	180	2.59
playground	165	1586	22.82

Table 5. Class-wise object detection results of MAF-Net on the AI-TOD dataset.

Class	Images	Labels	Precision	Recall	mAP@0.5	mAP@0.5:0.95
all	2804	70437	0.667	0.553	0.558	0.245
person	2804	3841	0.743	0.264	0.356	0.118
vehicle	2804	59915	0.766	0.745	0.754	0.315
ship	2804	3791	0.779	0.667	0.722	0.348
airplane	2804	170	0.802	0.794	0.826	0.396
storage-tank	2804	2479	0.831	0.83	0.864	0.475
bridge	2804	140	0.683	0.462	0.52	0.204
wind-mill	2804	67	0.3	0.284	0.183	0.0394
pool	2804	34	0.432	0.382	0.242	0.064

Table 6. Class-wise object detection results on the DOTA dataset.

Class	Images	Labels	Precision	Recall	mAP@0.5	mAP@0.5:0.95
all	458	28853	0.718	0.474	0.49	0.28
plane	458	2531	0.812	0.727	0.739	0.461
large-vehicle	458	4387	0.718	0.775	0.764	0.499
small-vehicle	458	5438	0.579	0.6	0.569	0.309
ship	458	8960	0.799	0.584	0.583	0.312
harbor	458	2090	0.683	0.755	0.713	0.319
ground-track-field	458	144	0.774	0.309	0.396	0.169
soccer-ball-field	458	153	0.601	0.333	0.298	0.171
tennis-court	458	760	0.839	0.909	0.921	0.773
baseball-diamond	458	214	0.763	0.491	0.578	0.308
swimming-pool	458	440	0.646	0.55	0.487	0.188
roundabout	458	179	0.685	0.0615	0.103	0.0366
basketball-court	458	132	0.586	0.402	0.408	0.289
storage-tank	458	2888	0.68	0.346	0.357	0.174
bridge	458	464	0.602	0.235	0.257	0.0748
helicopter	458	73	1	0.0323	0.182	0.123

Table 7. Class-wise object detection results of MAF-Net on the RSOD dataset.

Class	Images	Labels	Precision	Recall	mAP@0.5	mAP@0.5:0.95
all	253	777	0.347	0.955	0.349	0.238
aircraft	253	546	0.33	0.973	0.337	0.224
oiltank	253	197	0.372	0.954	0.373	0.306
overpass	253	19	0.347	0.895	0.321	0.14
playground	253	15	0.338	1	0.364	0.28

Table 8. Fine-grained ablation experiment results on AI-TOD dataset.

Algorithm	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv7 (Baseline)	0.664	0.243	0.256	0.104
YOLOv7 + HAE (Average/Max Pooling Only)	0.612	0.357	0.328	0.142
YOLOv7 + HAE (With Variance Pooling)	0.635	0.402	0.360	0.165
YOLOv7 + AGD (Without Coordinate Decoupling)	0.587	0.396	0.341	0.151
YOLOv7 + AGD (With Coordinate Decoupling)	0.603	0.448	0.385	0.179
YOLOv7 + HAE + AGD (Dual-path Attention)	0.648	0.501	0.433	0.208
YOLOv7 + 160 × 160 Detection Layer	0.537	0.401	0.420	0.181
YOLOv7 + Density-adaptive Anchor	0.628	0.327	0.301	0.132
YOLOv7 + Hierarchical Feature Aggregation	0.641	0.315	0.304	0.135
YOLOv7 + Joint Optimization of Three Components	0.605	0.473	0.489	0.215
MAF-Net (Complete Model)	0.667	0.553	0.558	0.245

Table 9. Fine-grained ablation experiment results on DOTA dataset.

Algorithm	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv7 (Baseline)	0.413	0.465	0.260	0.156
YOLOv7 + HAE (Average/Max Pooling Only)	0.389	0.521	0.315	0.173
YOLOv7 + HAE (With Variance Pooling)	0.407	0.558	0.343	0.189
YOLOv7 + AGD (Without Coordinate Decoupling)	0.376	0.513	0.302	0.168
YOLOv7 + AGD (With Coordinate Decoupling)	0.398	0.572	0.338	0.185
YOLOv7 + HAE + AGD (Dual-path Attention)	0.425	0.614	0.379	0.207
YOLOv7 + 160 × 160 Detection Layer	0.412	0.513	0.279	0.166
YOLOv7 + Density-adaptive Anchor	0.435	0.498	0.297	0.169
YOLOv7 + Hierarchical Feature Aggregation	0.429	0.487	0.291	0.167
YOLOv7 + Joint Optimization of Three Components	0.468	0.572	0.336	0.189
MAF-Net (Complete Model)	0.718	0.474	0.490	0.280

Table 10. Fine-grained ablation experiment results on RSOD dataset.

Algorithm	Precision	Recall	mAP@0.5	mAP@0.5:0.95
YOLOv7 (Baseline)	0.056	0.142	0.027	0.017
YOLOv7 + HAE (Average/Max Pooling Only)	0.189	0.423	0.156	0.098
YOLOv7 + HAE (With Variance Pooling)	0.214	0.487	0.182	0.117
YOLOv7 + AGD (Without Coordinate Decoupling)	0.197	0.456	0.163	0.102
YOLOv7 + AGD (With Coordinate Decoupling)	0.226	0.532	0.195	0.124
YOLOv7 + HAE + AGD (Dual-path Attention)	0.253	0.601	0.227	0.143
YOLOv7 + 160 × 160 Detection Layer	0.223	0.612	0.208	0.131
YOLOv7 + Density-adaptive Anchor	0.164	0.385	0.134	0.089
YOLOv7 + Hierarchical Feature Aggregation	0.172	0.398	0.141	0.093
YOLOv7 + Joint Optimization of Three Components	0.276	0.712	0.264	0.162
MAF-Net (Complete Model)	0.347	0.955	0.349	0.238

Table 11. Comparison of object detection results among different models in AI-TOD dataset.

Model Name	Precision	Recall	mAP@0.5	mAP@0.5:0.95
MAF-Net	0.667	0.553	0.558	0.245
YOLOv7-SCD	0.5789	0.4543	0.4677	0.2045
YOLOv7-UWSC	0.741	0.3491	0.372	0.162
YOLOv7 Improved	0.588	0.4874	0.4893	0.211
YOLOv7-tiny	0.7602	0.2715	0.2892	0.1161
YOLOv7-MH	0.6304	0.4299	0.4417	0.1917

Table 12. Comparison of object detection results among different models in DOTA dataset.

Model Name	Precision	Recall	mAP@0.5	mAP@0.5:0.95
MAF-Net	0.718	0.474	0.49	0.28
YOLOv7-SCD	0.6633	0.433	0.4234	0.2209
YOLOv7-UWSC	0.405	0.51	0.281	0.166
YOLOv7 Improved	0.472	0.625	0.32	0.264
YOLOv7-tiny	0.381	0.35	0.193	0.094
YOLOv7-MH	0.406	0.463	0.257	0.146

Table 13. Comparison of object detection results among different models in RSOD dataset.

Model Name	Precision	Recall	mAP@0.5	mAP@0.5:0.95
MAF-Net	0.347	0.955	0.349	0.238
YOLOv7-SCD	0.305	0.829	0.311	0.21
YOLOv7-UWSC	0.0396	0.0962	0.0205	0.0129
YOLOv7 Improved	0.582	0.224	0.0734	0.0121
YOLOv7-tiny	0.248	0.695	0.243	0.147
YOLOv7-MH	0.149	0.5	0.126	0.0779

Table 14. Comparison of model parameters, computational complexity and inference speed.

Model Name	Parameters (M)	FLOPs (G)	FPS
MAF-Net	39.123	59.773	21.277
YOLOv7-SCD	41.564	49.453	9.346
YOLOv7-UWSC	42.437	112.139	14.286
YOLOv7 Improved	37.829	59.942	11.96
YOLOv7-tiny	6.270	6.697	38.46
YOLOv7-MH	37.532	52.608	23.81

Table 15. Detection accuracy comparison of various models for small object detection on AI-TOD dataset.

Model Name	Precision	Recall	mAP@0.5	mAP@0.5:0.95
MAF-Net	0.667	0.553	0.558	0.245
YOLOv7-UWSC [50]	0.741	0.3491	0.372	0.162
YOLOv7-tiny [51]	0.7602	0.2715	0.2892	0.1161
KANs-DETR [39]	0.638	0.412	0.427	0.186
YOLO-CC [45]	0.605	0.437	0.443	0.197

Table 16. Detailed specifications of the DJI Mavic Mini 2 UAV.

Item	Specification
Folded dimensions (without propellers)	$138 \times 81 \times 58$ mm
Unfolded dimensions (without propellers)	$159 \times 203 \times 56$ mm
Diagonal wheelbase	213 mm
Maximum horizontal flight speed (near sea level, no wind)	16 m/s (Sport mode), 10 m/s (Normal mode), 6 m/s (Cine mode)
Maximum ascent speed	6 m/s (Normal mode), 8 m/s (Sport mode)
Maximum descent speed	6 m/s
Maximum hover time	38 min
Maximum flight time	45 min
Battery capacity	5000 mAh
Gimbal pitch range	$- 135^{\circ}$ to $45^{\circ}$
Gimbal roll range	$- 45^{\circ}$ to $45^{\circ}$
Gimbal yaw range	$- 27^{\circ}$ to $27^{\circ}$

Table 17. Detailed distribution of the test dataset across different groups.

Category	Indoor	Grid	Pavement	Water	Grass
Images	208	203	848	367	154
AEW Aircraft	14	49	155	0	17
Aircraft	27	15	128	0	27
Fighter	18	31	173	0	17
Helicopter	54	58	184	0	22
Hummer	19	59	226	0	50
Missile	40	46	127	0	0
Tank	42	36	199	0	49
Truck	10	53	94	0	5
Warship	0	0	0	339	0
Yacht	0	0	0	227	0

Table 18. Class-wise object detection performance on the Indoor dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	26	28	0.679	0.9	0.859	0.742
aew	26	6	0.888	1	0.995	0.85
aircraft	26	2	0.179	1	0.995	0.995
fighter	26	6	0.885	1	0.995	0.826
helicopter	26	4	0.992	0.25	0.579	0.457
hummer	26	1	0.547	1	0.995	0.896
missile	26	3	0.362	0.95	0.456	0.328
tank	26	1	0.841	1	0.995	0.896
truck	26	5	0.734	1	0.862	0.686

Table 19. Class-wise object detection performance on the Grid dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	20	35	0.745	0.648	0.711	0.542
aew	20	3	0.767	1	0.995	0.807
aircraft	20	1	1	0	0.199	0.139
fighter	20	4	0.931	0.25	0.459	0.34
helicopter	20	5	0.866	1	0.995	0.699
hummer	20	7	0.649	1	0.889	0.619
missile	20	5	0.558	0.6	0.662	0.526
tank	20	6	0.498	0.332	0.491	0.407
truck	20	4	0.693	1	0.995	0.796

Table 20. Class-wise object detection performance on the Pavement dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	84	123	0.974	0.989	0.994	0.791
aew	84	12	0.916	1	0.995	0.769
aircraft	84	13	0.953	1	0.995	0.782
fighter	84	14	0.983	1	0.995	0.816
helicopter	84	20	0.983	1	0.995	0.822
hummer	84	20	0.985	1	0.995	0.809
missile	84	14	0.99	1	0.995	0.792
tank	84	19	0.982	1	0.995	0.873
truck	84	11	1	0.909	0.988	0.664

Table 21. Class-wise object detection performance on the Water dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	36	60	0.992	0.985	0.995	0.632
warship	36	33	1	0.97	0.995	0.73
yacht	36	27	0.984	1	0.995	0.533

Table 22. Class-wise object detection performance on the Grass dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	15	19	0.857	0.75	0.829	0.603
aew	15	2	0.993	0.5	0.828	0.679
aircraft	15	2	0.695	1	0.995	0.721
helicopter	15	4	0.719	1	0.995	0.821
hummer	15	5	0.845	1	0.995	0.706
tank	15	5	0.888	1	0.995	0.577
truck	15	1	1	0	0.166	0.116

Table 23. Class-wise object detection performance on the IGPWG dataset.

Class	Images	Labels	P	R	mAP@0.5	mAP@0.5:0.95
all	178	262	0.963	0.987	0.99	0.766
aew	178	20	0.854	1	0.995	0.8
aircraft	178	14	0.973	1	0.995	0.828
fighter	178	27	0.963	0.969	0.994	0.806
helicopter	178	39	0.998	1	0.995	0.787
hummer	178	35	0.99	1	0.995	0.792
missile	178	24	0.953	1	0.995	0.814
tank	178	33	0.97	1	0.995	0.829
truck	178	18	0.976	0.944	0.99	0.741
warship	178	31	0.996	1	0.996	0.739
yacht	178	21	0.952	0.953	0.952	0.525

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, K.; Zhang, Z.; Zhang, J.; Huang, J. AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics 2026, 15, 1100. https://doi.org/10.3390/electronics15051100

AMA Style

Ma K, Zhang Z, Zhang J, Huang J. AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics. 2026; 15(5):1100. https://doi.org/10.3390/electronics15051100

Chicago/Turabian Style

Ma, Ke, Zhongjie Zhang, Jiarui Zhang, and Jian Huang. 2026. "AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery" Electronics 15, no. 5: 1100. https://doi.org/10.3390/electronics15051100

APA Style

Ma, K., Zhang, Z., Zhang, J., & Huang, J. (2026). AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery. Electronics, 15(5), 1100. https://doi.org/10.3390/electronics15051100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Native Multi-Scale Attention Fusion for Ubiquitous Aerial Sensing: Small Object Detection in UAV Imagery

Abstract

1. Introduction

2. Related Work

2.1. One-Stage Object Detectors

2.2. Two-Stage Object Detectors

2.3. Small Object Detection in Aerial and Construction Imagery: Methods, Challenges, and Research Gaps

3. The MAF-Net Model

3.1. Problem Analysis

3.2. Base Model Selection and Justification

3.2.1. Rationale for Selecting YOLOv7: Core Advantages Aligned with Our Problem Requirements

3.2.2. Comparative Analysis Against Representative Alternatives

3.3. Modeling

3.4. Incorporating an Additional Layer for Small Object Detection

3.5. Hybrid-Attention Encoder

3.5.1. Hybrid-Attention Encoder Channel Attention Module

3.5.2. Hybrid-Attention Encoder Spatial Attention Module

3.5.3. Feature Modeling with HAE Module

3.6. Attention-Guided Decoder

3.6.1. Feature Modeling with AGD Module

3.6.2. Loss Function

3.7. Implementation of the Attention Fusion Module

4. Experiments

4.1. Experiment Dataset and Evaluation Metrics

4.2. Experimental Results and Comparative Analysis

4.2.1. Analysis of Small Target Layer Results

4.2.2. Analysis of Results for Adjusting Anchor Box Size and Loss Function

4.2.3. Analysis of Object Detection Results for Dual-Path Attention

4.2.4. Analysis of Experimental Results for MAF-Net Model

4.2.5. Computational Complexity and Inference Efficiency

4.2.6. Hardware Experimental Verification

4.3. Robustness Evaluation Under Complex Conditions

4.3.1. Construction of an Experimental Dataset for Complex Conditions

4.3.2. Experimental Results and Analysis

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI