DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images

Liu, Xiangyu; Zhou, Shenbo; Ma, Jianbo; Sun, Yumei; Zhang, Jianlin; Zuo, Haorui

doi:10.3390/rs17203476

Open AccessArticle

DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images

by

Xiangyu Liu

^1,2

,

Shenbo Zhou

^1,2

,

Jianbo Ma

^1,2,

Yumei Sun

^1,2,

Jianlin Zhang

^1,2

and

Haorui Zuo

^1,2,*

¹

State Key Laboratory of Optical Field Manipulation Science and Technology, Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(20), 3476; https://doi.org/10.3390/rs17203476

Submission received: 21 August 2025 / Revised: 3 October 2025 / Accepted: 13 October 2025 / Published: 18 October 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Improved upsampling strategy to alleviate feature misalignment in multi-scale feature fusion.
Refined downsampling strategy to preserve fine-grained semantic details for small-object localization.

What are the implications of the main findings?

Enhanced accuracy and robustness of small-object detection in remote sensing imagery.
Lightweight design with fewer parameters suitable for UAV-based monitoring under complex backgrounds.

Abstract

In remote sensing imagery, detecting small objects is challenging due to the limited representational ability of feature maps when resolution changes. This issue is mainly reflected in two aspects: (1) upsampling causes feature shifts, making feature fusion difficult to align; (2) downsampling leads to the loss of details. Although recent advances in object detection have been remarkable, small-object detection remains unresolved. In this paper, we propose Dual Feature-Aware Sampling YOLO (DFAS-YOLO) to address these issues. First, the Soft-Aware Adaptive Fusion (SAAF) module corrects upsampling by applying adaptive weighting and spatial attention, thereby reducing errors caused by feature shifts. Second, the Global Dense Local Aggregation (GDLA) module employs parallel convolution, max pooling, and average pooling with channel aggregation, combining their strengths to preserve details after downsampling. Furthermore, the detection head is redesigned based on object characteristics in remote sensing imagery. Extensive experiments on the VisDrone2019 and HIT-UAV datasets demonstrate that DFAS-YOLO achieves competitive detection accuracy compared with recent models.

Keywords:

small-object detection; feature fusion; remote sensing image; feature enhancement; convolutional neural network

1. Introduction

With the rapid advancement of remote sensing technology, object detection has become an important approach for extracting valuable information from aerial and satellite imagery. In particular, Unmanned Aerial Vehicle (UAV) imagery plays a crucial role due to its flexibility, low cost, and fine spatial resolution. Remote sensing object detection refers to the task of identifying and localizing specific targets within images, and UAV-based data have become an essential source for this task. Among these tasks, small-object detection plays a critical role in applications such as urban monitoring, traffic surveillance, and environmental assessment [1,2,3]. Due to the limited resolution and quality of remote sensing images, target objects often appear as small instances (typically smaller than 32 × 32 pixels) [4,5,6], with dim features, low contrast, and insufficient information. These characteristics pose significant challenges for accurate object detection. Moreover, remote sensing systems often operate under uncontrollable observation conditions. Various factors, such as platform motion, atmospheric disturbances, and complex scene layouts, can interfere with the imaging process [7,8,9,10]. This often results in aliasing between the target and the background, making small objects difficult to distinguish.

In general, the challenges associated with small-object detection in remote sensing can be categorized into two main aspects: insufficient feature representation, and background confusion [11,12,13,14,15,16]. The key to alleviating the problems of insufficient feature representation and background confusion lies in feature enhancement and fusion [17]. For example, Feature Pyramid Network (FPN) [18,19,20] improves multi-scale representation by integrating low-level spatial details with high-level semantic context. Path Aggregation Network (PANet) [21] further strengthens this process through bottom–up path augmentation, enhancing the propagation of localization cues. Neural Architecture Search Feature Pyramid Network (NAS-FPN) [22] and Bidirectional Feature Pyramid Network (BiFPN) [23] explore automated and weighted fusion mechanisms to adaptively balance contributions from different feature levels.

Although existing methods have achieved significant progress, most approaches still rely on conventional upsampling and downsampling operations, which lead to feature misalignment and loss of critical details. Both feature misalignment and detail loss are essential factors for the precise localization of small objects in remote sensing imagery. Feature misalignment and detail loss directly reduce the accuracy of small-object detection. Standard upsampling techniques, such as nearest-neighbor or bilinear interpolation, often cause spatial and semantic misalignment [24,25,26,27]. At the same time, repeated downsampling gradually removes fine-grained details that are critical for accurately locating small objects [24,28]. These limitations are particularly detrimental in feature enhancement and fusion, where precise spatial alignment and preservation of semantic cues are essential. The challenges of spatial and semantic misalignment during upsampling and multi-scale fusion, along with the loss of fine-grained semantic information during downsampling, remain insufficiently addressed.

To address the above challenges, a novel lightweight detector called DFAS-YOLO is designed here, which enhances both feature quality and fusion reliability with minimal computational overhead. Two modules are incorporated: the Soft-Aware Adaptive Fusion (SAAF) module and the Global Dense Local Aggregation (GDLA) module. SAAF improves the precision of multi-scale feature fusion by addressing the spatial misalignment problem introduced by traditional upsampling. SAAF introduces a learnable soft scaling factor and a spatial attention. These two mechanisms jointly allow the network to adaptively calibrate upsampled features and focus on informative regions, thus preserving semantic consistency and spatial integrity during feature fusion. GDLA is designed to preserve semantic completeness during downsampling. GDLA consists of three parallel branches: global average pooling to capture global context, max pooling to enhance local saliency, and a standard convolution to retain edge-level structural details. As the deep network performs downsampling, the regions corresponding to small objects become smaller, but a larger receptive field is obtained and information is compressed, which is crucial for enriching the semantic information needed for accurate localization and classification. The aggregated features are further refined via an attention-based fusion mechanism. This joint modeling not only recovers lost semantic information but also suppresses background interference, particularly benefiting small-object detection in cluttered remote sensing scenes. To further optimize model performance, a shallow detection head tailored for small-object localization is introduced, and the deep branches are simplified to reduce the computational load. Additionally, a scale-aware localization loss function, WIOU [29], is adopted to enhance sensitivity to small targets and dense object distributions in complex remote sensing environments [30]. The main contributions of this work are summarized as follows:

A Soft-Aware Adaptive Fusion (SAAF) module is proposed, which alleviates spatial and semantic misalignment during upsampling by using attention-guided soft scaling.
A Global Dense Local Aggregation (GDLA) module is proposed, which compensates for information loss during downsampling by aggregating global context, local saliency, and edge structures.
The detection head is redesigned and WIOU loss is adopted to improve sensitivity to small targets and dense object distributions in complex remote sensing environments.
Extensive experiments on UAV datasets show that the proposed DFAS-YOLO achieves competitive detection accuracy and inference efficiency compared to other lightweight detectors under challenging remote sensing conditions.

2. Related Works

This section reviews related works on object detection, focusing on the development of detection methods, feature enhancement and fusion, and YOLO-based approaches in remote sensing.

2.1. Development of Object Detection

Object detection aims to identify and localize target objects within an image. Early approaches relied on handcrafted features and traditional classifiers, such as HOG [31], SIFT [32], and Haar-like features combined with SVM [33] or AdaBoost [34], typically using sliding-window searches and multi-scale pyramids [1]. Although these methods achieved certain success, they suffered from limited robustness and high computational cost. With the advent of deep learning, convolutional neural networks became the dominant paradigm. Two-stage detectors, starting from R-CNN [35] and evolving into Fast R-CNN [36], Faster R-CNN [37], and Mask R-CNN [38], achieved high accuracy through region proposal and refinement. One-stage detectors, represented by YOLO [39], SSD [40], and RetinaNet [41], improved inference speed by directly predicting object categories and bounding boxes from dense feature maps without explicit proposals.

In recent years, object detection has been further advanced by transformer-based architectures, such as DETR [42], Deformable DETR [43], UAV-DETR [44], and DINO-DETR [45], which introduce global attention mechanisms for end-to-end detection. Hybrid frameworks, such as RT-DETR [46], leverage the strengths of both CNNs and attention mechanisms to improve speed and accuracy. For remote sensing and UAV imagery, many works adapt general-purpose detectors—such as YOLOv5 [47], YOLOv8 [48], and Faster R-CNN—with enhanced feature fusion strategies to address small-object challenges. Meanwhile, lightweight models like PP-PicoDet [49] and EfficientDet are increasingly adopted for onboard deployment under strict computational and energy constraints.

2.2. Feature Enhancement and Fusion of Object Detection

Small-object detection often suffers from insufficient feature representation, severe scale imbalance, and background interference. Feature enhancement and fusion strategies aim to address these problems by improving multi-scale feature utilization and refining spatial details. Early approaches such as the Feature Pyramid Network (FPN), Path Aggregation Network (PANet), and Bidirectional Feature Pyramid Network (BiFPN) enhance detection performance by constructing multi-scale hierarchies and improving top–down and bottom–up information flow. Building on these foundations, advanced approaches have been proposed to further optimize feature representation and fusion. The Asymptotic Feature Pyramid Network (AFPN) [50] enables direct interactions between non-adjacent layers and employs adaptive spatial fusion to mitigate semantic gaps across scales. The upgraded CARAFE++ [51] introduces a more efficient content-aware reassembly mechanism, which adaptively generates upsampled features with lower computational cost and better structural alignment. Adaptive Spatial Feature Fusion (ASFF) [52] learns optimal spatial-level weights to combine features from different scales, thereby reducing redundancy and improving localization precision. The Receptive Field Block (RFB) [53] enhances context modeling by simulating the multi-receptive-field behavior of the human visual system, which is particularly beneficial for detecting small or elongated objects. In addition, an adaptive multi-level feature fusion module with an attention-augmented high-resolution head (AMFFM+AAHRH) has been proposed to suppress false alarms and enhance small-object perception, achieving state-of-the-art results on multiple remote sensing datasets [54]. An anchor-free detector, FE-CenterNet, integrates contextual aggregation and coordinate attention to improve small-object detection in complex scenes, achieving superior performance over mainstream detectors [55]. Despite these advancements, most methods still rely on conventional upsampling and downsampling operations, which inevitably introduce feature misalignment and cause the loss of critical details. Both issues hinder precise localization and significantly degrade the accuracy of small-object detection in complex remote sensing imagery.

2.3. YOLO-Based Detection in Remote Sensing

Due to its balance between accuracy and efficiency, the YOLO family has become one of the most widely used one-stage frameworks in remote sensing object detection, especially under real-time and resource-constrained conditions. However, the vanilla YOLO models still face challenges when dealing with small targets and complex backgrounds, which have motivated a series of customized improvements.

Several variants have been proposed, such as FFCA-YOLO [7], DPH-YOLOv8 [56], and YoloOW [57]. FFCA-YOLO introduces three lightweight modules—the Feature Enhancement Module (FEM), the Feature Fusion Module (FFM), and the Spatial Context-Aware Module (SCAM)—to strengthen both local details and global semantics while maintaining efficiency. DPH-YOLOv8 extends YOLOv8 with a high-resolution prediction head for tiny objects, together with TP-Fusion and coordinate attention modules, thereby enhancing small-object detection and reducing background interference with fewer parameters. YoloOW adopts a scale-adaptive design and leverages multi-resolution feature representations, optimized for open-water rescue tasks to mitigate missed detections in cluttered environments. CM-YOLO [58] focuses on cloud and mist scenarios, introducing a component-decoupling-based background suppression (CDBS) module and a local–global semantic joint mining (LGSJM) module to enhance target–background contrast and semantic perception, thereby achieving accurate detection in adverse weather conditions. MSTA-YOLO [59] targets landslide detection, enhancing multi-scale feature extraction with receptive field attention and refining small-scale localization with normalized Wasserstein distance. YOLO-MS [60] enhances multi-scale feature representation via multi-branch convolution, achieving superior real-time detection performance.

Motivated by these advances, the YOLO architecture is adopted as the foundation of the proposed method, and specialized modules are incorporated to further enhance small-object feature representation and suppress background interference.

3. Methods

3.1. Overview

YOLOv8 is adopted as the baseline framework due to its lower computational cost and fewer parameters compared to the latest versions, such as YOLOv12 and YOLOv13. Despite its lightweight nature, YOLOv8 maintains high detection accuracy in remote sensing tasks, making it well suited for practical engineering deployment. As illustrated in Figure 1, the overall architecture of YOLOv8 consists of three main components: a backbone for hierarchical feature extraction, a neck for feature fusion, and a detection head for multi-scale object prediction.

In DFAS-YOLO, the original YOLOv8 neck is redesigned to better preserve spatial details and enhance multi-scale feature interaction. Specifically, the standard upsampling and downsampling operators are replaced with two tailored modules: Soft-Aware Adaptive Fusion (SAAF) and Global Dense Local Aggregation (GDLA). SAAF is employed in the upsampling path to combine learnable scaling factors with spatial attention, reducing misalignment and resolution inconsistency during feature fusion. GDLA is integrated into the downsampling path, where multi-branch pooling captures global, dense, and local context, and attention weighting strengthens semantic retention under compression. These modifications aim to deliver more accurate and robust feature representations for small-object detection in complex remote sensing scenes.

In addition, considering the high proportion of small objects in remote sensing imagery, we redesign the detection head structure of YOLOv8. Specifically, a shallow P2 detection head is introduced to perform predictions on higher-resolution feature maps, thereby improving the detection of small targets. At the same time, the deep P5 detection head is removed to reduce computational overhead associated with large-object detection, further enhancing the overall inference efficiency. The structure of our proposed DFAS-YOLO is also shown in Figure 2.

Moreover, to further improve detection accuracy, especially in challenging UAV scenarios, DFAS-YOLO adopts a refined loss design. The classification and objectness branches follow the standard Binary Cross-Entropy (BCE) loss used in YOLOv8, while the localization branch introduces the WIOU (Wise-IOU) loss to replace the original CIOU. WIOU dynamically adjusts gradient contributions based on box quality, thus facilitating more accurate regression of small and overlapping targets.

In summary, the above structural modifications not only improve the network’s feature representation capability but also significantly enhance its performance in detecting objects in complex remote sensing scenarios. Meanwhile, the model maintains a low computational burden, demonstrating strong potential for real-world deployment.

3.2. Soft-Aware Adaptive Fusion (SAAF)

In multi-scale feature fusion, upsampling operations are commonly used to align high-level semantic features with low-level spatial features. However, traditional methods such as nearest-neighbor and bilinear interpolation, though computationally efficient, lack content-awareness. This often leads to feature misalignment and semantic inconsistency, which significantly degrade detection performance, especially for small objects in complex remote sensing scenes.

SAAF maintains the efficiency of nearest-neighbor interpolation while introducing a learnable scaling factor and a spatial attention mechanism. These two mechanisms enable the network to adaptively recalibrate upsampled features and enhance the precision of multi-scale fusion. The architecture of SAAF is illustrated in Figure 3.

Specifically, given an input feature map

F \in R^{C \times H \times W}

, we first apply nearest-neighbor interpolation to obtain an upsampled feature map. A learnable scalar factor

α \in (0, 1)

is then applied to modulate the magnitude of the upsampled features, allowing the network to dynamically balance the contribution of high-resolution semantic features during fusion. This soft scaling mechanism serves as a cost-free yet effective strategy to mitigate representation conflicts across scales.

Following scaling, the feature map is further enhanced by a lightweight spatial attention subnetwork consisting of two successive

1 \times 1

convolutional layers with ReLU and sigmoid activations. The final attention map acts as a spatial mask to emphasize informative regions and suppress redundant responses.

The computation of SAAF can be formulated as follows:

F^{'} = α \cdot Upsample (F) \otimes σ (ϕ (F)),

(1)

where

Upsample (\cdot)

denotes nearest-neighbor interpolation,

ϕ (\cdot)

is the spatial attention subnetwork,

σ (\cdot)

represents the sigmoid activation function, and ⊗ indicates element-wise multiplication.

The scalar

α

is initialized as

\frac{1}{s^{2}}

, where s is the upsampling factor (e.g.,

s = 2

), and is implemented as a learnable parameter during training. This means that

α

is optimized together with other network weights via backpropagation, enabling the model to automatically adjust the intensity of upsampled features in response to task-specific requirements.

The detailed computation process of the SAAF module is outlined in Algorithm 1.

Algorithm 1: Computation pipeline of Soft-Aware Adaptive Fusion (SAAF)

Input:

•: Input feature map: $F \in R^{C \times H \times W}$ ;
•: Upsample factor: s (e.g., $s = 2$ ).

Output:

•: Output feature map: $F^{'}$ .

₁: Initialization:;
₂: Define learnable scalar $α = \frac{1}{s^{2}}$ , where $α$ is a trainable parameter if
learnable_alpha=True;
₃: Define spatial attention network $ϕ (\cdot)$ as two $1 \times 1$ conv layers with ReLU and sigmoid;
// Step 1: Nearest-neighbor upsampling
₄: $F_{u p} \leftarrow Upsample (F)$ using scale factor s;
// Step 2: Apply learnable soft scaling
₅: $F_{s c a l e d} \leftarrow α \cdot F_{u p}$ ;
// Step 3: Spatial attention computation
₆: $M \leftarrow σ (ϕ (F_{s c a l e d}))$ ; // sigmoid after two $1 \times 1$ convs with ReLU
// Step 4: Attention-based modulation
₇: $F^{'} \leftarrow F_{s c a l e d} \otimes M$ ; // element-wise multiplication
₈: Return: $F^{'}$ .

SAAF improves the semantic quality of upsampled features with minimal cost. It also enhances multi-scale alignment and, importantly, increases the accuracy of small-object detection in remote sensing images.

3.3. Global Dense Local Aggregation (GDLA) Module

In the feature fusion stage, feature downsampling is a crucial step that progressively reduces the spatial resolution of feature maps, allowing models to extract high-level semantic information. However, excessive downsampling often results in the loss of fine spatial details, which is particularly detrimental for small-object detection in remote sensing imagery. To address this issue, the Global Dense Local Aggregation (GDLA) module is proposed to enhance downsampled feature representation by integrating global context, local saliency, and structural cues. The structure of the GDLA module is illustrated in Figure 4.

The GDLA module consists of three parallel branches. The first branch captures global context via global average pooling, enabling long-range dependency modeling. The second branch emphasizes local discriminative features through max pooling, which enhances salient local cues and benefits small-object perception. The third branch applies a standard convolution to preserve fine-grained textures and local structures. Given the input feature map

F \in R^{C \times H \times W}

, the outputs of the three branches are defined as follows:

F_{global} = AvgPool (F)

(2)

F_{local} = MaxPool (F)

(3)

F_{dense} = {Conv}_{3 \times 3} (F)

(4)

These features are then concatenated along the channel dimension and passed through an attention-based refinement module to reweight their importance adaptively. To fuse the outputs of the three GDLA branches effectively, we adopt the Efficient Multi-Scale Attention (EMA) mechanism, as illustrated in Figure 5.

F_{fused} = EMA (Concat (F_{global}, F_{local}, F_{dense}))

(5)

EMA splits the input feature map into several groups along the channel dimension, enabling more fine-grained attention computation. For each group, two adaptive pooling operations are applied: XAvgPool (pool along the width) and YAvgPool (pool along the height) to capture long-range horizontal and vertical dependencies, respectively. The pooled features are concatenated and passed through a 1 × 1 convolution, producing a fused attention map for each spatial position. A subsequent GroupNorm and 3 × 3 convolution further refine the attention weights, which are combined via channel-wise softmax operations to generate final multiplicative weights for each feature group. This process allows EMA to adaptively emphasize informative regions while suppressing redundant responses across both spatial dimensions and channel groups.

The rationale for selecting EMA over conventional attention mechanisms such as SE or CBAM is that it explicitly models multi-scale dependencies along both spatial axes and across channel groups. This is particularly beneficial for remote sensing images, where objects can vary in shape and orientation and fine spatial details are critical for accurate detection. Ablation studies demonstrate that EMA outperforms SE and CBAM within the GDLA module, validating its effectiveness in enhancing feature representation for precise object localization.

To integrate these features efficiently and reduce dimensionality, we apply a

1 \times 1

convolution after attention to obtain the final output:

F^{'} = {Conv}_{1 \times 1} (F_{fused})

(6)

where

{Conv}_{1 \times 1}

denotes a pointwise convolution that not only compresses channel dimensions but also facilitates nonlinear feature interaction and fusion. This final step plays a critical role in enabling the network to reuse complementary information from different branches while preserving discriminative cues necessary for precise object localization and classification.

Overall, GDLA enhances feature expressiveness under resolution reduction by adaptively fusing multi-scale contextual information, which mitigates the limitations of conventional downsampling and contributes to more robust representation learning.

3.4. Detection Head Reconfiguration for Small-Object Emphasis

In remote sensing imagery, small objects often dominate the scene, exhibiting low resolution and weak semantic cues. The default detection head configuration in YOLOv8, which includes three prediction layers (P3, P4, and P5), is suboptimal for such scenarios. In particular, the deepest head P5, operating on

20 \times 20

resolution feature maps, primarily targets large-scale objects, which are less prevalent in remote sensing datasets. To adapt the detection framework to this domain, the detection head is reconfigured to emphasize small-object detection.

Specifically, the P5 head is removed and a new shallow detection head, P2, is introduced to utilize

160 \times 160

high-resolution feature maps extracted from early backbone stages. The added P2 head enables the model to detect fine-grained details by directly leveraging high-frequency spatial information that is typically lost in deeper layers. This reconfiguration allows for more effective localization and classification of small targets such as vehicles, ships, or rooftops in aerial or satellite images.

The overall impact of this modification is twofold: first, by discarding the computationally intensive P5 head, the model achieves a reduction in inference latency and parameter overhead; second, the inclusion of the P2 head significantly improves detection sensitivity for small-scale targets. This architectural adjustment, as visualized in Figure 6, ensures that the detection framework is better aligned with the scale distribution of objects commonly found in remote sensing datasets, thereby enhancing both efficiency and accuracy.

3.5. Loss Function

The loss function of DFAS-YOLO follows the general structure of the YOLOv8 framework, which consists of three main components: box regression loss, objectness loss, and classification loss. Specifically, the total loss is defined as follows:

L_{t o t a l} = λ_{b o x} L_{b o x} + λ_{o b j} L_{o b j} + λ_{c l s} L_{c l s}

(7)

where

L_{b o x}

is the bounding-box regression loss,

L_{o b j}

is the objectness confidence loss, and

L_{c l s}

is the classification loss. The hyperparameters

λ_{b o x}

,

λ_{o b j}

, and

λ_{c l s}

are weighting coefficients that balance each component.

To improve localization performance, especially in dense and small-object scenes, WIOU (Wise-IoU) is adopted as the regression loss in place of traditional CIOU. Unlike CIOU, which treats all samples equally, WIOU dynamically adjusts the gradient contribution based on the quality of each predicted box. It penalizes high-quality boxes less and low-quality boxes more, accelerating convergence while improving robustness to ambiguous or overlapping objects. The WIOU loss can be formulated as follows:

L_{b o x} = 1 - I o U + α \cdot ρ^{2} (b, b^{g t}) + β \cdot v - γ \cdot S (I o U)

(8)

where

ρ

measures the center distance between the predicted and ground-truth box, v penalizes aspect ratio mismatch, and

S (I o U)

is a sample weighting term determined by IoU quality.

In the classification and objectness branches, DFAS-YOLO retains the BCE loss, ensuring stable gradient flow and high training efficiency in one-stage detectors. Coupled with the introduced WIOU localization loss, the composite loss formulation enables the model to achieve more accurate and robust detection, especially under challenging UAV scenarios characterized by small targets, dense distributions, and frequent occlusions.

4. Experimental Section

To validate the effectiveness of the proposed DFAS-YOLO, experiments were conducted on two challenging aerial object detection datasets: VisDrone2019 and HIT-UAV. These datasets feature dense small objects, complex scenes, and diverse backgrounds, making them suitable for evaluating detection performance in remote sensing scenarios.

YOLOv8s was adopted as the baseline, balancing accuracy and parameter size for lightweight deployment. All proposed improvements—including the SAAF module, GDLA module, and detection head reconfiguration—were implemented on top of YOLOv8s. Experiments were conducted under consistent training settings to ensure fair comparison with the original model.

4.1. Experimental Dataset Description

(1): VisDrone2019: The VisDrone2019 dataset was developed by the AISKYEYE team at Tianjin University. It consists of a large number of video frames and static images captured by UAVs equipped with cameras across various urban and rural regions in China. Each video sequence was collected at different altitudes under diverse weather and lighting conditions. The dataset contains a total of 288 video clips, 261,908 image frames, and 10,209 static images, covering a wide range of object densities and spatial distributions.

Over 2.6 million object instances are manually annotated, including pedestrians, cars, bicycles, and tricycles, with rich attributes such as category, truncation, occlusion, and visibility. Images were collected from different UAV platforms, enhancing the dataset’s diversity. In our experiments, we followed the official split: 6471 images for training, 548 for validation, and 1610 for testing, focusing on UAV-based object detection. The object category distribution is summarized in Table 1.

(2): HIT-UAV: The HIT-UAV dataset, proposed by the Harbin Institute of Technology, is a high-resolution UAV dataset designed for small-object detection. It contains 326 video sequences captured at altitudes of 45–120 m across diverse environments, including highways, urban intersections, playgrounds, parking lots, and residential areas, with both sparse and dense scenes.

The dataset provides over 285,000 annotated object instances across seven categories, such as pedestrians, cars, trucks, buses, motorcycles, and bicycles. Due to high-altitude capture, objects are small and often affected by occlusion, motion blur, and low contrast. Each frame includes class labels, bounding boxes, and visibility states. For our experiments, we followed the split: 2029 training images, 290 validation images, and 579 testing images. The detailed category distribution is listed in Table 2.

4.2. Model Training and Evaluation Metrics

The proposed model was implemented based on the PyTorch 2.0.0 deep learning framework and trained on a workstation. The detailed hardware and software environment is shown in Table 3.

The model was trained for 300 epochs using a stochastic gradient descent (SGD) optimizer. The initial learning rate was set to 0.01, with a momentum of 0.937 and weight decay of 0.0005. The batch size was set to 16, and the input image resolution was

640 \times 640

. Automatic mixed precision (AMP) training was enabled to accelerate convergence and reduce memory usage. Early stopping with a patience of 100 was applied to prevent overfitting. Data loading utilized 8 parallel workers for improved training throughput. During the first 3 epochs, a warm-up strategy was adopted: the momentum linearly increased from 0.8, and the learning rate for bias parameters started from 0.1. The classification loss and Distribution Focal Loss (DFL) weights were set to 0.5 and 1.5, respectively.

In the comparative experiments, all YOLOv5–V11 variants were retrained under the same training settings as described above, while results for other comparison models were directly cited from the corresponding references. The ablation studies were also conducted using identical training parameters.

To comprehensively evaluate the performance of the proposed model, precision, recall, average precision (AP), and mean average precision (mAP) were adopted as primary evaluation metrics. The computation of each metric is described as follows:

Precision measures the proportion of correctly predicted positive samples among all predicted positives. It is defined as follows:

Precision = \frac{T P}{T P + F P}

(9)

where

T P

denotes the number of true positives and

F P

the number of false positives.

Recall evaluates the ability of the model to detect all relevant objects in the ground truth. It is calculated as follows:

Recall = \frac{T P}{T P + F N}

(10)

where

F N

represents the number of false negatives.

By varying the confidence threshold, multiple

(Precision, Recall)

pairs are obtained, forming a precision–recall (P-R) curve. The area under this curve is defined as the average precision:

AP = \int_{0}^{1} p (r) d r

(11)

where

p (r)

denotes the precision as a function of recall r. In practice, the integral is approximated using a finite set of recall levels and the corresponding precision values.

The mean average precision at an IoU threshold of 0.50 is computed by averaging the AP over all object classes:

{mAP}_{50} = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}^{IoU = 0.50}

(12)

where N is the number of object categories.

Following the COCO evaluation protocol,

{mAP}_{50 : 95}

is computed by averaging the AP over ten IoU thresholds ranging from 0.50 to 0.95, with a step size of 0.05:

{mAP}_{50 : 95} = \frac{1}{10 N} \sum_{j = 1}^{10} \sum_{i = 1}^{N} {AP}_{i}^{IoU = 0.45 + 0.05 j}

(13)

This metric provides a more rigorous and comprehensive evaluation of the model’s detection performance under various overlap conditions.

4.3. Comparisons with Previous Methods

To evaluate the effectiveness of the proposed DFAS-YOLO model, a comprehensive comparison was conducted with mainstream detection algorithms on two representative UAV datasets: VisDrone2019 and HIT-UAV. These two datasets reflect different sensing modalities and detection challenges. Various classical and recent detectors were compared, with metrics including precision, recall, mAP50, mAP50:95, and model size (in millions of parameters) evaluated on the validation sets. Visual qualitative comparisons are presented based on results from the testing sets.

(1): VisDrone2019: Table 4 presents the detection results of DFAS-YOLO alongside various classical and recent methods, including YOLOv5, YOLOv8, YOLOv10, and transformer-based approaches.

Compared to existing methods, DFAS-YOLO achieves superior detection performance on the VisDrone2019 dataset. As shown in Table 4, our method outperforms mainstream one-stage detectors such as YOLOv5, YOLOv8, and YOLOv10, as well as more complex two-stage frameworks like Faster R-CNN and RetinaNet. Specifically, DFAS-YOLO obtains the highest values in both

{mAP}_{50}

(0.448) and

{mAP}_{50 : 95}

(0.273), indicating stronger localization precision and robustness across different IoU thresholds.

Notably, DFAS-YOLO surpasses its baseline YOLOv8s by a considerable margin of +6.4% in

{mAP}_{50}

and +4.3% in

{mAP}_{50 : 95}

, demonstrating the effectiveness of our design in enhancing small-object detection under crowded and cluttered UAV scenarios. Moreover, the proposed DFAS-YOLO significantly reduces the model size from 11.1M to 7.52M parameters, suggesting that the improvements are not only effective but also lightweight and efficient.

When compared to other lightweight and specialized UAV detection models such as Drone-YOLO and FFCA-YOLO, DFAS-YOLO still maintains a performance advantage. For example, while Edge-YOLO reaches a comparable

{mAP}_{50}

(0.448), its parameter count (40.5M) is more than five times that of DFAS-YOLO. This reflects the strong balance that our model strikes between accuracy and computational cost, making it better suited for resource-constrained UAV platforms.

Qualitative results in Figure 7 further support our findings. DFAS-YOLO demonstrates superior detection performance on weak and small targets, effectively addressing missed detections that are commonly observed in the baseline model. It accurately detects small, overlapping, or occluded objects in complex environments, validating the effectiveness of the proposed SAAF and GDLA modules in enhancing multi-scale feature fusion and semantic sensitivity. The visual improvements highlight DFAS-YOLO’s enhanced ability to capture subtle object cues that the baseline often fails to detect.

(2): HIT-UAV: As shown in Table 5, DFAS-YOLO achieves the highest ${mAP}_{50}$ (0.785) and ${mAP}_{50 : 95}$ (0.541) among all competing methods, outperforming conventional detectors like YOLOv5 and YOLOv8, as well as transformer-based models such as RTDETR. Notably, our model maintains a compact size of only 7.52M parameters, while achieving better performance than significantly larger models like YOLOv5m (25.1M) and RTDETR-L (32.8M).

The visual results in Figure 8, obtained on the testing set, further demonstrate the robustness of DFAS-YOLO in detecting low-contrast targets and preserving fine-grained structures in thermal imagery. Notably, DFAS-YOLO successfully avoids false positives that occur in the baseline YOLOv8, showcasing its superior discrimination ability. These results highlight the effectiveness of our designed modules, particularly under adverse sensing conditions, and validate the enhanced reliability of DFAS-YOLO in challenging environments.

To demonstrate the practical effectiveness of DFAS-YOLO in complex real-world scenarios, qualitative detection results are presented for both the VisDrone2019 and HIT-UAV datasets. As illustrated in Figure 9, DFAS-YOLO achieves accurate and reliable detection across representative scenes in the VisDrone2019 dataset, including congested urban streets, densely populated basketball courts, and nighttime road environments.

Figure 10 further showcases the model’s robustness in detecting small and low-contrast objects across various infrared scenes from the HIT-UAV dataset, such as forested areas and parking lots. These visualizations underscore the model’s strong generalization capability under diverse sensing conditions.

Qualitative detection results on the VisDrone2019 and HIT-UAV datasets provide an initial assessment of DFAS-YOLO’s detection ability across standard scenes. These results illustrate that the model can accurately detect objects under typical conditions.

To further evaluate transferability, the model trained on the VisDrone dataset was directly tested on the UAVDT and HIT-UAV datasets. The visualizations in Figure 11 indicate that DFAS-YOLO maintains effective detection performance despite the domain shift, highlighting its cross-dataset generalization capability.

Additionally, robustness was assessed by presenting detection results under challenging imaging conditions, including blur, low light, and occlusion. As shown in Figure 12, DFAS-YOLO consistently detects objects even when the visual quality is degraded, demonstrating stability under non-ideal conditions.

Besides the aforementioned successful detections, failure cases were observed under dark and blurred conditions. As shown in Figure 13, DFAS-YOLO occasionally misses objects when the scene is both poorly lit and blurred, reflecting the model’s limitations under extreme image degradation. Missed detections usually occur for small objects or regions with low contrast against the background, suggesting that future improvements could involve data augmentation, more robust attention mechanisms, or exploration of multimodal information.

4.4. Ablation Experiments

To evaluate the contribution of each component in DFAS-YOLO, extensive ablation experiments were conducted on the VisDrone2019 validation set. The experiments were categorized into four aspects: overall module effectiveness, upsampling methods, loss functions, and the GDLA window size.

(1): Overall Module Effectiveness: As shown in Table 6, a progressive ablation study was conducted to evaluate the individual and combined contributions of SAAF, GDLA, the P2–P5 adjustment, and WIOU loss. Starting from the YOLOv8s baseline, which achieved 0.384 ${mAP}_{50}$ and 0.230 ${mAP}_{50 : 95}$ , each module yielded noticeable gains. Introducing the SAAF module boosts ${mAP}_{50}$ by 1.7%, thanks to its soft-aware feature fusion that mitigates misalignment during upsampling. The GDLA module provides further enhancement in precision and recall by reinforcing low-level semantics in dim or cluttered regions. Meanwhile, replacing the original detection heads with a finer-scale P2 and removing the coarse-scale P5 further improves performance, highlighting the importance of small-object detection in UAV scenes. Finally, switching from CIOU to WIOU brings consistent gains in both recall and ${mAP}_{50 : 95}$ . When all four components are integrated, DFAS-YOLO reaches 0.448 ${mAP}_{50}$ and 0.273 ${mAP}_{50 : 95}$ , clearly surpassing the baseline and demonstrating the effectiveness and complementarity of each proposed module.

FPS results were measured on the VisDrone2019 validation set with single-image inference using an RTX 3090 GPU. Although adding modules inevitably introduces extra computation and slightly reduces speed, the complete model still achieves 132 FPS. This indicates that our method improves detection accuracy while maintaining a favorable inference speed.

(2): Comparison of Upsampling Methods: To validate the superiority of the proposed SAAF upsampling strategy, it was compared with several commonly used methods, including bilinear, bicubic, and transposed convolution, as summarized in Table 7. The default upsampling in YOLOv8 yields only 0.384 ${mAP}_{50}$ , while bilinear and bicubic interpolation produce marginal improvements. Transposed convolution offers higher recall and mAP, but at the cost of significantly increased computational overhead (35.4GFLOPs). In contrast, the SAAF module achieves the highest overall performance, with 0.401 ${mAP}_{50}$ and 0.242 ${mAP}_{50 : 95}$ , while maintaining moderate computational complexity (29.4 GFLOPs). This confirms that SAAF effectively balances accuracy and efficiency by introducing soft-aware interpolation, which refines feature alignment without excessive computation, making it better suited for resource-constrained UAV detection tasks. Compared with CARAFE and CARAFE++, the SAAF module achieves higher mAP values while keeping computational complexity lower (29.4 GFLOPs vs. 30.3 GFLOPs). This indicates that SAAF provides a more effective balance between detection performance and efficiency.

(3): Comparison of Loss Functions: Table 8 also presents the performance of different IoU-based loss functions used for bounding-box regression. While EIOU yields slightly higher precision (0.512), it suffers from a trade-off in recall and fails to significantly improve the overall localization quality. In contrast, WIOU demonstrates a more balanced performance across all metrics, achieving the highest recall (0.386), ${mAP}_{50}$ (0.390), and ${mAP}_{50 : 95}$ (0.236). The improved ${mAP}_{50 : 95}$ , which accounts for stricter IoU thresholds, indicates that WIOU provides more precise box alignment, which is especially beneficial for small or overlapping objects in crowded UAV scenarios. Furthermore, its adaptive weighting strategy helps suppress outliers during optimization, enhancing model robustness. Based on these observations, WIOU was selected as the final localization loss function in DFAS-YOLO.

(4): Comparison of Downsampling Methods: As shown in Table 9, we conducted a comprehensive comparison between GDLA and several representative downsampling modules, including GSConvE, ADown, GhostConv, and SCDown, under the same experimental settings. The results demonstrate that GDLA achieves the highest overall performance, yielding the best ${mAP}_{50}$ (0.398) and ${mAP}_{50 : 95}$ (0.235), as well as the best recall (0.389). Although SCDown slightly outperforms GDLA in terms of precision, GDLA provides a more balanced improvement across all metrics, confirming its superiority in enhancing detection accuracy. The computational complexity of GDLA (29.3 GFLOPs) is marginally higher than that of the other methods, but the performance gain justifies this trade-off, validating the effectiveness of the proposed design.
(5): GDLA Window Size Study: As shown in Table 10, the impact of different local window sizes in the GDLA module was investigated to explore how regional context aggregation affects performance. Among the tested configurations, a $4 \times 4$ window yielded the best balance between recall and accuracy, achieving the highest ${mAP}_{50}$ (0.398) and ${mAP}_{50 : 95}$ (0.235). This suggests that an intermediate window size provides sufficient spatial context to enhance weak semantic responses without diluting local detail. This proves particularly effective for improving detection in infrared and low-contrast scenes, where small objects often suffer from blurred or ambiguous features. Therefore, the $4 \times 4$ setting was adopted in our final design.
(6): Comparison of Attention in GDLA: As shown in Table 11, we compared different attention mechanisms applied within the GDLA module. While SE and CBAM slightly improve precision or recall, EMA achieves the best overall performance, yielding 0.398 ${mAP}_{50}$ and 0.235 ${mAP}_{50 : 95}$ . This indicates that EMA more effectively integrates global, local, and dense features, enabling better feature selection across the concatenated branches. Therefore, EMA is adopted as the attention mechanism in GDLA to enhance discriminative feature representation while maintaining efficiency.
(7): Performance of SAAF and GDLA on Larger Backbones: Table 12 demonstrates that the proposed SAAF and GDLA modules maintain consistent performance improvements when integrated into larger backbones (YOLOv8m/l). Both modules enhance recall and detection accuracy, confirming their scalability and effectiveness across different network capacities. The stable improvements across different model sizes indicate that the proposed designs are robust and are not limited to lightweight architectures.
(8): Performance of Detection Head on Small, Medium, and Large Objects: To investigate the impact of modifying the detection heads, we removed the original large-object P5 detection head in YOLOv8 and added a new small-object P2 detection head. The target scales are defined as follows: small objects with an area of 0 to $32 \times 32$ pixels, medium objects with an area of $32 \times 32$ to $96 \times 96$ pixels, and large objects with an area greater than $96 \times 96$ pixels.

As shown in Table 13, after adding the P2 detection head, small-object AP increased from 0.114 to 0.158, medium-object AP increased from 0.313 to 0.343, and large-object AP slightly increased from 0.394 to 0.407. This improvement can be attributed to the P2 detection head leveraging low-level high-resolution features to capture fine-grained information. Through multi-scale feature fusion, these enhanced features are propagated to mid- and high-level layers, strengthening the representation of large objects. As a result, the adjusted detection head structure not only significantly improves the small-object detection performance but also provides a slight gain for medium and large objects, achieving balanced performance across multiple scales.

Overall, the ablation results validate the effectiveness and necessity of each design component in DFAS-YOLO, and their synergy contributes significantly to performance gains on challenging UAV datasets.

5. Conclusions

In this work, DFAS-YOLO is proposed as a lightweight and efficient detector specifically designed for small-object detection in UAV scenarios. The core contributions of DFAS-YOLO are twofold: the SAAF module, which introduces a soft-aware upsampling mechanism to mitigate the feature misalignment typically caused by standard nearest-neighbor interpolation; and the GDLA module, which reduces information loss during downsampling and enhances feature details by integrating global context, local saliency, and fine structural cues. In addition, the detection head is adjusted with an extra P2 layer and removal of the P5 layer, and WIOU is adopted as a localization loss, further improving small-object prediction and balancing precision and recall. Collectively, these design choices enable DFAS-YOLO to achieve high sensitivity to tiny objects while maintaining a small parameter footprint.

Specifically, DFAS-YOLO achieves an

{mAP}_{50}

of 0.448 and

{mAP}_{50 : 95}

of 0.273 on the VisDrone2019 dataset, and an

{mAP}_{50}

of 0.785 and

{mAP}_{50 : 95}

of 0.541 on the HIT-UAV dataset, demonstrating significant performance while maintaining a small parameter footprint suitable for edge deployment.

Nevertheless, DFAS-YOLO has certain limitations: (1) Although the model remains lightweight, further validation and optimization of inference speed and memory usage are required for real-time deployment on embedded hardware. (2) The current evaluation is limited to UAV-based datasets, and its generalization to other scenarios—such as satellite remote sensing or maritime surveillance—remains to be verified.

In future work, we plan to explore cross-modal fusion strategies to address the inherent limitations of single-sensor perception. We believe that leveraging multi-platform or multi-spectral information will become an essential direction in advancing the accuracy and robustness of small-object detection under diverse and degraded conditions.

Author Contributions

Conceptualization, X.L. and S.Z.; software, X.L.; validation, X.L. and H.Z.; formal analysis, S.Z. and X.L.; resources, X.L., Y.S. and J.Z.; original draft preparation, X.L., S.Z. and J.M.; funding acquisition, H.Z. and J.Z.; review and editing, S.Z., X.L., J.M., Y.S., J.Z. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Frontier Research Fund of the Institute of Optics and Electronics, China Academy of Sciences (Grant No. C24K003).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, K.; Song, C.; Xie, Y.; Pan, L.; Gan, X.; Huang, G. RMT-YOLOv9s: An Infrared Small Target Detection Method Based on UAV Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 7002205. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
Ma, J.; Wu, F.; Li, C.; Tang, C.; Zhang, J.; Xu, Z.; Li, M.; Liu, D. G2EMOT: Guided Embedding Enhancement for Multiple Object Tracking in Complex Scenes. IEEE Trans. Instrum. Meas. 2024, 73, 2527214. [Google Scholar] [CrossRef]
Ma, J.; Tang, C.; Wu, F.; Zhao, C.; Zhang, J.; Xu, Z. STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagra Falls, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zhou, S.; Liu, Z.; Luo, H.; Qi, G.; Liu, Y.; Zuo, H.; Zhang, J.; Wei, Y. GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery. Remote Sens. 2025, 17, 1077. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Liu, Z.; Chen, C.; Huang, Z.; Chang, Y.C.; Liu, L.; Pei, Q. A low-cost and lightweight real-time object-detection method based on uav remote sensing in transportation systems. Remote Sens. 2024, 16, 3712. [Google Scholar] [CrossRef]
Debnath, D.; Vanegas, F.; Sandino, J.; Hawary, A.F.; Gonzalez, F. A review of UAV path-planning algorithms and obstacle avoidance methods for remote sensing applications. Remote Sens. 2024, 16, 4019. [Google Scholar] [CrossRef]
Luo, H.; Zhuang, Z.; Li, Y.; Tan, M.; Chen, C.; Zhang, J. Toward compact and robust model learning under dynamically perturbed environments. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4857–4873. [Google Scholar] [CrossRef]
Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-object detection in remote sensing images with end-to-end edge-enhanced GAN and object detector network. Remote Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Zhao, C.; Liu, R.W.; Qu, J.; Gao, R. Deep learning-based object detection in maritime unmanned aerial vehicle imagery: Review and experimental comparisons. Eng. Appl. Artif. Intell. 2024, 128, 107513. [Google Scholar] [CrossRef]
Li, C.; Zhao, R.; Wang, Z.; Xu, H.; Zhu, X. Remdet: Rethinking efficient model design for uav object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4643–4651. [Google Scholar]
Shi, Y.; Wang, C.; Xu, S.; Yuan, M.D.; Liu, F.; Zhang, L. Deformable convolution-guided multiscale feature learning and fusion for UAV object detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6004105. [Google Scholar] [CrossRef]
Zhuang, Z.; Wang, Z.; Chen, S.; Liu, L.; Luo, H.; Tan, M. Robust 3d semantic occupancy prediction with calibration-free spatial transformation. arXiv 2024, arXiv:2411.12177. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Kim, S.W.; Kook, H.K.; Sun, J.Y.; Kang, M.C.; Ko, S.J. Parallel feature pyramid network for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, B.; Zhu, Y. OWRT-DETR: A Novel Real-Time Transformer Network for Small-Object Detection in Open-Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5449–5457. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Li, H. Rethinking Features-Fused-Pyramid-Neck for Object Detection. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; pp. 74–90. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2988–2997. [Google Scholar]
Zhang, X.; Izquierdo, E.; Chandramouli, K. Dense and small object detection in UAV vision based on cascade network. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wang, X.; He, N.; Hong, C.; Wang, Q.; Chen, M. Improved YOLOX-X based UAV aerial photography object detection algorithm. Image Vis. Comput. 2023, 135, 104697. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 355–371. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 1998, 2, 121–167. [Google Scholar] [CrossRef]
Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WA, USA, 24–26 July 1998; pp. 80–91. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Cheng, X.; Yu, J. RetinaNet with difference channel attention and adaptively spatial feature fusion for steel surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 2503911. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Montes, D.; et al. ultralytics/yolov5: v6. 2-yolov5 classification models, apple m1, reproducibility, clearml and deci.ai integrations. Zenodo 2022, 7002879. [Google Scholar]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Yu, G.; Chang, Q.; Lv, W.; Xu, C.; Cui, C.; Ji, W.; Dang, Q.; Deng, K.; Wang, G.; Du, Y.; et al. PP-PicoDet: A better real-time object detector on mobile devices. arXiv 2021, arXiv:2111.00902. [Google Scholar]
Yang, G.; Lei, J.; Tian, H.; Feng, Z.; Liang, R. Asymptotic Feature Pyramid Network for Labeling Pixels and Regions. IEEE Trans. Cir. and Sys. for Video Technol. 2024, 34, 7820–7829. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE++: Unified Content-Aware ReAssembly of FEatures. arXiv 2020, arXiv:2012.04733. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. arXiv 2018, arXiv:1711.07767. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhu, G.; Yuan, B.; Sun, Y.; Zhang, W. Adaptive Feature Fusion With Attention-Guided Small Target Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623116. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Wang, J.; Li, X.; Chen, J.; Zhou, L.; Guo, L.; He, Z.; Zhou, H.; Zhang, Z. DPH-YOLOv8: Improved YOLOv8 Based on Double Prediction Heads for the UAV Image Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5647715. [Google Scholar] [CrossRef]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. Yoloow: A spatial scale adaptive real-time object detection neural network for open water search and rescue from uav aerial imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623115. [Google Scholar] [CrossRef]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Wang, B.; Su, J.; Xi, J.; Chen, Y.; Cheng, H.; Li, H.; Chen, C.; Shang, H.; Yang, Y. Landslide Detection with MSTA-YOLO in Remote Sensing Images. Remote Sens. 2025, 17, 2795. [Google Scholar] [CrossRef]
Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic training sample selection network for UAV object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5902516. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar]

Figure 1. Overall framework of YOLOv8.

Figure 2. Overall framework of DFAS-YOLO.The red boxes highlight the improved components.

Figure 3. Structure of SAAF.

Figure 4. Structure of GDLA.

Figure 5. Structure of EMA module.

Figure 6. Architecture adjustment with added small-object detection layer.

Figure 7. Qualitative comparison between DFAS-YOLO and baseline methods on VisDrone2019. (a–c) Detection results on three representative images from the VisDrone2019 dataset.

Figure 8. Qualitative comparison between DFAS-YOLO and baseline methods on HIT-UAV infrared scenes. (a–c) Detection results on three representative images from the HIT-UAV dataset.

Figure 9. Detection results under normal conditions on VisDrone2019.

Figure 10. Detection examples on HIT-UAV.

Figure 11. Detection results on UAVDT and HIT-UAV for transferability evaluation.

Figure 12. Detection results under blur, low light, and occlusion.

Figure 13. Failure cases under dark and blurred conditions.

Table 1. Category distribution of VisDrone2019 dataset.

Pedestrian	People	Bicycle	Car	Van
23.1%	7.9%	3.1%	42.2%	7.3%
Truck	Tricycle	Awning-tricycle	Bus	Motor
3.8%	1.4%	0.9%	1.7%	8.6%

Table 2. Category distribution of HIT-UAV dataset.

Person	Car	Bicycle	Other Vehicle	Dont Care
48.4%	29.8%	20.6%	0.6%	0.6%

Table 3. Training environment configuration.

Component	Configuration
Operating System	Ubuntu 20.04
GPU	NVIDIA RTX 3090 (Santa Clara, CA, USA )
CPU	Intel Xeon E5-2620 v3 @ 2.40 GHz (Santa Clara, CA, USA)
RAM	128 GB
Python	3.10.0
CUDA Version	12.1

Table 4. Comparison of detection performance with different methods in VisDrone2019.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$	Para (M)
YOLOv5s [47]	0.448	0.332	0.329	0.181	7.05
YOLOv5m [47]	0.470	0.343	0.341	0.190	20.9
YOLOv8s(Baseline) [48]	0.495	0.376	0.384	0.230	11.1
YOLOv10s [61]	0.502	0.377	0.391	0.235	8.07
YOLOv11s [62]	0.482	0.383	0.384	0.231	9.43
Faster R-CNN [37]	—	—	0.414	0.219	41.3
RetinaNet [41]	—	—	0.302	0.177	36.4
DTSSNet [63]	—	—	0.399	0.242	10.1
FFCA-YOLO [7]	0.484	0.403	0.411	0.225	7.1
TPH-YOLOv5 [64]	0.729	0.154	0.363	0.194	45.4
Drone-YOLO(t) [65]	—	—	0.428	0.256	5.35
Drone-YOLO(s) [65]	—	—	0.443	0.270	10.9
Edge-YOLO [66]	—	—	0.448	0.264	40.5
RTDETR-L [46]	0.413	0.263	0.247	0.131	32.8
DFAS-YOLO (Ours)	0.548	0.428	0.448	0.273	7.52

Note:

{mAP}_{50}

and

{mAP}_{50 : 95}

denote average precision at IoU thresholds of 0.5 and 0.5:0.95, respectively. Bold values indicate the best performance. Para indicates model size in millions.

Table 5. Comparison of detection performance with different methods in HIT-UAV.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$	Para (M)
YOLOv5s [47]	0.828	0.671	0.765	0.510	7.05
YOLOv5m [47]	0.785	0.727	0.771	0.521	25.1
YOLOv8s(Baseline) [48]	0.847	0.688	0.760	0.501	11.1
YOLOv10s [61]	0.877	0.679	0.778	0.517	8.07
YOLOv11s [62]	0.881	0.675	0.780	0.518	9.43
Faster R-CNN [37]	—	—	0.562	—	41.3
RTDETR-X [46]	0.710	0.587	0.644	0.391	67.3
RTDETR-L [46]	0.830	0.707	0.769	0.499	32.8
UAV-DETR [44]	—	—	0.749	—	21.5
DFAS-YOLO (Ours)	0.849	0.704	0.785	0.541	7.52

Note: Bold values indicate the best results. Para represents the number of parameters in millions.

Table 6. Ablation study of overall module effectiveness.

SAAF	GDLA	+P2–P5	WIOU	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$	FPS
				0.495	0.376	0.384	0.230	149
			✔	0.504	0.386	0.390	0.236	149
✔				0.510	0.392	0.401	0.242	147
	✔			0.508	0.389	0.398	0.235	144
		✔		0.520	0.404	0.411	0.242	140
✔	✔			0.517	0.393	0.406	0.237	142
✔		✔		0.520	0.406	0.418	0.238	139
	✔	✔		0.539	0.415	0.433	0.265	138
✔	✔	✔	✔	0.548	0.428	0.448	0.273	132

Note: ✔ means that the module is activated. “+P2–P5” indicates adding a small-object detection head and removing the large-object head. Bold values indicate the best results. FPS refers to inference speed (frames per second).

Table 7. Comparison of SAAF with other upsampling methods.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs
*	0.495	0.376	0.384	0.230	28.8
Bilinear	0.498	0.374	0.386	0.230	28.7
Bicubic	0.514	0.380	0.390	0.234	29.9
ConvTranspose	0.497	0.385	0.394	0.236	35.4
CARAFE	0.505	0.383	0.394	0.235	30.3
CARAFE++	0.505	0.384	0.395	0.237	30.3
SAAF	0.510	0.392	0.401	0.242	29.4

Note: ‘*’ indicates the default upsampling in YOLOv8. GFLOPs measures the computational load during inference. Bold values indicate the best results.

Table 8. Comparison of different loss functions.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$
*	0.495	0.376	0.384	0.230
EIOU	0.512	0.375	0.388	0.236
DIOU	0.509	0.374	0.385	0.232
SIOU	0.505	0.379	0.389	0.232
GIOU	0.504	0.376	0.388	0.233
WIOU	0.504	0.386	0.390	0.236

Note: ‘*’ indicates YOLOv8s. Bold values indicate the best results.

Table 9. Comparison of downsampling methods.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$	GFLOPs
*	0.495	0.376	0.384	0.230	28.8
GSConvE	0.490	0.386	0.387	0.231	28.4
ADown	0.505	0.379	0.391	0.234	28.0
GhostConv	0.498	0.377	0.387	0.230	28.2
SCDown	0.511	0.376	0.390	0.234	28.2
GDLA	0.508	0.389	0.398	0.235	29.3

Note: ‘*’ indicates the default downsampling in YOLOv8s. GFLOPs measures the computational load during inference. Bold values indicate the best results.

Table 10. Ablation study on window size in GDLA module.

Window Size	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$
$2 \times 2$	0.509	0.375	0.388	0.231
$3 \times 3$	0.500	0.375	0.381	0.227
$4 \times 4$	0.508	0.389	0.398	0.235
$5 \times 5$	0.513	0.374	0.389	0.230

Note: The

4 \times 4

window is used in our GDLA design due to its superior mAP performance. Bold values indicate the best results.

Table 11. Comparison of attention in GDLA module.

Attention	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$
*	0.505	0.379	0.388	0.232
SE	0.509	0.373	0.386	0.230
CBAM	0.515	0.381	0.388	0.231
EMA	0.508	0.389	0.398	0.235

Note: ‘*’ denotes that no attention mechanism is added within GDLA. Bold values indicate the best results.

Table 12. Scalability analysis of SAAF and GDLA on larger YOLOv8 backbones.

Methods	Precision	Recall	${mAP}_{50}$	${mAP}_{50 : 95}$
YOLOv8m	0.547	0.403	0.423	0.257
YOLOv8m+SAAF	0.555	0.410	0.433	0.263
YOLOv8m+GDLA	0.538	0.411	0.430	0.261
YOLOv8l	0.560	0.421	0.440	0.268
YOLOv8l+SAAF	0.570	0.428	0.449	0.275
YOLOv8l+GDLA	0.554	0.434	0.444	0.273

Note: Bold values indicate improvements over the corresponding baseline (YOLOv8m or YOLOv8l).

Table 13. Detection performance on small, medium, and large objects after adding the P2 detection head and removing the P5 head.

Methods	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	MaxDets
YOLOv8s	0.114	0.313	0.394	100
YOLOv8s+P2–P5	0.158	0.343	0.407	100

Note:

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

are computed using IoU thresholds 0.50:0.95. MaxDets = 100 means that the top 100 highest-confidence predictions per image are considered. Bold values indicate improved performance compared to the baseline.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Zhou, S.; Ma, J.; Sun, Y.; Zhang, J.; Zuo, H. DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 3476. https://doi.org/10.3390/rs17203476

AMA Style

Liu X, Zhou S, Ma J, Sun Y, Zhang J, Zuo H. DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images. Remote Sensing. 2025; 17(20):3476. https://doi.org/10.3390/rs17203476

Chicago/Turabian Style

Liu, Xiangyu, Shenbo Zhou, Jianbo Ma, Yumei Sun, Jianlin Zhang, and Haorui Zuo. 2025. "DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images" Remote Sensing 17, no. 20: 3476. https://doi.org/10.3390/rs17203476

APA Style

Liu, X., Zhou, S., Ma, J., Sun, Y., Zhang, J., & Zuo, H. (2025). DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images. Remote Sensing, 17(20), 3476. https://doi.org/10.3390/rs17203476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DFAS-YOLO: Dual Feature-Aware Sampling for Small-Object Detection in Remote Sensing Images

Abstract

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Development of Object Detection

2.2. Feature Enhancement and Fusion of Object Detection

2.3. YOLO-Based Detection in Remote Sensing

3. Methods

3.1. Overview

3.2. Soft-Aware Adaptive Fusion (SAAF)

3.3. Global Dense Local Aggregation (GDLA) Module

3.4. Detection Head Reconfiguration for Small-Object Emphasis

3.5. Loss Function

4. Experimental Section

4.1. Experimental Dataset Description

4.2. Model Training and Evaluation Metrics

4.3. Comparisons with Previous Methods

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI