Next Article in Journal
Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring
Previous Article in Journal
An Algorithm for Planning Coverage of an Area with Obstacles with a Heterogeneous Group of Drones Using a Genetic Algorithm and Parameterized Polygon Decomposition
Previous Article in Special Issue
Multi-Level Contextual and Semantic Information Aggregation Network for Small Object Detection in UAV Aerial Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images

1
School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
2
Information Engineering College, Capital Normal University, Beijing 100048, China
3
Artificial Intelligence and Intelligent Operation Center, China Mobile Research Institute, Beijing 100052, China
4
Department of Computer Science, School of Petroleum, China University of Petroleum-Beijing at Karamay, Karamay 834000, China
*
Author to whom correspondence should be addressed.
Current address: Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, School of Medicine and Engineering, Beihang University, Beijing 100191, China.
Drones 2025, 9(9), 659; https://doi.org/10.3390/drones9090659
Submission received: 24 July 2025 / Revised: 7 September 2025 / Accepted: 15 September 2025 / Published: 19 September 2025
(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones, 2nd Edition)

Abstract

Small object detection in aerial images, particularly from Unmanned Aerial Vehicle (UAV) platforms, remains a significant challenge due to limited object resolution, dense scenes, and background interference. However, existing small object detectors often overlook making full use of hierarchical features and inevitably introduce noise interference because of hierarchical upsampling operations, and commonly used loss metrics lack sensitivity to scale information; these two issues jointly lead to performance deterioration. To address these issues, we propose Hierarchical Scale-Sensitive Feature Aggregation Network (HSFANet), a novel framework that conducts robust cross-layer feature interaction to perceive the small objects’ position information in hierarchical feature pyramids and enforces the model to balance the multi-scale prediction heads for accurate instances localization. HSFANet introduces a Dynamic Position Aggregation (DPA) module to explicitly enhance the object area in both shallow and deep layers, which is capable of exploiting the complementarily salient representation of the small objects. Additionally, an efficient Scale-Sensitive Loss (SSL) is proposed to balance the small object detection outputs in hierarchical prediction heads, thereby effectively improving the performance of small object detection. Extensive experiments on two challenging UAV benchmarks, VisDrone and UAVDT, demonstrate that HSFANet achieves state-of-the-art (SOTA) results, with a 1.3% gain in overall average precision (AP) and a notable 2.2% improvement in AP for small objects on VisDrone. On UAVDT, HSFANet outperforms previous methods by 0.3% in overall AP and 16.7% in small object AP. These results highlight the effectiveness of HSFANet in enhancing small object detection performance in complex aerial imagery, making it well suited for practical UAV-based applications.

1. Introduction

Object detection is a fundamental task in computer vision with numerous real-world applications [1,2,3,4,5,6,7,8,9,10,11,12,13], such as intelligent video surveillance, autonomous driving, and medical image analysis. This task focuses on identifying and localizing instances of semantic objects within an image, which plays a pivotal role in enabling machines to perceive and interact with their surroundings. Among the various detection scenarios, object detection in Unmanned Aerial Vehicle (UAV) aerial images is garnering growing attention in recent years due to the swift advancement of drone technology and the increasing demand for aerial scene understanding [14,15,16]. However, this task is inherently more demanding than traditional object detection tasks [17,18]. As shown in Figure 1, this is primarily because UAV-based aerial images are often captured from high altitudes with wide fields of view, resulting in a multitude of small-scale objects that are densely distributed and easily conflated with background textures [19,20]. Moreover, during the process of convolution and pooling in standard convolutional neural networks (CNNs), the features of these small objects are frequently weakened, blurred, or even lost due to resolution reduction, making it exceedingly difficult for detectors to accurately recognize and localize them. Consequently, how to improve the detection performance for small objects in aerial images remains a foundational yet unresolved problem in the computer vision community, and it has become an active research focus in recent years [21,22,23].
To tackle these unique challenges in aerial image object detection, researchers have made considerable efforts in the fusion of small object features. A series of advanced detection frameworks, such as UFPMP-Det [24], QueryDet [25], and HRDNet [26], have achieved marked improvements by introducing multi-level feature enhancements or adopting transformer-based query mechanisms. These models have shown encouraging results on public benchmarks and have extended the limits of detection accuracy. Nevertheless, many of them still are deficient in fully capturing the complex characteristics of small targets, particularly in the context of aerial scenes with complex backgrounds and scale variations. A prevalent limitation is that these detectors often fail to conduct effective hierarchical feature interactions across different semantic layers, resulting in inadequate fine-grained representation of small-scale instances. Additionally, the widespread use of traditional Feature Pyramid Networks (FPNs) in these detectors typically involves multiple stages of downsampling and upsampling to align feature resolutions. Although this hierarchical fusion strategy enables coarse-to-fine detection, it also introduces considerable noise and information redundancy, especially when dealing with low-resolution small objects. As discussed in [27], such noise interference has a disproportionately adverse effect on small object detection, making precise localization more difficult and diminishing overall performance.
In addition, researchers have proposed a series of loss functions specifically designed to improve the performance of small object detection. Furthermore, traditional small object detection methods typically employ IOU-based [28,29,30] metrics or Gaussian distribution distance measures [31,32] as regression losses. These metrics inherently exhibit limitations regarding the relevant target scale range: IOU-based approaches are generally more suitable for medium to large-sized objects, whereas Gaussian distribution distance measures are better suited for extremely small objects. Traditional small object detectors are deficient in scale-awareness and thus cannot utilize appropriate regression losses customized for objects of varying scales. Given that target scale distribution in UAV aerial imagery is highly imbalanced, this shortcoming results in suboptimal detection performance in UAV aerial scenarios.
In this paper, we introduce a novel small object detection framework tailored for aerial images termed HSFANet. The proposed HSFANet seeks to address the limitations of existing methods by promoting stronger cross-layer feature interactions and explicitly improving the model’s sensitivity to small-scale targets through a new scale-aware supervision strategy. Specifically, we first devise a DPA module that is integrated into a standard CSPDarknet backbone to facilitate effective spatial-aware fusion between shallow and deep features. The DPA module adaptively upsamples and aligns the spatial structures of low-level features and combines them with high-level semantic features, thereby enhancing the localization cues of small targets across multiple resolutions. This hierarchical aggregation not only retains fine-grained spatial details from early layers but also utilizes semantic context from deeper layers, mitigating the adverse impact of feature noise and enabling the detector to more accurately detect and identify small objects. Furthermore, we introduce an effective SSL that supervises predictions across multiple output heads. The SSL dynamically readjusts the learning process to highlight difficult and small-scale samples, prompting the model to focus more on these hard examples. The integration of DPA and SSL leads to a more resilient and fine-grained representation of small objects, ultimately resulting in better detection performance.
Extensive experiments are conducted on two widely used and challenging aerial object detection benchmarks, namely VisDrone and UAVDT, to verify the effectiveness of our proposed approach. The results demonstrate that HSFANet regularly surpasses existing state-of-the-art detectors in terms of small object detection accuracy and localization precision, exhibiting its robustness and generalization capability under varied aerial imaging scenarios. Our proposed method not only advances the performance frontier but also offers a practical and efficient solution for real-world UAV-based applications.
In summary, the main contributions of this paper are presented as follows:
  • We introduce a novel DPA module. This module enables resilient and flexible cross-layer feature interaction by fusing spatial and semantic features from different levels and mitigating the effect of noise interference introduced during feature compression, allowing the detector to effectively discern small object position information within hierarchical feature pyramids for precise localization.
  • We devise an effective SSL function that supervises predictions across multiple output scales. This loss function prompts the model to dedicate more learning resources to small and difficult objects by adaptively reweighting the losses, thus improving detection robustness.
  • We propose a HSFANet and conduct comprehensive experiments on two public aerial object detection datasets, VisDrone and UAVDT, to verify the effectiveness and generalization capability of our method. The results clearly demonstrate that HSFANet considerably enhances small object detection performance and attains new state-of-the-art results in this challenging domain.
The organization of this paper is as follows: Section 2 reviews related research work in the field of small object detection. Section 3 provides a detailed explanation of the proposed HSFANet. Section 4 presents experimental validations demonstrating the effectiveness of HSFANet. Finally, Section 5 summarizes the findings of this study.

2. Related Work

2.1. Small Object Detection in Aerial Images

Small object detection in aerial imagery [33] has become a major research focus due to its broad range of real-world applications, including traffic monitoring, urban planning, and disaster response. Dissimilar to natural scene images, aerial images captured by Unmanned Aerial Vehicles (UAVs) or satellites often exhibit ultra-high resolutions, complex backgrounds, and densely packed small targets [34]. These characteristics introduce distinct challenges [19], such as considerable scale variation, cluttered scenes, and extreme class imbalance. In particular, small targets usually occupy only a few pixels and are susceptible to feature loss during convolution and pooling operations, making accurate localization and classification challenging [35].
To address these issues, multiple strategies have been explored. Some approaches introduce scale-specific branches or resolution-preserving structures to better preserve small object features [36]. Others incorporate contextual modeling or global semantic reasoning to enrich the representations of low-resolution instances [37,38]. Despite these efforts, small object detection in aerial images remains impeded by the low signal-to-noise ratio, especially under occlusion, illumination variation, or atmospheric interference. Moreover, traditional loss functions and post-processing strategies such as IoU-based regression often are unable to offer sufficient supervision or suppression behavior for small instances [39]. Therefore, formulating effective architectures and learning objectives that are robust to noise and scale variation is critical for improving performance in this domain.
Although the aforementioned works have improved small object detection to some extent, they often disregard cross-level feature structures. The hierarchical upsampling operations unavoidably introduce noise interference, and frequently used loss metrics lack scale-awareness. As a result, detection performance in UAV aerial scenarios remains suboptimal. In this work, we fully leverage the hierarchical feature structure and adopt dynamic upsampling to mitigate sampling noise. Moreover, we introduce a scale-aware function into the loss to balance the regression loss across objects of different scales.

2.2. Hierarchical Feature Representation for Small Objects

Small object detection is intrinsically difficult due to the loss of detailed spatial information and the prevalence of background noise in high-resolution images [40]. Feature Pyramid Networks (FPNs) and their variants have demonstrated considerable success in multi-scale object detection by constructing hierarchical feature maps [41]. However, the intrinsic trade-off between spatial resolution and semantic richness still presents difficulties for accurately detecting small objects [42]. High-level features offer stronger semantics but are deficient in spatial precision, while low-level features preserve spatial details but are semantically weak [43]. This semantic gap often leads to imprecise or missed detections of small objects.
To address this disparity, a range of enhanced FPN structures have been introduced. Shi et al. [44] proposed the High-frequency and Spatial-aware FPN (HS-FPN), which utilizes high-pass filters to extract high-frequency signals and incorporates spatial dependency modules to capture spatial cues for small object enhancement. Yang et al. [45] introduced the Asymptotic FPN (AFPN), which merges non-adjacent feature levels to lessen semantic inconsistency and employs adaptive spatial fusion to decrease multi-object interference. Guo et al. [46] created AugFPN with consistency supervision, residual enhancement, and soft RoI selection to improve multi-scale representations. Additionally, Deng et al. [47] proposed EFPN, which adds a super-high-resolution pyramid to retain finer detail and enhances feature propagation across scales.
The aforementioned methods have eased the difficulty of small object feature representation to a degree; most of them still depend on sequential feature fusion, which often results in new problems such as gradual degradation of feature quality across layers. Conversely, our approach further enhances the representation of small objects by utilizing cross-level feature aggregation and a feature extraction module specifically designed to better capture small object characteristics.

2.3. Loss Function Design for Small Object Detection

The optimization process of object detection models is heavily dependent on the choice of loss function. For small objects, traditional IoU-based losses such as GIoU [28], DIoU, and CIoU [29] often are unable to offer robust gradients due to the minimal overlap between predictions and ground truth boxes. This results in less than optimal localization performance, particularly for minuscule instances.
To tackle this shortcoming, a number of scale-aware and distribution-based loss functions have been introduced. α -IoU loss [30] includes an exponential term to boost sensitivity in low-IoU regions, which is especially advantageous for small object alignment. Meanwhile, distribution-based losses such as the Normalized Wasserstein Distance (NWD) [31] and Gaussian-based Receptive Field Label Assignment (RFLA) [32] quantify the spatial distribution mismatch between predicted and ground truth boxes, offering improved localization signals for small targets. These methods represent spatial uncertainty and geometric variance more efficiently than hard-threshold IoU metrics. Additionally, loss designs that jointly optimize classification and regression, such as Generalized Focal Loss (GFL) [48] and Focal Loss [49], help alleviate sample imbalance and enhance detection robustness.
While the preceding methods have improved the effectiveness of small object loss metrics to some extent, the highly imbalanced target scale distribution in UAV aerial scenarios often surpasses the relevant scale range of these single, static metrics. To resolve this problem, we design a loss function that effectively discerns target scales, thereby alleviating the problems caused by scale imbalance.

3. Methods

In this section, we provide a comprehensive and detailed description of the proposed HSFANet architecture. The overall structure and design of HSFANet are illustrated clearly in Figure 2, which helps to visualize the network’s key components and their interconnections. Initially, the network employs a well-established and widely used backbone, CSPDarknet, as the foundational feature extractor. This backbone is responsible for generating multi-scale feature pyramids that capture hierarchical information from the input images at different resolutions and semantic levels. Following the backbone, we introduce a novel DPA module. This module is specifically designed to effectively fuse and integrate high-level semantic information, which carries rich contextual cues, with low-level detailed feature information that contains fine-grained spatial details.
The purpose of the DPA module is to enable a more robust and adaptive fusion of features across multiple scales and layers, enhancing the representation capability of the network for objects of varying sizes. To further address the common problem of feature information loss that often occurs during upsampling operations in HSFANet, a dynamic upsampling operator is carefully incorporated. This operator adaptively adjusts the upsampling process, reducing the distortion and degradation of feature details that can negatively impact detection accuracy, especially for small objects. Moreover, in order to boost the detection performance on small-scale targets, a SSL function is integrated into the overall loss computation. This specialized loss formulation enhances the network’s sensitivity to small objects by dynamically emphasizing their importance during training, thereby significantly improving detection precision and robustness in challenging scenarios.

3.1. Dynamic Position Aggregation

In UAV target detection tasks, small targets constitute a significantly high proportion of the overall objects present in aerial scenes. However, due to their inherently limited pixel count and insufficient distinctive feature information, small targets are much more vulnerable to severe feature degradation and information loss during the downsampling stages of convolutional neural networks. Compared to medium and large targets, which inherently contain richer and more detailed visual semantics, small targets tend to be compressed excessively after multiple downsampling operations, sometimes shrinking to pixel-level or even sub-pixel-level representations. This extreme compression causes a drastic reduction in useful discriminative information, rendering small target detection particularly challenging when relying on high-level feature maps that tend to focus on coarse semantic cues.
To effectively address this critical challenge, this paper proposes a novel DPA module, specifically designed to explicitly enhance and emphasize the object-related regions across both shallow and deep network layers. The DPA module leverages complementary salient representations by dynamically aggregating position-aware feature information that is crucial for the accurate localization and identification of small targets. In particular, DPA extracts and preserves effective feature details from lower-level feature maps, which usually contain richer spatial and texture information. This extraction helps mitigate the otherwise inevitable negative effects of feature loss during the feature extraction and downsampling processes, thus substantially improving the network’s ability to detect small-scale objects.
Moreover, the DPA module employs a cross-level feature fusion architecture, which plays a pivotal role in alleviating the semantic gap that often arises from continuous hierarchical fusion of features at multiple levels. By effectively integrating cross-level information, the module introduces valuable low-level detailed features into the fusion process, providing crucial fine-grained guidance that benefits small object detection. This integration enables the network to better combine spatial and positional information, which are key to accurately delineating small objects that are otherwise prone to being overlooked. The DPA module thereby organically fuses high-level global contextual information with low-level local detail cues, further strengthening the overall representational capacity of the network for small target detection.
In addition to the DPA module, we design the Convolution to Feature with Mish (C2FM) module and the Convolution to Feature with Mish Compact Inverted Block (C2FMCIB) to further enhance the sensitivity of the DPA towards subtle variations in feature maps. The C2FM and C2FMCIB modules adaptively modulate channel-wise features by focusing on informative feature channels and suppressing irrelevant or noisy ones, which significantly improves the discrimination power of the aggregated features. This complementary design allows the DPA module to better capture minute changes and subtle characteristics in feature maps that are often crucial for distinguishing small targets from complex backgrounds or clutter.
The feature fusion process at the P 2 in the DPA module is formulated as follows:
F 3 = Cat F 3 , Dysample F 5
P 2 = Cat F 2 , Dysample F 3
where Cat · denotes the concatenation operations. Dysample denotes the dynamic upsample operations [50]. And P 2 denotes the output feature map corresponding to the second layer in the DPA module.
The feature fusion process at the P 3 in the DPA module is formulated as follows:
F ^ 3 = Downsample ( P 2 )
P 3 = Cat F 3 , F ^ 3
where Downsample · denotes the downsampling operation is implemented using CBS. And P 3 denotes the output feature map corresponding to the third layer in the DPA module.
The feature fusion process at the P 5 in the DPA module is formulated as follows:
F ^ 5 = Downsample Cat F 3 , F ^ 3
P 5 = Cat ( F 5 , F ^ 5 )
where Downsample · denotes the downsampling operation is implemented using SCDown [51]. And P 5 denotes the output feature map corresponding to the fifth layer in the DPA module.
In the DPA module, high-magnification upsampling is necessary to align and match the spatial dimensions of low-level feature maps, ensuring effective cross-level feature fusion. However, commonly used traditional upsampling methods such as nearest-neighbor interpolation and bilinear interpolation tend to introduce undesirable sampling noise during this process. This noise can lead to distortions in the reconstructed feature maps, particularly impacting the delicate and limited feature representations of small targets. Such perturbations may degrade the network’s ability to accurately identify and localize small objects, as the added noise can obscure subtle spatial details crucial for small target detection.
To effectively alleviate the negative impact caused by high-magnification upsampling noise, this paper proposes the use of a dynamic upsampling operator within the DPA module. Unlike static interpolation methods that rely on fixed kernels and uniform sampling patterns, dynamic upsampling adaptively learns sampling offsets and weights based on the local feature context. This learned approach enables more precise and content-aware feature reconstruction, significantly reducing the introduction of sampling artifacts and preserving the integrity of small target features. By dynamically adjusting the upsampling behavior to the underlying feature structure, the network achieves a more faithful and noise-robust spatial alignment, thereby enhancing the overall detection accuracy for small-scale objects.
In Figure 3, the weights are computed using 1 × 1 convolutional layers (’linear1’/’linear2’), with input channels equal to that of the corresponding FPN feature map, output channels set to 2 × g × scale 2 , and groups = 1. Here, g denotes the number of groups for dynamic computation (set to 4 in our experiments), while s H and s W represent the height and width of the local feature patch, corresponding to the spatial size of the input FPN feature map. The initial positional offsets are generated using the specified ‘scale’ factor. These settings are consistent across all datasets and are chosen to balance computational cost and representational power.
And the process of dynamic upsampling is shown in Figure 3. The use of cross-hierarchical dynamic upsampling effectively enhances the overall detection performance of HSFANet. The computation process of dynamic upsampling [50] is as follows in Equation (7):
X = gridsample X , S ,
where X and X represent the input feature map and upsampled feature map. Additionally, S is the sampling set in the equation that can be represented by Equation (8):
S = G + O ,
then the sampling set S is the sum of the offset O and the original sampling grid G . And the O is defined as shown in Equation (9):
O = sigmoid linear 1 X · linear 2 X .

3.2. Scale-Sensitive Loss

The CIoU loss demonstrates strong performance when applied to medium and large objects, effectively guiding the regression of bounding boxes in these cases. However, small objects exhibit much higher sensitivity to slight perturbations in their bounding box coordinates due to their limited pixel size and fewer distinctive features. Consequently, the CIoU loss, while suitable for larger objects, tends to be less effective for accurately regressing the positions of small objects. This limitation poses minimal impact in conventional detection scenarios dominated by medium and large targets, where minor localization errors are often negligible relative to object size.
In contrast, in UAV aerial imagery scenarios, small objects constitute a substantially higher proportion of the detected targets. In such contexts, even slight perturbations or inaccuracies in bounding box regression can lead to significant performance degradation, as these small deviations represent a larger fraction of the object’s spatial extent. This sensitivity to bounding box fluctuations makes accurate localization for small objects a critical challenge, one that traditional CIoU loss struggle to overcome.
To address this issue, we introduce the Normalized Wasserstein Distance (NWD) metric [31] as a complementary measure for bounding box regression loss. NWD assesses the similarity between predicted and ground truth bounding boxes by considering their spatial distributions as probability measures, making it inherently more robust to minor positional shifts and scale variations typical in small object detection. We then design a combined regression loss function that integrates NWD with the CIoU loss [52], leveraging the strengths of both metrics.
Furthermore, a scale-aware adaptive weighting function λ ( s ) is introduced to dynamically balance the relative contribution of the NWD and CIoU components within the combined loss. This adaptive weighting mechanism allows the model to place more emphasis on NWD when learning from small objects, effectively increasing the proportion of positive samples influenced by the SSL during training. As a result, the model can more effectively learn the subtle features and precise localization cues of small targets, leading to improved regression accuracy and overall detection performance in challenging UAV aerial imagery environments.
According to the definition of the Wasserstein distance in optimal transport theory [53], the Gaussian distributions N a and N b modeled by the bounding boxes a = c x a , c y a , w a 2 , h a 2 and b = c x b , c y b , w b 2 , h b 2 , the Wasserstein distance between two distributions can be expressed as
W 2 2 N a , N b = a T , b T 2 2 ,
where · F is the Frobenius norm.
Since the range of W 2 2 N a , N b is [ 0 , + ] , it cannot be directly used as a similarity measure for label assignment. To align its range with the IoU range, which is between 0 and 1, we choose to use an exponential nonlinear transformation function to map the Gaussian Wasserstein distance to another space, thus normalizing its range to ( 0 , 1 ] . This gives us the NWD, as shown in Equation (11):
NWD N a , N b = exp W 2 2 N a , N b C ,
where C is a constant.
Therefore, the loss function based on NWD can be defined as shown in Equation (12):
L n w d = 1 NWD N p , N g .
The definition of CIoU [52] is shown in Equation (13):
L C I o U = 1 I o U + ρ 2 b , b g c 2 + α v ,
where α is defined as Equation (14):
α = v 1 I o U + v ,
and v is denoted as Equation (15):
v = 4 π 2 arctan w g h g arctan w p h p 2 .
The regression loss function in this paper is defined as shown in Equation (16):
L = λ ( s ) L C I o U + 1 λ ( s ) L n w d ,
where λ ( s ) is a scale-aware adaptive weighting function that adjusts the weight between the CIoU and NWD losses. To adapt the loss function to object size, we design a smooth task-adaptive weighting function defined as
λ ( s ) = 1 1 + e α ( s s 0 )
where s = w h denotes the object size computed from the width w and height h of the bounding box. The threshold s 0 serves as the boundary between small and large objects, and α controls the sharpness of the transition. A larger α approximates a step function while maintaining gradient continuity, which facilitates stable training.
In Equation (17), s = w h denotes the absolute pixel-based size of the object, where w and h are the width and height of the bounding box in image pixels. It is not normalized relative to the input resolution or FPN level. While our method leverages multi-scale features from the FPN, the Scale-Sensitive Loss (SSL) automatically handles scale variations across pyramid levels, allowing small objects to receive higher relative weighting without manually adjusting for different feature scales.

4. Experiments

This section first introduces the benchmark datasets, evaluation metrics, and specific implementation details used in our experiments. Then, we conduct a series of ablation studies to thoroughly evaluate the individual contributions and effectiveness of the proposed components, including the feature aggregation mechanism, backbone architecture, and loss function design. To further verify the effectiveness of the proposed method, we perform comprehensive comparisons with state-of-the-art object detection approaches. Additionally, to assess the generalization capability of our algorithm, we conduct experiments across different datasets. The results, presented from both quantitative evaluation metrics and qualitative visualizations, comprehensively demonstrate the superior performance of our method in complex aerial object detection scenarios.

4.1. Datasets

The experiments in this paper use two aerial image datasets: VisDrone and UAVDT.
The VisDrone [54] dataset comprises videos and images captured by various drone platforms operating across fourteen different cities, covering a wide range of real-world scenarios, such as urban streets, residential areas, and public squares. Specifically, the VisDrone subset includes a total of 10,209 high-resolution static images, among which 6471 images are designated for training, 548 for validation, and 3190 for testing. The dataset spans 10 object categories, including pedestrians, vehicles, bicycles, and other common targets observed from aerial perspectives. In Figure 4, the distributions of instance counts across different categories and object sizes are presented. The image resolutions vary significantly, ranging from 960 × 540 to 2000 × 1500 pixels, reflecting the diversity in scene complexity and altitude.
The UAVDT [55] dataset comprises a total of 38,327 images with an average resolution of 1080 × 540 pixels. It primarily focuses on vehicle detection tasks and includes three object categories: car, bus, and truck.
Figure 5 presents an overview of the number of instances per category and the distribution of objects across different size ranges. The dataset is divided into 23,258 images for training and 15,069 images for testing. Captured from real-world urban traffic surveillance scenes under various weather and lighting conditions, UAVDT presents considerable challenges such as small object sizes, frequent occlusions, and dynamic backgrounds. These characteristics make it a valuable benchmark for evaluating the robustness and generalization ability of object detection algorithms, especially in aerial scenarios.

4.2. Evaluation Metrics

We use AP, AP 50 , and AP 75 as evaluation metrics. AP represents the average precision, while AP 50 and AP 75 denote the average precision calculated at IoU thresholds of 0.5 and 0.75, respectively. To assess detection performance for objects of different sizes, we use AP s , AP m , and AP l to evaluate the average precision for small, medium, and large objects, respectively. Additionally, we provide the average precision for each object category to evaluate the model’s detection performance for individual classes. In order to gain deeper insights into the model’s performance across various object categories within the dataset, we incorporate precision (P) and recall (R) to evaluate the detection effectiveness. True Positives (TPs) represent the cases where the model correctly predicts a positive instance. False Negatives (FNs) occur when the actual label is positive, but the model incorrectly predicts it as negative. False Positives (FPs) refer to instances where the actual label is negative, but the model mistakenly classifies it as positive. True Negatives (TNs) correspond to situations where both the actual label and the model’s prediction are negative. The formulas for computing precision and recall are given as follows:
P = T P T P + F P
R = T P T P + F N

4.3. Implementation Details

We conducted the HSFANet experiments using 8 NVIDIA GeForce GTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB of memory. The experiments were implemented using Python version 3.9.19, with PyTorch 2.0.1 as the deep learning framework, and accelerated using CUDA 11.7.0. The specific configurations of the experimental environment are shown in Table 1. The input images were RGB images with dimensions of 1024 × 1024 pixels. The model was trained using the Adam optimizer with a batch size of 16 for 200 epochs. We applied a cosine learning rate decay schedule, starting from 0.001 and decreasing to 1 × 10 6 , with a linear warm-up for the first 3 epochs. Common data augmentation techniques were employed, including random horizontal flipping, random scaling, random cropping, color jittering, and mosaic augmentation, which improve generalization, particularly for small objects. Our method follows an anchor-free design with SimOTA label assignment, standard NMS (IoU threshold 0.5), and a confidence score threshold of 0.2. No test-time augmentation (TTA) was used, ensuring fair comparison across methods. These settings are fully specified to facilitate reproducibility of our experiments. The key hyperparameters of the proposed SSL were set as follows: C = 1.6 (Equation (11)), α = 13 , and s 0 = 32 (Equation (17)). These values were applied consistently across both VisDrone and UAVDT datasets, without dataset-specific tuning, as they yielded stable performance in all experiments. Reporting these hyperparameters explicitly ensures reproducibility and facilitates fair comparisons.
For the results of other comparison methods, we follow the numbers directly cited from their original papers, since official codes or complete training configurations are not always publicly available. As a result, the input resolution, data augmentation strategies, and training schedules are consistent with those reported in the respective works.
We acknowledge that differences in the original experimental setups may lead to minor inconsistencies across methods. However, all cited approaches are widely recognized and evaluated on the same benchmarks (VisDrone and UAVDT), which ensures the fairness and validity of the comparison.

4.4. Main Results

VisDrone Results. The performance of the proposed algorithm is comprehensively evaluated and compared against several state-of-the-art related methods on the challenging VisDrone dataset, with results summarized in Table 2. We focus on a variety of widely accepted evaluation metrics, including AP, AP 50 , AP 75 , as well as AP s , AP m , and AP l , to provide a detailed and nuanced comparison across different object scales.
As shown in Table 2, our proposed method achieves superior detection results across multiple metrics. Specifically, it surpasses the second-best performing algorithm by 1.3% in overall AP, indicating a consistent improvement in detection accuracy. Furthermore, notable gains are observed in AP 75 , where our method exceeds the runner-up by 3.0%, reflecting its enhanced capability for precise localization. Improvements in scale-specific metrics are also significant: AP s is increased by 2.2% and AP m by 4.0%, demonstrating that HSFANet is particularly effective in detecting small and medium-sized objects, which are prevalent in UAV imagery. The visualization of HSFANet’s detection results on the VisDrone dataset is shown in Figure 6.
A key strength of HSFANet lies in its ability to substantially enhance small object detection performance, which is a challenging aspect for many existing detectors. Importantly, our approach achieves these gains without relying on excessively deep or computationally heavy backbone networks. This indicates that HSFANet effectively mitigates feature information loss and narrows the semantic gap between high-level semantic features and low-level detailed feature maps, facilitating more accurate and robust multi-scale feature fusion.
In Table 3, we present the performance of HSFANet across different categories on the VisDrone dataset. It is evident that the categories with relatively low AP are awning-tricycle, bicycle, and tricycle, which are typically composed of densely distributed small targets. This observation highlights that small object detection remains a challenging aspect in object detection tasks. Therefore, improving the detection performance for small objects is of great significance to advancing overall object detection capabilities.
Through both quantitative and qualitative evaluations of HSFANet’s performance, we find that it is highly adapted to UAV aerial scenarios. This contributes to improving the accuracy of object detection and recognition in UAV applications and holds significant importance for enhancing UAVs’ autonomous understanding of ground targets and scenes from high altitudes.
UAVDT Results. We conduct a thorough comparison of the proposed HSFANet algorithm against several recent and relevant methods on the UAVDT dataset, with detailed results presented in Table 4. To maintain a rigorous and consistent evaluation protocol, we adopt the widely used standard from the COCO dataset for object size definitions: small objects are defined as having an area less than 32 × 32 pixels, medium objects as having an area between 32 × 32 and greater than or equal to 96 × 96 pixels, and large objects as having an area 96 × 96 pixels. Our algorithm consistently outperforms the second best performing approach by 0.3% in overall AP, 2% in AP 50 , and an impressive 16.7% increase in AP s . The most significant improvement is observed in the detection of small objects, which highlights the effectiveness of the HSFANet for enhancing small target representations and overcoming the inherent challenges posed by limited pixel information and feature loss.
HSFANet demonstrates significant advantages in object detection tasks from UAV perspectives, achieving notably superior performance on the UAVDT benchmark dataset. The network effectively addresses inherent challenges in UAV-captured imagery—such as severe scale variations (ranging from extremely small to relatively large objects), dense object distributions, and complex background clutter—through its unique Hierarchical Scale-aware Feature Aggregation mechanism. Evaluated against the UAVDT benchmark, HSFANet exhibits outstanding results across key metrics significantly outperforming numerous mainstream object detection models. These results robustly validate the efficacy of HSFANet’s design in accurately localizing and identifying diverse targets within UAV aerial images. Consequently, HSFANet provides a powerful and reliable technical foundation for practical UAV-based applications such as traffic monitoring and security surveillance. The visualization of HSFANet’s detection results on the UAVDT dataset is shown in Figure 7.
Furthermore, the superior results obtained on the UAVDT dataset demonstrate not only the excellent performance of the proposed architecture but also its strong generalization ability across different aerial imagery scenarios. The consistent improvements in small object detection indicate that HSFANet can effectively handle the diverse and complex conditions typical of UAV-captured datasets, making it a promising solution for real-world aerial surveillance and monitoring applications.

4.5. Ablation Studies

To systematically verify the effectiveness of each component in our proposed HSFANet architecture, we conducted a comprehensive ablation study on the VisDrone dataset. The experimental results, detailed in Table 5, were obtained by incrementally integrating the DPA module and the SSL into the baseline network.
From the results, it is evident that the addition of the DPA module led to a substantial improvement in the detection of small objects, with the AP s metric increasing significantly by 4.8%. This demonstrates the crucial role of DPA in enhancing feature fusion and preserving fine-grained spatial details necessary for small target detection.
Furthermore, when SSL was incorporated, the AP s further increased by 1.7%, indicating that the scale-aware loss function complements the feature-level improvements by better guiding the model to learn small object representations during training. In total, these two modules collectively contributed to a remarkable 6.0% gain in AP s , alongside a 5.0% improvement in the overall AP.
As shown in Table 5, the final model exhibits a slight decrease in large-object AP ( AP l ) from 58.8 to 55.4 compared to the baseline. This reflects an inherent trade-off in scale-sensitive designs: the SSL emphasizes smaller objects by assigning them higher relative weight, and the DPA module enhances local feature aggregation for small targets. Consequently, features for large objects may receive slightly less attention, leading to a modest reduction in AP l . Potential strategies to mitigate this include adjusting the slope of λ ( s ) , balancing losses across different FPN levels, and applying multi-scale feature regularization. These considerations will be explored in future work to better balance performance across all object scales.
The visual detection results comparing the baseline and HSFANet models are presented in Figure 8. Within the zoomed-in windows, it is clearly observed that the baseline model fails to detect two small targets, whereas HSFANet accurately localizes all present objects. This qualitative comparison vividly illustrates the enhanced regression precision and robustness achieved by our method. Such improvements highlight the ability of HSFANet to effectively capture and preserve crucial spatial information during feature aggregation.
Additionally, Figure 9 displays the normalized confusion matrices of both the baseline model and HSFANet. Examining the diagonal elements, it is clear that HSFANet attains higher classification accuracy across all object categories. This improvement underscores the efficacy of the DPA module combined with the customized SSL function in mitigating feature information loss and semantic discrepancies. Consequently, the classification confidence for each category is significantly increased, contributing to the overall superior detection results.
As illustrated by the comparison of the PR curves in Figure 10, the proposed HSFANet significantly outperforms the baseline model across most object categories, particularly for typical small objects such as pedestrians, bicycles, and motorcycles. The PR curves of HSFANet are consistently higher and smoother, indicating better precision–recall trade-offs and more stable performance across different confidence thresholds. These improvements are attributed to the enhanced capability of the DPA module in detail preservation and efficient feature aggregation, as well as the SSL function’s effectiveness in mitigating the issue of target scale imbalance. Overall, HSFANet demonstrates superior detection performance and robustness in complex UAV and remote sensing scenarios, validating the effectiveness of the proposed design.
Specifically, HSFANet contains 63.9M parameters, requires 365.5 GFLOPs for an input resolution of 1024 × 1024, runs at 19.7 FPS on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), and consumes approximately 3.8 GB GPU memory during inference.
Although our laboratory does not currently have onboard UAV devices to directly measure deployment performance, the above results demonstrate that HSFANet achieves a practical balance between accuracy and efficiency on high-resolution UAV datasets. We will further explore onboard deployment in future work to provide more comprehensive real-world runtime evaluations.
In summary, HSFANet achieves leading performance in object detection by effectively integrating the DPA module and the SSL loss function. This integration enables efficient cross-level aggregation of deep and shallow feature information and mitigates the issue of target scale imbalance. As a result, the model’s ability to focus on and accurately detect small objects is significantly enhanced while maintaining effective loss measurement across targets of varying scales.

5. Conclusions

In this paper, we propose HSFANet, a novel small object detection framework specifically designed for UAV aerial imagery. To address the challenges faced by small objects in terms of limited resolution, feature degradation, and scale imbalance, HSFANet integrates two key components: the DPA module and the SSL function. The DPA module facilitates effective cross-level feature fusion by combining high-level semantic information with low-level spatial details, significantly enhancing the localization capability for small objects. Meanwhile, the SSL introduces a scale-aware regression strategy balance the small object detection outputs in hierarchical prediction heads, thereby effectively improving the performance of small object detection. Extensive experiments conducted on the VisDrone and UAVDT benchmarks demonstrate the superior performance of HSFANet, which is state-of-the-art (SOTA). These results confirm that HSFANet offers an effective and practical solution for small object detection in complex aerial scenarios. Its enhanced detection capability, efficiency, and robustness make it a promising candidate for real-world applications in UAV-based surveillance, monitoring, and remote sensing tasks.
In future work, we plan to extend HSFANet to multi-modal scenarios such as RGB-Thermal fusion and LiDAR-visual fusion, enabling more robust performance under adverse conditions like low light or occlusion. In addition, we will explore lightweight deployment on edge devices, such as UAV onboard systems, to further improve inference speed and adapt the framework for real-time applications.

Author Contributions

Conceptualization, H.Z.; methodology, H.Z.; software, S.W.; validation, H.Z.; formal analysis, H.Z.; investigation, Y.Z. (Yangfu Zhu) and S.Y.; resources, Z.O.; data curation, S.W.; writing—original draft preparation, H.Z.; writing—review and editing, S.Y., Z.O., M.S., Y.Z. (Yifan Zhu), Y.Z. (Yangfu Zhu) and H.L.; visualization, H.Z. and Y.G.; supervision, Z.O.; project administration, S.Y., Z.O., M.S. and Y.Z. (Yifan Zhu); funding acquisition, S.Y., Z.O., M.S. and Y.Z. (Yifan Zhu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid Hebei Information and Telecommunication Branch kj2023-093, the Science and Technology Project of State Grid Hebei Information and Telecommunication Branch kj2023-039, the National Key Research and Development Program of China under Grant 2024YFC3308500, the National Natural Science Foundation of China under Grant 62402055, and the National Natural Science Foundation of China under Grant 62406036.

Data Availability Statement

The original data presented in this study are openly available in the VisDrone dataset at https://github.com/VisDrone/VisDrone-Dataset (accessed on 26 January 2025) and in the UAVDT dataset at https://sites.google.com/view/grli-uavdt/ (accessed on 28 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.M. YOLO-MS: Rethinking multi-scale representation learning for real-time object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef]
  2. Zhou, G.; Qian, L.; Gamba, P. A novel iterative self-organizing pixel matrix entanglement classifier for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5407121. [Google Scholar] [CrossRef]
  3. Li, M.; Jia, T.; Wang, H.; Ma, B.; Lu, H.; Lin, S.; Cai, D.; Chen, D. Ao-detr: Anti-overlapping detr for X-ray prohibited items detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 12076–12090. [Google Scholar] [CrossRef]
  4. Guo, Q.; Xie, K.; Ye, W.; Zhou, T.; Xu, S. A Sparse Bayesian Learning Method for Moving Target Detection and Reconstruction. IEEE Trans. Instrum. Meas. 2025, 74, 4505413. [Google Scholar] [CrossRef]
  5. Zhuang, J.; Chen, W.; Guo, B.; Yan, Y. Infrared weak target detection in dual images and dual areas. Remote Sens. 2024, 16, 3608. [Google Scholar] [CrossRef]
  6. Wang, Z.; Wang, C.; Li, X.; Xia, C.; Xu, J. MLP-Net: Multi-layer perceptron fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5601313. [Google Scholar] [CrossRef]
  7. Chen, X.; Cui, J.; Liu, Y.; Zhang, X.; Sun, J.; Ai, R.; Gu, W.; Xu, J.; Lu, H. Joint scene flow estimation and moving object segmentation on rotational LiDAR data. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17733–17743. [Google Scholar] [CrossRef]
  8. Chen, G.; Jia, Y.; Yin, Y.; Fu, S.; Liu, D.; Wang, T. Remote sensing image dehazing using a wavelet-based generative adversarial networks. Sci. Rep. 2025, 15, 3634. [Google Scholar] [CrossRef]
  9. Wang, B.; Yang, M.; Cao, P.; Liu, Y. A novel embedded cross framework for high-resolution salient object detection. Appl. Intell. 2025, 55, 277. [Google Scholar] [CrossRef]
  10. Zhou, G.; Liu, W.; Zhu, Q.; Lu, Y.; Liu, Y. ECA-MobileNetV3 (Large)+ SegNet model for binary sugarcane classification of remotely sensed images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4414915. [Google Scholar] [CrossRef]
  11. Zhou, G.; Wang, Q.; Huang, Y.; Tian, J.; Li, H.; Wang, Y. True2 orthoimage map generation. Remote Sens. 2022, 14, 4396. [Google Scholar] [CrossRef]
  12. Liao, H.; Xia, J.; Yang, Z.; Pan, F.; Liu, Z.; Liu, Y. Meta-learning based domain prior with application to optical-ISAR image translation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 7041–7056. [Google Scholar] [CrossRef]
  13. Xu, X.; Fu, X.; Zhao, H.; Liu, M.; Xu, A.; Ma, Y. Three-dimensional reconstruction and geometric morphology analysis of lunar small craters within the patrol range of the Yutu-2 Rover. Remote Sens. 2023, 15, 4251. [Google Scholar] [CrossRef]
  14. Xiong, X.; He, M.; Li, T.; Zheng, G.; Xu, W.; Fan, X.; Zhang, Y. Adaptive feature fusion and improved attention mechanism-based small object detection for UAV target tracking. IEEE Internet Things J. 2024, 11, 21239–21249. [Google Scholar] [CrossRef]
  15. Alshehri, M.; Zahoor, L.; AlQahtani, Y.; Alshahrani, A.; AlHammadi, D.A.; Jalal, A.; Liu, H. Unmanned aerial vehicle based multi-person detection via deep neural network models. Front. Neurorobotics 2025, 19, 1582995. [Google Scholar] [CrossRef]
  16. Zhao, X.; Wang, T.; Li, Y.; Zhang, B.; Liu, K.; Liu, D.; Wang, C.; Snoussi, H. Target-driven visual navigation by using causal intervention. IEEE Trans. Intell. Veh. 2023, 9, 1294–1304. [Google Scholar] [CrossRef]
  17. Zeng, S.; Yang, W.; Jiao, Y.; Geng, L.; Chen, X. SCA-YOLO: A new small object detection model for UAV images. Vis. Comput. 2024, 40, 1787–1803. [Google Scholar] [CrossRef]
  18. Zhang, R.; Wang, Y.; Li, Z.; Ding, F.; Wei, C.; Wu, M. Online Adaptive Keypoint Extraction for Visual Odometry Across Different Scenes. IEEE Robot. Autom. Lett. 2025, 10, 7539–7546. [Google Scholar] [CrossRef]
  19. Rekavandi, A.M.; Xu, L.; Boussaid, F.; Seghouane, A.K.; Hoefs, S.; Bennamoun, M. A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance. IEEE Trans. Intell. Transp. Syst. 2025, 26, 2851–2879. [Google Scholar] [CrossRef]
  20. Wang, L.; Fu, Q.; Zhu, R.; Liu, N.; Shi, H.; Liu, Z.; Li, Y.; Jiang, H. Research on high precision localization of space target with multi-sensor association. Opt. Lasers Eng. 2025, 184, 108553. [Google Scholar] [CrossRef]
  21. Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, L.; Zhu, Y. OWRT-DETR: A Novel Real-Time Transformer Network for Small Object Detection in Open Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
  22. Wang, T.; Li, J.; Wu, H.N.; Li, C.; Snoussi, H.; Wu, Y. ResLNet: Deep residual LSTM network with longer input for action recognition. Front. Comput. Sci. 2022, 16, 166334. [Google Scholar] [CrossRef]
  23. Li, D.; Tong, S.; Yang, H.; Hu, Q. Time-synchronized control for spacecraft reorientation with time-varying constraints. IEEE/ASME Trans. Mechatron. 2024, 30, 2073–2083. [Google Scholar] [CrossRef]
  24. Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Virtual Conference, 22 February–1 March 2022; Volume 36, pp. 1026–1033. [Google Scholar]
  25. Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
  26. Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
  27. You, J.; Kim, Y.K. Up-sampling method for low-resolution LiDAR point cloud to enhance 3D object detection in an autonomous driving environment. Sensors 2022, 23, 322. [Google Scholar] [CrossRef]
  28. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  29. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  30. He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. α-IoU: A family of power intersection over union losses for bounding box regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]
  31. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
  32. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
  33. Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale feature fusion small object detection network for UAV aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
  34. Le Jeune, P.; Bahaduri, B.; Mokraoui, A. A comparative attention framework for better few-shot object detection on aerial images. Pattern Recognit. 2025, 161, 111243. [Google Scholar] [CrossRef]
  35. Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
  36. Xia, C.; Gao, H.; Yang, W.; Yu, J. MSDT: Multiscale Diffusion Transformer for Multimodality Image Fusion. IEEE Trans. Emerg. Top. Comput. Intell. 2025, 9, 2269–2283. [Google Scholar] [CrossRef]
  37. Yan, R.; Yan, L.; Geng, G.; Cao, Y.; Zhou, P.; Meng, Y. ASNet: Adaptive semantic network based on transformer–CNN for salient object detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608716. [Google Scholar] [CrossRef]
  38. Xie, Y.; Liu, S.; Chen, H.; Cao, S.; Zhang, H.; Feng, D.; Wan, Q.; Zhu, J.; Zhu, Q. Localization, balance and affinity: A stronger multifaceted collaborative salient object detector in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 63, 4700117. [Google Scholar] [CrossRef]
  39. Ye, X.; Xu, C.; Zhu, H.; Xu, F.; Zhang, H.; Yang, W. Density-Aware DETR with Dynamic Query for End-to-End Tiny Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13554–13569. [Google Scholar] [CrossRef]
  40. Yang, Z.; Li, Q.; Yuan, Y.; Wang, Q. HCNet: Hierarchical feature aggregation and cross-modal feature alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5624711. [Google Scholar] [CrossRef]
  41. Liao, Y.; Peng, C.; Li, X.; Wang, X.; Deng, Y. HRGA-Net: Hierarchical Rotation Gaussian Attention Network for Accurate Insulator Detection from UAV Images. IEEE Trans. Power Deliv. 2025. [Google Scholar] [CrossRef]
  42. Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A denoising fpn with transformer r-cnn for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
  43. Feng, Y.; Huang, J.; Du, S.; Ying, S.; Yong, J.H.; Li, Y.; Ding, G.; Ji, R.; Gao, Y. Hyper-yolo: When visual object detection meets hypergraph computation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2388–2401. [Google Scholar] [CrossRef] [PubMed]
  44. Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection. arXiv 2024, arXiv:2412.10116. [Google Scholar] [CrossRef]
  45. Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
  46. Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
  47. Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended feature pyramid network for small object detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
  48. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
  49. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  50. Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
  51. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  52. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
  53. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  54. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
  55. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  56. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  57. Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
  58. Zhang, J.; Huang, J.; Chen, X.; Zhang, D. How to fully exploit the abilities of aerial image detectors. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
  59. Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 190–191. [Google Scholar]
  60. Duan, C.; Wei, Z.; Zhang, C.; Qu, S.; Wang, H. Coarse-grained density map guided object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2789–2798. [Google Scholar]
  61. Wei, Z.; Duan, C.; Song, X.; Tian, Y.; Wang, H. Amrnet: Chips augmentation in aerial images object detection. arXiv 2020, arXiv:2009.07168. [Google Scholar] [CrossRef]
  62. Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process. 2020, 30, 1556–1569. [Google Scholar] [CrossRef]
  63. Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
  64. Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13435–13444. [Google Scholar]
  65. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  66. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  67. Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
  68. Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13863–13875. [Google Scholar] [CrossRef]
  69. Chen, Y.; Ye, Z.; Sun, H.; Gong, T.; Xiong, S.; Lu, X. Global-Local Fusion with Semantic Information-Guidance For Accurate Small Object Detection in UAV Aerial Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4701115. [Google Scholar] [CrossRef]
  70. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 379–387. [Google Scholar]
  71. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Figure 1. The difficulties for object detection in aerial images. There are many small objects that are challenging to identify for the detectors.
Figure 1. The difficulties for object detection in aerial images. There are many small objects that are challenging to identify for the detectors.
Drones 09 00659 g001
Figure 2. The overall architecture diagram of HSFANet. The convolutional downsampling used in feature extraction by the model leads to feature loss for small objects. Low-level feature maps experience less information loss. To simultaneously capture global information, HSFANet is adopted to reduce the loss of small object features. DPA module is able to enable a more robust and adaptive fusion of features across multiple scales and layers and enhance the representation capability of the network for objects of varying sizes. Additionally, the SSL function is restructured to minimize the negative impact of imbalanced scale on small object detection. Therefore, HSFANet has excellent performance in small object detection.
Figure 2. The overall architecture diagram of HSFANet. The convolutional downsampling used in feature extraction by the model leads to feature loss for small objects. Low-level feature maps experience less information loss. To simultaneously capture global information, HSFANet is adopted to reduce the loss of small object features. DPA module is able to enable a more robust and adaptive fusion of features across multiple scales and layers and enhance the representation capability of the network for objects of varying sizes. Additionally, the SSL function is restructured to minimize the negative impact of imbalanced scale on small object detection. Therefore, HSFANet has excellent performance in small object detection.
Drones 09 00659 g002
Figure 3. The process of dynamic upsampling. The input features, upsampled features, generated offsets, and the original grid are represented by X , X , O and G , respectively. θ denotes the sigmoid function.
Figure 3. The process of dynamic upsampling. The input features, upsampled features, generated offsets, and the original grid are represented by X , X , O and G , respectively. θ denotes the sigmoid function.
Drones 09 00659 g003
Figure 4. In this figure, subfigure (a) shows the distribution of instance counts across different categories in the VisDrone dataset, while subfigure (b) presents the distribution of object instances with respect to their normalized width and height ratios relative to the input image size.
Figure 4. In this figure, subfigure (a) shows the distribution of instance counts across different categories in the VisDrone dataset, while subfigure (b) presents the distribution of object instances with respect to their normalized width and height ratios relative to the input image size.
Drones 09 00659 g004
Figure 5. In this figure, subfigure (a) shows the distribution of instance counts across different categories in the UAVDT dataset, while subfigure (b) presents the distribution of object instances with respect to their normalized width and height ratios relative to the input image size.
Figure 5. In this figure, subfigure (a) shows the distribution of instance counts across different categories in the UAVDT dataset, while subfigure (b) presents the distribution of object instances with respect to their normalized width and height ratios relative to the input image size.
Drones 09 00659 g005
Figure 6. Visualization results of HSFANet on the VisDrone dataset. Green boxes indicate the predicted bounding boxes of detected targets.
Figure 6. Visualization results of HSFANet on the VisDrone dataset. Green boxes indicate the predicted bounding boxes of detected targets.
Drones 09 00659 g006
Figure 7. Visualization results of HSFANet on the UAVDT dataset. Green boxes indicate the predicted bounding boxes of detected targets.
Figure 7. Visualization results of HSFANet on the UAVDT dataset. Green boxes indicate the predicted bounding boxes of detected targets.
Drones 09 00659 g007
Figure 8. Qualitative results of HSFANet on the VisDrone dataset. The left column shows the ground truth annotations, the middle column displays the baseline predictions, and the right column presents the predictions of HSFANet. To highlight missed detections by the baseline model, certain missed objects are marked with red bounding boxes in the middle column. Green boxes indicate the predicted bounding boxes of detected targets. Only predictions with confidence scores greater than 0.2 are shown. Best viewed in zoom-in windows.
Figure 8. Qualitative results of HSFANet on the VisDrone dataset. The left column shows the ground truth annotations, the middle column displays the baseline predictions, and the right column presents the predictions of HSFANet. To highlight missed detections by the baseline model, certain missed objects are marked with red bounding boxes in the middle column. Green boxes indicate the predicted bounding boxes of detected targets. Only predictions with confidence scores greater than 0.2 are shown. Best viewed in zoom-in windows.
Drones 09 00659 g008
Figure 9. Normalized confusion matrices of the baseline model and HSFANet on the VisDrone dataset. The top image shows the baseline model, and the bottom image shows HSFANet.
Figure 9. Normalized confusion matrices of the baseline model and HSFANet on the VisDrone dataset. The top image shows the baseline model, and the bottom image shows HSFANet.
Drones 09 00659 g009
Figure 10. PR curves of the baseline model and HSFANet on the VisDrone dataset. The top image shows the baseline model, and the bottom image shows HSFANet.
Figure 10. PR curves of the baseline model and HSFANet on the VisDrone dataset. The top image shows the baseline model, and the bottom image shows HSFANet.
Drones 09 00659 g010
Table 1. Specific configuration of the experimental environment.
Table 1. Specific configuration of the experimental environment.
EnvironmentConfiguration Information
Operating systemUbuntu 22.04.2
CPUAMD EPYC 7773X 64-Core Processor
CPU number2
GPUNVIDIA Corporation GA102 [GeForce RTX 3090]
GPU memory size24G
GPU number8
GPU calculate platformCUDA 11.7.0
Python versionPython 3.9.19
Deep learning frameworkPyTorch 2.0.1
Table 2. Performance comparison on VisDrone. Best results are in bold.
Table 2. Performance comparison on VisDrone. Best results are in bold.
MethodYear & VenueBackboneAP AP 50 AP 75 AP s AP m AP l
FRCNN [56]2015 NIPSResNeXt10124.447.821.817.834.834.3
ClusDet [57]2019 ICCVResNeXt10132.456.231.6---
DREN [58]2019 ICCVWResNeXt15230.3-----
DMNet [59]2020 CVPRWResNeXt10129.449.330.621.641.056.7
CDMNet [60]2021 ICCVWResNeXt10130.751.332.022.242.444.7
AMRNet [61]2020 ArxivResNeXt10132.1--23.243.960.5
GLSAN [62]2021 TIPResNet10132.555.830.0---
HRDNet [26]2021 ICMEResNeXt50+10135.562.035.1---
FCOS+SAHI [63]2022 ICIP--38.5----
VFNet+SAHI [63]2022 ICIP--42.2----
TOOD+SAHI [63]2022 ICIP--43.5----
QueryDet [25]2022 CVPRResNet5028.348.128.8---
CEASC [64]2023 CVPRResNet1828.750.728.4---
UFPMP-Det [24]2022 AAAIResNeXt10140.166.841.3---
CenterNet [65]2019 arXivHourglass10427.847.927.621.342.149.8
YOLOX [66]2021 arXivCSPv5-M27.647.727.517.641.046.1
YOLOv9 [67]2024 ECCVGELAN29.549.929.420.241.747.8
YOLC [68]2024 TITSResNeXt10136.360.137.428.947.551.8
GLSDet [69]2025 TGRSCSPv5-M32.354.232.823.342.950.2
HSFANetOursCSPDarknet5341.463.244.331.151.555.4
Note. Results of other methods are directly cited from their original papers, as official codes or complete training details are not always publicly available. Therefore, the reported input resolution, augmentation, and training settings follow those specified in the respective works. While slight inconsistencies may exist across methods, all results are obtained on the same benchmarks (VisDrone), ensuring a fair and valid comparison.
Table 3. Performance of HSFANet on various categories of the VisDrone dataset.
Table 3. Performance of HSFANet on various categories of the VisDrone dataset.
ClassInstancesPRAP AP 50 AP 75
pedestrian884473.768.740.473.739.2
people512571.057.729.862.024.7
bicycle128749.847.925.547.223.2
car14,06483.788.370.391.480.0
van197565.862.648.864.556.4
truck75065.456.443.159.349.2
tricycle104560.554.434.053.637.9
awning-tricycle53243.428.820.329.922.7
bus25181.972.061.277.870.1
motor488667.372.540.172.539.8
Table 4. Performance comparison on UAVDT. Best results are in bold.
Table 4. Performance comparison on UAVDT. Best results are in bold.
MethodAP AP 50 AP 75 AP s AP m AP l
R-FCN [70]7.017.53.94.414.712.1
SSD [71]9.321.46.77.117.112.0
FRCNN [56]5.817.42.53.812.39.4
FRCNN [56]+FPN11.023.48.48.120.226.5
ClusDet [57]13.726.512.59.125.131.2
DMNet [59]14.724.616.39.326.235.2
CDMNet [60]16.829.118.511.929.015.7
DREN [58]17.1-----
AMRNet [61]18.230.419.810.331.333.5
GLSAN [62]17.128.318.8---
CEASC [64]17.130.917.8---
CenterNet [65]13.226.711.87.826.613.9
UFPMP-Det [24]24.638.728.0---
YOLOX [66]15.827.716.31027.324.8
YOLOV9 [67]16.128.116.510.627.925.3
YOLC [68]19.330.920.110.932.235.5
GLSDet [69]18.329.817.611.829.326.8
HSFANet (Ours)24.940.728.028.626.720.0
Note. Results of other methods are directly cited from their original papers, as official codes or complete training details are not always publicly available. Therefore, the reported input resolution, augmentation, and training settings follow those specified in the respective works. While slight inconsistencies may exist across methods, all results are obtained on the same benchmarks (UAVDT), ensuring a fair and valid comparison.
Table 5. Ablation study on the VisDrone dataset. ’Baseline’ denotes the baseline model, ’DPA’ stands for Dynamic Position Aggregation, and ’SSL’ represents Scale-Sensitive Loss. A check mark (✓) indicates that the corresponding feature or configuration is used. Best results are in bold.
Table 5. Ablation study on the VisDrone dataset. ’Baseline’ denotes the baseline model, ’DPA’ stands for Dynamic Position Aggregation, and ’SSL’ represents Scale-Sensitive Loss. A check mark (✓) indicates that the corresponding feature or configuration is used. Best results are in bold.
BaselineDPASSLAP AP 50 AP 75 AP s AP m AP l
--36.456.834.825.147.758.8
-41.363.043.229.950.258.9
-38.459.440.826.850.454.3
41.463.244.331.151.555.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Ou, Z.; Yao, S.; Zhu, Y.; Zhu, Y.; Li, H.; Wang, S.; Guo, Y.; Song, M. HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images. Drones 2025, 9, 659. https://doi.org/10.3390/drones9090659

AMA Style

Zhang H, Ou Z, Yao S, Zhu Y, Zhu Y, Li H, Wang S, Guo Y, Song M. HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images. Drones. 2025; 9(9):659. https://doi.org/10.3390/drones9090659

Chicago/Turabian Style

Zhang, Hongxing, Zhonghong Ou, Siyuan Yao, Yifan Zhu, Yangfu Zhu, Hailin Li, Shigeng Wang, Yang Guo, and Meina Song. 2025. "HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images" Drones 9, no. 9: 659. https://doi.org/10.3390/drones9090659

APA Style

Zhang, H., Ou, Z., Yao, S., Zhu, Y., Zhu, Y., Li, H., Wang, S., Guo, Y., & Song, M. (2025). HSFANet: Hierarchical Scale-Sensitive Feature Aggregation Network for Small Object Detection in UAV Aerial Images. Drones, 9(9), 659. https://doi.org/10.3390/drones9090659

Article Metrics

Back to TopTop