HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery

Li, Yihang; Yang, Wenzhong; Wang, Liejun; Tao, Xiaoming; Yin, Yabo; Chen, Danny

doi:10.3390/drones8120713

Open AccessArticle

HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery

by

Yihang Li

¹,

Wenzhong Yang

^1,2,*

,

Liejun Wang

¹

,

Xiaoming Tao

³,

Yabo Yin

¹ and

Danny Chen

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

³

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(12), 713; https://doi.org/10.3390/drones8120713

Submission received: 16 October 2024 / Revised: 17 November 2024 / Accepted: 27 November 2024 / Published: 28 November 2024

Download

Browse Figures

Versions Notes

Abstract

Current mainstream computer vision algorithms focus on designing suitable network architectures and loss functions to fit training data. However, the accuracy of small object detection remains lower than for other scales, and the design of convolution operators limits the model’s performance. For UAV small object detection, standard convolutions, due to their fixed kernel size, cannot adaptively capture small object spatial information. Many convolutional variants have scattered sampling points, leading to blurred boundaries and reduced accuracy. In response, we propose HawkEye Conv (HEConv), which utilizes stable sampling and dynamic offsets with random selection. By varying the convolution kernel design, HEConv reduces the accuracy gap between small and larger objects while offering multiple versions and plug-and-play capabilities. We also develop HawkEye Spatial Pyramid Pooling and Gradual Dynamic Feature Pyramid Network modules to validate HEConv. Experiments on the RFRB agricultural and VisDrone2019 urban datasets demonstrate that, compared to YOLOv10, our model improves AP₅₀ by 11.9% and 6.2%, AP_S by 11.5% and 5%, and F1-score by 5% and 7%. Importantly, it enhances small object detection without sacrificing large object accuracy, thereby reducing the multi-scale performance gap.

Keywords:

UAV small object detection HawkEye Conv; convolutional operator; multi-scale feature pyramid network; spatial pyramid pooling

1. Introduction

Researchers in the field of object detection have consistently sought to design efficient detectors for effective small object detection. The YOLO (You Only Look Once) series has dominated real-time single-stage object detection during its respective periods [1,2,3]. To enhance feature extraction capabilities, several modules based on standard convolution have been proposed: the DarkNet structure utilizes residual connections and stride convolutions to improve depth, efficiency, and learning ability, replacing traditional pooling layers [4]; CSPNet optimizes gradient utilization, reduces computational cost, and enhances detection performance with strong generalization in residual frameworks [5]; EfficientRep integrates 3 × 3 and 1 × 1 convolutions during training and re-parameterizes into a single efficient 3 × 3 convolution for inference, improving early-stage feature extraction [6]; and ELAN reduces excessive transition layers to optimize the shortest and longest gradient paths, reducing computational costs [7]. These modules are widely employed in convolutional models. However, due to the inherent characteristics of standard convolution, such models mainly rely on local information, leading to inadequate utilization of global spatial information. Expanding the receptive field by stacking layers may shrink spatial features, causing object boundary blurring and regression localization challenges, ultimately increasing the likelihood of multiple objects being misidentified as a single one.

In contrast, DETR (DEtection TRansformer) series models based on Transformer architecture eliminate the need for the NMS process in convolutional models, improving inference speed and overcoming limitations in global context construction [8,9,10]. This series introduces numerous innovative designs that significantly challenge the dominance of convolutional models in object detection. For example, DINO employs a hybrid query selection strategy, initializing queries with different positions and content to improve convergence [11]. Co-DETR introduces auxiliary heads and training modules, increasing the number of queries and leveraging a one-to-many matching method during training [12], addressing missed detections of dense objects. Although DETR models effectively capture the global dependencies of features from richly sampled targets, they often incur high training costs and underperform in small object detection, mainly due to the scarcity and complexity of small object samples and the lack of local attention mechanisms.

A common issue with these small object detection models is their heavy focus on improving accuracy, often neglecting the inherent relationships and challenges between small objects and those of other scales. This phenomenon can be analyzed from two perspectives:

Performance Analysis of Mainstream Detectors: In real-world scenarios, small objects often appear as blurry shapes due to low resolution, with considerable variations in pose even within the same class. For instance, pedestrians, vehicles, airplanes, and crops exhibit different aspect ratios, presenting further challenges for detection. In the MSCOCO benchmark, small objects account for approximately 40% of typical datasets [13]. Among the top ten models, accuracy for large objects typically exceeds that of small objects by about 30%. On UAV datasets like VisDrone2019, small object samples exceed 60%, while large objects account for only 5% [14]. Despite resolution impacting mean average precision (AP), the relative accuracy between scales remains consistent. For detectors from various families, such as ATSS [15], RCNN [16], YOLO [1], and DETR [17], the average precision for small objects (AP_S) is often less than half of that for medium and large objects, sometimes only one-third. This disparity stems from two primary factors: the diverse scales and shapes of targets in UAV views, which challenge standard convolution’s geometric transformation modeling capabilities, and the low pixel quality of small objects, complicating effective feature extraction, especially in densely occluded areas with background interference.

Improvement Strategies for Small Objects: Current models increasingly adopt data augmentation techniques, such as random cropping and transparency adjustments combined with strategies like mosaic and copy–paste, to enrich sample diversity and enable the model to learn more comprehensive features. Techniques like Mosaic [18], which combines random images to increase target quantity, Mixup [19], which adjusts transparency to blend two images, and Cutout [20], which randomly crops regions for pasting into other images, effectively augment training data. However, the results show that these techniques fail to bridge the significant precision gap across scales, particularly the low accuracy for small objects. Larger objects often benefit more from these augmentations due to pixel area advantages. Attention modules have become a popular optimization strategy, with researchers designing various mechanisms for specific applications. For instance, Ouyang et al. proposed Efficient Multi-channel Attention (EMA), which reshapes part of the channel dimension into the batch dimension to fuse output feature maps from two parallel sub-networks through cross-space learning [21]. Lightweight attention modules, such as MLCA [22], ELA [23], CAA [24], and Efficient Additive Attention [25], reduce model complexity while achieving effective channel and spatial attention. Additionally, Han et al. introduced Agent Attention, a novel four-way attention mechanism aimed at global information modeling while maintaining low computational costs [26]. Challenging Transformers, the Vmamba method achieves linear complexity without sacrificing the global receptive field [27]. Its SS2D structure, the core of Vmamba, processes and transforms input features using linear layers, layer normalization, scan operations, activation functions, and depthwise separable convolutions. The Mamba-YOLO model, based on SS2D, incorporates the ODSSBlock module to comprehensively process image features while ensuring feature continuity and enhancement through residual connections, facilitating effective information propagation and utilization [28]. Although these models show advancements in small object detection, performance results reveal their inability to significantly reduce the precision gap between small and large objects. In other words, these methods primarily benefit large objects, even in scenarios where small objects far outnumber large ones, failing to fully address the challenges of small object detection.

This paper introduces plug-and-play convolution operators along with accompanying methodologies to elevate the detection accuracy of small objects and narrow the performance gap with other scales. The proposed methods are evaluated through experiments conducted on the RFRB [29] and VisDrone2019 datasets [14]. The novel contributions of this study are as follows:

HawkEye Conv: A hybrid convolution operator is designed, combining fixed and dynamic sampling to enhance small object feature extraction. This module, as shown in Figure 1, with plug-and-play capabilities, balances stability and flexibility, achieving superior detection performance.
HawkEye Spatial Pyramid Pooling Method A pooling structure with guided attention is developed to effectively address feature loss during pooling, preserving small object details. A progressive fusion approach enhances multi-level target information.
Gradual Dynamic Feature Pyramid Network (GradDynFPN): A feature pyramid network is proposed, using HawkEye Conv for downsampling, progressively fusing small object features, and dynamically interacting between adjacent layers to improve detection accuracy.

2. Related Work

2.1. Fundamental Modules of Convolutional Models

In target detection models primarily based on convolutional neural networks (CNNs), the convolutional layer Conv serves as the core feature extraction unit. By performing weighted summation, Conv integrates local information into global features, demonstrating both reliability and stability. The Ghost Block employs a linear stacking strategy, capturing rich features through inexpensive linear transformations [30]. Deformable Conv enhances the modeling capability of geometric transformations [31]. The Deformable ConvNets v2 introduces a modulation mechanism to dynamically adjust module outputs, but this approach often results in slower convergence and higher computational requirements [32]. DCNv3 improves spatial aggregation performance through separable convolutions and multiple grouping mechanisms [33], while DCNv4 further optimizes dynamic capabilities [34], achieving three times the inference speed of DCNv3. Haar wavelet downsampling reduces spatial resolution via wavelet transformation, preserving information as much as possible. The CKAN convolution operator, based on the Kolmogorov–Arnold theorem, substitutes traditional linear transformations with learnable nonlinear activations, significantly enhancing small object detection at the cost of increasing the parameter count (from 40.8 M to 88 M) [35]. Although some grouped convolution methods aim to reduce computational load, they face challenges such as stitching errors and poor adaptability, making them difficult to apply in scenarios like drone inspections. In contrast, our designed HEConv, particularly the Diamond variant, simplifies the parameter count (from 40.8 M to 40 M) without altering the channel configuration, achieving stable average precision (AP) improvements across multiple scales.

2.2. Multi-Scale Fusion Module

Feature pyramid networks (FPN) provide a framework for constructing multi-scale feature pyramids, fusing high-resolution low-level features with high-level semantic features to facilitate multi-scale object detection. BiFPN enhances feature fusion efficiency by introducing additional lateral connections and nodes [36]. The Recursive-FPN maximizes information from various layers through the recursive fusion of low- and high-level features, thus improving detection accuracy [37]. SFPN addresses the scale truncation issue of FPN by creating synthetic layers between different scales and optimizing feature fusion outcomes [38]. AFPN incorporates adaptive spatial fusion operations to manage conflicts in feature fusion across different levels; however, it may produce blurring effects during large-span fusion and increase model complexity [39]. HS-FPN employs a selective feature fusion mechanism to synergistically integrate high- and low-level information, enriching feature semantics and particularly aiding in extracting fine details of small targets [40]. Nevertheless, due to the extensive use of attention modules, the HS-FPN architecture is complex and challenging to deploy on drone systems. The ContextGuideFPN module utilizes attention mechanisms to adaptively capture target significance dependencies, employing context attention modules (CxAMs) and content attention modules (CnAMs) to capture discriminative semantics and accurately localize targets [41]. This paper draws inspiration from the analysis of the FPN structure in the work “You Should Look at ALL Objects” [42] and AFPN [39], employing the designed HEConv as the downsampling unit. We innovatively adjusted the fusion interaction layers by incorporating participation from P2 and C5 levels while limiting their participation frequency. This design not only enriches the information within feature maps but also significantly reduces parameter count, decreasing training complexity compared to AFPN and enhancing computational efficiency.

2.3. Spatial Pooling Module

Spatial Pyramid Pooling (SPP) technology addresses image distortion and repetitive feature extraction issues through pooling operations, enabling the extraction of spatial features of varying sizes to enhance the model’s robustness against spatial layouts and object deformations. SPP and its variants (e.g., SPPF [43], SimSPPF [44], RFB [45]) employ different pooling strategies. The SPPCPSC method extracts multi-view features via parallel pooling and optimizes spatial feature processing further by incorporating residual connections [7]. The SPPELAN introduced in YOLOv9 enhances feature extraction efficacy through a combination of standard convolutions, normalization, activation functions, and pooling operations [1]. Our proposed HSPP module employs a three-branch structure consisting of HawkEye Conv, Batch Normalization, and the SiLU activation function. This three-branch architecture includes a channel transformation branch, a serial pooling branch, and a weight extraction branch. Through a two-stage fusion mechanism, the weights of the original features are multiplied by the pooled features, accurately guiding the model’s focus towards target areas and preventing small objects from disappearing during consecutive pooling operations. Ultimately, a convolutional layer further extracts the fused features, concatenating them with the original features to form a feature map of the same size as the original, thereby preventing gradient vanishing issues during backpropagation.

3. Materials and Methods

3.1. Experimental Materials

3.1.1. Experimental Configuration

The proposed method was validated using two small object detection datasets in dense drone scenes: RFRB and VisDrone2019. All models were trained from initialization for a total of 300 epochs, with a batch size of 4. For learning rate adjustment, a linear warm-up strategy was employed for the first three iterations, followed by a decay schedule adapted to the model complexity. Additionally, Automatic Mixed Precision (AMP) was disabled to ensure computational stability. Detailed hardware configurations are provided in Table 1.

3.1.2. Experimental Details

We developed the HawkEyeNet detector, built on the YOLOv10 [2] and RTDETR [17] architectures. The overall design follows the network structure proposed by Ultralytics, with HEConv replacing standard convolution layers. In the model’s multi-branch architecture, HEConv is applied for various tasks, such as feature extraction and downsampling. The output channel size was uniformly reduced to 256 to optimize multi-scale interaction and parameter efficiency. More detailed information about the model framework can be found in the Supplementary Materials. The formulas for accuracy and recall are shown in Equations (1) and (2):

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

P is the number of target frames in the assay that were detected as targets and were positive samples, FP is the number of target frames that were detected as targets but were misdetected as negative samples, and FN is the number of targets that were not used to be detected, which is the number of target frames in the total samples minus TP. The P-R curve formed by plotting the accuracy and recall of each category and the area enclosed by the P and R axes is the mAP of the category. mAP is a common measure of the precision of all categories, which is the average of the APs of all the categories. mAP@5:95 indicates the average mAP over the interval of 0.05 in steps from 0.5 to 0.95. The computational formulas are specifically shown in Equations (3) and (4), where p(r) denotes the P-R curve and m denotes the number of target categories.

A P = \int_{0}^{1} p (r) d r

(3)

m A P = \frac{1}{m} \sum_{1}^{m} A P

(4)

The importance of the proposed method was validated through a series of comparative experiments. Different components of the design were isolated and compared to state-of-the-art methods under equivalent experimental conditions. In the field of small object detection, small objects are typically defined as having a pixel area less than 32 × 32 pixels, meaning the target pixel area is less than 1024. These small objects have a small area and few details in the image, making them prone to interference from background, noise, and other factors, hence making them more challenging to detect and recognize in computer vision. For example, in high-altitude images captured by drones, it is common to encounter distant targets such as vehicles, pedestrians, or crop lesions. Detection performance was evaluated using COCO metrics, including mAP, AP, AP50, AP75, APS, APM, and APL (in percentage). Additionally, the parameter count (Para, M) and computational cost (FLOPS, G) were compared across methods.

3.1.3. Dataset Description

RFRB Dataset [29]

The RFRB dataset is an RGB format dataset annotated with rectangular bounding boxes for detecting canola flower clusters. The average and median numbers of flower clusters per image are 310 and 297, respectively, significantly higher than the density observed in other UAV datasets like VisDrone. As shown in Figure 2, the RFRB dataset contains a large number of small, densely clustered objects. This dataset poses significant challenges and serves as an important benchmark for evaluating dense scene detection tasks in agricultural UAV applications.

2.: VisDrone2019 Dataset [14]

The VisDrone dataset is one of the primary datasets in the field of UAV-based small object detection, making validation on this benchmark both comparable and authoritative. This dataset consists of 10,209 static images, all captured by various drone cameras. In Figure 3, examples of different object categories in the dataset are illustrated. The dataset contains ten categories, covering multiple aspects such as location (14 different cities), environment (urban and rural), objects (pedestrians, vehicles, bicycles, etc.), and density (sparse and crowded scenes).

3.2. Experimental Methods

This study utilizes the Ultralytics framework, with the model architecture shown in Figure 4. The backbone, based on YOLOv10, extracts features, which are then processed by the Neck module and passed to the detection head for localization and classification. The RTDETR detection head (DHead) is used to handle dense scenes with small objects, leveraging its robust global relationship modeling for handling multiple instances. The HEConv module is integrated into both the HSPP and GradDynFPN modules. HSPP optimizes the pooling process, enhancing feature generalization, while GradDynFPN manages multi-scale features, enabling cross-scale interactions. The detection task is completed through iterative updates in the decoder.

3.2.1. HawkEye Conv

The principle of conventional convolution calculation is illustrated in Equation (5), which derives the output by flipping the weights of the sampling points in the convolution kernel and summing the results.

O (i, j) = \sum_{m = 0}^{k - 1} \sum_{n = 0}^{k - 1} W (m, n) \times I (i + m, j + n)

(5)

where O(i, j) represents the value of the output feature map at position (i, j). W(m, n) is the weight matrix of the convolution kernel, and I(i + m, j + n) denotes the corresponding pixel values in the input feature map. The size of the convolution kernel is denoted as k. Its receptive field is described in Equation (6), where S is the stride and R_l is the receptive field at the l-th layer of convolution.

R_{l} = R_{l - 1} + (k - 1) \times \prod_{i = 1}^{l - 1} S_{i}

(6)

As illustrated in Figure 5, the receptive field of standard convolution is inherently limited and fixed. This constraint hampers the complete learning of target features with a finite number of sampling points. Furthermore, the rigidity of the learning region inevitably introduces interference from background and other targets during the information collection process, adversely affecting the model’s learning capability. However, existing convolutional variants are rarely designed specifically for small targets. While these variants may focus on the position and shape of objects to some extent, their divergent sampling points can lead to significant information noise during critical feature extraction. This issue is exacerbated in dense scenes with occluded small targets, where current dynamic convolutions often perform poorly.

HawkEye Conv (HEConv), shown in Figure 6, is designed to address the aforementioned challenges. The upper stable sampling region utilizes predefined fixed sampling points arranged in various shapes, with parameters ensuring stability during local feature extraction. In contrast, the lower region involves dynamic offsets and random selection, derived from standard convolution sampling points that fall outside predefined shapes. Initially, dynamic points are randomly generated, after which a dynamic offset network adjusts these points by creating adaptive offsets, enabling flexible sampling across the feature map. By combining adaptive dynamic points with fixed points from non-feature areas, the model adapts to target shapes, improving sensitivity and robustness to input variations.

Unlike traditional channel-grouping convolution strategies, which operate with separate convolutional operators for each group and reintegrate information by channel, our approach directly segments the convolution kernel from the fundamental sampling points. The Diamond and X versions serve as the primary convolution modules, while their mixed version forms the channel-grouped convolution. The effectiveness of this design is validated through detailed experimental analysis and comparisons presented in the subsequent section.

A.: Stable Sampling Area

In the stable sampling region, we aim to maximize the convolution kernel’s ability to extract information from small targets. We modify the standard convolution kernel shape to a special shape convolution (Diamond- or X-shaped) composed of k sampling points with stable sampling relationships. These shapes are utilized to extract information from fixed positions. The Diamond convolution is applicable for symmetric shapes with concentrated features, such as vehicles and buildings, while the X-shaped convolution is suited for targets like plants and pedestrians that exhibit significant aspect ratios. The corresponding sampling point operations are defined as follows in Equation (7):

O_{f i x e d (i, j)} = \sum_{(u, v) \in ω} W (u, v) \times I (i + u, j + v)

(7)

where

ω = \begin{array}{l} \{s h a p e = 0, (k, k), (0, k - 1), (0, k + 1), (k - 1,0), (k + 1,0)\} \\ \{s h a p e = 1, (k, k), (k + 1, k + 1), (k - 1, k + 1), (k + 1, k - 1), (k - 1, k - 1)\} \end{array}

(8)

As shown in Equation (8), the parameter shape governs the convolution sampling process and is contingent upon the aspect ratio α. Specifically, when shape = 0, a Diamond-shaped configuration is employed for fixed region sampling, applicable in scenarios where 0.5 < α < 2. Conversely, when shape = 1, an X-shaped configuration is utilized for fixed sampling. Furthermore, acknowledging the contemporary innovations that leverage channel grouping techniques—utilizing distinct convolution units for feature processing followed by the application of attention modules for weighted fusion—this study introduces a third testing mode designed accordingly. The three configurations investigated are (a) X-shaped convolution, (b) Diamond-shaped convolution, and (c) a dual-branch architecture that concurrently processes both shapes, ultimately resulting in a mixed convolution structure with lightweight attention weight fusion.

Considering that both shapes may be necessary in complex and diverse scenes, we design three modes: (a) X-shaped convolution, (b) Diamond convolution, and (c) a dual-branch structure that combines both, culminating in a lightweight attention-weighted fusion of the mixed convolution structure.

B.: Dynamic Offset and Random Selection Area

The dynamic sampling adjustment section is designed to accommodate the shapes of small targets while restricting excessive flexibility in dynamic offsets. Initially, the convolution kernel is allowed to move sampling points along the edges of the image, creating a combination of original and offset sampling points, which together form the range for selecting dynamic sampling points, termed the dynamic sampling point repository. It is crucial to emphasize that in the dynamic sampling section, the extent of point offsets varies according to the complexity of the convolution and the usage scenario. Random offsets occur within individual shapes (X-shaped or Diamond), while a convolutional network is added in the mixed shape to facilitate learnable dynamic offsets. Here, R represents the sampling points of the standard convolution minus the points in the stable sampling area (A), Q denotes the new sampling points generated by dynamic offsets, and

R + Q

forms the dynamic sampling point repository. The dynamic random extraction method involves selecting

K_{d y n a m i c}

sampling locations from

R + Q

, computed as follows in Equation (9):

{K_{d y n a m i c} = K}_{c o n v} - K_{f i x e d}

(9)

The computation of dynamic offset points is illustrated in Equation (10).

O_{d y n a m i c} (i, j) = {D r o p o u t}_{(P, K_{d y n a m i c})} (\sum_{1}^{P} W_{p} \times I (i + {∆ u}_{p}, j + {∆ v}_{p})) = {D r o p o u t}_{(P, K_{d y n a m i c})} (\sum_{1}^{R} W_{R} \times I (i, j) + \sum_{1}^{Q} W_{Q} \times I (i + {∆ u}_{Q}, j + {∆ v}_{Q}))

(10)

This extraction method effectively balances computational load by randomly selecting positions within a specified range, enabling both dynamic sampling points and allowing dynamic extraction from certain fixed sampling points. These points are then combined with those from Group A to create new convolution units. Consequently, this convolution exhibits the characteristics of 1/4 dynamic points, 1/4 random fixed points, and 1/2 special shape fixed points. It is essential to clarify that in the final sampling configuration, Group A fixed points must always be present and remain fixed; Group B points in R may not always exist but are fixed; whereas points in Q are neither guaranteed to exist nor fixed. This approach not only enhances the model’s adaptability but also improves its robustness in various scenarios.

The final positions and numbers of sampling points are as indicated in Equation (11):

O (i, j) = {O_{f i x e d} (i, j) + O}_{d y n a m i c} (i, j)

(11)

The receptive field of HEConv is significantly expanded compared to standard convolution, as specified in Equation (12):

R_{l} = R_{l - 1} + (k - 1) \times \prod_{i = 1}^{l -} S_{i} + {o f f s e t}_{l}

(12)

where

{o f f s e t}_{l}

denotes the convolution kernel’s displacement, which is learned and subject to dynamic variations.

The HawkEye Conv implementation is provided for clarity, as shown in the following Algorithm 1. It demonstrates the dual-branch hybrid convolution, with one branch representing the single-version convolution. The algorithm outlines how the convolution operation integrates fixed sampling points with dynamically adjusted offset points, detailing the process of generating these points. By using SimAM for the weighted fusion of fixed and dynamic components, this approach enhances the model’s ability to capture small target shapes, improving detection performance.

Algorithm 1 Pseudocode for the HEConv algorithm.

Algorithm 1: HawkEye Conv

Input:

x : i n_c h a n n e l s, o u t_c h a n n e l s, k e r n e l_s i z e, s t r i d e, s a m p l i n g

Output:

y

1 Initialize deformable convolution layer and offset prediction network

2 Define fixed sampling points (Diamond or X shape based on sampling)

3 Define B group points from 3 × 3 grid

4

5 Step 1: Calculate Dynamic Offsets

6 dynamic_offsets ← Offset_Prediction_Network(x)

7

8 Step 2: Generate Dynamic Sampling Points

9 Adjust B group points with dynamic_offsets

10 Randomly sample 4 points from dynamic repository

11

12 Step 3: Apply Deformable Convolution

13 deform_output ← Deformable_Convolution(x, dynamic_points)

14

15 Step 4: Apply Fixed Sampling Convolution

16 fixed_output ← Convolution using fixed sampling points

17

18 Step 5: Weighted Fusion with SimAM

19 attention_weights ← SimAM(fixed_output, deform_output)

20 y ↔ (fixed_output × attention_weight + deform_output × (1 − attention_weight))

21 Return y

3.2.2. HSPP Module

The SPP module effectively captures target features at different scales, enhancing the detector’s performance when handling diverse images. However, this method also introduces several noteworthy drawbacks. The impact of pooling operations on gradients during backpropagation is minimal, making it prone to the vanishing gradient problem, especially in multiple cascaded pooling layers. This phenomenon significantly affects the model’s learning efficacy. Moreover, while the multi-branch parallel pooling structure can enhance feature extraction capabilities, it also substantially increases the computational load, which can adversely affect the real-time performance and lightweight nature required for UAV dense target detection tasks. Therefore, balancing feature extraction effectiveness with computational efficiency remains a topic worthy of further research when designing UAV target detection systems.

The proposed multi-branch progressive spatial fusion module addresses information loss and low gradient values caused by multiple pooling operations during feature extraction. Figure 7 illustrates the structure of the HSPP module. The HBS (HEConv + BN + SiLU) architecture extracts scale-invariant features from the P5 layer, capturing target shape characteristics and enhancing the model’s information learning capacity. A weight branch and hierarchical fusion structure are introduced on top of the cascaded pooling and channel variation branches. An attention module first extracts the weights of the original features, guiding the pooled features through matrix multiplication to emphasize small target areas. Then, convolution operations are applied to the mixed features, which are concatenated with the channel variation branch along the channel dimension. Finally, feature integration is achieved through the convolution module.

3.2.3. Gradual Dynamic Feature Pyramid Network

FPN has demonstrated exceptional performance in various small object detection tasks, primarily due to its effective feature interaction mechanism, which addresses the challenges of multi-scale object detection. The Gradual Dynamic Feature Pyramid Network (GradDynFPN) significantly enhances detection capabilities by incorporating richer scale information compared to traditional feature pyramid networks (FPNs). Classic models such as BiFPN [36], AFPN [39], and ContextGuideFPN [41] have continuously optimized the horizontal interactions among features at the same scale and the vertical interaction paths across different scales.

Through an in-depth analysis of the principles underlying the existing mainstream FPNs, we found that introducing more scale information can effectively enhance detection performance. However, while feature interactions across non-adjacent scales can broaden the information fusion across a wider field of view, they may also introduce blurring interference issues after scale transformations.

In the process of feature fusion, we do not directly apply a single attention mechanism for fusion. Our analysis indicates that when the features from the upper layer are sampled and resized to match the size of the lower layer, although this includes a large amount of semantic information that can enrich the original lower layer features, the sampling operation itself leads to feature information blurring. Therefore, we require the accurate spatial modeling capability of the original low-level features to guide the features in targeted learning at precise regions. Similarly, when the features from the lower layer are sampled and resized to match the upper layer’s feature map size, the spatial information can effectively address the challenges of recognizing small targets in low resolution. However, the lack of semantic information in the lower layers can interfere with the features of the upper layer. As a result, we need the semantic relationship capture ability of high-level features to guide the features in strengthening their understanding of the relationships between targets.

As illustrated in Figure 8a, our proposed Gradual Dynamic Feature Pyramid Network (GradDynFPN) emphasizes the importance of interactions between adjacent scales for information fusion in the middle layer. Therefore, we employ interaction operations between adjacent features to fully leverage the complementary information of deeper and shallower features, thereby enhancing the richness and precision of feature representation.

According to the structural details in Figure 8b, during the three-layer feature interaction process, we first design the sampling operations for different layers with careful consideration. The three layers of features are processed as follows: lightweight upsampling is performed using CARAFE [46], channel transformation is achieved through a 1 × 1 convolution, and downsampling employs our designed HEConv. Subsequently, we utilize the features from the upper and lower layers along with the middle layer to complete the initial step of sampling fusion. During the fusion process, the middle layer extracts spatial and channel weights, applying spatial attention for upward interactions and channel attention for downward interactions. This guides the fusion features through matrix multiplication, effectively addressing the blurring issues encountered when aligning features of different scales. Finally, the mixed features from the upper and lower layers and the middle layer are fused using adaptive weights, completing the final concatenation.

4. Results

4.1. Comparison with State-of-the-Art Technologies

Compared to existing state-of-the-art end-to-end object detectors, this paper aims to assess the performance improvement of HawkEyeNet in small object detection. In our study, we employed several one-stage models, including YOLOv3 [4], YOLOv5-L [43], YOLOv6 [44], YOLOv7 [7], YOLOv8 [47], YOLOv10 [2], DETR-ResNet50 [17], DINO [11], RetinaNet [48], RefineDet [49], FCOS [50], FSAD [51], and ATSS [15]. For two-stage models, we utilized Libra R-CNN [52], TridentNet [53], FRCNN [16], and DBNet [54]. Furthermore, we incorporated the models DroneEye [55] and TPH-YOLOv5 [56], specifically designed for drone-related tasks. The specific results are shown in Table 2 and Table 3. First, on the RFRB dataset, the AP₅₀ metric of HawkEyeNet increased by 11.9%, while on the VisDrone2019 dataset, the AP₅₀ improved by 6.2%. Similarly, the AP_S metrics of HawkEyeNet enhanced by 11.5% and 5% on these two datasets, respectively. These results indicate that the overall improvement in detection performance primarily stems from the increased accuracy in small object detection, validating the effectiveness of the methods presented in this paper across different benchmarks.

An analysis of detection performance was conducted for each category across two datasets, comparing HawkEyeNet with the baseline YOLOv10, as presented in Table 4 and Table 5. The results indicate that HawkEyeNet demonstrates superior performance compared to the YOLO series models across COCO metrics. Significant improvements were observed in the detection of small objects, while performance on large objects also exhibited enhancements. This suggests that the HEConv module effectively enhances small object detection performance without compromising the detection capabilities of larger objects, resulting in overall performance improvements. Furthermore, HawkEyeNet achieved average precision (AP) metrics of 46.4% and 30.8%, respectively, representing the highest values among the compared detection models. Additionally, as illustrated in Table 6, the comprehensive performance of this model ranks among the best, providing further evidence for the efficacy of the methodologies proposed in this paper.

4.2. Ablation Study

The methods and experimental results based on different datasets are presented in five detailed tables. Table 7 and Table 8 display the ablation study results for the HawkEyeNet model. HEConv, as designed in this study, is applied to both the HSPP and GradDynFPN modules, with “Diamond” and “X” denoting different versions of HEConv. We use (Diamond, X) to represent the dual-branch hybrid convolution version. The experimental results demonstrate that HEConv enhances the model’s focus on small object shapes and improves overall detection accuracy without increasing the number of parameters. Compared to the baseline model, HawkEyeNet achieves a 6.2% and 4.9% improvement in AP₅₀ and AP_S, respectively.

As observed in the fifth row of the table, without altering the model architecture, merely replacing the pooling methods and the standard convolution in multi-scale feature processing led to superior accuracy compared to the mainstream models. Specifically, HawkEyeNet achieved an AP of 45.6% and 30.8%, while AP_S increased to 45.6% and 21.9%. These findings validate that the design of HEConv significantly improves small object detection performance without compromising large object detection.

Additionally, the HSPP module contains one HEConv, while the GradDynFPN module, although containing five HEConv modules, reduces the number of channels to 256 to control model complexity and parameter size. The results show that incorporating the HSPP and GradDynFPN modules significantly enhances model performance. Notably, HSPP excels at capturing target shape features, while HEConv in GradDynFPN exhibits strong downsampling capabilities. When these two modules are used together, the model’s performance is further improved, underscoring the effectiveness of HEConv in both feature extraction and downsampling.

Comparative experiments were conducted to further validate the effectiveness of the proposed method, using the YOLOv10-DHead model and mainstream approaches at the FPN and SPP modules, as shown in Table 9 and Table 10. These experiments compared the performance of the proposed method against standard convolution. In multiple tests, HSPP outperformed in COCO metrics, achieving improvements of 2.3% in AP and 2.6% in AP_S over the baseline SPPF. Similarly, GradDynFPN delivered the best performance in both AP and small object AP_S, with increases of 2.5% and 6.2%, respectively. These results demonstrate that the proposed modules enhance detection performance across multiple scales, with significant improvements in small object precision, confirming the effectiveness of the method in small object detection.

As shown in Table 11, to validate the effectiveness of HEConv, we replaced standard convolutions and some mainstream convolution variants for comparative experiments. Under the same configuration, the latest v4 version of the DCN series showed minimal improvement in model performance, while SMPConv performed even worse, exhibiting poor results in small object detection and no significant advantage in overall accuracy. For the CKAN module, although it improved overall accuracy by about 1%, the parameter count increased from 40.8 M to 87.4 M, resulting in approximately 120 G more computational cost compared to the model with standard convolutions. In previous sections, we analyzed the unsuitability of most convolution variants in small object scenarios and highlighted the multi-scale accuracy differences under standard convolution architectures. HEConv, as designed in this study, shows significant improvements in multiple aspects. Firstly, the experimental results indicate that compared to standard convolutions, HEConv demonstrates advantages in both parameter count and computational cost, reducing them to 45.7% and 56.5% of CKAN, respectively. Secondly, HEConv also improves accuracy, particularly in the AP₅₀ metric, with a 10.9% increase over standard convolutions. Compared to CKAN, HEConv improves AP₅₀ by 9.1%. In earlier sections, we also discussed that the Diamond version performs optimally when the target aspect ratio is within the range of [0.5, 2]. We compared different versions of HEConv, all of which showed improvements, with the Diamond version standing out. Although mixed convolution with channel grouping outperforms standard convolutions, the fusion of different methods introduces alignment errors, limiting the performance of the Econv-Diamond/X method and making it less effective than the individual convolution versions. On the RFRB dataset, where the target aspect ratio is more balanced, the model using Diamond version convolutions achieved the best results, further confirming the effectiveness of HEConv’s version selection based on the shape parameter.

4.3. Visualization and Analysis

4.3.1. Heatmap Visualization

This section provides a visual analysis of feature heatmaps, intuitively demonstrating the model’s performance and advantages. The heatmaps highlight the model’s ability to capture targets at various levels, offering detailed insights into its small object detection capabilities. We compared the heatmaps of the 18th layer between the standard convolution and HEConv. As shown in Figure 9, which presents the results from testing RFRB images using GradCAM (a method for generating heatmaps), the right-hand side with HEConv detected more small objects, especially along the complex background edges. Additionally, it clearly distinguished adjacent targets in the central region, showcasing HEConv’s significant improvement in capturing small object features.

Moreover, we conducted further tests on the VisDrone2019 dataset, as shown in Figure 10, revealing that when the confidence threshold (conf_threshold) is set to 0.2, standard convolution fails to detect some small objects. This is because when the confidence of the detected objects falls below 0.2, they are classified as background or low-quality targets in the heatmap and are filtered out. Although standard convolution can locate small objects, it struggles to accurately capture their semantic information, leading to suboptimal precision. In contrast, HEConv effectively captures the features of small objects under the same confidence threshold, further proving its superior performance.

4.3.2. Performance Visualization

This section presents the visual comparison between YOLOv10 and HawkEyeNet in practical detection tasks, highlighting the performance improvement brought by the HEConv module. In Figure 11, the three columns show the detection results of YOLOv10, HawkEyeNet without HEConv, and HawkEyeNet with HEConv on the VisDrone2019 dataset.

From the visualizations, it is evident that the model with HEConv on the right achieves significantly better accuracy in target localization and detection. Specifically, in Figure 11a, YOLOv10 fails to detect the truck and incorrectly identifies the tire as a bicycle; HawkEyeNet without HEConv misclassifies a chair in the background as a person. However, HawkEyeNet with HEConv accurately detects the truck and avoids the misclassification of the chair. Similarly, in Figure 11b, YOLOv10 misses several targets, while the HEConv-enhanced model detects more small targets. Figure 11c further illustrates HEConv’s advantage in small object detection, where the model not only accurately detects a previously missed car but also avoids misclassifying a wall image as a pedestrian.

In Figure 12, results on the RFRB dataset are displayed. YOLOv10 shows multiple missed detections, while HawkEyeNet without HEConv detects some small targets, albeit with low confidence scores of 0.2–0.4. This suggests improved regression but insufficient learning for accurate classification. HawkEyeNet with HEConv improves both target localization and classification confidence, raising it to around 0.5–0.6.

These results indicate that HawkEyeNet achieves a notable improvement in accuracy and small object detection compared to YOLOv10. Particularly in complex backgrounds and multi-target scenarios, the HEConv module effectively reduces missed and false detections, significantly enhancing the model’s ability to capture the characteristics of small objects.

5. Discussion

Comparative experiments were conducted on the YOLOv10-DHead model using FPN and SPP approaches, as shown in Table 9 and Table 10, comparing against standard convolution. In small object detection, models focusing too much on object shape, while neglecting intermediate features, may hinder performance. Previous strategies combining standard and dynamic convolutions with channel grouping have shown limited performance gains despite increased parameters. The third convolution version in this study outperforms standard convolution but still falls short compared to our single-shape version. HEConv was tested in a multi-branch architecture, but further research on the backbone network is needed. The inclusion of flexible, learnable sampling points could increase model complexity and training time. Future work will explore lightweight adaptive sampling and fusion methods to improve accuracy across scales in small object datasets.

This study primarily focused on dense regions, which are common in UAV tasks such as agriculture and social transportation. However, these tasks may also involve some sparse scenarios. To assess the adaptability of the proposed method, experiments were conducted on the sparser AI-TOD dataset. The stable Diamond version was used for training on AI-TOD, with the experimental results provided as Supplementary Materials. In addition, the full meaning of some of the basic acronyms in the current field can be found in Appendix A, Table A1.

6. Conclusions

Under the challenges of small object detection, traditional convolutional operators often fall short, with detection precision significantly lagging behind larger objects despite increased small object data. HEConv addresses these issues through a dual-focus design, combining fixed sampling points for key feature capture with dynamic sampling for adaptability, avoiding reliance on channel grouping strategies.

Validated by HSPP and GradDynFPN modules, HEConv achieves superior multi-scale accuracy, as shown in ablation studies and comparisons with other convolutional variants. It demonstrates significant advantages over YOLO series models, including YOLOv10, and surpasses recent convolution operators on the RFRB dataset, excelling in both accuracy and computational efficiency.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones8120713/s1.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; validation, Y.L.; writing—original draft preparation, Y.L. and Y.Y.; writing—review and editing, Y.L., W.Y. and L.W.; supervision, X.T., L.W. and D.C.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2022ZD0115802, Wenzhong Yang), the Key Research and Development Program of the Autonomous Region (Grant No. 2022B01008, Wenzhong Yang), and the Central Government Guidance Local Special Projects (Grant No. ZYYD2022C19, Wenzhong Yang).

Data Availability Statement

The data for this article are available at the same address as the supporting material above, and all datasets and codes can be found on the following website: MLL768020/HawkEyeNet (www.github.com) (accessed on 14 October 2024).

Acknowledgments

We would like to express our sincere gratitude to Wenzhong Yang for his invaluable support, particularly in securing funding. Additionally, we extend our thanks to the creators of the VisDrone, AI-TOD, and RFRB datasets for making their data publicly available, which played a crucial role in enabling our research on small object detection in densely populated UAV scenarios.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The content of this section mainly explains the occurrence of acronyms in the above paper, and the specific presentation is shown in Table A1.

Table A1. Table of the corresponding full names of common acronyms in this paper.

Abbreviation	Full Name
CNN	convolutional neural network
IoU	Intersection over union
SS2D	Single-Shot MultiBox Detector 2D
NMS	Non-max suppression
AP	average precision
Fcos	Fully Convolutional One-Stage Object Detection
BN	Batch Normalization
SiLU	Sigmoid Linear Unit
F1	Harmonic Mean of Precision and Recall

References

Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Li, Y.; Xia, Y.; Zheng, G.; Guo, X.; Li, Q. YOLO-RWY: A Novel Runway Detection Model for Vision-Based Autonomous Landing of Fixed-Wing Unmanned Aerial Vehicles. Drones 2024, 8, 571. [Google Scholar] [CrossRef]
Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arxiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision—ECCV 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images. Drones 2024, 8, 523. [Google Scholar] [CrossRef]
Wu, T.; Zhang, Z.; Jing, F.; Gao, M. A Dynamic Path Planning Method for UAVs Based on Improved Informed-RRT* Fused Dynamic Windows. Drones 2024, 8, 539. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zong, Z.; Song, G.; Liu, Y. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6748–6758. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652663. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Zhang, L. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Wu, J.; Chen, J.; Lu, Q.; Li, J.; Qin, N.; Chen, K.; Liu, X. U-ATSS: A lightweight and accurate one-stage underwater object detection network. Signal Process. Image Commun. 2024, 126, 117137. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A multi-scale faster RCNN model for small target forest fire detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zou, D.; Cao, Y.; Li, Y.; Gu, Q. The benefits of mixup for feature learning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 43423–43479. [Google Scholar]
DeVries, T. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17425–17436. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. arXiv 2023, arXiv:2312.08874. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO For Object Detection. arXiv 2024, arXiv:2406.05835. [Google Scholar]
Li, J.; Wang, E.; Qiao, J.; Li, Y.; Li, L.; Yao, J.; Liao, G. Automatic rape flower cluster counting method based on low-cost labelling and UAV-RGB images. Plant Methods 2023, 19, 40. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Qiao, Y. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Wang, W.; Lu, T.; Dai, J. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5652–5661. [Google Scholar]
Wang, Z.; Lin, G.; Tan, H.; Chen, Q.; Liu, X. CKAN: Collaborative knowledge-aware attentive network for recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; pp. 219–228. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Gao, J.; Geng, X.; Zhang, Y.; Wang, R.; Shao, K. Augmented weighted bidirectional feature pyramid network for marine object detection. Expert Syst. Appl. 2024, 237, 121688. [Google Scholar] [CrossRef]
Zhang, Y.M.; Hsieh, J.W.; Lee, C.C.; Fan, K.C. SFPN: Synthetic FPN for object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1316–1320. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Oahu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]
Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided context feature pyramid network for object detection. arXiv 2020, arXiv:2005.11475. [Google Scholar]
Jin, Z.; Yu, D.; Song, L.; Yuan, Z.; Yu, L. You should look at all objects. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 332–349. [Google Scholar]
Glenn, J. YOLOv5 Release v7.0. Available online: https://github.com/ultralytics/yolov5/releases (accessed on 15 June 2024).
Zhang, Y.; Zhang, W.; Yu, J.; He, L.; Chen, J.; He, Y. Complete and accurate holly fruits counting using YOLOX object detection. Comput. Electron. Agric. 2022, 198, 107062. [Google Scholar] [CrossRef]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 385–400. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Glenn, J. YOLOv8 Release v8.1.0. Available online: https://github.com/ultralytics/ultralytics/releases/tag/v8.1.0 (accessed on 14 October 2024).
Cheng, X.; Yu, J. RetinaNet with difference channel attention and adaptively spatial feature fusion for steel surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 2503911. [Google Scholar] [CrossRef]
Xie, H.; Wu, Z. A robust fabric defect detection method based on improved RefineDet. Sensors 2020, 20, 4260. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 19221933. [Google Scholar] [CrossRef]
Kou, Y.; Wu, K.; Huang, C.; Chen, H.; Du, W.; Liu, H. FSAD: Few-Shot Object Detection via Aggregation and Disentanglement. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
Guo, H.; Yang, X.; Wang, N.; Song, B.; Gao, X. A rotational libra R-CNN method for ship detection. IEEE Trans. Geosci. Remote Sens. 2020, 58, 57725781. [Google Scholar] [CrossRef]
Paz, D.; Zhang, H.; Christensen, H.I. Tridentnet: A conditional generative model for dynamic trajectory generation. In International Conference on Intelligent Autonomous Systems; Springer International Publishing: Cham, Switzerland, 2021; pp. 403–416. [Google Scholar]
Dai, X.; Xie, J.; Zhang, G.; Chang, K.; Chen, F.; Wang, Z.; Tang, C. DBNet: A Curve-based Dynamic Association Framework for Lane Detection. IEEE Trans. Intell. Veh. 2024. [Google Scholar] [CrossRef]
Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Liu, M. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International conference on computer vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; p. 2788. [Google Scholar]

Figure 1. Comparison of mainstream fundamental operators based on the RFRB dataset. The circles represent the performance of HawkEyeNet with different convolution types, with circle size indicating computational requirements. The horizontal axis shows the model’s parameter count, and the vertical axis displays testing accuracy on the RFRB dataset. Models closer to the top left corner are superior. Overall, HEConv outperforms mainstream convolution operators, including the VSSBlock from Mamba and the CKAN operator, demonstrating strong competitiveness.

Figure 2. Example images from the RFRB dataset.

Figure 3. Example images of various categories from the VisDrone2019 dataset. In these images, the objects in the red boxes correspond to the category names listed in each row.

Figure 4. Overall architecture of HawkEyeNet. The model pipeline includes image input, feature extraction, multi-scale feature processing, and downstream detection. (a) Detailed structure of the module shown in (b). (b) Backbone architecture integrating the proposed HSPP module for pooling. (c) Gradual Dynamic Feature Pyramid Network (GradDynFPN) designed in this study, which implements multi-scale feature processing through restructured architecture and sampling modules. (d) Detection head (Dhead) based on a decoder structure, utilized for task prediction.

Figure 5. Schematic representation of sampling positions for standard and dynamic convolutions. The green points indicate adjusted sampling locations derived from the blue points. (a) illustrates standard convolution, a foundational module in convolutional neural networks that can form specialized structures such as CSP, ELAN, C2f, and C3, which are crucial for enhancing performance in visual tasks. (b–d) depict convolution operations with adjustable sampling positions. In the diagram, the arrows indicate the offset directions of the convolutional kernel. The blue marks represent the positions of the standard convolutional kernel, while the green marks denote the positions after applying the offsets. In conventional scenarios, these convolution variants improve adaptability to varying target shapes and poses through flexible adjustments of sampling point locations, demonstrating their effectiveness in traditional object detection tasks.

Figure 6. Structural flowchart of the HawkEye Conv operator. The simulated HawkEye structure features two attention areas: a stable sampling region A composed of special shapes and a dynamic offset and random selection area B. This design does not require channel grouping, offering plug-and-play flexibility. It is important to note that the top left corner clarifies that A and B are not defined by channel division but are designed by splitting the convolution kernel into two parts. Panels C and D illustrate the dynamic sampling methods for different versions of HawkEye Conv. In the figure, the blue circle represents the original convolution kernel, while the red and yellow circles represent its processing in parts A and B, respectively. The arrows indicate the offset direction, and the green circle marks the final offset position.

Figure 7. Schematic of the HSPP module structure.

Figure 8. Structure of GradDynFPN. (a) illustrates the overall architecture of the module; (b) details the specific design aspects of the multi-scale sampling and fusion components within the module.

Figure 9. Comparative visualization of heatmaps based on RFRB sample data. The brighter the color, the more attention the model has.

Figure 10. Comparative visualization of heatmaps based on VisDrone2019 sample images. The brighter the color, the more attention the model has. The red block image represents the extracted region, which is specially enlarged to facilitate the clear contrast.

Figure 11. Comparative visualization of test results based on the VisDrone2019 dataset. The red box extracts the contrast area and turns it into a subgraph, which aims to compare the effect gap of different models more clearly.

Figure 12. Comparative visualization of test results based on the RFRB dataset. The red box extracts the contrast area and turns it into a subgraph, which aims to compare the effect gap of different models more clearly.

Table 1. Experimental configuration details.

Devices	Configuration
OS	Ubuntu 18.04
Development language	Python 3.8.0
Development tools	Pytorch 1.12.1
CPU	Intel(R) Xeon(R) Gold 5120
GPU	8 × NVIDIA A40 (45 G)
CUDA Version	CUDA 11.3
Driver Version	NVIDIA Driver 460.32.03
Memory	512 G

Table 2. Comparative experimental results of mainstream detectors based on the RFRB dataset.

Model	AP₅₀	AP₇₅	AP_S	AP_M	AP
YOLOv3	71.2	33.3	36.8	42.5	37.0
YOLOv5	72.2	35.2	37.4	43.5	37.8
YOLOv 6	67.2	19.1	29.2	39.3	29.7
FCOS	25.8	7.3	10.7	21.7	11.5
ATSS	26.1	7.6	10.8	19.9	11.4
Tridentnet	18	5.2	7.2	18.2	8
Retinanet	8.6	1.2	2.7	7.3	3
FSAD	12.4	2.5	4.6	8.6	4.7
FRCNN	11.6	5.2	5.5	10.5	5.9
DINO	62.1	12	23.4	29.7	11.1
DETR-ResNet50	47.3	1.8	12.0	12.3	12.0
RT-DETR	72.2	45.2	41.6	43.1	41.7
YOLOv8	72.2	38.3	39.2	41.1	39.3
YOLOv10X	69.8	30.7	35.0	41.2	35.3
YOLOv10	69.6	30.5	34.8	39.8	35.0
HawkEyeNet (ours)	81.5	49.5	46.3	47.2	46.4

Table 3. Comparative experimental results of mainstream detectors based on the VisDrone2019 dataset.

Model	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	AP
Libra R-CNN	25.2	15.2	5.9	25.6	31.4	14.9
RefineDet	28.8	14.0	-	-	-	14.9
RetinaNet	26.4	15.3	6.3	25.6	34.4	15.0
YOLOv3	27.2	14.6	6.3	21.5	36.1	15.0
FCOS	30.4	18.3	9.2	22.6	35.4	17.9
ATSS	33.8	20.9	11.6	31.7	36.7	20.4
TridentNet	35.0	19.5	11.4	29.6	36.6	19.8
YOLOv5-L	39.8	23.4	15.5	36.1	49.6	23.6
YOLOX-X	43.2	26.2	15.9	38.0	52.4	25.8
DroneEye2020	30.6	18.1	10.1	27.2	35.1	17.4
DBNet	36.3	19.1	12.0	29.1	42.4	19.9
YOLOv7	49.0	28.3	18.7	40.1	52.6	28.6
TPH-YOLOv5	45.9	27.7	17.5	39.5	50.9	27.1
YOLOv8	44.6	27.9	16.4	41.6	46.7	27.3
DETR-ResNet50	48.3	29.9	20.7	40.6	49.8	29.5
RT-DETR	46.1	27.9	18.8	39.1	48.4	27.7
YOLOv10X	45.1	27.9	17.9	39.6	46.0	27.3
YOLOv10	43.8	26.7	16.9	38.6	45.2	26.4
HawkEyeNet (ours)	50.0	31.8	21.9	42.2	44.1	30.8

Table 4. Comparison of single-class detection results based on the RFRB dataset.

Category	Instances	Method
		YOLOv10				HawkEyeNet
		P	R	AP₅₀	mAP	P	R	AP₅₀	mAP
rape	5033	89.7	70.1	82	48.1	90.4	79.2	87.8	53.4
rape	5033	AP	35	F1	79	AP	45.6	F1	84

Table 5. Comparison of single-class detection results based on the VisDrone2019 dataset.

Category	Instances	Method
		YOLOv10				HawkEyeNet
		P	R	AP₅₀	mAP	P	R	AP₅₀	mAP
pedestrian	8844	67.7	42.6	50.7	24.8	71.3	56.3	61.2	30.9
person	5125	59.4	36.2	40.4	17	67.2	53	54.1	23.8
bicycle	1287	34.8	23.3	20.5	9.46	50.3	28.4	27.6	13.4
car	14,064	80.3	78.9	83.2	61.6	82.1	85.6	87.3	65
van	1975	59.9	49.5	51.8	37.4	67.8	53.4	57.3	43.8
truck	750	60.6	40.8	44.7	30.3	66.3	44	47.3	32.3
tricycle	1045	47.7	37.7	35.9	20.4	54.3	46	42.5	26.3
awning-tricycle	532	34	21.4	20.1	12.2	42.2	25.9	25.8	17.3
bus	251	77	5.4	66	48.3	81.2	63.8	70.1	55.2
motor	4886	56.8	52.4	52.7	25.6	65.4	64	64.6	33.2
All	38,759	P	R	AP₅₀	mAP	P	R	AP₅₀	mAP
		57.8	44	46.6	28.7	64.8	52	53.8	34.1
		AP	26.4	F1	50.0	AP	30.8	F1	57.0

Table 6. Performance comparison of mainstream small object detection models.

Model	Input Size (Pixels)	Parameters (M)	Flops (G)
YOLOv5M	640 × 640	21.2	49.0
YOLOv10X	640 × 640	30	160.4
YOLOv8L	640 × 640	43.61	164.9
YOLOv5L	640 × 640	46.5	109.1
YOLOX-L	640 × 640	54.2	155.6
YOLOv3	640 × 640	59.6	158.0
YOLOv7X	640 × 640	71.3	189.9
YOLOX-X	640 × 640	99.1	281.9
RTDETR-R50	640 × 640	42	130
HawkEyeNet (ours)	640 × 640	40	167

Table 7. Ablation study of the model based on the RFRB dataset.

Model	HSPP	GradDynFPN	HEConv	AP₅₀	AP₇₅	AP_S	AP_M	AP
YOLOv10				69.6	30.5	34.8	39.8	35.0
YOLOv10-DHead				71.2	38.7	38.5	43.5	38.9
YOLOv10-DHead	√		√ (Diamond)	72.3	43.7	41.1	44.1	41.2
YOLOv10-DHead		√	√ (Diamond)	72.2	44.9	41.2	44.2	41.4
YOLOv10-DHead	√	√		70.5	46.4	41.6	45.8	41.8
YOLOv10-DHead	√	√	√ (X)	72.2	48.5	42.7	45.5	42.9
YOLOv10-DHead	√	√	√ (X,Diamond)	70.9	47.9	42.5	45.7	42.6
ours	√	√	√ (Diamond)	81.4	48.3	45.6	46.6	45.6

√ denotes the method corresponding to adding this column.

Table 8. Ablation study of the model based on the VisDrone2019 dataset.

Model	HSPP	GradDynFPN	HEConv	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	AP
YOLOv10				43.8	26.7	16.9	38.6	45.2	26.4
YOLOv10-DHead				45.5	27.5	18.6	38.3	39.2	27.3
YOLOv10-DHead	√		√ (Diamond)	47.0	29.5	19.8	40.3	45.9	28.9
YOLOv10-DHead		√	√ (Diamond)	48.8	30.8	21.0	41.2	43.8	29.7
YOLOv10-DHead	√	√		47.8	29.9	20.0	40.9	44.1	29.1
YOLOv10-DHead	√	√	√ (X)	49.1	30.8	21.1	41.5	44.4	30.1
ours	√	√	√ (Diamond)	50.0	31.8	21.9	42.2	44.1	30.8

√ denotes the method corresponding to adding this column.

Table 9. Comparison between the proposed HSPP and flow SPP methods based on the RFRB dataset.

YOLOv10-DHead	AP	AP₅₀	AP₇₅	AP_S	AP_M
+SPPF (base)	38.9	71.2	38.7	38.5	43.5
+SPPCSPC	40.8	72.0	42.6	40.8	41.9
+SPPELAN	41.0	72.2	43.4	40.8	44.1
+HSPP (ours)	41.2	72.3	43.7	41.1	44.1

Table 10. Comparison between the proposed GradDynFPN and mainstream FPN methods based on the RFRB dataset.

YOLOv10-DHead	Out_Channel	AP	AP₅₀	AP₇₅	AP_S	AP_M
+PAFPN (base)	[256, 256, 256]	38.9	71.2	38.7	38.5	43.5
+ContextGuideFPN	[256, 256, 256]	40.4	71.2	43.9	40.2	43.6
+BiFPN	[256, 256, 256]	40.1	71.3	42.2	40.0	42.8
+AFPN	[256, 256, 256]	40.2	71.3	42.8	40.1	43.0
+GradDynFPN (ours)	[256, 256, 256]	41.4	72.2	44.9	41.2	44.2

Table 11. Comparison between the proposed HawkEye Conv and mainstream convolution methods based on the RFRB dataset.

HawkEyeNet	AP₅₀	AP₇₅	AP_S	AP_M	AP	Para (M)	FLOPS (G)
+Conv (base)	70.5	46.4	41.6	45.8	41.8	40.8	171.1
+DCNv4	70.5	47.5	41.8	43.8	41.8	38.6	165.8
+CKAN	72.3	48.6	42.7	47.7	43.2	87.4	295.4
+SMPConv	71.2	46.3	41.1	45.5	41.5	40.3	167
+VSSBlock	72.2	47.8	42.6	44.9	42.8	45.75	184.3
+HEConv (ours, X)	72.2	48.5	42.7	45.5	42.9	40.3	168.5
+HEConv (ours, Mix)	70.9	47.9	42.5	45.7	42.6	40.3	182.8
+HEConv (ours, Diamond)	81.4	48.3	45.6	46.6	45.6	40	167

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Yang, W.; Wang, L.; Tao, X.; Yin, Y.; Chen, D. HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery. Drones 2024, 8, 713. https://doi.org/10.3390/drones8120713

AMA Style

Li Y, Yang W, Wang L, Tao X, Yin Y, Chen D. HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery. Drones. 2024; 8(12):713. https://doi.org/10.3390/drones8120713

Chicago/Turabian Style

Li, Yihang, Wenzhong Yang, Liejun Wang, Xiaoming Tao, Yabo Yin, and Danny Chen. 2024. "HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery" Drones 8, no. 12: 713. https://doi.org/10.3390/drones8120713

APA Style

Li, Y., Yang, W., Wang, L., Tao, X., Yin, Y., & Chen, D. (2024). HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery. Drones, 8(12), 713. https://doi.org/10.3390/drones8120713

Article Menu

HawkEye Conv-Driven YOLOv10 with Advanced Feature Pyramid Networks for Small Object Detection in UAV Imagery

Abstract

1. Introduction

2. Related Work

2.1. Fundamental Modules of Convolutional Models

2.2. Multi-Scale Fusion Module

2.3. Spatial Pooling Module

3. Materials and Methods

3.1. Experimental Materials

3.1.1. Experimental Configuration

3.1.2. Experimental Details

3.1.3. Dataset Description

3.2. Experimental Methods

3.2.1. HawkEye Conv

3.2.2. HSPP Module

3.2.3. Gradual Dynamic Feature Pyramid Network

4. Results

4.1. Comparison with State-of-the-Art Technologies

4.2. Ablation Study

4.3. Visualization and Analysis

4.3.1. Heatmap Visualization

4.3.2. Performance Visualization

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI