BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network

Zhou, Liming; Yang, Jiakang; Xie, Yuanfei; Zhang, Guochong; Liu, Cheng; Liu, Yang

doi:10.3390/ijgi14090365

Open AccessArticle

BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network

by

Liming Zhou

^1,2

,

Jiakang Yang

^1,2

,

Yuanfei Xie

^1,2

,

Guochong Zhang

^1,2

,

Cheng Liu

^1,2,3,*

and

Yang Liu

^1,2

¹

School of Computer and Information Engineering, Henan University, Kaifeng 475004, China

²

Key Laboratory of Big Data Analysis and Processing of Henan Province, Henan University, Kaifeng 475004, China

³

Zhongzhi Software Co., Ltd., Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(9), 365; https://doi.org/10.3390/ijgi14090365

Submission received: 12 May 2025 / Revised: 3 September 2025 / Accepted: 17 September 2025 / Published: 21 September 2025

(This article belongs to the Topic State-of-the-Art Object Detection, Tracking, and Recognition Techniques)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicle (UAV) image object detection has become an increasingly important research area in computer vision. However, the variable target shapes and complex environments make it difficult for the model to fully exploit its features. In order to solve this problem, we propose a UAV image object detection method based on a backbone feature reuse detection network, named BFRDNet. First, we design a backbone feature reuse pyramid network (BFRPN), which takes the model characteristics as the starting point and more fully utilizes the multi-scale features of backbone network to improve the model’s performance in complex environments. Second, we propose a feature extraction module based on multiple kernels convolution (MKConv), to deeply mine features under different receptive fields, helping the model accurately recognize targets of different sizes and shapes. Finally, we design a detection head preprocessing module (PDetect) to enhance the feature representation fed to the detection head and effectively suppress the interference of background information. In this study, we validate the performance of BFRDNet primarily on the VisDrone dataset. The experimental results demonstrate that BFRDNet achieves a significant improvement in detection performance, with the mAP increasing by 7.5%. To additionally evaluate the model’s generalization capacity, we extend the experiments to the UAVDT and COCO datasets.

Keywords:

detection head; feature extraction; feature fusion; unmanned aerial vehicle; object detection

1. Introduction

In the process of urban modernization, the wide application of high technology has penetrated every aspect of our lives. The swift advancement of UAV technology, one of the representatives of high technology, has brought about revolutionary changes in several fields. UAVs not only provide an efficient means of information collection in the fields of military defense [1], agricultural production [2], traffic monitoring [3], and hazardous condition monitoring [4], but their advanced intelligent processing capabilities are also crucial for solving complex problems in ground observation. In traffic monitoring, UAVs can monitor traffic flow in real-time through object detection technology to prevent traffic congestion [5,6]. In hazardous condition monitoring, UAVs can quickly locate trapped people and provide timely and critical information for rescue teams [7]. These applications demonstrate the importance of the UAV as an aerial observation platform and have been widely used in multiple fields. However, the broader and more varied observation angles of UAVs result in captured images that have characteristics such as small targets, complex backgrounds, and large-scale changes. These characteristics differ significantly from images captured by conventional cameras, posing more significant challenges to detection technology. Especially in object detection, the dense arrangement of small targets, more complex environments, and unclear target boundaries make it difficult for traditional object detection algorithms to ensure rapid and accurate detection in UAV-related tasks. These difficulties limit the application of UAV technology and bring new challenges to the research in computer vision.

Object detection algorithms are primarily divided into two categories: those based on traditional manual feature extraction and those based on deep learning. Manual feature extraction methods can be custom-designed according to different application scenarios and needs, providing flexibility and adaptability. However, manually designed features have difficulty capturing sufficient information, leading to poor generalization ability. Additionally, the computational efficiency of these features often fails to satisfy the requirements of most scenarios [1]. The emergence of deep learning has significantly propelled the advancement of object detection technology, and its superiority in processing speed and generalization ability has made deep learning-based object detection algorithms mainstream in the industry. Deep learning-based algorithms can be broadly divided into two types: two-stage methods and one-stage methods. The two-stage methods, represented by R-CNN [8], Fast R-CNN [9], Faster R-CNN [10], and Cascade R-CNN [11], operate through region proposal mechanisms. These methods first generate numerous candidate regions and subsequently perform classification and bounding-box refinement using dedicated classifiers. Although the two-stage methods can achieve higher detection accuracy, their speed is too slow [12]. The one-stage methods, represented by the Single Shot MultiBox Detector (SSD) [13] and You Only Look Once (YOLO) [14] series of algorithms, are widely used in real-time object detection applications. Unlike two-stage methods, these methods eliminate the need for region proposals by directly predicting object categories and bounding box coordinates through end-to-end feature extraction. This streamlined pipeline not only achieves faster inference speeds but also maintains competitive accuracy, making one-stage methods particularly suitable for real-time monitoring applications [15].

In real-world scenarios, the distance, angle, and potential occlusion between the UAV and the captured object can affect how the object appears in the image. Even the same object may present different sizes and shapes in the image, which poses challenges to the detection algorithms [16]. In addition, the intertwining of objects with complex backgrounds can weaken or blur the objects in images, thereby heightening the likelihood of erroneous and overlooked detections [17]. In real UAV scene applications, natural factors such as clouds, fog, and snow significantly impact image quality and cannot be ignored. Images captured under these extreme conditions often exhibit quality degradation and domain shifts, posing significant challenges to object detection algorithms [18]. In response to these challenges, Hu et al. [19] propose a component-decoupling-based background suppression method that enhances target-background contrast through prior-guided cloud and mist component extraction. Peng et al. [20] develop a dual-structure element morphological filtering approach employing directional enhancement and dynamic scale perception for low-SNR target detection in heavy cloud conditions. These studies demonstrate that optimizing both feature extraction capabilities and fusion architecture design significantly enhances model robustness in challenging environmental conditions.

Many studies have concentrated on optimizing feature extraction mechanisms to enhance the network’s performance in object detection tasks. Wang et al. [21] incorporate an attention module into the backbone network to enhance the ability of capturing key object features and specially design a feature processing module to further integrate shallow and deep features, thereby significantly enhancing performance in small object detection. Xu et al. [22] replace the backbone portion of the YOLOv8 network with a lightweight MobileNetV3 network structure to optimize feature extraction while speeding up inference. Wang et al. [23] propose a C2f-E structure based on the Efficient Multi-Scale Attention Module (EMA), which combines the EMA into the C2f module. This approach further strengthens the network’s feature extraction capability while enhancing its performance in detecting small targets. Liu et al. [24] add ResNet50 as an auxiliary backbone while keeping the original backbone unchanged. They extract more informative low-level features through residual connectivity to enhance the detection effect. Ma et al. [25] propose a dual-strategy dimensionality reduction approach that employs two different strategies to reduce the dimensionality of hyperspectral data from two complementary perspectives, effectively balancing computational efficiency with information retention. Shi et al. [26] introduce the instance-guided enhancement module (IGEM) to adaptively combine instance-level information from the auxiliary branch with features in the main branches, thereby explicitly improving the discriminative features of aircraft. These studies have effectively addressed the problem of insufficient feature extraction. However, implementing the attention mechanism could potentially introduce some limitations to the model’s generalizability.

Feature fusion strategies play a decisive role in improving the detection accuracy. Well-designed feature fusion mechanisms significantly improve the accuracy of object detection tasks, as recent research advances have shown. Tank et al. [27] propose a weighted bidirectional feature pyramid network (BiFPN) that enables efficient and rapid multi-scale feature fusion. The low-level features extracted by the backbone network effectively alleviate the problem of information loss during feature propagation and play an essential role in improving module accuracy. Lim et al. [28] propose an object detection algorithm using contextual information, which effectively elevates the model’s detection accuracy by fusing multiscale features with contextual information derived from different layers. Wang et al. [29] propose the Bidirectional Adaptive Feature Pyramid Network (BAFPN) based on the BiFPN, which effectively enhances the model’s detection performance by optimizing the fusion capability of multi-scale features. Xu et al. [30] propose a novel Efficient RepGFPN based on GFPN for real-time object detection. Unlike previous neck network designs, it adopts a heavy neck design and integrates sufficient and adequate feature fusion modules, which significantly improves both real-time and accuracy. All of them have performed well in the object detection task. However, they have not mined the backbone network features more deeply.

Based on existing research, to address the challenge of fully utilizing target features in complex environments, this study proposes a UAV image object detection method based on a backbone feature reuse detection network, named BFRDNet. Firstly, to address the issue of insufficient feature extraction capability in complex environments, this paper proposes a new feature extraction module, MKConv. This module employs a set of three depth-separable convolutions, each with a distinct kernel size, to capture multi-scale features across varying receptive fields. These features then undergo a merging operation to enhance the feature representation, which further facilitates feature extraction while protecting the features being transmitted more entirely in the network and addressing the issue of significant feature information loss as the network deepens as much as possible. Secondly, considering that the baseline model focuses on a backbone-dominant design, and the backbone network includes sufficient fine-grained and semantic features, we design a backbone feature reuse pyramid network, named BFRPN, which emphasizes backbone network feature reuse. The core of this method lies in efficiently utilizing the features of the backbone network, integrating the rich feature information in the backbone network with the deep features of the neck. It also designs a new fusion strategy, which directly fuses the features extracted from the adjacent layers in the backbone network and then integrates the fused features with the corresponding deep features obtained from the neck. The BFRPN ensures that the output features maximally retain the backbone’s shallow features while incorporating the deep semantic features fused in the feature fusion stage, thus achieving the reuse of the backbone network features. Finally, we design a new detection head preprocessing module, named PDetect, which performs weighted mapping of the features undergoing detection to enhance the information flow between channels. These enhancements significantly boost the model’s detection performance while reducing its parameter count and computational complexity.

Before delving into the details of this model, the contributions of this research are summarized as follows:

(1) In this paper, we propose a new feature extraction module, MKConv, to extract target shallow detail features and deep location features. This module strengthens the representational capacity of features by strengthening the feature aggregation process and also mitigates information loss during network propagation to a certain extent.

(2) In this paper, we design a backbone feature reuse pyramid network, named BFRPN, which is designed to optimize the utilization of feature information extracted from the backbone network. It significantly improves the efficiency of feature fusion by adaptively injecting rich shallow features from the backbone into critical neck layers. Additionally, the BFRPN further integrates a dedicated detection head optimized for small objects, enhancing the model’s accuracy in small target detection.

(3) To achieve adequate detection of multi-scale targets, we design a detection head preprocessing module, named PDetect, which improves the model’s detection performance by performing a weighted mapping strategy on the detected features, mitigates the problem of gradient vanishing, and effectively improves the model’s training efficiency and overall performance.

The remainder of this paper is organized as follows: Section 2 reviews the relevant literature in this research area. Section 3 details the proposed model and its methodology. Section 4 details the experimental and visual demonstration of the proposed module and analyzes its effectiveness. Section 5 provides a summary of the paper.

2. Related Work

2.1. Feature Fusion

Before delving into the feature fusion, we need to thoroughly understand the feature extraction process. During the feature extraction process in object detection, most extraction paths are unidirectional, which preserves the original data characteristics in the features extracted by the shallow backbone network. This method retains rich feature details such as location, texture, and color. Comparatively, as the feature extraction process deepens, the features extracted by deeper layers tend to be richer in semantic information, such as shape and category. Obviously, neglecting any of the above features will seriously impair the overall performance. Therefore, feature fusion techniques are used to integrate these two features, constructing a more comprehensive feature map, thereby compensating for the potential feature loss. Among many feature fusion methods, the feature pyramid network (FPN) stands out due to its construction of top-down propagation paths and lateral connectivity. It can effectively aggregate shallow features containing fine-grained information and deep features containing deep semantic information, generating feature maps that contain multi-scale feature information. This strategy provides a rich fusion of features extracted from the backbone network and further improves the model’s performance. On this basis, PANet [31] further introduces a bottom-up shortcut path, which directly propagates shallow, high-resolution features to deep layers through lateral connections, effectively mitigating the spatial information loss in high-level features. BiFPN [27] and BAFPN [29] further enhance the feature fusion efficiency by introducing bi-directional connectivity and proposing adaptive fusion modules for enhanced multi-scale feature aggregation, respectively. These modifications notably amplify the model’s detection performance. A2-FPN [32] optimizes the multi-scale feature learning process through attention-guided feature aggregation by collecting more global contextual features from neighboring features, effectively mitigating semantic information loss during propagation. GAFN [33], on the other hand, adopts multi-structure and multi-level adaptive feature modules based on AFPN [34] to adaptively learn different scale feature maps, further improving the efficiency of feature fusion by mixing these different scale features. Although these methods play a positive role in feature fusion, they do not fully utilize the extracted features in the backbone network. To better utilize the features extracted from the backbone network, we propose BFRPN to fully exploit the representational capacity of heavy backbone networks. Unlike conventional approaches that process backbone features sequentially, BFRPN establishes bidirectional connections between adjacent layers, enabling continuous interaction between shallow spatial details and deep semantic information. The network first refines features at each backbone level through local fusion, then progressively integrates them with deeper representations through cross-scale fusion. This hierarchical feature reuse mechanism offers two key advantages: (1) it preserves fine-grained information throughout the network by minimizing feature degradation along the backbone-to-neck pathway, and (2) it transforms the backbone into an active participant in feature evolution rather than a static feature extractor.

2.2. Detection Head

In object detection models, the detection head plays a crucial role in extracting target information from fused multi-scale feature maps and generating final predictions, including bounding box coordinates, confidence scores, and class labels. YOLOv8 employs a decoupled head that independently computes bounding-box and classification losses. While this design improves accuracy, it substantially increases the parameter count compared with traditional coupled heads. Wang et al. [35] introduce a dynamic detection head with an attention mechanism to unify predictions across scale, space, and task dimensions, markedly enhancing accuracy. Xia et al. [36] embed a ConvMixer into the C3 module, compressing parameters and accelerating inference at a slight cost to representational power. Liu et al. [37] propose a feature-sharing detector head that retains performance with fewer parameters and further extend it to a dynamic interactive version for stronger classification and localization. Despite these advances, most efforts center on refining the features that reach the head. In this paper, we shift the focus upstream and present PDetect, without stacking multi-branch convolutions or introducing complex additional modules, which effectively improves the detection accuracy of the model while reducing the number of parameters. The module is well adapted to the model in this study through a series of simple yet efficient module designs, which significantly enhances target recognition and has excellent performance in the experiments.

2.3. Object Detection in UAV Images

Object detection in UAV images significantly differs from that in conventional scenarios. Due to their high randomness and angular variability, UAV images cause the same target to appear with significant differences in size and shape across different images. These challenges mean that algorithms that perform well in conventional scenarios often perform poorly when applied to UAV images. In addition, the intertwining of targets and backgrounds in complex settings increases the difficulty of target detection. These factors combined greatly increase the difficulty of object detection in UAV images. To address this challenge, Li et al. [38] propose the SRE-YOLOv8 model, which focuses on the detection task of extracting limited feature information in complex backgrounds. The model introduces an optimized Swin Transformer module, which utilizes an attention mechanism to preserve global context information and extract a broader range of features during feature extraction. Dang et al. [39] propose a road crack detection model, RCYOLO, designed to detect road cracks with complex shapes, textures, and small sizes that blend well with the background. The module applies Dynamic Snake Convolution (DySnakeConv) to enhance the backbone’s feature extraction capability and combines it with the Simulated Attention Mechanism (SimAM), which enhances target feature recognition while suppressing background interference, thus effectively addressing the issue of target-background fusion. These research advances show that introducing the attention mechanism in the feature extraction phase effectively mitigates the challenges of target detection in complex backgrounds and significantly enhances the model’s feature extraction ability. However, implementing the attention mechanism notably increases the model’s parameter count and computational demands, which needs careful consideration in UAV object detection scenarios. Exploring how to effectively use convolutional operations to extract features without relying on the attention mechanism is a subject worth exploring. Deng et al. [40] propose an extended feature pyramid structure (EFPN) for the small object detection problem and design a new feature texture transfer module (FTT), which enhances feature extraction efficiency from the backbone network and boosts the model’s fusion capability. Hu et al. [41] propose an unpaired image translation model based on scattering feature enhancement GAN (SFEG) to generate SAR images with pixel-level registration to input optical images, effectively addressing the challenges of obtaining paired training data and incomplete feature element consideration in fusion models. Xiao et al. [42] propose a new lightweight strategy to enhance the feature correlation of each layer by designing a new feature-focusing unit. They also create a multi-level feature reconstruction module to effectively reconstruct and transform the strong and weak information of the features of each layer in the pyramid. This approach is more comprehensively advanced based on the traditional FPN and is also a plug-and-play model. While existing methods have demonstrated promising performance by incorporating external modules to enhance feature extraction, they often overlook the efficient utilization of intrinsic features from the backbone network itself. To address this challenge, we propose the BFRDNet, a novel framework that maximizes the exploitation of backbone features without relying on external modules. The core innovation of BFRDNet lies in its backbone feature reuse pyramid network (BFRPN), which is designed to optimize the utilization of feature information extracted from the backbone network. In addition, we design two lightweight but effective components: MKConv and PDetect. By integrating BFRPN with MKConv and PDetect, BFRDNet achieves excellent feature utilization while maintaining computational efficiency, making it particularly suitable for UAV-based target detection in complex scenarios.

3. Method Overview

The core objective of this research is to effectively utilize the features of the backbone network, thereby enhancing the model’s ability to detect multi-scale targets in complex environments. In particular, this study focuses on addressing the issue where targets in UAV aerial images are easily disturbed by the background. To achieve this goal, this study introduces a new framework for UAV image object detection, named BFRDNet. It consists of three main areas. The backbone network incorporates the MKConv module to enhance feature extraction. The neck utilizes the BFRPN structure for efficient utilization of the backbone network features. While the detection head predicts the target’s location and category, we propose a detection head preprocessing module, named PDetect, to refine the input features and further boost detection capability. Figure 1 shows the overall structure of BFRDNet.

3.1. Multiple Kernels Convolution (MKConv) Module

In UAV image-based object detection, the size and shape of targets in different images show significant differences owing to the effects of UAV shooting angles and heights. Especially in cases where small targets are densely packed and environments are highly complex, the model faces great challenges in extracting target features. As network depth increases, these features may gradually weaken or even disappear. Efficiently extracting features and reducing feature loss during propagation pose substantial challenges for object detection.

We recognize that complex network architectures carry an inherent risk of overfitting, while deep network structures may exacerbate information loss during feature propagation. Additionally, we note that the negative impact of the extensive use of the attention mechanism is not negligible. Thus, this paper introduces MKConv, a feature extraction module designed without introducing complex mechanisms or extra modules and only using depth-separable convolution to extract features. This module extracts features at different scales from various receptive fields and then merges them to obtain more detailed and complete feature maps. Figure 2 shows the proposed MKConv module.

In the MKConv module, we use depth-separable convolutions with various kernel sizes to extend the receptive field and extract target features at varying spatial resolutions from

F_{in} \in R^{C \times H \times W}

. The distinction among these feature maps is that they focus on targets at various scales and contain rich semantic information. The calculation process is illustrated in Equations (1)–(3).

F_{1} = SiLU (Bn ({DSConv}_{1 \times 1} (F_{in})))

(1)

F_{2} = SiLU (Bn ({DSConv}_{3 \times 3} (F_{in})))

(2)

F_{3} = SiLU (Bn ({DSConv}_{5 \times 5} (F_{in})))

(3)

where

F_{in}

represents the input feature map, DSConv represents depthwise separable convolution, Bn represents normalization, SiLU represents the SiLU activation function, and

F_{1}

,

F_{2}

,

F_{3}

represent enhanced features.

Then, we merge

F_{1}

\in R^{C \times H \times W}

and

F_{2}

\in R^{C \times H \times W}

, as well as

F_{2}

\in R^{C \times H \times W}

and

F_{3}

\in R^{C \times H \times W}

. This merging process ensures that the output feature maps

\in R^{2 C \times H \times W}

are not affected by the fusion of the large-scale features and the small-scale features, thereby enhancing the representation capability of multi-scale features and avoiding the production of additional noise. Finally, we execute another merging process on the two feature maps,

M_{1}

\in R^{2 C \times H \times W}

and

M_{2}

\in R^{2 C \times H \times W}

, generating the feature maps

\in R^{4 C \times H \times W}

that consolidate multi-scale information. It is important to note that the two-stage fusion process first focuses on the fusion of mid- and low-level features, followed by the fusion of mid- and high-level features. This hierarchical fusion strategy mitigates interference from features with significant scale differences, facilitates richer gradient flow throughout the network, and effectively enhances the coordinated representation of features at different abstraction levels while minimizing redundant information. This approach preserves feature information integrity during network propagation and significantly boosts model performance. The calculation process is delineated in Equations (4)–(6).

M_{1} = Cat (F_{1}, F_{2})

(4)

M_{2} = Cat (F_{2}, F_{3})

(5)

F_{out} = {DW}_{1 \times 1} (Cat (M_{1}, M_{2}))

(6)

where

M_{1}

,

M_{2}

represent the features after preliminary integration, Cat represents the concatenation operation, DW represents depthwise separable convolution, and

F_{out} \in R^{C \times H \times W}

represents the output features of the MKConv module. Through the MKConv module, we can effectively integrate information from features of different scales in the current layer, highlight the expressive ability of target features while suppressing the background interference, and achieve good feature extraction capability with the simple module design.

The MKConv module employs a scientifically optimized architecture that significantly enhances feature extraction capabilities without requiring supplementary components. The module’s innovation stems not from the convolution operation itself, but rather from its strategic selection of kernel sizes and carefully designed scale-specific convolution ratios. This approach represents a sophisticated advancement in multi-scale feature learning. By synergistically integrating features with varying receptive fields, the network ensures comprehensive information capture in complex feature spaces. Notably, the module’s depth-agnostic design enables significant performance enhancement when deployed in deeper layers of the backbone network. This strategic placement generates high-definition feature representations that facilitate subsequent fusion processes. Therefore, we replace the original C2f module in the P2, P3, P4, and P5 layers of the backbone network with the MKConv module. In the experimental part of Section 4, we verified that removing MKConv led to a significant drop in accuracy, demonstrating its critical contribution to the model. The result confirms that MKConv functions as an essential plug-and-play module, aligning with our design intent.

3.2. Backbone Feature Reuse Pyramid Network (BFRPN)

In object detection tasks, it is an ongoing challenge to efficiently propagate feature information in deep networks and prevent feature information loss. In the design of the BFRDNet model, the neck is located within the deeper layers and is closest to the final detection task. Therefore, exploring ways to enhance the efficiency of feature fusion, maintain the integrity of feature information, and facilitate the propagation of features at the neck is crucial to boost the model’s detection accuracy.

Through a comprehensive analysis of prevailing feature fusion architectures, we find that existing networks predominantly emphasize refining fusion strategies while neglecting intrinsic feature quality. However, since our baseline model focuses on a backbone-dominant design, existing networks fail to fully utilize backbone features. Therefore, we propose the BFRPN, which, unlike the previous direction of improvement of the feature pyramid structure, focuses on the deeper mining and exploitation of the backbone network features. Although the FPN structure has already introduced lateral connections to enhance feature fusion, considering the backbone-dominant design, we believe that the introduction of additional lateral connections through a well-designed strategy can more effectively promote the use of backbone network features. The proposed BFRPN structure is illustrated in Figure 3.

In studying the BFRPN structure, we propose a novel feature reuse strategy for the backbone network. First, we preprocess adjacent backbone features

C_{i} \in R^{C \times H \times W}

and

C_{i + 1} \in R^{2 C \times (H / 2) \times (W / 2)}

by upsampling the deep features to match the spatial resolution of the shallow features. The upsampled deep features are then fused with the shallow features through concatenation. This cross-layer fusion mechanism preserves target information while enhancing noise suppression. The resulting fused feature map

F_{i} \in R^{3 C \times H \times W}

is further propagated to the neck network. Through iterative reuse of backbone features, the model achieves significant improvement in feature integration. The formula for the preliminary processing operation is illustrated in Equations (7) and (8).

C_{i + 1} = Conv (MKConv (C_{i}))

(7)

F_{i} = W_{a} \otimes U (C_{i + 1}) \oplus W_{b} \otimes C_{i}

(8)

where

W_{a}

and

W_{b}

represent the proportion of

C_{i + 1}

and

C_{i}

in the fusion process, i denotes the backbone layer index ranging from 1 to 3, U represents the upsampling operation, Conv represents standard convolution, ⊕ represents element-wise addition, ⊗ represents element-wise multiplication, and

F_{i}

represents the calculation of the preliminary treatment.

Given that the fused features in the neighboring layers contain rich multi-scale object feature information and effectively suppress background information, we introduce these fused features into the neck network to further enhance the model’s detection performance. In the neck, we perform a deep fusion of the feature maps

F_{{cor}_{i}}

propagated from the lower network with the backbone network feature maps

F_{i}

obtained from the lateral connections after the initial fusion. This process compensates for information loss during propagation and refines feature-background boundaries. We also incorporate higher-resolution feature maps for richer contextual information. By merging high-level semantic features with low-level spatial features rich in texture, our architecture reduces interference from complex backgrounds. The computational formula for the second stage is shown in Equations (9)–(12).

F_{{cor}_{3}} = C2f (Concat (U (C_{4}), C_{3}))

(9)

F_{{cor}_{2}} = C2f (Concat (U (F_{{cor}_{3}}), C_{2}))

(10)

F_{{cor}_{1}} = C2f (Concat (U (F_{{cor}_{2}}), C_{1}))

(11)

F_{out} = W_{c} \otimes DSConv (F_{i}) \oplus (W_{d} \otimes F_{{cor}_{i}})

(12)

where

F_{{cor}_{i}}

represents the fusion features of the neck in the same layer,

W_{c}

and

W_{d}

represent the proportion of

F_{i}

and

F_{{cor}_{i}}

in the fusion process, i denotes the backbone layer index ranging from 1 to 3, DSConv represents the depthwise separable convolution, U represents the upsampling operation, ⊗ represents element-wise multiplication, and

F_{out}

represents the features input into the detection head.

Finally, Algorithm 1 shows the specific steps of feature fusion in different layers of the BFRPN. The initial fusion phase processes feature maps from neighboring layers in the backbone network and adjusts the number of high-level feature channels to match the low-level features. The two feature maps fused in the second phase are in the same layer for more profound feature fusion. First, we fuse the neighboring feature maps in the four layers

P = {P_{2}, P_{3}, P_{4}, P_{5}}

after matching feature size to obtain

W = {W_{1}, W_{2}, W_{3}}

. Next, the feature maps are weighted using global average pooling and a Softmax activation function, and then deeply integrated with the neck feature maps at the same level. Finally, the detection head preprocessing module processes the data to obtain the adaptive fusion result R.

Algorithm 1: The feature fusion steps of BFRPN.

Input: $P = {P_{2}, P_{3}, P_{4}, P_{5}}$ ; $P$ denotes the feature maps from various layers of the backbone network.
Step 1: $Y = {}$ ; $Y$ denotes the feature maps of high-level features after Upsample() processing for channel adjustment. $W = {}$ ; $W$ denotes the initial fused feature map. $W_{a}$ and $W_{b}$ denote the proportion of $y_{i}$ and $y_{i - 1}$ in the fusion process.
for $i = 3$ to 5 do
$y_{i} = U p s a m p l e (P_{i})$
$Y . a p p e n d (y_{i})$
end for
for $i = 3$ to 5 do
$w_{i} = W_{a} \otimes y_{i} + W_{b} \otimes P_{i - 1}$
$W . a p p e n d (w_{i})$
end for
Step 2: $R = {}$ ; R denotes the feature maps for detection. $N = {}$ ; N denote the feature maps from the neck in the same layer. $W_{c}$ and $W_{d}$ denotes the proportion of $w_{i}$ and $N_{i}$ in the fusion process. DSConv represents the depthwise separable convolution.
$r_{i} = W_{c} \otimes (DSConv (w_{i})) \oplus (W_{d} \otimes N_{i})$
$R . append (r_{i})$
end for
Output: Return $R$ .

3.3. Detection Head Preprocessing Module (PDetect)

The detection head module is deployed in the deepest layer of the network. Introducing complex modules in the deep layer of the network for more complicated computation of features may bring significant risks. Not only will the number of parameters increase significantly, but it will also easily lead to overfitting. In fact, using complex modules to extract features in the network’s deep layers can improve feature expression ability to some extent, but it also results in relatively low extraction efficiency. Often, it increases the number of parameters unacceptably while bringing minimal improvement in accuracy. This is far worse than placing it in the backbone network. In this study, we advocate using a simple module design that maintains the target feature information without increasing the number of parameters and prevents the loss of target features, which is more suitable for the detection head module. Based on the above concept, we design a new detection head preprocessing module named PDetect, as illustrated in Figure 4.

PDetect preprocesses the features fed into the detection head module and consists of two branches. One branch passes the raw data

F_{i n} \in R^{C \times H \times W}

directly, and the other first captures the global features by global average pooling. Then, it suppresses the unimportant features and emphasizes the important ones in the feature map by using the Sigmoid activation function to improve the model’s sensitivity to the key information. Finally, we perform element-wise multiplication on the feature maps from the two branches to enhance the module’s adaptive capability. The computation in the PDetect module is shown in Equations (13) and (14).

D_{1} = Sigmoid (AvgPool (Conv (F_{in})))

(13)

D_{2} = Conv (D_{1} \otimes F_{in})

(14)

where

F_{in}

represents the input feature map, AvgPool represents global average pooling, Sigmoid represents Sigmoid activation function, Conv represents standard convolution,

D_{1}

\in R^{C \times 1 \times 1}

represents the feature weights, ⊗ represents element-by-element multiplication, and

D_{2}

\in R^{C \times H \times W}

represents the feature map fed into the detection head. Subsequent experiments verify the effectiveness of the PDetect. In addition, the module is a plug-and-play module and is highly malleable, with great potential for development.

4. Experiment

This section presents an overview of the datasets, parameter configurations, and evaluation metrics employed in our experimental studies. Additionally, to rigorously assess the robustness and generalizability of the BFRDNet, we execute a comprehensive suite of experiments across two widely recognized datasets. We compare BFRDNet to the benchmark YOLOv8s and other UAV object detection models. Detailed analysis of the experimental results is provided in the subsequent sections.

4.1. Experimental Datasets

During the experimental stage, we utilize the following two datasets to conduct our research.

(1) VisDrone: This dataset was obtained from drone aerial photography in diverse environments, including a comprehensive range of scenarios such as urban and rural, bright and shadowy. The dataset covers a wide range of target distributions from sparse to dense and stands as the predominant UAV aerial image dataset, offering a rich variety of scenarios and extensive data. [43]. The dataset encompasses 10 distinct detection target categories, including pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. The dataset is meticulously organized into distinct subsets, featuring 6471 images for training, 548 images for validation, and 3190 images reserved for testing. Each image is labeled with an average of 53 objects. In the test images, the average number of objects per image is 71, with the majority of these objects exhibiting pixel dimensions smaller than 32 × 32. In addition, different categories of targets have various levels of occluded portions in the images. Compared with other computer vision datasets, it has more realistic scenes and higher complexity, requiring higher network detection capability. Figure 5 presents the statistical distributions of the VisDrone dataset, with Figure 5a,b illustrating the target-size distribution and spatial-position distribution, respectively. The analysis reveals that (1) most targets have width/height dimensions smaller than 10% of the image size; and (2) targets show significant spatial aggregation, with the highest density observed in the lower-central region of the image. Table 1 presents the statistical distribution of object categories in the VisDrone dataset, revealing a predominance of pedestrians and vehicles in urban drone-captured scenes. This distribution pattern emerges naturally from the dataset’s collection environment, where these categories occur more frequently in city settings. As a benchmark for aerial vision systems, we preserve this authentic distribution to ensure the dataset accurately represents real-world urban scenarios. This approach enables the development of detection algorithms that effectively address practical urban surveillance requirements.

(2) UAVDT: This large-scale benchmark dataset, captured by UAVs in complex environments, features diverse scenarios and high complexity. It is manually annotated and poses significant challenges for UAV detection and tracking. The image resolution is uniformly 1080 × 540, and the target categories in the images are mainly focused on vehicles, including car, bus, and truck [44]. Compared with the VisDrone dataset, the dataset is well-suited for vehicle detection in road images, featuring diverse weather conditions, angles, and scenes. The dataset covers various scenarios and conditions, providing researchers with a rich data resource that facilitates the development and evaluation of UAV vision technologies. Figure 5 presents the statistical distributions of the UAVDT dataset, with Figure 5c,d illustrating the target-size distribution and spatial-position distribution, respectively. The analysis reveals that (1) most targets have width/height dimensions smaller than 10% of the image size; and (2) targets show significant spatial aggregation, with the highest density observed in central image regions. As shown in Table 2, cars comprise the majority of samples in the UAVDT dataset. This predominance reflects the dataset’s focus on urban road scenes, where cars naturally appear more frequently than other objects. We maintain this natural sample distribution to preserve the dataset’s realistic representation of traffic environments, making it a valuable benchmark for drone-based detection research.

4.2. Experimental Setup

The environment setup during the experiment was as follows: the operating system used was Ubuntu 22.04, the GPU configuration was NVIDIA GeForce RTX 4090 (24 G), and the CPU model was AMD EPYC 7402. In addition, the model training used the Python deep learning framework, with specific versions of PyTorch (2.2.0) and Python (3.10). During the training phase, the initial learning rate was established at 0.01. We used a stochastic gradient descent (SGD) optimizer with momentum, configured with a batch size of 4, weight decay of 0.0005, and momentum set to 0.937. The training process consisted of 300 epochs. For experiments on different datasets, we maintain consistent training settings, including the same learning rate adjustment strategy and training duration, to ensure the comparability of results.

4.3. Evaluation Metrics

To comprehensively evaluate the effectiveness of the BFRDNet in object detection, this study adopts a suite of widely recognized evaluation metrics, including precision (P), recall (R), average precision (AP), and mean average precision (mAP). The corresponding formulas are presented below.

P = \frac{TP}{TP + FP}

(15)

R = \frac{TP}{TP + FN}

(16)

where TP represents correctly identified positive samples, FN represents incorrectly classified positive samples, and FP represents negative samples mislabeled as positive. AP is computed as the area under the precision–recall curve and reflects the model’s ability to maintain high precision across different recall thresholds.

A P = \int_{0}^{1} P (R) d R

(17)

where P(R) represents the precision value on the P-R curve corresponding to a recall of R, and represents the average performance of P and R. AP is a metric computed for each category individually, whereas mAP combines the precision and recall of the different categories, offering a comprehensive measure of detection performance.

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(18)

where

A P_{i}

represents the AP value of the ith category, and n represents the total number of categories encompassed within the detection task.

In addition, to rigorously validate BFRDNet’s effectiveness, this study uses the COCO evaluation metric to evaluate the model’s proficiency in detecting objects of varying sizes [45]. In the VisDrone datasets, instances are categorized into three size-based subsets based on the COCO standard: a small subset with pixels less than 32, a medium subset with pixels greater than 32 and less than 96, and a large subset with pixels greater than 96. These correspond to

A P_{S}

,

A P_{M}

, and

A P_{L}

under the COCO criterion, respectively.

4.4. Experimental Results on the VisDrone Dataset

To thoroughly assess BFRDNet’s capabilities, we adopt mAP0.5, mAP0.5:0.95, and COCO evaluation metrics. As shown in Table 3, BFRDNet achieves state-of-the-art performance in mAP0.5 (48.2%), with improvements of 7.5% in mAP0.5 and 4.2% in mAP0.5:0.95. Although not leading in all metrics, BFRDNet exhibits superior overall accuracy, outperforming even the largest-scale baseline variants. This confirms its advantages in multi-scale object detection. Notably, while DMNet achieves higher detection accuracy at specific scales, its mAP remains lower than BFRDNet. More significantly, BFRDNet demonstrates superior computational efficiency with a frame rate of 45.9 FPS compared to DMNet’s 3.45 FPS. This substantial 13.3× speed advantage, coupled with competitive detection accuracy, positions BFRDNet as a more viable solution for real-time UAV applications.

To comprehensively assess the capabilities of BFRDNet, we benchmark it on the challenging VisDrone dataset with per-category accuracy analysis. As shown in Table 4, BFRDNet achieves the highest overall mAP among all competitors. While exhibiting marginal performance gaps in the Truck and Awning-Tricycle categories, a more substantial discrepancy is observed for Bicycle detection when compared to E-YOLOv8. Specifically, E-YOLOv8’s anchor-free detection head and small-object-optimized backbone architecture yield a 41.8% AP in the Bicycle category (versus BFRDNet’s 20.7%), yet this specialization incurs a drop of roughly 2.3% in overall mAP compared with BFRDNet. Notably, BFRDNet dominates the remaining seven categories, demonstrating its general-purpose detection strength.

To provide a more intuitive understanding of the enhancements achieved through our proposed methods, this study discusses each improvement in detail through ablation experiments. The findings are detailed in Table 5, with subsequent analysis provided below.

(1) MKConv: The experimental results show that after integrating the MKConv module into the baseline model, mAP0.5 and mAP0.5:0.95 are improved by 1.1% and 0.5%, respectively. Additionally, improvements in precision and recall are observed. These improvements confirm that MKConv is able to effectively capture multi-scale target features in the backbone network through convolutional computation with multiple receptive fields. While there is a minor increase in the number of parameters, the module effectively reduces the depth of the network, which can effectively minimize the feature degradation during the propagation process and further promotes the features to propagate more completely through the network. By mining features from multiple receptive fields, the quality of features in the backbone network is effectively improved.

(2) BFRPN: To further enhance the effectiveness of multi-scale feature fusion, we propose a backbone feature reuse pyramid network, named BFRPN, which enhances feature integration by amplifying the contribution of backbone network features in the fusion process, making these features fit more closely into the feature fusion architecture of our network. As shown in Table 5, the standalone application of the BFRPN module not only significantly improves the detection mAP0.5 (+4.9%) and recall (+3.6%), but its synergistic combination with the MKConv module and the PDetect structure also demonstrates excellent systematic advantages. The experimental results demonstrate that BFRPN can efficiently utilize the multi-scale target features extracted from the backbone network and significantly improve the detection accuracy of the model by amplifying the proportion of backbone network features in the entire model.

(3) PDetect: Based on the MKConv module and BFRPN network, we further integrate PDetect into the framework. The experimental results demonstrate remarkable improvements: not only do precision and recall rates increase significantly, but mAP0.5 and mAP0.5:0.95 also improve by 1.7% and 0.6%, respectively. Additionally, the standalone combination of PDetect and MKConv not only enhances detection accuracy but also reduces the number of parameters. These results confirm that PDetect can effectively fuse multi-level features and strengthen feature representation. Moreover, PDetect is not only fully compatible with our network architecture but also highly effective for UAV-based image object detection tasks.

To more intuitively highlight the strengths of our proposed model, we perform visualization experiments on both the baseline and BFRDNet models in this paper. Through heatmaps, we visually demonstrate how the improved model effectively focuses on regions missed by the baseline model. The visualization results are shown in Figure 6. The first column shows the input images, while the second and third columns illustrate the detection outcomes of the baseline and the BFRDNet models, respectively. In the second and third columns of images, we marked the same regions with yellow arrows and boxes. This clearly shows that BFRDNet effectively identifies target regions missed by the baseline model. The fourth and fifth columns display the detection heatmaps for the baseline and BFRDNet models, respectively. The heatmap comparisons further confirm BFRDNet’s advantages: the intensified and concentrated activation regions in Column 5 reflect the effectiveness of our BFRPN module in preserving critical shallow features typically lost in conventional architectures. Additionally, the sharper response boundaries across varying target scales highlight MKConv’s dynamic receptive field adaptation, which suppresses background interference while maintaining focus on genuine targets. These findings demonstrate that BFRDNet’s architectural innovations, which combine feature reuse and adaptive processing, fully account for both the improved heatmap responses and enhanced detection accuracy in our experiments.

The four selected images showcase four representative scenarios: long-range small targets, targets in low-light conditions, targets in complex backgrounds, and densely packed target areas situated at a distance from the UAV. The results confirm that BFRDNet markedly enhances object detection across various challenging environments, particularly in identifying closely spaced and distant small targets. This superior performance is due to our effective feature extraction techniques, which ensure robust target feature capture. Additionally, by employing more judicious methods in the stages of feature fusion and propagation, we effectively reduce the loss of critical target features, especially the smaller ones, to keep their characteristic information more intact.

In real application scenarios, how to measure the value between accuracy and speed is an issue that cannot be ignored. As shown in Figure 7, we have plotted a scatter plot of model accuracy and inference speed. The black regression line indicates that FPS increases as accuracy decreases. Although BFRDNet does not have the highest inference speed, its FPS already meets the requirements of real-time detection (FPS > 30), demonstrating greater advantages in accuracy. This observation is further supported by the comprehensive metric comparison in Table 6. DMNet approaches BFRDNet’s accuracy (47.6%) but suffers impractical latency (3.45 FPS). While achieving superior frame rates, both ATO-YOLO and DM-YOLOX exhibit substantially compromised detection performance compared to BFRDNet at comparable model complexities. These results collectively validate BFRDNet’s unprecedented balance between computational efficiency and detection performance.

To visually demonstrate the BFRDNet’s performance, and conduct a detailed comparison of the recognition outcomes for each category with the baseline model, we employ the confusion matrix as a visualization tool. The confusion matrix offers an intuitive means of juxtaposing predictions against accurate labels by organizing them within a matrix format. In the confusion matrix utilized in this study, each row corresponds to the categories predicted by the model, whereas each column represents the actual category labels. The diagonal elements signify correct predictions, that is, instances where the predicted categories align with the proper categories. Conversely, the off-diagonal elements signify misclassifications, highlighting discrepancies between predicted and actual categories. This comparison facilitates a detailed assessment of the model’s accuracy in recognizing various categories.

Figure 8 vividly illustrates the comparative analysis of prediction results for each category between the baseline model and BFRDNet. A marked enhancement in the values along the main diagonal is evident when juxtaposed with the baseline model, indicating a significant advancement in BFRDNet’s ability to accurately predict multi-scale targets. Furthermore, by examining the data in the last row of the confusion matrix, it is clear that BFRDNet exhibits lower matrix values. This reduction indicates that BFRDNet significantly diminishes the error rate associated with misclassifying objects as background. These observations underscore the efficacy of BFRDNet’s feature extraction capabilities, enabling it to discern targets from complex backgrounds more adeptly.

In this study, we propose MKConv, a novel feature extraction module designed to replace the conventional C2f module in the baseline model. MKConv is specifically engineered to enhance multi-scale target information extraction. Prior to experimentation, we conducted a thorough analysis of MKConv’s design, hypothesizing that its optimal performance would require integration across all backbone network layers, particularly in the deepest layers. To validate this hypothesis, we progressively implemented MKConv into the backbone network—starting with fewer layers (P2) and systematically extending to additional layers (P3, P4, P5)—while performing comparative experiments at each stage. As demonstrated in Table 7, the results confirm that MKConv achieves peak performance when fully replacing the C2f module throughout the entire backbone network. Furthermore, we observed that deploying MKConv solely on the P2 layer achieved the highest recall rate. To investigate this phenomenon, we conducted comparative experiments analyzing recall rates across different scales. The results revealed that when the recall rates for small targets were comparable, the P2 layer exhibited a 1.5% higher recall rate for large-scale targets compared to the optimal mAP0.5 combination, while its recall rate for medium-scale targets was only 0.5% lower. This explains why the P2 layer outperformed other configurations in terms of recall. However, to ensure balanced detection performance, we ultimately selected the last layer’s data as the final configuration for this model.

To comprehensively evaluate the advantageous effect of the MKConv module on the network’s receptive field, we employ a visual feature mapping technique to provide a detailed illustration. Figure 9a–e depict the outcomes of the original backbone network and its receptive field visualization after sequentially replacing the C2f module at layers P2 through P5. Without modifying the backbone network, it is evident that the coverage area of the region of interest (ROI) is comparatively small, and the central region’s green color appears blurry. This observation indicates that the network’s object perception capability is limited. In contrast, upon integrating the MKConv module, there is a notable expansion in the ROI’s coverage area, and the deepening of the color indicates an enhanced ability to capture local foreground features. This improvement effectively suppresses background noise, enabling more accurate and efficient extraction of the object’s key feature areas.

To demonstrate the effectiveness of the proposed BFRPN in multi-scale feature fusion, we conducted a series of comparative experiments involving various feature fusion networks. In these experiments, we substituted the BFRPN architecture in our BFRDNet model with a range of existing feature fusion networks and evaluated their performance. We present the results of these comparisons in Table 8. While PAFPN and AFPN boast the fewest parameters, they exhibit lower detection accuracy. GFPN has mediocre data and is not suitable for backbone-dominated detection models. While BiFPN demonstrates improved feature fusion capabilities, its performance gains do not sufficiently justify the significant parameter overhead it introduces. In contrast, the BFRPN introduced in this paper demonstrates the highest values for both mAP0.5 and mAP0.5:0.95 metrics, along with superior precision and recall. From the experimental results, it can be seen that BFRPN significantly affects the detection performance of the model by amplifying the proportion of backbone network features in the model and using backbone network features more comprehensively. Compared with general feature fusion networks, the BFRPN designed in this paper is more suitable for the backbone-dominant detection models.

In this research, we conducted an in-depth exploration and carefully crafted a diverse set of experiments focusing on the detection head module. Our exploration spanned three critical areas: attention mechanisms, multi-scale feature fusion, and multi-branch connection. To assess the efficacy of the detection head module, we integrated insights from existing research with our experimental findings. The comparative experimental results, detailed in Table 9, reveal a key insight: incorporating specific feature-introducing modules does not always enhance detection accuracy, especially in the detection head of the network. Our experiments underscore that only modules optimally aligned with the task can deliver peak detection performance. The PDetect module we propose is not only highly adaptable to the BFRDNet detection model but also demonstrates superior detection capability. This outcome not only corroborates the effectiveness of our approach but also offers novel perspectives for future research endeavors concerning detection head modules.

4.5. Experimental Results on the UAVDT Dataset

In order to validate the effectiveness and applicability of the BFRDNet model, we conduct comparative experiments on the UAVDT dataset, with the outcomes presented in Table 10. When juxtaposed against several state-of-the-art algorithms, BFRDNet exhibits marked superiority in performance. In comparison with the baseline model, BFRDNet achieves a 4.8% improvement in detecting objects within the Truck category and a 0.1% increase in overall average accuracy. Nevertheless, it is essential to note that the performance of BFRDNet on the UAVDT dataset does not match the level of excellence achieved on the VisDrone dataset. Upon rigorous analysis, we discern that BFRDNet possesses commendable capabilities in multi-scale feature extraction, effectively managing the extensive scale and diversity inherent in the VisDrone dataset. Conversely, when confronted with the UAVDT dataset, which features dynamic shifts in UAV viewpoints and inconsistent aspect ratios, BFRDNet’s adaptability is inadequate. This shortfall in adaptability is identified as a key area for future enhancement in our ongoing research.

Furthermore, to visually illustrate the performance disparity between the baseline model and BFRDNet in practical application settings, we conduct an array of visual comparative analyses on the UAVDT dataset, with the results depicted in Figure 10. The results of the baseline model are depicted in column b, with the detection outcomes of BFRDNet illustrated in column c. To more clearly highlight the performance differences between the two models, we magnify and annotate critical regions with yellow arrows, emphasizing areas where these differences are most evident. The quartet of image sets in Figure 10 displays oblique aerial photographs of vehicles captured by the UAV under varying lighting conditions. A precise observation from comparing the magnified sections is that BFRDNet is markedly more effective in detecting minute vehicle targets in long-range views than the baseline model, thereby demonstrating enhanced capabilities in identifying small-sized targets at diverse distances. Although BFRDNet does not detect all the targets, its detection proficiency is substantially superior to that of the baseline model. The results demonstrate that the improvements in this article boost the model’s ability to discern small-sized targets from long-distance perspectives under different shooting angles, varied illumination conditions, and a spectrum of scenes.

4.6. Extended Experiments

To further validate the generalization capability of the proposed BFRDNet architecture, we conduct extensive experiments on YOLOv11. Compared to the baseline model, BFRDNetV2 (our augmented version) exhibits superior generalization performance, as demonstrated by our evaluation on the VisDrone and UAVDT datasets. Table 11 summarizes the comparative evaluation results between our proposed BFRDNetV2 and the baseline YOLOv11 model. The experimental data demonstrate consistent performance improvements, with mAP increases of 5.3% on VisDrone and 0.3% on UAVDT datasets, respectively. These quantitative results substantiate the enhanced detection capability of our method for UAV image object detection tasks.

Figure 11 presents the visualization comparison results between YOLOv11 and BFRDNetV2 when applied to the VisDrone and UAVDT datasets. These results are displayed in columns b and c, respectively. To more clearly illustrate the performance differences between the two models, we selectively magnified and annotated critical regions with yellow arrows, highlighting significant disparities. The first two rows depict the comparison on the VisDrone dataset, where our model demonstrates a pronounced advantage in detecting small-scale targets, such as pedestrians and vehicles, within complex scenes. These findings indicate that our model has the ability to effectively capture feature information in intricate environments, thereby achieving more precise object detection. The subsequent two rows highlight the comparison using the UAVDT dataset. The third row emphasizes the superior performance of BFRDNetV2 in detecting partially occluded targets and partially visible feature targets. In contrast, the fourth row illustrates the robustness of BFRDNetV2 in recognizing densely packed targets, effectively minimizing missed detections. Collectively, these results demonstrate the effectiveness and practicality of BFRDNetV2 for UAV image object detection.

In addition, we perform experiments to assess generalizability on the COCO dataset, and the outcomes are presented in Table 12. We present the detection results for some categories in which BFRDNetV2 performs well, thereby fully demonstrating its excellent detection ability for small targets.

To visually illustrate the performance of the enhanced BFRDNetV2 model, we conduct visual comparative analyses using the COCO dataset, displayed the results in Figure 12. Although the COCO dataset is not specifically composed of UAV-captured images, the comparative outcomes vividly highlight BFRDNetV2’s superior capability in identifying small targets. Although BFRDNetV2 does not achieve perfect detection of all targets, it shows significantly improved accuracy.

5. Conclusions

In this study, we present a UAV image object detection method based on a backbone feature reuse detection network. It aims to address the challenge of fully utilizing target features in complex environments. Initially, we devise an innovative feature extraction module aimed at capturing the multi-scale characteristics of targets while mitigating feature loss during network propagation. Next, we design a backbone feature reuse pyramid network that greatly improves the integration of target localization and semantic information. This enhancement effectively utilizes the features of the backbone network and adopts advanced feature fusion strategies. Scientifically amplifying the proportion of backbone network features in the model enhances the model’s ability to detect targets at different scales. Lastly, we propose a detection head preprocessing module focused on enriching the target feature representation fed into the detection head, to further enhance the network’s detection capabilities. Our experiments show that our proposed MKConv can capture richer target information and reduce feature loss during network propagation; our BFRPN effectively leverages low-level backbone features and excels at backbone-dominant architectures. Moreover, our proposed BFRDNet algorithm achieves a 7.5% improvement in the mAP0.5 metric on the VisDrone dataset. Furthermore, through a series of extended experiments, we substantiate the model’s superior generalization capabilities.

While our approach has shown considerable gains in detecting objects in UAV images, further exploration is needed, particularly in designing the detection head. The model’s generalization capabilities also warrant further enhancement. Consequently, our subsequent work will concentrate on the feature mapping mechanisms within the detection head module, with the objective of achieving a lightweight design. This approach aims to augment the predictive prowess of the detection head without escalating the model’s computational load. Moreover, we plan to optimize the backbone network’s architecture to optimize its parameter efficiency, enabling better feature fusion and enhancing the model’s generalization capability.

Author Contributions

Conceptualization, Jiakang Yang; Funding acquisition, Liming Zhou; Methodology, Liming Zhou and Jiakang Yang; Project administration, Liming Zhou and Cheng Liu; Resources, Yang Liu; Software, Jiakang Yang; Supervision, Liming Zhou, Cheng Liu and Yang Liu; Writing—original draft, Jiakang Yang; Writing—review and editing, Yuanfei Xie, Guochong Zhang, and Cheng Liu. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant no. 62176087), the Henan Province University Science and Technology Innovation Team Support Plan (Grant no. 24IRTSTHN021), the Key Research and Promotion Projects of Henan Province (Grant nos. 242102210081, 252102211053), the Postgraduate Education Reform and Quality Improvement Project of Henan Province (Grant nos. YJS2023JD28, YJS2024JD30), and the Natural Science Foundation of Henan (Grant no. 242300421218).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

Author Cheng Liu was employed by the company Zhongzhi Software Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. Mffsodnet: Multiscale feature fusion small object detection network for uav aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Zhao, S.; Chen, J.; Ma, L. Subtle-yolov8: A detection algorithm for tiny and complex targets in uav aerial imagery. Signal Image Video Process. 2024, 18, 8949–8964. [Google Scholar] [CrossRef]
Feng, F.; Hu, Y.; Li, W.; Yang, F. Improved yolov8 algorithms for small object detection in aerial imagery. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102113. [Google Scholar] [CrossRef]
Papyan, N.; Kulhandjian, M.; Kulhandjian, H.; Aslanyan, L. Ai-based drone assisted human rescue in disaster environments: Challenges and opportunities. Pattern Recognit. Image Anal. 2024, 34, 169–186. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Z.; Zhang, L.; Wu, Y.; Zhang, Q.; Zheng, X. A lightweight multi-feature fusion network for unmanned aerial vehicle infrared ray image object detection. Egypt. J. Remote Sens. Space Sci. 2024, 27, 268–276. [Google Scholar] [CrossRef]
Chuai, Q.; He, X.; Li, Y. Improved traffic small object detection via cross-layer feature fusion and channel attention. Electronics 2023, 12, 3421. [Google Scholar] [CrossRef]
Alsamhi, S.H.; Shvetsov, A.V.; Kumar, S.; Shvetsova, S.V.; Alhartomi, M.A.; Hawbani, A.; Rajput, N.S.; Srivastava, S.; Saif, A.; Nyangaresi, V.O. Uav computing-assisted search and rescue mission framework for disaster and harsh environment mitigation. Drones 2022, 6, 154. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Shen, L.; Lang, B.; Song, Z. Ds-yolov8-based object detection method for remote sensing images. IEEE Access 2023, 11, 1125122–1125137. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zhang, Y.; Zhang, H.; Huang, Q.; Han, Y.; Zhao, M. Dsp-yolo: An anchor-free network with dspan for small object detection of multiscale defects. Expert Syst. Appl. 2024, 241, 122669. [Google Scholar] [CrossRef]
Ran, Q.; Zhang, C.; Wei, W.; Zhang, L. Efficient and accurate giraffe-det for uav image based object detection. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6190–6193. [Google Scholar]
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for uavs based on deep learning. Remote Sens. 2024, 16, 149. [Google Scholar] [CrossRef]
Lu, W.; Lan, C.; Niu, C.; Liu, W.; Lyu, L.; Shi, Q.; Wang, S. A cnn-transformer hybrid model based on cswin transformer for uav image object detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1211–1231. [Google Scholar] [CrossRef]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Peng, L.; Lu, Z.; Lei, T.; Jiang, P. Dual-Structure Elements Morphological Filtering and Local Z-Score Normalization for Infrared Small Target Detection against Heavy Clouds. Remote Sens. 2024, 16, 2343. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. Uav-yolov8: A small-object-detection model based on improved yolov8 for uav aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Xu, W.; Cui, C.; Ji, Y.; Li, X.; Li, S. Yolov8-mpeb small target detection algorithm based on uav images. Heliyon 2024, 10, e29501. [Google Scholar] [CrossRef]
Wang, H.; Yang, H.; Chen, H.; Wang, J.; Zhou, X.; Xu, Y. A remote sensing image target detection algorithm based on improved yolov8. Appl. Sci. 2024, 14, 1557. [Google Scholar] [CrossRef]
Liu, W.; Qiang, J.; Li, X.; Guan, P.; Du, Y. Uav image small object detection based on composite backbone network. Mob. Inf. Syst. 2022, 2022, 7319529. [Google Scholar] [CrossRef]
Ma, S.; Khader, A.; Xiao, L. Complementary features-aware attentive multi-adapter network for hyperspectral object tracking. In Proceedings of the Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022), Proceedings of SPIE 12705, Nanjing, China, 21–23 October 2022; Volume 12705, p. 1270523. [Google Scholar]
Shi, T.; Gong, J.; Hu, J.; Sun, Y.; Bao, G.; Zhang, P.; Wang, J.; Zhi, X.; Zhang, W. Progressive class-aware instance enhancement for aircraft detection in remote sensing imagery. Pattern Recognit. 2025, 164, 111503. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Lim, J.-S.; Astrid, M.; Yoon, H.-J.; Lee, S.-I. Small object detection using context and attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; pp. 181–186. [Google Scholar]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. Amfef-detr: An end-to-end adaptive multi-scale feature extraction and fusion object detection network based on uav aerial images. Drones 2024, 8, 523. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Hu, M.; Li, Y.; Fang, L.; Wang, S. A2-fpn: Attention aggregation based feature pyramid network for instance segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15338–15347. [Google Scholar]
Zhou, L.; Zhao, S.; Wan, Z.; Liu, Y.; Wang, Y.; Zuo, X. Mfefnet: A multi-scale feature information extraction and fusion network for multi-scale object detection in uav aerial images. Drones 2024, 8, 186. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. Afpn: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA,, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
Wang, J.; Meng, R.; Huang, Y.; Zhou, L.; Huo, L.; Qiao, Z.; Niu, C. Road defect detection based on improved yolov8s model. Sci. Rep. 2024, 14, 16758. [Google Scholar] [CrossRef]
Xia, K.; Lv, Z.; Liu, K.; Lu, Z.; Zhou, C.; Zhu, H.; Chen, X. Global contextual attention augmented yolo with convmixer prediction heads for pcb surface defect detection. Sci. Rep. 2023, 13, 9805. [Google Scholar] [CrossRef]
Liu, X.; Wang, Y.; Yu, D.; Yuan, Z. Yolov8-fdd: A real-time vehicle detection method based on improved yolov8. IEEE Access 2024, 12, 1136280–1136296. [Google Scholar] [CrossRef]
Li, J.; Zhang, J.; Shao, Y.; Liu, F. Sre-yolov8: An improved uav object detection model utilizing swin transformer and re-fpn. Sensors 2024, 24, 3918. [Google Scholar] [CrossRef] [PubMed]
Dang, C.; Wang, Z.X. Rcyolo: An efficient small target detector for crack detection in tubular topological road structures based on unmanned aerial vehicles. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12731–12744. [Google Scholar] [CrossRef]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended feature pyramid network for small object detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
Hu, J.; Li, Y.; Zhi, X.; Shi, T.; Zhang, W. Complementarity-Aware Feature Fusion for Aircraft Detection via Unpaired Opt2SAR Image Translation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5628019. [Google Scholar] [CrossRef]
Xiao, Y.; Xu, T.; Yu, X.; Fang, Y.; Li, J. A lightweight fusion strategy with enhanced interlayer feature correlation for small object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708011. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized gaussian wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Lin, H.; Zhou, J.; Gan, Y.; Vong, C.-M.; Liu, Q. Novel up-scale feature aggregation for object detection in aerial images. Neurocomputing 2020, 411, 364–374. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. Querydet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13658–13667. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. arXiv 2021, arXiv:2108.11539. [Google Scholar]
Ning, T.; Wu, W.; Zhang, J. Small object detection based on yolov8 in uav perspective. Pattern Anal. Appl. 2024, 27, 103. [Google Scholar] [CrossRef]
Li, M.; Chen, Y.; Zhang, T.; Huang, W. Ta-yolo: A lightweight small object detection model based on multi-dimensional trans-attention module for remote sensing images. Complex Intell. Syst. 2024, 10, 5459–5473. [Google Scholar] [CrossRef]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. Edgeyolo: An edge-real-time object detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar]
Tang, S.; Zhang, S.; Fang, Y. Hic-yolov5: Improved yolov5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 737–746. [Google Scholar]
Su, J.; Qin, Y.; Jia, Z.; Liang, B. Mpe-yolo: Enhanced small target detection in aerial imaging. Sci. Rep. 2024, 14, 17799. [Google Scholar] [CrossRef]
Vasanthi, P.; Mohan, L. Efficient yolov8 algorithm for extreme small-scale object detection. Digit. Signal Process. 2024, 154, 104682. [Google Scholar] [CrossRef]
Su, J.; Qin, Y.; Jia, Z.; Hou, Y. Ptcdet: Advanced uav imagery target detection. Sci. Rep. 2024, 14, 27403. [Google Scholar] [CrossRef]
Varghese, R.; M, S. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–20 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Xu, H.; Zheng, W.; Liu, F.; Li, P.; Wang, R. Unmanned aerial vehicle perspective small target recognition algorithm based on improved yolov5. Remote Sens. 2023, 15, 3583. [Google Scholar] [CrossRef]
Xu, S.; Song, L.; Yin, J.; Chen, Q.; Zhan, T.; Huang, W. Mffci–yolov8: A lightweight remote sensing object detection network based on multiscale features fusion and context information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19743–19755. [Google Scholar] [CrossRef]
Yang, Y.; Song, P.; Wang, Y.; Cao, L. Re-Parameterization After Pruning: Lightweight Algorithm Based on UAV Remote Sensing Target Detection. Sensors 2024, 24, 7711. [Google Scholar] [CrossRef] [PubMed]
Jia, S.; Yi, Q.; Ze, J.; Jing, W. Small Object Detection Algorithm Based on ATO-YOLO. Comput. Eng. Appl. 2024, 60, 68–77. [Google Scholar]
Li, X.; Wang, F.; Wang, W.; Han, Y.; Zhang, J. DM-YOLOX aerial object detection method with intensive attention mechanism. J. Supercomput. 2024, 80, 12790–12812. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; Series Proceedings of Machine Learning Research. Volume 139, pp. 11863–11874. [Google Scholar]
Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. Tiny object detection in remote sensing images based on object reconstruction and multiple receptive field adaptive feature enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616213. [Google Scholar] [CrossRef]
Anwar, S.; Barnes, N. Densely residual laplacian super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1192–1204. [Google Scholar] [CrossRef]
Zheng, G.; Songtao, L.; Feng, W.; Zeming, L.; Jian, S. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Huang, Y.; Chen, J.; Huang, D. Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Pomona, CA, USA, 24–28 October 2022; Volume 36, pp. 1026–1033. [Google Scholar]
Leng, J.; Mo, M.; Zhou, Y.; Gao, C.; Li, W.; Gao, X. Pareto refocusing for drone-view object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 1320–1334. [Google Scholar] [CrossRef]
Qiu, X.; Chen, Y.; Sun, C.; Li, J.; Niu, M. Dmff-yolo: Yolov8 based on dynamic multiscale feature fusion for object detection on uav aerial photography. IEEE Access 2024, 12, 125160–125169. [Google Scholar] [CrossRef]
Bai, C.; Zhang, K.; Jin, H.; Qian, P.; Zhai, R.; Lu, K. Sffef-yolo: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images. Image Vis. Comput. 2025, 156, 105469. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Zhao, J.; Zhang, J.; Zhao, D. Precision and speed: Lsod-yolo for lightweight small object detection. Expert Syst. Appl. 2025, 269, 126440. [Google Scholar] [CrossRef]

Figure 1. Overall structure of BFRDNet.

Figure 2. MKConv structure diagram.

Figure 3. BFRPN structure diagram.

Figure 4. PDetect structure diagram.

Figure 5. Statistical characterization of the datasets. Color gradient indicates target density (red: high, blue: low). (a) Target size distribution in VisDrone; (b) Spatial position distribution in VisDrone; (c) Target size distribution in UAVDT; (d) Spatial position distribution in UAVDT.

Figure 6. Example of comparison between visualization results and heatmap. (a) Original image. (b) Visualization results of baseline model. (c) Visualization results of BFRDNet. (d) Heatmap results of baseline model. (e) Heatmap results of BFRDNet. The heatmap’s color scale (right) indicates model attention intensity, ranging from 0 (blue, low attention) to 1 (red, high attention).

Figure 7. Comparison of efficiency between BFRDNet and other models on the VisDrone dataset. Values in bold represent the highest accuracy within their respective category.

Figure 8. (a) Confusion matrix for the baseline model; (b) Confusion matrix for the BFRDNet model.

Figure 9. Receptive field visualization. (a) Receptive field of the baseline model. (b–e) Receptive field visualizations for layers P2 through P5 of the modified backbone network, respectively.

Figure 10. Visualization of the comparison results on the UAVDT dataset. Columns (a–c) depict the original images, the detection results of the baseline model, and the detection results of the BFRDNet model detection result, respectively.

Figure 11. Visualization comparison results for the VisDrone dataset (first and second rows) and UAVDT dataset (third and fourth rows). Columns (a–c) represent the original image, the baseline model detection results, and the BFRDNetV2 model detection results, respectively.

Figure 12. Visualization of the comparative results on the COCO dataset. Columns (a–c) depict the original images, the detection results of the baseline model, and the detection results of the BFRDNetV2 model, respectively. The first two rows illustrate scenes with small objects. The third and fourth rows showcase scenes with occluded targets. The fifth and sixth rows present scenes with densely arranged targets.

Table 1. The distribution of target instances across categories in various subsets of the VisDrone dataset.

Dataset Partition	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning-Tricycle	Bus	Motor
training set	79,337	27,059	10,480	144,867	24,956	12,875	4812	3246	5926	29,647
validation set	8844	5125	1287	14,064	1975	750	1045	532	251	4886
test set	21,006	6376	1302	28,704	5771	2659	530	599	2940	5845

Table 2. The distribution of target instances across categories in various subsets of the UAVDT dataset.

Dataset Partition	Car	Truck	Bus
training set	467,451	14,291	12,934
validation set	288,237	10,795	5087

Table 3. Comparison results of average accuracy on the VisDrone dataset, where “*” after the algorithm name represents the re-implementation (reproducible training under identical protocols and environments) result. Values in bold represent the highest accuracy within their respective category.

Method	mAP0.5	mAP0.5:0.95	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
Faster R-CNN [10]	40.0	21.5	15.4	34.6	37.1
RetinaNet [46]	35.9	19.4	14.1	29.5	33.7
CascadeR-CNN [11]	39.9	23.2	16.5	36.8	39.4
NWD [47]	40.3	\	22.2	\	\
HawkNet [48]	44.3	25.6	19.9	36.0	39.1
QueryDet [49]	48.1	28.3	\	\	\
YOLOv5l [50]	36.2	20.5	12.4	29.9	36.4
Imporved-yolov8 [51]	45.4	27.7	17.1	37.5	41.1
TA-YOLO-s [52]	45.4	27.7	\	\	\
EdgeYOLO [53]	44.8	26.2	16.3	38.7	53.1
HIC-YOLOv5 [54]	44.31	25.95	\	\	\
DMNet [55]	47.6	28.2	19.9	39.6	55.8
MPE-YOLO [56]	37.0	21.4	\	\	\
E-YOLOv8 [57]	45.9	30.8	\	\	\
PTCDet [58]	44.93	27.22	\	\
YOLOv8m [59] *	42.6	25.6	14.8	35.5	41.8
YOLOv8l [59] *	44.1	27.1	15.3	36.0	44.7
YOLOv8x [59] *	45.4	28.0	16.7	38.9	45.5
YOLOv8s [59] *	40.7	24.0	13.0	34.0	40.9
BFRDNet (Ours)	48.2	28.2	18.1	37.3	42.0

Table 4. Comparison results of every category on the VisDrone dataset, where “*” after the algorithm name represents the re-implementation result. Values in bold represent the highest accuracy within their respective category.

Method	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning-Tricycle	Bus	Motor	mAP0.5
Faster R-CNN [10]	37.5	19.4	13.3	71.9	42.5	42.8	19.8	18.1	58.4	34.4	35.8
YOLOv3 [60]	12.8	7.8	4.0	43.0	23.5	16.5	9.5	5.1	29.0	12.5	31.4
YOLOv5l [61]	44.4	36.8	15.6	73.9	39.2	36.2	22.6	11.9	50.5	42.8	37.4
MFFCI-YOLOv8 [62]	42.7	33.3	13.4	80.5	46.9	38.4	31.4	16.6	58.6	44.6	40.6
MPE-YOLO [56]	34.9	21.8	12.9	77.2	42.1	42.8	23.1	20.2	59.6	35.6	37.0
YOLOv5s-pp [61]	51.7	39.6	19.0	82.1	44.1	36.0	26.3	14.7	55.3	48.2	41.7
E-YOLOv8 [57]	49.4	43.3	41.8	73.5	50.1	43.3	31.9	22.3	57.8	45.6	45.9
PTCDet [58]	53.3	42.8	16.9	84.6	49.4	38.8	30.5	19.8	56.7	52.5	44.9
YOLOv8m [59] *	47.0	36.9	16.9	81.0	47.1	40.9	31.6	17.6	57.9	49.1	42.6
YOLOv8l [59] *	46.8	37.2	18.4	81.4	49.8	42.2	34.4	18.0	62.6	49.9	44.1
YOLOv8x [59] *	49.1	38.0	19.3	82.5	50.0	44.5	34.5	18.4	65.6	51.9	45.4
YOLOv8s [59] *	44.5	35.2	14.3	80.5	46.1	36.7	29.9	17.2	56.0	46.8	40.7
BFRDNet (Ours)	56.1	46.0	20.7	85.8	52.9	41.7	35.4	20.6	66.1	57.0	48.2

Table 5. Ablation experiments on the VisDrone dataset. Values in bold represent the highest accuracy within their respective category.

MKConv	BFRPN	PDetect	Precision	Recall	mAP0.5	mAP0.5:0.95	Param (M)
-	-	-	51.3	40.1	40.7	23.9	11.16
√	-	-	$51.8$ $(↑ 0.5)$	$41.1$ $(↑ 1.0)$	$41.8$ $(↑ 1.1)$	$24.4$ $(↑ 0.5)$	$11.33$ $(↑ 0.17)$
-	√	-	$51.3$ $(↑ 0.0)$	$43.7$ $(↑ 3.6)$	$45.6$ $(↑ 4.9)$	$17.5$ $(↓ 6.4)$	$10.38$ $(↓ 0.78)$
√	√	-	$52.1$ $(↑ 0.8)$	$40.4$ $(↑ 0.3)$	$46.5$ $(↑ 5.8)$	$27.6$ $(↑ 3.7)$	$10.98$ $(↓ 0.18)$
-	√	√	$55.9$ $(↑ 4.6)$	$45.8$ $(↑ 5.7)$	$47.2$ $(↑ 6.5)$	$27.7$ $(↑ 3.8)$	$10.83$ $(↓ 0.33)$
√	-	√	$53.8$ $(↑ 2.5)$	$40.3$ $(↑ 0.2)$	$42.1$ $(↑ 1.4)$	$24.4$ $(↑ 0.5)$	$10.15$ $(↓ 1.01)$
√	√	√	$56.8$ $(↑ 5.5)$	$46.1$ $(↑ 6.0)$	$48.2$ $(↑ 7.5)$	$28.2$ $(↑ 4.3)$	$10.84$ $(↓ 0.32)$

Table 6. Accuracy (mAP0.5), parameters (M), and speed (FPS) comparison across models.

Model	mAP0.5	Param (M)	FPS
PR-YOLO [63]	36.2	1.2	49.0
YOLOv8s	40.7	11.16	59.3
YOLOv8m	44.5	25.88	47.5
YOLOv8l	46.3	43.66	46.8
YOLOv8x	47.1	68.20	32.7
YOLOv11s	40.5	9.41	48.9
ATO-YOLO [64]	38.2	11.4	90.9
DM-YOLOX [65]	41.9	9.6	78.3
DMNet [55]	47.6	-	3.45
BFRDNet (Ours)	48.2	10.83	45.9

Table 7. Comparative experiments with different locations of MKConv placed on the VisDrone dataset. Values in bold represent the highest accuracy within their respective category.

Location	Precision	Recall	mAP0.5	mAP0.5:0.95	Param (M)	${AR}_{S}$	${AR}_{M}$	${AR}_{L}$
P2	55.8	46.5	47.8	27.9	10.828	33.2	52.8	56.7
P2-P3	55.3	45.7	47.4	27.6	10.830	32.8	52.6	56.2
P2-P3-P4	56.8	45.7	47.6	27.8	10.831	32.5	53.3	55.6
P2-P3-P4-P5	56.8	46.1	48.2	28.2	10.836	33.2	53.3	55.2

Table 8. Comparative experiments using different feature fusion networks on the VisDrone dataset. Values in bold represent the highest accuracy within their respective category.

Location	Precision	Recall	mAP0.5	mAP0.5:0.95	Param (M)
PANet	50.7	38.2	39.5	22.1	5.33
BiFPN	54.6	42.7	44.3	26.3	12.42
AFPN	48.7	39.3	39.6	22.4	6.49
GFPN	52.3	40.5	41.7	24.2	10.66
BFRPN	56.8	46.1	48.2	28.2	10.84

Table 9. Comparative experiments using different detection head modules on the VisDrone dataset. Values in bold represent the highest accuracy within their respective category.

Location	Precision	Recall	mAP0.5	mAP0.5:0.95	Param (M)
SimAM [66]	56.5	44.9	47.4	27.7	10.39
MRFAFEM [67]	52.8	43.3	44.4	26.6	10.98
Residual [68]	52.1	40.4	46.5	27.6	10.98
PDetect	56.8	46.1	48.2	28.2	10.84

Table 10. Comparison results of average accuracy on the UAVDT dataset, where “*” after the algorithm name represents the re-implementation result. Values in bold represent the highest accuracy within their respective category.

Method	Car	Truck	Bus	mAP0.5	mAP0.5:0.95
YOLOv3 [60]	30.8	3.9	26.4	36.3	20.4
YOLOX [69]	39.4	5.7	25.3	37.9	23.5
UFPMP-Det [70]	\	\	\	38.7	24.6
PRDet [71]	\	\	\	34.1	19.8
DMFF-YOLO [72]	\	\	\	41.1	27.5
YOLC [73]	\	\	\	30.9	19.3
LSOD-YOLO [74]	\	\	\	37.1	22.1
YOLOv8s [59] *	72.3	9.6	43.1	41.7	27.2
BFRDNet (Ours)	71.3	14.4	39.6	41.8	26.5

Table 11. Comparison results of average accuracy on the VisDrone and UAVDT datasets. Values in bold represent the highest accuracy within their respective category.

Dataset	Method	Precision	Recall	mAP0.5	mAP0.5:0.95	Param (M)
VisDrone	Yolo11	51.6	39.9	40.5	24.2	9.42
VisDrone	BFRDNetV2	54.0	44.2	45.8	27.0	13.37
UAVDT	Yolo11	54.2	35.2	42.7	27.9	9.41
UAVDT	BFRDNetV2	38.1	47.8	43.0	24.4	13.36

Table 12. Comparison results of YOLO11 and BFRDNetV2 on the COCO dataset. Values in bold represent the highest accuracy within their respective category.

Method	Person	Car	Traffic Light	Fire Hydrant	Sports Ball	Baseball Bat	Wine Glass	Orange	Pizza	Clock
YOLOv11	80.5	66.8	50.9	85.0	59.9	54.9	54.7	38.2	71.9	71.5
BFRDNetV2	81.5	68.3	55.7	86.0	66.7	56.6	56.9	44.4	73.3	73.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, L.; Yang, J.; Xie, Y.; Zhang, G.; Liu, C.; Liu, Y. BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network. ISPRS Int. J. Geo-Inf. 2025, 14, 365. https://doi.org/10.3390/ijgi14090365

AMA Style

Zhou L, Yang J, Xie Y, Zhang G, Liu C, Liu Y. BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network. ISPRS International Journal of Geo-Information. 2025; 14(9):365. https://doi.org/10.3390/ijgi14090365

Chicago/Turabian Style

Zhou, Liming, Jiakang Yang, Yuanfei Xie, Guochong Zhang, Cheng Liu, and Yang Liu. 2025. "BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network" ISPRS International Journal of Geo-Information 14, no. 9: 365. https://doi.org/10.3390/ijgi14090365

APA Style

Zhou, L., Yang, J., Xie, Y., Zhang, G., Liu, C., & Liu, Y. (2025). BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network. ISPRS International Journal of Geo-Information, 14(9), 365. https://doi.org/10.3390/ijgi14090365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BFRDNet: A UAV Image Object Detection Method Based on a Backbone Feature Reuse Detection Network

Abstract

1. Introduction

2. Related Work

2.1. Feature Fusion

2.2. Detection Head

2.3. Object Detection in UAV Images

3. Method Overview

3.1. Multiple Kernels Convolution (MKConv) Module

3.2. Backbone Feature Reuse Pyramid Network (BFRPN)

3.3. Detection Head Preprocessing Module (PDetect)

4. Experiment

4.1. Experimental Datasets

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Experimental Results on the VisDrone Dataset

4.5. Experimental Results on the UAVDT Dataset

4.6. Extended Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI