1. Introduction
In the process of urban modernization, the wide application of high technology has penetrated every aspect of our lives. The swift advancement of UAV technology, one of the representatives of high technology, has brought about revolutionary changes in several fields. UAVs not only provide an efficient means of information collection in the fields of military defense [
1], agricultural production [
2], traffic monitoring [
3], and hazardous condition monitoring [
4], but their advanced intelligent processing capabilities are also crucial for solving complex problems in ground observation. In traffic monitoring, UAVs can monitor traffic flow in real-time through object detection technology to prevent traffic congestion [
5,
6]. In hazardous condition monitoring, UAVs can quickly locate trapped people and provide timely and critical information for rescue teams [
7]. These applications demonstrate the importance of the UAV as an aerial observation platform and have been widely used in multiple fields. However, the broader and more varied observation angles of UAVs result in captured images that have characteristics such as small targets, complex backgrounds, and large-scale changes. These characteristics differ significantly from images captured by conventional cameras, posing more significant challenges to detection technology. Especially in object detection, the dense arrangement of small targets, more complex environments, and unclear target boundaries make it difficult for traditional object detection algorithms to ensure rapid and accurate detection in UAV-related tasks. These difficulties limit the application of UAV technology and bring new challenges to the research in computer vision.
Object detection algorithms are primarily divided into two categories: those based on traditional manual feature extraction and those based on deep learning. Manual feature extraction methods can be custom-designed according to different application scenarios and needs, providing flexibility and adaptability. However, manually designed features have difficulty capturing sufficient information, leading to poor generalization ability. Additionally, the computational efficiency of these features often fails to satisfy the requirements of most scenarios [
1]. The emergence of deep learning has significantly propelled the advancement of object detection technology, and its superiority in processing speed and generalization ability has made deep learning-based object detection algorithms mainstream in the industry. Deep learning-based algorithms can be broadly divided into two types: two-stage methods and one-stage methods. The two-stage methods, represented by R-CNN [
8], Fast R-CNN [
9], Faster R-CNN [
10], and Cascade R-CNN [
11], operate through region proposal mechanisms. These methods first generate numerous candidate regions and subsequently perform classification and bounding-box refinement using dedicated classifiers. Although the two-stage methods can achieve higher detection accuracy, their speed is too slow [
12]. The one-stage methods, represented by the Single Shot MultiBox Detector (SSD) [
13] and You Only Look Once (YOLO) [
14] series of algorithms, are widely used in real-time object detection applications. Unlike two-stage methods, these methods eliminate the need for region proposals by directly predicting object categories and bounding box coordinates through end-to-end feature extraction. This streamlined pipeline not only achieves faster inference speeds but also maintains competitive accuracy, making one-stage methods particularly suitable for real-time monitoring applications [
15].
In real-world scenarios, the distance, angle, and potential occlusion between the UAV and the captured object can affect how the object appears in the image. Even the same object may present different sizes and shapes in the image, which poses challenges to the detection algorithms [
16]. In addition, the intertwining of objects with complex backgrounds can weaken or blur the objects in images, thereby heightening the likelihood of erroneous and overlooked detections [
17]. In real UAV scene applications, natural factors such as clouds, fog, and snow significantly impact image quality and cannot be ignored. Images captured under these extreme conditions often exhibit quality degradation and domain shifts, posing significant challenges to object detection algorithms [
18]. In response to these challenges, Hu et al. [
19] propose a component-decoupling-based background suppression method that enhances target-background contrast through prior-guided cloud and mist component extraction. Peng et al. [
20] develop a dual-structure element morphological filtering approach employing directional enhancement and dynamic scale perception for low-SNR target detection in heavy cloud conditions. These studies demonstrate that optimizing both feature extraction capabilities and fusion architecture design significantly enhances model robustness in challenging environmental conditions.
Many studies have concentrated on optimizing feature extraction mechanisms to enhance the network’s performance in object detection tasks. Wang et al. [
21] incorporate an attention module into the backbone network to enhance the ability of capturing key object features and specially design a feature processing module to further integrate shallow and deep features, thereby significantly enhancing performance in small object detection. Xu et al. [
22] replace the backbone portion of the YOLOv8 network with a lightweight MobileNetV3 network structure to optimize feature extraction while speeding up inference. Wang et al. [
23] propose a C2f-E structure based on the Efficient Multi-Scale Attention Module (EMA), which combines the EMA into the C2f module. This approach further strengthens the network’s feature extraction capability while enhancing its performance in detecting small targets. Liu et al. [
24] add ResNet50 as an auxiliary backbone while keeping the original backbone unchanged. They extract more informative low-level features through residual connectivity to enhance the detection effect. Ma et al. [
25] propose a dual-strategy dimensionality reduction approach that employs two different strategies to reduce the dimensionality of hyperspectral data from two complementary perspectives, effectively balancing computational efficiency with information retention. Shi et al. [
26] introduce the instance-guided enhancement module (IGEM) to adaptively combine instance-level information from the auxiliary branch with features in the main branches, thereby explicitly improving the discriminative features of aircraft. These studies have effectively addressed the problem of insufficient feature extraction. However, implementing the attention mechanism could potentially introduce some limitations to the model’s generalizability.
Feature fusion strategies play a decisive role in improving the detection accuracy. Well-designed feature fusion mechanisms significantly improve the accuracy of object detection tasks, as recent research advances have shown. Tank et al. [
27] propose a weighted bidirectional feature pyramid network (BiFPN) that enables efficient and rapid multi-scale feature fusion. The low-level features extracted by the backbone network effectively alleviate the problem of information loss during feature propagation and play an essential role in improving module accuracy. Lim et al. [
28] propose an object detection algorithm using contextual information, which effectively elevates the model’s detection accuracy by fusing multiscale features with contextual information derived from different layers. Wang et al. [
29] propose the Bidirectional Adaptive Feature Pyramid Network (BAFPN) based on the BiFPN, which effectively enhances the model’s detection performance by optimizing the fusion capability of multi-scale features. Xu et al. [
30] propose a novel Efficient RepGFPN based on GFPN for real-time object detection. Unlike previous neck network designs, it adopts a heavy neck design and integrates sufficient and adequate feature fusion modules, which significantly improves both real-time and accuracy. All of them have performed well in the object detection task. However, they have not mined the backbone network features more deeply.
Based on existing research, to address the challenge of fully utilizing target features in complex environments, this study proposes a UAV image object detection method based on a backbone feature reuse detection network, named BFRDNet. Firstly, to address the issue of insufficient feature extraction capability in complex environments, this paper proposes a new feature extraction module, MKConv. This module employs a set of three depth-separable convolutions, each with a distinct kernel size, to capture multi-scale features across varying receptive fields. These features then undergo a merging operation to enhance the feature representation, which further facilitates feature extraction while protecting the features being transmitted more entirely in the network and addressing the issue of significant feature information loss as the network deepens as much as possible. Secondly, considering that the baseline model focuses on a backbone-dominant design, and the backbone network includes sufficient fine-grained and semantic features, we design a backbone feature reuse pyramid network, named BFRPN, which emphasizes backbone network feature reuse. The core of this method lies in efficiently utilizing the features of the backbone network, integrating the rich feature information in the backbone network with the deep features of the neck. It also designs a new fusion strategy, which directly fuses the features extracted from the adjacent layers in the backbone network and then integrates the fused features with the corresponding deep features obtained from the neck. The BFRPN ensures that the output features maximally retain the backbone’s shallow features while incorporating the deep semantic features fused in the feature fusion stage, thus achieving the reuse of the backbone network features. Finally, we design a new detection head preprocessing module, named PDetect, which performs weighted mapping of the features undergoing detection to enhance the information flow between channels. These enhancements significantly boost the model’s detection performance while reducing its parameter count and computational complexity.
Before delving into the details of this model, the contributions of this research are summarized as follows:
(1) In this paper, we propose a new feature extraction module, MKConv, to extract target shallow detail features and deep location features. This module strengthens the representational capacity of features by strengthening the feature aggregation process and also mitigates information loss during network propagation to a certain extent.
(2) In this paper, we design a backbone feature reuse pyramid network, named BFRPN, which is designed to optimize the utilization of feature information extracted from the backbone network. It significantly improves the efficiency of feature fusion by adaptively injecting rich shallow features from the backbone into critical neck layers. Additionally, the BFRPN further integrates a dedicated detection head optimized for small objects, enhancing the model’s accuracy in small target detection.
(3) To achieve adequate detection of multi-scale targets, we design a detection head preprocessing module, named PDetect, which improves the model’s detection performance by performing a weighted mapping strategy on the detected features, mitigates the problem of gradient vanishing, and effectively improves the model’s training efficiency and overall performance.
The remainder of this paper is organized as follows:
Section 2 reviews the relevant literature in this research area.
Section 3 details the proposed model and its methodology.
Section 4 details the experimental and visual demonstration of the proposed module and analyzes its effectiveness.
Section 5 provides a summary of the paper.
4. Experiment
This section presents an overview of the datasets, parameter configurations, and evaluation metrics employed in our experimental studies. Additionally, to rigorously assess the robustness and generalizability of the BFRDNet, we execute a comprehensive suite of experiments across two widely recognized datasets. We compare BFRDNet to the benchmark YOLOv8s and other UAV object detection models. Detailed analysis of the experimental results is provided in the subsequent sections.
4.1. Experimental Datasets
During the experimental stage, we utilize the following two datasets to conduct our research.
(1) VisDrone: This dataset was obtained from drone aerial photography in diverse environments, including a comprehensive range of scenarios such as urban and rural, bright and shadowy. The dataset covers a wide range of target distributions from sparse to dense and stands as the predominant UAV aerial image dataset, offering a rich variety of scenarios and extensive data. [
43]. The dataset encompasses 10 distinct detection target categories, including pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. The dataset is meticulously organized into distinct subsets, featuring 6471 images for training, 548 images for validation, and 3190 images reserved for testing. Each image is labeled with an average of 53 objects. In the test images, the average number of objects per image is 71, with the majority of these objects exhibiting pixel dimensions smaller than 32 × 32. In addition, different categories of targets have various levels of occluded portions in the images. Compared with other computer vision datasets, it has more realistic scenes and higher complexity, requiring higher network detection capability.
Figure 5 presents the statistical distributions of the VisDrone dataset, with
Figure 5a,b illustrating the target-size distribution and spatial-position distribution, respectively. The analysis reveals that (1) most targets have width/height dimensions smaller than 10% of the image size; and (2) targets show significant spatial aggregation, with the highest density observed in the lower-central region of the image.
Table 1 presents the statistical distribution of object categories in the VisDrone dataset, revealing a predominance of pedestrians and vehicles in urban drone-captured scenes. This distribution pattern emerges naturally from the dataset’s collection environment, where these categories occur more frequently in city settings. As a benchmark for aerial vision systems, we preserve this authentic distribution to ensure the dataset accurately represents real-world urban scenarios. This approach enables the development of detection algorithms that effectively address practical urban surveillance requirements.
(2) UAVDT: This large-scale benchmark dataset, captured by UAVs in complex environments, features diverse scenarios and high complexity. It is manually annotated and poses significant challenges for UAV detection and tracking. The image resolution is uniformly 1080 × 540, and the target categories in the images are mainly focused on vehicles, including car, bus, and truck [
44]. Compared with the VisDrone dataset, the dataset is well-suited for vehicle detection in road images, featuring diverse weather conditions, angles, and scenes. The dataset covers various scenarios and conditions, providing researchers with a rich data resource that facilitates the development and evaluation of UAV vision technologies.
Figure 5 presents the statistical distributions of the UAVDT dataset, with
Figure 5c,d illustrating the target-size distribution and spatial-position distribution, respectively. The analysis reveals that (1) most targets have width/height dimensions smaller than 10% of the image size; and (2) targets show significant spatial aggregation, with the highest density observed in central image regions. As shown in
Table 2, cars comprise the majority of samples in the UAVDT dataset. This predominance reflects the dataset’s focus on urban road scenes, where cars naturally appear more frequently than other objects. We maintain this natural sample distribution to preserve the dataset’s realistic representation of traffic environments, making it a valuable benchmark for drone-based detection research.
4.2. Experimental Setup
The environment setup during the experiment was as follows: the operating system used was Ubuntu 22.04, the GPU configuration was NVIDIA GeForce RTX 4090 (24 G), and the CPU model was AMD EPYC 7402. In addition, the model training used the Python deep learning framework, with specific versions of PyTorch (2.2.0) and Python (3.10). During the training phase, the initial learning rate was established at 0.01. We used a stochastic gradient descent (SGD) optimizer with momentum, configured with a batch size of 4, weight decay of 0.0005, and momentum set to 0.937. The training process consisted of 300 epochs. For experiments on different datasets, we maintain consistent training settings, including the same learning rate adjustment strategy and training duration, to ensure the comparability of results.
4.3. Evaluation Metrics
To comprehensively evaluate the effectiveness of the BFRDNet in object detection, this study adopts a suite of widely recognized evaluation metrics, including precision (
P), recall (
R), average precision (
AP), and mean average precision (
mAP). The corresponding formulas are presented below.
where TP represents correctly identified positive samples, FN represents incorrectly classified positive samples, and FP represents negative samples mislabeled as positive. AP is computed as the area under the precision–recall curve and reflects the model’s ability to maintain high precision across different recall thresholds.
where
P(
R) represents the precision value on the
P-
R curve corresponding to a recall of
R, and represents the average performance of
P and
R.
AP is a metric computed for each category individually, whereas
mAP combines the precision and recall of the different categories, offering a comprehensive measure of detection performance.
where
represents the
AP value of the ith category, and
n represents the total number of categories encompassed within the detection task.
In addition, to rigorously validate BFRDNet’s effectiveness, this study uses the COCO evaluation metric to evaluate the model’s proficiency in detecting objects of varying sizes [
45]. In the VisDrone datasets, instances are categorized into three size-based subsets based on the COCO standard: a small subset with pixels less than 32, a medium subset with pixels greater than 32 and less than 96, and a large subset with pixels greater than 96. These correspond to
,
, and
under the COCO criterion, respectively.
4.4. Experimental Results on the VisDrone Dataset
To thoroughly assess BFRDNet’s capabilities, we adopt mAP0.5, mAP0.5:0.95, and COCO evaluation metrics. As shown in
Table 3, BFRDNet achieves state-of-the-art performance in mAP0.5 (48.2%), with improvements of 7.5% in mAP0.5 and 4.2% in mAP0.5:0.95. Although not leading in all metrics, BFRDNet exhibits superior overall accuracy, outperforming even the largest-scale baseline variants. This confirms its advantages in multi-scale object detection. Notably, while DMNet achieves higher detection accuracy at specific scales, its mAP remains lower than BFRDNet. More significantly, BFRDNet demonstrates superior computational efficiency with a frame rate of 45.9 FPS compared to DMNet’s 3.45 FPS. This substantial 13.3× speed advantage, coupled with competitive detection accuracy, positions BFRDNet as a more viable solution for real-time UAV applications.
To comprehensively assess the capabilities of BFRDNet, we benchmark it on the challenging VisDrone dataset with per-category accuracy analysis. As shown in
Table 4, BFRDNet achieves the highest overall mAP among all competitors. While exhibiting marginal performance gaps in the Truck and Awning-Tricycle categories, a more substantial discrepancy is observed for Bicycle detection when compared to E-YOLOv8. Specifically, E-YOLOv8’s anchor-free detection head and small-object-optimized backbone architecture yield a 41.8% AP in the Bicycle category (versus BFRDNet’s 20.7%), yet this specialization incurs a drop of roughly 2.3% in overall mAP compared with BFRDNet. Notably, BFRDNet dominates the remaining seven categories, demonstrating its general-purpose detection strength.
To provide a more intuitive understanding of the enhancements achieved through our proposed methods, this study discusses each improvement in detail through ablation experiments. The findings are detailed in
Table 5, with subsequent analysis provided below.
(1) MKConv: The experimental results show that after integrating the MKConv module into the baseline model, mAP0.5 and mAP0.5:0.95 are improved by 1.1% and 0.5%, respectively. Additionally, improvements in precision and recall are observed. These improvements confirm that MKConv is able to effectively capture multi-scale target features in the backbone network through convolutional computation with multiple receptive fields. While there is a minor increase in the number of parameters, the module effectively reduces the depth of the network, which can effectively minimize the feature degradation during the propagation process and further promotes the features to propagate more completely through the network. By mining features from multiple receptive fields, the quality of features in the backbone network is effectively improved.
(2)
BFRPN: To further enhance the effectiveness of multi-scale feature fusion, we propose a backbone feature reuse pyramid network, named BFRPN, which enhances feature integration by amplifying the contribution of backbone network features in the fusion process, making these features fit more closely into the feature fusion architecture of our network. As shown in
Table 5, the standalone application of the BFRPN module not only significantly improves the detection mAP0.5 (+4.9%) and recall (+3.6%), but its synergistic combination with the MKConv module and the PDetect structure also demonstrates excellent systematic advantages. The experimental results demonstrate that BFRPN can efficiently utilize the multi-scale target features extracted from the backbone network and significantly improve the detection accuracy of the model by amplifying the proportion of backbone network features in the entire model.
(3) PDetect: Based on the MKConv module and BFRPN network, we further integrate PDetect into the framework. The experimental results demonstrate remarkable improvements: not only do precision and recall rates increase significantly, but mAP0.5 and mAP0.5:0.95 also improve by 1.7% and 0.6%, respectively. Additionally, the standalone combination of PDetect and MKConv not only enhances detection accuracy but also reduces the number of parameters. These results confirm that PDetect can effectively fuse multi-level features and strengthen feature representation. Moreover, PDetect is not only fully compatible with our network architecture but also highly effective for UAV-based image object detection tasks.
To more intuitively highlight the strengths of our proposed model, we perform visualization experiments on both the baseline and BFRDNet models in this paper. Through heatmaps, we visually demonstrate how the improved model effectively focuses on regions missed by the baseline model. The visualization results are shown in
Figure 6. The first column shows the input images, while the second and third columns illustrate the detection outcomes of the baseline and the BFRDNet models, respectively. In the second and third columns of images, we marked the same regions with yellow arrows and boxes. This clearly shows that BFRDNet effectively identifies target regions missed by the baseline model. The fourth and fifth columns display the detection heatmaps for the baseline and BFRDNet models, respectively. The heatmap comparisons further confirm BFRDNet’s advantages: the intensified and concentrated activation regions in Column 5 reflect the effectiveness of our BFRPN module in preserving critical shallow features typically lost in conventional architectures. Additionally, the sharper response boundaries across varying target scales highlight MKConv’s dynamic receptive field adaptation, which suppresses background interference while maintaining focus on genuine targets. These findings demonstrate that BFRDNet’s architectural innovations, which combine feature reuse and adaptive processing, fully account for both the improved heatmap responses and enhanced detection accuracy in our experiments.
The four selected images showcase four representative scenarios: long-range small targets, targets in low-light conditions, targets in complex backgrounds, and densely packed target areas situated at a distance from the UAV. The results confirm that BFRDNet markedly enhances object detection across various challenging environments, particularly in identifying closely spaced and distant small targets. This superior performance is due to our effective feature extraction techniques, which ensure robust target feature capture. Additionally, by employing more judicious methods in the stages of feature fusion and propagation, we effectively reduce the loss of critical target features, especially the smaller ones, to keep their characteristic information more intact.
In real application scenarios, how to measure the value between accuracy and speed is an issue that cannot be ignored. As shown in
Figure 7, we have plotted a scatter plot of model accuracy and inference speed. The black regression line indicates that FPS increases as accuracy decreases. Although BFRDNet does not have the highest inference speed, its FPS already meets the requirements of real-time detection (FPS > 30), demonstrating greater advantages in accuracy. This observation is further supported by the comprehensive metric comparison in
Table 6. DMNet approaches BFRDNet’s accuracy (47.6%) but suffers impractical latency (3.45 FPS). While achieving superior frame rates, both ATO-YOLO and DM-YOLOX exhibit substantially compromised detection performance compared to BFRDNet at comparable model complexities. These results collectively validate BFRDNet’s unprecedented balance between computational efficiency and detection performance.
To visually demonstrate the BFRDNet’s performance, and conduct a detailed comparison of the recognition outcomes for each category with the baseline model, we employ the confusion matrix as a visualization tool. The confusion matrix offers an intuitive means of juxtaposing predictions against accurate labels by organizing them within a matrix format. In the confusion matrix utilized in this study, each row corresponds to the categories predicted by the model, whereas each column represents the actual category labels. The diagonal elements signify correct predictions, that is, instances where the predicted categories align with the proper categories. Conversely, the off-diagonal elements signify misclassifications, highlighting discrepancies between predicted and actual categories. This comparison facilitates a detailed assessment of the model’s accuracy in recognizing various categories.
Figure 8 vividly illustrates the comparative analysis of prediction results for each category between the baseline model and BFRDNet. A marked enhancement in the values along the main diagonal is evident when juxtaposed with the baseline model, indicating a significant advancement in BFRDNet’s ability to accurately predict multi-scale targets. Furthermore, by examining the data in the last row of the confusion matrix, it is clear that BFRDNet exhibits lower matrix values. This reduction indicates that BFRDNet significantly diminishes the error rate associated with misclassifying objects as background. These observations underscore the efficacy of BFRDNet’s feature extraction capabilities, enabling it to discern targets from complex backgrounds more adeptly.
In this study, we propose MKConv, a novel feature extraction module designed to replace the conventional C2f module in the baseline model. MKConv is specifically engineered to enhance multi-scale target information extraction. Prior to experimentation, we conducted a thorough analysis of MKConv’s design, hypothesizing that its optimal performance would require integration across all backbone network layers, particularly in the deepest layers. To validate this hypothesis, we progressively implemented MKConv into the backbone network—starting with fewer layers (P2) and systematically extending to additional layers (P3, P4, P5)—while performing comparative experiments at each stage. As demonstrated in
Table 7, the results confirm that MKConv achieves peak performance when fully replacing the C2f module throughout the entire backbone network. Furthermore, we observed that deploying MKConv solely on the P2 layer achieved the highest recall rate. To investigate this phenomenon, we conducted comparative experiments analyzing recall rates across different scales. The results revealed that when the recall rates for small targets were comparable, the P2 layer exhibited a 1.5% higher recall rate for large-scale targets compared to the optimal mAP0.5 combination, while its recall rate for medium-scale targets was only 0.5% lower. This explains why the P2 layer outperformed other configurations in terms of recall. However, to ensure balanced detection performance, we ultimately selected the last layer’s data as the final configuration for this model.
To comprehensively evaluate the advantageous effect of the MKConv module on the network’s receptive field, we employ a visual feature mapping technique to provide a detailed illustration.
Figure 9a–e depict the outcomes of the original backbone network and its receptive field visualization after sequentially replacing the C2f module at layers P2 through P5. Without modifying the backbone network, it is evident that the coverage area of the region of interest (ROI) is comparatively small, and the central region’s green color appears blurry. This observation indicates that the network’s object perception capability is limited. In contrast, upon integrating the MKConv module, there is a notable expansion in the ROI’s coverage area, and the deepening of the color indicates an enhanced ability to capture local foreground features. This improvement effectively suppresses background noise, enabling more accurate and efficient extraction of the object’s key feature areas.
To demonstrate the effectiveness of the proposed BFRPN in multi-scale feature fusion, we conducted a series of comparative experiments involving various feature fusion networks. In these experiments, we substituted the BFRPN architecture in our BFRDNet model with a range of existing feature fusion networks and evaluated their performance. We present the results of these comparisons in
Table 8. While PAFPN and AFPN boast the fewest parameters, they exhibit lower detection accuracy. GFPN has mediocre data and is not suitable for backbone-dominated detection models. While BiFPN demonstrates improved feature fusion capabilities, its performance gains do not sufficiently justify the significant parameter overhead it introduces. In contrast, the BFRPN introduced in this paper demonstrates the highest values for both mAP0.5 and mAP0.5:0.95 metrics, along with superior precision and recall. From the experimental results, it can be seen that BFRPN significantly affects the detection performance of the model by amplifying the proportion of backbone network features in the model and using backbone network features more comprehensively. Compared with general feature fusion networks, the BFRPN designed in this paper is more suitable for the backbone-dominant detection models.
In this research, we conducted an in-depth exploration and carefully crafted a diverse set of experiments focusing on the detection head module. Our exploration spanned three critical areas: attention mechanisms, multi-scale feature fusion, and multi-branch connection. To assess the efficacy of the detection head module, we integrated insights from existing research with our experimental findings. The comparative experimental results, detailed in
Table 9, reveal a key insight: incorporating specific feature-introducing modules does not always enhance detection accuracy, especially in the detection head of the network. Our experiments underscore that only modules optimally aligned with the task can deliver peak detection performance. The PDetect module we propose is not only highly adaptable to the BFRDNet detection model but also demonstrates superior detection capability. This outcome not only corroborates the effectiveness of our approach but also offers novel perspectives for future research endeavors concerning detection head modules.
4.5. Experimental Results on the UAVDT Dataset
In order to validate the effectiveness and applicability of the BFRDNet model, we conduct comparative experiments on the UAVDT dataset, with the outcomes presented in
Table 10. When juxtaposed against several state-of-the-art algorithms, BFRDNet exhibits marked superiority in performance. In comparison with the baseline model, BFRDNet achieves a 4.8% improvement in detecting objects within the Truck category and a 0.1% increase in overall average accuracy. Nevertheless, it is essential to note that the performance of BFRDNet on the UAVDT dataset does not match the level of excellence achieved on the VisDrone dataset. Upon rigorous analysis, we discern that BFRDNet possesses commendable capabilities in multi-scale feature extraction, effectively managing the extensive scale and diversity inherent in the VisDrone dataset. Conversely, when confronted with the UAVDT dataset, which features dynamic shifts in UAV viewpoints and inconsistent aspect ratios, BFRDNet’s adaptability is inadequate. This shortfall in adaptability is identified as a key area for future enhancement in our ongoing research.
Furthermore, to visually illustrate the performance disparity between the baseline model and BFRDNet in practical application settings, we conduct an array of visual comparative analyses on the UAVDT dataset, with the results depicted in
Figure 10. The results of the baseline model are depicted in column b, with the detection outcomes of BFRDNet illustrated in column c. To more clearly highlight the performance differences between the two models, we magnify and annotate critical regions with yellow arrows, emphasizing areas where these differences are most evident. The quartet of image sets in
Figure 10 displays oblique aerial photographs of vehicles captured by the UAV under varying lighting conditions. A precise observation from comparing the magnified sections is that BFRDNet is markedly more effective in detecting minute vehicle targets in long-range views than the baseline model, thereby demonstrating enhanced capabilities in identifying small-sized targets at diverse distances. Although BFRDNet does not detect all the targets, its detection proficiency is substantially superior to that of the baseline model. The results demonstrate that the improvements in this article boost the model’s ability to discern small-sized targets from long-distance perspectives under different shooting angles, varied illumination conditions, and a spectrum of scenes.
4.6. Extended Experiments
To further validate the generalization capability of the proposed BFRDNet architecture, we conduct extensive experiments on YOLOv11. Compared to the baseline model, BFRDNetV2 (our augmented version) exhibits superior generalization performance, as demonstrated by our evaluation on the VisDrone and UAVDT datasets.
Table 11 summarizes the comparative evaluation results between our proposed BFRDNetV2 and the baseline YOLOv11 model. The experimental data demonstrate consistent performance improvements, with mAP increases of 5.3% on VisDrone and 0.3% on UAVDT datasets, respectively. These quantitative results substantiate the enhanced detection capability of our method for UAV image object detection tasks.
Figure 11 presents the visualization comparison results between YOLOv11 and BFRDNetV2 when applied to the VisDrone and UAVDT datasets. These results are displayed in columns b and c, respectively. To more clearly illustrate the performance differences between the two models, we selectively magnified and annotated critical regions with yellow arrows, highlighting significant disparities. The first two rows depict the comparison on the VisDrone dataset, where our model demonstrates a pronounced advantage in detecting small-scale targets, such as pedestrians and vehicles, within complex scenes. These findings indicate that our model has the ability to effectively capture feature information in intricate environments, thereby achieving more precise object detection. The subsequent two rows highlight the comparison using the UAVDT dataset. The third row emphasizes the superior performance of BFRDNetV2 in detecting partially occluded targets and partially visible feature targets. In contrast, the fourth row illustrates the robustness of BFRDNetV2 in recognizing densely packed targets, effectively minimizing missed detections. Collectively, these results demonstrate the effectiveness and practicality of BFRDNetV2 for UAV image object detection.
In addition, we perform experiments to assess generalizability on the COCO dataset, and the outcomes are presented in
Table 12. We present the detection results for some categories in which BFRDNetV2 performs well, thereby fully demonstrating its excellent detection ability for small targets.
To visually illustrate the performance of the enhanced BFRDNetV2 model, we conduct visual comparative analyses using the COCO dataset, displayed the results in
Figure 12. Although the COCO dataset is not specifically composed of UAV-captured images, the comparative outcomes vividly highlight BFRDNetV2’s superior capability in identifying small targets. Although BFRDNetV2 does not achieve perfect detection of all targets, it shows significantly improved accuracy.