BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery

Zeng, Xue; Fang, Shenghong; Sun, Qi

doi:10.3390/electronics15020297

Open AccessArticle

BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery

by

Xue Zeng

^1,2,*

,

Shenghong Fang

¹ and

Qi Sun

¹

School of Software, Yunnan University, Kunming 650500, China

²

Kunming Shipbuilding Equipment Co., Ltd., Kunming 650236, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 297; https://doi.org/10.3390/electronics15020297

Submission received: 1 December 2025 / Revised: 30 December 2025 / Accepted: 6 January 2026 / Published: 9 January 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Identifying minute targets within UAV-acquired imagery continues to pose substantial technical hurdles, primarily due to blurred boundaries, scarce textural details, and drastic scale variations amidst complex backgrounds. In response to these limitations, this paper proposes BFRI-YOLO, an enhanced architecture based on the YOLOv11n baseline. The framework is built upon four synergistic components designed to achieve high-precision localization and robust feature representation. First, we construct a Balanced Adaptive Feature Pyramid Network (BAFPN) that utilizes a resolution-aware attention mechanism to promote bidirectional interaction between deep and shallow features. This is complemented by incorporating the Receptive Field Convolutional Block Attention Module (RFCBAM) to refine the backbone network. By constructing the C3K2_RFCBAM block, we effectively enhance the feature representation of small objects across diverse receptive fields. To further refine the prediction phase, we develop a Four-Shared Detail Enhancement Detection Head (FSDED) to improve both efficiency and stability. Finally, regarding the loss function, we formulate the Inner-WIoU strategy by integrating auxiliary bounding boxes with dynamic focusing mechanisms to ensure precise target localization. The experimental results on the VisDrone2019 benchmark demonstrate that our method secures mAP@0.5 and mAP@0.5:0.95 scores of 42.1% and 25.6%, respectively, outperforming the baseline by 8.8% and 6.2%. Extensive tests on the TinyPerson and DOTA1.0 datasets further validate the robust generalization capability of our model, confirming that BFRI-Yolo strikes a superior balance between detection accuracy and computational overhead in aerial scenes.

Keywords:

small object detection; UAV imagery; feature fusion; attention mechanism

1. Introduction

The rapid development of Unmanned Aerial Vehicle (UAV) technology has demonstrated significant application potential for UAV-based object detection in fields such as security surveillance [1], traffic management [2], and emergency rescue [3]. However, small object detection in complex scenes remains a formidable challenge. These challenges primarily stem from several key factors. First, due to their diminutive size, small objects often lack distinct edge and texture features, leading to insufficient feature representation. Second, objects of varying scales are frequently unevenly distributed within the same image, making it difficult for detectors to reconcile global and local discriminative information. Third, complex background clutter, densely packed targets, and mutual occlusion further impair the model’s ability to capture salient information. Collectively, these factors contribute to high miss detection rates and false detection rates, alongside insufficient bounding box localization accuracy in existing models. This severely constrains their performance in practical applications.

The proliferation of deep learning has propelled CNN-based methods to the forefront of research, making them the de facto standard for identifying small-scale targets. Their end-to-end feature learning and representation capabilities enable models to automatically extract multi-level discriminative information across different scenarios, significantly enhancing the detection accuracy and robustness for small objects. Concurrently, strategies such as multi-scale feature fusion, attention mechanisms, and improved bounding box regression loss functions have also been widely applied to UAV small object detection tasks, effectively mitigating the challenges posed by insufficient resolution and background interference. Object detection methods can be primarily divided into two major categories: single-stage detectors and two-stage detectors. The representative of two-stage methods is the R-CNN [4,5] series, which completes detection through a two-step process of region proposal generation followed by object classification and regression. While offering high detection accuracy, these methods suffer from slower inference speeds, which limits their application in UAV tasks with high real-time requirements. In contrast, single-stage methods, such as the YOLO [6,7] series and SSD [8], perform end-to-end prediction directly on feature maps. They possess advantages like simple architectures, fast detection speeds, and ease of deployment, and have thus seen more widespread application in small object detection tasks under complex environments. Meanwhile, Transformer-based detection models have also gained attention in recent years. For instance, DETR [9] utilizes self-attention mechanisms to achieve end-to-end object detection, bypassing the need for region proposals and NMS, and possessing strong global modeling capabilities.

Therefore, this paper proposes BFRI-Yolo, a small object detection method. This method addresses the key difficulties in UAV small object detection through a systematic design that incorporates improvements and optimizations in feature enhancement, feature modeling, detection head optimization, and the loss function. The principal contributions of this research are highlighted below.

To address the insufficient utilization of shallow features in aerial small object detection, we construct the BAFPN module. This module introduces a P2 layer into the feature pyramid and integrates the SBA attention mechanism for multi-level feature fusion. By fully leveraging the advantages of high-resolution detail features in representing edge and position information, BAFPN effectively weakens complex background interference while maintaining semantic consistency.
To address the challenge of inadequate feature modeling for minute targets, we construct the C3K2_RFCBAM module to refine the backbone network. By integrating the strengths of CBAM and RFAConv, this component significantly boosts the network’s feature extraction capacity and effectively captures fine-grained contextual details with minimal computational overhead.
To solve the problems of computational redundancy and semantic inconsistency in traditional detection heads, we design the FSDED detection head. Through shared convolution operations and multi-scale collaborative modeling, this head effectively reduces redundant computation and enhances semantic consistency across different detection layers. While maintaining high inference speed, it improves the stability of both bounding box regression and classification prediction.
To address the inaccurate bounding box regression for minute objects, we formulate the Inner-WIoU loss function by synergizing the auxiliary bounding box mechanism with the dynamic focusing strategy of WIoU. This integrated approach effectively accelerates convergence and improves localization precision specifically for tiny targets in complex aerial scenarios.
To rigorously assess both the efficacy and broad applicability of our approach, we executed a series of extensive benchmarking trials across three standard small object detection datasets: VisDrone2019, TinyPerson, and DOTA1.0. The experimental results demonstrate that the proposed improvements achieve superior detection performance and generalization across different scenarios, further proving the method’s adaptability and reliability.

2. Related Work

2.1. Small Object Detection

Small object detection Although detecting small objects in aerial imagery is highly challenging due to pixel scarcity and complex backgrounds, improved algorithms based on YOLO have significantly alleviated this issue by optimizing feature flow. For instance, Nghiem et al. [10] proposed LEAF-YOLO, which achieves real-time, high-precision detection on edge devices by designing a Lightweight-Efficient Aggregating Fusion (LEAF) structure. Sangaiah et al. [11] enhanced the feature extraction capabilities of lightweight models by adding a detection layer to Tiny YOLOv4 and integrating modules such as Spatial Pyramid Pooling (SPP) and CBAM attention. In addition to CNN architectures, DETR-based end-to-end models have also become a key research direction. Building upon the RT-DETR framework, Kong et al. [12] introduced Drone-DETR, which incorporates a lightweight ESDNet backbone and an Enhanced Dual-Path Feature Fusion Attention Module (EDF-FAM) to specifically address the feature loss of small objects in aerial scenes. Similarly, structural optimizations like the Bidirectional Feature Pyramid Network (BiFPN) proposed by Tan et al. [13] have introduced efficient cross-scale connections to improve feature fusion. However, these existing methods still face limitations in practical UAV applications. On the one hand, lightweight models often achieve speed improvements by sacrificing the depth of feature representation, making them prone to missing minute targets with blurred boundaries. On the other hand, DETR-based or complex multi-stage models impose heavy computational burdens, rendering them unsuitable for resource-constrained airborne platforms. More critically, most standard detectors lack specialized mechanisms to explicitly suppress complex background noise, leading to a high false alarm rate when detecting tiny objects in cluttered aerial scenes.

2.2. Feature Enhancement and Multi-Scale Fusion

To address the challenges of insufficient feature representation and drastic scale variation in small object detection, scholars have proposed various strategies ranging from local feature enhancement to comprehensive structural optimization. In terms of feature enhancement modules, Zhou et al. [14] developed the YOLO-SASE architecture, which incorporates a specialized SASE unit. By synergizing Spatial Pyramid Pooling (SPP), Channel Attention (SE), and multi-scale receptive fields, this module is designed to bolster the model’s capacity for capturing fine-grained details of minute targets amidst cluttered environments. Distinct from methods that optimize in the spatial domain, Gong et al. [15] explored an alternative path from the perspective of frequency-domain analysis. The Adaptive Frequency-Domain Aggregation (AFDA) module proposed in their AED-YOLO11 is capable of dynamically aggregating features using frequency information to better differentiate targets from noise. However, most of these enhancement methods primarily focus on local feature refinement or specific domains. They often incur significant computational overhead when stacking complex attention mechanisms and may still struggle to effectively suppress background clutter in highly complex aerial scenes. In this context, relevant strategies from other visual tasks offer valuable insights. For instance, Dong et al. [16] demonstrated that modeling spatial correlations is effective for separating objects from complex dynamic backgrounds. This mechanism of leveraging spatial context to suppress environmental noise provides a critical theoretical basis for highlighting the features of minute targets in aerial imagery.

Structurally, multi-scale feature fusion serves as a critical technology to reconcile resolution and semantic information. The Feature Pyramid Network (FPN) proposed by Lin et al. [17] laid the foundation for this field; it combines high-level semantic information with shallow-level detail features through a top-down pathway. To solve the issue of insufficient interaction in FPN, subsequent work has explored optimizations in fusion paths and methods. Building on this, Wan et al. [18] introduced DAU-YOLO, which integrates the shallowest feature map through a dynamic attention unit (DAU) to significantly boost perceptual capability for tiny objects. Nevertheless, existing multi-scale fusion architectures still face limitations. The simple concatenation or addition of features often leads to semantic misalignment between layers. Furthermore, while utilizing high-resolution shallow features is beneficial for localization, it inevitably brings in excessive background noise, which can overwhelm the semantic information of minute targets.

2.3. Detection Head and Loss Function

In the design of detection heads and loss functions, to solve the gradient vanishing problem of traditional IoU loss when bounding boxes do not overlap, researchers successively proposed methods like GIoU [19] and DIoU [20]. These methods improved the localization stability for general objects by introducing additional penalty terms. However, these approaches still suffer from insufficient precision when handling small objects. To this end, researchers have proposed specialized strategies. For example, NWD (Normalized Wasserstein Distance), proposed by Wang et al. [21], measures bounding box similarity using the Wasserstein distance between Gaussian distributions, enhancing matching robustness. WIoU, proposed by Tong et al. [22], introduces a dynamic focusing mechanism to optimize the regression process. In terms of detection heads, to resolve the conflict between classification and regression tasks in traditional shared heads, Ge et al. [23] proposed the Decoupled Head in YOLOX, which separates the two tasks and has since become a standard design for high-performance detectors. Taking this further, Lin et al. [24], in their SDA-YOLO model, combined a Dynamic Head with an mpdiou-based loss function to construct an enhanced detection module named DMDetect. Despite these advancements, significant limitations remain for UAV-based small object detection. On one hand, traditional loss functions are overly sensitive to positional deviations of minute targets, where a shift of merely a few pixels can cause a drastic drop in IoU, leading to unstable convergence. On the other hand, while fully decoupled heads mitigate task conflict, they often completely separate the classification and regression branches. This isolation can hinder the synergistic learning of fine-grained features, which is crucial for accurately recognizing blurred small targets that rely on shared contextual cues.

3. Method

As a high-performance model in the YOLO series, YOLOv11 has made significant strides in balancing detection accuracy and efficiency by integrating the novel C3k2 module and the C2PSA spatial attention mechanism. From an architectural perspective, the proposed network is organized into three distinct operational stages: the backbone (responsible for feature extraction), the neck, and the detection head. The backbone utilizes a cutting-edge Cross Stage Partial (CSP) configuration, anchored by the C3k2 block, to ensure superior efficiency in extracting features. In the Neck, an improved PANet (Path Aggregation Network) structure facilitates both bottom-up multi-scale feature fusion and top-down interaction between semantic and fine-grained features. Features from different scales are fused via Concatenation operations, while the novel C2PSA spatial attention module enhances the complementarity between semantic information and fine-grained details. The Head utilizes an efficient Decoupled Head, where each sub-head contains independent classification and regression branches. This design accelerates model convergence while simultaneously boosting detection accuracy. Consequently, we utilized YOLOv11n as the foundational baseline for this work. However, notwithstanding its inherent strengths, the standard model exhibits significant limitations when applied to aerial photography, particularly struggling to accurately identify minute targets against chaotic or intricate backgrounds.

To enhance multi-scale object detection performance, particularly when addressing the challenges of small objects and complex background interference, this paper proposes an improved BFRI-YOLO architecture, with its overall framework illustrated in Figure 1. While preserving the efficient inference characteristics of the YOLO series networks, this method designs multiple improved modules to target specific issues prevalent in small object detection within complex scenarios, such as the susceptibility to edge detail loss, insufficient shallow-level semantics, and simplistic feature fusion methods. In the Backbone, we introduce the RFCBAMConv and C3k2-RFCBAM structures. By combining lightweight convolution with attention mechanisms, this approach effectively enhances the network’s joint modeling capability of local details and global context, thereby maintaining strong robustness against complex background interference. In the Neck, we design the BAFPN module. Building upon the traditional pyramid network, it explicitly introduces a P2 layer and integrates the IFU and SBA mechanisms to realize dynamic interaction and selective enhancement between shallow edge features and deep semantic features. This enhances the boundary perception and spatial discrimination capabilities for small objects. Additionally, in the Head, we propose the FSDED structure. By leveraging multi-scale feature sharing alongside convolutional coordination, this mechanism reinforces information alignment across detection heads. Consequently, it elevates both the positioning precision and categorization success rates for minute targets. Furthermore, to refine the bounding box regression specifically for small targets situated in cluttered environments, we adopt an enhanced Inner-WIoU as our loss metric. This technique integrates a spatial weighting mechanism into the standard IoU framework, compelling the predicted box to concentrate its attention on the target’s core area, thereby mitigating localization errors common in small object scenarios. This alleviates localization errors caused by boundary ambiguity while enhancing overall regression stability.

3.1. BAFPN

In the task of small object detection within complex scenes, conventional Feature Pyramid Networks (FPN, PAN), while effective for multi-scale targets, typically commence feature aggregation from the P3 layer, neglecting the high-resolution and fine-grained information present in P2. This design leads to imprecise bounding box localization due to the loss of edge and positional information in shallow features during downsampling. Moreover, the lack of deep semantic support in these features results in frequent false positives and missed detections of small objects. Furthermore, most existing methods rely on static fusion strategies (upsampling and concatenation), which lack dynamic modeling of the complementary relationship between different feature levels, potentially introducing redundant or conflicting information and compromising detection accuracy and stability.

To counteract these shortcomings, this paper introduces a Balanced Adaptive Feature Pyramid Network (BAFPN). This structure explicitly incorporates the P2 layer into the feature pyramid and is optimized by integrating two core modules: IFU and SBA [25]. As depicted in Figure 2, the IFU module utilizes an interactive update mechanism to realize dynamic complementarity between shallow edge details and deep semantic information, which significantly enhances the localization and discriminative capabilities for small objects. Concurrently, the SBA module selectively amplifies boundary regions, effectively suppressing interference from complex backgrounds, thereby strengthening the contour perception and spatial representation of minute targets. Through the synergistic action of these two components, BAFPN maximizes the utility of both shallow and deep features while maintaining detection efficiency, providing a more robust feature expression for small object detection in complex environments.

As illustrated in Figure 2, the internal components of the IFU module are defined by generic input features

T_{1} \in R^{H \times W \times N_{1}}

and

T_{2} \in R^{H \times W \times N_{2}}

. Here,

T_{1}

and

T_{2}

serve as mathematical placeholders representing positional and semantic feature paths, respectively. We apply convolution and Sigmoid activation to these two paths to obtain the attention weights:

{T_{1}}^{'} = σ (C o n v_{1 \times 1} (T_{1})), {T_{2}}^{'} = σ (C o n v_{1 \times 1} (T_{2}))

(1)

Subsequently, the features are enhanced, respectively, using the interactive update formulas:

{\hat{T}}_{1} = T_{1} + T_{1} ⊙ {T_{1}}^{'} + (1 - {T_{1}}^{'}) ⊙ U p s a m p l e ({T_{2}}^{'} ⊙ T_{2})

(2)

{\hat{T}}_{2} = T_{2} + T_{2} ⊙ {T_{2}}^{'} + (1 - {T_{2}}^{'}) ⊙ U p s a m p l e ({T_{1}}^{'} ⊙ T_{1})

(3)

where

⊙

denotes element-wise multiplication,

σ (\cdot)

epresents the Sigmoid function, and Upsample refers to the upsampling operation. This module effectively supplements the contextual information of distant small objects while preserving semantic consistency.

To optimize the feature fusion stage, this study introduces the SBA module (Figure 3), which instantiates the IFU logic for specific feature alignment. In this context, we map the fine-grained positional features

F^{b}

to

T_{1}

and the deep semantic features

F^{s}

to

T_{2}

. By exploiting their complementary nature, this module significantly bolsters detection performance for small objects. Structurally, the SBA comprises two parallel IFU units designed to interactively augment boundary and spatial information, respectively; this process is mathematically formulated in Equation (4). Subsequently, the final fused features are synthesized through a concatenation operation followed by a convolution layer, as defined in Equation (5).

{\hat{F}}^{b} = I F U (F^{b}, F^{s}), {\hat{F}}^{s} = I F U (F^{s}, F^{b})

(4)

Z = C o n v_{3 \times 3} ([{\hat{F}}^{b}, {\hat{F}}^{s}])

(5)

This module effectively mitigates the impact of complex background interference on the recognition of edge details for small objects while maintaining detection efficiency.

3.2. FSDED

When identifying minute targets amidst cluttered backgrounds, the architectural configuration of the detection head acts as a critical determinant of the model’s overall precision. Traditional detection heads in the YOLO series typically adopt a decoupled branch structure, where each feature layer independently constructs convolutional branches to predict targets. However, this approach faces two primary limitations when handling small objects. First, the lack of a shared collaboration mechanism significantly increases the model’s computational and parameter overhead. The independently constructed detection branches lead to a substantial amount of redundant operations during feature extraction. Second, the perception capability for distant, tiny objects is extremely limited. The absence of multi-scale semantic interaction makes it difficult for the model to extract effective discriminative information from blurry shallow features, resulting in insufficient classification and localization accuracy for distant targets. To address the aforementioned issues, this paper proposes a Four-Scale Shared Detail-Enhanced Detection Head (FSDED). Illustrated in Figure 4, the fundamental design philosophy of the FSDED unit centers on integrating a shared convolutional framework with Detail-Enhanced Convolution (DEConv). This synthesis facilitates effective cross-scale resource sharing and the retention of fine-grained details, which subsequently elevates the model’s precision and stability in identifying small targets.

First, for the input multi-scale features P2, P3, P4, and P5, we apply Group Normalization (GN) [26] to each scale to ensure the stability of features with varying distributions under small-batch training.

F_{g n} = C o n v_{G N} (F)

(6)

Secondly, following the shared convolution, this paper introduces the Detail-Enhanced Convolution (DEConv) [27]. Its core mechanism lies in modeling texture and structural information at different levels through parallel detail and structural branches.

F^{'} = F_{s t r u c t} \otimes W_{s} + F_{d e t a i l} \otimes W_{d}

(7)

where

F_{s t r u c t}

and

F_{d e t a i l}

denote the structural features and detail features, respectively;

W_{s}

and

W_{d}

are the corresponding weight parameters; and

\otimes

represents the convolution operation. This mechanism ensures that the detection head enhances the representation capability of local details while preserving multi-scale semantic information. Finally, after being processed by shared convolution and DEConv, features from all scales are uniformly fed into the classification and regression branches to achieve end-to-end detection prediction:

O = [S c a l e (R e g (F^{'})), C l s (F^{'})]

(8)

By adopting this consolidated decoding approach, we minimize calculation redundancy and guarantee consistent feature representation across multiple scales. This strategy is pivotal for elevating the detection efficacy of minute targets. Empirical evidence confirms that the FSDED module substantially bolsters the model’s perceptual sensitivity and precision for small objects, all without imposing a significant burden on parameter size or computational load.

3.3. C3K2_RFCBAM

In aerial small object detection tasks, the ability of the network to efficiently extract discriminative features is paramount, particularly given the generally small scale of targets and the frequent presence of complex background interference. While the C3k2 module in YOLO series models maintains a lightweight design and computational efficiency, its internal standard Bottleneck structure relies solely on stacking convolutions for feature modeling. Consequently, it lacks sufficient attention to multi-scale context and salient regions. This limitation renders the model susceptible to false and missed detections in complex backgrounds. To address this, this paper integrates RFCBAMConv into the C3k2 module of YOLOv11, replacing the standard convolution.

The RFConv module [28] addresses scale variation issues by utilizing convolutions with multiple receptive fields, which is crucial for stabilizing performance against complex backgrounds. On the other hand, CBAM [29] refines feature extraction by recalibrating channel and spatial weights, ensuring the network prioritizes salient regions over background distractions. Building on this foundation, the proposed RFCBAMConv module, as illustrated in Figure 5, integrates the strengths of both RFConv and CBAM. It achieves the unification of multi-scale modeling and salient feature selection.

We replace the standard Bottleneck structure in the original C3k2 with the RFCBAM-Bottleneck, effectively embedding the RFCBAMConv module after the convolutional layer to achieve lightweight, attention-enhanced feature learning. The overall improved structure is illustrated in Figure 6. Specifically, Figure 6a depicts the original C3k2 structure, while Figure 6b shows the improved C3k2-RFCBAM structure. Figure 6c,d illustrates the comparison between the standard Bottleneck and the RFCBAM-Bottleneck. Compared to the original C3k2, the C3k2-RFCBAM demonstrates significant advantages in the following aspects: First, leveraging the multi-receptive field mechanism of RFConv, the improved module can better model features of targets at varying scales, thereby enhancing the detection capability for distant small objects. Second, by integrating the channel and spatial attention mechanisms of CBAM, the network is guided to focus more intently on key target regions while suppressing irrelevant background information in complex scenes. This significantly improves the specificity of feature selection. By effectively controlling parameter count, the lightweight RFCBAMConv maintains high computational efficiency, ensuring the model’s suitability for real-time deployment.

3.4. Inner-WIoU

CIoU Loss is implemented within the YOLOv11 framework to manage the bounding box regression task, focusing on optimizing the spatial location and dimensionality of the output predictions. This metric represents a significant advancement over legacy loss functions, such as IoU, by explicitly integrating penalty terms related to both the centroid distance and the aspect ratio discrepancy. The full CIoU loss function is formally expressed as follows:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v

(9)

In this expression,

ρ (b, b^{g t})

quantifies the spatial displacement between the geometric centers of the prediction and the ground truth box. To ensure shape alignment,

α v

is introduced as a constraint term for aspect ratio deviations. The coefficient

α

is used to balance the contribution of the aspect ratio cost relative to the distance loss, while

v

specifically evaluates the consistency of the aspect ratios.

To further optimize bounding box regression, the WIoU series of methods was proposed. WIoU-v1 introduces sample weights into the loss function to assign varying degrees of importance to high-quality and low-quality predicted boxes. WIoU-v3 [22], conversely, employs a dynamic non-monotonic gradient allocation strategy to achieve adaptive weight adjustment via the outlier degree

β

and the gain function

r

. The outlier degree is defined as the ratio of the current sample’s IoU loss to the historical mean:

β = \frac{L_{I o U}}{\bar{L_{I o U}}}

(10)

High-quality samples are characterized by a small

β

, in contrast to large

β

values which flag a marked divergence between the regression result and the ground truth. In this context,

L_{I o U}

denotes the specific loss of the current instance, and

\bar{L_{I o U}}

corresponds to the historical moving mean of the loss.

r = \frac{β}{δ α^{β - δ}}

(11)

where

α

and

δ

hyperparameters used to control the shape of the gradient curve. The form of the WIoU-v3 loss function is expressed as follows:

L_{W I o U - v 3} = \exp (\frac{(x - x_{g t})^{2} + (y - y_{g t})^{2}}{W_{g}^{2} + H_{g}^{2}}) \cdot (1 - I o U)

(12)

WIoU-v3 incorporates bounding box center offset, IoU loss, and a dynamic weight allocation mechanism. It enhances detection accuracy and robustness in complex scenarios while ensuring fast convergence speed. Building upon these methods, to further improve the localization precision for small and dense objects, researchers proposed Inner-IoU [30]. As illustrated in Figure 7, The central concept involves calculating the Inner-IoU score based on the geometric relationship between the regression output and the ground-truth label. By measuring the degree of pixel overlap within the bounding boxes, it more precisely constrains the alignment between the prediction and the ground truth. Its loss function can be expressed as follows:

L_{I n n e r - I o U} = 1 - I o U^{I n n e r}

(13)

Specifically, the inner box coordinates are derived by scaling the original bounding box with a ratio factor $ratio$. Let the center of the ground truth box be

(x_{c}^{g t}, y_{c}^{g t})

and its size be

w^{g t} \times h^{g t}

. The width and height of the inner ground truth box are computed as follows:

w_{i n n e r}^{g t} = w^{g t} \times r a t i o, h_{i n n e r}^{g t} = h^{g t} \times r a t i o

(14)

Similarly, the inner predicted box is scaled by the same factor. The inner intersection is then calculated based on these scaled boxes. In our experiments, we set the scaling factor ratio = 0.7, which effectively creates a stricter constraint on the central alignment of bounding boxes.

Where

I o U^{I n n e r}

denotes the Intersection over Union of the internal regions between the predicted box and the ground truth box. Since Inner-IoU primarily emphasizes the matching of local regions, it demonstrates superior effectiveness in aerial small object detection and the weighting of target boundary details. However, the standalone Inner-IoU suffers from inadequate weight allocation, lacking the ability to dynamically adjust based on sample quality. Therefore, to combine the advantages of both, we formulate the Inner-WIoU loss. Mathematically, this is achieved by substituting the standard IoU term in the original WIoU-v3 loss with the Inner-IoU metric. Specifically, we subtract the standard IoU loss component

1 - I o U

and add the Inner-IoU loss component

1 - I o U^{I n n e r}

. The final derived formulation is expressed as follows:

L_{I n n e r - W I o U} = L_{W I o U - v 3} + I o U - I o U^{I n n e r}

(15)

Inner-WIoU not only retains the adaptive dynamic weighting capability of WIoU-v3 for high- and low-quality samples in complex scenes, but also integrates the constraint effects of Inner-IoU regarding small objects and precise boundary alignment. As a result, this strategy significantly boosts the recognition precision for minute aerial targets, all while preserving superior model stability. The efficacy of Inner-WIoU on tiny objects can be attributed to the gradient amplification effect. For small targets in aerial imagery, standard IoU loss often exhibits weak gradients when the overlap is high, limiting further localization refinement. By utilizing a smaller auxiliary box (

r a t i o < 1

), the effective IoU decreases compared to the original scale for the same deviation, thereby generating a larger gradient signal that forces the model to focus on finer alignment of the centroids. Coupled with the dynamic focusing mechanism of WIoU, this ensures that high-quality samples receive sufficient attention during the late stages of training.

4. Experiment

4.1. Datasets

To validate the effectiveness and generalization capability of the proposed method across diverse small object detection scenarios, we selected three representative datasets: VisDrone, TinyPerson, and DOTA 1.0, as the subjects of our experiments.

VisDrone [31] is a quintessential UAV aerial dataset covering a variety of complex scenarios such as urban areas, rural districts, and highways. The dataset comprises 10 categories with over 25,000 images in total. The training, validation, and testing sets consist of 6471, 548, and 3192 images, respectively. Objects in VisDrone are generally small in scale, and the dataset is characterized by dense occlusion and complex backgrounds. The statistical properties of the dataset, including the category distribution and object size characteristics, are further visualized in Figure 8.

The TinyPerson [32] dataset focuses on long-distance, small-scale pedestrian detection, with images primarily sourced from seaside, ground-level, and urban surveillance scenarios. The dataset contains a total of 1610 images. According to the official partition, the training set includes 794 images, while the validation/testing set includes 816 images. It features two categories: earth_person and sea_person. The vast majority of targets are smaller than 20 × 20 pixels, presenting significant detection difficulty, making it suitable for verifying model robustness in ultra-small object scenarios.

DOTA 1.0 [33] is a large-scale remote sensing object detection dataset covering 15 categories, including planes, ships, storage tanks, tennis courts, and vehicles. It features high-resolution images, totaling over 2800 samples. The official split allocates 1411 images for training, 458 for validation, and 937 for testing. The dataset contains a large number of small and oriented (rotated) objects and is frequently used to assess the generalization capability and adaptability of detection algorithms in complex scenes.

4.2. Experimental Environment and Parameters

All simulations were carried out on a Windows 10 platform. The machine is equipped with 64 GB of RAM, an Intel Core i9-13900K CPU, and an NVIDIA GeForce RTX 4090 GPU. Regarding the software stack, the algorithms were developed in Python 3.8 using the PyTorch 2.3.0 library with CUDA 11.8 support.

To ensure a strict and fair comparison, all models evaluated in this paper (including our proposed method and representative comparative models such as YOLOv8, YOLOv9, and RT-DETR) were trained from scratch without loading any official pre-trained weights. All models were retrained on the experimental datasets under identical training protocols, utilizing the Stochastic Gradient Descent (SGD) optimizer. To further enhance robustness against complex backgrounds, Mosaic augmentation was enabled throughout the entire training process. The specific hyperparameters are detailed in Table 1.

4.3. Evaluation Indicators

To quantitatively assess the efficacy of our architectural improvements, we adopted a standard set of performance indicators widely accepted in object detection research. Specifically, we report Precision, Recall, model complexity (Parameters), as well as mAP@0.5 and mAP@0.5:0.95.

Precision (P): This indicator assesses the trustworthiness of the positive predictions. Specifically, it computes the fraction of detections that correspond to actual targets within the entire set of objects classified as positive.

P = \frac{T P}{T P + F P}

(16)

Recall (R): We employ this metric to evaluate the comprehensive coverage of the detection. It reflects the percentage of ground-truth instances that were successfully retrieved by the model, essentially measuring the omission rate.

R = \frac{T P}{T P + F N}

(17)

TP (True Positives) corresponds to correct detections of existing objects. FP (False Positives) indicates spurious predictions generated where no object exists, whereas FN (False Negatives) accounts for the real objects that were missed by the network.

Parameters: Measured in millions (M), this metric quantifies the model’s size and storage requirements. It serves as a direct indicator for evaluating whether a model is lightweight.

mAP@0.5: This metric serves as a primary benchmark for detection fidelity. It is derived by averaging the precision scores across all categories, strictly enforcing an Intersection over Union (IoU) cutoff of 0.5.

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(18)

mAP@0.5:0.95: To provide a more rigorous assessment of localization precision, this metric computes the average mAP across a spectrum of IoU thresholds, stepping from 0.5 up to 0.95. It offers a comprehensive view of the model’s robustness.

4.4. Comparative Experiments

4.4.1. Comparative Analysis of Feature Enhancement Modules

To verify the effectiveness of the proposed C3k2-RFCBAM module on the VisDrone2019 dataset, we selected other advanced feature enhancement schemes based on the C3k2 structure, including C3k2-EIEM [34] and C3k2-PPA [35], for comparative analysis. All models were trained under a consistent training pipeline to ensure a fair comparison of various structural improvement schemes. As shown in Table 2, although C3k2-EIEM reduced the parameter count slightly, it failed to improve detection accuracy compared to the baseline. While C3k2-PPA improved mAP@0.5 to 34.0%, it resulted in the highest parameter overhead (2.77 M). In contrast, the proposed C3k2-RFCBAM achieved the best detection performance on the VisDrone2019 dataset, boosting mAP@0.5 to 35.2% and mAP@0.5:0.95 to 21.3% with a moderate parameter count of 2.71 M. This result demonstrates that C3k2-RFCBAM achieves a superior balance between expanding the receptive field and maintaining model complexity.

4.4.2. Comparative Analysis of Detection Heads

Similarly, to evaluate the efficiency of the proposed FSDED structure on the VisDrone2019 dataset, we compared it with representative lightweight heads, such as RSCD [36] and LADH [37], while maintaining consistent hyperparameters and training settings. As shown in Table 3, existing lightweight heads successfully reduced the model size but suffered from varying degrees of accuracy degradation (mAP@0.5 dropped to 31.1% and 33.0%, respectively). However, our FSDED reduced the parameter count to 2.26 M (a 12.4% reduction from the baseline) while simultaneously improving mAP@0.5 to 33.7% on the VisDrone2019 dataset. This indicates that the proposed Four-Shared mechanism effectively retains fine-grained details necessary for small object detection without imposing heavy computational burdens, demonstrating high optimization potential.

4.4.3. Comparative Analysis of Neck Structures

Finally, to assess the multi-scale fusion performance of the proposed BAFPN on the VisDrone2019 dataset, we selected other fusion architectures, including SOEP [38] and ASF [39], for comparison, while maintaining consistent hyperparameters and training settings. As shown in Table 4, although the parameter count increased to 3.65 M, the model achieved significant performance improvements on the VisDrone2019 dataset. Specifically, mAP@0.5 improved by 6.1% and mAP@0.5:0.95 improved by 4.6% compared to the baseline, significantly outperforming SOEP and ASF. This substantial improvement indicates that BAFPN effectively mitigates semantic misalignment and suppresses background noise through environmental interaction, justifying the additional computational cost for high-precision aerial detection.

4.4.4. Comparative Analysis of Loss Functions

The specific metric chosen for bounding box regression loss fundamentally dictates the success rate of localizing minute objects, especially when operating in cluttered and intricate scenes. To address this, this paper compares six methods: CIoU, WIoU, ShapeIoU, NWD, EIoU, and the improved Inner-WIoU. All methods were evaluated under identical training configurations to ensure a fair comparison. The experimental results are presented in Table 5.

As shown in the table, different loss functions exhibit varying performances across the Precision, Recall, and mAP metrics. CIoU, as a conventional method, maintains a relatively balanced outcome between precision and recall, but its overall performance is limited. WIoU achieves a relative advantage in the Recall metric, suggesting its benefit in ensuring sufficient coverage of the detection boxes; however, its Precision and mAP scores are sub-optimal. ShapeIoU and EIoU generally show moderate stability with restricted margins for improvement. The reduced Recall observed with NWD suggests a deficiency in accurately localizing minute objects. Conversely, the Inner-WIoU mechanism yielded the most favorable outcomes, peaking at 45.3% for Precision and 33.8% for mAP@0.5. Relative to the CIoU baseline, these figures translate to a net performance gain of 1.0% and 0.5%, respectively. This finding suggests that introducing an internal region constraint enables a more effective optimization of the matching relationship between the predicted bounding box and the ground truth box, thereby enhancing the accuracy of small object detection. Consequently, the improved loss function better balances detection precision and overall generalization capability while successfully maintaining the recall rate. To further visualize the convergence behavior, Figure 9 illustrates the validation precision curves during training. It is evident that while most loss functions behave similarly in the early stages, Inner-WIoU consistently achieves superior precision after 150 epochs, demonstrating better convergence properties and stability compared to the baseline CIoU and other methods.

4.4.5. Comparative Experiments with Other Models

We visually present the assessment outcomes on the VisDrone2019 benchmark in Figure 10 and Figure 11, representing the comparative confusion matrices and Precision-Recall (P-R) curves, respectively. The confusion matrix in Figure 10 intuitively demonstrates the significant improvement of BFRI-Yolo in classification accuracy, where the number of True Positives on the diagonal comprehensively surpasses that of the baseline model. Particularly in the challenging small object categories of the VisDrone dataset, such as “pedestrian” and “people,” BFRI-Yolo achieved substantial increases in correct predictions of 57.4% and 92.3%, respectively. Meanwhile, missed detections—where targets are misclassified as “background”—were also significantly reduced. This enhanced classification and localization capability directly translates into an improvement in detection accuracy, which is verified by the P-R curve in Figure 11 Our BFRI-Yolo model achieved an mAP@0.5 of 42.1%, realizing a significant improvement of 8.8 percentage points compared to the 33.3% of YOLO11n. Graphically, the P-R curves of BFRI-Yolo are closer to the top-right corner across all categories, indicating that our model achieves higher precision at any given recall level. In summary, these results demonstrate that the BFRI-Yolo model not only identifies more targets more accurately—especially in complex scenes like VisDrone containing massive small objects—but also produces more reliable detection results with more robust overall performance.

To rigorously assess the performance capabilities of our proposed BFRI-YOLO architecture, we executed a series of benchmarking trials against a selection of prominent lightweight and enhanced detection models. To ensure a strictly fair comparison, all models were retrained utilizing identical hyperparameters and experimental settings. The empirical outcomes from the VisDrone dataset are detailed in Table 6. As evident from the results, although traditional lightweight models (YOLOv8n, YOLO11n, YOLO12n, YOLOv5n, YOLOv3-tiny) possess certain advantages in terms of parameter count and computation, they are generally insufficient in the mAP50-95 metric, reaching only around 18–19%, and thus struggle to effectively address the requirements for small object detection in UAV scenarios. Comparatively, improved methods such as Yolo-GE [40], HSF-YOLO [41], GMS-YOLO [42], and TA-YOLOn [43] exhibit improved detection accuracy; however, their parameter counts and GFLOPs increase markedly, resulting in a suboptimal balance between computational overhead and precision. Conversely, the BFRI-YOLO framework introduced in this study strikes a superior trade-off between detection precision and computational speed. Quantitatively, the model records 42.1% in mAP50 and 25.6% in mAP50-95. These results signify a performance boost of 8.8% and 6.2% relative to the YOLOv8n baseline while simultaneously outperforming TA-YOLO by margins of approximately 8.4% and 6%. Furthermore, with a parameter count of only 3.40M, this proves that the proposed method significantly enhances small object detection performance while maintaining a lightweight design.

4.4.6. Generalization Performance Evaluation Experiment

To validate the adaptability and robustness of the proposed BFRI-YOLO method across diverse scenarios, comparative experiments were conducted against mainstream object detection networks on three public datasets: TinyPerson, and DOTA1.0. All models were evaluated under identical training configurations to ensure a fair comparison. Table 7 presents the specific comparison on the TinyPerson remote sensing image dataset. The experimental results indicate that lightweight models such as YOLOv8n and YOLO11n sustain an mAP50-95 of only 4.5–4.8%, reflecting insufficient detection capabilities. Although YOLOv3-tiny possesses higher GFLOPs, its accuracy is poor, yielding an mAP50-95 of merely 1.3%. In contrast, the proposed BFRI-YOLO achieved 37.2%, 22.8%, 19.6%, and 7.0% for Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95 on this dataset, respectively. Notably, mAP50-95 improved by 2.2% over the best baseline, while Recall increased by more than 6%. These findings underscore the BFRI-YOLO framework’s proficiency in preserving critical boundary cues and contextual semantics, thereby substantially elevating the detection performance for minute targets.

Table 8 presents the specific comparison on the DOTA1.0 remote sensing image dataset, where models exhibit significant performance differences in large-scene and multi-scale detection tasks. While YOLOv8n, YOLO11n, and YOLO12n performed well in terms of mAP50, their scores on mAP50-95 were generally low. YOLOv5n and YOLOv3-tiny demonstrated even lower accuracy, with mAP50-95 scores of 4.5% and 18.5%, respectively. In contrast, BFRI-YOLO achieved 69.3%, 41.7%, 44.6%, and 27.0% for Precision (P), Recall (R), mAP@0.5, and mAP@0.5:0.95 on this dataset, respectively. Notably, it showed the highest magnitude of improvement in mAP50-95 while also significantly increasing Recall. This data underscores the proposed model’s reliability and its potential for effective deployment in varied remote sensing contexts.

4.5. Ablation Experiments

To rigorously assess the efficacy of each proposed component within challenging aerial detection environments, we performed a series of systematic ablation experiments utilizing the VisDrone benchmark. Adopting YOLO11n as the foundational framework, we sequentially incorporated the enhancement modules to isolate their contributions. The quantitative outcomes of this investigation are detailed in Table 9. In the initial phase, the default loss mechanism was substituted with the Inner-WIoU function. This allows the predicted bounding boxes to better constrain the target’s internal regions, thereby improving bounding box precision. The experimental results indicate that mAP@0.5 increased from 33.8% to 40.4%, demonstrating a significant improvement in accuracy. Subsequently, the BAFPN and FSDED modules were incorporated. This effectively enhanced feature fusion and contextual information modeling capabilities, leading to simultaneous improvements in both Precision and Recall, and resulting in stronger model robustness. Finally, the original Bottleneck structure was replaced with C3k2-RFCBAM. This enhanced the network’s multi-scale feature extraction and attention allocation capabilities, continuing to boost detection accuracy while maintaining a lightweight architecture.

The culmination of this study involves the simultaneous integration of all four proposed components, which yields the peak performance metrics. Quantitatively, the fully configured architecture attains 52.5% in Precision and 40.3% in Recall. Furthermore, the mAP scores at the 0.5 threshold and the 0.5:0.95 range climbed to 42.1% and 25.6%. Relative to the YOLO11n benchmark, this corresponds to net gains of 9.2%, 7.0%, 8.9%, and 6.2% across these respective indicators. Crucially, the parameter footprint remains minimal at 3.40M. These findings confirm that our approach successfully balances high-precision detection of minute aerial targets with computational efficiency, ensuring robustness in challenging environments.

4.6. Impact of Architectural Pruning on Model Efficiency and Performance

To further optimize the network for micro-object detection and specialized deployment, we conducted an architectural pruning study, as illustrated in Figure 12. The red dashed box in the baseline BFRI-Yolo (left) indicates the high-level detection head (P5) and its associated Neck components, which were originally designed for detecting large-scale objects. Given that the VisDrone dataset is predominantly composed of tiny targets, this redundant branch was removed to create a more streamlined version, BFRI-Yolo (Pruned) (right). This simplification reallocates the model’s computational focus toward higher-resolution feature maps (P2–P4), which are more critical for capturing fine-grained structural details.

The quantitative impact of this pruning is summarized in Table 10. The results demonstrate a significant reduction in model complexity: the parameter count dropped from 3.40M to 2.35M, representing a substantial reduction of approximately 30.9%. While the precision (P) and mAP@0.5 decreased slightly by 1.2% (from 52.5% to 51.3%) and 0.6% (from 42.1% to 41.5%), respectively, the recall (R) remained remarkably stable, even showing a marginal increase from 40.3% to 40.5%. These findings confirm that the P5 branch contributes minimally to the detection of micro-scale objects in the targeted scenarios. By executing strategic pruning, we achieved a much more lightweight architecture that maintains high detection accuracy for tiny objects while offering superior potential for real-time deployment on resource-constrained edge devices.

5. Visualization Experiment

5.1. Detection Results Visualization

In this experiment, we conducted a comparative visualization of the detection results between YOLO11n and its improved model, BFRI-YOLO, on the VisDrone test set images. Figure 13a displays the original images, representing daytime long-distance traffic, daytime dense pedestrian scenes, and nighttime traffic scenarios, respectively; The detection outcomes for the YOLO11 baseline and the enhanced BFRI-YOLO are depicted in Figure 13b and Figure 13c, respectively. A comparative analysis of these visuals reveals that the baseline architecture is frequently prone to false negatives and erroneous detections, particularly within congested crowds or long-range traffic monitoring contexts. Furthermore, the confidence scores of its detection boxes are relatively low, indicating particularly insufficient performance for small-sized pedestrians and distant vehicles. In contrast, the improved BFRI-YOLO model possesses stronger capabilities in spatial feature extraction and the fusion of deep and shallow features, enabling it to highlight key regions and maintain high detection accuracy in complex environments. Therefore, the visual evidence clearly substantiates that the proposed architecture surpasses the baseline approach, particularly regarding the recall of minute targets and detection certainty. These results further attest to the model’s superior stability and broad generalization potential in complex scenarios.

5.2. Heatmap Visualization

To qualitatively substantiate the effectiveness of the improved model regarding feature extraction and small object perception, this paper leverages the Grad-CAM visualization technique to comparatively analyze the activation maps of key feature regions for both the YOLO11 baseline and the enhanced model on the VisDrone dataset. As depicted in Figure 14, panel (a) illustrates the original input scene, while panels (b) and (c), respectively, present the activation heatmaps derived from the YOLO11 baseline and our proposed enhanced model. It can be observed from the figures that the heat distribution of the baseline model is relatively dispersed. In dense pedestrian scenes, it is prone to missing focus areas; particularly for distant, small-scale pedestrians and occluded target regions, the model exhibits a weak response. This indicates an insufficiency in the baseline model’s spatial feature extraction capability for small objects, leading to issues of missed detections and low confidence scores during the detection process.

In contrast, the improved model demonstrates stronger region aggregation capability in the heatmaps. Attributed to the introduction of the P2 detection branch and the improved attention mechanism, the model achieves better fusion between shallow and deep features, thereby enhancing its sensitivity to small-scale targets. As evident in Figure 14c, the improved model not only maintains a robust response to large objects but also exhibits a more distinct and concentrated heat distribution for dense crowds, distant vehicles, and small objects amidst complex backgrounds. This demonstrates that the improved model effectively highlights key regions and fine-grained features, possessing stronger adaptability for small object detection in complex scenes. The heatmap visualization results intuitively validate the advantages of the proposed improvement method regarding feature attention mechanisms. Relative to the standard baseline, the refined architecture delivers marked improvements, particularly in elevating small object recall and detection confidence scores. Furthermore, the model evidences heightened interpretability and structural robustness.

6. Conclusions

In this study, we presented a lightweight and efficient detection framework based on YOLOv11, specifically designed to address the challenges of detecting tiny targets in complex UAV environments. By integrating the BAFPN module for multi-scale fusion, the C3k2-RFCBAM for receptive field enhancement, and the FSDED shared head for efficient computation, our method effectively mitigates feature loss and background interference. Comprehensive experiments on the VisDrone, DOTA, and TinyPerson datasets demonstrate the robustness of the proposed approach. Notably, on the VisDrone benchmark, the model achieves an mAP@0.5 of 42.1%, outperforming the baseline by 8.8% with only a marginal increase in parameters (from 2.58 M to 3.40 M). These results confirm that the proposed framework attains a superior balance between detection accuracy and computational cost, offering a practical high-performance solution for resource-constrained UAV platforms. Future work will prioritize architectural streamlining and model compression techniques, such as network pruning and quantization, to further enhance the feasibility of real-time deployment on embedded edge devices.

Author Contributions

Methodology: X.Z. and S.F.; Software: X.Z.; Validation: X.Z., S.F. and Q.S.; Investigation: X.Z. and S.F.; Resources: X.Z.; Writing—Original Draft Preparation: X.Z. and S.F.; Writing—Review and Editing: X.Z., S.F. and Q.S.; Visualization: S.F. and Q.S.; Project Administration: X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China: No. 2024YFB2605200.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be provided by the corresponding author upon reasonable request by the reader.

Conflicts of Interest

Author Xue Zeng was employed by the company Kunming Shipbuilding Equipment Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. A Systematic Review of Drone Based Road Traffic Monitoring System. IEEE Access 2022, 10, 101537–101555. Available online: https://ieeexplore.ieee.org/document/9893814 (accessed on 25 September 2025). [CrossRef]
Byun, S.; Shin, I.-K.; Moon, J.; Kang, J.; Choi, S.-I. Road Traffic Monitoring from UAV Images Using Deep Learning Networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Yin, Q.; Yang, W.; Ran, M.; Wang, S. FD-SSD: An Improved SSD Object Detection Algorithm Based on Feature Fusion and Dilated Convolution. Signal Process. Image Commun. 2021, 98, 116402. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Nghiem, V.Q.; Nguyen, H.H.; Hoang, M.S. LEAF-YOLO: Lightweight Edge-Real-Time Small Object Detection on Aerial Imagery. Intell. Syst. Appl. 2025, 25, 200484. [Google Scholar] [CrossRef]
Sangaiah, A.K.; Yu, F.-N.; Lin, Y.-B.; Shen, W.-C.; Sharma, A. UAV T-YOLO-Rice: An Enhanced Tiny Yolo Networks for Rice Leaves Diseases Detection in Paddy Agronomy. IEEE Trans. Netw. Sci. Eng. 2024, 11, 5201–5216. [Google Scholar] [CrossRef]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient Small Object Detection for Remote Sensing Image Using Enhanced RT-DETR Model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. Available online: https://arxiv.org/abs/1911.09070v7 (accessed on 24 September 2025).
Zhou, X.; Jiang, L.; Hu, C.; Lei, S.; Zhang, T.; Mou, X. YOLO-SASE: An Improved YOLO Algorithm for the Small Targets Detection in Complex Backgrounds. Sensors 2022, 22, 4600. [Google Scholar] [CrossRef]
Gong, X.; Yu, J.; Zhang, H.; Dong, X. AED-YOLO11: A Small Object Detection Model Based on YOLO11. Digit. Signal Process. 2025, 166, 105411. [Google Scholar] [CrossRef]
Dong, G.; Zhao, C.; Pan, X.; Basu, A. Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation. IEEE Trans. Image Process. 2024, 33, 2447–2461. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2017. [Google Scholar] [CrossRef]
Wan, Z.; Lan, Y.; Xu, Z.; Shang, K.; Zhang, F. DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sens. 2025, 17, 1768. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. arXiv 2019. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2021. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021. [Google Scholar] [CrossRef]
Lin, X.; Liao, D.; Du, Z.; Wen, B.; Wu, Z.; Tu, X. SDA-YOLO: An Object Detection Method for Peach Fruits in Complex Orchard Environments. Sensors 2025, 25, 4457. [Google Scholar] [CrossRef]
Tang, F.; Huang, Q.; Wang, J.; Hou, X.; Su, J.; Liu, J. DuAT: Dual-Aggregation Transformer Network for Medical Image Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group Normalization. arXiv 2018. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Lu, Z.-M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2023, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating Spatial Attention and Standard Convolutional Operation. arXiv 2023. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Jin, M.; Miao, J.; Zhang, Y.; Khafizov, M.; Bawane, K.K.; Kombaiah, B.; Hurley, D.H. Unfaulting Mechanisms of Frank Loops in Fluorite Oxides. arXiv 2023. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision Meets Drones: A Challenge. arXiv 2018. [Google Scholar] [CrossRef]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. arXiv 2019. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Zhou, J.; Dong, X.; Zhou, Z.; Liu, R.; Zhang, J. EEM-YOLO: Real-Time Fire and Smoke Detection in Power Substation Environments. In Proceedings of the 2025 IEEE 20th Conference on Industrial Electronics and Applications (ICIEA), Yantai, China, 3–6 August 2025; pp. 1–6. [Google Scholar]
Zhang, L.; Zheng, A.; Sun, X.; Sun, Z. Enhanced YOLOv11-Based River Aerial Image Detection Research. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Cao, Q.; Chen, H.; Wang, S.; Wang, Y.; Fu, H.; Chen, Z.; Liang, F. LH-YOLO: A Lightweight and High-Precision SAR Ship Detection Model Based on the Improved YOLOv8n. Remote Sens. 2024, 16, 4340. [Google Scholar] [CrossRef]
Wang, H.; Yun, L.; Yang, C.; Wu, M.; Wang, Y.; Chen, Z. OW-YOLO: An Improved YOLOv8s Lightweight Detection Method for Obstructed Walnuts. Agriculture 2025, 15, 159. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, C.; Chen, G. Efficient Multi-Scale Detection of Construction Workers and Vehicles Based on Deep Learning. J. Real-Time Image Process. 2025, 22, 127. [Google Scholar] [CrossRef]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A Novel YOLO Model with Attentional Scale Sequence Fusion for Cell Instance Segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Zhang, Y.; Zhang, H. An Improved YOLOv8 Detector for Multi-Scale Target Detection in Remote Sensing Images. IEEE Access 2024, 12, 114123–114136. [Google Scholar] [CrossRef]
Wang, F.; Wang, X. HSF-YOLO: A Multi-Scale and Gradient-Aware Network for Small Object Detection in Remote Sensing Images. Sensors 2025, 25, 4369. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Zou, Z.; Dan, W. GMS-YOLO: A Lightweight Real-Time Object Detection Algorithm for Pedestrians and Vehicles Under Foggy Conditions. IEEE Internet Things J. 2025, 12, 23879–23890. [Google Scholar] [CrossRef]
Li, M.; Chen, Y.; Zhang, T.; Huang, W. TA-YOLO: A Lightweight Small Object Detection Model Based on Multi-Dimensional Trans-Attention Module for Remote Sensing Images. Complex Intell. Syst. 2024, 10, 5459–5473. [Google Scholar] [CrossRef]

Figure 1. The network architecture of BFRI-Yolo.

Figure 2. Schematic diagram illustrating the internal components of the IFU module.

Figure 3. Schematic diagram illustrating the internal components of the SBA module.

Figure 4. Schematic diagram illustrating the internal components of the FSDED module.

Figure 5. Comparison of the architectures of CBAM and RFCBAMConv. (a) Schematic diagram of the CBAM module; (b) Schematic diagram of the RFCBAMConv module.

Figure 6. (a) The original C3k2 structure; (b) the improved C3k2-RFCBAM structure; (c,d) comparison between the Bottleneck and the RFCBAM-Bottleneck.

Figure 7. Structure diagram of Inner IoU.

Figure 8. Statistical visualization of the VisDrone dataset. (a) The distribution of instance counts across ten categories; (b) The density distribution of object bounding box sizes.

Figure 9. Comparison of Validation Precision for Different Bounding Box Regression Loss Functions.

Figure 10. Comparison of confusion matrices evaluated on the VisDrone2019 dataset: (a) results for YOLOv11n; (b) results for BFRI-YOLO.

Figure 11. Comparison of Precision-Recall (PR) performance curves evaluated on the VisDrone2019 dataset: (a) results for YOLOv11n; (b) results for BFRI-YOLO.

Figure 12. Comparison between the baseline BFRI-Yolo architecture and the pruned architecture optimized for micro-object detection.

Figure 13. Visual comparison of detection results across different scenarios. (a) Original images; (b) YOLO11n model; (c) BFRI-YOLO model. The red dashed boxes highlight regions where the improved algorithm outperforms the baseline.

Figure 14. Heatmap comparison across different models. (a) Original images; (b) YOLO11n model; (c) BFRI-YOLO model. The red dashed boxes highlight regions where the improved.

Table 1. Experimental training parameters.

Parameter	Value
Input Size	640 × 640
Batch Size	8
epochs	300
Optimizer	SGD
Learning Rate	0.01
Momentum	0.937

Table 2. Performance comparison of different feature enhancement strategies.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
BaseBase	44.3	33.3	33.3	19.4	2.58
C3k2-EIE	43.8	33.2	33.3	19.6	2.53
C3k2-PPA	45.1	33.8	34.0	20.1	2.77
C3k2_RFCBAM	45.7	35.0	35.2	21.3	2.71

Table 3. Comparative evaluation of different detection head strategies.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Base	44.3	33.3	33.3	19.4	2.58
RSCD	42.7	32.2	31.1	18.0	2.42
LADH	43.6	33.3	33.0	19.1	2.28
FSDED	44.5	33.4	33.7	19.6	2.26

Table 4. Comparative evaluation of different Neck improvement strategies.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Base	44.3	33.3	33.3	19.4	2.58
SOEP	48.5	37.5	37.8	22.3	3.06
ASF	45.5	34.4	34.6	20.2	2.67
BAFPN	50.6	38.0	39.4	24.0	3.65

Table 5. Performance comparison of various regression loss methods evaluated on the VisDrone2019 dataset.

Bbox_Loss	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)
CloU	44.3	33.3	33.3	19.4
WIoU	43.9	34.2	33.7	19.5
ShapeIoU	44.0	33.6	33.4	19.3
NWD	44.2	32.7	33.1	19.0
EloU	44.4	33.4	33.3	19.2
Inner-WIoU	45.3	33.4	33.8	19.6

Table 6. Comparative performance evaluation of various baseline architectures on the VisDrone2019 benchmark.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Yolo-GE	49.3	41.2	40.7	23.7	3.50
HSF-YOLO	51.6	40.0	41.0	24.6	3.40
GMS-YOLO	45.5	27.0	37.5	21.0	3.50
TA-YOLOn	50.2	38.9	40.1	24.1	3.80
RT-DETR-R18	41.8	29.6	28.8	16.1	20
Yolov3-tiny	39.5	24.1	23.6	1.3	12.1
Yolov5n	45.3	32.3	32.6	18.7	2.50
Yolov8n	45.8	33.0	33.5	19.2	3.00
Yolo11n	44.3	33.3	33.3	19.4	2.58
Yolo11s	51.9	38.8	40.0	24.1	9.41
Yolo12n	45.0	33.3	33.3	18.9	2.56
BFRI-Yolo(Ours)	52.5	40.3	42.1	25.6	3.40

Table 7. Comparative evaluation of different object detectors on the TinyPerson dataset.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Yolov3-tiny	26.7	6.7	7.5	1.3	12.1
Yolov5n	31.2	15.8	13.1	4.5	2.50
Yolov8n	31.8	15.8	13.6	4.8	3.00
Yolo11n	31.9	16.2	13.5	4.7	2.58
Yolo12n	32.0	16.3	13.8	4.7	2.56
BFRI-Yolo(Ours)	37.2	22.8	19.6	7.0	3.40

Table 8. Comparative evaluation of different object detectors on the DOTA 1.0 dataset.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Yolov3-tiny	64.3	28.8	36.6	18.5	12.1
Yolov5n	64.6	37.1	40.0	23.7	2.50
Yolov8n	67.6	37.3	40.4	24.2	3.00
Yolo11n	65.3	38.3	41.4	25.2	2.58
Yolo12n	66.3	37.1	40.2	23.9	2.56
BFRI-Yolo(Ours)	69.3	41.7	44.6	27.0	3.40

Table 9. Ablation analysis evaluating the contribution of individual proposed modules on the VisDrone2019 dataset.

Model	BAFPN	FSDED	C3k2_RFCBAM	Inner-WIoU	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
Baseline					44.3	33.3	33.3	19.4	2.58
Baseline + A	√				50.6	38.0	39.4	24.0	3.65
Baseline + B		√			44.5	33.4	33.7	19.6	2.26
Baseline + C			√		45.7	35.0	35.2	21.3	2.71
Baseline + D				√	45.3	33.4	33.8	19.6	2.58
Baseline + AB	√	√			50.8	38.7	40.4	24.6	3.30
Baseline + ABC	√	√	√		51.1	40.2	41.5	25.3	3.40
ABCD(our)	√	√	√	√	52.5	40.3	42.1	25.6	3.40

Table 10. Comparative results of architectural pruning on the performance and parameter count of BFRI-Yolo.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Para (M)
BFRI-Yolo(Ours)	52.5	40.3	42.1	25.6	3.40
BFRI-Yolo(Pruned)	51.3	40.5	41.5	25.0	2.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, X.; Fang, S.; Sun, Q. BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery. Electronics 2026, 15, 297. https://doi.org/10.3390/electronics15020297

AMA Style

Zeng X, Fang S, Sun Q. BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery. Electronics. 2026; 15(2):297. https://doi.org/10.3390/electronics15020297

Chicago/Turabian Style

Zeng, Xue, Shenghong Fang, and Qi Sun. 2026. "BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery" Electronics 15, no. 2: 297. https://doi.org/10.3390/electronics15020297

APA Style

Zeng, X., Fang, S., & Sun, Q. (2026). BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery. Electronics, 15(2), 297. https://doi.org/10.3390/electronics15020297

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BFRI-YOLO: Harmonizing Multi-Scale Features for Precise Small Object Detection in Aerial Imagery

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection

2.2. Feature Enhancement and Multi-Scale Fusion

2.3. Detection Head and Loss Function

3. Method

3.1. BAFPN

3.2. FSDED

3.3. C3K2_RFCBAM

3.4. Inner-WIoU

4. Experiment

4.1. Datasets

4.2. Experimental Environment and Parameters

4.3. Evaluation Indicators

4.4. Comparative Experiments

4.4.1. Comparative Analysis of Feature Enhancement Modules

4.4.2. Comparative Analysis of Detection Heads

4.4.3. Comparative Analysis of Neck Structures

4.4.4. Comparative Analysis of Loss Functions

4.4.5. Comparative Experiments with Other Models

4.4.6. Generalization Performance Evaluation Experiment

4.5. Ablation Experiments

4.6. Impact of Architectural Pruning on Model Efficiency and Performance

5. Visualization Experiment

5.1. Detection Results Visualization

5.2. Heatmap Visualization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI