As illustrated in
Figure 2, the proposed EABI-DETR network comprises three main components: Backbone, Neck, and Decoder & Head. Compared with RT-DETR (
Figure 1), it introduces key improvements to enhance multi-scale representation and small-object perception. The C2f-EMA module replaces the original BasicBlock in the backbone, employing an efficient multi-scale attention mechanism to strengthen shallow feature extraction. The P2-BiFPN module in the neck substitutes the original CCFM, achieving bidirectional fusion of shallow high-resolution and deep semantic features. Moreover, the Focaler-MPDIoU loss replaces GIoU to better optimize hard samples and improve localization stability. Two variants are developed: EABI-DETRv1 is designed for lightweight and real-time detection on resource-constrained platforms such as UAVs, while EABI-DETRv2 achieves higher detection accuracy while maintaining efficient inference, making it suitable for high-precision aerial applications such as disaster response and traffic monitoring.
3.1. C2f-EMA
In the RT-DETR network, the backbone typically employs a feature extraction module based on BasicBlock. Derived from ResNet-18, BasicBlock utilizes two 3 × 3 convolutional layers to perform residual feature learning, offering good computational efficiency and training stability. However, due to the limited receptive field of convolutions, its single-scale feature extraction and the absence of an explicit attention mechanism make it difficult to effectively distinguish foreground objects from complex backgrounds, thereby constraining detection performance.
To enhance small-object perception and model efficiency, this study adopts the lightweight design philosophy of the YOLO family and introduces the C2f module in the feature extraction stage. Nevertheless, features extracted solely through convolution operations remain single-scale and lack sufficient saliency. To address this limitation, an Efficient Multi-scale Attention (EMA) mechanism is embedded within the C2f module, forming a new backbone structure termed C2f-EMA. This design achieves multi-scale feature enhancement and cross-spatial dependency modeling without increasing computational cost.
As illustrated in
Figure 3, the C2f-EMA adopts a dual-branch topology: one branch outputs features via a shortcut connection, while the other passes through the EMA module to extract deep semantic representations. The two outputs are then concatenated to achieve complementary fusion of semantic and detailed information. Unlike RT-DETR, which employs two BasicBlocks per stage for feature extraction, such a configuration imposes high computational complexity in shallow layers. To achieve a balance between representation capability and computational efficiency, the proposed model utilizes one C2f-EMA module in stages S2–S4 and three stacked modules in stage S5. Furthermore, only one EMA block is embedded within each C2f bottleneck, enhancing feature representation while avoiding redundant computation.
The C2f-EMA module is designed to enhance the perception and representation of small objects under a lightweight architecture. To balance computational efficiency and feature modeling capacity, the EMA mechanism first divides the input feature map into G groups along the channel dimension. Each group is processed independently within its subspace, thereby improving computational efficiency and reducing memory overhead.
In the parallel attention branch, EMA performs multi-scale feature modeling through two 1 × 1 branches and one 3 × 3 branch. The two 1 × 1 branches conduct global average pooling (GAP) along the height and width directions, respectively, to capture the channel-wise statistical characteristics in both spatial dimensions. The resulting features are concatenated and passed through a 1 × 1 convolution followed by dual Sigmoid activations, generating a direction-aware channel attention map that emphasizes salient regions while suppressing irrelevant channels. Meanwhile, the 3 × 3 branch models spatial contextual dependencies, producing a fine-grained spatial attention map that complements the channel attention mechanism.
The Sigmoid activation function is defined as:
This function maps the input feature value to the range [0, 1], highlighting salient features while suppressing irrelevant ones, where denotes the natural constant.
During the cross-spatial learning stage, EMA first applies Group Normalization (GroupNorm) to mitigate inter-group distributional discrepancies and stabilize training. Then, average pooling, softmax, and matrix multiplication (MatMul) operations are employed to construct spatial dependency relationships. Afterward, Sigmoid activation and feature reweighting (Re-weight) operations are applied for pixel-level feature enhancement, enabling the model to focus on discriminative regions while suppressing background noise.
The Softmax function is formulated as:
This function normalizes the input vector into a probability distribution, representing the relative importance of each element , where denotes the dimension of the input vector.
Overall, the C2f-EMA module significantly strengthens feature representation and spatial perception through multi-branch fusion and multi-scale attention mechanisms. Compared with the BasicBlock, it captures fine-grained details and key semantic regions more effectively, achieving superior accuracy and computational efficiency in small object detection particularly suitable for resource-constrained aerial applications.
3.2. P2-BiFPN
In object detection tasks, crucial spatial details are primarily encoded in the shallow layers of a convolutional network. However, these details are progressively weakened or even lost through successive convolutions and downsampling operations, leading to insufficient feature representation for small objects. Moreover, during feature extraction, small objects can be easily occluded by larger ones or interfered with by background noise, resulting in further degradation of fine-grained information.
As illustrated in
Figure 4, the traditional Feature Pyramid Network (FPN) integrates multi-scale information through a top-down feature fusion pathway and performs detection based on the fused high-level semantic features. Although this structure improves multi-scale representation, its unidirectional information flow causes shallow details to weaken after multiple downsampling operations, limiting the preservation of fine spatial textures crucial for small object detection and thus constraining detection performance.
To address this issue, RT-DETR introduces the CCFM, which adds a bottom-up aggregation path to FPN, thereby achieving bidirectional interaction between high- and low-level features. CCFM adopts an “inverted pyramid” fusion strategy, in which high-resolution features are downsampled and low-resolution features are upsampled to achieve alignment and fusion at intermediate scales. This design effectively enhances the joint modeling of semantics and detailed representations. However, its detection heads are only applied to the P3–P5 scales, neglecting the shallow high-resolution features (e.g., from the S2 layer). Consequently, the fusion process still focuses mainly on mid- and high-level features, resulting in limited fine-grained feature representation a deficiency that becomes particularly pronounced in aerial scenes with densely distributed small objects.
Bidirectional Feature Pyramid Network (BiFPN) further refines the multi-scale fusion process by introducing bidirectional pathways (top-down and bottom-up) and learnable weighting mechanisms, allowing the network to adaptively assign importance across different scales. This facilitates a more balanced integration of semantic and detailed information. In addition, BiFPN directly connects higher-level semantic features (from S4 and S5) with shallower layers, enhancing inter-level correlation and mitigating information loss during deep propagation. Nevertheless, BiFPN still underutilizes high-resolution features from the P2 layer. Directly incorporating P2 features can introduce redundancy and computational overhead, which compromises real-time performance and lightweight design key requirements for UAV-based or resource-constrained platforms.
To overcome these limitations, this study uses the P2-BiFPN module. This module explicitly integrates shallow, high-resolution P2 features into the BiFPN framework and employs a dedicated small-object detection head to improve the perception of small targets. Structurally, P2-BiFPN streamlines the high-level branches by removing redundant P5 layers and performing efficient fusion between shallow and deep features at intermediate levels. This design significantly reduces computational complexity while maintaining effective multi-scale feature interaction. Through multi-level bidirectional feature propagation and adaptive weighted fusion, P2-BiFPN promotes stronger collaboration between shallow spatial details and deep semantic representations, compensating for the deficiencies of RT-DETR in small object detection. This design not only enhances multi-scale feature modeling capability but also achieves an excellent balance between accuracy and efficiency, making it particularly suitable for real-time deployment on UAV and other mobile platforms.