You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

11 November 2025

DL-DEIM: An Efficient and Lightweight Detection Framework with Enhanced Feature Fusion for UAV Object Detection

and
School of Aviation, Inner Mongolia University of Technology, Hohhot 010051, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Aerospace Science and Engineering

Abstract

UAV object detection is still difficult to achieve due to large-scale variation, dense small objects, a complicated background, and resource constraints from onboard computing. To solve these problems, we develop a diffusion-enhanced detection network, DL-DEIM, tailored for aerial images. The proposed scheme generalizes the DEIM baseline across three orthogonal axes. First, we propose a lightweight backbone network called DCFNet, which utilizes a DRFD module and a FasterC3k2 module to maintain spatial information and reduce computational complexity. Second, we propose a LFDPN module, which can conduct bidirectional multi-scale fusion via frequency-spatial self-attention and deep feature refinement and largely enhance cross-scale contextual propagation for small objects. Third, we propose LAWDown, an adaptive-content-aware downsampling to preserve the discriminative representation with higher accuracy at lower resolutions, which can effectively capture the spatially-variant weights and group channel interactions. On the VisDrone2019 dataset, DL-DEIM achieves a mAP@0.5 of 34.9% and a mAP@0.5:0.95 of 20.0%, outperforming the DEIM baseline by +4.6% and +2.9%, respectively. The model maintains real-time inference speed (356 FPS) with only 4.64 M parameters and 11.73 GFLOPs. Ablation studies validate the fact that DCFNet, LFDPN, and LAWDown collaboratively contribute to the accuracy and efficiency. Visualizations also display clustered and better localized activation in crowded scenes. These results show that DL-DEIM achieves a good tradeoff between detection probability and computation burden and it can be used in practice on resource-limited UAV systems.

1. Introduction

Visual sensor-based unmanned aerial vehicles (UAV) have been widely deployed in various application scenarios, such as disaster emergency responding [], urban traffic monitoring [], air map, and military reconnaissance. The explosion of UAV technology has resulted in an increasing demand for effective and high-performance solutions for online object detection on aerial images. Compared with ground images, UAV images also exhibit several distinctive characteristics, which have posed strong challenges to the traditional computer vision methods: comprehensive coverage of the scene with cluttered background, overwhelmingly top view with occlusion, large variations in scale (orders of magnitudes), very small quantity of small objects in terms of image scales, and diversity in the illumination [].
UAV platforms come with strict hardware constraints, which demand a detection algorithm to achieve millisecond-level inference at milliwatt-level power budgets. These severe resource constraints make researchers re-think the design paradigms of deep networks []. The imaging features resulting from the bird’s-eye view perspective pose very challenging pattern recognition problems: target scale changes on the order of a hundred-fold, small targets are highly cluttered and can be easily timbered by complex surrounding objects, and target shapes are distorted by flight attitude, causing further difficulty during recognition []. All of these obstacles have greatly promoted the technical difficulties of aerial intelligent perception and need to be urgently solved. Figure 1 shows an UAV aerial image.
Figure 1. Daytime drone-captured image.
In recent years, object detection techniques have seen a transition from the anchor-box-based methods to end-to-end learning. Two-stage approaches form region proposals (region-based convolutional neural network (R-CNN) and its faster versions, including fast R-CNN, faster R-CNN, and mask R-CNN [,,,,]), achieve high accuracy but require a cascaded pipeline of components and, thus, are not real-time. Single-stage detectors (such as the You Only Look Once (YOLO) series [,,,,]) make dense predictions and effectively enhance inference efficiency, thus striking a balance between processing speed and accuracy in general environments. However, these convolutional neural network (CNN)-based methods have intrinsic drawbacks for aerial image processing: the local receptive field of the convolutional operation restricts the modeling of global context, leading to difficulty in capturing long-range dependencies between objects in aerial scenes. Meanwhile, the non-maximum suppression (NMS) in the post-processing phase dramatically increases the computational complexity in dense target areas and becomes the performance bottleneck of real-time detection [].
The introduction of the transformer architecture now provides a new route to overcome these constraints. Detection transformer (DETR) [] was the first to use self-attention in object detection, and the Hungarian algorithm was used to match in a one-to-one manner between queries and targets, completely discarding anchor box design and NMS post-treatment, and achieving real end-to-end learning. This simple yet principled design reduces the overhead of detection pipelines and shows its merits in occlusion and crowded scenarios. While the original DETR can be trained to acceptable accuracy, it leads to slow convergence and high time consumption, which is not suitable for real-world resources such as smart devices and graphics processing unit (GPU) servers [].
The effectiveness of diffusion models as generative models has motivated new investigations in the detection literature. Diffusion model with explicit instance matching (DEIM) [] creatively introduces instance matching to diffusion denoising and iteratively performs optimization to progressively refine target estimations. Different from classical one-shot prediction, DEIM adopts a progressive optimization mechanism so that the model iteratively evolves from coarse to accurate detection. Most importantly, DEIM’s dense one-to-one (O2O) matching augments the number of targets in training images, to achieve supervision density as high as one-to-many (O2M) but with the benefits of one-to-one matching. Together with the tailored dual-endpoint matching loss (matchability-aware loss), DEIM makes a breakthrough in training efficiency: it achieves better detection performance while halving the amount of training time.
The backbone network in the DEIM model is implemented by using high-performance GhostNet version 2 (HGNetv2) to extract features. For real-time object detection tasks, the HGStage modules in HGNetv2 can obtain strong feature representation power, but the structurally complex HGStage units and the squeeze-and-excitation (SE) attention mechanism are computationally intensive, making it difficult to deploy the model on edge devices. Research by Chen et al. [] showed that overly complex networks significantly increase inference latency and add barely any improvement in accuracy. Meanwhile, Liu et al. [] pointed out that single feature extraction modules cannot cope with the simultaneous detection of objects of various sizes, such as in complex scenes containing multi-scale objects. Several recent works have attempted to improve HGNetv2 in different domains. For instance, an improved YOLOv11n segmentation model has been proposed for lightweight detection and segmentation of crayfish parts []; ELS-YOLO [] and LEAD-YOLO [] enhances UAV object detection performance in low-light conditions based on an improved YOLOv11 framework; and HCRP-YOLO, along with other HGNetv2-based approaches (LEAD-YOLO []), has been applied in agricultural and inspection tasks, demonstrating the backbone’s potential for real-world UAV deployments []. However, these improvements are often tailored for specific application scenarios or environmental constraints and do not provide a general solution to the challenges of multi-scale, cluttered, and resource-constrained UAV imagery, which motivates our proposed discriminative correlation filter network (DCFNet) design.
We propose a new DCFNet backbone network in this paper. DCFNet takes advantage of multi-scale features and background information for better performance. The network integrates the downsampling with reduced feature degradation (DRFD) module, which aims to improve standard downsampling techniques by maintaining higher spatial information during resolution reduction, leading to less feature loss. Moreover, we also propose an innovative feature enrichment module, FasterC3k2, by fusing the C3k2Block (as cross-stage partial network (CSPNet)) module and FasterBlock [], effectively alleviating the superfluous calculations through parameter settings. The whole network follows a progressive cyclic structure, which not only guarantees full feature extraction but also saves a large amount of computational resources.
In the neck network, the fusion module of DEIM is in charge of multi-scale feature fusion. Object distribution and scale structure change with shooting angle in UAVs to some extent. Existing conventional feature pyramid networks adopt only simple residual concatenation and re-parameterization block (RepBlock []), which makes the model fail to fully express the complex features. To solve this problem, the paper proposes the lightweight feature diffusion pyramid network (LFDPN). The feature diffusion mode of LFDPN can effectively diffuse the context-rich features to different detection scales, through the bidirectional feature propagation paths from top-down to bottom-up. As shown in stage 1 of the network, the output of the feature focusing module is fused with high-resolution and low-resolution feature maps by downsampling and upsampling, respectively, to generate rough multi-scale feature representations. The second stage even realizes deep feature fusion together with complete context information diffusion by the second feature focusing (FocusFeature) and deep feature enhancement submodules (re-parameterized high-level multi-scale (RepHMS)). Moreover, this paper introduces lightweight adaptive weighted downsampling (LAWDown), which, together with LFDPN’s dual-stage feature spreading mechanism, successfully mitigates the loss of information caused by scale discrepancy in conventional feature pyramid networks, leading to a notable enhancement of consistent and expressive multi-scale features.
In summary, the main improvements and innovations of this paper are as follows:
1.
To minimize feature loss during the downsampling process and reduce redundant computations, we design a novel backbone network DCFNet for UAV object detection. It introduces the DRFD downsampling module and designs the FasterC3k2 feature enhancement module composed of C3k2Block and FasterBlock, which enhances feature extraction capability while achieving model lightweighting;
2.
To enable the model to fully express complex features, we design a novel neck network LFDPN, which effectively alleviates the information loss problem caused by scale differences in traditional feature pyramid networks through deep integration of cross-scale features and the design of the FocusFeature focusing module;
3.
To address the difficulty in balancing information preservation and computational efficiency in traditional convolutional downsampling, we design the LAWDown. It integrates adaptive weight mechanisms with grouped convolution, achieving new downsampling through dynamic feature selection and efficient channel reorganization, significantly reducing computational complexity while maintaining accuracy.

2. Materials and Methods

2.1. Overview

The full approach is described in this section, where DEIM framework is optimized for UAV object detection. The particular challenges presented by aerial images with widely varying scales, dense small target distributions, and complex backgrounds require dedicated architecture designs to reconcile detection accuracy and computational complexity. Our solution is designed to tackle these issues in a comprehensive manner by identifying and making incremental enhancements in several steps of the detection process.
Our proposed optimizations have a system-wide design, that golden, solid improvements to one component yield synergistic benefits for the system. We substitute the heavy HGNetv2 backbone with new designed DCFNet backbone, which is a shallower architecture and still has competitive feature learning ability with few more parameters. The feature fusion network of LFDPN adopts a novel bidirectional feature diffusion mechanism, so as to efficiently propagate semantic information as well as spatial details between scales. To deal with the needs for aerial object detection, we creatively compose dedicated modules, such as the DRFD module, the LAWDown sampler and the FocusFeature enhancer, to a systematic solution.

2.2. Base DEIM Framework

In this section, we first present the base DEIM framework, which serves as the foundation for our improved DL-DEIM model. The base framework establishes the overall detection pipeline and highlights the key modules that enable diffusion-based optimization. By understanding the architecture of the baseline system, we can better motivate the enhancements introduced later.

2.2.1. DEIM Architecture Overview

The base DEIM framework adopts a hierarchical architecture for end-to-end object detection, as illustrated in Figure 2. The framework consists of three main components: (1) HGNetv2 backbone for multi-scale feature extraction, (2) HybridEncoder for feature enhancement and fusion, and (3) DFINETransformer decoder with self-distillation for iterative detection refinement.
Figure 2. Structure of the base DEIM framework.
The pipeline is designed to perform progressive feature extraction, multi-scale fusion, and end-to-end detection of UAV images directly without any additional post-processing (e.g., NMS). The most crucial contribution is to integrate diffusion-based fine-tuning and dense supervision schemes.

2.2.2. HGNetv2 Backbone

The hierarchical gradient aggregation is realized on the HGNetv2 backbone, which consists of the specialized HGStage modules and the multi-HGBlock units responsible for dense feature extraction. The backbone gradually extracts features at five scales with continuously increasing receptive fields and semantic abstraction. Given the initial Stem block that downsamples the spatial resolution to H/4 × W/4, the network goes through four hierarchical stages (HGStage1–4) to produce feature maps of scales P2/4, P3/8, P4/16, and P5/32, respectively. At each stage, the proposed pipeline carefully trade-off resolution reduction and channel dimension expansion, to retain fine-grained local details for small object detection, as well as to create sufficient capacity for global context information for semantic understanding. This hierarchical structure guarantees complete multi-scale feature representation and facilitates the gradient flow through dense connections. Plan views of the HGStage and HGBlock modules are shown in Figure 3.
Figure 3. Structure of the HGStage and HGBlock.

2.3. Improved DEIM Framework

Building upon the base DEIM framework, we introduce an improved version termed DL-DEIM.

DL-DEIM Architecture

To overcome the deficiencies of the baseline DEIM without losing its main benefits, we suggest global architectural improvements in all components. The enhanced architecture substitutes the backbone HGNetv2 with our light-weight DCFNet, adds the neck LFDPN for better feature fusion, and keeps the powerful decoder DFINETransformer. The structure of the DL-DEIM architecture is illustrated in Figure 4.
Figure 4. Structure of the DL-DEIM architecture.
As shown in Figure 4, feature maps extracted by the DCFNet backbone are first refined by the DRFD–FasterC3k2 compound block, which jointly balances receptive field diversity and computation. These refined multi-scale features are then propagated to the LFDPN neck for bidirectional diffusion and cross-scale fusion. Finally, the fused representations are passed through the DFINETransformer decoder for detection. The LAWDown module is embedded between LFDPN stages as an adaptive downsampling operator to maintain spatial consistency. This pipeline ensures smooth information flow from shallow to deep layers while minimizing information loss for small-object detection.
The DL-DEIM follows a clear top-down flow: features are first extracted by the DCFNet backbone, where each stage combines a DRFD unit with a FasterC3k2 refinement block. The resulting multi-scale feature maps (p3–p5) are then passed into the LFDPN neck, which diffuses and fuses semantic and spatial information bidirectionally. Finally, the fused representations are decoded by the DFINETransformer head to produce bounding box and confidence predictions. This structure ensures a consistent data stream and enables interpretable interaction between modules.

2.4. DCFNet Backbone

As a fundamental contribution, we propose a new kind of backbone: DCFNet architecture, in sharp contrast to the computational complexity of HGNetv2, the DCFNet we develop is a lightweight architecture yet preserves the strong feature extraction ability with dramatically less computational cost. The backbone adopts a progressive feature extraction strategy in four hierarchical stages, producing multi-scale representations at resolutions H 4 × W 4 , H 8 × W 8 , H 16 × W 16 and H 32 × W 32 , with the feature levels P2, P3, P4, and P5 for their levelsP2, P3, P4 and P5 respectively.

2.4.1. DRFD

Given an input feature X R B × C × H × W , the DRFD [] block performs downsampling through three complementary branches—CutD, ConvD, and MaxD—followed by feature fusion. Here, GConv denotes grouped convolution with g groups (depth-wise when g = C ), and CutD refers to pixel-rearrangement-based downsampling with 1 × 1 fusion.
CutD Branch—The input is split into four spatially interleaved sub-tensors and fused by a 1 × 1 convolution:
C = BN Conv 1 × 1 Concat [ X [ : , : , 0 : : 2 , 0 : : 2 ] , X [ : , : , 1 : : 2 , 0 : : 2 ] , X [ : , : , 0 : : 2 , 1 : : 2 ] , X [ : , : , 1 : : 2 , 1 : : 2 ] ] ,
yielding C R B × C × H 2 × W 2 , where C = 2 C .
ConvD and MaxD Branches—For local context preservation, ConvD applies grouped convolutions with strides 1 and 2:
X conv = BN GConv 3 × 3 , s = 2 g = C ( GConv 3 × 3 , s = 1 g = C ( X ) ) ,
while MaxD captures salient responses via pooling:
X max = BN MaxPool 2 × 2 ( GConv 3 × 3 , s = 1 g = C ( X ) ) .
The three outputs are concatenated and fused through a 1 × 1 convolution:
Y = Conv 1 × 1 Concat [ C , X conv , X max ] R B × C × H 2 × W 2 .
The three-branch design jointly preserves fine details (CutD), spatial coherence (ConvD), and high-response regions (MaxD), effectively mitigating information loss from conventional strided downsampling with only marginal computational overhead. The architecture of the DRFD module is illustrated in Figure 5.
Figure 5. Architecture of the DRFD module.

2.4.2. FasterC3k2

As illustrated by the orange arrows in Figure 4, each DRFD module is immediately followed by a FasterC3k2 block, forming a compound downsampling–refinement unit. The DRFD stage performs multi-branch feature reduction, generating a compact yet information-rich representation Y R B × C × H 2 × W 2 , where C is typically set to 2 C . This output serves as the input to the FasterC3k2 block for efficient channel-wise refinement. Mathematically, the combined operation can be expressed as
Y DRFD = F DRFD ( X ) , Z out = F FasterC 3 k 2 ( Y DRFD ) = Concat Conv Y DRFD , 1 : C n div , Y DRFD , C n div : C ,
where n div = 4 denotes the channel division ratio used to perform partial convolution for efficiency.
This joint design allows DRFD to capture and preserve multi-scale contextual information while FasterC3k2 selectively refines and mixes feature channels with minimal computational overhead.
The compound structure ensures a balanced trade-off between receptive-field diversity and model compactness, leading to improved detection accuracy without increased complexity.
Following each DRFD module, the C3k2_Block incorporates Faster_Block components for efficient feature processing. The Faster_Block employs partial convolution strategies, processing only a fraction of channels ( n d i v = 4 ) to reduce computational cost while maintaining representational capacity:
F o u t = Concat [ Conv ( F 1 : C / 4 ) , F C / 4 : C ] .
The architecture of the FasterC3k2 module is illustrated in Figure 6.
Figure 6. Architecture of the FasterC3k2 module and details.

2.5. Lightweight Feature Diffusion Pyramid Network (LFDPN)

The LFDPN introduces a novel bidirectional feature propagation mechanism that efficiently diffuses both semantic and spatial information across scales.

2.5.1. Architectural Overview

The LFDPN introduces a novel bidirectional feature propagation mechanism that efficiently diffuses both semantic and spatial information across scales. The network architecture follows a sophisticated encoder-decoder pattern with three main pathways: top-down semantic flow, bottom-up detail enhancement, and lateral connections for feature fusion.

2.5.2. FMetaFormerBlock

Although the transformer block effectively captures long-range dependencies through self-attention, its quadratic complexity with respect to token number makes it inefficient for high-resolution UAV images. To address this limitation, we adopt the FMetaFormer block, which retains the overall transformer-like token-mixing paradigm but replaces the attention operation with a lightweight frequency spatial mixing strategy for improved efficiency and better feature generalization.
The MetaFormer block serves as the foundational computational unit, implementing a generalized transformer-style architecture. The block follows a dual-branch design with residual connections:
x = ResScale 1 ( x ) + LayerScale 1 ( DropPath ( TokenMixer ( LayerNorm ( x ) ) ) ) , x o u t = ResScale 2 ( x ) + LayerScale 2 ( DropPath ( MLP ( LayerNorm ( x ) ) ) ) ,
where the TokenMixer is FSSA module, and LayerNorm employs generalized normalization across spatial and channel dimensions. The architecture of the FMetaformer module is illustrated in Figure 7.
Figure 7. Architecture of the FMetaformer module.

2.5.3. FSSA

The FSSA module uniquely combines frequency-domain and spatial-domain attention mechanisms to capture both global and local dependencies []. The frequency attention branch leverages the fast Fourier transform to operate in the frequency domain:
Q f , K f , V f = F ( x ) ,
where F denotes the 2D FFT operation. The frequency attention scores are computed using complex-valued operations:
Attn f = ComplexNorm Q f K f H d T ,
where K f H represents the Hermitian transpose, T R n h × 1 × 1 are learnable temperature parameters, and ComplexNorm applies specialized normalization:
ComplexNorm ( z ) = Softmax ( Re ( z ) ) + i · Softmax ( Im ( z ) ) .
The spatial attention branch employs multi-scale depth-wise convolutions:
Q s = [ DWConv 3 × 3 ( x ) DWConv 5 × 5 ( x ) ] K s = [ DWConv 3 × 3 ( x ) DWConv 5 × 5 ( x ) ] V s = [ DWConv 3 × 3 ( x ) DWConv 5 × 5 ( x ) ]
A R B × C × H 2 × W 2 × 4 .

2.5.4. LAWDown

The LAWDown module realizes the attention-guided downsampling operation tailored for aerial imagery. The spatial-adaptive weights are generated dynamically by a global context-aware attention mechanism (average pooling and 1 × 1 convolution). This mechanism overcomes the limitation in traditional fixed-kernel downsampling, where the downsampling strategies can be adapted to the semantic content of input features, leading to high quality of feature preservation in different scenarios. We expanded the upsampled feature maps by channel rearrange operation, and then performed grouped convolution on these feature maps to reduce the dimension of each group, thus we can get even more groups. This architecture improves both feature representation ability and cross-channel semantic correlation maintaining using effective channel interactions. Reconstructing downsampled features from the s1 × s2 regions by softmax weights, LAWDown realizes fine-grained weighted information fusion in spatial dimension. The architecture of the LAWDown module is illustrated in Figure 8.
Figure 8. Architecture of the LAWDown module.
The module computes spatially adaptive weights for each 2 × 2 region:
A = Softmax ( Reshape ( Conv 1 × 1 ( AvgPool 3 × 3 ( x ) ) ) ) ,
where A R B × C × H 2 × W 2 × 4 assigns importance weights. The downsampling process is as follows:
X d s = Reshape ( GConv 3 × 3 , s = 2 , g = C / 16 ( x ) ) X o u t = Conv 1 × 1 i = 1 4 X d s ( i ) A ( i ) .

2.5.5. FocusFeature

The FocusFeature module performs adaptive multi-scale aggregation across three pyramid levels. The architecture of the FMetaformer module is illustrated in Figure 9. The input features { f 1 , f 2 , f 3 } from scales P3, P4, and P5 are as follows:
f 1 = Conv 1 × 1 ( Upsample 2 × ( f 1 ) ) f 2 = Conv 1 × 1 ( f 2 ) if e 1 f 3 = ADown ( f 3 ) .
Figure 9. Architecture of the FocusFeature module.
The ADown operation combines average pooling and max pooling paths:
ADown ( x ) = [ Conv 3 × 3 , s = 2 ( x 1 : C / 2 ) Conv 1 × 1 ( MaxPool ( x C / 2 : C ) ) ] .
Multi-scale context is captured through parallel depth-wise convolutions with kernel sizes K = { 3 , 5 , 7 , 9 } :
F e n h a n c e d = F c o n c a t + k K DWConv k ( F c o n c a t ) .

2.5.6. MFM

The MFM module dynamically calibrates contributions from different scales using channel attention []. The architecture of the MFM module is illustrated in Figure 10. Given features { f i } i = 1 N from multiple sources:
w = Softmax ( MLP ( GAP ( i = 1 N f i ) ) ) .
Figure 10. Architecture of the MFM module.

2.5.7. RepHMS

The RepHMS module enhances feature representation through hierarchical processing with UniRepLKNet blocks []. The architecture of the RepHMS module is illustrated in Figure 11. The module employs cascaded DepthBottleneck blocks for multi-scale feature extraction:
y i , j = DepthBlock i , j ( x i + 1 + cascade j ) .
Figure 11. Architecture of the RepHMS module.

3. Results

3.1. Experimental Dataset and Experimental Setup

In this section, we have introduced the rationale for our dataset selection and have adopted the standard evaluation metrics described below to objectively quantify model performance under these conditions.

3.1.1. Experimental Dataset

All experiments are carried out on the publicly available VisDrone2019 dataset [], which was released by the AISKYEYE team at Tianjin University. The dataset contains 6471 training images, 548 validation images, and 1580 testing images captured by unmanned aerial vehicles (UAVs) in various urban and suburban environments.
The VisDrone2019 dataset is challenging in several aspects for object detection algorithms: (1) Wide scale variations objects can vary in size from only a few pixels to hundreds of pixels; (2) complex crowded scenes are included with many object occlusions; and (3) varying viewpoints and angles. The images come from UAV platforms that provide vie wpoints not available in common object detection datasets. The distribution of different object scales in the VisDrone2019 dataset is presented in Figure 12.
Figure 12. Sample images illustrating the diversity and challenges of the VisDrone2019 dataset.

3.1.2. Evaluation Metrics

To conduct a fair comparison of our DL-DEIM model, we employ the common COCO evaluation metrics including the average precision (AP) and the average precision over classes, which is denoted as mAP.
Average Precision (AP): Average precision is calculated by computing the area under the precision–recall curve. For a specific class c and IoU threshold θ :
A P c θ = 0 1 P c ( r ) d r ,
where P c ( r ) represents the precision at recall rate r for class c.
AP@0.5 (AP50): AP50 specifically uses an IoU threshold of 0.5, which is calculated as
A P 50 c = 1 11 r { 0 , 0.1 , . . . , 1.0 } P i n t e r p ( r ) ,
where P i n t e r p ( r ) is the interpolated precision:
P i n t e r p ( r ) = max r r P ( r ) .
Mean Average Precision (mAP): The mAP is computed by averaging the AP values across all object categories:
m A P @ 0.5 = 1 C c = 1 C A P c 0.5 ,
where C is the total number of object categories (10 for VisDrone2019).
mAP@0.5:0.95: This metric averages the mAP values computed at different IoU thresholds from 0.5 to 0.95 with a step size of 0.05:
m A P @ 0.5 : 0.95 = 1 10 θ { 0.5 , 0.55 , . . . , 0.95 } m A P θ .
The choice of IoU thresholds follows the widely adopted COCO evaluation protocol. Specifically, AP@0.5 provides a relatively lenient criterion that emphasizes detection sensitivity, where predicted bounding boxes only need to overlap the ground truth by at least 50%. In contrast, AP@0.95 enforces a much stricter criterion, requiring almost perfect alignment between predicted and ground-truth boxes. By averaging mAP values across thresholds from 0.5 to 0.95 with a step size of 0.05, we obtain a balanced evaluation that simultaneously accounts for both coarse and precise localization. This combination enables a fair comparison with prior works and reflects real-world detection requirements where both recall and precise localization are important.
Evaluation Protocol: All results listed in this paper are based on the VisDrone2019 validation set (548 images). The validation set is used as the main evaluation benchmark, which is consistent with common practices in the VisDrone challenges, due to unavailability of the test set annotations. This is done to allow for a fair comparison with other current state-of-the-art approaches, which also give results on the validation set.

3.1.3. Experimental Setup

The developed DL-DEIM model was realized with the PyTorch deep learning library. Table 1 lists the computational environment for all experiments, which are important for reproducing our results.
Table 1. Experimental environment. The experiment use high-performance GPU, 16 GB memory, for the efficient training of our method, with bigger batch size and more faster convergence.
The training parameters, determined through preliminary tuning on the validation set, are summarized in Table 2.
Table 2. Training parameters used in the experiments.

3.2. Results and Comparison

3.2.1. Ablation Study

We carried out extensive ablation experiments to verify the effectiveness of each component in our proposed DL-DEIM model using the VisDrone2019 validation set. All performances in the tables below are generated on the 548 validation images unless otherwise indicated. The ablation study is performed to investigate how much each module contributes to the performance improvement.
We first consider the sensitivity of individual components of the LFDPN encoder. Table 3 shows the performance of the FMetaformer block and LAWDown module.
Table 3. Ablation study on LFDPN encoder components evaluated on the VisDrone2019 validation set.
Table 4 presents a comprehensive ablation study examining the contribution of DCFNet, LFDPN, and LAWDown modules both individually and in combination.
Table 4. Detection performance ablation study on the VisDrone2019 validation set. The table demonstrates the synergistic effect of combining different modules.
Figure 13 provides a visualization of the ablation studies presented in Table 4.
Figure 13. Analysis of ablation experiments.
We derive several important insights from the ablation study:
  • Individual module contributions: All proposed modules contribute positively to detection, and LFDPN achieves the largest individual gain (+3.6% mAP@0. 5).
  • Synergy effect: All the three modules combined (DCFNet + LFDPN + LAWDown) lead to their synergistic enhancements, outperforming the best overall results.
  • Efficiency performance trade-off: Despite the increase in computational cost from 7.12 to 11.73 GFLOPs and number of parameters from 3.73 M to 4.64 M, the model remains real-time processing (356.0 FPS), which is appropriate for deployment on UAV.

3.2.2. Comparative Experiment

To further validate the effectiveness of our proposed DL-DEIM model, we conducted comparative experiments against different backbone networks and state-of-the-art (SOTA) object detection methods on the VisDrone2019 validation set. The comparisons aim to demonstrate three aspects: (1) the superiority of our proposed DCFNet backbone over other lightweight backbones (Table 5); (2) the per-class detection performance of DL-DEIM across challenging object categories (Table 6); and (3) the overall competitiveness of DL-DEIM compared with the latest detection frameworks (Table 7).
Table 5. Comparison of different backbone networks on the VisDrone2019 validation set.
Table 6. Per-class AP comparison of different models on the VisDrone2019 validation set.
Table 7. Performance comparison of different detection models on the VisDrone2019 validation set.
DCFNet achieves the highest detection accuracy (mAP@0.5 = 32.0%, mAP@0.5:0.95 = 18.3%) among all tested backbones, while keeping the model relatively lightweight (7.86 GFLOPs, 4.10 M parameters). Compared with MobileNetV3-Small and ShuffleNetV2, DCFNet significantly improves accuracy with only moderate computational cost, confirming its suitability for UAV-based applications.
Our DL-DEIM model shows consistent improvements across most object categories. This confirms the effectiveness of LFDPN and LAWDown in handling scale variations and complex backgrounds, which are common challenges in UAV imagery.
Compared with recent YOLO series (YOLOv5n–YOLOv11n), RT-DETR, and redefine regression task in DETRs as fine-grained distribution refinement (DFine []), our DL-DEIM achieves the best overall performance (mAP@0.5 = 34.9%, mAP@0.5:0.95 = 20.0%). This indicates that our model surpasses both transformer-based and CNN-based SOTA detectors while still maintaining real-time inference speed, making it highly suitable for UAV-based real-world applications. Figure 14 provides a visualization of the comparison experiment presented in Table 7.
Figure 14. Comparative experimental performance analysis.

3.3. Visualization Analysis

In order to make it more clear to see what improvements we have made to our proposed method, we show overall visualization results between the baseline and our accentuated approaches, as illustrated in Figure 15, Figure 16 and Figure 17. The detection performance, representation of feature maps, and activation map of the feature pyramid network (FPN) during detection are visualized in more details in the following subsections.
Figure 15. Detection results visualization comparison for two samples. (a) Sample 1; (b) Sample 2; (c) detection map (Sample 1); (d) detection map (Sample 2).
Figure 16. Feature-map responses for the same input using the baseline (top and middle) and the improved model (bottom). (a) Sample 1; (b) Sample 2; (c) Sample 3; (d) feature map before improvement (Sample 1); (e) feature map before improvement (Sample 2); (f) feature map before improvement (Sample 3); (g) feature map after improvement (Sample 1); (h) feature map after improvement (Sample 2); (i) feature map after improvement (Sample 3).
Figure 17. Heatmaps showing salient regions before and after the model improvements. (a) Sample 1; (b) Sample 2; (c) heatmap before improvement (Sample 1); (d) heatmap result before improvement (Sample 2); (e) heatmap result after improvement (Sample 1); (f) heatmap result after improvement (Sample 2).

3.3.1. Detection Comparison

Figure 15 shows the detection results on two exemplar samples of test dataset by our baseline method. The first row (Figure 15a,b) to original input images with the presence of target object in diverse situations and backgrounds. The corresponding detection maps in the bottom row (Figure 15c,d) visualizes the detected outputs of the model where the regions detected along with the confidence levels of detections are highlighted. These detection maps imply the effectiveness of the model in detecting and localizing objects of interest in complex environments. The brightness of the highlighted areas represents the confidence scores for detection where brighter portions indicate higher confidence of objects being present. The visualization shows that the baseline model can detect dominant objects, but there is space for it to be improved in the detecting precision and boundary clarity of components. These base results will be used as a baseline for comparison with our further model improvements and give first indications on the detection capabilities in the baseline setup.

3.3.2. Featuremap Comparison

We demonstrate the effectiveness of our enhancements by comparing the activation featuremaps before and after model enhancements. This consistency is shown in Figure 16 for three different samples. The maps of spatial transformed featuremaps in the original model (Figure 16d,f) contain more scattered attention patterns (i.e. activations are spread over various regions including background). Conversely, the feature maps of the enhanced model maps (Figure 16g,i) are due to noticeably much more focused and localized activation profiles. The improved model has stronger localizing ability that the activations are most focused on the object and noise pixels have been well suppressed. This enhancement of spatial attention indicates that the proposed modifications help the model to learn to be more discriminative and to have a better understanding of object boundaries. The compact and spiky activation regions in the better-refined model indicate better selectivity in feature and diminishment of uncertainty in target localization.

3.3.3. Heatmap Comparison

The last visualization plots the true heatmap results of the original and pruned models. Figure 17 demonstrates two challenging examples in which the superiority of the heatmaps is clear. The heatmaps of original model’s results (Figure 17c,d) have some drawbacks the missed heatmap, the inaccurate localization of the bounding box, and the occasionally false positive. In contrast, the results of the enhanced model (Figure 17e,f) are visibly significant if not visually significant in terms of detection quality. The bounding boxes obtained from the enhanced model are more tightly fitted to target objects with higher confidence scoring for true positives and less mis-detected ones. These advantages are particularly obvious in complicated scenes with a cluttered background or an occluded object. The visualization demonstrates that the improvements contributed by our proposed methods are not only reflected in the quantitative numbers but can be demonstrated in the quality of the heatmap results, which is more consistent for practical application. Both the decrease of false positive and the increase of the localization accuracy verify that our approach does alleviate the weakness in the baseline model.

4. Discussion

The proposed DL-DEIM framework demonstrates clear and consistent improvements over the baseline, achieving 34.9% mAP@0.5 and 20.0% mAP@0.5:0.95 on the VisDrone2019 dataset. These results validate the overall design philosophy that efficiency and accuracy in UAV object detection can be simultaneously enhanced through modular architectural optimization. This section discusses the individual contributions of each component and their synergistic effects on detection performance. An ablation analysis reveals the distinct roles of each major module:
  • DCFNet. By integrating the lightweight DCFNet backbone with the DRFD downsampling unit, the framework effectively mitigates information loss that typically occurs when processing small aerial targets. This component alone contributes approximately 1.8% mAP improvement while reducing computation by nearly 30%.
  • LFDPN. The bidirectional diffusion and frequency-domain attention in LFDPN enable more efficient semantic exchange across scales, strengthening object localization and recall—particularly for densely distributed small objects (+2.3% mAP).
  • LAWDown. The adaptive downsampling strategy dynamically adjusts spatial weighting based on content importance, preserving boundary information and improving precision on small targets (+2.1% mAP).
These findings indicate that each component addresses a complementary aspect of the detection pipeline: DCFNet, LFDPN, and LAWDown. Their joint operation leads to a synergistic performance boost greater than the sum of their isolated effects.
Visualization analysis further supports these quantitative findings. DL-DEIM produces sharper activation maps and more distinct object boundaries compared with FPN-based detectors, effectively suppressing background noise and emphasizing true positives in cluttered aerial scenes. This suggests that the network learns to better separate meaningful object cues from the complex backgrounds that are typical in UAV imagery.
Despite these advances, certain limitations remain. The model still depends on high-resolution inputs, imposing memory constraints that restrict batch size during training. Moreover, the generalizability of the proposed modules to other aerial benchmarks and real-time UAV applications warrants further investigation. Future work will explore model compression, knowledge distillation, and hardware-aware optimization to enhance deployment efficiency without sacrificing detection quality.

5. Conclusions

This paper presents DL-DEIM, a diffusion-enhanced object detection network tailored for aerial imagery in UAV systems. Built upon the foundation of DEIM, DL-DEIM integrates three well-established modules:DCFNet, LFDPN, and LAWDown. DCFNet reduces computational complexity while preserving spatial information, LFDPN improves multi-scale feature fusion, and LAWDown ensures high-quality discriminative representations at lower resolutions. The proposed model achieves a mean average precision (mAP@0.5) of 34.9% and a mAP@0.5:0.95 of 20.0%, outperforming the DEIM baseline by 4.6% and 2.9%, respectively. Additionally, DL-DEIM maintains real-time inference speed (356 FPS) with only 4.64 M parameters and 11.73 GFLOPs, making it highly efficient for deployment in resource-limited UAV systems. Ablation studies demonstrate the critical contributions of each module to both accuracy and efficiency, while visualizations highlight the model’s ability to localize objects in crowded scenes. Despite these advancements, DL-DEIM’s performance may still degrade under extreme conditions, such as highly dynamic scenes or challenging weather environments. Future work will focus on further refining the model to handle these edge cases and explore adaptive strategies for real-time UAV deployment in diverse operational settings.

Author Contributions

Conceptualization, Y.L. and Y.B.; methodology, Y.L.; software, Y.L.; validation, Y.L. and Y.B.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.B.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and Y.B.; visualization, Y.L.; supervision, Y.B.; project administration, Y.B.; funding acquisition, Y.B. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results received funding from the Department of Education Project (Project Number: JY20230118), the Inner Mongolia “14th Five-Year Plan” Key Research and Development and Achievement Transformation Plan Project in the Field of Social Welfare (Inner Mongolia Science and Technology Plan Project, No.: 2023YFSH0003), and the Inner Mongolia University Scientifc Research Project (NJZY22387).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All data generated and analyzed during this study are included in this article. For further details, please contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, L.; Zhang, W.; Liu, Y. Real-time disaster response using UAV-based object detection with lightweight neural networks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5601812. [Google Scholar]
  2. Wang, X.; Li, J.; Zhang, H. Urban traffic monitoring and analysis using drone-captured imagery with deep learning. Transp. Res. Part C 2022, 140, 103742. [Google Scholar]
  3. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Ling, H. Vision meets drones: Past, present and future. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4325–4341. [Google Scholar]
  4. Suo, J.; Zhang, X.; Shi, W.; Zhou, W. E3-UAV: An Edge-Based Energy-Efficient Object Detection System for Unmanned Aerial Vehicles. IEEE Internet Things J. 2023, 10, 3301–3313. [Google Scholar] [CrossRef]
  5. Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
  6. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  7. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
  8. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  10. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  11. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  12. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  13. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. Available online: https://arxiv.org/abs/1804.02767 (accessed on 8 April 2018).
  14. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Available online: https://arxiv.org/abs/2004.10934 (accessed on 23 April 2020).
  15. Jocher, G. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 25 June 2020).
  16. Tripathi, R.; Patel, I.; Tushar, M.; Khandelwal, V. ASAP-NMS: Accelerating Non-Maximum Suppression Using Spatially Aware Priors. Available online: https://arxiv.org/abs/2007.09785 (accessed on 21 August 2020).
  17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  18. Liu, S.; Ren, T.; Chen, J.; Zeng, Z.; Zhang, H.; Li, F.; Li, H.; Huang, J.; Su, H.; Zhu, J. Detection Transformer with Stable Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6560–6569. [Google Scholar]
  19. Huang, S.; Yu, X.; Liu, L.; Xie, H.; Li, F.; Wang, Z. DEIM: DETR with Improved Matching for Fast Convergence. Available online: https://arxiv.org/abs/2412.04234 (accessed on 5 December 2024).
  20. Li, Z.; Paolieri, M. Inference latency prediction for CNNs on heterogeneous mobile devices and ML frameworks. Digit. Signal Process. 2024, 152, 104629. [Google Scholar] [CrossRef]
  21. Wu, S.; Zhang, L.; Chen, H. Multi-scale feature extraction for energy-efficient object detection in remote sensing images. IET Comput. Vis. 2024, 18, 1089–1104. [Google Scholar] [CrossRef]
  22. Shi, W.; Zhang, J.; Fu, Y.; Chen, D.; Zhu, J.; Lv, C. Lightweight detection and segmentation of crayfish parts using an improved YOLOv11n segmentation model. Sci. Rep. 2025, 15, 25634. [Google Scholar] [CrossRef]
  23. Weng, T.; Niu, X. Enhancing UAV object detection in low-light conditions with ELS-YOLO: A lightweight model based on improved YOLOv11. Sensors 2025, 25, 4463. [Google Scholar] [CrossRef]
  24. Yang, Y.; Yang, S.; Chan, Q. LEAD-YOLO: A Lightweight and Accurate Network for Small Object Detection in Autonomous Driving. Sensors 2025, 25, 4800. [Google Scholar] [CrossRef]
  25. Liao, H.; Wang, G.; Jin, S.; Liu, Y.; Sun, W.; Yang, S.; Wang, L. HCRP-YOLO: A lightweight algorithm for potato defect detection. Smart Agric. Technol. 2025, 10, 100849. [Google Scholar] [CrossRef]
  26. Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
  27. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Wei, X.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
  28. Lu, W.; Chen, S.-B.; Tang, J.; Ding, C.H.Q.; Luo, B. A Robust Feature Downsampling Module for Remote-Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404312. [Google Scholar] [CrossRef]
  29. Sun, Y.; Xu, C.; Yang, J. Frequency-spatial entanglement learning for camouflaged object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 343–360. [Google Scholar]
  30. Zhang, Y.; Zhou, S.; Li, H. Depth information assisted collaborative mutual promotion network for single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2846–2855. [Google Scholar]
  31. Yang, Z.; Guan, Q.; Yu, Z.; Xu, X.; Long, H.; Lian, S.; Hu, H.; Tang, Y. Mhaf-yolo: Multi-branch heterogeneous auxiliary fusion yolo for accurate object detection. arXiv 2025, arXiv:2502.04656. [Google Scholar]
  32. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019; pp. 1–10. [Google Scholar]
  33. Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  35. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  36. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  37. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO (Version 8.0.0). Available online: https://github.com/ultralytics/ultralytics (accessed on 10 August 2025).
  38. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  39. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
  40. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  41. Ultralytics. YOLO11: The Latest Advancement in State-of-the-Art Object Detection. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 10 August 2025).
  42. Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.