FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery

Yao, Yu; Wu, Jun; Hao, Yisheng; Huang, Zhen; Yin, Zixuan; Xu, Jiajing; Chen, Honglin; Pi, Jiahua

doi:10.3390/drones9050365

Open AccessArticle

FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery

by

Yu Yao

,

Jun Wu

^*,

Yisheng Hao

,

Zhen Huang

,

Zixuan Yin

,

Jiajing Xu

,

Honglin Chen

and

Jiahua Pi

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(5), 365; https://doi.org/10.3390/drones9050365

Submission received: 20 March 2025 / Revised: 1 May 2025 / Accepted: 8 May 2025 / Published: 13 May 2025

Download

Browse Figures

Versions Notes

Abstract

Due to traditional recovery systems lacking visual perception, it is difficult to monitor UAVs’ real-time status in communication-constrained or GPS-denied environments. This leads to insufficient ability in decision-making and parameter adjustment and increase uncertainty and risk of recovery. Visual inspection technology can make up for the limitations of GPS and communication and improve the autonomy and adaptability of the system. However, the existing RT-DETR algorithm is limited by single-path feature extraction, a simplified fusion mechanism, and high-frequency information loss, which makes it difficult to balance detection accuracy and computational efficiency. Therefore, this paper proposes a lightweight visual detection model based on transformer architecture to further optimize computational efficiency. Firstly, aiming at the performance bottleneck of existing models, the Parallel Backbone is proposed, which captures local features and global semantic information by sharing the initial feature extraction module and the double-branch structure, respectively, and uses the progressive fusion mechanism to realize the adaptive integration of multiscale features so as to balance the accuracy and lightness of target detection. Secondly, an adaptive multiscale feature pyramid network (AMFPN) is designed, which effectively integrates different scales of information through multi-level feature fusion and information transmission mechanism, alleviates the problem of information loss in small-target detection, and improves the detection accuracy in complex backgrounds. Finally, a wavelet frequency–domain-optimized reverse feature fusion mechanism (WT-FORM) is proposed. By using the wavelet transform to decompose the shallow features into multi-frequency bands and combining the weighted calculation and feature compensation strategy, the computational complexity is reduced, and the representation ability of the global context is further enhanced. The experimental results show that the improved model reduces the parameter size and computational load by 43.2% and 58% while maintaining detection accuracy comparable to the original RT-DETR in three datasets. Even in complex environments with low light, occlusion, or small targets, it can provide more accurate detection results.

Keywords:

UAV recovery; transformer; parallel backbone; object detection

1. Introduction

The fixed-wing unmanned aerial vehicle (UAV) plays an irreplaceable role in military reconnaissance, disaster relief, and logistics transportation due to its long endurance, high-speed flight, and extensive operation ability. However, the efficient and safe recovery of a UAV after completing the mission is always the key technical problem to be solved. Traditional fixed-wing UAV recovery methods include rope recovery and arresting net recovery, which mainly rely on airborne flight control systems and preset ground devices to ensure accurate position control in the final approach stage. The adjustment of braking parameters of these methods usually depends on manual operation, in which it is difficult to achieve rapid response and accurate matching for different aircraft types and flight states, thus increasing the recovery risk [1,2]. In particular, in narrow areas such as mountains and cities involved in military operations and logistics transportation, recovery tasks are often affected by sudden changes in lighting conditions, background environmental interference, unpredictable wind, and other factors. These environmental factors will have a significant impact on the UAV’s flight control system, recovery path planning, speed stability, and other core parameters. Ground recovery facilities have difficulty adjusting the recovery parameters in time according to the UAV’s state changes, leading to the failure of recovery tasks and overload damage to onboard equipment [3,4]. What is more difficult is that the recovery of fixed-wing UAVs often faces the challenge of limited communication; especially in the refusal environment, a stable communication link between the ground and the aircraft may not be established. Research shows that when GPS is available, the position and attitude of a UAV can be obtained through RTK-GNSS technology, and the recovery task can be supported only by transmitting position data between the UAV and the ground recovery device [5]. However, in battlefield environments, complex terrains, or long-distance autonomous flight missions, the probability of the GNSS signal being interfered with or shielded is very high, and the traditional recovery method relying on GPS or remote communication is difficult to apply [6,7]. Because the real-time position, model, and flight status information of the UAV cannot be obtained, the decision-making ability of the ground recovery system is greatly limited. However, even under this condition, the recycling task still needs to be executed, so it is necessary to seek an autonomous sensing method without relying on external signals.

In recent years, the rapid development of computer vision technology has provided new solutions for UAV status monitoring and target positioning. Figure 1 illustrates a recovery method that integrates object detection technology and vortex braking technology. This approach enables the direct acquisition of UAV position and status information through image analysis, even without GPS or interactive support, and dynamically adjusts the braking parameters of the vortex brake to adapt to different recovery conditions.

However, the core challenge to achieving this goal is that the vision detection of UAVs still faces many technical difficulties. Because a UAV usually flies in complex environments, its image size is small, its flight speed is fast, and it is easily disturbed by the background, so it is difficult to ensure detection accuracy and robustness. Although deep learning has made significant progress in the field of target detection, existing mainstream detection methods still have limitations in performance in UAV recovery scenarios. For example, classic models such as Fast R-CNN, SSD, and EfficientDet perform well in general target detection tasks, but in UAV detection tasks, due to factors such as small target sizes, complex backgrounds, and motion blur, it is often difficult for detection performance to meet actual needs [8]. Therefore, it is a key issue to study an efficient visual detection model more suitable for UAV detection tasks.

In response to the above challenges, although the Transformer architecture represented by RT-DETR has made some progress in integrating global modeling capabilities and real-time performance in recent years, it still has limitations in terms of single-feature extraction paths and a lack of multiscale adaptability, which limits its actual deployment effect in UAV recycling systems with limited computing resources [9,10]. Therefore, this paper re-examines the design of the target detection architecture and proposes a lightweight UAV target detection model (FUR-DETR). Through innovative feature extraction and feature fusion strategies, the computational efficiency is further optimized while maintaining the Transformer’s strong representation ability. The main contributions of this paper are summarized as follows:

(1) A parallel multi-branch feature extraction network (Parallel Backbone) is proposed, which includes a shared initial feature extraction module and a dual-branch structure to extract local features and global semantic information. A dynamic alignment fusion mechanism is employed to achieve adaptive multiscale feature fusion, thus balancing lightweight design and detection accuracy.

(2) An adaptive multiscale feature pyramid network (AMFPN) is designed, which optimizes feature fusion and upsampling modules to mitigate the loss of fine-grained feature information and scale mismatches. Then, a dynamic fusion strategy is applied to enhance the feature representation ability, thus improving detection accuracy while reducing computational complexity.

(3) A wavelet frequency–domain-optimized reverse feature fusion mechanism (WT-FORM) is proposed, which uses the wavelet transform (WT) to decompose shallow features into low-frequency components, representing global semantics, and high-frequency components, representing local details, and adaptively reweights frequency–domain-specific information through learnable parameters to expand the receptive field and retain more image details.

The rest of this paper is organized as follows: Section 2 reviews the related work on UAV target detection. Section 3 introduces the framework and architecture of FUR-DETR and describes the key improvement modules in detail. Section 4 describes the dataset, evaluation metrics, experimental setup, and results. Section 5 provides a comprehensive conclusion.

2. Related Work

2.1. Transformer-Based Object Detection Algorithms

In many target detection frameworks, YOLO series [11,12,13] algorithms show good real-time performance because of their efficient architecture, but there are limitations in dealing with the relationship between global information and targets, especially in cases of complex backgrounds and overlapping targets, in which they are prone to false detection and missed detection. To this end, the Facebook team [14] launched the Transformer-based end-to-end target detection algorithm DETR, which regards target detection as a set prediction problem and uses the bipartite graph matching algorithm to solve the matching problem between model predictions and actual targets. In contrast, the Transformer-based Detr model can effectively handle the relationship between targets and the global context through the global self-attention mechanism and improve detection accuracy and robustness.

However, in the training process of DETR, the computational cost of bipartite graph matching is high, resulting in slow convergence speed and weak detection ability for small targets. In addition, the high computational complexity of the Transformer architecture limits its application in some scenarios. For this reason, scholars have carried out a number of studies on backbone modification, query design, attention refinement, and real-time enhancement [15]. DN-DETR [16] added noise to the real data to assist the Hungarian algorithm in completing the matching process, effectively alleviating the instability of the binary matching mechanism and accelerating the convergence speed of the model. DINO [17] significantly improved detection performance and training efficiency by introducing innovative technologies such as dynamic anchor boxes, comparative denoising training, mixed query selection, and Look Forward Twice. Group DETR [18] proposed a new label allocation strategy to decouple the one-to-many allocation problem from the multi-group one-to-one allocation problem, which not only accelerated the convergence speed but also removed redundant predictions while ensuring the support of multiple forward queries. RT-DETR [10] optimized the backbone network, object query design, multiscale feature fusion, and other key modules of the DETR model, significantly reducing the demand for computing resources and enabling real-time detection. The original intention of its design was to retain the global modeling ability of the Transformer and improve detection speed through lightweight design to achieve a balance between real-time performance and detection accuracy. However, although RT-DETR is optimized for computational efficiency, there are still some limitations. Firstly, the backbone network based on ResNet is limited by the single-path feature extraction mode and lacks adaptive processing ability for multiscale targets, and deep convolution stacking further aggravates the problem of information loss in the process of feature transmission. Secondly, although the simplified bidirectional feature fusion mechanism reduces computational complexity, it weakens the ability to capture long-distance dependencies and performs poorly in complex backgrounds. Finally, traditional convolution lacks effective use of frequency–domain information in the process of feature dimensionality reduction, resulting in the loss of high-frequency detail texture information, which limits the recognition ability of the model for targets at different scales.

2.2. Current Research of UAV Detection Algorithms

When detecting UAVs, the general target detection algorithms mentioned above usually produce suboptimal results due to the significant difference between the general scene image and the UAV image. The complex background, large change of target scale, and diversity of attitude pose new challenges to the algorithm. In order to enhance the accuracy and robustness of UAV detection, scholars have carried out a series of works in network architecture, feature fusion, and convolution operations.

Firstly, the traditional deep learning model usually involves a large amount of computation and high storage overhead, which makes it difficult to effectively deploy on edge computing devices and resource-constrained platforms. To solve this problem, lightweight network architectures and model optimization methods have become a research hotspot. The PHSI-RTDETR proposed by Wang et al. [19] was used for infrared small-target detection. It combines the Hilo attention mechanism and a cross-scale feature fusion structure and optimizes the loss function to maintain high detection accuracy and real-time performance while reducing the amount of calculation. Based on the edge computing platform, Titu et al. [20] used knowledge distillation technology to compress the model, significantly reducing computational overhead and improving deployment efficiency under the condition of controllable precision loss. Qin et al. [21] proposed MobileNetV4, which achieved a better balance between computational efficiency and accuracy on the mobile hardware platform through the Universal Inverted Bottleneck (UIB) module, the optimized mobile MQA attention block, and the improved Neural Architecture Search (NAS) method. Wang et al. [22] proposed the lightweight CNN architecture RepViT, which significantly improved performance and reasoning speed on mobile devices by integrating the efficient design of a lightweight visual converter (ViT). Li et al. [23] proposed a lightweight Large Selective Kernel Network (LSKNet) backbone to better adapt to scale changes of targets in remote sensing images and complex background modeling by dynamically adjusting its large spatial receptive field.

Secondly, multiscale feature fusion technology mainly focuses on the problems of large-scale changes in UAV targets and difficulty in detecting small targets. By constructing a feature pyramid structure and combining attention mechanisms, the detection model effectively captures target features of different scales and significantly enhances the perception ability of small-scale UAVs. The Dogfight algorithm proposed by Ashraf et al. [24] integrates spatiotemporal information and attention mechanisms to improve the detection performance of small UAVs in complex environments. TGC-YOLOv5, proposed by Zhao et al. [25], introduced the Transformer, Global Attention Mechanism (GAM), and Coordinate Attention (CA) mechanism to optimize the expression of features, which improved detection accuracy in challenging low-visibility scenarios with abnormal light and fog. However, TransVisDrone, proposed by Sangam et al. [26], achieved optimal performance on multiple public datasets by introducing the spatiotemporal Transformer and CSPDarkNet53 network, demonstrating the adaptability of this method in complex backgrounds. Tan et al. [27] proposed a weighted Bidirectional Feature Pyramid Network (BiFPN) based on two-way cross-scale connection and weighted feature fusion, which significantly reduced computational complexity and parameters while improving accuracy. To further improve the feature representation, Ni et al. [28] designed a context-guided spatial feature reconstruction network (CGRSeg) composed of pyramid context extraction, spatial feature reconstruction, and a lightweight head. Chen et al. [29] designed a high-level filtering feature fusion pyramid (HSFPN), filtering low-level feature information through the channel attention module and then fusing the filtered low-level features with high-level features, thus enhancing the feature expression ability of the model.

In addition to network structure optimization, it is also important to improve the efficiency of convolution operations. Yang et al. [30] designed Multiscale LightWeight Convolution (MLWConv), which can operate on features of different sizes at the same time, thus enhancing the acquisition of multiscale target information and reducing computational complexity. Zhang et al. [31] proposed Linear Deformable Convolution (LDConv). By designing a new coordinate generation algorithm, learning offsets, and irregular convolution operations, the sampling effect is improved while reducing the number of parameters. Lu et al. [32] proposed Robust Feature Downsampling (RFD). The module addresses the problem of information loss in the downsampling process by fusing multiple feature maps extracted by different downsampling techniques. An overview of mainstream UAV detection algorithms is shown in Table 1.

3. Methods

Figure 2 is the schematic diagram of FUR-DETR. Firstly, we propose a multi-backbone feature extraction network (Parallel Backbone), which uses a double-branch structure to extract local features and global semantic information, as well as a dynamic alignment fusion mechanism to achieve multiscale feature adaptive fusion to improve the balance between the lightness and accuracy of UAV target detection. Secondly, an Adaptive Multiscale Feature Pyramid Network (AMFPN) is proposed. By optimizing the feature fusion and the upsampling module, the network adopts a hierarchical information transmission and two-way fusion strategy, which effectively combines high-level semantic information with low-level spatial details. In addition, AMFPN dynamically adjusts the weight of each scale feature through an adaptive fusion mechanism to optimize the feature expression ability and reduce computational complexity. Finally, the Wavelet Transform-based Frequency-Optimized Reverse Fusion Mechanism (WT-FORM) is proposed. The shallow features are decomposed into low-frequency components representing global semantics and high-frequency components representing local details by wavelet transform, and the frequency–domain-specific information is weighted by learnable parameters. This not only expands the receptive field but also retains more image details and optimizes the performance of the model in the processes of feature extraction and subsampling.

3.1. Baseline Network

As the first real-time end-to-end target detection algorithm, RT-DETR focuses on improving reasoning speed and training efficiency while maintaining detection accuracy as much as possible. By optimizing the backbone network, object query design and multiscale feature fusion, and other key modules, it significantly reduces the demand for computing resources and provides a more efficient solution for target detection tasks in practical applications. As shown in Figure 3, the network structure of RT-DETR includes Backbone, Efficient Hybrid Encoder, Decoder, and other modules. The Efficient Hybrid Encoder is composed of an attention-based intra-scale feature interaction (AIFI) module and a CNN-based cross-scale feature fusion module (CCFF) [10]. The AIFI module only works on high-level features rich in semantics to capture the global relationship between conceptual entities in the image. This design effectively reduces computational complexity and improves computational speed, which is key to achieving real-time detection. At the same time, the CCFF module focuses on the cross-scale fusion of features, which enhances the detection ability of the model for different-size targets. The model fully combines the advantages of the self-attention mechanism and the Convolutional Neural Network (CNN), which not only gives it the ability of the Transformer to capture long-distance dependencies but also retains the advantage of CNN in dealing with local features. By improving the multiscale feature processing ability of DETR, RT-DETR improves detection performance while maintaining efficient calculation.

3.2. Improvement of Feature Extraction Network

To solve the problem of insufficient feature extraction and computational redundancy in target detection in complex sceneries, a feature extraction network named Parallel Backbone is proposed. By constructing a dual-backbone architecture and optimizing the number of channels, the network can significantly reduce computational complexity while maintaining high accuracy. Specifically, Parallel Backbone captures features of different scales through the parallel feature extraction backbone network and integrates these features level by level through the Progressive Fusion (PF) module to improve the feature extraction ability and avoid excessive computational overhead.

As shown in Figure 4, the architectural innovation of Parallel Backbone is to build a parallel heterogeneous dual-path feature extraction system and realize multiscale information complementarity through differentiated feature processing strategies. Compared with a single-backbone network, this design can extract more comprehensive and robust feature information without increasing computing resources. In this design, first, the shared initialization module (SIM) is used to process the original input in a unified way, avoiding the repeated calculation of the low-level feature map and ensuring the spatial consistency of dual-path input. Secondly, the Parallel Block is constructed to handle the heterogeneous feature information of fine-grained paths and semantic-enhanced paths. The fine-grained path refers to the idea of CSPNet [33] and processes the input features through the gradient path separation strategy to retain the original information and extract high-order features at the same time. This strategy not only reduces computational redundancy but also enhances sensitivity to local texture, edge, and shape patterns. The semantic enhancement path builds a dense feature stream based on parallel convolution, uses the cascade system to transfer the pre-ordered features layer by layer, combines with the deep separable convolution, which can effectively capture global context information while optimizing the receptive field, and provides good modeling ability for long-distance dependencies.

The theoretical advantage of this parallel two-branch design is that it can solve a basic trade-off problem in feature extraction, that is, it can obtain a more comprehensive semantic context while maintaining high-resolution spatial details. The fine-grained path obtains the detailed spatial information that is crucial for precise positioning and small-target detection, while the semantic enhancement path obtains the high-level abstract information required for accurate classification and context understanding. Because the UAV recovery process is carried out from far to near, represented by different scale targets in complex backgrounds, the information extracted by the dual-branch structure has a complementary effect, in theory. At the same time, from the perspective of information theory, the parallel double-branch structure can optimize the transmission of information flow, follow the principle of entropy maximization, make information extraction and representation more efficient, and alleviate the common gradient disappearance problem in deep single-path networks.

To efficiently integrate the complementary features of the two branches in the Parallel Block, we designed the PF module to align and weight the feature maps from the two backbone networks. Specifically, the module first aligns the number of channels of the features of the two input branches through the 1 × 1 convolution layer to ensure that they have the same channel dimension, then splices the features along the channel dimension. Then the dynamic weights are generated by 3 × 3 convolution and the sigmoid activation function. This design allows the module to adaptively determine the importance of each branch based on the feature content, rather than using a fixed weight, as shown in Formulas (1) and (2), where F₁ and F₂ are input features, ψ_I(·) represents the 1 × 1 convolution mapping function, [·,·] represents splicing along the channel dimension, φ represents the 3 × 3 convolution function, σ is the sigmoid activation function, Ω represents channel dimension bisection operation, and W_i represents weight.

[F_{1}^{'}, F_{2}^{'}] = [ψ_{1} (F_{1}), ψ_{2} (F_{2})]

(1)

{W_{1}, W_{2}} = Ω (σ (ϕ ([F_{1}^{'}, F_{2}^{'}])))

(2)

The dynamic fusion mechanism adopts an operation similar to self-attention and automatically weights the feature contribution according to the information content of the feature to achieve adaptive multiscale feature fusion. The specific operation process is as follows: first, the content-dependent attention map is learned, relevant features are highlighted, and redundancy or noise features are suppressed. Secondly, automatic weighting captures the most discriminative branches at different scales to adapt to scale changes. Finally, the channel weight is optimized to give priority to the characteristic channel with the most information in each branch. Due to the obvious scale differences of UAV images at different distances and heights, this mechanism helps improve the detection effect of the UAV recovery process. For example, at a high altitude, the semantic enhancement path may provide more valuable context. At a low altitude, fine-grained paths may contribute more to accurate detection.

At the same time, the adaptive weighting mechanism further enhances the weighting effect by controlling the learnable global balance parameters α₁ and α₂. These parameters are optimized iteratively during the training process to provide more detailed regulation for dynamic fusion, prevent any branches from dominating the fusion process, and ensure that complementary information from two paths is retained in the network.

Finally, the dynamically computed weights are applied to their respective feature branches, followed by fusion with a learnable global balancing parameter. The operation is shown in Formula (3), where ⊙ represents the Hadamard product and α_i is the constrained learnable parameter that meets ||α_i|| ≤ 1. φ represents a 1 × 1 convolution function, θ is a trainable parameter initialized to 0.5, η is the learning rate, and ∇_θiL is the gradient of the loss function to the parameter.

\{\begin{cases} Y = φ ((F_{1}^{'} ⊙ W_{1}) \cdot α_{1} + (F_{2}^{'} ⊙ W_{2}) \cdot α_{2}) \\ α_{i} = \underset{Forward Constraint}{\underset{︸}{{Proj}_{[- 1, 1]}}} (\underset{Gradient - Based Update}{\underset{︸}{θ_{i} - η \nabla_{θ_{i}} L}}), i \in {1, 2} \\ {Proj}_{[- 1, 1]} (x) = sign (x) ⊙ \min (| x |, 1) \end{cases}

(3)

In summary, based on parallel feature extraction and progressive fusion with complementary functions, the Parallel Backbone effectively overcomes the limitations of single-path networks in feature diversity and receptive field coverage. This improvement extracts multi-level features through parallel paths and enhances the perception ability of the model to multiscale targets. In the UAV target detection task, the traditional single-path network usually cannot efficiently capture fine-grained spatial information and high-level semantic information at the same time. Through parallel feature extraction and progressive fusion, the Parallel Backbone not only expands the receptive field and retains key information but also significantly reduces the number of parameters and amount of computation while maintaining the same detection performance.

3.3. Improvement of Cross-Scale Feature Fusion Network

Although the traditional feature pyramid (FPN) and improved methods have made some progress in multiscale feature fusion, there are still obvious deficiencies in the detection field of UAV recovery [34]. The main problems are as follows: on the one hand, due to the target scale change of UAV in the return phase, the traditional FPN is prone to causing the loss of fine-grained feature information in the process of upsampling or downsampling, which affects the detection accuracy. On the other hand, it is difficult for the single-path or fixed-fusion strategy to take into account both high-level semantics and low-level spatial details, which often leads to feature redundancy or information conflict. At the same time, the complex feature fusion process also makes it difficult for computational efficiency to meet the requirements of real-time detection. For this reason, a feature pyramid structure named AMFPN is proposed, which aims to further reduce the computational complexity of the model without reducing the detection accuracy by improving the feature fusion and upsampling module.

AMFPN improves the problems of feature information loss, insufficient scale matching, and the single fusion strategy in the neck part of target detection. The idea of the overall architecture is, firstly, to map the feature maps of different scales from the backbone network to the same number of channels through convolution to facilitate subsequent fusion. Then, through the hierarchical information transmission and two-way fusion strategy, the organic combination of high-level semantics and low-level detail information is realized to obtain richer multiscale feature information. This design not only improves the utilization of fine-grained features but also reduces the complexity of the model and improves detection performance.

As shown in Figure 5, the DWConvUpsample (DWUP) module is used to upsample the feature map at the current stage to match the scale with other feature maps. Specifically, the input feature map is applied with the upsampling operation to improve its spatial resolution at first. Subsequently, the upsampled feature map is processed by DepthWise Convolution (DWConv), which can not only capture local spatial relationships efficiently but also significantly reduce computational complexity. Next, the module uses the Batch Normalization (BN) and ReLU activation functions to standardize and nonlinear-transform the convolution output to enhance the expression ability of the features. Finally, the number of channels is adjusted by 1 × 1 convolution to make the output characteristics consistent with the number of channels required by the next layer.

The Multiscale Feature Extraction (MSFE) module combines the idea of CSPNet [33] and heterogeneous kernel convolution and can effectively capture multiscale feature modules by using depth convolution of different sizes. Firstly, the number of channels of the input feature graph is expanded by 1 × 1 convolution to improve its expression ability and provide more feature dimensions. Subsequently, the feature map is processed by different sizes of depth convolution kernels, and each convolution kernel focuses on extracting local feature information at different scales. To fully integrate the features of each scale, the channel shuffle operation is added to rearrange and integrate the information between the channels to further improve the effect of feature representation. Finally, the mixed feature map is transformed into the final output feature map by 1 × 1 convolution.

The Fusion module accepts multi-path feature input, assigns different learnable weights to each feature, and then dynamically adjusts the fusion degree according to different input contents. Firstly, the weight w_i is processed by the ReLU activation function, and fast normalization is used to obtain a weight sum of 1. Finally, the weighted sum is performed to obtain the fusion feature. This design significantly reduces the number of parameters and amount of computational overhead while maintaining the adaptive ability of feature fusion. The formula of the fusion process is as follows:

Y = \sum_{i} \frac{ReLU (w_{i})}{\sum_{j} ReLU (w_{j}) + ϵ} \cdot X_{i}

(4)

In general, AMFPN solves the problems of information loss and scale mismatch in target detection by optimizing the structure of the traditional feature pyramid, demonstrating its advantages in multiscale target detection, especially for target detection tasks in complex environments such as UAVs.

3.4. Wavelet-Based Frequency-Optimized Reverse Fusion Mechanism

Traditional downsampling operations, such as pooling and ordinary convolution, increase the convolution step size to reduce the resolution of the feature map. This kind of method involves inherent spectrum deviation. Although it can effectively capture low-frequency semantic information, it disproportionately loses high-frequency details such as edges and textures, which are important for small target recognition. When the proportion of target pixels is less than 0.5%, continuous downsampling may cause key features to be irreversibly erased, leading to catastrophic information loss. Although existing methods try to mitigate the loss of information through hole convolution or multiscale feature fusion, they are limited to spatial domain optimization and fail to make full use of the prior knowledge in the frequency domain. In response to this challenge, the Wavelet Transform-based Frequency-Optimized Reverse Fusion Mechanism (WT-FORM) is proposed. Its core innovation is to use wavelet transform (WT) to decompose shallow features into low-frequency (global semantics) and high-frequency (local detail) components and use learnable parameters to adaptively reweight frequency–domain-specific information, laying the foundation for subsequent cross-level reverse fusion.

The idea of the architecture is to use the multi-resolution analysis ability of wavelet transform [35], which reduces the spatial dimension of the feature map while retaining more high-frequency details, and to effectively improve the perception ability of the network to the target. Firstly, the input image is decomposed into low-frequency and high-frequency components by wavelet transform. This process converts the image into multiple frequency bands, each with a different spatial resolution. A small convolution kernel, such as 3 × 3 or 5 × 5, is applied to each frequency component to capture the characteristics of each frequency range. The low-frequency component (LL) is processed by multi-level recursive wavelet transform to effectively expand the receptive field, while the high-frequency component (LH, HL, HH) provides the details of the image. In the feature fusion stage, the inverse wavelet transform is used to merge the multi-level wavelet decomposition results back to the original space to restore the complete image features. This process preserves the multi-band features of each layer of information and further enhances the feature extraction ability of the model. The schematic diagram is shown in Figure 6.

In this architecture, the output of each layer of convolution comes from the synthesis of low-frequency and high-frequency components, rather than relying on the traditional convolution operation. Its core workflow includes three main steps: wavelet transform, convolution, and inverse wavelet transform.

In the wavelet transform stage, given the input image X, it is decomposed into four frequency bands using wavelet transform, where LL is low-frequency and LH, HL, and HH are high-frequency. For the two-dimensional Haar wavelet transform, the following four filters are used for deep convolution to obtain four outputs.

\begin{array}{l} [X_{L L}, X_{L H}, X_{H L}, X_{H H}] = Conv ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], X), \\ [\begin{matrix} f_{L L} & f_{L H} \\ f_{H L} & f_{H H} \end{matrix}] = [\begin{matrix} \frac{1}{2} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}] & \frac{1}{2} [\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}] \\ \frac{1}{2} [\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}] & \frac{1}{2} [\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}] \end{matrix}] \end{array}

(5)

In the convolution operation stage, small convolution kernels (such as 3 × 3 and 5 × 5) are applied to each frequency component (such as low-frequency LL and high-frequency LH, HL, HH) for feature extraction.

X_{LL}^{(i)}, X_{LH}^{(i)}, X_{HL}^{(i)}, X_{HH}^{(i)} = WT (X_{LL}^{(i - 1)})

(6)

In the stage of inverse wavelet transform, the output characteristic images with different frequencies are merged back to the original space by using inverse wavelet transform, and then the global characteristic information of the image is restored. W presents the convolution kernel, X is the component, and the output characteristic graph Y is obtained.

\begin{array}{l} Z = \sum_{N}^{i = 0} Z^{(i)} = \sum_{N}^{i = 0} IWT (X_{LL}^{(i)} + Z_{H}^{(i - 1)}, Y_{L H}^{(i)}, Y_{H L}^{(i)}, Y_{H H}^{(i)}), \\ Y = IWT (Conv (W, WT (X))) \end{array}

(7)

In general, the WT-FORM module decomposes features into the frequency domain through wavelet transform, breaking through the limitation of traditional multiscale fusion methods (such as BiFPN and HSFPN), which are optimized only in the spatial domain. Through the fine separation of high- and low-frequency information, WT-FORM not only expands the receptive field but also effectively retains details that are easy to lose in the process of downsampling. At the same time, the introduced learnable parameters can adaptively reweight specific information in the frequency domain so that the model can dynamically adjust the importance of low-frequency (global semantics) and high-frequency (local detail) features according to the image content and improve small-target detection ability. In addition, WT-FORM complements the parallel backbone network and adaptive multiscale feature pyramid network (AMFPN) and constructs a coarse-to-fine feature processing process. The synergy of the three effectively improves the target detection performance of the model in complex backgrounds.

4. Results

4.1. Dataset Introduction

To evaluate the effectiveness and applicability of our method, three UAV public datasets—Det-Fly [36], DUT-Anti-UAV [37], and TIB [38]—were used for verification. Det-Fly is an air-to-air UAV detection dataset that contains 13271 images, including four regions—city, mountain area, sky, and ground—as well as three angles of view: head up, down, and up. The dataset simulates the real scenario of UAV detection by changing the lighting conditions, dynamic blur, and target occlusion. The resolution of each image is 3840 × 2160, the relative distance of the target UAV varies from 10 m to 100 m, and the flight altitude varies from 20 m to 110 m, providing sufficient details for UAV detection. To meet the experimental requirements, the dataset is randomly divided into three parts in a ratio of 7:2:1. Among the images, 9289 are used as the training set, 1327 are used as the verification set, and 2655 are used as the test set. The DUT-Anti-UAV consists of two parts: detection and tracking. The detection dataset is composed of a training set (5200 pieces), a test set (2200 pieces), and a verification set (2600 pieces). The dataset consists of 35 different types of UAV images. Multiple resolutions are set to train the multiscale adaptability of the model. TIB is a ground-to-air dataset. UAV images are captured by a fixed camera on the ground. The UAVs are about 500 m away from the camera. The dataset contains 2860 images of various types of UAVs, such as fixed-wing and multi-rotor. The resolution of each image is 1920 × 1080 pixels, and the scene involves a variety of lighting conditions. After testing and analysis, the TIB dataset was completely composed of small-target images, while Det-Fly and DUT-Anti-UAV contained a large proportion of small-target images [39]. Example images of the three datasets are shown in Figure 7.

4.2. Evaluation Metrics

In the process of training and evaluation, the evaluation index uses the COCO standard to comprehensively evaluate the detection performance of the model in different scenarios, including average accuracy (AP) and its performance under different IOU thresholds and target scales. Precision (P) refers to the proportion of real positive samples in the test results among the detected positive samples. Recall (R) refers to the proportion of real positive samples among the test results in all positive samples in the dataset. The relevant equation is as follows, where TP represents the number of real positive samples identified, FP represents the number of negative samples incorrectly identified as positive samples, and FN represents the number of real positive samples incorrectly identified. AP represents the comprehensive performance of the accuracy of the model under different recall thresholds for a single target category. The calculation method is the weighted average of the area under the precision–recall curve, reflecting the overall detection stability of the category.

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

R e c a l l = \frac{T P}{T P + F N}

(9)

A P = \int_{0}^{1} P (R) d R

(10)

AP can be computed at different IoU thresholds, where AP50 represents the AP calculated when IoU ≥ 0.5, and AP50-95 represents the mean AP (mAP) computed over IoU thresholds ranging from 0.5 to 0.95. N represents the number of layers in the network; C denotes the number of categories; and S, M, and L refer to the numbers of small, medium, and large targets, respectively. APs, APm, and APl represent the average precision for small, medium, and large targets.

AP 50 = \frac{1}{C} \sum_{C}^{c = 1} {AP}_{c} (IoU \geq 0.5)

(11)

AP 50 - 95 = \frac{1}{10} \sum_{10}^{i = 1} AP (IoU = 0.5 + 0.05 \times (i - 1))

(12)

APs = \frac{1}{S} \sum_{S}^{s = 1} APs (small objects)

(13)

APm = \frac{1}{M} \sum_{M}^{m = 1} APm (medium objects)

(14)

APl = \frac{1}{L} \sum_{L}^{l = 1} APl (large objects)

(15)

In addition, parameters (Param) and Giga Floating Point Operations Per Second (GFLOPS) are the key indicators used to evaluate the scale and computational complexity of the model. Param represents the total number of trainable parameters in the model, and GFLOPS represents the calculation cost.

4.3. Experiment Settings

Relevant experiments were carried out using the operating system Ubuntu20.04lts, in which the CPU is Intel i9-14900K, the GPU is Geforce RTX 4090D, the running memory is 64 GB, and the deep learning framework is Pytorch 2.0.1. To ensure the fairness and comparability of the results, in all ablation experiments and comparative experiments, the training process of each model started from zero, without any pre-trained weights. The specific parameters are shown in Table 2.

4.4. Ablation Experiment

To further verify the effectiveness and contribution of each module in the FUR-DETR model, ablation experiments were conducted on the Det-Fly dataset. Ablation experiments evaluate the role of each part of the network architecture by removing or replacing the components in the FUR-DETR model one by one. The results are shown in Table 3, where √ indicates that the module is included in the ablation experiment for that group.

First, the single module is verified. The Parallel Backbone (2-A) module uses a lightweight Parallel Backbone network structure to optimize the feature extraction process. AP50-95 is 0.658, slightly lower than 1-Base (0.667), but the computational complexity is significantly reduced. AP50 is 0.965, close to that of 1-Base (0.968), indicating that the module can maintain good feature expression ability while reducing the amount of calculation and the number of parameters. AMFPN (3-B) optimizes upsampling and feature fusion and effectively improves the ability to obtain multiscale information. In this experiment, when AMFPN is used alone, AP50-95 is 0.668, which is slightly higher than that of 1-Base. It is proven that the module reduces the computational complexity and the number of parameters without reducing accuracy, and it achieves a balance between computational efficiency and detection accuracy. AMFPN shows improvement in APs (0.439) and APm (0.668), showing better detection ability for global and small- and medium-sized targets. WT-FORM (4-C) module can reduce the spatial dimension of the feature map while retaining more high-frequency details by using the multi-resolution analysis ability of wavelet transform. This module enables the model to retain details while further reducing the computational complexity. AP50-95 is 0.670, slightly higher than that of 1-Base, which further proves the effectiveness of this module in improving the network’s ability to perceive the target. WT-FORM improves APs (0.447) and APl (0.746), while APm (0.668) is relatively stable, showing its balanced performance in various target detection scenarios.

When multiple modules are combined, the performance of the model is further improved. Specifically, the combination of Parallel Backbone and AMFPN (5-A+B) brings AP50-95 to 0.662, greatly reducing computational complexity and parameters while maintaining high detection accuracy. The AP50-95 of the combination of Parallel Backbone and WT-FORM (6-A+C) is 0.660, which is slightly lower than that of 1-Base, but the number of parameters and calculations are reduced by 35.2% and 44.1%, respectively. When AMFPN and WT-FORM are combined (7-B+C), the accuracy of multiscale targets is improved, while AP50-95 is not reduced. This further proves the synergy between modules. Multiple module combinations show a balanced improvement in the overall AP and multiscale goals.

Finally, the full model of FUR-DETR (8-Ours) surpasses any single module or combination above in terms of performance indicators, indicating that the combined use of the Parallel Backbone, AMFPN, and WT-FORM modules significantly improves the performance of the model. Specifically, the parameters of FUR-DETR are about 11.3 M, which is about 43.3% less than that of 1-Base (19.9 M), and the computational complexity (GFLOPS) is 23.9, which is 58.0% less than that of 1-Base (56.9 GFLOPS). This shows that while the model is simplified, FUR-DETR still maintains close or equivalent detection accuracy.

To sum up, the complete model of FUR-DETR (8-Ours) is superior to other variants in all ablation experiments, which verifies the key role of the combined use of Parallel Backbone, AMFPN, and WT-FORM in improving the performance of the model. The experimental results show that the model efficiency in target detection tasks can be significantly improved by reasonable module design and combination, and the computational complexity and parameters can be significantly reduced while ensuring high accuracy. These design concepts and architectural advantages enable FUR-DETR to show excellent performance in target detection.

4.5. Performance Comparison Experiments with the Deep Learning Model

4.5.1. Comparisons with Different Backbone Networks

In the target detection task, the backbone is the core component of feature extraction, and its design directly affects detection accuracy and computational efficiency. For the challenge of UAV target detection, it is particularly critical to select the appropriate backbone. In this paper, a variety of excellent lightweight backbones (MobileNetV4 [21], RepViT [22], and LSKNet [23]) are integrated into the model as feature extraction networks and compared with the lightweight Parallel Backbone network proposed in this paper to analyze their performance in terms of computational complexity and detection accuracy, demonstrating their effectiveness in UAV target detection tasks. The experimental dataset is Det-Fly, and the experimental results are shown in Table 4.

First, from the perspective of Params and GFLOPS, the Parallel Backbone has only 14 M of parameters and 34.0 GFLOPS of calculation, significantly reducing parameters by 29.7% and calculation by 40.3% compared with the Base. At the same time, compared with other lightweight backbones (such as MobileNetV4-Conv-S and RepViT-M-09), it remains at the same order of magnitude, showing good computational efficiency. In terms of detection accuracy, the overall AP (AP50-95) of the Parallel Backbone is 0.658, which is only 1.3% lower than that of the Base (0.667), but it is still better than RepViT-M-09 and LSKNet-T. This shows that it can maintain high computational efficiency while ensuring detection accuracy. Regarding the detection performance of different IOU thresholds, the Parallel Backbone achieves 0.965 and 0.766 on AP50 and AP75, respectively, which is stable compared with other lightweight networks and surpasses MobileNetV4-Conv-S and LSKNet-T on AP75, indicating that it has advantages in the regression of high-quality target boxes. For the detection performance of targets with different sizes, the Parallel Backbone achieves 0.423 on small-target AP (APs), which is only 1.2% lower than that of Base (0.428), but it has obvious advantages over RepViT-M-09 and LSKNet-T, indicating that the parallel structure helps enhance the ability of small-target feature extraction. On the medium-target AP (APm) and large-target AP (APl), the Parallel Backbone reached 0.654 and 0.740 respectively, which is equivalent to MobileNetV4-Conv-S and better than RepViT-M-09 and LSKNet-T.

All in all, the proposed Parallel Backbone can compete with the mainstream lightweight backbone while maintaining a low computational load through the multi-path parallel feature extraction strategy, and it performs well in high-quality detection and small-target detection, which verifies its effectiveness in UAV target detection tasks.

4.5.2. Comparisons with Different FPN Networks

In the target detection task, the neck structure is used to integrate the multiscale features extracted from the backbone and further enhance target representation ability, which plays an important role in final detection performance. To verify the effectiveness of AMFPN, we selected several popular multiscale feature fusion networks, including BiFPN [27], CGRFPN [28], and HSFPN [29], and compared them with the baseline model, RT-DETR, to evaluate the advantages of AMFPN in feature fusion efficiency and detection performance. The experimental dataset was Det-Fly, and the experimental results are shown in Table 5.

From the perspective of Params and GFLOPS, the proposed AMFPN has only 17.9 m of parameters and 48.6 GFLOPS of calculation, which is 9.6% less than the parameters of Base and 14.6% less than the calculation of Base. It shows scale reduction compared with other lightweight feature fusion networks (such as CGRFPN and HSFPN), indicating its advantages in computational efficiency. In terms of detection performance, the overall AP (AP50-95) of AMFPN reached 0.668, which is slightly improved compared with Base, close to the best level of BiFPN and HSFPN, and better than CGRFPN. This shows that AMFPN can reduce the computational overhead while maintaining high detection accuracy. In terms of the detection performance of different IOU thresholds, AP50 and AP75 of AMFPN are 0.967 and 0.788, respectively, consistent with Base and close to BiFPN and HSFPN. In terms of the detection ability for targets of different sizes, AMFPN achieved 0.439 on APs, which is slightly improved compared with Base (0.428) and better than CGRFPN, but slightly lower than BiFPN and HSFPN. This shows that the adaptive feature fusion mechanism has certain advantages in small-target detection, but there is still room for further optimization. On the medium-target AP (APm), AMFPN is consistent with Base and close to BiFPN and HSFPN. In terms of large-target AP, AMFPN achieves 0.742, which is equal to Base and slightly better than CGRFPN.

To sum up, AMFPN can achieve similar or better detection performance with the mainstream FPN network on the premise of ensuring computational efficiency through adaptive multiscale feature fusion.

4.5.3. Comparisons with Different Convolutional Networks

Existing methods try to reduce the loss of information through hole convolution or multiscale feature fusion, but these methods mostly optimize the spatial domain information and fail to effectively use prior knowledge in the frequency domain. To verify the effectiveness of WT-FORM, this paper conducts a comparative experiment with a variety of mainstream methods (LDConv [31], SRFD [32], and ContextGuidedDown [40]). The experimental dataset is Det-Fly, and the experimental results are shown in Table 6.

In terms of parameters and computational complexity, WT-FORM has only 18.8 M parameters and 54.8 GFLOPS of calculation, less than those of the Base and other models above. Compared with LDConv and SRFD, WT-FORM has greater advantages in the amount of calculation, which shows that this method not only optimizes information extraction ability but also maintains a low computational cost. WT-FORM achieved 0.670 in the AP50-95 index, 0.3% higher than Base and 1.3% higher than SRFD. This shows that the optimization work in the frequency domain enhances overall detection performance. Under different IOU thresholds, WT-FORM achieves 0.967 and 0.791 on AP50 and AP75 respectively, which is equivalent to the optimal scheme, indicating that WT-FORM can still maintain strong detection ability under high IOU thresholds. WT-FORM achieves 0.668 on APm, which is the same as Base and ContextGuidedDown, indicating that its detection ability for medium targets is stable. WT-FORM achieves 0.746 on APl, which is at the leading level, showing that it not only optimizes small-target detection but also maintains strong large-target feature expression ability. On the small-target detection task, WT-FORM achieves 0.447, higher than Base, SRFD, and ContextGuidedDown. It is further proven that WT-FORM can improve the detection of small-target features by retaining high-frequency details through wavelet frequency–domain decomposition and the reverse cross-layer compensation mechanism.

In conclusion, WT-FORM improves the disadvantage of information loss caused by traditional downsampling methods through wavelet frequency domain optimization. At the same time, the computational efficiency is also improved, and the overall detection performance is better than many mainstream methods.

4.5.4. Comparisons of Different Detection Models

To evaluate the performance of the proposed FUR-DETR model in UAV target detection, we conduct comparative experiments with several existing mainstream models. The selected comparison models are the YOLO series [11,12,13], which are highly representative in the field of object detection, aiming to validate the effectiveness of the FUR-DETR model. The experimental dataset is Det-Fly, and the experimental results are shown in Table 7.

The experimental results show that the FUR-DETR has obvious advantages in parameters and computational complexity. Specifically, the parameter quantity of FUR-DETR is 11.3 M, which is reduced by 56.1%, 43.5%, 31.3%, and 43.5%, respectively, compared with YOLO v8m, YOLO v9m, YOLO v10m, and YOLO v11m. Furthermore, FUR-DETR shows a comparable parameter count to MobileNetV4+Base (11.3 M) and FasterNet+Base (10.8 M), while being significantly smaller than RTMDet (27.5 M). The GFLOPS of FUR-DETR is 23.9, which reduces computational complexity by about 69.7% compared to YOLO v8m, 68.7% compared to YOLO v9m, 62.3% compared to YOLO v10m, and 64.6% compared to YOLO v11m. FUR-DETR reduces computational complexity by about 69.7% compared to YOLO v8m, 68.7% compared to YOLO v9m, 62.3% compared to YOLO v10m, and 64.6% compared to YOLO v11m. Moreover, FUR-DETR demonstrates considerable computational efficiency compared to RTMDet, with a 55.9% reduction in GFLOPS, while achieving a 39.5% reduction compared to MobileNetV4+Base. When compared to FasterNet+Base, FUR-DETR has slightly higher computational requirements but offers significant performance improvements.

Although the computational complexity and the number of parameters are significantly reduced, FUR-DETR maintains high accuracy in several key performance indicators. The index of FUR-DETR for AP50-95 is 0.662, which is significantly better than that of other YOLO models. Notably, FUR-DETR achieves comparable AP50-95 performance to MobileNetV4+Base and FasterNet+Base while outperforming RTMDet by 19.5%. In terms of AP50, the result is 0.966, which is higher than YOLO v8m, YOLO v9m, YOLO v10m, and YOLO v11m, with increases of about 9.0%, 8.8%, 8.8%, and 9.4%, respectively. FUR-DETR’s AP50 score of 0.966 also slightly outperforms that of MobileNetV4+Base and FasterNet+Base and shows a more substantial 10.3% improvement over RTMDet. In terms of AP75, FUR-DETR is 0.780—about 17.0% higher than YOLO v8m, about 15.4% higher than YOLO v9m, 18.7% higher than YOLO v10m, and 14.7% higher than YOLO v11m. In addition, in terms of APm, the index of FUR-DETR is 0.446, which is nearly two times higher than that of YOLO v8m and is better than that of YOLO v11m, showing significant advantages in small-target detection. FUR-DETR also demonstrates superior performance in medium-sized object detection compared to other lightweight models, with a 4.4% improvement over MobileNetV4+Base, 18.3% over FasterNet+Base, and a remarkable increase over RTMDet’s APm.

In conclusion, FUR-DETR shows strong advantages in terms of parameters, computational complexity, and detection accuracy. When compared with lightweight detection models such as MobileNetV4+Base, FasterNet+Base, and RTMDet, FUR-DETR maintains an excellent balance between model efficiency and detection performance. Especially when the computational complexity and parameters are reduced, FUR-DETR can still provide detection accuracy equivalent to or even better than other mainstream models. This indicates that FUR-DETR has application potential in UAV target detection tasks, especially under conditions of limited computing resources.

4.5.5. Performance Comparisons on Different Datasets

In addition to Det-Fly, we also conducted independent experiments on the DUT-Anti-UAV [37] and TIB [38] datasets. On each dataset, each algorithm uses the same network architecture and training strategy as described in Section 4.5.4 to complete the training and testing process independently, in order to evaluate the performance of the proposed method on different visual data.

(1) Experimental analysis on the DUT-Anti-UAV dataset

The experimental results on the DUT-Anti-UAV dataset are shown in Table 8. FUR-DETR shows good detection performance while maintaining the characteristics of a lightweight model. In terms of parameters, FUR-DETR and MobileNetV4+Base are both 11.3 M, 43.2% less than 19.9 M of Base and 58.9% less than 27.5 M of RTMDet. The parameters of FUR-DETR are reduced by 56.2% compared with YOLO v8m, 43.5% compared with YOLO v9m and YOLO v11m, and 31.5% compared with YOLO v10m, reflecting the advantages of its lightweight design. In terms of computational complexity, the 23.9 GFLOPs of FUR-DETR is 58.0% less than Base, 55.9% less than RTMDet, 39.5% less than MobileNetV4+Base, and 19.0% higher than FasterNet+Base, but it provides better detection accuracy. Compared with the YOLO series, the computational complexity of FUR-DETR is 69.6% lower than that of YOLO v8m, 68.8% lower than that of YOLO v9m, 62.3% lower than that of YOLO v10m, and 64.6% lower than that of YOLO v11m, significantly reducing the computational burden.

In terms of detection accuracy, the AP50-95 of FUR-DETR reached 0.698, which was 1.1% lower than that of Base, but exceeded other comparison models, including MobileNetV4+Base (increased by 4.0%), FasterNet+Base (increased by 1.5%), RTMDet (increased by 12.0%), YOLO v8m (increased by 5.3%), YOLO v9m (increased by 3.7%), YOLO v10m (increased by 5.8%), and YOLO v11m (increased by 4.6%). In terms of the AP75 index, FUR-DETR was close to Base exceeded MobileNetV4+Base (increased by 3.8%), FasterNet+Base (increased by 1.1%), and RTMDet (increased by 12.7%). Compared with the YOLO series, the AP75 index of FUR-DETR increased by 4.9% compared with YOLO v8m, 4.9% compared with YOLO v9m, 6.7% compared with YOLO v10m, and 4.8% compared with YOLO v11m. In terms of small-target detection (APs), FUR-DETR was 3.2% higher than MobileNetV4+Base, 1.7% higher than FasterNet+Base, 34.9% higher than RTMDet, and more than 10% higher than the YOLO series.

It is worth noting that although the parameters and computational complexity of FUR-DETR are significantly lower than Base, it has a small gap in most performance indicators, indicating that it has achieved a good balance between efficiency and performance.

(2) Experimental analysis on the TIB dataset

The experimental results on the TIB dataset are shown in Table 9. The AP50-95 of FUR-DETR reaches 0.375, which is higher than all the selected YOLO series models, MobileNetV4+Base, FasterNet+Base, and RTMDet. In terms of the AP50 index, FUR-DETR’s 0.915 is second only to Base but better than all YOLO series models (increased by 6.5–9.8%) and RTMDet (increased by 28.2%). In terms of the AP75 index, FUR-DETR is slightly lower than Base, but it performs better in the comparative lightweight model, surpassing MobileNetV4+Base (increased by 3.3%), FasterNet+Base (increased by 6.9%) and RTMDet (increased by 56.5%). In terms of small-target detection (APs), FUR-DETR is relatively close to Base, which is superior to other lightweight models, including MobileNetV4+Base (increased by 0.9%), FasterNet+Base (increased by 5.4%) and RTMDet (increased by 82.9%). This shows that the detection ability of FUR-DETR on the TIB dataset has also maintained a good level.

To sum up, our proposed FUR-DETR model performs well on different datasets, that is, the model achieves high detection accuracy while maintaining a lightweight design, and it maintains this performance level under different data distribution and visual conditions, which verifies the stability of its detection performance.

4.5.6. Comparative Analysis with Established Benchmarks

To further confirm the effectiveness of our proposed FUR-DETR model, the experimental results obtained in this paper on three datasets are compared with those in the corresponding papers. To ensure the comparability of the experimental results, we use the same evaluation index as the original papers of the datasets, and the results are shown in Table 10, Table 11 and Table 12.

Compared with the existing results in published papers, our FUR-DETR achieves leading performance on all three datasets. This direct comparison with the established benchmark confirms the effectiveness and reliability of our method. On the Det-Fly dataset, our method significantly exceeds the best method reported in the benchmark paper of the dataset Grid R-CNN and other widely used models, such as Cascade R-CNN and RetinaNet. On the DUT-Anti-UAV dataset, FUR-DETR surpasses the highest-performing model, Cascade-RCNN, reported in the original paper of the dataset, and it is significantly superior to a variety of mainstream detectors, including the Fast-RCNN series and YOLOX series. Similarly, on the TIB dataset, our method is also superior to the Cascade RCNN and TIB-Net models reported in the original paper of the dataset. These results fully prove the effectiveness and advancement of our proposed FUR-DETR in UAV detection tasks.

4.6. Visualization of Detections

To show the overall performance of the proposed algorithm more intuitively, the effect of the proposed algorithm is tested in four different environments, and the thermodynamic diagram is used to compare with the baseline model. Figure 8 shows the target detection results in four different environments: mountains, ground, city, and air. Specifically, as shown in Figure 8a,c, the detection environments are mountains and cities, and there are problems with terrain changes, illumination changes, and background occlusions, but the model achieves accurate detection. As shown in Figure 8b, the detection environment is the ground, where there are problems with small targets and background occlusions, and the model achieves accurate detection. As shown in Figure 8d, the detection environment is the sky, with illumination changes and small targets. The model also successfully identifies the targets. This shows that FUR-DETR has strong adaptability and stability and can maintain good detection performance in different environmental conditions.

As shown in Figure 9, this paper compares the detection results of FUR-DETR and RT-DETR in four environments. The baseline model, RT-DETR, achieves relatively stable detection effects in most scenes, but in some special cases, such as occlusion, low light, and small targets, FUR-DETR shows obvious advantages. For example, FUR-DETR can better deal with situations in which some targets are occluded, while RT-DETR is prone to false detection. In low-light environments, especially at dusk or when the light is weak, FUR-DETR can continue to accurately detect the target, while RT-DETR is vulnerable to light and performs poorly. At the same time, FUR-DETR shows more advantages in small-target detection than RT-DETR and can identify and calibrate smaller targets more effectively. In general, FUR-DETR is optimized for lightweight design, and its overall performance is equivalent to that of RT-DETR. Even in some complex environments, such as poor light, partial occlusion of targets, and small targets, FUR-DETR performs better than RT-DETR.

4.7. Discussion

The design of FUR-DETR incorporates multiple structural innovations such as Parallel Backbone, AMFPN, and WT-FORM. The ablation experiment results fully demonstrate the effectiveness of these components, namely that the Parallel Backbone structure can improve computational efficiency while maintaining performance, AMFPN can enhance the detection performance of targets at different scales, and WT-FORM further optimizes the feature fusion effect. When these components are used in combination, they can significantly reduce model complexity while maintaining high detection accuracy. Independent experiments conducted on multiple datasets further validate the robustness of our method. Compared with lightweight models such as MobileNetV4+Base and FasterNet+Base, FUR-DETR demonstrates superior detection accuracy while maintaining comparable or better computational efficiency. The comparative analysis with the YOLO series models further confirms that the design of FUR-DETR provides a better solution for the challenges of drone detection.

From a theoretical perspective, the design of FUR-DETR aligns with core theories in the field of computer vision. The Parallel Backbone structure follows the residual learning principle of He et al. [43] and alleviates the gradient vanishing problem in deep networks through multi-path feature extraction. This multi-path architecture design is similar to the multi-level information exchange mechanism proposed by Wang et al. [44] in the study of few-shot object detection. Their research shows that parallel structures can extract and fuse complementary feature information at different abstract levels, thereby improving detection accuracy. The AMFPN structure draws inspiration from the improved feature pyramid network proposed by Zhu et al. [45], which adaptively fuses features of different scales to better handle the problem of target scale changes. This multiscale representation learning method can theoretically improve the robustness of the model to targets of different scales, consistent with the basic principles of multiscale representation learning. The WT-FORM module is based on the classical wavelet transform theory and the research of Finder et al. [35]. It preserves high-frequency details through wavelet frequency–domain decomposition and employs a reverse cross-layer compensation mechanism to improve the detection performance of medium- and small-target features.

However, this study still has certain limitations. Research has shown that deep learning models often experience performance degradation when faced with out-of-distribution samples [46]. This means that our model may face more challenges in actual deployment environments. Although it performs well on public datasets, it is necessary to focus on improving performance under extreme lighting or severe weather conditions as a key area of future research.

5. Conclusions

In this paper, a UAV detection model based on the Transformer architecture is proposed, which aims to solve the challenges faced by the universal target detection model in UAV detection tasks. Although the traditional model performs well in the general scenario, in the UAV detection task, it is often difficult for detection performance to meet actual needs due to factors such as small target scales, complex backgrounds, and motion blur. By integrating the Parallel Backbone, AMFPN, and WT-FORM modules, our model significantly reduces computational requirements while maintaining high detection accuracy. Compared with RT-DETR, the proposed FUR-DETR reduces 43.2% of the parameters and 58% of the computational load and provides comparable detection performance in most scenarios.

The experimental results on three independent datasets show that the model is robust under different operating conditions and can maintain effective detection performance under different background and illumination conditions. These characteristics make FUR-DETR especially suitable for real-time UAV recovery systems. In these systems, computing resources may be limited, but reliable detection is still essential. Although our model shows adaptability to different background and lighting conditions in testing, the performance optimization for extreme lighting conditions (such as strong glare or very low light) still needs further research. Additionally, occlusion in complex environments and motion blur caused by fast-moving targets may also affect detection performance, which are key problems to solve in future work.

To address these adaptability challenges, we will explore small sample and meta-learning methods to improve the model’s adaptability to new environments with minimal training data. Furthermore, we will study targeted model optimization techniques to further reduce computational requirements without sacrificing accuracy. We will also develop hardware-oriented optimization strategies to improve the algorithm’s performance on different deployment platforms. These complementary improvements will jointly enhance the practical application of our method in UAV detection and recovery systems.

Author Contributions

Conceptualization, Y.Y. and J.W.; Data curation, Y.Y.; Formal analysis, Y.Y.; Funding acquisition, J.P.; Investigation, Y.Y.; Methodology, Y.Y.; Project administration, J.X. and H.C.; Resources, Z.H. and Z.Y.; Software, Y.Y.; Supervision, J.W.; Validation, Y.Y. and Y.H.; Visualization, Y.Y.; Writing—original draft, Y.Y.; Writing—review and editing, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Farajijalal, M.; Eslamiat, H.; Avineni, V.; Hettel, E.; Lindsay, C. Safety Systems for Emergency Landing of Civilian Unmanned Aerial Vehicles—A Comprehensive Review. Drones 2025, 9, 141. [Google Scholar] [CrossRef]
Yao, Y.; Wu, J.; Chen, H.; Xu, J.; Yin, Z. Braking Models for Short-Distance Recovery of UAV. In The Proceedings of 2024 International Conference of Electrical, Electronic and Networked Energy Systems; Jia, L., Yang, F., Cheng, X., Wang, Y., Li, Z., Huang, W., Eds.; Lecture Notes in Electrical Engineering; Springer Nature: Singapore, 2025; Volume 1316, pp. 37–47. ISBN 978-981-96-2079-1. [Google Scholar]
Souto, A.; Alfaia, R.; Cardoso, E.; Araújo, J.; Francês, C. UAV Path Planning Optimization Strategy: Considerations of Urban Morphology, Microclimate, and Energy Efficiency Using Q-Learning Algorithm. Drones 2023, 7, 123. [Google Scholar] [CrossRef]
Hu, Z.; Chen, H.; Lyons, E.; Solak, S.; Zink, M. Towards Sustainable UAV Operations: Balancing Economic Optimization with Environmental and Social Considerations in Path Planning. Transp. Res. Part E Logist. Transp. Rev. 2024, 181, 103314. [Google Scholar] [CrossRef]
Hansen, J.M.; Johansen, T.A.; Sokolova, N.; Fossen, T.I. Nonlinear Observer for Tightly Coupled Integrated Inertial Navigation Aided by RTK-GNSS Measurements. IEEE Trans. Control Syst. Technol. 2019, 27, 1084–1099. [Google Scholar] [CrossRef]
Cheng, C.; Li, X.; Xie, L.; Li, L. Autonomous Dynamic Docking of UAV Based on UWB-Vision in GPS-Denied Environment. J. Franklin Inst. 2022, 359, 2788–2809. [Google Scholar] [CrossRef]
Skulstad, R.; Syversen, C.; Merz, M.; Sokolova, N.; Fossen, T.; Johansen, T. Autonomous Net Recovery of Fixed-Wing UAV with Single-Frequency Carrier-Phase Differential GNSS. IEEE Aerosp. Electron. Syst. Mag. 2015, 30, 18–27. [Google Scholar] [CrossRef]
Kassab, M.; Zitar, R.A.; Barbaresco, F.; Seghrouchni, A.E.F. Drone Detection with Improved Precision in Traditional Machine Learning and Less Complexity in Single Shot Detectors. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 3847–3859. [Google Scholar] [CrossRef]
Muzammul, M.; Li, X. Comprehensive Review of Deep Learning-Based Tiny Object Detection: Challenges, Strategies, and Future Directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs Beat Yolos on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, version 8.0.0; Ultralytics: Frederick, MD, USA, 2023. Available online: https://ultralytics.com (accessed on 12 March 2025).
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. ISBN 978-3-030-58451-1. [Google Scholar]
Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. arXiv 2023, arXiv:2306.04670. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-Detr: Accelerate Detr Training by Introducing Query Denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 13619–13627. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group Detr: Fast Detr Training with Group-Wise One-to-Many Assignment. In Proceedings of the IEEE/CVPR International Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6633–6642. [Google Scholar]
Wang, S.; Jiang, H.; Li, Z.; Yang, J.; Ma, X.; Chen, J.; Tang, X. Phsi-Rtdetr: A Lightweight Infrared Small Target Detection Algorithm Based on UAV Aerial Photography. Drones 2024, 8, 240. [Google Scholar] [CrossRef]
Titu, M.F.S.; Pavel, M.A.; Michael, G.K.O.; Babar, H.; Aman, U.; Khan, R. Real-Time Fire Detection: Integrating Lightweight Deep Learning Models on Drones with Edge Computing. Drones 2024, 8, 483. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; Volume 15098, pp. 78–96. ISBN 978-3-031-73660-5. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit: Revisiting Mobile Cnn from Vit Perspective. In Proceedings of the IEEE/CVPR Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.-M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vision 2025, 133, 1410–1431. [Google Scholar] [CrossRef]
Ashraf, M.W.; Sultani, W.; Shah, M. Dogfight: Detecting Drones from Drones Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7067–7076. [Google Scholar]
Zhao, Y.; Ju, Z.; Sun, T.; Dong, F.; Li, J.; Yang, R.; Fu, Q.; Lian, C.; Shan, P. Tgc-Yolov5: An Enhanced Yolov5 Drone Detection Model Based on Transformer, Gam & ca Attention Mechanism. Drones 2023, 7, 446. [Google Scholar] [CrossRef]
Sangam, T.; Dave, I.R.; Sultani, W.; Shah, M. Transvisdrone: Spatio-Temporal Transformer for Vision-Based Drone-to-Drone Detection in Aerial Videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–02 June 2023; IEEE: London, UK, 2023; pp. 6006–6013. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Ni, Z.; Chen, X.; Zhai, Y.; Tang, Y.; Wang, Y. Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; Volume 15110, pp. 239–255. ISBN 978-3-031-72942-3. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y. Accurate Leukocyte Detection Based on Deformable-DETR and Multi-Level Feature Fusion for Aiding Diagnosis of Blood Diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]
Yang, R.; Shan, P.; He, Y.; Xiao, H.; Zhang, L.; Zhao, Y.; Fu, Q. A Lightweight Bionic Flapping Wing Drone Recognition Network Based on Data Enhancement. Measurement 2025, 239, 115476. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear Deformable Convolution for Improving Convolutional Neural Networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Lu, W.; Chen, S.-B.; Tang, J.; Ding, C.H.; Luo, B. A Robust Feature Downsampling Module for Remote-Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Cai, H.; Zhang, J.; Xu, J. ALDNet: A Lightweight and Efficient Drone Detection Network. Meas. Sci. Technol. 2025, 36, 025402. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; Volume 15112, pp. 363–380. ISBN 978-3-031-72948-5. [Google Scholar]
Zheng, Y.; Chen, Z.; Lv, D.; Li, Z.; Lan, Z.; Zhao, S. Air-to-Air Visual Detection of Micro-Uavs: An Experimental Evaluation of Deep Learning. IEEE Robot. Autom. Lett. 2021, 6, 1020–1027. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, J.; Li, D.; Wang, D. Vision-Based Anti-UAV Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25323–25334. [Google Scholar] [CrossRef]
Sun, H.; Yang, J.; Shen, J.; Liang, D.; Ning-Zhong, L.; Zhou, H. TIB-Net: Drone Detection Network with Tiny Iterative Backbone. IEEE Access 2020, 8, 130697–130707. [Google Scholar] [CrossRef]
Zhou, X.; Han, B.; Li, L.; Chen, J.; Chen, B.M. DRNet: A Miniature and Resource- Efficient MAV Detector. IEEE Trans. Instrum. Meas. 2025, 74, 1–14. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, L.; Mei, S.; Wang, Y.; Lian, J.; Han, Z.; Chen, X. Few-Shot Object Detection with Multilevel Information Interaction for Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Zhu, L.; Lee, F.; Cai, J.; Yu, H.; Chen, Q. An Improved Feature Pyramid Network for Object Detection. Neurocomputing 2022, 483, 127–139. [Google Scholar] [CrossRef]
Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. arXiv 2019, arXiv:1903.12261. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of UAV recycling system.

Figure 2. Schematic diagram of FUR-DETR.

Figure 3. Schematic diagram of RT-DETR.

Figure 4. Schematic diagram of parallel backbone.

Figure 5. Schematic diagram of AMFPN.

Figure 6. Schematic diagram of WT-FORM.

Figure 7. Example images from the three datasets. Each column displays representative images from the corresponding dataset. (a–c) represent the Det-Fly dataset, DUT-Anti-UAV dataset, and TIB dataset, respectively.

Figure 8. Detection results under different conditions.

Figure 9. Visual comparisons of detection results from different algorithms.

Table 1. Overview of mainstream UAV detection algorithms.

Research Direction	Representative Methods	Main Advantages
Transformer-based Detection	YOLO series [11,12,13]	Single-stage detection, balanced speed and accuracy
	DETR [14]	End-to-end architecture, simplified process
	DN-DETR [16]	Mitigating matching instability, faster convergence
	DINO [17]	Enhanced detection performance and training efficiency
	Group DETR [18]	Accelerated convergence, reduced redundancy
	RT-DETR [10]	Reduced resource requirements, real-time detection
Architecture Optimization	PHSI-RTDETR [19]	Balance of detection accuracy and real-time performance
	MobileNetV4 [21]	Efficiency–accuracy balance on mobile platforms
	RepViT [22]	Enhanced performance on mobile devices
	LSKNet [23]	Adaptation to scale changes and complex backgrounds
Multiscale Feature Fusion	Dogfight [24]	Improved small UAV detection in complex environments
	TGC-YOLOv5 [25]	Better detection in low-visibility conditions
	TransVisDrone [26]	Good performance across datasets, complex background adaptation
	BiFPN [27]	Lower computational complexity and parameters
	CGRSeg [28]	Enhanced feature representation capability
	HSFPN [29]	Enhanced feature expression capabilities
Convolution Optimization	MLWConv [30]	Efficient multiscale information acquisition
	LDConv [31]	Improved sampling effect with fewer parameters
	RFD [32]	Reduced information loss during downsampling

Table 2. Hardware configuration and model parameters.

Parameter	Setup
Image Size	640 × 640
Epochs	500
Patience	80
Number of GPUs	1
GPU type	RTX 4090D
Works	8
Batchsize	8
Optimizer	AdmW
Initial learning rate	0.0001
Final learning rate	0.0001

Table 3. Results of ablation experiment.

Methods	Parallel Backbone	AM FPN	WT-FORM	Param	GFLOPS	AP 50-95	AP 50	AP 75	APs	APm	APl
1-Base	-	-	-	19.9 M	56.9	0.667	0.968	0.788	0.428	0.668	0.742
2-A	√			14.0 M	34.0	0.658	0.965	0.766	0.423	0.654	0.740
3-B		√		17.9 M	48.6	0.668	0.967	0.788	0.439	0.668	0.742
4-C			√	18.8 M	54.8	0.670	0.967	0.791	0.447	0.668	0.746
5-A+B	√	√		12.0 M	25.4	0.662	0.965	0.772	0.441	0.662	0.731
6-A+C	√		√	12.9 M	31.8	0.660	0.961	0.775	0.437	0.662	0.733
7-B+C		√	√	17.2 M	47.2	0.672	0.965	0.792	0.451	0.673	0.744
8-Ours	√	√	√	11.3 M	23.9	0.662	0.966	0.780	0.446	0.663	0.731

Table 4. Comparison results of different backbone networks.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR(Base) [10]	19.9 M	56.9	0.667	0.968	0.788	0.428	0.668	0.742
RT-DETR+MobileNetV4-Conv-S [21]	11.3 M	39.5	0.655	0.964	0.763	0.427	0.655	0.732
RT-DETR+RepViT-M-09 [22]	13.3 M	36.3	0.643	0.956	0.747	0.391	0.639	0.730
RT-DETR+LSKNet-T [23]	12.6 M	37.5	0.643	0.963	0.746	0.400	0.639	0.731
RT-DETR+Parallel Backbone	14.0 M	34.0	0.658	0.965	0.766	0.423	0.654	0.740

Table 5. Comparison results of different FPN networks.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR (Base) [10]	19.9 M	56.9	0.667	0.968	0.788	0.428	0.668	0.742
RT-DETR+BIFPN [27]	20.3 M	64.3	0.672	0.969	0.794	0.467	0.672	0.74
RT-DETR+CGRFPN [28]	19.2 M	48.2	0.665	0.967	0.779	0.429	0.668	0.739
RT-DETR+HSFPN [29]	18.1 M	53.3	0.673	0.968	0.791	0.442	0.675	0.749
RT-DETR+AMFPN	17.9 M	48.6	0.668	0.967	0.788	0.439	0.668	0.742

Table 6. Comparison results of different convolutional networks.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR (Base) [10]	19.9 M	56.9	0.667	0.968	0.788	0.428	0.668	0.742
RT-DETR+LDConv [31]	19.6 M	57.9	0.673	0.968	0.789	0.451	0.672	0.753
RT-DETR+ContextGuidedDown [40]	22.3 M	61.7	0.667	0.969	0.791	0.433	0.668	0.743
RT-DETR+SRFD [32]	19.6 M	55.0	0.657	0.965	0.758	0.422	0.657	0.734
RT-DETR+WT-FORM	18.8 M	54.8	0.670	0.967	0.791	0.447	0.668	0.746

Table 7. Comparison results of different detection models on Det-Fly.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR(Base) [10]	19.9 M	56.9	0.667	0.968	0.788	0.428	0.668	0.742
YOLO v8m [11]	25.8 M	78.7	0.578	0.886	0.667	0.149	0.571	0.740
YOLO v9m [12]	20.0 M	76.5	0.583	0.888	0.676	0.160	0.574	0.744
YOLO v10m [13]	16.5 M	63.4	0.578	0.888	0.657	0.157	0.566	0.738
YOLO v11m [11]	20.0 M	67.6	0.584	0.883	0.680	0.159	0.577	0.744
MobileNetV4+Base [21]	11.3 M	39.5	0.655	0.964	0.763	0.427	0.655	0.732
FasterNet+Base [41]	10.8 M	28.5	0.635	0.959	0.740	0.377	0.632	0.724
RTMDet [42]	27.5 M	54.2	0.554	0.876	0.622	0.124	0.532	0.733
FUR-DETR (Ours)	11.3 M	23.9	0.662	0.966	0.780	0.446	0.663	0.731

Table 8. Comparison results of different detection models on DUT-Anti-UAV.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR(Base) [10]	19.9 M	56.9	0.706	0.960	0.801	0.636	0.742	0.775
YOLO v8m [11]	25.8 M	78.7	0.663	0.932	0.755	0.521	0.722	0.788
YOLO v9m [12]	20.0 M	76.5	0.673	0.931	0.755	0.527	0.737	0.796
YOLO v10m [13]	16.5 M	63.4	0.660	0.928	0.742	0.519	0.721	0.779
YOLO v11m [11]	20.0 M	67.6	0.667	0.932	0.756	0.531	0.728	0.778
MobileNetV4+Base [21]	11.3 M	39.5	0.671	0.953	0.763	0.595	0.713	0.745
FasterNet+Base [41]	10.8 M	28.5	0.688	0.959	0.783	0.604	0.725	0.772
RTMDet [42]	27.5 M	54.2	0.623	0.900	0.703	0.455	0.681	0.766
FUR-DETR (Ours)	11.3 M	23.9	0.698	0.959	0.792	0.614	0.733	0.773

Table 9. Comparison results of different detection models on TIB.

Model	Param	GFLOPS	AP50-95	AP50	AP75	APs	APm	APl
RT-DETR(Base) [10]	19.9 M	56.9	0.393	0.926	0.227	0.349	0.453	0.560
YOLO v8m [11]	25.8 M	78.7	0.342	0.844	0.171	0.268	0.438	0.545
YOLO v9m [12]	20.0 M	76.5	0.346	0.859	0.183	0.266	0.449	0.625
YOLO v10m [13]	16.5 M	63.4	0.349	0.843	0.188	0.265	0.454	0.325
YOLO v11m [11]	20.0 M	67.6	0.336	0.833	0.179	0.242	0.465	0.500
MobileNetV4+Base [21]	11.3 M	39.5	0.368	0.904	0.209	0.328	0.430	0.502
FasterNet+Base [41]	10.8 M	28.5	0.368	0.900	0.202	0.314	0.439	0.600
RTMDet [42]	27.5 M	54.2	0.277	0.714	0.138	0.181	0.419	0.650
FUR-DETR (Ours)	11.3 M	23.9	0.375	0.915	0.216	0.331	0.438	0.500

Table 10. Benchmark results on Det-Fly.

Model	Backbone	AP50
Cascade R-CNN [36]	ResNet50	0.794
FPN [36]	ResNet50	0.787
Faster R-CNN [36]	ResNet50	0.705
Grid R-CNN [36]	ResNet50	0.824
RetinaNet [36]	ResNet50	0.779
RefineDet [36]	ResNet50	0.695
SSD512 [36]	ResNet50	0.787
YOLOv3 [36]	DarkNet53	0.723
FUR-DETR	Parallel Backbone	0.966

Table 11. Benchmark results on DUT-Anti-UAV.

Model	Backbone	AP50-95
Faster-RCNN [37]	ResNet50	0.653
Faster-RCNN [37]	ResNet18	0.605
Faster-RCNN [37]	VGG16	0.633
Cascade-RCNN [37]	ResNet50	0.683
Cascade-RCNN [37]	ResNet18	0.652
Cascade-RCNN [37]	VGG16	0.667
ATSS [37]	ResNet50	0.642
ATSS [37]	ResNet18	0.610
ATSS [37]	VGG16	0.641
YOLOX [37]	ResNet50	0.427
YOLOX [37]	ResNet18	0.400
YOLOX [37]	VGG16	0.551
YOLOX [37]	DarkNet	0.552
SSD [37]	VGG16	0.632
FUR-DETR	Parallel Backbone	0.698

Table 12. Benchmark results on TIB.

Model	Backbone	AP50
Faster RCNN [38]	ResNet50	0.872
Faster RCNN [38]	MobileNet	0.675
Cascade RCNN [38]	ResNet50	0.901
Cascade RCNN [38]	MobileNet	0.780
YOLOv3 [38]	DarkNet53	0.849
YOLOv4 [38]	CSPDarkNet53	0.860
YOLOv5 [38]	YOLOv5-s	0.862
EXTD [38]	MobileFaceNet	0.851
TIB-Net [38]	TIB-Net	0.892
FUR-DETR	Parallel Backbone	0.915

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, Y.; Wu, J.; Hao, Y.; Huang, Z.; Yin, Z.; Xu, J.; Chen, H.; Pi, J. FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery. Drones 2025, 9, 365. https://doi.org/10.3390/drones9050365

AMA Style

Yao Y, Wu J, Hao Y, Huang Z, Yin Z, Xu J, Chen H, Pi J. FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery. Drones. 2025; 9(5):365. https://doi.org/10.3390/drones9050365

Chicago/Turabian Style

Yao, Yu, Jun Wu, Yisheng Hao, Zhen Huang, Zixuan Yin, Jiajing Xu, Honglin Chen, and Jiahua Pi. 2025. "FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery" Drones 9, no. 5: 365. https://doi.org/10.3390/drones9050365

APA Style

Yao, Y., Wu, J., Hao, Y., Huang, Z., Yin, Z., Xu, J., Chen, H., & Pi, J. (2025). FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery. Drones, 9(5), 365. https://doi.org/10.3390/drones9050365

Article Menu

FUR-DETR: A Lightweight Detection Model for Fixed-Wing UAV Recovery

Abstract

1. Introduction

2. Related Work

2.1. Transformer-Based Object Detection Algorithms

2.2. Current Research of UAV Detection Algorithms

3. Methods

3.1. Baseline Network

3.2. Improvement of Feature Extraction Network

3.3. Improvement of Cross-Scale Feature Fusion Network

3.4. Wavelet-Based Frequency-Optimized Reverse Fusion Mechanism

4. Results

4.1. Dataset Introduction

4.2. Evaluation Metrics

4.3. Experiment Settings

4.4. Ablation Experiment

4.5. Performance Comparison Experiments with the Deep Learning Model

4.5.1. Comparisons with Different Backbone Networks

4.5.2. Comparisons with Different FPN Networks

4.5.3. Comparisons with Different Convolutional Networks

4.5.4. Comparisons of Different Detection Models

4.5.5. Performance Comparisons on Different Datasets

4.5.6. Comparative Analysis with Established Benchmarks

4.6. Visualization of Detections

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI