FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection

Wang, Wu; Wang, Qijin; Zou, Kun; Huang, Yichi; Xu, Qi

doi:10.3390/electronics14234715

Open AccessArticle

FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection

by

Wu Wang

^1,2,

Qijin Wang

^2,*

,

Kun Zou

^1,2,

Yichi Huang

^1,2 and

Qi Xu

^1,2

¹

School of Electronic and Information Engineering, Anhui Jianzhu University, Hefei 230601, China

²

School of Big Data and Artificial Intelligence, Anhui Xinhua University, Hefei 230088, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4715; https://doi.org/10.3390/electronics14234715

Submission received: 28 October 2025 / Revised: 25 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

Download

Browse Figures

Versions Notes

Abstract

The high computational complexity of transformer-based detectors leads to slow inference speeds. RT-DETR demonstrates improved performance in these aspects, yet there remains room for enhancement. To achieve more comprehensive feature learning and better coverage of objects across scales, we introduce FD-RTDETR, a refined architecture for end-to-end object detection. We design a Frequency and Attention-based Intra-scale Feature Interaction module for the hybrid encoder, performing dual-path enhancement on high and low frequency features. Simultaneously, we introduce a Dynamic Fusion of Scale Sequence features module for cross-scale feature fusion, which significantly extends the model’s coverage capability across objects at different scales. Using ResNet-18 as the backbone network and evaluating on the COCO 2017 dataset, our FD-RTDETR achieves 45.1 AP, surpassing RT-DETR by 0.7 AP. On the VisDrone2019 dataset, it achieves 47.9 mAP₅₀, outperforming RT-DETR by 1.3. Our method was also tested for generalization on urinary sediment and high-altitude infrared thermal imaging datasets, achieving 0.7 mAP₅₀ and 0.8 mAP_50:95 higher than RT-DETR, respectively, and performs better in certain categories.

Keywords:

computer vision; object detection; DETR

1. Introduction

Object detection represents a core problem in computer vision, which has been extensively adopted in various fields such as autonomous driving, intelligent surveillance, and medical image analysis. With the rapid advancement of deep learning, object detection methods have transitioned from traditional approaches relying on handcrafted algorithms to deep learning-based methods, substantially enhancing detection accuracy and generalization capability.

The current object detection landscape is dominated by two architectural paradigms: convolutional neural networks (CNNs) and transformers. Following the groundbreaking success of AlexNet [1] in the 2012 ImageNet challenge, CNN-based object detection frameworks rapidly became the dominant paradigm. Among the most notable is the YOLO series of detectors, which frames object detection as an end-to-end, single-stage regression problem. Compared to two-stage detectors, which require generating region proposals followed by classification and bounding box regression, the YOLO series excels at combining high precision with rapid inference, demonstrating its powerful engineering practicality. However, there are still obvious bottlenecks in its performance. On the one hand, convolutional operations, by design, struggle with long-range dependencies owing to their localized nature. In addition, reliance on artificial priors such as anchor frame design and NMS post-processing not only introduces other hyperparameters into the model but also slows down the inference speed of the model, hindering further improvements in real-time detector performance and wider application.

The transformer-based DETR [2] simplifies object detection by adopting a set prediction formulation, thereby eliminating two complex manual stages: anchor box design and non-maximum suppression (NMS). However, its computational burden precludes real-time deployment. Various adaptations of DETR have been introduced to mitigate these limitations. In Deformable DETR [3], a deformable attention mechanism was proposed, which significantly accelerates convergence and improves inference efficiency by computing attention over a limited set of key sampling points in the vicinity of the reference point. Dynamic-DETR [4] incorporates a dynamic attention mechanism across encoder and decoder stages to tackle slow convergence during training and poor performance on small objects. Sparse DETR [5] significantly reduces the number of features involved in complex attention calculations by dynamically selecting relevant feature tokens and target queries. In RT-DETR [6], DETRs were rethought and key components were restructured to reduce unnecessary computational redundancy and improve detector accuracy. It designs a hybrid encoder that cleverly combines the global modeling capabilities of transformers with the local feature extraction advantages of CNNs by decoupling the encoders into attention-based intrascale feature interaction (AIFI) and CNN-based cross-scale feature fusion (CCFF). Among them, AIFI adopts a single-scale encoder that operates solely on the top-layer features from the backbone for representation learning. CCFF is a PANet-like structure designed to complementarily fuse low-level features with high-level features. However, its AIFI only performs feature learning at the highest level and lacks some basic feature information, resulting in the encoder part not learning sufficient features. Our improved FAIFI module enhances feature learning by incorporating frequency information, achieving a 0.5 AP improvement on the COCO dataset. Moreover, the CCFF module is only a simple up and down series structure, ignoring the correlation between feature scales. By adding a scale sequence feature fusion component in the CCFF module, we compensate for the correlations between different scales and achieve an improvement of 0.6 in APs. Therefore, this paper mainly focuses on improving these two aspects. We validate the effectiveness of our module on the standard COCO2017 dataset, compare it with mainstream models using the VisDrone2019 [7] dataset, and further verify the generalization ability of our model on different domain datasets using HIT-UAV [8] and Urised11. Overall, the main contributions are as follows:

1. We designed FAIFI module, a frequency-based dual-channel feature enhancement architecture that incorporates frequency information into the AIFI module, achieving improved accuracy without significantly increasing the computational burden. We achieved a 0.5 AP improvement on the COCO dataset, and improved AP by 0.8 for large-scale objects, which may be due to the enhancement of frequency information.

2. We introduced a scale sequence feature fusion module based on dynamic sampling (DFSS). By integrating this structure into CCFF, the model can achieve higher AR and AP, as it alleviates the problem of missing inter-scale correlations and better represents the correlations between scales.

3. Our method has a stronger ability to cover the object, which has been validated on the Urised11 dataset with significant intra-class variations, where the AP for classes with large intra-class differences such as ‘pcast’ and ‘sperm’ increased by 3.1 and 2.5, respectively.

2. Related Works

2.1. Wavelet Transform in Computer Vision

Wavelet transform is a technique used in signal processing. It has significant advantages when processing non-stationary signals (such as images, sounds, etc.). And the wavelet transform process is reversible, which means that during the transformation process, not only can frequency domain information be analyzed efficiently, but all information can also be retained.

Recently, wavelet transforms have been applied to various architectures in deep learning, improving the performance of various computer vision tasks. For instance, DWSR [9] employs the low-frequency sub-band decomposed from the low-resolution input image to predict the high-frequency sub-band, thereby restoring missing details in the high-resolution image for the super-resolution task. MWCNN [10] obtains multi-scale high-frequency and low-frequency sub-bands by multi-level wavelet decomposition of the input, extracts multi-resolution features, enhances the receptive field without losing information, and improves the effect of image restoration tasks. In WFEN [11], wavelet transforms are used in upsampling and downsampling to reduce distortion during sampling and enhance details, significantly improving the performance of the face super-resolution task. In traditional convolution operations, larger kernels are typically required to achieve a greater receptive field. However, WTConv [12] proposes processing the high-frequency sub-bands at each level through cascaded wavelet transforms using small kernels. This achieves a large receptive field at a small cost, avoiding the parameter explosion problem associated with large-kernel convolutions. Wavelet transforms are not only used in CNNs to improve performance in various tasks, but have also recently been introduced into transformers, where they have achieved good results.

Since traditional ViTs typically reduce computational costs through excessive pooling operations, which can result in significant loss of detail and irreversible effects, Wave-ViT [13] employs multi-scale wavelet decomposition on the input to generate keys/values from the decomposed multi-level subbands. This enables lossless downsampling and multi-scale feature extraction, thereby improving computational efficiency and model performance. Inspired by previous work [14], we found that the transformer structure is more interested in low-frequency feature information, while the CNN architecture is more sensitive to high-frequency information with specialized information. Therefore, we designed a module that is more suitable for each architecture after performing a wavelet transform on the feature maps, fully utilizing the different frequency information of the feature maps.

2.2. Multi-Scale Feature Fusion

In computer vision, multi-scale describes a framework for sampling signals at multiple levels of detail [15]. Due to factors such as shooting distance and angle, the size and shape of objects in images may vary. By leveraging multi-scale analysis, models can learn distinct features at different granularities and integrate them, leading to greater robustness when confronted with objects of diverse sizes, shapes, and types.

From the perspective of network topology, multi-scale feature fusion can be divided into two major categories: serial hop-layer connection structure and parallel multi-branch networks. The serial hop-layer connection structure is achieved through top-down or bottom-up hierarchical transmission, thus bridging the gap between low-level, detailed features and high-level semantic representations. For example, the FPN [16] structure was designed using an upsampling method with a top-down structure, gradually restoring resolution, and through horizontal connection, the complementarity of semantic information and detailed information is realized. In PANet [17], a bottom-up path enhancement structure was added to the FPN structure to complement the top-down structure of FPN, enable location information from lower layers to be transmitted to higher layers more quickly, thereby better preserving fine-grained information. In BiFPN [18], nodes with only a single input edge in the PANet structure are removed, allowing features from the same layer to be directly passed to subsequent layers, minimizing the parameter count while preserving the interplay of critical features; weighted feature fusion is proposed to assign learnable weights to each input feature to improve the fusion quality.

The parallel multi-branch feature fusion network extracts multi-scale features by designing multiple branches with different receptive fields and then uses multiple fusion strategies to fuse these features together. For example, in Inception [19], there are four parallel branch structures, each using different sizes of convolution and pooling operations to extract features of different scales at the same level. Subsequently, Chen et al. proposed the ASPP [20] structure, extracting features of different scales using dilated convolution with different dilated rates on different branches, combined with dilated convolution, it is possible to simulate larger convolution kernels and expand the receptive field without significantly increasing consumption. The feature pyramid structure is very important. In most work, feature pyramids are used for serial multi-scale feature fusion, ignoring the relationships between feature maps. Recent research ASF-YOLO [21] proposed a novel structure SSFF to utilize the correlations between all pyramid feature maps. Inspired by 3D convolution’s operation across multiple video frames, this approach treats multi-scale feature maps as sequential frames, necessitating resolution alignment to a unified scale; it takes into account the global and high-level semantic information of multi-scale features to obtain objects of different spatial scales, sizes, and shapes. We improved its structure to DFSS in our model. Specifically, the improved module will be added to the CCFF part of RT-DETR to compensate for the correlation between different feature levels. It is worth noting that modeling the correlation between features is not limited to the spatial scale dimension. In the field of video saliency prediction, ref. [22] proposed the SAT mechanism, which focuses on modeling inter-frame correlations in the temporal dimension and achieved good results, further highlighting the importance of correlations between features.

2.3. Feature Map Sampling Methods

In the process of multi-scale feature fusion, high-resolution and low-resolution feature maps need to be sampled, and the sampling quality directly affects the fusion effect.

Feature map sampling is divided into upsampling and downsampling. For upsampling, the goal is to reconstruct a higher-resolution representation from a low-resolution feature map containing high-level semantics. Generally speaking, the commonly used interpolation algorithms in upsampling methods estimate the values of newly inserted pixels based on the values of existing pixels; they have no learnable parameters, ignore the semantic meaning in the feature map, and are prone to introducing blurring. In addition, some learnable upsampling methods, such as deconvolution [23], restore feature maps to higher resolutions in the opposite way of ordinary convolution, but they have relatively large computational costs. Advances in computer vision have led to the creation of numerous deep learning-based upsampling techniques. In DUpsample [24], a simple and effective data-dependent method is proposed; its core idea is to replace fixed bilinear interpolation by learning a data-dependent reconstruction matrix. In CARAFE [25], upsampling kernels are dynamically generated based on the local content of the input features, and adaptive feature upsampling is achieved through content-aware weighted combination. In the recent work: DySample [26], adaptive feature upsampling is achieved by learning content-aware offsets, and it can be easily accomplished using PyTorch’s standard built-in functions, achieving lightweight and efficient upsampling.

Downsampling serves as a fundamental operation in deep learning, employed to decrease the resolution of feature maps. By compressing the spatial dimension, it helps the network focus on higher-level semantic features. Commonly used downsampling methods include pooling and convolution. Pooling is used to compress information in a local area by sliding a window without additional learning parameters. Convolution operations achieve downsampling by increasing the stride of the convolution. In addition, in [27], wavelet pooling is proposed, which reduces the feature dimension by performing two wavelet decompositions on the feature map and only taking the second-level subband. Recently, Xu et al. proposed the HWD sampling approach [28], which leverages wavelet transform properties to decompose features into four frequency sub-bands while enabling effective feature learning, this approach not only effectively reduces the spatial resolution of feature maps but also achieves complete information preservation.

3. Materials and Methods

3.1. Datasets

This study employs four datasets for experimental analysis: COCO 2017 validates the effectiveness of our proposed method, while VisDrone2019, HIT-UAV, and Urised11 assess cross-dataset generalization capabilities, demonstrating our model’s adaptability across diverse domains. Sample images from the datasets are shown in Figure 1. The COCO2017 dataset serves as a large-scale and comprehensive benchmark for object detection, created by Microsoft. The VisDrone2019 dataset is a large-scale drone dataset that collects various sparse or dense scenes in 14 cities in China under various weather and lighting conditions from a drone’s perspective, comprising a total of 10 categories. HIT-UAV is a dataset specifically designed for high-altitude infrared thermal imaging object detection by unmanned aerial vehicles (UAVs). The dataset covers five object categories across multiple locations in urban areas. The Urised11 dataset provides 7364 carefully selected urine sediment microscopic images. The dataset contains 11 categories, hence the name Urised11. In addition, some categories in this dataset have significant intra-class differences due to object deformation and pose changes, making this dataset challenging.

3.2. FD-RTDETR

3.2.1. Architecture

This article improves the detection performance of RT-DETR in two ways: by introducing feature maps of different frequencies into AIFI, and by enhancing CCFF. Figure 2 illustrates the workflow. FD-RTDETR consists of three parts: the backbone network, a frequency-based hybrid encoder, and the decoder. We designed a new module, FAIFI, which enables the model to capture more feature information by introducing feature maps of different frequencies, thereby more effectively adapting to the diversity of object poses and shapes in images. Secondly, by improving the feature fusion module, we propose the DSFF module, which enhances the model’s ability to extract multi-scale spatial information from feature maps, allowing it to more comprehensively represent objects of various sizes and shapes, and improving the discriminative ability of object features.

3.2.2. FAIFI

To capitalize more extensively on the multi-frequency characteristics of features, we propose a Frequency and Attention-based intra-scale feature interaction module. This module is shown in Figure 3; the core idea is to use Haar wavelet transformation for frequency decomposition and design different enhancement paths to process different frequency components.

First, the module performs frequency decomposition on the output feature map

F 5 \in R^{H \times W \times C}

, mapping it to the frequency domain space by decomposition operators. The single-level 2D Haar wavelet transform and its inverse can be implemented mathematically using the following filter bank:

L 0 = L 0 ’ = [\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}], H 0 = H 0 ’ = [\frac{1}{\sqrt{2}}, - \frac{1}{\sqrt{2}}]

(1)

where

L 0

denotes a low-pass filter, and

H 0

denotes a high-pass filter,

L 0 ’

denotes a low-pass inverse filter, and

H 0 ’

denotes a high-pass inverse filter. Because the Haar wavelet transform is orthogonal, the reconstruction filter is the transpose of the decomposition filter, so their mathematical expressions are the same.

The decomposition process of input feature

I \in R^{H \times W}

can be divided into horizontal filtering and vertical filtering. First, the process filters and downsamples the feature map in the horizontal direction:

L_{i, k} = \frac{1}{\sqrt{2}} (I_{i, 2 k} + I_{i, 2 k + 1}), H_{i, k} = \frac{1}{\sqrt{2}} (I_{i, 2 k} - I_{i, 2 k + 1})

(2)

where

L

and

H

represent the low-frequency and high-frequency components, respectively, which are the results of downsampling after convolution with low-pass and high-pass filters, respectively,

i \in [0, H - 1]

is the row index,

k \in [0, \frac{W}{2} - 1]

is the column index after downsampling, and the output dimension is

L, H \in R^{H \times \frac{W}{2}}

.

Subsequently, the

L

and

H

generated by horizontal filtering are filtered and downsampled in the vertical direction, ultimately generating four subbands (

L L

,

L H

,

H L

,

H H

) with different frequency characteristics:

\{\begin{cases} L {\bar{X}}_{m, j} = \frac{1}{\sqrt{2}} (L_{2 m, j} \pm L_{2 m + 1, j}), \bar{X} \in \{L, H\} \\ H Y_{m, j} = \frac{1}{\sqrt{2}} (H_{2 m, j} \pm H_{2 m + 1, j}), Y \in \{L, H\} \end{cases}

(3)

in the formula, when

\bar{X}

or

Y

are

L

, use

‘ + ’

; when

\bar{X}

or

Y

are

H

, use

‘ - ’

.

L L

is a low-frequency sub-band containing the entire global feature information, while the high-frequency sub-bands

L H

,

H L

, and

H H

represent the horizontal, vertical, and diagonal high-frequency information of the feature map, respectively, and the dimensions of the four frequency components are

(L L, L H, H L, H H) \in R^{\frac{H}{2} \times \frac{W}{2} \times C}

.

We decouple the feature map into complementary frequency components, enhancing them with distinct structures. Previous studies have shown that transformer structures are more sensitive to low-frequency features containing global information, while CNN-based structures are more interested in fine-grained features with specialized information. Therefore, we designed a dual-channel enhancement architecture that feeds low-frequency information LL into AIFI with a transformer structure for processing and feeds high-frequency information into the convolution enhancement module we designed; through this rational allocation, our model enhances the processing efficiency for both high- and low-frequency information within feature maps. Finally, the outputs of the high-frequency and low-frequency enhancement modules are combined to reconstruct the

S_{5}

feature map with different frequency information:

S_{5} = F_{r e f a c t o r} (F_{l o w} (L L), F_{h i g h} (L H, H L, H H))

(4)

where

F_{r e f a c t o r}

represents the frequency reconstruction function, which is the inverse operation of wavelet decomposition,

F_{l o w}

represents AIFI with a transformer structure, and

F_{h i g h}

represents the high-frequency enhancement block.

Given that high-frequency components are characterized by directional details (horizontal, vertical, diagonal edges, textures) and typically exhibit irregular structures, we introduce three high-frequency-enhanced residual blocks constructed with distinct convolutional configurations. Modules using ordinary convolution enhancement are named Residual blocks for CONV (CNRB), modules using depth-separable convolution are named Residual blocks for DSConv (DSRB), and modules using DCNV4 are named (D4RB). The structure is shown in Figure 4. The tensor

H \in R^{B \times C \times 3 \times \frac{W}{2} \times \frac{H}{2}}

containing high-frequency components is reshaped into

\bar{H} \in R^{3 B \times C \times \frac{W}{2} \times \frac{H}{2}}

by unfolding the sub-band dimension into the batch dimension. This transformation facilitates parallel processing of the tri-directional high-frequency components and enhances the model’s sensitivity to orientation-specific details. Moreover, residual connections are incorporated to mitigate information loss in deep network layers. Finally, after the residual connection, the enhanced high-frequency features are reshaped to their original dimensions to facilitate fusion with the processed low-frequency components.

3.2.3. DFSS

In the ASF-YOLO work, it is believed that the high-resolution feature map F3 retains the bulk of information critical for object detection and ancillary tasks, therefore SSFF uses feature layer F₃ as a reference, samples F₄ and F₅ to the size of F₃ using the nearest neighbor interpolation method, stacks them to increase the feature dimension, and then performs a three-dimensional convolution to extract the scale sequence features, this allows for the full utilization of the correlation between different feature scales. This study draws on the SSFF concept and improves upon it. Our work also uses the F₃ feature layer as a benchmark but considering that the deep feature map F₅ feature layer has undergone multiple convolutions, resulting in sparse features and weakened semantic information, we discarded the F₅ feature layer and instead used the F₂ feature layer, which contained more rich underlying features.

The DFSS module, whose pipeline is presented in Figure 5, operates through the following steps. First, it aligns the scales: performs downsampling and upsampling operations on feature maps F₂ and F₄, respectively, to match their spatial resolutions with that of the reference feature map F₃; then dimension expansion: it expands the dimension of the feature map from

H \times W \times C

to

D \times H \times W \times C

; here the depth direction is extended for each feature map; finally, it performs sequence feature fusion, concatenating the three feature maps with expanded dimensions along the depth direction (D dimension) to form a feature cube, applying a 3D convolution layer to operate on the feature cube to extract cross-scale sequence feature information.

During scale alignment, the quality of F₂ downsampling and F₄ upsampling to F₃ resolution is critical and greatly affects the subsequent fusion effect. The original SSFF structure uses the nearest neighbor interpolation method to sample feature maps. Although this method is simple to calculate and easy to implement, it compromises reconstruction fidelity and spatial coherence because it only treats the gray value of the pixel closest to the sampling point as the gray value of that sampling point, without considering the influence of other adjacent pixels. Building upon this foundation, we integrate an efficient dynamic upsampler (DySample) at the F₄ feature hierarchy level, whose core advantage resides in its capacity to dynamically learn sampling offsets; by predicting optimal sampling locations for target positions, this approach overcomes the fixed pattern constraints inherent to static upsampling methods (e.g., bilinear/nearest-neighbor interpolation), significantly enhancing the continuity of feature reconstruction. Compared to dynamic kernel-based upsamplers like CARAFE, DySample circumvents the complex kernel generation process, requiring only a lightweight offset prediction network to achieve superior balance between efficiency and precision. In the downsampling stage of F₂, we do not use traditional pooling operations and strided convolutions. Instead, we introduce the latest HWD downsampling; compared with traditional downsampling, it significantly improves the ability to retain key information (such as boundaries, scale, and texture), which plays a crucial role in object detection tasks and provides richer and more reliable features for subsequent layers.

3.2.4. Overall Architecture Flow

FD-RTDETR improves the AIFI module based on RT-DETR and adds the DFSS module in CCFF. To more clearly demonstrate the overall framework of FD-RTDETR, Table 1 supplements the process in the form of pseudocode. The output F5 of the backbone network enters FAIFI to generate the S5 feature map, while F2, F3, and F4 enter the DSFF module for inter-scale feature interaction and are then fused with other feature maps before being passed to the decoder. Finally, it enters the detection head for object detection.

3.3. Experimental Setup

3.3.1. Experimental Environment

Our experiments were conducted on a DELL PowerEdge 640 server (Dell Inc., Round Rock, TX, USA) featuring two GeForce RTX 3090 GPUs. The key software components were Ubuntu 20.04, CUDA 11.7, Python 3.8, and PyTorch 1.13.1. The model is based on Ultralytics’ official implementation of RT-DETR with modifications. We closely followed PaddlePaddle’s official training configurations, employing the AdamW optimizer with an initial learning rate of 0.0001 and weight decay of 0.0001. Models were trained for 300 epochs on the VisDrone2019 and HIT-UAV datasets, and for 200 epochs on the MS COCO dataset. No pretrained weights were utilized in any experiments.

3.3.2. Evaluation Metrics

Common evaluation metrics in the field of object detection include Precision, Recall, and AP, with specific formulas shown below. This study uses COCO evaluation metrics and the pycocotools tool for evaluation. The primary metric, AP, is computed as the mean accuracy over multiple IoU thresholds from 0.5 to 0.95. For fixed IoU criteria, AP₅₀ reports the accuracy at 0.50, while AP₇₅ reports it at 0.75. Performance across object sizes is evaluated by AP_s (small), AP_m (medium), and AP_l (large), which measure the average precision for objects of corresponding scales. The size of the model is measured by Params, the algorithm complexity is expressed in FLOPs.

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

A P = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} P (R) d R

(7)

where TP indicates that positive sample is correctly identified; FP denotes a negative sample that is mistakenly recognized as positive; FN indicates that the positive sample is incorrectly rejected.

4. Experimental Results and Analysis

4.1. Comparison of Different Detection Models

FD-RTDETR was evaluated against other state-of-the-art object detectors on the VisDrone2019 validation set. For models with diverse architectural paradigms, we selected state-of-the-art detectors for comparative analysis. Specifically, for CNN-based architectures, we use the M version of the YOLO series detector and RTMDet [29]. For DETR-based use some classic models and the recently performing well PHSI-RTDETR [30] in the UAV field were adopted, with comparative results presented in Table 2.

According to the results in Table 2, FD-RTDETR exhibits considerable performance in various key metrics. The FD-RTDETR achieves 29.3 mAP_50:95 on the validation dataset, with its mAP₅₀ reaching 47.9 surpassing numerous state-of-the-art methods. Compared with the parameter-intensive model RTMDet (52.30 M), FD-RTDETR achieves a better balance between computational efficiency and detection performance. In terms of computational complexity, compared with Deformable DETR and Efficient DETR [31], although the number of parameters is similar, FD-RTDETR reduces computational complexity by 134.5 G and 97.5 G, respectively, while maintaining higher accuracy. Compared with YOLOv9 [32], YOLOv10 [33], YOLOv11 [34], and YOLOv12 [35], these models are similar in terms of parameter count and computational complexity, while FD-RTDETR improves average precision by 2.5, 3.4, 2.1, and 2.8, respectively. In terms of inference speed, FD-RTDETR reaches 64.5 FPS, which is about 1.4% slower compared to RT-DETR’s 65.4 FPS, but our accuracy is higher, so the slight sacrifice in inference speed is worthwhile.

Our approach was further validated and assessed on the HIT-UAV dataset. Table 3 presents the per-category results of other models and FD-RTDETR. Our method outperforms previous state-of-the-art models in multiple categories. FD-RTDETR achieved 53.1 mAP_50:95, which is 0.8 higher than the RT-DETR-R18 model, and the mAP₅₀ reached 82.4, which is 4.7 and 0.7 higher than YOLOv10 and RT-DETR, respectively. On the “Other Vehicle” category, FD-RTDETR achieves a 67.1 mAP₅₀, exceeding RT-DETR by 6.3 points. For the “Don’t Care” category, we observed a decrease of 2.0 mAP₅₀ points, which may be attributed to enhanced object coverage capability leading to increased background misclassification. Improvements were also observed in other categories and metrics. These results indicate that FD-RTDETR not only performs well in normal light scenarios but also maintains stable performance in more challenging high-altitude infrared thermal image datasets.

Our experiments were also tested and analyzed on the relatively standard COCO dataset. Table 4 shows the comparison results between FD-RTDETR and recent models. Our benchmark achieved an AP of 45.1, 0.7 higher than RT-DETR, and we also observed varying degrees of improvement on objects of other scales. Although our performance on large objects is relatively lower compared to YOLOv10 and YOLOv11, we achieved more significant results for small objects.

We further validated our approach through experimental analysis on the Urised11 dataset, the corresponding results are presented in Table 5. For the categories ‘cast,’ ‘yeast,’ ‘sperm,’ and ‘pcast,’ FD-RTDETR showed significant improvements, with mAP₅₀ values increasing by 3, 1.2, 2.5, and 3.1, respectively, and these categories exhibit significant intra-class variation, presenting substantial challenges. Our model does not perform well on ‘cryst’ and ‘epithn’ because these two types of cells have partial cell adhesion and aggregation issues, which can easily lead to missed or false detections. Overall, our method has improved significantly, but there are still cases of missed detections and false positives. Our model achieves significantly greater improvements in categories with high intra-class variation, demonstrating enhanced robustness in detecting such heterogeneous objects.

4.2. Ablation Experiment

Table 6 presents ablation study results using ResNet-18 as the backbone network on the COCO 2017 dataset. Due to differences in experimental conditions (involving hardware, reproduction platforms, hyperparameters, and tricks used in training the original model), and because we did not use pre-trained weights, the best AP achieved by our benchmark was 44.4, below the default value of RT-DETR. When we use the improved FAIFI module, AP increases to 44.9 (+0.5), with a particularly significant effect on large objects, reaching an increase of 0.8, demonstrating the effectiveness of our frequency enhancement mechanism FAIFI. After adding DFSS, the model improved APs by 0.6 and AP also improved, with only a 0.24 M increase in the number of parameters. The integration of both modules yields a 0.7 AP improvement overall while incurring only approximately 5% and 7% increases in parameters and computational costs, respectively, with performance gains observed across multiple scales.

4.2.1. Comparison of Different Modules in FAIFI

The high-frequency enhancement module (D4RB) we designed, based on deformable convolution, differs from ordinary convolution using fixed grid receptive fields in standard residual structures. D4RB leverages DCNv4’s dynamic spatial sampling capability, enabling convolutional kernels to adaptively adjust sampling locations based on input high-frequency features. The experiments on these three reinforced structures are shown in Table 7 below.

We created a heat map of FAIFI’s output and compared it with the basic model to further verify and discuss the impact of the FAIFI module on the model. Referring to Figure 6, the AIFI module of the original model may rely more on the self-attention mechanism in the spatial domain, resulting in a relatively dispersed response to local high-frequency details. FAIFI’s thermal response is more concentrated at the edges of objects (high-frequency areas), indicating that its ability to capture detailed features has been enhanced. Furthermore, the attention of the original model may be diverted to non-target areas, resulting in a slightly dispersed heat distribution. FAIFI, on the other hand, uses frequency filtering through wavelet transformation to suppress high-frequency noise interference, allowing the heat to be concentrated on the target subject.

4.2.2. Comparison of Different Fusion Strategies

To comprehensively evaluate DFSS’s impact, we analyzed not only detection accuracy but also assessed performance using COCO-style metrics on the COCO 2017 validation set (5000 images). This included Average Recall (AR) and scale-specific AR metrics for small, medium, and large objects (AR_s, AR_m, AR_l) as defined in the COCO evaluation protocol. We evaluated recall rates using 100 and 1000 detections per image, per the COCO evaluation protocol. The specific results are shown in Table 8. Incorporating the DFSS module into our baseline model yields significant improvements not only in mean Average Recall but also demonstrates consistent gains across all scale-specific recall metrics. The addition of the DFSS module significantly alleviated the problem of false negatives in the model, but slightly increased the number of false positives, so that the improvement in model accuracy was not very noticeable. We visualized these situations, as shown in Figure 7 below.

We used blue dashed boxes to indicate false positives, and red dashed boxes to indicate missed detections. As shown in the figure, our model has alleviated the problem of false negatives, but false positives also increase slightly when objects are obscured by adhesions. This may be due to our module’s enhanced coverage of the object, leading to misclassification of the background.

4.3. Visualization Experiments

A direct performance visualization of FD-RTDETR against the RT-DETR baseline is shown in Figure 8, based on the VisDrone2019 dataset. The left side of the figure shows a comparison of precision-recall curves. Overall, the FD-RTDETR curve is higher than the RT-DETR curve, indicating that FD-RTDETR has higher precision at the same recall level. The right side of the figure shows a visualization of the mAP₅₀ results for the validation dataset during training. FD-RTDETR consistently outperformed RT-DETR in terms of training convergence speed and performance, ultimately achieving an mAP₅₀ that was 1.3 higher than RT-DETR. Coupled with consistent gains across precision-recall curves, these results validate that our approach maintains baseline detection accuracy while simultaneously enhancing generalization capability and improving convergence efficiency.

As illustrated in Figure 9, the detection efficacy of different models is evaluated on the VisDrone2019 dataset. Analysis of the detection results reveals that our model significantly reduces missed detections, performing better than other models, especially for small targets at long distances. However, against the original RT-DETR, the false detections of FD-RTDETR have slightly increased, which may be due to the model’s enhanced coverage of objects.

4.4. Error Analysis

To comprehensively evaluate the model, we used the TIDE tool [36] to perform a fine-grained analysis of detection errors in the COCO dataset for the baseline model and the improved model, breaking down the overall performance of the model into explainable error types. The results are shown in the Figure 10 below. Our model reduced classification errors (Cls) and missed detections (Miss) by 0.57 and 0.18, respectively, and significantly reduced false negatives (FN) by 0.85, indicating improved object recall capability. These results likely stem from the performance boost enabled by our approach, and the overall false positive (FP) rate increased by 0.33, leading to a slight increase in background false positives (Bkg) and duplicate detections (Dupe). This also indicates from another perspective that our method has improved coverage of the object. Loc error and Both error have not been significantly improved and require further optimization.

5. Conclusions

We present an improved version of the RT-DETR model, designed to overcome specific limitations in object detection; under the premise of a small increase in overhead, the FD-RTDETR object detector was proposed. To capture richer feature information, we propose the FAIFI module, which decouples feature maps by frequency and designs a dual-stream enhancement module to enhance different frequencies. Subsequently, DFSS module was introduced in the multi-scale feature fusion section to compensate for the correlation between features of different scales during feature fusion. We validated the effectiveness of the module on the COCO2017 dataset. We also performed evaluations on the VisDrone2019 dataset for specialized scenarios, and the results showed that FD-RTDETR also demonstrated significant improvements for complex scenarios and smaller objects. For experiments on the high-altitude infrared thermal image dataset HIT-UAV, to verify the generalization ability of the model, the results show that FD-RTDETR is not only effective in normal light scenes, but also stable on HIT-UAV datasets with more challenging data environments. Finally, we also conducted verification on the Urised11 dataset, proving that FD-RTDETR has stronger robustness on object with large intra-class differences. Despite these results, we also found that in relatively complex backgrounds, the model’s enhanced coverage of the target can lead to false positives in the background. In future work, we aim to refine the regression loss function. The current model’s regression loss may not be sensitive enough when dealing with targets that have irregular shapes and a lot of background interference. We plan to design a more refined and discriminative loss function to improve localization accuracy and classification consistency in complex scenarios, and we may also explore more efficient hybrid encoder architectures.

Author Contributions

Conceptualization, Q.W.; methodology, Q.W. and W.W.; writing—original draft preparation, W.W.; writing—review and editing, Q.W.; data curation, K.Z., Y.H. and Q.X.; formal analysis, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Anhui Province Quality Engineering Project (No.2014zytz035, No. 2021sqyrgx01, No. 2022jyxm641) and the Academic funding project for top talents of disciplines in colleges and universities of Anhui Province (No. gxbjZD2020096).

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 89–90. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-To-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2988–2997. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef] [PubMed]
Guo, T.; Seyed Mousavi, H.; Huu Vu, T.; Monga, V. Deep wavelet prediction for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 104–113. [Google Scholar]
Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 773–782. [Google Scholar]
Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient face super-resolution via wavelet-based feature enhancement network. In Proceedings of the32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 4515–4523. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Yao, T.; Pan, Y.; Li, Y.; Ngo, C.-W.; Mei, T. Wave-vit: Unifying wavelet and transformers for visual representation learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 25–27 October 2022; pp. 328–345. [Google Scholar]
Li, A.; Zhang, L.; Liu, Y.; Zhu, C. Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12514–12524. [Google Scholar]
Zhang, Q.; Yang, Y.; Cheng, Y.; Wang, G.; Ding, W.; Wu, W.; Pelusi, D. Information fusion for multi-scale data: Survey and challenges. Inf. Fusion 2023, 100, 101954. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2020; pp. 10781–10790. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Kang, M.; Ting, C.-M.; Ting, F.F.; Phan, R.C.-W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Zhang, Y.; Xiao, Y.; Zhang, Y.; Zhang, T. Video saliency prediction via single feature enhancement and temporal recurrence. Eng. Appl. Artif. Intell. 2025, 160, 111840. [Google Scholar] [CrossRef]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Tian, Z.; He, T.; Shen, C.; Yan, Y. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3126–3135. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Williams, T.; Li, R. Wavelet pooling for convolutional neural networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Li, Z.; Yang, J.; Ma, X.; Chen, J.; Tang, X. PHSI-RTDETR: A lightweight infrared small target detection algorithm based on UAV aerial photography. Drones 2024, 8, 240. [Google Scholar] [CrossRef]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Bolya, D.; Foley, S.; Hays, J.; Hoffman, J. Tide: A general toolbox for identifying object detection errors. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 558–573. [Google Scholar]

Figure 1. Example images from COCO2017, Visdrone2019, HIT-UAV and Urised11 datasets.

Figure 2. Overview of our work.

Figure 3. Detailed structure of FAIFI. Frequency decomposition is Haar wavelet transform. Frequency refactoring is Inverse wavelet transform.

Figure 4. Structure of different enhancement modules. (a) CNRB, (b) DSRB, (c) D4RB.

Figure 5. Detailed Structure of DFSS.

Figure 6. Heat maps generated by AIFI and FAIFI.

Figure 7. Missed detections and false detections in different structures. The second row shows the detection results of the original model, the third row shows the results with the SSFF structure added, and the fourth row shows the results with the DFSS module added.

Figure 8. Comparison of the validation results of each model on VisDrone2019 dataset. (a) Precision–recall curve. (b) Comparison of mAP50 curves.

Figure 9. Model Evaluation on the VisDrone2019 Dataset. Blue dashed boxes indicate false positives, and red dashed boxes indicate missed detections.

Figure 10. (a) Error analysis for RT-DETR, (b) Error analysis for FD-RTDETR.

Table 1. Pseudocode for the overall framework of FD-RTDETR.

Overall framework of FD-RTDETR

Input: Input image I

Output:

1: Step 1: Backbone Feature Extraction

2: F₂, F₃, F₄, F₅

\leftarrow

Backbone(I)

3: Step 2: FAIFI Processing

4: S₅

\leftarrow

FAIFI(F₅)

5: Step 3: DSFF Processing

6: S₂

\leftarrow

HWD Downsample(F₂)

7: S₃

\leftarrow

Cnov1×1(F₃)

8: S₄

\leftarrow

DySample Upsample(F₄)

9:

M \leftarrow

Conv3D (Stack (S₂, S₃, S₄))

10: Step 4: Feature Fusion

11:

F_{f u s e d}

\leftarrow

CCFF (F₃, F₄, S₅, M)

12: Step 5: Decoder&Head

13: E

\leftarrow

Decoder&Head (

F_{f u s e d}

)

14: return E

Table 2. Comparison of results of different methods on the VisDrone2019 dataset.

$M o d e l$	$# Params$	$GFLOPs$	$P^{v a l}$	$R^{v a l}$	${m A P}_{50}^{v a l}$	${m A P}_{50 : 95}^{v a l}$	${F P S}_{b s = 4}$
CNN-Based Object Detector
RTMDet	52.3	80	55.7	41.3	43.2	26.4	37.7
YOLOv8-M	25.8	78.7	54.1	42.1	42.8	26.1	65.7
YOLOv9-M	20.0	76.5	55.3	42.4	43.9	26.8	55.5
YOLOv10-M	15.3	58.9	54.3	40.4	42.4	25.9	82.6
YOLOv11-M	20.0	67.7	55.2	42.8	44.3	27.2	62.5
YOLOv12-M	20.1	67.2	54.8	42.0	43.3	26.5	65.8
DETR-Based Object Detector
Deformable DETR	40	196	54.6	40.9	42.8	26.8	_
Efficient DETR	32.1	159	58.8	44.0	46.1	28.5	_
RT-DETR-R18	20	57.3	61.6	45.0	46.6	28.6	65.4
RT-DETR-R34	31.4	90	60.8	44.4	46.2	28.3	62.5
PHSI-RTDETR	14.0	47.5	60.8	45.6	47.1	28.7	39.9
FD-RTDETR	21	61.5	62.0	46.5	47.9	29.3	64.5

Table 3. Per-Class Detection Results on the HIT-UAV Test Set.

$M e t h o d$	$P e r s o n$	$C a r$	$B i c y c l e$	$O t h e r$ $V e h i c l e$	$D o n ’ t$ $C a r e$	${m A P}_{50}^{t e s t}$	${m A P}_{50 : 95}^{t e s t}$
YOLOv8-M	94.5	96.6	91.4	57.8	59.3	79.9	51.0
YOLOv9-M	91.0	98.7	89.6	74.5	45.5	79.8	51.9
YOLOv10-M	88.0	96.8	84.5	66.1	52.5	77.7	47.4
YOLOv11-M	91.3	98.0	91.2	66.8	66.1	82.7	54.2
RT-DETR-R18	93.5	96.7	91.0	60.8	66.2	81.7	52.3
RT-DETR-R34	93.3	96	91.3	59.4	59.5	79.9	51.4
FD-RTDETR	94.3	94.9	91.4	67.1	64.2	82.4	53.1
vs RT-DETR-R18	+0.8	−1.8	+0.4	+6.3	−2.0	+0.7	+0.8

Table 4. Comparison of results of different methods on the COCO2017 dataset.

$M o d e l$	$B a c k b n e$	$A P$	${A P}_{50}$	${A P}_{75}$	${A P}_{s}$	${A P}_{m}$	${A P}_{l}$
YOLOv10	YOLOv10s	44.4	61.1	48.3	25.0	49.0	61.1
YOLOv11	YOLOv11s	44.3	60.9	48.1	25.0	48.6	61.7
RT-DETR	Resnet-18	44.4	61.3	47.9	26.5	47.7	57.9
FD-RTDETR	Resnet-18	45.1	62.2	48.4	27.2	47.5	59

Table 5. Per-Class Detection Results on the Urised11 validation dataset.

$M e t h o d$	${m A P}_{50}^{v a l}$	$C a s t$	$L e u k o$	$C r y s t$	$E p i t h$	$Y e a s t$
RT-DETR-R18	71.7	71.1	91.5	89.1	88.9	85.1
FD-RTDETR	72.4	74.1	91.1	87.8	89.2	86.3
vs	+0.7	+3	−0.4	−1.3	+0.3	+1.2
$M e t h o d$	$e r y t h$	$m y c e t e$	$l e u k o c$	$s p e r m$	$p c a s t$	$e p i t h n$
RT-DETR-R18	90.8	59.5	67.6	45.9	59.2	40.4
FD-RTDETR	90.7	59.7	67.1	48.4	62.3	39.6
vs	−0.1	+0.2	−0.5	+2.5	+3.1	−0.8

Table 6. Ablation Study on COCO 2017 datasets.

$M e t h o d$	$# Params$	$GFLOPs$	$F P S$	$A P$	${A P}_{50}$	${A P}_{75}$	${A P}_{s}$	${A P}_{m}$	${A P}_{l}$
Baseline	19.97	57.3	143	44.4	61.3	47.9	26.5	47.7	57.9
+FAIFI	20.70	57.4	125	44.9	61.8	48.3	26.9	47.4	58.7
+DFSS	20.21	61.3	133	44.6	61.7	48.1	27.1	47.7	58.5
+ALL	20.94	61.5	118	45.1	62.2	48.4	27.2	47.5	59

Table 7. Exploring Different Enhancement Modules.

$M e t h o d$	$# Params$	$GFLOPs$	$A P$	${A P}_{50}$	${A P}_{75}$	${A P}_{s}$	${A P}_{m}$	${A P}_{l}$
Baseline	19.97	57.3	44.4	61.3	47.9	26.5	47.7	57.9
CNRB	21.15	57.7	44.1	61.2	47.7	26.3	47.4	58.5
DSRB	20.63	57.4	44.7	61.6	48.1	27.1	47.6	58.1
D4RB	20.70	57.4	44.9	61.8	48.3	26.9	47.4	58.7

Table 8. Impact of Different Fusion Strategies on AP and AR.

Fusion Strategy	$A P$	${A P}_{50}$	${A R}^{100}$	${A R}^{1 k}$	${A R}_{s}^{1 k}$	${A R}_{m}^{1 k}$	${A R}_{l}^{1 k}$
CCFF	44.4	61.3	63.6	64	43.7	66.9	79.8
+SSFF	44.0	61.1	63.5	64.1	43.5	67.9	78.7
+DFSS	44.6	61.7	64.5	64.8	44.4	68.1	80.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Wang, Q.; Zou, K.; Huang, Y.; Xu, Q. FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection. Electronics 2025, 14, 4715. https://doi.org/10.3390/electronics14234715

AMA Style

Wang W, Wang Q, Zou K, Huang Y, Xu Q. FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection. Electronics. 2025; 14(23):4715. https://doi.org/10.3390/electronics14234715

Chicago/Turabian Style

Wang, Wu, Qijin Wang, Kun Zou, Yichi Huang, and Qi Xu. 2025. "FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection" Electronics 14, no. 23: 4715. https://doi.org/10.3390/electronics14234715

APA Style

Wang, W., Wang, Q., Zou, K., Huang, Y., & Xu, Q. (2025). FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection. Electronics, 14(23), 4715. https://doi.org/10.3390/electronics14234715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FD-RTDETR: Frequency Enhancement and Dynamic Sequence-Feature Optimization for Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Wavelet Transform in Computer Vision

2.2. Multi-Scale Feature Fusion

2.3. Feature Map Sampling Methods

3. Materials and Methods

3.1. Datasets

3.2. FD-RTDETR

3.2.1. Architecture

3.2.2. FAIFI

3.2.3. DFSS

3.2.4. Overall Architecture Flow

3.3. Experimental Setup

3.3.1. Experimental Environment

3.3.2. Evaluation Metrics

4. Experimental Results and Analysis

4.1. Comparison of Different Detection Models

4.2. Ablation Experiment

4.2.1. Comparison of Different Modules in FAIFI

4.2.2. Comparison of Different Fusion Strategies

4.3. Visualization Experiments

4.4. Error Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI