Object Detection with Transformers: A Review

Shehzadi, Tahira; Hashmi, Khurram Azeem; Liwicki, Marcus; Stricker, Didier; Afzal, Muhammad Zeshan

doi:10.3390/s25196025

Open AccessReview

Object Detection with Transformers: A Review

by

Tahira Shehzadi

^1,2,3,*

,

Khurram Azeem Hashmi

^1,2,3

,

Marcus Liwicki

⁴

,

Didier Stricker

^1,2,3 and

Muhammad Zeshan Afzal

^1,2,3

¹

Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany

²

Mindgarage Lab, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany

³

German Research Institute for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany

⁴

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, 971 87 Luleå, Sweden

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(19), 6025; https://doi.org/10.3390/s25196025

Submission received: 16 August 2025 / Revised: 24 September 2025 / Accepted: 26 September 2025 / Published: 1 October 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The astounding performance of transformers in natural language processing (NLP) has motivated researchers to explore their applications in computer vision tasks. A detection transformer (DETR) introduces transformers to object detection tasks by reframing detection as a set prediction problem. Consequently, it eliminates the need for proposal generation and post-processing steps. Despite competitive performance, DETR initially suffered from slow convergence and poor detection of small objects. However, numerous improvements are proposed to address these issues, leading to substantial improvements, enabling DETR to achieve state-of-the-art performance. To the best of our knowledge, this paper is the first to provide a comprehensive review of 25 recent DETR advancements. We dive into both the foundational modules of DETR and its recent enhancements, such as modifications to the backbone structure, query design strategies, and refinements to attention mechanisms. Moreover, we conduct a comparative analysis across various detection transformers, evaluating their performance and network architectures. We aim for this study to encourage further research in addressing the existing challenges and exploring the application of transformers in the object detection domain.

Keywords:

transformer; object detection; DETR; computer vision; deep neural networks

1. Introduction

Object detection is a fundamental task in computer vision that involves locating and classifying objects within an image [1,2,3,4,5,6], with applications in autonomous driving, surveillance, robotics, and medical imaging. In autonomous driving, for example, accurately detecting pedestrians, vehicles, and traffic signs in real time is critical for safety. Traditionally, convolutional neural networks (CNNs), such as faster R-CNN [1] and RetinaNet [4], have served as the primary backbones for object detection models, achieving impressive performance. However, these models heavily rely on hand-crafted components like region proposal networks (RPNs) and post-processing steps such as non-maximum suppression (NMS) [7], which complicate the training pipeline and limit end-to-end optimization. The recent success of transformers in natural language processing (NLP) has motivated researchers to explore their potential in computer vision [8]. The transformer architecture [9,10] effectively captures long-range dependencies in sequential data, enabling global context modeling that is difficult for traditional CNNs. This capability makes transformers particularly attractive for object detection, where recognizing objects often depends on global context.

The transformer architecture [9,10] is characterized by its encoder–decoder structure and the use of self-attention and cross-attention mechanisms, which allow it to capture long-range dependencies across input sequences effectively. Unlike CNNs, which primarily focus on local features through convolutional kernels, transformers can model global relationships across an entire image. This capability makes transformers particularly suitable for object detection, where understanding the spatial and contextual relationships between multiple objects is crucial. Leveraging this strength, researchers have explored transformer-based approaches to develop end-to-end object detection frameworks that do not rely on hand-crafted components.

In this context, Carion et al. (2020) proposed the detection transformer (DETR) [11], a novel framework that replaces traditional region proposal-based methods with a end-to-end trainable architecture using a transformer encoder–decoder network. The DETR network demonstrates promising performance, outperforming conventional CNN-based object detectors [12,13,14,15,16,17,18,19], while also eliminating the need for components such as region proposal networks and post-processing steps like non-maximum suppression (NMS) [7]. Despite these advantages, DETR has certain limitations, including slow training convergence and reduced performance on small objects, which have motivated numerous modifications and improvements in subsequent research.

Since DETR’s introduction, numerous variants have emerged to address limitations such as slow convergence, small object detection, and computational efficiency. Figure 1 illustrates the growth and evolution of DETR research, showing rising publications and citations, widespread architectural modifications, and a focus on key challenges like improving training stability, efficiency, and small object performance. This highlights the rapid expansion of transformer-based detection, emphasizing the need for a comprehensive review, to which numerous DETR variants have responded. Deformable-DETR [20] modifies the attention modules to process the image feature maps by considering the attention mechanism as the main reason for slow training convergence, while UP-DETR [21] proposes modifications to pre-train DETR similar to the pre-training of transformers in natural language processing. Efficient-DETR [22], based on original DETR and Deformable-DETR, examines the randomly initialized object probabilities, including reference points and object queries, which is one of the reasons for multiple training iterations. SMCA-DETR [23] introduces a spatially modulated co-attention module that replaces the existing co-attention mechanism in DETR to overcome slow training convergence, and TSP-DETR [24] deals with cross-attention and the instability of bipartite matching. Conditional-DETR [25] presents a conditional cross-attention mechanism, while WB-DETR [26] considers a CNN backbone for feature extraction as an extra component and presents a transformer encoder–decoder network without a backbone. PnP-DETR [27] proposes a PnP sampling module to reduce spatial redundancy and improve computational efficiency. Dynamic-DETR [28] introduces dynamic attention in the encoder–decoder network, YOLOS-DETR [29] demonstrates the transferability and versatility of the transformer from image recognition to detection, Anchor-DETR [30] proposes object queries as anchor points, Sparse-DETR [31] reduces computational cost via token filtering, D²ETR [32] uses cross-scale attention in the decoder, FP-DETR [33] reformulates pre-training and fine-tuning, and CF-DETR [34] refines predicted locations to improve small object detection. Further improvements targeting training stability and small object performance include DN-DETR [35], which uses noised object queries as additional decoder input to reduce the instability of the bipartite-matching mechanism, AdaMixer [36], which considers the encoder an extra network between the backbone and decoder and introduces a 3D sampling process, REGO-DETR [37], which proposes an RoI-based method for detection refinement, and DINO [38], which uses positive and negative noised object queries to accelerate convergence and enhance performance on small objects. These successive innovations collectively address the limitations of the original DETR while retaining its advantages as a fully end-to-end transformer-based object detector. FP-DETR [33] reformulates the pre-training and fine-tuning stages for detection transformers. CF-DETR [34] refines the predicted locations by utilizing local information, as incorrect bounding box location reduces performance on small objects.

DN-DETR [35] uses noised object queries as additional decoder input to reduce the instability of the bipartite-matching mechanism in DETR, which causes the slow convergence problem. AdaMixer [36] considers the encoder an extra network between the backbone and decoder that limits the performance and slows the training convergence because of its design complexity. It proposes a 3D sampling process and a few other modifications in the decoder. REGO-DETR [37] proposes an RoI-based method for detection refinement to improve the attention mechanism in the detection transformer. DINO [38] considers positive and negative noised object queries to make training convergence faster and to enhance the performance on small objects. Building on these improvements, Co-DETR [39] introduces collaborative hybrid assignments to improve training stability and convergence speed, addressing limitations in bipartite matching and small object performance. LW-DETR [40] focuses on efficiency, using a lightweight ViT encoder, a shallow decoder, and global attention to reduce computational cost while maintaining competitive accuracy. RT-DETR [41] combines a hybrid encoder with multi-scale feature processing and IoU-aware query selection to achieve adaptable inference speed, balancing high accuracy with real-time performance.

The rapid pace of advancements makes it difficult to track progress systematically. Thus, a review of ongoing progress is necessary and would be helpful for the researchers in the field. This paper provides a detailed overview of recent advancements in detection transformers. Table 1 shows the overview of Detection Transformer (DETR) modifications to improve performance and training convergence. Many surveys have studied deep learning approaches in object detection [42,43,44,45,46,47]. Table 2 lists existing object detection surveys. Among these, several studies comprehensively review approaches that process different 2D data types [48,49,50,51], while others focus on specific 2D applications [52,53,54,55,56,57,58,59] or related tasks such as segmentation [60,61,62], image captioning [63,64,65,66], and object tracking [67]. Furthermore, some surveys examine deep learning methods and introduce vision transformers [68,69,70,71]. Nonetheless, most of these surveys were published before recent improvements in detection transformer networks, and a comprehensive review of transformer-based object detectors is still lacking. Therefore, a detailed survey of ongoing advancements is necessary to provide guidance and insights for researchers.

A detailedreview of transformer-based detection methods from an architectural perspective. We categorize and summarize improvements in the detection transformer (DETR) according to backbone modifications, pre-training level, attention mechanism, query design, etc. This analysis aims to help researchers develop a deeper understanding of the key components of detection transformers in terms of performance indicators.
A performance evaluation of detection transformers. We evaluate improvements in detection transformers using the popular benchmark MS COCO [75]. We also highlight the advantages and limitations of these approaches.
An analysis of accuracy and computational complexity of improved versions of detection transformers. We present an evaluative comparison of state-of-the-art transformer-based detection methods with respect to attention mechanisms, backbone modifications, and query designs.
An overview of the key building blocks of detection transformers to improve their performance further and future directions. We examine the impact of various key architectural design modules that impact network performance and training convergence to provide possible suggestions for future research. Readers interested in ongoing developments in detection transformers can refer to our Github repository; https://github.com/mindgarage-shan/transformer_object_detection_survey (accessed on 25 September 2025).

The remaining paper is arranged as follows. Section 2 is related to object detection and transformers in all types of vision. Section 3 is the main part, which explains the modifications in the detection transformers in detail. Section 3.24 refers to the evaluation protocol, and Section 4 provides a comparative evaluation of detection transformers. Section 5 discusses open challenges and future directions. Finally, Section 6 concludes the paper.

2. Object Detection and Transformers in Vision

2.1. Object Detection

This section explains the key concept of object detection and previously used object detectors. A more detailed analysis of object detection concepts can be found in [74,76,77]. The object detection task localizes and recognizes objects in an image by providing a bounding box around each object and its category. These detectors are usually trained on datasets like PASCAL VOC [78] or MS COCO [75]. The backbone network extracts the features of the input image as feature maps [79]. Usually, the backbone network, such as ResNet-50 [80], is pre-trained on ImageNet [81] and then fine-tuned to downstream tasks [82,83,84,85,86,87]. Moreover, many works have also used visual transformers [3,88,89] as a backbone. Single-stage object detectors [3,4,90,91,92,93,94,95,96,97,98] use only one network, having faster speed but lower performance than two-stage networks. Two-stage object detectors [1,2,7,79,99,100,101,102,103,104] contain two networks, which provide final bounding boxes and class labels.

Lightweight Detectors: Lightweight detectors are designed to be more computationally efficient than standard object detection models. These are real-time object detectors and can be employed on small devices. Examples include [105,106,107,108,109,110,111,112,113,114].

Three-Dimensional Object Detection: The primary purpose of 3D object detection is to recognize the objects of interest using a 3D bounding box and give a class label. Three-dimensional approaches fall into three categories: image-based [115,116,117,118,119,120,121], point cloud-based [122,123,124,125,126,127,128,129,130], and multimodal fusion-based [131,132,133,134,135].

2.2. Transformer for Segmentation

The self-attention mechanism can be employed for segmentation tasks [136,137,138,139,140] that provide pixel-level [141] prediction results. Panoptic segmentation [142] jointly solves semantic and instance segmentation tasks by providing per-pixel class and instance labels. Wang et al. [143] propose location-sensitive axial attention for the panoptic segmentation task on three benchmarks [75,144,145]. The above segmentation approaches have self-attention in CNN-based networks. Recently, segmentation transformers [137,139] containing encoder– decoder modules have provided new directions to employ transformers for segmentation tasks.

2.3. Transformers for Scene and Image Generation

Previously, text-to-image generation methods [146,147,148,149] were based on GANs [150]. Ramesh et al. [151] introduced a transformer-based model for generating high-quality images from provided text details. Transformer networks are also applied for image synthesis [152,153,154,155,156], which is important for learning unsupervised and generative models for downstream tasks. Feature learning with an unsupervised training procedure [153] achieves state-of-the-art performance on two datasets [157,158], while SimCLR [159] provides comparable performance on [160]. The iGPT image generation network [153] does not include pre-training procedures similar to language modeling tasks. However, unsupervised CNN-based networks [161,162,163] consider prior knowledge as the architectural layout, attention mechanism, and regularization. Generative adversarial networks (GAN) [150] with CNN-based backbones are appealing for image synthesis [164,165,166]. TransGAN [155] is a strong GAN network where the generator and discriminator contain transformer modules. These transformer-based networks boost performance for scene and image generation tasks.

2.4. Transformers for Low-Level Vision

Low-level vision analyzes images to identify their basic components and create an intermediate representation for further processing and higher-level tasks. After observing the remarkable performance of attention networks in high-level vision tasks [11,137], many attention-based approaches have been introduced for low-level vision problems, such as [167,168,169,170,171].

2.5. Transformers for Multi-Modal Tasks

Multi-modal tasks involve processing and combining information from multiple sources or modalities, such as text, images, audio, or video. The application of transformer networks in vision language tasks has also been widespread, including visual question-answering [172], visual commonsense-reasoning [173], cross-modal retrieval [174], and image captioning [175]. These transformer designs can be classified into single-stream [176,177,178,179,180,181] and dual-stream networks [182,183,184]. The primary distinction between these networks lies in the choice of loss functions.

3. Detection Transformers

This section briefly explains the detection transformer (DETR) and its improvements, as shown in Figure 2.

3.1. DETR

The detection transformer (DETR) [11] architecture is much simpler than CNN-based detectors like faster R-CNN [185] as it removes the need for an anchor generation process and post-processing steps, such as non-maximal suppression (NMS), and provides an optimal detection framework. The DETR network has three main modules: a backbone network with positional encodings, an encoder, and a decoder network with an attention mechanism. The extracted features from the backbone network are a single vector and its positional encoding [186,187] within the input vector fed to the encoder network. Here, self-attention is performed on key, query, and value matrices forwarded to the multi-head attention and feed-forward network to find the attention probabilities of the input vector. The DETR decoder takes object queries in parallel with the encoder output. It computes predictions by decoding N number of object queries in parallel. It uses a bipartite-matching algorithm to label the ground-truth and predicted objects, as provided in the following equation:

\hat{σ} = arg min_{σ \in N} \sum_{k}^{N} L_{m} (y_{k}, {\hat{y}}_{σ (k)}) .

(1)

Here,

y_{k}

is a set of ground-truth (GT) objects. It provides boxes for both object and “no object” classes, where N is the total number of objects to be detected.

L_{m} (y_{k}, {\hat{y}}_{σ (k)})

represents the duplicate-free matching cost between predicted objects

σ (k)

and ground-truth

y_{k}

, as defined below:

L_{m} (y_{k}, {\hat{y}}_{σ (k)}) = - 1_{{c_{k} \neq ϕ}} {\hat{p}}_{σ (k)} (c_{k}) + 1_{{c_{k} \neq ϕ}} L_{b b o x} (b_{k}, {\hat{b}}_{\hat{σ}} (k)) .

(2)

The next step is to compute the Hungarian loss by determining the optimal matching between ground-truth (GT) and detected boxes regarding the bounding-box region and label. The loss is reduced by stochastic gradient descent (SGD).

L_{H} (y, \hat{y}) = \sum_{k = 1}^{N} [- l o g {\hat{p}}_{\hat{σ} (k)} (c_{k}) + 1_{{c_{k} \neq ϕ}} L_{b o x} (b_{k}, {\hat{b}}_{\hat{σ}} (k))],

(3)

where

{\hat{p}}_{\hat{σ} (k)}

and

c_{k}

are the predicted class and target label, respectively. The term

\hat{σ}

is the optimal-assignment factor;

b_{k}

and

{\hat{b}}_{\hat{σ}} (k)

are ground-truth and predicted bounding boxes. The term

\hat{y}

and

y = {(c_{k}, b_{k})}

are the prediction and ground-truth of objects, respectively. Specifically, the bounding box loss is a linear combination of the generalized IoU (GIoU) loss [188] and of the L1 loss, as in the following equation:

L_{b b o x} = λ_{i} L_{i o u} (b_{k}, {\hat{b}}_{σ (k)}) + λ_{l 1} {‖ b_{k} - {\hat{b}}_{σ (k)} ‖}_{1},

(4)

where

λ_{i}

and

λ_{l 1}

are the hyperparameters. DETR can only predict a fixed number of N objects in a single pass. For the COCO dataset [75], the value of N is set to 100 as this dataset has 80 classes. This network does not need NMS to remove redundant predictions as it uses bipartite matching loss with parallel decoding [189,190,191]. In comparison, previous studies used RNN-based autoregressive decoding [192,193,194,195,196,196]. The DETR network has several challenges, such as slow training convergence and performance drops for small objects. To address these challenges, modifications have been made to the DETR network. Despite its end-to-end design, DETR suffers from slow training convergence and lower accuracy for small objects. The uniform attention initialization and lack of multi-scale features make learning precise object locations difficult. These limitations motivated the development of several modifications aimed at improving convergence, computational efficiency, and small object detection.

3.2. Deformable-DETR

The attention module of DETR provides a uniform weight value to all pixels of the input feature map at the initialization stage. These weights need many epochs for training convergence to find informative pixel locations. However, it requires high computation and extensive memory. The encoder’s self-attention has complexity

O (w_{i}^{2} h_{i}^{2} c_{i})

. In contrast, the decoder’s cross-attention has complexity

O (h_{i} w_{i} c_{i}^{2} + N h_{i} w_{i} c_{i})

. Formally,

h_{i}

and

w_{i}

denote the height and width of the input feature map, respectively, and N represents object queries fed as input to the decoder. Let

q \in Ω_{q}

denote a query element with feature

z_{q} \in R^{c_{i}}

, and

k \in Ω_{k}

represents a key vector with feature

x_{k} \in R^{c_{i}}

, where

c_{i}

is the input features dimension,

Ω_{k}

and

Ω_{q}

indicate the set of key and query vectors, respectively. Then, the feature of multi-head attention (MHAttn) is computed by the following:

M H A t t n (z_{q}, x) = \sum_{j = 1}^{J} W_{j} [\sum_{k \in Ω_{k}} A_{j q k} . W_{j}^{'} x_{k}],

(5)

where j represents the attention head,

W_{j} \in R^{c_{i} \times c_{v}}

, and

W_{j}^{'} \in R^{c_{v} \times c_{i}}

are of learnable weights (

c_{v} = c_{i} / J

by default). The attention weights

A_{j q k} \propto e x p \frac{z_{q}^{T} U_{j}^{T} V_{j} x_{k}}{{\sqrt{c}}_{v}}

are normalized as

\sum_{k \in Ω_{k}} A_{j q k} = 1

, in which

U_{j}, V_{j} \in R^{c_{v} \times c_{i}}

are also learnable weights. Deformable-DETR [20] modifies the attention modules inspired by [197,198] to process the image feature map by considering the attention network as the main reason for slow training convergence and confined feature spatial resolution. This module samples a small set of features near each reference point. Given an input feature map

x \in R^{c_{i} \times h_{i} \times w_{i}}

, let query q have content feature

z_{q}

and a 2D reference point

r_{q}

, and the deformable attention feature is computed by the following:

D e f o r m A t t n (z_{q}, r_{q}, x) = \sum_{j = 1}^{J} W_{j} [\sum_{k = 1}^{K} A_{j q k} . W_{j} x (r_{q} + Δ r_{j q k})],

(6)

where

Δ r_{j q k}

indexes the sampling offset. It takes ten times fewer training epochs than a simple DETR network. The complexity of self-attention becomes

O (w_{i} h_{i} c_{i}^{2})

, which is linear complexity according to spatial size

h_{i} w_{i}

. The complexity of the cross-attention in the decoder becomes

O (N K c_{i}^{2})

, which is independent of spatial size

h_{i} w_{i}

. In Figure 3, the dark pink block indicates the deformable attention module in Deformable-DETR.

Multi-Scale Feature Maps: High-resolution input image features increase the network efficiency, specifically for small objects. However, this is computationally expensive. Deformable-DETR provides high-resolution features without affecting the computation. It uses a feature pyramid containing high and low-resolution features rather than the original high-resolution input image feature map. This feature pyramid has an input image resolution of 1/8, 1/16, and 1/32 and contains its relative positional embeddings. Furthermore, Deformable-DETR replaces the attention module in DETR with the multi-scale deformable attention module to reduce computational complexity and improve performance. While Deformable-DETR accelerates training and improves small object detection, designing effective sampling offsets and managing multi-scale feature interactions remain critical to achieving optimal performance. Algorithm 1 illustrates the step-by-step implementation of the multi-scale deformable attention mechanism, complementing the mathematical formulation presented above.

Algorithm 1: Multi-scale deformable attention in Deformable-DETR.

3.3. UP-DETR

Dai et al. [21] proposed a few modifications to pre-train the DETR similar to pre-training transformers in NLP. The randomly sized patches from the input image are used as object queries to the decoder as input. The pre-training proposed by UP-DETR helps to detect these randomly sized query patches. Algorithm 2 summarizes the pre-training procedure of UP-DETR, illustrating how random patches, query grouping, and attention masking are applied to improve convergence and feature learning. In Figure 3, the bright cyan block denotes UP-DETR. Two issues are addressed during pre-training: multi-task learning and multi-query localization.

Multi-Task Learning: The object detection task combines object localization and classification, while these tasks always have distinct features [199,200,201]. The patch detection damages the classification features. Multi-task learning using patch feature reconstruction and a frozen pre-training backbone is proposed to protect the classification features of the transformer. The feature reconstruction is given as follows:

L_{r e c} (f_{k}, {\hat{f}}_{\hat{σ} (k)}) = {‖ \frac{f_{k}}{‖ f_{k} ‖_{2}} - \frac{{\hat{f}}_{\hat{σ} (k)}}{‖ {\hat{f}}_{\hat{σ} (k)} ‖_{2}} ‖}_{2}^{2} .

(7)

Here, the feature reconstruction term is

L_{r e c}

. It is the mean-squared error between

l_{2}

(normalized) features of patches obtained from the CNN backbone.

Multi-query Localization: The decoder of DETR takes object queries as input to focus on different positions and box sizes. When this object queries a number N (typically

N = 100

) that is high, a single-query group is unsuitable as it has convergence issues. To solve the multi-query localization problem between object queries and patches, UP-DETR proposes an attention mask and query shuffle mechanism. The number of object queries is divided into X different groups, where each patch is provided to

N / X

object queries. The Softmax layer of the self-attention module in the decoder is modified by adding an attention mask inspired by [202] as follows:

P (q_{i}, k_{i}) = S o f t m a x (\frac{q_{i} k_{i}^{T}}{\sqrt{d}} + M) v_{i},

(8)

M_{k, l} = \{\begin{matrix} 0 & k, l i n t h e s a m e g r o u p \\ - \infty & o t h e r w i s e \end{matrix},

(9)

where

M_{k, l}

is the interaction parameter of object queries

q_{k}

and

q_{l}

. Though object queries are divided into groups, these queries do not have explicit groups during downstream training tasks. Therefore, these queries are randomly shuffled during pre-training by masking

10 %

query patches to zero, similar to dropout [203]. Although UP-DETR improves convergence and query learning, the pre-training may not always transfer perfectly to downstream detection tasks, and its grouping and masking mechanisms require careful tuning to avoid convergence or performance issues. Algorithm 2 shows the patch detection pre-training procedure, where random patches are cropped, assigned to query subsets with attention masking, and the model is trained to predict patch locations while reconstructing features, improving robustness and convergence.

Algorithm 2: Patch detection pre-training in UP-DETR.

3.4. Efficient-DETR

The performance of DETR also depends on the object queries, as the detection head obtains final predictions from them. However, these object queries are randomly initialized at the start of training. Efficient-DETR [22], based on DETR and Deformable-DETR, examines the randomly initialized object blocks, including reference points and object queries, which is one of the reasons for multiple training iterations. In Figure 3, the dull green box shows Efficient-DETR.

Efficient-DETR has two main modules: a dense module and a sparse module. These modules have the same final detection head. The dense module includes the backbone network, encoder network, and detection head. Following [204], it generates proposals by a class-specific dense prediction using the sliding window and selects Top-k features as object queries and reference points. Efficient-DETR uses 4D boxes as reference points rather than 2D centers. The sparse network does the same work as the dense network, except for their output size. The features from the dense module are taken as the initial state of the sparse module, which is considered a good initialization of object queries. Both dense and sparse modules use a one-to-one assignment rule, as in [205,206,207]. However, Efficient-DETR adds architectural complexity, and the final performance heavily depends on the quality of the dense module’s proposals, making the approach sensitive to the selection of initial object queries and hyperparameters.

3.5. SMCA-DETR

The decoder of the DETR takes object queries as input that are responsible for object detection in various spatial locations. These object queries combine with spatial features from the encoder. The co-attention mechanism in DETR involves computing a set of attention maps between the object queries and the image features to provide class labels and bounding box locations. However, the visual regions in the decoder of DETR related to object query might be irrelevant to the predicted bounding boxes. This is one of the reasons that DETR needs many training epochs to find suitable visual locations to identify corresponding objects correctly. Gao et al. [23] introduced a spatially modulated co-attention (SMCA) module that replaces the existing co-attention mechanism in DETR to overcome the slow training convergence of DETR. In Figure 4, the purple block represents SMCA-DETR. The object queries estimate the scale and center of its corresponding object, which are further used to set up a 2D spatial weight map. The initial estimate of scale

l_{h_{i}}, l_{w_{i}}

and center

e_{h_{i}}, e_{w_{i}}

of a Gaussian-like distribution for object queries q is provided as follows:

\begin{matrix} e_{h_{i}}^{n r m}, e_{w_{i}}^{n r m} & = s i g m o i d (M L P (q)), \end{matrix}

(10)

\begin{matrix} l_{h_{i}}, l_{w_{i}} & = F C (q), \end{matrix}

(11)

where object query q provides a prediction center in normalized form by a sigmoid activation function after two layers of

M L P

. These predicted centers are un-normalized to obtain the input image’s center coordinates

e_{h_{i}}

and

e_{w_{i}}

. The object query also estimates the object scales as

l_{h_{i}}

and

l_{w_{i}}

. After the prediction of the object scale and center, SMCA provides a Gaussian-like weight map as follows:

W (x, y) = e x p (- \frac{{(x - e_{w_{i}})}^{2}}{β l_{w_{i}}^{2}} - \frac{{(y - e_{h_{i}})}^{2}}{β l_{h_{i}}^{2}}),

(12)

where

β

is the hyperparameter to regulate the bandwidth, and

(x, y)

is the spatial parameter of weight map W. It provides high attention to spatial locations closer to the center and low attention to spatial locations away from the center.

A_{i} = S o f t m a x (\frac{q_{i} k_{i}^{T}}{\sqrt{d}} + \log W) v_{i} .

(13)

Here,

A_{i}

is the co-attention map. The difference between the co-attention module in DETR and this co-attention module is the addition of the logarithm of the spatial-map W. The decoder attention network has more attention near predicted box regions, which limits the search locations and thus converges the network faster. SMCA-DETR improves training efficiency and small object detection. However, its success depends on accurate initial predictions of object centers and scales, making it sensitive to initialization and hyperparameters.

3.6. TSP-DETR

TSP-DETR [24] deals with the cross-attention and the instability of bipartite matching to overcome the slow training convergence of DETR. TSP-DETR proposes two modules based on an encoder network with feature pyramid networks (FPN) [79] to accelerate the training convergence of DETR. In Figure 4, the orange block indicates TSP-DETR. These modules are TSP-FCOS and TSP-RCNN, which use a classical one-stage detector FCOS [208] and classical two-stage detector Faster-RCNN [1], respectively. TSP-FCOS used a new Feature of Interest (FoI) module to handle the multi-level features in the transformer encoder. Both modules use the bipartite matching mechanism to accelerate the training convergence.

TSP-FCOS: The TP-FCOS module follows the FCOS [208] for designing the backbone and FPN [79]. Firstly, the features extracted by the CNN backbone from the input image are fed to the FPN component to produce multi-level features. Two feature extraction heads, the classification head and the auxiliary head, use four convolutional layers and group normalization [209], which are shared across the feature pyramid stages. Then, the FoI classifier filters the concatenated output of these heads to select top-scored features. Finally, the transformer encoder network takes these FoIs and their positional encodings as input, providing class labels and bounding boxes as output.

TSP-RCNN: Like TP-FCOS, this module extracts the features from the CNN backbone and produces multi-level features using the FPN component. In place of two feature extraction heads used in TSP-FCOS, the TSP-RCNN module follows the design of faster R-CNN [1]. It uses a region proposal network (RPN) to find regions of interest (RoIs) to refine further. Each RoI in this module has an objectness score, as well as a predicted bounding box. RoIAlign [101] is applied on multi-level feature maps to take RoI information. After passing through a fully connected network, these extracted features are fed to the Transformer encoder as input. The positional info of these RoI proposals is the four values

(c_{n x}, c_{n y}, w_{n}, h_{n})

, where

(c_{n x}, c_{n y}) \in {[0, 1]}^{2}

represents the normalized value of center and

(w_{n}, h_{n}) \in {[0, 1]}^{2}

represents the normalized value of height and width. Finally, the transformer encoder network inputs these RoIs and their positional encoding for accurate predictions. The FCOS and RCNN modules in TSP-DETR accelerate training convergence and improve the performance of the DETR network.

3.7. Conditional-DETR

The cross-attention module in the DETR network needs high-quality input embeddings to predict accurate bounding boxes and class labels. The high-quality content embeddings increase the training convergence difficulty. Conditional-DETR [25] presents a conditional cross-attention mechanism to solve the training convergence issue of DETR. It differs from the simple DETR by input keys

k_{i}

and input queries

q_{i}

for cross-attention. In Figure 4, the yellow box represents conditional-DETR. The conditional queries are obtained from 2D coordinates along with the embedding output of the previous decoder layer. The predicted candidate box from decoder-embedding is as follows:

b o x = s i g (F F N (e) + {[r^{T} 00]}^{T}) .

(14)

Here, e is the input embedding that is fed as input to the decoder. The

b o x

is a 4D vector

[b o x_{c x} b o x_{c y} b o x_{w} b o x_{h}]

, having the box center value as

(b o x_{c x}, b o x_{c y})

, width value as

b o x_{w}

, and height value as

b o x_{h}

. The

s i g ()

function normalizes the predictions and varies from 0 to 1.

F F N ()

predicts the un-normalized box. r is the un-normalized 2D coordinate of the reference-point, and

(0, 0)

is the simple DETR. This work either learns the reference point r for each box or generates it from the respective object query. It learns queries for multi-head cross-attention from input embeddings of the decoder. This spatial query makes the cross-attention head consider the explicit region, which helps to localize the different regions for class labels and bounding boxes by narrowing down the spatial range.

3.8. WB-DETR

DETR extracts local features using a CNN backbone and gets global contexts by an encoder–decoder network of the transformer. WB-DETR [26] proves that the CNN backbone for feature extraction in detection transformers is not compulsory. It contains a transformer network without a backbone. It serializes the input image and feeds the local features directly in each independent token to the encoder as input. The transformer self-attention network provides global information, which can accurately obtain the contexts between input image tokens. However, the local features of each token and the information between adjacent tokens need to be included, as the transformer lacks the ability for local feature modeling. The LIE-T2T (Local Information Enhancement-T2T) module solves this issue by reorganizing and unfolding the adjacent patches and focusing on each patch’s channel dimension after unfolding. In Figure 5, the top-right block denotes the LIE-T2T module of WB-DETR. The iterative process of the LIE-T2T module is as follows:

P = s t r e t c h (r e s h a p e (P i)),

(15)

Q = s i g (e_{2} \cdot R e L U (e_{1} \cdot P)),

(16)

P_{i + 1} = e_{3} \cdot (P \cdot Q),

(17)

where

r e s h a p e

function reorganizes

(l_{1} \times c_{1})

patches into

(h_{i} \times w_{i} \times c_{i})

feature maps. The term

s t r e t c h

denotes unfolding

(h_{i} \times w_{i} \times c_{i})

feature maps to

(l_{2} \times c_{2})

patches. Here, the fully connected layer parameters are

e_{1}

,

e_{2}

, and

e_{3}

. The

R e L U

activation is its non-linear map function, and the

s i g

generates the final attention. The channel attention in this module provides local information as the relationship between the channels of the patches is the same as the spatial relation in the pixels of the feature maps.

3.9. PnP-DETR

The transformer processes the image feature maps that are transformed into a one-dimensional feature vector to produce the final results. Although effective, using the full feature map is expensive because of useless computation on background regions. PnP-DETR [27] proposes a poll and pool (PnP) sampling module to reduce spatial redundancy and make the transformer network computationally more efficient. This module divides the image feature map into contextual background features and fine foreground object features. Then, the transformer network uses these updated feature maps and translates them into the final detection results. In Figure 5, the bottom-left block indicates PnP-DETR. This PnP Sampling module includes two types of samplers: a pool sampler and a poll sampler, as explained below.

Poll Sampler: The poll sampler provides fine feature vectors

V_{f}

. A meta-scoring module is used to find the informational value for every spatial location (x, y):

a_{x y} = S c o r e N e t (v_{x y}, θ s) .

(18)

The score value is directly related to the information of feature vector

v_{x y}

. These score values are sorted as follows:

[a_{z}, | z = 1, \dots, Z], ℵ = S o r t (a_{x y}),

(19)

where

Z = h_{i} w_{i}

, and ℵ is the sorting order. The top

N_{s}

-scoring vectors are selected to obtain fine features:

V_{f} = [v_{z}, | z = 1, \dots, N_{s}] .

(20)

Here, the predicted informative value is considered a modulating factor to sample the fine feature vectors:

V_{f} = [v_{z} \times a_{z}, | z = 1, \dots, N_{s}] .

(21)

To make the learning stable, the feature vectors are normalized:

V_{f} = [L_{n o r m} (v_{z}) \times a_{z}, | z = 1, \dots, N_{s}] .

(22)

Here,

L_{n o r m}

is the layer normalization, and

N_{s} = α Z

, where

α

is the poll ratio factor. This sampling module reduces the training computation.

Pool Sampler: The poll sampler obtains the fine features of foreground objects. A pool sampler compresses the background region’s remaining feature vectors that provide contextual information. It performs weighted pooling to get a small number of background features

M_{b}

motivated by double attention operation [210] and bilinear pooling [211]. The remaining feature vectors of the background region are as follows:

V_{b} = V ∖ V_{f} = {v_{b}, | b = 1, \dots, Z - N} .

(23)

The aggregated weights

a_{b} \in R^{M_{b}}

are obtained by projecting the features with weight values

w^{s} \in R^{c_{i} \times M_{b}}

as follows:

a_{b} = v_{b} w^{s} .

(24)

The projected features with learnable weight

w^{p} \in R^{c_{i} \times c_{i}}

are obtained as follows:

{\overset{´}{v}}_{b} = v_{b} w^{p} .

(25)

The aggregated weights are normalized over the non-sampled regions with Softmax as follows:

a_{b m} = \frac{e_{b m}^{a}}{\sum_{\overset{´}{b} = 1}^{N - Z} e^{a} \overset{´}{b} m} .

(26)

By using the normalized aggregation weight, the new feature vector is obtained to provide information for non-sampled regions:

v_{m} = \sum_{b = 1}^{Z - N} {\overset{´}{v}}_{b} \times a_{b m} .

(27)

By considering all Z aggregation weights, the coarse background contextual feature vector is as follows:

V_{c} = {v_{m}, | b = 1, \dots, M_{b}} .

(28)

The pool sampler provides contextual information at different scales using aggregation weights. Here, some feature vectors may provide local context while others may capture the global context.

3.10. Dynamic-DETR

Dynamic-DETR [28] introduces dynamic attention in the encoder–decoder network of DETR to solve the slow training convergence issue and detection of small objects. Firstly, a convolutional dynamic encoder is proposed to have different attention types to the self-attention module of the encoder network to make the training convergence faster. The attention of this encoder depends on various factors such as spatial effect, scale effect and input feature dimensions effect. Secondly, ROI-based dynamic attention is replaced with cross-attention in the decoder network. This decoder helps to focus on small objects, reduces learning difficulty and converges the network faster. In Figure 5, the bottom right box represents Dynamic-DETR. This dynamic encoder–decoder network is explained in detail as follows.

Dynamic Encoder: The Dynamic-DETR uses a convolutional approach for the self-attention module. Given the feature vectors

F = {F 1, \cdot \cdot \cdot, F_{n}}

, where n = 5 represents object detectors from the feature pyramid, the multi-scale self-attention (MSA) is as follows:

A t t n = M S A (F) . F .

(29)

However, it is impossible because of the various scale feature maps from the FPN. The feature maps of different scales are equalized within neighbors using 2D convolution as in the pyramid convolution [212]. It focuses on the spatial locations of the un-resized mid-layer and transfers information to its scaled neighbors. Moreover, SE [213] is applied to combine the features to provide scale attention.

Dynamic Decoder: The dynamic decoder uses mixed attention blocks in place of multi-head layers to ease learning in the cross-attention network and improve the detection of small objects. It also uses dynamic convolution instead of a cross-attention layer inspired by ConvBERT [214] in natural language processing (NLP). Firstly, RoI pooling [1] is introduced in the decoder network, after which position embeddings are replaced with box encoding

B E \in R^{p \times 4}

as the image size. The output from the dynamic encoder, along with box encoding

B E

, is fed to the dynamic decoder to pool image features

R \in R^{p \times s \times s \times c_{i}}

from the feature pyramid as follows:

R = R o I_{p o o l} (F_{e n c o d e r}, B E, s),

(30)

where s is the size of the pooling parameter, and

c_{i}

represents the quantity of channels of

F_{e n c o d e r}

. To feed this into the cross-attention module, input embeddings

q e \in R^{p \times c_{i}}

are required for object queries. These embeddings are passed through the multi-head self-attention (MHSAttn) layer as follows:

q e^{*} = M H S A t t n (q e, q e, q e) .

(31)

Then, these query embeddings are passed through the fully connected layer (dynamic filters) as follows:

F i l t e r^{q e} = F C (q e^{*}) .

(32)

Finally, cross-attention between features and object queries is performed with 1 × 1 convolution using dynamic filters

F i l t e r^{q e}

:

q e^{F} = C o n_{1 \times 1} (F, F i l t e r^{q e}) .

(33)

These features are passed through FFN layers to provide various predictions as updated object-embedding, updated box-encoding, and the object class. This process eases the learning of the cross-attention module by focusing on sparse areas and then spreading to global regions.

3.11. YOLOS-DETR

Vision transformer (ViT) [8], inherited from NLP, performs well on the image recognition task. ViT-FRCNN [215] uses a pre-trained backbone (ViT) for a CNN-based detector. It utilizes convolution neural networks and relies on strong 2D inductive biases and region-wise pooling operations for object-level perception. Other similar works, such as DETR [11], introduce 2D inductive bias using CNNs and pyramidal features. YOLOS-DETR [29] presents the transferability and versatility of the transformer from image recognition to detection in the sequence aspect using the least information about the spatial design of the input. It closely follows the ViT architecture with two simple modifications. Firstly, it removes the image-classification patches (CLS) and adds randomly initialized one hundred detection patches (DET) as [216] along with the input patch embeddings for object detection. Secondly, similar to DETR, a bipartite matching loss is used instead of the ViT classification loss. The transformer encoder takes the generated sequence as input as follows:

\begin{matrix} s_{0} = [I_{p}^{1} L; \cdot \cdot \cdot; I_{p}^{M} L; I_{d}^{1}; \cdot \cdot \cdot; I_{d}^{100}] + PE, \end{matrix}

(34)

where I is the input image

I \in R^{h_{i} \times w_{i} \times c_{i}}

that is reshaped into 2D tokens

I_{p} \in R^{n_{i} \times (R^{2} \cdot c_{i})}

. Here,

h_{i}

represents the height, and

w_{i}

indicates the width of the input image.

c_{i}

is the total number of channels.

(r, r)

is each token resolution, and

n_{i} = \frac{h_{i} w_{i}}{r^{2}}

is the total number of tokens. These tokens are mapped to

D_{i}

dimensions with linear projection,

L \in R^{(r^{2} \cdot c_{i}) \times D_{i}}

. The result of this projection is

I_{p} L

. The encoder also takes one hundred randomly initialized learnable tokens

I_{d} \in R^{100 \times D_{i}}

. To keep the positional information, positional embeddings

PE \in R^{(n_{i} + 100) \times D_{i}}

are also added. The encoder of the transformer contains a multi-head self-attention mechanism and one MLP block with a GELU [217] non-linear activation function. Layer normalization (LN) [218] is added between each self-attention and MLP block as follows:

{\overset{´}{s}}_{n} = M H S A t t n (L N (s_{n - 1})) + s_{n - 1},

(35)

s_{n} = M L P (L N ({\overset{´}{s}}_{n})) + {\overset{´}{s}}_{n},

(36)

where

s_{n}

is the encoder input sequence. In Figure 6, the top-right block indicates YOLOS-DETR.

3.12. Anchor-DETR

DETR uses learnable embeddings as object queries in the decoder network. These input embeddings do not have a clear physical meaning and cannot illustrate where to focus. It is challenging to optimize the network as object queries concentrate on something other than specific regions. Anchor-DETR [30] solves this issue by proposing object queries as anchor points that are extensively used in CNN-based object detectors. This query design can provide multiple object predictions in one region. Moreover, a few modifications in the attention are proposed that reduce the memory cost and improve performance. In Figure 6, the yellow block shows Anchor-DETR. The two main contributions of Anchor-DETR, query and attention variant design, are explained as follows.

Row and Column Decoupled-Attention: DETR requires huge GPU memory, as in [219,220], because of the complexity of the cross-attention module. It is more complex than the self-attention module in the decoder. Although Deformable-DETR reduces memory cost, it still causes random memory access, making the network slower. Row–column decoupled attention (RCDA), as shown in the blue block of Figure 6, reduces memory and provides similar or better efficiency.

Anchor Points as Object Queries: The CNN-based object detectors consider anchor points as the relative position of the input feature maps. In contrast, transformer-based detectors take uniform grid locations, handcraft locations, or learned locations as anchor points. Anchor-DETR considers two types of anchor points: learned anchor locations and grid anchor locations. The grid anchor locations are input image grid points. The learned anchor locations are uniform distributions from 0 to 1 (randomly initialized) and updated using the learned parameters.

3.13. Sparse-DETR

Sparse-DETR [31] filters the encoder tokens by a learnable cross-attention map predictor. After distinguishing these tokens in the decoder network, it focuses only on foreground tokens to reduce computational costs.

Sparse-DETR introduces the scoring module, aux-heads in the encoder, and the Top-k queries selection module for the decoder. In Figure 6, the light orange box represents Sparse-DETR. Firstly, it determines the saliency of tokens, fed as input to the encoder, using the scoring network that selects top

ρ %

tokens. Secondly, the aux-head takes the top-k tokens from the output of the encoder network. Finally, the top-k tokens are used as the decoder object queries. The salient token prediction module refines encoder tokens that are taken from the backbone feature map using threshold

ρ

and updates the features

x_{l} - 1

as follows:

x_{l}^{m} = \{\begin{matrix} x_{l - 1}^{m} & m \notin Ω_{r}^{q} \\ LN (FFN (y_{l}^{m}) + y_{l}^{m}) & m \in Ω_{r}^{q}, \end{matrix},

w h e r e y_{l}^{m} = L N (D e f o r m A t t n (x_{l - 1}^{m}, x_{l - 1}) + x_{l - 1}^{m}),

(37)

where DeformAttn is the deformable attention, FFN is the feed-forward network, and LN is the layer normalization. Then, the decoder cross-attention map (DAM) accumulates the attention weights of decoder object queries, and the network is trained by minimizing loss between prediction and binarized DAM as follows:

L_{d a m} = \frac{- 1}{M} \sum_{k = 1}^{M} B C E L o s s (s n (x_{f}), D A M_{k}^{b}),

(38)

where BCELoss is the binary cross-entropy (BCE) loss,

D A M_{k}^{b}

is the k-th binarized DAM value of the encoder token, and

s n

is the scoring network. In this way, sparse-DETR minimizes the computation by significantly eliminating encoder tokens.

3.14. D²ETR

Much work [20,22,23,24,25] has been proposed to make the training convergence faster by modifying the cross-attention module. Many researchers [20] used multi-scale feature maps to improve performance for small objects. However, the solution for high computation complexity has yet to be proposed. D²ETR [32] achieves better performance with low computational cost. Without an encoder module, the decoder directly uses the fine-fused feature maps provided by the backbone network with a novel cross-scale attention module. The D²ETR contains two main modules: a backbone and a decoder. The backbone network based on a pyramid vision transformer (PVT) consists of two parallel layers: one for cross-scale interaction and another for intra-scale interaction. This backbone contains four transformer levels to provide multi-scale feature maps. All levels have the same architecture, depending on the basic module of the selected transformer. The backbone also contains three fusing levels in parallel with four transformer levels. These fusing levels provide a cross-scale fusion of input features. The i-th fusing level is shown in the light green block of Figure 7. The cross-scale attention is formulated as follows:

f_{j} = L_{j} (f_{j - 1}),

(39)

f_{j}^{*} = S A (f_{q}, f_{k}, f_{v}),

(40)

f_{q} = f_{j}, f_{k} = f_{v} = [f_{1}^{*}, f_{2}^{*}, \dots, f_{j - 1}^{*}, f_{j}],

(41)

where

f_{j}^{*}

is the fused form feature map

f_{j}

. Given that L is the input of the decoder as the last-level feature map, the final result of cross-scale attention is

f_{1}^{*}, f_{2}^{*}, \dots, f_{L}^{*}

. The output of this backbone is fed to the decoder that takes object queries in parallel. It provides output embeddings independently transformed into class labels and box coordinates by a forward feed network. Without an encoder module, the decoder directly used the fine-fused feature maps provided by the backbone network, with a novel cross-scale attention module providing better performance with low computational cost.

3.15. FP-DETR

Modern CNN-based detectors like YOLO [221] and Faster-RCNN [1] utilize specialized layers on top of backbones pre-trained on ImageNet to enjoy pre-training benefits such as improved performance and faster training convergence. The DETR network and its improved version [21] only pre-train its backbone while training both encoder and decoder layers from scratch. Thus, the transformer needs massive training data for fine-tuning. The main reason for not pre-training the detection transformer is the difference between the pre-training and final detection tasks. Firstly, the decoder module of the transformer takes multiple object queries as input for detecting objects, while ImageNet classification takes only a single query (class token). Secondly, the self-attention module and the projections on input query embeddings in the cross-attention module easily overfit a single class query, making the decoder network difficult to pre-train. Moreover, the downstream detection task focuses on classification and localization, while the upstream task considers only classification for the objects of interest.

FP-DETR [33] reformulates the pre-training and fine-tuning stages for detection transformers. In Figure 7, the pink block indicates FP-DETR. It takes only the encoder network of the detection transformer for pre-training, as it is challenging to pre-train the decoder on the ImageNet classification task. Moreover, DETR uses both the encoder and CNN backbone as feature extractors. FP-DETR replaces the CNN backbone with a multi-scale tokenizer and uses the encoder network to extract features. It fully pre-trains the Deformable-DETR on the ImageNet dataset and fine-tunes it for final detection that achieves competitive performance.

3.16. CF-DETR

CF-DETR [34] observes that COCO-style metric average precision (AP) results for small objects on detection transformers at low IoU threshold values are better than CNN-based detectors. It refines the predicted locations by utilizing local information, as incorrect bounding box location reduces performance on small objects. CF-DETR introduces the transformer-enhanced FPN (TEF) module, coarse layers, and fine layers into the decoder network of DETR. In Figure 7, the blue box represents CF-DETR. The TEF module provides the same functionality as FPN, has non-local features E4 and E4 extracted from the backbone, and E5 features taken from the encoder output. The features of the TEF module and the encoder network are fed to the decoder as input. The decoder modules introduce a coarse block and a fine block. The coarse block selects foreground features from the global context. The fine block has two modules: adaptive scale-fusion (ASF) and local cross-attention (LCA), further refining coarse boxes. In summary, these modules refine and enrich the features by fusing global and local information to improve detection transformer performance.

3.17. DAB-DETR

DAB-DETR [72] uses the bounding box coordinates as object queries in the decoder and gradually updates them in every layer. In Figure 8, the purple block indicates DAB-DETR. These box coordinates make training convergence faster by providing positional information and using the height and width values to update the positional attention map. This type of object query provides better spatial information prior to the attention mechanism and provides a simple query formulation mechanism.

The decoder network contains two main networks: a self-attention network to update queries and a cross-attention network to find feature probing. The difference between the self-attention of the original DETR and DAB-DETR is that the query and key matrices also have position information taken from bounding box coordinates. The cross-attention module concatenates the position and content information in key and query matrices and determines their corresponding heads. The decoder takes input embeddings as content queries and anchor boxes as positional queries to find object probabilities related to anchors and content queries. This way, dynamic box coordinates used as object queries provide better prediction, making the training convergence faster and increasing detection results for small objects.

3.18. DN-DETR

DN-DETR [35] uses noised object queries as an additional decoder input to reduce the instability of the bipartite-matching mechanism in DETR, which causes the slow convergence problem. In Figure 8, the dark green block indicates DN-DETR. The decoder queries have two parts: the denoising part, containing noised ground-truth box-label pairs as input, and the matching part, containing learnable anchors as input. The matching part

M = {M_{0}, M_{1}, \dots, M_{l - 1}}

determines the resemblance between the ground-truth label pairs and the decoder output, while the denoising part

d = {d_{0}, d_{1}, \dots, d_{k - 1}}

attempts to reconstruct the ground-truth objects as follows:

O u t p u t = D e c o d e r (d, M, I | A),

(42)

where I is the image features taken as input from the transformer encoder, and A is the attention mask that stops the information transfer between the matching and denoising parts and among different noised levels of the same ground-truth objects. The decoder has noised levels of ground-truth objects where noise is added to bounding boxes and class labels, such as label flipping. It contains a hyperparameter

λ

for controlling the noise level. The training architecture of DN-DETR is based on DAB-DETR, as it also takes bounding box coordinates as object queries. The only difference between these two architectures is the class label indicator as an additional input in the decoder to assist label denoising. The bounding boxes are updated inconsistently in DAB-DETR, making relative offset learning challenging. The denoising training mechanism in DN-DETR improves performance and training convergence.

3.19. AdaMixer

AdaMixer [36] considers the encoder an extra network between the backbone and decoder that limits the performance and slows the training convergence because of its design complexity. AdaMixer provides a detection transformer network without an encoder. In Figure 8, the light green box represents AdaMixer. The main modules of AdaMixer are explained as follows.

Three-dimensional feature space: For the 3D feature space, the input feature map from the CNN backbone with the downsampling stride

s_{i}^{f}

is first transformed by a linear layer to the same

d_{f}

channel and computed the coordinate of its z-axis as follows:

z_{i}^{f} = l o g_{2} (s_{i}^{f} / s_{b}),

(43)

where the height

h_{i}

and width

w_{i}

of feature maps (different strides) is rescaled to

h_{i} / s_{b}

and

w_{i} / s_{b}

, where

s_{b} = 4

.

Three-dimensional feature-sampling process: In the sampling process, the query generates

I_{p}

groups of vectors to

I_{p}

points,

(Δ x_{j}, Δ y_{j}, Δ z_{j}) I_{p}

, where each vector is dependent on its content vector

q_{i}

by a linear layer

L_{i}

as follows:

(Δ x_{j}, Δ y_{j}, Δ z_{j}) I_{p} = L_{i} (q_{i}) .

(44)

These offset values are converted into sampling positions with regard to the position vector of the object query as follows:

\{\begin{matrix} \tilde{x_{j}} & = x + Δ x_{j} . 2^{z - r}, \\ \tilde{y_{j}} & = y + Δ y_{j} . 2^{z + r}, \\ \tilde{z_{j}} & = z + Δ z_{j} . \end{matrix}

(45)

The interpolation over the 3D feature space first samples by bilinear interpolation in the

(x_{i}, y_{i})

space and then interpolates on the z-axis by Gaussian weighting, where the weight for the i-th feature map is as follows:

{\tilde{w}}_{i} = \frac{e x p (- {(\tilde{z} - z_{i}^{f})}^{2} / Γ_{z})}{\sum_{i} e x p (- {(\tilde{z} - z_{i}^{f})}^{2} / Γ_{z})},

(46)

where

Γ_{z}

is the softening coefficient used to interpolate values over the z-axis (

Γ_{z} = 2

). This process makes decoder detection learning easier by taking feature samples according to the query.

AdaMixer Decoder: The decoder module in AdaMixer takes a content vector

q_{i}

and positional vector

(x_{i}, y_{i}, z_{i}, r_{i})

as input object queries. The position-aware multi-head self-attention is applied between these queries as follows:

A t t n (q_{i}, k_{i}, v_{i}) = S o f t m a x (\frac{q_{i} k_{i}^{T}}{\sqrt{d}} + α X) . v_{i},

(47)

where

X_{k l} = l o g (| b o x_{k} \cap b o x_{l} | b o x_{k} | + ϵ), ϵ = 10^{- 7}

. The

X_{k l} = 0

indicates the

b o x_{k}

is inside the

b o x_{l}

and

X_{k l} = l

represents no overlapping between

b o x_{k}

and

b o x_{l}

. This position vector is updated at every stage of the decoder network. The AdaMixer decoder module takes a content vector and a positional vector as input object queries. For this, multi-scale features taken from the CNN backbone are converted into a 3D feature space, as the decoder should consider

(x_{i}, y_{i})

space as well as be adjustable in terms of scales of detected objects. It takes the sampled features from this feature space as input. It applies the AdaMixer mechanism to provide final predictions of input queries without using an encoder network to reduce the computational complexity of detection transformers.

3.20. REGO-DETR

REGO-DETR [37] proposes an RoI-based method for detection refinement to improve the attention mechanism in DETR. In Figure 9, the purple color block denotes REGO-DETR. It contains two main modules: a multi-level recurrent mechanism and a glimpse-based decoder. In the multi-level recurrent mechanism, bounding boxes detected in the previous level are considered to get glimpse features. These are converted into refined attention using earlier attention in describing objects. The k-th processing level is as follows:

\{\begin{matrix} O_{c l a s s} (k) = D F_{c l a s s} (H_{d e} (k)), \\ O_{b b o x} (k) = D F_{b b o x} (H_{d e} (k)) + O_{b b o x} (k - 1), \end{matrix}

(48)

where

O_{c l a s s} \in R^{M_{d} \times M_{c}}

and

O_{b b o x} \in R^{M_{d} \times 4}

. Here,

M_{d}

and

M_{c}

represent the total number of predicted objects and classes, respectively.

D F_{c l a s s}

and

D F_{b b o x}

are functions that convert the input features into desired outputs.

H_{d e} (k)

is the attention of this level after decoding as follows:

H_{d e} (k) = [H_{g m} (k), H_{d e} (k - 1)],

(49)

where

H_{g m} (k)

refers to the glimpse features according to

H_{d e} (k - 1)

and previous levels. These glimpse features are transformed using multi-head cross-attention into refined attention outputs according to previous attention outputs as follows:

H_{g m} (k) = A t t n (V (k), H_{d e} (k - 1)) .

(50)

For extracting glimpse features

V (k)

, the following operation is performed:

V (k) = F E_{e x t} (X, R I (O_{b b o x} (k - 1), α (k))),

(51)

where

F E_{e x t}

is the feature extraction function,

α (k)

is a scalar parameter, and RI is the RoI computation. In this way, region of interest (RoI)-based refinement modules make the training convergence of the detection transformer faster and provide better performance.

3.21. DINO

DN-DETR adds positive noise to the anchors taken as object queries to the input of the decoder and provides labels to only those anchors with ground-truth objects nearby. Following DAB-DETR and DN-DETR, DINO [38] proposes a mixed object query selection method for anchor initialization and a look forward twice mechanism for box prediction. It provides the contrastive denoising (CDN) module, which takes positional queries as anchor boxes and adds additional DN loss. In Figure 9, the red block indicates DINO. This detector uses

λ_{1}

and

λ_{2}

hyperparameters where

λ_{1} < λ_{2}

. The bounding box

b = (x_{i}, y_{i}, w_{i}, h_{i})

taken as input in the decoder, its corresponding generated anchor is denoted as

a = (x_{i}, y_{i}, w_{i}, h_{i})

.

A T D (k) = \frac{1}{k} Σ {M_{K} ({‖ b_{0} - a_{0} ‖_{1}, ‖ b_{1} - a_{1} ‖_{1}, \dots, ‖ b_{N - 1} - a_{N - 1} ‖_{1}}, k)},

(52)

where

‖ (b_{i} - a_{i}) ‖

is the distance between the anchor and bounding box, and

M_{K} (x, k)

is the function that provides the top K elements in x. The

λ

parameter is the threshold value for generating noise for anchors that are fed as input object queries to the decoder. It provides two types of anchor queries: positive with threshold value less than

λ_{1}

and negative with noise threshold values greater than

λ_{1}

and less than

λ_{2}

. This way, the anchors with no ground-truth nearby are labeled as “no object”. Thus, DINO makes the training convergence faster and improves performance for small objects.

DINOv2 [222] is a self-supervised vision transformer model developed by Meta AI. It was trained on a large-scale dataset of 142 million images without any labels or annotations. DINOv2 [222] produces high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks. These visual features are robust and perform well across domains without any requirement for fine-tuning. DINOv3 [223], also developed by Meta AI, is the third generation of the DINO framework. It is a 7-billion-parameter Vision Transformer trained on 1.7 billion images without labels. DINOv3 [223] introduces several innovations, including Gram anchoring, which stabilizes dense feature maps during training, and axial RoPE (Rotary Positional Embeddings) with jittering, which enhances the model’s robustness to varying image resolutions, scales, and aspect ratios. These advancements enable DINOv3 [223] to achieve state-of-the-art performance across a wide range of vision tasks, including object detection, semantic segmentation, and depth estimation.

3.22. Co-DETR

Co-DETR [39] is an improvement over DETR that addresses a key limitation of the standard one-to-one label assignment, which in DETR restricts each ground-truth object to a single predicted query. In Figure 10, the light red block indicates Co-DETR. This design leads to a few positive samples during training, leaving many decoder queries unused and slowing gradient flow, particularly in the early stages of learning. Co-DETR overcomes this by introducing a collaborative hybrid assignment strategy that combines the original one-to-one assignment with a one-to-many assignment implemented through auxiliary heads. The one-to-one assignment preserves the unique matching of each object, maintaining the stability and structure of DETR’s training. The one-to-many assignment leverages heuristics from classical object detectors, such as ATSS or Faster R-CNN, to assign multiple predicted queries to the same ground-truth object, providing denser supervision for both the encoder and decoder. The auxiliary heads are only active during training and are discarded during inference, ensuring no additional computational cost at test time.

The total training loss is expressed as follows:

L_{total} = L_{DET} + \sum_{h \in auxiliary heads} L_{aux, h},

(53)

where

L_{DET}

is the standard DETR loss and

L_{aux, h}

represents the one-to-many assignment loss from each auxiliary head. This hybrid assignment improves gradient flow by increasing the number of positive samples per batch, enhances encoder supervision through additional feedback signals, and leads to better detection performance on benchmarks such as COCO and LVIS. By enriching training supervision without altering the inference process, Co-DETR enables faster convergence, more effective learning, and higher accuracy in DETR-based object detectors.

3.23. LW-DETR

LW-DETR [40] is a lightweight, transformer-based object detection model designed for high accuracy and real-time performance. It streamlines the standard DETR architecture by using an optimized vision transformer (ViT) encoder and a shallow decoder. The model first processes an input image by breaking it into patches and extracting features through the encoder. These features are then refined via a convolutional projection layer before being passed to the decoder, which uses a set of object queries to predict bounding boxes and class labels. In Figure 10, the blue block indicates LW-DETR. LW-DETR further improves efficiency through several strategies: interleaved window and global attention reduce the complexity of self-attention, multi-level feature aggregation captures richer representations, and window-major feature map organization optimizes attention computation. During training, the model employs deformable cross-attention to focus on relevant regions, IoU-aware classification loss to enhance localization accuracy, and encoder–decoder pre-training to learn robust features. The total training loss combines classification, bounding box regression, and IoU losses to guide learning effectively.

L_{total} = L_{cls} + L_{box} + λ_{giou} L_{GIoU},

(54)

where

L_{cls}

is the classification loss,

L_{box}

is the bounding box regression loss,

L_{GIoU}

is the generalized intersection over union loss, and

λ_{giou}

balances the contributions of the losses. Experimental results show that LW-DETR achieves higher accuracy than many real-time detectors, including YOLO variants, while maintaining low computational cost, making it suitable for real-time object detection tasks.

3.24. RT-DETR

An RT-DETR [41] (real-time detection transformer) is a transformer-based object detection model developed by Baidu, designed for high-speed, end-to-end inference suitable for real-time applications. In Figure 10, the purple block indicates RT-DETR. The model employs a hybrid encoder that processes multi-scale features by decoupling intra-scale interactions and cross-scale feature fusion. This efficient design reduces computational costs while retaining rich feature representations. The encoder outputs multi-scale feature maps, which are then passed to a DETR-style decoder. An IoU-aware query selection mechanism is utilized to focus on the most relevant object queries, enhancing detection accuracy. Additionally, the inference speed can be adjusted by changing the number of decoder layers, allowing for flexible deployment across different real-time scenarios.

Subsequent versions build upon this foundation to further enhance performance. RT-DETRv2 [224] introduces selective multi-scale sampling and replaces the grid-sample operator with a discrete sampling operator, improving the detection of objects at different scales. It also employs dynamic data augmentation and scale-adaptive hyperparameter tuning to enhance training efficiency without increasing inference latency. RT-DETRv3 [225] addresses limitations of sparse supervision and insufficient decoder training by adding a CNN-based auxiliary branch for dense supervision, a self-attention perturbation strategy to diversify label assignment, and a shared-weight decoder branch for dense positive supervision. In summary, the RT-DETR series demonstrates a clear evolution in real-time object detection, with each version introducing architectural and training innovations that enhance both speed and accuracy. The original RT-DETR establishes the foundation for real-time performance, while v2 and v3 progressively improve detection capability without compromising inference efficiency.

It is important to compare modifications in detection transformers to understand their effect on network size, training convergence, and performance. In this work, we use the COCO2014 mini validation set (minival) as a benchmark, since COCO is a widely accepted standard for evaluating object detection models [75]. All images are preprocessed using standard resizing and normalization procedures, and data augmentation, such as random horizontal flipping, is applied, consistent with typical DETR training protocols. The performance of DETR and its variants is evaluated using mean average precision (mAP), calculated as the mean of each object category’s average precision (AP), where AP corresponds to the area under the precision–recall curve [226]. Following the standard COCO evaluation protocol, objects are classified into three size categories based on pixel area: small (<

32^{2}

pixels), medium (

32^{2}

–

96^{2}

pixels), and large (>

96^{2}

pixels). This categorization allows for detailed analysis across object scales, with AP_S, AP_M, and AP_L reporting performance for small, medium, and large objects, respectively. For a fair comparison, all results are obtained by loading the original pre-trained PTH files released by the respective authors and validating them on the COCO minival set. This approach allows us to reproduce the reported performance of each model while focusing on the architectural differences and improvements introduced by various DETR variants.

4. Results and Discussion

Many advancements are proposed in DETR, such as backbone modification, query design, and attention refinement to improve performance and training convergence. Table 3 shows the performance comparison of all DETR-based detection transformers on the COCO minival set. We can observe that DETR performs well at 500 training epochs and has low AP on small objects. The modified versions improve performance and training convergence, like DINO, which has an mAP of 49.0% at 12 epochs and performs well on small objects.

The quantitative analysis of DETR and its updated versions regarding training convergence and model size on the COCO minival set is performed. Left side of Figure 11 shows the mAP of the detection transformers using a ResNet-50 backbone with training epochs. The original DETR, represented with a brown line, has low training convergence. It has an mAP value of 35.3% at 50 training epochs and 44.9 % at 500 training epochs. Here, DINO, represented with a red line, converges at low training epochs and gives the highest mAP on all epoch values. The attention mechanism in DETR involves computing pairwise attention scores between every pair of feature vectors, which can be computationally expensive, especially for large input images. Moreover, the self-attention mechanism in DETR relies on using fixed positional encodings to encode the spatial relationships between the different parts of the input image. This can slow down the training process and increase convergence time. In contrast, Deformable-DETR and DINO have some modifications that can help speed up the training process. For example, Deformable DETR introduces deformable attention layers, which can better capture spatial context information and improve object detection accuracy. Similarly, DINO uses a denoising learning approach to train the network to learn more generalized features useful for object detection, making the training process faster and more effective.

Right side of Figure 11 compares all detection transformers regarding the model size. Here, YOLOS-DETR uses DeiT-small as the backbone instead of DeiT-Ti, but it also increases the model size by 20x times. DINO and REGO-DETR have comparable mAP, but REGO-DETR is nearly double the model size of DINO. These networks use more complex architectures than the original DETR architecture, which increases the total parameters and the overall network size.

We also provide a qualitative analysis of DETR and its updated versions on all-sized objects in Figure 12. For small objects, the mAP for the original DETR is 15.2% at 50 epochs, while Deformable-DETR has an mAP value of 26.4% at 50 epochs. The self-attention mechanism in Deformable-DETR allows it to interpolate features from neighboring pixels, which is particularly useful for small objects that may only occupy a few pixels in an image. This mechanism in Deformable-DETR captures more precise and detailed information about small objects, which can lead to better performance than DETR.

While DINO demonstrates impressive accuracy and fast convergence, its computational footprint remains a significant concern. With approximately 860 GFLOPs per inference, DINO is far more demanding than lightweight alternatives such as Nano YOLO variants, which typically operate in the range of 5–10 GFLOPs. This stark difference highlights a fundamental limitation of many DETR-based models: despite their accuracy gains, their inference cost makes them impractical for latency-critical or resource-constrained applications. In contrast, RT-DETR and LW-DETR provide lightweight and real-time DETR variants, achieving competitive accuracy with a substantially lower computational load (136–259 GFLOPs for RT-DETR and 67.7 GFLOPs for LW-DETR). Additionally, Co-DETR focuses on enhancing contextual reasoning to further boost detection performance, achieving very high AP scores, though at a higher computational cost similar to DINO. Thus, future research must address not only accuracy and convergence speed but also the efficiency gap that separates DETR variants from lightweight CNN-based detectors, ensuring their practical applicability in real-world scenarios.

While Table 3 and Figure 11 and Figure 12 show performance improvements, it is also important to consider computational cost, memory footprint, and implementation complexity. Models like DINO and REGO achieve high mAP but require significantly more parameters and GFLOPs, making them less suitable for resource-constrained scenarios. Deformable-DETR provides a balanced trade-off by improving small object detection and convergence speed without drastically increasing computational load. YOLOS-DETR, while compact in design, relies on a transformer backbone (DeiT-S) that increases the memory requirement by up to 20×, highlighting a trade-off between model size and detection speed. Therefore, selecting a DETR variant depends not only on accuracy but also on hardware constraints, dataset characteristics, and real-time requirements.

5. Open Challenges and Future Directions

Detection transformers have shown promising results on various object detection benchmarks. However, several open challenges remain, providing directions for future improvements. Table 4 summarizes the advantages and limitations of the various improved versions of DETR. Some of the key open challenges and future directions are as follows.

Improving the attention mechanisms: The performance of detection transformers heavily relies on the attention mechanism to capture dependencies between spatial locations in an image. To date, around 60% of modifications in DETR have focused on the attention mechanism to improve performance and training convergence. Future research could explore more refined attention mechanisms to better capture spatial information or incorporate task-specific constraints.

Adaptive and dynamic backbones: The backbone architecture significantly affects network performance and size. Current detection transformers often use fixed backbones or remove them entirely. Only about 10% of DETR modifications have targeted the backbone to improve performance or reduce model size. Future work could investigate dynamic backbone architectures that adjust their complexity based on the input image, potentially enhancing both efficiency and accuracy.

Improving the quantity and quality of object queries: In DETR, the number of object queries fed to the decoder is typically fixed during training and inference, but the number of objects in an image varies. Later approaches, such as DAB-DETR, DN-DETR, and DINO, demonstrate that adjusting the quantity or quality of object queries can significantly impact performance. DAB-DETR uses dynamic anchor boxes as queries, DN-DETR adds positive noise to queries for denoising training, and DINO adds both positive and negative noise for improved denoising. Future models could dynamically adjust the number of object queries based on image content and incorporate adaptive mechanisms to improve query quality.

Emerging directions: Beyond attention mechanisms, backbones, and object queries, several additional challenges remain. Improving training efficiency through faster convergence strategies and sample-efficient learning could make DETR more practical for large-scale applications. Integrating multitask learning, such as jointly performing detection, segmentation, and tracking, can leverage shared representations for better performance. Enhancing robustness and generalization under occlusions, domain shifts, or low-resolution inputs is also critical. Interdisciplinary approaches could incorporate reinforcement learning to guide model adaptation, NLP-inspired sequence modeling to improve feature interactions, or graph-based reasoning techniques to capture relationships between objects. Concrete research challenges include designing models that dynamically adapt to new tasks or domains and developing cross-modal attention mechanisms that integrate multiple data sources for richer scene understanding.

6. Conclusions

Detection transformers have transformed object detection by enabling fully end-to-end models that eliminate the need for proposal generation and complex post-processing, while also providing insights into the inner workings of deep neural networks. This review presented a detailed overview of DETR and its variants, focusing on recent advancements designed to improve performance and training convergence. In particular, modifications to the attention module in the encoder–decoder network and updates to object queries have enhanced training stability and performance, especially for small objects. Other improvements include backbone refinements, query design enhancements, and attention mechanism optimizations, all of which contribute to better accuracy and efficiency. From this survey, several high-level patterns emerge. Slow convergence and limited small-object detection remain central challenges, driving innovations in attention mechanisms, query design, and backbone architecture. Across DETR variants, commonalities include the use of transformer-based attention, modular encoder–decoder design, and strategies to increase positive supervision, while differences arise in how variants balance accuracy versus efficiency, implement multi-scale feature fusion, and assign object queries. Research diverges along two primary paths: accuracy-focused methods leverage deeper backbones and extensive pre-training, while efficiency-oriented approaches adopt lightweight, sparse, or deformable architectures such as RT-DETR and LW-DETR, which achieve competitive performance with lower computational cost. Recent trends further emphasize efficiency, multitask learning, and cross-modal integration, enabling faster convergence, improved generalization, and broader scene understanding that encompasses detection, segmentation, tracking, and vision–language reasoning. Key insights from this survey indicate that model design is increasingly shaped by the trade-off between real-time deployment and high accuracy, and that modular, adaptive architectures are central to achieving this balance. Overall, DETR has evolved into a modular and flexible framework capable of balancing accuracy and efficiency. Future directions point toward adaptive architectures that dynamically allocate computational resources based on input complexity, robust training strategies for challenging environments, and richer contextual reasoning through multimodal integration. By uniting architectural innovation with practical deployment considerations, transformers are poised to drive the next generation of scalable, intelligent, and versatile visual perception systems.

Author Contributions

Writing, review and editing, T.S.; review, K.A.H. and M.Z.A.; supervision and project administration, M.L. and D.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work was partially funded by the European project AIRISE under Grant Agreement ID 101092312.

Acknowledgments

All included in this paper have consented to the acknowledgment.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
Girshick, R.B. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
Shehzadi, T.; Majid, A.; Hameed, M.; Farooq, A.; Yousaf, A. Intelligent predictor using cancer-related biologically information extraction from cancer transcriptomes. In Proceedings of the 2020 International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), Islamabad, Pakistan, 20–22 October 2020; Volume 5, pp. 1–5. [Google Scholar] [CrossRef]
Sarode, S.; Khan, M.S.U.; Shehzadi, T.; Stricker, D.; Afzal, M.Z. Classroom-Inspired Multi-mentor Distillation with Adaptive Learning Strategies. In Proceedings of the Intelligent Systems and Applications, Amsterdam, The Netherlands, 27–28 August 2025; Arai, K., Ed.; Springer Nature: Cham, Switzerland, 2025; pp. 294–324. [Google Scholar] [CrossRef]
Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. Available online: https://dl.acm.org/doi/10.5555/3295222.3295349 (accessed on 25 September 2025).
Khan, M.S.U.; Shehzadi, T.; Noor, R.; Stricker, D.; Afzal, M.Z. Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification. arXiv 2024, arXiv:2406.14370. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. arXiv 2023, arXiv:2306.04670. [Google Scholar] [CrossRef]
Sheikh, T.U.; Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. UnSupDLA: Towards Unsupervised Document Layout Analysis. arXiv 2024, arXiv:2406.06236. [Google Scholar]
Ehsan, I.; Shehzadi, T.; Stricker, D.; Afzal, M.Z. End-to-End Semi-Supervised approach with Modulated Object Queries for Table Detection in Documents. Int. J. Document Anal. Recognit. 2024, 27, 363–378. Available online: https://api.semanticscholar.org/CorpusID:269626070 (accessed on 25 September 2025). [CrossRef]
Shehzadi, T.; Stricker, D.; Afzal, M.Z. A Hybrid Approach for Document Layout Analysis in Document images. arXiv 2024, arXiv:2404.17888. [Google Scholar] [CrossRef]
Shehzadi, T.; Sarode, S.; Stricker, D.; Afzal, M.Z. Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer. arXiv 2024, arXiv:2405.00187. [Google Scholar]
Saeed, W.; Saleh, M.S.; Gull, M.N.; Raza, H.; Saeed, R.; Shehzadi, T. Geometric features and traffic dynamic analysis on 4-leg intersections. Int. Rev. Appl. Sci. Eng. 2024, 15, 171–188. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Liwicki, M.; Afzal, M.Z. Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images. arXiv 2023, arXiv:2306.13526. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection. arXiv 2024, arXiv:2404.01819. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers. arXiv 2020, arXiv:2011.09094. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. arXiv 2021, arXiv:2101.07448. [Google Scholar]
Sun, Z.; Cao, S.; Yang, Y.; Kitani, K. Rethinking Transformer-based Set Prediction for Object Detection. arXiv 2020, arXiv:2011.10881. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. arXiv 2021, arXiv:2108.06152. [Google Scholar]
Liu, F.; Wei, H.; Zhao, W.; Li, G.; Peng, J.; Li, Z. WB-DETR: Transformer-Based Detector without Backbone. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2959–2967. [Google Scholar] [CrossRef]
Wang, T.; Yuan, L.; Chen, Y.; Feng, J.; Yan, S. PnP-DETR: Towards Efficient Visual Analysis with Transformers. arXiv 2021, arXiv:2109.07036. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection with Dynamic Attention. In Proceedings of the 2021 International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; Available online: https://www.microsoft.com/en-us/research/publication/dynamic-detr-end-to-end-object-detection-with-dynamic-attention/ (accessed on 25 September 2025).
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. arXiv 2021, arXiv:2106.00666. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Available online: https://api.semanticscholar.org/CorpusID:237513850 (accessed on 25 September 2025).
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Lin, J.; Mao, X.; Chen, Y.; Xu, L.; He, Y.; Xue, H. D²ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention. arXiv 2022, arXiv:2203.00860. [Google Scholar] [CrossRef]
Wang, W.; Cao, Y.; Zhang, J.; Tao, D. FP-DETR: Detection Transformer Advanced by Fully Pre-training. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022; Available online: https://openreview.net/forum?id=yjMQuLLcGWK (accessed on 25 September 2025).
Cao, X.; Yuan, P.; Feng, B.; Niu, K. CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Available online: https://api.semanticscholar.org/CorpusID:250293790 (accessed on 25 September 2025).
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2239–2251. [Google Scholar] [CrossRef]
Gao, Z.; Wang, L.; Han, B.; Guo, S. AdaMixer: A Fast-Converging Query-Based Object Detector. arXiv 2022, arXiv:2203.16507. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, J.; Tao, D. Recurrent Glimpse-based Decoder for Detection with Transformer. arXiv 2021, arXiv:2112.04632. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar] [CrossRef]
Zong, Z.; Song, G.; Liu, Y. DETRs with Collaborative Hybrid Assignments Training. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6725–6735. [Google Scholar] [CrossRef]
Chen, Q.; Su, X.; Zhang, X.; Wang, J.; Chen, J.; Shen, Y.; Han, C.; Chen, Z.; Xu, W.; Li, F.; et al. LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection. arXiv 2024, arXiv:2406.03459. Available online: https://arxiv.org/abs/2406.03459 (accessed on 25 September 2025).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. Available online: https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_DETRs_Beat_YOLOs_on_Real-time_Object_Detection_CVPR_2024_paper.html (accessed on 25 September 2025).
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G. Recent Advances in Convolutional Neural Networks. arXiv 2015, arXiv:1512.07108. [Google Scholar] [CrossRef]
Borji, A.; Cheng, M.; Jiang, H.; Li, J. Salient Object Detection: A Survey. arXiv 2014, arXiv:1411.5878. [Google Scholar] [CrossRef]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
Agarwal, S.; du Terrail, J.O.; Jurie, F. Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks. arXiv 2018, arXiv:1809.03193. [Google Scholar]
Yang, M.H.; Kriegman, D.; Ahuja, N. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 34–58. [Google Scholar] [CrossRef]
Zhao, B.; Feng, J.; Wu, X.; Yan, S. A survey on deep learning-based fine-grained object classification and semantic segmentation. Int. J. Autom. Comput. 2017, 14, 119–135. Available online: https://api.semanticscholar.org/CorpusID:53076119 (accessed on 25 September 2025). [CrossRef]
Goswami, T.; Barad, Z.; Desai, P.; Nikita, P. Text Detection and Recognition in images: A survey. arXiv 2018, arXiv:1803.07278. [Google Scholar] [CrossRef]
Chaudhari, S.; Polatkan, G.; Ramanath, R.; Mithal, V. An Attentive Survey of Attention Models. arXiv 2019, arXiv:1904.02874. [Google Scholar] [CrossRef]
Han, J.; Zhang, D.; Cheng, G.; Liu, N.; Xu, D. Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey. IEEE Signal Process. Mag. 2018, 35, 84–100. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.W.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. arXiv 2018, arXiv:1809.02165. [Google Scholar] [CrossRef]
Enzweiler, M.; Gavrila, D.M. Monocular Pedestrian Detection: Survey and Experiments. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2179–2195. [Google Scholar] [CrossRef]
Ülkü, I.; Akagündüz, E. A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images. arXiv 2019, arXiv:1912.10230. [Google Scholar]
Cheng, G.; Han, J. A Survey on Object Detection in Optical Remote Sensing Images. arXiv 2016, arXiv:1603.06201. [Google Scholar] [CrossRef]
Sommer, L.W.; Schuchert, T.; Beyerer, J. Fast Deep Vehicle Detection in Aerial Images. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2–29 March 2017; pp. 311–319. [Google Scholar] [CrossRef]
Zhang, P.; Niu, X.; Dou, Y.; Xia, F. Airport Detection on Optical Satellite Images Using Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1183–1187. [Google Scholar] [CrossRef]
Bach, M.; Stumper, D.; Dietmayer, K. Deep Convolutional Traffic Light Recognition for Automated Driving. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 851–858. [Google Scholar] [CrossRef]
de la Escalera, A.; Moreno, L.; Salichs, M.; Armingol, J. Road traffic sign detection and classification. IEEE Trans. Ind. Electron. 1997, 44, 848–859. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Mask-Aware Semi-Supervised Object Detection in Floor Plans. Appl. Sci. 2022, 12, 9398. [Google Scholar] [CrossRef]
Hariharan, B.; Arbelaez, P.; Girshick, R.B.; Malik, J. Simultaneous Detection and Segmentation. arXiv 2014, arXiv:1407.1808. [Google Scholar] [CrossRef]
Hariharan, B.; Arbeláez, P.A.; Girshick, R.B.; Malik, J. Hypercolumns for Object Segmentation and Fine-grained Localization. arXiv 2014, arXiv:1411.5752. [Google Scholar]
Dai, J.; He, K.; Sun, J. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015, arXiv:1512.04412. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv 2014, arXiv:1412.2306. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
Wu, Q.; Shen, C.; van den Hengel, A.; Wang, P.; Dick, A.R. Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge. arXiv 2016, arXiv:1603.02814. [Google Scholar]
Bai, S.; An, S. A survey on automatic image caption generation. Neurocomputing 2018, 311, 291–304. [Google Scholar] [CrossRef]
Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X.; et al. T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos. arXiv 2016, arXiv:1604.02532. [Google Scholar] [CrossRef]
Arkin, E.; Yadikar, N.; Xu, X.; Aysa, A.; Ubul, K. A survey: Object detection methods from CNN to transformer. Multimed. Tools Appl. 2022, 82, 21353–21383. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
Arkin, E.; Yadikar, N.; Muhtar, Y.; Ubul, K. A Survey of Object Detection Based on CNN and Transformer. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 16–18 July 2021; pp. 99–108. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. arXiv 2019, arXiv:1905.05055. [Google Scholar] [CrossRef]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.N.; Lee, B. A Survey of Modern Deep Learning based Object Detection Models. arXiv 2021, arXiv:2104.11892. [Google Scholar] [CrossRef]
Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A Survey of Deep Learning-based Object Detection. arXiv 2019, arXiv:1907.09408. [Google Scholar] [CrossRef]
Ahmed, M.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Survey and Performance Analysis of Deep Learning Based Object Detection in Challenging Environments. Sensors 2021, 21, 5116. [Google Scholar] [CrossRef] [PubMed]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2009, 88, 303–308. Available online: https://www.microsoft.com/en-us/research/publication/the-pascal-visual-object-classes-voc-challenge/ (accessed on 25 September 2025). [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. Available online: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 25 September 2025).
Bar, A.; Wang, X.; Kantorov, V.; Reed, C.J.; Herzig, R.; Chechik, G.; Rohrbach, A.; Darrell, T.; Globerson, A. DETReg: Unsupervised Pretraining with Region Priors for Object Detection. arXiv 2021, arXiv:2106.04550. [Google Scholar]
Bateni, P.; Barber, J.; van de Meent, J.; Wood, F. Improving Few-Shot Visual Classification with Unlabelled Examples. arXiv 2020, arXiv:2006.12245. [Google Scholar]
Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent Targets Provide Better Supervision in Semi-supervised Object Detection. arXiv 2022, arXiv:2209.01589. [Google Scholar] [CrossRef]
Li, Y.; Huang, D.; Qin, D.; Wang, L.; Gong, B. Improving Object Detection with Selective Self-supervised Self-training. arXiv 2020, arXiv:2007.09162. [Google Scholar] [CrossRef]
Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection. arXiv 2022, arXiv:2210.02368. [Google Scholar]
Hashmi, K.A.; Pagani, A.; Stricker, D.; Afzal, M.Z. BoxMask: Revisiting Bounding Box Supervision for Video Object Detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2029–2039. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
Li, C.; Yang, J.; Zhang, P.; Gao, M.; Xiao, B.; Dai, X.; Yuan, L.; Gao, J. Efficient Self-supervised Vision Transformers for Representation Learning. arXiv 2021, arXiv:2106.09785. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Fu, C.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar] [CrossRef]
Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar] [CrossRef]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. arXiv 2017, arXiv:1711.06897. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. arXiv 2018, arXiv:1808.01244. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv 2014, arXiv:1406.4729. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.; Yuille, A.L. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. arXiv 2020, arXiv:2006.02334. [Google Scholar] [CrossRef]
Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. arXiv 2019, arXiv:1901.07518. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. arXiv 2017, arXiv:1712.00726. [Google Scholar] [CrossRef]
Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv 2018, arXiv:1801.04381. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083. [Google Scholar] [CrossRef]
Wang, R.J.; Li, X.; Ao, S.; Ling, C.X. Pelee: A Real-Time Object Detection System on Mobile Devices. arXiv 2018, arXiv:1804.06882. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. arXiv 2018, arXiv:1807.11164. [Google Scholar] [CrossRef]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv 2018, arXiv:1807.11626. [Google Scholar]
Yousaf, A.; Sazonov, E. Food Intake Detection in the Face of Limited Sensor Signal Annotations. In Proceedings of the 2024 Tenth International Conference on Communications and Electronics (ICCE), Da Nang, Vietnam, 31 July–2 August 2024; pp. 351–356. [Google Scholar] [CrossRef]
Cai, H.; Gan, C.; Han, S. Once for All: Train One Network and Specialize it for Efficient Deployment. arXiv 2019, arXiv:1908.09791. [Google Scholar]
Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image. arXiv 2017, arXiv:1703.07570. [Google Scholar]
Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. arXiv 2016, arXiv:1612.00496. [Google Scholar]
Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving. arXiv 2019, arXiv:1903.10955. [Google Scholar] [CrossRef]
Li, P.; Chen, X.; Shen, S. Stereo R-CNN based 3D Object Detection for Autonomous Driving. arXiv 2019, arXiv:1902.09738. [Google Scholar] [CrossRef]
Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T. Geometry-based Distance Decomposition for Monocular 3D Object Detection. arXiv 2021, arXiv:2104.03775. [Google Scholar]
Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into Localization Errors for Monocular 3D Object Detection. arXiv 2021, arXiv:2103.16237. [Google Scholar] [CrossRef]
Liu, Y.; Wang, L.; Liu, M. YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection. arXiv 2021, arXiv:2103.09422. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. arXiv 2020, arXiv:2006.11275. [Google Scholar]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017, arXiv:1711.06396. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. arXiv 2018, arXiv:1812.05784. [Google Scholar]
Xu, Q.; Zhong, Y.; Neumann, U. Behind the Curtain: Learning Occluded Shapes for 3D Object Detection. arXiv 2021, arXiv:2112.02205. [Google Scholar] [CrossRef]
Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C. CIA-SSD: Confident IoU-Aware Single-Stage Object Detector from Point Cloud. arXiv 2020, arXiv:2012.03015. [Google Scholar] [CrossRef]
Zheng, W.; Tang, W.; Jiang, L.; Fu, C. SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud. arXiv 2021, arXiv:2104.09804. [Google Scholar]
Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. arXiv 2020, arXiv:2012.15712. [Google Scholar] [CrossRef]
Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.; Zhao, M. Improving 3D Object Detection with Channel-wise Transformer. arXiv 2021, arXiv:2108.10723. [Google Scholar] [CrossRef]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel Transformer for 3D Object Detection. arXiv 2021, arXiv:2109.02497. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. arXiv 2019, arXiv:1911.10150. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. arXiv 2017, arXiv:1712.02294. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. arXiv 2020, arXiv:2012.10992. [Google Scholar] [CrossRef]
Yoo, J.H.; Kim, Y.; Kim, J.S.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection. arXiv 2020, arXiv:2004.12636. [Google Scholar]
Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. arXiv 2020, arXiv:2009.00784. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. arXiv 2019, arXiv:1904.04745. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv 2020, arXiv:2012.15840. [Google Scholar]
Strudel, R.; Pinel, R.G.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. arXiv 2021, arXiv:2105.05633. [Google Scholar] [CrossRef]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.B.; Rother, C.; Dollár, P. Panoptic Segmentation. arXiv 2018, arXiv:1801.00868. [Google Scholar]
Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.L.; Chen, L. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. arXiv 2020, arXiv:2003.07853. [Google Scholar]
Neuhold, G.; Ollmann, T.; Bulò, S.R.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5000–5009. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. arXiv 2016, arXiv:1604.01685. [Google Scholar] [CrossRef]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text to Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; Proceedings of Machine Learning Research: New York, NY, USA, 2016; Volume 48, pp. 1060–1069. Available online: https://proceedings.mlr.press/v48/reed16.html (accessed on 25 September 2025).
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Huang, X.; Wang, X.; Metaxas, D.N. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv 2016, arXiv:1612.03242. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv 2017, arXiv:1710.10916. [Google Scholar]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv 2017, arXiv:1711.10485. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Murahari, M.D.; Reddy; Sk, M.; Basha, M.M.M.; Hari, M.N.C.; Student, P. DALL-E: CREATING IMAGES FROM TEXT. 2021. Available online: https://api.semanticscholar.org/CorpusID:261026641 (accessed on 25 September 2025).
Wang, X.; Yeshwanth, C.; Nießner, M. SceneFormer: Indoor Scene Generation with Transformers. arXiv 2020, arXiv:2012.09793. [Google Scholar]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; III, H.D., Singh, A., Eds.; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2020; Volume 119, pp. 1691–1703. Available online: https://proceedings.mlr.press/v119/chen20s.html (accessed on 25 September 2025).
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. arXiv 2020, arXiv:2012.09841. [Google Scholar]
Jiang, Y.; Chang, S.; Wang, Z. TransGAN: Two Transformers Can Make One Strong GAN. arXiv 2021, arXiv:2102.07074. [Google Scholar]
Bhunia, A.K.; Khan, S.H.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Shah, M. Handwriting Transformers. arXiv 2021, arXiv:2104.03964. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009; Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 25 September 2025).
Coates, A.; Ng, A.; Lee, H. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; Gordon, G., Dunson, D., Dudík, M., Eds.; Proceedings of Machine Learning Research: New York, NY, USA, 2011; Volume 15, pp. 215–223. Available online: https://proceedings.mlr.press/v15/coates11a.html (accessed on 25 September 2025).
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv 2019, arXiv:1911.05722. [Google Scholar]
Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning Representations by Maximizing Mutual Information Across Views. arXiv 2019, arXiv:1906.00910. [Google Scholar] [CrossRef]
Hénaff, O.J.; Srinivas, A.; Fauw, J.D.; Razavi, A.; Doersch, C.; Eslami, S.M.A.; van den Oord, A. Data-Efficient Image Recognition with Contrastive Predictive Coding. arXiv 2019, arXiv:1905.09272. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. Available online: https://api.semanticscholar.org/CorpusID:11758569 (accessed on 25 September 2025).
Gao, C.; Chen, Y.; Liu, S.; Tan, Z.; Yan, S. AdversarialNAS: Adversarial Neural Architecture Search for GANs. arXiv 2019, arXiv:1912.02037. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2019, arXiv:1912.04958. [Google Scholar]
Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. arXiv 2020, arXiv:2006.04139. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. arXiv 2020, arXiv:2012.00364. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. arXiv 2021, arXiv:2108.10257. [Google Scholar] [CrossRef]
Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A General U-Shaped Transformer for Image Restoration. arXiv 2021, arXiv:2106.03106. [Google Scholar] [CrossRef]
Kumar, M.; Weissenborn, D.; Kalchbrenner, N. Colorization Transformer. arXiv 2021, arXiv:2102.04432. [Google Scholar]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. arXiv 2015, arXiv:1505.00468. [Google Scholar]
Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. arXiv 2018, arXiv:1811.10830. [Google Scholar]
Lee, K.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked Cross Attention for Image-Text Matching. arXiv 2018, arXiv:1803.08024. [Google Scholar] [CrossRef]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. arXiv 2014, arXiv:1411.4555. [Google Scholar]
Chen, Y.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Learning UNiversal Image-TExt Representations. arXiv 2019, arXiv:1909.11740. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. arXiv 2020, arXiv:2004.06165. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. VideoBERT: A Joint Model for Video and Language Representation Learning. arXiv 2019, arXiv:1904.01766. [Google Scholar] [CrossRef]
Li, G.; Duan, N.; Fang, Y.; Jiang, D.; Zhou, M. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. arXiv 2019, arXiv:1908.06066. [Google Scholar] [CrossRef]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.; Chang, K. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv 2019, arXiv:1908.02265. [Google Scholar]
Lee, S.; Yu, Y.; Kim, G.; Breuel, T.M.; Kautz, J.; Song, Y. Parameter Efficient Multimodal Transformers for Video Representation Learning. arXiv 2020, arXiv:2012.04124. [Google Scholar]
Sun, N.; Zhu, Y.; Hu, X. Faster R-CNN Based Table Detection Combining Corner Locating. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1314–1319. [Google Scholar] [CrossRef]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A. Image Transformer. arXiv 2018, arXiv:1802.05751. [Google Scholar]
Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. arXiv 2019, arXiv:1904.09925. [Google Scholar]
Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.D.; Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. arXiv 2019, arXiv:1902.09630. [Google Scholar] [CrossRef]
van den Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; van den Driessche, G.; Lockhart, E.; Cobo, L.C.; Stimberg, F.; et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv 2017, arXiv:1711.10433. [Google Scholar]
Gu, J.; Bradbury, J.; Xiong, C.; Li, V.O.K.; Socher, R. Non-Autoregressive Neural Machine Translation. arXiv 2017, arXiv:1711.02281. [Google Scholar]
Ghazvininejad, M.; Levy, O.; Liu, Y.; Zettlemoyer, L. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6112–6121. [Google Scholar] [CrossRef]
Stewart, R.; Andriluka, M. End-to-end people detection in crowded scenes. arXiv 2015, arXiv:1506.04878. [Google Scholar]
Romera-Paredes, B.; Torr, P.H.S. Recurrent Instance Segmentation. arXiv 2015, arXiv:1511.08250. [Google Scholar]
Park, E.; Berg, A.C. Learning to decompose for object detection and instance segmentation. arXiv 2015, arXiv:1511.06449. [Google Scholar]
Ren, M.; Zemel, R.S. End-to-End Instance Segmentation and Counting with Recurrent Attention. arXiv 2016, arXiv:1605.09410. [Google Scholar]
Salvador, A.; Bellver, M.; Baradad, M.; Marqués, F.; Torres, J.; Giró-i-Nieto, X. Recurrent Neural Networks for Semantic Instance Segmentation. arXiv 2017, arXiv:1712.00617. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168. [Google Scholar] [CrossRef]
Zhang, H.; Wang, J. Towards Adversarially Robust Object Detection. arXiv 2019, arXiv:1907.10310. [Google Scholar] [CrossRef]
Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking Classification and Localization in R-CNN. arXiv 2019, arXiv:1904.06493. [Google Scholar] [CrossRef]
Song, G.; Liu, Y.; Wang, X. Revisiting the Sibling Head in Object Detector. arXiv 2020, arXiv:2003.07540. [Google Scholar] [CrossRef]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv 2019, arXiv:1905.03197. [Google Scholar] [CrossRef]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. Available online: http://jmlr.org/papers/v15/srivastava14a.html (accessed on 25 September 2025).
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. arXiv 2020, arXiv:2011.12450. [Google Scholar]
Zhang, X.; Wan, F.; Liu, C.; Ji, R.; Ye, Q. FreeAnchor: Learning to Match Anchors for Visual Object Detection. arXiv 2019, arXiv:1909.02466. [Google Scholar] [CrossRef]
Kim, K.; Lee, H.S. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. arXiv 2020, arXiv:2007.08103. [Google Scholar] [CrossRef]
Li, H.; Wu, Z.; Zhu, C.; Xiong, C.; Socher, R.; Davis, L.S. Learning from Noisy Anchors for One-stage Object Detection. arXiv 2019, arXiv:1912.05086. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group Normalization. arXiv 2018, arXiv:1803.08494. [Google Scholar] [CrossRef]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A²-Nets: Double Attention Networks. arXiv 2018, arXiv:1810.11579. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar] [CrossRef]
Wang, X.; Zhang, S.; Yu, Z.; Feng, L.; Zhang, W. Scale-Equalizing Pyramid Convolution for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13356–13365. Available online: https://api.semanticscholar.org/CorpusID:218537867 (accessed on 25 September 2025).
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
Jiang, Z.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. ConvBERT: Improving BERT with Span-based Dynamic Convolution. arXiv 2020, arXiv:2008.02496. [Google Scholar]
Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward Transformer-Based Object Detection. arXiv 2020, arXiv:2012.09958. [Google Scholar] [CrossRef]
Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. AutoAssign: Differentiable Label Assignment for Dense Object Detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv 2016, arXiv:1606.08415. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Ma, X.; Kong, X.; Wang, S.; Zhou, C.; May, J.; Ma, H.; Zettlemoyer, L. Luna: Linear Unified Nested Attention. arXiv 2021, arXiv:2106.01540. [Google Scholar] [CrossRef]
Shen, Z.; Zhang, M.; Yi, S.; Yan, J.; Zhao, H. Factorized Attention: Self-Attention with Linear Complexities. arXiv 2018, arXiv:1812.01243. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. Available online: https://arxiv.org/abs/2304.07193 (accessed on 25 September 2025). [CrossRef]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. Available online: https://arxiv.org/abs/2508.10104 (accessed on 25 September 2025). [PubMed]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. Available online: https://arxiv.org/abs/2407.17140 (accessed on 25 September 2025).
Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv 2024, arXiv:2409.08475. [Google Scholar]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]

Figure 1. Statistical overview of the literature on transformers. (a) Number of citations per year for transformer papers. (b) Citations in the last 12 months on detection transformer papers. (c) Modification percentage in the original detection transformer (DETR) to improve the performance and convergence speed. (d) Number of peer-reviewed publications per year that used DETR as a baseline. (e) A non-exhaustive timeline overview of important developments in DETR for detection tasks.

Figure 2. An overview of the detection transformer (DETR) and its modifications proposed by recent methods to improve performance and training convergence. It considers the detection a set prediction task and uses the transformer to free the network from post-processing steps such as non-maximal suppression (NMS). Here, each module added to the DETR is represented by different color with its corresponding label (shown on the right side).

Figure 3. The structure of the original DETR after the addition of Deformable-DETR [20], UP-DETR [21], and Efficient-DETR [22]. Here, the network is a simple DETR network, along with improvement indicated by small colored boxes. The dark pink block indicates Deformable-DETR, the bright cyan block indicates UP-DETR, and the dull green box represents Efficient-DETR.

Figure 4. The structure of the original DETR after the addition of SMCA-DETR [23], TSP-DETR [24], and Conditional-DETR [25]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The purple block indicates SMCA-DETR, the orange block indicates TSP-DETR, and the yellow box represents Conditional-DETR.

Figure 5. The structure of the original DETR after the addition of WB-DETR [26], PnP-DETR [27], and Dynamic-DETR [28]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The Magenta block indicates WB-DETR, the blue block indicates PnP-DETR, and the green box represents Dynamic-DETR.

Figure 6. The structure of the original DETR after the addition of YOLOS-DETR [29], Anchor-DETR [30], and Sparse-DETR [31]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The yellow block indicates YOLOS-DETR, the light blue block indicates Anchor-DETR, and the light orange box represents Sparse-DETR.

Figure 7. The structure of the original DETR after the addition of D²ETR [32], FP-DETR [33], and CF-DETR [34]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The light green block indicates D²ETR, the pink block indicates FP-DETR, and the blue box represents CF-DETR.

Figure 8. The structure of the original DETR after the addition of DAB-DETR [72], DN-DETR [35], and AdaMixer [36]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The purple block indicates DAB-DETR, the dark green block indicates DN-DETR, and light green box represents AdaMixer.

Figure 9. The structure of the original DETR after the addition of REGO-DETR [37] and DINO [38]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The purple color block indicates REGO-DETR and the red indicator block represents DINO.

Figure 10. The structure of the original DETR after the addition of Co-DETR [37], RT-DETR, and LW-DETR [38]. Here, the network is a simple DETR network, along with improvement indicated with small colored boxes. The light red indicator block represents Co-DETR, blue color block indicates LW-DETR, and the purple color block indicates RT-DETR.

Figure 11. Comparison of all DETR-based detection transformers on the COCO minival set. left Performance comparison of detection transformers using a ResNet-50 [80] backbone with regard to training epochs. Networks that are labeled with DC5 take a dilated feature map. right Performance comparison of detection transformers with regard to model size (parameters in million).

Figure 12. Comparison of DETR-based detection transformers on the COCO minival set using a ResNet-50 backbone. left Performance comparison of detection transformers on small objects. middle Performance comparison of detection transformers on medium objects. right Performance comparison of detection transformers on large objects.

Table 1. Overview of improvements in the detection transformer (DETR) to make training convergence faster and improve performance for small objects. Here, Bk represents the backbone, Pre denotes pre-training, Attn indicates attention, and Qry represents the query of the transformer network. Each method represents an improvement over the baseline DETR, and the green check marks indicate where modifications were introduced. The main contributions of each network are summarized in the last column. All GitHub links in this Table are accessed on 25 September 2025.

Methods	Modifications				Publication	Highlights
Methods	Bk	Pre	Attn	Qry	Publication	Highlights
DETR [11] GitHub https://github.com/facebookresearch/detr	-	-	-	-	ECCV 2020	Transformer, Set-based prediction, bipartite matching
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR			✓		ICLR 2021	Deformable-attention module
UP-DETR [21] GitHubhttps://github.com/dddzg/up-detr		✓			CVPR 2021	Unsupervised pre-training, random query patch detection
Efficient-DETR [22]				✓	arXiv 2021	Refence point and top-k queries selection module
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR			✓		ICCV 2021	Spatially-Modulated Co-attention module
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection			✓		ICCV 2021	TSP-FCOS and TSP-RCNN modules for cross attention
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR				✓	ICCV 2021	Conditional spatial queries
WB-DETR [26] GitHub https://github.com/aybora/wbdetr	✓				ICCV 2021	Encoder–decoder network without a backbone, LIE-T2T encoder module
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr			✓		ICCV 2021	PnP sampling module including pool sampler and poll sampler
Dynamic-DETR [28]			✓		ICCV 2021	Dynamic attention in the encoder–decoder network
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS		✓			NeurIPS 2021	Pre-training encoder network
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR			✓	✓	AAAI 2022	Row and Column decoupled-attention, object queries as anchor points
Sparse-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr			✓		ICLR 2022	Cross-attention map predictor, deformable-attention module
$D^{2}$ ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr			✓		arXiv 2022	Fine fused features, cross-scale attention module
FP-DETR [33] GitHub https://github.com/encounter1997/FP-DETR	✓	✓			ICLR 2022	Multiscale tokenizer in place of CNN backbone, pre-training encoder network
CF-DETR [34]			✓		AAAI 2022	TEF module to capture spatial relationships, a coarse and a fine layer in the decoder network
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR				✓	ICLR 2022	Dynamic anchor boxes as object queries
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR				✓	CVPR 2022	Positive noised object queries
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer			✓		CVPR 2022	3D sampling module, Adaptive mixing module in the decoder
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO			✓		CVPR 2022	A multi-level recurrent mechanism and a glimpse-based decoder
DINO [38] GitHub https://github.com/facebookresearch/dino				✓	arXiv 2022	Contrastive denoising module, positive and negative noised object queries
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR					ICCV 2023	Collaborative hybrid assignments for faster convergence and improved training stability
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR			✓		arXiv 2024	Lightweight DETR with optimized ViT encoder, shallow decoder, and global attention
RT-DETR [41] GitHub https://github.com/lyuwenyu/RT-DETR			✓	✓	CVPR 2024	Hybrid encoder with multi-scale features, IoU-aware query selection, adaptable inference speed

Table 2. Overview of previous surveys on object detection. For each paper, the publication details are provided.

Title	Year	Venue	Description
Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey [50]	2018	SPM	It provides an overview of different object detection domains, including object detection (OD), salient OD, and category-specific OD.
Object Detection in 20 Years: A Survey [73]	2019	TPAMI	This work gives an overview of the evolution of object detectors.
Deep Learning for Generic Object Detection: A Survey [51]	2019	IJCV	A review on deep learning techniques on generic object detection.
A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images [53]	2020	PRJ	Deep learning-based methods for semantic segmentation are reviewed.
A Survey of Modern Deep Learning based Object Detection Models [74]	2021	ICV	It briefly overviews deep learning-based (regression-based single-stage and candidate-based two-stage) object detectors.
A Survey of Object Detection Based on CNN and Transformer [70]	2021	PRML	A review of the benefits and drawbacks of deep learning-based object detectors and introduction of transformer-based methods.
Transformers in computational visual media: A survey [71]	2021	CVM	It focuses on backbone design and low-level vision using vision transformer methods.
A survey: object detection methods from CNN to transformer [68]	2022	MTA	Comparison of various CNN-based detection networks and introduction of transformer-based detection networks.
A Survey on Vision Transformer [69]	2023	TPAMI	This paper provides an overview of vision transformers and focuses on summarizing the state-of-the-art research in the field of vision transformers (ViTs).

Table 3. Performance comparison of all DETR-based detection transformers on the COCO minival set. Here, networks labeled with DC5 take a dilated feature map. The IoU threshold values are set to 0.5 and 0.75 for AP calculation and also calculate the AP for small (

A P_{s}

), medium (

A P_{m}

), and large (

A P_{l}

) objects. + represents bounding-box refinement and ++ denotes Deformable-DETR. ** indicates Efficient-DETR used 6 encoder layers and 1 decoder layer. S denotes small, and B indicates base. † represents the distillation mechanism by Touvron et al. [227]. ‡ indicates the model is pre-trained on ImageNet-21k. All models use 300 queries, while DETR uses 100 object queries to input to the decoder network. The models with superscript * use three pattern embeddings. All GitHub links in this Table are accessed on 25 September 2025.

Table 3. Performance comparison of all DETR-based detection transformers on the COCO minival set. Here, networks labeled with DC5 take a dilated feature map. The IoU threshold values are set to 0.5 and 0.75 for AP calculation and also calculate the AP for small (

A P_{s}

), medium (

A P_{m}

), and large (

A P_{l}

) objects. + represents bounding-box refinement and ++ denotes Deformable-DETR. ** indicates Efficient-DETR used 6 encoder layers and 1 decoder layer. S denotes small, and B indicates base. † represents the distillation mechanism by Touvron et al. [227]. ‡ indicates the model is pre-trained on ImageNet-21k. All models use 300 queries, while DETR uses 100 object queries to input to the decoder network. The models with superscript * use three pattern embeddings. All GitHub links in this Table are accessed on 25 September 2025.

Methods	Backbone	Publications	Epoch	GFLOPs	Parameters (M)	AP	AP⁵⁰	AP⁷⁵	AP_S	AP_M	AP_L
	DC5-ResNet-50		50	187	41	35.3	55.7	36.8	15.2	37.5	53.6
DETR [11] GitHub https://github.com/facebookresearch/detr	DC5-ResNet-50	ECCV 2020	500	187	41	43.3	63.1	45.9	22.5	47.3	61.1
	DC5-ResNet-101		500	253	60	44.9	64.7	47.7	23.7	49.5	62.3
	ResNet-50		50	173	40	43.8	62.6	47.7	26.4	47.1	58.0
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR	ResNet-50 +	ICLR 2021	50	173	40	45.4	64.7	49.0	26.8	48.3	61.7
	ResNet-50 ++		50	173	40	46.2	65.2	50.0	28.8	49.2	61.7
UP-DETR [21] GitHub https://github.com/dddzg/up-detr	ResNet-50	CVPR 2021	150	86	41	40.5	60.8	42.6	19.0	44.4	60.0
UP-DETR [21] GitHub https://github.com/dddzg/up-detr	ResNet-50	CVPR 2021	300	86	41	42.8	63.0	45.3	20.8	47.1	61.7
	ResNet-50		36	159	32	44.2	62.2	48.0	28.4	47.5	56.6
Efficient-DETR [22]	ResNet-101	arXiv 2021	36	239	51	45.2	63.7	48.8	28.8	49.1	59.0
	ResNet-101 **		36	289	54	45.7	64.1	49.5	28.2	49.1	60.2
	ResNet-50		50	152	40	43.7	63.6	47.2	24.2	47.0	60.4
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR	ResNet-50	ICCV 2021	108	152	40	45.6	65.5	49.1	25.9	49.3	62.6
	ResNet-101		50	218	58	44.4	65.2	48.0	24.3	48.5	61.0
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection	FCOS-ResNet-50	ICCV 2021	36	189	51.5	43.1	62.3	47.0	26.6	46.8	55.9
	RCNN-ResNet-50	ICCV 2021	36	188	63.6	43.8	63.3	48.3	28.6	46.9	55.7
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR	DC5-ResNet-50	ICCV 2021	50	195	44	43.8	64.4	46.7	24.0	47.6	60.7
	DC5-ResNet-101	ICCV 2021	50	262	63	45.0	65.5	48.4	26.1	48.9	62.8
WB-DETR [26] GitHub https://github.com/aybora/wbdetr	-	ICCV 2021	500	98	24	41.8	63.2	44.8	19.4	45.1	62.4
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr	DC5-ResNet-50	ICCV 2021	500	145	41	43.1	63.4	45.3	22.7	46.5	61.1
Dynamc-DETR [28]	ResNet-50	ICCV 2021	12	-	58	42.9	61.0	46.3	24.6	44.9	54.4
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS	DeiT-S [227] †	NeurIPS 2021	150	194	31	36.1	56.5	37.1	15.3	38.5	56.2
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS	DeiT-B [227] †	NeurIPS 2021	150	538	127	42.0	62.2	44.5	19.5	45.3	62.1
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR	DC5-ResNet-50 *	AAAI 2022	50	151	39	44.2	64.7	47.5	24.7	48.2	60.6
	DC5-ResNet-101 *	AAAI 2022	50	237	58	45.1	65.7	48.8	25.8	49.4	61.6
Sparse-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr	ResNet-50- $ρ$ -0.5	ICLR 2022	50	136	41	46.3	66.0	50.1	29.0	49.5	60.8
	Swin-T- $ρ$ -0.5 [228]	ICLR 2022	50	144	41	49.3	69.5	53.3	32.0	52.7	64.9
$D^{2}$ ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr	PVT2	arXiv 2022	50	82	35	43.2	62.9	46.2	22.0	48.5	62.4
Def $D^{2}$ ETR [32]	PVT2	arXiv 2022	50	93	40	50.0	67.9	54.1	31.7	53.4	66.7
FP-DETR-S [33] GitHub https://github.com/encounter1997/FP-DETR	-		50	102	24	42.5	62.6	45.9	25.3	45.5	56.9
FP-DETR-B [33] GitHub https://github.com/encounter1997/FP-DETR	-	ICLR 2022	50	121	36	43.3	63.9	47.7	27.5	46.1	57.0
FP-DETR-B ‡ [33] GitHub https://github.com/encounter1997/FP-DETR	-		50	121	36	43.7	64.1	47.8	26.5	46.7	58.2
CF-DETR [34]	ResNet-50	AAAI 2022	36	-	-	47.8	66.5	52.4	31.2	50.6	62.8
CF-DETR [34]	ResNet-101	AAAI 2022	36	-	-	49.0	68.1	53.4	31.4	52.2	64.3
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR	DC5-ResNet-50 *	ICLR 2022	50	216	44	45.7	66.2	49.0	26.1	49.4	63.1
	DC5-ResNet-101 *	ICLR 2022	50	296	63	46.6	67.0	50.2	28.1	50.5	64.1
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR	ResNet-50	CVPR 2022	50	94	44	44.1	64.4	46.7	22.9	48.0	63.4
	DC5-ResNet-50		50	202	44	46.3	66.4	49.7	26.7	50.0	64.3
	ResNet-101		50	174	63	45.2	65.5	48.3	24.1	49.1	65.1
	DC5-ResNet-101		50	282	63	47.3	67.5	50.8	28.6	51.5	65.0
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer	ResNet-50	CVPR 2022	36	132	139	47.0	66.0	51.1	30.1	50.2	61.8
	ResNeXt-101-DCN		36	214	160	49.5	68.9	53.9	31.3	52.3	66.3
	Swin-s [228]		36	234	164	51.3	71.2	55.7	34.2	54.6	67.3
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO	ResNet-50 ++	CVPR 2022	50	190	54	47.6	66.8	51.6	29.6	50.6	62.3
	ResNet-101 ++		50	257	73	48.5	67.0	52.4	29.5	52.0	64.4
	ReNeXt-101 ++		50	434	119	49.1	67.5	53.1	30.0	52.6	65.0
DINO [38] GitHub https://github.com/facebookresearch/dino	ReNet-50-4scale *	arXiv 2022	12	279	47	49.0	66.6	53.5	32.0	52.3	63.0
	ResNet-50-5scale *		12	860	47	49.4	66.9	53.8	32.3	52.5	63.9
	ReNet-50-5scale *		24	860	47	51.3	69.1	56.0	34.5	54.2	65.8
	ResNet-50-5scale *		36	860	47	51.2	69.0	55.8	35.0	54.3	65.3
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR	ReNet-50 *	ICCV 2023	12	279	47	52.1	69.3	57.3	35.4	55.5	67.2
	ReNet-50 *		36	860	47	54.8	72.5	60.1	38.3	58.4	69.6
	Swin-L(IN-22K) *		12	860	47	59.3	77.3	64.9	43.3	63.3	75.5
	Swin-L(IN-22K) *		24	860	47	60.4	78.3	66.4	44.6	64.2	76.5
	Swin-L(IN-22K) *		36	860	47	60.7	78.5	66.7	45.1	64.7	76.4
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR	-	arXiv 2024	50	67.7	54.6	54.4	-	-	48.0	52.5	56.1
RT-DETR [41] GitHub https://github.com/lyuwenyu/RT-DETR	ReNet-50*	CVPR 2024	72	136	42	53.1	71.3	57.7	34.8	58.0	70.0
RT-DETR [41] GitHub https://github.com/lyuwenyu/RT-DETR	ResNet-101 *	CVPR 2024	72	259	76	54.3	72.7	58.6	36.0	58.8	72.1
RT-DETRv2 [224] GitHub https://github.com/supervisely-ecosystem/RT-DETRv2	ReNet-50 *	arXiv 2024	72	136	42	53.4	-	-	-	-	-
	ResNet-101 *	arXiv 2024	72	259	76	54.3	-	-	-	-	-
RT-DETRv3 [225] GitHub https://github.com/clxia12/RT-DETRv3	ReNet-50 *	arXiv 2024	72	136	42	53.4	-	-	-	-	-
RT-DETRv3 [225] GitHub https://github.com/clxia12/RT-DETRv3	ResNet-101 *	arXiv 2024	72	259	76	54.6	-	-	-	-	-

Table 4. Overview of the advantages and limitations of detection transformers. All GitHub links in this Table are accessed on 25 September 2025.

Methods	Publications	Advantages	Limitations
DETR [11] GitHub https://github.com/facebookresearch/detr	ECCV 2020	Removes the need for hand-designed components like NMS or anchor generation.	Low performance on small objects and slow training convergence.
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR	ICLR 2021	Deformable attention network, which makes training convergence faster.	Number of encoder tokens increases by 20 times compared to DETR.
UP-DETR [21] GitHub https://github.com/dddzg/up-detr	CVPR 2021	Pre-training for Multi-tasks learning and Multi-queries localization.	Pre-training for patch localization, CNN and transformers pre-training needs to integrate.
Efficient-DETR [22]	arXiv 2021	Reduces decoder layers by employing dense and sparse set based network	Increase in GFLOPs twice compared to original DETR.
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR	ICCV 2021	Regression-aware mechanism to increase convergence speed	Low performance in detecting small objects.
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection	ICCV 2021	Deals with issues of Hungarian loss and the cross-attention mechanism of Transformer.	Uses proposals in TSP-FCOS and feature points in TSP-RCNN as in CNN-based detectors.
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR	ICCV 2021	Conditional queries remove dependency on content embeddings and ease the training.	Performs better than DETR and deformable-DETR for stronger backbones.
WB-DETR [26] GitHub https://github.com/aybora/wbdetr	ICCV 2021	Pure transformer network without backbone.	Low performance on small objects.
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr	ICCV 2021	Sampling module provides foreground and a small quantity of background features.	Breaks 2d spatial structure by taking foreground tokens and reducing background tokens.
Dynamic-DETR [28]	ICCV 2021	Dynamic attention provides small feature resolution and improves training convergence.	Still dependent on CNN networks as convolution-based encoder and an ROI-based decoder.
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS	NeurIPS 2021	Convert ViT pre-trained on ImageNet-1k dataset into Object detector.	Pre-trained ViT still needs improvements as it requires long training epochs.
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR	AAAI 2022	Object queries as anchor points that predict multiple objects at one position.	Consider queries as 2D anchor points which ignore object scale.
Spare-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr	ICLR 2022	Improve performance by updating tokens referenced by the decoder.	Performance is strongly dependent on the backbone specifically for large objects.
$D^{2}$ ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr	arXiv 2022	Decoder-only transformer network to reduce computational cost.	Decreases computation comlexity significantly but has low performance on small objects.
FP-DETR [33] GitHub https://github.com/encounter1997/FP-DETR	ICLR 2022	Pre-Training of the encoder-only transformer.	Low performance on large objects.
CF-DETR [34] GitHub https://github.com/facebookresearch/detr	AAAI 2022	Refine coarse features to improve localization accuracy of small objects.	Addition of three new modules increase network size.
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR	ICLR 2022	Anchor-boxes as queries, attention for different scale objects.	Positional prior for only foreground objects.
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR	CVPR 2022	Denoising training for positional-prior for foreground and background regions.	Denoising training by adding positive noise to object queries ignoring background regions.
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer	CVPR 2022	Faster Convergence, Improves the adaptability of query-based decoding mechanism.	Large number of parameters.
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO	CVPR 2022	Attention mechanism gradually focus on foreground regions more accurately.	Multi-stage RoI-based attention modeling increases the number of parameters.
DINO [38] GitHub https://github.com/facebookresearch/dino	arXiv 2022	impressive results on small and medium-sized datasets	Performance drops for large size objects
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR	ICCV 2023	Enhances encoder feature learning and decoder attention via collaborative hybrid assignments.	Increases training complexity due to multiple assignment heads.
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR	arXiv 2024	Achieves real-time detection with a lightweight transformer design using optimized ViT encoder and window attention.	Limited evaluation on benchmarks; less mature than YOLO-style detectors.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shehzadi, T.; Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. Sensors 2025, 25, 6025. https://doi.org/10.3390/s25196025

AMA Style

Shehzadi T, Hashmi KA, Liwicki M, Stricker D, Afzal MZ. Object Detection with Transformers: A Review. Sensors. 2025; 25(19):6025. https://doi.org/10.3390/s25196025

Chicago/Turabian Style

Shehzadi, Tahira, Khurram Azeem Hashmi, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2025. "Object Detection with Transformers: A Review" Sensors 25, no. 19: 6025. https://doi.org/10.3390/s25196025

APA Style

Shehzadi, T., Hashmi, K. A., Liwicki, M., Stricker, D., & Afzal, M. Z. (2025). Object Detection with Transformers: A Review. Sensors, 25(19), 6025. https://doi.org/10.3390/s25196025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Object Detection with Transformers: A Review

Abstract

1. Introduction

2. Object Detection and Transformers in Vision

2.1. Object Detection

2.2. Transformer for Segmentation

2.3. Transformers for Scene and Image Generation

2.4. Transformers for Low-Level Vision

2.5. Transformers for Multi-Modal Tasks

3. Detection Transformers

3.1. DETR

3.2. Deformable-DETR

3.3. UP-DETR

3.4. Efficient-DETR

3.5. SMCA-DETR

3.6. TSP-DETR

3.7. Conditional-DETR

3.8. WB-DETR

3.9. PnP-DETR

3.10. Dynamic-DETR

3.11. YOLOS-DETR

3.12. Anchor-DETR

3.13. Sparse-DETR

3.14. D2ETR

3.15. FP-DETR

3.16. CF-DETR

3.17. DAB-DETR

3.18. DN-DETR

3.19. AdaMixer

3.20. REGO-DETR

3.21. DINO

3.22. Co-DETR

3.23. LW-DETR

3.24. RT-DETR

4. Results and Discussion

5. Open Challenges and Future Directions

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.14. D²ETR