Object Detection with Transformers: A Review
Abstract
1. Introduction
- A detailedreview of transformer-based detection methods from an architectural perspective. We categorize and summarize improvements in the detection transformer (DETR) according to backbone modifications, pre-training level, attention mechanism, query design, etc. This analysis aims to help researchers develop a deeper understanding of the key components of detection transformers in terms of performance indicators.
- A performance evaluation of detection transformers. We evaluate improvements in detection transformers using the popular benchmark MS COCO [75]. We also highlight the advantages and limitations of these approaches.
- An analysis of accuracy and computational complexity of improved versions of detection transformers. We present an evaluative comparison of state-of-the-art transformer-based detection methods with respect to attention mechanisms, backbone modifications, and query designs.
- An overview of the key building blocks of detection transformers to improve their performance further and future directions. We examine the impact of various key architectural design modules that impact network performance and training convergence to provide possible suggestions for future research. Readers interested in ongoing developments in detection transformers can refer to our Github repository; https://github.com/mindgarage-shan/transformer_object_detection_survey (accessed on 25 September 2025).
2. Object Detection and Transformers in Vision
2.1. Object Detection
2.2. Transformer for Segmentation
2.3. Transformers for Scene and Image Generation
2.4. Transformers for Low-Level Vision
2.5. Transformers for Multi-Modal Tasks
3. Detection Transformers
3.1. DETR
3.2. Deformable-DETR
Algorithm 1: Multi-scale deformable attention in Deformable-DETR. |
3.3. UP-DETR
Algorithm 2: Patch detection pre-training in UP-DETR. |
3.4. Efficient-DETR
3.5. SMCA-DETR
3.6. TSP-DETR
3.7. Conditional-DETR
3.8. WB-DETR
3.9. PnP-DETR
3.10. Dynamic-DETR
3.11. YOLOS-DETR
3.12. Anchor-DETR
3.13. Sparse-DETR
3.14. D2ETR
3.15. FP-DETR
3.16. CF-DETR
3.17. DAB-DETR
3.18. DN-DETR
3.19. AdaMixer
3.20. REGO-DETR
3.21. DINO
3.22. Co-DETR
3.23. LW-DETR
3.24. RT-DETR
4. Results and Discussion
5. Open Challenges and Future Directions
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
- Girshick, R.B. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
- Shehzadi, T.; Majid, A.; Hameed, M.; Farooq, A.; Yousaf, A. Intelligent predictor using cancer-related biologically information extraction from cancer transcriptomes. In Proceedings of the 2020 International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), Islamabad, Pakistan, 20–22 October 2020; Volume 5, pp. 1–5. [Google Scholar] [CrossRef]
- Sarode, S.; Khan, M.S.U.; Shehzadi, T.; Stricker, D.; Afzal, M.Z. Classroom-Inspired Multi-mentor Distillation with Adaptive Learning Strategies. In Proceedings of the Intelligent Systems and Applications, Amsterdam, The Netherlands, 27–28 August 2025; Arai, K., Ed.; Springer Nature: Cham, Switzerland, 2025; pp. 294–324. [Google Scholar] [CrossRef]
- Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2013, arXiv:1311.2524. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. Available online: https://dl.acm.org/doi/10.5555/3295222.3295349 (accessed on 25 September 2025).
- Khan, M.S.U.; Shehzadi, T.; Noor, R.; Stricker, D.; Afzal, M.Z. Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification. arXiv 2024, arXiv:2406.14370. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision–ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. arXiv 2023, arXiv:2306.04670. [Google Scholar] [CrossRef]
- Sheikh, T.U.; Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. UnSupDLA: Towards Unsupervised Document Layout Analysis. arXiv 2024, arXiv:2406.06236. [Google Scholar]
- Ehsan, I.; Shehzadi, T.; Stricker, D.; Afzal, M.Z. End-to-End Semi-Supervised approach with Modulated Object Queries for Table Detection in Documents. Int. J. Document Anal. Recognit. 2024, 27, 363–378. Available online: https://api.semanticscholar.org/CorpusID:269626070 (accessed on 25 September 2025). [CrossRef]
- Shehzadi, T.; Stricker, D.; Afzal, M.Z. A Hybrid Approach for Document Layout Analysis in Document images. arXiv 2024, arXiv:2404.17888. [Google Scholar] [CrossRef]
- Shehzadi, T.; Sarode, S.; Stricker, D.; Afzal, M.Z. Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer. arXiv 2024, arXiv:2405.00187. [Google Scholar]
- Saeed, W.; Saleh, M.S.; Gull, M.N.; Raza, H.; Saeed, R.; Shehzadi, T. Geometric features and traffic dynamic analysis on 4-leg intersections. Int. Rev. Appl. Sci. Eng. 2024, 15, 171–188. [Google Scholar] [CrossRef]
- Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Liwicki, M.; Afzal, M.Z. Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images. arXiv 2023, arXiv:2306.13526. [Google Scholar] [CrossRef]
- Shehzadi, T.; Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection. arXiv 2024, arXiv:2404.01819. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Dai, Z.; Cai, B.; Lin, Y.; Chen, J. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers. arXiv 2020, arXiv:2011.09094. [Google Scholar]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
- Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast Convergence of DETR with Spatially Modulated Co-Attention. arXiv 2021, arXiv:2101.07448. [Google Scholar]
- Sun, Z.; Cao, S.; Yang, Y.; Kitani, K. Rethinking Transformer-based Set Prediction for Object Detection. arXiv 2020, arXiv:2011.10881. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. arXiv 2021, arXiv:2108.06152. [Google Scholar]
- Liu, F.; Wei, H.; Zhao, W.; Li, G.; Peng, J.; Li, Z. WB-DETR: Transformer-Based Detector without Backbone. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 2959–2967. [Google Scholar] [CrossRef]
- Wang, T.; Yuan, L.; Chen, Y.; Feng, J.; Yan, S. PnP-DETR: Towards Efficient Visual Analysis with Transformers. arXiv 2021, arXiv:2109.07036. [Google Scholar]
- Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic DETR: End-to-End Object Detection with Dynamic Attention. In Proceedings of the 2021 International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; Available online: https://www.microsoft.com/en-us/research/publication/dynamic-detr-end-to-end-object-detection-with-dynamic-attention/ (accessed on 25 September 2025).
- Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. arXiv 2021, arXiv:2106.00666. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Available online: https://api.semanticscholar.org/CorpusID:237513850 (accessed on 25 September 2025).
- Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
- Lin, J.; Mao, X.; Chen, Y.; Xu, L.; He, Y.; Xue, H. D2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention. arXiv 2022, arXiv:2203.00860. [Google Scholar] [CrossRef]
- Wang, W.; Cao, Y.; Zhang, J.; Tao, D. FP-DETR: Detection Transformer Advanced by Fully Pre-training. In Proceedings of the International Conference on Learning Representations, Virtual Event, 25–29 April 2022; Available online: https://openreview.net/forum?id=yjMQuLLcGWK (accessed on 25 September 2025).
- Cao, X.; Yuan, P.; Feng, B.; Niu, K. CF-DETR: Coarse-to-Fine Transformers for End-to-End Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Available online: https://api.semanticscholar.org/CorpusID:250293790 (accessed on 25 September 2025).
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2239–2251. [Google Scholar] [CrossRef]
- Gao, Z.; Wang, L.; Han, B.; Guo, S. AdaMixer: A Fast-Converging Query-Based Object Detector. arXiv 2022, arXiv:2203.16507. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, J.; Tao, D. Recurrent Glimpse-based Decoder for Detection with Transformer. arXiv 2021, arXiv:2112.04632. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar] [CrossRef]
- Zong, Z.; Song, G.; Liu, Y. DETRs with Collaborative Hybrid Assignments Training. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 6725–6735. [Google Scholar] [CrossRef]
- Chen, Q.; Su, X.; Zhang, X.; Wang, J.; Chen, J.; Shen, Y.; Han, C.; Chen, Z.; Xu, W.; Li, F.; et al. LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection. arXiv 2024, arXiv:2406.03459. Available online: https://arxiv.org/abs/2406.03459 (accessed on 25 September 2025).
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. Available online: https://openaccess.thecvf.com/content/CVPR2024/html/Zhao_DETRs_Beat_YOLOs_on_Real-time_Object_Detection_CVPR_2024_paper.html (accessed on 25 September 2025).
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G. Recent Advances in Convolutional Neural Networks. arXiv 2015, arXiv:1512.07108. [Google Scholar] [CrossRef]
- Borji, A.; Cheng, M.; Jiang, H.; Li, J. Salient Object Detection: A Survey. arXiv 2014, arXiv:1411.5878. [Google Scholar] [CrossRef]
- Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
- Agarwal, S.; du Terrail, J.O.; Jurie, F. Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks. arXiv 2018, arXiv:1809.03193. [Google Scholar]
- Yang, M.H.; Kriegman, D.; Ahuja, N. Detecting faces in images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 34–58. [Google Scholar] [CrossRef]
- Zhao, B.; Feng, J.; Wu, X.; Yan, S. A survey on deep learning-based fine-grained object classification and semantic segmentation. Int. J. Autom. Comput. 2017, 14, 119–135. Available online: https://api.semanticscholar.org/CorpusID:53076119 (accessed on 25 September 2025). [CrossRef]
- Goswami, T.; Barad, Z.; Desai, P.; Nikita, P. Text Detection and Recognition in images: A survey. arXiv 2018, arXiv:1803.07278. [Google Scholar] [CrossRef]
- Chaudhari, S.; Polatkan, G.; Ramanath, R.; Mithal, V. An Attentive Survey of Attention Models. arXiv 2019, arXiv:1904.02874. [Google Scholar] [CrossRef]
- Han, J.; Zhang, D.; Cheng, G.; Liu, N.; Xu, D. Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey. IEEE Signal Process. Mag. 2018, 35, 84–100. [Google Scholar] [CrossRef]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.W.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. arXiv 2018, arXiv:1809.02165. [Google Scholar] [CrossRef]
- Enzweiler, M.; Gavrila, D.M. Monocular Pedestrian Detection: Survey and Experiments. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 2179–2195. [Google Scholar] [CrossRef]
- Ülkü, I.; Akagündüz, E. A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images. arXiv 2019, arXiv:1912.10230. [Google Scholar]
- Cheng, G.; Han, J. A Survey on Object Detection in Optical Remote Sensing Images. arXiv 2016, arXiv:1603.06201. [Google Scholar] [CrossRef]
- Sommer, L.W.; Schuchert, T.; Beyerer, J. Fast Deep Vehicle Detection in Aerial Images. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 2–29 March 2017; pp. 311–319. [Google Scholar] [CrossRef]
- Zhang, P.; Niu, X.; Dou, Y.; Xia, F. Airport Detection on Optical Satellite Images Using Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1183–1187. [Google Scholar] [CrossRef]
- Bach, M.; Stumper, D.; Dietmayer, K. Deep Convolutional Traffic Light Recognition for Automated Driving. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 851–858. [Google Scholar] [CrossRef]
- de la Escalera, A.; Moreno, L.; Salichs, M.; Armingol, J. Road traffic sign detection and classification. IEEE Trans. Ind. Electron. 1997, 44, 848–859. [Google Scholar] [CrossRef]
- Shehzadi, T.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Mask-Aware Semi-Supervised Object Detection in Floor Plans. Appl. Sci. 2022, 12, 9398. [Google Scholar] [CrossRef]
- Hariharan, B.; Arbelaez, P.; Girshick, R.B.; Malik, J. Simultaneous Detection and Segmentation. arXiv 2014, arXiv:1407.1808. [Google Scholar] [CrossRef]
- Hariharan, B.; Arbeláez, P.A.; Girshick, R.B.; Malik, J. Hypercolumns for Object Segmentation and Fine-grained Localization. arXiv 2014, arXiv:1411.5752. [Google Scholar]
- Dai, J.; He, K.; Sun, J. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arXiv 2015, arXiv:1512.04412. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep Visual-Semantic Alignments for Generating Image Descriptions. arXiv 2014, arXiv:1412.2306. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
- Wu, Q.; Shen, C.; van den Hengel, A.; Wang, P.; Dick, A.R. Image Captioning and Visual Question Answering Based on Attributes and Their Related External Knowledge. arXiv 2016, arXiv:1603.02814. [Google Scholar]
- Bai, S.; An, S. A survey on automatic image caption generation. Neurocomputing 2018, 311, 291–304. [Google Scholar] [CrossRef]
- Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X.; et al. T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos. arXiv 2016, arXiv:1604.02532. [Google Scholar] [CrossRef]
- Arkin, E.; Yadikar, N.; Xu, X.; Aysa, A.; Ubul, K. A survey: Object detection methods from CNN to transformer. Multimed. Tools Appl. 2022, 82, 21353–21383. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
- Arkin, E.; Yadikar, N.; Muhtar, Y.; Ubul, K. A Survey of Object Detection Based on CNN and Transformer. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 16–18 July 2021; pp. 99–108. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. arXiv 2019, arXiv:1905.05055. [Google Scholar] [CrossRef]
- Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.N.; Lee, B. A Survey of Modern Deep Learning based Object Detection Models. arXiv 2021, arXiv:2104.11892. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A Survey of Deep Learning-based Object Detection. arXiv 2019, arXiv:1907.09408. [Google Scholar] [CrossRef]
- Ahmed, M.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Survey and Performance Analysis of Deep Learning Based Object Detection in Challenging Environments. Sensors 2021, 21, 5116. [Google Scholar] [CrossRef] [PubMed]
- Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2009, 88, 303–308. Available online: https://www.microsoft.com/en-us/research/publication/the-pascal-visual-object-classes-voc-challenge/ (accessed on 25 September 2025). [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. Available online: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 25 September 2025).
- Bar, A.; Wang, X.; Kantorov, V.; Reed, C.J.; Herzig, R.; Chechik, G.; Rohrbach, A.; Darrell, T.; Globerson, A. DETReg: Unsupervised Pretraining with Region Priors for Object Detection. arXiv 2021, arXiv:2106.04550. [Google Scholar]
- Bateni, P.; Barber, J.; van de Meent, J.; Wood, F. Improving Few-Shot Visual Classification with Unlabelled Examples. arXiv 2020, arXiv:2006.12245. [Google Scholar]
- Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent Targets Provide Better Supervision in Semi-supervised Object Detection. arXiv 2022, arXiv:2209.01589. [Google Scholar] [CrossRef]
- Li, Y.; Huang, D.; Qin, D.; Wang, L.; Gong, B. Improving Object Detection with Selective Self-supervised Self-training. arXiv 2020, arXiv:2007.09162. [Google Scholar] [CrossRef]
- Hashmi, K.A.; Stricker, D.; Afzal, M.Z. Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection. arXiv 2022, arXiv:2210.02368. [Google Scholar]
- Hashmi, K.A.; Pagani, A.; Stricker, D.; Afzal, M.Z. BoxMask: Revisiting Bounding Box Supervision for Video Object Detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2029–2039. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
- Li, C.; Yang, J.; Zhang, P.; Gao, M.; Xiao, B.; Dai, X.; Yuan, L.; Gao, J. Efficient Self-supervised Vision Transformers for Representation Learning. arXiv 2021, arXiv:2106.09785. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2015, arXiv:1512.02325. [Google Scholar]
- Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Fu, C.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional Single Shot Detector. arXiv 2017, arXiv:1701.06659. [Google Scholar] [CrossRef]
- Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar] [CrossRef]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. arXiv 2017, arXiv:1711.06897. [Google Scholar]
- Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. arXiv 2018, arXiv:1808.01244. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv 2014, arXiv:1406.4729. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
- Qiao, S.; Chen, L.; Yuille, A.L. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. arXiv 2020, arXiv:2006.02334. [Google Scholar] [CrossRef]
- Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. arXiv 2019, arXiv:1901.07518. [Google Scholar] [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. arXiv 2017, arXiv:1712.00726. [Google Scholar] [CrossRef]
- Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.G.; Zhu, M.; Zhmoginov, A.; Chen, L. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv 2018, arXiv:1801.04381. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. arXiv 2017, arXiv:1707.01083. [Google Scholar] [CrossRef]
- Wang, R.J.; Li, X.; Ao, S.; Ling, C.X. Pelee: A Real-Time Object Detection System on Mobile Devices. arXiv 2018, arXiv:1804.06882. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. arXiv 2018, arXiv:1807.11164. [Google Scholar] [CrossRef]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv 2018, arXiv:1807.11626. [Google Scholar]
- Yousaf, A.; Sazonov, E. Food Intake Detection in the Face of Limited Sensor Signal Annotations. In Proceedings of the 2024 Tenth International Conference on Communications and Electronics (ICCE), Da Nang, Vietnam, 31 July–2 August 2024; pp. 351–356. [Google Scholar] [CrossRef]
- Cai, H.; Gan, C.; Han, S. Once for All: Train One Network and Specialize it for Efficient Deployment. arXiv 2019, arXiv:1908.09791. [Google Scholar]
- Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image. arXiv 2017, arXiv:1703.07570. [Google Scholar]
- Mousavian, A.; Anguelov, D.; Flynn, J.; Kosecka, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. arXiv 2016, arXiv:1612.00496. [Google Scholar]
- Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving. arXiv 2019, arXiv:1903.10955. [Google Scholar] [CrossRef]
- Li, P.; Chen, X.; Shen, S. Stereo R-CNN based 3D Object Detection for Autonomous Driving. arXiv 2019, arXiv:1902.09738. [Google Scholar] [CrossRef]
- Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T. Geometry-based Distance Decomposition for Monocular 3D Object Detection. arXiv 2021, arXiv:2104.03775. [Google Scholar]
- Ma, X.; Zhang, Y.; Xu, D.; Zhou, D.; Yi, S.; Li, H.; Ouyang, W. Delving into Localization Errors for Monocular 3D Object Detection. arXiv 2021, arXiv:2103.16237. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, L.; Liu, M. YOLOStereo3D: A Step Back to 2D for Efficient Stereo 3D Detection. arXiv 2021, arXiv:2103.09422. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. arXiv 2020, arXiv:2006.11275. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. arXiv 2017, arXiv:1711.06396. [Google Scholar]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. arXiv 2018, arXiv:1812.05784. [Google Scholar]
- Xu, Q.; Zhong, Y.; Neumann, U. Behind the Curtain: Learning Occluded Shapes for 3D Object Detection. arXiv 2021, arXiv:2112.02205. [Google Scholar] [CrossRef]
- Zheng, W.; Tang, W.; Chen, S.; Jiang, L.; Fu, C. CIA-SSD: Confident IoU-Aware Single-Stage Object Detector from Point Cloud. arXiv 2020, arXiv:2012.03015. [Google Scholar] [CrossRef]
- Zheng, W.; Tang, W.; Jiang, L.; Fu, C. SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud. arXiv 2021, arXiv:2104.09804. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. arXiv 2020, arXiv:2012.15712. [Google Scholar] [CrossRef]
- Sheng, H.; Cai, S.; Liu, Y.; Deng, B.; Huang, J.; Hua, X.; Zhao, M. Improving 3D Object Detection with Channel-wise Transformer. arXiv 2021, arXiv:2108.10723. [Google Scholar] [CrossRef]
- Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel Transformer for 3D Object Detection. arXiv 2021, arXiv:2109.02497. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. arXiv 2019, arXiv:1911.10150. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. arXiv 2017, arXiv:1712.02294. [Google Scholar]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. arXiv 2020, arXiv:2012.10992. [Google Scholar] [CrossRef]
- Yoo, J.H.; Kim, Y.; Kim, J.S.; Choi, J.W. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection. arXiv 2020, arXiv:2004.12636. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. arXiv 2020, arXiv:2009.00784. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. arXiv 2019, arXiv:1904.04745. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv 2020, arXiv:2012.15840. [Google Scholar]
- Strudel, R.; Pinel, R.G.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. arXiv 2021, arXiv:2105.05633. [Google Scholar] [CrossRef]
- Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-Alone Self-Attention in Vision Models. arXiv 2019, arXiv:1906.05909. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar] [CrossRef]
- Kirillov, A.; He, K.; Girshick, R.B.; Rother, C.; Dollár, P. Panoptic Segmentation. arXiv 2018, arXiv:1801.00868. [Google Scholar]
- Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.L.; Chen, L. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. arXiv 2020, arXiv:2003.07853. [Google Scholar]
- Neuhold, G.; Ollmann, T.; Bulò, S.R.; Kontschieder, P. The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5000–5009. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. arXiv 2016, arXiv:1604.01685. [Google Scholar] [CrossRef]
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative Adversarial Text to Image Synthesis. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; Proceedings of Machine Learning Research: New York, NY, USA, 2016; Volume 48, pp. 1060–1069. Available online: https://proceedings.mlr.press/v48/reed16.html (accessed on 25 September 2025).
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Huang, X.; Wang, X.; Metaxas, D.N. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv 2016, arXiv:1612.03242. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv 2017, arXiv:1710.10916. [Google Scholar]
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv 2017, arXiv:1711.10485. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
- Murahari, M.D.; Reddy; Sk, M.; Basha, M.M.M.; Hari, M.N.C.; Student, P. DALL-E: CREATING IMAGES FROM TEXT. 2021. Available online: https://api.semanticscholar.org/CorpusID:261026641 (accessed on 25 September 2025).
- Wang, X.; Yeshwanth, C.; Nießner, M. SceneFormer: Indoor Scene Generation with Transformers. arXiv 2020, arXiv:2012.09793. [Google Scholar]
- Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative Pretraining From Pixels. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; III, H.D., Singh, A., Eds.; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2020; Volume 119, pp. 1691–1703. Available online: https://proceedings.mlr.press/v119/chen20s.html (accessed on 25 September 2025).
- Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. arXiv 2020, arXiv:2012.09841. [Google Scholar]
- Jiang, Y.; Chang, S.; Wang, Z. TransGAN: Two Transformers Can Make One Strong GAN. arXiv 2021, arXiv:2102.07074. [Google Scholar]
- Bhunia, A.K.; Khan, S.H.; Cholakkal, H.; Anwer, R.M.; Khan, F.S.; Shah, M. Handwriting Transformers. arXiv 2021, arXiv:2104.03964. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009; Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 25 September 2025).
- Coates, A.; Ng, A.; Lee, H. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; Gordon, G., Dunson, D., Dudík, M., Eds.; Proceedings of Machine Learning Research: New York, NY, USA, 2011; Volume 15, pp. 215–223. Available online: https://proceedings.mlr.press/v15/coates11a.html (accessed on 25 September 2025).
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum Contrast for Unsupervised Visual Representation Learning. arXiv 2019, arXiv:1911.05722. [Google Scholar]
- Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning Representations by Maximizing Mutual Information Across Views. arXiv 2019, arXiv:1906.00910. [Google Scholar] [CrossRef]
- Hénaff, O.J.; Srinivas, A.; Fauw, J.D.; Razavi, A.; Doersch, C.; Eslami, S.M.A.; van den Oord, A. Data-Efficient Image Recognition with Contrastive Predictive Coding. arXiv 2019, arXiv:1905.09272. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. Available online: https://api.semanticscholar.org/CorpusID:11758569 (accessed on 25 September 2025).
- Gao, C.; Chen, Y.; Liu, S.; Tan, Z.; Yan, S. AdversarialNAS: Adversarial Neural Architecture Search for GANs. arXiv 2019, arXiv:1912.02037. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. arXiv 2019, arXiv:1912.04958. [Google Scholar]
- Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. arXiv 2020, arXiv:2006.04139. [Google Scholar] [CrossRef]
- Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. arXiv 2020, arXiv:2012.00364. [Google Scholar]
- Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. arXiv 2021, arXiv:2108.10257. [Google Scholar] [CrossRef]
- Wang, Z.; Cun, X.; Bao, J.; Liu, J. Uformer: A General U-Shaped Transformer for Image Restoration. arXiv 2021, arXiv:2106.03106. [Google Scholar] [CrossRef]
- Kumar, M.; Weissenborn, D.; Kalchbrenner, N. Colorization Transformer. arXiv 2021, arXiv:2102.04432. [Google Scholar]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. arXiv 2015, arXiv:1505.00468. [Google Scholar]
- Zellers, R.; Bisk, Y.; Farhadi, A.; Choi, Y. From Recognition to Cognition: Visual Commonsense Reasoning. arXiv 2018, arXiv:1811.10830. [Google Scholar]
- Lee, K.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked Cross Attention for Image-Text Matching. arXiv 2018, arXiv:1803.08024. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. arXiv 2014, arXiv:1411.4555. [Google Scholar]
- Chen, Y.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: Learning UNiversal Image-TExt Representations. arXiv 2019, arXiv:1909.11740. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. arXiv 2020, arXiv:2004.06165. [Google Scholar]
- Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. VideoBERT: A Joint Model for Video and Language Representation Learning. arXiv 2019, arXiv:1904.01766. [Google Scholar] [CrossRef]
- Li, G.; Duan, N.; Fang, Y.; Jiang, D.; Zhou, M. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. arXiv 2019, arXiv:1908.06066. [Google Scholar] [CrossRef]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.; Chang, K. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv 2019, arXiv:1908.02265. [Google Scholar]
- Lee, S.; Yu, Y.; Kim, G.; Breuel, T.M.; Kautz, J.; Song, Y. Parameter Efficient Multimodal Transformers for Video Representation Learning. arXiv 2020, arXiv:2012.04124. [Google Scholar]
- Sun, N.; Zhu, Y.; Hu, X. Faster R-CNN Based Table Detection Combining Corner Locating. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1314–1319. [Google Scholar] [CrossRef]
- Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A. Image Transformer. arXiv 2018, arXiv:1802.05751. [Google Scholar]
- Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. arXiv 2019, arXiv:1904.09925. [Google Scholar]
- Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.D.; Savarese, S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. arXiv 2019, arXiv:1902.09630. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; van den Driessche, G.; Lockhart, E.; Cobo, L.C.; Stimberg, F.; et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv 2017, arXiv:1711.10433. [Google Scholar]
- Gu, J.; Bradbury, J.; Xiong, C.; Li, V.O.K.; Socher, R. Non-Autoregressive Neural Machine Translation. arXiv 2017, arXiv:1711.02281. [Google Scholar]
- Ghazvininejad, M.; Levy, O.; Liu, Y.; Zettlemoyer, L. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6112–6121. [Google Scholar] [CrossRef]
- Stewart, R.; Andriluka, M. End-to-end people detection in crowded scenes. arXiv 2015, arXiv:1506.04878. [Google Scholar]
- Romera-Paredes, B.; Torr, P.H.S. Recurrent Instance Segmentation. arXiv 2015, arXiv:1511.08250. [Google Scholar]
- Park, E.; Berg, A.C. Learning to decompose for object detection and instance segmentation. arXiv 2015, arXiv:1511.06449. [Google Scholar]
- Ren, M.; Zemel, R.S. End-to-End Instance Segmentation and Counting with Recurrent Attention. arXiv 2016, arXiv:1605.09410. [Google Scholar]
- Salvador, A.; Bellver, M.; Baradad, M.; Marqués, F.; Torres, J.; Giró-i-Nieto, X. Recurrent Neural Networks for Semantic Instance Segmentation. arXiv 2017, arXiv:1712.00617. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar] [CrossRef]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168. [Google Scholar] [CrossRef]
- Zhang, H.; Wang, J. Towards Adversarially Robust Object Detection. arXiv 2019, arXiv:1907.10310. [Google Scholar] [CrossRef]
- Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking Classification and Localization in R-CNN. arXiv 2019, arXiv:1904.06493. [Google Scholar] [CrossRef]
- Song, G.; Liu, Y.; Wang, X. Revisiting the Sibling Head in Object Detector. arXiv 2020, arXiv:2003.07540. [Google Scholar] [CrossRef]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv 2019, arXiv:1905.03197. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. Available online: http://jmlr.org/papers/v15/srivastava14a.html (accessed on 25 September 2025).
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. arXiv 2020, arXiv:2011.12450. [Google Scholar]
- Zhang, X.; Wan, F.; Liu, C.; Ji, R.; Ye, Q. FreeAnchor: Learning to Match Anchors for Visual Object Detection. arXiv 2019, arXiv:1909.02466. [Google Scholar] [CrossRef]
- Kim, K.; Lee, H.S. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. arXiv 2020, arXiv:2007.08103. [Google Scholar] [CrossRef]
- Li, H.; Wu, Z.; Zhu, C.; Xiong, C.; Socher, R.; Davis, L.S. Learning from Noisy Anchors for One-stage Object Detection. arXiv 2019, arXiv:1912.05086. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
- Wu, Y.; He, K. Group Normalization. arXiv 2018, arXiv:1803.08494. [Google Scholar] [CrossRef]
- Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-Nets: Double Attention Networks. arXiv 2018, arXiv:1810.11579. [Google Scholar]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN Models for Fine-Grained Visual Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, S.; Yu, Z.; Feng, L.; Zhang, W. Scale-Equalizing Pyramid Convolution for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13356–13365. Available online: https://api.semanticscholar.org/CorpusID:218537867 (accessed on 25 September 2025).
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
- Jiang, Z.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. ConvBERT: Improving BERT with Span-based Dynamic Convolution. arXiv 2020, arXiv:2008.02496. [Google Scholar]
- Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward Transformer-Based Object Detection. arXiv 2020, arXiv:2012.09958. [Google Scholar] [CrossRef]
- Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. AutoAssign: Differentiable Label Assignment for Dense Object Detection. arXiv 2020, arXiv:2007.03496. [Google Scholar] [CrossRef]
- Hendrycks, D.; Gimpel, K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv 2016, arXiv:1606.08415. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
- Ma, X.; Kong, X.; Wang, S.; Zhou, C.; May, J.; Ma, H.; Zettlemoyer, L. Luna: Linear Unified Nested Attention. arXiv 2021, arXiv:2106.01540. [Google Scholar] [CrossRef]
- Shen, Z.; Zhang, M.; Yi, S.; Yan, J.; Zhao, H. Factorized Attention: Self-Attention with Linear Complexities. arXiv 2018, arXiv:1812.01243. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. Available online: https://arxiv.org/abs/2304.07193 (accessed on 25 September 2025). [CrossRef]
- Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. Available online: https://arxiv.org/abs/2508.10104 (accessed on 25 September 2025). [PubMed]
- Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. Available online: https://arxiv.org/abs/2407.17140 (accessed on 25 September 2025).
- Wang, S.; Xia, C.; Lv, F.; Shi, Y. RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision. arXiv 2024, arXiv:2409.08475. [Google Scholar]
- Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Methods | Modifications | Publication | Highlights | |||
---|---|---|---|---|---|---|
Bk | Pre | Attn | Qry | |||
DETR [11] GitHub https://github.com/facebookresearch/detr | - | - | - | - | ECCV 2020 | Transformer, Set-based prediction, bipartite matching |
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR | ✓ | ICLR 2021 | Deformable-attention module | |||
UP-DETR [21] GitHubhttps://github.com/dddzg/up-detr | ✓ | CVPR 2021 | Unsupervised pre-training, random query patch detection | |||
Efficient-DETR [22] | ✓ | arXiv 2021 | Refence point and top-k queries selection module | |||
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR | ✓ | ICCV 2021 | Spatially-Modulated Co-attention module | |||
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection | ✓ | ICCV 2021 | TSP-FCOS and TSP-RCNN modules for cross attention | |||
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR | ✓ | ICCV 2021 | Conditional spatial queries | |||
WB-DETR [26] GitHub https://github.com/aybora/wbdetr | ✓ | ICCV 2021 | Encoder–decoder network without a backbone, LIE-T2T encoder module | |||
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr | ✓ | ICCV 2021 | PnP sampling module including pool sampler and poll sampler | |||
Dynamic-DETR [28] | ✓ | ICCV 2021 | Dynamic attention in the encoder–decoder network | |||
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS | ✓ | NeurIPS 2021 | Pre-training encoder network | |||
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR | ✓ | ✓ | AAAI 2022 | Row and Column decoupled-attention, object queries as anchor points | ||
Sparse-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr | ✓ | ICLR 2022 | Cross-attention map predictor, deformable-attention module | |||
ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr | ✓ | arXiv 2022 | Fine fused features, cross-scale attention module | |||
FP-DETR [33] GitHub https://github.com/encounter1997/FP-DETR | ✓ | ✓ | ICLR 2022 | Multiscale tokenizer in place of CNN backbone, pre-training encoder network | ||
CF-DETR [34] | ✓ | AAAI 2022 | TEF module to capture spatial relationships, a coarse and a fine layer in the decoder network | |||
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR | ✓ | ICLR 2022 | Dynamic anchor boxes as object queries | |||
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR | ✓ | CVPR 2022 | Positive noised object queries | |||
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer | ✓ | CVPR 2022 | 3D sampling module, Adaptive mixing module in the decoder | |||
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO | ✓ | CVPR 2022 | A multi-level recurrent mechanism and a glimpse-based decoder | |||
DINO [38] GitHub https://github.com/facebookresearch/dino | ✓ | arXiv 2022 | Contrastive denoising module, positive and negative noised object queries | |||
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR | ICCV 2023 | Collaborative hybrid assignments for faster convergence and improved training stability | ||||
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR | ✓ | arXiv 2024 | Lightweight DETR with optimized ViT encoder, shallow decoder, and global attention | |||
RT-DETR [41] GitHub https://github.com/lyuwenyu/RT-DETR | ✓ | ✓ | CVPR 2024 | Hybrid encoder with multi-scale features, IoU-aware query selection, adaptable inference speed |
Title | Year | Venue | Description |
---|---|---|---|
Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey [50] | 2018 | SPM | It provides an overview of different object detection domains, including object detection (OD), salient OD, and category-specific OD. |
Object Detection in 20 Years: A Survey [73] | 2019 | TPAMI | This work gives an overview of the evolution of object detectors. |
Deep Learning for Generic Object Detection: A Survey [51] | 2019 | IJCV | A review on deep learning techniques on generic object detection. |
A Survey on Deep Learning-based Architectures for Semantic Segmentation on 2D images [53] | 2020 | PRJ | Deep learning-based methods for semantic segmentation are reviewed. |
A Survey of Modern Deep Learning based Object Detection Models [74] | 2021 | ICV | It briefly overviews deep learning-based (regression-based single-stage and candidate-based two-stage) object detectors. |
A Survey of Object Detection Based on CNN and Transformer [70] | 2021 | PRML | A review of the benefits and drawbacks of deep learning-based object detectors and introduction of transformer-based methods. |
Transformers in computational visual media: A survey [71] | 2021 | CVM | It focuses on backbone design and low-level vision using vision transformer methods. |
A survey: object detection methods from CNN to transformer [68] | 2022 | MTA | Comparison of various CNN-based detection networks and introduction of transformer-based detection networks. |
A Survey on Vision Transformer [69] | 2023 | TPAMI | This paper provides an overview of vision transformers and focuses on summarizing the state-of-the-art research in the field of vision transformers (ViTs). |
Methods | Backbone | Publications | Epoch | GFLOPs | Parameters (M) | AP | AP50 | AP75 | APS | APM | APL |
---|---|---|---|---|---|---|---|---|---|---|---|
DC5-ResNet-50 | 50 | 187 | 41 | 35.3 | 55.7 | 36.8 | 15.2 | 37.5 | 53.6 | ||
DETR [11] GitHub https://github.com/facebookresearch/detr | DC5-ResNet-50 | ECCV 2020 | 500 | 187 | 41 | 43.3 | 63.1 | 45.9 | 22.5 | 47.3 | 61.1 |
DC5-ResNet-101 | 500 | 253 | 60 | 44.9 | 64.7 | 47.7 | 23.7 | 49.5 | 62.3 | ||
ResNet-50 | 50 | 173 | 40 | 43.8 | 62.6 | 47.7 | 26.4 | 47.1 | 58.0 | ||
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR | ResNet-50 + | ICLR 2021 | 50 | 173 | 40 | 45.4 | 64.7 | 49.0 | 26.8 | 48.3 | 61.7 |
ResNet-50 ++ | 50 | 173 | 40 | 46.2 | 65.2 | 50.0 | 28.8 | 49.2 | 61.7 | ||
UP-DETR [21] GitHub https://github.com/dddzg/up-detr | ResNet-50 | CVPR 2021 | 150 | 86 | 41 | 40.5 | 60.8 | 42.6 | 19.0 | 44.4 | 60.0 |
ResNet-50 | 300 | 86 | 41 | 42.8 | 63.0 | 45.3 | 20.8 | 47.1 | 61.7 | ||
ResNet-50 | 36 | 159 | 32 | 44.2 | 62.2 | 48.0 | 28.4 | 47.5 | 56.6 | ||
Efficient-DETR [22] | ResNet-101 | arXiv 2021 | 36 | 239 | 51 | 45.2 | 63.7 | 48.8 | 28.8 | 49.1 | 59.0 |
ResNet-101 ** | 36 | 289 | 54 | 45.7 | 64.1 | 49.5 | 28.2 | 49.1 | 60.2 | ||
ResNet-50 | 50 | 152 | 40 | 43.7 | 63.6 | 47.2 | 24.2 | 47.0 | 60.4 | ||
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR | ResNet-50 | ICCV 2021 | 108 | 152 | 40 | 45.6 | 65.5 | 49.1 | 25.9 | 49.3 | 62.6 |
ResNet-101 | 50 | 218 | 58 | 44.4 | 65.2 | 48.0 | 24.3 | 48.5 | 61.0 | ||
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection | FCOS-ResNet-50 | ICCV 2021 | 36 | 189 | 51.5 | 43.1 | 62.3 | 47.0 | 26.6 | 46.8 | 55.9 |
RCNN-ResNet-50 | 36 | 188 | 63.6 | 43.8 | 63.3 | 48.3 | 28.6 | 46.9 | 55.7 | ||
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR | DC5-ResNet-50 | ICCV 2021 | 50 | 195 | 44 | 43.8 | 64.4 | 46.7 | 24.0 | 47.6 | 60.7 |
DC5-ResNet-101 | 50 | 262 | 63 | 45.0 | 65.5 | 48.4 | 26.1 | 48.9 | 62.8 | ||
WB-DETR [26] GitHub https://github.com/aybora/wbdetr | - | ICCV 2021 | 500 | 98 | 24 | 41.8 | 63.2 | 44.8 | 19.4 | 45.1 | 62.4 |
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr | DC5-ResNet-50 | ICCV 2021 | 500 | 145 | 41 | 43.1 | 63.4 | 45.3 | 22.7 | 46.5 | 61.1 |
Dynamc-DETR [28] | ResNet-50 | ICCV 2021 | 12 | - | 58 | 42.9 | 61.0 | 46.3 | 24.6 | 44.9 | 54.4 |
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS | DeiT-S [227] † | NeurIPS 2021 | 150 | 194 | 31 | 36.1 | 56.5 | 37.1 | 15.3 | 38.5 | 56.2 |
DeiT-B [227] † | 150 | 538 | 127 | 42.0 | 62.2 | 44.5 | 19.5 | 45.3 | 62.1 | ||
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR | DC5-ResNet-50 * | AAAI 2022 | 50 | 151 | 39 | 44.2 | 64.7 | 47.5 | 24.7 | 48.2 | 60.6 |
DC5-ResNet-101 * | 50 | 237 | 58 | 45.1 | 65.7 | 48.8 | 25.8 | 49.4 | 61.6 | ||
Sparse-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr | ResNet-50--0.5 | ICLR 2022 | 50 | 136 | 41 | 46.3 | 66.0 | 50.1 | 29.0 | 49.5 | 60.8 |
Swin-T--0.5 [228] | 50 | 144 | 41 | 49.3 | 69.5 | 53.3 | 32.0 | 52.7 | 64.9 | ||
ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr | PVT2 | arXiv 2022 | 50 | 82 | 35 | 43.2 | 62.9 | 46.2 | 22.0 | 48.5 | 62.4 |
Def ETR [32] | PVT2 | 50 | 93 | 40 | 50.0 | 67.9 | 54.1 | 31.7 | 53.4 | 66.7 | |
FP-DETR-S [33] GitHub https://github.com/encounter1997/FP-DETR | - | 50 | 102 | 24 | 42.5 | 62.6 | 45.9 | 25.3 | 45.5 | 56.9 | |
FP-DETR-B [33] GitHub https://github.com/encounter1997/FP-DETR | - | ICLR 2022 | 50 | 121 | 36 | 43.3 | 63.9 | 47.7 | 27.5 | 46.1 | 57.0 |
FP-DETR-B ‡ [33] GitHub https://github.com/encounter1997/FP-DETR | - | 50 | 121 | 36 | 43.7 | 64.1 | 47.8 | 26.5 | 46.7 | 58.2 | |
CF-DETR [34] | ResNet-50 | AAAI 2022 | 36 | - | - | 47.8 | 66.5 | 52.4 | 31.2 | 50.6 | 62.8 |
ResNet-101 | 36 | - | - | 49.0 | 68.1 | 53.4 | 31.4 | 52.2 | 64.3 | ||
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR | DC5-ResNet-50 * | ICLR 2022 | 50 | 216 | 44 | 45.7 | 66.2 | 49.0 | 26.1 | 49.4 | 63.1 |
DC5-ResNet-101 * | 50 | 296 | 63 | 46.6 | 67.0 | 50.2 | 28.1 | 50.5 | 64.1 | ||
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR | ResNet-50 | CVPR 2022 | 50 | 94 | 44 | 44.1 | 64.4 | 46.7 | 22.9 | 48.0 | 63.4 |
DC5-ResNet-50 | 50 | 202 | 44 | 46.3 | 66.4 | 49.7 | 26.7 | 50.0 | 64.3 | ||
ResNet-101 | 50 | 174 | 63 | 45.2 | 65.5 | 48.3 | 24.1 | 49.1 | 65.1 | ||
DC5-ResNet-101 | 50 | 282 | 63 | 47.3 | 67.5 | 50.8 | 28.6 | 51.5 | 65.0 | ||
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer | ResNet-50 | CVPR 2022 | 36 | 132 | 139 | 47.0 | 66.0 | 51.1 | 30.1 | 50.2 | 61.8 |
ResNeXt-101-DCN | 36 | 214 | 160 | 49.5 | 68.9 | 53.9 | 31.3 | 52.3 | 66.3 | ||
Swin-s [228] | 36 | 234 | 164 | 51.3 | 71.2 | 55.7 | 34.2 | 54.6 | 67.3 | ||
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO | ResNet-50 ++ | CVPR 2022 | 50 | 190 | 54 | 47.6 | 66.8 | 51.6 | 29.6 | 50.6 | 62.3 |
ResNet-101 ++ | 50 | 257 | 73 | 48.5 | 67.0 | 52.4 | 29.5 | 52.0 | 64.4 | ||
ReNeXt-101 ++ | 50 | 434 | 119 | 49.1 | 67.5 | 53.1 | 30.0 | 52.6 | 65.0 | ||
DINO [38] GitHub https://github.com/facebookresearch/dino | ReNet-50-4scale * | arXiv 2022 | 12 | 279 | 47 | 49.0 | 66.6 | 53.5 | 32.0 | 52.3 | 63.0 |
ResNet-50-5scale * | 12 | 860 | 47 | 49.4 | 66.9 | 53.8 | 32.3 | 52.5 | 63.9 | ||
ReNet-50-5scale * | 24 | 860 | 47 | 51.3 | 69.1 | 56.0 | 34.5 | 54.2 | 65.8 | ||
ResNet-50-5scale * | 36 | 860 | 47 | 51.2 | 69.0 | 55.8 | 35.0 | 54.3 | 65.3 | ||
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR | ReNet-50 * | ICCV 2023 | 12 | 279 | 47 | 52.1 | 69.3 | 57.3 | 35.4 | 55.5 | 67.2 |
ReNet-50 * | 36 | 860 | 47 | 54.8 | 72.5 | 60.1 | 38.3 | 58.4 | 69.6 | ||
Swin-L(IN-22K) * | 12 | 860 | 47 | 59.3 | 77.3 | 64.9 | 43.3 | 63.3 | 75.5 | ||
Swin-L(IN-22K) * | 24 | 860 | 47 | 60.4 | 78.3 | 66.4 | 44.6 | 64.2 | 76.5 | ||
Swin-L(IN-22K) * | 36 | 860 | 47 | 60.7 | 78.5 | 66.7 | 45.1 | 64.7 | 76.4 | ||
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR | - | arXiv 2024 | 50 | 67.7 | 54.6 | 54.4 | - | - | 48.0 | 52.5 | 56.1 |
RT-DETR [41] GitHub https://github.com/lyuwenyu/RT-DETR | ReNet-50* | CVPR 2024 | 72 | 136 | 42 | 53.1 | 71.3 | 57.7 | 34.8 | 58.0 | 70.0 |
ResNet-101 * | 72 | 259 | 76 | 54.3 | 72.7 | 58.6 | 36.0 | 58.8 | 72.1 | ||
RT-DETRv2 [224] GitHub https://github.com/supervisely-ecosystem/RT-DETRv2 | ReNet-50 * | arXiv 2024 | 72 | 136 | 42 | 53.4 | - | - | - | - | - |
ResNet-101 * | 72 | 259 | 76 | 54.3 | - | - | - | - | - | ||
RT-DETRv3 [225] GitHub https://github.com/clxia12/RT-DETRv3 | ReNet-50 * | arXiv 2024 | 72 | 136 | 42 | 53.4 | - | - | - | - | - |
ResNet-101 * | 72 | 259 | 76 | 54.6 | - | - | - | - | - |
Methods | Publications | Advantages | Limitations |
---|---|---|---|
DETR [11] GitHub https://github.com/facebookresearch/detr | ECCV 2020 | Removes the need for hand-designed components like NMS or anchor generation. | Low performance on small objects and slow training convergence. |
Deformable-DETR [20] GitHub https://github.com/fundamentalvision/Deformable-DETR | ICLR 2021 | Deformable attention network, which makes training convergence faster. | Number of encoder tokens increases by 20 times compared to DETR. |
UP-DETR [21] GitHub https://github.com/dddzg/up-detr | CVPR 2021 | Pre-training for Multi-tasks learning and Multi-queries localization. | Pre-training for patch localization, CNN and transformers pre-training needs to integrate. |
Efficient-DETR [22] | arXiv 2021 | Reduces decoder layers by employing dense and sparse set based network | Increase in GFLOPs twice compared to original DETR. |
SMCA-DETR [23] GitHub https://github.com/gaopengcuhk/SMCA-DETR | ICCV 2021 | Regression-aware mechanism to increase convergence speed | Low performance in detecting small objects. |
TSP-DETR [24] GitHub https://github.com/Edward-Sun/TSP-Detection | ICCV 2021 | Deals with issues of Hungarian loss and the cross-attention mechanism of Transformer. | Uses proposals in TSP-FCOS and feature points in TSP-RCNN as in CNN-based detectors. |
Conditional-DETR [25] GitHub https://github.com/Atten4Vis/ConditionalDETR | ICCV 2021 | Conditional queries remove dependency on content embeddings and ease the training. | Performs better than DETR and deformable-DETR for stronger backbones. |
WB-DETR [26] GitHub https://github.com/aybora/wbdetr | ICCV 2021 | Pure transformer network without backbone. | Low performance on small objects. |
PnP-DETR [27] GitHub https://github.com/twangnh/pnp-detr | ICCV 2021 | Sampling module provides foreground and a small quantity of background features. | Breaks 2d spatial structure by taking foreground tokens and reducing background tokens. |
Dynamic-DETR [28] | ICCV 2021 | Dynamic attention provides small feature resolution and improves training convergence. | Still dependent on CNN networks as convolution-based encoder and an ROI-based decoder. |
YOLOS-DETR [29] GitHub https://github.com/hustvl/YOLOS | NeurIPS 2021 | Convert ViT pre-trained on ImageNet-1k dataset into Object detector. | Pre-trained ViT still needs improvements as it requires long training epochs. |
Anchor-DETR [30] GitHub https://github.com/megvii-research/AnchorDETR | AAAI 2022 | Object queries as anchor points that predict multiple objects at one position. | Consider queries as 2D anchor points which ignore object scale. |
Spare-DETR [31] GitHub https://github.com/kakaobrain/sparse-detr | ICLR 2022 | Improve performance by updating tokens referenced by the decoder. | Performance is strongly dependent on the backbone specifically for large objects. |
ETR [32] GitHub https://github.com/alibaba/easyrobust/tree/main/ddetr | arXiv 2022 | Decoder-only transformer network to reduce computational cost. | Decreases computation comlexity significantly but has low performance on small objects. |
FP-DETR [33] GitHub https://github.com/encounter1997/FP-DETR | ICLR 2022 | Pre-Training of the encoder-only transformer. | Low performance on large objects. |
CF-DETR [34] GitHub https://github.com/facebookresearch/detr | AAAI 2022 | Refine coarse features to improve localization accuracy of small objects. | Addition of three new modules increase network size. |
DAB-DETR [72] GitHub https://github.com/IDEA-Research/DAB-DETR | ICLR 2022 | Anchor-boxes as queries, attention for different scale objects. | Positional prior for only foreground objects. |
DN-DETR [35] GitHub https://github.com/IDEA-Research/DN-DETR | CVPR 2022 | Denoising training for positional-prior for foreground and background regions. | Denoising training by adding positive noise to object queries ignoring background regions. |
AdaMixer [36] GitHub https://github.com/MCG-NJU/AdaMixer | CVPR 2022 | Faster Convergence, Improves the adaptability of query-based decoding mechanism. | Large number of parameters. |
REGO [37] GitHub https://github.com/zhechen/Deformable-DETR-REGO | CVPR 2022 | Attention mechanism gradually focus on foreground regions more accurately. | Multi-stage RoI-based attention modeling increases the number of parameters. |
DINO [38] GitHub https://github.com/facebookresearch/dino | arXiv 2022 | impressive results on small and medium-sized datasets | Performance drops for large size objects |
Co-DETR [39] GitHub https://github.com/Sense-X/Co-DETR | ICCV 2023 | Enhances encoder feature learning and decoder attention via collaborative hybrid assignments. | Increases training complexity due to multiple assignment heads. |
LW-DETR [40] GitHub https://github.com/Atten4Vis/LW-DETR | arXiv 2024 | Achieves real-time detection with a lightweight transformer design using optimized ViT encoder and window attention. | Limited evaluation on benchmarks; less mature than YOLO-style detectors. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shehzadi, T.; Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. Sensors 2025, 25, 6025. https://doi.org/10.3390/s25196025
Shehzadi T, Hashmi KA, Liwicki M, Stricker D, Afzal MZ. Object Detection with Transformers: A Review. Sensors. 2025; 25(19):6025. https://doi.org/10.3390/s25196025
Chicago/Turabian StyleShehzadi, Tahira, Khurram Azeem Hashmi, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2025. "Object Detection with Transformers: A Review" Sensors 25, no. 19: 6025. https://doi.org/10.3390/s25196025
APA StyleShehzadi, T., Hashmi, K. A., Liwicki, M., Stricker, D., & Afzal, M. Z. (2025). Object Detection with Transformers: A Review. Sensors, 25(19), 6025. https://doi.org/10.3390/s25196025