MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection
Abstract
1. Introduction
- A Mamba RT-DETR model based on SSM is proposed. By leveraging the linear complexity of SSM, this model improves vehicle detection performance while reducing the number of model parameters.
- An EVDMamba module is designed. It introduces a dilated scanning mechanism and optimizes the module structure to enhance the performance of SSM in image feature extraction.
- The idea of gated aggregation is combined with effective convolutions and residual connections, and an RSCG module is introduced to capture local dependencies and enhance model robustness.
2. Related Work
2.1. Object Detection
2.2. Visual State Space Models
3. Materials and Methods
3.1. Overall Architecture
3.2. Mamba
3.3. EVDMamba
4. Results
4.1. Datasets
4.2. Training Details and Evaluation Metrics
4.3. Performance Comparison on the BDD100K Dataset
4.4. Performance Comparison on the KITTI Dataset
4.5. Ablation Study
4.6. Visualization
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 24 May 2019; pp. 6105–6114. [Google Scholar]
- Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17773–17783. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 6–22 June 2024; pp. 15909–15920. [Google Scholar]
- Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking Mobile Block for Efficient Attention-Based Models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 1 October 2023; IEEE Computer Society: Washington, DC, USA, 2023; pp. 1389–1400. [Google Scholar]
- Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
- Chen, Z.; Zhong, F.; Luo, Q.; Zhang, X.; Zheng, Y. EdgeViT: Efficient Visual Modeling for Edge Computing. In Proceedings of the Wireless Algorithms, Systems, and Applications, Dalian, China, 24–26 November 2022; Wang, L., Segal, M., Chen, J., Qiu, T., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 393–405. [Google Scholar]
- Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. EfficientFormer: Vision Transformers at MobileNet Speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
- Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; XU, C.; Wang, Y. Transformer in Transformer. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 15908–15919. [Google Scholar]
- Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.S.; Koltun, V. Point Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16259–16268. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Xie, Z.; Sun, J.; Zou, X.; Wang, J. A Cascaded R-CNN with Multiscale Attention and Imbalanced Samples for Traffic Sign Detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A Simple and Strong Anchor-Free Object Detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 1 March–28 February 2022; Volume 36, pp. 2567–2575. [Google Scholar] [CrossRef]
- Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient DETR: Improving End-to-End Object Detector with Dense Prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2501.14679. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
- Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar]
- Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. arXiv 2024, arXiv:2406.05835. [Google Scholar] [CrossRef]
- Pei, X.; Huang, T.; Xu, C. EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba. arXiv 2024, arXiv:2403.09977. [Google Scholar] [CrossRef]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 17 July 2017; pp. 933–941. [Google Scholar]
- Rajagopal, A.; Nirmala, V. Convolutional Gated MLP: Combining Convolutions and gMLP. In Proceedings of the Big Data, Machine Learning, and Applications, Shanghai, China, 15–18 July 2024; Borah, M.D., Laiphrakpam, D.S., Auluck, N., Balas, V.E., Eds.; Springer Nature: Singapore, 2024; pp. 721–735. [Google Scholar]
Name | Configuration |
---|---|
CPU | Intel Xeon CPU E5-2686 v4 |
RAM | 92G |
GPU | NVIDIA RTX 3090 24G |
System | Linux |
Pytorch | 2.0.0 |
CUDA | 11.8 |
Model | Parameter | Flops | mAP@50 | mAP@95 |
---|---|---|---|---|
RT-DETR-r18 | 19.9 | 57.0 | 0.50 | 0.276 |
RT-DETR-r34 | 30.0 | 87.2 | 0.511 | 0.28 |
RT-DETR-r50 | 42.0 | 129.6 | 0.525 | 0.29 |
YOLOv5m | 20.1 | 48.0 | 0.479 | 0.265 |
YOLOv6s | 17.2 | 21.89 | 0.467 | 0.259 |
YOLOv8m | 25.8 | 49.9 | 0.527 | 0.292 |
MRD | 9.1 | 18.5 | 0.535 | 0.302 |
Model | Parameter | Flops | mAP@50 | mAP@95 |
---|---|---|---|---|
RT-DETR-r18 | 19.9 | 57.0 | 0.743 | 0.522 |
YOLOv5m | 20.1 | 48.0 | 0.695 | 0.509 |
YOLOv6s | 17.2 | 21.89 | 0.689 | 0.504 |
YOLOv8m | 25.8 | 49.9 | 0.731 | 0.514 |
MRD | 9.1 | 18.5 | 0.752 | 0.530 |
Backbone | Neck | Metrics | ||||
---|---|---|---|---|---|---|
ResNet34 | EVDMamba | HybridEncoder | Enhanced FPAFN | Parameters | mAP@50 | mAP@50:95 |
√ | √ | 30.0 | 0.511 | 0.28 | ||
√ | √ | 22.8 | 0.508 | 0.274 | ||
√ | √ | 9.8 | 0.526 | 0.287 | ||
√ | √ | 9.1 | 0.535 | 0.302 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, K.; Huang, X. MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection. World Electr. Veh. J. 2025, 16, 307. https://doi.org/10.3390/wevj16060307
Li K, Huang X. MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection. World Electric Vehicle Journal. 2025; 16(6):307. https://doi.org/10.3390/wevj16060307
Chicago/Turabian StyleLi, Kaijie, and Xiaoci Huang. 2025. "MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection" World Electric Vehicle Journal 16, no. 6: 307. https://doi.org/10.3390/wevj16060307
APA StyleLi, K., & Huang, X. (2025). MRD: A Linear-Complexity Encoder for Real-Time Vehicle Detection. World Electric Vehicle Journal, 16(6), 307. https://doi.org/10.3390/wevj16060307