VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection
Simple Summary
Abstract
1. Introduction
- We introduce a Vision State-Space Duality (VSSD) module into the backbone network. Its core, a novel Non-Causal State-Space Duality (NC-SSD) mechanism, overcomes the unidirectional constraint of traditional state-space models by enabling bidirectional contextual modeling. This allows for highly efficient parallel computation while significantly strengthening the model’s ability to capture global structural features and long-range dependencies within an image.
- We design a Multi-Scale Efficient Hybrid Encoder (M-Encoder). This module utilizes parallel multi-scale depth-wise convolutional kernels to perform hierarchical feature extraction, simultaneously capturing fine-grained local details and broader contour information. This design effectively enriches the feature representation, improving the model’s robustness to scale variations caused by differing viewing distances and individual pig sizes.
- Through extensive experiments, we demonstrate that our VM-RTDETR model achieves state-of-the-art performance on challenging pig detection datasets, significantly boosting key metrics such as , , and compared to existing mainstream approaches. The model provides a robust and efficient solution for intelligent livestock farming.
2. Materials and Methods
2.1. Dataset
2.2. The Proposed Model, VM-RTDETR
2.2.1. Backbone
2.2.2. M-Encoder
2.2.3. Decoder
2.2.4. Synergistic Design of VSSD and M-Encoder
2.3. Evaluation Metric
- (abbreviated as ): The mean averaged over thresholds from 0.5 to 0.95 with a step size of 0.05. This is the primary metric for comprehensive performance evaluation.
- : The at an threshold of 0.5.
- : The at a stricter threshold of 0.75.
- : for medium objects ( pixel area < 96 × 96);
- : for large objects (pixel area ).
- Params: The total number of trainable parameters, indicating the model’s size.
- GFLOPs (Giga Floating-Point Operations): The total number of floating-point operations required for a single forward pass, measured in billions. It reflects the model’s computational complexity. A lower value indicates higher computational efficiency.
3. Experimental Results and Analyses
3.1. Experimental Setup
3.2. Comparison of Different Models
3.3. Ablation Studies
3.3.1. Verification of the VSSD Module
3.3.2. Verification of the M-Encoder Module
3.3.3. Verification of the VM-RTDETR Model
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rowe, E.; Dawkins, M.S.; Gebhardt-Henrich, S.G. A systematic review of precision livestock farming in the poultry sector: Is technology focussed on improving bird welfare? Animals 2019, 9, 614. [Google Scholar] [CrossRef] [PubMed]
- Okinda, C.; Nyalala, I.; Korohou, T.; Okinda, C.; Wang, J.; Achieng, T.; Wamalwa, P.; Mang, T.; Shen, M. A review on computer vision systems in monitoring of poultry: A welfare perspective. Artif. Intell. Agric. 2020, 4, 184–208. [Google Scholar] [CrossRef]
- Lee, S.; Ahn, H.; Seo, J.; Chung, Y.; Park, D.; Pan, S. Practical monitoring of undergrown pigs for iot-based large-scale smart farm. IEEE Access 2019, 7, 173796–173810. [Google Scholar] [CrossRef]
- Plà-Aragonès, L.M. The evolution of dss in the pig industry and future perspectives. In EURO Working Group on DSS: A Tour of the DSS Developments Over the Last 30 Years; Springer: Berlin/Heidelberg, Germany, 2021; pp. 299–323. [Google Scholar]
- Elizar, E.; Zulkifley, M.A.; Muharar, R.; Zaman, M.H.M.; Mustaza, S.M. A review on multiscale-deep-learning applications. Sensors 2022, 22, 7384. [Google Scholar] [CrossRef] [PubMed]
- Jiao, L.; Wang, M.; Liu, X.; Li, L.; Liu, F.; Feng, Z.; Yang, S.; Hou, B. Multiscale deep learning for detection and recognition: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5900–5920. [Google Scholar] [CrossRef] [PubMed]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Yang, Q.; Xiao, D.; Lin, S. Feeding behavior recognition for group-housed pigs with the faster r-cnn. Comput. Electron. Agric. 2018, 155, 453–460. [Google Scholar] [CrossRef]
- Tu, S.; Yuan, W.; Liang, Y.; Wang, F.; Wan, H. Automatic detection and segmentation for group-housed pigs based on pigms r-cnn. Sensors 2021, 21, 3251. [Google Scholar] [CrossRef] [PubMed]
- Hosain, M.T.; Zaman, A.; Abir, M.R.; Akter, S.; Mursalin, S.; Khan, S.S. Synchronizing object detection: Applications, advancements and existing challenges. IEEE Access 2024, 12, 154129–154167. [Google Scholar] [CrossRef]
- Shen, M.; Tai, M.; Cedric, O. Real-time detection method of newborn piglets based on deep convolution neural network [j/ol]. Trans. Chin. Soc. Agric. Mach. 2019, 50, 270–279. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Peng, N.; Li, F.; Luo, X. Plm-yolov5: Improved yolov5 for pig detection in livestock monitoring. In Proceedings of the 2024 International Conference on Intelligent Robotics and Automatic Control (IRAC), Guangzhou, China, 29 November–1 December 2024; pp. 619–625. [Google Scholar]
- Li, G.; Shi, G.; Jiao, J. Yolov5-kcb: A new method for individual pig detection using optimized k-means, ca attention mechanism and a bi-directional feature pyramid network. Sensors 2023, 23, 5242. [Google Scholar] [CrossRef] [PubMed]
- Huang, M.; Li, L.; Hu, H.; Liu, Y.; Yao, Y.; Song, R. Iat-yolo: A black pig detection model for use in low-light environments. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–5. [Google Scholar]
- Liao, Y.; Qiu, Y.; Liu, B.; Qin, Y.; Wang, Y.; Wu, Z.; Xu, L.; Feng, A. Yolov8a-sd: A segmentation-detection algorithm for overlooking scenes in pig farms. Animals 2025, 15, 1000. [Google Scholar] [CrossRef] [PubMed]
- He, P.; Zhao, S.; Pan, P.; Zhou, G.; Zhang, J. Pdc-yolo: A network for pig detection under complex conditions for counting purposes. Agriculture 2024, 14, 1807. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Chen, Y.; Wang, Y.; Xiong, S.; Lu, X.; Zhu, X.X.; Mou, L. Integrating detailed features and global contexts for semantic segmentation in ultrahigh-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Li, Z.; Hu, J.; Wu, K.; Miao, J.; Zhao, Z.; Wu, J. Local feature acquisition and global context understanding network for very high-resolution land cover classification. Sci. Rep. 2024, 14, 12597. [Google Scholar] [CrossRef] [PubMed]
- Shi, Y.; Dong, M.; Li, M.; Xu, C. Vssd: Vision mamba with non-causal state space duality. arXiv 2024, arXiv:2407.18559. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
- Yin, D.; Hu, L.; Li, B.; Zhang, Y.; Yang, X. 5%> 100%: Breaking performance shackles of full fine-tuning on visual recognition tasks. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 20071–20081. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
















| Term | Configurations |
|---|---|
| Operating system | Ubuntu 18.04 |
| GPU | NVIDIA Tesla P100 |
| CPU | Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20 GHz |
| GPU environment | CUDA 11.8 |
| Deep learning framework | PyTorch 2.1.1 |
| Compiler | 3.10.17 |
| Epochs | 100 |
| Memory | 128 GB |
| Model | (%) | (%) | (%) | (%) | (%) | GFLOPs (G) | Params (M) |
|---|---|---|---|---|---|---|---|
| Faster R-CNN | 35.8 | 80.9 | 26.4 | 24.9 | 37.6 | - | - |
| YOLOv5n | 51.1 | 91.2 | 50.2 | - | - | 7.7 | 2.6 |
| YOLOv5s | 54.7 | 93.6 | 55.1 | - | - | 24.0 | 9.1 |
| YOLOv6n | 51.2 | 91.3 | 49.6 | - | - | 13.0 | 4.5 |
| YOLOv6s | 54.9 | 93.5 | 55.3 | - | - | 44.7 | 16.4 |
| YOLOv7 | 59.9 | 91.2 | - | - | - | 37.1 | 105.1 |
| YOLOv8n | 52.2 | 92.1 | 51.7 | - | - | 8.7 | 3.2 |
| YOLOv8s | 54.2 | 92.9 | 54.4 | - | - | 28.6 | 11.2 |
| YOLOv9s | 55.7 | 93.4 | 56.1 | - | - | 26.7 | 7.2 |
| YOLOv12s | 52.5 | 92.0 | 52.1 | - | - | 19.4 | 9.1 |
| YOLOv12n | 50.9 | 91.8 | 49.4 | - | - | 6.0 | 2.5 |
| R50-RTDETR | 59.5 | 94.9 | 61.6 | 48.0 | 61.0 | 137.7 | 42.7 |
| VM-RTDETR | 60.9 | 95.5 | 63.3 | 50.7 | 62.3 | 97.9 | 28.4 |
| Model | B | (%) | (%) | (%) | (%) | (%) | GFLOPs (G) | Params (M) |
|---|---|---|---|---|---|---|---|---|
| R18-RTDETR | R18 | 59.5 | 94.9 | 61.7 | 49.3 | 60.9 | 61.1 | 20.0 |
| R34-RTDETR | R34 | 59.6 | 95.2 | 61.8 | 49.4 | 61.1 | 93.3 | 31.3 |
| R50-RTDETR | R50 | 59.5 | 94.9 | 61.6 | 48.0 | 61.0 | 137.7 | 42.7 |
| R101-RTDETR | R101 | 59.7 | 95.3 | 62.1 | 48.7 | 61.2 | 260.6 | 76.4 |
| VMamba-RTDETR | VMmba | 60.2 | 95.1 | 62.5 | 50.0 | 61.7 | 87.0 | 26.3 |
| V-RTDETR | VSSD | 60.4 | 95.3 | 62.9 | 50.0 | 61.8 | 97.9 | 28.4 |
| Model | B | E | (%) | (%) | (%) | (%) | (%) | GFLOPs (G) | Params (M) |
|---|---|---|---|---|---|---|---|---|---|
| RTDETR | R50 | T | 59.5 | 94.9 | 61.6 | 48.0 | 61.0 | 137.7 | 42.7 |
| M-RTDETR | R50 | ME | 59.8 | 95.1 | 62.3 | 48.2 | 61.4 | 137.7 | 42.7 |
| Model | B | E | (%) | (%) | (%) | (%) | (%) | GFLOPs (G) | Params (M) |
|---|---|---|---|---|---|---|---|---|---|
| RTDETR | R50 | T | 59.5 | 94.9 | 61.6 | 48.0 | 61.0 | 137.7 | 42.7 |
| M-RTDETR | R50 | ME | 59.8 | 95.1 | 62.3 | 48.2 | 61.4 | 137.7 | 42.7 |
| V-RTDETR | VSSD | T | 60.4 | 95.3 | 62.9 | 50.0 | 61.8 | 97.9 | 28.4 |
| VM-RTDETR | VSSD | ME | 60.9 | 95.5 | 63.3 | 50.7 | 62.3 | 97.9 | 28.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hao, W.; Xu, S.-A.; Shu, H.; Li, H.; Han, M.; Li, F.; Liu, Y. VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection. Animals 2025, 15, 3328. https://doi.org/10.3390/ani15223328
Hao W, Xu S-A, Shu H, Li H, Han M, Li F, Liu Y. VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection. Animals. 2025; 15(22):3328. https://doi.org/10.3390/ani15223328
Chicago/Turabian StyleHao, Wangli, Shu-Ai Xu, Hao Shu, Hanwei Li, Meng Han, Fuzhong Li, and Yanhong Liu. 2025. "VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection" Animals 15, no. 22: 3328. https://doi.org/10.3390/ani15223328
APA StyleHao, W., Xu, S.-A., Shu, H., Li, H., Han, M., Li, F., & Liu, Y. (2025). VM-RTDETR: Advancing DETR with Vision State-Space Duality and Multi-Scale Fusion for Robust Pig Detection. Animals, 15(22), 3328. https://doi.org/10.3390/ani15223328

