RSO-YOLO: A Real-Time Detector for Small and Occluded Objects in Autonomous Driving Scenarios
Abstract
1. Introduction
- We introduce the bidirectional feature pyramid network (BiFPN) [4] and space-to-depth convolution (SPD-Conv) [5]. These components replace the original neck network. BiFPN enables efficient multi-scale feature fusion through weighted bidirectional cross-scale connections, while SPD-Conv mitigates semantic information loss by preserving and reorganizing features in the spatial dimension. Together, they improve both computational efficiency and detection performance. Additionally, a shallow P2 branch is incorporated into the detection head to improve fine-grained feature perception, which is critical for reliable small-object detection.
 - We propose the feature enhancement and compensation module (FECM). This module enhances discriminative features in unoccluded regions and compensates for semantic deficiencies in occluded areas, thereby boosting the model’s capability to detect occluded objects. It also addresses feature loss issues commonly encountered in occlusion scenarios.
 - We design a lightweight global cross-dimensional coordinate detection head (GCCHead), built upon the global cross-dimensional coordinate module (GCCM). This lightweight detection head groups and synergistically enhances features to reduce computational complexity while improving detection accuracy. Its efficiency and performance make it particularly appropriate for real-time autonomous driving systems.
 
2. Related Work
2.1. General Object Detection
2.2. Detection of Small and Occluded Objects in Autonomous Driving Scenarios
2.2.1. Small Object Detection
2.2.2. Occluded Object Detection
2.3. Model Lightweighting
3. Methods
3.1. Design of RSO-YOLO Network Architecture
3.1.1. Structural Composition of RSO-YOLO
3.1.2. Design Rationale
3.2. BiFPN and SPD-Conv
3.2.1. BiFPN
3.2.2. SPD-Conv
3.3. FECM
3.4. GCCHead
4. Experiments
4.1. Datasets
4.1.1. SODA10M Dataset
4.1.2. BDD100K Dataset
4.1.3. FLIR ADAS Dataset
4.2. Experimental Environment and Parameters
4.3. Evaluation Metrics
4.4. Ablation Experiment
4.4.1. Ablation Experiments on the SODA10M Dataset
4.4.2. Ablation Experiments on the BDD100K Dataset
4.5. Comparative Experiments
4.5.1. Neck Network Comparative Experiment
4.5.2. Performance Comparison Experiment of Occluded Object Detection Based on FECM
4.5.3. Detection Head Comparative Experiment
4.5.4. Comparative Experiments Across Different Datasets
4.5.5. Experimental Comparison with Other Improved Models
4.5.6. Quantitative Experiments on Small and Occluded Object Detection
4.6. Visualization Analysis
4.6.1. Visualization of FECM Mechanism
4.6.2. Detection Results Visualization
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Szűcs, H.; Hézer, J. Road safety analysis of autonomous vehicles: An overview. Period. Polytech. Transp. Eng. 2022, 50, 426–434. [Google Scholar] [CrossRef]
 - Yang, T.; Tong, C. Real-time detection network for tiny traffic sign using multi-scale attention module. Sci. China Technol. Sci. 2022, 65, 396–406. [Google Scholar] [CrossRef]
 - Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
 - Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
 - Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
 - He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
 - Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
 - Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
 - Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
 - Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
 - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
 - Li, Y.; Li, J.; Meng, P. Attention-YOLOV4: A real-time and high-accurate traffic sign detection algorithm. Multimed. Tools Appl. 2023, 82, 7567–7582. [Google Scholar] [CrossRef]
 - Luo, Y.; Ci, Y.; Jiang, S.; Wei, X. A novel lightweight real-time traffic sign detection method based on an embedded device and YOLOv8. J. Real Time Image Process. 2024, 21, 24. [Google Scholar] [CrossRef]
 - Yang, Y.; Feng, F.; Liu, G.; Di, J. MEL-YOLO: A novel YOLO network with multi-scale, effective and lightweight methods for small object detection in aerial images. IEEE Access 2024, 12, 194280–194295. [Google Scholar] [CrossRef]
 - Yang, Y.; Yang, S.; Chan, Q. LEAD-YOLO: A lightweight and accurate network for small object detection in autonomous driving. Sensors 2025, 25, 4800. [Google Scholar] [CrossRef]
 - Wang, Z.; Li, Y.; Liu, Y.; Meng, F. Improved object detection via large kernel attention. Expert Syst. Appl. 2024, 240, 122507. [Google Scholar] [CrossRef]
 - Liu, G.; Huang, Y.; Yan, S.; Hou, E. RFCS-YOLO: Target detection algorithm in adverse weather conditions via receptive field enhancement and cross-scale fusion. Sensors 2025, 25, 912. [Google Scholar] [CrossRef]
 - Zhao, L.; Fu, L.; Jia, X.; Cui, B.; Zhu, X.; Jin, J. YOLO-BOS: An emerging approach for vehicle detection with a novel BRSA mechanism. Sensors 2024, 24, 8126. [Google Scholar] [CrossRef]
 - Tan, X.; Leng, X.; Luo, R.; Sun, Z.; Ji, K.; Kuang, G. YOLO-RC: SAR ship detection guided by characteristics of range-compressed domain. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18834–18851. [Google Scholar] [CrossRef]
 - He, J.; Chen, H.; Liu, B.; Luo, S.; Liu, J. Enhancing YOLO for occluded vehicle detection with grouped orthogonal attention and dense object repulsion. Sci. Rep. 2024, 14, 19650. [Google Scholar] [CrossRef]
 - Wang, Y.; Guan, Y.; Liu, H.; Jin, L.; Li, X.; Guo, B.; Zhang, Z. VV-YOLO: A vehicle view object detection model based on improved YOLOv4. Sensors 2023, 23, 3385. [Google Scholar] [CrossRef]
 - Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Chen, B.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
 - Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
 - Hou, R.; Xu, B.; Ren, T.; Wu, G. MTNet: Learning modality-aware representation with transformer for RGBT tracking. arXiv 2025, arXiv:2508.17280. [Google Scholar] [CrossRef]
 - Wang, H.; Wang, C.; Fu, Q.; Si, B.; Zhang, D.; Kou, R. MINIAOD: Lightweight aerial image object detection. IEEE Sens. J. 2025, 25, 9167–9184. [Google Scholar] [CrossRef]
 - Gu, Y.; Si, B. A novel lightweight real-time traffic sign detection integration framework based on YOLOv4. Entropy 2022, 24, 487. [Google Scholar] [CrossRef] [PubMed]
 - Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-time vehicle detection algorithm based on a lightweight You-Only-Look-Once (YOLOv5n-L) approach. Expert Syst. Appl. 2023, 213, 119108. [Google Scholar] [CrossRef]
 - Chen, W.; Liu, J.; Liu, T.; Zhuang, Y. PCPE-YOLO with a lightweight dynamically reconfigurable backbone for small object detection. Sci. Rep. 2025, 15, 29988. [Google Scholar] [CrossRef] [PubMed]
 - Yuan, X.; Kuerban, A.; Chen, Y.; Lin, W. Faster light detection algorithm of traffic signs based on YOLOv5s-A2. IEEE Access 2022, 11, 19395–19404. [Google Scholar] [CrossRef]
 - Han, J.; Liang, X.; Xu, H.; Chen, K.; Hong, L.; Mao, J.; Ye, C.; Zhang, W.; Li, X.; Liang, X.; et al. SODA10M: A large-scale 2D self/semi-supervised object detection dataset for autonomous driving. arXiv 2021, arXiv:2106.11118. [Google Scholar]
 - Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
 - Teledyne FLIR. FLIR Thermal Dataset for Algorithm Training; FLIR Systems: Wilsonville, OR, USA, 2021. [Google Scholar]
 - Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. GiraffeDet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]
 - Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
 - Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
 - Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zhang, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
 
















| Model | BiFPN | P2 | GCCHead | FECM | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FPS | GFLOPs | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 69.4 | 51.6 | 58.5 | 39.6 | 2.6 | 120 | 6.5 | ||||
| A | √ | 72.7 | 53.0 | 59.2 | 40.2 | 2.2 | 130 | 4.2 | |||
| B | √ | 68.1 | 57.0 | 63.0 | 44.6 | 2.7 | 117 | 8.4 | |||
| C | √ | 72.6 | 52.6 | 59.0 | 40.0 | 2.2 | 116 | 4.9 | |||
| D | √ | 67.7 | 53.7 | 59.3 | 40.4 | 2.6 | 100 | 6.6 | |||
| AB | √ | √ | 68.6 | 59.1 | 64.5 | 44.7 | 2.4 | 118 | 5.9 | ||
| ABC | √ | √ | √ | 69.9 | 60.5 | 65.5 | 45.6 | 2.1 | 110 | 4.8 | |
| ABD | √ | √ | √ | 72.8 | 57.8 | 65.4 | 45.5 | 2.6 | 104 | 6.5 | |
| ABCD (Ours)  | √ | √ | √ | √ | 77.1 | 57.6 | 66.5 | 46.6 | 2.2 | 108 | 5.2 | 
| Model | BiFPN | P2 | GCCHead | FECM | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FPS | GFLOPs | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 65.2 | 48.3 | 55.2 | 27.4 | 2.6 | 118 | 6.5 | ||||
| A | √ | 68.5 | 49.8 | 56.0 | 28.1 | 2.2 | 128 | 4.2 | |||
| B | √ | 64.0 | 53.5 | 59.8 | 29.2 | 2.7 | 105 | 8.4 | |||
| C | √ | 67.3 | 49.6 | 55.9 | 27.9 | 2.2 | 114 | 4.9 | |||
| D | √ | 63.8 | 50.4 | 56.5 | 28.3 | 2.6 | 104 | 6.6 | |||
| AB | √ | √ | 64.5 | 55.6 | 61.3 | 30.5 | 2.4 | 108 | 5.9 | ||
| ABC | √ | √ | √ | 65.8 | 54.1 | 62.2 | 30.9 | 2.1 | 116 | 4.8 | |
| ABD | √ | √ | √ | 68.6 | 54.3 | 62.4 | 31.1 | 2.6 | 102 | 6.5 | |
| ABCD | √ | √ | √ | √ | 72.5 | 56.8 | 65.9 | 31.3 | 2.2 | 106 | 5.2 | 
| Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | 
|---|---|---|---|
| Baseline | 58.5 | 39.6 | 2.6 | 
| GFPN | 59.6 | 40.6 | 3.0 | 
| BiFPN (ours) | 59.2 | 40.2 | 2.2 | 
| Model | FECM | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | 
|---|---|---|---|---|
| Baseline | 46.0 | 27.3 | 2.6 | |
| √ | 47.2 | 28.2 | 2.6 | 
| Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | 
|---|---|---|---|
| Baseline | 58.5 | 39.6 | 2.6 | 
| v10Detect | 58.0 | 38.6 | 2.9 | 
| GCCHead (ours) | 59.0 | 40.0 | 2.2 | 
| Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FPS | GFLOPs | 
|---|---|---|---|---|---|
| Baseline | 58.5 | 39.6 | 2.6 | 120 | 6.5 | 
| YOLOv8n | 59.6 | 40.4 | 3.2 | 103 | 8.7 | 
| YOLOv9t | 59.9 | 40.6 | 2.0 | 94 | 7.9 | 
| YOLOv10n | 58.9 | 39.9 | 2.3 | 182 | 6.7 | 
| YOLOv11n | 58.8 | 39.9 | 2.6 | 169 | 6.5 | 
| YOLOv13n | 59.2 | 40.1 | 2.4 | 150 | 6.4 | 
| RSO-YOLO (Ours) | 66.5 | 46.6 | 2.2 | 108 | 5.2 | 
| Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FPS | GFLOPs | 
|---|---|---|---|---|---|
| Baseline | 55.2 | 27.4 | 2.6 | 120 | 6.5 | 
| YOLOv8n | 55.8 | 27.8 | 3.2 | 222 | 8.7 | 
| YOLOv9t | 54.9 | 26.7 | 2.0 | 96 | 7.9 | 
| YOLOv10n | 55.5 | 27.6 | 2.3 | 137 | 6.7 | 
| YOLOv11n | 55.1 | 26.9 | 2.6 | 151 | 6.5 | 
| YOLOv13n | 55.4 | 27.4 | 2.4 | 140 | 6.4 | 
| RSO-YOLO (Ours) | 65.9 | 31.3 | 2.2 | 108 | 5.2 | 
| Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | FPS | GFLOPs | 
|---|---|---|---|---|---|
| Baseline | 61.2 | 29.2 | 2.6 | 116 | 6.5 | 
| YOLOv8n | 62.2 | 30.1 | 3.2 | 125 | 8.7 | 
| YOLOv9t | 62.0 | 29.6 | 2.0 | 101 | 7.9 | 
| YOLOv10n | 60.8 | 29.0 | 2.3 | 121 | 6.7 | 
| YOLOv11n | 61.3 | 29.3 | 2.6 | 132 | 6.5 | 
| YOLOv13n | 62.1 | 29.8 | 2.4 | 130 | 6.4 | 
| RSO-YOLO (Ours) | 68.4 | 35.4 | 2.2 | 126 | 5.2 | 
| Model | Backbone | mAP@0.5 (%) | Params (M) | FPS | 
|---|---|---|---|---|
| Attention-YOLOv4 | YOLOv4 | 62.8 | 65.7 | 42 | 
| MEL-YOLO | YOLOv5 | 53.8 | 3.1 | 46.2 | 
| YOLOv8-ghost-EMA | YOLOv8 | 55.6 | 2.6 | 106 | 
| RFCS-YOLO | YOLOv7 | 56.7 | 36.2 | 68 | 
| YOLO-BOS | YOLOv8 | 63.5 | 3.27 | 100 | 
| RSO-YOLO  (Ours)  | YOLOv12 | 66.5 | 2.2 | 108 | 
| Category | mAP@0.5 (%) | mAP@0.5:0.95 (%) | ||
|---|---|---|---|---|
| YOLOv12 | RSO-YOLO | YOLOv12 | RSO-YOLO | |
| Small object  subset  | 47.3 | 53.8 | 26.8 | 32.5 | 
| Medium object subset | 60.4 | 63.4 | 41.2 | 43.8 | 
| Large object  subset  | 67.5 | 69.0 | 52.3 | 54.2 | 
| Overall | 58.5 | 66.5 | 39.6 | 46.6 | 
| Category | mAP@0.5 (%) | mAP@0.5:0.95 (%) | ||
|---|---|---|---|---|
| YOLOv12 | RSO-YOLO | YOLOv12 | RSO-YOLO | |
| Occluded object  subset  | 46.0 | 49.3 | 25.8 | 28.4 | 
| Non-occluded  object subset  | 62.6 | 70.2 | 44.2 | 50.4 | 
| Overall | 58.5 | 66.5 | 39.6 | 46.6 | 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.  | 
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Q.; Zhou, Z.; Zhang, Z. RSO-YOLO: A Real-Time Detector for Small and Occluded Objects in Autonomous Driving Scenarios. Sensors 2025, 25, 6703. https://doi.org/10.3390/s25216703
Wang Q, Zhou Z, Zhang Z. RSO-YOLO: A Real-Time Detector for Small and Occluded Objects in Autonomous Driving Scenarios. Sensors. 2025; 25(21):6703. https://doi.org/10.3390/s25216703
Chicago/Turabian StyleWang, Quanxiang, Zhaofa Zhou, and Zhili Zhang. 2025. "RSO-YOLO: A Real-Time Detector for Small and Occluded Objects in Autonomous Driving Scenarios" Sensors 25, no. 21: 6703. https://doi.org/10.3390/s25216703
APA StyleWang, Q., Zhou, Z., & Zhang, Z. (2025). RSO-YOLO: A Real-Time Detector for Small and Occluded Objects in Autonomous Driving Scenarios. Sensors, 25(21), 6703. https://doi.org/10.3390/s25216703
        