PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro
Abstract
1. Introduction
- We propose PONet, a novel lightweight multi-modal object detection network optimized for RGB-IR fusion under constrained edge computing environments.
- We integrate Polarized Self-Attention into PONet to improve feature representation while keeping the model compact and efficient.
- We achieve competitive performance on the VEDAI dataset with a mAP@0.5 of 82.2%, while inference at 34 FPS on the OrangePi AIpro 20T, validating the network’s suitability for edge deployment.
2. Related Work
2.1. Visible–Infrared Object Detection Methods
2.2. Lightweight Models for Object Detection
3. Methods
3.1. Overall Architecture
3.2. Fusion Module
- Cross-Modality Enhancement: By employing SE blocks in a cross-guided manner, each modality benefits from the global context of the other. This promotes the emergence of features that are both locally discriminative and globally coherent.
- Modality-Specific Preservation: The residual connections, modulated by learnable weights, ensure that each modality retains its intrinsic characteristics, which is critical in early fusion stages to avoid over-blending.
- Computational Efficiency: The SE blocks and update modules are lightweight, relying only on global pooling, small fully connected layers, and depth-preserving convolutions.
- Implicit Attention Alignment: Our symmetrical design fosters bidirectional conditioning, improving modality alignment without the need for explicit warping or matching.
3.3. Polarized Self-Attention
4. Experiments
4.1. Datasets and Evaluation Metrics
4.1.1. Datasets
4.1.2. Evaluation Metrics
- AP: Average Precision, which is the area under the precision-recall curve.
- p(r): Precision at recall value r.
- The integral from 0 to 1 indicates the process of calculating precision at different recall levels and then summing them to compute the overall average precision.
- mAP: Mean Average Precision, which is the average of the Average Precision (AP) values for all classes.
- K: The total number of classes in the dataset.
- APi : The Average Precision for class i, which is calculated by integrating the precision-recall curve for each class.
- The formula computes the mean of the AP values for all K classes to obtain the mAP value, which is a common evaluation metric in object detection tasks.
4.2. Implementation Details
4.3. Algorithm Performance Experiment
5. Application on OrangePi AIpro 20T
5.1. Deployment Performance
- Platform: OrangePi AIpro 20T (Ascend 310B);
- Model Format: .om (converted from ONNX using ATC);
- Input Resolution: 1024 × 1024 (RGB+IR);
- Average Inference FPS: 34;
- End-to-End Latency: ∼29.4 ms per frame.
5.2. Discussion
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
- Payghode, V.; Goyal, A.; Bhan, A.; Iyer, S.S.; Dubey, A.K. Object detection and activity recognition in video surveillance using neural networks. Int. J. Web Inf. Syst. 2023, 19, 123–138. [Google Scholar] [CrossRef]
- Yang, B.; Li, J.; Zeng, T. A Review of Environmental Perception Technology Based on Multi-Sensor Information Fusion in Autonomous Driving. World Electr. Veh. J. 2025, 16, 20. [Google Scholar] [CrossRef]
- Zhao, H.; Chu, K.; Zhang, J.; Feng, C. YOLO-FSD: An improved target detection algorithm on remote-sensing images. IEEE Sens. J. 2023, 23, 30751–30764. [Google Scholar] [CrossRef]
- Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
- Li, Y.; Hu, Z.; Zhang, Y.; Liu, J.; Tu, W.; Yu, H. DDEYOLOv9: Network for detecting and counting abnormal fish behaviors in complex water environments. Fishes 2024, 9, 242. [Google Scholar] [CrossRef]
- Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal object detection in difficult weather conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
- Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-frame infrared small-target detection: A survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
- Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
- Sun, J.; Yin, M.; Wang, Z.; Xie, T.; Bei, S. Multispectral object detection based on multilevel feature fusion and dual feature modulation. Electronics 2024, 13, 443. [Google Scholar] [CrossRef]
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection using Deep Fusion Convolutional Neural Networks. In Proceedings of the ESANN, Bruges, Belgium, 27–29 April 2016; Volume 587, pp. 509–514. [Google Scholar]
- Wang, Y.; Tang, C.; Shi, Q. Cross-Modality Fusion Deformable Transformer for Multispectral Object Detection. In Proceedings of the International Conference on Guidance, Navigation and Control, Changsha, China, 9–11 August 2024; pp. 372–382. [Google Scholar]
- Shao, Y.; Huang, Q.; Mei, Y.; Chu, H. MOD-YOLO: Multispectral object detection based on transformer dual-stream YOLO. Pattern Recognit. Lett. 2024, 183, 26–34. [Google Scholar] [CrossRef]
- Meng, F.; Chen, X.; Tang, H.; Wang, C.; Tong, G. B2MFuse: A Bi-branch Multi-scale Infrared and Visible Image Fusion Network based on Joint Semantics Injection. IEEE Trans. Instrum. Meas. 2024, 73, 5037317. [Google Scholar] [CrossRef]
- Zhang, Y.; Yu, H.; He, Y.; Wang, X.; Yang, W. Illumination-guided RGBT object detection with inter-and intra-modality fusion. IEEE Trans. Instrum. Meas. 2023, 72, 2508013. [Google Scholar] [CrossRef]
- Fu, L.; Gu, W.b.; Ai, Y.b.; Li, W.; Wang, D. Adaptive spatial pixel-level feature fusion network for multispectral pedestrian detection. Infrared Phys. Technol. 2021, 116, 103770. [Google Scholar] [CrossRef]
- Gallagher, J.E.; Oughton, E.J. Surveying You Only Look Once (YOLO) Multispectral Object Detection Advancements, Applications Furthermore, Challenges. IEEE Access 2025, 13, 7366–7395. [Google Scholar] [CrossRef]
- Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef]
- Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
- Xu, J.; Tan, X.; Luo, R.; Song, K.; Li, J.; Qin, T.; Liu, T.Y. NAS-BERT: Task-agnostic and adaptive-size BERT compression with neural architecture search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1933–1943. [Google Scholar]
- Liu, Y.; Zhang, W.; Wang, J. Zero-shot adversarial quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 1512–1521. [Google Scholar]
- Zhu, J.; Tang, S.; Chen, D.; Yu, S.; Liu, Y.; Rong, M.; Yang, A.; Wang, X. Complementary relation contrastive distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 9260–9269. [Google Scholar]
- Wimmer, P.; Mehnert, J.; Condurache, A. Interspace pruning: Using adaptive filter representations to improve training of sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12527–12537. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 13713–13722. [Google Scholar]
- Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
- Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
- Pham, M.T.; Courtrai, L.; Friguet, C.; Lefèvre, S.; Baussard, A. YOLO-Fine: One-stage detector of small objects under various backgrounds in remote sensing images. Remote Sens. 2020, 12, 2501. [Google Scholar] [CrossRef]
- Betti, A.; Tucci, M. YOLO-S: A lightweight and accurate YOLO-like network for small target detection in aerial imagery. Sensors 2023, 23, 1865. [Google Scholar] [CrossRef]
- Ju, M.; Niu, B.; Jin, S.; Liu, Z. SuperDet: An efficient single-shot network for vehicle detection in remote sensing images. Electronics 2023, 12, 1312. [Google Scholar] [CrossRef]
- Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-based object detection method for remote sensing images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
- Ren, Z. Enhanced YOLOv8 Infrared Image Object Detection Method with SPD Module. J. Theory Pract. Eng. Technol. 2024, 1, 1–7. [Google Scholar]
- Shen, L.; Lang, B.; Song, Z. Infrared object detection method based on DBD-YOLOv8. IEEE Access 2023, 11, 145853–145868. [Google Scholar] [CrossRef]
- Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative cross-attention guided feature fusion for multispectral object detection. Pattern Recognit. 2024, 145, 109913. [Google Scholar] [CrossRef]
- Zhang, J.; Lei, J.; Xie, W.; Li, Y.; Yang, G.; Jia, X. Guided hybrid quantization for object detection in remote sensing imagery via one-to-one self-teaching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614815. [Google Scholar] [CrossRef]
- Zhu, J.; Chen, X.; Zhang, H.; Tan, Z.; Wang, S.; Ma, H. Transformer based remote sensing object detection with enhanced multispectral feature extraction. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5001405. [Google Scholar] [CrossRef]
Methods | All | Car | Pickup | Camping | Truck | Other | Tractor | Boat | van | Params (M) | FPS |
---|---|---|---|---|---|---|---|---|---|---|---|
YOLOv5s-RGB | 80.0 | 94.7 | 90.9 | 95.7 | 91.7 | 77.6 | 92.1 | 48.5 | 49.0 | - | - |
YOLOv5s-IR | 71.8 | 91.4 | 88.5 | 92.6 | 87.4 | 66.3 | 68.2 | 31.5 | 48.3 | - | - |
Baseline (without PSA) | 81.5 | 92.4 | 92.0 | 93.7 | 93.8 | 79.9 | 85.5 | 43.8 | 70.9 | 4.13 | 104 |
PONet (ours) | 82.2 | 95.2 | 92.1 | 97.9 | 94.7 | 88.5 | 94.7 | 48.7 | 45.9 | 3.76 | 158 |
Method | Input Modality | mAP@50 (%) | mAP@95 (%) | Params (M) |
---|---|---|---|---|
YOLO-S [36] | RGB | 70.40 | – | 78.0 |
SuperDet [37] | RGB | 77.60 | – | – |
DS-YOLOv8 [38] | RGB | 78.90 | 51.10 | – |
SPD-YOLOv8 [39] | Thermal | 63.70 | 52.10 | – |
DBD-YOLOv8 [40] | Thermal | 76.00 | – | – |
SuperYOLO [19] | RGB+Thermal | 75.09 | – | 7.0 |
ICAFusion [41] | RGB+Thermal | 76.62 | 44.93 | 120.2 |
GHOST [42] | RGB+Thermal | 80.31 | 49.05 | 9.7 |
Multispectral DETR [43] | RGB+Thermal | 82.70 | 50.80 | 73.0 |
PONet (Ours) | RGB+Thermal | 82.20 | 52.70 | 3.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, J.; Lian, J.; Cao, F.; Chen, J.; Luo, R.; Yang, J.; Shi, Q. PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sens. 2025, 17, 2650. https://doi.org/10.3390/rs17152650
Huang J, Lian J, Cao F, Chen J, Luo R, Yang J, Shi Q. PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sensing. 2025; 17(15):2650. https://doi.org/10.3390/rs17152650
Chicago/Turabian StyleHuang, Junyu, Jialing Lian, Fangyu Cao, Jiawei Chen, Renbo Luo, Jinxin Yang, and Qian Shi. 2025. "PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro" Remote Sensing 17, no. 15: 2650. https://doi.org/10.3390/rs17152650
APA StyleHuang, J., Lian, J., Cao, F., Chen, J., Luo, R., Yang, J., & Shi, Q. (2025). PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro. Remote Sensing, 17(15), 2650. https://doi.org/10.3390/rs17152650