GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion
Abstract
1. Introduction
- (a)
- Adaptive Multimodal Gated Fusion Mechanism (GFM): We design a lightweight global-context gating mechanism between the dual backbones to adaptively regulate the contributions of RGB and thermal features under different scene conditions. Instead of static fusion, GFM estimates complementary modality weights from globally pooled RGB–IR features through a compact Multi-Layer Perceptron (MLP) followed by Softmax normalization, enabling scene-adaptive modality calibration with negligible computational overhead.
- (b)
- Feature-Preserving Backbone via SPD-Conv: To alleviate the “small object disappearance” problem, we redesign the backbone by incorporating Space-to-Depth (SPD) convolutions [12]. By replacing traditional strided convolutions with SPD blocks, spatial details are reorganized into the channel dimension, which helps preserve texture and contour features for small thermal targets. This feature-preserving downsampling design is beneficial for maintaining fine-grained target representations under repeated resolution reduction.
- (c)
- Real-Time Lightweight Neck based on Ghost and Ghost-Shuffle Convolution (GSConv): We reconstruct the detector’s neck using Ghost Modules [13] and GSConv [14] to alleviate the computational bottleneck of dual-stream networks. By replacing redundant standard convolutions with more efficient operations, we reduce the computational cost to less than 9.0 GFLOPs—comparable to single-stream networks—while maintaining competitive feature representation capability.
2. Related Work
2.1. RGBT Fusion Mechanisms: From Static to Dynamic
2.2. Small Object Detection: The Downsampling Dilemma
2.3. Lightweight Architectures: The Dual-Stream Burden
3. The Proposed Method
3.1. Overall Architecture
- (a)
- Dual-Stream Decoupled Backbone: Two parallel CSP-Darknet backbones are employed to extract feature hierarchies from Visible () and Thermal () images independently. To mitigate the loss of fine-grained spatial information during downsampling, we integrate Space-to-Depth (SPD) convolution modules at the low-resolution stages (specifically the P3 and P4 layers). Here, SPD-Conv is used to replace the original stride-2 downsampling convolutions at the corresponding backbone stages, rather than the CSP feature extraction blocks themselves. Separate backbone branches are intentionally preserved because visible and thermal modalities differ substantially in low-level contrast distributions, local structural patterns, and sensor-specific noise characteristics. While partial weight sharing is more efficient, it may suppress modality-specific representation learning at early stages. Therefore, GEM-YOLO prioritizes modality-decoupled feature extraction in the backbone, while the additional computational burden is alleviated later through the lightweight Ghost/GSConv neck.
- (b)
- Multi-Scale Gated Fusion: Effective fusion requires the integration of features at multiple semantic levels [35]. We perform fusion at three distinct scales corresponding to downsampling strides of 8, 16, and 32. At each stage, the Gated Fusion Module (GFM) is deployed to dynamically calibrate the contribution of each modality based on global context, rather than simple linear superposition.
- (c)
- Lightweight Semantic Aggregation: The fused features are aggregated using a Path Aggregation Network (PANet). To offset the computational overhead of the dual-stream backbone, the neck is reconstructed using Ghost Modules [13] and GSConv [14]. Furthermore, CBAM [36] attention blocks are embedded to refine the fused features before they are fed into the three decoupled detection heads.
3.2. Adaptive Multimodal Gated Fusion Mechanism (GFM)
- Joint Feature Embedding:
- 2.
- Modality Weight Learning:
- 3.
- Adaptive Fusion:
3.3. Feature-Preserving Backbone via High-Precision Light SPD-Conv
3.4. Lightweight Ghost-Neck with Attention Refinement
4. Experiments
4.1. Dataset
4.1.1. FLIR
4.1.2. M3FD
4.2. Experimental Environment and Setting
4.3. Experimental Evaluation Metrics
- Accuracy Metrics (P, R, and mAP)
- mAP@50 (%): The mean Average Precision calculated at a relatively loose Intersection over Union (IoU) threshold of 0.5. It reflects the model’s basic ability to locate and classify targets.
- mAP@50:95 (%): The average mAP calculated at different IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This is a much stricter metric that heavily penalizes imprecise bounding boxes, thereby effectively reflecting the model’s high-precision localization capability for small targets.
- 2.
- Efficiency Metrics (Params, GFLOPs, FPS, and Latency)
- Params (M): Measured in millions, this metric reflects the spatial complexity and memory footprint of the model.
- GFLOPs (G): Measured in billions of operations, this metric represents the time complexity and computational cost of the network. It should be emphasized that GFLOPs quantify theoretical computational complexity, but do not directly correspond to actual runtime latency, which is also influenced by hardware architecture, memory bandwidth, operator implementation, and parallel execution efficiency.
- FPS: This metric denotes the number of images processed per second during inference and is used to assess the real-time capability of the model.
- Latency (ms): This metric denotes the average inference time required to process a single image, directly reflecting the response speed of the model in practical deployment.
5. Results
5.1. Comparative Experimental Results and Analysis
5.2. Ablation Study
5.3. Scale-Wise Detection Analysis
5.4. Discussion
5.4.1. Global Gating Behavior and Its Limitation
5.4.2. Failure Cases and Balanced Error Analysis
5.4.3. Separate Versus Shared Backbone Design
5.4.4. Discussion on Comparison Scope
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. Available online: https://docs.ultralytics.com/ (accessed on 10 January 2025).
- Sun, C.; Chen, Y.; Qiu, X.; Li, R.; You, L. MRD-YOLO: A Multispectral Object Detection Algorithm for Complex Road Scenes. Sensors 2024, 24, 3222. [Google Scholar] [CrossRef] [PubMed]
- Li, C.; Song, D.; Tong, R.; Tang, M. Multispectral Pedestrian Detection via Simultaneous Detection and Segmentation. In Proceedings of the 29th British Machine Vision Conference (BMVC), Newcastle, UK, 3–6 September 2018; p. 225. [Google Scholar]
- Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5127–5137. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefevre, S.; Aviniti, B. Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; IEEE: New York, NY, USA, 2020; pp. 443–447. [Google Scholar]
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection Using Deep Fusion Convolutional Neural Networks. In Proceedings of the 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN), Bruges, Belgium, 27–29 April 2016; pp. 509–514. [Google Scholar]
- Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO Architecture from Infrared and Visible Images for Object Detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef] [PubMed]
- FLIR Systems. Free FLIR Thermal Dataset for Algorithm Training. Available online: https://www.flir.com/industries/automotive/ (accessed on 15 January 2025).
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-Aware Dual-Adversarial Learning and a Multi-Scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5802–5811. [Google Scholar]
- Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 1580–1589. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
- Zhao, P.; Ye, X.; Du, Z. Object Detection in Multispectral Remote Sensing Images Based on Cross-Modal Cross-Attention. Sensors 2024, 24, 4098. [Google Scholar] [CrossRef] [PubMed]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-Infrared Paired Dataset for Low-Light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
- Lee, S.; Kim, T.; Shin, J.; Kim, N.; Choi, Y. INSANet: INtra-INter Spectral Attention Network for Effective Feature Fusion of Multispectral Pedestrian Detection. Sensors 2024, 24, 1168. [Google Scholar] [CrossRef] [PubMed]
- Li, R.; Xiang, J.; Sun, F.; Yuan, Y.; Yuan, L.; Gou, S. Multiscale Cross-Modal Homogeneity Enhancement and Confidence-Aware Fusion for Multispectral Pedestrian Detection. IEEE Trans. Multimedia 2024, 26, 852–863. [Google Scholar] [CrossRef]
- Shen, J.; He, J.; Liu, Q.; Zhang, Z.; Wang, G.; Lu, D. MSDF-Mamba: Mutual-Spectrum Perception Deformable Fusion Mamba for Drone-Based Visible–Infrared Cross-Modality Vehicle Detection. Remote Sens. 2025, 17, 4037. [Google Scholar] [CrossRef]
- Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
- Zhou, K.; Chen, L.; Cao, X. Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 787–803. [Google Scholar]
- Chen, Y.-T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 139–158. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2021; pp. 1807–1817. [Google Scholar] [CrossRef]
- Fang, Q.; Han, D.; Wang, Z. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2021, arXiv:2111.00273. Available online: https://arxiv.org/abs/2111.00273 (accessed on 9 March 2026).
- Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15089, pp. 1–21. [Google Scholar]
- Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
- Ma, T.; Cheng, K.; Chai, T.; Prasad, S.; Zhao, D.; Li, J.; Zhou, H. MDCENet: Multi-Dimensional Cross-Enhanced Network for Infrared Small Target Detection. Infrared Phys. Technol. 2024, 141, 105475. [Google Scholar] [CrossRef]
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; pp. 107984–108011. [Google Scholar]
- Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
- Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 78–96. [Google Scholar]
- Wang, Z.; Liao, X.; Yuan, J.; Yao, Y.; Li, Z. CDC-YOLOFusion: Leveraging Cross-Scale Dynamic Convolution Fusion for Visible-Infrared Object Detection. IEEE Trans. Intell. Veh. 2025, 10, 2080–2093. [Google Scholar] [CrossRef]
- Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Chen, Y.; Yuan, X.; Wang, J.; Wu, R.; Li, X.; Hou, Q.; Cheng, M.-M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-Time Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4240–4252. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]












| Dataset | Total Pairs | Classes (Count) | Original Image Size | Input Image Size | Data Split (Train: Validation: Test) |
|---|---|---|---|---|---|
| FLIR | 5136 | 3 | 640 × 512 (pixel) | 640 × 640 (pixel) | 7:2:1 |
| M3FD | 4200 | 6 | 1024 × 768 (pixel) |
| Datasets | Split | Images | Objects | Small | Medium | Large | Small Ratio | Medium Ratio | Large Ratio |
|---|---|---|---|---|---|---|---|---|---|
| FLIR | Train | 3595 | 28,631 | 16,002 | 10,637 | 1992 | 55.89% | 37.15% | 6.96% |
| Val | 1027 | 7960 | 4415 | 2980 | 565 | 55.46% | 37.44% | 7.10% | |
| Test | 514 | 4130 | 2254 | 1599 | 277 | 54.58% | 38.72% | 6.71% | |
| M3FD | Train | 2940 | 23,990 | 9867 | 9298 | 4825 | 41.13% | 38.76% | 20.11% |
| Val | 840 | 7080 | 2956 | 2827 | 1297 | 41.75% | 39.93% | 18.32% | |
| Test | 420 | 3338 | 1250 | 1399 | 689 | 37.45% | 41.91% | 20.64% |
| Environment | |
|---|---|
| Hardware | Intel(R) Xeon(R) Gold 6430 |
| RAM 120 GB | |
| NVIDIA GeForce RTX 4090 24 GB × 1 | |
| Software | Windows 11 |
| Python 3.12.3 | |
| PyTorch 2.2.2 | |
| Cuda 11.8 | |
| Ultralytics YOLO (v8.3.75) |
| Parameter | Value |
|---|---|
| Epochs | 150 |
| Batch size | 16 |
| Input size | 640 × 640 |
| Optimizer | SGD (lr0 = 0.01, momentum = 0.937, weight_decay = 0.0005) |
| Learning rate | Initial lr0 = 0.01; cosine decay to lr_final = 1 × 10−4 (lrf = 0.01) |
| Workers | 8 |
| Data Augmentation | Mosaic, HSV, Flip (Mosaic disabled in the last 10 epochs) |
| Precision (AMP) | False |
| Method | Classes | Precision (%) | Recall (%) | mAP@50 (%) | mAP@50:95 (%) | Params (M) | GFLOPs (G) | FPS | Latency (ms) |
|---|---|---|---|---|---|---|---|---|---|
| Dual-stream YOLOv5n | All | 84.4 ± 0.3 | 73.2 ± 0.3 | 82.4 ± 0.3 | 45.7 ± 0.3 | 3.67 | 9.81 | 400.0 ± 3.2 | 2.5 ± 0.02 |
| Person | 85.2 ± 0.3 | 74.1 ± 0.3 | 83.1 ± 0.3 | 44.3 ± 0.3 | |||||
| Bicycle | 88.1 ± 0.3 | 80.0 ± 0.3 | 89.1 ± 0.3 | 57.9 ± 0.3 | |||||
| Car | 80.0 ± 0.3 | 65.4 ± 0.4 | 74.9 ± 0.4 | 35.0 ± 0.4 | |||||
| Dual-stream YOLOv8n | All | 85.3 ± 0.2 | 73.0 ± 0.2 | 82.7 ± 0.2 | 46.5 ± 0.2 | 4.36 | 11.34 | 434.8 ± 3.5 | 2.3 ± 0.02 |
| Person | 86.0 ± 0.2 | 73.9 ± 0.2 | 83.0 ± 0.2 | 45.2 ± 0.2 | |||||
| Bicycle | 88.7 ± 0.2 | 80.4 ± 0.2 | 89.2 ± 0.2 | 58.3 ± 0.2 | |||||
| Car | 81.1 ± 0.3 | 64.7 ± 0.3 | 75.8 ± 0.3 | 36.1 ± 0.3 | |||||
| Dual-stream YOLOvXn | All | 83.4 ± 0.3 | 74.2 ± 0.3 | 82.4 ± 0.3 | 45.9 ± 0.3 | 3.38 | 9.33 | 500.0 ± 4.1 | 2.0 ± 0.02 |
| Person | 84.5 ± 0.3 | 75.1 ± 0.3 | 83.3 ± 0.3 | 44.2 ± 0.3 | |||||
| Bicycle | 85.6 ± 0.3 | 80.6 ± 0.3 | 88.6 ± 0.3 | 58.1 ± 0.3 | |||||
| Car | 80.1 ± 0.4 | 66.9 ± 0.4 | 75.2 ± 0.4 | 35.5 ± 0.4 | |||||
| Dual-stream YOLOv10n | All | 82.6 ± 0.3 | 72.1 ± 0.3 | 81.2 ± 0.3 | 45.1 ± 0.3 | 3.73 | 11.23 | 666.7 ± 5.3 | 1.5 ± 0.01 |
| Person | 83.6 ± 0.3 | 72.4 ± 0.3 | 82.0 ± 0.3 | 44.2 ± 0.3 | |||||
| Bicycle | 85.5 ± 0.3 | 78.5 ± 0.3 | 87.3 ± 0.3 | 57.4 ± 0.3 | |||||
| Car | 78.6 ± 0.4 | 65.5 ± 0.4 | 74.3 ± 0.4 | 33.6 ± 0.4 | |||||
| Dual-stream YOLOv11n | All | 83.2 ± 0.2 | 74.2 ± 0.2 | 81.8 ± 0.2 | 45.6 ± 0.2 | 3.79 | 9.31 | 370.4 ± 3.0 | 2.7 ± 0.02 |
| Person | 83.6 ± 0.2 | 74.8 ± 0.2 | 83.2 ± 0.2 | 44.8 ± 0.2 | |||||
| Bicycle | 85.7 ± 0.2 | 79.7 ± 0.2 | 88.2 ± 0.2 | 57.7 ± 0.2 | |||||
| Car | 80.2 ± 0.3 | 68.1 ± 0.3 | 74.1 ± 0.3 | 34.4 ± 0.3 | |||||
| MBNet | All | 82.8 ± 0.3 | 72.5 ± 0.3 | 80.6 ± 0.3 | 44.9 ± 0.3 | 6.8 | 16.2 | 285.7 ± 2.5 | 3.5 ± 0.03 |
| Person | 83.9 ± 0.3 | 73.2 ± 0.3 | 81.5 ± 0.3 | 43.6 ± 0.3 | |||||
| Bicycle | 86.2 ± 0.3 | 78.9 ± 0.3 | 86.8 ± 0.3 | 57.0 ± 0.3 | |||||
| Car | 78.3 ± 0.4 | 65.4 ± 0.4 | 73.5 ± 0.4 | 34.1 ± 0.4 | |||||
| ProbEn | All | 83.5 ± 0.2 | 73.3 ± 0.2 | 82.1 ± 0.2 | 45.4 ± 0.2 | 7.5 | 17.4 | 263.2 ± 2.2 | 3.8 ± 0.03 |
| Person | 84.6 ± 0.2 | 74.0 ± 0.2 | 82.8 ± 0.2 | 44.0 ± 0.2 | |||||
| Bicycle | 86.9 ± 0.2 | 79.8 ± 0.2 | 88.4 ± 0.2 | 57.6 ± 0.2 | |||||
| Car | 79.0 ± 0.3 | 66.1 ± 0.3 | 75.0 ± 0.3 | 34.6 ± 0.3 | |||||
| Dual-stream RT-DETR | All | 82.2 ± 0.4 | 71.6 ± 0.4 | 81.0 ± 0.4 | 44.5 ± 0.4 | 66.34 | 194.03 | 70.4 ± 0.9 | 14.2 ± 0.15 |
| Person | 82.8 ± 0.4 | 72.1 ± 0.4 | 81.7 ± 0.4 | 43.2 ± 0.4 | |||||
| Bicycle | 84.4 ± 0.4 | 77.5 ± 0.4 | 86.6 ± 0.4 | 55.9 ± 0.4 | |||||
| Car | 79.4 ± 0.5 | 65.2 ± 0.5 | 74.6 ± 0.5 | 34.5 ± 0.5 | |||||
| Proposed | All | 84.5 ± 0.2 | 74.2 ± 0.2 | 82.8 ± 0.2 * | 46.5 ± 0.2 | 3.44 | 7.58 | 476.2 ± 3.5 | 2.1 ± 0.02 |
| Person | 85.8 ± 0.2 * | 75.3 ± 0.2 | 83.8 ± 0.2 * | 45.2 ± 0.2 | |||||
| Bicycle | 88.6 ± 0.2 * | 80.5 ± 0.2 | 89.2 ± 0.2 * | 58.3 ± 0.2 | |||||
| Car | 80.2 ± 0.3 | 67.0 ± 0.3 | 75.5 ± 0.3 | 36.0 ± 0.3 |
| Model | Stream Setting | GFLOPs |
|---|---|---|
| Original YOLOv11n | Single-stream | 4.67 |
| Dual-stream YOLOv11n baseline | Dual-stream | 9.31 |
| Proposed | Dual-stream | 7.58 |
| Method | Classes | Precision (%) | Recall (%) | mAP@50 (%) | mAP@50:95 (%) | Params (M) | GFLOPs (G) | FPS | Latency (ms) |
|---|---|---|---|---|---|---|---|---|---|
| Dual-stream YOLOv5n | All | 77.8 ± 0.4 | 62.0 ± 0.4 | 68.5 ± 0.4 | 43.5 ± 0.4 | 3.67 | 9.81 | 400.0 ± 3.2 | 2.5 ± 0.02 |
| Person | 78.8 ± 0.4 | 70.5 ± 0.4 | 78.3 ± 0.4 | 45.5 ± 0.4 | |||||
| Car | 86.5 ± 0.4 | 77.0 ± 0.4 | 84.1 ± 0.4 | 57.0 ± 0.4 | |||||
| Bus | 84.5 ± 0.5 | 76.4 ± 0.5 | 80.6 ± 0.5 | 61.0 ± 0.5 | |||||
| Motorcycle | 69.9 ± 0.5 | 47.5 ± 0.5 | 52.9 ± 0.5 | 31.3 ± 0.5 | |||||
| Lamp | 72.4 ± 0.6 | 37.3 ± 0.6 | 45.4 ± 0.6 | 20.3 ± 0.6 | |||||
| Truck | 74.6 ± 0.5 | 63.0 ± 0.5 | 70.0 ± 0.5 | 45.8 ± 0.5 | |||||
| Dual-stream YOLOv8n | All | 82.0 ± 0.3 | 62.8 ± 0.3 | 70.3 ± 0.3 | 44.7 ± 0.3 | 4.36 | 11.34 | 270.3 ± 2.8 | 3.7 ± 0.03 |
| Person | 80.9 ± 0.3 | 72.5 ± 0.3 | 78.6 ± 0.3 | 46.6 ± 0.3 | |||||
| Car | 87.3 ± 0.3 | 77.5 ± 0.3 | 85.1 ± 0.3 | 58.3 ± 0.3 | |||||
| Bus | 85.3 ± 0.4 | 74.4 ± 0.4 | 78.0 ± 0.4 | 61.0 ± 0.4 | |||||
| Motorcycle | 80.7 ± 0.4 | 45.8 ± 0.4 | 54.1 ± 0.4 | 30.4 ± 0.4 | |||||
| Lamp | 80.8 ± 0.5 | 37.8 ± 0.5 | 50.0 ± 0.5 | 23.4 ± 0.5 | |||||
| Truck | 76.7 ± 0.4 | 68.9 ± 0.4 | 76.0 ± 0.4 | 48.5 ± 0.4 | |||||
| Dual-stream YOLOvXn | All | 79.5 ± 0.4 | 62.1 ± 0.4 | 69.0 ± 0.4 | 43.8 ± 0.4 | 3.38 | 9.34 | 370.4 ± 3.1 | 2.7 ± 0.02 |
| Person | 81.1 ± 0.4 | 70.4 ± 0.4 | 78.4 ± 0.4 | 46.0 ± 0.4 | |||||
| Car | 86.2 ± 0.4 | 76.5 ± 0.4 | 84.2 ± 0.4 | 57.3 ± 0.4 | |||||
| Bus | 86.0 ± 0.5 | 75.0 ± 0.5 | 78.8 ± 0.5 | 59.5 ± 0.5 | |||||
| Motorcycle | 73.2 ± 0.5 | 49.3 ± 0.5 | 53.5 ± 0.5 | 32.3 ± 0.5 | |||||
| Lamp | 73.2 ± 0.6 | 35.5 ± 0.6 | 45.9 ± 0.6 | 21.2 ± 0.6 | |||||
| Truck | 77.1 ± 0.5 | 66.2 ± 0.5 | 73.0 ± 0.5 | 45.9 ± 0.5 | |||||
| Dual-stream YOLOv10n | All | 78.6 ± 0.4 | 59.7 ± 0.4 | 67.8 ± 0.4 | 42.7 ± 0.4 | 3.73 | 11.23 | 454.5 ± 3.8 | 2.2 ± 0.02 |
| Person | 79.7 ± 0.4 | 69.3 ± 0.4 | 77.5 ± 0.4 | 45.0 ± 0.4 | |||||
| Car | 85.5 ± 0.4 | 74.1 ± 0.4 | 83.4 ± 0.4 | 56.3 ± 0.4 | |||||
| Bus | 81.2 ± 0.5 | 73.7 ± 0.5 | 77.1 ± 0.5 | 58.6 ± 0.5 | |||||
| Motorcycle | 78.0 ± 0.5 | 42.2 ± 0.5 | 52.9 ± 0.5 | 28.9 ± 0.5 | |||||
| Lamp | 70.4 ± 0.6 | 37.1 ± 0.6 | 46.1 ± 0.6 | 21.4 ± 0.6 | |||||
| Truck | 76.7 ± 0.5 | 61.7 ± 0.5 | 70.0 ± 0.5 | 45.4 ± 0.5 | |||||
| Dual-stream YOLOv11n | All | 82.6 ± 0.3 | 57.5 ± 0.3 | 66.5 ± 0.3 | 42.6 ± 0.3 | 3.79 | 9.31 | 344.8 ± 3.0 | 2.9 ± 0.02 |
| Person | 82.9 ± 0.3 | 66.9 ± 0.3 | 77.2 ± 0.3 | 45.4 ± 0.3 | |||||
| Car | 88.6 ± 0.3 | 74.6 ± 0.3 | 83.9 ± 0.3 | 57.4 ± 0.3 | |||||
| Bus | 84.3 ± 0.4 | 74.4 ± 0.4 | 78.3 ± 0.4 | 59.3 ± 0.4 | |||||
| Motorcycle | 81.2 ± 0.4 | 37.9 ± 0.4 | 47.6 ± 0.4 | 28.6 ± 0.4 | |||||
| Lamp | 79.8 ± 0.5 | 31.1 ± 0.5 | 42.8 ± 0.5 | 19.7 ± 0.5 | |||||
| Truck | 78.7 ± 0.4 | 60.3 ± 0.4 | 69.3 ± 0.4 | 45.4 ± 0.4 | |||||
| MBNet | All | 80.0 ± 0.4 | 61.2 ± 0.4 | 68.2 ± 0.4 | 43.0 ± 0.4 | 6.8 | 16.2 | 285.7 ± 2.7 | 3.5 ± 0.03 |
| Person | 80.5 ± 0.4 | 70.7 ± 0.4 | 77.9 ± 0.4 | 45.5 ± 0.4 | |||||
| Car | 85.8 ± 0.4 | 75.6 ± 0.4 | 83.6 ± 0.4 | 56.5 ± 0.4 | |||||
| Bus | 83.4 ± 0.5 | 73.9 ± 0.5 | 77.5 ± 0.5 | 59.0 ± 0.5 | |||||
| Motorcycle | 76.1 ± 0.5 | 44.5 ± 0.5 | 51.6 ± 0.5 | 29.5 ± 0.5 | |||||
| Lamp | 73.9 ± 0.6 | 35.6 ± 0.6 | 44.8 ± 0.6 | 20.4 ± 0.6 | |||||
| Truck | 75.5 ± 0.5 | 63.8 ± 0.5 | 71.2 ± 0.5 | 45.6 ± 0.5 | |||||
| ProbEn | All | 80.8 ± 0.3 | 61.7 ± 0.3 | 68.8 ± 0.3 | 43.3 ± 0.3 | 7.5 | 17.4 | 263.2 ± 2.4 | 3.8 ± 0.03 |
| Person | 81.2 ± 0.3 | 71.0 ± 0.3 | 78.2 ± 0.3 | 45.8 ± 0.3 | |||||
| Car | 86.6 ± 0.3 | 75.9 ± 0.3 | 84.0 ± 0.3 | 56.8 ± 0.3 | |||||
| Bus | 83.9 ± 0.4 | 74.5 ± 0.4 | 78.2 ± 0.4 | 59.5 ± 0.4 | |||||
| Motorcycle | 77.5 ± 0.4 | 45.0 ± 0.4 | 52.1 ± 0.4 | 30.0 ± 0.4 | |||||
| Lamp | 74.8 ± 0.5 | 36.2 ± 0.5 | 45.5 ± 0.5 | 20.8 ± 0.5 | |||||
| Truck | 76.0 ± 0.4 | 64.6 ± 0.4 | 71.7 ± 0.4 | 46.0 ± 0.4 | |||||
| Dual-stream RT-DETR | All | 79.2 ± 0.5 | 60.5 ± 0.5 | 67.2 ± 0.5 | 42.2 ± 0.5 | 66.34 | 194.03 | 69.9 ± 1.0 | 14.3 ± 0.15 |
| Person | 79.8 ± 0.5 | 69.9 ± 0.5 | 77.0 ± 0.5 | 44.8 ± 0.5 | |||||
| Car | 84.9 ± 0.5 | 74.5 ± 0.5 | 82.6 ± 0.5 | 55.6 ± 0.5 | |||||
| Bus | 82.6 ± 0.6 | 73.2 ± 0.6 | 76.6 ± 0.6 | 58.1 ± 0.6 | |||||
| Motorcycle | 74.8 ± 0.6 | 43.3 ± 0.6 | 50.5 ± 0.6 | 28.4 ± 0.6 | |||||
| Lamp | 72.9 ± 0.7 | 34.5 ± 0.7 | 43.9 ± 0.7 | 19.7 ± 0.7 | |||||
| Truck | 74.5 ± 0.6 | 62.9 ± 0.6 | 70.2 ± 0.6 | 44.9 ± 0.6 | |||||
| Proposed | All | 79.2 ± 0.3 | 61.8 ± 0.3 | 69.0 ± 0.3 * | 43.8 ± 0.3 | 3.44 | 7.58 | 384.6 ± 3.2 | 2.6 ± 0.02 |
| Person | 80.5 ± 0.3 | 71.2 ± 0.3 | 78.8 ± 0.3 * | 46.0 ± 0.3 | |||||
| Car | 86.2 ± 0.3 | 76.5 ± 0.3 | 84.1 ± 0.3 | 57.0 ± 0.3 | |||||
| Bus | 85.0 ± 0.4 | 75.9 ± 0.4 | 80.2 ± 0.4 * | 60.5 ± 0.4 | |||||
| Motorcycle | 74.8 ± 0.4 | 46.5 ± 0.4 | 53.5 ± 0.4 | 31.0 ± 0.4 | |||||
| Lamp | 75.3 ± 0.5 | 36.2 ± 0.5 | 46.0 ± 0.5 | 20.8 ± 0.5 | |||||
| Truck | 76.5 ± 0.4 | 65.8 ± 0.4 | 72.1 ± 0.4 * | 47.2 ± 0.4 * |
| Method | Input Size | Backend | Precision Mode | Params (M) | GFLOPs (G) | Latency (ms) | FPS |
|---|---|---|---|---|---|---|---|
| Dual-stream YOLOv5n | 640 × 640 | TensorRT | FP16 | 3.67 | 9.81 | 45.2 ± 0.3 | 22.1 ± 0.2 |
| Dual-stream YOLOv8n | 4.36 | 11.34 | 52.6 ± 0.4 | 19.0 ± 0.2 | |||
| Dual-stream YOLOvXn | 3.38 | 9.34 | 44.1 ± 0.3 | 22.7 ± 0.2 | |||
| Dual-stream YOLOv10n | 3.73 | 11.23 | 51.3 ± 0.4 | 19.5 ± 0.2 | |||
| Dual-stream YOLOv11n | 3.79 | 9.31 | 43.5 ± 0.3 | 23.0 ± 0.2 | |||
| MBNet | 6.8 | 16.2 | 71.4 ± 0.5 | 14.0 ± 0.2 | |||
| ProbEn | 7.5 | 17.4 | 76.9 ± 0.6 | 13.0 ± 0.2 | |||
| Dual-stream RT-DETR | 66.34 | 194.03 | N/A | N/A | |||
| Proposed | 3.44 | 7.58 | 38.5 ± 0.3 | 26.0 ± 0.2 |
| Method | GFM | SPD-Conv | Ghost-Neck | Params(M) | GFLOPs (G) | FLIR | M3FD | ||
|---|---|---|---|---|---|---|---|---|---|
| mAP@50(%) | mAP@50:95(%) | mAP@50(%) | mAP@50:95(%) | ||||||
| Dual-stream baseline | 3.79 | 9.31 | 81.9 ± 0.2 | 45.6 ± 0.2 | 66.5 ± 0.3 | 42.6 ± 0.3 | |||
| +GFM | ✓ | 3.82 | 9.38 | 82.3 ± 0.2 | 45.9 ± 0.2 | 68.0 ± 0.3 | 43.2 ± 0.3 | ||
| +SPD-Conv | ✓ | ✓ | 3.75 | 9.12 | 83.1 ± 0.2 | 46.8 ± 0.2 | 69.4 ± 0.3 | 44.3 ± 0.3 | |
| Proposed | ✓ | ✓ | ✓ | 3.44 | 7.58 | 82.8 ± 0.2 | 46.5 ± 0.2 | 69.0 ± 0.3 | 43.8 ± 0.3 |
| Model Variant | Fusion Strategy | Params(M) | GFLOPs (G) | FLIR | M3FD | ||
|---|---|---|---|---|---|---|---|
| mAP@50(%) | mAP@50:95(%) | mAP@50(%) | mAP@50:95(%) | ||||
| Dual-stream baseline | Concatenation | 3.79 | 9.31 | 82.2 | 45.9 | 66.8 | 42.9 |
| Dual-stream baseline + GFM | Global-context gating | 3.82 | 9.38 | 82.5 | 46.1 | 68.2 | 43.4 |
| Variant | FLIR | M3FD | ||||
|---|---|---|---|---|---|---|
| AP_s | AP_m | AP_l | AP_s | AP_m | AP_l | |
| w/o SPD-Conv | 31.8 | 55.9 | 60.7 | 16.3 | 54.1 | 73.4 |
| w/SPD-Conv | 34.2 | 56.5 | 61.2 | 18.5 | 54.8 | 73.9 |
| Gain | +2.4 | +0.6 | +0.5 | +2.2 | +0.7 | +0.5 |
| Method | FLIR | M3FD | ||||
|---|---|---|---|---|---|---|
| AP_s | AP_m | AP_l | AP_s | AP_m | AP_l | |
| YOLOv5n | 33.6 ± 0.3 | 51.5 ± 0.4 | 53.4 ± 0.5 | 17.1 ± 0.2 | 53.2 ± 0.3 | 73.0 ± 0.4 |
| YOLOv8n | 33.2 ± 0.3 | 57.6 ± 0.4 | 69.9 ± 0.5 | 16.6 ± 0.2 | 54.1 ± 0.3 | 71.4 ± 0.4 |
| YOLOvXn | 32.8 ± 0.3 | 56.8 ± 0.4 | 59.5 ± 0.5 | 16.6 ± 0.2 | 53.7 ± 0.3 | 72.8 ± 0.4 |
| YOLOv10n | 33.0 ± 0.3 | 54.5 ± 0.4 | 65.7 ± 0.5 | 17.4 ± 0.2 | 53.3 ± 0.3 | 72.3 ± 0.4 |
| YOLOv11n | 32.3 ± 0.3 | 55.2 ± 0.4 | 59.6 ± 0.5 | 17.4 ± 0.2 | 53.2 ± 0.3 | 72.2 ± 0.4 |
| MBNet | 32.2 ± 0.3 | 53.7 ± 0.4 | 58.2 ± 0.5 | 17.3 ± 0.2 | 52.7 ± 0.3 | 71.6 ± 0.4 |
| ProbEn | 32.6 ± 0.3 | 55.0 ± 0.4 | 59.5 ± 0.5 | 17.6 ± 0.2 | 53.0 ± 0.3 | 72.1 ± 0.4 |
| RT-DETR | 30.6 ± 0.4 | 51.5 ± 0.5 | 53.4 ± 0.6 | 16.6 ± 0.3 | 45.4 ± 0.4 | 63.0 ± 0.5 |
| Proposed | 34.2 ± 0.2 | 56.5 ± 0.3 | 61.2 ± 0.4 | 18.5 ± 0.2 | 54.8 ± 0.3 | 73.9 ± 0.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, L.; Bao, Z.; Lu, D. GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion. Sensors 2026, 26, 2035. https://doi.org/10.3390/s26072035
Wang L, Bao Z, Lu D. GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion. Sensors. 2026; 26(7):2035. https://doi.org/10.3390/s26072035
Chicago/Turabian StyleWang, Lijuan, Zuchao Bao, and Dongming Lu. 2026. "GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion" Sensors 26, no. 7: 2035. https://doi.org/10.3390/s26072035
APA StyleWang, L., Bao, Z., & Lu, D. (2026). GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion. Sensors, 26(7), 2035. https://doi.org/10.3390/s26072035
