Improved CenterNet-Based Multimodal Object Detection for Low-Light and Complex Environments
Abstract
1. Introduction
- 1.
- A multimodal object detection framework based on the improved CenterNet is proposed, which incorporates Haar wavelet detail priors from infrared images into the overall network. This framework establishes a dual-source input mechanism driven by the semantic information of the fused images and the texture structure of infrared images, thereby enhancing the input representation capability in low-light and complex scenes.
- 2.
- A Feature Fusion Attention (FFA) module is designed to spatially align and achieve early fusion of the semantic features of fused images with the multi-sub-band detail features of infrared wavelets. In addition, a SimAM-based lightweight attention enhancement and a residual stable injection strategy are employed to achieve efficient cross-modal expression of semantic and texture information.
- 3.
- A Heatmap-Guided Detection Head (HGDH) module is designed, which explicitly selects target-related features by generating spatial probability masks using the predicted heatmaps. This module, combined with contextual modeling and lightweight attention enhancement mechanisms, improves target localization accuracy and detection robustness in complex scenes.
2. Related Work
2.1. Object Detection with Visible and Infrared Images
2.2. Multimodal Object Detection Based on Fused Images
2.3. CenterNet Detection Models with Dual-Modal Inputs
3. Materials and Methods
3.1. Overall Network Architecture
3.2. Feature Fusion Attention (FFA) Module
3.3. Heatmap-Guided Detection Head (HGDH)
3.4. Hourglass Backbone
3.5. Collaborative Mechanism of the Proposed Modules
4. Results
4.1. Datasets
4.1.1. RH-25 Dataset
4.1.2. Screening Criteria
4.2. Experimental Setup
4.2.1. Experimental Environment and Parameter Configuration
4.2.2. Evaluation Metrics
4.3. Ablation Study
4.3.1. FFA Module
4.3.2. HGDH Module
4.4. Comparison Experiments
4.5. Generalization Experiments
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| CenterNet | Objects-as-Points Detector |
| FFA | Feature Fusion Attention |
| HGDH | Heatmap-Guided Detection Head |
References
- Samu, J.; Yang, C. Airport ground-based aerial object surveillance technologies for enhanced safety: A systematic review. Drones 2025, 10, 22. [Google Scholar] [CrossRef]
- Li, G.; Wang, Y.; He, B.; Pang, T.; Gao, M. Low-light multimodal object detection: A survey. Comput. Sci. Rev. 2025, 58, 100804. [Google Scholar] [CrossRef]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network; IEEE: Piscataway, NJ, USA, 2021; pp. 3489–3497. [Google Scholar] [CrossRef]
- Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2022; pp. 5792–5801. [Google Scholar] [CrossRef]
- Hou, Z.; Yang, C.; Sun, Y.; Ma, S.; Yang, X.; Fan, J. An object detection algorithm based on infrared-visible dual modal feature fusion. Infrared Phys. Technol. 2024, 137, 105107. [Google Scholar] [CrossRef]
- Xiao, X.; Wang, B.; Miao, L.; Li, L.; Zhou, Z.; Ma, J.; Dong, D. Infrared and visible image object detection via focused feature enhancement and cascaded semantic extension. Remote Sens. 2021, 13, 2538. [Google Scholar] [CrossRef]
- Hu, Z.; Jing, Y.; Wu, G. Decision-level fusion detection method of visible and infrared images under low light conditions. EURASIP J. Adv. Signal Process. 2023, 2023, 38. [Google Scholar] [CrossRef]
- Cao, Y.; Luo, X.; Yang, J.; Cao, Y.; Yang, M.Y. Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection. Inf. Fusion 2022, 88, 1–11. [Google Scholar] [CrossRef]
- Liu, X.; Huo, H.; Li, J.; Pang, S.; Zheng, B. A semantic-driven coupled network for infrared and visible image fusion. Inf. Fusion 2024, 108, 102352. [Google Scholar] [CrossRef]
- Li, X.; Qian, Y.; Guo, R.; Ao, N. I-CenterNet: Road infrared target detection based on improved CenterNet. IET Image Process. 2023, 17, 57–66. [Google Scholar] [CrossRef]
- Tan, W.; Geng, B.; Bai, X. A study on infrared-visible fusion multimodal object detection algorithm based on cross-modal information bottleneck and minimum redundancy transformation. Sci. Rep. 2026, 16, 12991. [Google Scholar] [CrossRef]
- Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-enhanced CenterNet for small object detection in remote sensing images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
- Wang, R.; Zhou, Z.; Li, S.; Zhang, Z. Advances and challenges in infrared-visible image fusion: A comprehensive review of techniques and applications. Artif. Intell. Rev. 2026, 59, 18. [Google Scholar] [CrossRef]
- Liu, J.; Wu, G.; Liu, Z.; Wang, D.; Jiang, Z.; Ma, L.; Zhong, W.; Fan, X.; Liu, R. Infrared and visible image fusion: From data compatibility to task adaption. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2349–2369. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning; PMLR 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 11863–11874. [Google Scholar] [CrossRef]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar] [CrossRef]
- Sun, C.; Chen, Y.; Qiu, X.; Li, R.; You, L. MRD-YOLO: A multispectral object detection algorithm for complex road scenes. Sensors 2024, 24, 3222. [Google Scholar] [CrossRef]
- Thaker, K.; Chennupati, S.; Rawashdeh, N.; Rawashdeh, S.A. Multispectral deep neural network fusion method for low-light object detection. J. Imaging 2023, 10, 12. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Ren, W.; Luo, L.; Ren, J. A two-stage approach for infrared and visible image fusion and segmentation. Appl. Sci. 2025, 15, 10698. [Google Scholar] [CrossRef]
- Yu, H.; Gao, J.; Zhou, S.; Li, C.; Shi, J.; Guo, F. Cross-modality target detection using infrared and visible image fusion for robust objection recognition. Comput. Electr. Eng. 2025, 123, 110133. [Google Scholar] [CrossRef]
- An, R.; Guo, Y.; Wang, Z.; Zhao, Z.; Deng, C.; Li, J.M. RT-DETR: A robust visible-infrared object detector with adaptive cross-modal feature fusion. Infrared Phys. Technol. 2025, 153, 106346. [Google Scholar] [CrossRef]
- Li, N.; Huang, S.; Wei, D. Infrared small target detection algorithm based on ISTD-CenterNet. Comput. Mater. Contin. 2023, 77, 3. [Google Scholar] [CrossRef]
- Zhou, J.; Chen, Z.; Huang, X. Weakly perceived object detection based on an improved CenterNet. Math. Biosci. Eng. 2022, 19, 12833–12851. [Google Scholar] [CrossRef] [PubMed]
- Wu, D.; Wang, Y.; Wang, H.; Wang, F.; Gao, G. DCFNet: Infrared and visible image fusion network based on discrete wavelet transform and convolutional neural network. Sensors 2024, 24, 4065. [Google Scholar] [CrossRef]
- Sun, J.; Wei, M.; Wang, J.; Zhu, M.; Lin, H.; Nie, H.; Deng, X. CenterADNet: Infrared video target detection based on central point regression. Sensors 2024, 24, 1778. [Google Scholar] [CrossRef] [PubMed]
- Liu, R.; Liu, Y.; Wang, H.; Du, S. WaveFusionNet: Infrared and visible image fusion based on multi-scale feature encoder–decoder and discrete wavelet decomposition. Opt. Commun. 2024, 573, 131024. [Google Scholar] [CrossRef]
- Dong, A.; Wang, L.; Liu, J.; Lv, G.; Zhao, G.; Cheng, J. MFIFusion: An infrared and visible image enhanced fusion network based on multi-level feature injection. Pattern Recognit. 2024, 152, 110445. [Google Scholar] [CrossRef]
- Ju, L.; Kittler, J.; Rana, M.A.; Yang, W.; Feng, Z. Keep an eye on faces: Robust face detection with heatmap-assisted spatial attention and scale-aware layer attention. Pattern Recognit. 2023, 140, 109553. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Guo, J.; Ma, J.; García-Fernández, Á.F.; Zhang, Y.; Liang, H. A survey on image enhancement for low-light images. Heliyon 2023, 9, e14558. [Google Scholar] [CrossRef]
- Guan, F.; Fang, Z.; Wang, L.; Zhang, X.; Zhong, H.; Huang, H. Modelling people’s perceived scene complexity of real-world environments using street-view panoramas and open geodata. ISPRS J. Photogramm. Remote Sens. 2022, 186, 315–331. [Google Scholar] [CrossRef]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
- Hwang, S.; Han, D.; Jeon, M. Multispectral detection transformer with infrared-centric feature fusion. arXiv 2025, arXiv:2505.15137. [Google Scholar] [CrossRef]
- Yaseen, M. What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
- Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
- Wu, W.; Zhang, X.; Yin, H.; Dai, S.; Zhang, H.; Zhang, Y. FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection. arXiv 2025, arXiv:2511.10046. [Google Scholar]










| Baseline | Hourglass (Two-Stack) | FFA | HGDH | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 | FPS |
|---|---|---|---|---|---|---|---|---|
| ✓ | 99.44 | 68.89 | 93.12 | 0.81 | 12.11 | |||
| ✓ | ✓ | 99.27 | 74.34 | 93.89 | 0.85 | 12.76 | ||
| ✓ | ✓ | 99.07 | 76.80 | 95.17 | 0.87 | 13.31 | ||
| ✓ | ✓ | 99.82 | 79.12 | 96.29 | 0.88 | 13.75 | ||
| ✓ | ✓ | ✓ | ✓ | 98.42 | 87.05 | 96.63 | 0.93 | 15.24 |
| Model | Modality | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 | Params (M) | FLOPs (G) | FPS |
|---|---|---|---|---|---|---|---|---|
| YOLOv8n [40] | Fused + IR | 95.19 | 89.24 | 88.28 | 0.92 | 3.01 | 4.10 | 25.20 |
| YOLOv11n [41] | Fused + IR | 95.46 | 90.08 | 89.02 | 0.93 | 2.59 | 3.22 | 20.14 |
| Faster R-CNN [20] | Fused + IR | 68.39 | 95.42 | 94.64 | 0.80 | 28.28 | 257.75 | 21.12 |
| RT-DETR [24] | Fused + IR | 88.11 | 93.52 | 91.38 | 0.91 | 32.81 | 54.00 | 28.43 |
| FreDFT [42] | Fused + IR | 56.78 | 91.67 | 89.16 | 0.70 | 23.52 | 36.72 | 22.47 |
| IC-Fusion [39] | Fused + IR | 55.36 | 92.03 | 89.93 | 0.69 | 26.24 | 113.62 | 15.12 |
| Baseline [15] | Fused + IR | 99.44 | 68.89 | 93.12 | 0.81 | 95.43 | 251.43 | 12.11 |
| Ours | Fused + IR | 98.42 | 87.05 | 96.63 | 0.93 | 192.50 | 528.28 | 15.24 |
| Modality | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 | FPS |
|---|---|---|---|---|---|
| Fused | 99.17 | 69.32 | 91.78 | 0.82 | 12.03 |
| IR | 99.05 | 72.55 | 91.65 | 0.84 | 12.14 |
| Fused + IR | 99.44 | 68.89 | 93.12 | 0.81 | 12.11 |
| Model | Modality | Precision (%) | Recall (%) | mAP@0.5 (%) | F1 |
|---|---|---|---|---|---|
| YOLOv11n [41] | Fused + IR | 90.31 | 71.14 | 71.69 | 0.80 |
| IC-Fusion [39] | Fused + IR | 77.58 | 68.43 | 70.74 | 0.73 |
| Baseline [15] | Fused + IR | 89.59 | 56.63 | 70.90 | 0.69 |
| Ours | Fused + IR | 90.15 | 70.46 | 73.12 | 0.79 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yao, Z.; Xu, H.; Zhang, H.; Tu, X.; Yin, J. Improved CenterNet-Based Multimodal Object Detection for Low-Light and Complex Environments. Sensors 2026, 26, 3735. https://doi.org/10.3390/s26123735
Yao Z, Xu H, Zhang H, Tu X, Yin J. Improved CenterNet-Based Multimodal Object Detection for Low-Light and Complex Environments. Sensors. 2026; 26(12):3735. https://doi.org/10.3390/s26123735
Chicago/Turabian StyleYao, Zhigang, Hengxin Xu, Huazhong Zhang, Xiaoguang Tu, and Juhang Yin. 2026. "Improved CenterNet-Based Multimodal Object Detection for Low-Light and Complex Environments" Sensors 26, no. 12: 3735. https://doi.org/10.3390/s26123735
APA StyleYao, Z., Xu, H., Zhang, H., Tu, X., & Yin, J. (2026). Improved CenterNet-Based Multimodal Object Detection for Low-Light and Complex Environments. Sensors, 26(12), 3735. https://doi.org/10.3390/s26123735

