Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention
Abstract
:1. Introduction
- To achieve effective feature interaction and adaptive channel selection, we design a dynamic channel adjustment strategy using channel exchange principles for enhancing the fusion of complementary information from different modalities.
- To capture key features across different scales, we introduce a multi-scale activated attention mechanism, enhancing the model’s focus on critical features from each modality for improved detection accuracy.
- Extensive experiment results on datasets show that our approach outperforms state-of-the-art methods in both image fusion and object detection tasks, thus demonstrating its superior performance in complex scenarios.
2. Related Work
3. Methodology
3.1. Architecture Overview
3.2. Dynamic Channel Adjustment
3.3. Multi-Scale Activated Attention Mechanism
3.4. FPN, Detector and Loss Function
4. Experiment
4.1. Experiment Setup
4.2. Implementation
4.3. Comparison Experiment and Ablation
- Fast R-CNN [31]: A domain adaptive object detection method based on Faster R-CNN that mitigates image-level and instance-level shifts using adversarial training and consistency regularization.
- GNN-based [32]: A joint multi-object tracking (MOT) approach based on Graph Neural Networks (GNNs) that simultaneously optimizes object detection and data association by modeling spatial and temporal relationships.
- CDDFuse [33]: CDDFuse is a multimodal image fusion method that employs a dual-branch architecture combining Transformer and CNN components
- ProbEn [34]: A multimodal object detection method based on probabilistic ensembling to effectively integrate information from multiple sensor modalities.
- DETR [25]: An end-to-end object detection framework that uses a transformer encoder–decoder architecture with bipartite matching loss to directly predict object sets.
- SwinF [35]: A feature fusion network based on Swin Transformer, designed to enhance object detection performance while reducing computational complexity through hierarchical windowing operations.
- TransFusion [36]: A unified multimodal model that combines next-token prediction and diffusion processes in a single transformer to jointly model continuous data.
- MMA-UNet [37]: A multimodal asymmetric UNet designed for balanced feature fusion by employing specialized encoders and cross-scale fusion strategies.
Method | VIF | SSIM | PSNR | Qabf | EN |
---|---|---|---|---|---|
Faster R-CNN [31] | 0.236 | 0.289 | 10.346 | 0.412 | 4.366 |
GNN-based [32] | 0.250 | 0.239 | 10.899 | 0.475 | 5.342 |
CDDFuse [33] | 0.386 | 0.363 | 12.803 | 0.645 | 6.301 |
ProbEn [34] | 0.296 | 0.332 | 11.996 | 0.338 | 3.865 |
MMA-UNet [37] | 0.446 | 0.478 | 13.634 | 0.702 | 6.398 |
Ours | 0.473 | 0.513 | 13.834 | 0.731 | 6.621 |
Method | VIF | SSIM | PSNR (dB) | Qabf | EN |
---|---|---|---|---|---|
Faster R-CNN [31] | 0.436 | 0.589 | 14.546 | 0.675 | 6.376 |
GNN-based [32] | 0.421 | 0.633 | 15.112 | 0.702 | 7.381 |
CDDFuse [33] | 0.530 | 0.630 | 16.103 | 0.736 | 7.653 |
ProbEn [34] | 0.492 | 0.573 | 15.102 | 0.593 | 6.784 |
SwinF [35] | 0.502 | 0.641 | 16.330 | 0.732 | 7.801 |
MMA-UNet [37] | 0.512 | 0.673 | 16.381 | 0.742 | 7.931 |
Ours | 0.506 | 0.692 | 17.031 | 0.781 | 7.832 |
Method | AP50 | AP75 | mAP | |||
---|---|---|---|---|---|---|
Person | Car | Person | Car | Person | Car | |
Faster R-CNN [31] | 0.831 | 0.853 | 0.804 | 0.736 | 0.769 | 0.683 |
GNN-based [32] | 0.830 | 0.901 | 0.763 | 0.862 | 0.736 | 0.801 |
CDDFuse [33] | 0.932 | 0.916 | 0.902 | 0.910 | 0.897 | 0.864 |
ProbEn [34] | 0.906 | 0.913 | 0.842 | 0.869 | 0.811 | 0.844 |
DETR [25] | 0.892 | 0.927 | 0.889 | 0.900 | 0.883 | 0.812 |
SwinF [35] | 0.896 | 0.945 | 0.873 | 0.902 | 0.886 | 0.872 |
TransFusion [36] | 0.915 | 0.928 | 0.903 | 0.901 | 0.891 | 0.884 |
MMA-UNet [37] | 0.926 | 0.903 | 0.913 | 0.910 | 0.892 | 0.873 |
Ours | 0.941 | 0.939 | 0.933 | 0.920 | 0.881 | 0.894 |
w/o DCA | 0.915 | 0.883 | 0.904 | 0.873 | 0.849 | 0.831 |
w/o MAAM | 0.908 | 0.890 | 0.891 | 0.846 | 0.833 | 0.825 |
Method | Sensor | Lathe | Forklift | Filter |
---|---|---|---|---|
Faster R-CNN [31] | 0.734 | 0.883 | 0.801 | 0.884 |
GNN-based [32] | 0.702 | 0.912 | 0.814 | 0.861 |
CDDFuse [33] | 0.816 | 0.943 | 0.902 | 0.933 |
ProbEn [34] | 0.819 | 0.952 | 0.897 | 0.947 |
DETR [25] | 0.821 | 0.973 | 0.936 | 0.914 |
SwinF [35] | 0.820 | 0.978 | 0.933 | 0.961 |
TransFusion [36] | 0.812 | 0.956 | 0.912 | 0.938 |
MMA-UNet [37] | 0.807 | 0.992 | 0.943 | 0.970 |
Ours | 0.843 | 0.986 | 0.970 | 0.976 |
w/o DCA | 0.801 | 0.953 | 0.919 | 0.902 |
w/o MAAM | 0.783 | 0.937 | 0.904 | 0.934 |
4.4. Ablation
4.5. Computational Complexity Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Limitations and Future Work
References
- Zheng, Y.; Blasch, E.; Liu, Z. Multispectral Image Fusion and Colorization; SPIE Press: Bellingham, WA, USA, 2018; Volume 481. [Google Scholar]
- Ouardirhi, Z.; Mahmoudi, S.A.; Zbakh, M. Enhancing object detection in smart video surveillance: A survey of occlusion-handling approaches. Electronics 2024, 13, 541. [Google Scholar] [CrossRef]
- Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; Wang, C. Multimodal industrial anomaly detection via hybrid fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8032–8041. [Google Scholar]
- Baumgartner, M.; Jäger, P.F.; Isensee, F.; Maier-Hein, K.H. nnDetection: A self-configuring method for medical object detection. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; pp. 530–539. [Google Scholar]
- Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional neural network (CNN) for image detection and recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
- Chandana, R.; Ramachandra, A. Real time object detection system with YOLO and CNN models: A review. arXiv 2022, arXiv:2208.00773. [Google Scholar]
- Dimitri, G.M.; Spasov, S.; Duggento, A.; Passamonti, L.; Lió, P.; Toschi, N. Multimodal and multicontrast image fusion via deep generative models. Inf. Fusion 2022, 88, 146–160. [Google Scholar] [CrossRef]
- Liu, L.; Muelly, M.; Deng, J.; Pfister, T.; Li, L.J. Generative modeling for small-data object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6073–6081. [Google Scholar]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Beal, J.; Kim, E.; Tzeng, E.; Park, D.H.; Zhai, A.; Kislyuk, D. Toward transformer-based object detection. arXiv 2020, arXiv:2012.09958. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 72–80. [Google Scholar]
- Liu, Y. Cross-Modal Attention for Robust Object Detection. Comput. Vis. Image Underst. 2024, 189, 103305. [Google Scholar]
- Li, X.; Liu, J.; Tang, Z.; Han, B.; Wu, Z. MEDMCN: A novel multi-modal EfficientDet with multi-scale CapsNet for object detection. J. Supercomput. 2024, 80, 12863–12890. [Google Scholar]
- Zhan, Y.; Zeng, Z.; Liu, H.; Tan, X.; Tian, Y. MambaSOD: Dual Mamba-driven cross-modal fusion network for RGB-D salient object detection. Neurocomputing 2025, 631, 129718. [Google Scholar]
- Liu, S.; Liu, Z. Multi-channel CNN-based object detection for enhanced situation awareness. arXiv 2017, arXiv:1712.00075. [Google Scholar]
- Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J. Bayesian fusion for infrared and visible images. Signal Process. 2020, 177, 107734. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
- Pedersoli, M.; Gonzàlez, J.; Hu, X.; Roca, X. Toward real-time pedestrian detection based on a deformable template model. IEEE Trans. Intell. Transp. Syst. 2013, 15, 355–364. [Google Scholar] [CrossRef]
- Yan, J.; Lei, Z.; Wen, L.; Li, S.Z. The fastest deformable part model for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2497–2504. [Google Scholar]
- Zhou, H.; Yu, G. Research on pedestrian detection technology based on the SVM classifier trained by HOG and LTP features. Future Gener. Comput. Syst. 2021, 125, 604–615. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020 Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Zhang, Z. Transformer-Based Fusion for Multi-Sensor Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 1391–1403. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Sriharipriya, K.C. Enhanced pothole detection system using YOLOX algorithm. Auton. Intell. Syst. 2022, 2, 22. [Google Scholar]
- Wang, Y.; Huang, W.; Sun, F.; Xu, T.; Rong, Y.; Huang, J. Deep multimodal fusion by channel exchanging. Adv. Neural Inf. Process. Syst. 2020, 33, 4835–4845. [Google Scholar]
- Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
- Wang, Y.; Kitani, K.; Weng, X. Joint object detection and multi-object tracking with graph neural networks. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13708–13715. [Google Scholar]
- Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
- Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling; Springer: Cham, Switzerland, 2022. [Google Scholar]
- Li, T.; Wang, H.; Li, G.; Liu, S.; Tang, L. SwinF: Swin Transformer with feature fusion in target detection. J. Phys. Conf. Ser. 2022, 2284, 012027. [Google Scholar]
- Zhou, C.; Yu, L.; Babu, A.; Tirumala, K.; Yasunaga, M.; Shamis, L.; Kahn, J.; Ma, X.; Zettlemoyer, L.; Levy, O. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model. arXiv 2024, arXiv:2408.11039. [Google Scholar]
- Huang, J.; Li, X.; Tan, T.; Li, X.; Ye, T. MMA-UNet: A Multi-Modal Asymmetric UNet Architecture for Infrared and Visible Image Fusion. arXiv 2024, arXiv:2404.17747. [Google Scholar]
Dataset | Sensors | Lathes | Forklifts | Fiters | Total |
---|---|---|---|---|---|
SAIC | 590 | 367 | 465 | 267 | 1689 |
Configuration | Person mAP | Car mAP |
---|---|---|
Full model | 0.881 | 0.894 |
w/o | 0.819 (−7.0%) | 0.837 (−6.4%) |
w/o | 0.843 (−4.3%) | 0.869 (−2.8%) |
w/o | 0.847 (−3.9%) | 0.858 (−3.6%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, Y.; Chen, M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Appl. Sci. 2025, 15, 4298. https://doi.org/10.3390/app15084298
Ye Y, Chen M. Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences. 2025; 15(8):4298. https://doi.org/10.3390/app15084298
Chicago/Turabian StyleYe, Yihang, and Mingxuan Chen. 2025. "Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention" Applied Sciences 15, no. 8: 4298. https://doi.org/10.3390/app15084298
APA StyleYe, Y., & Chen, M. (2025). Multimodal Network for Object Detection Using Channel Adjustment and Multi-Scale Attention. Applied Sciences, 15(8), 4298. https://doi.org/10.3390/app15084298