SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation
Abstract
1. Introduction
- We design a Self-correction and Collaboration Fusion Module (Sc-CFM), which integrates a Self-correction Long-Range Relationship Branch (Sc-LRB) to enhance semantic modeling capability and a Self-correction Fine-Grained Branch (Sc-FGB) to comprehensively capture multi-scale visual information, and employs a Dual-branch Collaborative Recalibration (DCR) mechanism to simultaneously recalibrate semantic features and visual representations. The Sc-CFM is specifically engineered to autonomously optimize the embedded semantic and visual features within latent representations, while simultaneously facilitating synergistic fusion between enhanced semantic guidance and raw visual signals. This dual-phase refinement process not only suppresses feature-level noise but also reinforces cross-modal alignment, ultimately generating discriminative feature representations that substantially improve downstream task performance.
- We propose an Interactive Context Restoration Mamba Decoder and a Region-adaptive Weighted Reconstruction Decoder, aiming to recover remote information lost during the upsampling process and reduce feature redundancy in the fused image.
- By establishing a coupling module that integrates image fusion with semantic segmentation in a complementary manner, our approach effectively harnesses the strengths of both tasks, thereby delivering a performance that maximizes each modality’s advantages.
2. Related Work
2.1. Multi-Modality Fusion and Segmentation
2.2. State Space Models
2.3. Multi-Task Learning
3. Methodology
3.1. Problem Formulation
3.2. Overvall Architecture
3.2.1. Sc-CFM
Sc-LRB
Sc-FGB
DCR
3.2.2. ICRM
3.2.3. ReAW
3.3. Loss Function
4. Experiment
4.1. Datasets and Implementation Details
4.1.1. Datasets
4.1.2. Implementation Details
4.2. Results of Semantic Segmentation
4.3. Results of Image Fusion
4.4. Complexity Analysis
4.5. Ablation Study
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhou, W.; Lin, X.; Lei, J.; Yu, L.; Hwang, J.N. MFFENet: Multiscale Feature Fusion and Enhancement Network For RGB–Thermal Urban Road Scene Parsing. IEEE Trans. Multimed. 2022, 24, 2526–2538. [Google Scholar] [CrossRef]
- Zhou, W.; Dong, S.; Xu, C.; Qian, Y. Edge-aware guidance fusion network for rgb–thermal scene parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 3571–3579. [Google Scholar]
- Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8082–8093. [Google Scholar]
- Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal infrared image colorization for nighttime driving scenes with top-down guided attention. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15808–15823. [Google Scholar] [CrossRef]
- Wang, Y.; Miao, L.; Zhou, Z.; Zhang, L.; Qiao, Y. Infrared and visible image fusion with language-driven loss in CLIP embedding space. arXiv 2024, arXiv:2402.16267. [Google Scholar]
- Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef]
- Li, H.; Yang, Z.; Zhang, Y.; Jia, W.; Yu, Z.; Liu, Y. MulFS-CAP: Multimodal Fusion-Supervised Cross-Modality Alignment Perception for Unregistered Infrared-Visible Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3673–3690. [Google Scholar] [CrossRef]
- Zhou, M.; Zheng, N.; He, X.; Hong, D.; Chanussot, J. Probing Synergistic High-Order Interaction for Multi-Modal Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 840–857. [Google Scholar] [CrossRef]
- Liu, H.; Mao, Q.; Dong, M.; Zhan, Y. Infrared-Visible Image Fusion Using Dual-Branch Auto-Encoder With Invertible High-Frequency Encoding. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 2675–2688. [Google Scholar] [CrossRef]
- Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
- Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
- Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8115–8124. [Google Scholar]
- Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2017, 5, 30–43. [Google Scholar] [CrossRef]
- Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar]
- Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-task learning for dense prediction tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3614–3633. [Google Scholar] [CrossRef]
- Zhang, D.; Zheng, R. TriangleNet: Edge Prior Augmented Network for Semantic Segmentation through Cross-Task Consistency. Int. J. Intell. Syst. 2022, 2023, 1–16. [Google Scholar] [CrossRef]
- Gonçalves, D.N.; Junior, J.M.; Zamboni, P.; Pistori, H.; Li, J.; Nogueira, K.; Gonçalves, W.N. MTLSegFormer: Multi-task Learning with Transformers for Semantic Segmentation in Precision Agriculture. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 6290–6298. [Google Scholar]
- Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Zhang, K.; Xu, S.; Chen, D.; Timofte, R.; Gool, L.V. Equivariant Multi-Modality Image Fusion. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 25912–25921. [Google Scholar]
- Li, X.; Li, X.; Ye, T.; Cheng, X.; Liu, W.; Tan, H. Bridging the Gap between Multi-focus and Multi-modal: A Focused Integration Framework for Multi-modal Image Fusion. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 1617–1626. [Google Scholar]
- Yuan, Y.; Wu, J.; Liang Jing, Z.; Leung, H.; Pan, H. Multimodal Image Fusion based on Hybrid CNN-Transformer and Non-local Cross-modal Attention. arXiv 2022, arXiv:2210.09847. [Google Scholar]
- Liu, H.; Zhang, J.; Yang, K.; Hu, X.; Stiefelhagen, R. CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers. IEEE Trans. Intell. Transp. Syst. 2022, 24, 14679–14694. [Google Scholar]
- Jiang, C.; Liu, X.; Zheng, B.; Bai, L.; Li, J. HSFusion: A high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation. arXiv 2024, arXiv:2407.10047. [Google Scholar]
- Alonso, C.A.; Sieber, J.; Zeilinger, M.N. State Space Models as Foundation Models: A Control Theoretic Overview. arXiv 2024, arXiv:2403.16899. [Google Scholar]
- Bragman, F.J.S.; Tanno, R.; Ourselin, S.; Alexander, D.C.; Cardoso, M.J. Stochastic Filter Groups for Multi-Task CNNs: Learning Specialist and Generalist Convolution Kernels. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1385–1394. [Google Scholar]
- Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
- Zhang, Z.; Cui, Z.; Xu, C.; Jie, Z.; Li, X.; Yang, J. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 235–251. [Google Scholar]
- Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. arXiv 2017, arXiv:1711.02257. [Google Scholar]
- Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-based Semi-supervised Learning for Object detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: San Francisco, CA, USA, 2019; Volume 32. [Google Scholar]
- Bu, Z.; Jin, X.; Vinzamuri, B.; Ramakrishna, A.; Chang, K.W.; Cevher, V.; Hong, M. Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate. arXiv 2024, arXiv:2410.22086. [Google Scholar]
- Mortaheb, M.; Vahapoglu, C.; Ulukus, S. FedGradNorm: Personalized Federated Gradient-Normalized Multi-Task Learning. In Proceedings of the 2022 IEEE 23rd International Workshop on Signal Processing Advances in Wireless Communication (SPAWC), Oulu, Finland, 4–6 July 2022; pp. 1–5. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
- Chen, C.F.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 347–356. [Google Scholar]
- Wu, D.; Wang, Y.; Wu, X.; Qu, T. Cross-attention Inspired Selective State Space Models for Target Sound Extraction. arXiv 2024, arXiv:2409.04803. [Google Scholar]
- Guo, M.H.; Lu, C.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar]
- Zhang, H.; Zuo, X.; Jiang, J.; Guo, C.; Ma, J. Mrfs: Mutually reinforcing image fusion and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26974–26983. [Google Scholar]
- Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 1–24 September 2017; pp. 5108–5115. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar]
- Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-Guided Fusion Network for RGB-Thermal Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
- Chen, Y.; Zhan, W.; Jiang, Y.; Zhu, D.; Guo, R.; Xu, X. LASNet: A light-weight asymmetric spatial feature network for real-time semantic segmentation. Electronics 2022, 11, 3238. [Google Scholar] [CrossRef]
- Liang, M.; Hu, J.; Bao, C.; Feng, H.; Deng, F.; Lam, T.L. Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks. IEEE Robot. Autom. Lett. 2023, 8, 4060–4067. [Google Scholar] [CrossRef]
- Zhao, S.; Liu, Y.; Jiao, Q.; Zhang, Q.; Han, J. Mitigating Modality Discrepancies for RGB-T Semantic Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 9380–9394. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A Unified Unsupervised Image Fusion Network. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
- Rao, D.; Xu, T.; Wu, X.J. TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network. IEEE Trans. Image Process. 2023, 1. [Google Scholar] [CrossRef] [PubMed]
- Tang, W.; He, F.; Liu, Y.; Duan, Y.; Si, T. DATFuse: Infrared and visible image fusion via dual attention transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3159–3172. [Google Scholar] [CrossRef]
- Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
- Aslantas, V.; Bendes, E. A new image quality metric for image fusion: The sum of the correlations of differences. Aeu-Int. J. Electron. Commun. 2015, 69, 1890–1896. [Google Scholar] [CrossRef]
MFNet | Car | Person | Bike | Curve | Car Stop | Guar. | Cone | Bump | mIoU |
---|---|---|---|---|---|---|---|---|---|
SeAFusion | 84.2 | 71.1 | 58.7 | 33.1 | 20.1 | 0.0 | 40.4 | 33.9 | 48.8 |
LASNet | 84.2 | 67.1 | 56.9 | 41.1 | 39.6 | 18.9 | 48.9 | 40.1 | 54.9 |
SegMiF | 87.8 | 71.4 | 63.2 | 47.5 | 31.1 | 0.0 | 48.9 | 50.5 | 56.1 |
MDRNet+ | 87.1 | 69.8 | 60.9 | 47.8 | 37.8 | 6.2 | 57.1 | 56.0 | 56.8 |
SGFNet | 88.4 | 77.6 | 64.3 | 45.8 | 31.0 | 0.6 | 57.1 | 55.0 | 57.6 |
ConvNeXt | 89.1 | 71.9 | 62.3 | 44.3 | 43.0 | 0.0 | 51.7 | 52.6 | 57.0 |
EAEFNet | 87.6 | 72.6 | 63.8 | 48.6 | 35.0 | 14.2 | 52.4 | 58.3 | 58.9 |
Ours | 90.7 | 74.1 | 65.9 | 47.1 | 45.7 | 1.8 | 55.6 | 59.2 | 59.8 |
FMB | Car | Person | Truck | T-Lamp | T-Sign | Buil. | Vege. | Pole | mIoU |
---|---|---|---|---|---|---|---|---|---|
SeAFusion | 76.2 | 59.6 | 15.1 | 34.4 | 68.0 | 80.1 | 83.5 | 38.4 | 51.9 |
MDRNet+ | 79.6 | 61.3 | 20.7 | 19.4 | 71.5 | 82.2 | 85.5 | 44.0 | 55.9 |
SegMiF | 79.0 | 31.1 | 25.9 | 49.1 | 74.7 | 80.1 | 84.6 | 49.5 | 56.0 |
LASNet | 79.4 | 54.7 | 32.2 | 23.2 | 70.6 | 82.1 | 86.1 | 45.3 | 56.1 |
ConvNeXt | 75.0 | 63.0 | 36.0 | 30.1 | 65.3 | 82.4 | 83.2 | 44.8 | 57.4 |
EAEFNet | 83.3 | 65.4 | 30.6 | 23.8 | 72.5 | 83.9 | 86.3 | 48.6 | 59.7 |
SGFNet | 78.1 | 68.9 | 45.0 | 45.0 | 72.1 | 83.1 | 85.8 | 45.4 | 60.4 |
Ours | 77.5 | 67.4 | 39.3 | 46.7 | 72.3 | 83.4 | 85.0 | 52.9 | 61.8 |
Method | FLOPs (G) | Params (M) | Task |
---|---|---|---|
MDRNet+ | 891.82 | 210.87 | Segmentation |
LASNet | 371.03 | 93.58 | Segmentation |
ConvNeXt | 278.49 | 114.36 | Segmentation |
EAEFNet | 316.49 | 147.21 | Segmentation |
SGFNet | 225.63 | 125.12 | Segmentation |
CDDFuse | 863.22 | 1.19 | Image Fusion |
DATFuse | 8.68 | 0.01 | Image Fusion |
TGFuse | 137.34 | 19.34 | Image Fusion |
U2Fusion | 633.09 | 0.66 | Image Fusion |
SeAFusion | 102.53 | 13.06 | Segmentation & Image Fusion |
SegMiF | 526.20 | 45.60 | Segmentation & Image Fusion |
Ours | 304.52 | 139.62 | Segmentation & Image Fusion |
Model | Car | Person | Bike | Curve | Car Stop | Guar. | Cone | Bump | mIoU |
---|---|---|---|---|---|---|---|---|---|
A | 88.0 | 71.4 | 65.5 | 42.2 | 40.3 | 0.2 | 50.4 | 48.1 | 56.0 |
B | 90.0 | 73.7 | 64.3 | 45.5 | 40.0 | 0.7 | 52.3 | 62.4 | 58.6 |
C | 88.8 | 61.0 | 64.6 | 39.1 | 42.4 | 1.7 | 51.9 | 58.5 | 56.2 |
D | 90.5 | 73.7 | 66.1 | 46.0 | 39.1 | 5.3 | 55.3 | 61.0 | 59.5 |
F | 90.5 | 73.8 | 66.0 | 43.8 | 40.6 | 1.1 | 54.8 | 63.9 | 59.2 |
Full Model | 90.7 | 74.1 | 65.9 | 47.1 | 45.7 | 1.8 | 55.6 | 59.2 | 59.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, D.; Qiao, L.; Shu, Y. SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation. Sensors 2025, 25, 3575. https://doi.org/10.3390/s25123575
Yang D, Qiao L, Shu Y. SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation. Sensors. 2025; 25(12):3575. https://doi.org/10.3390/s25123575
Chicago/Turabian StyleYang, Dongrui, Lihong Qiao, and Yucheng Shu. 2025. "SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation" Sensors 25, no. 12: 3575. https://doi.org/10.3390/s25123575
APA StyleYang, D., Qiao, L., & Shu, Y. (2025). SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation. Sensors, 25(12), 3575. https://doi.org/10.3390/s25123575