VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing

Yue, Yurong; Qin, Weiwei; Chi, Hao; An, Baiwei; Wu, Dingyi; Guo, Wenxin; Xiong, Jingyi

doi:10.3390/rs18121938

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing

by

Yurong Yue

,

Weiwei Qin

^*

,

Hao Chi

,

Baiwei An

,

Dingyi Wu

,

Wenxin Guo

and

Jingyi Xiong

Rocket Force University of Engineering, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1938; https://doi.org/10.3390/rs18121938

Submission received: 25 April 2026 / Revised: 3 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

Download Versions Notes

Abstract

To address the issues of insufficient single-modality robustness and limited multi-scale object detection accuracy in remote sensing image detection (RSID) in complex environments, this paper proposes a multimodal RSID network named VI-MSFFN. The model adopts a symmetric parallel dual-branch architecture to achieve independent extraction and collaborative modeling of visible and infrared modal features. A cross-modal multi-scale sparse cross-attention fusion module is proposed and applied to the P4 and P5 feature layers, and a high-low-level feature collaborative cross-modal fusion strategy was constructed to achieve efficient and robust cross-modal feature fusion while enhancing multi-scale object modeling capability and suppressing feature redundancy and noise. Additionally, a progressive feature interaction and fusion architecture was designed to combine spatial and frequency domain information to strengthen deep object representation. The experimental results on the VEDAI and Drone Vehicle datasets demonstrate that VI-MSFFN achieves state-of-the-art (SOTA) performance in detection accuracy, robustness, and generalization ability. The proposed method effectively solves the detection challenges of RSID and has significant application value in the field of multi-modal RSID.

Keywords: remote sensing image detection; VI-MSFFN; symmetric parallel dual-branch; progressive feature interaction and fusion architecture; joint spatial-frequency domain

Share and Cite

MDPI and ACS Style

Yue, Y.; Qin, W.; Chi, H.; An, B.; Wu, D.; Guo, W.; Xiong, J. VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing. Remote Sens. 2026, 18, 1938. https://doi.org/10.3390/rs18121938

AMA Style

Yue Y, Qin W, Chi H, An B, Wu D, Guo W, Xiong J. VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing. Remote Sensing. 2026; 18(12):1938. https://doi.org/10.3390/rs18121938

Chicago/Turabian Style

Yue, Yurong, Weiwei Qin, Hao Chi, Baiwei An, Dingyi Wu, Wenxin Guo, and Jingyi Xiong. 2026. "VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing" Remote Sensing 18, no. 12: 1938. https://doi.org/10.3390/rs18121938

APA Style

Yue, Y., Qin, W., Chi, H., An, B., Wu, D., Guo, W., & Xiong, J. (2026). VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing. Remote Sensing, 18(12), 1938. https://doi.org/10.3390/rs18121938

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VI-MSFFN: A Visible-Infrared Multi-Scale Feature Fusion Network for Cross-Modal Detection in Remote Sensing

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI