Abstract
The fusion of synthetic aperture radar (SAR) and visible images offers complementary spatial and spectral information, enabling more reliable and comprehensive scene interpretation. However, SAR speckle noise and the intrinsic modality gap pose significant challenges for existing methods in extracting consistent and complementary features. To address these issues, we propose VGSRF-Net, a Retinex-guided SAR reconstruction-driven fusion network that leverages visible-image priors to refine SAR features. This approach effectively reduces modality discrepancies before fusion, enabling improved multi-modal representation. The cross-modality reconstruction module (CMRM) reconstructs SAR features guided by visible priors, effectively reducing modality discrepancies before fusion and enabling improved multi-modal representation. The multi-modal feature joint representation module (MFJRM) enhances cross-modal complementarity by integrating global contextual interactions and local dynamic convolution, thereby achieving further feature alignment. Finally, the feature enhancement module (FEM) refines multi-scale spatial features and selectively enhances high-frequency details in the frequency domain, improving structural clarity and texture fidelity. Extensive experiments on diverse real-world remote sensing datasets demonstrate that VGSRF-Net surpasses state-of-the-art methods in denoising, structural preservation, and generalization under varying noise and illumination conditions.