This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
                
                        
            Open AccessReview
            
                Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey            
            
                                by
                    
    Guoqing Zhou
 Guoqing Zhou
Guoqing Zhou ,
, 
    Lihuang Qian
 Lihuang Qian
Lihuang Qian and
 and 
    Paolo Gamba
 Paolo Gamba
Paolo Gamba 
                            
                
                    
                            1
                        Guangxi Key Laboratory of Spatial Information and Geomatics, College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541004, China
         
                    
                            2
                        Department of Electrical, Biomedical, and Computer Engineering, University of Pavia, 27100 Pavia, Italy
         
    
    
            
            *
            Author to whom correspondence should be addressed. 
         
    
    
    
 
             
            
                Remote Sens. 2025, 17(21), 3532; https://doi.org/10.3390/rs17213532 (registering DOI)
            
            
                    
    Submission received: 29 August 2025
    /
    Revised: 15 October 2025
    /
    Accepted: 22 October 2025
    /
    Published: 24 October 2025
            
                
    
            
            
                        
            
            
            
        
                        
        
                        
        
                        
        
        
                        
        
                        
                                                                            
                                                                            
            
                            Abstract
            
            
                                                            Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures.
                    
                            
            
                            
            
                        
                        
                        
                    
                        
            
            
    
        
     
            
                Share and Cite
                
                
                    
MDPI and ACS Style
                    Zhou, G.;                     Qian, L.;                     Gamba, P.    
        Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sens. 2025, 17, 3532.
    https://doi.org/10.3390/rs17213532
    AMA Style
    
                                Zhou G,                                 Qian L,                                 Gamba P.        
                Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing. 2025; 17(21):3532.
        https://doi.org/10.3390/rs17213532
    
    Chicago/Turabian Style
    
                                Zhou, Guoqing,                                 Lihuang Qian,                                 and Paolo Gamba.        
                2025. "Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey" Remote Sensing 17, no. 21: 3532.
        https://doi.org/10.3390/rs17213532
    
    APA Style
    
                                Zhou, G.,                                 Qian, L.,                                 & Gamba, P.        
        
        (2025). Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing, 17(21), 3532.
        https://doi.org/10.3390/rs17213532
    
 
                 
                                    
                        Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details 
here.
                    
Article Metrics
                
                    
            
            Article Access Statistics
            
                            For more information on the journal statistics, click 
here.
            
            
                Multiple requests from the same IP address are counted as one view.