This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessReview
Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey
by
Guoqing Zhou
Guoqing Zhou 1
,
Lihuang Qian
Lihuang Qian 1,*
and
Paolo Gamba
Paolo Gamba 2
1
Guangxi Key Laboratory of Spatial Information and Geomatics, College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541004, China
2
Department of Electrical, Biomedical, and Computer Engineering, University of Pavia, 27100 Pavia, Italy
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(21), 3532; https://doi.org/10.3390/rs17213532 (registering DOI)
Submission received: 29 August 2025
/
Revised: 15 October 2025
/
Accepted: 22 October 2025
/
Published: 24 October 2025
Abstract
Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures.
Share and Cite
MDPI and ACS Style
Zhou, G.; Qian, L.; Gamba, P.
Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sens. 2025, 17, 3532.
https://doi.org/10.3390/rs17213532
AMA Style
Zhou G, Qian L, Gamba P.
Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing. 2025; 17(21):3532.
https://doi.org/10.3390/rs17213532
Chicago/Turabian Style
Zhou, Guoqing, Lihuang Qian, and Paolo Gamba.
2025. "Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey" Remote Sensing 17, no. 21: 3532.
https://doi.org/10.3390/rs17213532
APA Style
Zhou, G., Qian, L., & Gamba, P.
(2025). Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing, 17(21), 3532.
https://doi.org/10.3390/rs17213532
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.