Next Article in Journal
Large-Gradient Displacement Monitoring and Parameter Inversion of Mining Collapse with the Optical Flow Method of Synthetic Aperture Radar Images
Previous Article in Journal
Hourly and 0.5-Meter Green Space Exposure Mapping and Its Impacts on the Urban Built Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Review

Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey

1
Guangxi Key Laboratory of Spatial Information and Geomatics, College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541004, China
2
Department of Electrical, Biomedical, and Computer Engineering, University of Pavia, 27100 Pavia, Italy
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(21), 3532; https://doi.org/10.3390/rs17213532 (registering DOI)
Submission received: 29 August 2025 / Revised: 15 October 2025 / Accepted: 22 October 2025 / Published: 24 October 2025
(This article belongs to the Section AI Remote Sensing)

Abstract

Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures.
Keywords: remote sensing foundation model (RSFM); multimodal data; generative pre-trained Transformer (GPT); Earth observation; self-supervised learning remote sensing foundation model (RSFM); multimodal data; generative pre-trained Transformer (GPT); Earth observation; self-supervised learning

Share and Cite

MDPI and ACS Style

Zhou, G.; Qian, L.; Gamba, P. Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sens. 2025, 17, 3532. https://doi.org/10.3390/rs17213532

AMA Style

Zhou G, Qian L, Gamba P. Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing. 2025; 17(21):3532. https://doi.org/10.3390/rs17213532

Chicago/Turabian Style

Zhou, Guoqing, Lihuang Qian, and Paolo Gamba. 2025. "Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey" Remote Sensing 17, no. 21: 3532. https://doi.org/10.3390/rs17213532

APA Style

Zhou, G., Qian, L., & Gamba, P. (2025). Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey. Remote Sensing, 17(21), 3532. https://doi.org/10.3390/rs17213532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop