1. Introduction
Remote sensing (RS) plays an increasingly important role for earth observation (EO) with the help of multimodal sensors mounted on satellites or unmanned aerial vehicles (UAVs). By capturing the all-weather and round-the-clock data of various electromagnetic spectrums, RS provides critical insights into many applications, e.g., environmental monitoring, urban planning, land use mapping, disaster response, and climate change analysis [
1,
2,
3].
The traditional downlink-all strategy, which involves transmitting all raw data from the orbit to ground stations for processing, faces severe bandwidth and latency constraints due to the huge data volumes generated by modern high-resolution sensors [
4]. Apart from vast resource waste, this limitation also hinders real-time decision making and timely responses in critical applications such as disaster management. To address these challenges, there is a growing emphasis on onboard techniques, where data is analyzed directly on the sensing platforms (e.g., satellites and UAVs) before transmission. Onboard processing reduces the reliance on ground stations, minimizes data transmission costs, enhances data privacy, and enables real-time analytics in communication-denied environments [
5].
To enable this onboard intelligence, numerous near-real-time and lightweight algorithms based on deep learning have been widely developed for RS applications [
6,
7]. Although these methods are computationally efficient and suitable for resource-constrained edge devices, they often lack the generalization capabilities required to handle the diverse and complex conditions of RS data. Specifically, the tiny networks trained on limited datasets are task specific and struggle to adapt to new scenarios and sensor modalities without extensive retraining and redeployment, limiting the autonomy and flexibility of onboard computing.
More recently, the emergence of foundation models (FMs), referring to large-scale pretrained models capable of generalizing across diverse tasks, has revolutionized the field of artificial intelligence (AI) [
8]. Compared to those lightweight networks, FMs have demonstrated better performance in various domains due to huge parameters according to scaling law [
9]. Furthermore, they are more flexible for onboard deployment. After pretraining, FMs can be easily adapted to a wide range of downstream tasks via parameter-efficient fine-tuning on minimal task-specific data [
10], significantly reducing the model parameter scale for updating and uplinking. The RS community has also witnessed the development of large-scale remote sensing foundation models (RSFMs) [
11,
12,
13] tailored for various applications such as scene classification, semantic segmentation, target detection, and change detection. They leverage vast amounts of multimodal RS data to learn rich representations that capture the complex spatial, spectral, and temporal patterns inherent in EO imagery.
However, the deployment of RSFMs on resource-constrained onboard devices such as satellites remains a significant challenge due to their substantial compute, memory, and energy requirements [
4]. The gap between model complexity and device capability hinders the practical adoption of RSFMs for real-time onboard analytics. To bridge this gap, various model compression and optimization techniques, including quantization [
14,
15], pruning [
16,
17], and knowledge distillation [
18,
19,
20], are possible to reduce the size and computational demands of RSFMs. Furthermore, advancements in specialized hardware accelerators [
21] and edge computing architectures [
22] are essential to support the efficient execution.
Although onboard RSFMs show great potential over traditional AI, both in performance and flexibility, research on them has not yet been explored. With regard to the existing reviews, on the one hand, several surveys have explored the onboard techniques for RS [
4,
5,
6,
21,
23]. They primarily focus on the development of lightweight AI algorithms, without extending to the onboard FMs. On the other hand, many reviews have summarized the advancements in RSFMs and highlight the capabilities of FMs in handling diverse RS tasks [
11,
12,
13,
24], but they do not delve into the unique considerations and solutions required for deploying them on edge devices in real-world applications.
To this end, this review presents a comprehensive review of the deployment of RSFMs on resource-constrained onboard devices for the first time, aiming to fill the gap between the existing research on onboard processing techniques and the advancements in RSFMs. Specifically, a promising pipeline including hardware analysis, RSFMs development and model compression techniques is introduced. The remainder of this review is organized as follows. In
Section 2, the background of the onboard techniques and the FMs for RS are briefly surveyed. In
Section 3, we discuss the promising pipelines for the deployment of onboard RSFMs and the available hardware platforms and products.
Section 4 introduces the frameworks and datasets of existing RSFMs.
Section 5 reviews the model compression and acceleration methods for deploying RSFMs on resource-constrained devices. In
Section 6, the challenges and opportunities of onboard RSFMs are discussed for future research directions.
4. Remote Sensing Foundation Models (RSFMs)
The current literature typically categorizes RSFMs into two primary streams based on the modalities involved: vision–language models (VLMs) and vision foundation models (VFMs) [
11]. In this review, we focus on the survey of VFMs, as they are most relevant to onboard RSFMs’ deployment.
Table 5 gives an overview of existing VFMs in the field of RS, including their backbone, pretraining strategy, data modality, and model size.
4.1. Architecture and Pretraining Strategy
The rapid evolution of VFMs in RS has been marked a transition from convolutional neural networks to vision transformers as the dominant backbone. Early efforts primarily utilized CNN-based architectures such as ResNet [
66] to extract hierarchical features. However, the field has largely shifted towards transformer-based ones such as the basic vision transformer (ViT) [
67] and swin transformer (SwinT) [
68] due to their superior ability to model long-range dependencies and global context in complex, multi-scale geospatial data. Some methods also combine the strengths of both CNNs and ViTs to further enhance feature extraction capabilities and efficiency.
The paradigm for pretraining the model involves supervised learning and self-supervised learning. However, supervised pretraining requires large-scale annotated datasets [
128,
133,
144], which are hard to obtain in RS due to the high cost of expert labeling. To leverage the massive volumes of unlabeled Earth observation data, self-supervised learning, typically involving contrastive learning (Contrastive) and masked image modeling (MIM), as shown in
Figure 2, has become the predominant pretraining strategy for VFMs.
4.1.1. Contrastive Learning
Contrastive learning trains representations by pulling embeddings of related views and pushing apart those of unrelated views without supervision. Practically, this is implemented by forming positive pairs (different augmentations, temporal revisits, or cross-modal pairs) and contrasting them against many negatives using an objective such as InfoNCE [
69]. Building on this foundation, MoCo [
70] introduces a momentum encoder and a dynamic queue to maintain a large dictionary of negatives efficiently, while SimCLR [
71] shows the importance of large-batch training and strong augmentation pipelines for contrastive success.
In the context of RS, research utilizing contrastive learning focus on exploiting the unique meta-information inherent in RS data, such as temporal timestamps and geolocation. SeCo [
122] pioneers the use of temporal invariance, treating images of the same location captured at different times (seasons) as positive pairs to learn representations that are invariant to seasonal changes but sensitive to semantic content. Extending this concept to the spatial domain, GASSL [
123] and CSP [
126] integrate geospatial coordinates directly into the pretraining objective. Specially, CSP employs a dual-encoder architecture to separately encode images and their geolocation, aligning visual features with their corresponding geolocation embeddings to improve the performance. MATTER [
124] introduces a method centered on material and textural consistency, leveraging multi-temporal alignment to learn representations invariant to illumination and viewing angles. SkySense [
139] and its successor SkySense V2 [
147] employ massive SwinT combined with ViT, with up to billions of parameters, and V2 introduces a mixture-of-experts (MoE) strategy to further scale the model capacity while maintaining computational efficiency during inference.
4.1.2. Masked Image Modeling
Masked image modeling (MIM) learns representations by masking parts of the input and training the model to reconstruct the missing content or its discrete proxy, encouraging the encoder to capture both local texture and broader context required for reconstruction. In computer vision, two complementary flavors emerge: the first is MAE-based pixel reconstruction [
73], which masks many image patches and reconstructs raw pixels with an asymmetric encoder–decoder. The second is token-based MIM introduced in BEiT [
149], which tokenizes images into discrete visual tokens and predicts token IDs for masked locations. A simpler variant, SimMIM [
72], demonstrates that straightforward random masking plus pixel reconstruction can be highly effective.
MIM methods are well-suited to RS because they force the models to model fine-grained spatial or spectral details. For instance, SatMAE [
125] adapts the MAE framework to multispectral RS images, introducing temporal and spectral positional embeddings that allow the model to independently mask and reconstruct patches across time series and multi-spectral bands. 3DMAE [
150] employs a novel 3D vertical masking strategy to effectively capture inter- and intra-modality correlations between paired SAR and optical images. To address the significant resolution variations in RS imagery, Scale-MAE [
129] incorporates a ground sample distance (GSD)-based positional encoding, combined with a Laplacian pyramid decoder, to explicitly learn scale-invariant representations. RingMo [
132] proposes a specialized masking strategy that preserves the unmasked tokens of small targets, preventing them from being lost during the random masking process. Similarly, MA3E [
136] introduces angle-aware embeddings to reconstruction objectives, forcing the model to learn rotational invariance crucial for oriented objected detection. To extend MIM to multispectral data, S2MAE [
141] utilizes 3D masking to capture continuous spectral signatures in hyperspectral data, whereas HyperSIGMA [
143] applies a specialized MIM strategy to reduce the high dimensionality of hyperspectral imagery. RobSense [
146] leverages MIM to reconstruct missing modalities, thereby enhancing robustness against incomplete data inputs.
4.1.3. Hybrid Strategy
An increasing number of VFMs combine contrastive and MIM objectives to exploit both invariance learning and generative understanding. CMID [
134] unifies contrastive and MIM tasks under a single architecture, introducing cross-view interactions that enhance spatial–spectral coherence. Cross-Scale MAE [
131] utilizes scale augmentation to enforce consistency between synthesized multi-scale views of the same input using both contrastive and generative losses. OmniSat [
135] focuses on modality fusion, exploiting the precise spatial alignment of different sensors to learn joint embeddings that remain effective even when specific modalities are missing during inference. AnySat [
145] utilizes a joint embedding predictive architecture (JEPA) with scale-adaptive encoders to predict latent representations rather than raw pixels, facilitating scalable learning across resolutions.
4.2. Data Modality
Different from natural images, RS data can be categorized into various modalities, as shown in
Figure 3, including high-resolution optical images (RGB), multispectral imagery (MSI), synthetic aperture radar (SAR), hyperspectral imagery (HSI), the digital elevation model (DEM), etc. While the majority of the existing VFMs mainly focus on RGB images due to the abundance of available data, some methods explore the unique physical challenges of other data modalities. For instance, SARATR-X [
142] represents a pioneering effort in SAR-specific FM, incorporating speckle noise modeling into MIM pretraining to enhance robustness against SAR artifacts. The fusion of optical and SAR imagery is also surveyed by Zhang et al. [
2]. HyperSIGMA [
143] tackles the challenge of extreme spectral redundancy, which involves hundreds of contiguous bands by utilizing a sparse sampling attention (SSA) mechanism, allowing the model to scale to over one billion parameters while effectively modeling the intricate spectral correlations that standard ViTs miss.
In order to handle multimodal diversity, recent advancements have shifted toward unified architectures capable of fusing diverse RS modalities to improve robustness against extreme domain gaps between different modalities. For example, SkySense [
139] and SkySense V2 [
147] integrate RGB, MSI, and SAR data during pretraining, leveraging their complementary information to learn more holistic representations. msGFM [
140] focuses on self-supervised fusion by exploiting the spatial alignment of different sensors and using sensor-specific embeddings to bridge the domain gap between optical, SAR, and DEM data. AnySat [
145] introduces a JEPA equipped with scale-adaptive encoders. This allows the model to map data of any resolution, scale, or modality into a common latent space, predicting representations rather than raw pixels. TerraMind [
148] pushes the generative frontier with an any-to-any framework based on a symmetric transformer. It introduces “thinking-in-modalities”, a mechanism that generates synthetic intermediate modalities (e.g., creating a SAR image from an optical input) to enhance performance on downstream tasks. CDPrompt [
151] utilized SAM with a lightweight domain tuning module and automatically generated in-domain prompts derived from SAR to enable robust multimodal change detection in missing modality scenarios.
To capture the dynamic evolution of the Earth’s surface, many models also integrate explicit mechanisms for time series processing. SatMAE [
125] adapts the MAE framework for temporal data by introducing temporal positional encodings and employing an independent masking strategy across time steps, forcing the model to reconstruct distinct temporal states. SpectralGPT [
138] utilizes a 3D generative pretrained transformer architecture that treats sequential data as a continuous stream of tokens, modeling the 3D dependencies inherent in volumetric data. AnySat [
145] handles temporal dynamics by defining input patches as 3D tensors (height × width × time) within its scale-adaptive encoder, allowing it to process varying sequence lengths naturally.
4.3. Parameter-Efficient Fine-Tuning (PEFT)
The adaptation of pretrained VFMs to downstream applications relies on rigorous fine-tuning methodologies designed to transfer the knowledge from generalized pretext tasks to specific EO objectives. Historically, the standard adaptation protocol involved full fine-tuning (FFT), wherein all parameters of the pretrained backbone are updated via backpropagation using task-specific loss functions. However, this is not suitable for deployment onboard resource-constrained satellites. For a multi-billion parameter model, this requires high-end GPU clusters with massive RAM capacity to store gradient states and optimizer moments. Furthermore, FFT creates a separate, full-sized copy of the model for each downstream task. In a practical remote sensing workflow, where a single satellite operator might require distinct models for different tasks, the storage requirements for FFT become prohibitive. Additionally, FFT on small, specialized remote sensing datasets carries a high risk of catastrophic forgetting, where the model overfits to the narrow downstream distribution and loses the generalizable feature representations acquired during pretraining. Linear probing offers a more efficient alternative by freezing the backbone and training only a lightweight task-specific head on top of the extracted features by the pretrained models. However, it may under-utilize the rich representations learned during pretraining.
To address these challenges, parameter-efficient fine-tuning (PEFT) techniques have emerged as a dominant frontier, which focuses on freezing the vast majority of the pretrained weights and updating only a minimal subset of the learnable parameters (often fewer than 1%) to reach comparable performance to FFT. Common PEFT methods include adapter tuning, prompt tuning, and reparameterization tuning, as shown in
Figure 4.
Adapter tuning is arguably the most versatile PEFT approach, which introduces small bottleneck modules (adapters) into each layer of the pretrained model. During fine-tuning, only the adapter parameters are updated while the original model weights remain frozen. This modular design allows for easy integration into existing architectures and can be tailored to different tasks by varying the adapter size and placement. In RS applications, in order to address the limitations of adapters in dense prediction tasks, UPetu [
75] proposes a unified approach specifically for RSFMs, arguing that standard adapters are designed for classification and lack the spatial sensitivity required for pixel-level tasks. It integrates two complementary modules, i.e., the efficient quantization adapter module and context-aware prompt module, to enhance the correlation between fine-grained feature information and task-specific knowledge.
Prompt tuning draws inspiration from natural language processing, where task-specific prompts are prepended to the input data to steer the model’s attention during inference. In vision transformers, this involves learning a set of prompt tokens that are concatenated with the input patch embeddings. During fine-tuning, only these prompt tokens are updated, allowing the model to adapt to new tasks with minimal parameter changes. Some recent works explore prompt tuning for adapting the segment anything model (SAM) [
65] to RS tasks [
76,
77].
A typical approach of reparameterization tuning is low rank adaptation [
56] (LoRA). The efficacy of LoRA relies on the intrinsic dimension hypothesis, which posits that the optimal parameters for a specific downstream task reside in a low-dimensional subspace of the high-dimensional parameter space of the pretrained model. Instead of updating the full weight matrix, LoRA optimizes a low-rank approximation of the update, significantly reducing memory requirements without introducing inference latency.
The structure of LoRA is illustrated in
Figure 4c. In a standard neural network layer, the forward pass is defined as
, where
x is the input and
is the frozen pretrained weight. LoRA decomposes the weight update
into two smaller matrices
A and
B as follows:
where
is a parallel path for the weight update,
and
, with
being the trainable low-rank matrices. During fine-tuning, only
A and
B are updated, while
remains frozen. To ensure the training begins exactly at the pretrained state (i.e.,
at step zero),
A is initialized with Gaussian
and
B is initialized to zero. In this way, the total number of trainable parameters is reduced from
to
, leading to significant memory savings.
LoRA has been successfully applied to various VFMs in RS, demonstrating its effectiveness in adapting large models to specific EO tasks with minimal computational overhead. Some specific strategies based on LoRA have also been proposed to further address the unique challenges in RS. For instance, directly adapting a model trained on RGB or MSI to HSI (which has hundreds of spectral bands) requires handling 3D data cubes where the spectral correlation is as important as spatial correlation. Standard LoRA, which operates on 2D matrices, may fail to capture these tensor interactions efficiently. To this end, Ligan et al. [
74] apply kronecker product adaptation (KronA) for fine-tuning SpectralGPT [
138] to HSI classification, which is superior to standard LoRA and achieves accuracy competitive with FFT while undating only 0.056% of the parameters.
Table 6 summarizes the comparative performance of some typical PEFT methods applied to RS. Reparameterization tuning (e.g., LoRA and its variants) utilizes the smallest trainable parameters and introduces no additional inference latency since it only adds a parallel low-rank path during training. Moreover, it maintains original compute graph and memory access patterns, making it hardware-friendly for onboard applications where real-time processing is critical. Adapter tuning introduces moderate storage overhead due to the additional adapter layers, which requires further optimization such as quantization for efficient deployment.
4.4. Dataset
The success of VFMs heavily relies on large-scale and diverse pretraining datasets that capture the complex variability of the Earth’s surface. Early pretraining efforts largely rely on MillionAID [
152] and fMoW [
153]. MillionAID is a large-scale benchmark containing over one million RGB images primarily designed for scene classification with 51 categories. fMoW contains temporal sequences of high-resolution MSI imagery with metadata (time, location, and angles), enabling models to learn temporal dynamics beyond mere static appearance. However, these annotated datasets are limited in scale and diversity compared to the massive corpora used in natural image FMs. Recent state-of-the-art VFMs turn to collecting massive unlabeled multimodal datasets from various satellite platforms (e.g., Sentinel-1, Sentinel-2, Landsat, ALOS-2, NAIP, etc.) to fully leverage self-supervised pretraining, as is summarized in
Table 7.
After pretraining, the utility of RS VFMs is rigorously evaluated through their transferability to downstream EO tasks. Following the pretraining and fine-tune paradigm, they leverage generalized representations learned from massive unlabeled corpora to achieve state-of-the-art performance on specific target tasks. Typical downstream tasks and their benchmarks include scene classification (UC Merced Land Use (UCM) [
154], EuroSat [
155], BigEarthNet (BEN) [
156], and NWPU-RESISC45 [
157]), semantic segmentation (Potsdam & Vaihingen [
158], iSAID [
159]), change detection (OSCD [
160], LEVIR-CD [
161]), and object detection (DOTA [
162], DIOR [
163]).
5. Model Optimization and Compression
To enable the deployment of large-scale RSFMs onboard resource-constrained satellites, model optimization and compression techniques are essential to reduce the models’ size and computational requirements while maintaining performance. Key strategies include quantization, pruning, and knowledge distillation, as illustrated in
Figure 5. The applications of these compression techniques to RS models are also discussed in this section.
5.1. Quantization
Quantization reduces the precision of model weights and activations from high-precision formats (e.g., 32-bit floating point) to lower-precision formats (e.g., 8-bit integer), significantly reducing memory footprint and computational load. In addition, some hardware accelerators only support specific precision formats. The implementation of quantization generally falls into post-training quantization (PTQ) and quantization-aware training (QAT): PTQ applies quantization after model training, while QAT incorporates quantization effects during training to mitigate accuracy loss.
5.1.1. Post-Training Quantization (PTQ)
PTQ is a straightforward approach that quantizes a pretrained model without further retraining, which is computationally efficient and easy to implement. However, PTQ may lead to significant accuracy degradation if the model’s weight distribution is susceptible to quantization noise. This risk is elevated in RSFMs where subtle spectral distinctions (e.g., distinguishing between stressed and healthy vegetation or different mineral types) rely on high-precision feature representations in the deeper layers of the network. To address this challenge, advanced PTQ methods mainly focus on preserving relevant information in weights and activations during quantization, which can be divided into calibration-only and reconstruction-based techniques.
Calibration-only PTQ estimate quantization parameters (e.g., range, scale, and zero-point) via simple statistics or analytic rules on a small calibration set. Simple analytic corrections are applied, e.g., min/max or percentile clipping for activations, per-channel or per-tensor scale choices, symmetric/asymmetric zero-points, etc. For instance, to address scenarios where training data is unavailable, Nagel et al. [
166] propose a data-free method that utilizes weight equalization and bias correction to mitigate quantization errors directly from the pretrained model parameters. ZeroQ [
167] reconstructs a synthetic calibration dataset by optimizing input noise to match the batch normalization statistics of the original full-precision network. Building upon synthetic data generation, GDFQ [
168] employs a generative adversarial network (GAN) to produce diverse, distribution-matching samples that facilitate knowledge distillation for high-precision quantization.
Reconstruction-based PTQ further refines quantized weights by minimizing the reconstruction error between the outputs of the full-precision and quantized models on a calibration set, which compute intensively at the quantization time but achieve much stronger accuracy at low bitwidths. For instance, AdaRound [
169] challenges the assumption that rounding to the nearest integer is optimal by formulating weight quantization as a quadratic unconstrained binary optimization problem that minimizes layer-wise reconstruction error. BRECQ [
170] extends the scope of optimization from individual layers to residual blocks, demonstrating that block-wise reconstruction guided by the Hessian of the task loss significantly reduces error accumulation compared to layer-wise methods. Adapting these reconstruction principles to non-convolutional architectures, APHQ-ViT [
171] addresses the unique activation distributions of vision transformers by employing an average perturbation Hessian metric to calibrate weights according to the model’s sensitivity.
5.1.2. Quantization-Aware Training (QAT)
QAT is a more advanced approach that simulates quantization errors during the training or fine-tuning process, allowing the model to adapt its weights to the lower precision representation and learn to be robust to quantization noise. Specifically, during forward propagation, weights and activations are quantized to the target precision using simulated quantization functions, thus injecting quantization noise into the training signal. During backpropagation, a differentiable surrogate (commonly the straight-through estimator, STE) is used to propagate gradients through non-differentiable rounding/clamping operations. Although requiring access to the original training data and involving additional computational overhead during training, QAT can significantly improve the final accuracy of quantized models compared to PTQ, especially for aggressive quantization levels.
Pioneering the extreme limit of low-precision training, Hubara et al. [
172] introduce binarized neural networks (BNNs) to radically reduce memory consumption by constraining weights and activations to single-bit values
during the forward pass. Moving towards practical deployment on standard hardware, Jacob et al. [
173] propose a simulation framework that models quantization noise during training to enable inference using strictly integer arithmetic without sacrificing accuracy. As attention-based models gained prominence, Q-ViT [
174] identify that the self-attention mechanism is highly sensitive to quantization noise and propose an information rectification module to preserve the distribution of attention scores. In order to scale QAT to the era of large language models, EfficientQAT [
175] overcomes the prohibitive memory costs of end-to-end training by employing a block-wise reconstruction strategy that allows for the efficient fine-tuning of massive parameters.
5.2. Pruning
Model pruning removes parameters (weights, channels, or structures) from a trained neural network to reduce storage, FLOPs, and latency while attempting to preserve task performance. Pruning methods differ by granularity (unstructured and structured), timing (post-training, one-shot, during fine-tuning, and sparse training from scratch), and criterion (magnitude, sensitivity, second-order, gradient-based, etc.). This review categorizes pruning techniques into unstructured and structured methods.
5.2.1. Unstructured Pruning
Unstructured pruning removes individual scalar weights without regard to their spatial or channel-wise organization. Given the network weights
W and the dataset
, unstructured pruning applies a sparse binary mask
to zero out unimportant weights, as follows:
where
is the loss function,
counts for the number of retained weights, and
k determines the desired sparsity level.
The early theoretical approach framed pruning as an optimization problem that minimizes the loss increase due to weight removal. Optimal brain damage [
176] utilizes the diagonal of the Hessian matrix to estimate weight saliency, proving that low-sensitivity weights could be removed with minimal error. This is refined by optimal brain surgeon [
177], which employs the full inverse Hessian to update remaining weights, eliminating the need for retraining. Han et al. [
178,
179] give a practical magnitude-based pipeline, which combines pruning with quantization and entropy coding to obtain storage reductions.
To avoid the cost of dense pretraining, a class of single-shot or pruning-at-initialization methods emerges. SNIP [
180] scores connections by gradient sensitivity evaluated on mini-batches and prunes once at initialization. Wang et al. propose gradient signal preservation (GraSP) [
181] to choose masks that preserve gradient flow. Iterative synaptic flow pruning (SynFlow) [
182] produces a data-agnostic synaptic flow saliency to avoid layer collapse during iterative pruning. These approaches motivate deeper analysis of when and why pruning should occur at initialization.
An alternative trajectory focuses on dynamic sparse training (DST), which maintains a sparse model throughout training and periodically rewires connections (drop low-utility weights and grow promising ones). Methods such as RigL [
183] use gradient or magnitude signals to guide growth steps. DST directly targets training-time FLOP and memory reductions rather than only post hoc compression, approaching dense accuracy while offering substantial compute savings during training.
Scaling unstructured pruning to billion-parameter LLMs produces pragmatic one-shot methods that avoid costly retraining or heavy curvature estimation. SparseGPT [
184] introduces an approximate second-order reconstruction scheme tailored to GPT family architectures and demonstrates accurate one-shot pruning. Wanda [
185] proposes a simple per-output score (weight magnitude × input activation norm) that outperforms plain magnitude pruning without retraining.
5.2.2. Structured Pruning
On the other hand, structured pruning removes entire groups of parameters (filters, channels, neurons, attention heads, blocks, etc.) so that the pruned network contains smaller dense tensors rather than highly irregular sparsity. As modern accelerators are optimized for dense matrix operations, structured pruning explicitly targets hardware friendliness by enforcing block patterns and creating models that map directly to existing BLAS libraries without requiring specialized sparse kernels.
Early practical approaches frame structures’ pruning as filter or channel removal chosen by simple importance criteria. Li et al. [
186] propose pruning entire convolutional filters whose removal minimally affects accuracy before fine-tuning, showing direct inference speed gains. ThiNet [
187] introduces a data-driven filter selection strategy that estimates a filter’s contribution by solving a tiny reconstruction problem with responses from the next layer, illustrating that per-filter selection benefits from inter-filter redundancy and task signals rather than only within-filter magnitudes. Network slimming [
188] moves towards training time structured sparsity induction by introducing
regularization on BatchNorm scale parameters to encourage channel sparsity during training. Subsequent structured methods emphasize better criteria and redundancy removal. FPGM [
189] prunes filters according to their redundant relationship to others (geometric median) rather than those with smallest norms. NISP [
190] propagates final-response importance scores backward to rank filters globally, offering another principled manner to prioritize structural removals.
A complementary line of works shift from hand-crafted heuristics to automated, device-aware pruning pipelines. NetAdapt [
191] iteratively selects per-layer compression ratios using real latency or energy measurements on the target hardware instead of proxy FLOPs. AMC [
192] formulates layer-wise pruning as a reinforcement learning policy search, automatically discovering compression schedules that balance accuracy against latency constraints.
The hardware ecosystem motivates a new middle ground between fully unstructured and coarse structured sparsity. To balance flexibility and efficiency, N:M Sparsity [
193] introduces a semi-structured pattern (e.g., 2:4 sparsity) supported by NVIDIA Ampere GPUs, where two out of every four weights are zeroed to double compute throughput. STEP [
194] further proposes a preconditioned training framework that learns N:M fine-grained structured sparsity masks from scratch, enabling high-accuracy sparse networks under hardware-friendly N:M constraints without relying on dense pretrained models. Isomorphic pruning [
195] proposes grouping parameters by computational topology to handle the heterogeneous substructures of attention mechanisms effectively.
5.3. Knowledge Distillation
Knowledge distillation transfers knowledge from a large, complex teacher model to a smaller, simpler student model by training the student to mimic the teacher’s outputs. For model compression, the teacher model is typically a pretrained VFM, while the student is a compact model suitable for deployment on resource-constrained platforms. The most influential formulation to establish the dominant response-based paradigm is crystallized by Hinton et al. [
196], where the student model is trained to match the softened output probabilities of the teacher model using a temperature parameter to smooth the logits.
While response-based methods have proven effective, they fail to capture the internal representational power of deep networks, particularly in very deep architectures where the final logical output is insufficient to guide intermediate layers. To this end, feature-based distillation methods emerge that align intermediate feature maps between the teacher and student. FitNets [
197] pioneers this direction by introducing auxiliary regression losses on hidden layer outputs, allowing the student to learn richer representations beyond final outputs. Attention transfer [
198] further refines this idea by matching attention maps derived from feature activations, allowing the student to learn where to look rather than exactly what to see.
Rather than distilling individual data points, relation-based methods transfer relationships among multiple samples encoded by the teacher. Relational knowledge distillation (RKD) [
199] encourages the student to preserve the distances and angles between sampled pairs in the embedding space, ensuring that the structural topology of the teacher’s learned manifold is maintained. Contrastive representation distillation (CRD) [
200] integrates contrastive learning by maximizing a lower bound on the mutual information between the teacher and student representations.
With the development of large-scale FMs and LLMs, some distillation efforts focus on scaling laws and efficient distillation pipelines. DistilBERT [
201] demonstrates that a student model with half the layers of BERT can retain 97% of its performance by combining response-based distillation with masked language modeling during pretraining. TinyBERT [
202] introduces a two-stage distillation process that first pretrains the student on large corpora using both response and feature-based losses, followed by task-specific fine-tuning with additional losses on attention distributions and hidden states.
5.4. Application in RS
The model compression techniques discussed above have been applied in various RS tasks to enable efficient deployment and near-real-time inference onboard resource-constrained devices.
Guo et al. [
16] and Lu et al. [
17] both explore structured pruning on CNNs such as VGG and ResNet, tailored for RS image classification. The former utilize a sensitivity function to selectively remove non-semantic filters, while the latter propose an energy-based framework based on singular value decomposition (SVD) that remains robust on undertrained models. Both methods achieve moderate reductions in model size and FLOPs with comparable accuracy (e.g., compressing ResNet-50 from 25.78 M to 15∼18 M).
Focusing on distilling the knowledge from pretrained CNNs to a lightweight student, CDKD [
18] and DKD [
19] are proposed for change detection and target detection, respectively. The student model of CDKD (FC-Siam-conc) achieves comparable performance to the full ResNet-50 or VGG-19 with only 1.54 M parameters. DKD reduces the parameters of a heavy RetinaNet-152 by nearly 4× (from 71.03 M to 19.9 M) and FLOPs by over 2×. In the context of transformer-based architecture, Wang et al. [
20] propose burden-free distillation (BFD) to transfer general semantic knowledge from the visual encoder of CLIP (ViT-B/16) to task-specific change detection models via dual-temporal feature matching and patch contrastive loss. Compared to network pruning, distillation generally achieves higher compression ratios with less accuracy drop, but requires additional training efforts.
With respect to quantization, Li et al. [
14] propose the SPMix-Q, which leverages layer-wise sensitivity heterogeneity to assign progressively decreasing bit-widths, achieving comparable segmentation performance with only 1/13 model size and 1/29 computational cost to the full-precision counterparts. GHOST [
15] employs a clustering-based hybrid quantization strategy to automatically optimize bit widths. It also integrates distillation, utilizing a one-to-one self-teaching mechanism to distill knowledge from a full-precision teacher. To address the efficiency of fine-tuning large-scale RSFMs, Dong et al. [
75] present UPetu, a unified PEFT framework that integrates quantization directly into adapter modules (EQAM) to reduce the size of updated parameters.
Due to the simplicity and flexibility of quantization, it is often combined with other techniques such as distillation [
15] and PEFT [
75]. Furthermore, the hardware friendliness of quantization also makes it a critical enabler for onboard deployment, often integrated directly into hardware-specific compilation pipelines to optimize latency, memory, and energy efficiency. For FPGA-based deployments, D’Abbondanza et al. [
203] utilize hls4ml [
80] to convert a QAT-trained U-Net, which quantized to four-bit via Brevitas, into FPGA firmware, achieving 8.8× efficiency gain over Vitis AI’s DPU. Ziaja et al. [
204] validate the standard Vitis AI [
79] workflow by quantifying the performance of models converted to INT* for DPU execution. Neris et al. [
205] demonstrate that converting floating-point CNNs to 16-bit fix-point precision using Vitis HLS significantly reduces resource usage on Xilinx Kintex Ultrascale [
98]. On GPU-accelerated embedded systems, both Ijaz el al. [
206] and Jankovic et al. [
61] leverage TensorRT [
62] (the standard optimization engine for NVIDIA GPUs) to perform PTQ. The former find that FP16 quantization on Jetson Xavier NX or Nano offers the best balance of throughput and accuracy for disaster management CNNs, while the latter demonstrate that INT8 quantization is essential for real-time transformer inference on VAVs and reduce model size by ∼70%.
While low-bit quantization (e.g., INT4 or binary) offers substantial memory savings, it poses a significant risk to small target representation in RS imagery. Compared to natural images where targets often dominate the frame, RS images always contain small targets such as vehicles and ships that occupy only a few pixels due to high-altitude imaging. The activation maps corresponding to these objects rely on subtle high-frequency variations that are easily zeroed out or merged with the background noise floor when the dynamic range is compressed into low-bit format. For instance, on the NUAA-SIRST dataset, reducing precision from 32-bit to INT4 using standard PTQ resulted in an IoU drop from 72.69% to 60.32% [
14]. Empirical evidence indicates that INT8 is generally regarded as the “safe boundary” for uniform quantization. To push beyond this limit without losing small targets, SPMix-Q [
14] proposes a mixed-precision strategy. It maintains high precision such as FP16 or INT8 for the initial shallow layers that capture fine-grained spatial details, while allowing aggressive quantization such as 2–4 bits for the deeper semantic layers.
Most existing model optimization techniques are primarily developed and validated on optical images, which may not fully capture the unique characteristics of other data modalities, e.g., high spectral dimensionality, varying spatial resolutions, and sensor-specific noise patterns. For instance, Shinde et al. [
207] show that uniform pruning or fixed-bit quantization is suboptimal for the land cover classification of RS imagery, as it often involves multi-resolution input and varying layer importance. Their adaptive layer-wise pruning combined with resolution scaling jointly consider spatial resolution and spectral information to preserve discriminative features. For multispectral input data, the high dimensionality and strong inter-band correlation allow channel-wise pruning and low-rank factorization with limited accuracy loss. Zou et al. [
208] discover that channel importance in RS models is less separable than in natural-image models due to top-down viewpoints, scale variation, and atmospheric noise. They propose RemoteTrimmer to enable effective structural pruning by amplifying inter-channel importance differences via channel attention and introducing adaptive mining loss to focus training on difficult, noise-corrupted samples. For SAR input data, aggressive early-layer pruning or quantization can significantly degrade performance due to the need to preserve scattering and texture statistics under speckle noise [
142].
7. Conclusions
The rapid evolution of Earth observation systems, together with the increasing diversity and resolution of remote sensing modalities, has created an urgent demand for intelligent onboard processing capable of overcoming the long-standing limitations of bandwidth, latency, and operational inflexibility. This review provides the first perspective on the deployment of RSFMs on resource-constrained onboard platforms, bridging the gap between recent advances in foundation model research and the practical constraints of spaceborne computing. While RSFMs offer superior performance and generalization, their implementation faces severe challenges due to resource constraints, environmental factors and hardware’s energy efficiency ratio. In order to narrow this gap, from the algorithm perspective, model compression techniques including quantization, pruning, and knowledge distillation are analyzed as essential enablers for saving onboard resources. From the hardware and resource perspective, a typical case study and analysis are presented to demonstrate the feasibility of deploying RSFMs onboard LEO satellites under diverse scenarios, considering critical memory, power, and compute constraints. Ongoing progress in hardware–software co-optimization and edge-oriented toolkits have also made the deployment pipeline more convenient and efficient.
Furthermore, continued research on radiation-tolerant high-performance platforms, memory optimization, distributed inference, and human-centric VLM interfaces will further unlock the potential of larger-scale RSFMs for onboard autonomy in deeper space. We hope this review could open the door for future developments pertaining to next-generation intelligent Earth observation agents empowered by RSFMs.