Next Article in Journal
Seeing Like Argus: Multi-Perspective Global–Local Context Learning for Remote Sensing Semantic Segmentation
Previous Article in Journal
From Prompts to Self-Prompts: Parameter-Efficient Multi-Label Remote Sensing via Mask-Guided Classification
 
 
Due to scheduled maintenance work on our servers, there may be short service disruptions on this website between 11:00 and 12:00 CEST on March 28th.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation

1
DEIB—Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, 20133 Milano, Italy
2
NORCE—Norwegian Research Centre AS, 9019 Tromsø, Norway
3
Department of Mathematics and Statistics, UiT The Arctic University of Norway, 9037 Tromso, Norway
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(3), 519; https://doi.org/10.3390/rs18030519
Submission received: 3 January 2026 / Revised: 30 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026
(This article belongs to the Section Environmental Remote Sensing)

Highlights

What are the main findings?
  • The Segment Anything Model can be effectively adapted to Sentinel-1 Synthetic Aperture Radar composites for avalanche segmentation via lightweight domain adaptation modules.
  • Prompt-engineering strategies and an encoder-efficient fine-tuning procedure improve robustness to imprecise prompts while keeping training practical.
What are the implications of the main findings?
  • Integrated into a semi-automatic annotation tool, the adapted model reduces expert workload and speeds up the creation of high-quality avalanche inventories.
  • The proposed adaptation recipe provides a transferable path to bring segmentation foundation models beyond RGB and into remote sensing workflows.

Abstract

Remote sensing solutions for avalanche segmentation and mapping are key to supporting risk forecasting and mitigation in mountain regions. Synthetic Aperture Radar (SAR) imagery from Sentinel-1 can be effectively used for this task, but training an effective detection model requires gathering a large dataset with high-quality annotations from domain experts, which is prohibitively time-consuming. In this work, we aim to facilitate and accelerate the annotation of SAR images for avalanche mapping. We build on the Segment Anything Model (SAM), a segmentation foundation model trained on natural images, and tailor it to Sentinel-1 SAR data. Adapting SAM to our use case requires addressing several domain-specific challenges: (1) domain mismatch, since SAM was not trained on satellite or SAR imagery; (2) input adaptation, because SAR products typically provide more than three channels while the SAM is constrained to RGB images; (3) robustness to imprecise prompts that can affect target identification and degrade the segmentation quality, an issue exacerbated in small, low-contrast avalanches; and (4) training efficiency, since standard fine-tuning is computationally demanding for the SAM. We tackle these challenges through a combination of adapters to mitigate the domain gap, multiple encoders to handle multi-channel SAR inputs, prompt-engineering strategies to improve avalanche localization accuracy, and a training algorithm that limits the training time of the encoder, which is recognized as the major bottleneck. We integrate the resulting model into a segmentation tool and show experimentally that it speeds up the annotation of SAR images.

Graphical Abstract

1. Introduction

Mapping avalanche activities is a crucial component for aiding forecasting and risk mitigation in mountainous regions [1]. Every year, more than 100 avalanche-related fatalities are reported across Europe, and numerous infrastructures, roads, and buildings are damaged by this phenomenon [2]. In-field measurements are the most reliable option to quantitatively assess the area covered by an avalanche, but these are expensive, limited by accessibility, risky for observers, and unsuitable for broad and continuous monitoring [3]. Remote sensing has therefore emerged as a valuable alternative, enabling safe, large-scale, and frequent monitoring of avalanche activity through the systematic acquisition of high-quality satellite imagery, particularly at high latitudes [4]. Examples of SAR backscatter images and the corresponding expert masks are shown in Figure 1.
Among the available remote sensing modalities, Synthetic Aperture Radar (SAR) data is especially well suited for avalanche detection, as it is independent from weather conditions and exhibits clear scattering patterns of snow debris [5]. However, the manual identification of snow avalanches in SAR images is complex, time-consuming, and requires expert knowledge [6]. To overcome these limitations, several automated approaches for continuous avalanche monitoring have been proposed in recent years [7]. In particular, deep learning-based image segmentation methods currently represent the most promising solutions. Nevertheless, existing models still suffer from a high rate of false positives and fail to reach the same level of accuracy as human experts [8]. Therefore, manual annotation of SAR images still represents the gold standard in the field [4].
Perhaps among the major obstacles limiting further improvements in deep learning methods is the scarcity of labeled data, which is necessary to train more accurate models. Manual annotations are not only costly to produce but also prone to errors. Smaller avalanches are often overlooked, while imprecise drawing of the mask contours introduces label noise that negatively influences the performance of segmentation models [2,9]. These inaccuracies arise from a combination of annotator subjectivity, speckle noise in the SAR images, and ambiguity in interpreting the actual contours of the debris.
The goal of this study is to develop a tool that facilitates the annotation of snow avalanches and improves the quality of the segmentation masks in SAR imagery. To this end, we adapt the SAM [10] to the task of avalanche annotation and evaluate its integration into a semi-automatic annotation workflow. The SAM is a computer vision foundation model able to identify and segment any object in natural images with remarkable accuracy, requiring minimal user inputs in the form of prompts, namely simple clicks or bounding boxes (BBs) drawn around the objects of interest. Rather than relying on explicit class information, the SAM leverages prompt-based guidance to localize the target object, which makes it highly flexible and transferable to downstream tasks with minimal retraining. However, images from the SAR domain significantly differ from the RGB images on which the SAM was originally trained, preventing its straightforward application for avalanche segmentation in SAR images. Recent studies have demonstrated that adapting the SAM to specialized imaging modalities can substantially reduce annotation effort while maintaining high segmentation quality [11,12,13,14]. These findings motivate the exploration of the SAM as a semi-automated annotation tool for snow avalanches in SAR data.
The main contribution of our work is to extend the SAM to our use case by addressing the following key challenges:
  • Domain adaptation: Given the limited amount of training data, effective domain adaptation must be achieved by fine-tuning only a small subset of the SAM’s parameters.
  • Input adaptation: The standard SAM architecture can only process three channels’ inputs, whereas raw SAR images consist of a different number of channels.
  • Improved prompt robustness: The SAM struggles with small targets and with imprecise prompts, e.g., a bounding box (BB) much larger than the object of interest. Identifying prompt strategies that improve robustness, especially for segmenting small avalanches, is therefore critical.
  • Training optimization: Given that even the lightest SAM variant exceeds 90 million parameters, the fine-tuning procedure must be carefully designed to be computationally feasible on commercial hardware.
To address these challenges, we (1) employ adapters [15] in combination with decoder fine-tuning for tackling domain shifts, (2) introduce a multiple encoder method inspired by [16] to process the six channels of SAR images, (3) propose a specific prompt strategy based on BBs, and (4) introduce a custom algorithm for training the SAM.
The experimental results demonstrate that our approach successfully adapts the SAM to the SAR domain, achieving competitive or superior performance compared with existing methods in the literature. Moreover, the proposed prompt strategy reduces the sensitivity to prompt precision, enabling performance comparable to prompt-free segmentation approaches when using a full-image (minimum precision) prompt. Finally, the integration of our method into a semi-automatic annotation tool significantly improves annotation efficiency, demonstrating its practical value for generating a large-scale inventory of snow avalanches.

1.1. Related Work

In this subsection, we provide a few essential notions on SAR remote sensing (Section 1.1.1), we mention the most effective solutions for segmenting avalanches in SAR images (Section 1.1.2), and then we introduce the SAM and SAM adaptation (Section 1.1.3 and Section 1.1.4, respectively).

1.1.1. Synthetic Aperture Radar Images for Avalanche Mapping

SAR images are obtained from the backscattered energy of the microwave signals emitted by the radar itself. Unlike optical sensors that are passive and capture reflected sunlight, SAR is an active technology that operates independently of sunlight or cloud cover. At typical operating frequencies (e.g., X, C, or L band), SAR sensors are highly sensitive to surface properties, such as roughness and moisture. Therefore, debris deposited by snow avalanches can be distinguished from the surrounding undisturbed snow because of their different roughness and structural properties, resulting in an increased and enabled detection in SAR images [2,9].
SAR sensors operate using different polarization modes, which describe the orientation of the transmitted and received electromagnetic waves. These are generally categorized into co-polarized signals (VV or HH), where the transmit and receive orientations are the same, and cross-polarized signals (VH or HV), where they are orthogonal. When full polarimetric measurements are available (all co- and cross-polarized channels), target decomposition methods can be used to interpret dominant scattering mechanisms [17,18,19]. Sentinel-1, however, provides only dual polarized data (either VV/VH or HH/HV); thus, standard decompositions are not directly applicable, and dual polarization variants are typically more approximate [20,21]. Therefore, in this work, we directly use VV and VH backscattering:
  • Vertical transmit, vertical receive (VV) is most sensitive to rough surface scattering and the most informative data source for avalanche segmentation [9].
  • Vertical transmit, horizontal receive (VH) is most sensitive to volume scattering and often used to complement the VV channel.
Human experts annotate avalanche debris by looking at RGB composites obtained by combining two co-registered SAR images taken at consecutive times t 0 and t 1 . The time offset between two different passes can vary between 6 and 12 days. The most common RGB composites are given by [ VV 0 , VV 1 , VV 0 ] or created through specific algorithms like those described in Appendix B.1. Unfortunately, SAR images are affected by speckle noise, which negatively influences their use for many tasks. Speckle noise can be reduced during image preprocessing, e.g., by applying a Lee filter [22]. Modern noise removal approaches consist of pretraining deep learning models in a self-supervised way to reduce the impact of speckle noise on the final performance [23,24].
Useful auxiliary inputs that aid the human annotator to perform manual segmentation are the Digital Elevation Map (DEM) and the Meterological Fields (Mets) data. DEM images associate with each pixel a real value representing the elevation above sea level, expressed in meters. In the context of snow avalanche detection, DEM data can inform the model about areas where avalanches cannot occur, i.e., flat surfaces far from mountain slopes. The DEM can be used to derive the Slope Angle (SA), a topographical feature that can be used to identify release zones, i.e., those regions where avalanches can release debris. In particular, avalanche debris can be found in the proximity of slopes whose inclination ranges between 30 and 50° [9]. The SA is defined as follows:
θ = arctan ( | E ( p ) | ) ,
where E ( p ) is the elevation associated with pixel p and E ( p ) represents the gradient.
The relevance of Met data for avalanche detection has been highlighted in previous work [2,25,26], but its impact on performing automatic segmentation is still to be determined. The Met data consists of a time series associated with each SAR image, which spans the entire duration [ t 0 , t 1 ] between the two satellite acquisitions.

1.1.2. Automated Avalanche Detection with Deep Learning

Automatic detection of snow avalanches with deep learning is still a relatively new field. Deep learning approaches must reach a certain degree of reliability before being deployed in avalanche warning services to assess the avalanche danger and support decision making in specific communities [2]. The most prominent deep learning model for segmentation by Bianchi et al. [9] is based on a U-Net with an encoder-decoder structure and skip connections, illustrated in Figure 2, which takes as input SAR images, the terrain slope, and other topographic features. While the model achieved performance superior to existing approaches in automated avalanche detection, it still produces several detections not corresponding to annotations in the test set. Despite most of them being false alarms, some of the false positives were actual avalanches missed during labeling by the expert, highlighting a limitation in the manual annotation process.

1.1.3. Segment Anything Model

The SAM [10] is a computer vision foundation model that can identify and segment almost any object in natural RGB images, achieving remarkable accuracy. The SAM does not leverage any class information but instead relies on a minimal user prompt to isolate objects from their background, thus generating binary masks. This makes the SAM adaptable to many downstream tasks with minimal to no retraining, often through prompt engineering (given the right prompt, the model generalizes well even for unseen objects and different domains). Prompts can be points, BBs, masks, and text. BBs in particular are represented by a tuple [ x 1 , y 1 , x 2 , y 2 ] that corresponds to the top-left and bottom-right corners.
The original SAM was trained using a data engine technique composed of three subsequent phases: assisted manual, semi-automatic, and fully automatic. The procedure progressively reduces the presence of a human in the loop, leading to the creation of the SA-1B dataset [10], the largest dataset currently available for segmentation on natural images.
The architecture of the SAM, depicted in Figure 3, is composed of the following:
  • Image Encoder: This is a classical pretrained Vision Transformer (ViT) [27] which takes as input 1024 × 1024 RGB images and outputs 256 × 64 × 64 embeddings.
  • Prompt Encoder: Prompts can be either sparse (points, boxes, and text) or dense (segmentation masks). Sparse prompts are mapped to 256-dimensional embeddings and used later for computing cross-attention with the image embedding in the decoder. Dense prompts are processed with convolutional layers and directly mapped to the image embedding.
  • Mask Decoder: The decoder takes as input the image embedding and the encoded sparse prompts and outputs a probability map. The decoder architecture updates both the image and prompt embeddings by relying on self-attention applied to the prompt embedding and on bidirectional cross-attention between the image features and the prompt embeddings.
There are three main versions of the SAM, which differ in their number of parameters:
  • ViT-B: 91 million parameters;
  • ViT-L: 308 million parameters;
  • ViT-H: 636 million parameters.
The choice between these versions is driven by the target latency, hardware requirements, and training set size in the case of SAM retraining.

1.1.4. Adapting the SAM

The literature presents several solutions for adapting the SAM to different domains. Med-SAM consists of a successful retraining of the SAM to medical images, resulting in an effective tool for assisting doctors and medical experts [11]. In Med-SAM, both the SAM encoder and the decoder are fine-tuned on an incredibly large amount of labeled medical images. This retraining strategy is not viable in our specific case due to the shortage of annotated images and computational constraints.
Another successful approach to adapting the SAM consists of substituting the decoder. This strategy allowed applying the SAM to the SAR domain and enabled the construction of SAMRS, the largest dataset for semantic segmentation in remote sensing [13]. Substituting the decoder, usually paired with encoder adaptation, enables multi-class segmentation but gives up the possibility of using prompts [13,14,15,28]. It is possible to modify the decoder of the SAM to perform multi-class segmentation while still allowing prompts by changing the convolutional layers of the decoder. However, this modification also requires fundamental changes to the prompt handling, and to our knowledge, it has not been applied in practice. In our study, we did not explore substitution or major modifications to the decoder for two reasons:
  • Avalanche segmentation can be cast as a binary segmentation problem, where avalanches play the role of the foreground class.
  • We wanted to preserve prompts, which are fundamental for semi-automatic segmentation.
Since the SAM image encoder is characterized by a large number of parameters and layers, it represents a significant computational bottleneck. As a consequence, full fine-tuning of the SAM is often an impractical solution for domain adaptation.
The recent literature focuses on Parameter-Efficient Fine-Tuning (PEFT) strategies to adapt the SAM to a new domain. Many methods have been proposed in this direction, and the most popular ones are Low-Rank Adaptation (LoRA) [29], adapters [15], and Auto-SAM [30]. The latter tries to adapt the SAM to a new domain through the introduction of a parallel network and belongs to another family of methods that tries to improve SAM performance on new domains by adding additional prompts or modifying existing ones [12,14,16,30].

2. Materials and Methods

Section 2.1 describes the avalanche dataset, the available modalities (multi-temporal SAR channels, DEM, and derived SA), and the preprocessing needed to meet the SAM’s fixed input resolution. Section 2.2 details how we adapt the SAM to the SAR domain by training lightweight adapters in the ViT image encoder and fine-tuning the mask decoder. Section 2.4 presents a compute-efficient training scheme that reuses image embeddings for all prompts associated with the same image. Section 2.3 describes our BB-based prompting strategy, including prompt generation from masks and augmentation to handle imprecise user inputs. Section 2.5 introduces our multi-encoder architecture to leverage a larger number of input channels, based on supervised embedding alignment and fusion. Section 2.6 then combines these elements into the final three-phase training procedure and reports the main optimization settings. Finally, Section 2.7 presents the web-based tool used to assess the method in a human-in-the-loop annotation workflow.

2.1. Dataset

The dataset consisted of 2681 labeled samples acquired from various regions in Norway. Each sample maintained a ground sampling distance of 10 m × 10 m per pixel, with image resolutions ranging from 355 × 363 to 512 × 512 pixels. Given the 10 m pixel spacing, deposits smaller than 1–5 pixels may not be consistently detectable; therefore, the dataset and reported performance primarily reflect avalanches that are resolvable at this scale.
Each observation comprises three distinct data modalities: SAR, DEM, and Met. Our preliminary results showed that prompting Met data to the segmentation model did not improve the performance, despite the relevance of this data in avalanche detection (see Appendix D). Therefore, Met data are not discussed further in the following.
The SAR data is represented by the two SAR images with both the VV and VH channels, collected at time steps t 0 and t 1 . In our dataset, the two images were taken either at 6 or 12 days apart, and the image values are represented as a normalized radar cross-section (Sigma nought) expressed in decibels (dB). To create an RGB image composite, each SAR channel must also be rescaled from the original dB scale to [0, 1]. For these specific datasets, the images used for manual labeling were created through Algorithm A1. An example of such an RGB composite is presented in Figure 4. We deferred the algorithm and the details on manual labeling to Appendix B.1.
The values of the single-channel DEM images in our dataset ranged from 19.11 m to 2274.41 m above sea level, with a mean value of 675.82 m and a standard deviation of 380.62 m. The images had to be rescaled to make them compatible with RGB standard values ([0, 255] integer or [0, 1] float). The DEM had a spatial resolution of 10 m and was resampled to the same grid as the SAR images. We also processed the DEM images to derive the SA values, expressed in angular degrees.
To satisfy the fixed input requirement of the SAM image encoder (1024 × 1024 pixels), each sample was resized such that its longest dimension matched the target resolution. For samples with smaller aspect ratios, the remaining area was zero-padded, ensuring the preservation of the original spatial proportions and preventing geometric distortion of the SAR and DEM features. Additionally, each RGB image was normalized to a zero mean and unit standard deviation before being fed to the encoder.

2.2. Domain Adaptation

As previously discussed, the SAM was originally trained on RGB images from the natural image domain, which substantially differed from the SAR and DEM data in our dataset. Adapting the SAM to avalanche segmentation requires a fine-tuning step in which a subset of the model parameters is retrained. Training the entire model would lead to severe risks of overfitting, given the limited size of our dataset and the large number of model parameters (the smallest version of the SAM used in our experiments was based on ViT-B and contained over 91 M parameters). On top of that, fine-tuning the SAM requires great computational effort. In the following, we separately discuss how we adapted the encoder and decoder components of the SAM.

2.2.1. Image Encoder

In the SAM, most of the parameters are concentrated in the image encoder, which in the model based on ViT-B comprises approximately 86 M parameters and represents the major bottleneck to performing both training and inference. As discussed in Section 1.1.4, several methods have been proposed in the literature to adapt the SAM’s image encoder and, in general, large foundation models to different domains. In this work, we experimented with Auto-SAM [30], LoRA [29], and adapters [15]. Among these, we found that the adapters yielded the best performance in terms of the Intersection over Union (IoU) metric on the avalanche class, which served as our main validation metric. Further details were deferred to Appendix F.
Adapters are trainable components that are placed in the transformer block of the ViT between the multi-head attention and the residual connection and in parallel with the Multi-Layer Perceptron (MLP) layer, as shown in Figure 5. These modules transform the intermediate hidden states x of the ViT as follows:
Adapter ( x ) = Up ( ReLU ( Down ( x ) ) )
where ReLU is the standard activation function and Up ( · ) and Down ( · ) represent two fully connected layers that perform upscaling and downscaling, respectively.
Adding adapters to every transformer block of the ViT-B introduced 7 M parameters in the ViT-B encoder, which consisted of only 10% of the encoder parameters, reducing by over 90% the number of trainable parameters with respect to fine-tuning. We did not find a benefit in using dropout and set the MLP-ratio, which is the ratio between the number of output and input neurons of the linear layer, to 0.25.
Following the standard implementation in the PyTorch (v2.5.1) Linear layer, the weights and biases in the Up and Down layers of the adapters were initialized from a uniform distribution U ( k , k ) , where the parameter k is defined as follows:
k = 1 in _ features   .
This is a rather generic and uninformative weight initialization, which usually works best when there are many training samples available to learn the best weight configuration. However, the preliminary results showed that initialization schemes more tailored to our use case, including pretraining the adapters with self-supervised objectives such as missing data imputation and speckle denoising, did not convey significant improvements in our experiments. Additional details on pretraining and weight initialization are discussed in Appendix C.

2.2.2. Decoder

Since the decoder already outputs a binary mask, which in our case served as the avalanche class, we did not have to change the architecture of the decoder. Therefore, we simply fine-tuned the decoder as illustrated in previous work [13,14,15]. Plain fine-tuning without modifying the decoder architecture allowed us to preserve the prompt for the semi-automatic annotation, which is the main use case for our model.

2.3. Robustness to Inaccurate Prompts

We adopted BBs as prompts for the SAM, as they are intuitive, easy to provide, and widely recognized as the most effective prompting strategy for semi-automatic annotation, particularly in the context of SAR imagery [13]. We created BBs from the segmentation masks as follows: (1) By computing for each avalanche the minimum enclosing rectangle; (2) By increasing the BBs through an ad hoc augmentation strategy; (3) By merging BBs that intersect to increase efficiency and simulate a more realistic human input. All three steps are illustrated in Figure 6.
Since the SAM does not explicitly leverage any class information, the prompt alone determines the target object to be segmented. In the context of avalanche segmentation, we noticed that inaccurate BBs led to a drop in segmentation performance, suggesting that inaccurate localization hampers the model’s ability to correctly identify avalanche debris. In an operational setting, however, some degree of imprecision in human-provided prompts is unavoidable. To improve SAM robustness when inaccurate BBs are prompted, our prompt strategy increases the BB by displacing the four coordinates with a random value drawn from a uniform distribution U ( 0 , k ) , where k denotes the maximum number of offset pixels. During training, we used a mixed prompt strategy in which 80% of the prompts were accurate boxes ( k = 40 ), 10% were inaccurate ( k = 200 ), and in 10% of the cases, the BBs were replaced with full-image BBs. To perform model selection in the validation stage, we only used accurate boxes. Instead of drawing the displacement from the uniform distribution U ( 0 , 40 ) , we used a fixed value of 20. We observed that the introduction of the inaccurate and full-image BBs during training made the model more robust to imprecise prompts without affecting the usage of accurate prompts in validation.
On top of improving robustness to inaccurate prompts, we found that our strategy yields two additional benefits. The first is that the proposed augmentation strategy trains the model to perform full-image segmentation, enabling prompt-free inference. The second benefit is that training with imprecise and full-image prompts improves the segmentation performance over small avalanches. This is a particularly important result since the predictions from the baseline model exhibit a positive correlation between the avalanche size and the IoU, indicating that smaller avalanches are more challenging to segment. The performance degradation on small targets can be attributed to different factors:
  • Bias in the dataset: Small avalanches are more difficult to segment accurately for human annotators and are also more affected by image noise, making ground truth labels less reliable.
  • Architectural: ViT, which serves as the backbone of SAM, lacks skip connections that would preserve fine-grained spatial details from the early layers. Additionally, the fixed receptive field is not well suited for detecting objects at different scales.
The original SAM paper found that eliminating all connected components with areas inferior to 100 pixels significantly improved IoU performance, highlighting a fundamental limitation of the SAM in segmenting small objects. In Appendix E, we discuss additional details and attempts at improving the detection performance on small avalanches.

2.4. Resource Optimization

As discussed in Section 2.2, we employed the ViT-B variant of the SAM, which contains 91 million parameters, 86 million of which reside in the image encoder and constitute the primary computational bottleneck during fine-tuning. In the SAM, the image encoder and the prompt encoder operate independently; the image embedding depends only on the input image, while the prompt embedding depends exclusively on the provided prompt (e.g., the BB). This architectural property becomes particularly relevant in our setting, where multiple BB prompts are associated with different avalanches appearing in the same image. We note that this differs from the more common setting, where a natural image contains a single object of interest.
Processing each image-prompt pair independently would result in redundant and computationally expensive recomputation of the same image embedding. To avoid this redundancy, we computed the image embedding only once per image and reused it for all associated prompts. Algorithm 1 details the data preparation procedure that enables the resource optimization. The key idea is to replicate the image embedding (Line 6 of Algorithm 1) so that it can be paired with each prompt and processed in parallel by the decoder. This allows all prompts associated with the same image to be evaluated simultaneously, significantly improving training efficiency. All the repeated image embeddings are then concatenated to form a single expanded batch (Line 7), which is fed to the decoder together with the corresponding concatenated prompt embeddings (Line 8).
Depending on the number of prompts, the proposed parallelization could occupy a massive amount of memory, but even if each image must be processed individually, the overall compute time is reduced. Indeed, handling prompts at run time efficiently allowed us to train simultaneously on all the avalanches of the same image and reduced the training time by approximately 63 % without impacting the number of epochs needed to reach convergence in training.
Algorithm 1 Data preparation for resource optimization.
Input: 
Image embeddings I   =   { z i } i   =   1 B , where z i     R C   ×   H   ×   W is the embedding of the ith image and B is the batch size, Prompt embeddings P   =   { p i } i   =   1 B , where p i     R L i   ×   2   ×   C and Li is the number of prompts for the ith image
Output: 
Expanded image embeddings E ^ and concatenated prompt embeddings P ^
  1: function PrepareDecoderInput( I , P )
  2:     K i = 1 B L i ▹ Total number of prompts
  3:     I ^ [ ]
  4:     P ^ [ ]
  5:    for  i = 1 , , B  do
  6:        z ^ i Repeat ( z i , L i ) ▹ New shape: L i × C × H × W
  7:        I ^ Concat ( I ^ , z ^ i )
  8:        P ^ Concat ( P ^ , p i )
  9:    end for
10:    return  I ^ , P ^ ▹ Final shapes: K × C × H × W and K × 2 × C
11: end function

2.5. Input Adaptation

The SAM image encoder was pretrained on RGB images, and it expects a three-channel input. Our avalanche dataset, however, provides six co-registered channels: VV 0 , VV 1 , VH 0 , VH 1 , DEM, and SA. To exploit all available information without altering the pretrained encoder architecture, we adopted a multi-encoder strategy inspired by SAM with Multiple Modalities (SAMM) [16].
Concretely, we used two SAM image encoders that process complementary triplets, namely a primary encoder fed with [ VV 0 , VV 1 , D E M ] and a secondary encoder fed with [ VH 0 , VH 1 , S A ] , with all channels normalized to a zero mean and unit standard deviation. Both encoders share the same backbone architecture and are adapted with the same PEFT mechanism described in Section 2.2. For a batch of size B, the two image encoders produce embeddings e 1 , e 2 R B × 256 × 64 × 64 .
A key requirement of this design is that the embeddings produced by the two encoders are compatible with a single mask decoder so that they can be fused and decoded consistently. In the following, we describe our task-aware alignment strategy and the fusion mechanism used to combine the aligned embeddings.

2.5.1. Embedding Alignment

In the SAMM, the auxiliary encoder is aligned to a frozen primary encoder by minimizing a distance metric between their embeddings (e.g., Mean Squared Error (MSE)), an unsupervised objective often referred to as embedding unification. While this facilitates combining modalities, it also encourages the secondary encoder to reproduce information already present in the primary representation. This is not necessarily optimal for segmentation, where the goal is to extract complementary features that improve the final mask prediction.
We instead aligned the secondary encoder to the task-specific representation learned by the primary model. After adapting the primary model to SAR avalanche segmentation (Section 2.2), we froze its mask decoder and trained the secondary encoder using the supervised segmentation loss computed on the decoder output. Because the decoder parameters are fixed, the secondary encoder must generate embeddings that lie in the same space expected by the decoder, enabling subsequent fusion while preserving complementary information from the secondary modality. In our experiments, this supervised alignment strategy yielded the best performance among the considered input adaptation variants (Appendix B).

2.5.2. Embedding Fusion

Once both encoders produced aligned embeddings, we fused them at the embedding level. As a simple baseline, we used a global convex combination:
e ^ F = α   ·   e 1 + ( 1 α )   ·   e 2
With  α = 0.5 , this baseline is already an improvement over training on a single modality, confirming that the supervised alignment enables complementary information to be exploited.
To allow the relative contribution of each modality to vary spatially, we introduced a Selective Fusion Gate (SFG) (Figure 7). The SFG predicts an element-wise weight tensor ω [ 0 , 1 ] B × 256 × 64 × 64 from the concatenation of e 1 and e 2 and computes the fused embedding as follows:
e ^ F = ω e 1 + ( 1 ω ) e 2 ,
where ⊙ denotes element-wise multiplication.
We note that other input adaptation strategies, including channel selection and patch-embedding modifications, were also investigated but did not yield comparable performance improvements. Further details can be found in Appendix B.

2.6. Training Procedure

The overall training procedure consists of three sequential phases illustrated in Figure 8 and detailed below:
  • Phase 1: Primary modality adaptation. The primary model is trained on [ VV 0 , VV 1 , D E M ] . As discussed in Section 1.1.1, the VV polarization is the most informative SAR source for avalanche mapping and is used to create RGB composites for manual annotation. We therefore used [ VV 0 , VV 1 ] as the primary SAR inputs and complemented them with DEM. The model used at this stage leverages the approaches discussed in Section 2.2 (adapter-based encoder tuning and decoder fine-tuning), Section 2.3 (prompt-robust training), and Section 2.4 (resource optimization). This supervised training stage is necessary to account for the domain shift arising from natural RGB images to SAR.
  • Phase 2: Secondary modality alignment. A secondary model is trained to extract image embeddings from the remaining three input channels ([VH0, VH1, SA]) in a supervised manner. As described in Section 2.5.1, we forced alignment to the same embedding space through the frozen decoder of the primary model trained in Phase 1 (Figure 8). Freezing the decoder reduces the individual performance of the secondary model since we only trained the adapters, but it facilitates the combination of the embeddings later on, which is the main goal. Moreover, this secondary modality is supposed to complement the main modality, which justifies using the decoder from the primary modality.
  • Phase 3: Embedding fusion. In the final phase, an SFG is trained to combine the embeddings produced by the two encoders. Once again, we performed supervised training, and we leveraged the frozen decoder of the primary model trained in Phase 1 (Figure 8).
Experiments were carried out using the AdamW optimizer [31] with lr = 10 5 as well as early stopping (patience = 30 epochs) and a ReduceLROnPlateau scheduler (factor = 0.1 , patience = 10) monitoring the validation IoU for both phase 1 and phase 2. For phase 3, instead, we reduced the patience of the early stopping to 10 and the ReduceLROnPlateau to 4, again monitoring the validation IoU.
In the preprocessing step (see Figure 8), we applied image augmentations to reduce overfitting and help the model generalize better. In particular, we applied translation, rotations (360°), flips, Gaussian noise ( σ = 0.01 ), and random masking. Gaussian noise was introduced to tackle the impact of speckle noise on the segmentation performance. After image augmentation, we calculated the prompts related to the current images in the batch as explained in Section 2.3. To address class imbalance, we used the Dice loss as it gave us good performance and directly correlated with the IoU metric. Since the mask decoder generates a continuous probability map, a binarization step is required. We applied a global threshold of 0.5, which yielded the best performance in our empirical evaluations, to produce the final segmentation mask. All models were trained on an NVIDIA RTX 6000 Ada GPU (NVIDIA Corporation, Santa Clara, CA, USA).

2.7. Segmentation Tool

We developed a web-based tool for semi-automated segmentation in collaboration with a geoscientist responsible for the annotations of the SAR images in the dataset. The tool is designed to support efficient human-in-the-loop annotation and provides the following core functionalities:
  • Data loading: It loads the files to annotate (SAR image and the DEM) at two different time instants t 0 and t 1 .
  • Data visualization: The interface allows visualizing both the RGB composites (obtained from the SAR images as described in Algorithm A1) and the DEM, displayed by simulating a light source to create shadows and highlights (hillshade format), which transforms raw elevation data into a 3D-like representation of the terrain.
  • Semi-automatic segmentation: The tool allows the annotator to draw a BB on the SAR image around an area with snow avalanche debris. The input data and the prompts are fed to the adapted SAM, which returns a probability map. The annotator can then adjust the mask by setting the threshold for the probability map. The mask corresponding to the selected threshold produces the segmentation mask.
  • Mask editing: This enables manual correction and refinement of the mask generated by the deep learning model.
The same web tool can also operate in a fully manual mode, where the annotator draws the avalanches by freehand.
The web page of the software application for image segmentation is shown in Figure 9.

3. Results

In Section 3.1, we first conduct an ablation study to assess the benefits provided by each component in our adapted SAM architecture with respect to baseline methods. Then, in Section 3.2, we assess the adapted SAM when operating in a fully automatic segmentation and compare it to popular architectures for image segmentation. Finally, in Section 3.3, we quantitatively assess the practical benefits of our semi-automatic segmentation tool in a real-world annotation pipeline. Qualitative results for the ablation study are provided in Figure 10. Unless otherwise stated, all models were trained on the avalanche detection dataset using VV and VH SAR channels along with a DEM and SA as inputs. The performance in the experiments was evaluated according to the IoU, precision, and recall, which are defined in Appendix A.

3.1. Ablation Study

We compared the effectiveness of the adapted SAM against the following:
  • The zero-shot version of the SAM, taking as input an RGB image created through Algorithm A1;
  • The SAM model from Phase 1 (SAM with adapters, with fine-tuning of the decoder with VV channels and the DEM as input);
  • The SAMM method [16] with the six-channel input.
We used the same test set with pre-calculated accurate boxes as prompts. The results are reported in Table 1. Our approach obtained superior performance in almost all metrics, with improvements in the IoU and recall, which are the most important metrics in avalanche detection. In particular, our model achieved the highest IoU metric with respect to any other method we tested.

3.2. Fully Automatic Segmentation

We evaluated the capabilities of our adapted SAM in a fully automated segmentation setting. In this experiment, we simulated a prompt-free setting by providing a single full-image bounding box (covering the entire image domain) as the prompt for our model. It is important to note that this represents a secondary case study, as the primary focus of this work is the semi-automatic, prompt-based annotation tool.
We compared our model against three standard, fully automated segmentation baselines trained and fine-tuned on the same multi-channel avalanche dataset: SegFormer-B1 (13.7 M parameters) [32], U-Net (14.7 M parameters) [33], and a DeepLabV3+ (26.7 M parameters) [34] equipped with a ResNet-50 backbone pretrained on Sentinel-1 imagery [35]. Notably, U-Net and Segformer are the deep learning models that were used in previous work to perform fully automated segmentation of avalanches from SAR images [8,36].
Table 2 shows that our approach achieved a comparable IoU and recall, which is critical to minimize the risk of undetected events. We note that this is a non-trivial result, as the SAM is a prompt-based model, which is not designed to operate with imprecise or full-image bounding-box prompts. By contrast, our training strategy explicitly exposes the model to inaccurate and full-image prompts, enabling it to perform well in this challenging setting. These results highlight the effectiveness of the proposed prompt augmentation strategy (Section 2.3) and indicate the potential to perform fully automated and minimal prompt segmentation with foundation models in future applications.
We note that to further increase the performance of our SAM as a fully automated segmentation model, it would likely require dedicated retraining (e.g., make full images the priority by increasing their prompt percentage during training). We also note that in this case, the benefit of using a foundation model like the SAM could be down-weighted by the latency occurring when it is not possible to precalculate the image embedding. This is closely related to the specific requirements of the application and must be analyzed on a case-by-case basis, depending on the most important performance measure (inference time, precision, IoU, or recall).

3.3. Semi-Automatic Segmentation Tool

To evaluate how much the proposed SAM-based semi-automated segmentation tool (Section 2.7) speeds up the annotation procedure in an operational pipeline, we compared the time required to annotate images in the semi-automatic and fully manual mode within the web tool we developed. First, we requested an expert geoscientist to generate high-quality annotations for 50 SAR images from the test set. To perform manual segmentation, the expert took between 1 and 3 min, while annotations in the semi-automatic modality took about 5–30 s, indicating a substantial speedup.
To evaluate if the improvement was statistically significant, we conducted a pairwise comparison through a matched pair analysis on 25 images, i.e., we tested the significance of the difference in time for segmenting the same image manually or with the automated annotation tool. This experiment yielded a 60.28% speedup (using median values), as confirmed by a highly significant p value of 10−5 from the paired one-tailed t-test. These outcomes are consistent with similar domain adaptation studies (e.g., Med−SAM [11]), confirming SAM’s effectiveness in creating segmentation labels in different domains.

4. Discussion

This study investigated the adaptation of the Segment Anything Model (SAM) framework to snow avalanche segmentation in Synthetic Aperture Radar (SAR) imagery, with the dual objectives of improving segmentation quality and reducing the effort required for manual annotation. By combining parameter-efficient domain adaptation, prompt-robust training, multi-channel input handling, and compute-aware training, we showed that foundation models can be effectively transferred to this highly specialized remote sensing task.

4.1. Summary of Contributions

The primary contribution of this work is an end-to-end methodology to adapt the SAM to SAR avalanche data while preserving its prompt-based interaction. Among the investigated domain adaptation approaches, adapters proved to be the most effective one, enabling efficient fine-tuning by reducing the number of trainable encoder parameters by more than 90%. In addition, the proposed training strategy improved robustness to imprecise prompts, which is essential in realistic human-in-the-loop annotation scenarios. To overcome the limitation of the SAM to three input channels (Section 2.5 and Appendix B), we introduced a multi-encoder architecture based on supervised embedding alignment and fusion, designed to extract complementary information from secondary input channels. Overall, our final model achieved an IoU of 0.5981 using accurate BB prompts, representing a 5% improvement over the baseline methods. When used for fully automatic segmentation, our adapted SAM model achieved comparable performance to popular image segmentation architectures, namely U-Net [33] and Segformer [32] trained end-to-end on the same training set.
The second major contribution is the development of a semi-automatic avalanche annotation tool that, under the hood, runs the proposed SAM-based segmentation model. The segmentation tool offers multiple interaction modes, including drawing a BB prompt, threshold-based refinement of the returned probability map, and manual mask editing. We measured a speedup of 60.28% compared with the manual annotation process, directly addressing the main bottleneck for scaling up avalanche inventories. This tool has significant implications for operational avalanche monitoring systems, where timely and accurate detection is critical for public safety. Indeed, facilitating the annotation procedure could enable avalanche forecasting centers to process significantly larger volumes of SAR imagery, potentially improving the temporal and spatial coverage of avalanche monitoring programs. This scalability is particularly relevant given the increasing availability of SAR data from missions such as Sentinel-1, which provides regular coverage of mountainous regions regardless of weather conditions.
More broadly, increasing the amount of high-quality labels enables training more accurate and reliable models for snow avalanche detection, while larger and more diverse datasets remain the main bottleneck for automated snow avalanche mapping. In the long term, the proposed tool can support a positive feedback loop in which improved models reduce annotation effort and facilitate further dataset expansion.

4.2. Challenges and Limitations

One of the main technical challenges faced during this work was the long training time, which we addressed with the proposed efficient training algorithm. Looking ahead, the most important performance bottleneck concerns the detection of small avalanches; aside from the inclusion of imprecise prompts, the other solutions we tested did not consistently improve segmentation performance (Appendix E).
Another limitation is that the dataset includes only acquisitions from the Norway region. The generalization of our model to other geographic areas with different snow conditions and terrain characteristics is not guaranteed and requires additional validation. Nevertheless, we argue that the proposed methodology is general, and the same architecture could be kept as is and retrained on data from different regions.
We also acknowledge that the simpler adapter model with three input channels created through Algorithm A1 represents a strong baseline in terms of the IoU. Nevertheless, the proposed multi-encoder procedure provides a principled way to incorporate additional channels when they are available and informative for the downstream task.
It is also worth noting that the goal of this work was not to explicitly address speckle noise through dedicated denoising architectures. Instead, we focused on building a practical system that leverages the SAM for semi-automatic avalanche annotation, improving robustness and annotation efficiency despite the presence of speckle noise. The investigation of task-specific, speckle-aware network designs is, therefore, considered outside the scope of this study.
Finally, our results focused on avalanches whose deposits are resolvable at Sentinel-1’s 10 m scale; small events may be under-detected.

5. Conclusions

Snow avalanche mapping in SAR imagery is inherently challenging due to speckle noise, acquisition timing, and inter-annotator variability. Our results show that a promptable foundation model, once properly adapted, can act as an effective assistant for this task; it produces accurate masks from simple BBs and, when integrated in an operational pipeline, substantially reduces the annotation time. This creates a practical opportunity to scale up high-quality avalanche inventories, which can in turn improve both prompt-based and fully automatic detection systems.
Beyond avalanche mapping, the proposed methodology for multi-modal input handling and prompt-based training is applicable to other SAR-based detection tasks, including flood detection [37] and oil spill detection [36].
Future research should prioritize the following key directions:
  • Expand training data through larger annotation campaigns supported by the tool, with quality control protocols that improve contour consistency across annotators.
  • Investigate multi-scale decoding and training strategies that increase sensitivity to small targets while controlling false positives.
  • Validate and retrain on acquisitions from other regions and seasons, and explore additional inputs when available (e.g., meteorological products or higher-resolution topographic descriptors).
  • Conduct field evaluations with forecasting centers to assess usability, latency, and reliability in real annotation workflows.

Author Contributions

Conceptualization, F.M.B. and J.G.; methodology, R.G., C.S. and F.M.B.; software, R.G.; validation, R.G. and J.G.; formal analysis, R.G., C.S. and F.M.B.; investigation, R.G., C.S. and F.M.B.; resources, F.M.B., G.B. and J.G.; data curation, J.G.; writing—original draft preparation, R.G.; writing—review and editing, C.S., J.G., G.B. and F.M.B.; visualization, R.G.; supervision, G.B. and F.M.B.; project administration, G.B.; funding acquisition, F.M.B. All authors have read and agreed to the published version of the manuscript.

Funding

F.M.B. was supported by the Norwegian Research Council project no. 345017 (RELAY: Relational Deep Learning for Energy Analytics). The authors wish to thank Nvidia Corporation for donating the GPUs used in this project.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author. Code is available at https://github.com/RiccardoGelato/AdaptingSAMToSARAvalancheDetection (accessed on 2 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Evaluation Metrics

The IoU represents the main metric and determines how closely the predicted avalanche area matches the area in the ground truth. The IoU is defined as follows:
IoU = | A B | | A B |
where A is the predicted mask and B is the ground truth annotation.
The precision is defined as follows:
Precision = TP TP + FP .
where TP represents the true positives and FP represents the false positives. Precision determines the percentage of predicted avalanche pixels that were actually correct, and this is crucial in cases where false alarms carry high operational costs or severe consequences.
The recall is defined as follows:
Recall = TP TP + FN .
where FN represents the false negatives. Recall determines the percentage of actual avalanche pixels that the model successfully identifies, and this is crucial in safety-critical applications where missing a detection can have severe consequences. Recall is an important metric for fully autonomous avalanche detection systems, where missing avalanches can have significant impacts and endanger lives. It is often desirable to avoid missing potentially hazardous regions at the cost of making the model more trigger-happy.

Appendix B. Input Adaptation

Appendix B.1. Standard RGB Creation for Manual Segmentation

To perform manual segmentations, expert annotators rely on two different types of RGB composites. The first is simply obtained by combining the following channels: [ VV 0 , VV 1 , VV 0 ] . The other RGB composite is obtained through Algorithm A1.
Algorithm A1 SAR Polarimetric data to RGB image conversion.
  1:
function CreationOfRGBImage( VH 0 , VH 1 , VV 0 , VV 1 )
  2:
    VH i rescale ( VH i , 27 , 7 ) for i = 0 , 1
  3:
    VV i rescale ( VV i , 23 , 3 ) for i = 0 , 1
  4:
    a rescale ( VH 1 VH 0 , 0 , 0.25 )
  5:
    b rescale ( VV 1 VV 0 , 0 , 0.25 )
  6:
    w rescale ( a b , 0 , 1 )
  7:
    R w · VH 0 + ( 1 w ) · VV 0
  8:
    G w · VH 1 + ( 1 w ) · VV 1
  9:
    B w · VH 0 + ( 1 w ) · VV 0
10:
    RGB [ R , G , B ]
11:
   return RGB
12:
end function
The function rescale converts the SAR data, which is usually provided in a logarithmic scale (dB), to the interval [ 0 , 1 ] to enhance the effectiveness of the detection algorithms. Algorithm A2 shows the details of the rescale function, where “arr” represents the input data while “lo” and “hi” are user-defined thresholds.
Algorithm A2 Rescale algorithm.
  1:
function rescale( arr , lo , hi )
  2:
    arr arr lo hi lo
  3:
    arr 0 if arr < 0
  4:
    arr 1 if arr > 1
  5:
    arr 0 if isnan(arr)
  6:
   return arr
  7:
end function
After the initial rescaling of the raw SAR data (Lines 3–4 of Algorithm A1), we calculated the difference between time steps t 0 and t 1 , and we rescaled it again between 0 and 1 (Lines 5–6). The pixel values in the difference image will be higher where a new avalanche occurred due to the higher diversity in the backscattering. We then subtracted the two difference images a and b, obtaining a new output w (Line 7), which will be zero where the backscattering difference is higher in the VV image and a value between 0 and 1 otherwise. We then used w for a convex combination of VV and VH (Lines 8–10).
Overall, the VV polarization is more informative for avalanche detection. This behavior is reflected in Algorithm A1. The weight w is typically small, and thus the resulting RGB composite is dominated by VV, and the influence of VH is reduced. The VH channel becomes informative only where its backscatter change exceeds that of VV; even then, its contribution to the final image remains minor.
We also underline that in addition to the RGB composites, the annotator relies on the topographic information contained in the DEM or, more precisely, in products obtained from the DEM, namely the hill-shade representation (which is a better way to visualize the topology) and the SA, which gives information about the slope and steepness.

Appendix B.2. Channel Combination

A straightforward unimodal way to adapt the SAM to avalanche segmentation is to rescale the selected modalities and stack them into a three-channel (RGB) input. Figure A1 shows examples of the considered configurations. Using the standard channels ( VV 0 , VV 1 , VH 0 , VH 1 , DEM, and SA), the simplest composites are obtained by rescaling and directly stacking as follows:
  • Vertical + DEM creates an RGB with [VV0, VV1, DEM];
  • Horizontal + DEM creates an RGB with [VH0, VH1, DEM].
Figure A1. Visual comparison of different SAR input modalities and feature combinations: (a) the RGB composite for manual interpretation from Algorithm A1 and (b) the RGB composite of (a) combined with the DEM substituted in the third channel, which creates (c) an RGB with [VV1–VV0, VH1–VH0, DEM], (d) an RGB with [VV0, VV1, DEM], and (e) an RGB with [VH0, VH1, DEM], and (f) reference ground truth mask for avalanche segmentation.
Figure A1. Visual comparison of different SAR input modalities and feature combinations: (a) the RGB composite for manual interpretation from Algorithm A1 and (b) the RGB composite of (a) combined with the DEM substituted in the third channel, which creates (c) an RGB with [VV1–VV0, VH1–VH0, DEM], (d) an RGB with [VV0, VV1, DEM], and (e) an RGB with [VH0, VH1, DEM], and (f) reference ground truth mask for avalanche segmentation.
Remotesensing 18 00519 g0a1
We also considered more complex variants:
  • Difference + DEM creates an RGB with [VV1–VV0, VH1–VH0, DEM];
  • Standard + DEM creates an RGB with [R, G, DEM], where R and G are computed as shown in Algorithm A1.
All SAR-derived channels were obtained in their rescaled form as a byproduct of Algorithm A2. The DEM was rescaled to [0, 1] by dividing by 4000 m, while the SA was rescaled by dividing by 1 2 π radians (the maximum theoretical slope). All channels were normalized to a zero mean and unit standard deviation before being fed to the model.
Table A1 summarizes the performance obtained when using as input the different combinations of channels.
Table A1. Performance metrics (IoU, precision, and recall) for different input configurations. Bold values indicate the best performance for each metric.
Table A1. Performance metrics (IoU, precision, and recall) for different input configurations. Bold values indicate the best performance for each metric.
Input ConfigurationIoUPrecisionRecall
Vertical + DEM58.2177.1878.38
Horizontal + DEM54.5871.8878.30
Difference + DEM57.3975.6877.77
Standard + DEM59.1679.0877.78
These unimodal three-channel composites already provide competitive performance; the best configuration (Standard + DEM) reached an IoU of 59.16 , which is close to the final results reported in the main body (Table 1). This supports the discussion in the main text that careful input selection yields a strong starting point and that more complex multi-modal strategies provide a modest but meaningful additional gain. Moreover, our final model was worse in terms of precision and better in terms of recall. Overall, the final model outperformed this best unimodal approach in the key metrics in avalanche segmentation. Considering the presence of false negatives in the labels used for training, it is better to have a less conservative model, which could reduce this problem.

Appendix B.3. Patch Embedding

The SAM’s patch embedding is the first layer of the image encoder and the only component whose parameters depend on the number of input channels. Replacing the patch embedding layers is, therefore, the most direct way to ingest all available modalities. However, this solution also discards the pretrained projection that maps the input image to the token embeddings expected by the subsequent ViT blocks, which can negatively affect optimization and downstream components.
We experimented to directly embed the six-channel input ( VV 0 , VV 1 , VH 0 , VH 1 , DEM, and SA) and trained SAM with the new, initialized, patch embedding. As expected, this model did not achieve the same performance as the strongest unimodal baselines, reaching a maximum IoU of 57.17 .
To mitigate the disadvantage of losing the pretrained parameters, we investigated self-supervised initialization strategies for the new patch embedding (while also training the adapters):
  • Masked reconstruction: We trained the modified encoder on unlabeled data to reconstruct the input from a masked version (masking ratio of 30 % ). To limit the training time and encourage the encoder to carry most of the representation burden, we used a lightweight decoder composed of three transposed convolution layers.
  • Embedding distillation: We extracted image embeddings with the original SAM from a three-channel input and trained the modified model to reproduce the same embeddings, starting from the six-channel input.
After self-supervised pretraining, supervised fine-tuning improved the results to 57.73 and 58.63 for the IoU, respectively, with the embedding distillation objective providing the largest gain. We hypothesize that the unsupervised pretraining stabilizes optimization by improving gradient flow and by reducing the mismatch between the encoder output and the embedding space expected by the decoder. Nevertheless, the best patch-embedding variant remained below the strongest three-channel configuration (Table A1) and was therefore not retained in the final model.

Appendix B.4. Prefix Net

Instead of modifying the pretrained image encoder, the Prefix Net approach learns a mapping from an n-channel input to a three-channel pseudo-RGB image, which is then fed to the original SAM patch embedding and encoder. This can be seen as a data-driven alternative to the hand-engineered channel combinations described in Appendix B.2.
We considered two variants:
  • Light: This is a small convolutional network designed for efficient end-to-end training. The input is first processed by two 3 × 3 convolutional layers with batch normalization. This block maintains the input resolution while projecting the features into a 64-channel latent space. A transpose convolutional layer (ConvTranspose2d) with a 2 × 2 kernel and a stride of two increases the resolution to 1024 × 1024. The final stage consists of a ReLU activation followed by a 3 × 3 convolutional layer that maps the 64 intermediate features to 3.
  • Heavy: This is a multi-branch convolutional architecture with parallel kernels to better capture multi-scale patterns and improve small-avalanche segmentation inspired by Dong et al. [25].
End-to-end training of the heavy variant exceeded our GPU memory budget. We therefore trained the prefix network first with the original SAM and then fine-tuned the SAM (with adapters) using the pretrained prefix output. This two-stage procedure reached a 54.45 IoU, substantially below the light variant and below the best unimodal baselines. We expect that joint or alternating optimization could be more effective given sufficient hardware resources.
The light Prefix Net can be trained jointly with the adapters, and it achieved an IoU of 59.65, slightly outperforming the best hand-crafted three-channel composite. It also reduces the negative correlation between the avalanche size and IoU, improving performance on small avalanches. In addition, the learned mapping tends to denoise the inputs and produces visually interpretable three-channel images (Figure A2), which can support manual inspection and annotation, as discussed in Section 2.5.
Figure A2. (left) Output of the Prefix Net. (center) Standard image used to perform manual segmentation by experts. (right) The ground truth mask. As we can see, the avalanches are more clearly visible in the output of the Prefix Net.
Figure A2. (left) Output of the Prefix Net. (center) Standard image used to perform manual segmentation by experts. (right) The ground truth mask. As we can see, the avalanches are more clearly visible in the output of the Prefix Net.
Remotesensing 18 00519 g0a2

Appendix C. Self-Supervised Pretraining

As discussed in Section 2.2, we adapted the SAM to the SAR avalanche domain by fine-tuning lightweight adapter modules while keeping the pretrained image-encoder backbone frozen. Since the adapters are randomly initialized, we explored self-supervised pretraining as a way to provide a more informative initialization, improve training stability, and potentially learn noise-robust representations for SAR imagery.
We tested multiple self-supervised initialization methods, which are summarized here for completeness:
  • Masked autoencoders: This is the standard method of pretraining for ViTs, which reconstructs the input from a masked version [38].
  • Teacher-student: This learn view-invariant representations by matching student and teacher embeddings across multiple augmentations.
  • Self-supervised denoising: This reconstructs a filtered target to encourage noise-robust features.
Since annotations are not required, it was possible to collect additional data from Norway and obtain an independent dataset for the self-supervised training, which is substantially larger than the supervised avalanche dataset (over 10,000 images for training and over 1000 for validation). The larger dataset implied a significantly higher computational cost; in our setting, each pretraining run required several days to converge.

Appendix C.1. Masked Autoencoders

We followed the Masked Autoencoder (MAE) paradigm [38]. During pretraining, we kept the pretrained SAM image encoder frozen and trained only the adapters together with a lightweight reconstruction head. The head maps the image embedding back to the input space and is implemented as a small convolutional decoder with three layers (kernel size of three). A lightweight decoder is commonly preferred in MAE-style training because it encourages the encoder to learn informative representations rather than delegating the reconstruction capacity to the decoder.
We used as input the standard three-channel composite produced by Algorithm A1. The masking ratio was set to 30 % (lower than the typical 70 % used in natural image MAE), motivated by the lower and noisier information content of SAR composites. On the NVIDIA RTX 6000 Ada GPU, training converged in about four days with early stopping (patience of 20 epochs). After pretraining, we discarded the reconstruction head and fine-tuned the model for avalanche segmentation using the baseline supervised protocol (adapters and decoder unfrozen). This initialization reached a best IoU of 58.63 , which is below the performance obtained with randomly initialized adapters.

Appendix C.2. Teacher-Student

We also explored a DINO-style teacher-student training procedure [39]. We used the SAM with adapters as the backbone for both teacher and student, initialized with identical weights. During self-supervision, only the adapters are trained; the teacher is updated through an exponential moving average of the student parameters.
Due to GPU memory constraints (48 GB VRAM), we used only two global views and did not include local crops. The global views were generated through standard image augmentations (rotation, translation, flips, Gaussian blur, grayscale, random crops, and color jitter). Training converged in about four days on the NVIDIA RTX 6000 Ada GPU, with early stopping (patience of 20 epochs). Fine-tuning the student initialization for avalanche segmentation yielded a best IoU of 58.73 , which again was not an improvement over the baseline with random adapter initialization.

Appendix C.3. Self-Supervised Denoising

Motivated by prior work on self-supervised speckle denoising for SAR data [23,24], we consider reconstruction tasks in which the target is not the raw input but a denoised or feature-enhanced version. We used the SAM with adapters (only adapters being trainable) and attached a lightweight fully convolutional decoder (three layers) to reconstruct the target. As in MAE, 30 % of the input was masked. In addition, when the three-channel input included a topographic channel, we reconstructed the SA instead of the DEM, forcing the model to preserve terrain-relevant cues.
We evaluated multiple target transformations, including the gradient-based representation proposed in [23], a Lee filter, and an edge-based target inspired by IRSAM [40]. Figure A3 shows the representative outputs. In our data, the gradient- and Lee-based targets tended to suppress or blur avalanche boundaries, whereas the edge-based targets provided a simpler objective that better preserved boundary information. We therefore adopted edge reconstruction (for the SAR channels) together with SA reconstruction as our denoising pretext task.
Figure A3. Examples of denoising targets considered for self-supervision: (a) input composite; (b) Lee-filtered target; (c) edge-based target; (d) gradient-based target; (e) reference ground truth mask.
Figure A3. Examples of denoising targets considered for self-supervision: (a) input composite; (b) Lee-filtered target; (c) edge-based target; (d) gradient-based target; (e) reference ground truth mask.
Remotesensing 18 00519 g0a3
After supervised fine-tuning, this initialization achieved a best IoU of 58.89 , which was the strongest among the self-supervised objectives but still below the baseline with random adapter initialization. Given the additional training time, we did not include this strategy in the final model.

Appendix C.4. Conclusions on Model Initialization

Overall, self-supervised pretraining did not improve the downstream segmentation performance in our setting. We attribute this outcome to a mismatch between the image-embedding distribution induced by the self-supervised objectives and the embedding space expected by the pretrained SAM mask decoder. In practice, this manifests as an unstable initial fine-tuning phase that is not fully compensated during supervised training, even when the learned features appear qualitatively meaningful.
We found that freezing the image encoder for the first 10 epochs of supervised fine-tuning (allowing the decoder to adapt to the pretrained embeddings) improved stability; the results reported above include this additional step. Nevertheless, none of the tested objectives consistently outperformed the baseline with randomly initialized adapters, and we therefore adopted standard initialization in the final approach.

Appendix D. Meteorological Data

Meteorological conditions play an important role in avalanche release and are widely used in operational forecasting [2,26]. In our dataset, each image is associated with five time series, with each one measuring a different meteorological variable (temperature, wind speed, air pressure, precipitation amount, and relative humidity) described in Table A2.
Table A2. Descriptive statistics for meteorological variables.
Table A2. Descriptive statistics for meteorological variables.
VariableMeanStdUnit
Air Temperature (2 m)270.593.74K
Wind Speed (10 m)7.14.42m/s
Air Pressure at Sea Level100,348.151650.62hPa
Precipitation Amount0.2740.68mm
Relative Humidity (2 m)0.860.135%
Met data is characterized in our dataset by an extremely low spatial resolution. In particular, there is only a time series associated with each SAR image, which spans the entire duration [ t 0 , t 1 ] between the two satellite passes with a time resolution of 1 h. Since the SAM operates on spatial prompts (Figure 3), using these time series requires converting them into a dense, image-aligned representation. To provide the spatial structure, we conditioned the meteorological embedding on topography by pairing it with the SA. Inspired by MetNet2 [41], we encoded the normalized slope map with a small convolutional block (Conv-LayerNorm-ReLU), and we encoded the normalized meteorological sequences with an LSTM [42] to capture temporal dependencies. The LSTM output is projected to a feature vector, broadcast over the spatial grid, concatenated with the slope features, and processed with additional convolutional blocks to produce the final dense prompt encoding.
We trained this meteorology-conditioned prompt jointly with the adapter baseline, enabling it for 50% of the samples for which Met data was available. At test time, however, using the prompt degraded performance (IoU 58.12 ) compared with disabling it (IoU 59.06 ). We attribute this to the extremely coarse spatial granularity of the meteorological measurements and the fact that in our setting, they do not provide discriminative information at the pixel level beyond what is already captured by SAR and the topography. For these reasons, meteorological data is not included in the final model.

Appendix E. Small Avalanche Optimization

As discussed in Section 2.3, segmentation performance decreases for small avalanches, leading to a positive correlation between the avalanche area and IoU (see Figure A4). Aside from the BB augmentation strategy adopted in the main body in Section 2.3, in Table A3, we evaluated the model with adapters and fine-tuning of the decoder with the image created by Algorithm A1 as input.
Table A3. Correlation coefficient and p value for variable pairs based on prompt training strategy.
Table A3. Correlation coefficient and p value for variable pairs based on prompt training strategy.
Variable PairPromptCorrelation Coefficient (r)p Value
IoU vs. Mask AreaAccurate Only 0.2056 4.86 × 10 8
Ours 0.1668 1.06 × 10 5
We evaluated additional approaches commonly used to improve the detection of small targets:
  • Loss reweighting: modify the objective to emphasize hard examples and mitigate class imbalance (e.g., Dice and focal losses) [10,43];
  • Multi-scale feature extraction: combine features computed at different receptive fields to capture better small structures and sharp boundaries [25];
  • High-resolution feature injection: add skip connections so that the decoder can leverage earlier, higher-resolution features, such as in U-Net-like architectures [9,33,44].
Figure A4. Scatter-plot of the IoU versus the mask area in log scale. The clear correlation suggests that smaller snow avalanches are harder to detect.
Figure A4. Scatter-plot of the IoU versus the mask area in log scale. The clear correlation suggests that smaller snow avalanches are harder to detect.
Remotesensing 18 00519 g0a4
Regarding loss reweighting, we use Dice as the default objective because it addresses class imbalance and directly correlates with the IoU. We also tested Dice + Focal, which is frequently used for small or hard-to-classify targets [45,46], and class-weighted losses (doubling the avalanche weight, following [9]). In our experiments, these variants performed worse than Dice alone and were not pursued further.
Multi-scale and skip connection mechanisms are not directly available in the SAM, whose image encoder is a ViT and whose convolutional components appear only in the patch embedding and in the final stages of the encoder and decoder. One Auto-SAM-style solution is to add an auxiliary convolutional branch in parallel to the SAM [30]. We also investigated skip connection variants that inject higher-resolution features into the decoder.
In particular, we evaluated HQ-SAM [47] and IRSAM [40], which expose the decoder to intermediate encoder features. Notably, IRSAM also tries to reduce the impact of speckle noise on the segmentation performance. These approaches achieved IoU values of 59.21 and 59.02 , respectively. The gains were marginal and did not justify the added architectural complexity and memory overhead.
A limitation of HQ-SAM is that the connected features originate from global attention blocks (the first connection is after 3 of 12 transformer blocks), where representations are already highly processed. The added convolutional adapters must simultaneously preserve information for the remaining transformer blocks and provide high-frequency signals for the skip connection. To provide less-processed, higher-resolution signals, we implemented an additional convolutional branch that feeds features directly from the input to the mask decoder using the HQ-SAM fusion mechanism. The tested branch consists of the following:
  • Initial block: strided 3 × 3 convolutions reducing 1024 × 1024 × 3 to 512 × 512 × 64 , followed by BatchNorm and ReLU;
  • Middle block: three residual stages with 3 × 3 convolutions and max pooling, reducing to 64 × 64 × 256 (with a 1 × 1 projection in the identity path when channel dimensions change);
  • Final block: a linear convolution producing the features injected into the decoder.
This variant reached a best IoU of 59.39 in early experiments. However, the improvement was not reproducible once we introduced the less precise prompt strategy and the multi-encoder components of the final method. After careful evaluation, we found that these architectural and loss function variants were less effective for small-avalanche performance than the BBs augmentation procedure described in Section 2.3.

Appendix F. Parameter-Efficient Fine-Tuning

As discussed in Section 2.2, we adapted the image encoder of the SAM to the SAR avalanche domain using adapters. Table A4 reports a comparison with the other PEFT strategies that we evaluated in this work, namely LoRA and Auto-SAM. We report the best two models for each of the methods tested with the corresponding hyperparameters. Each model was trained on the Standard + DEM images, namely RGB with [R, G, DEM], where R and G are computed as shown in Algorithm A1. See Appendix B for details.
Table A4. Comparison of adapters, LoRA, and Auto-SAM performance using precise prompts based on IoU, precision, and recall. Bold values indicate the best performance for each metric.
Table A4. Comparison of adapters, LoRA, and Auto-SAM performance using precise prompts based on IoU, precision, and recall. Bold values indicate the best performance for each metric.
ModelIoUPrecisionRecall
Adapters (MLP-ratio = 0.25) 59.16 79.08 77.78
Adapters (MLP-ratio = 0.5) 58.91 78.84 77.73
LoRA (r = 130, α = 130) 58.99 76.46 82.10
LoRA (r = 64, α = 64) 58.35 76.49 80.33
Auto-SAM (Res-Net50) 58.97 77.72 80.49
Auto-SAM (Custom Res-Net) 59.14 77.95 79.82
We note that the performance of Auto-SAM was particularly unsuccessful when applied by following the procedure described in the original paper [30]. In particular, the results of Auto-SAM were obtained after the domain was already shifted using the most successful adapters, configuration with an MLP-ratio equal to 0.25. As a parallel backbone, we tested Res-Net50 and a custom Res-Net, which is composed of three parallel nets with different kernel sizes to capture details at different scales to reduce the impact of speckle noise [25]. Ultimately, adapters were superior to the other two PEFT methodologies in terms of IoU performance and were therefore selected for inclusion in our proposed model. Auto-SAM alone failed to effectively achieve domain adaptation; its original medical imaging domain was closer to the standard SAM’s expected input than avalanche detection in remote sensing imagery. LoRA, while being most efficient in terms of inference latency and performing only slightly below adapters in terms of IoU, proved more difficult to fine-tune due to having two hyperparameters (rank and alpha) compared with the adapters’ single parameter.

References

  1. Eckerstorfer, M.; Malnes, E.; Müller, K. A complete snow avalanche activity record from a Norwegian forecasting region using Sentinel-1 satellite-radar data. Cold Reg. Sci. Technol. 2017, 144, 39–51. [Google Scholar] [CrossRef]
  2. Kapper, K.L.; Goelles, T.; Muckenhuber, S.; Trügler, A.; Abermann, J.; Schlager, B.; Gaisberger, C.; Eckerstorfer, M.; Grahn, J.; Malnes, E.; et al. Automated snow avalanche monitoring for Austria: State of the art and roadmap for future work. Frontiers 2023, 4, 1156519. [Google Scholar] [CrossRef]
  3. Eckerstorfer, M.; Vickers, H.; Malnes, E.; Grahn, J. Near-real time automatic snow avalanche activity monitoring system using Sentinel-1 SAR data in Norway. Remote Sens. 2019, 11, 2863. [Google Scholar] [CrossRef]
  4. Eckerstorfer, M.; Bühler, Y.; Frauenfelder, R.; Malnes, E. Remote sensing of snow avalanches: Recent advances, potential, and limitations. Cold Reg. Sci. Technol. 2016, 121, 126–140. [Google Scholar] [CrossRef]
  5. Data-driven avalanche forecasting using weather and satellite data. In International Snow Science Workshop (ISSW) Proceedings; Montana State University: Bozeman, MT, USA, 2024; Available online: https://arc.lib.montana.edu/snow-science/item.php?id=3109 (accessed on 2 February 2026).
  6. Eckerstorfer, M.; Malnes, E. Manual detection of snow avalanche debris using high-resolution Radarsat-2 SAR images. Cold Reg. Sci. Technol. 2015, 120, 205–218. [Google Scholar] [CrossRef]
  7. Vickers, H.; Eckerstorfer, M.; Malnes, E.; Larsen, Y.; Hindberg, H. A method for automated snow avalanche debris detection through use of synthetic aperture radar (SAR) imaging. Earth Space Sci. 2016, 3, 446–462. [Google Scholar] [CrossRef]
  8. Bianchi, F.M.; Grahn, J. Snow avalanches. In Data-Driven Earth Observation for Disaster Management; Elsevier: Amsterdam, The Netherlands, 2026; pp. 69–88. [Google Scholar]
  9. Bianchi, F.M.; Grahn, J.; Eckerstorfer, M.; Malnes, E.; Vickers, H. Snow avalanche segmentation in SAR images with fully convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 75–82. [Google Scholar] [CrossRef]
  10. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, C.Y.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
  11. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
  12. Wu, J.; Fu, R.; Fang, H.; Liu, Y.; Wang, Z.; Xu, Y.; Jin, Y.; Arbel, T. Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation. arXiv 2023, arXiv:2304.12620. [Google Scholar] [CrossRef]
  13. Wang, D.; Zhang, J.; Du, B.; Xu, M.; Liu, L.; Tao, D.; Zhang, L. SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Mode. Adv. Neural Inf. Process. Syst. 2023, 36, 8815–8827. [Google Scholar]
  14. Yan, Z.; Li, J.; Li, X.; Zhou, R.; Zhang, W.; Feng, Y.; Diao, W.; Fu, K.; Sun, X. RingMo-SAM: A Foundation Model for Segment Anything in Multimodal Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5625716. [Google Scholar] [CrossRef]
  15. Pu, X.; Jia, H.; Zheng, L.; Wang, F.; Xu, F. ClassWise-SAM-Adapter: Parameter Efficient Fine-tuning Adapts Segment Anything to SAR Domain for Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 4791–4804. [Google Scholar] [CrossRef]
  16. Xiao, A.; Xuan, W.; Qi, H.; Xing, Y.; Yokoya, N.; Lu, S. Segment Anything with Multiple Modalities. arXiv 2024, arXiv:2408.09085. [Google Scholar] [CrossRef]
  17. Cloude, S.R.; Pottier, E. An Entropy Based Classification Scheme for Land Applications of Polarimetric SAR. IEEE Trans. Geosci. Remote Sens. 1997, 35, 68–78. [Google Scholar] [CrossRef]
  18. Freeman, A.; Durden, S.L. A Three-Component Scattering Model for Polarimetric SAR Data. IEEE Trans. Geosci. Remote Sens. 1998, 36, 963–973. [Google Scholar] [CrossRef]
  19. Karachristos, K.; Koukiou, G.; Anastassopoulos, V. A Review on PolSAR Decompositions for Feature Extraction. J. Imaging 2024, 10, 75. [Google Scholar] [CrossRef]
  20. Ji, K.; Wu, Y. Scattering Mechanism Extraction by a Modified Cloude–Pottier Decomposition for Dual Polarization SAR. Remote Sens. 2015, 7, 7447–7470. [Google Scholar] [CrossRef]
  21. Mascolo, L.; Cloude, S.R.; Lopez-Sanchez, J.M. Model-Based Decomposition of Dual-Pol SAR Data: Application to Sentinel-1. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5220119. [Google Scholar] [CrossRef]
  22. Lee, J.S. Digital Image Enhancement and Noise Filtering by Use of Local Statistics. IEEE Trans. Pattern Anal. Mach. Intell. 1980, PAMI-2, 165–168. [Google Scholar] [CrossRef]
  23. Li, W.; Yang, W.; Liu, T.; Hou, Y.; Li, Y.; Liu, Z.; Liu, Y.; Liu, L. Predicting Gradient is Better: Exploring Self-Supervised Learning for SAR ATR with a Joint-Embedding Predictive Architecture; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar]
  24. Dalsasso, E.; Denis, L.; Muzeau, M.; Tupin, F. Self-supervised training strategies for SAR image despeckling with deep neural networks. In Proceedings of the EUSAR 2022; 14th European Conference on Synthetic Aperture Radar, Leipzig, Germany, 25–27 July 2022. [Google Scholar]
  25. Dong, H.; Ma, W.; Jiao, L.; Liu, F.; Li, L. A Multiscale Self-Attention Deep Clustering for Change Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5207016. [Google Scholar] [CrossRef]
  26. Abermann, J.; Eckerstorfer, M.; Malnes, E.; Hansen, B.U. A Large Wet Snow Avalanche Cycle in West Greenland Quantified Using Remote Sensing and In Situ Observations; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  28. Zhang, Y.; Ma, Q.; Ge, B.; Wei, M.; Huang, Y.; Ji, Z. Flood Area Segmentation By SAM Based On SAR Data And DEM Assistance. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024. [Google Scholar]
  29. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar]
  30. Shaharabany, T.; Dahan, A.; Giryes, R.; Wolf, L. AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder. arXiv 2023, arXiv:2306.06370. [Google Scholar] [CrossRef]
  31. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  32. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  33. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  34. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 801–818. [Google Scholar]
  35. Cambrin, D.R.; Vaiani, L.; Gallipoli, G.; Cagliero, L.; Garza, P. Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources. arXiv 2025, arXiv:2507.10403v1. [Google Scholar]
  36. Bianchi, F.M.; Espeseth, M.M.; Borch, N. Large-scale detection and categorization of oil spills from SAR images with deep learning. Remote Sens. 2020, 12, 2260. [Google Scholar] [CrossRef]
  37. Amitrano, D.; Di Martino, G.; Di Simone, A.; Imperatore, P. Flood detection with SAR: A review of techniques and datasets. Remote Sens. 2024, 16, 656. [Google Scholar] [CrossRef]
  38. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
  39. Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
  40. Zhang, M.; Wang, Y.; Guo, J.; Li, Y.; Gao, X.; Zhang, J. IRSAM: Advancing Segment Anything Model for Infrared Small Target Detection. arXiv 2024, arXiv:2407.07520. [Google Scholar] [CrossRef]
  41. Espeholt, L.; Agrawal, S.; Sønderby, C.; Kumar, M.; Heek, J.; Bromberg, C.; Gazen, C.; Carver, R.; Andrychowicz, M.; Hickey, J.; et al. Deep learning for twelve hour precipitation forecasts. Nat. Commun. 2022, 13, 5145. [Google Scholar] [CrossRef]
  42. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  43. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  44. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  45. Waseem Ashraf, M.; Sultani, W.; Shah, M. Dogfight: Detecting Drones from Drones Videos. arXiv 2021, arXiv:2103.17242. [Google Scholar] [CrossRef]
  46. Rongsheng Dong, X.P.; Li, F. DenseU-Net-Based Semantic Segmentation of Small Objects in Urban Remote Sensing Images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
  47. Ke, L.; Ye, M.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Segment Anything in High Quality. Adv. Neural Inf. Process. Syst. 2023, 36, 29914–29934. [Google Scholar]
Figure 1. Avalanche segmentation: (a,c) SAR backscatter images created through Algorithm A1, discussed in Appendix B.1, and (b,d) corresponding ground truth masks.
Figure 1. Avalanche segmentation: (a,c) SAR backscatter images created through Algorithm A1, discussed in Appendix B.1, and (b,d) corresponding ground truth masks.
Remotesensing 18 00519 g001
Figure 2. Illustration of a U-Net architecture, composed of an encoder with several downsampling blocks and a decoder composed of upsampling blocks, for performing full-image segmentation.
Figure 2. Illustration of a U-Net architecture, composed of an encoder with several downsampling blocks and a decoder composed of upsampling blocks, for performing full-image segmentation.
Remotesensing 18 00519 g002
Figure 3. Overview of the SAM architecture. A heavyweight image encoder outputs an image embedding, given an RGB image. The prompt encoder identifies the segmentation target, which is then segmented by the mask decoder.
Figure 3. Overview of the SAM architecture. A heavyweight image encoder outputs an image embedding, given an RGB image. The prompt encoder identifies the segmentation target, which is then segmented by the mask decoder.
Remotesensing 18 00519 g003
Figure 4. Visual comparison of different input modalities and the final image created for manual segmentation with Algorithm A1.
Figure 4. Visual comparison of different input modalities and the final image created for manual segmentation with Algorithm A1.
Remotesensing 18 00519 g004
Figure 5. Complete ViT architecture with adapter-modified transformer blocks. Each adapter consists of two linear layers and an activation function, positioned after the multi-head attention and in parallel with the MLP.
Figure 5. Complete ViT architecture with adapter-modified transformer blocks. Each adapter consists of two linear layers and an activation function, positioned after the multi-head attention and in parallel with the MLP.
Remotesensing 18 00519 g005
Figure 6. Creation of BBs, highlighted in red, to improve robustness to inaccurate prompts. From left to right, we show the creation of the minimum enclosing BB from the segmentation mask, the random increase in the BB dimensions, and the merging of overlapping BBs.
Figure 6. Creation of BBs, highlighted in red, to improve robustness to inaccurate prompts. From left to right, we show the creation of the minimum enclosing BB from the segmentation mask, the random increase in the BB dimensions, and the merging of overlapping BBs.
Remotesensing 18 00519 g006
Figure 7. Selective Fusion Gate. Given two image embeddings e 1 and e 2 , the gate predicts weights ω from their concatenation and produces the fused embedding e ^ F .
Figure 7. Selective Fusion Gate. Given two image embeddings e 1 and e 2 , the gate predicts weights ω from their concatenation and produces the fused embedding e ^ F .
Remotesensing 18 00519 g007
Figure 8. Overview of the training procedure. In the first phase, a model with VV0, VV1, and DEM as input is trained with adapters and its decoder fine-tuned using our prompt and efficient parallelization strategies. In the second phase, a model with VH0, VH1, and SA as input is trained using the supervised embedding alignment strategy. The third phase combines the image embeddings through an SFG.
Figure 8. Overview of the training procedure. In the first phase, a model with VV0, VV1, and DEM as input is trained with adapters and its decoder fine-tuned using our prompt and efficient parallelization strategies. In the second phase, a model with VH0, VH1, and SA as input is trained using the supervised embedding alignment strategy. The third phase combines the image embeddings through an SFG.
Remotesensing 18 00519 g008
Figure 9. Segmentation tool. Here we show the View Image page, which allows for both manual and semi-automatic segmentation of avalanches.
Figure 9. Segmentation tool. Here we show the View Image page, which allows for both manual and semi-automatic segmentation of avalanches.
Remotesensing 18 00519 g009
Figure 10. Visual comparison of the performance of the different models in the ablation study on the avalanche detection dataset. From left to right are the (1) input SAR composite with the bounding box prompt highlighted in red; (2) ground truth manually drawn from a human expert; (3) zero-shot SAM; (4) SAM with adapters (Phase 1); (5) SAMM baseline [16]; and (6) our proposed adapted SAM. Each row represents a distinct sample from the test set in the avalanche dataset.
Figure 10. Visual comparison of the performance of the different models in the ablation study on the avalanche detection dataset. From left to right are the (1) input SAR composite with the bounding box prompt highlighted in red; (2) ground truth manually drawn from a human expert; (3) zero-shot SAM; (4) SAM with adapters (Phase 1); (5) SAMM baseline [16]; and (6) our proposed adapted SAM. Each row represents a distinct sample from the test set in the avalanche dataset.
Remotesensing 18 00519 g010
Table 1. Comparison of SAM adaptation methods for precise prompts. Bold values indicate the best performing method according to each metric.
Table 1. Comparison of SAM adaptation methods for precise prompts. Bold values indicate the best performing method according to each metric.
ModelIoUPrecisionRecall
SAM 34.29 29.51 82.41
Phase 1 57.88 ± 0.4 75.79 ± 1.2 79.20 ± 0.8
SAMM 59.17 ± 0.3 75.58 ± 0.8 79.97 ± 1.0
Ours 59.81 ± 0.3 75.60 ± 0.6 80.99 ± 0.8
Table 2. Comparison of segmentation models on full image detection. Bold values indicate the best performing method according to each metric.
Table 2. Comparison of segmentation models on full image detection. Bold values indicate the best performing method according to each metric.
ModelIoUPrecisionRecall
U-Net 42.26 ± 0.1 68.5 ± 0.3 62.73 ± 0.4
Segformer 43.28 ± 0.7 68.07 ± 1.1 65.13 ± 0.9
DeepLabV3+ 41.99 ± 0.4 65.7 ± 1.9 67.85 ± 2.2
Ours 42.3 ± 0.8 62.56 ± 0.9 66.23 ± 1.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gelato, R.; Sgaravatti, C.; Grahn, J.; Boracchi, G.; Bianchi, F.M. Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation. Remote Sens. 2026, 18, 519. https://doi.org/10.3390/rs18030519

AMA Style

Gelato R, Sgaravatti C, Grahn J, Boracchi G, Bianchi FM. Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation. Remote Sensing. 2026; 18(3):519. https://doi.org/10.3390/rs18030519

Chicago/Turabian Style

Gelato, Riccardo, Carlo Sgaravatti, Jakob Grahn, Giacomo Boracchi, and Filippo Maria Bianchi. 2026. "Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation" Remote Sensing 18, no. 3: 519. https://doi.org/10.3390/rs18030519

APA Style

Gelato, R., Sgaravatti, C., Grahn, J., Boracchi, G., & Bianchi, F. M. (2026). Promptable Foundation Models for SAR Remote Sensing: Adapting the Segment Anything Model for Snow Avalanche Segmentation. Remote Sensing, 18(3), 519. https://doi.org/10.3390/rs18030519

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop