1. Introduction
Sea surface oil spill pollution is a major environmental threat that has adverse effects to the marine environment and the living organisms that inhabit it [
1,
2,
3]. Even though most of the oil that ends up at sea comes from natural underwater seepages, a significant amount is anthropogenic. Most of this anthropogenic oil does not originate from accidents, but rather from illegal routine sea operations. Thus, oil spills have a high correlation with ship routes, oil platforms, pipelines and other offshore installations [
4,
5,
6]. Due to its high kinematic viscosity (higher than the surrounding sea water), oil has a dampening effect on the short gravity/capillary waves, thereby reducing the backscatter coefficient and appearing as dark patches in SAR images. This effect varies spatially across the scene, depending on the type of oil, thickness of the spill and the weather conditions present [
7,
8].
SAR images have been successfully used for oil spill identification and delineation for decades [
9,
10,
11]. Their ability to monitor the Earth regardless of time of day and weather conditions, combined with the minimal effect of the atmosphere on the microwave region of the spectrum, makes them very useful for observing various marine phenomena. For an oil spill to be observable in an SAR image, it needs to be distinguishable from the surrounding water. This is very prominent in VV polarization and at wind speeds ranging from approximately 3 to 7–10 m/s. At speeds lower than that, the sea has equally low backscatter values, and at higher speeds it mixes completely with the surrounding water (unless the oil is thick enough). Additionally, the backscattering coefficient decreases as the incident angle increases, making oil spills more visible at angles between 20 and 45 degrees.
However, oil spills are not the only phenomenon that can cause a damping effect. Other phenomena, termed look-alikes, appear very similar to an actual spill. Most commonly, look-alikes can be produced by low winds in the lee of islands, natural surface films produced by fish and plankton and algal blooms. It has been observed that oil spills produce a dampening effect in the range of 0.6–13.0 dB while look-alikes produce one in the range of 0.8 dB to 11.3 dB. This overlap in values makes their discrimination difficult. One way to separate oil spills from look-alikes is to consider their shape. Oil spills are more linear and have clearer boundaries, especially on the downwind side. They may also exhibit feathering due to the wind [
12,
13].
Deep learning is the current state of the art for pixel-wise image segmentation. It has largely replaced traditional methods that require handcrafted feature extraction. Encoder–decoder designs such as U-Net [
14] have become the standard for combining multi-scale context via skip connections, which is especially effective for thin, fragmented targets and precise boundary delineation. Many modern models strengthen this encoder–decoder architecture with residual backbones, most commonly ResNet [
15], to improve optimization stability and representation. DeepLabV3 [
16] uses atrous (dilated) convolutions and multi-rate context aggregation to improve robustness to scale variation, while SegNeXt [
17] revisits efficient convolutional attention as a practical alternative to global self-attention. Swin-T [
18] and ViT [
19] provide strong hierarchical and token-based representations for dense prediction, while ConvNeXt [
20] shows that modernized convolutional backbones can achieve comparable segmentation performance to transformer-based architectures. More recently, foundation models have reshaped segmentation when ground truth is scarce. The Segment Anything Model (SAM) performs promptable segmentation, producing masks conditioned on points, boxes, or text to enable strong zero-shot transfer across diverse image distributions [
21]. In remote sensing, however, systematic domain shifts caused by sensor physics, spatial resolution and spatial and temporal variations motivate dedicated evaluation and adaptation. SAM has been studied as a baseline and annotation tool for point and box-driven workflows [
22], and it has also been used to build large-scale remote-sensing segmentation resources that support pretraining when pixel labels are expensive [
23]. Several works propose adaptations tailored to remote-sensing characteristics: RS-SAM [
24] integrates multi-scale modeling, SAM-RSIS [
25] uses box-prompted progressive adaptation for instance segmentation, and SAM-assisted semantic segmentation injects object/boundary constraints into conventional semantic heads to improve mask quality [
26]. Parameter-efficient fine-tuning methods like LoRa provide a pragmatic route to adapting SAM-like encoders to geographically diverse distributions with limited trainable parameters and potentially limited labels [
27]. Related training-free or low-supervision approaches further demonstrate that SAM’s representations can be repurposed beyond natural images when domain gaps are explicitly addressed [
28].
In SAR oil spill mapping, deep learning is increasingly used to perform semantic segmentation on data that are noisy and hard to interpret. Speckle, variations in incidence angle, and many look-alike features can appear similar to oil slicks, so accurate boundary delineation and low false alarm rates are key goals in practice [
29,
30,
31]. To cope with strong class imbalance and limited labeled data, ref. [
29] proposed a two-stage approach that first classifies patches and then applies a U-Net-style segmentation model with an imbalance-aware loss. OSCNet [
30] showed that CNN features can improve separation of oil from look-alikes in classification tasks that are often used prior to segmentation. Benchmark datasets and evaluations, such as those introduced by [
31], have supported more systematic comparison of segmentation models and highlight that distinguishing oil from look-alikes remains difficult at the pixel level.
Recent studies have focused on making models scalable and improving boundary quality. Ref. [
32] introduced a fully convolutional model (OFCN/OFCNet) designed to handle variable-sized SAR images and run efficiently at a large scale (e.g., using sliding-window inference). The study reported performance comparable to that of human operators in large-area detection and categorization when training data were sufficiently diverse and inference was carefully engineered. Other work has targeted boundary errors more directly: CBD-Net combines multi-scale features with edge-focused supervision to sharpen spill outlines, supported by the manually labeled SOS dataset that helps with more consistent benchmarking [
33]. Attention-based encoder–decoder variants, such as dual-attention U-Net models, also aim to reduce missed spill regions and unclear boundaries in noisy SAR scenes [
34]. Reliability can further improve by adding extra inputs to reduce look-alike confusion, for example, by using SAR intensity together with derived statistics (e.g., variance) and environmental/context information (e.g., wind) rather than relying on a single channel [
35]. More recently, hybrid CNN–Transformer models have been explored to capture both local texture and wider context, while multi-task approaches (including GAN-based frameworks) attempt to learn oil and look-alike discrimination and pixel-level segmentation jointly under limited labeled data [
36,
37]. In [
38], a modified version of SegNeXt was introduced for segmenting different kinds of marine phenomena and polluting agents in Sentinel 2 images including marine debris and oil spills.
A common limitation in SAR oil spill segmentation is the lack of labeled data and the difficulty in training models that generalize well. To reduce this dependence on fixed datasets, self-evolving training approaches iteratively create new training samples and update the model as additional SAR scenes become available, which can partially overcome the limits of static labeled corpora [
39]. At the same time, detector-based methods trained on large Sentinel-1 collections support scalable monitoring, but accurate spill extent detection still depends on reliable downstream segmentation [
40,
41]. Other studies, including CNN-based segmentation, adversarial learning methods, and dual-stream U-Net variants, show that improving generalization with limited supervision remains a key open problem [
42,
43,
44].
Recent studies have also begun to explore the Segment Anything Model (SAM) and its variants for oil spill detection. SAM-OIL [
45] first introduced SAM into SAR-based oil spill detection by combining YOLOv8-generated bounding boxes, an adapted SAM, and ordered mask fusion to produce class-aware segmentation masks. Subsequent works extended this idea through stronger domain adaptation and feature fusion strategies. CoRemoteSAM-Oil [
46] and HSRD-Net [
47] combine RemoteSAM with ResNet-based visual features to improve generalization and suppress false alarms, while DADS-SAM [
48] applies parameter-efficient SAM fine-tuning to complex UAV-based port oil spill scenes. More recently, OilSAM2 [
49] introduced a memory-augmented SAM2 framework to reuse multi-scale information across SAR image collections. Overall, these works show that SAM-based foundation models are promising for oil spill segmentation but also highlight the need for SAR-specific adaptation, efficient fine-tuning, and robust handling of look-alikes, scarce labels, and variable sea-state conditions.
A complementary line of radar-polarimetry research shows that scattering behavior is not fixed but changes with polarization basis, time, frequency, and observation geometry. The General Polarimetric Correlation Pattern (GPCP) [
50] formalizes this idea by visualizing and characterizing target scattering diversity across multiple domains. PolSAR and dual-polarimetric SAR data provide extra scattering information that can reduce the ambiguity of dark-spot interpretation. Several studies have therefore incorporated polarimetric information in segmentation networks. Ma et al. [
51] used Sentinel-1 dual-polarimetric amplitude, phase, and Cloude decomposition parameters in an improved DeepLabv3+ model. Wang et al. [
52] proposed BO-DRNet using quad-polarimetric RADARSAT-2 features and Bayesian hyperparameter optimization. Another Wang et al. study [
53] introduced a Cloude–Pottier-based relative polarimetric feature. Liao et al. [
54] used PolSAR and deep learning for coastal oil spill risk monitoring in Jiaozhou Bay. More recent work combines polarimetric inputs with stronger segmentation designs. OSDTAU-Net [
55] uses dual-polarimetric SAR with Transformer and attention mechanisms. A scene-adaptive PolSAR network [
56] uses dynamic convolution and boundary constraints. Xiang et al. [
57] combine composite polarimetric scattering power entropy with multi-scale hybrid feature fusion to improve detection of small and elongated slicks. CDANet [
58] uses a multi-year, multi-region, multi-polarimetric SAR dataset to improve generalization. Finally, PBITU-Net [
59] combines dual-polarimetric features with oil–seawater boundary information.
Despite the strong performance reported by the aforementioned studies, with F1-scores reaching as high as 0.98, their experimental validation is typically based on small image collections and limited test sets, averaging approximately three test images. Consequently, their reported performance may not fully reflect robustness across diverse oil spill appearances, acquisition conditions, and sea states.
In this study, we aim to exploit the deep features of the foundational model SAM through OSDA-SAM, a novel architecture for segmenting oil spills. Unlike standard SAM, which was trained primarily on natural RGB images and typically requires prompts, OSDA-SAM operates in a prompt-free, single-shot segmentation setting while keeping the SAM backbone frozen. The SAR-to-SAM domain gap is addressed through three lightweight adaptation components. First, the Input Domain Adaptation Block (IDAB) maps SAR-derived inputs into a SAM-compatible representation using a residual convolutional adapter and learnable channel-wise normalization. This helps the model make the SAR input cleaner and more suitable for SAM. Second, LoRA introduces trainable low-rank updates into selected encoder projections, allowing the internal SAM representations to adapt to SAR oil spill features without full fine-tuning. This preserves the general segmentation knowledge of SAM while reducing trainable parameters and improving stability under limited labeled data. Third, the Residual Feature Space Adapter (RFSA) is applied after the frozen encoder to refine the generated image embeddings through lightweight pointwise convolutions and residual feature reweighting. This provides an additional feature-level correction.
The proposed OSDA-SAM differs from existing SAM-based oil spill segmentation approaches by avoiding manual prompts, memory modules, or substantial modifications to the SAM architecture. It also differs from approaches that either apply SAM directly in a zero-shot manner or mainly focus on improving multi-scale encoder features. Instead, OSDA-SAM keeps the SAM backbone largely unchanged and introduces only lightweight adaptation modules at selected stages. Because marine SAR images differ substantially from optical images, we place particular emphasis on adapting the input before it enters the encoder. This makes the SAR data more compatible with the type of input SAM was originally trained on, while keeping the overall architecture simple and efficient.
Our approach employs a two-stage methodology similar to [
29]. In the first stage, a ConvNeXt classification model is trained to distinguish oil spill patches from look-alikes. In the second stage, the proposed OSDA-SAM segmentation model is used to delineate the oil spill boundaries. Since the SAM image encoder is constrained to three-channel inputs, our approach uses VV backscatter together with selected texture descriptors rather than incorporating the full set of dual-polarimetric features, which would require modifying the pretrained encoder and reduce the benefit of using SAM as a foundation model. By evaluating this architecture on a broader dataset, this study aims to provide a more robust assessment of oil spill segmentation performance under diverse real-world conditions. The main contributions of this study are as follows:
- (1)
We propose OSDA-SAM, a novel network architecture for segmenting images containing oil slicks. The architecture is based on the foundational model SAM and is efficiently adapted to the task by LoRa linear layers, taking advantage of the rich knowledge of the model.
- (2)
A Domain Adaption Block is designed to map SAR-derived inputs to a SAM-compatible appearance, which consists of a residual convolutional adapter and a learnable channel-wise normalization.
- (3)
The effect of incorporating GLCM-derived statistics alongside VV backscatter as an input to deep learning models for oil spill detection and segmentation is analyzed.
- (4)
A statistical analysis of the effect of wind speed on the performance of our deep learning model is conducted.
The paper is organized as follows:
Section 2 provides a description of the entire methodology and data used in the study;
Section 3 provides the results of the proposed method compared to the current state of the art; and
Section 4 discusses the effectiveness of the algorithm and
Section 5 concludes the study.
3. Results
3.1. Classification Results
The first stage of the proposed pipeline performs patch-level classification to discriminate true oil spills from visually similar dark look-alike phenomena, reducing the amount of imagery that must be processed by the second-stage segmentor.
The summary of the quantitative comparison of the six backbones is shown in
Table 2. In general, all models have high specificity (0.94–0.98), which means that most of the non-oil patches are correctly rejected, and this is a desirable characteristic since there is a large imbalance in the number of background/look-alikes compared to oil spills. However, the models vary in terms of how they balance the number of missed spills and false alarms.
Among the tested architectures, ConvNeXt-T offers the best overall trade-off, with 0.98 recall, 0.94 precision, 0.98 specificity, and the best F1-score of 0.96. These properties make it particularly well-suited for a first-stage filter: high recall ensures that actual spills are rarely missed before segmentation, while its improved precision over other models with high recall rates prevents unnecessary subsequent segmentation of background patches. Transformer-based architectures are also very competitive. ViT has the same best recall of 0.98 and an F1-score of 0.95 with 0.91 precision and 0.97 specificity, suggesting a slightly higher false positive rate than ConvNeXt-T but with comparable sensitivity. Swin-T has the same high recall of 0.98 but with lower precision of 0.86 and specificity of 0.94, suggesting a higher rate of misattribution with look-alike structures, which would result in more patches being sent to stage two.
The CNN baselines remain strong but show a more conservative detection profile. ResNet-18 and ResNet-50 maintain high precision (0.93) and specificity (0.98), but recall is lower (0.90–0.92), meaning more missed oil spill patches compared to the best-performing models. VGG-16 improves sensitivity (0.97 recall) while keeping good specificity (0.97), but precision (0.91) and F1 (0.94) remain below ViT and ConvNeXt-T. In practice, these differences matter because false negatives at this stage cannot be recovered by the segmentor, whereas false positives primarily increase computational cost and may introduce occasional spurious masks.
Qualitative examples for ConvNeXt-T are shown in
Figure 6. In the figure, the Sentinel-1 VV backscatter patches are visualized alongside the target labels T and prediction ouputs P (Oil Spill or Background). Classification errors occur in hard cases under challenging conditions, where the contrast between oil spills and the sea is very low (last example of
Figure 6d), or where look-alikes have very similar shape to those of slicks. Wind speed and incidence angle values for the false positive and false negative cases overlap with those of correctly classified patches, reinforcing the idea that the models base their decisions primarily on shape. For patches that contain both look-alikes and oil slicks (
Figure 6c), ConvNeXt-T has no trouble identifying the slick. Taken together, the results justify selecting ConvNeXt-T as the screening model for the two-stage framework due to its best overall F1 and its favorable high-recall/high-specificity operating point.
3.2. Segmentation Results
The second stage of the proposed pipeline focuses on the pixel-wise delineation of oil slicks in Sentinel-1 SAR patches. All segmentation models were evaluated on the test set using the metrics defined earlier (recall, precision, F1-score, and IoU).
Table 3 summarizes the quantitative comparison between representative encoder–decoder and modern segmentation baselines (U-Net, DeepLabV3, SegNeXt, OFCNet, CBDNet) and the proposed SAM-adapted segmentor (OSDA-SAM).
Overall, the results show that the SAM-based adaptation provides the most accurate and balanced masks. OSDA-SAM achieves the highest F1-score (0.86) and IoU (0.75), indicating improved overlap with ground truth and stronger boundary consistency. In addition, it attains the best recall (0.86), demonstrating that it is less prone to missing thin or low-contrast slick regions compared to the CNN baselines. The strongest competing method is CBDNet (IoU 0.72, F1 0.83), while U-Net reaches a similar F1-score (0.83) with slightly lower IoU (0.71). Among the remaining baselines, DeepLabV3 produces moderate overlap (IoU 0.70, F1 0.82), and SegNeXt/OFCNet show the lowest IoU (0.69) and lower recall (0.77), suggesting a tendency toward under-segmentation on subtle slick pixels.
U-Net attains the highest precision (0.88), implying fewer false positive pixels, but this comes with lower recall (0.80), consistent with conservative masks that may omit faint spill regions. In contrast, OSDA-SAM balances recall and precision at 0.86/0.86, yielding the best overall F1 and IoU.
In
Figure 7a, the upper portion of the slick lies in a low-contrast region. OSDA-SAM is the only method that segments it correctly, with UNet delivering the second-best result and SegNeXt performing the worst. For highly elongated spills such as the example in
Figure 7b, the proposed network successfully delineates the full length of the slick, unlike the competing approaches.
Figure 7c shows a very thin spill against a bright background. Most models perform well, particularly CBDNet and UNet, but OSDA-SAM produces the thinnest mask. Finally,
Figure 7d depicts a challenging scenario with a very dark sea, resulting in low contrast between the slick and the water. Here, OSDA-SAM captures most of both the small and large slicks, although all methods exhibit substantial misclassifications. In this case, CBDNet performs very well, while UNet fails to detect the small region entirely.
For scenes that contain both look-alikes and oil spills, OSDA-SAM can segment the slick regions more effectively. For instance,
Figure 7e presents an oil slick that, according to OSPO, is in the same area as an elongated look-alike feature. The results indicate that the proposed model can distinguish between the two and accurately delineate the oil spill region. Therefore, OSDA-SAM helps separate look-alike features from oil spills in many scenes where both are present, adding an additional layer of robustness.
Examining the precision–recall (PR) curves on the test dataset further reinforces the previous findings.
Figure 8 shows the calculated curves. The PR curves provide an informative view of model performance under class imbalance, since they focus directly on the retrieval quality of the positive class. The proposed model achieves the highest PR-AUC (0.9259), indicating the best overall trade-off between precision and recall across decision thresholds. Its curve remains above the competing methods over most of the recall range, showing that it preserves higher precision while maintaining strong sensitivity. In practical terms, this suggests that the proposed model is more effective at identifying positive samples without incurring the same level of false positives as the alternative approaches.
Taken together, these findings indicate that adapting a foundation segmentation model to SAR imagery can improve oil slick delineation relative to conventional CNN segmentors, particularly by preserving more complete slick structures without sacrificing precision.
Table 4 reports the computational overhead of OSDA-SAM compared with the baseline segmentation models. The comparison was conducted on a system equipped with an NVIDIA RTX 4070 SUPER GPU (NVIDIA Corporation, Santa Clara, CA, USA), an Intel Core i5-13600 CPU (Intel Corporation, Santa Clara, CA, USA), and 32 GB of DDR5 RAM. Runtime was averaged over 10 runs using the same input resolution and batch size of 1 for all models. As expected, OSDA-SAM has a substantially larger total parameter count because it is built on the SAMasoundation model. However, only a small fraction of these parameters is trainable, since the SAM backbone remains frozen and adaptation is performed through lightweight modules. Specifically, OSDA-SAM has 639.94M total parameters but only 2.66M trainable parameters, which is lower than all baseline models. This confirms the parameter-efficient nature of the proposed adaptation strategy. In terms of inference speed, OSDA-SAM is slower than the CNN-based baselines, with an inference time of 134.74 ms per batch and a throughput of 7.42 images/s. This additional cost is mainly due to the use of the large SAM image encoder. Although OSDA-SAM is slower, its inference time remains within a usable range for patch-based SAR analysis, especially since the first classification stage reduces the number of patches that need to be passed to the segmentation model.
3.3. Effect of Wind on Segmentation Performance
To examine whether wind conditions influence segmentation performance, we compared patch-level F1-scores (N = 49 patches) from the OSDA-SAM test set with the mean wind speed of each patch derived from collocated ERA5 data (
Figure 9). The resulting scatter does not support a simple linear relationship; instead it suggests a “wind-window” effect: low F1-scores occur across a wide range of wind speeds, whereas higher F1-scores—particularly those above the operational “good” threshold (F1 = 0.8)—are predominantly observed within 1–7 m/s, a range commonly reported to enhance oil spill visibility in SAR imagery. To quantify this observation, we performed a two-stage analysis: (i) we tested whether wind speed dispersion differs across F1 bins using the median-centered Brown–Forsythe (Levene) test and (ii) we tested whether the probability of achieving good performance (F1 ≥ 0.8) is higher inside the 1–7 m/s wind window than outside it using Fisher’s exact test.
The Brown–Forsythe test [
69] is a robust variant of Levene’s test for equality of variances. It evaluates whether groups have the same variability, but it centers each group on the median rather than the mean, making it less sensitive to outliers and non-normal distributions. We applied the test after splitting the patches into three equal F1 tertiles (
Figure 10). The test yielded
p = 0.003348, indicating that wind speed variance differs significantly across tertiles. The ratio of the interquartile ranges (IQRs) between the lowest and highest tertile was 2.6, suggesting that wind speed exhibits ~2.6× larger typical dispersion in the low-F1 tertile than in the high-F1 tertile. However, a bootstrap 95% confidence interval for the IQR ratio was wide (0.729–7.58) and included values < 1, implying that the exact effect size cannot be estimated precisely from these data. Overall, the results provide strong evidence of heteroscedasticity (different wind dispersion across F1 levels), while the magnitude of the dispersion difference remains uncertain due to the limited sample size.
To test whether the probability of obtaining a “good” F1-score (F1 ≥ 0.8) is higher when wind speed falls within the “good window” than when it falls outside it, Fisher’s exact test [
70] was performed on a 2 × 2 contingency table (good vs. not-good performance; inside vs. outside the window). Fisher’s exact test is appropriate here because the number of patches outside the window is small (n = 7). The test returned
p = 0.004837, providing strong evidence of an association between wind-window membership and good performance. In our test set, 31/42 patches inside the window achieved F1 ≥ 0.8 (
p = 0.738), compared with 1/7 patches outside the window (
p = 0.143). The estimated odds ratio was 15.84 (95% confidence interval: 1.65–798), indicating substantially higher odds of good performance inside the window; however, the confidence interval is wide because of the small number of samples outside the window and only one “good” case outside. For interpretability, the corresponding risk ratio was 5.167 and the risk difference was 0.595; i.e., being inside the window is associated with an approximately 60-percentage-point-higher probability of good performance. Note that in this dataset, “outside the window” corresponds almost entirely to wind speeds > 7 m/s (no samples < 1 m/s), so the result primarily reflects reduced performance under higher winds.
Overall, the 1–7 m/s wind window appears to be a favorable condition for achieving good patch-level segmentation performance (F1 ≥ 0.80). Under moderate wind conditions that favor clearer slick contrast in SAR imagery, performance is more stable, whereas low F1-scores occur across a broader range of wind speeds. This suggests that wind speed influences detectability but does not, by itself, determine segmentation quality. As a robustness check, aggregating patches by acquisition (one observation per scene ID) yielded consistent results (Fisher p = 0.004702; Wilcoxon p = 0.009338).
3.4. Ablation on GLCM Texture Features
To measure the effect of the proposed texture-enhanced input representation, an ablation study was conducted in which the two GLCM statistics, homogeneity and variance, were introduced together with the preprocessed Sentinel-1 VV backscatter and compared against the VV-only input representation. The reasoning behind this is that oil slicks are expected to have higher homogeneity and lower variance than the surrounding sea surface, and this information can improve the boundary separability and reduce false positives.
3.4.1. Effect on Classification (Patch-Level)
Across the six classification backbones, adding texture channels produces no consistent improvement in recall, precision, specificity, or F1-score. In most cases, the changes are marginal (near zero), and, in several cases, performance slightly decreases (
Table 5). This indicates that, for the classification stage, the texture descriptors do not add discriminative information beyond what the networks already learn from the intensity channel.
This result is expected for two main reasons. First, classification is context-driven, and the context is global rather than local. The classifier makes a prediction for the whole patch based on a single label. In such a scenario, the classification decision may be dependent on mesoscale information, such as the geometric arrangement of dark patches, their form and connectivity, and the relationship between slick patterns and the sea background. While texture maps may provide useful local information, they do not directly represent the global geometry that may be required to distinguish between actual slicks and their look-alike patterns. Hence, the inclusion of GLCM features is redundant, at least for architectures that are good at capturing multi-scale context information. Second, a practical factor is performance saturation. The strongest backbone already achieves very high scores with baseline input, leaving limited headroom for handcrafted channels to help.
3.4.2. Effect on Segmentation (Pixel-Level)
To measure the contribution of texture to segmentation performance, we remove the two channels of the GLCM and compare the VV-only inputs to the texture-aided scenario (
Table 6). In the SAR domain, slicks are generally more locally homogeneous and smoother than the surrounding ocean, but intensity contrast may be low, and speckle noise may obscure or break up boundaries, leading to increased uncertainty in boundary regions. Homogeneity and variance offer a direct characterization of local structure that, together with VV, can alleviate such ambiguity in low-contrast boundary areas, potentially aiding completeness. This is most apparent in the case of OSDA-SAM, where the addition of texture results in the most pronounced gains (ΔRecall +0.04, ΔF1 +0.02, ΔIoU +0.03 with ΔPrecision ≈ 0.00), suggesting that texture is mainly beneficial for segmenting noisy and difficult cases. In the case of fully trainable CNN segmentors, the impact is more modest and less robust since many such models can learn filters that resemble texture from intensity images. For example, U-Net makes only a slight improvement (ΔRecall +0.01, ΔPrecision +0.01, ΔIoU +0.01), while other models display near-zero or mixed changes.
Figure 11 presents the feature maps produced by the SAM encoder of the OSDA-SAM architecture for representative oil spill patch inputs. Since the feature maps have dimensions of
, they are averaged across channels and resampled to the input resolution for visualization. The VV backscatter of each patch is shown alongside the aggregated feature maps from the model trained using VV only, the model trained with texture features, and the corresponding ground-truth information.
Figure 11a illustrates a case where the texture-based model exhibits stronger activation over small oil slicks.
Figure 11c shows that the model assigns reduced relevance to ocean objects that are not associated with oil slicks. The remaining examples further demonstrate that incorporating texture information suppresses activation in dark areas unrelated to oil spills, making them less prominent in the resulting feature maps.
The benefit to OSDA-SAM is much stronger because of its domain adaptation architecture. The SAM backbone is kept mostly frozen, and the SAR-to-SAM gap is filled with lightweight adaptation (pixel-space mapping, adaptive normalization, and low-rank updates). In this setting, the texture channels are remarkably useful because they bring in a SAR-relevant second-order structure that the frozen backbone is not quite relearning from scratch, assisting in the separation of slick from sea under speckle and low contrast. When the adaptation modules align these channels into a SAM-compatible space, the model can leverage the new structure to better include thin slick extensions and faint boundary pixels, leading to the recall-driven improvement seen in the ablation.
3.5. Ablation on OSDA-SAM Components
To further investigate the significance of every adaptation module in the OSDA-SAM architecture, an ablation study was conducted by iteratively deactivating one component at the time and retraining the model on the train dataset. The results on the test dataset are presented in
Table 7. The most significant module is LoRA, where its omission drops the F1-score by 5 percentage points to 0.81, followed by IDAB to 0.83 and RFSA to 0.85. This is expected as LoRA is adapting the weights of the model itself, while IDAB transforms the inputs and RFSA simply offers some extra guidance inside the feature representation after the SAM encoder.
4. Discussion
This work proposed a two-step deep learning approach for operational oil spill mapping in Sentinel-1 SAR images, targeting the challenge of distinguishing real oil slicks from look-alike regions and accurately segmenting slick boundaries. The first stage uses a ConvNeXt-T patch classifier to filter SAR images for potential patches containing oil slicks, and the second step carries out prompt-free, single-shot image segmentation using the proposed OSDA-SAM, a domain-adapted version of the Segment Anything Model. OSDA-SAM bridges the SAR-to-RGB domain gap with an Input Domain Adaptation Block that includes residual pixel-space correction and channel-wise normalization, along with low-rank updates in the frozen SAM image encoder and a lightweight residual adapter on top of the image embeddings.
Quantitative analysis on the test set indicates that the screening step reaches high sensitivity with a low false alarm rate. The best trade-off of the competing backbones is provided by ConvNeXt-T, with a recall of 0.98, precision of 0.94, specificity of 0.98, and F1-score of 0.96. This makes it a good choice for shrinking the search space in dense prediction tasks while rarely eliminating actual spills. For boundary delineation, OSDA-SAM provides the strongest and most balanced masks, with a recall of 0.86 and precision of 0.86, and the highest overlap with the ground truth (F1-score of 0.86 and IoU of 0.75). These results outperform the state-of-the-art CNN-based baselines, such as CBDNet (IoU of 0.72, F1-score of 0.83) and U-Net (IoU of 0.71, F1-score of 0.83). This indicates that after aligning the input distribution, the segmentation priors captured by a foundation model can be leveraged for more coherent and complete oil-slick masks.
The classification-stage results compare favorably with those reported in prior work. For instance, ref. [
29] reports 99% accuracy, 84% precision, and an F1-score of 80% in a Sentinel-1 two-stage framework. Recall is comparable, while precision and F1 are higher in the present study. For segmentation, ref. [
32] reports an F1-score of 0.892, which is slightly higher, likely due to the substantially larger training dataset used. Ref. [
36] reports an F1-score of 78.48 on a large and varied dataset, like ours, which is lower than the value achieved here, further indicating strong robustness under diverse operating conditions. Studies [
30,
34,
38] were trained and tested on smaller datasets with less scene variation and therefore we cannot draw direct comparisons. In [
31], multiclass oil spill segmentation yields a markedly lower F1-score of 53.79, consistent with the increased difficulty of multiclass discrimination. Finally, study [
35] reports very high performance, but a direct comparison is not appropriate due to differences in sensor resolution and evaluation on a small set of large scenes.
An ablation study on the texture-enhanced input representation further helps clarify the importance of domain-specific information. Adding Gray-Level Co-occurrence Matrix statistics (homogeneity and variance) to VV backscatter does not provide any consistent improvement in patch-level classification, with marginal and sometimes negative changes across backbones. This indicates that the screening decision is dominated by mesoscale morphology and context information already captured from intensity images. On the other hand, the same texture augmentation does help pixel-wise segmentation in most models, and the improvement is most significant for OSDA-SAM, with a relative improvement of 0.04, 0.02, and 0.03 in recall, F1-score, and IoU, respectively, compared to VV-only inputs, while precision is left essentially unchanged. This is consistent with the interpretation that homogeneity and variance capture SAR-relevant second-order structural information useful for delineating low-contrast boundaries under speckle noise, especially when the backbone is mostly frozen and only lightly adapted.
At the dataset level, the samples span the full Sentinel-1 IW incidence angle range as well as the expected wind speed range, indicating that the reported results are not limited to a narrow set of viewing geometries or environmental conditions. For the classification stage, the qualitative error analysis shows that most misclassifications occur in inherently difficult cases (e.g., faint slicks or highly linear look-alikes), while the associated wind speed and incidence angle values remain within typical ranges. This suggests that classification performance is driven more by scene appearance and shape-related cues than by acquisition or environmental parameters alone. For the segmentation stage, the analysis suggests that patch level segmentation performance is generally more consistent under moderate wind conditions, where oil slick contrast in SAR is typically clearer, while lower F1-scores appear across a broader range of wind speeds. In turn, this indicates that wind speed affects detectability but does not, by itself, determine segmentation quality.
Several limitations should be considered when interpreting these results. Although manual inspection and correction were performed, the reference labels originate from operational mapping products that combine visual interpretation with automated tools and may therefore contain boundary uncertainty or occasional omissions. Also, oil spills are undetectable in the presence of dense look-alike areas (last example of row d in
Figure 7). Furthermore, inaccuracies in land masking and a dense presence of fish farms can heavily affect image statistics, influencing the binning process during the GLCM estimation and ultimately leading to loss in sea texture details. Finally, the two-stage design introduces an unrecoverable error path: spills missed by the classifier are not passed to the segmentation stage. The missed oil spills, which account for approximately 2% of the cases, are typically characterized by low contrast, diffuse or fragmented boundaries, and limited separability from the surrounding sea surface. These characteristics suggest that they are mainly thin or low-volume spills, as well as weathered, aged, or dispersed oil slicks. In some cases, the spills are also mixed with look-alike areas, primarily under low-wind conditions, placing them near the SAR detectability limits. Since the number of missed cases is small and they generally correspond to weak or ambiguous slick signatures, their impact on the overall pollution area assessment is expected to be limited.
These limitations point to several directions for future work, Possible avenues for limiting the effect of misclassifications during the first stage include joint/curriculum training of both stages and active learning to focus annotation efforts on hard cases, as well as test-time augmentations. Even though noise is currently handled on the data preprocessing level (Lee filter, thermal noise) and Sentinel 1 is considered a low-noise SAR system, greater integration of noise robustness in the model design can also be investigated in the future. Furthermore, the fusion of other SAR features such as multiple polarizations and auxiliary environmental information, such as wind speed and sea surface temperature, could be examined. Finally, investigating additional GLCM features like entropy as input could lead to further improvements in the segmentation stage.