Next Article in Journal
Experimental Study on the Evolution Characteristics of Sand-Laden Vortex Based on Energy Gradient Theory
Previous Article in Journal
Climate Change, Fish and Shellfish, and Parasite Dynamics: A Comprehensive Review
Previous Article in Special Issue
Life-Cycle Impacts of Artificial Islands on Shoreline Evolution: A High-Frequency Satellite-Based Assessment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FCS-Net: A Frequency-Spatial Coordinate and Strip-Augmented Network for SAR Oil Spill Segmentation

1
School of Yonyou Digital & Intelligence, Nantong Institute of Technology, Nantong 226001, China
2
Division of Information and Communication Convergence Engineering, Mokwon University, Daejeon 35349, Republic of Korea
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(2), 168; https://doi.org/10.3390/jmse14020168
Submission received: 5 December 2025 / Revised: 1 January 2026 / Accepted: 9 January 2026 / Published: 13 January 2026

Abstract

Accurate segmentation of marine oil spills in synthetic aperture radar (SAR) images is crucial for emergency response and environmental remediation. However, current deep learning methods are still limited by two long-standing bottlenecks: first, multiplicative speckle noise and complex background clutter make it difficult to accurately delineate actual oil spills; and second, limited receptive fields often lead to the geometric fragmentation of elongated, irregular oil films. To surmount these challenges, this paper proposes a novel framework termed the Frequency-Spatial Coordinate and Strip-Augmented Network (FCS-Net). First, we leverage the ConvNeXt-Small backbone to extract robust hierarchical features, utilizing its large kernel design to capture broad contextual information. Second, a Frequency-Spatial Coordinate Attention (FS-CA) module is proposed to integrate spatial coordinate encoding with global frequency-domain information. Third, to maintain the morphological integrity of elongated targets, we introduce a Strip-Augmented Pyramid Pooling (SAPP) module which employs anisotropic strip pooling to model long-range dependencies. Extensive experiments on the multi-source SOS dataset demonstrate the effectiveness of FCS-Net. The proposed method achieves state-of-the-art performance, reaching an mIoU of 87.78% in the Gulf of Mexico and 89.62% in the challenging Persian Gulf, outperforming strong baselines and demonstrating superior robustness in complex ocean scenarios.

1. Introduction

The ocean acts as the linchpin of our global climate and a sanctuary for biodiversity, yet it remains perilously vulnerable to human industrial expansion—most notably, oil spills. These disasters, ranging from violent rig explosions and collisions to intentional illegal discharges, do more than pollute the water; they wreak havoc on marine ecosystems and jeopardize coastal livelihoods for decades [1,2,3,4,5]. Consequently, the timely and accurate monitoring of oil spills is paramount for effective emergency response and environmental mitigation [6]. High-precision vessel and buoy observations face cost and weather constraints that limit their scalability compared to remote sensing solutions [7,8,9]. Synthetic Aperture Radar (SAR) dominates marine surveillance by offering all-weather day-and-night imaging capabilities [10,11]. Oil films dampen sea surface capillary waves to reduce radar backscatter and form dark spots distinct from the brighter surrounding water [12,13,14]. Operational monitoring systems must distinguish these spills from natural “look-alikes” such as low-wind areas or organic films that produce similar dark signatures [15,16].
Conventional SAR processing chains combined adaptive thresholding for dark spot identification with manual geometric and textural feature extraction for subsequent statistical classification. Solberg et al. [17] and Del Frate et al. [1] implemented these techniques using Support Vector Machines (SVMs) and Mahalanobis distance classifiers. Rigid reliance on hand-crafted features hinders the model’s ability to generalize to varied oil spill features and complicates the differentiation of similar oil spills in dynamic marine environments [18].
The launch of Sentinel-1 satellite in 2014 represents a turning point, where DL era replaced the large data processing associated with manual feature engineering [19]. Krestenitis et al. [15] pioneered this transition to data-driven methods by establishing modern standards for segmentation performance. This trend has been further accelerated by the release of the SOS dataset [11], which contains 3877 samples from disasters, such as the Deepwater Horizon oil spill, and offers boundary-aware annotations that can substantially contribute toward bettering the localization capability of disaster-related datasets.
While fully convolutional networks (FCNs) [20] and U-Net [21] dominate pixel-level segmentation, newer methods focus on improving their structural stability. Innovations include a sequence-based dark spot recognition method proposed by Chen et al. [22], and deeper extraction methods in OSCNet [23]. To handle complex scale variations, DeepLabV3+ [24] has achieved success through contextual aggregation, as proposed by Wang et al. [25], and later extended with Bayesian optimization. Importantly, by addressing the classic problem of extremely limited labeled data, architectures such as MTGAN [26] can now use generative adversarial learning to perform segmentation tasks on very few samples.
Although these initial techniques obtained classification accuracies of more than 95% on benchmark datasets, they suffered from inherent speckle noise and similarities between oil spills and look-alikes. To address these deficiencies, most recent methods explicitly increase the discrimination of features based on high-level modules. For example, Li et al. [27] proposed multi-convolutional layers (MCLs) to accumulate both global and local contextual information. Similarly, AUOSD [28] and DAM-UNet [29] exploit attention models integrating polarization information with dual attention mechanisms to improve boundary delineation ability.Indeed, the reliability and strong accuracy of attention-based and transformer-based architectures have been empirically demonstrated across diverse application domains, ranging from automated fault diagnosis in building systems [30] to object counting in UAV imagery [31]. Inspired by this cross-domain success, a major breakthrough for 2023–2024 in the maritime domain was the adoption of Transformer-based hybrid models such as OSDTAU-Net [32], which utilize global attention mechanisms to capture long-distance dependencies ignored by pure CNNs. This period also saw advances in foundation models, such as the SAM-OIL framework proposed by Wu et al. [33], which employs a Segment Anything model with a learnable adapter designed to address blurred edges. Also, to tackle the extreme class imbalance problem, where oil spills usually constitute less than 2% of the pixels, a two-stage approach (e.g., Frost filter used by Shaban et al. [34]) has been adopted, along with specific loss functions such as the generalized Dice loss. Meanwhile, studies are also seeking to minimize operational costs by using lighter designs like FA-MobileUNet [35], which can reach real-time detection speeds without compromising the ability to distinguish oil spills from other substances. This complex undertaking results from joint efforts to assemble intelligent systems fit for maritime surveillance tasks.
Although the application of these techniques in remote sensing in general and to oil spill segmentation has revolutionized the field, we posit that standard CNN architectures lack an intuitive “inductive bias” for SAR oil spill segmentation. Most CNNs use isotropic square kernels and pool layers which are specifically designed to capture small objects (like cars and buildings), inappropriate for the anisotropic, filamentary nature of oil spill structures. This inherent limitation leads to a failure to preserve the smoothness of spill edges. Moreover, it is not enough to use only the intensity of the spatial imagery because “spectral ambiguity” occurs where common feature (e.g., low-wind areas) simulates the backscattering properties of oil. We argue that a good solution must go beyond the spatial domain. To address these problems, we introduce FCS-Net that aims to match oil spill geometry and remove the semantic inconsistency via multiple domain features representations.
(1) Backbone Innovation: We adopt ConvNeXt [36] as the encoder backbone due to its 1 newly designed architecture combining the inductive biases of CNNs with the global receptive field features of Transformers which can be used to extract stronger hierarchical characteristics from noisy SAR data.
(2) Dual-Domain Attention: We design a new Frequency-Spatial Coordinate Attention (FS-CA) module. By integrating Coordinate Attention with a specifically designed Fast Fourier Transform (FFT)-based process, FS-CA jointly encodes precise spatial coordinate cues and global frequency-domain texture information within a unified attention formulation. This dual-domain design reduces false alarms caused by background, thereby improving robustness in complex SAR ocean scenarios.
(3) Multi-Scale Context Aggregation: We introduce a Strip-Augmented Pyramid Pooling (SAPP) module. Unlike conventional PPMs [37] with only square pooling kernels, SAPP introduces a specific strip pooling branch to encode the anisotropic long-range dependencies in near horizontality and verticality. Such design dramatically promotes the segmentation completeness of slim and consecutive oil spill targets as well, effectively suppressing the fragmentation problem that is prevalent in prior art.

2. Dataset Description

2.1. Data Sources and Collection

The Deep-SAR Oil Spill (SOS) dataset [11] is a multi-source benchmark dataset developed specifically for oil spill segmentation in maritime environments. It contains two subsets acquired by different SAR sensors: the L-band ALOS PALSAR subset and the C-band Sentinel-1A subset. In this study, we use these two subsets as separate evaluation settings to assess performance under heterogeneous acquisition configurations, without performing cross-sensor data fusion. Complete technical specifications and contextual information for the two subsets are summarized in Table 1.
The subset of this dataset derived from the Phased Array type L-band Synthetic Aperture Radar (PALSAR) onboard the ALOS satellite [38] focuses on the deepwater horizon oil spill, located in the Gulf of Mexico. The data was obtained from a temporal sequence of images collected between May and August 2010, and the geographical limits for this data subset are longitudes 87.056° W to 91.371° W and latitudes 25.719° N to 29.723° N. Level 1.5 PALSAR products were selected for raw data as they had already gone through multi-look processing and map projection techniques. With a frequency of L-band and HH polarization mode, the PALSAR sensor offers a pixel spacing of 12.5 m and provides vital structural information on the spill’s macro-geometry. This subset is characterized by the continuous, thick, and unbroken oil slick.
Subset 2 is located in the Persian Gulf. These images were captured using C-band first-order interferometric wide-swath ground range products acquired by the Sentinel-1A satellite. Each image has a ground spatial resolution of 5 m × 20 m . The incident angle of the Sentinel-1A satellite varied between approximately 29.1 degrees and 46 degrees when acquiring land images before and after the event. The time-series data of these images cover a region of the Earth’s surface from 46.893 degrees E to 52.812 degrees E and from 25.505 degrees N to 48.995 degrees N. While Sentinel-1A provides both VV and VH polarizations, this subset utilizes only VV polarization data, as it is generally more sensitive to surface roughness variations, facilitating the delineation of oil film boundaries [39]. The raw data underwent standard preprocessing steps, including filtering, radiometric calibration, and topographic correction, to derive backscattering intensity. Regarding environmental conditions, the dataset represents scenarios with low to moderate wind speeds. Under such conditions, look-alike phenomena (e.g., low-wind zones) may appear as dark patches similar to oil spills In this study, these ambiguous patterns are treated as part of the complex background.

2.2. Dataset Scale and Configuration

The SOS dataset was created by taking all of the images in a single scene and implementing a sliding window approach that resulted in 8070 labeled images. Each image was resized to 256 × 256 pixels for use as an input to Deep Learning models. A key characteristic of this publicly available dataset is its extensive augmentation process used to increase the generalization capacity of the model. Specifically, data augmentation included adding additive Gaussian noise and multiplicative speckle noise. The synthetic aperture radar (SAR) imagery obtained from this method has a naturally low signal-to-noise ratio (SNR). This will allow for the resilience of the model to the training or to be appropriately utilized after its creation.
To guarantee a strict comparability with the existing benchmarks, we follow exactly the official data split protocol of the dataset creators. This common protocol requires an equivalent training to a testing ratio of 8:2, and includes pixel-level binary ground truth masks. In this official configuration, the dataset is split into: 3877 images (3101 training/776 testing) with PALSAR subset and 4193 images (3354 training/839 testing) in Sentinel-1 subset. The total number of experimental samples used in the experiments, 6455 for training and 1615 for testing samples, provides a fair comparison evaluation.
To avoid any risk of test-set leakage during hyper-parameter tuning, we further define a validation split within the official training set. Specifically, we randomly sample 10% of the official training images as a held-out validation set and use the remaining 90% for training. This sampling is performed once before training, and the resulting validation file list/indices is kept fixed for all baseline comparisons and ablation studies. The official test split remains unchanged and is used only for final evaluation.

2.3. Descriptive Statistics

The characteristics of the dataset’s statistical distribution are indicative of the model training strategy. The most prominent of these characteristics is the high level of class imbalance at the pixel level. The oil spill pixels have a much smaller number than the background pixels, despite a large overall count of images in the dataset. The small quantity of oil pixels means that models will converge to the trivial solution of predicting all background unless care is taken when implementing loss functions during training (e.g., class-balancing loss functions like Weighted Cross-Entropy or Dice Loss).
Another important characteristic of the dataset is that it contains a high level of morphological variability. The PALSAR subset contains very large and connected components of blocky oil spills, while the Sentinel-1 subset generally is made up of long, thin, fragile to curvilinear blob-like morphologies. Therefore, to successfully segment disparate feature sizes, the segmentation network should combine multiple scales of feature extraction into its design. Illustrations of these morphological attributes alongside examples of visual confusion (look-alike) are included in Figure 1.

2.4. Characteristics and Challenges

The SOS dataset presents three primary challenges that serve as the motivation for the architectural design proposed in this study:
Look-Alikes and Spatial Ambiguity: The data set exhibits several “look-alike” phenomena to include low-winds regions and ship wakes. Under low to moderate wind conditions, the visual and statistical difference between oil films and look-alikes (e.g., wind slicks) can be minimal. Therefore, this study focuses on extracting the annotated oil targets from this challenging composite background, without explicitly categorizing the physical types of look-alikes.These characteristics of the features are such that their spatial textures and intensity profile information coincide with those of real oil; so, they could not be discriminated on the basis of their spatial domain information only.
Noise Interference and Label Uncertainty: Real SAR image is corrupted by the coherent speckle noise inherently. This, along with the fact that oil spill boundaries are diffuse (transition zones), results in large label uncertainty. Models therefore need strong feature encoders that can attenuate the high frequency noise and be invariant to semantic boundary information.
Multi-Scale Anisotropy: The target objects present a high variability in morphology, from large blocky slicks to fine thread-like filaments (linear features). These are long angular structures in nature and have high anisotropy (direction-dependence), which means the regular block is pool of a k × k square kernel not work well because they need to be able to capture long range dependencies along certain dimensions.

3. The FCS-Net Architecture

In this section, we present the Frequency-Spatial Coordinate and Strip-Augmented Network (FCS-Net). Designed as an integrated framework, FCS-Net is customized to address the two-fold challenge in SAR segmentation: it preserves the geometric continuity of slender oil spills while effectively suppressing background clutter, thereby yielding robust segmentation results.

3.1. Overview of the Architecture

We have developed a novel end-to-end dedicated segmentation system, which we have named FCS-Net. Figure 2 shows its overall architecture, which employs a popular encoder-decoder combination and is carefully designed to mitigate inherent speckle noise while maintaining the integrity of elongated, fragmented targets.
The architecture was built off of the ConvNeXt-Small encoder as the backbone, as well as the ability to provide global context through the utilization of its large field of view during image processing. The FS-CA Module was utilized at each encoder layer as well as the decoder stages to limit the confusion created between similar images. The FS-CA is a two-step process for filtering that uses the spatial location and frequency information of an image to filter out noise from the image. By utilizing the FS-CA Module, we improved the quality and fidelity of oil film appearances.
In addition to incorporating the FS-CA Module, the encoder and decoder were linked through the use of a Strip-Augmented Pyramid Pooling (SAPP) architecture module. Traditional pooling architectures did not consider the orientation of the input images, and thus each pooling layer generated a different pooled representation of the input images. Leveraging the SAPP architecture, we designed a strip-oriented branch to ensure uniform feature representation of thin oil films.
To process the decoder, we applied a U-Net-like architecture to generate the final output of the image using transposed convolutions with skip connections, allowing for the efficient merging of low-level image representations to the deep semantic images generated by the architecture.Finally, a lightweight two-layer convolutional prediction head is applied to the decoded features to progressively reduce the channel dimension and project them to a single-channel probability map, yielding the final binary segmentation mask.

3.2. Encoder Module

As the core of feature representation extraction for FCS-Net, the encoder module is utilized to discover strong hierarchical representations from raw SAR data. To address both semantic consistency and boundary learning, we design a composite network where the ConvNeXt-Small backbone is fused with our Frequency-Spatial Coordinate Attention (FS-CA) mechanism. The four consecutive part-level stages structurally constitute the encoder. The higher layers down-sample the resolution of the input since multi-scale contexts are taken into account as we go along. Meanwhile, the embedded FS-CA modules enhance intermediate features at each stage to suppress speckle noise and eliminate ambiguous background interferences. Finally, the encoder outputs a collection of four multiscale feature maps which are used as input for context aggregation and decoding steps.

3.2.1. Hierarchical Feature Extraction via ConvNeXt Backbone

To establish a robust feature representation for SAR imagery, we employ ConvNeXt-Small [36] as the primary encoder backbone. ConvNeXt is a modernized pure convolutional architecture that systematically incorporates design choices from Vision Transformers (ViTs) [40]—such as optimized stage compute ratios and “patchify” stem layers—to achieve scalability and global context awareness while retaining the inductive biases of standard CNNs. This architecture effectively bridges the gap between the computational efficiency of convolutions and the representation capability of Transformers, making it an ideal candidate for complex maritime scene analysis. The detailed architecture is illustrated in Figure 3.
To fortify the model against the scarcity of labeled SAR data, the backbone is primed with ImageNet-22K pre-trained weights, providing a rich initial feature prior. The processing begins as the input X R H × W × C i n is ingested by a strided 4 × 4 convolutional stem. From there, features evolve through four hierarchical stages following a { 3 ,   3 ,   27 ,   3 } distribution. Crucially, the internal inverted bottleneck blocks depart from standard ResNet conventions [41]: they prioritize spatial context aggregation via 7 × 7 depthwise convolutions before expanding channel dimensions fourfold through GELU-activated pointwise layers. This structure, stabilized by Layer Scale residuals, culminates in a set of multi-scale feature maps { E 1 ,   E 2 ,   E 3 ,   E 4 } with channels C = { 96 ,   192 ,   384 ,   768 } , effectively encapsulating everything from low-level textures to high-level semantics.
SAR oil spill detection using backscatter statistics provides a unique opportunity to distinguish oil spill patterns from the sea background. The use of backscatter statistics also allows for the use of larger 7 × 7 kernels to determine an effective receptive field size and thus yields a far greater effective receptive field compared to the traditional ResNet blocks using 3 × 3 kernel filters. Thus, this allows a larger global sea-state context to be captured without local intensity effects. This will reduce the instances of false positives due to continuously dark areas and provide an opportunity for improved optimization through the use of layer normalization versus batch normalization and the use of GELU activation functions. As a result, the new model is capable of superior non-linear processing of SAR images because of the significant reduction in sensitivity to high-frequency artefacts and still maintaining the morphological characteristics of oil spill boundaries.

3.2.2. Feature Refinement with Frequency-Spatial Coordinate Attention

The frequency-spatial coordinate attention (FS-CA) module, illustrated in the Figure 4, resides at the end of every encoder stage. It has been implemented to provide a solution for two significant hurdles associated with using synthetic aperture radar (SAR) imagery—the geometric fragmentation of SAR images and the multiplicative speckle noise commonly found in SAR imagery. By utilising the cascading effects of the spatial coordinate attention [42] (CA) and frequency attention [43] (FA) modules, this module allows for the synergistic enhancement in features found both in the physical and spectral domains.
The CA block (refer to Figure 4a) is employed as the start of the refinement process. By maintaining the exact position of each input element while also allowing for maximum flexibility in encoding very large relational structures at great distances, it is possible to use this block as an encoding element for all feature levels with the use of two pooling layers ( H , 1 ) and ( 1 , W ) on the input feature X to generate pooled feature maps representing the vertical and horizontal directions of the input respectively. Afterward, these spatially aware pooled feature maps are concatenated together and processed through a shared 1 × 1 convolution allowing for the capture of cross-channel correlation information between the pooled feature maps in both directions. The resulting intermediate feature is then processed through a sigmoid activation function to create the attention vector for the horizontal a h and vertical a w pooling maps. Finally, the input feature is reweighted to generate the final output spatial feature:
X c a = X · a h · a w
This mechanism effectively highlights the continuous boundaries of slender oil strips that are often eroded by standard isotropic pooling.
In Figure 4b, the Frequency Attention (FA) module is designed to enhance feature discrimination under spectrally ambiguous conditions by exploiting frequency-domain cues, thereby attenuating noise-dominated responses that may lead to spurious activations in low-backscatter backgrounds.This begins with a 3 × 3 dilated convolution with a factor of 3 that increases the receptive field of the input X c a to allow for the capturing of semantic context. The features are then projected into a frequency representation F = F ( X d i l a t e d ) by applying the 2D discrete Fast Fourier Transform (FFT).
For spectral refinement, we employ a dual-branch strategy. Crucially, the Spectral Channel Attention branch diverges from conventional spatial analysis: it calculates inter-channel frequency correlations via normalized dot products, effectively identifying and suppressing feature maps that are dominated by noise.
A s p e c = Softmax τ · F · F H | | F | | · | | F | | · F
where F H denotes the conjugate transpose and τ is a learnable temperature parameter. This operation explicitly models global dependencies among frequency components. In parallel, a spectral gating branch modulates the spectrum to preserve structural details. Specifically, a lightweight convolutional subnetwork computes a gating mask based on the real part of the frequency components ( R ( F ) ), which is then element-wise multiplied with the original spectrum. Finally, the outputs from both branches are concatenated and transformed back to the spatial domain via the inverse fast Fourier transform (IFFT). The result is then fused with the original input through a residual connection.

3.3. Anisotropic Context Aggregation Module

The SAPP module is located at the bottleneck of the network, and its structure is shown in Figure 5. The standard PPM [37] algorithm uses an isotropic square kernel. The isotropic square kernel is difficult to fit slender targets and is prone to boundary breaks. In contrast, the SAPP structure adds an anisotropic strip pooling [44] branch to the PPM structure. Anisotropic strip pooling can capture the large aspect ratio of marine oil films and overcome the fragmentation problem faced by standard pooling techniques.
Unlike conventional square pooling, strip pooling is designed to capture long-range dependencies along one spatial dimension while ignoring irrelevant information in the other. Specifically, given the bottleneck feature map E 4 R C × H × W , the strip pooling branch decomposes the global context aggregation into two orthogonal operations using kernel shapes of ( 1 , W ) and ( H , 1 ) . Mathematically, the horizontal pooling output y h R C × H × 1 and vertical pooling output y v R C × 1 × W are computed as follows:
y c , i h = 1 W 0 j < W x c , i , j , y c , j v = 1 H 0 i < H x c , i , j
The one-dimensional context arrays are interpolated using bilinear interpolation back into their original spatial dimensions and added together using an element-wise summation method. This procedure allows the disjoint subsets of the elongated oil spill to be represented together by extracting contextual information from the primary axis of correlation for the entire oil spill area.
The SAPP module fuses the anisotropic branch with standard square pooling branches. The square branches use bin sizes of 1 × 1 , 2 × 2 , 3 × 3 , and 6 × 6 . A 1 × 1 convolution compresses the concatenated multi-scale features. A channel attention mechanism refines these features through dynamic weighting. The weighting allows the network to adaptively emphasize the most discriminative context scales. These scales range from local texture details to global shape patterns. The aggregated representation is then passed to the decoder for subsequent processing.

3.4. Feature Refinement Decoder

The decoder incorporates a 4-stage U-shaped architecture [21], whereby each stage correlates to one level of the encoder. The spatial dimensions are increased by a factor of two via a 2D transposed convolution with stride = 2. The upsampled features and encoder features with the same resolution are concatenated along the channel axis to recover the spatial information lost by encoding. After concatenation, the features are sent through a double-convolution block, which includes two 3 × 3 convolutions, followed by Batch Normalization and ReLU activations after each convolution. The final prediction head consists of two lightweight convolutional layers that progressively reduce the channel dimension and project the decoded features to a single-channel probability map, yielding the final binary segmentation mask at full resolution.

4. Experimental Results and Analysis

4.1. Implementation Details

All experiments were conducted under the same hardware/software conditions to ensure equitable comparisons and reproducibility of the results. All models were developed and evaluated on a high-performance workstation equipped with an Intel Xeon Gold 6326 CPU (2.9 GHz) and 256 GB RAM. Two NVIDIA GeForce RTX 4090 GPUs (each providing 24 GB VRAM) are included in this workstation to perform the graphical acceleration required by the models. The operating system of this workstation is Manjaro Linux v25.0.10. Models were created using Python 3.10 and PyTorch 2.9.1 (a Deep Learning Framework on a CUDA 12.8 base).
To train the proposed model, the training process was standardized. All input images were resized to 256   ×   256 pixels for network compatibility. Training was performed using the Adam [45] optimizer with an initial learning rate of 0.001, a batch size of 16, and 80 epochs.
At the pixel level, there is a significant class imbalance between oil slick (positive class) pixels and ocean background (negative class) pixels. To overcome this pixel-level class imbalance, a composite loss function was used. This composite loss function combines the binary cross-entropy (BCE) and Dice loss, resulting in L t o t a l = L B C E + L D i c e . BCE generates stable gradients during optimization, while the Dice loss accurately measures the degree of overlap between predicted and true values. By combining these two losses, the optimization process follows the gradient descent path, ultimately leading to the correct implementation. The resulting composite loss function is able to more accurately depict the ground truth boundary.
Validation protocol and model selection. We follow the official SOS 8:2 split for training and testing, keeping the test set strictly untouched during development. Within the official training split, we use a held-out validation set (10% of the training images) that is randomly selected once and then fixed. All hyper-parameters reported in this paper (e.g., learning rate, loss weights, and architectural variants) are selected based on validation performance only. Final metrics are obtained by evaluating the checkpoint with the best validation performance on the official test set.

4.2. Evaluation Metrics

To quantitatively assess the segmentation performance, we employ four standard metrics: Mean Intersection over Union (mIoU), F1-Score, Precision, and Recall. The definitions are as follows:
Precision = T P T P + F P Recall = T P T P + F N
F 1 -Score = 2 × Precision × Recall Precision + Recall
mIoU = 1 C i = 1 C T P i T P i + F P i + F N i
where T P , F P , and F N denote the counts of true positive, false positive, and false negative pixels, respectively. C represents the number of classes ( C = 2 ). To ensure a fair and rigorous evaluation, all metrics are computed by aggregating the T P , F P , and F N counts over all pixels across the entire independent test split (i.e., global/micro-averaging), rather than averaging the scores of individual images. This approach avoids potential biases caused by the varying sizes of oil spill targets in different SAR scenes. Furthermore, all quantitative results reported in this section are obtained exclusively from the test split, which was not involved in the training process.
Precision quantifies the reliability of the detection outcomes, indicating the proportion of identified anomalies that are truly oil spills. It reflects the system’s capability to filter out complex sea background interference and minimize false alarms. The financial consequences of high false positive rates may require the organisation to respond to emergency situations that may be more costly than necessary. Hence, precision is a critical factor in the operational efficiency of an agency responsible for maritime surveillance.
Recall quantifies an agency’s capability to detect every pixel representing an oil slick. Small, thin sheens, and broken slicks may not be detected, and should an oil spill pixel go undetected, the oil is allowed to continue spreading without intervention. Therefore, in order to help mitigate the adverse effects of oil upon the environment, organisations must utilise high recall rates to reduce the occurrence of false negatives.
The F1 Score represents the harmonic mean of both precision and recall, thereby providing a measure of balance between the competing requisites of resource management and ecological protection.
SAR imagery contains a large proportion of sea background pixels. Standard pixel accuracy becomes deceptive under such class distribution. A model predicting all pixels as background can still achieve high accuracy. Mean Intersection over Union (mIoU) avoids this problem. mIoU penalizes both over-segmentation and under-segmentation. It evaluates the true geometric fidelity between predictions and ground truth. mIoU serves as the definitive benchmark in this study.

4.3. Comparative Analysis

To rigorously evaluate the segmentation performance of the proposed FCS-Net, we conducted comparative experiments against five established methods: the classic U-Net [21], Deeplabv3+ [24], SegFormer [46], Swin-Unet [47], and the boundary-aware CBD-Net [11]. The quantitative results on both the Gulf of Mexico (PALSAR) and Persian Gulf (Sentinel-1) subsets are detailed in Table 2.

4.3.1. Performance on the Gulf of Mexico Subset

FCS-Net has set a new record with the Gulf of Mexico (PALSAR) subset with wide, continuous slick traces. For fair comparison, every competitive method used has been configured to use the best backbone and pre-trained weights. The results shown in Table 2 show our method has a mIoU of 87.78% and a F1 score of 88.91%, exceeding the U-Net Baseline by 5.4% and SegFormer, a Transformer Based Architecture by 3.06%. We believe the large increase in performance is a result of the ConvNeXt Backbone; its large kernel structure gives it a larger global receptive field which enables the model to maintain internal homogeneity within large-scale slicks. The large kernel connection to the global receptive field allows us to bridge CNNs and Transformers and eliminate the “holes” commonly found with traditional architectures.
This quantitative superiority is visually corroborated in Figure 6. Conventional CNN-based methods, such as U-Net (Figure 6c) and Deeplabv3+ (Figure 6d), produce prediction maps marred by jagged boundaries and fragmented contours. Similarly, the Transformer-based SegFormer (Figure 6e) and Swin-Unet (Figure 6f) exhibit limitations in preserving fine structural details, often leading to disconnected components. In contrast, FCS-Net (Figure 6h) successfully maintains the topological integrity of the slicks, generating smooth, contiguous maps that faithfully align with the Ground Truth (Figure 6b).

4.3.2. Performance on the Persian Gulf Subset

The Persian Gulf (Sentinel-1) subset presents distinct challenges, primarily characterized by very thin oil sheens and interference from look-alike features such as ship wakes. In this complex environment, FCS-Net demonstrates superior performance, achieving an mIoU of 89.62% and surpassing the U-Net baseline by 7.35% (Table 2).
The metrics specifically quantify all architectural components’ tasks within the system architecture. The very high Recall score (94.02%) versus U-Net (88.67%) exemplifies the significant impact of the Strip-Augmented Pyramid Pooling (SAPP) module on recovering very thin, linear oil signatures that otherwise would be lost using isotropic pooling methods; the Precision score is also higher 91.76% vs. 85.99% for U-Net with a difference of 5.77%. The higher Precision score results due to the Frequency-Spatial Coordinate Attention (FS-CA) mechanism, allowing for proper discrimination of actual oil from wave textures represented in the frequency domain.
The visual comparison in Figure 7 further corroborates this robustness. Dealing with narrow oil signatures, baseline methods like U-Net (Figure 7c) and Deeplabv3+ (Figure 7d) struggle to preserve topological connectivity, resulting in fragmented predictions. In contrast, FCS-Net (Figure 7h) maintains the continuity of thin structures while effectively suppressing background noise, demonstrating superior generalization in complex maritime environments.

4.3.3. Failure Case Analysis

To deeply analyze the failure cases, we categorize the errors into false negatives and false positives, as visualized in Figure 8:
As indicated by the red boxes in Rows (I) and (II), the model fails to extract features of small-scale targets. In Row (I), the target is submerged by the inherent strong speckle noise. In Row (II), the model exhibits poor edge processin capabilities for thin objects under interference, resulting in the fracture of continuous structures.
As indicated by the blue boxes in Rows (III) and (IV), the model generates false alarms in background areas. In both cases, the texture features of the background (e.g., sea surface) bear a strong resemblance to the targets. This high similarity confuses the model, leading it to misclassify background clutter as targets.

4.4. Ablation Studies

To obtain quantitative assessments of level of contribution from each of the parts of our model and understand if these contributed elements are architecturally strong, we have executed a multiphase ablation study using a dual-phase procedure. The first step consisted of finding the “best fit” configuration through iteratively adding the proposed modules onto an improved ConvNeXT-Small backbone to form the final version of FCS-Net. The second phase was performed to evaluate the FS-CA and SAPP modules—through transferring their capabilities to a vanilla Resnet-34 backbone.

4.4.1. Component Effectiveness on ConvNeXt

Table 3 details the stepwise construction of the proposed FCS-Net. To ensure statistical reliability, we repeated each experiment using five different random seeds and reported the results as mean values ± standard deviations.
The improvements observed when moving from a ResNet-34 model on which the first experiment was performed to an ConvNeXt-Small model performed in the second experiment can be credited to the increase in model complexity afforded by the latter’s design. ResNet-34 provides an average mIoU increase of approximately 3.18% for the PALSAR data set and 3.46% for the Sentinel-1 data set due to these additional modelling capabilities. In comparison, the ResNet architecture is limited because of the use of only local Gaussian receptive fields and hence is not able to effectively model the various heterogeneous physical characteristics associated with large-scale oil spills. ConvNeXt provides a more modern architecture which allows for a more complete view of the object being classified through its inverted bottleneck with a greater capability for learning a global receptive field that is similar to the way learning occurs within the Transformer structure. Moreover, this architectural difference allows ConvNeXt to more accurately capture the complex and varied textures found in the SAR images than do models that rely on smaller kernel sizes for their classification tasks.
In Experiment 3 the FS-CA module was tested as a filtering technique to analyse Remote Sensing Data, offering a 2.25% increase in mIoU compared to the ConvNeXt baseline. After applying a Fourier Transform, the FS-CA module improved the capability of Identifying Oil Spills as detected patterns of signal on the Frequency Domain, as well as separating them from other types of artefacts in the image created by Multiplicative Speckle Noise, which was a limitation of Traditional Spatial Convolutional Architectures. Experiment 4 tested the use of the SAPP module to enable new recoveries of linear features by offering a geometric restoration technique for fixing broken or damaged connections. Overall, using the combination of all Proposed Modules from FCS-Net in Experiment 5, FCS-Net had an overall mIoU of 89.62% when analysing Sentinel-1 data. Notably, the low standard deviation ( ± 0.0084 ) across the five runs indicates that the synergy between the backbone (semantic foundation), FS-CA (texture cleaning), and SAPP (topology repair) not only improves accuracy but also significantly enhances the robustness of the model.

4.4.2. Generalizability Verification on ResNet-34

To demonstrate that the FS-CA and SAPP modules are generic, plug-and-play components that are independent of specific encoders, we apply them to the widely used ResNet-34.
The most convincing evidence of the universality of our modules can be found in Table 4, where they were deployed atop the lightest backbone, ResNet-34. Despite the limited capacity of this legacy encoder, even the full FCS-Net variant clearly outperforms the vanilla baseline by an impressive 3.91% in mIoU on the challenging PALSAR subset ( 85.47 % vs. 81.56 % ). This finding is crucial as it disentangles our performance improvements from the stronger inductive biases of ConvNeXt. It shows that the gain provided by frequency-domain denoising (FS-CA) and anisotropic aggregation (SAPP) are not there to optimize a particular strong backbone, but their model-agnostic structural refinements. By adequately alleviating the spatial shortage in typical CNNs, these modules are proven to be portable solutions for boosting various architectures.

4.5. Zero-Shot Cross-Dataset Generalization Analysis

As a complex environment consisting of various types of satellite images throughout the world, the M4D dataset [15] is a benchmark dataset for evaluating models used to classify oil spills. In this experiment and in order to assess the generalizability of the FCS-Net network within unknown environments, the M4D dataset served as the target experiment’s evaluation dataset as it is an international benchmark dataset that includes the Sentinel-1 (S1) SAR imagery of non-stormy and/or stormy conditions collected from around the globe. Though both datasets (SOS and M4D) contain S1 satellite imagery, they have no geographic connections to each other and only differ in terms of their upload and processing methodologies. Furthermore, these two datasets are more complex than previous studies because of the differences in both scale (lower resolution for the SOS dataset) and geographic coverage during collection (higher latitude for the M4D dataset), which makes this a significant test of domain shift.
To create a global metric, we randomly sampled 100 sets of oil spill images and associated annotations from M4D. Label re-mapping was performed to align the high granularity of M4D labels with our binary task. Specifically, all “Oil Spill” labels were retained as the positive class, while all other categories (including “Look-alikes”, “Land”, “Ships”, and “Sea Surface”) were merged into an undifferentiated background category. It is important to emphasize that all models used in this evaluation were trained exclusively on the Sentinel-1 subset of the SOS dataset and were directly applied to the M4D samples without any fine-tuning or domain adaptation.
Table 5 presents the quantitative comparison results on the M4D dataset. As expected, due to the significant domain shift between the source domain (SOS) and the target domain (M4D) (e.g., different sea states, wind speeds, and incident angle distributions), all models showed a performance decline on M4D compared to the SOS internal test.
The robustness of the proposed FCS-Net was superior to those of the conventional CNN and Transformer architectures. We compared FCS-Net against the strongest examples of each of the two leading types of architecture: DeepLabv3+ (the highest-performing conventional CNN model) and SegFormer (the leading Transformer architecture). FCS-Net had the highest mIoU score of 68.97%. In addition, with respect to the Oil IoU metric, which indicates an oil spill’s detection capability, FCS-Net’s score of 42.89% was substantially greater than the scores of the two strongest representations of CNN (DeepLabv3+ score of 36.27%) and Transformer (SegFormer score of 38.83%). FCS-Net demonstrated not only the ability to recognize oil spills based on the spatial characteristics of oil spills in synthetic aperture radar (SAR) imagery; it also demonstrated the ability to accurately find oil spills in SAR images of areas of ocean that were unknown to the model.
Figure 9 further provides a visual comparison of cross-domain testing, clearly demonstrating the differences in behavior of different models when dealing with unfamiliar sea conditions. Columns (a)–(c) show the original images and label processing of the M4D dataset, where the Cyan area represents the actual oil spill and the red area represents easily confused “look-alikes”.
The specific model comparison analysis is as follows: First, the capture of subtle features (Figure 9, first row): As shown in the red box of Row (I), when facing thin strip-shaped oil spills, the baseline model is prone to segmentation breaks, while FCS-Net, due to its sensitivity to high-frequency components in the frequency domain, successfully preserves the topological integrity of the oil film. Second, robustness to extreme interference (Figure 9, second row): Row (II) shows a challenging scene containing a large area of oil-like film. It can be seen that, faced with such extremely similar dark area features, all three models exhibited a certain degree of false positives, which objectively reflects the physical limitations of cross-domain SAR image semantic segmentation. Third, the comparison between aggressive and balanced strategies (Figure 9, third row): Row (III) reveals the core difference between SegFormer and FCS-Net. Although SegFormer (column e) has the highest recall rate in quantitative metrics, it exhibits an overly aggressive prediction tendency visually, easily misclassifying the oil-like film within the blue box as an oil spill. In contrast, FCS-Net (column f) adopts a more balanced strategy, effectively suppressing background clutter while ensuring the detection of the core oil spill area. This explains why, in Table 5, SegFormer has a high recall (Recall > 76%) but extremely low precision (Precision < 44%), while FCS-Net achieves the best Oil IoU while maintaining substantially higher precision than SegFormer.

5. Conclusions

5.1. Main Results and Contributions

The purpose of the FCS-Net is to address the specific difficulties of isolating marine oil slicks in radar images and to solve the various problems that exist with current methods for isolating marine oil slicks from radar images.
To achieve these goals, we have developed three novel network architectures that utilize a variety of methods. The initial architecture we developed (ConvNeXt-Small) serves as the encoder backbone of the model, allowing it to take advantage of the large receptive fields produced by the ConvNeXt network. To supplement this, we have developed a Frequency-Spatial Coordinate Attention (FS-CA) module that leverages both frequency and spatial information. It is designed to suppress speckle noise and enhance the feature representation of oil slicks against complex sea textures through low-frequency filtering.
Additionally, to accommodate the complex geometry of marine oil slicks, we introduce an additional module (SAPP) that combines anisotropic strip pooling with a max-pooling-based pyramid pooling (PPM) branch, enabling multi-scale context aggregation and preserving long-range dependencies for sparsely distributed oil targets.
The tests we conducted on the SOS multi-source dataset using ALOS PALSAR and Sentinel-1 data have proven that FCS-Net produced superior results based upon state-of-the-art evaluations: mIou scores of 87.78% (ALOS PALSAR) and 89.62% (Sentinel-1) have been achieved and have far exceeded the previous established baselines (Deeplabv3+ and CBD-Net). The results demonstrated in the ablation studies were able to equally show that the two methods FS-CA and SAPP are able to represent a separate benefit and provide complementary functionality. FS-CA ensures that high precision of the prediction is preserved by minimizing the number of false positives, while SAPP ensures that a high recall rate is achieved through recovery of narrow linear features.

5.2. Limitations and Future Directions

Despite the demonstrated effectiveness of FCS-Net, several limitations remain to be addressed, spanning physical, methodological, and operational dimensions.
First, segmentation performance is inherently constrained by the physical imaging mechanisms of single-polarization SAR. Extreme environmental conditions, particularly varying wind speeds, can significantly affect the contrast between oil slicks and the background. Specifically, consistent with established oceanographic understanding, under conditions such as low wind speeds or the presence of surfactants, the backscatter difference between oil films and wind slicks can be minimal and temporally unstable. This creates intrinsic ambiguities that are difficult to resolve using intensity data alone. This results in confusing ambiguities that cannot be resolved using intensity data alone. In our future works, we will be incorporating multiple types of remote sensing data to alleviate this physical barrier. We will combine the features of SAR with optical imagery (e.g., Sentinel-2) and/or with multi-polarimetric features to utilize the spectral and polarimetric information that we gather to provide greater clarity in the identification of these complex events.
Second, regarding geometric modeling, we acknowledge that the SAPP module currently relies on axis-aligned orthogonal decomposition. While this provides an efficient long-range context, it constitutes an orthogonal approximation rather than strict rotation invariance for arbitrarily oriented spills. In future studies, we plan to investigate rotation-adaptive strip modeling—such as deformable or steerable strip operators—to explicitly align the receptive field with the intrinsic orientation of highly curved or diagonal targets.
Third, the cross-sensor domain shift presents a challenge for broad deployment. While this study validated the model on C-band and L-band data, ensuring consistent operational utility requires broader generalization. Consequently, we plan to extend our evaluation to a wider assortment of sensors, such as Gaofen-3, ZY3-XL, and RADARSAT-2. This will allow us to rigorously test the model’s transferability across diverse frequencies and imaging modes.

Author Contributions

S.W. undertook the primary investigation, including experimental execution, simulation, and data curation, and drafted the original manuscript. B.-W.M. contributed to the experimental procedures and provided analytical support. Y.H. was responsible for project administration and supervision. D.G. secured the necessary funding. All authors have critically reviewed the content and approved the final submission. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Plan Project of Nantong (No. JC2023023).

Data Availability Statement

The dataset used in this study can be found in https://grzy.cug.edu.cn/zhuqiqi (accessed on 1 July 2025).

Acknowledgments

The authors would like to express their sincere gratitude to the ECHO Group of China University of Geosciences for providing the SOS: Deep-SAR Oil Spill Dataset (available at https://grzy.cug.edu.cn/zhuqiqi (accessed on accessed on 1 July 2025)). We also extend our special thanks to Yuanzhi Zhang for his kind invitation to submit this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Del Frate, F.; Petrocchi, A.; Lichtenegger, J.; Calabresi, G. Neural Networks for Oil Spill Detection Using ERS-SAR Data. IEEE Trans. Geosci. Remote Sens. 2000, 38, 2282–2287. [Google Scholar] [CrossRef]
  2. Kontovas, C.A.; Psaraftis, H.N.; Ventikos, N.P. An Empirical Analysis of IOPCF Oil Spill Cost Data. Mar. Pollut. Bull. 2010, 60, 1455–1466. [Google Scholar] [CrossRef]
  3. Jernelöv, A. The Threats from Oil Spills: Now, Then, and in the Future. AMBIO 2010, 39, 353–366. [Google Scholar] [CrossRef]
  4. Fingas, M.; Brown, C. Review of Oil Spill Remote Sensing. Mar. Pollut. Bull. 2014, 83, 9–23. [Google Scholar] [CrossRef]
  5. Dong, S.; Feng, J.; Gu, Z.; Yin, K.; Long, Y. A Review of Artificial Intelligence and Remote Sensing for Marine Oil Spill Detection, Classification, and Thickness Estimation. Remote Sens. 2025, 17, 3681. [Google Scholar] [CrossRef]
  6. Keramitsoglou, I.; Cartalis, C.; Kiranoudis, C.T. Automatic Identification of Oil Spills on Satellite Images. Environ. Model. Softw. 2006, 21, 640–652. [Google Scholar] [CrossRef]
  7. Topouzelis, K.N. Oil Spill Detection by SAR Images: Dark Formation Detection, Feature Extraction and Classification Algorithms. Sensors 2008, 8, 6642–6659. [Google Scholar] [CrossRef] [PubMed]
  8. Fingas, M.; Brown, C.E. Oil Spill Remote Sensing. In Handbook of Oil Spill Science and Technology; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2014; Chapter 12; pp. 311–356. [Google Scholar] [CrossRef]
  9. Wang, B.; Chen, L.; Song, D.; Chen, W.; Yu, J. SinGAN-Labeler: An Enhanced SinGAN for Generating Marine Oil Spill SAR Images with Labels. J. Mar. Sci. Eng. 2025, 13, 422. [Google Scholar] [CrossRef]
  10. Garcia-Pineda, O.; Staples, G.; Jones, C.E.; Hu, C.; Holt, B.; Kourafalou, V.; Graettinger, G.; DiPinto, L.; Ramirez, E.; Streett, D.; et al. Classification of Oil Spill by Thicknesses Using Multiple Remote Sensors. Remote Sens. Environ. 2020, 236, 111421. [Google Scholar] [CrossRef]
  11. Zhu, Q.; Zhang, Y.; Li, Z.; Yan, X.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Oil Spill Contextual and Boundary-Supervised Detection Network Based on Marine SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5213910. [Google Scholar] [CrossRef]
  12. Alpers, W.; Hühnerfuss, H. The Damping of Ocean Waves by Surface Films: A New Look at an Old Problem. J. Geophys. Res. Oceans 1989, 94, 6251–6265. [Google Scholar] [CrossRef]
  13. Minchew, B.; Jones, C.E.; Holt, B. Polarimetric Analysis of Backscatter From the Deepwater Horizon Oil Spill Using L-Band Synthetic Aperture Radar. IEEE Trans. Geosci. Remote Sens. 2012, 50, 3812–3830. [Google Scholar] [CrossRef]
  14. Zhang, J.; Yang, P.; Ren, X. Detection of Oil Spill in SAR Image Using an Improved DeepLabV3+. Sensors 2024, 24, 5460. [Google Scholar] [CrossRef] [PubMed]
  15. Krestenitis, M.; Orfanidis, G.; Ioannidis, K.; Avgerinakis, K.; Vrochidis, S.; Kompatsiaris, I. Oil Spill Identification from Satellite Images Using Deep Neural Networks. Remote Sens. 2019, 11, 1762. [Google Scholar] [CrossRef]
  16. Chen, Y.T.; Chang, L.; Wang, J.H. Full-Scale Aggregated MobileUNet: An Improved U-Net Architecture for SAR Oil Spill Detection. Sensors 2024, 24, 3724. [Google Scholar] [CrossRef] [PubMed]
  17. Solberg, A.; Storvik, G.; Solberg, R.; Volden, E. Automatic Detection of Oil Spills in ERS SAR Images. IEEE Trans. Geosci. Remote Sens. 1999, 37, 1916–1924. [Google Scholar] [CrossRef]
  18. Brekke, C.; Solberg, A.H.S. Oil Spill Detection by Satellite Remote Sensing. Remote Sens. Environ. 2005, 95, 1–13. [Google Scholar] [CrossRef]
  19. Al-Ruzouq, R.; Gibril, M.B.A.; Shanableh, A.; Kais, A.; Hamed, O.; Al-Mansoori, S.; Khalil, M.A. Sensors, Features, and Machine Learning for Oil Spill Detection and Monitoring: A Review. Remote Sens. 2020, 12, 3338. [Google Scholar] [CrossRef]
  20. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  21. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
  22. Chen, H.; Qi, X.; Yu, L.; Dou, Q.; Qin, J.; Heng, P.A. DCAN: Deep Contour-Aware Networks for Object Instance Segmentation from Histology Images. Med. Image Anal. 2017, 36, 135–146. [Google Scholar] [CrossRef]
  23. Zeng, K.; Wang, Y. A Deep Convolutional Neural Network for Oil Spill Detection from Spaceborne SAR Images. Remote Sens. 2020, 12, 1015. [Google Scholar] [CrossRef]
  24. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  25. Wang, D.; Wan, J.; Liu, S.; Chen, Y.; Yasir, M.; Xu, M.; Ren, P. BO-DRNet: An Improved Deep Learning Model for Oil Spill Detection by Polarimetric Features from SAR Images. Remote Sens. 2022, 14, 264. [Google Scholar] [CrossRef]
  26. Fan, J.; Liu, C. Multitask GANs for Oil Spill Classification and Semantic Segmentation Based on SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2532–2546. [Google Scholar] [CrossRef]
  27. A Novel Multi-Scale Feature Map Fusion for Oil Spill Detection of SAR Remote Sensing. Remote Sens. 2024, 16, 1684. [CrossRef]
  28. Chen, Y.; Wang, Z. Marine Oil Spill Detection from SAR Images Based on Attention U-Net Model Using Polarimetric and Wind Speed Information. Int. J. Environ. Res. Public Health 2022, 19, 12315. [Google Scholar] [CrossRef]
  29. Oil Spill Identification Based on Dual Attention UNet Model Using Synthetic Aperture Radar Images. J. Indian Soc. Remote Sens. 2023, 51, 121–133. [CrossRef]
  30. Wang, S. Evaluating Cross-Building Transferability of Attention-Based Automated Fault Detection and Diagnosis for Air Handling Units: Auditorium and Hospital Case Study. Build. Environ. 2025, 287, 113889. [Google Scholar] [CrossRef]
  31. Wang, S. Effectiveness of Traditional Augmentation Methods for Rebar Counting Using UAV Imagery with Faster R-CNN and YOLOv10-based Transformer Architectures. Sci. Rep. 2025, 15, 33702. [Google Scholar] [CrossRef]
  32. Song, W.; Ma, X.; Song, W. Automatic Detection of Marine Oil Spills from Polarimetric SAR Images Using Deep Convolutional Neural Network Model. Ecol. Indic. 2024, 169, 112934. [Google Scholar] [CrossRef]
  33. Wu, W.; Wong, M.S.; Yu, X.; Shi, G.; Kwok, C.Y.T.; Zou, K. Compositional Oil Spill Detection Based on Object Detector and Adapted Segment Anything Model From SAR Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4007505. [Google Scholar] [CrossRef]
  34. Shaban, M.; Salim, R.; Abu Khalifeh, H.; Khelifi, A.; Shalaby, A.; El-Mashad, S.; Mahmoud, A.; Ghazal, M.; El-Baz, A. A Deep-Learning Framework for the Detection of Oil Spills from SAR Data. Sensors 2021, 21, 2351. [Google Scholar] [CrossRef]
  35. Chang, L.; Chen, Y.T.; Cheng, C.M.; Chang, Y.L.; Ma, S.C. Marine Oil Pollution Monitoring Based on a Morphological Attention U-Net Using SAR Images. Sensors 2024, 24, 6768. [Google Scholar] [CrossRef]
  36. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  37. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  38. Shimada, M.; Isoguchi, O.; Tadono, T.; Isono, K. PALSAR Radiometric and Geometric Calibration. IEEE Trans. Geosci. Remote Sens. 2009, 47, 3915–3932. [Google Scholar] [CrossRef]
  39. Alpers, W.; Holt, B.; Zeng, K. Oil Spill Detection by Imaging Radars: Challenges and Pitfalls. Remote Sens. Environ. 2017, 201, 133–147. [Google Scholar] [CrossRef]
  40. Dosovitskiy, A. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  43. Deng, M.; Sun, S.; Li, Z.; Hu, X.; Wu, X. FMNet: Frequency-Assisted Mamba-Like Linear Attention Network for Camouflaged Object Detection. arXiv 2025. [Google Scholar] [CrossRef]
  44. Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
  45. Kingma, D.P. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  46. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  47. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Figure 1. Typical samples from the SOS dataset.
Figure 1. Typical samples from the SOS dataset.
Jmse 14 00168 g001
Figure 2. FCS-Net overall structure diagram.
Figure 2. FCS-Net overall structure diagram.
Jmse 14 00168 g002
Figure 3. Schematic illustration of the ConvNeXt-Small backbone integrated into FCS-Net. (a) Internal composition of the ConvNeXt block, highlighting the utilization of large-kernel 7 × 7 depthwise convolutions for expanded receptive fields, complemented by an inverted bottleneck design. (b) Global hierarchical topology of the encoder, initiating with a convolutional stem and evolving through four feature stages with a depth distribution of { 3 ,   3 ,   27 ,   3 } .
Figure 3. Schematic illustration of the ConvNeXt-Small backbone integrated into FCS-Net. (a) Internal composition of the ConvNeXt block, highlighting the utilization of large-kernel 7 × 7 depthwise convolutions for expanded receptive fields, complemented by an inverted bottleneck design. (b) Global hierarchical topology of the encoder, initiating with a convolutional stem and evolving through four feature stages with a depth distribution of { 3 ,   3 ,   27 ,   3 } .
Jmse 14 00168 g003
Figure 4. Diagram of the Frequency-Spatial Coordinate Attention (FS-CA) module. (a) Coordinate Attention (CA). (b) Frequency Attention (FA).
Figure 4. Diagram of the Frequency-Spatial Coordinate Attention (FS-CA) module. (a) Coordinate Attention (CA). (b) Frequency Attention (FA).
Jmse 14 00168 g004
Figure 5. Architecture of the Strip-Augmented Pyramid Pooling (SAPP) module. The upper branch employs standard square pooling for global context, while the lower branch utilizes strip pooling to capture anisotropic features of slender oil spills.
Figure 5. Architecture of the Strip-Augmented Pyramid Pooling (SAPP) module. The upper branch employs standard square pooling for global context, while the lower branch utilizes strip pooling to capture anisotropic features of slender oil spills.
Jmse 14 00168 g005
Figure 6. Visual The red boxes are intended to highlight specific details. We have added an explanation in the caption stating that these boxes highlight regions of interest for comparing boundary continuity and detail preservation. comparison of segmentation results on the Gulf of Mexico subset. Rows (IIV) correspond to representative samples exhibiting distinct characteristics: (I) irregular shapes, (II) oil spills with internal holes, (III) slender oil strips, and (IV) oil slicks affected by interference. From left to right: (a) Original SAR Image, (b) Ground Truth, (c) U-Net, (d) Deeplabv3+, (e) SegFormer, (f) Swin-Unet, (g) CBD-Net, and (h) FCS-Net (Ours).The red boxes highlight specific regions of interest to demonstrate the differences in boundary continuity and detail preservation among the methods.
Figure 6. Visual The red boxes are intended to highlight specific details. We have added an explanation in the caption stating that these boxes highlight regions of interest for comparing boundary continuity and detail preservation. comparison of segmentation results on the Gulf of Mexico subset. Rows (IIV) correspond to representative samples exhibiting distinct characteristics: (I) irregular shapes, (II) oil spills with internal holes, (III) slender oil strips, and (IV) oil slicks affected by interference. From left to right: (a) Original SAR Image, (b) Ground Truth, (c) U-Net, (d) Deeplabv3+, (e) SegFormer, (f) Swin-Unet, (g) CBD-Net, and (h) FCS-Net (Ours).The red boxes highlight specific regions of interest to demonstrate the differences in boundary continuity and detail preservation among the methods.
Jmse 14 00168 g006
Figure 7. Visual comparison of segmentation results on the Persian Gulf subset. Rows (IIV) correspond to representative samples exhibiting distinct characteristics: (I) slender oil films, (II) slender oil films containing holes, (III) strip-like oil films, and (IV) irregular oil films. From left to right: (a) Original SAR Image, (b) Ground Truth, (c) U-Net, (d) Deeplabv3+, (e) SegFormer, (f) Swin-Unet, (g) CBD-Net, and (h) FCS-Net (Ours).The red boxes highlight specific regions of interest to demonstrate the differences in boundary continuity and detail preservation among the methods.
Figure 7. Visual comparison of segmentation results on the Persian Gulf subset. Rows (IIV) correspond to representative samples exhibiting distinct characteristics: (I) slender oil films, (II) slender oil films containing holes, (III) strip-like oil films, and (IV) irregular oil films. From left to right: (a) Original SAR Image, (b) Ground Truth, (c) U-Net, (d) Deeplabv3+, (e) SegFormer, (f) Swin-Unet, (g) CBD-Net, and (h) FCS-Net (Ours).The red boxes highlight specific regions of interest to demonstrate the differences in boundary continuity and detail preservation among the methods.
Jmse 14 00168 g007
Figure 8. Failure case analysis. The rows represent different error types: (I) missed detection of small targets due to speckle noise, (II) structural fracture of thin oil spills, (III) false alarms caused by large-scale background clutter, and (IV) false alarms due to strip-like interference. (a) Original SAR images, (b) Ground Truth, and (c) Prediction results. Red boxes indicate false negatives, and blue boxes indicate false positives.
Figure 8. Failure case analysis. The rows represent different error types: (I) missed detection of small targets due to speckle noise, (II) structural fracture of thin oil spills, (III) false alarms caused by large-scale background clutter, and (IV) false alarms due to strip-like interference. (a) Original SAR images, (b) Ground Truth, and (c) Prediction results. Red boxes indicate false negatives, and blue boxes indicate false positives.
Jmse 14 00168 g008
Figure 9. Qualitative visualization of zero-shot cross-dataset evaluation on the M4D dataset. The columns represent: (a) Original SAR imagery from M4D; (b) Original M4D annotations (Cyan: Oil Spills, Red: Look-alikes); (c) Re-mapped binary ground truth used in this study (Oil only); (d) DeepLabv3+ predictions; (e) SegFormer predictions; (f) FCS-Net (Ours) predictions. Row (I): The red box highlights FCS-Net’s superior ability to maintain the topological integrity of thin oil strips compared to baselines. Row (II): A challenging scenario with large look-alike areas where all models exhibit false positives, illustrating the difficulty of distinguishing dark spots in SAR. Row (III): The blue box demonstrates that while SegFormer is highly aggressive (high recall), it misclassifies look-alikes as oil. In contrast, FCS-Net adopts a more balanced strategy, effectively suppressing false alarms.
Figure 9. Qualitative visualization of zero-shot cross-dataset evaluation on the M4D dataset. The columns represent: (a) Original SAR imagery from M4D; (b) Original M4D annotations (Cyan: Oil Spills, Red: Look-alikes); (c) Re-mapped binary ground truth used in this study (Oil only); (d) DeepLabv3+ predictions; (e) SegFormer predictions; (f) FCS-Net (Ours) predictions. Row (I): The red box highlights FCS-Net’s superior ability to maintain the topological integrity of thin oil strips compared to baselines. Row (II): A challenging scenario with large look-alike areas where all models exhibit false positives, illustrating the difficulty of distinguishing dark spots in SAR. Row (III): The blue box demonstrates that while SegFormer is highly aggressive (high recall), it misclassifies look-alikes as oil. In contrast, FCS-Net adopts a more balanced strategy, effectively suppressing false alarms.
Jmse 14 00168 g009
Table 1. Details of the PALSAR and Sentinel subsets.
Table 1. Details of the PALSAR and Sentinel subsets.
AttributePALSAR SubsetSentinel Subset
Source RegionGulf of MexicoPersian Gulf
Satellite SourceALOS (PALSAR)Sentinel-1A
Frequency BandL-bandC-band
Total Images38774193
Image Size 256 × 256 pixels 256 × 256 pixels
Spatial Resolution12.5 m 5 m × 20 m
PolarizationHHVV
Spill CharacteristicsThick, unbroken slicksThin sheens, linear features
Background ComplexityLow (Macro-geometry)High (Ship wakes, look-alikes)
Table 2. Quantitative comparison of segmentation performance on the SOS dataset. The best results are highlighted in bold.
Table 2. Quantitative comparison of segmentation performance on the SOS dataset. The best results are highlighted in bold.
MethodBackbonemIoUF1PrecisionRecall
PALSARSentinel-1PALSARSentinel-1PALSARSentinel-1PALSARSentinel-1
U-netres34(1k)81.5682.2782.6087.3178.4085.9987.2888.67
U-netres50(22k)82.3882.1783.3987.2580.6285.7286.3688.84
deeplabv3+res34(1k)83.8383.5484.8888.2683.0787.4586.7789.09
deeplabv3+res50(22k)84.0686.5585.1190.6083.6089.3686.6991.88
SegFormerMiT-B2(1k)84.5583.7685.6988.3082.7889.2188.8187.42
SegFormerMiT-B3(1k)84.7284.6885.8389.1083.7789.0887.9889.11
Swin-UnetSwin-Small(1k)81.6081.8282.5886.9479.4185.8186.0288.09
Swin-UnetSwin-Small(22k)81.9382.0382.8687.1981.1685.0584.6289.43
CBD-Netres34(1k)83.9682.9485.0787.8182.2486.7088.1188.95
CBD-Netres50(22k)84.4184.0485.4788.6984.1087.3586.8990.07
FCS-NetConvNeXt-Small (1k)87.2289.3788.2792.7487.6490.8188.9194.75
FCS-NetConvNeXt-Small (22k)87.7889.6288.9192.8787.3891.7690.5094.02
Table 3. Stepwise ablation study of individual components. FS-CA: Frequency-Spatial Coordinate Attention; SAPP: Strip-Augmented Pyramid Pooling. The best results are highlighted in bold.
Table 3. Stepwise ablation study of individual components. FS-CA: Frequency-Spatial Coordinate Attention; SAPP: Strip-Augmented Pyramid Pooling. The best results are highlighted in bold.
Exp.BackboneFS-CASAPPPALSARSentinel-1
mIoUF1mIoUF1
1ResNet-34 81.56 ± 0.000382.60 ± 0.000582.27 ± 0.001687.31 ± 0.0013
2ConvNeXt-S 84.74 ± 0.00385.87 ± 0.003085.73 ± 0.003489.75 ± 0.0024
3ConvNeXt-S 85.86 ± 0.003087.01 ± 0.003287.98 ± 0.011191.67 ± 0.0091
4ConvNeXt-S 86.15 ± 0.003287.31 ± 0.003587.60 ± 0.009791.42 ± 0.0078
5ConvNeXt-S87.78 ± 0.001588.91 ± 0.001589.62 ± 0.008492.87 ± 0.0069
Table 4. Generalizability analysis on the ResNet-34 backbone. The consistent improvements verify that the proposed modules are effective independently of the backbone architecture. Bold values indicate the best performance in each metric.
Table 4. Generalizability analysis on the ResNet-34 backbone. The consistent improvements verify that the proposed modules are effective independently of the backbone architecture. Bold values indicate the best performance in each metric.
FS-CASAPPPALSARSentinel-1
mIoU (%)F1 (%)mIoU (%)F1 (%)
81.5682.6082.2787.31
83.7884.8783.5988.29
83.9585.0083.6088.28
85.4786.5786.5990.63
Table 5. Zero-shot cross-dataset evaluation on the M4D dataset. All models were trained on the SOS dataset (Sentinel-1 subset) and directly tested on 100 random samples from M4D without fine-tuning. FCS-Net demonstrates superior generalization capability, especially in Oil IoU. Bold values indicate the best performance in each metric.
Table 5. Zero-shot cross-dataset evaluation on the M4D dataset. All models were trained on the SOS dataset (Sentinel-1 subset) and directly tested on 100 random samples from M4D without fine-tuning. FCS-Net demonstrates superior generalization capability, especially in Oil IoU. Bold values indicate the best performance in each metric.
MethodTraining SourceTarget DomainmIoU (%)Oil IoU (%)Oil Recall (%)Oil Precision (%)
DeepLabv3+SOS (Sentinel-1)M4D65.6736.2751.4655.13
SegFormerSOS (Sentinel-1)M4D66.0738.8376.8443.97
FCS-Net (Ours)SOS (Sentinel-1)M4D68.9742.8967.5454.02
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Min, B.-W.; Gao, D.; Hong, Y. FCS-Net: A Frequency-Spatial Coordinate and Strip-Augmented Network for SAR Oil Spill Segmentation. J. Mar. Sci. Eng. 2026, 14, 168. https://doi.org/10.3390/jmse14020168

AMA Style

Wang S, Min B-W, Gao D, Hong Y. FCS-Net: A Frequency-Spatial Coordinate and Strip-Augmented Network for SAR Oil Spill Segmentation. Journal of Marine Science and Engineering. 2026; 14(2):168. https://doi.org/10.3390/jmse14020168

Chicago/Turabian Style

Wang, Shentao, Byung-Won Min, Depeng Gao, and Yue Hong. 2026. "FCS-Net: A Frequency-Spatial Coordinate and Strip-Augmented Network for SAR Oil Spill Segmentation" Journal of Marine Science and Engineering 14, no. 2: 168. https://doi.org/10.3390/jmse14020168

APA Style

Wang, S., Min, B.-W., Gao, D., & Hong, Y. (2026). FCS-Net: A Frequency-Spatial Coordinate and Strip-Augmented Network for SAR Oil Spill Segmentation. Journal of Marine Science and Engineering, 14(2), 168. https://doi.org/10.3390/jmse14020168

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop