Comparison of CNNs and Vision Transformers-Based Hybrid Models Using Gradient Proﬁle Loss for Classiﬁcation of Oil Spills in SAR Images

: Oil spillage over a sea or ocean surface is a threat to marine and coastal ecosystems. Spaceborne synthetic aperture radar (SAR) data have been used efﬁciently for the detection of oil spills due to their operational capability in all-day all-weather conditions. The problem is often modeled as a semantic segmentation task. The images need to be segmented into multiple regions of interest such as sea surface, oil spill, lookalikes, ships, and land. Training of a classiﬁer for this task is particularly challenging since there is an inherent class imbalance. In this work, we train a convolutional neural network (CNN) with multiple feature extractors for pixel-wise classiﬁcation and introduce a new loss function, namely, “gradient proﬁle” (GP) loss, which is in fact the constituent of the more generic spatial proﬁle loss proposed for image translation problems. For the purpose of training, testing, and performance evaluation, we use a publicly available dataset with selected oil spill events veriﬁed by the European Maritime Safety Agency (EMSA). The results obtained show that the proposed CNN trained with a combination of GP, Jaccard, and focal loss functions can detect oil spills with an intersection over union (IoU) value of 63.95%. The IoU value for sea surface, lookalikes, ships, and land class is 96.00%, 60.87%, 74.61%, and 96.80%, respectively. The mean intersection over union (mIoU) value for all the classes is 78.45%, which accounts for a 13% improvement over the state of the art for this dataset. Moreover, we provide extensive ablation on different convolutional neural networks (CNNs) and vision transformers (ViTs)-based hybrid models to demonstrate the effectiveness of adding GP loss as an additional loss function for training. Results show that GP loss signiﬁcantly improves the mIoU and F 1 scores for CNNs as well as ViTs-based hybrid models. GP loss turns out to be a promising loss function in the context of deep learning with SAR images.


Introduction
Oil spills are one of the major causes of sea oil pollution and they pose a significant threat to the marine and coastal ecosystems. Ship accidents, bilge dumping, and offshore oil platforms are the main sources of sea oil pollution [1]. Over the last few decades, spaceborne synthetic aperture radar (SAR) has been widely used for the detection and classification of oil spills and lookalikes. Oil on a sea surface can generally be seen as a dark stretch in SAR images because it dampens the capillary waves and reduces the backscatter [2]. Nevertheless, dark stretches can also occur as a result of natural phenomena such as low wind areas, algae blooms, grease ice, etc. [1,3]. They are generally called lookalikes. These lookalikes add to the complexity of the classification problem. Even a visual inspection may not suffice to separate an oil spill from a lookalike, and an automated algorithm can similarly mistake a lookalike for an oil spill and vice versa.
In this context, deep learning may prove useful. For example, semantic segmentation with deep convolutional neural networks (DCNNs) can be used to assign a class label to every pixel in the remotely sensed images. DCNNs are inspired by the functioning of the human brain, which learns the complex feature from a large amount of data and extracts information in a hierarchical manner, resulting in striking successes in the field of remote sensing and geospatial analysis [4]. Unlike object-based detection methods, semantic segmentation can delimit the boundaries and position of the target of interest accurately, which renders it suitable for processing remotely sensed data [5,6]. The swath of typical SAR images over a sea may include contextual information, such as part of the coastline (land), ship(s), natural sea surface, and lookalike(s), besides oil spill itself [5]. Therefore, in the context of identification of oil spills, a multi-class classification framework is needed. There are numerous classification models based on semantic segmentation, including UNet [7][8][9][10] and DeepLab series [11], which have been used for the detection and classification of oil spills. In spite of this, oil spill detection and its discrimination from lookalikes remains a challenging problem, especially when multiple classes have to be trained and tested.
Recently, the authors in [12] proposed a family of Convolutional Neural Networks (CNNs), termed as EfficientNetV2. Usually, the training of CNNs require high powered computational resources such as GPUs. EfficientNetV2 family has fewer trainable parameters which significantly reduces the training time. We intend to use EfficientNetV2 for semantic segmentation based multiclass classification of SAR images and to highlight the choice of GP loss as a promising loss function for training CNNs. In addition, the authors in [13] proposed self-attention models, i.e., Transformers for language processing applications [14,15]. As compared to CNNs, the Transformers have a large model capacity. However, their generalization capability is worse. After the development of Transformers, several attempts have been made to use the power of self-attention for different computer vision tasks [16][17][18]. With increasing interest in Vision Transformers (ViTs), the authors in [19] considered the advantages of both CNNs and ViTs to propose a new family of hybrid models. These models are termed as CMTs: Convolutional Neural Networks Meet Vision Transformers. CMTs obtained state of the art performance on various benchmark datasets. The authors in [20] utilized the generalization capability of CNNs and model capacity of Transformers to propose a new family of hybrid architectures referred to as CoAtNets. We intend to do ablation studies on these hybrid models to show the effectiveness of using GP loss for training hybrid models for oil spill classification problem. Our training dataset is small and hybrid models may not prove useful for this case, nonetheless it allows us to show the advantage of adding GP loss as an additional loss along with the focal and Jaccard loss functions.

Related Work
The advantage of utilizing CNNs over traditional approaches is that they can be trained end-to-end and learn the input-output mapping from examples [21]. This end-toend training will simplify the task and reduce the human effort to define critical thresholds and parameters. Topouzelis et al. [22] utilized two neural networks (shallow and deep) for classification of potential oil spills from lookalikes. Same framework has been utilized in various later studies with SAR imagery [23,24]. The authors in [25] proposed a method for oil spill detection and classification based on SegNet [26], which is a deep convolutional neural network for semantic segmentation. The model is applied to SAR images with pre-confirmed oil spill. The model performs well under high clutter conditions. However, the model is also based on and limited to classification of SAR images into two classess i.e., oil spill and lookalikes. The authors in [27] proposed a deep DCNN for semantic segmentation of SAR images into multiple regions of interest. The deployed model was trained on a publicly available oil spill dataset [28]. An instance-based segmentation model, namely mask region-based convolutional neural network (Mask R-CNN) is proposed for the detection and segmentation of oil spills and lookalikes in [29]. The results conclude that the instance-based segmentation model outperforms traditional deep learning models. Krestenitis et al. [30] proposed a deep DCNN based on architecture of DeepLab [11] for semantic segmentation of SAR images into regions of interest such as sea surface, oil spills, lookalikes, ships and land. The deep learning model was trained on manually annotated SAR images. The authors in [28] provided a comparison of existing CNNs based on semantic segmentation for detection of oil spills and lookalikes.
Recently, the oil spill detection dataset developed by authors in [28] has been used in several studies regarding oil spill classification. The authors in [31] developed a twostage deep learning framework for classification of potential oil spills. The first stage is a 23 layer CNN that classifies the patches based on the percentage of oil spill class pixels. The second stage is a UNet CNN for semantic segmentation of SAR images. Moreover, they used generalized Dice loss for training and evaluated their results on test dataset using Dice score. The authors in [32] proposed a feature merge network (FMNet) for semantic segmentation of SAR images. Initially, they utilized a threshold method to extract global features from SAR images. After that, the results from the initial step are used to extract high dimensional features. In the final step, the extracted features are combined with the high dimensional features of the original SAR image. In [33], the authors proposed a CNN based on UNet for semantic segmentation of SAR images into multiple regions of interest, i.e., sea surface, oil spill, lookalikes, ship and land. However, the training is performed with standard cross-entropy loss function which does not cater for the high class imbalance. The authors in [34] proposed a two-stage framework for detection of oil spills and ships using side-looking airborne radar (SLAR) images. It consists of three pairs of CNNs with each pair trained to detect a specific class, i.e., ships, oil spills, and coast. However, the authors used their own oil spill detection dataset based on SLAR images to compute different performance metrics, i.e., precision, recall, and F 1 scores. In [35], the authors proposed an oil spill convolutional network (OSCNet) for feature extraction and target classification in SAR images. They used an oil spill detection dataset that consists of 20,000 SAR dark patches based on Envisat, ERS-1, ERS-2, and COSMO Sky-Med data. The dataset is developed by Ocean Remote Sensing Institute (ORSI), Ocean University of China (OUC). The authors stated that the proposed CNN performs better than the hand-crafted features needed by traditional machine learning algorithms.
The training of neural networks naturally necessitates the choice of one or more loss functions. At times, combination of multiple loss functions yields better performance. Commonly used loss functions for CNNs in the context of semantic segmentation include cross-entropy (CE) and focal loss. Since CE loss treats all samples and classes equally, it is not suitable when there is a large class imbalance [36]. Typically for oil spill problems, and remote sensing applications in general, the desired class may have fewer samples by several orders of magnitude than other class(es). To address this concern, CE loss can be tailored to give priority to class(es) with fewer samples. However, it can result in noise amplification [37]. Focal loss can be considered an extension of CE loss, with an addition of a modulating factor to facilitate differentiation between false positives and negatives. A common denominator among these loss functions is that they classify each pixel individually irrespective of the spatial relationship over semantically constant regions.
Until the present time, several methods have been proposed for detection and classification of oil spills and lookalikes. Most of these are based on classification of SAR images into just two classes of interest, i.e., oil spills and lookalikes. Oil spill events resulting from ship accidents and illegal ship discharge (bilge dumping) are more common, creating a need for detection of accurate position of ships besides the spillage. The detection and classification algorithms based on multiple regions of interest such as sea surface, oil spills, lookalikes, ships, and land areas are currently lacking. Moreover, for training CNNs, the loss function that considers spatial relationship over semantically constant regions is not studied to the best of our knowledge.
In this paper, we investigate the performance of different CNNs and ViTs-based hybrid architectures for semantic segmentation of SAR images into multiple relevant classes, i.e., sea surface, oil spill, lookalikes, ship, and land. Moreover, we introduce the use of a new loss function termed as GP loss, which is in fact the constituent of the more generic spatial profile loss proposed for image translation problems [38]. It computes similarity in gradient space between ground truth and predicted class labels by considering rows and columns as spatial profiles, respectively. Despite a small oil spill detection dataset of 1112 SAR images, the use of GP loss as an additional loss, along with the focal and Jaccard loss functions for training CNNs and hybrid models, results in significant performance improvement in terms of mean intersection over union (mIoU) and F 1 scores.

Dataset
The detection of oil spills remains a challenging problem for the research community. Due to the absence of a common benchmark dataset, earlier work on oil spill detection and classification [27,39,40] utilized different custom datasets corresponding to the specific approaches used at the time. Until recently, to the best of our knowledge, there has been no common baseline available in the literature for comparison of different deep-learningbased semantic segmentation approaches. Krestenitis et al. [28] recently developed a labeled dataset of several oil spill events, and it is publicly available through their website (https://mklab.iti.gr/, accessed on 23 September 2020). The dataset contains spaceborne SAR acquisitions containing oil spill events verified by the European Maritime Safety Agency (EMSA) through the CleanSeaNet service. These SAR images are from the Sentinel 1 constellation operated by the European Space Agency (ESA). The images cover a ground range of approximately 250 km in interferometric wide swath (IW) mode with a resolution of 10 m. The images are dual-polarized, i.e., VV and VH, but only VV polarized images were retained for developing the dataset. After a series of preprocessing steps, the authors in [28] retained 1112 SAR images, which were split into training and test data subsets comprising 1002 and 110 images, respectively. The dataset contains manually annotated ground truth masks with a distinct RGB color assigned to each of the classes, viz., sea surface, land area, oil spill, lookalikes, and ships. Two example training SAR images along with their ground truth masks and class labels are shown in Figure 1. We use this dataset not only for training the classifiers, but also as a benchmark to compare our results against those published by the developers in [28]. Training dataset: A sample of two Sentinel 1 SAR images (left) along with ground truth masks (right) and class labels, viz., sea surface (black), oil spill (cyan), lookalikes (red), ship (brown), and land (green). The dataset was prepared by Krestenitis et al. [22] from the MKLab ITI-CERTH, Greece. It comprises validated oil spill records from European Maritime Safety Agency (EMSA).

Methodology
The proposed methodology for oil spill detection is based on semantic segmentation of SAR images. Due to irregularity in oil slick shape and texture, a single label for the entire image is not sufficient to detect potential oil spills. Similarly, other approaches, such as object-based detection [41] and assigning multiple labels to single image [42], do not perform well in an oil spill detection case. In contrast, semantic segmentation classifies the multiple classes of interest in a single image at pixel-level, making it suitable for complex problems such as oil spill detection and classification [5,6].

UNet
UNet [7] is a popular CNN, originally proposed for biomedical image segmentation and is also used in many remote sensing applications [8][9][10]. It consists of an encoder (contracting path) and decoder (expansive path) part, as shown in Figure 2. The encoder has a similar structure to a typical CNN. It consists of two 3 × 3 convolutional layers, each followed by a rectified linear unit (ReLU) and a maximum pooling layer with kernel size 2 × 2 and stride 2. At the end of each encoder block, the number of feature channels are doubled to learn complex low-level features. The decoder consists of upsampling and concatenate layers, followed by two 3 × 3 convolutional layers, rectified linear unit (ReLU), and a maximum pooling layer with kernel size 2 × 2 and stride 2. Finally, a 1 × 1 convolution is used to map the feature channels to the desired number of classes. The encoder part reduces the spatial dimensions of input SAR image and increases the number of overall filters to extract complex low-level feature maps. On the contrary, the decoder part transforms high-level features by combining the feature information from the encoder part using skip connections. Finally, the decoder maps the high-level features to output, which is a semantic segmentation mask containing five classes of interest, i.e., sea surface, oil spills, lookalikes, ships, and land areas.  [7]. It consists of an encoder part that extracts the complex low-level features by reducing the image dimensions and increasing the number of channels. The decoder part upsamples the low-level features and maps the high-level features to output, which is a semantic segmentation mask containing the desired number of classes, i.e., five in our case.

EfficientNetV2
EfficientNetV2 is a new family of CNNs proposed by Tan et al. [12]. These CNNs have better training efficiency in terms of less trainable parameters, which reduces the training time. These models are developed by jointly optimizing the training speed and parameter efficiency using training aware neural architecture search (NAS) and scaling. The major differences between the standard EfficientNet backbones and EfficientNetV2 CNNs are as follows: 1.
In initial layers, the EfficientNetV2 extensively utilizes MBConv and fused-MBConv structures, as shown in Figure 3.

2.
During training, EfficientNetV2 uses a small expansion ratio for MBConv modules. It reduces the memory overhead and results in faster training.

3.
EfficientNet uses a small kernel size of 3 × 3. It reduces the receptive field during training, which can be compensated by adding some additional layers. 4.
The original EfficientNet has a last stride 1 × 1 stage with large number of trainable parameters. EfficientNetV2 does not utilize it to reduce memory usage and increase the training speed.
We implement the EfficientNetV2S, EfficientNetV2B0, EfficientNetV2B1, Efficient-NetV2B2 and EfficientNetV2B3 architectures for semantic segmentation of SAR images into five classes, viz., sea surface, oil spill, lookalikes, ship, and land. We train all the variants with and without the addition of GP loss to check its effectiveness in a semantic segmentation based setting.

Convolutional Neural Networks Meet Vision Transformers (CMTs)
CMTs are a new family of hybrid models proposed by Guo et al. [19]. It has a CMT stem which consists of a single 3 × 3 convolutional layer with stride 2 × 2 and two 3 × 3 convolutional layers with stride 1 × 1. The rest of the network is made of alternate 3 × 3 convolutional layers with stride 2 × 2 and CMT blocks, as shown in Figure 4. Each CMT block consists of a local perception unit (LPU), lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN). LPU extracts the local information and is defined as follows: where X ∈ R H×W×d , H × W represents the dimensions of the input image at current stage, and d represents the dimensions of the features. DWConv(.) is depthwise convolution. For details about LMHSA and IRFNN modules, the readers are referred to [19]. Combining the aforementioned modules, the CMT block can be defined as follows: where X i and X i are outputs from the LPU and LHMSA modules for block i, respectively. LN(.) represents layer normalization. We implement different variants of CMTs, viz., convolutional neural networks meet vision transformer (CMT) tiny, CMTExtraSmall, and CMTSmall, and add a classification head at the end of each architecture for semantic segmentation of SAR images. The classification head upsamples the features extracted by each CMT architecture and maps the high-level features to output, which is a semantic segmentation mask containing five classes of interest, i.e., sea surface, oil spill, lookalikes, ship, and land. . An overview of the CMT architecture used for semantic segmentation of SAR images for oil spill classification. The architecture is based on two modules, viz., CMT stem and CMT block. Each CMT block consists of LPU, LHMSA, and IRFNN modules. For our classification problem, the input is a 320 × 320 × 3 SAR image and output is a 320 × 320 × 5 semantic segmentation mask with five desired classes.

Convolution and Self-Attention Networks (CoAtNets)
CoAtNets are a family of hybrid models, recently proposed by authors in [20]. CoAt-Nets are built with two key insights which are as follows: 1.
The advantages of both depthwise convolution and self-attention can be achieved by unifying them using simple relative attention.

2.
Vertical stacking of convolution and attention layers can improve the generalization, efficiency, and capacity of the models.
The CoAtNet models are composed of five stages, i.e., S0-S4, as shown in Figure 5. The first stage consists of two 3 × 3 convolutional layers with stride 2 × 2 and 1 × 1, respectively. The second and third stages perform downsampling with depthwise convolution. Each stage consist of two 1 × 1 convolutional layers and one 3 × 3 depthwise convolution layer. The fourth and fifth stages consist of relative attention and feed-forward network (FFN) modules. For details about relative attention and FFN modules, the readers are referred to [20]. We implement the CoAtNet-0 variant of this family and add a classification head to upsample the low-level features and map the high-level features to the output, which is a semantic segmentation mask containing all the relevant classes, i.e., five in our case. Figure 5. An overview of the CoAtNet architecture used for semantic segmentation of SAR images for oil spill classification. It has five stages, viz., S0, S1, S2, S3, and S4. Each stage reduces the dimensions of the input image by a factor of 1/2. For our classification problem, the input is a 320 × 320 × 3 SAR image and output is a 320 × 320 × 5 semantic segmentation mask with five desired classes.

Experimental Setup
We implement the UNet CNN with different encoder backbones from the resnet series to extract complex low-level features. These features are then upsampled by simple decoder module of UNet CNN. Moreover, we implement the EfficientNetV2 family of CNNs, CMTs, and CoAtNet families of hybrid models. Apart from UNet CNN, we add a classification head to each architecture for upsampling the complex low-level features, and map the high-level features to output for semantic segmentation of SAR images. All the models are trained on the benchmark dataset introduced in Section 2. The models are trained with imagenet pretrained weights for an input shape of 320 × 320 with batch size of 12. A stochastic optimization method, namely Adam, is used. This is an efficient method for stochastic optimization with low memory requirements [43]. We are applying data augmentation on the fly. Random data augmentation generally improves the performance in various computer vision and remote sensing applications [44]. More specifically, we apply a series of random transformations including zoom range, width shift range, and height shift range of 0.3, rotation of 90 • , and random vertical and horizontal flips. These random transformations are applied to SAR images as well as the ground truth masks during the training phase.

Commonly Used Semantic Segmentation Loss Functions
This subsection briefly discusses the different loss functions used for training the semantic segmentation networks.

Categorical Cross-Entropy Loss
The cross entropy is a measure of the difference between two probability distributions. Considering the case of binary classification, the cross-entropy loss is expressed as follows [45]: where y ∈ {±1} is the ground truth class and p ∈ [0, 1] is the probability of predicted true class, respectively. In the context of multi-class classifications, this loss is referred to as the categorical cross-entropy loss. It measures the performance of a classification model by comparing probability distributions of ground truths and predicted class labels. If we define a new variable p t : then Equation (3) can be rewritten as CE(p t ) = − log(p t ).

Categorical Focal Loss
This loss function helps in addressing the data imbalance problem. The hard examples tend to increase the classification error. Training a CNN with categorical focal loss encourages the model to pay more attention to these examples, resulting in improved classification performance. It prevents a large number of false negatives from saturating the CNN during the training phase. Mathematically, the focal loss is defined by adding the modulating factor (1 − p t ) γ to the cross-entropy loss [45]: where α and γ are the hyperparameters of focal loss.

Jaccard Loss
Jaccard index is one of the most commonly used metrics for semantic-segmentationbased classification problems. It measures the similarity between ground truth mask and predicted class labels. Considering y to be the ground truth mask andŷ as the predicted class labels, the Jaccard loss function can be computed as follows [46]: L jac (y,ŷ) = 1 − (y.ŷ) + (y +ŷ − y.ŷ) + (6) where is used to prevent division by zero. The subtrahend is equivalent to the intersection over union (IoU) value. Therefore, the use of Jaccard loss for the training aims to directly increase the IoU (which itself is a commonly used figure of merit for classification performance).

Gradient Profile Loss
Common cross-entropy-based losses used in semantic segmentation focus on classifying each pixel individually and do not take into account the spatial relationship over semantically constant regions. To some extent, the use of IoU-based loss (Jaccard) caters for this since it tries to increase the intersection over union of final predictions over a region. In order to illustrate this point, Figure 6 shows three images, i.e., source A (left), target B (center), and C (right). The targets B and C have the same number of white pixels but their spatial structure is different. First, we compute the mean absolute difference (D pixel ) between source A and each of the target B and C by considering each pixel independently. As a result, we obtain the same value of 0.3750 for both targets. This method does not capture the different spatial patterns of target B and C. Towards this end, the complex spatial patterns in an image can be better captured by considering pixel variations along a given direction. To demonstrate this, we consider the columns of an image as vectors and compute the Euclidean distance between source A and each of the targets B and C. The mean of these distances (D GP ) between A and B is 10.9545. Similarly, the mean of distances between A and C is 6.7082. By considering columns or rows of an image as spatial profiles, we can accurately capture the complex spatial patterns. With this motivation, we introduce the use of an additional loss that is computed in a way which preserves the spatial structure of the target label map over the entire image, in contrast to over regions or pixels. This is achieved by matching prediction probabilities along horizontal and vertical directions in the output segmentation maps. The whole row or column, a.k.a. profile, of the output prediction map is considered as a vector and matched in vector space by computing cosine similarity. This is inspired from the recently proposed spatial profile loss (SPL) [38] for use in image translation tasks. SPL computes such similarities on different color spaces and gradient spaces of the image. Our contribution is to incorporate such a matching on prediction probabilities in a semantic segmentation task. Since we are matching probability distribution along profiles, we compute this similarity over the gradients of prediction class maps. Formally, the similarity over each image channel is measured as follows: where y represents the ground truth mask of size H × W,ŷ represents the predicted class labels of the same dimension, tr(.) represents trace of a matrix and (.) τ represents transpose of a matrix, and the subscript c represents each image channel. The first and second terms compute similarity between row and column profiles of ground truth mask and the predicated class labels, respectively. We compute the loss given in Equation (8) in the image gradients' space, and call it the gradient profile (GP) loss [38]: L GP (y,ŷ) = −S(∇y, ∇ŷ).
The image gradients for each channel of an image can be easily computed by measuring image difference between an image and its one-pixel shifted version. Figure 6. Graphical demonstration showing importance of complex spatial patterns in different images with the same number of pixels. We compute mean absolute difference between source A (left) and each of the targets B (center) and C (right). Targets B and C have the same number of white pixels, which results in the same value of mean absolute difference. Complex spatial patterns can be captured by considering rows or columns of an image as spatial profiles.

Results
The training of UNet CNN is conducted with different backbones. In particular, we used resnet50, resnet101, and resnet152 backbones. Moreover, we provide extensive ablation results on EfficientNetV2 CNNs, CMTs, and CoAtNet hybrid models. Among the different loss functions, we used categorical focal loss and Jaccard loss, as well as the GP loss. All models were trained for 62 epochs.

Comparison against State of the Art
We evaluate the performance of the classification in terms of the IoU values. The results are compared against those from the earlier work [28] as reproduced in  Table 1) with resnet101 backbone offers a significant improvement in terms of mIoU, which has increased by 13.5%, as well as the classwise IoU scores for all classes. The best results reported in the earlier work are achieved for the mobilenetv2 backbone; we have also outperformed those results by a significant margin for all classes except the sea surface class. For the oil spill and lookalike classes, we improve by 10.6% and 5.47%, respectively. Moreover, we carried out an ablation study by training DeepLabv3+ and UNet with mobilenetv2 backbone and a combination of GP, Jaccard, and focal loss functions. The results (row # 3 & 5, Table 1) show an improvement with mIoU score of 75.44% and 74.84%, which accounts for a 10% and 9% improvement over state of the art for this dataset, respectively. These results emphasize the advantage of adding GP loss as an additional loss function for training different backbones.
Additionally, we compare the results of our classification framework with results from earlier works on classification of oil spills [28,31,32,34,35]. Table 2 illustrates the comparison in terms of mIoU, F 1 scores, datasets used for oil spills detection, and number of classes considered for classification. Our proposed classification framework (row # 6, Table 2) provides improved results with an mIoU score of 78.45% on the oil spill detection dataset developed by MKLab ITI-CERTH, Greece, with five classes of interest, viz., sea surface, oil spills, lookalikes, ship, and land (row # 2 & 3, Table 2). We also compare our results in terms of F 1 score against those published in [31] for the same dataset (row # 1, Table 2). We achieved an F 1 score of 82.47%, which accounts for a 2.47% improvement. This highlights the significance of our proposed methodology for multi-class classification of oil spills from lookalikes, sea surface, ship, and land. For the sake of completeness and the reader's interest, we are stating performance metrics reported by other studies on different datasets (row # 4 & 5, Table 2) with fewer classes. A direct comparison of our results with those is not possible due to difference in dataset characteristics and number of classes. Table 1. Comparison of classification results with the state of the art (as reported by the earlier work [28]) assessed over the test SAR images in terms of the intersection over union (IoU) score.

Ablation on ResNet Series
For our ablation study, we experiment with different resnet backbones trained with different loss function combinations. The results are evaluated in terms of mIoU as well as F 1 score, as reported in Table 3. When just cross-entropy loss was used in [28] with restnet101 backbone, the mIoU achieved was merely 64.97% (row # 1, Table 1). If we use a combination of categorical focal and Jaccard loss, the mIoU score jumps to 76.52% (row # 3, Table 3). Moreover, even resnet50 with 19 million fewer trainable parameters compared to resnet101 performs better with this combination (row # 1, Table 3). Remarkably, addition of GP loss further improves the overall classification performance in terms of both mIoU and F 1 scores for each backbone in our study. Classwise results have also been improved by GP loss. For the oil spill class, in particular, GP loss improves the IoU score by nearly 3-5% in each backbone.

Ablation on EfficientNetV2
We experiment with different architectures from the EfficientNetV2 family of CNNs. The mIoU and F 1 scores are used as evaluation metrics. For EfficientNetB0 (row # 3 & 4, Table 4) with 15.7 million trainable parameters, the training with a combination of focal and Jaccard loss resulted in mIoU and F 1 scores of 64.64% and 68.61%, respectively. By adding GP loss as an additional loss function, the mIoU and F 1 scores improve to 75.27% and 79.26%, respectively. This accounts for an 11% improvement with the addition of GP loss. For EfficientNetV2Small (row # 1 & 2, Table 4), EfficientNetV2B1 (row # 5 & 6, Table 4), and EfficientNetV2B2 (row # 7 & 8, Table 4), there is an improvement of 2% in mIoU and F 1 scores with the addition of GP loss as an additional loss function along with focal and Jaccard loss functions. For EfficientNetV2B3 (row # 7 & 8, Table 4), there is a 1% improvement in mIoU and F 1 scores with the addition of GP loss for training. Nevertheless, GP loss performs well for architectures with different trainable parameters.

Ablation on CMTs
To check the effectiveness of GP loss as an additional loss function for training, we experiment with CMTs: a family of hybrid models developed by combining CNNs and ViTs. The generalization ability of CNNs and capacity of ViTs is combined for better generalization and scaling. For CMTTiny (row # 1 & 2, Table 5) with 18.0 million trainable parameters, the addition of GP loss results in significant improvement in terms of mIoU and F 1 scores. The mIoU score increases by 5% from 67.16% to 72.43%, and F 1 score increases by 6% from 70.82% to 76.10%. The training of CMTXS with 23.8 million trainable parameters without addition of GP loss (row # 3, Table 5) results in low mIoU and F 1 scores. However, with the addition of GP loss, the performance significantly improves, resulting in mIoU and F 1 scores of 72.72% and 76.78%, which accounts for 17% and 18% improvement, respectively. Therefore, GP loss proves useful for training with a small training dataset. Referring to the CMTSmall (row # 5 & 6, Table 5) with 34.6 million trainable parameters, the training without the addition of GP loss results in mIoU and F 1 scores of 41.00% and 43.43%, respectively. These are the lowest scores among all the trained models. It accounts for a smaller number of training images and trainable features. However, there is a significant improvement with the addition of GP loss for training. The mIoU and F 1 scores improved to 64.50% and 67.29%, which accounts for 23% and 24% improvement, respectively.

Ablation on CoAtNet
We experiment with the CoAtNet family of hybrid models, recently proposed by Dai et al. [20]. We train the CoAtNet-0, which is the base variant of CoAtNet series with 29.4 million trainable parameters. The training of CoAtNet-0 (row 7 & 8, Table 5) is performed using a combination of focal and Jaccard loss functions. We obtained mIoU and F 1 scores of 67.00% and 70.77%, respectively. After the addition of GP loss as an additional loss function, the mIoU and F 1 scores improved to 73.61% and 77.00%, which accounts for 6% and 7% improvement, respectively. Hence, GP loss turns out to be a promising loss function for training different CNNs and ViTs-based hybrid models.

Qualitative Results
A few selected results are shown in Figure 7 for qualitative analysis. These SAR images are tested with UNet (resnet101) CNN, trained with a combination of GP, Jaccard, and focal loss functions. Referring to the top sub-figure, the model has accurately classified oil spill, lookalikes, and land area. It has also detected a small area of lookalikes that is not labeled in the ground truth mask. As such, it is difficult to say if it is a labeling error. Nonetheless, in the computation of our performance metrics, it is attributed as an error. Referring to the middle sub-figure, the model has detected the oil spills and a nearby ship. In the bottom sub-figure, the model has accurately detected oil spill and land area, but a few lookalikes predicted by the classifier close to the land seem to be in error. As per ground truth, these dark areas are just sea surface, which may represent naturally calm water close to the land. Figure 7. SAR images (left) along with ground truth masks (center) and predicted class labels (right). The classification framework used is based on UNet architecture with resnet101 pretrained encoder backbone, trained with a combination of focal, Jaccard, and GP loss functions. The images are acquired by Sentinel 1, and the training/test dataset is developed by MKLab ITI-CERTH, Greece.

Conclusions & Outlook
This paper reports an investigation into the performance of different CNNs and hybrid (CNNs + ViTs) models for oil spill classification in SAR images, and introduces the use of a new loss function, namely, gradient profile (GP) loss, that has offered significant improvements in classification performance. The problem is set up as a multi-class classification. A potential oil spill in an image has to be classified against other possible classes: natural sea surface, land, ship, and lookalikes. A labeled dataset comprising 1112 SAR images is used, which is split into training and test data subsets comprising 1002 and 110 images, respectively. State-of-the-art results reported for this dataset are an mIoU of 65.06%, using Mobilenetv2 backbone on the DeepLabv3+ architecture. Our proposed framework relies on the UNet neural network architecture, and we show our best results with the resnet101 backbone. We have achieved an mIoU of 76.52% with this framework, while training with a combination of Jaccard and focal loss functions. We achieve a further improvement of 1.93% (an overall improvement of 13.5% over state of the art) by including the GP loss function. It explicitly takes into account spatial relationships over semantically constant regions by computing cosine similarities over horizontal and vertical spatial profiles in gradients' space. We have also performed extensive ablation studies where only the GP loss is excluded from other loss combinations in successive experiments on three different resnet backbones, EfficientNetV2 CNNs, CMTs, and CoAtNet hybrid models. In each case, the inclusion of GP loss significantly improves classwise performance (particularly for oil spill, which is an imbalanced class) as well as the overall performance.
Nevertheless, it is noteworthy to mention that the deep learning has been performed on a rather small training set with a large class imbalance. It is probable that an increased dataset may help in furthering the scores, though decent results (with F 1 > 80) are achieved already. We thank the researchers who set up this dataset [28,30], and for our future work, we aim to further improve our classification scores and explore the choice of GP loss as a preferred loss function for other remote sensing applications.