Application of Deep Networks to Oil Spill Detection Using Polarimetric Synthetic Aperture Radar Images

Featured Application: Using polarimetric synthetic aperture radar (SAR) remote sensing to detect and classify sea surface oil spills, for the early warning and monitoring of marine oil spill pollution. Abstract: Polarimetric synthetic aperture radar (SAR) remote sensing provides an outstanding tool in oil spill detection and classiﬁcation, for its advantages in distinguishing mineral oil and biogenic lookalikes. Various features can be extracted from polarimetric SAR data. The large number and correlated nature of polarimetric SAR features make the selection and optimization of these features impact on the performance of oil spill classiﬁcation algorithms. In this paper, deep learning algorithms such as the stacked autoencoder (SAE) and deep belief network (DBN) are applied to optimize the polarimetric feature sets and reduce the feature dimension through layer-wise unsupervised pre-training. An experiment was conducted on RADARSAT-2 quad-polarimetric SAR image acquired during the Norwegian oil-on-water exercise of 2011, in which veriﬁed mineral, emulsions, and biogenic slicks were analyzed. The results show that oil spill classiﬁcation achieved by deep networks outperformed both support vector machine (SVM) and traditional artiﬁcial neural networks (ANN) with similar parameter settings, especially when the number of training data samples is limited.


Introduction
As one of the most significant sources of marine pollution, oil spills have caused serious environmental and economic impacts to the ocean and coastal zone [1].Oil spills near the coast can be caused by ship accidents, explosion of oil rig platforms, broken pipelines, and deliberate discharge of tank-cleaning wastewater from ships.The NEREIDs program, sponsored by the European Commission, was the first robust attempt to use shipping, geological and metocean data to characterize oil spills in one of the major oil exploration areas of the world, prior to any major oil spill accident.Based on this data, oil spill models were established to simulate the development and trajectories of oil spills and investigate the susceptibility of coastal zone and find suitable measures to alleviate its impacts to the environment [2][3][4][5].
Early warning and near-real-time monitoring of oil slicks plays a very important role in cleaning up operation of oil spill to alleviate its impact to coastal environment [2,3].Synthetic aperture radar (SAR) is one of most promising remote sensing systems for oil spill monitoring, for it can provide valuable information about the position and size of the oil spill [1].Moreover, the wide coverage and all-day, all-weather capabilities make SAR very suitable for large scale oil spill monitoring and early warning [6][7][8].
In their early stages, studies of oil spill detection are mainly based on single polarimetric SAR images [9][10][11][12].The theoretical rationale of SAR oil spill detection is that the presence of oil slicks on the sea surface dampens short-gravity and capillary waves, so the Bragg scattering from the sea surface is largely weakened.The ideal sea surface wind speed for oil spills detection is 3-14 m/s [13].As a result, oil spills can be detected as "dark" areas in SAR images.However, some other manmade or natural phenomena can result in very similar low scattering areas on the sea surface, e.g., biogenic slicks, waves, currents and low-wind areas, etc. Conventional oil spill detection procedures use intensity, morphological texture, and auxiliary information to distinguish mineral oil and its lookalikes, with its processing chain divided into three main steps [13]: (1) dark spot detection; (2) features extraction; and (3) classification between mineral and its lookalikes.
Single polarimetric SAR-based oil spill detection algorithms need auxiliary information and large number of data samples to classify mineral oil and its lookalikes.Sometimes the shape and texture of oil slicks may vary, affecting the robustness of intensity-based oil spill classification algorithms.Polarimetric observation capabilities provided by advanced SAR sensors have much stronger capabilities for oil spills detection [14].For instance, biogenic slicks and mineral oil are difficult to distinguish by single polarimetric SAR images.Yet, their polarimetric scattering mechanisms are largely different: for oil-covered areas, Bragg scattering is largely suppressed, and high polarimetric entropy can be documented.In the case of a biogenic slick, Bragg scattering is still dominant, but with a low intensity.Thus, similar polarimetric behaviors as those of oil-free areas should be expected in the presence of biogenic films.Hence, polarimetric features can largely help the image classification between mineral and biogenic lookalikes [14].
Various polarimetric features have been proposed to classify oil spills.The standard deviation of copolarized phase difference (phase difference between Vertical transmit and Vertical receive-VV and Horizonal transmit and Horizonal receive-HH channel) has shown a strong oil classification capability on C-, X-, and L-band data [15].Nunziata et al. (2011) proposed pedestal height to describe the different polarization signature between mineral oil and biogenic lookalikes [16].Minchew et al. (2012) took the advantage of copolarization ratio to study the mixing status of crude oil and sea water [17].Zhang et al. (2011) used the conformity coefficient as a binary classifier [18].Other polarimetric features such as degree of polarization, entropy, alpha angle, and Bragg likelihood angle were also used to classify oil spills [19][20][21].
Some previous studies conducted automatic oil-spill classification algorithms.Marghany (2001) developed models to discriminate textures between oil and water by using co-occurrence textures [22].Gambardella et al. (2008) proposed one-class classification with an optimized feature selection algorithm and obtained a promising oil spill classification [23].Frate et al. (2000) proposed a semiautomatic detection of oil spills by neural network [24].Garcia-Pineda et al. ( 2008) developed the Textural Classifier Neural Network Algorithm (TCNNA) to map an oil spill in the Gulf of Mexico Deepwater Horizon accident [11].Marghany (2013) used a genetic algorithm (GA) for automatic detection of an oil spill from ENVISAT ASAR (Advanced Synthetic Aperture Radar) data [25].Li et al. (2013) used a Support Vector Machine (SVM) to detect oil spills based on morphological features on very limited data samples [26].
Polarimetric SAR features contain massive complementary and redundancy information.The extraction and optimization of them are closely related to the performance of oil spill classification [27].Deep learning algorithms have very strong capabilities of exploring complex correlation between features and achieve very promising fitting result on complicated problems.It has been a very popular technique for image processing, computer vision, and natural language processing.According to the authors, deep learning has not been used in features optimization for oil spills detection based on polarimetric SAR data, and it should be a very promising research topic.
Deep neural network with multilayer neuron has powerful capabilities in describing complex functions compared with shallow networks [28].However, the traditional gradient descent technique works poorly on a deep neural network when the weights are initialized randomly.The reason is that when the derivative is calculated using the back propagation method, the magnitude of the gradient (from the output layer to the initial layer of the network) decreases dramatically as the network depth increases.As the result, the gradient of the overall loss function, with respect to the weights of the first few layers, is very small.Thus, when the gradient descent method is used, the weights of the first layers change very slowly, so that they cannot learn effectively from the samples.This problem is often referred to as "gradient dispersion".In 2006, Hinton et al. proposed the deep belief network (DBN), which is a belief network composed of Restricted Boltzmann Machine (RBM) one layer at a time, to take the advantage of complementary priors of the data.Inspired by DBN, Beigio et al. ( 2006) used a stacked autoencoder, which is a deep multilayer neural network that initialized its weights by a greedy layer-wise unsupervised training strategy [29].
Moreover, feature dimension reduction can be seen as an early fusion step.Fusion at different stages of classification procedures is a booming research field that has shown capabilities for improvement of classification results.For instance, Vergara et al. fused the output of nonindependent detectors to derive the optimum classification result [30].Late fusion of scores of several classifiers could be adapted to the proposed problem as a future research work.
The aims of this paper are exploring the capabilities of deep learning algorithms on polarimetric SAR-based marine oil spill detection.In Section 2, research methods including the representation of polarimetric SAR data, feature extraction methods and deep learning algorithms including DBN and SAE will be introduced.In Section 3, experiments were conducted on RADARSAT-2 data containing verified oil spills and biogenic lookalikes.The performance of different algorithms on various sample sizes for oil spill classification will be compared.Finally, conclusions are drawn in Section 4, and the significance and future work of the study will be briefly presented.

Foudamentals of Polarimetric SAR
The scattering characteristics of the observed target can be described by matrix, S; which links the scattered and incident electromagnet field, in the backscattered coordinate system: where k is the wavenumber of the EM wave, r is the distance.Fully polarimetric SAR observations can be achieved by quad-polarimetric mode, in which both horizontal and vertical polarized signals are transmitted alternatively and received coherently.The 2 × 2 scattering matrix is used to represent the single look complex quad-pol SAR data: where S ij describes the transmitted and received polarization, respectively, with h denoting the horizontal direction and v denoting the vertical direction.
To take advantage of statistical properties and reduce the effect of speckle noise of SAR data, covariance matrix is often derived from the scattering matrix by multilook its second order products: where "*" is the symbol of conjugate, and "< >" stands for multilook by using an averaging window.
Multilook is applied as a standard procedure to obtain the second order statistics (covariance matrix, coherence matrix) of the SAR data, an average window of 5 × 5 is normally used for balancing the multilook result and maintaining the spatial resolution.

Features Extraction for Oil Spills Detection
Previous studies proved experimentally that various SAR features could assist oil spill detection and classifications [31].In this study, ten features including single VV channel intensity, entropy, alpha angle, degree of polarization, ellipticity, pedestal height, copolarized phase difference (CPD), conformity coefficient, correlation coefficient and coherence coefficient are extracted from the covariance matrix (or coherence matrix and Stokes vector deriving from the covariance matrix) [32] of polarimetric SAR data.The ten features investigated in this study, and their behavior on clean sea surface and sea surface covered by different materials, are given in Table 1.Detailed definitions and their behavior on different targets are provided explicitly in [27].
Low High Higher 1 Note: "lower" and "higher" mean that the property of the feature on a certain type of surface is close to the other surface that has the property of "low" or "high", but slightly lower or higher."Std.copolarized phase difference (CPD)" stands for the standard deviation of CPD.

Restricted Boltzmann Machine
RBM is a neural perceptron consisting of visible and hidden layers, and the neurons between the visible layer (v i , i = 1, . . ., N v ) and the hidden layer (h j , j = 1, ..., N h ) are bidirectional and fully connected.The basic structure of RBM is shown in Figure 1: RBM is a neural perceptron consisting of visible and hidden layers, and the neurons between the visible layer (vi, i = 1, ..., Nv) and the hidden layer (hj, j = 1, ..., Nh) are bidirectional and fully connected.The basic structure of RBM is shown in Figure 1: In RBM, W represents the weight between any two connected neurons, in which each neuron has a bias coefficient b (of neurons) and c (of hidden neurons).
The energy of the RBM can be represented by: And the probability of the activation of the hidden layer neuron hj is: Similarly, the neurons in the visible layer can also be activated by the bidirectional connected hidden neurons: where is the activation function, e.g., sigmoid function: Since for RBM the neurons of the same layer is not connected, they are independent: Based on the input data vector x, the possibility of the activation of each hidden layer neuron can be calculated.Similarly, based on the activation state of hidden layer neurons, the activation state of visible layers can be calculated.Through a contrastive divergence algorithm [28], the parameters of the RBM: (b, c, W) can be set based on the input data vector x iteratively by a Gibbs sampling technique.An RBM can be seen as a feature detector, which is often used for dimensional reduction of the data.The training process of RBM is to find a probability distribution that can best produce training samples.

The Structure of DBN
DBN is a generative model which establishes a joint distribution between a label and the data sample.It not only considers P (label/observation), but also P (observation/label).In a DBN, several In RBM, W represents the weight between any two connected neurons, in which each neuron has a bias coefficient b (of neurons) and c (of hidden neurons).
The energy of the RBM can be represented by: And the probability of the activation of the hidden layer neuron h j is: Similarly, the neurons in the visible layer can also be activated by the bidirectional connected hidden neurons: where σ is the activation function, e.g., sigmoid function: Since for RBM the neurons of the same layer is not connected, they are independent: Based on the input data vector x, the possibility of the activation of each hidden layer neuron can be calculated.Similarly, based on the activation state of hidden layer neurons, the activation state of visible layers can be calculated.Through a contrastive divergence algorithm [28], the parameters of the RBM: (b, c, W) can be set based on the input data vector x iteratively by a Gibbs sampling technique.An RBM can be seen as a feature detector, which is often used for dimensional reduction of the data.The training process of RBM is to find a probability distribution that can best produce training samples.

The Structure of DBN
DBN is a generative model which establishes a joint distribution between a label and the data sample.It not only considers P (label/observation), but also P (observation/label).In a DBN, several RBMs are connected.The hidden layer of the previous RBM is the next RBM's visible layer, and the output of the previous RBM is the input of the next RBM.During the pre-training process, the upper layer of RBM is trained before the training of the current layer.Usually when the top RBM is trained, the label information is also considered as the visible units.

The Fine-Tuning of DBN
Contrastive Wake-Sleep algorithms are usually used to fine-tune the pre-trained DBN.In the wake stage, the status of nodes of each layer is generated by external features and cognitive weights (upward), and the generated weights (downward) are modified using gradient descent algorithm.In the sleep stage, the state of the bottom neurons is generated through the top-level representation (the states learned by waking) and the weights generated in previous stage, then the cognitive weights of each layer are modified.

Autoencoder
As shown in Figure 2, to build an autoencoder, three layers, namely, an input layer, a hidden layer and an output layer have to be established.The explanations of symbols used in Figure 2 are listed below: n: the size of the input and output layer.m: the size of the hidden layer.
x R n , h R m , y R n stand for the data vector of the input, hidden and output layers, respectively.b R m , c R n stand for the bias vector of the hidden and output layers, respectively.W R m × n stands for weights matrix between the input and hidden layer.W R n × m stands for weights matrix between the hidden and input layer.RBMs are connected.The hidden layer of the previous RBM is the next RBM's visible layer, and the output of the previous RBM is the input of the next RBM.During the pre-training process, the upper layer of RBM is trained before the training of the current layer.Usually when the top RBM is trained, the label information is also considered as the visible units.

The Fine-Tuning of DBN
Contrastive Wake-Sleep algorithms are usually used to fine-tune the pre-trained DBN.In the wake stage, the status of nodes of each layer is generated by external features and cognitive weights (upward), and the generated weights (downward) are modified using gradient descent algorithm.In the sleep stage, the state of the bottom neurons is generated through the top-level representation (the states learned by waking) and the weights generated in previous stage, then the cognitive weights of each layer are modified.

Autoencoder
As shown in Figure 2, to build an autoencoder, three layers, namely, an input layer, a hidden layer and an output layer have to be established.The explanations of symbols used in Figure 2 are listed below: n: the size of the input and output layer.m: the size of the hidden layer.
ℝ , ℎ ℝ , ℝ stand for the data vector of the input, hidden and output layers, respectively.ℝ , ℝ stand for the bias vector of the hidden and output layers, respectively.ℝ × stands for weights matrix between the input and hidden layer.ℝ × stands for weights matrix between the hidden and input layer.From the input layer to the output of hidden layer, the input signal is encoded.And from hidden layer to the output, the output of hidden layer is decoded by: In Equations ( 10) and (11), f() and g() stand for the encoding and decoding functions, respectively.Sf and Sg are the corresponding activation functions of the encoder and decoder.sigmoid function can be chosen as the activation function and W T can be taken as the weights of the decoder.From the input layer to the output of hidden layer, the input signal is encoded.And from hidden layer to the output, the output of hidden layer is decoded by: In Equations ( 10) and (11), f () and g() stand for the encoding and decoding functions, respectively.S f and S g are the corresponding activation functions of the encoder and decoder.sigmoid function can be chosen as the activation function and W T can be taken as the weights W of the decoder.
Given input vectors, the autoencoder aims to minimize the difference between an input x and the output y.The reconstruction error can be described by the cross-entropy function: For the training set, S; the average reconstruction error can hence be established as: By minimizing L(θ), the parameter θ = {W, b, c} of the autoencoder can be fitted.The learning of an autoencoder does not need the label information, so it is an unsupervised procedure.The output of the hidden layer h can be seen as a representation of input x.

The Stacking of Autoencoders
In a SAE, autoencoders are stacked so that they take the output h(k) of one hidden layer of the former autoencoder as the input for its successive autoencoder.Each layer is trained by a greedy unsupervised layer-wise training strategy, and the upper layers are the representations of relevant high-level abstractions (Figure 3).Stacked autoencoders can establish the deep neural network more efficiently by initializing its weights in a region near its local minimum.
Given input vectors, the autoencoder aims to minimize the difference between an input x and the output y.The reconstruction error can be described by the cross-entropy function: For the training set, S; the average reconstruction error can hence be established as: By minimizing ( ), the parameter = , , of the autoencoder can be fitted.The learning of an autoencoder does not need the label information, so it is an unsupervised procedure.The output of the hidden layer h can be seen as a representation of input x.

The Stacking of Autoencoders
In a SAE, autoencoders are stacked so that they take the output h(k) of one hidden layer of the former autoencoder as the input for its successive autoencoder.Each layer is trained by a greedy unsupervised layer-wise training strategy, and the upper layers are the representations of relevant high-level abstractions (Figure 3).Stacked autoencoders can establish the deep neural network more efficiently by initializing its weights in a region near its local minimum.

Fine-Tuning of the SAE
Normally the last layer of the SAE is connected to a classifier, it can be a neural network, Softmax classifier, SVM, etc. Finally, a fine-tuning process is also taken on either the whole network or only the classifier by taking the advantage of the label information through a supervised classification, using the back-propagation algorithm.

The Experiment Data
In this study, RADARSAT-2 quad-pol SAR data acquired during the 2011 Norwegian oil-onwater experiment (59°59′ N, 2°27′ E) were used for analysis.The data was received at 17:27 of 8 June 2011 UTC in fine-quad polarimetric mode, with the spatial resolution of 4.7 × 4.8 m in range

Fine-Tuning of the SAE
Normally the last layer of the SAE is connected to a classifier, it can be a neural network, Softmax classifier, SVM, etc. Finally, a fine-tuning process is also taken on either the whole network or only the classifier by taking the advantage of the label information through a supervised classification, using the back-propagation algorithm.

The Experiment Data
In this study, RADARSAT-2 quad-pol SAR data acquired during the 2011 Norwegian oil-on-water experiment (59 • 59 N, 2 • 27 E) were used for analysis.The data was received at 17:27 of 8 June 2011 UTC in fine-quad polarimetric mode, with the spatial resolution of 4.7 × 4.8 m in range and azimuth directions.The incident angle of the image is 34.5-36.1 • and the local wind speed is 1.6-3.3m/s.For the convenience of processing and display, a data sample with 2000 × 2000 pixels was picked from the single look complex (SLC) data.The pseudo RGB image of the RADARSAT-2 data on the Pauli basis are provided in Figure 4.In the scene, three verified slicks were present; from left to right, they were: biogenic film, emulsions and mineral oil [33].The biogenic film was simulated by Radiagreen plant oil.Emulsions were made of Oseberg blend crude oil mixed with 5% IFO380 (Intermediate Fuel Oil), released 5 h before the radar acquisition.Additionally, the Balder crude oil was released 9 h before the radar acquisition [34].
Appl.Sci.2017, 7, 968 8 of 15 and azimuth directions.The incident angle of the image is 34.5-36.1°and the local wind speed is 1.6-3.3m/s.For the convenience of processing and display, a data sample with 2000 × 2000 pixels was picked from the single look complex (SLC) data.The pseudo RGB image of the RADARSAT-2 data on the Pauli basis are provided in Figure 4.In the scene, three verified slicks were present; from left to right, they were: biogenic film, emulsions and mineral oil [34].The biogenic film was simulated by Radiagreen plant oil.Emulsions were made of Oseberg blend crude oil mixed with 5% IFO380 (Intermediate Fuel Oil), released 5 h before the radar acquisition.Additionally, the Balder crude oil was released 9 h before the radar acquisition [33].

The Experiment Procedure
The SLC quad-polarimetric SAR data was firstly multi-looked, and then the covariance matrix and coherency matrix of the data samples were generated.As mentioned before, 10 features are extracted and saved as a 10-dimension vector for each pixel.
As shown in Figure 5, the 24,000 data samples were picked up from the image, including 12,000 verified positive (mineral oil) and 12,000 negative (clean sea surface and biogenic slick) samples.The data samples were picked by squared boxes with the size of 20 × 20 for convenience and keeping the purity of the sample, and then their order was shuffled.
In order to test the performance of different algorithms and avoid over-fitting, a six-fold crossvalidation was applied.We first divided the training set into six subsets equally.Then the five-sixths of the data samples were used as training set and the rest were taken as testing set.Sequentially, we repeat the classification and another one-sixth data sample were used as testing set.The experiment was conducted six times until each instance of the whole training set is predicted once.Finally, the cross-validation accuracy is the overall percentage of data which are correctly classified.
In order to test the performance of these algorithms on smaller sample sizes, the whole dataset was divided into smaller groups.All the 24,000 data samples were divided into 5 and 25 groups randomly.Then classifications were conducted on these groups, namely 4000 training, 800 testing and 800 training, and 160 testing samples respectively.In the experiment, the classification accuracy of smaller sample size was obtained by averaging the classification result on each group respectively.
In the experiment, two previously introduced deep learning algorithms (i.e., DBN and SAE) were tested on their performance of oil spill detection and classification.In addition, two traditional supervised classifiers including neural network (NN) and SVM were compared.

The Experiment Procedure
The SLC quad-polarimetric SAR data was firstly multi-looked, and then the covariance matrix and coherency matrix of the data samples were generated.As mentioned before, 10 features are extracted and saved as a 10-dimension vector for each pixel.
As shown in Figure 5, the 24,000 data samples were picked up from the image, including 12,000 verified positive (mineral oil) and 12,000 negative (clean sea surface and biogenic slick) samples.The data samples were picked by squared boxes with the size of 20 × 20 for convenience and keeping the purity of the sample, and then their order was shuffled.
In order to test the performance of different algorithms and avoid over-fitting, a six-fold cross-validation was applied.We first divided the training set into six subsets equally.Then the five-sixths of the data samples were used as training set and the rest were taken as testing set.Sequentially, we repeat the classification and another one-sixth data sample were used as testing set.The experiment was conducted six times until each instance of the whole training set is predicted once.Finally, the cross-validation accuracy is the overall percentage of data which are correctly classified.
In order to test the performance of these algorithms on smaller sample sizes, the whole dataset was divided into smaller groups.All the 24,000 data samples were divided into 5 and 25 groups randomly.Then classifications were conducted on these groups, namely 4000 training, 800 testing and 800 training, and 160 testing samples respectively.In the experiment, the classification accuracy of smaller sample size was obtained by averaging the classification result on each group respectively.
In the experiment, two previously introduced deep learning algorithms (i.e., DBN and SAE) were tested on their performance of oil spill detection and classification.In addition, two traditional supervised classifiers including neural network (NN) and SVM were compared.The key parameters of these applied classifiers are shown in Tables 2-5:   The key parameters of these applied classifiers are shown in Tables 2-5: LIBSVM-a library for Support Vector Machines [33] was used to implement the SVM algorithm.The parameters C and γ were derived by shrinking heuristics search technique.The neural network has ten input neurons, two hidden layers and two output neurons.The initialization of the former layers of deep learning algorithms SAE and DBN were carried out by unsupervised pretraining, and then the outputs were connected to a neural network with two output neurons.

Results and Discusion
To examine the feature dimension reduction capability of the deep neural networks, scatter plots of the main original features and the features derived by principal component analysis (PCA), DBN and SAE are shown in Figure 6.Two of the most discriminative features, conformity coefficient and degree of polarization (DoP) of HH and VV transition/receiving combinations, as two of the most effective feature in oil spill classification [31], are plotted in Figure 6a.Scatter plots of the first two components derived by PCA are shown in Figure 6b.In this paper, taking the advantage of DBN and SAE, the dimension of polarimetric features are reduced to six, then they are put into fully connected neural network.To show these features in a scatter plot, PCA was implemented on the six features, and then the first two components are shown in Figure 6c,d.It can be observed that deep neural network algorithms effectively extracted the information from high dimensional features and improved their separability to distinguish mineral oil and none mineral samples.
The classification results are shown in Table 6 statistically, some key findings and discussions are listed as follow: • SAE achieved the highest classification accuracy (lowest testing error) among all the algorithms on different sample sizes.DBN achieved a close performance to SAE.SAE and DBN applied in the experiment had similar structures and both of them took the advantage of greedy unsupervised layer-wise pretraining, so very similar performances were achieved.The unsupervised pretraining worked as a feature optimizer, which can reveal the latent relationship and reduction of noise in features.It helps to improve the performance of the followed supervised classification procedure.

•
On the small training data set, deep learning algorithms have much higher performance than neural networks.When the number of data set is reduced, the parameters of traditional NN cannot be sufficiently tuned.Based on unsupervised pretraining, deep learning algorithms such as SAE and DBN have much stronger capability to achieve the optimized solution of the learning problem.

•
When the number of data sample size is reduced, the classification error will increase (i.e., the accuracy is reduced).When the number of data sets reduced, the characteristics of the studied object cannot be sufficiently expressed by the limited number of data samples, so the classification performance is reduced.

•
On the large training data set, NN have a close performance to deep learning algorithms.
With large number of training data, the parameters of NN can be sufficiently adjusted.In this experiment, the NN have a few hidden layers, the gradient of objective function could pass to the layers in the front effectively.As the result, comparable classification performance was achieved by NN on large data set.

•
SVM has better performance on small sample sizes than NN.SVM is based on structural risk minimization, which has superior performance on relative small data sets.It maximizes the classification margin, which is decided by a few support vectors and could successfully avoid the risk of the "curse of dimensionality".However, although the SVM has several advantages, it is equivalent to a NN with one hidden layer, so on learning complicated relationships its performance is no better than the other three more complex classifiers applied in the experiment.
Appl.Sci.2017, 7, 968 11 of 15 experiment, the NN have a few hidden layers, the gradient of objective function could pass to the layers in the front effectively.As the result, comparable classification performance was achieved by NN on large data set.• SVM has better performance on small sample sizes than NN.SVM is based on structural risk minimization, which has superior performance on relative small data sets.It maximizes the classification margin, which is decided by a few support vectors and could successfully avoid the risk of the "curse of dimensionality".However, although the SVM has several advantages, it is equivalent to a NN with one hidden layer, so on learning complicated relationships its performance is no better than the other three more complex classifiers applied in the experiment.The confusion matrix of the cross-validation testing result is shown in Tables 7-10.The best classification results were achieved by SAE on the largest data set: 20,000 training and 4000 testing samples.On the 24,000 testing set, 251 pixels were wrongly classified.101 pixels of these 251 pixels were false positive (commission errors) and 150 pixels were false negative (omission errors).Similar false positive was achieved by DBN, with slightly higher false negative rate.In the confusion matrix  The confusion matrix of the cross-validation testing result is shown in Tables 7-10.The best classification results were achieved by SAE on the largest data set: 20,000 training and 4000 testing samples.On the 24,000 testing set, 251 pixels were wrongly classified.101 pixels of these 251 pixels were false positive (commission errors) and 150 pixels were false negative (omission errors).Similar false positive was achieved by DBN, with slightly higher false negative rate.In the confusion matrix achieved by NN and SAE, it can be discovered that compared with deep learning algorithms, they achieved lower false negative and higher false positive rates.From the binary output that achieved by SAE (Figure 7), it can be observed that a few pixels in the area covered by the biogenic slick are classified as mineral oil.The possible reason of these "misclassifications" is the affection of signal noise on space-borne SAR data or the uniform distribution of the mineral oil and biogenic slicks.This misinterpretation can be further eliminated by a simple postprocessing step.Corrosion and swelling algorithms can be applied on the binary classification result to fix the small holes (missing alarm) in large oil-covered areas and isolated positive targets (false alarm) in the sea surface area.
The receiver operating characteristics (ROC) of these classifiers in oil spill detection are approached based on 20,000 training, 4000 testing (Figure 8a,b) and 4000 training, 800 testing samples (Figure 8c,d) respectively.All the ROC curves are very close to the upper-left corner of the ROC map (Figure 8a,c).In the zoomed-in map (Figure 8b,d), some minor differences can be observed.Compared with other classifiers, SVM achieved a lower true positive rate under low false positive rate requirements, and higher true positive rate under high false positive rate requirements.And for NN the situation is just opposite.Deep neural networks, SAE and DBN achieved a modest true positive rate in the whole false positive rate range.

Conclusions
In this paper, the capability of polarimetric SAR to detect and classify marine oil spills was investigated.Potential features were extracted from a covariance matrix, a coherence matrix and a Stokes vector of the original SLC quad-pol SAR data.Deep learning algorithms together with classic classifiers were compared and analyzed.A key discovery of this paper is that given insufficient number of data samples, deep learning algorithms such as SAE and DBN can achieve better performance than traditional algorithms by initializing their parameters from a position closer to the

Conclusions
In this paper, the capability of polarimetric SAR to detect and classify marine oil spills was investigated.Potential features were extracted from a covariance matrix, a coherence matrix and a Stokes vector of the original SLC quad-pol SAR data.Deep learning algorithms together with classic classifiers were compared and analyzed.A key discovery of this paper is that given insufficient number of data samples, deep learning algorithms such as SAE and DBN can achieve better performance than traditional algorithms by initializing their parameters from a position closer to the optimum solution.Polarimetric SAR data confirmed strong capacity in distinguishing mineral oil and its biogenic lookalikes.This can be achieved by a one-step operation, with no need to firstly segment and then classify data samples based on auxiliary information.The advantages demonstrated by polarimetric SAR can greatly boost the efficiency and accuracy of marine oil spill detection.Further studies will be conducted on features extracted from compact polarimetric SAR modes, with wider swath width, to achieve larger monitoring areas and shorter revisit times: two of the prime requirements for marine surveillance through large areas.

Figure 1 .
Figure 1.The illustration of a Restricted Boltzmann Machine (RBM) with two hidden units and four visible units.

Figure 1 .
Figure 1.The illustration of a Restricted Boltzmann Machine (RBM) with two hidden units and four visible units.

Figure 2 .
Figure 2. The structure of an autoencoder.

Figure 2 .
Figure 2. The structure of an autoencoder.

Figure 4 .
Figure 4. Pauli RGB image of RADARSAT-2 data.(RADARSAT-2 Data and Products © Macdonald, Dettwiler and Associates Ltd., Vancouver, BC, Canada, 2011-All Rights Reserved.RADARSAT is an official mark of the Canadian Space Agency).

Figure 4 .
Figure 4. Pauli RGB image of RADARSAT-2 data.(RADARSAT-2 Data and Products © Macdonald, Dettwiler and Associates Ltd., Vancouver, BC, Canada, 2011-All Rights Reserved.RADARSAT is an official mark of the Canadian Space Agency).

Figure 5 .
Figure 5. Demonstration of a selected area for analysis (taking VV 2 image as background); 24,000 pixels are picked as data samples.

Figure 5 .
Figure 5. Demonstration of a selected area for analysis (taking VV 2 image as background); 24,000 pixels are picked as data samples.

Figure 6 .
Figure 6.Scatter plots of the main original features (a) and the features derived by principal component analysis (PCA), deep believe network (DBN) and stacked autoencoder (SAE) (b-d).

Figure 6 .
Figure 6.Scatter plots of the main original features (a) and the features derived by principal component analysis (PCA), deep believe network (DBN) and stacked autoencoder (SAE) (b-d).

Figure 7 .
Figure 7. Classification result achieved by SAE, 0 stands for nonoil and 1 stands for mineral oil.

Figure 7 .
Figure 7. Classification result achieved by SAE, 0 stands for nonoil and 1 stands for mineral oil.
Appl.Sci.2017, 7, 968 13 of 15The receiver operating characteristics (ROC) of these classifiers in oil spill detection are approached based on 20,000 training, 4000 testing (Figure8a,b) and 4000 training, 800 testing samples (Figure8c,d) respectively.All the ROC curves are very close to the upper-left corner of the ROC map (Figure8a,c).In the zoomed-in map (Figure8b,d), some minor differences can be observed.Compared with other classifiers, SVM achieved a lower true positive rate under low false positive rate requirements, and higher true positive rate under high false positive rate requirements.And for NN the situation is just opposite.Deep neural networks, SAE and DBN achieved a modest true positive rate in the whole false positive rate range.

Figure 8 .
Figure 8. Receiver operating characteristics (ROC) curves of the classifiers.(a) ROC curve achieved based on 20,000 training, 4000 testing samples; (b) Zoomed in map of (a); (c) ROC curve achieved based on 4,000 training, 800 testing samples; (d) Zoomed in map of (c).

Figure 8 .
Figure 8. Receiver operating characteristics (ROC) curves of the classifiers.(a) ROC curve achieved based on 20,000 training, 4000 testing samples; (b) Zoomed in map of (a); (c) ROC curve achieved based on 4,000 training, 800 testing samples; (d) Zoomed in map of (c).

Table 1 .
Features investigated in this study.

Table 2 .
Parameter settings of the neural network.

Table 3 .
Parameter settings of the support vector machine (SVM).

Table 2 .
Parameter settings of the neural network.

Table 3 .
Parameter settings of the support vector machine (SVM).

Table 5 .
Parameter settings of the deep belief network (DBN).

Table 6 .
Testing error (1-accuracy) of classification achieved by different classifiers and on different sizes of data samples.

Table 6 .
Testing error (1-accuracy) of classification achieved by different classifiers and on different sizes of data samples.