Exploratory Data Analysis of Synthetic Aperture Radar (SAR) Measurements to Distinguish the Sea Surface Expressions of Naturally-Occurring Oil Seeps from Human-Related Oil Spills in Campeche Bay (Gulf of Mexico)

An Exploratory Data Analysis (EDA) aims to use Synthetic Aperture Radar (SAR) measurements for discriminating between two oil slick types observed on the sea surface: naturally-occurring oil seeps versus human-related oil spills—the use of satellite sensors for this task is poorly documented in scientific literature. A long-term RADARSAT dataset (2008–2012) is exploited to investigate oil slicks in Campeche Bay (Gulf of Mexico). Simple Classification Algorithms to distinguish the oil slick type are designed based on standard multivariate data analysis techniques. Various attributes of geometry, shape, and dimension that describe the oil slick Size Information are combined with SAR-derived backscatter coefficients—sigma-(σo), beta-(βo), and gamma-(γo) naught. The combination of several of these characteristics is capable of distinguishing the oil slick type with ~70% of overall accuracy, however, the sole and simple use of two specific oil slick’s Size Information (i.e., area and perimeter) is equally capable of distinguishing seeps from spills. The data mining exercise of our EDA promotes a novel idea bridging petroleum pollution and remote sensing research, thus paving the way to further investigate the satellite synoptic view to express geophysical differences between seeped and spilled oil observed on the sea surface for systematic use.


Introduction
Oil floating on the surface of the ocean is commonly detected using instruments flying onboard airplanes or satellites [1].Currently, the most useful systems to detect oil-contaminated areas at sea are active radars, i.e., Synthetic Aperture Radars (SAR) operating in the microwave region of the electromagnetic spectrum [2].As some environmental phenomena can generate signatures similar to oil in SAR imagery, this technology may yield false positives, in which the non-unique signature of oil caused by false targets can induce to ambiguous interpretations [3].The so-called "look-alike features (or radar look-alikes)" range from atmospheric phenomena (e.g., rain cells) and oceanographic features (e.g., upwelling zones), amongst others [4].Many scholars have described the use of SAR measurements to distinguish oil from look-alike features, and indeed, a broad scientific basis is devoted to the detection and identification of oil in SAR imagery, e.g., [5].
A supplementary application for satellite measurements is the recognition of the oil slick type.Herein, for the sake of brevity and simplicity, the term oil slick indicates the sea surface expression of oil naturally seeped out of the seafloor (i.e., oil seep) or spilled after man-made activities (i.e., oil spill).The latter accounts for operational, accidental, or illegal spills from platforms or ships-i.e., discharge of water containing petroleum products.Information about the oil slick type is an improvement requirement for several purposes, for example, environmental monitoring, emergency response, etc. [6].Reliable surveillance based on satellite remote sensors capable of distinguishing the oil slick type can add accuracy to oil spill forecast systems, as well as enhance tridimensional mathematical models that backtrack (i.e., hindcast) oil seeps observed on the sea surface to its seafloor origin [7].
Environmental and economical issues associated with oil slicks are a constant concern to the oil and gas exploration and production industry.As such, the satellite synoptic view is an attractive option to distinguish the oil slick type that can lead to considerable scientific advances.From the standpoint of environmental monitoring programs, it can reduce ambiguities about the source of the observed oil (i.e., seeped or spilled).This can develop the relationship between governmental agencies and oil and gas exploration and production industry, thus reducing political uproar.From an economic standpoint, it can lead to offshore discoveries bringing invaluable information to explore active petroleum systems.Therefore, if the distinguishing of seeps and spills becomes possible, satellite information can directly assist in the search to find new oil fields in offshore exploration frontiers, a constant goal of a crucial sector for the world's economic development.
Oil seeps and oil spills are both products of mineral oil floating on the sea surface, therefore, it is expected that their surface signatures are fairly similar in satellite imagery.Peer-reviewed and grey literature present limited information available about oil slick type differentiation using satellite sensors; to this matter, the reader is encouraged to see [8] and the vast list of references therein.As a result, the objective of the present research focuses on performing an Exploratory Data Analysis (EDA) to evaluate the use of SAR measurements to distinguish the sea surface expressions of oil seeps from oil spills to a useful level of confidence for systematic use.
In the course of achieving this objective, three scientific questions are sought about the sea surface expression of oil slicks observed in the Campeche Bay region (Gulf of Mexico-Figure 1), as determined by digital classification of satellite imagery: 1.
Does seeped oil floating on the ocean surface have a SAR backscatter signature distinctive enough to distinguish it from anthropogenically-spilled oil? 2.
Can the oil slick Size Information be used to distinguish seeps from spills? 3.
Which characteristics lead to the generation of a system capable of distinguishing between seeped and spilled oil?
To address these matters, a long-term RADARSAT-2 dataset (2008-2012) is exploited to perform a data mining of selected oil slicks' characteristics.Therefore, an additional goal is devised to design innovative qualitative-quantitative Classification Algorithms to distinguish man-made from natural oil slicks.

Dataset
An environmental monitoring observed the occurrence of oil slicks for more than a decade (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) in Campeche Bay-this has been carried out by Pemex (Petróleos Mexicanos; Mexico City, Mexico) to support its decision-making processes, thus demonstrating the operational usefulness SAR measurements to execute effective oil slick mapping.Such monitoring was based on the interpretation of SAR scenes of both RADARSAT satellites analyzed by domain specialists with support of ancillary meteo-oceanographic data from Earth Observing System (EOS) sensors.All logged oil slicks underwent a post-processing validation, in which internal Pemex field reports were considered-i.e., the sea surface source recognized by the satellite image analyses were corroborated by additional field observations.The long-term RADARSAT dataset exploited by Pemex was made available for this investigation and it is herein referred to as Campeche Bay Oil Slick Satellite Database (i.e., CBOS-Data).It is good to point out that the CBOS-Data does not account for look-alike features and that oil seepage sites on the sea surface were identified with at least three different observations about the same location, thus indicating oil seep cases.For a thorough description of the Pemex's operational environmental monitoring strategy and its produced dataset (i.e., CBOS-Data), the reader is encouraged to see a D.Sc.Thesis [8], a peer-reviewed article [9], and a conference paper [10]-these three references also provide a description of the occurrence and spatio-temporal distribution of the observed oil seeps and oil spills in the region.
To guarantee the quality of circumstances to cope with the objectives of the present research, our EDA only considers oil slicks observed between 2008 and 2012 with RADARSAT-2 (16-bit and VV-polarized-i.e., both ScanSAR Narrow beam modes: SCNA and SCNB).This mostly aims to circumvent technical specification differences (e.g., radiometric depth and polarization) between the two single-polarized RADARSAT measurements, as all analyzed RADARSAT-1 scenes are 8-bit and HH-polarized.Even though RADARSAT-2 provides a range of polarimetric information (i.e., linear cross-, dual parallel cross-, and full quad-polarized), our EDA focuses on using single linear parallel polarimetric SAR measurements from Pemex's provided information-i.e., CBOS-Data.As such, the dataset exploited in our EDA consists of 277 RADARSAT-2 scenes and 4916 oil slicks, of which 2021 (41%) have been identified as oil seeps and 2895 (59%) as oil spills.Herein, this dataset is referred to as Campeche Bay Oil Slick Modified Database: CBOS-MOD.The entire content of the CBOS-MOD has been used to perform the proposed data mining exercises, as well as to design the innovative qualitative-quantitative Classification Algorithms, thus ensuring best statistical reliability of our EDA.

Dataset
An environmental monitoring observed the occurrence of oil slicks for more than a decade (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) in Campeche Bay-this has been carried out by Pemex (Petróleos Mexicanos; Mexico City, Mexico) to support its decision-making processes, thus demonstrating the operational usefulness SAR measurements to execute effective oil slick mapping.Such monitoring was based on the interpretation of SAR scenes of both RADARSAT satellites analyzed by domain specialists with support of ancillary meteo-oceanographic data from Earth Observing System (EOS) sensors.All logged oil slicks underwent a post-processing validation, in which internal Pemex field reports were considered-i.e., the sea surface source recognized by the satellite image analyses were corroborated by additional field observations.
The long-term RADARSAT dataset exploited by Pemex was made available for this investigation and it is herein referred to as Campeche Bay Oil Slick Satellite Database (i.e., CBOS-Data).It is good to point out that the CBOS-Data does not account for look-alike features and that oil seepage sites on the sea surface were identified with at least three different observations about the same location, thus indicating oil seep cases.For a thorough description of the Pemex's operational environmental monitoring strategy and its produced dataset (i.e., CBOS-Data), the reader is encouraged to see a D.Sc.Thesis [8], a peer-reviewed article [9], and a conference paper [10]-these three references also provide a description of the occurrence and spatio-temporal distribution of the observed oil seeps and oil spills in the region.
To guarantee the quality of circumstances to cope with the objectives of the present research, our EDA only considers oil slicks observed between 2008 and 2012 with RADARSAT-2 (16-bit and VV-polarized-i.e., both ScanSAR Narrow beam modes: SCNA and SCNB).This mostly aims to circumvent technical specification differences (e.g., radiometric depth and polarization) between the two single-polarized RADARSAT measurements, as all analyzed RADARSAT-1 scenes are 8-bit and HH-polarized.Even though RADARSAT-2 provides a range of polarimetric information (i.e., linear cross-, dual parallel cross-, and full quad-polarized), our EDA focuses on using single linear parallel polarimetric SAR measurements from Pemex's provided information-i.e., CBOS-Data.As such, the dataset exploited in our EDA consists of 277 RADARSAT-2 scenes and 4916 oil slicks, of which 2021 (41%) have been identified as oil seeps and 2895 (59%) as oil spills.Herein, this dataset is referred to as Campeche Bay Oil Slick Modified Database: CBOS-MOD.The entire content of the CBOS-MOD has been used to perform the proposed data mining exercises, as well as to design the innovative qualitative-quantitative Classification Algorithms, thus ensuring best statistical reliability of our EDA.

Methods
The Methods' structure of our EDA is divided in two main Sections: Workable-Database Preparation (Section 3.1) and Multivariate Data Analysis (Section 3.2).Key concepts are emphasized providing enough detailed information to enable any knowledgeable scientist to replicate the results of our EDA with any SAR dataset containing identified oil slicks [8].

RADARSAT Re-Processing
The RADARSAT-2 images, provided in SGF-format (SAR Georeferenced Fine Resolution), are given in uncalibrated grey-level count, i.e., Digital Numbers (DNs).These are enough for qualitative usage, but they are not recommended to cross-compare time series of SAR images [11,12].Conversely, SAR backscatter coefficients are essential for quantitative analyses.The use of sigma-naught (σ o ), beta-naught (β o ), or gamma-naught (γ o ) permit the comparison of data acquired with the same sensor from different dates or beam modes, as well as the evaluation of data acquired with various sensors [13][14][15].The conversion from DN to σ o , β o , and γ o has been completed using the PCI Geomatica (version 2014, PCI Geomatics; Markham, ON, Canada).Various radiometric-calibrated image products are used in our EDA to describe the radar signal backscatter strength of individual oil slicks.Table 1 lists the twelve forms of σ o , β o , and γ o that are given with the amplitude of the received radar beam (C 1 ) and in dB units (C 2 ), both with and without the application of the Frost filter with 3 × 3 window [16].A thirteenth image product gives the incidence angle values per pixel: INC.ang.
]. DN: Digital Number.A: Range-dependent gain that varies per pixel in range direction.B: Constant offset nominally set to zero for SGF products.cc: 2 for the amplitude (or 1 for the intensity) of the received radar beam.

New Slick-Feature Attributes
Most new slick-feature attributes are bridged from elemental characteristics found in the literature; however, these have been suggested to distinguish oil slicks from look-alike features.Herein, they are explored to distinguish seeps from spills.A particular set of basic statistical measures is experimentally used to describe the oil slicks' characteristics.Four Attribute Types (Tables 2 and 3) are investigated:  3 rd ) Size Information: From the two basic morphological features originally present in the CBOS-Data, i.e., area (Area) and perimeter (Per), seven ratios are derived to describe the oil slicks' geometry, shape, and dimension (Table 2).The first is the area to perimeter ratio (AtoP).Fiscella et al. [17] suggest using the perimeter to area ratio (PtoA).Fiscella et al. [17] and Singha et al. [18] recommend using a dimensionless normalized perimeter to area ratio (PtoA.nor).While small PtoA.nor values are related to simple geometry, larger values come from oil slicks with more complex geometries [19].A dimensionless complexity measure is given by [20]: COMPLEX.ind.In addition, Bentz et al. [21] uses another dimensionless descriptor to illustrate how compact (i.e., close to a circle) is a sea-surface feature: COMPACT.ind.Two other indices have been utilized by [22]: SHAPE.ind and FRACTAL.ind.While these indices yield values close to the unit for regular forms (i.e., circular or square, respectively), larger numbers represent form irregularity [23].The total number of pixels inside each oil slick polygon (LEN) is also provided.3).

•
4 th ) SAR Backscatter Signature: All pixels within individual polygons are utilized to calculate the different basic statistical measures experimentally used to characterize the oil slick SAR backscatter signature (Table 3).These are separately calculated for the twelve radiometric-calibrated image products (Table 1).As suggested by [18,21], an arithmetic mean (AVG) of all pixels of each oil slick is computed.Three other central tendency representations are also considered: median (MED), mode (MOD), and mid-mean (MDM).To analyze the SAR backscatter signature spread, six dispersion measures are considered: standard deviation (STD), coefficient of dispersion (COD), variance (VAR), total range (RNG), average absolute deviation (AAD), and median absolute deviation (MAD).The coefficient of variation (COV) evaluates the relative relationship between dispersion and central tendency, thus comparing the degree of variation of data with different units and different meanings.COV originally involves the ratio between STD and AVG, but herein other pair-values are also given (Table 3).Different authors recommend exploring COV, e.g., [18,24] refer to it as power-to-mean ratio, whereas [20] describe it as the oil slick's homogeneity, and for [21] it depicts the oil slick's heterogeneity.However, Miranda [25] emphasizes that oil slicks with different spatial structures (i.e., with completely different pixel configuration) can have identical average or median values and the same standard deviation.Minimum (MIN) and maximum (MAX) values of the pixels inside each polygon are also used as supplementary quantities.
Information from the area surrounding oil slicks may play an essential role in the classification between oil slicks and look-alike features [17,20,26].Nonetheless, herein, no information from the background oil-free surface around oil slicks has been taken into consideration.This decision aims to evaluate as simple as possible range of variables solely accounting for information within the oil slicks polygons' limits.

Data Treatment Processes
The new slick-feature attributes (Tables 2 and 3) have both non-numeric values (i.e., qualitative variables) and numeric values (i.e., quantitative variables).The multivariate data analysis techniques applied herein (see Section 3.2.)requires variables to be quantitative descriptors.To guarantee that all slick-feature attributes have discrete or continuous numerical metric values with no categorical variables, four Data Treatment Processes are applied: 1.
Log 10 Transformation: Various non-linear transformations (e.g., square, cubic, or fourth root, as well as natural and base 10 logarithms) were tested to bring non-symmetric distributions to, or at least close to, a normal distribution pattern.Through the visual analyses of their histograms, Log 10 had best results to stabilize the data, thus decreasing the effect of outliers and reducing skewness.The exception is FRACTAL.indwith its negative to positive range that was transformed as the cubic root.

2.
Negative Values Scaling: Because log functions cannot be applied to negative values, all pixels inside the oil slicks were inspected before such transformation.This mostly relates to the SAR Backscatter Signature (4 th Attribute Type: Table 3), as the other Attribute Types all have positive values (except FRACTAL.ind).This specific Data Treatment Process consists of subtracting the minimum negative pixel value within the oil slick (PIXmin) from each pixel inside the oil slick (PIX) and adding one to it: PIXpos = [PIX − PIXmin + 1], where PIXpos is the new positive value.This simply adds two constants to each pixel, i.e., PIXmin and 1.The pixel with the minimum negative value becomes equal to 1, as PIX = PIXmin gives PIXpos = 1.Although this changes the values of the pixels inside some oil slicks, the relationship among pixel values within individual oil slicks is preserved [27,28].

3.
Ranging Standardization: While some quantitative attributes are dimensionless (e.g., dB), others have incompatible units.To compare the oil slick characteristics in the subsequent data mining exercise of our EDA, a common scale is recommended [29][30][31].Of the various available methods, there is no dearth of standardization; however, Milligan and Cooper [32] advocated that Ranging could be more effective than other methods (e.g., z-score).The Ranging standardization, besides bounding the magnitude of the attributes between zero (0) and one (1), also equalizes the variability of the attributes to a common scale: where Xranged is the new Ranged value, X is the actual attribute value for a particular oil slick, Xmin and Xmax are, respectively, the minimum and maximum values of this attribute among all oil slicks [27].For instance, if taking the SIG.amp.AVG value of one oil slick: subtract the minimum AVG for this radiometric-calibrated image product among all 4916 oil slicks, then the result is divided by the maximum AVG of all 4916 oil slicks minus the minimum AVG of all 4916 oil slicks.
The Xranged variables assume non-negative values, and only one oil slick has Xranged = 0 and another one Xranged = 1, respectively when X = Xmin and when X = Xmax.

4.
Dummy Variables: Certain qualitative attributes, such as string variables (Category and Bmode) and spatio-temporal variables (SARtime, SARdate, cLAT, and cLONG), have been converted to a specific indicator type: "dummy variable" [30].For convenience, these have values of one (1) or zero (0), thus referring to its presence (Yes) or absence (No), respectively.These types of binary-coded variables attempt to decompose complex variabilities (or hidden knowledge) into more useful information.

Attribute Selection Methods
The attributes' relevance is explored for three reasons: to eliminate redundant (or irrelevant) variables, to select more representative variables, and to reduce the variable-hyperspace dimension.A lower degree of dependence from one attribute to another reduces the messiness of the subsequent data mining exercise of our EDA (see Section 3.2.).Two Attribute Selection Methods (i.e., R-mode), having variable selection strategies different in essence, are explored mostly based on their simplicity:

•
Unweighted Pair Group Method with Arithmetic Mean (UPGMA): The resemblance between variables is revealed on a semi-automatic manner, in which an equally weighted pair wise group relationship among variables is used to form groups [27].A variable is attributed to a group that has a larger similarity measure, i.e., Pearson's r correlation coefficient, with all other variables of that same group [30,31].A free scientific data analysis software package (PAST: PAleontological STatistics version 2.17c; Oslo, Norway) was utilized [33].Rooted-tree dendrograms (i.e., diagrams of relationships) with a bootstrapping of 100 replicates help to find hierarchical relationships (i.e., affinity) among variables.Due to the large subjectivity while using dendrograms to form groups, two user-defined thresholds are explored [34].Their choice aims to guarantee repeatability, as different thresholds give different groups of variables.The first threshold subjectively uses the Cophenetic Correlation Coefficient (CCC) that is automatically calculated in PAST to specify the clusters; its value varies depending upon dataset [35,36].The second threshold is simply the fixed similarity value of 0.5.These two thresholds are used to draw horizontal lines (i.e., phenon lines) across the dendrograms, and when such lines cross a branch, a group is formed; from each group of similar variables only one variable is selected [37,38].The Attribute Selection occurs twice for each UPGMA implementation, one for each phenon line, i.e., CCC and 0.5.This variable selection gives preference to an arbitrary user-defined strategy based on the simplicity of their meanings and on their calculation form, in which simple variables are preferred over complicated ones-e.g., basic variables in lieu of ratios (e.g., Area versus AtoP), C 1 (amplitude) compared with C 2 (dB), SAR backscatter signature without rather than with Frost filter, and central tendency is preferable to dispersion metrics.The variables choice follows the order presented on Tables 1-3.

• Correlation-Based Feature Selection (CFS):
The CFS Attribute Selection is a fully automated process based on a heuristic variable selection strategy [39,40].The "Merit" of various groups of selected variables is evaluated to select attributes with low inter-correlation but highly correlated to the categories being distinguished [41,42].In essence, a Merit is calculated as a measure of the usefulness of the selected best possible group of attributes.A scientific machine learning open source package (WEKA: Waikato Environment for Knowledge Analysis version 3.6.12;Waikato, New Zealand) was utilized.The user specifies an evaluation function and a search algorithm [43][44][45].The former is the method by which the groups of attributes are evaluated (Backward Sequential Selection: CfsSubsetEval), and the latter improves the evaluation function (best-first searching strategy: BestFirst).
A summary of the analyzed 11 original sets, as well as the 33 optimal subsets of selected variables, i.e., UPGMA (CCC and 0.5) and CFS, is presented in Figure 2.These are collectively referred to as 44 data sub-divisions.It is expected that these particularly fruitful attribute-wise evaluations reveal hidden complexities into information capable of distinguishing the oil slick type.
Although literature suggests not using images given in DNs to compare SAR images [11,12,46], herein, an assessment explores DNs to quantify the consequences of using DNs.Only Frost filter and Antenna Pattern Compensation (APC) were applied.DN signatures are expressed by the same set of basic statistical measures calculated experimentally used to describe SAR backscatter signature (4 th Attribute Type: Table 3), with the same Data Treatment Processes.5-1) and 33 optimal subsets of selected variables (shown in both sides).These have undergone two Attribute Selection Methods: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and CFS (Correlation-Based Feature Selection).The UPGMA implementation employs two user-defined thresholds to form groups on the analysis of dendrograms: CCC (Cophenetic Correlation Coefficient) and 0.5 (fixed similarity value).The CFS automatically provides the selected variables.

Principal Components Analysis (PCA)
PCA reduces the variable-hyperspace dimension, helping on the interpretation of large multivariate datasets.It also assures the fulfillment of the Discriminant Analysis pre-requisite by avoiding multi-collinearity among the explored hypothetical variables, i.e., Principal Components (PCs).PAST (version 3.06; Oslo, Norway) was utilized to run the PCAs, which were applied to the 44 data sub-divisions.
Even though some guidelines are available to decide the number of PCs to account for, there is no "rule of thumb" to determine the best "stopping rule" [47,48].Two cut-offs were explored herein [49].The first one is simply referred to as "Scree Plot", as it analyzes the scree plot, but the PC-selection occurs based on the random model curve of expected eigenvalues, i.e., "broken stick".The broken stick is also plotted on the scree plot and PCs falling below this curve, i.e., to its right side, are not considered [50,51].Furthermore, if the broken stick curve crosses the bootstrapping eigenvalue error bars, i.e., is inside the 95% confidence interval, this PC is also not considered [30,52].The second is a simpler and direct method based on the so-called Kaiser-Guttman criterion, thus referred to as "Kaiser Criterion", which discards any PC after a specific lower bound: eigenvalue of 1 [49,53].Only one of these two cut-offs was used to select meaningful PCs.This occurred based on the analysis of the global picture of the percentages of variance concentrated on the PCs (see [8]).The score values of the selected meaningful PC were subjected to the subsequent data mining exercise of our EDA.
Loadings expressing the relationship between variables (rows) and PCs (columns) verify the importance (i.e., meaning) of the original variables on each selected PC, thus determining if there were variables with higher weight (i.e., loadings) influencing each PC.Bootstrapping error bars  5-1) and 33 optimal subsets of selected variables (shown in both sides).These have undergone two Attribute Selection Methods: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and CFS (Correlation-Based Feature Selection).The UPGMA implementation employs two user-defined thresholds to form groups on the analysis of dendrograms: CCC (Cophenetic Correlation Coefficient) and 0.5 (fixed similarity value).The CFS automatically provides the selected variables.

Principal Components Analysis (PCA)
PCA reduces the variable-hyperspace dimension, helping on the interpretation of large multivariate datasets.It also assures the fulfillment of the Discriminant Analysis pre-requisite by avoiding multi-collinearity among the explored hypothetical variables, i.e., Principal Components (PCs).PAST (version 3.06; Oslo, Norway) was utilized to run the PCAs, which were applied to the 44 data sub-divisions.
Even though some guidelines are available to decide the number of PCs to account for, there is no "rule of thumb" to determine the best "stopping rule" [47,48].Two cut-offs were explored herein [49].The first one is simply referred to as "Scree Plot", as it analyzes the scree plot, but the PC-selection occurs based on the random model curve of expected eigenvalues, i.e., "broken stick".The broken stick is also plotted on the scree plot and PCs falling below this curve, i.e., to its right side, are not considered [50,51].Furthermore, if the broken stick curve crosses the bootstrapping eigenvalue error bars, i.e., is inside the 95% confidence interval, this PC is also not considered [30,52].The second is a simpler and direct method based on the so-called Kaiser-Guttman criterion, thus referred to as "Kaiser Criterion", which discards any PC after a specific lower bound: eigenvalue of 1 [49,53].Only one of these two cut-offs was used to select meaningful PCs.This occurred based on the analysis of the global picture of the percentages of variance concentrated on the PCs (see [8]).The score values of the selected meaningful PC were subjected to the subsequent data mining exercise of our EDA.
Loadings expressing the relationship between variables (rows) and PCs (columns) verify the importance (i.e., meaning) of the original variables on each selected PC, thus determining if there were variables with higher weight (i.e., loadings) influencing each PC.Bootstrapping error bars were plotted per variable, and those having bars not crossing the abscissa are deemed to influence the PC [52].Peres-Neto et al. [47] suggest 1000 row-wise bootstrapping replicates for the eigenvalues 95% confidence interval; however, as some PCAs took more than 10 h to run, only 500 replicates were used.

Correlation Matrix
Correlation matrices have been used to verify residual inter-variable correlation that might still be present after the Attribute Selection and PCA.These matrices provide the parametric Pearson's r correlation coefficient, p (uncorrelated), to determine the relationship among all original values of the CBOS-MOD variables (no PCA applied), as well as among the scores of the selected PCs (with PCA applied).

Discriminant Function
Discriminant Analyses are linear combinations among the selected characteristics that seek the highest discriminating power to minimize the probability of miss-discrimination [31,49].Discriminant Functions are mathematical features that arrange objects among reported groups [33], in which the dependent variable is the oil slick type (i.e., seep or spill).Each analyzed sample is a member of only one oil slick type, i.e., simple binary discrimination.In PAST (version 2.17c; Oslo, Norway), Discriminant Functions were tested using the 44 data sub-divisions as input.
The accuracy of the Discriminant Functions is assessed via Confusion Matrices, i.e., simple 2-by-2-tables, in which one can quantify misclassification chances by using a cross validation estimate [27,54].Table 4 illustrates adapted Confusion Matrices following a wide-open character design proposed by [55,56].Even though the Confusion Matrix is a well-known tool for analyzing classification performance, a two-fold reason is given to introduce basic details about its usage.Firstly, usually the nomenclature of these metrics may vary, e.g., [57][58][59], etc.Secondly, usually only the Overall Accuracy is reported, but [60] demonstrated that this may lead to a quality gap on the information provided that can mislead the user [61,62].To this matter, various metrics should come into play to report the effectiveness of the Discriminant Functions.As a starting point, one should answer a first set of four questions focusing on the lines of Table 4, frequently referred to as Producer's Accuracy (PA) and Commission Error (CE).A second set of four questions is given to interpret the columns of Table 4, converging to what is usually called as User's Accuracy (UA) and Omission Error (OE).

Oil Slick Classification Algorithm
The Classification Algorithm design implies not only the understanding of the Overall Accuracy (i.e., Confusion Matrix diagonal), but it also requires a full line-and column-wise inspection of Table 4, i.e., Sensitivity-Specificity balance and a trade-off between the Positive and Negative Predictive Values.The former is obtained by looking into lines (PA/CE), whereas the latter is achieved exploring columns (UA/OE).Steadiness is sought among line and column information.This is because lines give the reference frame of what was previously known (i.e., spills and seeps): how many of the known oil slick samples are correctly (or incorrectly) identified?Conversely, the column reference frame indicates how many oil slick samples identified by the algorithm are correctly (or incorrectly) identified?While the first question (about lines: PA/CE) measures the success of identifying known samples, the second question (about columns: UA/OE) provides a metric of the success of the algorithm to identify oil slick samples.

Results
The Results presented herein mirror the organization structure of the Methods.

RADARSAT Re-Processing
Each processed RADARSAT-2 scene is about 27 GB, as it accounts for several image products (Table 1), and together, the 277 scenes occupy almost 7.5 TB of disk space.

New Slick-Feature Attributes
Typical values (i.e., represented by a set of basic qualitative-quantitative statistics: minimum, maximum, average, and standard deviation) of the slick-feature attributes are presented in [8], along with histograms with their frequency distributions.

Data Treatment Processes
The skewness of most data transformed data attributes is reduced after the application of the Negative Values Scaling and the Log 10 Transformation.The exception is the FRACTAL.ind,which has a cubic root transformation applied to it, and cLAT and cLONG, which did not change much in comparison to the frequency distributions prior to the Data Treatment Processes.
The application of these processes has five main consequences: (1) The Negative Values Scaling gives all pixels inside individual oil slicks that had negative pixel values a positive value; (2) The MIN attribute of 9 radiometric-calibrated image products lost its meaning and were removed, as PIXpos of several oil slicks became 1, as PIX = PIXmin gives PIXpos = 1; (3) The frequency of distribution of all variables has been brought to a Gaussian distribution with the Log 10 Transformation; (4) All qualitative variables are coded with values lying between 0 and 1 after the Ranging Standardization; and (5) The 7 qualitative attributes have been replaced by 33 new dummy variables that are binary-coded to 1 or 0.

Attribute Selection Methods
Table 5-1 presents the CCC (UPGMA) and Merit (CFS) values of the 11 original sets, whereas Table 5-2 reviews the number of selected variables of the 33 optimal subsets.Even though CCC values are quite similar, there is a small variation on the separate investigations of σ o , β o , and γ o (with and without Size Information) that gives a slight variation in their number of selected variables.A higher CCC value stands out when only Size Information is considered: 0.9143.
The CFS-Merit decreasing trend reveals important information about the oil slick type predictability (Table 5-1).The CFS implementation using the all CBOS-MOD variables (502) mostly selected categorical contextual dummy variables (13 out of 15) and its Merit is the highest as possible: 1.00.The considerable CFS-Merit drop (0.131) when such variables are removed (i.e., all SAR backscatter signature with Size Information: 433) depicts the complexity of the exploratory nature of this research: the use of radiometric and size variables to distinguish seeps from spills is indeed not an easy task.
If Size Information variables are involved, there is a higher CFS-Merit as compared to when they are not considered, e.g., all SAR backscatter signatures with (0.131) and without (0.103) size.The same pattern is observed on the separate analysis of each SAR backscatter signature, e.g., σ o with size (0.127) versus only σ o (0.099).When using only Size Information the CFS-Merit (0.112) is higher than when separately exploring the SAR backscatter signature (σ o , β o , and γ o ): 0.099, 0.099, and 0.097, respectively.These are strong indications that the sole use of radiometric variables to differentiate the oil slick type is indeed intricate, as well as suggests that by using the Size Information one has a somewhat better chance in differentiating seeps from spills.The CFS-Merit resulted from the analysis of DNs showed the worse value (0.059), thus agreeing with literature that does not recommend the use of DN values to cross-compare time series of SAR images [11,12,46].
The CFS selected attributes are presented on Table 6 (separate analyses of Size Information and DN variables) and on Table 7 (each SAR backscatter coefficients with and without Size Information).A series of other tables with the results of the several CFS implementations is presented in [8].The CFS selected attributes are presented on Table 6 (separate analyses of Size Information and DN variables) and on Table 7 (each SAR backscatter coefficients with and without Size Information).A series of other tables with the results of the several CFS implementations is presented in [8].
The results of the dendrogram analysis are depicted in Figure 3. Figure 3 (1 st row left) illustrates the UPGMA dendrogram exploring only Size Information: of the initial 10 attributes (Table 2), the CCC (0.9143) forms six groups and the fixed similarity gives five groups: 6 and 5 variables, respectively (Table 6).Similarly, Figure 3 (1 st row right) presents assessment of the DN values: of the initial 35 attributes, the CCC (0.8020) and the fixed similarity thresholds found 5 and 3 groups (variables), respectively (Table 6).Figure 3 also depicts the dendrograms of the separate UPGMA analysis of σo (CCC = 0.8262), βo (CCC = 0.8031), and γo (CCC = 0.8240) together with Size Information, whereas Table 7 presents an inventory with the UPGMA selected attributes.Carvalho [8] presents other dendrograms and lists the selected attributes of the 44 data sub-divisions.

Principal Components Analysis (PCA)
A comparison of Table 5-2,3 confirms the PCA dimensionality reduction.After analyzing the global picture of the percentages of variance concentrated on particular PCs among the 44 data sub-divisions a user-defined criterion was used to decide which of the two cut-offs should be used to select the meaningful PCs: if the "Scree Plot" PC concentrated less than 80% of the variance, the "Kaiser Criterion" PC was selected; otherwise, the Scree Plot PC was the chosen cut-off (see [8]).The number of selected PCs varies per data sub-divisions (Table 5-3).
Figure 4 (left panels) demonstrates the Scree Plot PC selection strategy that explores the broken stick analyzing only the Size Information.The 4 th row of Figure 4 depicts the bootstrapping eigenvalue error bars touching the broken stick curve for the PCA of the CFS; as such, no PC has been selected with the Scree Plot.Although the Kaiser Criterion selected one PC concentrating 90% of the variance, the Discriminant Function requires at least 2 variables (i.e., in this case, two PCs); therefore, the first two PCs have been selected (Table 5-3).Carvalho [8] illustrates the full proficiency of the broken stick PCs selection strategy using the Scree Plot.
Figure 4 (middle panels) exemplifies scatterplots illustrating the relationship between the first two PCs resulted from the PCA exploring only Size Information: 10, 6, 5, and 3 variables (Tables 2  and 6).It is evident the considerable overlap between the scores of the two populations: seeps (green possible: 1.00.The considerable CFS-Merit drop (0.131) when such variables are removed (i.e., all SAR backscatter signature with Size Information: 433) depicts the complexity of the exploratory nature of this research: the use of radiometric and size variables to distinguish seeps from spills is indeed not an easy task.
If Size Information variables are involved, there is a higher CFS-Merit as compared to when they are not considered, e.g., all SAR backscatter signatures with (0.131) and without (0.103) size.The same pattern is observed on the separate analysis of each SAR backscatter signature, e.g., σo with size (0.127) versus only σo (0.099).When using only Size Information the CFS-Merit (0.112) is higher than when separately exploring the SAR backscatter signature (σo, βo, and γo): 0.099, 0.099, and 0.097, respectively.These are strong indications that the sole use of radiometric variables to differentiate the oil slick type is indeed intricate, as well as suggests that by using the Size Information one has a somewhat better chance in differentiating seeps from spills.The CFS-Merit resulted from the analysis of DNs showed the worse value (0.059), thus agreeing with literature that does not recommend the use of DN values to cross-compare time series of SAR images [11,12,46].All SAR Backscatter Signature (σ o , β o , and γ o ; See also Tables 1 and 3).σ o Sigma-naught: SIG.amp, SIG.amp.FF, SIG.dB, and SIG.dB.FF.β o Beta-naught: BET.amp, BET.amp.FF, BET.dB, and BET.dB.FF.γ o Gamma-naught: GAM.amp, GAM.amp.FF, GAM.dB, and GAM.dB.FF.† Same as σ o with Size Information.† † Same as σ o only.§ PCA not performed because most categorical are dummy variables.# 13 are categorical contextual dummy variables (see [8]).¥ Because the Discriminant Function requires at least 2 variables.DNs: Digital Numbers.
The results of the dendrogram analysis are depicted in Figure 3. Figure 3 (1 st row left) illustrates the UPGMA dendrogram exploring only Size Information: of the initial 10 attributes (Table 2), the CCC (0.9143) forms six groups and the fixed similarity gives five groups: 6 and 5 variables, respectively (Table 6).Similarly, Figure 3 (1 st row right) presents assessment of the DN values: of the initial 35 attributes, the CCC (0.8020) and the fixed similarity thresholds found 5 and 3 groups (variables), respectively (Table 6).Figure 3 also depicts the dendrograms of the separate UPGMA analysis of σ o (CCC = 0.8262), β o (CCC = 0.8031), and γ o (CCC = 0.8240) together with Size Information, whereas Table 7 presents an inventory with the UPGMA selected attributes.Carvalho [8] presents other dendrograms and lists the selected attributes of the 44 data sub-divisions.

Principal Components Analysis (PCA)
A comparison of Table 5-2,3 confirms the PCA dimensionality reduction.After analyzing the global picture of the percentages of variance concentrated on particular PCs among the 44 data sub-divisions a user-defined criterion was used to decide which of the two cut-offs should be used to select the meaningful PCs: if the "Scree Plot" PC concentrated less than 80% of the variance, the "Kaiser Criterion" PC was selected; otherwise, the Scree Plot PC was the chosen cut-off (see [8]).The number of selected PCs varies per data sub-divisions (Table 5-3).
Figure 4 (left panels) demonstrates the Scree Plot PC selection strategy that explores the broken stick analyzing only the Size Information.The 4 th row of Figure 4 depicts the bootstrapping eigenvalue error bars touching the broken stick curve for the PCA of the CFS; as such, no PC has been selected with the Scree Plot.Although the Kaiser Criterion selected one PC concentrating 90% of the variance, the Discriminant Function requires at least 2 variables (i.e., in this case, two PCs); therefore, the first two PCs have been selected (Table 5-3).Carvalho [8] illustrates the full proficiency of the broken stick PCs selection strategy using the Scree Plot.
Figure 4 (middle panels) exemplifies scatterplots illustrating the relationship between the first two PCs resulted from the PCA exploring only Size Information: 10, 6, 5, and 3 variables (Tables 2 and 6).It is evident the considerable overlap between the scores of the two populations: seeps (green circles) and spills (red triangles).The same pattern is observed on all other scatterplots, not only for the relationship between the first two PCs, but also for all other combinations of PCs [8].
The top three rows of Figure 4 (middle panels) reveal a peculiar gap in the cloud of data points affecting both populations, i.e., spills and seeps.This is mostly due to the FRACTAL.ind and its bi-modal (multi-modal) frequency distribution [8].Such gap does not exist in the CFS analysis (Figure 4: 4 th row middle panel), in which FRACTAL.ind is not accounted for (Table 6).However, this gap is not as prominent on the scatterplots when the FRACTAL.ind is accounted for on the analysis of Size Information together with the original set of SAR backscatter signature variables, i.e., σ o , β o , or γ o .This is probably because, in this case, each PC is influenced by a much larger number of variables: 433 or 151 (Tables 1 and 3).
Figure 4 (right panels) depicts the influence (i.e., meaning) of each variable to each PC quantified by the inspection of Loadings Plots considering only Size Information: 10, 6, 5, and 3 variables (Tables 2  and 6).Almost all variables largely influence the first PC, the exception is on the CFS (Figure 4: 4 th row right) that has the perimeter (Per) being the sole variable to account for the variance on the first PC; the second PC has the opposite occurring.In fact, the only PC of all PCAs that had a preferable influence was indeed this case, as the loadings expressions of all PCs of the 44 data sub-divisions have shown that each PC receives the influence of every single variable [8].(5) were selected using the fixed similarity value of 0.5.See also Figure 3 (1 st row left: CCC = 0.9143).CFS implementation: Variables in bold (1) was also selected with the UPGMA implementation.Bottom: Only Digital Numbers (DNs) variables (35).UPGMA implementation: Variables in bold (3) were selected using the fixed similarity value of 0.5.See also Figure 3 (1 st row right: CCC = 0.8020).CFS implementation: Variables in bold (3) are the same as those selected with the UPGMA implementation.2: 10 variables), UPGMA-CCC (Table 6: 6 variables), UPGMA-Fixed (Table 6: 5 variables), and CFS (Table 6: 3 variables).Left panels: Scree Plots (Eigenvalue x Principal Components (PC)), where the dashed red line corresponds to the broken stick curve.Bootstrapping eigenvalue error bars are shown as black lines: 2.5% and 97.5%.Middle panels: Scatterplots of the first two PCs, in which red triangles correspond to oil spills and green circles to oil seeps.Right panels: Loadings Plots.

Correlation Matrix
The correlation matrices were computed in PAST (version 2.17c; Oslo, Norway).Just about all information from the original values of the CBOS-MOD variables (no PCA applied: Table 5-2) showed residual inter-variable correlation among the variables of the 44 data sub-divisions.On the other hand, and as expected, the information from the scores of the selected PCs (with PCA applied: Table 5-3) presented no inter-variable correlation.This confirms that only the PCA information is carried to design the Classification Algorithms.

Discriminant Function
Table 8 summarizes the Discriminant Analyses' findings, where it is possible to notice the correspondence between Hotelling's t2 and Overall Accuracy.The former represents a measure of equality of the means of the groups been discriminated (i.e., spills and seeps), in which larger values correspond to better discrimination of the a priori informed groups.Likewise, the latter is the total classification efficiency that reports the success rate in correctly telling apart a priori known spills and seeps samples.Initially, only the Overall Accuracy is explored; see Section 4.2.5 for the full assessment of the other metrics.The Overall Accuracy somewhat matches the Hotelling's t2  2: 10 variables), UPGMA-CCC (Table 6: 6 variables), UPGMA-Fixed (Table 6: 5 variables), and CFS (Table 6: 3 variables).Left panels: Scree Plots (Eigenvalue x Principal Components (PC)), where the dashed red line corresponds to the broken stick curve.Bootstrapping eigenvalue error bars are shown as black lines: 2.5% and 97.5%.Middle panels: Scatterplots of the first two PCs, in which red triangles correspond to oil spills and green circles to oil seeps.Right panels: Loadings Plots.

Correlation Matrix
The correlation matrices were computed in PAST (version 2.17c; Oslo, Norway).Just about all information from the original values of the CBOS-MOD variables (no PCA applied: Table 5-2) showed residual inter-variable correlation among the variables of the 44 data sub-divisions.On the other hand, and as expected, the information from the scores of the selected PCs (with PCA applied: Table 5-3) presented no inter-variable correlation.This confirms that only the PCA information is carried to design the Classification Algorithms.

Discriminant Function
Table 8 summarizes the Discriminant Analyses' findings, where it is possible to notice the correspondence between Hotelling's t2 and Overall Accuracy.The former represents a measure of equality of the means of the groups been discriminated (i.e., spills and seeps), in which larger values correspond to better discrimination of the a priori informed groups.Likewise, the latter is the total classification efficiency that reports the success rate in correctly telling apart a priori known spills and seeps samples.Initially, only the Overall Accuracy is explored; see Section 4.2.5 for the full assessment of the other metrics.The Overall Accuracy somewhat matches the Hotelling's t2 effectiveness.Because the results about the complete exploration of the CBOS-MOD accounted for many categorical contextual dummy variables, its discrimination power almost reached 100%.This corroborates the quality of excellence of the explored dataset, i.e., CBOS-MOD.DN is again the worst, matching expectations [11,12,46].A slightly better discrimination is observed while using only the Size Information.Higher Hotelling's t2 is observed for the complete exploration of the CBOS-MOD that had undergone a major data reduction (Table 5-2: 502, 59, and 46 variables, and Table 5-3: 10, 21, and 18 PCs).This was anticipated as many categorical contextual dummy variables are accounted for.On the other hand, DNs (e.g., UPGMA-CCC: 163.0) has the lower Hotelling's t2 by far.The same result observed for its CFS-Merit (Table 5-1: 0.059); important to note that Hotelling's t2 and CFS-Merit are not comparable.
Column-wise, the Hotelling's t2 behavior is fairly similar to the one from the CFS-Merit (Table 5-1).The pattern shown by the sole analysis of Size Information presents higher Hotelling's t2 as compared to when these attributes are not included.Line-wise, higher Hotelling's t2 values are usually observed for the UPGMA-CCC optimal subset.Again, the sole use of Size Information has about the same effectiveness to discriminate seeps from spills, as when these attributes are combined with the SAR backscatter signature.This is an indication that the SAR backscatter signature is not adding much to the whole differentiation process.The UPGMA-Fixed optimal subsets tend to present lower Hotelling's t2, line-wise speaking.With the UPGMA-CCC optimal subset, noted above, it is possible to confirm that the number of variables has some influence on the discrimination process.This might be related not only to the amount of variables, but also to their quality, i.e., whether there is redundant correlated information or not.The UPGMA-CCC Hotelling's t2 values are somewhat equivalent to those of the CFS.Even though their variables are similar, they are not the same (e.g., Tables 6 and 7).This also corroborates the importance of reducing the dimensionality on the attribute-domain, for instance, by using Attribute Selection and PCA.However, these processes should be applied with caution so as not to lose information (i.e., underestimation) or to include noise (i.e., overestimation) [47-49].

Oil Slick Classification Algorithm
The full set of metrics shown in Table 4 is assessed to design the qualitative-quantitative Classification Algorithms, which are simple-to-use and simple in their formulation.Because the UPGMA-Fixed showed lower discriminating power than that of UPGMA-CCC and CFS on the initial Overall Accuracy analysis (Table 8), only the latter two are presented.
The Discriminant Function results using selected PCs from all SAR backscatter signature variables (Table 3: σ o , β o , and γ o ) analyzed with and without Size Information (Table 2) are presented in Table 9.A two-fold encouraging outcome is revealed: (1) The influence of Size Information on the discrimination between spills and seeps: if comparing the top with the lower tables, better results are observed when these variables are accounted for (top) than when they are not (lower); and (2) It confirms that the Overall Average (64%) may not inform the real discrimination performance; for instance, Specificity (59%) and Positive Predictive Value (54%) are not very effective, thus agreeing with [60] that the user should be aware of the real accuracy of the algorithm by examining different metrics.The results of the separate analyses of σ o variables with and without Size Information are presented in Table 10.The improvement observed when Size Information is considered is quite similar to the one observed in Table 9.This reinforces the fact that there is a benefit to using these variables, as all metrics evaluated are improved when they are accounted for.Confusion Matrices for β o and γ o with similar results are presented in [8].The Discriminant Functions results for separate analysis of Size Information (best) and of DNs variables (worst) are presented in Table 11.A two-fold positive outcome is evident: (1) The relevancy of Size Information to distinguish oil seeps from oil spills on the sea surface of the Campeche Bay region; and (2) The corroboration with literature (e.g., [11,12,46] that DN images should not be used to compare time-series of SAR imagery.Because the Multivariate Data Analysis revealed the use of Size Information (i.e., 3 rd Attribute Type: Table 2) as best to distinguish the oil slick type, a separate analysis explores the only two basic morphological features originally present in the CBOS-Data: area and perimeter.The same Data Treatment Process has been applied preceding the PCA, and two PCs have been selected prior the Discriminant Function.Table 12 presents quite similar results to those shown in Tables 9-11, thus demonstrating that the sole and simple use of area and perimeter is equally capable of distinguishing human-related oil spills from naturally-occurring oil seeps to a useful accuracy.
Because the Multivariate Data Analysis revealed the use of Size Information (i.e., 3 rd Attribute Type: Table 2) as best to distinguish the oil slick type, a separate analysis explores the only two basic morphological features originally present in the CBOS-Data: area and perimeter.The same Data Treatment Process has been applied preceding the PCA, and two PCs have been selected prior the Discriminant Function.Table 12 presents quite similar results to those shown in Tables 9-11, thus demonstrating that the sole and simple use of area and perimeter is equally capable of distinguishing human-related oil spills from naturally-occurring oil seeps to a useful accuracy.Therefore, there is no need to evoke more complicated variables for distinguishing the oil slick type.

Discussion
Even though the three SAR backscatter coefficients can easily be transferred from one to another [8], the pool of attributes selected by the UPGMA and CFS implementations indicate the utility of σo, βo, and γo is not exactly the same (Table 7).It is also true that because σo is directly related to the reflectivity of the ground horizontal range plane, it is a common SAR backscatter coefficient considered in oil slick detection; see [8] and references therein.Nonetheless, the data mining exercise of our EDA has demonstrated that σo, βo, and γo present differences among each other (Table 7) and that it is worth analyzing them separately.
Two aspects agreed by the remote-sensing scientific community regarding the use of geometry, shape, and dimension attributes in the use of SAR measurements to identify oil slicks are: (1) Ship spills have distinct Size Information from other oil spills [63]; and (2) These attributes are crucial to distinguish oil slicks from look-alike features [64,65].Thus, considering the exploratory nature of our Classification Algorithms, it should be emphasized that only a small fraction of the CBOS-Data (3.9%) and of the CBOS-MOD (3.2%) represents ship spills.In addition, the sole use of only two basic oil slicks' morphological features (i.e., area and perimeter) is an innovative way to distinguish oil from oil, i.e., natural from man-made oil slicks-a task poorly documented in the scientific literature-peer-reviewed, as well as grey literature; see [8] and references therein.
The data mining exercise of our EDA promotes a novel idea bridging petroleum pollution and remote sensing research, thus paving the way for further investigations using the satellite synoptic view to express geophysical differences between spilled and seeped oil observed on the sea surface.Despite the substantial amount of work performed in the data mining exercise of our EDA that uses standard multivariate data analysis techniques applied to SAR measurements for the successful distinguishment of sea surface expressions of naturally-occurring oil seeps from human-related oil spills in the Campeche Bay region [8], a list of nine main suggestions that may be further explored to achieve improved accuracies for systematic use is given:

Discussion
Even though the three SAR backscatter coefficients can easily be transferred from one to another [8], the pool of attributes selected by the UPGMA and CFS implementations indicate the utility of σ o , β o , and γ o is not exactly the same (Table 7).It is also true that because σ o is directly related to the reflectivity of the ground horizontal range plane, it is a common SAR backscatter coefficient considered in oil slick detection; see [8] and references therein.Nonetheless, the data mining exercise of our EDA has demonstrated that σ o , β o , and γ o present differences among each other (Table 7) and that it is worth analyzing them separately.
Two aspects agreed by the remote-sensing scientific community regarding the use of geometry, shape, and dimension attributes in the use of SAR measurements to identify oil slicks are: (1) Ship spills have distinct Size Information from other oil spills [63]; and (2) These attributes are crucial to distinguish oil slicks from look-alike features [64,65].Thus, considering the exploratory nature of our Classification Algorithms, it should be emphasized that only a small fraction of the CBOS-Data (3.9%) and of the CBOS-MOD (3.2%) represents ship spills.In addition, the sole use of only two basic oil slicks' morphological features (i.e., area and perimeter) is an innovative way to distinguish oil from oil, i.e., natural from man-made oil slicks-a task poorly documented in the scientific literature-peer-reviewed, as well as grey literature; see [8] and references therein.
The data mining exercise of our EDA promotes a novel idea bridging petroleum pollution and remote sensing research, thus paving the way for further investigations using the satellite synoptic view to express geophysical differences between spilled and seeped oil observed on the sea surface.Despite the substantial amount of work performed in the data mining exercise of our EDA that uses standard multivariate data analysis techniques applied to SAR measurements for the successful distinguishment of sea surface expressions of naturally-occurring oil seeps from human-related oil spills in the Campeche Bay region [8], a list of nine main suggestions that may be further explored to achieve improved accuracies for systematic use is given:

•
Information from the area surrounding the oil slicks, i.e., background oil-free surface, could come into play, e.g., damping ratio [66]; • Information about radar beam incidence angles and environmental configurations (e.g., wind conditions) could add knowledge to the differentiation process; • Fractal dimension (e.g., box-counting or dynamic fractal approaches) could be used to measure, analyze, and classify the oil slicks textures and shapes [67,68]; The number of individual non-contiguous parts forming oil slick polygons could be explored along with other contextual properties, such as the distance between centroid and the shoreline or the distance from a possible point source, i.e., oil rig complex or oil seep site location on the sea surface [21]; Other methods instead of Ranging Standardization could be applied during the Data Treatment Process, e.g., z-score [30]; As an alternative to the proposed automatic Attribute Selection Methods to select groups and variables, visual analyses of the UPGMA dendrograms could be instituted, or even other custom methods, e.g., Ward's [30]; • Different "stopping rule" methods could select meaningful PCs [47,48,69]; • Artificial Neural Networks (ANN), largely used to distinguish between oil slicks and look-alike features (e.g., [70][71][72]), indeed could be tested to differentiate oil seeps from oil spills using variables from the Attribute Selection Methods used herein, e.g., area and perimeter; and; • Clustering Analyses (e.g., K-means) could also reveal natural groupings among oil slicks, thus supporting the oil slick type differentiation process.
In addition to the abovementioned, the data mining exercise of our EDA surely serves as an archetype for developing other ways to differentiate the oil slick type [73].It also paves the way for full polarimetric investigations that could indeed search other ways to differentiate oil seeps from oil spills using SAR measurements (e.g., [74,75]).

Conclusions
The data mining exercise of our EDA has successfully demonstrated that it is feasible to use multivariate data analysis techniques, applied to SAR measurements, to distinguish the sea surface expressions of naturally-occurring oil seeps from human-related oil spills observed in Campeche Bay.Indeed, the use of an assorted pool of standard multivariate data analysis techniques (e.g., R-mode Correlation, Principal Components Analysis, and Discriminant Function) has been effectively applied to analyze the remote sensing library of a specific long-term dataset (2008-2012) that consists of 277 RADARSAT-2 scenes and 4916 oil slicks.
The answers to the three scientific questions about the sea surface expression of oil slicks are: 1.
Yes.The Size Information (i.e., attributes describing the geometry, shape, and dimension of the oil slicks) can be used to distinguish seeps from spills, and in fact, the sole and simple use of area and perimeter can also distinguish natural from man-made oil slicks with an overall accuracy of about 70%; and;

3.
The synergistic combination of various oil slick characteristics leads to systems capable of distinguishing between seeped and spilled oil, in which different combinations of variables promote similar differentiation; however, the one leading to the most effective classification is represented by the sole and simple use of area and perimeter (70%), whereas the one with the worst capabilities have variables expressed in Digital Numbers (DNs).
The proposed oil slick type Classification Algorithms of our EDA have useful accuracies, and range from uncomplicated algorithms combining the synergy of different variables to very simple ones using only SAR backscatter signature or only Size Information.In fact, the effective differentiation between oil seeps and oil spills achieved using only basic morphological features, such as the area and the perimeter of oil slicks, as determined by digital classification of satellite imagery, is a relevant outcome directly contributing to the remote sensing research, as well as to the activities of the oil and gas exploration and production industry.The field of environment monitoring also benefits from the outcomes of the data mining exercise conducted during our EDA.Conclusively, if the techniques presented herein are capable of distinguishing two types of oil observed on the sea surface (i.e., naturally-occurring oil seeps from human-related oil spills) based on RADARSAT-2 measurements, it is likely they can also produce good results if applied to the problem of discriminating oil from look-alike features (for instance, algal blooms or low wind zone signatures), indeed, solely using SAR measurements.

Figure 1 .
Figure 1.Gulf of Mexico highlighting the Campeche Bay region.The rectangle illustrates the region from which most oil slicks (98%) have been observed [9,10].Isobaths of 200 m, 1000 m, and 3000 m are also shown.Courtesy of Adriano Vasconcelos.

Figure 1 .
Figure 1.Gulf of Mexico highlighting the Campeche Bay region.The rectangle illustrates the region from which most oil slicks (98%) have been observed [9,10].Isobaths of 200 m, 1000 m, and 3000 m are also shown.Courtesy of Adriano Vasconcelos.

Figure 2 .
Figure 2. Summary of 44 data sub-divisions: 11 original sets (shown in the middle along with the number of variables; see also Table5-1) and 33 optimal subsets of selected variables (shown in both sides).These have undergone two Attribute Selection Methods: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and CFS (Correlation-Based Feature Selection).The UPGMA implementation employs two user-defined thresholds to form groups on the analysis of dendrograms: CCC (Cophenetic Correlation Coefficient) and 0.5 (fixed similarity value).The CFS automatically provides the selected variables.

Figure 2 .
Figure 2. Summary of 44 data sub-divisions: 11 original sets (shown in the middle along with the number of variables; see also Table5-1) and 33 optimal subsets of selected variables (shown in both sides).These have undergone two Attribute Selection Methods: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) and CFS (Correlation-Based Feature Selection).The UPGMA implementation employs two user-defined thresholds to form groups on the analysis of dendrograms: CCC (Cophenetic Correlation Coefficient) and 0.5 (fixed similarity value).The CFS automatically provides the selected variables.
How many known oil seep samples are correctly identified?Sensitivity: A, also given in percentage by (A × 100)/[A + B].PA/CE Q2) How many known oil spill samples are correctly identified?Specificity: D, also given in percentage by (D × 100)/[C + D].PA/CE Q3) How many known oil seep samples are misidentified?False Negative cases: B, also given in percentage by (B × 100)/[A + B].Coupled with Sensitivity.PA/CE Q4) How many known oil spill samples are misidentified?False Positive cases: C, also given in percentage by (C × 100)/[C + D].Linked to Specificity.UA/OE Q1) How many oil seeps identified by the Function are indeed known oil seeps?Positive Predictive Value: A, also given in percentage by (A × 100)/[A + C].UA/OE Q2) How many oil spills identified by the Function are indeed known oil spills?Negative Predictive Value: D, also given in percentage by (D × 100)/[B + D].UA/OE Q3) Of samples identified by the Function as oil seeps, how many are oil spills?Inverse of the Positive Predictive Value: C, also given in percentage by (C × 100)/[A + C].UA/OE Q4) Of samples identified by the Function as oil spills, how many are oil seeps?Inverse of the Negative Predictive Value: B, also given in percentage by (B × 100)/[B + D].

Figure 4 .
Figure 4. Results of the PCA exploring only Size Information.From top to bottom: original set (Table2: 10 variables), UPGMA-CCC (Table6: 6 variables), UPGMA-Fixed (Table6: 5 variables), and CFS (Table6: 3 variables).Left panels: Scree Plots (Eigenvalue x Principal Components (PC)), where the dashed red line corresponds to the broken stick curve.Bootstrapping eigenvalue error bars are shown as black lines: 2.5% and 97.5%.Middle panels: Scatterplots of the first two PCs, in which red triangles correspond to oil spills and green circles to oil seeps.Right panels: Loadings Plots.

Figure 4 .
Figure 4. Results of the PCA exploring only Size Information.From top to bottom: original set (Table2: 10 variables), UPGMA-CCC (Table6: 6 variables), UPGMA-Fixed (Table6: 5 variables), and CFS (Table6: 3 variables).Left panels: Scree Plots (Eigenvalue x Principal Components (PC)), where the dashed red line corresponds to the broken stick curve.Bootstrapping eigenvalue error bars are shown as black lines: 2.5% and 97.5%.Middle panels: Scatterplots of the first two PCs, in which red triangles correspond to oil spills and green circles to oil seeps.Right panels: Loadings Plots.

Table 5 .
Data mining summary of the 11 original sets and the 33 optimal subsets of selected variables proposed on Figure3, collectively referred to as to as 44 data sub-divisions.1)Values of UPGMA-CCC and CFS-Merit.2) Number of variables.3)Number of selected Principal Components (PCs).See also Table9for full PCA results.

Table 5 .
Data mining summary of the 11 original sets and the 33 optimal subsets of selected variables proposed on Figure3, collectively referred to as to as 44 data sub-divisions.1)Values of UPGMA-CCC and CFS-Merit.2) Number of variables.3)Number of selected Principal Components (PCs).See also Table9for full PCA results.

Table 5 .
Data mining summary of the 11 original sets and the 33 optimal subsets of selected variables proposed on Figure3, collectively referred to as to as 44 data sub-divisions.1)Values of UPGMA-CCC and CFS-Merit.2) Number of variables.3)Number of selected Principal Components (PCs).See also Table9for full PCA results.

Table 8 .
Discriminant Function results: Hotelling's t2 values and Overall Accuracy of the 44 data sub-divisions proposed on Figure 2.

Table 12 .
Confusion Matrices of the Discriminant Functions using selected Principal Components (PCs): sole and simple use of two specific oil slicks Size Information, i.e., area and perimeter.See also Table4for further information about Confusion Matrixes.See Table4for support.

Table 12 .
Confusion Matrices of the Discriminant Functions using selected Principal Components (PCs): sole and simple use of two specific oil slicks Size Information, i.e., area and perimeter.See also Table4for further information about Confusion Matrixes.See Table4for support.