Reﬁned Analysis of RADARSAT-2 Measurements to Discriminate Two Petrogenic Oil-Slick Categories: Seeps versus Spills

: Our research focuses on refining the ability to discriminate two petrogenic oil-slick categories: the sea surface expression of naturally-occurring oil seeps and man-made oil spills. For that, a long-term RADARSAT-2 dataset (244 scenes imaged between 2008 and 2012) is analyzed to investigate oil slicks (4562) observed in the Gulf of Mexico (Campeche Bay, Mexico). As the scientific literature on the use of satellite-derived measurements to discriminate the oil-slick category is sparse, our research addresses this gap by extending our previous investigations aimed at discriminating seeps from spills. To reveal hidden traits of the available satellite information and to evaluate an existing Oil-Slick Discrimination Algorithm, distinct processing segments methodically inspect the data at several levels: input data repository, data transformation, attribute selection, and multivariate data analysis. Different attribute selection strategies similarly excel at the seep-spill differentiation. The combination of different Oil-Slick Information Descriptors presents comparable discrimination accuracies. Among 8 non-linear transformations, the Logarithm and Cube Root normalizations disclose the most effective discrimination power of almost 70%. Our refined analysis corroborates and consolidates our earlier findings, providing a firmer basis and useful accuracies of the seep-spill discrimination practice using information acquired with space-borne surveillance systems based on Synthetic Aperture Radars.


Introduction
The impact of mineral oil pollution is a widely spread source of environmental concern in various ecosystems [1,2]. The detection of the sea surface expression of oil using space-borne surveillance systems is an extensively studied subject [3][4][5]. Oil floating on the surface of the ocean can be located, to some extent, with different types of remote sensing sensors-e.g., thermal infrared (AVHRR: Advanced Very High Resolution Radiometer [6]), visible/near infrared (MODIS: Moderate Resolution Imaging Spectroradiometer [7]), etc.-but generally, most attempts concentrate on using 1.
Among the several Data Transformation Approaches we tested, which one provides the most accurate oil-slick category discrimination? 2.
Is there a specific Attribute Selection Process that excels at choosing variables to discriminate seeps from spills? 3.
Which combination of Oil-Slick Information Descriptors promotes the best discrimination between seeps and spills?

Methods
We developed a comprehensive Exploratory Data Analysis (EDA) to reveal hidden information contained in the satellite-derived measurements and to refine the analysis to discriminate slicks by category, as proposed in our earlier studies [19][20][21][22]. The design of our EDA focuses on a data-driven scheme to investigate possible ways to improve the seep-spill discrimination with the simplest possible analysis and the lowest satellite-imaging cost. The research strategy employed herein is a development of our previous investigations [19][20][21][22], and consists of four distinct Data Processing Segments (i.e., A, B, C, and D in Figure 2)-devised in eight individual Phases-separately described in detail and introduced in a complete manner easily enabling replicability of our data mining exercise. A summary of our EDA design is depicted in Figure 2. While in-house Python codes are used to run the oil slick RADARSAT-2 related analyses (i.e., Phases 1-4), PAST (PAleontological STatistics: version 3.20, Oslo, Norway [26]) is used in the implementation of Phases 5-8. Research strategy developed to refine the ability to discriminate between two petrogenic oilslick categories (i.e., seeps versus spills), as proposed in our previous studies [19][20][21][22]. The proposed

Methods
We developed a comprehensive Exploratory Data Analysis (EDA) to reveal hidden information contained in the satellite-derived measurements and to refine the analysis to discriminate slicks by category, as proposed in our earlier studies [19][20][21][22]. The design of our EDA focuses on a data-driven scheme to investigate possible ways to improve the seep-spill discrimination with the simplest possible analysis and the lowest satellite-imaging cost. The research strategy employed herein is a development of our previous investigations [19][20][21][22], and consists of four distinct Data Processing Segments (i.e., A, B, C, and D in Figure 2)-devised in eight individual Phases-separately described in detail and introduced in a complete manner easily enabling replicability of our data mining exercise. A summary of our EDA design is depicted in Figure 2. While in-house Python codes are used to run the oil slick RADARSAT-2 related analyses (i.e., Phases 1-4), PAST (PAleontological STatistics: version 3.20, Oslo, Norway [26]) is used in the implementation of Phases 5-8.

Methods
We developed a comprehensive Exploratory Data Analysis (EDA) to reveal hidden information contained in the satellite-derived measurements and to refine the analysis to discriminate slicks by category, as proposed in our earlier studies [19][20][21][22]. The design of our EDA focuses on a data-driven scheme to investigate possible ways to improve the seep-spill discrimination with the simplest possible analysis and the lowest satellite-imaging cost. The research strategy employed herein is a development of our previous investigations [19][20][21][22], and consists of four distinct Data Processing Segments (i.e., A, B, C, and D in Figure 2)-devised in eight individual Phases-separately described in detail and introduced in a complete manner easily enabling replicability of our data mining exercise. A summary of our EDA design is depicted in Figure 2. While in-house Python codes are used to run the oil slick RADARSAT-2 related analyses (i.e., Phases 1-4), PAST (PAleontological STatistics: version 3.20, Oslo, Norway [26]) is used in the implementation of Phases 5-8. Research strategy developed to refine the ability to discriminate between two petrogenic oilslick categories (i.e., seeps versus spills), as proposed in our previous studies [19][20][21][22]. The proposed Research strategy developed to refine the ability to discriminate between two petrogenic oil-slick categories (i.e., seeps versus spills), as proposed in our previous studies [19][20][21][22]. The A multi-year dataset of RADARSAT-2 scenes imaged between 2008 and 2012 gave rise to the oil slick data archive analyzed in our earlier investigations [19][20][21][22]. This data archive consists of polygons representative of oil slicks that had been identified and field validated as seeps and spills by domain experts. For more information about this dataset, see [19][20][21][22]. The workable dataset explored herein is defined after fine-tuning this data archive along the 1st Data Processing Segment (Figure 2A: Input Data Repository-Phases 1-3).

Phase 1: Data Quality Control
The initial oil slick data archive from our previous studies [19][20][21][22] is sorted by the satellite scene-imaging configuration (i.e., beam modes determining the acquisition swath width and ground resolution), thus establishing the amount of RADARSAT-2 imagery and the seeps and spills of our workable dataset.

Phase 2: Positive Domain Rescaling
The initially available oil slick data archive analyzed in our earlier investigations [19][20][21][22] had undergone a linear scaling action (Negative Values Scaling Filter: NVSF) that is comprised of a two-fold procedure applied to individual oil slicks: the subtraction of the minimum negative pixel value within each oil slick from every single pixel of such oil slick, followed by the addition of 1 to every single pixel-the minimum pixel value becomes 1. This brings all pixel values to the positive domain, which is a requirement of data normalization procedures that cannot be applied to negative values, e.g., log 10 . The NSVF is applied at the pixel level, i.e., taking into account all pixels of each oil slick to provide a single measure representative of all pixels of such oil slick (see below: Section 2.3.2). Nevertheless, previously, the NVSF was only applied to certain oil slicks: those having at least one negative pixel value-for instance, oil slicks that had spurious negative SAR backscatter signature caused by intrinsic multiplicative random granular speckle noise destructive imprecision in the range-dependent gain calculation [27,28].
Although we also conduct this filtering strategy, we apply it in the present research to all oil slicks. In essence, hereafter, for our purpose, the NVSF is referred to as Minimum Values Scaling Filter (MVSF), such that: PIXpos = (PIX-PIXmin) + 1, in which PIXpos corresponds to the new positive pixel value, PIX is the original pixel value, PIXmin is the minimum pixel value of all pixels of each oil slick. Therefore, this is a dissimilarity between our previous investigations and the current EDA: NVSF versus MVSF. The reason for applying the MVSF to all oil slicks is three-fold: (1) To avoid possible biases caused by gradient differences among oil slicks with and without NVSF; (2) To circumvent the application of despeckle filtering (e.g., Frost Filter: FFrost [29]; see also Phase 3) that eventually would eliminate negative values, but would alter (e.g., smoothing) the SAR backscatter signature values-the lack of such filter is justifiable to preserve the data-driven design of our EDA; and (3) To exploit data transformations that do not accept negative values (see below: Phase 4).

SAR Backscatter Signature
Previously, we explored twelve SAR backscatter signatures: SAR backscatter coefficients corresponding to the radar cross-section (RCS: σ) normalized by the unit area calculated in three different surface planes (i.e., σ o , β o , and γ o [30][31][32][33][34]) computed in four radiometric-calibrated image products-i.e., the amplitude (1st) of the received radar beam and its dimensionless physical quantity form that represents power expressed in dB (2nd), both with (3rd) and without (4th) despeckle filtering (FFrost: 3-by-3 window). However, herein we perform a simplification for a more controlled EDA solely using σ o given in amplitude without despeckle filtering. As such, from this point onwards, unless otherwise stated, any reference to SAR backscatter signature synonymously refers to this simplification.

Oil-Slick Information Descriptors
As before [19][20][21][22], we start our research analyzing the same ten attributes describing the oil slicks' geometry, shape, and dimension (these are collectively referred to as Size Information Descriptors) derived from two basic morphological features characterizing the oil slicks-i.e., area (Area) and perimeter (Per): Analogously, we also exploit the same 36 basic descriptive statistics metrics experimentally explored to characterize the oil slicks' SAR backscatter signature as in our previous investigations [19][20][21][22]. These metrics are calculated based on all pixels inside individual oil slick polygons: Herein we introduce two new variables that describe the distribution patterns of the pixels within each oil slick: Skewness (SKW) and Kurtosis (KUR). As such, this collection of 38 basic descriptive statistics metrics characterizing the oil slick's SAR backscatter signature is henceforth referred to as SAR Information Descriptors. Together, these two types of Oil-Slick Information Descriptors (i.e., Size and SAR) determine the initial number of variables (48) accounted in our workable dataset.

Phase 5: Attribute Selection Processes
The processes of selecting relevant attributes deals with the complex matter of reducing dimensionality in the variable-hyperspace domain (see also Phase 6); this generally helps to elucidate the problem solution of numerical ecology assessments and to improve the performance of classification algorithms [42,45]. As such, another difference from our earlier studies is the number of explored attributes: before, we investigated 44 data sub-divisions with 502, 433, 423, 151, 141, 35, 10, and 2 variables [19][20][21][22]. Indeed, we considerably reduce these numbers with the SAR backscatter signature simplification (see Phase 3: Section 2.3.1). Additionally, we start with 48 Oil-Slick Information Descriptors (see Phase 3: Section 2.3.2) but use even fewer variables upon the completion of the Attribute Selection Processes (see below: Section 2.5.1).

Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
Two attribute selection strategies (i.e., R-mode) have been performed in our previous investigations [19][20][21][22]: UPGMA [42,43,46] and CFS (Correlation-Based Feature Selection [47,48]). Based on our earlier results, we only implement the former as it allows a user-defined strategy to select relevant variables: the choice of the similarity index (Pearson's r correlation coefficient) used in the UPGMA dendrogram as cut-off to form groups of similar variables, i.e., phenon line [49,50]. See also [19][20][21][22] for further information about analyses and interpretations of rooted tree UPGMA dendrograms.
Moreover, an imperative distinction from our earlier investigations is that herein we are experimenting the use of a strict cut-off level, i.e., a fixed similarity value of 0.3, in relation to the previous fixed value of 0.5 and varying one ranging around 0.9 [19][20][21][22]. The selection of the 0.3 similarity cut-off is enlightened by the Bonferroni Adjustment as the level of minimum significance (p value) for large datasets (n > 100); below this there is no statistically significant correlation and variables are considered different from one another [51].

Histograms and Correlation Matrices
Histograms and correlation matrices assist in the verification of residual inter-variable correlation and to help with the decision of which variables to select on the groups formed on the UPGMA analyses.

Phase 6: Principal Component Analysis (PCA)
PCAs reduce the large correlated variables set into a smaller set of uncorrelated hypothetical variables-Principal Components (PCs)-containing most of the relevant information of the initial larger set [42,43]. The rotation of the original axes to the new orthogonal coordinate system is implemented in the same manner as our earlier work: square symmetric correlation matrix and 1000 bootstraps [52]. However, the approach to select relevant axes (i.e., PCs) is a departure from our earlier investigations. While, herein we use only the Kaiser Cut, i.e., Kaiser-Guttman criterion (eigenvalues > 1 [53]), previously we explored several PC-selection practices, e.g., Jolliffe, Scree Plot (Knee/Elbow), and a combined strategy using the Scree Plot (broken stick) with Kaiser [54][55][56][57].

Phase 7: Discriminant Function
Discriminant Analysis differs from Clustering Analysis as it is not meant to determine to which group each object belongs [43]. Instead, Discriminant Functions use a priori measured information (Oil-Slick Information Descriptors) and knowledge of the object's (oil slick) group membership (seep or spill), to obtain the maximum discriminating power that minimizes the probability of erroneous discrimination: [DF(X) = (W 1 X 1 + W 2 X 2 + . . . + W n X n )−C off ]; in which DF(X) corresponds to the dependent variable (i.e., Discriminant Function); X n to the independent variables (i.e., Oil-Slick Information Descriptor value); W n to the independent variables' weight; and C off to the constant offset [58][59][60][61].
The use of uncorrelated attributes (selected PCs from Phase 6), or at least with the lowest possible degree of dependence (UPGMA selected variables from Phase 5), is a pressing need for Discriminant Functions [62], and as such, this concerns a crucial development of the current EDA from our previous investigations [19][20][21][22]: herein, we are not only using the PCA scores (PCs) as input to the Discriminant Functions, we are also testing the use of UPGMA dendrogram selected variables (see Phase 5: Section 2.5.1) without passing through the PCA.

Phase 8: Confusion Matrices (2-by-2 Tables)
The Oil-Slick Discrimination Algorithm accuracy is reported based on the Discriminant Function results by means of the complete understanding of adapted 2-by-2 Tables (Confusion Matrices: CMs). See also [19][20][21][22][63][64][65] for information on how to analyze and to better interpret 2-by-2 Tables. The conjunct interpretation of five metrics [66] is essential to fully evaluate the algorithm's effectiveness. Table 1 gives a picture of these metrics that are color-coded for clarity:  or spill), to obtain the maximum discriminating power that minimizes the probability of erroneous discrimination: [DF(X) = (W1X1 + W2X2 + ... + WnXn)-Coff]; in which DF(X) corresponds to the dependent variable (i.e., Discriminant Function); Xn to the independent variables (i.e., Oil-Slick Information Descriptor value); Wn to the independent variables' weight; and Coff to the constant offset [58][59][60][61]. The use of uncorrelated attributes (selected PCs from Phase 6), or at least with the lowest possible degree of dependence (UPGMA selected variables from Phase 5), is a pressing need for Discriminant Functions [62], and as such, this concerns a crucial development of the current EDA from our previous investigations [19][20][21][22]: herein, we are not only using the PCA scores (PCs) as input to the Discriminant Functions, we are also testing the use of UPGMA dendrogram selected variables (see Phase 5: Section 2.5.1) without passing through the PCA.

Phase 8: Confusion Matrices (2-by-2 Tables)
The Oil-Slick Discrimination Algorithm accuracy is reported based on the Discriminant Function results by means of the complete understanding of adapted 2-by-2 Tables (Confusion Matrices: CMs). See also [19][20][21][22][63][64][65] for information on how to analyze and to better interpret 2-by-2 Tables. The conjunct interpretation of five metrics [66] is essential to fully evaluate the algorithm's effectiveness. Table 1 gives a picture of these metrics that are color-coded for clarity:

Phase 1: Data Quality Control
The initially available oil slick data archive is composed of 4,916 oil slick polygons-2021 oil seeps (41%) and 2895 oil spills (59%)-imaged with 277 RADARSAT-2 scenes (Table 2 I), all of which are 16-bit and VV polarized [19][20][21][22]. These include two different RADARSAT beam modes-Wide [W1 and W2: 354 oil slicks (7%)] and ScanSAR Narrow [SCNA and SCNB: 4562 oil slicks (93%)]-that own two fundamental imaging differences: (1) W1 and W2 are Single Beam Modes (i.e., a strip-map SAR mode with certain imaging aspects constant along the entire scene), whereas SCNA and SCNB are ScanSAR Modes (i.e., combine two or more of the Single Beam Modes) [67]; the latter provides larger area coverage: swath width of 300 km-almost twice that of W1 and W2: 170 km and 150 km, respectively; and (2) Wide has a finer ground resolution of 25 m, which is 1 4 of the ScanSAR Narrow one: 50 m. The initially available oil slick data archive is composed of 4,916 oil slick polygons-2021 oil seeps (41%) and 2895 oil spills (59%)-imaged with 277 RADARSAT-2 scenes (Table 2I), all of which are 16bit and VV polarized [19][20][21][22]. These include two different RADARSAT beam modes-Wide [W1 and W2: 354 oil slicks (7%)] and ScanSAR Narrow [SCNA and SCNB: 4562 oil slicks (93%)]-that own two fundamental imaging differences: (1) W1 and W2 are Single Beam Modes (i.e., a strip-map SAR mode with certain imaging aspects constant along the entire scene), whereas SCNA and SCNB are ScanSAR Modes (i.e., combine two or more of the Single Beam Modes) [67]; the latter provides larger area coverage: swath width of 300 km-almost twice that of W1 and W2: 170 km and 150 km, respectively; and (2) Wide has a finer ground resolution of 25 m, which is ¼ of the ScanSAR Narrow one: 50 m. Table 2. Number (and percentage) of explored oil slicks (seeps and spills) and satellite images.
Regarding their specification differences, eventual inaccuracies may be introduced to beam mode cross-comparisons. Notwithstanding that W1 and W2 provide better delineation of smaller oil slicks with their finer ground resolution, only SCNA and SCNB are kept in our analysis as these represent more than 90% of the available scenes. Furthermore, the ScanSAR Narrow swath width is more appropriate for monitoring applications requiring large-scale coverage such as the one that gave rise to the initially available oil slick data archive [19][20][21][22]. In fact, the lower scene cost of using ScanSAR Narrow to monitor larger ocean regions is rather preferable than the smaller area coverage of the Wide images.
Consequently, our workable dataset is composed of the collection of oil slick polygons imaged with the two ScanSAR Narrow beam modes: 4562 oil slicks-1994 oil seeps (44%) and 2568 oil spills (56%)- Table 2II. Despite the fact that our EDA has 7% (354) fewer oil slicks than our previous study [19][20][21][22], representing about 1% (27) fewer seeps and approximately 11% (327) fewer spills (Table 2III), such data reduction results in a more balanced dataset as compared to the one explored in our previous investigations, i.e., a smaller difference between the number of analyzed spills and seeps: 13% instead of 18% (Table 2: I-II). Indeed, this provides a firmer basis in the oil-slick category discrimination. Moreover, the oil slick polygons imaged with SCNA and SCNB come from 244 RADARSAT-2 scenes imaged between 2008 and 2012-12% (33) fewer images than our earlier investigations ( Table 2).

Phase 2: Positive Domain Rescaling
As the MVSF is applied at the pixel level to all oil slicks in our workable dataset (Table 2II: 4562), it affects the values of the 38 SAR Information Descriptors but not of the 10 Size Information Descriptors (see Phase 3: Section 2.3.2). The latter is independent of the MVSF application as they are derived from and include the two basic morphological oil slick features: Area and Perimeter.

Phase 3: Slick Feature Refinement
The consequence of MVSF (see Phase 2) is two-fold: (1) the SAR Information Descriptors are not the same as in our previous investigations and need to be recomputed for all analyzed oil slicks; (2) Regarding their specification differences, eventual inaccuracies may be introduced to beam mode cross-comparisons. Notwithstanding that W1 and W2 provide better delineation of smaller oil slicks with their finer ground resolution, only SCNA and SCNB are kept in our analysis as these represent more than 90% of the available scenes. Furthermore, the ScanSAR Narrow swath width is more appropriate for monitoring applications requiring large-scale coverage such as the one that gave rise to the initially available oil slick data archive [19][20][21][22]. In fact, the lower scene cost of using ScanSAR Narrow to monitor larger ocean regions is rather preferable than the smaller area coverage of the Wide images.
Consequently, our workable dataset is composed of the collection of oil slick polygons imaged with the two ScanSAR Narrow beam modes: 4562 oil slicks-1994 oil seeps (44%) and 2568 oil spills (56%)- Table 2 II. Despite the fact that our EDA has 7% (354) fewer oil slicks than our previous study [19][20][21][22], representing about 1% (27) fewer seeps and approximately 11% (327) fewer spills (Table 2 III), such data reduction results in a more balanced dataset as compared to the one explored in our previous investigations, i.e., a smaller difference between the number of analyzed spills and seeps: 13% instead of 18% (Table 2: I-II). Indeed, this provides a firmer basis in the oil-slick category discrimination. Moreover, the oil slick polygons imaged with SCNA and SCNB come from 244 RADARSAT-2 scenes imaged between 2008 and 2012-12% (33) fewer images than our earlier investigations ( Table 2).

Phase 2: Positive Domain Rescaling
As the MVSF is applied at the pixel level to all oil slicks in our workable dataset ( Table 2 II: 4562), it affects the values of the 38 SAR Information Descriptors but not of the 10 Size Information Descriptors (see Phase 3: Section 2.3.2). The latter is independent of the MVSF application as they are derived from and include the two basic morphological oil slick features: Area and Perimeter.

Phase 3: Slick Feature Refinement
The consequence of MVSF (see Phase 2) is two-fold: (1) the SAR Information Descriptors are not the same as in our previous investigations and need to be recomputed for all analyzed oil slicks; (2) MIN loses its meaning as its value for all oil slicks becomes 1; accordingly, it is not pursued in our analysis.

Phase 4: Data Transformation Approaches
Although the NLTs can be independently applied to each attribute, for consistency, during our EDA, all-numeric variables uniformly undergo the same column-wise transformation. Because three Oil-Slick Information Descriptors-i.e., Fractal, SKW, and KUR-have values that range from negative to positive, they are not used on half of the NLTs that require only positive values: NLT.1; NLT.2; NLT.3; and NLT.4.

Phase 5: Attribute Selection Processes
Histograms show that the distribution of some Size Information Descriptors is the same as others, sometimes being inverted independent of NLT, meaning that there is no new information revealed. As a result, only one of these variables is selected, for instance: (1) AtoP and PtoA have equal but inverted distributions; (2) PtoAnor, Complex, Compact, and Shape, also have equal distribution but Compact is inverted from the three other. Of these variables, we only keep PtoA and Compact, as Area and Perimeter appear in opposition in their formula: PtoA has area in the denominator, as opposed to Compact, which has area in the numerator; the contrary holds true for the perimeter (see Phase 3: Section 2.3.2).

Unweighted Pair Group Method with Arithmetic Mean (UPGMA)
The combined analysis of dendrograms and correlation matrices show that the 24 COV pair-values have a strong intra-correlation, as well as that they are highly correlated with most of the other variables; hence, they are not further explored. Therefore, out of the 48 initial Oil-Slick Information Descriptors (see Phase 3: Section 2.3.2), only 19 remain for further analyses-Size (6) Figure 3 depicts eight UPGMA dendrograms (one for each of the analyzed NLT), in which it is possible to observe a number of differences, as well as resemblances, between them; mostly regarding inter-variable correlations. An evident characteristic of the two Logarithm functions (NLT.2: Log 10 ; and NLT.3: Ln) is that their dendrograms are equal; the same holds true for their correlation matrices that are also identical.
Prior to the uncorrelated variables selection, we have to identify the groups of correlated variables. The process of defining and/or interpreting groups in UPGMA dendrograms is quite subjective [43], but, at first glance, the global picture of Figure 3 clearly reveals how equivalent are the groups between the several NLTs; these are color-coded for clarity. In the visual analysis of Figure 3, one can note that variables tend to group based on their main characteristics, following the Oil-Slick Information Descriptor features, such that: An advanced analysis of the UPGMA dendrograms shown in Figure 3 discloses that: • The three morphological ratios (Red group) are not correlated with any other variable (similarity close to or equal to zero)-PtoA and Compact form an uncorrelated group, and Fractal usually stands alone; the exception is in NLT.0 where Compact is the one by itself; • The two groups of SAR Information Descriptor, i.e., Green (central tendency) and Blue (dispersion), generally form a larger group-Geen + Blue-the exception is in NLT.7;     The phenon line, represented by the horizontal red dashed line in Figure 3 (i.e., 0.3 Pearson's r correlation coefficient) defines the actual groups from which we select one variable of each-groups are formed when this cut-off line crosses a vertical line (i.e., branch or edge) [49,50]. In fact, the groups formed in this manner match the preliminary visual analysis of the dendrograms: One should pay close attention to the two Red groups, as from them, three variables are selected-e.g., NLT.0 (Fractal, PtoA, and Compact)-because such variables have no correlation.
The number of selected variables ranges between 4 and 7 variables, depending on the NLT (

Phase 6: Principal Component Analysis (PCA)
The scatterplots show a large overlap between seeps and spills, but their centroids are somehow distinctively independent of NLT. When all variables (16 or 19) are directly input to the PCA, the cumulative variance of the selected PCs (3 to 7) ranges between 80 to 90% for all NLTs. However,

Phase 6: Principal Component Analysis (PCA)
The scatterplots show a large overlap between seeps and spills, but their centroids are somehow distinctively independent of NLT. When all variables (16 or 19) are directly input to the PCA, the cumulative variance of the selected PCs (3 to 7) ranges between 80 to 90% for all NLTs. However, when the input is the UPGMA selected variables (4 to 7), the PC-selection (2 to 4 PCs) shows a much lower cumulative variance: from 52% to 70%; the exceptions are the Logarithm functions (NLT.2: Log 10 ; and NLT.3: Ln) with 99.5% (2 PCs). Table 4 reports the number of selected PCs and their cumulative variance per NLT.   Figure 4 portrays the scheme defining these four input dataset versions for each NLT (8x). Another improvement from our earlier studies is that besides exploring the seep-spill discrimination capabilities of using the PC-scores and values of the variables, as well as the sole use of Area with Perimeter as before [19][20][21][22], we also test a separate analysis with a pair of Size Information Descriptors (PtoA with Compact) and with a pair of SAR Information Descriptors (AVG with SKW)see Figure 4. These are chosen based on the interpretation of the UPGMA dendrograms (Phase 5: Section 3.5.1-see also Figure 3). Although the histograms of the Discriminant Functions' axes show that seep and spill properties overlap, independent of NLT, their centroids are separate.

Phase 8: Confusion Matrices (2-by-2 Tables)
Each NLT is evaluated with the four input dataset versions (Figure 4), and usually, Set.1 presents the highest discrimination power. However, these variables (16 or 19) are strongly correlated ( Figure  3) and do not fulfill a Discriminant Functions requirement to use independent, or the least as correlated as possible, attributes [62]. The second best discrimination accuracy occurs with Set.2, which is closely followed by Set.3. The lowest observed accuracies are from Set.4, as the selected PCs have a very low cumulative variance in the selected PCs; the exceptions are the Logarithm functions (NLT.2: Log10; and NLT.3: Ln-see Table 4).  Figure 4 portrays the scheme defining these four input dataset versions for each NLT (8x). Another improvement from our earlier studies is that besides exploring the seep-spill discrimination capabilities of using the PC-scores and values of the variables, as well as the sole use of Area with Perimeter as before [19][20][21][22], we also test a separate analysis with a pair of Size Information Descriptors (PtoA with Compact) and with a pair of SAR Information Descriptors (AVG with SKW)-see Figure 4. These are chosen based on the interpretation of the UPGMA dendrograms (Phase 5: Section 3.5.1-see also Figure 3). Although the histograms of the Discriminant Functions' axes show that seep and spill properties overlap, independent of NLT, their centroids are separate.  (Table 5) and Set.3 (Table 6) indicates that these two attribute selection strategies-i.e., 1) no UPGMA variable selection with PCA; and 2) UPGMA selected variables without PCA-promote comparable seep-spill discrimination accuracies.
A careful analysis of

Phase 8: Confusion Matrices (2-by-2 Tables)
Each NLT is evaluated with the four input dataset versions (Figure 4), and usually, Set.1 presents the highest discrimination power. However, these variables (16 or 19) are strongly correlated ( Figure 3) and do not fulfill a Discriminant Functions requirement to use independent, or the least as correlated as possible, attributes [62]. The second best discrimination accuracy occurs with Set.2, which is closely followed by Set.3. The lowest observed accuracies are from Set.4, as the selected PCs have a very low cumulative variance in the selected PCs; the exceptions are the Logarithm functions (NLT.2: Log 10 ; and NLT.3: Ln-see Table 4).
The  When considering the separate analysis of the Oil-Slick Information Descriptors (i.e., Size and SAR), Set.3 and Set.4 (see Figure 4: without PCA and with PCA, respectively) present the same result-these are shown in Table 7. The foremost outcome revealed in Table 7 is that the sole use of SAR Information Descriptors (AVG with SKW) is not as effective as using only Size Information Descriptors (Area with Perimeter and PtoA with Compact). Table 7 also discloses that these two pairs of Size Information Descriptors have the same results in the Logarithm function (NLT.2), and in fact, these results present superior discrimination power than in the other two analyzed NLTs, i.e., NLT.0 (No Transformation) and NLT.6 (Cube Root). Slightly better Overall Accuracies are achieved when using Area with Perimeter than PtoA with Compact; however, one should note that the False Negatives of the former pair are much higher than those of using the second pair: 67.7% against 21.0% (NLT.0), and 43.4% against 28.9% (NLT.6).
We can also evaluate the results of using several variables (Tables 5 and 6) against the use of individual pairs of attributes, i.e. the separate analysis of Size and SAR Information Descriptors (Table 7). If one compares the outcomes of NLT.2 (Log 10 ) in Tables 5-7, it is possible to notice that the sole use of the two Size Information Descriptor pairs (Table 7) has equivalent results as the ones from the other two attribute selection strategies, i.e., no UPGMA variable selection with PCA (Set.2: Table 5) versus UPGMA selected variables without PCA (Set.3: Table 6).

Conclusions
Our research addresses a gap in our scientific knowledge regarding the discrimination of the oil-slick category, i.e., sea surface expression of oil seeps versus oil spills observed in Campeche Bay ( Figure 1). We report on analyses to refine the ability of using SAR-derived measurements for this task, thus addressing expanded Data Processing Segments (A, B, C, and D in Figure 2) as compared to our previous investigations [19][20][21][22]. A firmer basis to discriminate slicks by category has been established with the specific data-driven design of our Exploratory Data Analysis (EDA). An innovative strategy to select uncorrelated attributes based on the Bonferroni Adjustment (i.e., Pearson's r correlation coefficient of 0.3 [51]) has been successfully implemented using rooted tree dendrograms (Unweighted Pair Group Method with Arithmetic Mean: UPGMA-see Figure 3). We investigate several Non-Linear Transformations (NLTs-see Phase 4: Data Transformation Approaches) and various strategies to select uncorrelated attributes: we tested more than 32 combinations of Data Transformation Approaches, i.e., eight NLTs versus four input dataset versions (see Set.1, Set.2, Set.3, and Set.4 in Phase 7: Discriminant Function- Figure 4).  Figure 3 and Table 3) without the dendrogram selection (no UPGMA: Unweighted Pair Group Method with Arithmetic Mean) but with the application of the PCA (Principal Component Analysis-see Phase 6: Section 3.6- Table 4). Note that NLT.2 and NLT. 3 Figure 3 and Table 3) without the dendrogram selection (no UPGMA: Unweighted Pair Group Method with Arithmetic Mean) but with the application of the PCA (Principal Component Analysissee Phase 6: Section 3.6- Table 4). Note that NLT.2 and NLT.3 have the same outcome.  Figure 3 and Table 3) and without the application of the PCA (Principal Component Analysis-see Phase 6). Note that NLT.2 and NLT.3 have the same outcome.   Figure 3 and Table 3) and without the application of the PCA (Principal Component Analysis-see Phase 6). Note that NLT.2 and NLT.3 have the same outcome.   Figure 4.
Based on our comprehensive approach to find a simple way to discriminate seeps from spills, we are able to answer the three scientific questions: 1.
The two Logarithm functions (NLT.2: Log 10 ; and NLT.3: Ln) and Cube Root (NLT.6) have the most accurate seep-spill discrimination among the eight Data Transformation Approaches tested.

2.
Of the different strategies tested for selecting relevant attributes (i.e., four input dataset versions-see Phase 7: Section 3.7), two (Set.2 and Set.3) have comparable discrimination power with Overall Accuracies of almost 70%; however, the sole use of UPGMA dendrograms (i.e., Set.3) excels at selecting uncorrelated variables as it provides a simpler form avoiding the implementation of additional Multivariate Data Analysis Techniques (i.e., PCA). This is clearly observed in an inspection of Table 6 (Set.3) and in a comparison with Table 5 [Set.2: the use of all variables (see Phase 5: Section 3.5.1- Figure 3 and Table 3) without the dendrogram selection (i.e., no UPGMA) but with the application of the PCA (see Phase 6: Section 3.6- Table 4)].

3.
The use of a collection of variables from two attribute selection strategies, i.e., Set.2 [no UPGMA with PCA (19 or 16 attributes but with 3 to 7 PCs- Table 5)] and Set.3 [UPGMA and no PCA (4 to 7 variables- Table 6)] is equally capable of discriminating seeps from spills. However, these are comparable to the sole use of the two Size Information Descriptor pairs (Area with Perimeter and PtoA with Compact) that outperform the SAR Information Descriptor pair (AVG with SKW)-see Table 7.
Our EDA also demonstrates that using simple and low-cost RADARSAT-2 beam modes (SCNA and SCNB), one can achieve useful seep-spill discrimination accuracies, thus supporting new products for the RADARSAT Constellation Mission (RCM): RADARSAT-2 Mode Selection for Maritime Surveillance (R2MS2).
Author Contributions: The paper was conceived and written by G.A.C. under the supervision of P.J.M., E.T.P., F.P.M., and L.L. All authors participated of the research conceptualization, experiment design, data analysis/interpretation, as well as of the investigation quality improvement, read, edit, and approval of the final manuscript.
Funding: Financial support has been provided by the Programa Nacional de Pós Doutorado (PNPD) of Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil.