Probabilistic Substrate Classification with Multispectral Acoustic Backscatter : A Comparison of Discriminative and Generative Models

We propose a probabilistic graphical model for discriminative substrate characterization, to support geological and biological habitat mapping in aquatic environments. The model, called a fully-connected conditional random field (CRF), is demonstrated using multispectral and monospectral acoustic backscatter from heterogeneous seafloors in Patricia Bay, British Columbia, and Bedford Basin, Nova Scotia. Unlike previously proposed discriminative algorithms, the CRF model considers both the relative backscatter magnitudes of different substrates and their relative proximities. The model therefore combines the statistical flexibility of a machine learning algorithm with an inherently spatial treatment of the substrate. The CRF model predicts substrates such that nearby locations with similar backscattering characteristics are likely to be in the same substrate class. The degree of allowable proximity and backscatter similarity are controlled by parameters that are learned from the data. CRF model results were evaluated against a popular generative model known as a Gaussian Mixture model (GMM) that doesn’t include spatial dependencies, only covariance between substrate backscattering response over different frequencies. Both models are used in conjunction with sparse bed observations/samples in a supervised classification. A detailed accuracy assessment, including a leave-one-out cross-validation analysis, was performed using both models. Using multispectral backscatter, the GMM model trained on 50% of the bed observations resulted in a 75% and 89% average accuracies in Patricia Bay and Bedford Basin, respectively. The same metrics for the CRF model were 78% and 95%. Further, the CRF model resulted in a 91% mean cross-validation accuracy across four substrate classes at Patricia Bay, and a 99.5% mean accuracy across three substrate classes at Bedford Basin, which suggest that the CRF model generalizes extremely well to new data. This analysis also showed that the CRF model was much less sensitive to the specific number and locations of bed observations than the generative model, owing to its ability to incorporate spatial autocorrelation in substrates. The CRF therefore may prove to be a powerful ‘spatially aware’ alternative to other discriminative classifiers.


Introduction
High resolution bathymetry and acoustic backscatter can be gathered in a cost-effective manner over large areas in aquatic systems with multibeam sonar.Automated substrate classification for benthic habitat and geologic mapping is a growing area of research [1][2][3] due to the efficiency with which dense information on both terrain and substrates may be collected with a single instrument, driven by a pressing need for accurate substrate maps for the purposes of marine spatial planning and management, design of marine protected areas, fisheries resource management, maritime archeology, minerals exploration, and scientific research in the fields of submarine geology and benthic ecology [4,5].Similar needs in freshwater environments are driving increasing use of multibeam backscatter in rivers and lakes [6,7].Recently, the appropriate collection and analysis of backscatter [8,9], including studies into the uncertainty [10] and repeatability [11] of backscatter measurements have made substrate acoustics a more accessible field for ecologists and geologists.The quantitative analysis of substrate topography has been reviewed by [12].General overviews of substrate classification using multibeam acoustic backscatter are provided by [2,13].
Existing approaches to benthic substrate characterization using data collected by multibeam echosounders can be categorized as unsupervised, which determine the type and number of substrates from the data itself, or supervised (or 'task-specific') that estimate substrates based on some independent information about the distribution of sediments, such as physical samples or benthic photographs.Many unsupervised approaches employ analytical models that capture the physics of substrate-sound interactions [14,15] and are especially suitable for relatively simple clastic substrates at relatively low acoustic frequencies [6].Supervised methods are more commonly used for heterogeneous substrates with a large biogenic component [4,16], especially at relatively high frequency (several hundred kilohertz) [17].Supervised approaches typically require a set of training samples to find the relationship between relatively sparse ground-truth bed observations and features derived from multibeam data, to quantitatively describe the associated substrate [2].Extrapolating this relationship to broad scales from limited observations requires robust spatial modeling.To that end, machine learning has proved a popular approach based on single frequency (monospectral) multibeam backscatter [4,[18][19][20], for two principal reasons.First, the outputs are readily amenable to probabilistic interpretation, whereas quantification of spatially distributed uncertainty is more difficult to achieve using unsupervised approaches.Second, technological advances in collecting images of the bed using cabled and autonomous video systems [21] have been commensurate with advances in global positioning systems and inertial navigation, sonar hardware and automated substrate classification algorithms.
The goal of all probabilistic models for substrate classification is to find P(y|x), that is, the conditional probability distribution of a single discrete label y = y i given a vector of features x i assigned to grid node i = {1, . . ., N}, where y i can take any value from a pre-defined set, M = {1, 2, . ..}, numeric labels each corresponding to an observed discrete substrate substrate type.Feature vector x may in practice be any combination of the primary data streams from multibeam echosounders (bathymetry and backscatter) or derived bathymetric products such as roughness, slope, aspect, etc [4,18] or derived backscatter metrics [22].One way to formalize the problem is as a function approximation.We assume y = f (x) for some unknown function f .The goal of a machine learning algorithm is to estimate f given a labeled training data set, and then to make predictions using ŷ = f (x) (theˆsymbol denotes an estimate).To handle ambiguity, the most probable substrate type is found by solving, where C is a vector of possible substrate classes, where is the softmax function [23], in which x T w denotes the inner product of x and weight w and is the output of the model over K possible outcomes.One might break the classification problem down into two stages: The inference stage in which the training data (substrate features, x, and their substrate labels, y) are used to learn a model for P(y|x), and the subsequent decision stage where we use these posterior probabilities to assign optimal labels.An alternative is to solve both the inference and decision stages simultaneously, by learning a discriminant function from the input data that maps inputs x directly into labels y.The difference between these two approaches is the difference between a generative [7,24] and a discriminative probabilistic model (Figure 1).A generative model estimates the probability distributions of individual substrate labels (classes) to then generate estimates of their posterior probabilities.With a discriminative model, instead of first modeling the joint distribution of x and y together, P(x, y), a conditional distribution of y given x, P(y|x) is modeled directly, without attempting to model underlying (or 'generating') probability distributions.This bypasses the need to capture the distributions over x, to model the correlations between them, and therefore to specify an underlying prior model that describes P(x, y).Any chosen set of features x may have complicated dependencies, and the relationships among x, and between x and y, might significantly change when applied to the next data set, the next location, the next time, the next spatial scale, etc. Discriminative approaches, including neural networks [25], support vector machines [18], decision trees [26], and random forests [27], have therefore been popular.Both approaches train a model from the data that make optimal label assignments.The generative approach first models the label-conditioned probabilities (P(x|y)) for each label class individually, or joint distribution (P(x, y)) between substrate features and substrate types (labels), then finds the posterior probabilities of each label (P(y|x)).The discriminative approach learns a decision boundary (or discriminant function) between labels that maps substrate features directly into labels.
With the recent development of multispectral multibeam technology [28,29] it is now possible to examine the backscattering response of a set of substrates across multiple frequencies.With multispectral backscatter, a discriminative approach has much more information with which to draw its decision boundaries.However, until it is understood how substrates respond to multiple frequencies, we require a substrate model that does not rely on a statistical assumption of independence, which says that the substrate features do not depend on and affect each other.To that end, we describe a discriminative substrate classification algorithm that relaxes the independence assumption, for use with either mono-or multispectral backscatter, that may be applied in a supervised (y drawn from direct observations of the bed) framework.We evaluate the skill of the proposed classification method using multispectral backscatter data at two sites, comparing the classification results with those from a more standard generative machine learning algorithm, namely a Gaussian Mixture Model [6,7,24].

Graphical Models
Graphical models are an efficient way to describe complex, high-dimensional distributions.Suppose we have a simple seafloor everywhere consisting of either sand or gravel, and also suppose we also have three extracted seafloor features called feature 1, feature 2, and feature 3.These features could be any relevant metric extracted from backscatter, bathymetry or other sources of available information such as hydrodynamic maps.Let's further suppose each of our features may be valued 1 through 100.Overall, therefore, our probability space has 2 (sand or no sand) × 2 (gravel or no gravel) × 100 × 100 × 100 = 4 million values, corresponding to the possible assignments of the five variables.Our task is to evaluate the likelihood of each.Computing a joint probability distribution over this space allows us to ask questions such as how likely the substrate is sand given minimal values of all three features.The probability expression for this case encodes our beliefs about this particular situation, and would be However, specifying a joint probability distribution over four million possible values could be a substantial computational task and, in reality, the seafloor typically consists of many more classes, and it may be possible to extract many more features that have many more possible values.Graphical models are an efficient way to describe such complex, high-dimensional distributions.The nodes in the graph correspond to our variables (substrates and features) and the edges between them correspond to the probabilistic interactions between them.The goal is to infer the structure of the graph, which tells us the set of independencies that hold for the distribution [30].For example, our 'target' distribution P in Equation (3) may satisfy the conditional independence (feature 1 ⊥ feature 2 | gravel, sand) where ⊥ denotes 'independent from', which asserts that P( f eature 1|gravel, sand, f eature 2) = P( f eature 1|gravel, sand) The above says that, if we are interested in the distribution over feature 1, and we know whether the substrate is sand or gravel, feature 2 is no longer informative.However, this doesn't imply that feature 2 is independent of feature 1, only that all the information we may obtain from feature 1 on the chances of feature 2 we already obtain by knowing whether the substrate is sand or gravel.The other advantage of the graphical model is that rather than encoding the probability of each of the 4 million values in our example, instead it facilitates breaking up the problem into smaller factors, each over a much smaller space of possibilities.The overall distribution is found as the product of these factors.For example, suppose that the probability of the event "feature 1=50, no sand, gravel, feature 2 = 100, feature 3 = 0" is found by multiplying five numbers: P( f eature 1 = 50), P(sand = f alse| f eature 1 = 50), P(gravel = true| f eature 1 = 50), P( f eature 2 = 100|sand = f alse, gravel = true), and P( f eature 3 = 0|gravel).This representation is significantly more compact, requiring far fewer parameters than the original four million.The graph structure defines the factorization of P as a set of factors and the variables they encompass, called the 'scope' [30].

Conditional Random Field
A conditional random field (CRF) is an undirected graphical model that we use here to probabilistically predict substrate labels based on weak supervision, which in the present case is discrete label annotations of substrate types in discrete locations.Seafloor (backscatter) features x and their associated labels y are mapped to graphs, whereby each node is connected by an edge to its neighbors according to a connectivity rule.Linking each node of the graph created from x to every other node enables modeling of the long-range spatial connections within the data by considering both proximal and distal pairs of grid nodes, resulting in pixel (fine) scale labeling at boundaries and transitions between different label classes.
Conditional Random Fields (CRFs) [31] have been used extensively for task-specific predictions where estimates of labels for sparse regions of the data are used in conjunction with the underlying features, to draw decision boundaries for each label, resulting in a label per feature [32,33].The primary advantage of CRFs for substrate classification is the relaxation of the independence assumption [31], which says that the substrate features do not depend on each other.Especially for multispectral backscatter, this is not always the case and it could lead to serious inaccuracies.
A CRF is parameterized the same as a Gibbs distribution, but is normalized differently.We start with an unnormalized measure of the joint distribution where Φ = {φ i (D i ), . . ., φ I (D I )}, φ i are factors and where D i are a set of vectors of backscatter at each location i that are termed the 'scope' of their associated φ i [30] (see previous subsection).In order to model the conditional distribution P(y|x), a normalization constant (called a partition function), Z, is required that is a function of x [30]: Then the the probability of a substrate labeling y given the multibeam-derived data x is We model P Φ (x, y) as a Gibbs energy function, E, and the conditional distribution is rewritten where θ = [θ β , θ γ ] are a set of tunable model parameters that are described below.Minimizing (8) yields the most probable label assignment, whereby the maximum a posteriori (or MAP) for the labeling (y ∈ M) is ŷ = arg max y∈M P(y|x, θ), which chooses what is the most likely y considering x.Features x are mapped to graphs, where each pixel in x represents a graph node, and every node is connected with an edge to its neighbors according to a connectivity rule.In CRFs based on 'local' connectivity, nodes connect adjacent pixels in x [31,34], whereas in the fully connected definition, each node is linked to every other [35].

Fully Connected Conditional Random Field
Linking each node of the graph created from x to every other enables modeling of the long-range connections within the data by considering both proximal and distal pairs of grid nodes, resulting in pixel (fine) scale labeling at boundaries and transitions between different substrate.Equation ( 8) is obtained by summing unary (ψ i (y i )) and pairwise (ψ ij (y i , y j )) potentials: where i and j range from 1 to N, the number of backscatter grid cells.The vectors f i and f j are features created from x. Here, f i and f j are a function of both the relative position as well as magnitudes of the data.Whereas unary potentials depend on the label describing the backscattering of the substrate at a single grid location, pairwise potentials depend on the labels of the acoustic response at two grid locations.The unary potentials, ψ i , define a log-likelihood over the label assignment y, and therefore represent the cost of assigning label y i to grid node i.In this paper, unary potentials are defined using interpreted photographic observations or physical samples of the bed at discrete locations.
The pairwise potentials ψ ij (y i , y j ) are the cost of simultaneously assigning label y i to grid node i and y j to grid node j, defined as where l denotes feature vector derived from x, where each k l is a function that determines the similarity between connected grid nodes by means of a given feature f l .The function Λ quantifies label 'compatibility', by imposing a penalty for nearby similar grid nodes that are assigned different labels.The so-called 'Potts' model is used for the compatibility function Λ(y i , y j ) = [y i = y j ], that is Λ(y i , y j ) = 1 if |i − j| ≤ µ and 0 otherwise.We make the proximity tolerance µ (units of pixels, controlling the minimum distance between i and j that are assigned different labels) a free parameter.
A computationally efficient inference algorithm for a fully connected CRF was introduced by [35], whereby the pairwise potentials (10) are linear combinations of Gaussian kernels, k, which are filters applied to the backscatter data that preserve strong edges [36].We use a simplified version of the pairwise kernels proposed by [35], given by: where p i and p j are grid positions.The first kernel quantifies the observation that nearby grid nodes with similar multifrequency backscatter response are likely to be in the same substrate class.The degree of similarity is controlled by the parameter θ β (non-dimensional).As θ β increases, larger differences on the l-th feature are tolerated.The second kernel removes small isolated regions.Following [35], we set θ γ = 1.The parameters µ and θ β are then determined from the data.The true distribution, P(y), cannot be directly computed [35].Rather it is approximated through inference by evaluating a set of candidate distributions Q(y), minimizing the Kullback-Leibler (KL) divergence to the true distribution, computing the product of independent marginals Q(y ι |x) over each variable as an approximation of P(y|x) using the iterative algorithm detailed by [35,37] to minimize ∆(Q||P) = ∑ ι Q ι ln(Q ι /P ι ).

Generative Model
The generative approach models the label-conditional probabilities and their prior probabilities.By trying to capture the probability distribution over all the variables x (backscatter, roughness, etc), the probability distribution of each label, y, are explicitly modeled such that label estimates y * are found using With the joint probability distribution function, given a particular y, you can then calculate (or 'generate') its respective x.For this reason they are called generative models, because sampling can 'generate' synthetic data.

Naïve Bayes Model
One approach is to infer the label-conditioned probabilities, P(x|y i ) for each label individually, y i , and separately infer prior label probabilities P(y i ).Then one may find the posterior label probabilities P(y|x) using Bayes' theorem where P(x) = ∑ P(x|y)P(y). ( For illustrative purposes, perhaps the simplest way to accomplish this would be to assume that once the class label is known, all the features are independent.The resulting so-called 'naïve Bayes' classifier is which ignores the correlations between the features.For most substrate classification tasks, this is an unreasonable assumption.For example, backscatter response of a given substrate type at a given frequency would be correlated, at least to some degree, to that at a different frequency.Further, backscatter response of a given substrate at a given frequency would correlate with that of a similar substrate.If backscatter and roughness were both input features, they would likely correlate with each other.In short, any chosen set of features x may have complicated dependencies, and the relationships between x, and between x and y, might significantly change when you move onto the next data set, the next location, the next time, the next spatial scale, etc.

Gaussian Mixture Model
A potentially significant downfall of generative models such as Equation ( 15) for substrate characterization is that they ignore any correlation structure between the features (such as backscatter), which is often an incorrect independence assumption that leads to spurious statistical confidence between features and their labels.Therefore, more sophisticated types of generative models than Equation ( 15) are typically used for substrate classification that allow quantification of the correlations between x.Instead of using Equation ( 13) to find the posterior probabilities of each label, one may model the joint distribution of x and y together, P(x, y), and then normalize to obtain posterior probabilities.An example of one such approach is the Gaussian Mixture Model (GMM) [6,7,24,38], which is a weighted sum of M component Gaussian probability density functions, g, with unknown parameters, given by (e.g., [7]) where w m are the mixture weights such that ∑ w m = 1 and 0 ≤ w m ≤ 1, and g(x|µ m , Σ m ) are the m = 1 : M component Gaussian densities, where λ = [w m , µ m , Σ m ], µ m is the mean, and Σ m is the covariance matrix for the mth component.Latent variable vector λ is estimated using the expectation-maximization (EM) algorithm, which maximizes the likelihood of the model given the training data (see [7,39] for more details of the implementation).The relationships between x are modeled by means of w and Σ.There are many functional forms for Σ [7] but for simplicity, here we only consider the full covariance matrix between features x:

Backscatter
All backscatter data used in this study were collected using an R2Sonic 2026 multibeam echosounder (LLC, Austin, TX, USA).All data are provided by R2Sonic for the 2017 Multispectral Challenge, with positional/attitude corrections applied, and corrected for tide and sound velocity effects, with gross outliers removed.Backscatter snippet (time-series) records were processed using Qimera FMgeocoder toolbox(FMGT) version 7.7.8(QPS BV, Zeist, The Netherlands).Backscatter data were corrected for system frequency response (Table 1), transmit power, receive gain, beam pattern, and angular effects.Corrections for frequency-dependent spreading/absorption by water (Table 1) were computed using standard methods based on ideal seawater [40] (ignoring the influence of bubbles, suspended sediment and biological organisms) given provided data from CTD casts taken during the surveys.Backscatter data were mosaicked using no data within 10 • of nadir where possible, using 50% line blending, and filling gaps using adjacent backscatter where feasible.In order to meet the objectives of the present paper, it is not necessary to either compute and display the backscatter in units of decibels, nor further process the data for additional acoustic or physical variables.Both the GMM and CRF models would function the same with backscatter in arbitrary linear units or acoustic power on a decibel-scale.All backscatter grids were exported as 1×1 m resolution GeoTIFF images and all subsequent processing was carried out using the 'PriSM' toolbox written by the authors (see Acknowledgments), implementing all methods in the above section.Two datasets were processed in this way.The first comes from Patricia Bay, British Columbia, Canada (survey conducted in November 2016).The second dataset comes from Bedford Basin, Nova Scotia, Canada (survey conducted in April 2016).See [41] for a description of this site.

Bed Observations
Observations of the bed from 35 stations in Patricia Bay within the surveyed extent are available from [42], in six categories: Fine sand (FS), gravel sand (GS), muddy sand (MS), sand (S), sand/cobble (SC), and sandy mud (mS).The categories GS and S had only one observation apiece, therefore the one S observation was recategorized as FS, and the one GS observation was recategorized SC.Four grouped categories were therefore used for the substrate classification.
In Bedford Basin, 61 georeferenced images of the substrate were provided by R2Sonic at 27 survey stations.Following [43], each photo was classified visually into one of three substrate classes, namely: Class 1-mud; Class 2-cobbles and boulders with interstitial mud; and Class 3-cobbles, gravel, and boulders.
The sparse labels, y, are derived from direct observations of the substrate at a few discrete locations, and features x are derived from the multibeam data.Whereas both the CRF and GMM models are designed to work with multivariate input features x to any depth (i.e., any number of layers of information on coincident spatial grids), in this contribution we wish to explore the utility of backscatter alone for supervised substrate characterization.We compare models using inputs of both traditional monospectral (grids of backscattering response at individual outgoing acoustic frequencies) and multispectral backscatter, to explore and ultimately demonstrate the enhanced discriminatory power of the latter.Future work will explore any added classification benefits of additional information from derived products from the bathymetry and backscatter, and this is further discussed in the Discussion.
For both sites, it was necessarily assumed that the bed did not change in the time elapsed between bed sampling and multibeam surveys, such that the bed observations taken prior to multibeam surveys are an accurate reflection of the spatial distribution of substrates at the later time of the sonar surveys.Further, each bed observation was assigned to all grid nodes within a 10 m radius of the location.This is a reasonable assumption even for very spatially heterogeneous beds (see analyses presented by [7]), especially given any spatial uncertainties in horizontal location ascribed to each observation, which are unknown but estimated to be of the order 1-10 m. Figure 2 shows false-color backscatter imagery and color-coded bed labels for each of the datasets used in this study.Frequency distributions of backscatter intensity were compiled based on locations of known substrate types (Figure 2A.1-A.3 for Patricia and Figure 2B.1-B.3 for Bedford).Despite the considerable amount of overlap between the resulting curves, which is partially a result of the inherently stochastic nature of backscatter [44] as well as the unaccounted-for variability within the substrate groups, the generally modest degree of separation between the peaks show that the task of using the GMM or CRF model for pixel-level substrate classification is, at the very least, analytically tractable.This analysis also serves to demonstrate the potential value of multispectral over monospectral backscatter.In a general sense, because each substrate has a different distribution of backscatter magnitude according to outgoing frequency, there is much more information for a probabilistic classifier to draw decision boundaries.For example, the backscatter associated with Bedford substrates Class 2 and Class 3 have a narrower range at high frequency compared to low, and all curves tend to be more multimodal at lower frequencies and more unimodal at higher frequencies.Further, the backscatter associated with Patricia Bay substrates mS and FS are only distinguishable at 100 kHz (Figure 2A.1).

GMM Model Implementation Details
Following [7], the GMM model was initialized using the observed per-substrate mean backscatter for µ m , and unit weight w m and covariance Σ m .Model fitting continued until a convergence criterion was satisfied, which in the present study was when the average gain in posterior probability from the previous iteration went below 0.01.
The number of Gaussians may be optimized using the data (see [6,7] for two different approaches).In the present study, however, in order to keep the focus on the relative performance of the two models and the two types of backscatter (monospectral and multispectral), the number of modeled Gaussians in the mixture was set to equal the number of substrates known to be present in the surveyed area (four for Patricia, and three for Bedford).Any posterior probability below a certain threshold was reclassified as 'unknown' substrate.This threshold was arbitrarily set to 0.8 for all data sets.A 50% subsample of the bed observations, drawn at random but with the condition that at least one bed observation per class was included, was used to train the model.The remaining 50% of the bed observations were used to test the performance of the substrate classification model.

CRF Model Implementation Details
To explore the sensitivity of the CRF model to θ β , keeping µ constant at a value of 100, the model was run using 10 equal-increment values between 30 and 800, using multispectral backscatter, and the results evaluated.Once the appropriate value for θ β across all data sets had been decided and fixed, the model was run using equal-increment 10 values of Potts compatibility parameter µ between 3 and 300.In these tests, all inference algorithms were run for 30 iterations.Once the appropriate value for µ across all data sets had been decided and fixed, we explored the effects of number of inference iterations, from 1 to 20.Average posterior probabilities at the final model iteration tended to be very high for the CRF model.Therefore, in the sensitivity tests involving varying θ β and µ parameters we instead used per-pixel posterior probabilities averaged over all model iterations, which was a more conservative estimate of probability being, for each pixel, a function of the rate of convergence to the final posterior probability, and therefore showed greater variability across entire ranges of parameter values.
Like in the GMM model, the final CRF model used a 50% subsample of the bed observations, drawn at random but with the condition that at least one bed observation per class was included, to train the model.The remaining 50% of the bed observations were used to test the performance of the substrate classification model.In all model runs except those with fewer bed observations, any posterior probability below a certain threshold was reclassified as 'unknown' substrate.For consistency with the GMM model, this threshold was set to 0.8.

Model Evaluation
Each model was trained using a 50% subsample of the bed observations, drawn at random but with the condition that at least one bed observation per class was included.The remaining 50% of the bed observations were used to test the performance.Any posterior probability below 0.8 was reclassified as 'unknown' substrate.A 'confusion matrix', which is the matrix of normalized correspondences between true and estimated labels is used to visualize model skill.A perfect correspondence between true and estimated labels is scored 1.0 along the diagonal elements of the matrix.Misclassifications are readily identified as off-diagonal elements.Systematic misclassifications are recognized as off-diagonal elements with large magnitudes.
We also performed leave-one-out cross-validation for multispectral inputs whereby, upon successive iterations, one bed observation is randomly removed and a previously removed observation is replaced, and the model refitted.We did this for each site and each model: 35 iterations for Patricia and 27 iterations for Bedford, corresponding to the number of bed observations stations at each site.This procedure checks how well a model generalizes to data, by examining the variability in each classified grid cell over those many iterations, and also provides a means to examine spatially which areas of the bed were sensitive to a lack of ground truth bed observation.Such information could be used, for example, to target further bed sampling.

Conditional Random Field (CRF) Model: Sensitivity to θ β and µ
The parameter θ β controls the degree of allowable similarity in backscatter between CRF graph nodes.Relatively large θ β means backscatter features with relatively large differences in magnitude are assigned the same substrate label.The response of θ β was not sensitive to value of µ.For each, three metrics were computed: (1) Accuracy (the number of correctly labeled pixels divided by the number of all tested pixels), ( 2) average posterior probability, and (3) proportion of the pixels with posterior probabilities > 0.8.An appropriate value for θ β was considered to be one at which all three metrics stabilized.Accuracy is based on comparing pixels with known and model-predicted labels.When θ β is small, accuracies (determined only at those calibration locations within the vicinity of available observations of the bed) are spuriously high because the average probability across the entire survey (i.e., in areas in and outside of the calibration) may be relatively low.As θ β increases, accuracy and percent classified both tend to slightly decrease, stabilizing at uncertainty values that are more realistic given the observed variability in backscatter vectors (Figure 3).Based on the above criteria, across data from both sites, the appropriate value was determined to be θ β = 300.The compatibility function imposes a penalty for grid nodes that have similar backscattering but are assigned different labels, up to a distance of µ.This 'proximity tolerance' parameter therefore specifies the distance between pairs of pixels beyond which they are considered far enough apart to have similar backscattering but different substrate labels.The model was run using 10 equal-increment values of µ between 3 and 300 (Figure 4).The optimal value of µ = 100 was determined by the same criteria used to determine θ β .

CRF: Number of Iterations
The effect of the number of CRF model iterations of the classification result was evaluated using average posterior probability as a metric.From the above, with θ β = 300 and µ = 100, the number of iterations was increased from 1 to 20, in increments of 1.Average probability rapidly increases from 1 to 5 iterations, stabilizing beyond approximately 10 iterations (Figure 5).The speed with which average posterior probabilities stabilize (convergence rate) seems to be related to the spatial density of bed observations, with Patricia having the greatest spatial density of bed observations and the fastest convergence rate, and Bedford having the lower spatial density of bed observations and correspondingly the lower convergence rate (Figure 5).

Patricia Bay Substrate Classification
The GMM approach was used to generate, from multispectral backscatter inputs, a substrate classification map (Figure 6A) and associated map of prediction probabilities (Figure 6B).Using optimal settings (θ β = 300, µ = 100, and 15 iterations), the CRF approach was also used to generate a substrate classification map (Figure 6C) and associated map of prediction probabilities (Figure 6D) using multispectral data.In all cases, there is excellent qualitative correspondence between substrate patches and their nearby bed observations.Scrutiny of Figure 6A,C reveals that the greatest differences between substrate maps was in the FS class, which had an overall greater extent in the GMM prediction (Figure 6A).The other significant difference between the two approaches was the maps of prediction probabilities (Figure 6B,D), which tended to be far higher for the CRF model (Figure 6D).Rather than being an advantageous aspect of the CRF prediction, we suggest that those probabilities are not particularly useful for assessing the spatial variability in the quality of substrate predictions.The GMM and CRF approaches were also used to generate substrate classification maps (Figure 7), using each of the three monospectral bands separately as inputs.There is generally a much greater propensity for the GMM model to (over-) predict the FS class (Figure 7A.1-A.3) because the GMM model takes into account only relative backscatter magnitudes, not their relative spatial locations like the CRF model does.

Bedford Basin Substrate Classification
Like at the Patricia Bay site, both the GMM and CRF approaches were used to generate substrate classification maps (Figure 8A,C) and associated maps of prediction probabilities (Figure 8B,D).The most noteworthy difference between the two model outputs is that the GMM model predicted a greater spatial extent of Class 2 substrates (Figure 8A) compared to the CRF model.This seems plausible considering that these areas correspond to relatively large magnitude backscattering (the western end of Figure 2A) across all three spectral bands.It seems likely that the CRF model penalized the prediction of Class 2 in this area because of its large distance to any Class 2 bed observations.Unfortunately, it is impossible to say whether the GMM or the CRF model was correct in this instance there is no ground truth information in this area; supervised models can only be evaluated based on how well they predict known substrate classes.
Like at Patricia Bay, the GMM and CRF approaches were also used to generate substrate classification maps (Figure 9), using each of the three monospectral bands separately as inputs.Once again, there is generally a much greater propensity for the GMM model to (over-) predict a certain class, this time Class 2 (Figure 9A.1-A.3),especially at 200 kHz.It seems likely that the GMM model assigns spuriously high weights to this intermediate substrate class because the simple covariance structure of the model is unable to capture the correlations between the two end member substrate classes, therefore the intermediate class becomes a 'probability sink'.Like at Patricia Bay, the CRF-derived substrate maps are spatially more homogeneous because the relative spatial locations of the bed observations are given weighting comparable to that given to relative differences in backscattering magnitude.At both sites, the relative spatial homogeneity of CRF-predicted classes seems more physically plausible than the relative heterogeneity of the GMM predictions.That is not to say that such variability in substrates is not real: In fact, it is entirely possible that the acoustic variability is due to variability in substrate composition that is not adequately captured by the few classes that all bed observations were grouped into.However, given the limited number of substrate classes, the GMM models have 'over-fit' [23] the backscatter data, capturing acoustic variability as substrate variability, unfiltered by the strong effects of spatial autocorrelation over the substrate, which is picked up well by the CRF model.

Synthesis of All Model Results
The model synthesis shown in Figure 10 and described below uses 50% of the bed observations to test the performance.At Patricia Bay, the average CRF classification accuracy across all three substrate types was 78% using multispectral inputs (Figure 10A.Overall, therefore, the CRF performed significantly better than the GMM model at both sites with multispectral backscatter inputs.For monospectral inputs, the CRF performed significantly better than the GMM model at Bedford (91% versus 75% overall accuracy for CRF and GMM models, respectively) whereas the differences were negligible at Patricia Bay (70.5% versus 71% overall accuracy for CRF and GMM models, respectively).Care must be taken to interpret these task-specific classification results in physical terms.For example, these results do not necessarily reflect the relative importance (or, strictly speaking, information content) of each acoustic frequency compared to multispectral backscatter, because the relative contributions of the different frequencies to overall model performance would not be linear, nor additive.
At Patricia Bay, the leave-one-out cross-validation procedure carried out on the on the CRF map was a 91% mean accuracy across all four substrate classes.The maximum variability in accuracy between folds was largest (6.1%) for the FS class and relatively small for class MS (3.9%), SC (2.3%), and class mS (0.07%).The GMM result was significantly worse: 65% mean accuracy across all four substrate classes with variability in accuracy between folds between 20 and 39% depending on the class.At Bedford Basin, the CRF result was a 99.5% mean accuracy across all three substrate classes.The maximum variability in accuracy between folds was largest (2.9%) for Class 2 and relatively small for Class 1 (1.53%) and Class 3 (1.47%).The GMM result was 84% mean accuracy across all substrate classes with variability in accuracy between folds between 4 and 16% depending on the class.
Overall, the CRF model creates outputs that are much less sensitive to the number and location of bed observations, compared with the GMM model, which suggests that the CRF model generalizes to data much better.The standard deviations in bed classifications (Figure 11) show that, at both sites, the GMM model was much more sensitive to the presence and location of individual bed observations, especially at Patricia Bay (Figure 11A).The CRF model is much less sensitive (standard deviation in classifications are much lower), owing to its ability to incorporate spatial autocorrelation in substrates as well as relative differences in backscattering magnitude.

Discussion and Conclusions
We suggest that the high classification accuracies and probabilities of the CRF model outputs were the result of that model incorporating spatial information, as well as backscatter magnitudes, through the use of pairwise potentials that quantify the computational cost of simultaneously assigning label y i to grid node i and y j to grid node j (10), parameterized using a 'proximity tolerance' parameter µ.It is encouraging that substrate classification outputs were not overly sensitive to parameter values (Figures 3 and 4), at least beyond a certain low value.Collectively, this suggests that CRF models applied elsewhere might use similar values for θ β , which controls the degree of allowable similarity in backscatter between graph nodes, and µ.Model parameters θ β and µ have clear physical meaning and are related to physical proximity and backscatter similarity, respectively.The model might be further extended by adding a kernel to (11) that captures only the spatial proximity of features, with a parameter controlling the degree of allowable similarity in position between graph nodes.The 'Potts model' for label compatibility is oversimplistic because it penalizes a pair of nearby grid nodes the same irrespective of their labels.Compatibility functions that learn from the data could be implemented using methods discussed by [37], whereby substrates that are physically likely to be proximal are penalized less than those unlikely to be nearby.
The CRF is analytically more complicated than other discriminative approaches, however, owing to the implementation of an efficient inference algorithm [35], the added computational cost is negligible.Among discriminative approaches, which also include support vector machines [19], decision trees [20,26], neural networks [25], and random forests [27], among others, there are many potential advantages to the CRF.The biggest advantage is that a CRF model does not need to make independence assumptions about the data, or model the correlations between substrate features, to learn a discriminant function that best maps inputs x directly into labels y.This is especially important for use with emerging multispectral sonar technology because, until it is understood how substrates respond quantitatively to multiple frequencies, we require a substrate model that does not rely on a statistical assumption of independence, which says that the substrate features do not depend on and affect each other.
We used CRF models based on gridded backscatter magnitude only, purposefully ignoring other potentially useful derived products of bathymetry (depth, roughness, slope, aspect, curvature, etc), or backscatter [22].While the CRF model can use features x consisting of any number of spatially co-registered inputs, using backscatter alone demonstrates suitability for task-specific substrate classification without transforming the data into a chosen set of inputs amenable to a specific machine learning algorithm from among a set of candidate inputs.A disadvantage of this so-called feature engineering is that it makes it harder to evaluating the relative importance of all features that make up a given realization of x, especially among disparate data sets.Using backscatter alone is relatively powerful, elegant and parsimonious.
Generative approaches are statistically more complex, due to the added complexity of statistically modeling the specific joint probability of the substrate features (backscatter at frequency 1, frequency 2, etc) and their labels together, and make restrictive assumptions about the distribution of the data, such as independence assumptions between backscatter features given a label.As here, when the task requires only finding the posterior probabilities of substrates, that is the conditional distribution of y given x, P(y|x), generative models can be overly complicated and, in the words of [39], "excessively demanding of data".The label-conditional probabilities may contain a lot of detail that is not used.An example of this is shown in Figure 12 in which the left-hand mode of substrate y 1 in the left plot has no effect on the corresponding posterior probability in the right plot.Compare the multiple modes in the label-conditional probabilities shown in Figure 2, where we may also observe that the label-conditional density assumptions of the GMM, namely, that each class may be modeled as a Gaussian-that is, a unimodal function-gave a comparatively poor approximation to the true distributions.However, generative models are by definition more powerful when direct observations of the bed are relatively scarce: If the joint probability distribution function is statistically well behaved, given a particular y, you can then calculate (or 'generate') its respective x.For generative models, therefore, the training set in theory can contain data from multiple sites and times.If the generative model is sufficiently broad in scope, it can be applied to areas of the substrate with no bed observations, whereas the discriminative approach tends to rely on bed observations from within the area mapped by the sonar.Thus far, however, in practice this tends not to be the case.In other words, models to estimate substrates tend to be built for specific sites and are not typically transferred between sites without recalibration or reparameterization, and the practical scope of generative and discriminative models has become similar.In the longer term, this situation may change, and generative models may be more widely applicable outside of their calibration.Until that time, however, we predict that discriminative models will remain a popular and useful approach to task-specific substrate classification, especially if they better model spatial dependencies, such as the CRF.Given the relative advantages of both model types, there has long been interest in a hybrid approach [45], which is perhaps now on the horizon with the discovery of the generative applications [46] of deep neural networks [33,47] which, to our knowledge, have yet to be applied to substrate characterization but would be an interesting and potentially useful development.

Figure 1 .
Figure 1.Schematic of generative versus discriminative approaches to probabilistic substrate classification.Both approaches train a model from the data that make optimal label assignments.The generative approach first models the label-conditioned probabilities (P(x|y)) for each label class individually, or joint distribution (P(x, y)) between substrate features and substrate types (labels), then finds the posterior probabilities of each label (P(y|x)).The discriminative approach learns a decision boundary (or discriminant function) between labels that maps substrate features directly into labels.

Figure 2 .
Figure 2. (A,B) (Patricia and Bedford sites, respectively): False color image with red, green, and blue values of each pixel corresponding to 8-bit values of backscattering strength at, respectively, 100, 200, and 400 kHz outgoing acoustic frequencies.Overlain are the locations of bed observations, color-coded according to the legend.At Patricia, the substrate classes are fine sand (FS), muddy sand (MS), sand/cobble (SC), and sandy mud (mS).At Bedford, Class 1 is mud; Class 2 is cobbles and boulders with interstitial mud; and Class 3 is cobbles, gravel, and boulders.A1-A3 and B1-B3 (Patricia and Bedford sites, respectively): Normalized non-dimensional [-] frequency distributions of 8-bit digital integers of backscatter intensity, compiled based on locations of known substrate types.From left to right: Response at 100, 200, and 400 kHz, respectively.

Figure 3 .
Figure 3. Accuracy, average posterior probability, and proportion classified as a function of θ β , for Patricia (A) and Bedford (B) datasets.

Figure 4 .
Figure 4. Accuracy, average posterior probability, and proportion classified as a function of µ, for Patricia (A) and Bedford (B) datasets.

Figure 5 .
Figure 5. Average conditional random field (CRF) posterior probability as a function of iteration, for the Patricia (A) and Bedford (B) sites.

Figure 6 .
Figure 6.Substrate classification using multispectral Patricia Bay data.Top row: Substrate map (A) and associated probability (B) using the Gaussian Mixture Model (GMM) algorithm.Bottom row: Substrate map (C) and associated probability (D) using the CRF algorithm.

Figure 8 .
Figure 8. Substrate classification using multispectral Bedford Basin data.Top row: Substrate map (A) and associated probability (B) using the GMM algorithm.Bottom row: Substrate map (C) and associated probability (D) using the CRF algorithm.

Figure 10 .
Figure 10.Confusion matrices for CRF and GMM-derived substrate classifications, for Patricia (A,B) and Bedford (C,D) sites.Rows, from top to bottom, indicate multispectral, 100 kHz, 200 kHz, and 400 kHz inputs, respectively.Columns, from left to right, indicate Patricia/CRF, Patricia/GMM, Bedford/CRF, and Bedford/GMM, respectively.The numbers in each cell indicate the proportion of all classifications for each class.

Figure 11 .
Figure 11.Standard deviation in substrate classifications computed using 35 and 27 realizations of model outputs at Patricia (A,C) and Bedford (B,D), respectively.The top row shows results from GMM outputs, and the bottom row shows CRF outputs.Each realization was based on a 1 fewer bed observation than the full set, drawn randomly.The bed observations are shown for reference.

Figure 12 .
Figure 12.Example of label-conditional probability densities (left) for two substrates, y 1 and y 2 , as a function of a single substrate feature, x, and the corresponding posterior probabilities (right).The vertical line in the right plot shows the discriminative decision boundary.Notice that the left-hand mode of substrate y 1 in the left plot has no effect on the corresponding posterior probability in the right plot. Figure modified from [39].