A Spatially Explicit Comparison of Quantitative and Categorical Modelling Approaches for Mapping Seabed Sediments Using Random Forest

: Seabed sediment composition is an important component of benthic habitat and there are many approaches for producing maps that convey sediment information to marine managers. Random Forest is a popular statistical method for thematic seabed sediment mapping using both categorical and quantitative supervised modelling approaches. This study compares the performance and qualities of these Random Forest approaches to predict the distribution of ﬁne-grained sediments from grab samples as one component of a multi-model map of sediment classes in Frobisher Bay, Nunavut, Canada. The second component predicts the presence of coarse substrates from underwater video. Spatial and non-spatial cross-validations were conducted to evaluate the performance of categorical and quantitative Random Forest models and maps were compared to determine di ﬀ erences in predictions. While both approaches seemed highly accurate, the non-spatial cross-validation suggested greater accuracy using the categorical approach. Using a spatial cross-validation, there was little di ﬀ erence between approaches—both showed poor extrapolative performance. Spatial cross-validation methods also suggested evidence of overﬁtting in the coarse sediment model caused by the spatial dependence of transect samples. The quantitative modelling approach was able to predict rare and unsampled sediment classes but the ﬂexibility of probabilistic predictions from the categorical approach allowed for tuning to maximize extrapolative performance. Results demonstrate that the apparent accuracies of these models failed to convey important di ﬀ erences between map predictions and that spatially explicit evaluation strategies may be necessary for evaluating extrapolative performance. Di ﬀ erentiating extrapolative from interpolative prediction can aid in selecting appropriate modelling methods.


Introduction
There is growing pressure on marine ecosystems due to human use, especially near coasts where interactions between terrestrial and marine drivers have the potential to generate large cumulative impacts [1]. Coastal ecosystems provide many important goods and services to both coastal and inland Using a supervised modelling approach, ground-truth sediment data are commonly treated in two ways to produce classified maps of seabed sediment according to schemes such as those described in Figure 1: 1. Quantitative measures of a substrate property, such as grain size fraction (e.g., percent mud, sand, and gravel), are used to predict quantitative gradational values across the full environmental data coverage (e.g., [26]). These predictions are useful for management or further modelling but some applications require classified or thematic maps, which can be produced by classifying the quantitative predictions according to some scheme (e.g., Figure 1). This is useful for summarizing sediment composition in a single map or for ensuring compatibility with regional management plans or similar research (see Strong et al. [25] for discussion of classification and compatibility).
2. Ground-truth data are aggregated according to a classification scheme prior to modelling, thereby treating them as categorical variables (e.g., [27]). It may also be the case that inherited data (e.g., from the literature, online databases, legacy data) are already classified and the quantitative data are unavailable or that datasets consist of sediment classes derived from visual assessment. In these cases, the options available to the modeler are limited and the categorical classification approach may be the logical choice. Using this approach, a model can predict the occurrence of the observed classes over the full extent of the environmental data.
Here we will refer to these as "quantitative" and "categorical" modelling approaches. While the quantitative approach is also known as "continuous" or "regression" modelling and categorical is Using a supervised modelling approach, ground-truth sediment data are commonly treated in two ways to produce classified maps of seabed sediment according to schemes such as those described in Figure 1: 1. Quantitative measures of a substrate property, such as grain size fraction (e.g., percent mud, sand, and gravel), are used to predict quantitative gradational values across the full environmental data coverage (e.g., [26]). These predictions are useful for management or further modelling but some applications require classified or thematic maps, which can be produced by classifying the quantitative predictions according to some scheme (e.g., Figure 1). This is useful for summarizing sediment composition in a single map or for ensuring compatibility with regional management plans or similar research (see Strong et al. [25] for discussion of classification and compatibility).
2. Ground-truth data are aggregated according to a classification scheme prior to modelling, thereby treating them as categorical variables (e.g., [27]). It may also be the case that inherited data (e.g., from the literature, online databases, legacy data) are already classified and the quantitative data are unavailable or that datasets consist of sediment classes derived from visual assessment. In these cases, the options available to the modeler are limited and the categorical classification approach may be the logical choice. Using this approach, a model can predict the occurrence of the observed classes over the full extent of the environmental data.
Here we will refer to these as "quantitative" and "categorical" modelling approaches. While the quantitative approach is also known as "continuous" or "regression" modelling and categorical is Geosciences 2019, 9,254 4 of 34 commonly referred to as "classification," we use the terms "quantitative" and "categorical" to reduce confusion, since the other terms also have other meanings that are relevant here (e.g., "classified" maps can be produced from either approach and all predictions are "spatially continuous").
Each of these broad approaches contain numerous individual modelling techniques with their own intricacies, many of which have been compared in the ecological and conservation management literature (e.g., presence-absence models [28], regression models [29], machine learning [30], geostatistical and hybrid methods [18,31]). Among machine learning techniques, Random Forest [32] is a particularly flexible and accurate method that is capable of both quantitative and categorical modelling. This flexibility, coupled with widespread availability via popular statistical and GIS software (e.g., R, ArcGIS), has made Random Forest popular for seabed mapping. Given the goal of producing a classified (i.e., "thematic") seabed map using continuous sediment data, Random Forest could be applied using either quantitative or categorical approaches ( Figure 2) at the discretion of the user.
Geosciences 2019, 9, x FOR PEER REVIEW 4 of 35 commonly referred to as "classification," we use the terms "quantitative" and "categorical" to reduce confusion, since the other terms also have other meanings that are relevant here (e.g., "classified" maps can be produced from either approach and all predictions are "spatially continuous"). Each of these broad approaches contain numerous individual modelling techniques with their own intricacies, many of which have been compared in the ecological and conservation management literature (e.g., presence-absence models [28], regression models [29], machine learning [30], geostatistical and hybrid methods [18,31]). Among machine learning techniques, Random Forest [32] is a particularly flexible and accurate method that is capable of both quantitative and categorical modelling. This flexibility, coupled with widespread availability via popular statistical and GIS software (e.g., R, ArcGIS), has made Random Forest popular for seabed mapping. Given the goal of producing a classified (i.e., "thematic") seabed map using continuous sediment data, Random Forest could be applied using either quantitative or categorical approaches ( Figure 2) at the discretion of the user. There are apparent advantages and disadvantages to both quantitative and categorical sediment modelling approaches for producing classified maps. Unclassified quantitative predictions on their own constitute a useful result for further modelling and mapping and are flexible once producedit is easy to classify and reclassify quantitative values as necessary. The modelling process can be complex though, potentially involving data transformations such as additive log-ratios for compositional data [33], multiple models for different log-ratios [26] and multiple corresponding tuning and variable selection procedures. On the other hand, the categorical modelling procedure can be more straightforward, requiring little data manipulation once ground-truth measurements have been aggregated into classes and explanatory variables have been selected. Class labels can be predicted to new data relatively easily. Once produced though, classes are more static compared to quantitative predictions. It may be possible to simply aggregate mapped classes to a more general scheme (e.g., Folk to simplified Folk; Figure 1) but it may also be necessary to re-classify the groundtruth, select new variables and re-tune model parameters for a new classification, especially if the original scheme is a poor match for the data.
Otherwise, characteristics of the ground-truth data and the type of prediction required by the models may be important qualities for determining the suitability of modelling approaches. For example, sample size, distribution and bias, class prevalence and spatial dependence are known to have profound effects on the performance of distribution models [29,34,35] and particularly Random Forest [36]. These and other dataset characteristics might influence the appropriateness of the There are apparent advantages and disadvantages to both quantitative and categorical sediment modelling approaches for producing classified maps. Unclassified quantitative predictions on their own constitute a useful result for further modelling and mapping and are flexible once produced-it is easy to classify and reclassify quantitative values as necessary. The modelling process can be complex though, potentially involving data transformations such as additive log-ratios for compositional data [33], multiple models for different log-ratios [26] and multiple corresponding tuning and variable selection procedures. On the other hand, the categorical modelling procedure can be more straightforward, requiring little data manipulation once ground-truth measurements have been aggregated into classes and explanatory variables have been selected. Class labels can be predicted to new data relatively easily. Once produced though, classes are more static compared to quantitative predictions. It may be possible to simply aggregate mapped classes to a more general scheme (e.g., Folk to simplified Folk; Figure 1) but it may also be necessary to re-classify the ground-truth, select new variables and re-tune model parameters for a new classification, especially if the original scheme is a poor match for the data.
Otherwise, characteristics of the ground-truth data and the type of prediction required by the models may be important qualities for determining the suitability of modelling approaches. For example, sample size, distribution and bias, class prevalence and spatial dependence are known to have profound effects on the performance of distribution models [29,34,35] and particularly Random Forest [36]. These and other dataset characteristics might influence the appropriateness of the approach Geosciences 2019, 9, 254 5 of 34 selected for producing classified maps. For instance, rare classes may be difficult to model using a categorical approach when they have been sampled few times but may cause less of an issue when modelled as quantitative variables. In some cases, clustered or uneven sampling may create spatial dependence in the response data [37], violating assumptions of independence [38]. This could have unintended consequences for prediction and apparent model accuracy, especially when extrapolating to new locations [37,39] and these could depend partly on the modelling approach. Here we refer to extrapolation in a spatial sense as predictions outside of the sampled area, whereas interpolation occurs between sample locations. An implicit assumption then, is that interpolation operates within the sampled environmental conditions, while extrapolation may predict outside of them.
The primary goal of this study was to create a classified seabed sediment map for inner Frobisher Bay, Nunavut, Canada from grab samples and underwater video using the Random Forest statistical modelling algorithm. Ground-truth characteristics, however, suggested that spatial dependence might be an issue when extrapolating seabed sediment characteristics to unsampled locations and evaluating these predictions. We therefore undertook a spatially explicit investigation of the qualities of the two modelling approaches-quantitative and categorical ( Figure 2) -for predicting sediment grain size classes from grab samples using Random Forest. Coarse substrates that were not adequately represented in grab samples were modelled separately using underwater video data and the two predictions were subsequently combined to produce a single map of surficial sediment distribution.
Specifically, when evaluating the quantitative and categorical Random Forest models for producing classified maps, we investigated: (1) their performance when extrapolating grain size predictions to new locations and if this was affected by spatial autocorrelation; (2) the appropriateness of three levels of classification based on the relative proportions of grain size measurements; and (3) if the two approaches produced similar maps. Because the observations of coarse sediment from video transects were likely to be spatially autocorrelated, we investigated if the proximity of these samples: (1) inflated the apparent accuracy of coarse substrate predictions; and (2) caused overfitting in model training. The results of these investigations informed the selection of modelling approach, while also providing spatially explicit accuracy estimates. Based on the results, we provide recommendations on the utility and potential pitfalls of these approaches in a spatial context.

Study Area
Frobisher Bay is a long (~265 km), northwest-southeast-oriented macrotidal fjord located in southeastern Baffin Island, Nunavut, Canada ( Figure 3). It can be partitioned into two morphologically distinct sections. The inner part spans from the northwest head of the bay to the mid-bay islands, with a maximum depth of approximately 350 m. Much of this section is shallow (< 100 m) with extreme tides (> 10 m) resulting in extensive tidal flats. The mid-bay islands separate the predominantly muddy shallow inner bay from the coarse and bedrock-dominated outer bay, which deepens to over 800 m and opens to the North Atlantic. The southwest coast of the outer bay is the fault boundary of a half-graben and is characterized by steep rock cliffs.
This study focuses on inner Frobisher Bay. The morphology and orientation of submarine features here are a product of repeated Quaternary glaciations, the most recent of which receded between 9-7 ka BP [40,41]. These have produced a complex, heterogeneous seabed, with erosional and depositional glacial features such as scour troughs and moraines indicative of northwest-southeast ice flow. Currently, seabed sediments are re-mobilized by several non-glacial processes, including tidal currents, submarine slope failures and iceberg and sea-ice scour [42].  [43]) and coastline reproduced from ESRI [44], with (b) location on southeastern Baffin Island and (c) the study area -inner Frobisher Bay.   [43]) and coastline reproduced from ESRI [44], with (b) location on southeastern Baffin Island and (c) the study area -inner Frobisher Bay.

Ground-Truth
Ideally, a sediment model would rely on a single consistent source of ground-truth data, yet grab samples commonly fail to accurately represent coarser sediments [45,46], while it can be difficult to consistently distinguish mud from sand in underwater video [47]. To overcome these limitations, we modelled finer grain sizes (< 4000 µm) using grab sample data and coarse substrates (≥ 4000 µm; that is, pebble, cobble, boulder) from video observations. Grab samples (n = 239) and underwater video (n = 78) were collected in 2015 and 2016 to provide substrate ground-truth for the MBES data (Figures 4 and 5). Ground-truth sites were selected from the area of MBES coverage prior to each field season using a random approach, stratified by water depth up to 200 m and seabed slope. Because sampling and mapping occurred simultaneously, only part of the final mapped area was available for sample site selection each year, resulting in unsampled areas.
A live-feed Deep Blue Pro underwater camera with a GoPro Hero4 was deployed at each selected site to collect high-definition video for a four-minute drift. Two lights were attached to the camera mount to illuminate the seabed and two green lasers, spaced 5 cm apart, were attached for scale. Positioning was obtained using a Garmin 18x PC GPS with video overlay, providing coordinates at the surface for the duration of the recording. GPS accuracy was rated at < 3 m and

Ground-Truth
Ideally, a sediment model would rely on a single consistent source of ground-truth data, yet grab samples commonly fail to accurately represent coarser sediments [45,46], while it can be difficult to consistently distinguish mud from sand in underwater video [47]. To overcome these limitations, we modelled finer grain sizes (< 4000 µm) using grab sample data and coarse substrates (≥ 4000 µm; that is, pebble, cobble, boulder) from video observations. Grab samples (n = 239) and underwater video (n = 78) were collected in 2015 and 2016 to provide substrate ground-truth for the MBES data (Figures 4 and 5). Ground-truth sites were selected from the area of MBES coverage prior to each field season using a random approach, stratified by water depth up to 200 m and seabed slope. Because sampling and mapping occurred simultaneously, only part of the final mapped area was available for sample site selection each year, resulting in unsampled areas.
A live-feed Deep Blue Pro underwater camera with a GoPro Hero4 was deployed at each selected site to collect high-definition video for a four-minute drift. Two lights were attached to the camera mount to illuminate the seabed and two green lasers, spaced 5 cm apart, were attached for scale. Positioning was obtained using a Garmin 18x PC GPS with video overlay, providing coordinates at the surface for the duration of the recording. GPS accuracy was rated at < 3 m and though efforts were made to keep the location of the GPS above the camera during drifts, it is likely that the locational error was greater under windy or high-current conditions due to horizontal drift of the camera from the vessel, especially at the greatest depths. Positional error potentially exceeded 10 m under high-drift conditions; if this was suspected, the drift was cut short. Still frames were extracted for analysis every 10 s for the duration of the video. If coarse substrates (pebble, cobble, boulder) were visible in a frame, they were labelled as "present." All observations were aggregated so that coarse substrates were labelled as "present" or "absent" for each 10 m raster cell that was sampled.
Up to three sediment samples were collected from the area near each site using either a 24-l Van Veen or a 2.4-l Petite Ponar grab and individually georeferenced from the surface using the ship's GPS location. Each grab sample was sub-sampled for~100 g of sediment, which was considered sufficient for measuring grain size composition up to 4000 µm. These were stored in a sample jar and frozen for transport. In the lab, samples were thawed and dried at low heat in an oven. Samples were dry-sieved for 5 min in a mechanical sieve shaker to separate mud (< 63 µm), sand (63-2000 µm) and gravel (2000-4000 µm) fractions. Many samples had a high proportion of flocculant mud that failed to disperse during dry sieving. To obtain an accurate measure of the mud fraction, samples were gently spray-rinsed through a 63 µm sieve and agitated by hand, washing away the mud fraction. The remaining sediments coarser than 63 µm were re-dried and weighed to estimate the proportion of mud that was lost. The weights of each fraction were divided by the total weight to obtain percent mud, sand and gravel composition.

Statistical Modelling
The Random Forest machine learning algorithm was used to model both sediment grain size and presence of coarse substrates. Random Forest is an ensemble modelling method that uses bagging [48] to combine the results of many individual classification or regression trees, whereby many bootstrap samples of the modelling dataset are drawn and a decision tree is fit to each [32]. A random set of predictor variables is selected to partition the data at each decision tree node and the data not selected for a given bootstrap sample (termed "out-of-bag" observations) are predicted by the decision tree. This provides several useful metrics, such as (1) a predicted class or value for each observation based on majority vote, (2) estimates of accuracy that are comparable to cross-validation and (3) estimates of variable importance (see Cutler et al. [49] for explanation in an ecological context) and these can be used to evaluate the quality of the model.
Because Random Forest is an ensemble of decisions trees, it is generally considered robust to noisy or unimportant predictors [32,50]. Large numbers of quantitative and categorical variables can therefore be included in the model and non-linear responses are automatically modeled with interaction (a quality of decision trees). Furthermore, it can perform both quantitative and categorical modelling (e.g., Figure 2). These qualities have made Random Forest popular for environmental modelling, yet the resulting ensemble of trees is complex, making it difficult to understand exactly how model predictions are derived. Ways in which variables interact in the model, for example, is difficult to know given that interactions can be multiple levels deep for each decision tree, of which there are hundreds or thousands. Though this complexity makes Random Forest a powerful predictor in the range of sampled conditions, it may be less effective at extrapolation [51,52], especially for rare classes. This is sometimes alleviated by subsampling the sample dataset to ensure equal class observations, yet this method can potentially cause rare classes to be overpredicted [36]. These qualities must be considered when evaluating the suitability of Random Forest for a given modelling application.
For the categorical grain size modelling approach, samples were assigned class labels according to three ternary schemes prior to modelling to test different levels of data aggregation. The Folk [20] and simplified Folk schemes were used according to Long [21]. Here, the "slightly gravelly" boundary for the Folk classification is at 1% rather than "trace," as in Folk [20]. The third classification was simply "muddy" or "sandy" if there was a majority of either size fraction. This was used instead of the EUNIS simplification to the Folk classification (Figure 1c), which was not appropriate given the data -most samples were muddy and < 5% gravel. The EUNIS simplification would aggregate nearly all the samples into the class "mud and sandy mud." For the quantitative approach, percent mud, sand, and gravel measurements were transformed to an unbound additive log-ratio (ALR) scale [33] with respect to the mud fraction: where ALR sm and ALR gm are the additive log-ratios of sand to mud and gravel to mud and M, S, and G are the proportions of mud, sand, and gravel size fractions measured in a grab sample, respectively. Note that the results are unaffected by the choice of size fraction to serve as the denominator [53]. Model predictions were then back-transformed to a compositional scale bound between 0 and 1, corresponding to the relative percentage of each size fraction and summing to 1 for each sample [54]: To produce a classified map from the quantitative output, predictions were classified according to the three ternary schemes above.
The presence or absence of coarse substrates was recorded for each underwater video still frame to produce binary presence-absence data. These were used to train a categorical Random Forest model, essentially treating the presence or absence of coarse substrates as a two-class categorical response. Random Forest can output class probabilities rather than the class of majority vote, which allows the presence threshold to be tuned according to the classification goal, rather than the arbitrary default value of 0.5. Though tuning this threshold cannot solve the problem of poor model fit, it can potentially increase the efficiency of meeting explicit predictive goals for a binary classifier (such as balancing correctly classified presences and absences or accurately predicting a certain sediment class), especially when model prediction and performance are highly sensitive to class prevalence [55].

Explanatory Variables
In addition to the primary MBES data (bathymetry and backscatter), 11 secondary variables were tested for inclusion in each of quantitative and categorical grain size models and the coarse substrate model using a multiscale approach ( [56]; Appendix B). Five bathymetric derivatives that describe most of the topographic structure of a surface suggested by Lecours et al. [57] were calculated using their toolbox for ArcGIS (the eastness and northness of seabed slope aspect, relative difference to the mean bathymetric value, slope, local bathymetric standard deviation; [58]). Local bathymetric standard deviation was omitted because it was highly correlated with seabed slope but did not perform as well. Three measures of surface curvature (total, plan, profile) were derived using the "Curvature" tool in ArcGIS Pro v.2.2.3 and two measures of surface complexity (surface area:planar area [SA:PA], vector ruggedness measure [VRM]) were derived using the Benthic Terrain Modeler toolbox [59] for their potential as topographic surrogates that influence bottom currents and potentially sediment transport. The range of backscatter values in a circular neighborhood and their standard deviation (SD) in a 3 x 3-pixel neighborhood were derived for each spatial scale as potential surrogates for local substrate variability. The Euclidean distance to the nearest coast was calculated as a potential predictor of sediment grain size based on transport distance from terrestrial sources.
Random Forest is generally considered robust to noise, ignoring unimportant predictors, yet there are benefits to variable reduction such as decreased variability between model runs and more accurate estimates of error [36]. We simplified the predictor set by removing variables that had Spearman's ρ ≥ 0.70-a common threshold for reducing correlated variables (e.g., [60][61][62]; Appendix B).

Evaluating Model Performance
All predictions were first tested using a leave-one-out cross validation (LOO CV). Using this approach, a single sample is removed from the dataset and all other samples (N-1) are used to train the model. The class of the withheld sample is then predicted using the N-1 model. This is repeated for every sample in the dataset, producing observed and predicted classes at every sample location, which are used to estimate predictive performance. Error matrices were computed to observe the success in predicting the observed classes. From this we derived the percent correctly classified and the kappa coefficient [63]. Percent correctly classified is simply the proportion of test samples that were assigned the correct label. Kappa is a statistic that reflects whether the model achieved better results than to be expected at random and is calculated from where κ is the value of kappa between −1 and 1 (with 0 describing predictions that are no better than random), p o is the proportion correctly classified and p e is the proportion correctly classified due to chance, based on the frequency of observations and predictions of each class. Because the performance of the coarse presence-absence model depends on the probability threshold, the threshold-independent area under the receiver operating characteristic curve (AUC), which is the plot of sensitivity against 1-specificity [64] and the maximum kappa values at all thresholds (e.g., [28]) were used to compare candidate models. Spatial autocorrelation is known to inflate estimates of predictive performance [36,65]. To determine its effects on the modelling approaches tested here and whether models were able to extrapolate to unsampled locations, we also conducted a spatial leave-one-out cross-validation (S-LOO CV) [66]. This procedure is identical to LOO CV, except that a spatial buffer is placed around the withheld test point and training data from within this buffer are omitted from both model training and testing so that there are no training data proximal to the test. This aims to eliminate spatial bias in accuracy assessment by removing points that are spatially autocorrelated with the test site up to the specified buffer distance.
We calculated empirical variograms for the observed grain size values and coarse substrate observations to determine a suitable buffer distance. We used the variogram model range to estimate the distance beyond which the effects of autocorrelation are negligible. This distance has been suggested as adequate for S-LOO CV [67]. Variograms were calculated up to 5000 m and multiple models were tested for characterizing the major range using the Geostatistical Wizard in ArcGIS Pro v.2.2.3. The coarse substrate model was trained using image frames from video transects and these were expected to be highly autocorrelated due to their proximity. Therefore, in addition to the above two assessment procedures, we conducted a spatially resampled leave-one-out cross-validation (SR-LOO CV) to determine whether this spatial dependence affected model fitting in addition to performance estimation. Because Random Forest is an "embarrassingly parallel" algorithm [50], separate "forests" can be combined and treated as a single model to make predictions. The SR-LOO CV builds on S-LOO CV by using the same spatial buffering procedure (i.e., the withheld test sample is spatially buffered; points within the buffer are excluded from training and testing), except that each training point for each "leave-one-out" iteration is also spatially buffered, so that no adjacent points are used for model fitting or testing (similar to the algorithm in Holland et al. [68]). Because this severely limits the number of samples available for training each iteration of Random Forest, we can randomly subsample each "leave-one-out" training dataset many times (100 here) to acquire different subsets of spatially independent training data. By producing only a small number of trees (ntree) from each of the 100 spatially independent data subsets, it is possible to achieve the same number of trees in a forest that are desirable in a full "leave-one-out" iteration (e.g., ntree = 2000 here). This effectively creates the same number of trees as in a normal Random Forest run but produces the "forest" using the combination of many small, spatially independent subsets, rather than the full dataset.
The results of these cross-validations were analyzed to address the goals of this study. For grain size predictions, we first investigated the performance of the quantitative and categorical approaches to determine if these models could successfully extrapolate sediment grain sizes at unsampled locations and whether spatial autocorrelation interfered with assessing map accuracy. Performance was assessed using multiple classification schemes at different levels of grain size detail (i.e., aggregation) to select one that fit the data. We then compared maps produced using both approaches to determine if and how any differences in model performance manifested in the mapped predictions and if the maps agreed. For predicting the presence of coarse substrates, we tested whether the proximity of sample points within video transects inflated estimates of predictive performance and also whether their proximity caused overfitting of Random Forest models.

Map Prediction
Predictions of sediment grain size and the presence of coarse substrates were combined to produce a single map of seabed sediment distribution. The results from the accuracy assessments and map comparison, and also the qualities inherent in the two modelling approaches, were used to select a suitable model for predicting sediment grain size classes. The probability of coarse substrate presence was predicted for the entire study area. To combine these predictions with those of grain size, an occurrence threshold was set to maximize the sum of sensitivity and specificity, which aims to balance the class accuracy of predictions [60] and which has been demonstrated to perform well compared to other threshold selection criteria [55]. Thus, the combined map predicts the sediment grain size class and whether coarse substrates are present for each pixel throughout the study area (e.g., "muddy with coarse substrate").

Grain Size Data
The sampled substrates were primarily muddy, with some sandy sediments ( Figure 6). When classified according to Folk [20], the most common class was (g)sM (40.72%), followed by sM (38.14%) and (g)mS (13.40%). The simplified Folk scheme eliminates the "slightly gravelly" modifier, aggregating the classes (g)sM and (g)mS with sM and mS, increasing these class proportions to 78.87% and 18.04%, respectively. In the "muddy/sandy" classification, 79.38% of samples fall into the "muddy" class, with the remaining 20.62% in "sandy." Coarse substrates were observed in 20.06% of raster cells containing underwater video observations (e.g., Figure 7; Table 1). Geosciences 2019, 9, x FOR PEER REVIEW 13 of 35

Spatial Autocorrelation
Estimates of major range from the Geostatistical Wizard variogram models were used to determine a buffer distance for the spatial leave-one-out cross-validations. Circular variogram models provided a distinct major range compared to the more asymptotic models (e.g., exponential, Gaussian, Bessel), were relatively stable with varying input parameters and fit the data comparatively well. The major ranges of the circular mud and sand variogram models were 1497 m and 1210 m, respectively, when calculated at a maximum distance of 5000 m (see Appendix C). There appeared to be little change in gravel measurement variance with increasing distance and they did not yield a useable variogram model. We selected a buffer distance of 1500 m for the S-LOO CV based on the

Spatial Autocorrelation
Estimates of major range from the Geostatistical Wizard variogram models were used to determine a buffer distance for the spatial leave-one-out cross-validations. Circular variogram models provided a distinct major range compared to the more asymptotic models (e.g., exponential, Gaussian, Bessel), were relatively stable with varying input parameters and fit the data comparatively well. The major ranges of the circular mud and sand variogram models were 1497 m and 1210 m, respectively, when calculated at a maximum distance of 5000 m (see Appendix C). There appeared to be little change in gravel measurement variance with increasing distance and they did not yield a useable variogram model. We selected a buffer distance of 1500 m for the S-LOO CV based on the

Spatial Autocorrelation
Estimates of major range from the Geostatistical Wizard variogram models were used to determine a buffer distance for the spatial leave-one-out cross-validations. Circular variogram models provided a distinct major range compared to the more asymptotic models (e.g., exponential, Gaussian, Bessel), were relatively stable with varying input parameters and fit the data comparatively well. The major ranges of the circular mud and sand variogram models were 1497 m and 1210 m, respectively, when calculated at a maximum distance of 5000 m (see Appendix C). There appeared to be little change in gravel measurement variance with increasing distance and they did not yield a useable variogram model. We selected a buffer distance of 1500 m for the S-LOO CV based on the mud and sand major range estimates. The major range of the circular coarse substrate model was 199 m; we selected a buffer distance of 200 m for S-LOO CV and SR-LOO CV methods based on this model.

Variable Selection
Different sets of scale-dependent variables were selected for modelling ALR sm , ALR gm , grain size classes, and the presence of coarse substrates (Tables 2 and 3). Backscatter range was commonly correlated with the backscatter standard deviation (SD) (ρ ≥ 0.70) and only one of these two variables was generally selected, except in the classification model where the correlation between backscatter range at 250 m scale and backscatter SD at 100 m scale was below this threshold. The different curvature measures were often correlated at similar scales and also to RDMV. Total curvature was correlated with one of these variables at every scale tested and consistently had a weaker relationship with the response-it was therefore removed from all models. The two measures of complexity, SA:PA and VRM, were also correlated at similar scales.

Grain Size Model Evaluation and Comparison
The LOO CV suggested that the simplified Folk and muddy/sandy classes were predicted accurately, with predictions from the categorical approach generally more accurate than the quantitative, while Folk classes were not predicted accurately using either approach ( Table 4). The error matrix shows that this poorer performance was the result of several uncommon Folk classes that were not successfully predicted (Appendix D); the categorical Folk predictions had a percent correctly classified of 54.64% and predictions that were only marginally better than random (κ = 0.25). The greater accuracies of the simplified Folk and muddy/sandy schemes is a product of rare classes being aggregated into broader ones, reducing their misclassification. Though the quantitative approach was generally less accurate than the categorical using LOO CV with all schemes, it was similar in that the simplified schemes were predicted more successfully than the Folk. Again, this was a result of aggregating rare Folk classes into broader ones, making them easier to predict.
In contrast to the LOO CV results, there was little difference in performance between the two approaches when evaluated using S-LOO CV -both demonstrated poor predictive performance. All performance metrics were lower using the S-LOO CV, except for the quantitative Folk predictions ( Table 4). The percent correctly classified was reduced only marginally-by 7.92% to not at all-yet kappa values all indicate that the performance of these models is hardly better than by random chance based on class prevalence. The disparity between percent correctly classified and kappa scores in the S-LOO CV assessments is the result of an increased inability to predict the rarer classes (see error matrices in Appendix D). Results of the unclassified mud, sand, and gravel quantitative predictions are presented in Appendix E. Table 4. Performance of quantitative and categorical grain size predictions using three schemes with spatial and non-spatial cross-validations. Despite similarities in predictive performance when evaluated using S-LOO CV, there were obvious differences in map predictions between categorical and quantitative Random Forest model predictions. The quantitative approach predicted the occurrence of classes that were not observed in ground-truth samples, such as the Folk class gM in the deep channels in the southeast part of the bay (Figure 8b), which was the third most commonly predicted class and was only predicted in unsampled areas. Conversely, the categorical approach generally predicted the most common class (sM) in these areas. (g)sM was the most commonly predicted Folk class in the classified quantitative map, occurring in 65.36% of raster cells, while it was only the second most common for the categorical map, occurring in 44.04% of cells. sM was the most common for the categorical map, occurring in 52.83% of cells, while it was the second most common in the classified quantitative map, occurring in only 27.79% of cells. The prediction of unsampled classes using the quantitative approach accounted for much of the disagreement between maps but they also disagreed on the extent of the most common classes throughout the study area. The primary differences between categorical and quantitative Random Forest approaches for the simplified Folk classified maps (Figure 9) was the prediction of unsampled classes. Again, the quantitative approach predicted extensive areas of gM in the deep southeast channel that were not predicted using the categorical approach. In all other areas the two map predictions were similar, with 93.15% and 96.37% of quantitative and categorical maps classified as sM, respectively. Both approaches predicted similar distributions of mS, primarily near Iqaluit, in the northernmost mapped area and near the southwestern coast. "Muddy/sandy" maps ( Figure 10) were highly similar between the two approaches with nearly 97% agreement, having eliminated all unsampled classes. "Sandy" sediment was primarily predicted where mS occurred in the simplified Folk maps. Predicted class The primary differences between categorical and quantitative Random Forest approaches for the simplified Folk classified maps (Figure 9) was the prediction of unsampled classes. Again, the quantitative approach predicted extensive areas of gM in the deep southeast channel that were not predicted using the categorical approach. In all other areas the two map predictions were similar, with 93.15% and 96.37% of quantitative and categorical maps classified as sM, respectively. Both approaches predicted similar distributions of mS, primarily near Iqaluit, in the northernmost mapped area and near the southwestern coast. "Muddy/sandy" maps ( Figure 10) were highly similar between the two approaches with nearly 97% agreement, having eliminated all unsampled classes. "Sandy" sediment was primarily predicted where mS occurred in the simplified Folk maps. Predicted class proportions between the approaches were similar, with "sandy" sediment predicted in 3.61% of cells in the classified quantitative map and 4.60% in the categorical map -the remaining being "muddy." Geosciences 2019, 9, x FOR PEER REVIEW 18 of 35 proportions between the approaches were similar, with "sandy" sediment predicted in 3.61% of cells in the classified quantitative map and 4.60% in the categorical map -the remaining being "muddy."

Coarse Model Assessment
The LOO CV suggested that the presence of coarse substrates was predicted accurately but the S-LOO CV did not ( Table 5). The maximum kappa value obtained using LOO CV ( = 0.62; Table 5) suggested that the model had potential to predict much better than expected by chance (depending on the threshold selected). The threshold-independent AUC value (0.86) also suggested strong predictive performance. Conversely, S-LOO CV yielded lower maximum kappa and AUC values. Accuracy of the SR-LOO CV, however, was higher than the S-LOO CV with the same spatial

Coarse Model Assessment
The LOO CV suggested that the presence of coarse substrates was predicted accurately but the S-LOO CV did not ( Table 5). The maximum kappa value obtained using LOO CV (κ = 0.62; Table 5) suggested that the model had potential to predict much better than expected by chance (depending on the threshold selected). The threshold-independent AUC value (0.86) also suggested strong predictive performance. Conversely, S-LOO CV yielded lower maximum kappa and AUC values. Accuracy of the SR-LOO CV, however, was higher than the S-LOO CV with the same spatial constraints (200 m buffer), suggesting potential model overfitting caused by the proximity of the training data that may have been alleviated when training data were forced to be independent. Table 5. Threshold-independent accuracy of coarse substrate model using spatial and non-spatial cross validation (CV) approaches and with spatially independent training data. The map of coarse substrates shows the probability of occurrence for each pixel (Figure 11). Coarse substrates were predicted to occur throughout the bay but primarily on the flanks of topographic highs and on several coasts (see the northern and westernmost mapped areas; Figure 11). Coarse substrates were also predicted in the southeast section of the bay on the flanks of deep channels and near the islands to the east, where backscatter return was high ( Figure 5). constraints (200 m buffer), suggesting potential model overfitting caused by the proximity of the training data that may have been alleviated when training data were forced to be independent. Table 5. Threshold-independent accuracy of coarse substrate model using spatial and non-spatial cross validation (CV) approaches and with spatially independent training data. The map of coarse substrates shows the probability of occurrence for each pixel (Figure 11). Coarse substrates were predicted to occur throughout the bay but primarily on the flanks of topographic highs and on several coasts (see the northern and westernmost mapped areas; Figure  11). Coarse substrates were also predicted in the southeast section of the bay on the flanks of deep channels and near the islands to the east, where backscatter return was high ( Figure 5).

Combined Map and Model Tuning
The two-class categorical muddy/sandy predictions were selected to combine with coarse substrate predictions to produce a single map of surficial sediments (Figure 12). Because the grain size scheme is binary, the categorical Random Forest predictions have the distinct advantage of a flexible threshold of occurrence, which can be readily optimized. Setting the threshold to maximize the sum of sensitivity and specificity (0.18; [55]) for predicting "sandy" sediments using S-LOO CV yielded κ = 0.27. Having selected this model, the performance was tested after dropping further unimportant predictors identified from estimates of variable importance. Maintaining only the top six variables (bathymetry, backscatter, 300 m profile curvature, 10 m and 450 m slopes, 10 m VRM) resulted in more accurate and more stable grain size predictions using S-LOO CV. At the 0.18 threshold, predictions were classified correctly~70% of the time, with κ = 0.34 (Table 6).

Discussion
The predicted seabed sediment classes generally agreed with expectation given the geomorphology of the bay, yet particular locations without ground-truth data require further investigation. The majority of the low-relief seabed was classified as "muddy," which is not surprising given what was observed in grab samples and underwater video (e.g., Figures 6 and 7a). Sandy sediments predicted south and southwest of Iqaluit may be partially attributable to sediment input from the Sylvia Grinnell River, directly west of the city. This is also an area of distinct sea-ice scouring [42] with higher acoustic backscatter than the surrounding seabed ( Figure 5) and several distinctly reflective features that were classified as "sandy with coarse substrate." This class was also predicted at several locations along the coast, fining to muddier grain sizes with increasing distance and depth. Otherwise, exposed coarse substrates predicted along the flanks of steep topographic features may be attributable to current winnowing of unstable fine sediments [69]. This is likely the  Table 6.
Accuracies of grain size and coarse substrate components of combined seabed sediment predictions. The presence or absence of coarse substrates was dichotomized using a 0.27 threshold of occurrence to maximize the sum of sensitivity and specificity [55]. These predictions were 75.34% accurate with κ = 0.40 using the SR-LOO CV method ( Table 6). The final seabed sediment class was determined by specifying the predicted grain size class ("muddy" or "sandy") and whether coarse substrates are present ( Figure 12).

Discussion
The predicted seabed sediment classes generally agreed with expectation given the geomorphology of the bay, yet particular locations without ground-truth data require further investigation. The majority of the low-relief seabed was classified as "muddy," which is not surprising given what was observed in grab samples and underwater video (e.g., Figures 6 and 7a). Sandy sediments predicted south and southwest of Iqaluit may be partially attributable to sediment input from the Sylvia Grinnell River, directly west of the city. This is also an area of distinct sea-ice scouring [42] with higher acoustic backscatter than the surrounding seabed ( Figure 5) and several distinctly reflective features that were classified as "sandy with coarse substrate." This class was also predicted at several locations along the coast, fining to muddier grain sizes with increasing distance and depth. Otherwise, exposed coarse substrates predicted along the flanks of steep topographic features may be attributable to current winnowing of unstable fine sediments [69]. This is likely the case in the high-relief, deep southeastern channels, where coarse substrates were predicted extensively. Further investigation is necessary in these deep channels though-this was an unsampled area of high disagreement between the categorical and quantitative models (Figures 8 and 9). One might expect a muddier composition at the bottoms of these deep channels, yet sandier grain sizes were predicted, likely as a product of the high backscatter response ( Figure 5).

Model Comparison
There was little difference in accuracy between quantitative and categorical Random Forest approaches when using spatially explicit cross-validation methods but their maps differed substantially. Using a two-class scheme, it was possible to tune the threshold of occurrence for the probabilistic output of the categorical model to obtain a higher accuracy than the quantitative model and this was selected for the final map. The most noticeable difference between maps was that the quantitative approach predicted extensive patches of sediment classes that were not observed in the ground-truth data, where the categorical approach predicted the most commonly observed classes. The predicted proportions and distributions of the classes also differed between approaches.
Although the quantitative Random Forest approach failed at extrapolation in this study, it has several characteristics that may be otherwise desirable. Because classification of quantitative predictions is done post hoc, this method might avoid some of the difficulty associated with predicting unbalanced classes-one of the major shortcomings of the categorical approach. Furthermore, as demonstrated here, predictions are not constrained to the classes that were sampled. Thus, if the model were fit well, it may be possible to predict rare and unsampled classes at new locations, while this is not feasible with the categorical approach. This may be a particularly useful quality if unsampled areas are expected to contain different sediment characteristics than the sample sites, yet it requires a high degree of confidence in the modelled relationships between grain size composition and the explanatory variables. The spatial leave-one-out CV error matrices for classified quantitative predictions (Appendix D) failed to indicate that the model could successfully predict rare classes in this study and we did not have the confidence to adopt predictions of unobserved classes in unsampled areas. It is quite possible that these areas do actually contain different sedimentary characteristics though, as their morphology and backscatter characteristics were unique but there is no way to confirm this. Sampling these areas would be a priority in future work.
It is also worth considering characteristics of the unclassified quantitative predictions of mud, sand, and gravel, which may offer some advantages over classification. These predictions represent gradational changes in sediment composition, which are more realistic and potentially more desirable than discrete classes. If classes are required, quantitative predictions are completely flexible with regards to classification scheme. Because the quantitative values remain the same, it is not necessary to run through the model fitting procedure to test different classifications -the class boundaries simply need to be adjusted. Other methods could also be used to optimize the classification of quantitative predictions to produce relevant and distinct classes, such as multivariate clustering. This way, it is possible to define an appropriate number of classes with boundaries that are most relevant to a given study area.
The qualities of the categorical Random Forest approach ultimately made it more suitable for this study but one major difficulty was assessing whether rare classes were predicted correctly. Because the data were spatially autocorrelated, samples of rare classes were likely to occur close to one another. Using a spatial CV approach, in which samples proximal to the test data are omitted from model training (as in S-LOO CV) or proximal samples are not allowed for either training or testing (as in SR-LOO CV), can potentially remove most or all other samples of a rare class, making it impossible to assess if it was predicted accurately. Selecting a classification scheme that matches the data facilitates the estimation of accuracy by ensuring an adequate number of samples for training and testing each class. Here, the Folk and simplified Folk schemes each contained several classes with very few observations (< 5) and the success in predicting these could not be confidently determined. The muddy/sandy classification better fit the data, providing adequate samples of each class to evaluate predictive success.
Though the muddy/sandy classification was a good fit given these data ( Figure 6), the class prevalence was still unbalanced. Another solution afforded by the categorical approach is that the threshold of occurrence can be optimized. Setting this threshold to maximize the sum of sensitivity + specificity equally weights the success in predicting each class and this produced higher extrapolative (i.e., S-LOO CV) accuracy than the quantitative approach, especially after removing superfluous variables. Another common approach used to predict rare classes is to subsample the dataset to ensure equal class representation but that requires enough samples of the rarest class to allow a reasonable subsample size. Furthermore, some research has suggested that the proportions of classes in the training data should be representative of the actual proportions of these classes [36].

Spatial Assessment
Spatial autocorrelation inflated estimates of predictive accuracy regardless of the modelling approach or classification scheme for both grain size and coarse substrate models, hindering the ability to determine whether the models could successfully extrapolate grain size classes at unsampled locations. Similar to LOO CV, many common model validation techniques (e.g., sample partitioning, k-fold CV) have no spatial component. For this study, non-spatial techniques failed to correctly estimate the model's ability to extrapolate. If LOO CV were used in isolation to evaluate the categorical simplified Folk predictions for example, the percent correctly classified and κ values (85.50%; 0.52) would suggest the model is highly accurate and reliable. In reality though, it fails to extrapolate beyond the sphere of spatial autocorrelation influence, with predictions no better than random.
The SR-LOO CV for the coarse substrate model suggested not only that spatial autocorrelation inflated estimates of accuracy but also that Random Forest was spatially overfitting, hindering extrapolation. This is an important issue for severely autocorrelated datasets that is not necessarily solved by other spatial validation approaches that allow for proximal training samples such as S-LOO CV or spatial blocking. Though SR-LOO CV shows promise for reducing overfitting and providing non-biased estimates of accuracy, we note that the computational effort may not always be realistic. One hundred random samples of each spatially-buffered leave-one-out sample set (n = 324) yielded 32,400 sub-samples and corresponding Random Forest models for one SR-LOO CV run. Furthermore, the "embarrassingly parallel" qualities of Random Forest were leveraged to implement SR-LOO CV here, which is not characteristic of most other modelling methods. Simpler alternatives could involve aggregating sample transects to a single point and adjusting the raster resolution, yet this may be less attractive if a high resolution is desired. How such methods compare with the SR-LOO CV remains to be explored.

Spatial Prediction
It is important to distinguish between interpolating a well-sampled area and extrapolating to unsampled locations [67]. Though it is becoming standard practice to report predictive accuracy, it is less common to differentiate between these predictive spatial qualities, which are partly determined by sample distribution and intensity. Again, if interpolation is the goal, with somewhat uniform and well-distributed sampling, then standard non-spatial model evaluation methods may be appropriate (e.g., LOO CV, k-fold CV, partitioning). If samples are clustered, with parts of the study area unsampled, then it is necessary to evaluate for extrapolation, which may require a spatially explicit approach, as was the case here. This also may affect the appropriateness of categorical and quantitative approaches -if extrapolating to a potentially new sedimentary environment is the goal and if there is confidence in the modelled relationships between sediment and explanatory variables, then a quantitative approach may be useful for identifying unsampled or rare sediment classes. Here, we found that the flexibility of the threshold of occurrence using a categorical Random Forest approach resulted in superior extrapolative performance compared to a quantitative approach for a binary classification scheme. Given a set of classification requirements (e.g., regional compatibility) and a desire to maximize predictive accuracy of predetermined classes, it may be desirable to test both approaches where feasible -our results do not suggest the consistent superiority of one method over the other.
Recently there have been calls for greater transparency in reporting map quality, including uncertainty and error, to determine whether thematic maps are fit for purpose (e.g., [70,71]). This becomes especially important when providing maps as tools for management, where end-users may lack the technical understanding to critically evaluate a map [72]. The spatial component of distribution modelling is a potential source of data error [71] that is commonly neglected [52], yet which can be exacerbated due to marine sampling constraints. Here we have demonstrated the necessity of spatially explicit analysis for comparing the error and predictions between two seabed sediment mapping approaches and the potential pitfalls of neglecting to do so. Though many approaches have been tested and compared in the seabed mapping literature, these qualities are often ignored. The SR-LOO CV approach used here to model the presence of coarse substrates is similar to the variable scale selection procedure used by Holland et al. [68] but uses "embarrassingly parallel" Random Forests so that no samples are fully omitted. This is the first application of the approach in this context to our knowledge. Though the SR-LOO CV method was well-suited to modelling video transect data in this study, we acknowledge several other useful strategies [67] and tools [73] that are worth considering to address spatial sampling bias. Geostatistical methods may also be a preferable alternative for handling spatially dependent data depending on the modelling goals. The focus of this study was on modelling seabed sediments but the findings are relevant to other similar benthic distribution models including those of species and biotopes.

Conclusions
Neither categorical nor quantitative Random Forests performed consistently better between the classification schemes tested but the mapped predictions and the qualities of the models differed substantially, ultimately informing on the suitability of these methods. The ability of the quantitative approach to predict rare and unsampled classes may be an important quality depending on sample distribution and mapping goals, yet we found the probabilistic threshold qualities of the categorical approach with a binary scheme (i.e., "muddy/sandy") made it more suitable for extrapolation in this study. Extrapolation was a necessary quality for these models because sample sites were clustered, with some areas not sampled.
There was evidence that the proximity of transect video observations caused both inflated estimates of accuracy and overfitting in the Random Forest models of coarse substrate. We conclude therefore that it should not be taken for granted that Random Forest will not overfit amidst autocorrelated data and spatially explicit methods may be necessary to ensure spatial independence, regardless of the modelling algorithm used. This is especially relevant for seabed mapping given the prevalence of transect data in this field.
From the results of this study we recommend that seabed map producers be specific about their predictive goals -especially whether the models are required to extrapolate to new environments and locations or whether they will "fill in the gaps" between sample sites (i.e., interpolate). This distinction can determine the appropriateness of modelling approaches and evaluation methods. We found it necessary to use spatially explicit strategies to evaluate whether the models in this study were able to extrapolate and that modelling highly autocorrelated data required both model fitting and testing in a spatially independent context. Acknowledgments: Thanks to the captain and crew of the RV Nuliajuk for assisting with data collection, to Drs. Alvin Simms and David Schneider for feedback on statistical methodology and to Kirsten Costello for assisting with data processing. Thanks to the community of Iqaluit and the Government of Nunavut Department of Environment Fisheries and Sealing Division for supporting this project.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript or in the decision to publish the results.

Multibeam Echosounder Data Processing
Bathymetry data were imported to Qimera version 1.7; erroneous values were removed manually or using conservative spline filters. The data were corrected for tides using the Arctic9 tide model [74]. Data acquired in each survey year and system were processed separately, exported as a 10 m floating point geoTIFF grid and mosaicked in ESRI ArcGIS Pro v.2.1 to a single raster ( Figure 4).
Uncalibrated MBES backscatter data from each survey year and system were processed using the Fledermaus Geocoder Toolbox (FMGT) and exported separately as floating point geoTIFF grid files. Focal statistics were used in ESRI ArcGIS Pro to smooth the data over a 5 x 5-cell neighborhood to reduce noise. The use of multisource backscatter datasets presents several difficulties as relative dB values partially depend on the acquisition parameters of individual MBES systems (e.g., operating frequency; [75]). If each survey has been adequately ground-truthed, disparate datasets can be analyzed separately and their results combined [76]. Here, some of the datasets had few or no ground truth samples. We therefore adopted a normalization approach by which separate datasets were harmonized using a "bulk shift" methodology [56,77]. This standardizes each survey using the most extensive as reference, operating under the assumption that relative backscatter strength is a function of substrate properties and is relatively stable throughout a given survey. All surveys were thus mosaicked to a single raster at 10 m resolution and a low pass filter was applied to smooth out remaining data noise ( Figure 5).

Variable Scale Selection
For the quantitative models, we calculated Spearman's correlation coefficient for each scale of each predictor variable with grain size composition (ALR sm and ALR gm ) to test for non-parametric monotonic relationships. We attempted to determine up to two appropriate scales (i.e., "intrinsic scales"; [78]) for each predictor by identifying local peaks in the plot of correlation versus variable scale. Because calculating correlation coefficients between a multi-level categorical response (viz., grain size classes) and quantitative predictors is not as straightforward, we used univariate multinomial logistic regressions to test the ability of each predictor at each scale to explain the grain size sediment class. The residual deviance of the univariate models was plotted against variable scale and up to two local minima were identified in each graph as intrinsic scales. All correlation scores and multinomial logistic regressions were calculated in R using the cor() and multinom() functions within the "stats" and "nnet" packages [79,80].
We tested whether selected scales of a given variable were correlated with each other and removed the weakest variable (based on relationship with the response) if Spearman's ρ ≥ 0.70 between predictors [60][61][62]. We then tested the correlation between all remaining scales of all variables and removed weaker variables in cases where Spearman's ρ ≥ 0.70.

Variogram Analysis
Variograms were calculated for measurements of each grain size fraction and the presence of coarse sediments. The following model fits were obtained from the ArcGIS Geostatistical Wizard.

Error Matrices
Because the leave-one-out cross-validations (including spatial leave-one-out) produce a prediction for each sample point in the dataset, error matrices can be calculated between observed and predicted sediment classes for each sample. Note a slight difference in observed grain size classes

Error Matrices
Because the leave-one-out cross-validations (including spatial leave-one-out) produce a prediction for each sample point in the dataset, error matrices can be calculated between observed and predicted sediment classes for each sample. Note a slight difference in observed grain size classes caused by the complete absence of gravel from some samples. Quantitative samples were transformed to additive log-ratios, which do not allow zero values. The following tables (A1-A14) correspond to those discussed and compared in results Section 3.4 ("Grain Size Model Evaluation and Comparison"). Error matrices were also calculated to estimate the predictive performance of the categorical grain size and coarse substrate models after optimizing the threshold of occurrence to maximize the sum of sensitivity + specificity. Tables A13 and A14 correspond to results Section 3.6 ("Combined Map and Model Tuning").