# A Spatially Explicit Comparison of Quantitative and Categorical Modelling Approaches for Mapping Seabed Sediments Using Random Forest

*Reviewer 1:*Anonymous

*Reviewer 2:*Anonymous

**Round 1**

*Reviewer 1 Report*

Major Issues:

1) I think that a better job needs to be done in differentiating “quantitative” and “categorical” approaches. Despite the nice flow chart on Figure 2, I have trouble seeing why we should expect the results to be substantially different, since they both ultimately lead to the same classification scheme. Shouldn’t the classification be fairly robust with respect to the methodology used to reach that classification? Small differences are to be expected, of course, but a sample is a sample, and its characterization should not be too dependent on the method used to make that characterization. I also think that the authors mischaracterize the difference in performance of predictions for categorical and quantitative predictions for non-spatial cross-validation (Table 4). They assert that this difference is “significant”, but there is no test of that significance. Do we really think that, say, an 85.5% success rate is significantly better than 82.25%? That seems like a statistical fantasy. I think that is marginal, at best, and likely not significant at all, no matter the value of kappa (more on that below). In any case, there is definitely no significance for the spatial cross-validation, which is likely the less biased method per the authors’ own assertion. So it seems like the results largely bear out the likelihood that the two methods are not really that different – and that calls into questions the basic motivation for the study.

2) The authors do a very poor job of defining terms and parameters. This is particularly true of the description of the Random Forest Algorithm (e.g., “bagging”, “regression tree”, “forest”?) and the description of the variables selected for sediment grain size models in Tables 2 and 3 (e.g., “backscatter range”, “eastness”, “northness”, “RDMV”?). This made much of the text incomprehensible to the non-expert. Derivative quantitative parameters should be described with a mathematical equation. The evidently critical parameter “kappa” is also undefined – it is only stated, very vaguely, that it “reflects” whether the model achieved better results than to be expected at random. Is this “reflected” by larger or smaller values of kappa? When is kappa significant? How is kappa defined mathematically? Provide a reference for kappa.

3) The sidescan map is probably the most important determinant for sedimentary properties, but it is also the most problematic issue for this study. As described by the authors, the data are derived from many different sources, some at 30 kHz, and others ~300 kHz. This is a clear case of mixing apples and oranges – one cannot assume that the sedimentary response at 30 kHz will be the same as it is at 300 kHz. I think the only valid course of action would be to treat these as completely different data sets, and different inputs into the Random Forest Algorithm. Can that be done? I also think that there are a lot of potential pitfalls in the “normalization” approach to merging the different backscatter data sets (Appendix A). I think this is fine for a qualitative view of the data, but highly suspect as a quantitative methodology because each data set does not sample the same region, and therefore the extrema for each will be different.

4) The conclusions are very vague – I’m not sure that the authors have provided any useful guidance to potential users of these methods.

Minor Issues:

1) I think “random forest machine learning algorithm” should be in the title, since this is an essential component of the methodology.

2) Line 152: Add a sentence or two describing Random Forest algorithm.

3) Line 172: delete “conveniently”

4) Figure 3 and all map figures: the lat/lon annotation is too small

5) Section 2.4 : Better describe Random Forest algorithm, and define all jargon terms.

6) Line 254: Delete “Note that”.

7) Line 276: Why is “distance to nearest coast” used as a parameter?

8) Line 321: change “autocorrelated” to “correlated”

9) Line 324: What is meant by “embarrassingly parallel”?

10) Line 370: Need to show the underwater video frames here.

11) Line 388: define “SD”

12) Table 2: change “scale” to “correlation scale”

13) Table 4: The use of footnotes here is very confusing.

14) Appendix C: switch in terminology from “scale” to “major range”. All these semivariogram parameters need to be described.

*Author Response*

We appreciate the detailed comments from Reviewer 1. Attached are responses indicating changes that were made to the manuscript based on these.

Author Response File: Author Response.docx

*Reviewer 2 Report*

Review of: “A spatially-explicit comparison of quantitative and categorical modelling approaches for mapping seabed sediments”

Overview: This paper provides a quantitative comparison of two approaches for modelling and mapping ocean bottom characteristics, focusing on the presence and consequences of spatial autocorrelation not accounted for by the models. It is of wide interest and very clearly written. While the ideal in this situation would be to use modelling approaches that explicitly take spatial correlations into account, the authors rightly point out that these are not currently widely available in an implementation that most users can (or do) take advantage of, so this discussion of the importance of the problem is a key step. (In fact, ignoring crucial independence assumptions is a glaring problem with many, many data science applications today.) This paper will be a valuable addition to the literature and I have only a few minor comments/questions, detailed below.

Detailed Remarks:

Line 33: change to “interpolate or extrapolate”

End of abstract: I’m not sure what is meant by spatial “characteristics” of modelling approaches. Do you mean spatial properties of the study/prediction site, or the way that spatial information is incorporated into the model, or something else?

Line 130: “Categorical models may only require tuning parameters to a single set of variables to predict the entire range of sampled classes.” I’m not sure what this means. Specifically, what are the parameters and variables in question? My initial take was that this means basically “fitting the model to one dataset allows you to make predictions” which (while always true) can’t be the right interpretation…

Line 145: I might suggest “…consequences for prediction and model accuracy, *especially* when extrapolating…” If there is spatial dependence that isn’t properly accounted for in the model, then all model uncertainty estimates will be suspect. Also, it’s not only that it may give strange/incorrect average predictions outside the sampled region – unsampled spots within the sampled region could also have similar issues, especially if they have characteristics similar to sampled spots but are spatially distant.

Line 174 fjord not fjard?

Paragraph at 240-250: To me, this oversells random forest a bit, and I would point out also its three major drawbacks (discussed or pointed out in other places): Since it is an ensemble of trees, at the end of analysis, it’s very hard to visualize or understand exactly how the classification happened, making it a bit of a black box; like most machine learning algorithms, it performs poorly on rare classes; and (most crucially here) it is best suited to classifying a set of independent cases (so while it may deal well with dependence between the different predictors, residual correlation can affect its accuracy. There isn’t an obvious, user-friendly option that doesn’t have the same (or other equally bad) drawbacks, so I’m not suggesting a different analysis, just clearer up-front statement of these drawbacks.

Lines 258-9: Interesting; I wonder if this is a general fault of the EUNIS classification (does it work well in European waters and less well globally)? If so that might be worth pointing out. If this is just an idiosyncrasy of the current study area, then obviously disregard this comment – bottom properties are not my main expertise, clearly!

Eqn 1-2: Can you please specify explicitly whether these are base-10 or natural logarithms? The log() notation means base-10 in most mathematics applications, but in several software platforms (including R used in this paper) the function log() does a natural log, and based on later equations I think that is what is meant here. It would be helpful just to clarify.

Eqn 3-5: I would strongly prefer to see correct standard mathematical notation instead of computer-code-like shorthands in published equations. (Basically: write instead of exp(-iωt)). This is particularly important for long-term interpretability, as software platforms and functions change over time. See Edwards and Auger-Méthé 2018 (https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13105).

Lines 273-274: This is true, but not a comprehensive solution to the problem of poor prediction of rare classes (in other words, for some situations all the tuning in the world will not change the fact that prediction performance is poor for rare classes).

Lines 686-699, Appendix B: This procedure is probably adequate, but it may miss cases where one candidate predictor is redundant with a set of others (multicollinearity). One way to check this, and also have a convenient way to look at correlations between/with categorical predictors, would be to use variance inflation factors instead. Another advantage is that it is simpler to decide which of a set of correlated variables to exclude (exclude those with largest VIF instead of referring to fitted models.) These VIFs can be computed in R, for example (there are various packages that implement variations on VIFs), by using function vif() in package car.

Line 695: do you mean function cor(), or package corrr? I don’t see a package cor.

Appendix C: I don’t understand what you mean when you say that “model fits” were from ArcGIS but “variograms and models” were from R. The figures and fitted model parameters, etc. should come from the same platform (unless you can confirm that the algorithms and results are identical). And I’m not sure what the difference between “model” and “model fits” is.

Lines 351-353: There are several levels/types of dependence here, and the permutation test only helps with one of them (the fact that both kappas are estimated using the same ground-truth/dataset). There is also the spatial dependence, which is a man focus of the paper…I think this permutation test will only be accurate if the individual points sampled in the model are independent of each other, and a main point of this paper is that they are not and this matters. So ideally, the permutation test would ALSO somehow take into account the spatial correlation of the observations (sometimes this is done via block resampling (here I guess this would correspond to shuffling the category labels in blocks instead of totally at random?)). Failing to do so usually results in p-values that are smaller than they really should be.

Lines 372-374: It is annoying that these are different and not totally clear why you didn’t just choose one and use it (and not talk about the other).

Table 4 and similar and Appendix E: Including Appendix E with the error matrices is crucial, and allows the interested reader to compute measures of performance and accuracy other than just % correct classification. (So don’t remove this please!)

Table 4: Why is the SR-LOO CV approach not included here (and in corresponding part of the results section)?

Lines 557-562: This statement sounds like an argument in favor of the approach that ignores spatial correlation. But that seems to me an unwise interpretation. You can either have model results that are wrong because correlation structure in the data is ignored (and reported prediction accuracy is inflated), or you can have results that are wrong because your algorithm is terrible at making accurate predictions of rare classes (and areas with rare classes are mis-classified). Unless you have a preference for one kind of error depending on the way the results will be used, or can show that the magnitude of one of the errors is much bigger, I don’t see why there’s a natural preference for one or the other. A model that commits either kind of mistake is not going to be trustworthy.

Lines 563-568: Another common approach to solve the rare-classes problem is to resample the training data so that there is approximately equal representation of all the classes in the training dataset (and the rest of the data are used for classification/prediction). Maybe the dataset here is too small for this, but is it perhaps worth a sentence or so of discussion (especially since the sub-sampling of the training data might also get rid of some of the spatial correlation problems?

Line 584: Not quite sure what “which allow adjacent samples for model fitting”. Are some words missing in there?

*Author Response*

The comments from Reviewer 2 were well-informed and highly useful - we feel they have improved the quality of this manuscript substantially. Attached are responses to these indicating where changes were made to the text.

Author Response File: Author Response.docx