Limitations of Predicting Substrate Classes on a Sedimentary Complex but Morphologically Simple Seabed

: The ocean ﬂoor, its species and habitats are under pressure from various human activities. Marine spatial planning and nature conservation aim to address these threats but require su ﬃ ciently detailed and accurate maps of the distribution of seabed substrates and habitats. Benthic habitat mapping has markedly evolved as a discipline over the last decade, but important challenges remain. To test the adequacy of current data products and classiﬁcation approaches, we carried out a comparative study based on a common dataset of multibeam echosounder bathymetry and backscatter data, supplemented with groundtruth observations. The task was to predict the spatial distribution of ﬁve substrate classes (coarse sediments, mixed sediments, mud, sand, and rock) in a highly heterogeneous area of the south-western continental shelf of the United Kingdom. Five di ﬀ erent supervised classiﬁcation methods were employed, and their accuracy estimated with a set of samples that were withheld. We found that all methods achieved overall accuracies of around 50%. Errors of commission and omission were acceptable for rocky substrates, but high for all sediment types. We predominantly attribute the low map accuracy regardless of mapping approach to inadequacies of the selected classiﬁcation system, which is required to ﬁt gradually changing substrate types into a rigid scheme, low discriminatory power of the available predictors, and high spatial complexity of the site relative to the positioning accuracy of the groundtruth equipment. Some of these issues might be alleviated by creating an ensemble map that aggregates the individual outputs into one map showing the modal substrate class and its associated conﬁdence or by adopting a quantitative approach that models the spatial distribution of sediment fractions. We conclude that further incremental improvements to the collection, processing and analysis of remote sensing and sample data are required to improve map accuracy. To assess the progress in benthic habitat mapping we propose the creation of benchmark datasets.


Introduction
The majority of anthropogenic pressures on the seabed are concentrated near land within the continental shelf [1][2][3]. The increasing use of the marine environment and overlapping nature of these and characterisation of benthic marine environments [12], mapping of seabed habitats for better marine management [13], seafloor backscatter data from swath mapping echosounders [14], marine geomorphometry [15] and geological seafloor mapping (https://www.mdpi.com/journal/geosciences/ special_issues/Geological_Seafloor_Mapping). Review papers were presented dealing with acoustic seabed classification [16], benthic habitat mapping [17], spatial scale and geographic context in benthic habitat mapping [18], marine geomorphometry [19] and the transfer of knowledge from terrestrial mapping to image-based seabed classification [20]. A sizeable amount of published studies compared predominantly supervised classification approaches in their ability to map seabed substrates and habitats [21][22][23][24][25][26][27][28]. Such studies typically use a measure of map accuracy as a metric to compare the different methods in their ability to derive reliable seabed maps.
The proliferation of seabed mapping studies has shown that MBES data, when combined with groundtruthing samples, can produce accurate and valuable data products [29][30][31][32]. However, the diversity of options available means that users often may choose different approaches to the same problem. Such choices are frequently based on aspects like availability of specific software and familiarity of the user with certain types of analytical approaches. Given the breadth of options that are available (e.g., rule-based classification, geostatistics, machine learning, object-based image analysis) it is unlikely that one user will be able to master them all. Knowledge on which approach is most successful under certain circumstances is also still limited. Comparison studies like those mentioned above are therefore still needed. In an endeavour to shed some light on the aforementioned questions we conducted a seabed classification comparison exercise. We invited researchers from the three European seabed mapping programmes MAREANO, INFOMAR and Marine Environmental Mapping Programme (MAREMAP), together with the Royal Belgian Institute of Natural Sciences, to apply their preferred seabed mapping methods to a common dataset of acoustic remote sensing and physical and optical groundtruthing data. To test the limits of the applied methods a highly heterogenous and therefore challenging site to map was selected.

Aim and Objectives
Five approaches to creating a substrate map from a common dataset were undertaken to explore the issues and complexities associated with habitat mapping. The main objectives of the study were as follows: (i) examine how each approach succeeded in interpreting the acoustic data; (ii) quantitatively measure the accuracy of the outputs from each method of interpretation; (iii) explore agreement between the maps and the possibility of creating an ensemble map; (iv) discuss the limitations of such approaches in terms of technical issues, prescribed classifications and stakeholder expectations; (v) suggest possible solutions to the issues highlighted.

Study Site and Data
The area selected for the common dataset was the East of Haig Fras Marine Conservation Zone (MCZ) which covers an area of approximately 400 km 2 (20 km by 20 km) and is situated within the Celtic Sea on a plateau of the UK continental shelf. Made up of a complex mosaic of rock and sediment substrates, this site has been the focus of four different surveys as part of the UK MPA Programme. The intent of these surveys was to provide evidence for the site designation and subsequent monitoring of seabed habitat condition. An initial acoustic survey in 2012 provided 100% coverage MBES data [33]. Data were acquired using a Kongsberg Simrad EM710 and processed to International Hydrographic Organisation (IHO) Order 1 specifications. Bathymetry data were processed using Caris HIPS and SIPS v7.1 and gridded at 2 m by 2 m resolution. Backscatter were processed in FMGT v7.8.2 using Remote Sens. 2020, 12, 3398 4 of 23 standard settings and full swath and having removed cross track lines. Backscatter data were exported to the same spatial grid as the bathymetry. Groundtruthing of the site was then undertaken in 2012 and 2013 to support the creation of habitat maps [34] and the designation of the MCZ. A subsequent groundtruthing survey was undertaken in 2015 as the first timepoint in what will be a long-term monitoring programme of benthic faunal communities within the MCZ [35].
For the purposes of this mapping exercise, groundtruth samples collected during all three groundtruthing surveys have been used. For observations of sediment type, only samples collected using a 0.1 m 2 Mini Hamon grab were retained for analysis and a minimum distance of 300 m between stations was applied to remove resampled stations. From each grab a 0.5 L subsample was used for Particle Size Analysis (PSA). Sedimentary substrates (coarse sediment (CS), sand and muddy sand (Sa), mud and sandy mud (Mu) and mixed sediments (Mx)) were classified according to Long [36] as shown in Figure 1. GPS positions for the sediment grabs were recorded at the instant the grab contacted the seabed. The position was calculated by applying a known offset based on the distance between the side gantry and the antenna location.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 23 undertaken in 2012 and 2013 to support the creation of habitat maps [34] and the designation of the MCZ. A subsequent groundtruthing survey was undertaken in 2015 as the first timepoint in what will be a long-term monitoring programme of benthic faunal communities within the MCZ [35].
For the purposes of this mapping exercise, groundtruth samples collected during all three groundtruthing surveys have been used. For observations of sediment type, only samples collected using a 0.1 m 2 Mini Hamon grab were retained for analysis and a minimum distance of 300 m between stations was applied to remove resampled stations. From each grab a 0.5 L subsample was used for Particle Size Analysis (PSA). Sedimentary substrates (coarse sediment (CS), sand and muddy sand (Sa), mud and sandy mud (Mu) and mixed sediments (Mx)) were classified according to Long [36] as shown in Figure 1. GPS positions for the sediment grabs were recorded at the instant the grab contacted the seabed. The position was calculated by applying a known offset based on the distance between the side gantry and the antenna location. Observations of hard substrate were derived from seabed imagery using a combination of towed camera sledge (2012) and drop camera (2013 and 2015) transects. In 2012, a random subset of the sediment sample locations was targeted for seabed imagery. The 2013 and 2015 surveys then specifically targeted hard substrate based on acoustic data and existing maps [34]. Originally, transects were conducted as 10-minute tows at approximately 0.5 knots (0.26 m s −1 ) with a still image taken every 30 s. For the mapping exercise only still images where the analyst observed the presence of boulders or bedrock were retained as hard substrate observations. A subset of these were then randomly sampled to produce a single observation per transect. Positional data for seabed imagery were recorded using High Precision Acoustic Positioning (HiPAP), model 500 to measure the offset of the frame relative to the vessel.
In total the groundtruth dataset included 384 samples consisting of five substrate types. A random stratified sampling approach (based on substrate class) was used to split the data into training and testing samples based on a 7:3 split (70% training/30% testing) ( Table 1). Only the training samples (totalling 268) were provided to the individual participants for the mapping exercise. Testing samples were used to estimate the accuracy of the individual maps. Table 1 gives an overview of the samples, including the symbols used throughout this study and their relation to EUNIS and the Marine Habitat Classification for Britain and Ireland (previously Marine Nature Conservation Review, MNCR).Modelled current speed data were also available for the MCZ [35]. Using the hydrodynamic software Telemac2D v7.1, an unstructured triangular mesh was used to model current speeds at various depths within the model domain. The model was developed for the Observations of hard substrate were derived from seabed imagery using a combination of towed camera sledge (2012) and drop camera (2013 and 2015) transects. In 2012, a random subset of the sediment sample locations was targeted for seabed imagery. The 2013 and 2015 surveys then specifically targeted hard substrate based on acoustic data and existing maps [34]. Originally, transects were conducted as 10-minute tows at approximately 0.5 knots (0.26 m s −1 ) with a still image taken every 30 s. For the mapping exercise only still images where the analyst observed the presence of boulders or bedrock were retained as hard substrate observations. A subset of these were then randomly sampled to produce a single observation per transect. Positional data for seabed imagery were recorded using High Precision Acoustic Positioning (HiPAP), model 500 to measure the offset of the frame relative to the vessel.
In total the groundtruth dataset included 384 samples consisting of five substrate types. A random stratified sampling approach (based on substrate class) was used to split the data into training and testing samples based on a 7:3 split (70% training/30% testing) ( Table 1). Only the training samples (totalling 268) were provided to the individual participants for the mapping exercise. Testing samples were used to estimate the accuracy of the individual maps. Table 1 gives an overview of the samples, including the symbols used throughout this study and their relation to EUNIS and the Remote Sens. 2020, 12, 3398

of 23
Marine Habitat Classification for Britain and Ireland (previously Marine Nature Conservation Review, MNCR).Modelled current speed data were also available for the MCZ [35]. Using the hydrodynamic software Telemac2D v7.1, an unstructured triangular mesh was used to model current speeds at various depths within the model domain. The model was developed for the south-west UK continental shelf based on a 3 km node spacing with increased resolution (25 m node spacing) within the study area. By modelling the data across a 30-day spring-neap cycle the mean and maximum current velocity were extracted for the area of interest. This 25 m node spacing was then interpolated to the same spatial grid as the bathymetry. All participants were provided with the training groundtruth observations and four predictor variables (MBES bathymetry, MBES backscatter, mean current speed at the seabed and maximum current speed at the seabed) with which to derive a seabed substrate map ( Figure 2).

Methods
Unlike studies that tested the performance of different types of spatial prediction methods or machine learning algorithms e.g., [25,28], the choice of the appropriate method was left to the scientists who were involved in this study. Participants were also free to choose which predictor variables to include in the analysis. The five methods used either pixels or image objects/superpixels as the unit of analysis [37]. All classification methods were supervised, i.e., model building was guided by observed classes and their relation to predictor variables. A central assumption of all supervised classification methods is that all substrate types present in the area to be mapped have been observed or sampled.

Method A
The random forest algorithm was used for classification. The algorithm was chosen due to reported high predictive accuracy in studies focusing on the comparison of supervised classifications of MBES data [23,38,39] and has generally proven highly successful in data mining applications [40]. The main underlying assumption of this method is that the predictive power of multiple decision trees is higher than that of a single tree. Bootstrapped samples from the training data are used to construct the individual trees in the forest introducing the first element of randomness. In turn, a random subset of the predictor features is used at the node splits throughout the construction of the model. The result is the construction of unique trees (whose correlation amongst them decreases considerably by this approach). Decisions about the class allocation (labelling) are made on the basis of majority votes of individual trees. Following a feature selection procedure based on the random forest algorithm [41], the random forest model was run growing 500 trees and leaving the parameters as default (m try = 2).
To reduce user identified artefacts in the bathymetry a pre-processing step was applied to the bathymetry layer using a majority filter and a neighbourhood of 27 pixels. From this smoothed bathymetry layer, a range of derivatives were generated for inclusion in the model. Feature selection identified five predictor variables that were used for subsequent model training, namely: backscatter, Remote Sens. 2020, 12, 3398 6 of 23 smoothed bathymetry, Moran's autocorrelation, slope, and bathymetric position index. The routine was implemented in R [42] and models applied using the RandomForest package [43].

Method B
A composite image was generated using 3 variables: backscatter, bathymetry, and maximum current speed. This composite image was classified using the multivariate classification toolset in Esri ArcGIS v10.3 following the methods of Calvert et al. [44] and Dolan et al. [45].

Method B
A composite image was generated using 3 variables: backscatter, bathymetry, and maximum current speed. This composite image was classified using the multivariate classification toolset in Esri ArcGIS v10.3 following the methods of Calvert et al. [44] and Dolan et al. [45].
The backscatter data were segmented to generate superpixels for classification. The segmented image was filtered to remove segments generated due to nadir effects. The resulting gaps were filled using neighbourhood analysis of the surrounding cells. This augmented, segmented image was input as the backscatter variable into the composite image. The bathymetry and maximum current speed data did not require any pre-processing and could therefore be input directly as bands 2 and 3 into the composite image. Groundtruth data locations were used to extract statistical descriptions of each substrate class for each band. This statistical information was stored in a signature file and used to classify the entire area using the Maximum Likelihood Classification tool.

Method C
A supervised classification based on substrate observations and a large array of predictor variables was carried out. Subsequently, image pixel-level results were generalised by applying a multiresolution segmentation and averaging over the resulting image objects. Apart from the segmentation process, which was carried out with eCognition software, all processes were run in R statistical software.
Image pre-processing included de-speckling of the backscatter data with a 3 pixel by 3 pixel Lee filter (Speckle Function in ArcGIS Image Analysis). The following derived variables were calculated from the primary variables bathymetry and de-speckled backscatter: backscatter roughness, curvature, eastness, northness, roughness, slope topographic position index (tpi) and vector ruggedness measure (vrm). The calculation was carried out at multiple scales with neighbourhood sizes of 3, 5, 7 . . . 51. Subsequently, multi-scale means, and standard deviations were calculated across scales. Together with mean and maximum current speed, 272 potential predictor variables were submitted to a two-step predictor variable selection process [46]: the Boruta algorithm [41] was used to identify important variables. Correlated variables were subsequently removed. The final selection included backscatter (17), roughness (21) backscatter roughness (5), tpi (51) and vrm (3).
A random forest classification model was trained on the training samples and the selected predictor variables. The number of trees in the forest was set to n tree = 500, while the number of predictor variables to consider at any given split (m try ) was tuned with a grid search algorithm. A value of m try = 2 was subsequently used and class probabilities were predicted at the pixel level.
A multi-resolution image segmentation was carried out on the backscatter (17) layer with a scale parameter of 10, shape = 0.1 and compactness = 0.5. The segmentation result was then used as a means of generalising the noisy pixel-level predictions by calculating mean class probabilities per image object and, based on this, the modal substrate class.

Method D
Seabed maps were generated using the Remote Sensing Object Based Image Analysis (RSOBIA) segmentation toolbar [47] in ArcMap, following the approach applied in [48]. The two main independent datasets were considered to be the bathymetry and backscatter mosaic. Both were of comparable acquired resolution and therefore could be taken to be of good coverage of the survey area. The modelled current velocity data were not used.
The RSOBIA toolbar is designed to be quick and user-friendly. The "MBES segmentation" tool requires the bathymetry and backscatter files and the user to define the number of different clusters to be identified and the minimum size of the clusters [48]. The values of these two user defined parameters are based on whether the focus is to produce a high-resolution interpretation of small patches of the survey area or a more regional interpretation of large patches of the survey area. The default values of 20 clusters and 10,000 pixels minimum size were initially used and an Remote Sens. 2020, 12, 3398 8 of 23 interpretation created. A second run of the program (with a smaller 200 pixels minimum polygon size) produced a much higher resolution interpretation. The RSOBIA tool calculates the slope and local roughness from the bathymetry data and combining the backscatter data with these two derivatives uses a k-means clustering algorithm [49] to find data clusters (classes) which are then put through an iterative elimination process to provide geographically distinct areas. More clusters than final data classes were given so that similar clusters can be amalgamated by the user. The fine scale interpretation was selected as the final segmentation.
The segmentation was then compared with the sample grab data. The segmentation class numbers were correlated with sample type and used to spread classification descriptions to areas without actual samples. In some cases, two differing sample types were found to be in a single segmentation class, at which point expert judgement was given.

Method E
Six derivatives of bathymetry were created using the Terrain Attribute Selection for Spatial Ecology (TASSE) tool [50] based on a circular radius of 25 cells. Some noise was present in the bathymetry data, so this radius was selected to characterise the local depth variation rather than the noise. Backscatter, bathymetry, the six derivatives of bathymetry (bathymetry mean, eastness, northness, Relative Deviation from Mean Value (RDMV), standard deviation and slope) and maximum currents at the seabed were imported into eCognition 9.3 for segmentation using the multiresolution segmentation algorithm. Various scale parameters and input variables were trialled with the final selection based on expert judgement. In this case, oversegmentation was considered preferable to undersegmentation so the intention was to ensure that boundaries visible to the eye were also present in the segmentation. The final segmentation used backscatter, bathymetry, and the standard deviation of bathymetry layers (with weightings of 2:1:1, respectively) and a scale parameter of 30 (default shape and compactness). Objects where a sample was present were classified according to that sample class and the mean of each predictor variable was then calculated for all objects along with geometric variables such as size and shape.
Variable selection was determined using Principle Component Analysis. Six variables were to be included in the model as these explained 80.1% of the variance and had eigenvalues >0.99. For each component, the most highly correlated variable was included. This resulted in bathymetry, backscatter, eastness, northness, RDMV and slope being retained for modelling. The most probable seabed substrate type was modelled using the methodology described in Mitchell et al. [51]. A bootstrap aggregation modelling approach was used to derive multiple random forest [52] models from the training data by repeatedly randomly subsetting the samples with replacement. Models were then used to predict across the study site with the final prediction derived from a plurality vote. Individual random forest models were generated under default settings based on a 70% split of the training data, and the plurality vote was based on 25 subsets of the training data.

Measuring Map Accuracy
The assessment of map accuracy was based on the testing samples, which were not used in the predictions. Predictions made with the methods described above were extracted for every location in the testing dataset and confusion matrices were built. Overall accuracy was used to evaluate the global accuracy of the predictions, while error of omission and error of commission were selected as class-specific metrics of accuracy. The overall accuracy gives the percentage of cases correctly allocated and is calculated by dividing the total number of correct allocations by the total number of samples [53]. The error of omission is the number of incorrectly classified samples of one class divided by the total number of reference samples of that class. The error of commission is the number of incorrectly classified samples of one class divided by the total number of samples that were classified as that class [54]. The overall accuracy, its 95% confidence intervals and a one-sided test to evaluate whether the overall accuracy was significantly higher than the no information rate (NIR) were calculated by applying the confusionMatrix function of the caret package [55]. The confidence interval is estimated using a binomial test. The NIR is taken to be the proportion of the most frequent class. Errors of omission and commission are not provided by the function but can be calculated from the confusion matrix.

Ensemble Map and Map Agreement
All predictions were rasterised (if necessary) at a resolution of 10 m and aligned. Class-specific agreement between the map outputs was determined by counting the number of instances a class was predicted for every pixel. The respective pixel was assigned the class that was most frequently predicted (modal class) to derive an ensemble map. The agreement between the map outputs of the modal class served as an indicator of confidence in the ensemble predictions. All predictions were rasterised (if necessary) at a resolution of 10 m and aligned. Class-specific agreement between the map outputs was determined by counting the number of instances a class was predicted for every pixel. The respective pixel was assigned the class that was most frequently predicted (modal class) to derive an ensemble map. The agreement between the map outputs of the modal class served as an indicator of confidence in the ensemble predictions. The boxplots of backscatter and bathymetry by substrate class reveal a large overlap in the ranges of those predictor variables (Figure 4). This is particularly obvious for the backscatter response of CS and Mx, which are near-identical, and to a lesser degree Mu and Sa. Bathymetry does not appear to separate between substrate types apart from Mu. Mean and maximum current speed appear to have virtually no discriminatory power. The boxplots of backscatter and bathymetry by substrate class reveal a large overlap in the ranges of those predictor variables (Figure 4). This is particularly obvious for the backscatter response of CS and Mx, which are near-identical, and to a lesser degree Mu and Sa. Bathymetry does not appear to separate between substrate types apart from Mu. Mean and maximum current speed appear to have virtually no discriminatory power.

Map Accuracy
All maps have overall accuracies significantly higher than the NIR ( Figure 5). Overall accuracies vary from 46% to 59%. R is generally predicted with acceptable accuracy (mean omission error ≈ 15%, mean commission error ≈ 25%), but prediction errors are high for all other substrates ( Figure 6). Accordingly, the overall accuracies are low (mean of 50%). The ensemble map does not perform better than the individual methods A to E. Contingency tables relating to the five produced maps are available as Supplementary Tables S1-S5. The data behind Figure 6 is available as Supplementary Table S6.

Map Accuracy
All maps have overall accuracies significantly higher than the NIR ( Figure 5). Overall accuracies vary from 46% to 59%. R is generally predicted with acceptable accuracy (mean omission error ≈ 15%, mean commission error ≈ 25%), but prediction errors are high for all other substrates ( Figure 6). Accordingly, the overall accuracies are low (mean of 50%). The ensemble map does not perform better than the individual methods A to E. Contingency tables relating to the five produced maps are available as Supplementary Tables S1-S5. The data behind Figure 6 is available as Supplementary Table S6. vary from 46% to 59%. R is generally predicted with acceptable accuracy (mean omission error ≈ 15%, mean commission error ≈ 25%), but prediction errors are high for all other substrates ( Figure 6). Accordingly, the overall accuracies are low (mean of 50%). The ensemble map does not perform better than the individual methods A to E. Contingency tables relating to the five produced maps are available as Supplementary Tables S1-S5. The data behind Figure 6 is available as Supplementary Table S6.

Map Agreement
Individual map outputs of the five methods are shown in the Supplementary Figures S1-S5. The agreement between methods for the five substrate classes is shown in Figure 7. These maps might be interpreted as showing the probability of a specific substrate class to occur in a specific location (pixel). There was no class predicted by the majority of maps for only 15.4% of the study area; however, only 15.8% of the area was predicted as the same class in all maps ( Table 2). The ensemble map (Figure 8) based on the predictions of Methods A to E shows the modal class, i.e. the substrate class that was most frequently predicted and the agreement between the predictions. The contingency table of the ensemble map (Table 3) shows that R is generally predicted with high accuracy, e.g. 25 of the 27 reference samples labelled as R were also predicted as R. Sedimentary substrates did, however, show a high degree of confusion and low accuracy. This is especially true in the case of CS, which was most frequently predicted as Mx. No class (i.e., map agreement <3) was most frequently predicted in areas of Mx, followed by Sa. Based on the test set, we also assessed prediction accuracy for individual agreement classes (Table 4). This highlights that map accuracy is higher when all five predictions agree.

Map Agreement
Individual map outputs of the five methods are shown in the Supplementary Figures S1-S5. The agreement between methods for the five substrate classes is shown in Figure 7. These maps might be interpreted as showing the probability of a specific substrate class to occur in a specific location (pixel). There was no class predicted by the majority of maps for only 15.4% of the study area; however, only 15.8% of the area was predicted as the same class in all maps ( Table 2). The ensemble map (Figure 8) based on the predictions of Methods A to E shows the modal class, i.e., the substrate class that was most frequently predicted and the agreement between the predictions. The contingency table of the ensemble map (Table 3) shows that R is generally predicted with high accuracy, e.g., 25 of the 27 reference samples labelled as R were also predicted as R. Sedimentary substrates did, however, show a high degree of confusion and low accuracy. This is especially true in the case of CS, which was most frequently predicted as Mx. No class (i.e., map agreement < 3) was most frequently predicted in areas of Mx, followed by Sa. Based on the test set, we also assessed prediction accuracy for individual agreement classes (Table 4). This highlights that map accuracy is higher when all five predictions agree.  Table 2. Total predicted area, as a percentage of the mapped area, for each class in the ensemble map. Derived from right panel in Figure 8. Where the five methods failed to predict a majority class this was classified as 'No class'.   Table 2. Total predicted area, as a percentage of the mapped area, for each class in the ensemble map. Derived from right panel in Figure 8. Where the five methods failed to predict a majority class this was classified as 'No class'.

Discussion
The overall accuracy of the maps from the various mapping methods were lower than expected, although significantly higher than the NIR. These five attempts to map the site use various approaches that are commonly applied, yet accuracies of ≈ 50% suggest limited success. An 85% target accuracy has often been adopted in thematic mapping for remotely sensed data but stems from

Discussion
The overall accuracy of the maps from the various mapping methods were lower than expected, although significantly higher than the NIR. These five attempts to map the site use various approaches that are commonly applied, yet accuracies of ≈50% suggest limited success. An 85% target accuracy has often been adopted in thematic mapping for remotely sensed data but stems from assumptions that pixels are purely a single thematic unit, areal extent is definable and that all groundtruth data are positionally accurate and error-free [56]. The accuracy of the five maps would make monitoring change and implementing management strategies difficult, as the map error would be considered so great that actual seabed change would need to be enormous to be detected. While these data types (acoustics, grabs and videos of rock) and the classification scheme (EUNIS) are commonly used in conservation and spatial management and the mapping methods have been widely applied with success in other studies, supporting their use, the results suggest that these are not always suitable. It appears timely to assess what issues would be preventing better maps from being achieved and consider alternative approaches for challenging sites such as this one. While there are other sources of error, we consider the main issues for this site to fit under four main categories: (i) Sample type and classification scheme; (ii) acoustic discrimination; (iii) scale and (iv) outputs. Each of these sources of error are discussed below along with potential options that could be used in the future to overcome these issues.

Sample Type and Classification Scheme
Model development and accuracy assessments used reference data in the form of grab and video samples to determine what the 'true' substrate type was for a given location. Therefore, it is worth considering how representative the reference data are that have been classified. For example, the variation in sediment composition across the site is gradational and the classification based on a rigid scheme might appear arbitrary (Figure 3). Small variations, within the margin of error for different analysis techniques [57], could result in two samples of the same sediment, in terms of broad character and associated biota, being classified as different classes. The need to fit continuous data into predefined categories can therefore be problematic as samples that differ by only a few percent may be acoustically similar but classified as different sediment types. This is a common issue of broad scale classification schemes such as the EUNIS level III, that was adopted from a geologically minded classification scheme, and not based on class descriptions that are defined from acoustic data. It also means the majority of sediment information (originally measured to half-phi grain size distributions) was discarded prior to any attempt to model sediments. Further considerations for repeatability of sediment sampling relate to how subsamples of gravelly sediments are taken, how shell fragments are treated during grain size analysis, and the potential loss of parts of the fine fraction during retrieval of grab samples from the seabed.
For a map to support marine spatial planning and environmental conservation it must have a classification scheme that is relevant to the environment that is being measured and these classes must be mappable. There are many alternative classification methods and schemes to EUNIS level III from which maps of higher accuracy could be created. For example, in order to define sediment classes that are distinctly different one option may be to group sediment fractions in a way that maximises entropy between the groups [58]. It is also worth keeping in mind that if the maps are created to support our understanding of how the biology varies across a site, then a focus on physical properties may not be ideal. The question may be how large a variation in grain size fractions is required to change the biological community, given changes of a few percent (of mud, sand, and gravel fractions) do not appear to impact the acoustic signature? Cooper et al. [59] have suggested defining meaningful habitat groups based on infaunal species assemblages and then developing maps from these classes. It is possible that some of the issues encountered in this challenging site would become irrelevant, for example if the infaunal species assemblages observed in similar classes (like CS and Mx) supported the same communities of species.
The EUNIS level III scheme was developed at a transnational level, and has succeeded in supporting harmonisation of samples and maps across research institutes, governments and organisations [60,61]. Site specific classifications, such as maximum entropy and infaunal assemblage classes, may not be desirable where maps form part of larger programmes and management strategies. More detailed classification schemes are available (i.e., further subdividing the sediment classes such as Folk 11 or Folk 16 [60,62]; however, increased classes may also require more samples as each class would become "rarer". On the other hand, reducing the number of classes by aggregating "overlapping" categories (in this case merging Mx with CS and Sa with Mu) has the potential to improve classification accuracy by compromising thematic resolution. It is therefore up to the manager to determine whether less classes would still be relevant for managing the area at an ecological scale.

Acoustic Discrimination
A discussion of defining mappable classes leads to the second issue: insufficient information in the two acoustic datasets used by all analysts (bathymetry and backscatter) limited the ability to discriminate between the thematic classes. In this study the bathymetric range was only 10 m over an area of 400 km 2 and therefore much of the interpretation was dependent on the backscatter strength.
Other studies have noted that a single-frequency backscatter layer contains insufficient information to accurately predict the full Folk sediment spectrum, and class-aggregation may be required [63][64][65]. Attempts to map this site encountered the same issue. Differences between R and areas of Sa or Mu were evident in the backscatter (likely resulting from the contribution of the differences in acoustic impedance contrasts), but backscatter failed at separating Mx from CS as well as having considerable overlap between Mu and Sa classes (Figure 4). It is clear from both pioneering [66] and more recent research e.g., [67] that the backscatter strength registered by an MBES does not exclusively relate to the relative percentages of the sediment fractions. Rather, it relates to differences in a combination of acoustic impedance contrasts, sediment, and topographic roughness (i.e., at the level of the grain size which exhibits intrinsic roughness and at the level of sub-beam topographic roughness such as oscillatory ripples) and sediment volume inhomogeneities [68]. Refining the relationship between frequency, seafloor type (including porosity, compactness and permeability) and backscatter response is continually being improved [14,16,63,65,69,70]. Further developments to better utilise backscatter data such as angular response analysis (ARA) [71][72][73] and the hyper-angular cube (HAC) [21,74] may also increase the predictive power of acoustic data [75]. The recent development of MBES systems that can acquire multi-frequency backscatter simultaneously also has the potential to improve mapped accuracy [29,76]. As the low and high frequencies interact with soft sediments differently [77] this could potentially prove most beneficial for complex but morphologically flat sites such as this one, essentially doubling the amount of information available compared to a single frequency MBES.

Scale
Scale issues are of fundamental importance in seabed mapping but are frequently not explicitly dealt with [18]. Scale mismatches frequently occur between observations and the resolution of predictor variables. In this study, the pixel size of the predictor variables was 2 m by 2 m. This compares with an area of 0.1 m 2 (1/40 of the area covered by a pixel) sampled with a mini Hamon grab as employed in this study. Furthermore, only a sub-sample of approximately 500 mL was used for subsequent grain-size analysis. This indicates a scale mismatch between the response and predictor variables. The implications are that grab samples might not necessarily be representative of the environment to which they are related. The field of view of the optical data is dependent on the type of equipment. In the case of a drop camera, the field of view varies with the altitude above the seabed between maybe 0.2-1 m 2 . The towed sledge system did provide a constant field of view of approximately 0.7 m 2 . Such fields of view are much closer to the size of a pixel of 4 m 2 , hence optical data might be more representative when compared to grab sample data.
Several methods used superpixels or image objects rather than pixels as the unit of analysis. The size of the created polygons is however variable and dependent on the segmentation parameters (which differ between approaches) and the spatial heterogeneity of the predictor variable that is being segmented. Despite this, it follows that the mismatch between the area covered by image objects or superpixels and the groundtruthing data is even more pronounced. Object-based methods have been advocated in the literature over the last 20 years or so [78,79], due to their perceived strengths [80]. GEOBIA approaches have also been applied to marine datasets for more than ten years [29,30,48,[81][82][83][84][85]. However, it should be kept in mind that it is much easier in terrestrial mapping to obtain a holistic view of the ground conditions than in marine mapping. Drop-camera systems with the ability to rotate and tilt [31], thereby capturing a larger expanse of the seabed might be a way to bridge the gap between the area covered by seabed imagery and predictor variables.
A critical issue in this study was the high complexity and spatial heterogeneity of the site, with changes in substrate types at scales of several metres to a few tens of metres. For the towed camera sledge and the drop camera, which were both fitted with HiPAP systems, we assume positioning errors of 5-6 m in approximately 100 m water depth [86]. For the grab samples, the position was recorded as the offset to the side gantry relative to the vessel's GPS. Additional offset from this recorded position due to drift might be on the order of 10 m. Given these estimated positioning accuracies, it is likely that in many cases a sample was associated with incorrect values of the response variables. Achieving high map accuracy requires a positioning accuracy in line with the spatial heterogeneity of the site. In situations like those encountered here, this might mean pushing the positioning accuracy requirements to the limits of what is technically feasible. An alternative might be a simulation approach by which many models were fitted to training data that used samples that were randomly relocated based on a bivariate probability function around the original location [28].
In addition to spatial scale, there is the possibility that temporal scale may have been a source of error due to the time interval between the acquisition of MBES data and subsequent groundtruth sampling (up to three years). Movement of benthic sediment through major storms or currents are common and large moving dune features have been observed at depths > 100 m, particularly where strong tidal currents are present ( [87]: maximum~2.0 ms −1 ). However, our analysis assumed no change had occurred to sediment distributions between surveys and features sampled in the MBES data were represented in the groundtruth samples. This assumption was based on: the absence of obvious mobile sedimentary features in the bathymetry, such as sandwaves; relatively slow currents predicted across the site ( Figure 2C,D); and the minimum depth (94 m) suggesting wave action would be of limited influence. Modelled data from Aldridge et al. [88] supports this assumption, as annual maximum disturbances for this area were <7.5 cm for the most mobile sediment component (sand). Although given the observed heterogeneity of the site, the actual annual maximum disturbance is likely to be even less. Nevertheless, we cannot rule out that some change may have occurred and where possible, avoiding this temporal scale issue is desirable.

Output
In the previous sections we have discussed how changes to the classification scheme, either by adopting a site-specific scheme or by merging classes, may have improved map accuracy. However, this is often out of control for the map producer, as it is commonly the stakeholders who specify the format of map outputs. Another option would be to aggregate the information contained in five maps into a single map of the modal class associated with spatial information on agreement between predictions. Such ensemble maps have been proposed as an effective way of improving classification performance [89,90]. While this was not supported by our results ( Figure 5) and other studies [91], the ensemble map provides additional information on the agreement between the outputs derived with different methods. Intuitively, this could be interpreted as a spatially-explicit measure of confidence [92]. Maps like Figure 8 could fulfil the recommendation to include spatial representations of confidence [20], thereby allowing future map users to interpret the outputs based on their needs. However, a more detailed analysis (Table 3) revealed that map accuracy increased markedly only in the case of complete agreement. Nevertheless, such information is valuable for the map user: for example, if the goal were to investigate or monitor a specific substrate type, then samples could be positioned to target the relevant class in areas of complete agreement. Conversely, if the aim were to improve map accuracy then reclassifying areas of low agreement as hybrid classes, or even acquiring additional samples may be preferred.
In this exercise scientists were tasked with creating a thematic map for the site. However, considering the limited success, an alternate approach may have been to generate predictions as class-specific probability maps or quantitative sediment composition maps [31,93,94]. This could be achieved by generating maps in two stages. Initially, R substrate could be delineated using a presence/absence mapping approach [95,96]. Sediment substrates could then be mapped using methods similar to Misiuk et al. [93] and Mitchell et al. [94]. Spatial predictions like these may better capture the heterogeneity of the seafloor, highlighting areas that are relatively fine or coarse, even though that variation may be within a single EUNIS level III class. Unlike thematic mapping based on predefined substrate classes, quantitative modelling is also able to map rare and unsampled classes [93]. In addition, should stakeholders and other map users require thematic maps these could be calculated from the predictions by applying the desired classification scheme.

Way Forward
Whether field sampling, sample processing, remote sensing or subsequent classification, the approaches used in seabed mapping are continually improving. Here we have presented a range of methods and discussed some of the main issues and possible solutions as they relate to a heterogenous and complex site. However, this is not an exhaustive list (for a detailed error analysis see Strong [97]). Improvements to the seabed mapping workflow are not typically great leaps forward (although a small number of exceptions may exist such as single beam echosounders to MBES) but rather a series of incremental changes to each component of the workflow. However, what may work in one location does not always transfer to all environments. In the terrestrial remote sensing community, the creation of benchmark datasets allows users to compare approaches on common data to understand the relative strengths and weaknesses of new innovations [98][99][100]. For seabed mapping, different benchmark datasets should cover different environments, MBES systems (including multi-spectral backscatter) and sampling methods, with sufficient observations (already split into training and testing data) to allow comparison. The data used in this study are published alongside this manuscript and are accessible for use under Crown Copyright to serve as one such example of a benchmark dataset.

Conclusions
The aim of this paper was to test a suite of well-established supervised classification methods on a common dataset. We highlighted the challenges summarising a complex seabed into five thematic classes using a classification scheme that targets harmonisation of maps across marine institutions at a European level. Although the backscatter mosaic is one of the most frequently used [17] and important predictor variables in marine habitat mapping [28,29], it should not be relied on solely to discriminate the subtle differences between sediment types with only fractional differences in sediment composition at such fine scales. Promising new possibilities at various stages of the classification process are currently being developed. These include developments of remote sensing technology and improvements of backscatter processing approaches, including the merging of uncalibrated backscatter mosaics [101], calibrated backscatter [102], multispectral backscatter [103], ARA [104] and HAC [21], acquiring more representative samples by applying rigorous sampling designs [105,106] and the derivation of novel features from the primary MBES data with increased discriminatory power for sediment classification [107,108]. We expect progress to mainly occur through refinement at each stage of the mapping process as discussed above. Alternatively, class-specific probability maps or quantitative sediment composition maps may overcome some of the issues encountered with maps of low thematic accuracy. The rapid uptake of maps has shown their importance, but further comparative exercises like this are required to identify research gaps at the various stages of the classification process focussing on challenging sites. Progress in seafloor substrate and habitat mapping will ultimately improve decision making in marine spatial planning and conservation.   Table S6. Results of the accuracy assessment. Overall map accuracy (as shown in Figure 5). Benchmark dataset: Diesing et al.