Machine Learning for Gully Feature Extraction Based on a Pan-Sharpened Multispectral Image: Multiclass vs. Binary Approach

: Gullies reduce both the quality and quantity of productive land, posing a serious threat to sustainable agriculture, hence, food security. Machine Learning (ML) algorithms are essential tools in the identiﬁcation of gullies and can assist in strategic decision-making relevant to soil conservation. Nevertheless, accurate identiﬁcation of gullies is a function of the selected ML algorithms, the image and number of classes used, i


Introduction
Despite decades of focused scientific research and societal concerns [1], soil erosion by water, remains a major cause of land degradation and increasingly threatens agriculture in both developed and developing nations [2]. Globally, approximately 12 million hectares of productive land are lost due to soil erosion [1,3]. Urgent intervention is required to prevent not only further damages to productive land but also prevent irreversible loss of ecosystem services [3,4]. Soil erosion results from natural factors but anthropogenic activities including unsustainable land use practices are thought to accelerate soil erosion [2,[5][6][7][8]. Soil erosion by water can manifest itself in manifolds, i.e., sheet, rill, and gully erosion [9]. Gully erosion is the most detrimental of all forms of erosion because it can quickly remove and transport enormous quantities of soil [10]. Globally, it is reported that gullies contribute around 50%-80% of total sediment production in semiarid regions [11]. Gullies can damage infrastructure testing the performance of these ML methods, including the Linear Discriminant Analysis (LDA), another promising ML method.
In this study, the selected ML methods are applied to a newly launched SPOT-7 multispectral (pan-sharpened) product whose potential in gully feature extraction has not been tested before until now. Besides, so far, there has not been any direct attempt to evaluate the impact of the class number on the classification accuracy of gully erosion. Yet, the class number is among major factors affecting image classification, thus, the resulting accuracy [49]. We proposed a simple, but practically relevant methodology consisting of three ML algorithms (RF, SVM, LDA) × two approaches of class numbers (binary and multiclass) × six combinations of study areas as train and test sets. The primary objective of this study was to evaluate the accuracy of the LDA, SVM, and RF methods in gully feature extraction across three spatially independent study areas. Our hypotheses were the following: (i) SPOT-7 pan-sharpened product can offer acceptable classification accuracy of gully erosion, (ii) application of multiclass approach can results in better classification accuracy in relation to the binary (eroded and non-eroded) approach.

Study Area
Our area of investigation is located in the Eastern Cape Province, South Africa, and consists of three study areas (s1-s3), each covering 1.26 km 2 ( Figure 1). Extensive erosion features, mainly gullies, commonly occur in relatively gentle slopes whereas rills are found in steep sloping areas. The area has a mixture of land use types including the built-up areas characterized by dispersed rural settlements, unpaved road networks, and agricultural activities. Agriculture, including crop and animal farming, is common in the region. The climate is semiarid: winter is frequently dry and cold followed by occasionally intense rainfalls in summer. The average annual rainfall is 671 mm with yearly temperatures varying from 7 to 30 • C. Topography is highly uneven across the region. It ranges from approximately 1098 m in the central parts to more than 1500 m in the hilly northern and eastern parts of the study area. The vegetation is predominantly grassland, distributed across the elevated and mountainous areas. Low-lying areas with gentle slope, where most human activities like settlement and farming take place, have sparse vegetation cover.

Data Acquisition and Pre-Processing
The data used in this study consist of the SPOT-7 image, obtained from the South African National Space Agency (SANSA). In South Africa, SPOT-7 image is available at no cost for educational purposes or PhD research, in our case, and/or research projects that are of public interest to the country. The image contains four multispectral bands: red, green, blue (altogether RGB), and near infrared (NIR) with 5.5 m geometric resolution and a high-resolution panchromatic band (1.5 m). We improved the low geometric resolution of the SPOT-7 multispectral image using the panchromatic band from the same sensor. The Gram-Schmidt pan-sharpening method was applied, which is a widely used and accepted image fusion technique for images obtained from the same sensor [50][51][52]. ENVI software was used for pan sharpening (ENVI version 5.3-Exelis Visual Information Solutions, Boulder, Colorado).

Gully Feature Extraction from Satellite Image
In order to extract gully features, we employed three ML methods: RF, SVM, and DA. Being the most widely used ML methods, the selected ML methods have been embedded in various software packages [53]. In this study, we ran all three ML methods in Python programming environment.

Random Forest (RF)
RF is a robust non-parametric classifier, which is independent of assumptions on data distributions or homoscedasticity. The algorithm uses several hundred of decision trees with bagging [54]. All trees have a unique set of cases resampled from the train dataset. Sampling is performed by random selection and bootstrapping with replacements drawn from original observations, i.e., the same case can have multiple appearance in the realizations [55]. Furthermore, the number of variables involved for decision trees is the square root of the total number of variables. Each decision tree contributes a single vote to the classification, and the class with the majority vote is selected as the final classification outcome [44,56]. The larger the number of variables, the more diverse the trees become. According to variable sampling, the algorithm determines variable importance, that is, if an omitted variable results in large decrease in average accuracy, it will be considered important. RF is considered to be accurate and was successfully applied for several tasks including but not limited to grassland species discrimination [57], invasive species identification [58], soil erosion mapping [56,59,60], soil organic carbon stock mapping [61], tree species classification [62], and extracting water-related features [63]. We employed the RF model to extract gully erosion features. In the RF model, we set the ntree (number of tress) parameter to 100 and the Gini impurity was selected as the split criterion. The mtry (number of features in each split) parameter was left at its default value (i.e., mtry = 2).

Data Acquisition and Pre-Processing
The data used in this study consist of the SPOT-7 image, obtained from the South African National Space Agency (SANSA). In South Africa, SPOT-7 image is available at no cost for educational purposes or PhD research, in our case, and/or research projects that are of public interest to the country. The image contains four multispectral bands: red, green, blue (altogether RGB), and near infrared (NIR) with 5.5 m geometric resolution and a high-resolution panchromatic band (1.5 m). We improved the low geometric resolution of the SPOT-7 multispectral image using the panchromatic band from the same sensor. The Gram-Schmidt pan-sharpening method was applied, which is a widely used and accepted image fusion technique for images obtained from the same

Support Vector Machines (SVM)
Proposed by Vapnik [64], the SVM model is based on statistical learning theory to overcome problems relating to regression and classification [65][66][67]. Since its introduction, the model has been in widespread use in several researches [35,44,56,68]. The model searches for the right boundary surface among data points belonging to different classes. The aim is to find a flat boundary, called hyperplane, which can separate the classes into homogeneous partitions where each partition contains only data points of a given class. The model works well only with linear data space, and in cases of complex and higher dimensional space, the so-called 'kernel-trick' is used to transform the non-linear data to linear where hyperplanes can be applied [69]. SVM has two important parameters. (i) The C parameter penalizes the misclassifications: a low C value results in a simple model with soft margin, while a model with large C value prioritizes the perfect classifications; and (ii) the γ parameter defines the role of a single training pixel: too small values result in constrained models, whereas too large values lead to overfitting, and both small and large γ-values can perform poorly with a test dataset [70]. Several common kernel functions for the SVM model include linear, Radial Basis Function (RBF), polynomial, and sigmoid [49]. In the current study, we applied the SVM model with the RBF kernel function.

Linear Discriminant Analysis (LDA)
Discriminant Analysis (DA) is a parametric classification method requiring multivariate normal distribution, equal covariance matrices and prefers equal number of cases within categories [71]. In this study, we applied a linear DA (LDA), an ordination technique, i.e., dimension reduction, which substitutes the original variables (bands) with discriminant functions (DFs). DF scores are calculated in the m dimensional space defined by the input variables (where m is the number of a priori categories), in our case it meant land cover types, based on decision boundaries, which can either be of linear or quadratic functions [51,72]. LDA had been successfully applied in land degradation [73]. We applied linear functions for the classification of gullies.

Reference Data Collection and Accuracy Assessment
We collected the reference data based on a field survey, high-resolution SPOT image, and ancillary data (Google Earth). The reference data is not publicly available but can be provided to anybody on request. We delineated those areas where the land cover were identifiable both in the field and in the images; therefore, seven land cover classes had been distinguished: dense vegetation (DV), stressed vegetation (SV), gully (G), bare soil (BS), mixed bare soil (MS), i.e., exposed rocks, unpaved roads, bare soils, etc., settlement (S), and the roads (R). Besides, we intended to reveal the case when land cover was classified only into two classes: gully and non-gully areas; thus, as another approach we investigated the case when all non-gully classes were reclassified into one class. In the following, we refer to the seven-class solution as 'multi-class' (m), and the two-class solution as 'binary' (b) approach.
We evaluated the classification accuracy of the algorithms with repeated 10-fold cross-validation using three repetitions. We applied stratified random sampling of the whole data sets (20,795, 31,784, and 22,512 data, respectively, in s1, s2, and s3 areas): we selected 1000-1000 cases for the binary approach and 350 cases per category for the multiclass approach from the whole reference dataset, depending on the available number of data per classes and to avoid autocorrelation (selecting adjacent pixels). This approach did not require a train and test database, as the whole dataset was randomly split into 10 subsamples and 9 were used to train the models and 1 for testing and the procedure ends when all subsamples were used as a test set against all the other subsamples [74,75]. Finally, it was repeated three times and accuracy measures of 30 models can be used to evaluate the models' performance with median and quartiles.
Although cross-validation is a reliable tool, it does not provide information on the class level accuracies; thus, we also used the confusion matrix. This matrix provides class-specific accuracies like Producer's Accuracy (PA) and User's Accuracy (UA), which indicate commission and omission errors, respectively [18,19,76]. PA is the probability that a pixel of a given class was classified correctly whereas UA is the probability that a given pixel was predicted to be in the class it was supposed to be [77][78][79]. Table 1 provides basic information on the calculation of the accuracy indices used. Table 1. Accuracy assessment indices [79].

Accuracy Equation Description
Producer's accuracy (PA) PA = n ii n icol Where n ii is the number of pixels correctly classified in each class; and n icol is the column total representing reference data. User's Accuracy (UA) UA = n ii n irow Where n irow is the row total representing predicted classes.
Overall Accuracy (OA) Where N is the total number of pixels in the confusion matrix, and r is the number of rows.
Overall Accuracy (OA), which is a ratio of correctly classified pixels, was also calculated, but on the usual way: we had three study areas (s1-s3), and one was used to train the model and another to test it. In other words, train data of one study area (e.g., s1) was used to perform the model generation and this model was applied on completely independent data of the other two study areas (s2 and s3). We repeated this procedure in all combinations of the study areas; thus, altogether we had 36 models: 3 ML algorithms (RF, SVM, LDA) × 2 approaches of class numbers (binary and multiclass) × 6 combinations of study areas as train and test sets (s1→s2, s2→s1, s2→s3, s3→s2, s1→s3, s3→s1).

Statistical Evaluation
Shapiro-Wilk test was used to check the assumption of normal distribution [80]. Reference data of the land cover classes were skewed, but classification accuracy measures (UA, PA, OA) followed the normal distribution. We applied hypothesis testing to explore if the land cover classes had the same medians (H0) or differed from each other (H1); i.e., we evaluated the training datasets from the aspect of differences in reflectance. For example, if reflectance values differ, we can suppose that classification of the satellite image can successfully distinguish gullies from other land cover classes. In case of the binary approach, we applied the Yuen test with 0.2 value of trim and 599 times bootstrapping [81], while for multiclass approach, robust ANOVA was used with a post hoc test based on trimmed means (using the WRS2 package of R). For Yuen test, we also evaluated the effect size (ξ), which is a standardized measure to quantify the magnitude of difference; therefore, it can be compared with the analysis performed on different datasets. Calculation of effect sizes for the multiclass approach is not defined for robust ANOVA. Therefore, we applied the Dunnett's test [82], using gullies as a control and all other land cover classes were compared with it. This way, we were able to calculate the confidence intervals of differences instead of effect sizes, but we were also able to see how the differences were distributed around zero. At the same time, we limited the number of comparisons and the degree of freedom (avoiding a full factorial comparison). Besides, considering the aim of this study, which was to extract gullies, this was a reasonable step.
Model performance conducted on classification accuracy measures was evaluated with General Linear Modelling (GLM) using the study areas, the number of classes (i.e., binary or multiclass) and the algorithms as factors in different combinations and the UA and PA as dependent variables. In this case, we applied the ω 2 as a measure of effect size, which is less biased by the low sample size (in our case, it was limited to the results of 36 models), as suggested by Levine [83]. Statistical analyses were performed in R 3.6.2 software [84], with the WRS2 [85], jamovi 1.2. [86], and with the GAMLj module [87].

Differences in Reflectance Values
Results confirmed that, generally, all bands had significantly different values in case of the binary approach; the only exception was observed in case of the blue band, in Study area #1 ( Figure 2). However, effect sizes indicated only small effect for red and blue bands, medium for green and large for NIR (Table 2). Therefore, NIR band had the largest relevance in discriminating the gullies. Next, we repeated the analysis with the multiclass approach (Figure 3), using seven categories, and the robust ANOVA test confirmed significant models for each combination of bands and study areas (Table 3).

Differences in Reflectance Values
Results confirmed that, generally, all bands had significantly different values in case of the binary approach; the only exception was observed in case of the blue band, in Study area #1 ( Figure  2). However, effect sizes indicated only small effect for red and blue bands, medium for green and large for NIR (Table 2). Therefore, NIR band had the largest relevance in discriminating the gullies. Next, we repeated the analysis with the multiclass approach (Figure 3), using seven categories, and the robust ANOVA test confirmed significant models for each combination of bands and study areas (Table 3).

Figure 2.
Distribution of reflectance values of Systeme Pour l'Observation de la Terre (SPOT-7) image by bands (red, green, blue (RGB) and near infrared (NIR)), study areas (s1-s3) and classification categories (NG: non-gully, G: gully). Table 2. Results of robust independent samples t-test performed on SPOT-7 bands using the binary (gully-non-gully) approach (study areas: s1-s3, t: value of t-statistic, p: significance, ξ: effect size). As ANOVA only reports that there is at least one significantly different category, a post hoc test was needed to reveal the differences among the categories. With seven categories, there are too many combinations with pairwise comparison; therefore, we focused only on the difference in gully and the other categories, according to the primary aim of this study. Gullies usually had significant differences, but mostly in red and green bands, differences were not of significant with the stressed vegetation and the roads (Table 3). In this case, we presented the mean differences of the land cover categories. Confidence intervals were in accordance with the post hoc test, except in the study area #2 (s2) where the robust post hoc test indicated significant differences (Table 4), while the confidence intervals ( Figure 4, based on Dunnett's test statistics) showed non-significance between gullies and roads.  As ANOVA only reports that there is at least one significantly different category, a post hoc test was needed to reveal the differences among the categories. With seven categories, there are too many combinations with pairwise comparison; therefore, we focused only on the difference in gully and the other categories, according to the primary aim of this study. Gullies usually had significant differences, but mostly in red and green bands, differences were not of significant with the stressed vegetation and the roads (Table 3). In this case, we presented the mean differences of the land cover categories. Confidence intervals were in accordance with the post hoc test, except in the study area #2 (s2) where the robust post hoc test indicated significant differences (Table 4), while the confidence intervals (Figure 4, based on Dunnett's test statistics) showed non-significance between gullies and roads. Table 2. Results of robust independent samples t-test performed on SPOT-7 bands using the binary (gully-non-gully) approach (study areas: s1-s3, t: value of t-statistic, p: significance, ξ: effect size).

Gully Feature Extraction
Gully classification results based on the multiclass and binary approach are presented in Figures 5  and 6, respectively. SVM and RF showed more or less the same results, whereas LDA yielded completely different results, most notably in study area #1 (s1→s2, s1→s3), study area #2 (s2→s3), and study area #3 (s3→s2). However, in some cases, i.e., study area #2 (s2→s1) and study area #3 (s3→s1), LDA showed almost the same results as SVM and RF. Overall, the multiclass and binary approach appear to have yielded almost the same results in terms of gully extraction, although there are slight differences in accuracy results, as presented in the sub-sections that follow (Sections 3.3 and 3.4).

Gully Feature Extraction
Gully classification results based on the multiclass and binary approach are presented in Figure  5 and Figure 6, respectively. SVM and RF showed more or less the same results, whereas LDA yielded completely different results, most notably in study area #1 (s1→s2, s1→s3), study area #2 (s2→s3), and study area #3 (s3→s2). However, in some cases, i.e., study area #2 (s2→s1) and study area #3 (s3→s1), LDA showed almost the same results as SVM and RF. Overall, the multiclass and binary

Overall Model Performance Evaluation
Overall accuracies, based on three times, repeated 10-fold cross-validation (i.e., results of 30 models), showed that gully identification can be successful if we use RF or SVM classification algorithms. Both RF and SVM provided similar accuracies; medians were between 92%-96%. Binary approach resulted in better OA values; nevertheless, the multiclass (m) classification was only 2% worse than the binary (b) (Figure 7). The rank was the same in all study areas in the first three places: SVMb, RFb, LDAb with the binary approach, then the multiclass classifications had some mixing on the 4 th place between SVMm and RFm, but LDAm always had the worst performance. Regarding the first two places, SVMb had only a slight superiority (<0.5%) and, additionally, lower quartiles were higher for RFb; thus, generally, these models can be regarded as more reliable. Furthermore, the multiclass approach was not as effective as the binary approach, even the LDAb's lower quartiles

Overall Model Performance Evaluation
Overall accuracies, based on three times, repeated 10-fold cross-validation (i.e., results of 30 models), showed that gully identification can be successful if we use RF or SVM classification algorithms. Both RF and SVM provided similar accuracies; medians were between 92%-96%. Binary approach resulted in better OA values; nevertheless, the multiclass (m) classification was only 2% worse than the binary (b) (Figure 7). The rank was the same in all study areas in the first three places: SVMb, RFb, LDAb with the binary approach, then the multiclass classifications had some mixing on the 4th place between SVMm and RFm, but LDAm always had the worst performance. Regarding the first two places, SVMb had only a slight superiority (<0.5%) and, additionally, lower quartiles were higher for RFb; thus, generally, these models can be regarded as more reliable. Furthermore, the multiclass approach was not as effective as the binary approach, even the LDAb's lower quartiles were higher in all study areas than the best multiclass solution's upper quartile, and differences were 2%-8%.

Model Performance Evaluation on Class Level
Class level evaluation was performed only with the accuracy metrics of gullies. Results showed that most classifiers failed to produce both high PA and UA values (Figure 8). Worst performances belonged to the LDA algorithm, UA values of the multiclass classification was under 30%, misclassifications meant large number of commission error, large number of pixels were classified as gullies, which belonged to other categories; although, we also have to note that there were also successful models (L-m-2-1 was as good as some RF and SVM models). Regarding the best 80-80% quarter (Figure 8), there were 4 RF and 3 SVM models, 2 binary and 5 multiclass types. In this quarter, where the errors were within a reasonable measure, highest UAs belonged to R-b-2-3 (86.0%) and R-m-2-3 (83.8%), while PAs were 95.1% and 96.1%, respectively. Nevertheless, highest PA belonged to the R-b-3-2 (93.7%) but the corresponding UA was only 79.4%. Best PA belonged to an LDA model (L-b-2-3) with 99.5% but its UA was only 18.5%. ISPRS Int. J. Geo-Inf. 2020, 9, 252 12 of 20 were higher in all study areas than the best multiclass solution's upper quartile, and differences were 2%-8%.

Model Performance Evaluation on Class level
Class level evaluation was performed only with the accuracy metrics of gullies. Results showed that most classifiers failed to produce both high PA and UA values (Figure 8). Worst performances belonged to the LDA algorithm, UA values of the multiclass classification was under 30%, misclassifications meant large number of commission error, large number of pixels were classified as gullies, which belonged to other categories; although, we also have to note that there were also successful models (L-m-2-1 was as good as some RF and SVM models). Regarding the best 80-80% quarter (Figure 8), there were 4 RF and 3 SVM models, 2 binary and 5 multiclass types. In this quarter, where the errors were within a reasonable measure, highest UAs belonged to R-b-2-3 (86.0%) and Rm-2-3 (83.8%), while PAs were 95.1% and 96.1%, respectively. Nevertheless, highest PA belonged to the R-b-3-2 (93.7%) but the corresponding UA was only 79.4%. Best PA belonged to an LDA model (L-b-2-3) with 99.5% but its UA was only 18.5%.

Statistical Evaluation of Factors Biasing Class Level Performance
GLM revealed that study areas and the algorithms can provide explanations for the different efficiency of the classification models; adjusted R 2 s indicated 59.3% explained variance for UA and 56.1% for PA (Tables 5 and 6). Study area had significant effect on both the PA and UA results, but the applied algorithm had significant direct effect only on PA.

Statistical Evaluation of Factors Biasing Class Level Performance
GLM revealed that study areas and the algorithms can provide explanations for the different efficiency of the classification models; adjusted R 2 s indicated 59.3% explained variance for UA and 56.1% for PA (Tables 5 and 6). Study area had significant effect on both the PA and UA results, but the applied algorithm had significant direct effect only on PA. Table 5. Summary of General Linear Modelling (GLM) performed with PA as independent variable (Alg: algorithm, type: binary or multiclass approach, stud: Study area; SS: Sum of Squares, df: degree of freedom, F: F-statistic, p: significance, ω 2 : effect size; p < 0.05 is highlighted with bold). In case of PA, the interaction between the algorithms and study areas was not significant. Effect size (ω 2 ) indicated large effect for the study areas for PA, and for the algorithms for UA. Furthermore, the interaction was significant between the applied algorithms and the type (number of categories), i.e., classification algorithms perform differently with the binary or multiclass approaches.

Discussion
Despite the spectral heterogeneity of gullies, spectral differences of SPOT-7 bands showed that gully mapping with remote sensing data can be reasonable. The average OA results obtained by the binary (93.68%) and multiclass (74.27%) approach justifies this assertion. Almost all bands, except for blue band in study area #1, had significantly different values in case of the binary approach, hence, high OA. However, it is worth noting that merely relying on statistical significance can be misleading. In spite of significant differences: effect sizes were only small (sometimes moderate) in case of RGB bands, only the NIR band indicated large (nevertheless, in this case very large) effect. Accordingly, while p-values only show that differences can be regarded as significant or not significant, effect sizes express the magnitude, and as standardized measures proved the relevance of differences with land cover categories, too: even with larger interquartile ranges, differences can be significant, but in classifications these small differences (indicated by effect sizes of <0.3) lead to misclassifications. Such interpretation is in accordance with the findings of Szabó et al. [88].
Although effect sizes were not calculated to land cover pairs of the multiclass approach, confidence intervals provided valuable information on the differences. Usually, all pairs had significant differences, and the confidence intervals were within a small range except some land cover categories: gullies and stressed vegetation and roads had non-significant differences in RGB bands (mostly in red and green bands), but the NIR band always reported significant differences. However, according to the confidence intervals, NIR band was so successful to discriminate the roads and gullies: 95% confidence ranges were close to the "zero" line, which indicated non-significance (Figure 9). This was not signed by the p-values (all of them were p < 0.001); however, if we use confidence ranges and their distance from zero as an effect size, these cases indicate low values with lower efficiency to distinguish categories. However, despite the critics of the statistical evaluation, considering the results, we can accept that reference dataset contained reliable data about land cover categories, and presumably, can be used in classification models. binary (93.68%) and multiclass (74.27%) approach justifies this assertion. Almost all bands, except for blue band in study area #1, had significantly different values in case of the binary approach, hence, high OA. However, it is worth noting that merely relying on statistical significance can be misleading. In spite of significant differences: effect sizes were only small (sometimes moderate) in case of RGB bands, only the NIR band indicated large (nevertheless, in this case very large) effect. Accordingly, while p-values only show that differences can be regarded as significant or not significant, effect sizes express the magnitude, and as standardized measures proved the relevance of differences with land cover categories, too: even with larger interquartile ranges, differences can be significant, but in classifications these small differences (indicated by effect sizes of <0.3) lead to misclassifications. Such interpretation is in accordance with the findings of Szabó et al. [88].
Although effect sizes were not calculated to land cover pairs of the multiclass approach, confidence intervals provided valuable information on the differences. Usually, all pairs had significant differences, and the confidence intervals were within a small range except some land cover categories: gullies and stressed vegetation and roads had non-significant differences in RGB bands (mostly in red and green bands), but the NIR band always reported significant differences. However, according to the confidence intervals, NIR band was so successful to discriminate the roads and gullies: 95% confidence ranges were close to the "zero" line, which indicated non-significance ( Figure  9). This was not signed by the p-values (all of them were p < 0.001); however, if we use confidence ranges and their distance from zero as an effect size, these cases indicate low values with lower efficiency to distinguish categories. However, despite the critics of the statistical evaluation, considering the results, we can accept that reference dataset contained reliable data about land cover categories, and presumably, can be used in classification models.  As different classifiers were applied, we evaluated their overall performance. RF and SVM are robust algorithms, as data distribution does not bias the results, i.e., outliers have only slight effect on the classification accuracy, while LDA assumes multivariate normality and balanced element number in the categories [89,90]. This was also true in our study as both RFb and SVMm outperformed the LDA. But it should be noted, LDAb had better classification results in case of binary approach than RFm and SVMm. Multiclass classifications' OA values were below the binary; thus, one can think that it is more reasonable to use only categories when possible, i.e., in our case the gully and non-gully categories. Beygelzimer et al. [91] also found that the binary approach can perform better than the multiclass. To gain this result, the authors developed a new "weighted one-against-all" method. However, in our case, the reality is that this result is only generally true, on category level, the multiclass approach was more efficient regarding the gully identification. These findings are in high agreement with those of Allwein et al. [92]. Our findings revealed that multiclass solutions with the 7-class classification had better performance, only two binary approach models were in the best 80% quadrant (delineated by UA and PA) with RF and SVM classifiers. LDA's performance varied and models provided ambiguous results: first three within the 36 models were LDA using the binary approach, and the second best was an LDA in the multiclass types based on the PA; however, the corresponding UA values were very low (e.g., the best PA, 99.51%, belonged to L-b-2-3 and the UA was only 18.5%). Thus, in these study areas with these reference datasets, we found that LDA models were severely biased by the outliers, while RF and SVM overcame this issue and provided accurate outcomes with the same input train data.
We applied GLM to investigate the reliability of the train datasets with applying the models trained with them on two other areas. All areas were similar regarding their spectral characteristics, but the six combinations revealed that in case of Study area #2, PAs were above the other two areas, but it was not true for UAs where Study area #3 had the largest values (Tables 5 and 6; Figure 9). Although the study areas had differences regarding the reference data, both RF and SVM provided PAs and UAs above 75%, which is relatively high, considering the lack of ancillary data (e.g., a digital elevation model). LDA models had large dependency on training data quality and, therefore, the error of commissions was very large (medians indicated >50% errors).
GLM revealed the multivariate effects on PA and UA and confirmed the relevance of the classification algorithms. Number of classes (binary or multiclass) did not have significant effect; although results of 10-fold cross-validation suggested that binary classifications had better performance, class-level comparisons did not support this finding in case of gullies. Nevertheless, this study showed that space-borne images like SPOT-7 can be successfully applied to automatically detect gullies using relevant ML algorithms. Although the spectral bands of SPOT-7 played a key role in discriminating gullies among other land cover categories, pan sharpening too, to some extent, contributed to successful extraction of gullies. Automatic detection of gullies with remote sensing is a challenge, because gullies do not generally differ from the surfaces they incise [23]. As a result, visual interpretation of high-resolution images has been preferred for monitoring gullies over large areas. On a country scale, Mararakanye and Le Roux [27] performed visual interpretation of SPOT-5 image to monitor gully erosion in South Africa. Recently, Karydas and Panagos [93] performed a preliminary assessment on the presence of ephemeral gullies in Greece through visual interpretation of Google Earth images. Undoubtedly, there is a need for more reliable methods, which can help to automatically identify the areas affected by gullies [22,23]. This in turn ensures consistent monitoring of gully development over space and time. The use of appropriate ML algorithms and satellite images seems to be adequate for these surveys, but without digital elevation models (DEMs), the simple classifications can be flawed. However, in areas where gullies can be identified visually, the spectral reflectance permits the discrimination of gullies from other land cover categories, as was the case in the present study. For example, most gullies were located on bare soil surface but with the improved geometric resolution, the applied pan-sharpened product successfully discriminated gullies from bare soil although there is still a room for improvement. Where available, future research should incorporate ancillary data (i.e., DEM) to accurately discern gullies from paved/unpaved road network. A major limitation in this study was to discriminate unpaved roads from bare soil and exposed rocks. Instead, these land cover types were grouped into one class, called mixed bare soil (MS), but this did not affect gully, the target class. Although this study was conducted in small areas, the methodology applied could be adopted for application in larger areas.

Conclusions
Gully erosion is a crucial problem confronting sustainable agriculture today and cannot be left unabated, if sustainable agriculture is to be achieved. This study employed three commonly used ML algorithms including DA, SVM, and RF in gully feature extraction. Using these ML algorithms, we relied on two different approaches of class number: binary and multiclass. Based on the findings of this study, we drew the following conclusions:

•
Despite having a small number of bands (RGB and NIR), the pan-sharpened product from SPOT-7 multispectral image successfully discriminated gullies (with OAs >95%). • Repeated k-fold cross-validation was an efficient tool to analyze the representativeness of the reference data as reflected in different classification algorithms. It showed that the binary approach performed better (i.e., higher OAs with narrow interquartile ranges) with all classifiers than the multiclass approach. • GLM effectively identified the biasing factors of a given class, in this case gullies, accuracy metrics (PA and UA), with the option of statistical interaction Accordingly, we revealed that algorithms performed differently with the binary or multiclass approach in case of PA, while there was no interaction in case of UA, i.e., the number of classes did not influence UAs in the function of different classifiers. • LDA can accurately identify gullies at least from the PA's perspective, but had usually low corresponding UA values. • SVM and RF showed better performance compared to LDA in identifying gullies, usually with >80% of PA and UA in different areas.
Overall, we conclude that the application of different study areas for training and prediction ensures independent circumstances for models, which allow to assess the possibilities of generalization of the case studies. OAs can be misleading as the accuracy can be different at class level. In our case, the classifiers showed best performance with the binary approach, according to the OA, but the multiclass approach was more efficient in gully identification based on the PA and UA on class level. Our aim was to obtain the best representation of gullies; therefore, we suggest the use of several classes instead of only two classes (i.e., binary) for better gully feature extraction.