Classiﬁcation Efﬁcacy Using K-Fold Cross-Validation and Bootstrapping Resampling Techniques on the Example of Mapping Complex Gully Systems

: The availability of aerial and satellite imageries has greatly reduced the costs and time associated with gully mapping, especially in remote locations. Regardless, accurate identiﬁcation of gullies from satellite images remains an open issue despite the amount of literature addressing this problem. The main objective of this work was to investigate the performance of support vector machines (SVM) and random forest (RF) algorithms in extracting gullies based on two resampling methods: bootstrapping and k-fold cross-validation (CV). In order to achieve this objective, we used PlanetScope data, acquired during the wet and dry seasons. Using the Normalized Difference Vegetation Index (NDVI) and multispectral bands, we also explored the potential of the PlanetScope image in discriminating gullies from the surrounding land cover. Results revealed that gullies had signiﬁcantly different ( p < 0.001) spectral proﬁles from any other land cover class regarding all bands of the PlanetScope image, both in the wet and dry seasons. However, NDVI was not efﬁcient in gully discrimination. Based on the overall accuracies, RF’s performance was better with CV, particularly in the dry season, where its performance was up to 4% better than the SVM’s. Nevertheless, class level metrics (omission error: 11.8%; commission error: 19%) showed that SVM combined with CV was more successful in gully extraction in the wet season. On the contrary, RF combined with bootstrapping had relatively low omission (16.4%) and commission errors (10.4%), making it the most efﬁcient algorithm in the dry season. The estimated gully area was 88 ± 14.4 ha in the dry season and 57.2 ± 18.8 ha in the wet season. Based on the standard error (8.2 ha), the wet season was more appropriate in gully identiﬁcation than the dry season, which had a slightly higher standard error (8.6 ha). For the ﬁrst time, this study sheds light on the inﬂuence of these resampling techniques on the accuracy of satellite-based gully mapping. More importantly, this study provides the basis for further investigations into the accuracy of such resampling techniques, especially when using different satellite images other than the PlanetScope data.


Introduction
Defined as the detachment, transportation, and deposition of soil particles by the erosive forces of raindrop and runoff [1,2], soil erosion by water represents one of the most typical forms of land degradation affecting many countries around the world [3]. While soil erosion has many negative effects, the most concerning one include the decline in soil fertility, resulting in limited food production [4,5]. This, in turn, contributes to food insecurity in several developing countries, particularly in those ones where a considerable segment of their population strongly relies on agriculture for their survival [6]. South (kNN), maximum likelihood (ML), artificial Neural Network (ANN), convolutional neural networks (CNN), discriminant analysis (DA), and minimum distance (MD). One study mapped the areas susceptible to gully erosion using RF and ANN [32], and found that RF performed better than ANN. Noi and Kappas [33] compared SVM, RF, and kNN in land cover classification, and found that SVM, followed by RF, were better than kNN. Phinzi et al. [34] reported that both SVM and RF outperformed linear discriminant analyst (LDA) in a study on gully detection. Although deep learning methods such as CNNs have shown better performance over SVM and RF [35], like most deep learning methods, CNNs also strongly rely on the availability of abundant high-quality training/ground truth data [36]. While CNNs perform well in detecting and differentiating active gullies from other forms of surface erosion (e.g., sheet and rills), they have errors in detecting complex gully systems [37]. For these reasons, SVM and RF still attract most researchers' attention, because of their low computational complexity and higher interpretability capabilities compared to deep learning algorithms [36].
The wide usage of these machine learning algorithms in remote sensing proved that learning features from dataset is more efficient and practical than merely defining the features [38]. Although the application of machine learning in soil erosion research is not new, previous investigations commonly use coarser spatial imagery such as Landsat, ASTER and Sentinel/Sentinel-SAR (Synthetic Aperture Radar), which from an economic point of view makes sense, given that such images are obtainable at no cost. Besides, these sensors are good for wide area mapping of soil erosion. However, what has become apparent from previous studies, is that such sensors cannot identify individual gullies (especially small discontinuous gullies) with sufficient detail, this limitation is attributable to their low spatial resolution [15]. Whereas other optical sensors such as IKONOS, WorldView, and RapidEye with relatively higher spatial resolution exist for gully mapping, these sensors are not readily or freely available, as such, their high acquisition costs limit their application for gully mapping. Similarly, the use of LiDAR-derived elevation data from airborne surveys including Unmanned Aerial Vehicles (UAVs) is limited by a lack of financial resources. Depending on the availability of data and objective of a given study, multi-source and multi-sensor data fusion are common in remote sensing since this provides synthetic data that have the combined advantages of different sensors [39]. Multi-sensor or pixel level data fusion are mainly applied to optical images, for example, the fusion of high resolution panchromatic and low resolution multi-spectral images [40], was successfully applied in gully feature extraction [34]. Multi-source data fusion concerns feature level and decision level fusion of data from various sources such as SAR, optical images, LiDAR, geographic information system (GIS) data, and in-situ data [40]. In our case, we did not perform any data fusion due to lack of data (including the panchromatic band) with suitable spatial resolution necessary for detecting individual gullies.
Despite the unavailability of a higher spatial resolution panchromatic band, the 3 m PlanetScope image, which is available free of charge for research purposes, offers a great potential for detecting individual gullies. However, the capability of PlanetScope image in classifying gullies in different seasons (dry and wet) in an arid or semi-arid environment had been investigated only in areas of large forms (1-5 km length, 100-600 m width) [41]. While machine learning algorithms such as the SVM and RF have been frequently applied, little efforts have been made to investigate the influence of resampling techniques, particularly, bootstrapping and k-fold cross-validation (CV), on the accuracy relations. We identified gullies from PlanetScope images based on these resampling methods. Our aim was (i) to compare the satellite's bands reflectance values from the aspect of gullies, (ii) to reveal which classifier (RF or SVM) and resampling technique (CV or bootstrapping) perform better regarding the overall and class level accuracy metrics, and (iii) which season is more appropriate to identify the gullies.

Study Area
The study area was located in the rural part of eastern South Africa, characterized by extensive erosion where permanent gully erosion was the most prominent erosion type [42]. Geographically, the study area lies between 30 • 42 30 -30 • 43 55 S 28 • 46 22 -28 • 48 47 E, covering a surface of about 10 km 2 ( Figure 1). Subsistence agriculture (e.g., crop farming and livestock rearing) and settlement were the main land use types. Grassland was the most common vegetation type throughout the area, with some forest patches found in the north-western section of the study area. The topography ranges from 1213 m-1658 m, with the north-western and south-western sections being steeper than other parts of the area. Steep mountain slopes with gently undulating footslopes characterize the geomorphology of the area [14]. The climate is semi-arid with temperatures ranging from 7-30 • C. Winters are cold and dry, with less vegetation due to limited rainfall. Rainfall mostly occurs during the summer season reaching approximately 670 mm on average per year. Although the study area has limited annual rainfall, it experiences high-intensity rainfall events. Gully development in the area was further fostered by the predominance of highly erodible soils such as duplex and dispersive soils [9,43], predominantly underlain by mudstone and sandstone of the Beaufort Group [44]. Although vegetation exists in the wet season, its effectiveness in protecting soil against erosion and inappropriate land-use practices such as overgrazing usually reduces vegetation cover, making the area susceptible to soil erosion. The study area features both continuous and discontinuous gully networks with distinct occurrences and appearances, i.e., narrow, wide, vegetated, shallow, deep with shadows, etc. [14,42]. Additionally, some gullies resemble the unpaved road network in appearance. Such complexity of gullies within the area makes the area particularly suitable for study.

Study Area
The study area was located in the rural part of eastern South Africa, characterized by extensive erosion where permanent gully erosion was the most prominent erosion type [42]. Geographically, the study area lies between 30°42′30″-30°43′55″S 28°46′22″-28°48′47″E, covering a surface of about 10 km 2 ( Figure 1). Subsistence agriculture (e.g., crop farming and livestock rearing) and settlement were the main land use types. Grassland was the most common vegetation type throughout the area, with some forest patches found in the north-western section of the study area. The topography ranges from 1213 m-1658 m, with the north-western and south-western sections being steeper than other parts of the area. Steep mountain slopes with gently undulating footslopes characterize the geomorphology of the area [14]. The climate is semi-arid with temperatures ranging from 7-30 °C. Winters are cold and dry, with less vegetation due to limited rainfall. Rainfall mostly occurs during the summer season reaching approximately 670 mm on average per year. Although the study area has limited annual rainfall, it experiences high-intensity rainfall events. Gully development in the area was further fostered by the predominance of highly erodible soils such as duplex and dispersive soils [9,43], predominantly underlain by mudstone and sandstone of the Beaufort Group [44]. Although vegetation exists in the wet season, its effectiveness in protecting soil against erosion and inappropriate land-use practices such as overgrazing usually reduces vegetation cover, making the area susceptible to soil erosion. The study area features both continuous and discontinuous gully networks with distinct occurrences and appearances, i.e., narrow, wide, vegetated, shallow, deep with shadows, etc. [14,42]. Additionally, some gullies resemble the unpaved road network in appearance. Such complexity of gullies within the area makes the area particularly suitable for study.

Data Acquisition and Pre-Processing
Two cloud-free PlanetScope orthorectified products (Level 3B) for the wet and dry seasons acquired on 23 January 2017 and 25 June 2017, respectively, were used in this study. The images were downloaded from the Planet explorer website (https://www. planet.com/explorer (accessed on 30 July 2020)). The orthorectified scenes had already been radiometrically and geometrically corrected and projected to the Universal Traverse Mercator (UTM) projection, referenced to the world geodetic system (WGS84) datum. With a spatial resolution of 3 m and temporal resolution of 1 day, the PlanetScope image is comprised of 4 spectral bands: red, green, blue (RGB), and near-infrared (NIR). The flowchart summarizing the workflow followed in this study is presented in Figure 2.

Gully Classification
Classification of gullies from the PlanetScope image was conducted in Python software using random forest (RF) and support vector machines (SVM). These were the most widely applied algorithms and their detailed description has been provided in the literature [10,34,36,[45][46][47]. The RF, developed by Breiman [48], is a robust machine learning algorithm that is increasingly becoming more popular in remote sensing of soil erosion. The algorithm has several parameters that need to be tuned, amongst which the ntree (number of trees) and mtry (number of features in each split) are the most important that should be considered when training the algorithm [49]. The models were built using only 4 variables (e.g., four multispectral bands of the PlanetScope image), thus we tested all possible values of the mtry parameter. For the ntree parameter, we tested different values ranging from 50 to 1000. After ntree = 100, the accuracies stagnated while the computational time kept increasing [50]; thus, the final model was trained with 100 individual decision trees, selecting 2 random variables at each split.
The support vector machine (SVM) model was capable of overcoming both classification and regression problems [51,52]. To achieve this, SVM searched for the flat boundary (hyperplane) in some feature space that best separated the classes into homogeneous partitions where each partition contained only data points of a given class [34,49]. In reality, however, it was difficult to find a hyperplane that perfectly separated the classes using just the original features [49]. SVM overcomes this problem in 2 ways: first, loosen what is meant by "perfectly separates," and second, use the so-called kernel trick to expand the feature space to the extent that perfect separation of classes is more likely [49]. The radial basis function (RBF) was chosen for the kernel type. For RBF, a C penalty parameter against misclassifications and a kernel coefficient (γ) as a decision boundary have to be specified, which greatly affects the performance of the model [53]. Hyperparameter tuning was performed with the grid search method.

Reference Data Collection and Accuracy Assessment
The reference data were collected through field surveys and visual interpretation of high-resolution Google Earth images. We delineated the study area into 7 land cover classes, of which all were identifiable both in the field and in images (Google Earth and PanetScope): forest, built-up, agriculture, gully, bare soil, and mixed bare soil (i.e., exposed rocks, unpaved roads/dirty roads, and exposed soil mostly in ploughed fields). A total of 966 points were collected using stratified random sampling in ArcMap. Each land cover class was assigned a number of points proportional to its size. Remote Sens. 2021, 13, 2980 6 of 18

Reference Data Collection and Accuracy Assessment
The reference data were collected through field surveys and visual interpretation of high-resolution Google Earth images. We delineated the study area into 7 land cover classes, of which all were identifiable both in the field and in images (Google Earth and We evaluated the overall performance of the RF and SVM algorithms using CV and bootstrapping. Kappa coefficients and overall accuracy (OA) were among the most commonly used metrics to evaluate classification accuracy [54]. However, the use of kappa in remote sensing classification accuracy is becoming less common [33,55]. Pontius and Millones [56], Flight and Julious [57], and more recently, Delgado and Tibau [58], Remote Sens. 2021, 13, 2980 7 of 18 recommend against using kappa because of its inherent limitations. A major limitation of kappa was that it is highly sensitive to the distribution of the marginal totals, potentially producing unreliable results [57]. Thus, we used OA to assess the overall performance of the models. Contrary to the conventional error matrix, which used all of the available data to test the model, CV splits the reference dataset into training and testing data. It used the majority of the data for training and the remainder, often called the holdout sample, was used to test the model, ensuring that the model was robust [49]. In total, we used 17,757 pixels for the wet season and 30,597 pixels for the dry season, generated from PlanetScope. We repeated the 5-fold CV 20 times, meaning that final accuracies were computed from 100 models. Before each repetition, the dataset was randomly shuffled and new folds were generated to increase the robustness of the models. Unlike CV, in bootstrapping, the original data were randomly sampled with replacement, meaning that, after a data point (bootstrap sample) was selected for inclusion in the subset, it was still available for further selection [49]. Two parameters must be chosen before running bootstrapping: sample size and the number of repetitions. In our case, the sample size was the same as the original dataset [59], and we applied 100 repetitions. The models were validated on the samples that were not included in the bootstrap sample.
We used the traditional error matrix to assess the model performance at class level as bootstrapping and CV do not provide class accuracies. An error matrix compared reference data to the classified map using various accuracy indices [54], but in this study, we only focused on class level accuracies/errors: producer's accuracy (PA) and user's accuracy (UA); PA was also known as sensitivity or recall while UA was sometimes referred to as precision. The difference of the possible 100% accuracy and the PA represented the omission error, which occurred when a pixel was excluded from the class to which it belonged. A difference of 100% and UA represented a commission error, which occurred when a pixel was incorrectly included in the class where it did not belong. We computed unbiased area-based PAs and UAs, following "good practice" recommendations for accuracy assessment [60]. The F1-score was also reported as the harmonic mean of UA and PA [61]. Additionally, we computed unbiased areal coverages (ha) of gullies along with their standard errors (ha) and associated ± 95% confidence intervals (ha). We generated 6 algorithms based on the combination of the classifiers: svm and rf, seasons: dry (d) and wet (w), and resampling methods: bootstrapping (b) and cross validation (cv), i.e., rf-d-b, rf-d-cv, rf-w-cv, rf-w-b, svm-d-cv, svm-d-b, svm-w-cv, and svm-w-b.

Statistical Analysis
NDVI values of the images, and specifically focusing on the gullies, were compared by the 2 seasons with the robust Mann-Whitney test using the Monte Carlo p (p MC ) with 9999 permutations. We applied the General Linear Model (GLM) to determine the effects of spectral bands (4 bands; RGB + NIR), seasons (wet and dry), and the LULC classes (7 classes). Furthermore, we also determined the statistical interactions to reveal if factorial variables had a common effect (e.g., effects of spectral bands differed by LULC classes or were different in the dry or wet seasons). Besides, we also determined the effect size (ω 2 ) as a standardized measure of the variables' contribution in the model (higher values indicate larger contribution, ω 2 > 0.14 was considered as a large effect [62]. The Dunnett test [63] was used to determine if gullies had significant differences from other land cover types (H0: mean reflectance values of gullies was identical with the other land cover types). The Dunnett test was developed to perform multiple comparisons against 1 control group; in this case, gullies' land cover type was chosen as the control. As in the Dunnett test, the number of comparisons was limited (related to a full factorial approach; i.e., 6 instead of 21). Furthermore, the test compared the factor groups' means with the control group's mean (unlike other tests, which compare group means to the grand mean); thus, it can reveal small significant differences [64], and our intent was to find all overlaps in the reflectance with the gullies.

Spectral Bands, Land Cover Classes and Seasons as Determinants of Reflectance
The difference of NDVI values was significant between the two seasons (U = 35303, z = 19.102, p MC < 0.0001). The NDVI for the wet season had relatively higher values ranging from −0.36 to 0.81, while the values for the dry season lay in the range −0.41 to 0.59. The dry season had bimodal distribution while the wet season had multimodal distribution ( Figure 3). Such bimodal distribution in the dry season represents non-vegetation (first mode) and vegetation pixels (second mode). Like in the dry season, the first mode in the wet season was indicative of non-vegetation pixels denoted by lower NDVI values compared to the last two modes, represented by relatively higher NDVI values. These last two modes represent vegetated areas: vegetation and forest pixels, respectively.
land cover types). The Dunnett test was developed to perform multiple comparisons against 1 control group; in this case, gullies' land cover type was chosen as the control. As in the Dunnett test, the number of comparisons was limited (related to a full factorial approach; i.e., 6 instead of 21). Furthermore, the test compared the factor groups' means with the control group's mean (unlike other tests, which compare group means to the grand mean); thus, it can reveal small significant differences [64], and our intent was to find all overlaps in the reflectance with the gullies.

Spectral Bands, Land Cover Classes and Seasons as Determinants of Reflectance
The difference of NDVI values was significant between the two seasons (U = 35303, z = 19.102, pMC < 0.0001). The NDVI for the wet season had relatively higher values ranging from -0.36 to 0.81, while the values for the dry season lay in the range −0.41 to 0.59. The dry season had bimodal distribution while the wet season had multimodal distribution ( Figure 3). Such bimodal distribution in the dry season represents non-vegetation (first mode) and vegetation pixels (second mode). Like in the dry season, the first mode in the wet season was indicative of non-vegetation pixels denoted by lower NDVI values compared to the last two modes, represented by relatively higher NDVI values. These last two modes represent vegetated areas: vegetation and forest pixels, respectively. We also compared the NDVIs' of the gullies in the dry and wet seasons. Accordingly, the difference was significant (U = 162, z = 9.5534, p < 0.0001). The mean difference was 0.08 in the wet season, green vegetation was also present in gullies, thus, NDVI was larger. According to the results of the GLM we found that the spectral bands, LULC classes, and the seasons, as factorial variables and the interactions, were significant (p < 0.001) and explained 92.3% of the variance. Among the factors, the difference of dry and wet seasons had the largest effect on the reflectance (0.868). The bands and LULC classes had almost We also compared the NDVIs' of the gullies in the dry and wet seasons. Accordingly, the difference was significant (U = 162, z = 9.5534, p < 0.0001). The mean difference was 0.08 in the wet season, green vegetation was also present in gullies, thus, NDVI was larger. According to the results of the GLM we found that the spectral bands, LULC classes, and the seasons, as factorial variables and the interactions, were significant (p < 0.001) and explained 92.3% of the variance. Among the factors, the difference of dry and wet seasons had the largest effect on the reflectance (0.868). The bands and LULC classes had almost the same effect with a bit lower value (~0.6), however, also indicating a large effect. Regarding the interactions, we confirmed that reflectance by the band was different by LULC classes and seasons. The contribution of these interactions was large (Table 1). Furthermore, the effect size of the interaction between the seasons and the LULC classes was the lowest, being only third related to the interactions with the bands (0.141), but it still indicated a large effect. The interaction of all factors (spectral bands, seasons, LULC classes) also had a large effect but only with a smaller value (0.185). The post hoc test performed with the Dunnett test revealed significant differences (p < 0.001) between the gullies and other LULC classes in the dry season ( Figure 4). The difference was not significant between the gullies and the agricultural areas (blue band), the vegetation and agricultural areas (green band and red band) in the wet season. Table 2 ranks the original band's importance in terms of discriminating gullies. We also studied the differences of NDVI and found that this spectral index was not as successful in discriminating the gullies as the original bands. It did not differ from the mixed bare soil and the vegetation in the dry season. Although NDVI performed better in the wet season, the difference was not significant with the built-up class.
garding the interactions, we confirmed that reflectance by the band was different by LULC classes and seasons. The contribution of these interactions was large (Table 1). Furthermore, the effect size of the interaction between the seasons and the LULC classes was the lowest, being only third related to the interactions with the bands (0.141), but it still indicated a large effect. The interaction of all factors (spectral bands, seasons, LULC classes) also had a large effect but only with a smaller value (0.185). The post hoc test performed with the Dunnett test revealed significant differences (p < 0.001) between the gullies and other LULC classes in the dry season ( Figure 4). The difference was not significant between the gullies and the agricultural areas (blue band), the vegetation and agricultural areas (green band and red band) in the wet season. Table 2 ranks the original band's importance in terms of discriminating gullies. We also studied the differences of NDVI and found that this spectral index was not as successful in discriminating the gullies as the original bands. It did not differ from the mixed bare soil and the vegetation in the dry season. Although NDVI performed better in the wet season, the difference was not significant with the built-up class.

Accuracy Assessment of Gully Mapping
Using machine learning algorithms (RF and SVM), the cross-validation (CV) resampling method yielded better OA compared to bootstrapping for both the wet and dry seasons ( Figure 5). Two apparent trends can be observed from these results based on OA: (i) RF consistently performed better than SVM irrespective of the season or resampling methods: bootstrapping and cross-validation; (ii) dry season had better OAs than the wet season, but this was not reflected in class level accuracy indices for gully classification. Based on the unbiased UA, all algorithms showed good performance in gully classification, recording UA above 70% ( Figure 6). In particular, the best performance belonged to the svm-d-b (93.4%), whereas the worst UA belonged to the rf-w-b model (77%). For most models, PA was generally low relative to UA. Only half of the models recorded a PA greater than 70%, with the best performance belonging to svm-w-cv (89.2%), while the other half fell below 70%, with the svm-d-b model recording the lowest PA (32.5%).
An unbiased area estimate of gullies (ha) is presented in Table 3. With the highest PA (89.2%) and lowest standard error (3.7 ha), svm-w-cv provided the most accurate gully areal coverage (57.2ha). The highest standard error (11.5 ha) belonged to rf-w-b model, which had a gully area of 55.2 ± 25ha. However, in the F1-score ranking, rf-d-b and rf-d-cv algorithms achieved the best results (>0.90), but RF algorithms belonging to the wet season had relatively low score (0.82). On the other hand, all SVM algorithms (svm-d-cv, svm-d-b, svm-w-cv, and svm-w-b) recorded lower F1-scores, ranging 0.85-0.88. The two resampling techniques recorded the same omission error (85.1%), but slightly different commission errors, e.g., bootstrapping had 40.8% error of commission compared to 37.8% error for k-fold CV (Table 4).

Accuracy Assessment of Gully Mapping
Using machine learning algorithms (RF and SVM), the cross-validation (CV) resampling method yielded better OA compared to bootstrapping for both the wet and dry seasons ( Figure 5). Two apparent trends can be observed from these results based on OA: (i) RF consistently performed better than SVM irrespective of the season or resampling methods: bootstrapping and cross-validation; (ii) dry season had better OAs than the wet season, but this was not reflected in class level accuracy indices for gully classification. Based on the unbiased UA, all algorithms showed good performance in gully classification, recording UA above 70% ( Figure 6). In particular, the best performance belonged to the svm-d-b (93.4%), whereas the worst UA belonged to the rf-w-b model (77%). For most models, PA was generally low relative to UA. Only half of the models recorded a PA greater than 70%, with the best performance belonging to svm-wcv (89.2%), while the other half fell below 70%, with the svm-d-b model recording the lowest PA (32.5%).
An unbiased area estimate of gullies (ha) is presented in Table 3. With the highest PA (89.2%) and lowest standard error (3.7 ha), svm-w-cv provided the most accurate gully areal coverage (57.2ha). The highest standard error (11.5 ha) belonged to rf-w-b model, which had a gully area of 55.2 ± 25ha. However, in the F1-score ranking, rf-d-b and rf-dcv algorithms achieved the best results (>0.90), but RF algorithms belonging to the wet season had relatively low score (0.82). On the other hand, all SVM algorithms (svm-d-cv, svm-d-b, svm-w-cv, and svm-w-b) recorded lower F1-scores, ranging 0.85-0.88. The two resampling techniques recorded the same omission error (85.1%), but slightly different commission errors, e.g., bootstrapping had 40.8% error of commission compared to 37.8% error for k-fold CV (Table 4).    Table 3. Estimated gully area (ha) with associated standard error (ha) at ± 95% CI (ha) for each algorithm (rf: random forest, svm: support vector machine, d: dry, w: wet, b: bootstrapping, cv: cross-validation, CI: confidence interval). The two resampling techniques recorded the same omission error (85.1%), but slightly different commission errors, e.g., bootstrapping had 40.8% error of commission compared to 37.8% error for k-fold CV.   The two resampling techniques recorded the same omission error (85.1%), but slightly different commission errors, e.g., bootstrapping had 40.8% error of commission compared to 37.8% error for k-fold CV.

Gully Distribution
Results indicated that gullies can be spectrally discriminated from other land cover classes, both in the dry and wet season; although there were observable differences in the distribution of the extracted gullies in these two seasons (Figure 7). In the wet season, there seem to be more gullies than there are in the dry season. This difference in gully areal coverage between the two seasons is more pronounced in Figure 7a, corresponding to rf-d-b and Figure 7b, representing the svm-w-cv model.
Differences in gully reflectance among the two seasons also had a bearing on gully classification. The underlying statistical test revealed that the difference was significant (U = 162, z = 9.5534, p < 0.0001), and the mean difference was 0.08. The wet season had more vegetation covering bare surfaces, and because of this, spectral differences were more pronounced during the wet season (Figure 8). On the contrary, in the dry season, most gullies spectrally resembled the bare surfaces they dissect. Consequently, the algorithms were less efficient in extracting gullies occurring on bare soil surfaces in the dry season. This probably explains the high commission error (43.1%) and standard error (8.6 ha) in the dry season.

Gully Distribution
Results indicated that gullies can be spectrally discriminated from other land cover classes, both in the dry and wet season; although there were observable differences in the distribution of the extracted gullies in these two seasons (Figure 7). In the wet season, there seem to be more gullies than there are in the dry season. This difference in gully areal coverage between the two seasons is more pronounced in Figure 7a, corresponding to rf-d-b and Figure 7b, representing the svm-w-cv model.
Differences in gully reflectance among the two seasons also had a bearing on gully classification. The underlying statistical test revealed that the difference was significant (U = 162, z = 9.5534, p < 0.0001), and the mean difference was 0.08. The wet season had more vegetation covering bare surfaces, and because of this, spectral differences were more pronounced during the wet season (Figure 8). On the contrary, in the dry season, most gullies spectrally resembled the bare surfaces they dissect. Consequently, the algorithms were less efficient in extracting gullies occurring on bare soil surfaces in the dry season. This probably explains the high commission error (43.1%) and standard error (8.6 ha) in the dry season.

Discussion
Remotely sensed data are inherently subject to errors, hence, error assessment is essential for data assimilation, one of the primary uses of satellite data products [65]. In this section, we discuss errors associated with the derived gully maps, offering a possible explanation for such error sources. Different resampling methods undoubtedly play an important role in classification accuracy, hence, the final model selection. Specifically, we explored the influence of bootstrapping and k-fold cross-validation techniques in gully classification, considering different seasons (dry and wet) and classifiers (SVM and RF). Results revealed that k-fold CV performs slightly better than bootstrapping in terms of commission error. Kohavi et al. [66], in his study of CV and bootstrap for accuracy estimation and model selection, also reported k-fold CV as the best method to use over bootstrapping. Kim [67] estimated classification error rate, comparing repeated k-fold CV, repeated hold-out and bootstrap, and found that the repeated k-fold CV was better than bootstrap. The author further reported that bootstrapping had bias problems for both large and small samples, despite its small variance, hence, the expectation for better performance for small samples.
Although the results of our study are generally in agreement with previous studies, it is worth noting that the performance of the bootstrapping and k-fold CV varied considerably at class level with algorithm and season. There are instances where bootstrapping performed better than k-fold CV in gully classification. For instance, the best model, namely, svm-d-b, based on UA, belonged to bootstrapping. Such results are important because most studies using either bootstrapping or k-fold CV rarely focus on class level accuracy when evaluating the performance of these resampling techniques. More importantly, even at the class level, different accuracy metrics ought to be considered. This increases the robustness and reliability of the accuracy results, making it possible for researchers to draw correct deductions on the behavior of the algorithms under investigation [61]. However, various class accuracy metrics (UA, PA, standard error, and F1-score) used in the current study, all derived from the confusion matrix, disagreed with one another in some instances. For example, some algorithms that obtained high PA values had low corresponding UA values or vice-versa. This is also true with F1-score vs. either PA or UA. Based on the F1-score, the best algorithms belonged to RF (e.g., rf-d-b and rf-d-cv). Given the disagreement amongst various accuracy metrics, we relied on the standard error as a reliable measure to judge the accuracy of the algorithms.
In the wet season, the algorithms proved to be more efficient in gully classification on bare soil surfaces due to the existence of vegetation cover in bare soil surfaces, making it possible to discriminate gullies. Such findings are comparable or similar to those of previous studies. For example, one study automatically identified gullies based on ASTER images acquired during the dry and wet seasons [68]. The study concluded that the wet season-acquired image performed better than the dry season one. It is worth noting that the wet season is not always appropriate for gully identification in all situations. The success of gully identification depends on the complexity of gully appearance as influenced by their morphological characteristics (shape, size, length, depth, etc.) [42], sensor type and/or resolution, and classification algorithms [69], amongst other factors. For example,

Discussion
Remotely sensed data are inherently subject to errors, hence, error assessment is essential for data assimilation, one of the primary uses of satellite data products [65]. In this section, we discuss errors associated with the derived gully maps, offering a possible explanation for such error sources. Different resampling methods undoubtedly play an important role in classification accuracy, hence, the final model selection. Specifically, we explored the influence of bootstrapping and k-fold cross-validation techniques in gully classification, considering different seasons (dry and wet) and classifiers (SVM and RF). Results revealed that k-fold CV performs slightly better than bootstrapping in terms of commission error. Kohavi et al. [66], in his study of CV and bootstrap for accuracy estimation and model selection, also reported k-fold CV as the best method to use over bootstrapping. Kim [67] estimated classification error rate, comparing repeated k-fold CV, repeated hold-out and bootstrap, and found that the repeated k-fold CV was better than bootstrap. The author further reported that bootstrapping had bias problems for both large and small samples, despite its small variance, hence, the expectation for better performance for small samples.
Although the results of our study are generally in agreement with previous studies, it is worth noting that the performance of the bootstrapping and k-fold CV varied considerably at class level with algorithm and season. There are instances where bootstrapping performed better than k-fold CV in gully classification. For instance, the best model, namely, svm-d-b, based on UA, belonged to bootstrapping. Such results are important because most studies using either bootstrapping or k-fold CV rarely focus on class level accuracy when evaluating the performance of these resampling techniques. More importantly, even at the class level, different accuracy metrics ought to be considered. This increases the robustness and reliability of the accuracy results, making it possible for researchers to draw correct deductions on the behavior of the algorithms under investigation [61]. However, various class accuracy metrics (UA, PA, standard error, and F1-score) used in the current study, all derived from the confusion matrix, disagreed with one another in some instances. For example, some algorithms that obtained high PA values had low corresponding UA values or vice-versa. This is also true with F1-score vs. either PA or UA. Based on the F1-score, the best algorithms belonged to RF (e.g., rf-d-b and rf-d-cv). Given the disagreement amongst various accuracy metrics, we relied on the standard error as a reliable measure to judge the accuracy of the algorithms.
In the wet season, the algorithms proved to be more efficient in gully classification on bare soil surfaces due to the existence of vegetation cover in bare soil surfaces, making it possible to discriminate gullies. Such findings are comparable or similar to those of previous studies. For example, one study automatically identified gullies based on ASTER images acquired during the dry and wet seasons [68]. The study concluded that the wet season-acquired image performed better than the dry season one. It is worth noting that the wet season is not always appropriate for gully identification in all situations. The success of gully identification depends on the complexity of gully appearance as influenced by their morphological characteristics (shape, size, length, depth, etc.) [42], sensor type and/or resolution, and classification algorithms [69], amongst other factors. For example, Sentinel and Landsat images performed relatively well in the dry season than in the wet season [70]. Although gully classification was successful in the wet season relative to the dry season, there were few locations where gullies were filled up with vegetation. Such gullies could not be automatically classified, in which case we relied on visual interpretation of high-resolution aerial photographs and/or dry season PlanetScope images.
Gully appearance also played an important role in gully classification. Consistent with previous studies [42,71], the classification algorithms were efficient in detecting continuous gullies mostly in linear shape. Conversely, the algorithms proved to be less efficient in areas with high gully density, often surrounded by transitional zones to non-gully [71], but these areas form a relatively small portion of the study area and had negligible influence on the accuracy. The SVM combined with CV (e.g., svm-w-cv) reflected the best performance in the wet season with the least standard error (3.7 ha) and highest PA (89.2%), followed by a RF model (rf-d-b), recording slightly different standard error (6.1 ha) and PA (83.6%). Nevertheless, 50% of the models obtained a PA that is below 70%. Despite this discrepancy, the estimated gully areas (ha), based on area-weighted metrics, are unbiased and can be relied upon.
From a practical point of view, the identification of gullies from satellite images with reasonable accuracies is of paramount importance to gully rehabilitation. Like all remote sensing-derived products, gully maps are subject to errors, and hence, accuracy assessment is a prerequisite [54]. However, most remote sensing-based gully studies tend to rely on accuracy indices, such as PA and UA, without taking into account the uncertainty of the estimated gully areas. Although it is not a requirement, it is often recommended to provide not only PA and UA but also unbiased quantitative area estimates such as the area-weighted metrics and confidence intervals [60]. In this study, we quantified gullied areas (ha) together with their associated levels of uncertainties, such as standard errors (ha) and confidence intervals (ha).
RF combined with bootstrapping resampling provided the best gully area (88 ± 14.4 ha) estimate with the least standard error (6.1 ha) in the dry season. In the wet season, SVM combined with CV resampling estimated gully area (57.2 ± 18.8 ha) with the lowest standard error (3.7 ha). These findings shed light on the influence of these resampling techniques on the accuracy of satellite-based gully mapping but also provides the basis for further investigations into the accuracy of such resampling techniques, especially when using different satellite images other than the PlanetScope data, preferable, freely available ones, with higher spatial resolution. Initially, we planned to use both PlanetScope and SPOT-7 images, also obtainable free of charge for the test area, but SPOT-7 image scenes acquired in the wet and dry season months were not available for the test area. Nevertheless, given that we only mapped gullies in a small part of the problem area, we are planning to test the method in other areas with wider spatial coverage. However, mapping gullies over large areas, particularly using automatic methods, is still a challenge due to the complexity of gullies over such large areas [14]. Thus far, even advanced methods such as CNNs have errors in detecting complex gully systems [37]. It is worth noting that the detection of gullies mainly depends on the spatial resolution of the image used. For example, at larger scales, gullies have only been mapped at a spatial resolution of up to 2.5 m in South Africa [14,15]. To overcome this challenge, the future implementation of our method, will in part, require the use of a high spatial resolution (<2 m) image, for instance, pansharpened SPOT-7 image (1.5 m) or WorldView (0.5 m), which can detect individual gullies. Another limitation of this method relates to climate. The method is suitable for application in arid/semi-arid regions where gullies are often not covered by trees [42]. Our study demonstrated that gullies could be better identified in the dry season with RF combined with bootstrapping, whereas SVM combined with k-fold CV is best for identifying gullies in the wet season. Therefore, we recommend the use of RF and SVM for mapping gullies in the dry and wet seasons, respectively. Provided that PlanetScope provides global spatial coverage with daily revisit time, we particularly recommend it for continuous monitoring of gullies at any location.

Conclusions
The aim of this study was to assess the efficacy of cross-validation and bootstrapping in gully classification and also to reveal how well the PlanetScope images perform in gully extraction in the dry and wet seasons of a semi-arid climate. We found the following outcomes.

•
Gullies were spectrally different in all bands of the PlanetScope images, both in the dry and the wet seasons. • NDVI values did not differ from all land cover classes regarding the reflectance values; thus, it was not involved in gully classification. • Dry and wet seasons ensured different classification accuracy, but gully extraction was successful. RF outperformed the SVM algorithm in terms of OA, but the differences of the OAs were < 4%. Differences were larger in the dry season (3.5%) and smaller in the wet season (~1%). • Generally, based on the OAs, CV performed better with the RF algorithm than the bootstrapping (with~1.0-1.5% differences), but on a class level, bootstrapping provided the most accurate gully extraction with the RF in the dry season, whereas CV was efficient with SVM in the wet season.
Accordingly, both resampling techniques were efficient, but RF with bootstrapping resampling technique in the dry season can be suggested to map gullies. In the future, we plan to extend the mapping in larger areas to help landowners and managers to fight against erosion and to plan the interventions at the hot spot areas.
Author Contributions: Conceptualization, K.P. and S.S.; methodology, K.P.; software, K.P. and D.A.; validation, K.P. and D.A., formal analysis, K.P.; investigation, K.P.; resources, S.S.; data curation, K.P. and D.A.; writing-original draft preparation, K.P.; writing-review and editing, K.P. and S.S.; visualization, K.P., D.A., and S.S.; supervision, S.S. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: PlanetScope images can be purchased from the PlanetLabs Inc. Limited, non-commercial access to PlanetScope imagery can also be gained through the Education and Research Program (https://www.planet.com/markets/education-and-research/ (accessed on 30 July 2020)). Reference data can be provided by the authors on demand.