Determining Cover Management Factor with Remote Sensing and Spatial Analysis for Improving Long-Term Soil Loss Estimation in Watersheds

: The universal soil loss equation (USLE) is a widely used empirical model for estimating soil loss. Among the USLE model factors, the cover management factor (C-factor) is a critical factor that substantially impacts the estimation result. Assigning C-factor values according to a land-use/land-cover (LULC) map from ﬁeld surveys is a typical traditional approach. However, this approach may have limitations caused by the difﬁculty and cost in conducting ﬁeld surveys and updating the LULC map regularly, thus signiﬁcantly affecting the feasibility of multi-temporal analysis of soil erosion. To address this issue, this study uses data mining to build a random forest (RF) model between eight geospatial factors and the C-factor for the Shihmen Reservoir watershed in northern Taiwan for multi-temporal estimation of soil loss. The eight geospatial factors were collected or derived from remotely sensed images taken in 2004, a digital elevation model, and related digital maps. Due to the memory size limitation of the R software, only 4% of the total data points (population dataset) in each C-factor class were selected as the sample dataset (input dataset) for analysis using the stratiﬁed random sampling method. Seventy percent of the input dataset was used to train the RF model, and the other 30% was used to test the model. The results show that the RF model could capture the trend of vegetation recovery and soil loss reduction after the destructive event of Typhoon Aere in 2004 for multi-temporal analysis. Although the RF model was biased by the majority class’s large sample size (C = 0.01 class), the estimated soil erosion rate was close to the measurement obtained by the erosion pins installed in the watershed (90.6 t/ha-year). After the model’s completion, we furthered our aim to address the input dataset’s imbalanced data problem to improve the model’s classiﬁcation performance. An ad-hoc down-sampling of the majority class technique was used to reduce the majority class’s sampling rate to 2%, 1%, and 0.5% while keeping the other minority classes at a 4% sample rate. The results show an improvement of the Kappa coefﬁcient from 0.574 to 0.732, the AUC from 0.780 to 0.891, and the true positive rate of all minority classes combined from 0.43 to 0.70. However, the overall accuracy decreases from 0.952 to 0.846, and the true positive rate of the majority class declines from 0.99 to 0.94. The best average C-factor was achieved when the sampling rate of the majority class was 1%. On the other hand, the best soil erosion estimate was obtained when the sampling rate was 2%.


Introduction
Severe soil erosion will increase soil sedimentation and severely reduce the water storage and supply capabilities of reservoirs. A previous analysis revealed that 30% of map and a look-up table to determine the C-factors in this study. An RF model was then built to predict the C-factor values from the geospatial factors for use in the USLE model. The rest of the paper is organized as follows. Section 2 introduces the study area, the geospatial data, and the data mining algorithm. Section 3 presents the C-factor modeling and soil erosion estimation results and a discussion focusing on the class imbalance (imbalanced data) problem. Finally, Section 4 concludes this paper and provides possible future research directions.

Methods
To address the research goal, it is necessary to develop a strategy to derive more consistent and reliable factors when applying the USLE model to evaluate soil loss. With the rapid development of geospatial technologies and data availability, there is a great potential to improve the USLE modeling based on geospatial data, such as remotely sensed images and geographic information system (GIS) data layers, to obtain a more accurate soil loss estimation at the regional scale, especially for long-term and multi-temporal studies. The data types, data processing, and data mining analysis are described in the following sub-sections.

Study Area
A mountainous area of 760 km 2 of the Shihmen Reservoir watershed in northern Taiwan was selected as the study area ( Figure 1). The elevation of the study area ranges approximately from 250 to 3500 m above sea level, measured from a 10-m digital elevation model (DEM). The terrain increases steeply from north to south, and steep slopes are ubiquitous in the watershed. The average annual precipitation is approximately 2500 mm. There are 13 geological formations and six soil types in the study area [24]. The major land-cover is forest (both natural and artificial) and there is little agricultural activity.
wan was selected as the study area ( Figure 1). The elevation of the study area ranges approximately from 250 to 3500 m above sea level, measured from a 10-m digital elevation model (DEM). The terrain increases steeply from north to south, and steep slopes are ubiquitous in the watershed. The average annual precipitation is approximately 2500 mm. There are 13 geological formations and six soil types in the study area [24]. The major land-cover is forest (both natural and artificial) and there is little agricultural activity. The Shihmen Reservoir watershed is one of the major reservoirs in Taiwan, providing drinking water to more than three million people in three northern cities. However, heavy rainfall induced by typhoons, such as Typhoon Aere in August 2004, may generate a large amount of debris and driftwood, resulting in a water supply shortage and causing various water resource management problems. Hence, a long-term land cover monitoring project from 2004 to 2009 was implemented using remote sensing technologies to support water resource and water supply management [25]. Based on the collected geospatial data, this study further explored the effectiveness of the USLE model by modeling the C-factor in the Shihmen Reservoir watershed to estimate the multi-temporal soil erosion rates.

Materials and Data Preprocessing
To connect the geospatial data to the C-factor, a total of eight attributes (as listed in Table 2) were considered in the modeling. They are elevation, slope, NDVI, soil adjusted vegetation index (SAVI), shortest distance to roads, shortest distance to rivers, geology, and soil type. In contrast to studies using many attributes, we used only eight attributes because research has shown that attribute reduction could improve the model performance [26]. We also used fewer attributes because of the need to maximize the number of data points that can be processed by the R software (see Section 2.3). The purpose of preprocessing is three-fold, including data collection, labeling, and feature derivation. Some of the derived features could be obtained by spatial analysis from the original data. For example, the elevation and slope information could be derived from the DEM. As for the multi-temporal satellite images listed in Table 3, this study generated the NDVI and SAVI indices year by year for modeling purposes. Equations (1) and (2) show the NDVI and SAVI equations, where NIR and RED represent the radiance or reflectance of near-infrared The Shihmen Reservoir watershed is one of the major reservoirs in Taiwan, providing drinking water to more than three million people in three northern cities. However, heavy rainfall induced by typhoons, such as Typhoon Aere in August 2004, may generate a large amount of debris and driftwood, resulting in a water supply shortage and causing various water resource management problems. Hence, a long-term land cover monitoring project from 2004 to 2009 was implemented using remote sensing technologies to support water resource and water supply management [25]. Based on the collected geospatial data, this study further explored the effectiveness of the USLE model by modeling the C-factor in the Shihmen Reservoir watershed to estimate the multi-temporal soil erosion rates.

Materials and Data Preprocessing
To connect the geospatial data to the C-factor, a total of eight attributes (as listed in Table 2) were considered in the modeling. They are elevation, slope, NDVI, soil adjusted vegetation index (SAVI), shortest distance to roads, shortest distance to rivers, geology, and soil type. In contrast to studies using many attributes, we used only eight attributes because research has shown that attribute reduction could improve the model performance [26]. We also used fewer attributes because of the need to maximize the number of data points that can be processed by the R software (see Section 2.3). The purpose of preprocessing is three-fold, including data collection, labeling, and feature derivation. Some of the derived features could be obtained by spatial analysis from the original data. For example, the elevation and slope information could be derived from the DEM. As for the multitemporal satellite images listed in Table 3, this study generated the NDVI and SAVI indices year by year for modeling purposes. Equations (1) and (2) show the NDVI and SAVI equations, where NIR and RED represent the radiance or reflectance of near-infrared and red bands, respectively, and L indicates the soil correction factor that is commonly set to 0.5 [27]. However, the mountainous topography might distort the spectral responses so that the same land cover types located on phototropic and apheliotropic areas might have significant variations. To reduce the topographic effect, this study applied a typical Minnaert correction [28] to rectify the spectral responses based on a non-Lambertian assumption before deriving the vegetative features. Similar to the use of DEM and SPOT images, the rest of the data, such as the distance information to roads and rivers, were generated by spatial analysis. Additionally, the vector data were rasterized to 10 × 10 m to be compatible with one another.
where NIR and RED are the radiance or reflectance of near-infrared and red bands, and L is the soil correction factor (0.5). Loamy fine sand/coarse sandy loam/sandy loam 4.
No data After feature derivation and assembly, a look-up table was used to assign C-factor values to the official 2004 LULC map (the only map available during the study period). The C-factor's point values were based on the research of Jhan [29] and Lin [30], which in turn were based on the design manual of the Soil and Water Conservation Bureau of Taiwan. Without conducting numerous experiments in the field to determine the C-factors at various locations, the look-up table provides the next best option to assign credible C-factor values to different LULC classes in the study area. As can be seen from Figure 2, there were 23 land use classes assigned to 12 different C-factor classes. The higher the Cfactor, the less the land cover and the higher the soil erosion. The correspondence between the LULC classes and the C-factor classes are summarized in Table 4. The geospatial data and the corresponding C-factor values of each grid cell in the study area were extracted and assembled as an analytic dataset (herein referred to as the population dataset) used by the data mining algorithm (after sampling) to create a C-factor model. The model is used for multi-temporal analysis of C-factor change and soil erosion. After feature derivation and assembly, a look-up table was used to assign C-factor values to the official 2004 LULC map (the only map available during the study period). The C-factor's point values were based on the research of Jhan [29] and Lin [30], which in turn were based on the design manual of the Soil and Water Conservation Bureau of Taiwan. Without conducting numerous experiments in the field to determine the C-factors at various locations, the look-up table provides the next best option to assign credible Cfactor values to different LULC classes in the study area. As can be seen from Figure 2, there were 23 land use classes assigned to 12 different C-factor classes. The higher the Cfactor, the less the land cover and the higher the soil erosion. The correspondence between the LULC classes and the C-factor classes are summarized in Table 4. The geospatial data and the corresponding C-factor values of each grid cell in the study area were extracted and assembled as an analytic dataset (herein referred to as the population dataset) used by the data mining algorithm (after sampling) to create a C-factor model. The model is used for multi-temporal analysis of C-factor change and soil erosion.   Table 4. LULC classes merged by C-factor classes (revised from Jhan [29] and Lin [30]).

C-Factor Class
Land Use/Land Cover Class

Data Mining Analysis
To extract useful, unknown, and potential information from the vast dataset, data mining is an efficient approach [31,32]. Random Forests (RF), one of the popular data mining algorithms proposed by Breiman [33], has excellent performance in analyzing many complicated remote sensing issues [34][35][36][37]. The procedure for applying the data mining algorithm to construct a C-factor model for soil erosion estimation is illustrated in Figure 3. We used the geospatial data derived from GIS and SPOT images and the official 2004 LULC map (from field surveys) to construct the C-factor model using the randomForest() package of the R software. Among the data mining procedures, the RF algorithm is a supervised approach that adopts multiple decision trees (DT), bootstrap aggregation (bagging), and internal cross-validation techniques. It integrates all tree-based results into the best model for analysis [33]. Recent years have seen increased attention being given to the RF algorithm in the geo-informatics domain. Belgiu and Dragut [34] reviewed its application and future direction in remote sensing. Nevertheless, constructing the C-factor model based on the RF algorithm and geospatial data has rarely been conducted until now. The primary benefit of the RF algorithm is that it can avoid the over-fitting issue to improve prediction accuracy [38]. The RF algorithm employs measures such as the Gini index, information gain (IG), or entropy to evaluate the degree of impurity of discrete or numeric input data. The smaller the Gini index of an attribute, the higher the priority should be selected to construct a conditional node and ignore the other attributes. The RF algorithm performs numerous iterations and randomly divides the training dataset (in terms of the number of data and the number of attributes) into many subsets to build many trees and generate better results than the DT method. The detailed steps can be seen in Guo et al. [39].  Owing to the memory size limitation of R, only about 4% of the population dataset can be imported and used in the modeling (303,682 points out of 7,592,062 points). Therefore, we used the stratified random sampling method to select the same percentage of data points from each of the 12 C-factor classes (herein referred to as the sample dataset or the input dataset), as shown in Table 5. Note that the input dataset is highly unbalanced, with 92.5% of the data from the C = 0.01 class (forests). The percentage of the rest of the C-factor classes ranges from close to 0% to 2.9%. Table 5. The percentage composition of C-factor classes in the input datasets under different majority class sampling rates (C = 0.01 class). Owing to the memory size limitation of R, only about 4% of the population dataset can be imported and used in the modeling (303,682 points out of 7,592,062 points). Therefore, we used the stratified random sampling method to select the same percentage of data points from each of the 12 C-factor classes (herein referred to as the sample dataset or the input dataset), as shown in Table 5. Note that the input dataset is highly unbalanced, with 92.5% of the data from the C = 0.01 class (forests). The percentage of the rest of the C-factor classes ranges from close to 0% to 2.9%. After sampling, 70% of the input dataset was used as the training data and inputted into the RF algorithm to create the C-factor model. The remaining 30% was the test data and used to assess the performance of the model. Once the model was validated, it was applied to the population dataset to generate the C-factor maps from 2004 to 2008. Finally, the C-factor and the other USLE factors were combined in the USLE model to calculate the soil erosion rates from 2004 to 2008 (see Section 2.4 for details).
Categorical and numerical data are two major geospatial data types. The randomForest() package of R uses the Gini index to split nodes in order to reduce impurity at each node [40]. We used 1000 trees in this study, and three variables were tried at each split. The Gini index of dataset D is defined in Equation (3), where m is the number of categories, n i is the number of data points in the ith category, and N is the total number of data points. If a binary split was performed on attribute A, the Gini index given the split was defined in Equation (4), where D 1 and D 2 are the datasets after the split [41].
To verify the data mining-based C-factor model, this study calculated the confusion matrix (or error matrix), overall accuracy (OA), Kappa coefficient, and area under the curve (AUC) of the receiver operating characteristic (ROC) curve for the quantitative evaluation. Overall accuracy refers to the percentage of correctly classified samples as shown in Equation (5), where M represents the element in the confusion matrix; M total is the sum of M; M diag is the sum of M on the diagonal line; Nc is the number of labels; and i and j are the row and column indices. The Kappa coefficient is shown in Equation (6), which reflects the reliability of the modeling results. When the Kappa coefficient is close to 1, it shows excellent agreement between prediction and observation. By contrast, the results are worse than random assignment when a negative Kappa value appears.

USLE Computation
The USLE equation is shown in Equation (7), and the meanings of all USLE factors are listed in Table 6. Although the goal of the study is to evaluate the C-factor, other USLE factors are also needed to estimate the annual soil loss. For these factors, this study followed the investigations of Chen et al. [14] and Liu et al. [42] to produce the rainfall erosivity, soil erodibility, slope length, and slope steepness layers, respectively. We also assumed that the support practice factor is 1. The generated R m -and K m -factor distribution maps are shown in Figure 4a,b, and the L-and S-factors are combined as a topographic factor (LS-factor) shown in Figure 4c. (a) Rm-factor (b) Km-factor (c) LS-factor  Soil erodibility factor t-hour/MJ-mm L Slope length factor --S Slope steepness factor --C Cover management factor --

Results and Discussion
The results of this study are presented as tables, graphs, and statistical metrics in the following sub-sections.

Cover-Management Factor Modeling
According to the data preprocessing and study procedures described in previous sections, the C-factor RF model was constructed using the training data and tested with the test data. The C-factors were from the official 2004 LULC map (the only map available during the study period), and the results are shown in Table 7. Unlike a previous study [30], which only used at most 100 points from each C-factor class and substantially overestimated soil erosion, we tried to maximize the number of data points that could be processed to train the RF model given the memory size limitation of R. Consequently, 4% of the total data points (population dataset) were used (303,682 points selected out of 7,592,062 points). The training dataset result shows that OA = 1, Kappa = 1, and AUC = 1 (the confusion matrix is not shown here to avoid redundancy). This indicates that the RF model can correctly distinguish all 212,578 data points in the training dataset. By contrast, Table 7 shows that the result of the test dataset has less remarkable metrics. While OA is still very high (0.9516), Kappa is only 0.5741, and AUC is 0.7804. Using this RF model, we predicted the C-factor distribution maps from 2004 to 2008 using SPOT images. The resulting maps are shown in Figure 5b-f.   Figure 5a shows the true C-factor distribution from the official 2004 LULC map. The similarity between the prediction (Figure 5b-f) and the reference C-factor is evident. The red pixels (high C-factor values) of all maps cluster along the central river valley near the center and lower portions of the watershed. However, some of the C-factor cannot be distinguished reliably, and some omission and commission errors have occurred. As shown in column 4 (shaded) of Table 7, the most noticeable trend is a prediction bias towards the C = 0.01 class, representing natural and artificial forests. This result reflects the overwhelming majority of the C = 0.01 class in the sample (92.5%) when building the RF model (Table 5). Hence, the RF model tends to classify pixels into the C = 0.01 class. Because of this bias towards a low C-factor value (0.01), the average of 2004 C-factors predicted by the RF model is only 0.0115 (Table 8)    To reduce the classification error, we experimented with an ad-hoc down-sampling of the majority class technique that used only 2% data from the majority class (C = 0.01 class) while maintaining a 4% sample rate of the other minority C-factor classes. The percentages of data points from each of the 12 C-factor classes in the input dataset were previously shown in Table 5. Again, the result of the training dataset shows a perfect classification of OA = 1, Kappa = 1, and AUC = 1 (again, the confusion matrix is not shown here to avoid redundancy). The result of the test dataset (Table 9) shows that OA = 0.9230, Kappa = 0.6484, and AUC = 0.7807. Compared with Table 7, we can see that the OA decreases from 0.9516 to 0.9230, the Kappa increases from 0.5741 to 0.6484, and the AUC remains about the same. Using only 2% data from the C = 0.01 class, the predicted average C-factors from 2004 to 2008 range from 0.0124 to 0.0133 (Table 8), higher than the 4% case and closer to the true average C-factor value. In other words, the reduction of the sampling rate from 4% to 2% increases the Kappa coefficient at the expense of OA. Simultaneously, the reduction of the sampling rate also brings the predictions closer to the reference value (ground truth). Although we had adopted the stratified random sampling method to obtain a representative sample of all C-factor classes, it did not completely avoid the imbalanced data problem. The next section will present how the sampling rate of the majority class affects soil erosion estimates.

Soil Erosion Estimation
Up until 2017, the USLE model was the only method for estimating the amount of soil loss in the technical regulations for soil and water conservation in Taiwan [43]. Combining the R m -, K m -, and LS-factor layers ( Figure 4) with the C-factor layer ( Figure 5), the USLE model was used in this study to estimate the amount of soil erosion. The result based on the official 2004 LULC map is listed in the second column of Table 10 Figure 6a. As such, a high resemblance between the predictions (Figure 6b-f) and the reference value (Figure 6a) is evident. The red pixels (high soil erosion rates) of all maps cluster near the center and lower portions of the watershed, indicating good modeling results. The results of the 4% RF model are similar, but we did not include them here to avoid redundancy. soil erosion maps in Figure 6b-f for the years 2004 through to 2008, while the soil erosion map based on the official 2004 LULC map is shown in Figure 6a. As such, a high resemblance between the predictions (Figure 6b-f) and the reference value (Figure 6a) is evident. The red pixels (high soil erosion rates) of all maps cluster near the center and lower portions of the watershed, indicating good modeling results. The results of the 4% RF model are similar, but we did not include them here to avoid redundancy.   Chen et al. [14] compiled a table of the calculated amounts of soil erosion of the Shihmen Reservoir watershed from previous studies. The table shows that soil erosion ranges from 1 to 3310 t/ha-year. Our results are close to the lower end of the soil erosion range. By contrast, the only study using a similar methodology to this study [30], which related nine decision factors, including the gray level co-occurrence matrix (GLCM) to the C-factor, only used at most 100 points from each C-factor class. The study achieved a Kappa coefficient of 0.758, but estimated the soil erosion to be 359.4-629.9 t/ha-year. Chen et al. [14] compiled a table of the calculated amounts of soil erosion of the Shihmen Reservoir watershed from previous studies. The table shows that soil erosion ranges from 1 to 3310 t/ha-year. Our results are close to the lower end of the soil erosion range. By contrast, the only study using a similar methodology to this study [30], which related nine decision factors, including the gray level co-occurrence matrix (GLCM) to the C-factor, only used at most 100 points from each C-factor class. The study achieved a Kappa coefficient of 0.758, but estimated the soil erosion to be 359.4-629.9 t/ha-year.
Based on the erosion pins installed in the Shihmen Reservoir watershed [44] and the measurements collected from 8 September 2008 to 10 October 2011, the soil erosion depth ranged from 2.17 to 13.03 mm/year [37]. The average erosion depth is 6.5 mm/year, which is equivalent to 90.6 t/ha-year if the unit weight of soil is assumed to be 1.4 t/m 3 [42]. Thus, our results are more comparable with the erosion pin measurements. Specifically, it shows that the 4% sampling rate underestimates soil erosion, whereas the 2% sampling rate overestimates soil erosion.
It is worth noting that the long-term land cover and landslide monitoring project [25] from 2004 to 2009 indicated that only typhoon Aere in 2004 induced significant land degradation and mass movement in the period. Large amounts of debris and driftwood flowed into the Shihmen Reservoir. The high turbidity in the water caused the water distribution system to be shut down for an unprecedented 18 days. A similar situation has not happened since. The removal of land cover in the watershed is the reason why the calculated soil erosion based on the official 2004 LULC map is as high as 116.3 t/ha-year. After Typhoon Aere, as the land stabilized and vegetation re-grew to provide new ground cover, soil erosion reduced substantially. This explains why the measured soil erosion is only 90.6 t/ha-year between 2008 and 2011 (Table 10).
According to Table 10, both of the 4% and 2% C-factor models predict a peak of soil erosion in 2004, followed by a gradual decrease until a small rebound in 2008. This is consistent with the vegetation recovery in the study area from 2005 to 2007 and the hike in rainfall brought by Typhoons Kalmaegi, Sinlaku, and Jangmi in 2008 [45]. However, the differences in soil erosion rates are not as marked as expected from year to year. Nevertheless, these results confirm that it is possible to develop a multi-temporal data mining model for the C-factor and the corresponding soil erosion.

Discussion
Although this study's results demonstrated the feasibility of constructing a data mining model for the USLE C-factor, the modeling results were affected by the majority class's sampling rate (C = 0.01 class). At the beginning of the study, we used the stratified random sampling method (instead of the simple random sampling method) to ensure that each class (strata) of the population dataset (total data points) were represented. It helped to avoid incorrect analysis, but it did not solve the class imbalance problem entirely. So, we experimented with the down-sampling of the majority class (C = 0.01 class) to 2% and kept the other minority classes at a 4% sample rate. A better result was achieved as indicated by the higher Kappa coefficient (but at the expense of lower OA). To investigate if the classification performance can be further improved, we applied an ad-hoc down-sampling of the majority class technique (similar to random under-sampling) to the population dataset with 1% and 0.5% sampling rates. The resulting percentage compositions of each of the 12 C-factor classes in the input dataset were previously shown in Table 5. After building corresponding RF models, the overall results are summarized in Table 11, which shows the OA, the Kappa coefficient, the AUC, the true positive rate of all minority classes combined, the true positive rate of the majority class, the average C-factor of the LULC map, the predicted average C-factor of 2004, the predicted soil erosion rate of 2004, the predicted soil erosion rate of 2008, and the measured soil erosion rate by the erosion pins. It was found that the down-sampling strategy works well. With the decrease in the majority class sampling rate, the Kappa coefficient increases from 0.574 to 0.732, and the AUC increases from 0.780 to 0.891. Moreover, the true positive rate of all minority classes combined also increases from 0.43 to 0.70. However, the overall accuracy decreases with the down-sampling from 0.952 to 0.846, and the true positive rate of the majority class declines from 0.99 to 0.94.
At first, it appears that 0.5% is the best sampling rate in this study, but the average C-factor indicates otherwise. The C-factor value starts from 0.0115 at 4% and then becomes 0.0130 at 2% and 0.0156 at 1%, gradually approaching the reference value of 0.0164. However, the average C-factor suddenly dips back down to 0.0115 at 0.5%, deviating from the reference value again. Therefore, judging from the average C-factor, 1% is the majority class's best sampling rate.
The unexpected result of the average C-factor is further investigated in Figure 7, in which the predicted percentage compositions of four different sampling rates were plotted for all minority classes. The red line indicates the "true" percentage composition from the 2004 official LULC map. As shown in the figure, the difference in the C = 0.16 class between the LULC map and the different sampling rates determines how accurate the final average C-factor is. As the sampling rate decreases from 4% to 2% and 1%, the respective percentage of the C = 0.16 class approaches and then surpasses that of the LULC map. When the sampling rate is further reduced to 0.5%, the percentage of the C = 0.16 class reverses course and drops back to the same level as that of the 4% sampling rate. This explains why the 0.5% sampling rate did not yield a better average C-factor, even though its other metrics (such as OA, Kappa, and AUC) were superior. reference value again. Therefore, judging from the average C-factor, 1% is the majority class's best sampling rate. The unexpected result of the average C-factor is further investigated in Figure 7, in which the predicted percentage compositions of four different sampling rates were plotted for all minority classes. The red line indicates the "true" percentage composition from the 2004 official LULC map. As shown in the figure, the difference in the C = 0.16 class between the LULC map and the different sampling rates determines how accurate the final average C-factor is. As the sampling rate decreases from 4% to 2% and 1%, the respective percentage of the C = 0.16 class approaches and then surpasses that of the LULC map. When the sampling rate is further reduced to 0.5%, the percentage of the C = 0.16 class reverses course and drops back to the same level as that of the 4% sampling rate. This explains why the 0.5% sampling rate did not yield a better average C-factor, even though its other metrics (such as OA, Kappa, and AUC) were superior. Finally, if we compare the predicted soil erosion rates with the rate measured by the erosion pins, we arrive at yet another different conclusion. The 2% sampling rate predicts a soil erosion rate of 93.2 t/ha-year, which is the closest to the measured rate of 90.6 t/hayear. In this scenario, 2% is the best sampling rate. The best sampling rates under different criteria are shaded in Table 11 for easy comparison.
The results presented above suggest that, apart from the class imbalance problem, other factors are responsible for this study's modeling performance. Using the overall Finally, if we compare the predicted soil erosion rates with the rate measured by the erosion pins, we arrive at yet another different conclusion. The 2% sampling rate predicts a soil erosion rate of 93.2 t/ha-year, which is the closest to the measured rate of 90.6 t/ha-year. In this scenario, 2% is the best sampling rate. The best sampling rates under different criteria are shaded in Table 11 for easy comparison.
The results presented above suggest that, apart from the class imbalance problem, other factors are responsible for this study's modeling performance. Using the overall evaluation metrics (such as OA, Kappa, and AUC) in this study is not entirely appropriate and could be misleading. Since our goal was a two-step approach to model the C-factor and eventually the soil erosion, the predicted soil erosion rate is the most important indicator. A more balanced dataset does not always yield a better modeling result in terms of soil loss, and we consider 2% to be the best sampling rate in this study.

Conclusions
Unlike previous studies, this research developed C-factor models based on data mining techniques to improve the soil erosion assessment in the Shihmen Reservoir watershed in northern Taiwan. Eight geospatial data were selected and used in the modeling. The multi-temporal vegetative indices (NDVI and SAVI) derived from multispectral satellite images were rectified by a topographic correction to reduce variations over time and better characterize the surface radiances of the same targets. The C-factor models built using the RF-based data mining algorithm were used with USLE to estimate the spatiotemporal soil losses from 2004 to 2008. The results were compared against past studies and the measurements of erosion pins. They showed promising classification performance.
It is found that the soil erosion rate in 2004 was the highest because of the unprecedented destruction of Typhoon Aere in 2004. As the vegetation in the watershed re-grew after the typhoon to provide new ground cover, the soil erosion rate decreased steadily until 2008, when a surge of rainfall occurred due to Typhoons Kalmaegi, Sinlaku, and Jangmi. This trend was successfully captured by the RF models, which demonstrates the feasibility of the multi-temporal analysis. Furthermore, using an ad-hoc down-sampling of the majority class technique (at 2% sampling rate), the soil erosion rate was predicted to be 93.2 t/ha-year, very close to the 90.6 t/ha-year measured by the erosion pins installed in the watershed.
In addition, this study provides a case of an imbalanced data problem that differs from other imbalanced data problems in that a more balanced dataset does not always yield a better modeling result. The best sampling rate of the majority class based on different metrics are summarized as follows: Funding: This study was supported, in part, by the Ministry of Science and Technology of Taiwan (ROC) under project No. NSC 100-2628-E-008-014-MY3, MOST 108-2221-E-008-001-MY3, and MOST 109-2121-M-027-001.