Slope Failure Prediction Using Random Forest Machine Learning and LiDAR in an Eroded Folded Mountain Belt

The probabilistic mapping of landslide occurrence at a high spatial resolution and over a large geographic extent is explored using random forests (RF) machine learning; light detection and ranging (LiDAR)-derived terrain variables; additional variables relating to lithology, soils, distance to roads and streams and cost distance to roads and streams; and training data interpreted from high spatial resolution LiDAR-derivatives. Using a large training set and all predictor variables, an area under the receiver operating characteristic (ROC) curve (AUC) of 0.946 is obtained. Our findings highlight the value of a large training dataset, the incorporation of a variety of terrain variables and the use of variable window sizes to characterize the landscape at different spatial scales. We also document important variables for mapping slope failures. Our results suggest that feature selection is not required to improve the RF modeling results and that incorporating multiple models using different pseudo absence samples is not necessary. From our findings and based on a review of prior studies, we make recommendations for high spatial resolution, large-area slope failure probabilistic mapping.


Introduction
Slope failures, including landslides, are estimated to cause 25 to 50 fatalities and more than 3 billion dollars in damage each year in the United States alone [1][2][3][4]. Based on a review of government statistics, aid agency reports and research papers, Petley [5] estimates that 32,322 global fatalities from non-seismically induced landslides occurred between 2004 and 2010. Further, although uncertainty still remains, it has been suggested that global climate change may result in alterations in the global and local frequency, intensity and distribution of failures [6][7][8][9][10]. Thus, there is a need to develop methods to monitor and predict slope failure occurrence that are appropriate for mapping large spatial extents using available geospatial data and mapped reference locations.
Accurate and consistent geospatial data are of great importance in mapping and predicting slope failures, as they represent key factors that may contribute to or inhibit slope stability, such as geomorphologic, lithologic, soil and land use characteristics [11][12][13][14]. As early as the 1970s and 1980s, researchers were using geospatial data and statistical modeling to assess these hazards [11][12][13][14]. More recently, machine learning methods have been applied to mapping, predicting and modeling slope failures [15][16][17][18][19][20][21][22][23][24][25]. Generally, machine learning methods have been successfully applied to a wide range of predictive modeling and classification tasks in the geospatial sciences, which is at least partially attributed to their ability to model complex patterns and relationships from a variety of input data without distribution assumptions [26][27][28]. Further, the more recent development of deep learning methods and convolutional neural networks are further expanding our ability to map and model slope failure features, as recently demonstrated by Sameen et al. [29] and Wang et al. [30].
Development of light detection and ranging (LiDAR) technologies and their application to mapping bare earth terrains over large spatial extents with high spatial detail has improved our ability to quantify and map geomorphic features and processes [31][32][33][34]. In the United States, the federal government has implemented the 3D Elevation Program (3DEP) (https://www.usgs.gov/core-science-systems/ngp/3dep) with the goal of providing LiDAR coverage for the entire country, excluding Alaska, which will be collected using interferometric synthetic aperture radar (InSAR) data [35,36]. Further, the United States Geologic Survey (USGS) is currently curating a nation-wide landslide inventory with contributions from local, state and federal agencies (https://www.usgs.gov/natural-hazards/landslide-hazards) [37]. Given the risks that slope failures pose and these recent developments in availability of quality digital terrain data, landslide inventories and computational and machine learning methods, we argue that there is a need to develop methods that leverage these data and algorithms to map and predict slope failures over large spatial extents.
This research explores the mapping of slope failures throughout the entirety of the 10,765 km 2 Northern Appalachian Ridges and Valleys Major Land Resource Area (MLRA) within the state of West Virginia. We make use of terrain data derived from LiDAR, additional geospatial data and the position of mapped slope failure head scarps interpreted from LiDAR-derivatives. The objective of this research is to provide recommendations for large-area slope failure mapping from LiDAR and additional geospatial data as highlighted by our results and previous studies. We specifically address the following questions: (1). Does combining multiple models using different sets of pseudo absence data improve slope failure prediction? The use of pseudo absence samples is an approach to generate negative (i.e., no slope failure) examples and is explained in more detail in the Methods section. (2). Does incorporating additional variables representing lithology, soil characteristics and proximity to roads or streams improve the model in comparison to just using terrain variables? (3). How does reducing the training sample size impact model performance? (4). How does predictor variable feature selection impact model performance? (5). What variables are most important for predicting slope failure occurrence? (6). Does calculating terrain variables using multiple window sizes improve the prediction?
Modeling is conducted using the random forest (RF) algorithm to obtain a probabilistic output as opposed to a classification of slope failure extents. In this study, we define slope failures as the movement of a mass of rock, earth or debris down a slope [38]. Our goal is to predict the likelihood of slope failure occurrence broadly, regardless of material or movement type. Debris flows, lateral spread and slides, both translational and rotational, are predicted. Since many failures may not have a distinct spectral signature due to age, canopy cover or spectral confusion with other landscape features and because there is a need to assess future risk along with inventorying existing failures, it is common to attempt to map the likelihood of occurrence or the future susceptibility to failure [11,15,20,21,25,33,42,45,54,55]. For example, Trigilia et al. [54] compared logistic regression (LR) and RF for shallow landslide susceptibility mapping from a variety of terrain, lithologic and land use variables. Using terrain variables only, Goetz et al. [45] compared generalized additive models (GAM), generalize linear models (GLM), weights of evidence (WOE), support vector machines (SVM), RF and bootstrap aggregated classification trees (bundling) with penalized linear discrimination analysis (BPLDA) for generating susceptibility models. Duo et al. [25] noted the value of SVM for predicting earthquake and rainfall induced landslide susceptibility in comparison to four other methods. We expand upon such studies by exploring the application of LiDAR, a variety of predictor variables and RF machine learning over a large spatial extent, which is uncommon in the literature. Although this study focuses on probabilistic mapping using the RF traditional machine learning method, is should be noted that deep learning methods that rely on convolutional neural networks have been explored for slope failure mapping and predictive tasks in several recent studies [25,29,30,[56][57][58][59][60][61].

Random Forest for Spatial Predictive Modeling
RF is a nonparametric, ensemble decision tree (DT) method capable of accepting continuous and categorical predictor variables to perform classification, regression and make probabilistic predictions [62]. DTs rely on recursive binary splits of the data based on learned decision rules to divide the data into more homogenous subsets [63]. Ensemble methods combine multiple decision trees to potentially improve upon the predictive performance of a single tree [27,64]. For RF specifically, each tree in the ensemble is trained using a subset of available training samples, selected using bootstrapping or random sampling with replacement. Additionally, instead of defining an optimal split using any variable, only a subset of the predictor variables will be available at each node or split. The goal is to decrease the correlation between trees in the ensemble by providing each with a different set of training data and input features, resulting in a large number of weak classifiers that, when combined, act as a strong model with the ability to generalize well and not overfit to the training examples [62]. RF has many positive attributes for predictive modeling including its ability to accept a variety of input predictor variables that may be correlated and/or scaled differently. Also, it is generally robust to complex feature space, can be trained quickly, can accept categorical predictor variables and can provide an assessment of variable importance based on the withheld or out-of-bag (OOB), data in each tree [27,62].
The RF algorithm has been applied to many mapping and spatial predictive modeling tasks with remotely sensed and/or geospatial data as input [15,27,51,52,54,[65][66][67][68]. It has also been used extensively to obtain probabilistic predictions as opposed to classification products. For example, Evans and Cushman [65] assessed the algorithm for predicting conifer tree species occurrence. Maxwell et al. [69] assessed the algorithm for predicting the likelihood of palustrine wetland occurrence based on topographic variables, while Strager et al. [70] used the algorithm to predict the likelihood of surface mine expansion.
Many studies have already assessed RF for probabilistic mapping of landslide occurrence or susceptibility (for example, References [15,21,45,55,68]); Goetz et al. [45] document strong performance for RF when applied to slope failure susceptibility modeling, as it generally outperformed the other tested methods and was not negatively impacted by a high-dimensional feature space and highly correlated variables. Trigila et al. [54] note the convenience of incorporating categorical predictor variables without the need to generate dummy variables. Taalab et al. [21] document that the algorithm can be used for both probabilistic susceptibility mapping and differentiation of failure types. Thus, prior research generally suggests that RF is an appropriate algorithm for slope failure mapping and based on our previous experience and also a review of the literature we suggest that it is an optimal method Remote Sens. 2020, 12, 486 4 of 27 to investigate this mapping problem over large spatial extents with a variety of input variables. Given that many prior studies have noted the value of this algorithm for slope failure mapping and modeling and its many positive characteristics as described above, no additional methods were investigated. Further, as highlighted in our review of the literature, prior studies already offer a comparison of different machine learning methods for this mapping task; as a result, algorithm comparison was not pursued in this study.

LiDAR and Terrain Variables for Mapping Slope Failures
LiDAR is an active remote sensing method that relies on laser range finding to generate accurate horizontal and elevational coordinates at a high spatial resolution from a terrestrial, airborne or satellite platform. An emitted photon, generally in the visible or near infrared spectrum, can strike an object and a portion of the energy can reflect back to the sensor for detection. Further, a single laser pulse can also be divided into multiple returns, allowing for vegetation canopy penetration and the mapping of subcanopy terrain features, in contrast to other elevation mapping methods, such as InSAR. Other than laser range finding, LiDAR also relies on global positioning system (GPS) measurements to reference the point cloud to a geospatial datum and an inertial measurement unit (IMU), which measures the orientation and motion of the aircraft [71]. As highlighted in the review by Jaboyedoff et al. [32], LiDAR offers detailed terrain and geomorphic information for characterizing and detecting the topographic signature of slope failure; however, there are some limitations, such as the expense and time required to collect and post-process the data, the lack of world-wide open and freely available data, the absence of available historic LiDAR data due to the only recent development of these technologies for mapping large spatial extents, and variability in the data in regards to point density and collection conditions.
A variety of terrain variables can be calculated from raster-based digital terrain models (DTMs), which can be generated from LiDAR data; further, Goetz et al. [72], Goetz et al. [45] and Mahalingam et al. [73] all suggest that terrain variables are highly important in predicting landslide occurrence and susceptibility. Goetz et al. [72] suggest that empirical or trained models that incorporate terrain variables often outperform physical models that attempt to model slope failure susceptibility based on our understanding of the physical processes that produce them. Table 1 provides some example terrain variables that have been used in different slope failure mapping and modeling studies. Note that this is not an exhaustive list and is simply meant to provide some examples. Also, many of these studies incorporate additional, non-terrain variables, such as variables relating to lithology, soil characteristics or land use, that are not summarized here. A review of these papers suggests that several variables have consistently been used in slope failure studies, such as topographic slope [74], topographic aspect [74], topographic wetness index (TWI) [75] and measures of curvature, such as profile curvature (PrC)-which measures curvature in the direction parallel to maximum slope-and plan or planform curvature (PlC), which measures curvature in the direction perpendicular to the maximum slope [76,77]. However, it does not appear that a consistent or optimal set of terrain variables have been determined. Also, many of these variables are calculated based on a neighborhood or moving window analysis [69,[76][77][78] and studies have not generally investigated the effect of altering this window size. Thus, there is a need for further investigation of terrain variables for mapping and predicting slope failures.

Study Area
The 10,765 km 2 study area extent is defined relative to the Northern Appalachian Ridges and Valleys MLRA within the state of West Virginia ( Figure 1). MLRAs are defined based on common patterns of physiography, lithology, soil, climate, water resources and land use [79]. This paper highlights findings for this specific MLRA; however, it is part of a larger project to assess slope failure occurrence across the entire state of West Virginia [80], which is ongoing at the time of writing. We have chosen to produce separate models for each MLRA in the state on the assumption that patterns and variable importance may vary based on landscape conditions and that a single model for the entire state would be inappropriate. The Northern Appalachian Ridges and Valleys MLRA is characterized by an eroded mountain belt of long, linear ridges and valleys, a pattern resulting from Paleozoic mountain building, folding and thrust faulting. Rock units vary in age from Precambrian to early Mississippian, with more recently formed units occurring in the western extent of the study area. The Precambrian exposure of metamorphosed basement rock occurs only in the extreme eastern extent of the study area in the Blue Ridge Mountains. The mapped extent also contains a portion of the Great Valley, which is relatively flat and dominated by Cambrian and Ordovician limestone, dolomite and shale. Within the folded mountain belt, valleys are commonly composed of shale and siltstone while ridges are supported by resistant sandstone and limestone [81][82][83]. Based on the 1:250,000 scale geologic map of West Virginia by Cardwell et al. [81], 30 geologic formations occur throughout the extent. The landscape is dominated by a trellis stream network with elevations ranging from 0 to 1400 m and an average elevation of 540 m. Mean annual temperature is near 0°C and yearly total precipitation is around 65 cm, though this can vary greatly based on elevation and topographic aspect; for example, east-facing slopes tend to receive less precipitation than west-facing slopes due to a rain shadow The Northern Appalachian Ridges and Valleys MLRA is characterized by an eroded mountain belt of long, linear ridges and valleys, a pattern resulting from Paleozoic mountain building, folding and thrust faulting. Rock units vary in age from Precambrian to early Mississippian, with more recently formed units occurring in the western extent of the study area. The Precambrian exposure of metamorphosed basement rock occurs only in the extreme eastern extent of the study area in the Blue Ridge Mountains. The mapped extent also contains a portion of the Great Valley, which is relatively flat and dominated by Cambrian and Ordovician limestone, dolomite and shale. Within the folded mountain belt, valleys are commonly composed of shale and siltstone while ridges are supported by resistant sandstone and limestone [81][82][83]. Based on the 1:250,000 scale geologic map of West Virginia by Cardwell et al. [81], 30 geologic formations occur throughout the extent. The landscape is dominated by a trellis stream network with elevations ranging from 0 to 1400 m and an average elevation of 540 m. Mean annual temperature is near 0 • C and yearly total precipitation is around 65 cm, though this can vary greatly based on elevation and topographic aspect; for example, east-facing slopes tend to receive Remote Sens. 2020, 12, 486 7 of 27 less precipitation than west-facing slopes due to a rain shadow effect. The region is dominated by oak-pine and oak-history forests, with large expanses of agriculture in the Great Valley [82].

Reference Data
Reference data were generated based on manual interpretation of hillshades and slopeshades produced at a 1 m spatial resolution. These raster-based representations of the terrain were derived from publicly available LiDAR data that cover the entire study area extent. These data were funded by the Federal Emergency Management Agency (FEMA) and are made freely available as part of the 3DEP program (https://www.usgs.gov/core-science-systems/ngp/3dep) and can also be obtained from the West Virginia GIS Technical Center and West Virginia View (http://data.wvgis.wvu.edu/elevation/). Hillshades are generated by modeling illumination over the landscape based on topography and the position of an illuminating source in the local sky [84] while slopeshades are created from a topographic slope model where dark shades represent steep slopes and light shades represent shallow slopes. In contrast to hillshades, slopeshades are illumination invariant, as they do not rely on modeling brightness relative to a specific illumination position [75,85].
Each feature was mapped as a point at the interpreted initiation location of the failure, generally the head scarp. So, each feature was mapped as an individual point as opposed to multiple points or an areal extent due to the difficulty of accurately mapping the full extent of the failure consistently and to reduce spatial autocorrelation in the training data. This process was completed by two trained analysts under the supervision of a professional geomorphologist. Debris flows, lateral spread and slides were mapped and all locations were interpreted by one analyst then verified by the other. A total of 1,798 slope failures were generated using this method. Examples are provided in Figure 2.
Remote Sens. 2020, 12, 486 7 of 28 effect. The region is dominated by oak-pine and oak-history forests, with large expanses of agriculture in the Great Valley [82].

Reference Data
Reference data were generated based on manual interpretation of hillshades and slopeshades produced at a 1 m spatial resolution. These raster-based representations of the terrain were derived from publicly available LiDAR data that cover the entire study area extent. These data were funded by the Federal Emergency Management Agency (FEMA) and are made freely available as part of the 3DEP program (https://www.usgs.gov/core-science-systems/ngp/3dep) and can also be obtained from the West Virginia GIS Technical Center and West Virginia View (http://data.wvgis.wvu.edu/elevation/). Hillshades are generated by modeling illumination over the landscape based on topography and the position of an illuminating source in the local sky [84] while slopeshades are created from a topographic slope model where dark shades represent steep slopes and light shades represent shallow slopes. In contrast to hillshades, slopeshades are illumination invariant, as they do not rely on modeling brightness relative to a specific illumination position [75,85].
Each feature was mapped as a point at the interpreted initiation location of the failure, generally the head scarp. So, each feature was mapped as an individual point as opposed to multiple points or an areal extent due to the difficulty of accurately mapping the full extent of the failure consistently and to reduce spatial autocorrelation in the training data. This process was completed by two trained analysts under the supervision of a professional geomorphologist. Debris flows, lateral spread and slides were mapped and all locations were interpreted by one analyst then verified by the other. A total of 1,798 slope failures were generated using this method. Examples are provided in Figure 2.  RF requires both presence and absence data to create a probabilistic prediction [62]. We generated pseudo absence data as 100,000 random points throughout the study area extent. Any random point that was within 30 m of a slope failure observation was removed. Also, additional slope failure data were obtained from the West Virginia Department of Transportation (WVDOT) and the West Virginia Geologic and Economic Survey (WVGES) and any random points that occurred within these mapped extents or within 30 m of them were removed. Given that a complete inventory of failures was generated, we argue that it is unlikely to randomly select a slope failure feature in the pseudo absence data. Similar methods were used by Strager et al. [70] for predicting future surface mine extents and Maxwell et al. [69] for predicting palustrine wetland occurrence. Table 2 provides a list of all terrain predictor variables used in the study. A total of 43 variables are include, of which 32 represent terrain variables calculated from the LiDAR-derived DTM. All modeling was conducted at a 2 m spatial resolution. Most of the terrain variables were calculated within ArcGIS Pro 2 using built-in tools, such as the Slope Tool [86] or the ArcGIS Geomorphometry & Gradient Metrics Toolbox [87]. All curvature measures were produced using the Morphometric Features module from the open-source System for Automated Geoscientific Analysis (SAGA) software package [88][89][90][91]. Since many raster-based terrain calculations rely on local neighborhoods or moving windows to measure local patterns and compare a cell to its neighbors, the window size and shape can have an impact on the resulting measures and representation of the terrain [92,93]; therefore, we used multiple window sizes to calculate all terrain variables that rely on moving windows in order to capture patterns at multiple scales. Specifically, we used circular windows with radii of 7, 11 and 21 cells. These scales were decided upon based on measures of ridge-to-valley distance across the study area extent The curvature measures calculated in SAGA rely on second-order polynomials that can accept moving windows of variable size [89][90][91]. Other than terrain variables, we also calculated 11 additional features (Table 3). Distance to the nearest US, state and local road were calculated as three separate variables using the Euclidean Distance Tool in ArcGIS Pro 2 [86]. We also calculated the distance from all mapped streams using the same method. To further characterize these factors, we also calculated cost distance to roads and streams by weighting the distance relative to topographic slope, since failures may arise along steep slopes resulting from road construction or stream incision. This was accomplished using the Cost Distance Tool in ArcGIS Pro 2 [86]. The three remaining variables are categorical and represent lithological and soil characteristics. A professional geomorphologist categorized all geologic formations in the extent based on their geomorphic presentation as defined in Table 4. We did not include the individual formations as a predictor variable due to the large number of categories. To characterize the soils, a soil scientist augmented the Soil Survey Geographic Database (SSURGO) [102] to derive measures of dominant soil parent material (DSPM) and drainage class. The derived categories are also listed in Table 4.

Predictor Variables
Other than just providing additional information for the predictive modeling, including these variables also allowed us to incorporate expert knowledge into the prediction. As highlighted in the literature review above, a consistent set of variables has not been determined for slope failure likelihood prediction tasks. The variables in this study were selected based on suggestions from the literature, initial experimentation, data availability and expert knowledge. Although a larger number of features could be evaluated, we argue that the variables generated for this study offer a detailed representation of the geomorphic, soil and lithologic characteristics of the terrain even given data availability and additional limitations.
All raster-based variables were then extracted at the mapped slope failure and pseudo absence point locations using the Extract Multi Values to Points Tool in ArcGIS Pro 2 [86] in order to generate tables from which to extract training and validation data.

RF Modeling and Validation
The randomForest package [103] within the open-source data analysis software R [104] was used to generate and validate the RF models. Of the 1798 available slope failure samples, 1500 were randomly selected to use for training while the remaining 298 were withheld for validation. As one goal of this study is to assess whether incorporating a variety of pseudo absence samples can improve the model performance and also to avoid model bias resulting from an imbalanced training sample, we paired the 1500 training samples with five sets of non-overlapping pseudo absence samples using random sampling without replacement, resulting in five training sets containing all 1500 slope failure samples and a different set of pseudo absence samples. This process resulted in five training datasets, each with 3000 samples and a validation dataset with 596 samples, all of which contain an equal number of samples in the presence and absence class. A model was then trained using each training set and 500 trees, as this was found to be adequate to stabilize the results. The mtry parameter, which defines the number of variables available for splitting at each node in the multiple decision trees, was optimized using 5-fold cross validation and 10 values were tested. Hyperparameter optimization was performed separately for each model. All five models were then combined into a single model containing 1500 trees. In order to compare models using less variables or training samples, models were also generated using feature and training sample subsets.
Variable importance measures produced by RF have been shown to be biased if variables are highly correlated [105,106]. As demonstrated in Figure 3, which compares correlation between a subset of the terrain variables based on Spearman's rho [107], variable correlation is an issue in this study. Further, calculating the same measure at different window sizes result in sets of highly correlated variables; for example, Spearman's rho between the three SP measures were all above 0.80. So, we used a measure of variable importance based upon conditional random forests that takes into account correlation in the importance calculation as implemented in the R party package [105,106]. In order to explore the impact of feature space reduction, we used a feature selection method from the rfUtilities R package [65], which selects variables using RF-based variable importance estimates. Since our product is a probabilistic prediction as opposed to a classification, models are assessed and compared using receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC) measure as implemented in the pROC R package [108][109][110]. An ROC curve plots the true positive rate against the false positive rate at various thresholds. The AUC measure is the area under the ROC curve and is equivalent to the probability that the classifier will rank a randomly chosen positive (true) record higher than a randomly chosen negative (false) record. Generally, values over 0.9 indicate excellent prediction rates [108][109][110]. To statistically compare models, we also made use of Delong's test for two ROC curves, which provides a p-value for statistical comparison of ROC curves [109,111]. Note that a balanced validation sample was used in this study, as ROC curves have been shown to be misleading when applied to imbalanced datasets [112].
To further assess the classification results, we calculated overall accuracy and the Kappa statistic using a 0.5 probability threshold. We also calculated precision, recall, specificity and the F1 score relative to the slope failure class using the number of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) withheld validation samples. Precision represents the portion of the predicted slope failures that were slope failures while recall represents the ratio of correctly predicted slope failures relative to the total number of slope failures. Specificity represents the proportion of not slope failure locations that are correctly identified as not slope failure. The F1 score is the harmonic mean of precision and recall [112]. The equations for these metrics are provided below in Equations (1)-(4). Lastly, to provide an additional measure of performance that does not rely on selecting a threshold, we also calculated the area under the precision-recall curve (AUC (PRC)) using the PPROC package in R [29,113,114].  Since our product is a probabilistic prediction as opposed to a classification, models are assessed and compared using receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC) measure as implemented in the pROC R package [108][109][110]. An ROC curve plots the true positive rate against the false positive rate at various thresholds. The AUC measure is the area under the ROC curve and is equivalent to the probability that the classifier will rank a randomly chosen positive (true) record higher than a randomly chosen negative (false) record. Generally, values over 0.9 indicate excellent prediction rates [108][109][110]. To statistically compare models, we also made use of Delong's test for two ROC curves, which provides a p-value for statistical comparison of ROC curves [109,111]. Note that a balanced validation sample was used in this study, as ROC curves have been shown to be misleading when applied to imbalanced datasets [112].
To further assess the classification results, we calculated overall accuracy and the Kappa statistic using a 0.5 probability threshold. We also calculated precision, recall, specificity and the F1 score relative to the slope failure class using the number of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) withheld validation samples. Precision represents the portion of the predicted slope failures that were slope failures while recall represents the ratio of correctly predicted slope failures relative to the total number of slope failures. Specificity represents the proportion of not slope failure locations that are correctly identified as not slope failure. The F1 score is the harmonic mean of precision and recall [112]. The equations for these metrics are provided below in Equations (1)-(4). Lastly, to provide an additional measure of performance that does not rely on selecting a threshold, we also calculated the area under the precision-recall curve (AUC (PRC)) using the PPROC package in R [29,113,114]. In order to make predictions across the full spatial extent and at each 2 m cell location, the trained model was applied to the raster-based predictor variables using a combination of R and Python scripts. Since all predictor variables across the full study area extent gridded at a 2 m cell size sum to several terabytes of data, it was not possible to generate the prediction across the entire extent at once. Instead, predictions were made over 858 4-by-4 km tiles with a 100 m overlap to avoid data gaps. Also, terrain variables were derived for each tile prior to performing the prediction then subsequently deleted, which allowed for the model to be generated without excessive storage requirements. Once all tiles were processed, they were merged to generate a continuous probabilistic prediction across the entire study area extent.

Impact of Combining Multiple Models
AUC calculated from the withheld validation samples for each separate model and the combined model are provided in Table 5. AUC varied by only 0.006 between all the models; further, based on Delong's test statistical difference between pairs of models was only observed between Model 5 and the combined model (p-value = 0.021). This generally suggests that providing a wide variety of pseudo absence examples to train multiple models did not improve the classification performance; thus, the sampling scheme used here was not necessary to stabilize the prediction. This result if further support by the AUC (PRC) metric and all threshold-based metrics, as metrics were similar for all models and the combined model.  Figure 4 represents the distribution of predicted probabilities for the withheld validation data using a kernel density function. In support of the 0.946 AUC value obtained for the combined model, this plot suggests a strong separation between slope failure samples and random pseudo absence data. The median probability for the slope failure locations is 0.84 while the median probability for the pseudo absence points is 0.15. Of the slope failure points, 92.6% have a predicted probability higher than 0.50 while only 18.5% of the pseudo absence data have a probability higher than 0.50. Using a probability threshold of 0.5, the overall accuracy for predicting the validation data is 87.1% and the Kappa statistic is 0.742. For the slope failure class specifically, precision is 0.834 and recall is 0.926. The resulting prediction across the entire mapped extent and some example areas at a larger scale are provided in Figure 5. Red areas are those that are predicted to have a high likelihood of slope failure occurrence while green areas are predicted as having a low likelihood. Based on a visual inspection, the figure suggests a strong relationship between predicted occurrence and topographic slope and incision.    Table 6 provides comparisons for models using subsets of the predictor variables while the ROC curves are visualized in Figure 6. These models were created using five combined models with different pseudo absence data, as described above. At the 95% confidence level statistical difference was noted between all experiments other than those using all the variables and just the terrain variables (p-value = 0.479). Further, using all variables provided only a 0.002 improvement in AUC  Table 6 provides comparisons for models using subsets of the predictor variables while the ROC curves are visualized in Figure 6. These models were created using five combined models with different pseudo absence data, as described above. At the 95% confidence level statistical difference was noted between all experiments other than those using all the variables and just the terrain variables (p-value = 0.479). Further, using all variables provided only a 0.002 improvement in AUC and a 0.003 improvement in AUC (PRC) in comparison to only using the terrain variables. The model using all variables provided the best performance while models trained with only the ancillary data provided the poorest performance, which highlights the value of including terrain variables. Further, overall accuracy was lower than 80% and the Kappa statistic was lower than 0.60 for all models that did not incorporate terrain variables. The lithologic, soil, distance and cost distance variables were not able to provide a statistically comparable performance to the results obtained using only the terrain data and combining these variables with the terrain data did not statistically improve the model performance. As previously noted by Duo et al. [25], LiDAR-derived terrain variables are valuable for slope failure predictive modeling tasks.

Impact of Sample Size
Figures 7 and 8 summarize the impact of sample size on model performance. Note that the sample size is the number of samples for each class not the overall number of samples. In Figure 7 red stars indicate statistical difference at the 95% confidence level between the model and the model trained with 1500 samples per class. All models performed statistically significantly poorer than the model with 1500 samples other than the models trained with 500 and 1250 samples, although the pvalue when using 500 samples was 0.084, just larger than the 0.05 threshold. Further, an increase in performance is noted up to the model with 1500 samples, though the largest changes occur between  Figures 7 and 8 summarize the impact of sample size on model performance. Note that the sample size is the number of samples for each class not the overall number of samples. In Figure 7 red stars indicate statistical difference at the 95% confidence level between the model and the model trained with 1500 samples per class. All models performed statistically significantly poorer than the model with 1500 samples other than the models trained with 500 and 1250 samples, although the p-value when using 500 samples was 0.084, just larger than the 0.05 threshold. Further, an increase in performance is noted up to the model with 1500 samples, though the largest changes occur between smaller sample sizes. AUC values larger than 0.900 are observed until the sample size is reduced to fewer than 75 samples per class. Figure 8 shows patterns similar to those in Figure 7; improvement in performance metrics is observed as sample size increases, with the largest improvement at lower sample sizes. This suggests that increased sample size can improve the results; however, this benefit diminishes as sample size increases. Further, this highlights the value of developing large slope failure inventories to support model generation.

Feature Reduction and Feature Importance
Figures 9 and 10 show how model performance varies with feature selection. In comparison to the model using all predictor variables, statistical significance in AUC is observed when only variables in the upper 2.5 percentile of importance or less were used (p-value = 0.036). Model performance stabilizes once roughly the upper 10 th percentile of variables is included. In contrast to the sample size results explored above, this generally suggests that the model is not negatively impacted by substantial feature reduction. Also, feature selection does not improve the modeling results, as the highest AUC is obtained when all variables are included. Additional metrics, which are shown in Figure 10, further support this observation. Similar results were noted by Maxwell et al. [115] for general land cover mapping and RF has generally been shown to be robust to complex and large feature spaces [27,62]. Practically, this suggests that feature selection may not need to be undertaken to improve the predictive performance of the model. However, feature selection could be used to reduce the number of variables that must be produced following a pilot study to assess which variables are most important. This could be particularly useful if large extents are to be mapped.
al. [115] for general land cover mapping and RF has generally been shown to be robust to complex and large feature spaces [27,62]. Practically, this suggests that feature selection may not need to be undertaken to improve the predictive performance of the model. However, feature selection could be used to reduce the number of variables that must be produced following a pilot study to assess which variables are most important. This could be particularly useful if large extents are to be mapped.   Figure 11 summarizes the variable importance results obtained using conditional variable importance. The most important five variables in the model are topographic slope (Slp), surface area ratio (SAR), cross-sectional curvature (CSC), surface relief ratio (SRR) and plan curvature (PlC). Specifically, the most important CSC, SRR and PlC variables are those calculated using a 7-cell radius circular window. All variables calculated using a 7-cell radius circular window are found to be more important than their counterparts calculated using an 11-cell or 21-cell radius window, suggesting  Figure 11 summarizes the variable importance results obtained using conditional variable importance. The most important five variables in the model are topographic slope (Slp), surface area ratio (SAR), cross-sectional curvature (CSC), surface relief ratio (SRR) and plan curvature (PlC). Specifically, the most important CSC, SRR and PlC variables are those calculated using a 7-cell radius circular window. All variables calculated using a 7-cell radius circular window are found to be more important than their counterparts calculated using an 11-cell or 21-cell radius window, suggesting the importance of characterizing local terrain conditions. Generally, terrain features show high importance in the model. Other than distance to US roads and cost distance from streams, the lithologic, soil, distance and cost distance variables are found to be of comparatively low importance. This makes sense, as adding these variables does not statistically improve the model performance in comparison to only using the terrain variables as discussed above. It should be noted that variable importance is not consistent given the large standard deviation displayed here with error bars calculated by replicating the experiment five times using different training sample subsets.  This study confirms the importance of some variables noted as valuable in prior studies. For example, Goetz et al. [45] note the value of Slp, TR and PlC and Trigila et al. [54] document the importance of Slp, aspect and PlC. Interestingly, other studies contradict our results and the results of Goetz et al. [45] and Trigila et al. [54]. For example, Taalab et al. [21] and Pourghasemi and Kerle [68] both document low importance of PrC and PlC for landslide susceptibility mapping using RF. Some prior studies suggest the value of including non-terrain variables; for example, Trigila et al. [54] document the value of including lithology while Taalob et al. [21] highlight the value of distance from stream. As note in prior studies (for example, Goetz [45]), importance of variables may vary based on the characteristics of the study area, mapped failures and the modeling methods being used. This again highlights the value of assessing a variety of variables for predicting landslide occurrence perhaps using a pilot study. Additional studies to compare importance assessment methods and the value of variables between different study area extents is needed.

Effect of Variable Window Sizes
The results in Table 7 were generated for models using only the terrain variables. Models were produced using all the terrain variables that were not calculated using different window sizes along with the variables calculated at the window size of interest. The model that incorporated variables calculated at only a window size of 21-cells was statistically less accurate in regard to AUC than the model using the variables calculated at all window sizes (p-value = 0.001) while the models using only 7-cell (p-value = 0.337) and 11-cell (p-value = 0.078) windows were not statistically different from the model using all window sizes. The 7-cell window model statistically outperformed the 21-cell model (p-value = 0.337) but not the 11-cell model (p-value = 0.398), again highlighting the value of using smaller window sizes in this study. The additional metrics generally support these observations. These results generally suggest that there is value in incorporating terrain measures at multiple scales.

Discussion and Recommendations
Given recent developments in the availability of landslide inventories and high spatial resolution LiDAR data over broad spatial extents, we argue that there is a need to develop methodologies for predicting the likelihood of slope failure occurrence using these data. Thus, the primary objective of this study is to provide recommendations for producing these large-area slope failure mapping products based on our findings and prior studies.
In order to alleviate the impact of training data class imbalance and to provide the algorithm with many examples of pseudo absence data, we produced five separate models then combined the results into one model, which is one benefit of using RF. However, we found that this was not necessary since the combined model did not outperform the separate models based on a variety of metrics. Thus, providing the model with one set of pseudo absence data was adequate; however, we had to produce a complete inventory of slope failures across the study area extent to minimize the chance of randomly selecting a slope failure as an absence location. If a complete mapping cannot be completed, we would suggest that a manual interpretation of the random pseudo absence points be performed in order to avoid any false negative samples.
Generally, incorporating measures of lithology, soils and proximity to roads and streams did not statistically improve the model in comparison to just using the LiDAR-derived terrain variables. This is an encouraging finding, as this may alleviate the need to produce variables from a wide variety of input data that may be of different quality and scale. For example, we used lithology data from a 1:250,000 scale geologic map in this study, which is much coarser than the available LiDAR data and was a limitation. Similar boundary uncertainties are an issue in the SSURGO soil data.
Reducing the sample size tended to decrease the model accuracy; however, AUC remained above 0.90 with only 75 samples per class. Further, the largest improvements for a variety of metrics was observed at smaller sample sizes. When adding additional samples past 75, performance metrics increased at a slower rate but improvement was documented. So, we would suggest that developing a large training dataset is of great importance for obtaining quality predictions and is worth the investment in resources. As noted above, we used a point feature at the head scarp to represent each slope failure feature in the training and validation datasets. A review of the literature suggests that there is not a consistent method used to represent slope failure features when generating likelihood or susceptibility models; some studies use points (for example, References [19,45,48,72]) while other use polygons (for example, References [15,73]). Thus, there is a need for further investigation of the impact of sample selection and feature representation methods in slope failure modeling.
In contrast to sample size, our results suggest that RF is not heavily impacted by feature selection. The best performance was obtained using all variables; however, results were not statistically different when using all variables vs. the top 10 th percentile of variables. Even though variable selection may not be necessary, it may still be desirable as a means to reduce the model complexity and the need to produce a large set of variables over a large spatial extent. A pilot study over a smaller extent or multiple smaller extents could be used to determine appropriate variables.
The most important five variables in the model were topographic slope (Slp), surface area ratio (SAR), cross-sectional curvature (CSC), surface relief ratio (SRR) and plan curvature (PlC). Generally, we also document that variables calculated using a 7-cell radius moving window showed greater importance than their counterparts calculated using 11-or 21-cells, which suggests the need to measure local conditions. However, including the measures at multiple scales did improve the model, so we suggest using multiple window sizes for calculating terrain variables that rely on moving windows. More work is required to assess the impact of window size and to determine optimal scales at which to produce these variables. The optimal terrain variables may be case specific and may depend on the characteristics of the slides and the landscape. We recommend experimenting with a variety of variables, perhaps as a pilot study.
In a risk management context, these findings generally suggest that LiDAR data are of great value in mapping slope failures and producing likelihood models since they allow for the interpretation of slope failure locations for producing inventories and training data for models. Further, as highlighted in this study and prior studies, a variety of terrain variables can be generated from LiDAR that are valuable for predicting slope failure occurrence. Once these models are generated, occurrence and risk can be summarized relative to aggregating units, such as property boundaries, to generalize the model and provide valuable information to regulators and land owners.

Conclusions
Slope failure and landslide mapping is an important application of geospatial data due to the threats to property and life that they pose. With the development of slope failure inventories and high spatial resolution LiDAR data over large spatial extents, there is a need to develop consistent methods for mapping and predicting these features. This study specifically highlights the value of large and quality training datasets along with a characterization of the terrain using a variety of terrain variables calculated at different scales. In the United States specifically, we argue for the adoption of consistent methods to make use of landslide inventories, such as those currently being curated by the USGS and LiDAR data, such as the 3DEP products, to consistently generate products over large spatial extents.