Feature Engineering of Geohazard Susceptibility Analysis Based on the Random Forest Algorithm: Taking Tianshui City, Gansu Province, as an Example

: In this paper, Feature Engineering (FE) was applied to Landslide Susceptibility Mapping (LSM), while the most suitable conditioning feature dataset and analysis method were tested and analyzed. Tianshui city was taken as the study area, three types of geohazard (collapse, landslide, and unstable slopes) were used, while a total of twenty-three conditioning features were generated; two dimensionless methods (normalization and standardization) were tested afterward. Four Random-Forest-based (RF-based) feature selection methods using different indicators (Gini Impurity, GI; Out of Bag Accuracy, OOBA) were proposed and tested separately. The LSMs of four models were carried out under the guidance results of FE, namely Classiﬁcation and Regression Tree (CART), Random Forest (RF), Logistic Regression (LR), and Support Vector Machine for Classiﬁcation (SVC). For feature enhancement, standardization had signiﬁcant advantages over normalization. All RF-based methods were proven effective, lifting the AUC by 0.01~0.02. The RF model achieved the highest LSM accuracies, respectively, 0.949 (landslide), 0.957, and 0.949 (unstable slopes), improved by 0.008 (landslide), 0.005 (collapse), and 0.013 (unstable slopes). This proved that the FE helped to improve LSM and can help to decide the dominant conditioning factors for regional geohazards.


Introduction
Landslide is a natural phenomenon that includes mass down-slide movements of rocks, soil, or debris flows under gravity [1], and it has become the most common geohazard due to city expansion and climate change. Landslides can cause massive casualties and economic losses. In the last century, landslides caused over 16,000 casualties [2]. From 2004 to 2010, the Durham Fatal Landslide Database (DFLD) recorded 2620 non-seismic landslides and 32,322 related deaths [3]. China has long suffered from landslides and correlated hazards [4,5]. Hence, research on landslides and other geohazards is of great academic and social importance.
Landslide susceptibility represents the likelihood of landslide occurrence under a certain combination of geo-conditions [6]. In contrast, landslide susceptibility mapping (LSM) refers to its visualization through techniques such as geographical information science (GIS) and remote sensing (RS) [7]. LSM is the basis of geohazards prevention and mitigation and has always been a hot-spot research topic. In the quantitative LSM process, landslides are assumed to be the coupling results of multiple conditioning factors (also discussed as conditioning features when using ML models). The most frequently used methods are formula-based statistical methods and big-data-driven machine learning (ML) methods. some researchers have already used FE or feature selection to help analyze the LSM result, yet they mainly focused on feature importance ranking [21,32,33,35]. Sun et al. [40] used a simple feature importance ranking method, while Zhou et al. [35] used an iteration method; both authors retrained the ML model accordingly and concluded that the feature selection would improve the accuracy. However, whether the iteration method is superior has not been fully discussed.
The most suitable data pre-procession method and feature selection method type, and whether eliminating the unimportant features would improve the result, have not been fully discussed. In detail, there are three aims listed: (1) Explore the most suitable data-preprocessing principles.
(2) Determine whether using the elected features to retrain the model would improve the accuracy. (3) Compare the simple ranking feature selection idea and the iteration feature election idea. Therefore, this paper presents a relatively comprehensive FE-guided LSM by setting experiments upon all the steps in the FE to explore the most suitable combination, namely feature extraction, feature enhancement, and feature selection.

Materials and Methods
Tianshui city was selected as the study area. The detailed workflow is shown in Figure 1. First, 23 conditioning features were generated, and 4 RF-based FE methods were proposed and tested. The LSM is performed with 4 ML models: the CART, RF, SVC (SVM for classification), and LR. The results were evaluated through the Receiver Operating Characteristic curve (ROC curve) and the Area Under Curve (AUC).
been applied to many computing engineering fields yet has seldom been mentioned in the LSM. The FE helps to estimate the most suitable data pre-procession method and the best-matched conditioning feature dataset, thus making the ML-based LSM more convincing and explainable. Moreover, it helps to analyze the dominant conditioning factors of certain kinds of geohazards and compare the in between differences in between. In recent years, some researchers have already used FE or feature selection to help analyze the LSM result, yet they mainly focused on feature importance ranking [21,32,33,35]. Sun et al. [40] used a simple feature importance ranking method, while Zhou et al. [35] used an iteration method; both authors retrained the ML model accordingly and concluded that the feature selection would improve the accuracy. However, whether the iteration method is superior has not been fully discussed.
The most suitable data pre-procession method and feature selection method type, and whether eliminating the unimportant features would improve the result, have not been fully discussed. In detail, there are three aims listed: (1) Explore the most suitable data-preprocessing principles.
(2) Determine whether using the elected features to retrain the model would improve the accuracy. (3) Compare the simple ranking feature selection idea and the iteration feature election idea. Therefore, this paper presents a relatively comprehensive FE-guided LSM by setting experiments upon all the steps in the FE to explore the most suitable combination, namely feature extraction, feature enhancement, and feature selection.

Materials and Methods
Tianshui city was selected as the study area. The detailed workflow is shown in

Geological Conditions of the Study Area
Tianshui city, situated in the Gansu Province of western China, was selected as the study area. It is one of the prefecture-level cities within the Gansu province; both urbaniza-  10 18 N (latitude). Located at the eastern end of the Qilian orogenic belt, the transition zone between the Guanzhong Plain and the Loess Plateau, the terrain is high in the west and relatively low in the east [41], ranging from 736~3118 m. Fractures are distributed widely, among which the most typical ones are the West Qinling North Rim Fault [42], the Tongwei Fault, and the Qingshui Fault. As a result, tectonic activities take place frequently. Tianshui city belongs to the semi-humid and semi-arid continental monsoon climate zone [41], and the precipitation is strongly affected by seasons and terrain, ranging from 500 to 600 mm. The annual temperature is from roughly 8 to 19 • C. The Wei River, the largest tributary of the Great Yellow River, traverses the entire study area from west to east. The basin has been covered with Quaternary deposits with loess covering 10 to 30 m at the top of the strata, which is unstable [42]. Lithology is gneiss (Proterozoic), rocks (Triassic sedimentary and metamorphic), and sandstone (Devonian and Cretaceous) [42].

Landslide Inventories
Where geohazards have occurred tend to have more significant potential to breed new ones; hence, landslide inventories are required in LSM [1]. This study acquired the landslide inventory by the geological field survey of hazards [42]. It contained 968 landslides, 183 collapses, and 243 unstable slopes. All the spatial attributes of the records were verified manually. The accurate occurrence times were not logged due to the delayed field investigations and historical hazard records that took place decades ago. In Tianshui city, the most frequently occurring geohazards can be divided into three types: landslide, collapse, and unstable slope. The landslides often move along the horizontal direction, while collapses move along the tangential direction at a much higher speed. An unstable slope refers to a slope that is prone to sliding. It may not have slid or collapsed but obtains a high possibility of forming into a landslide or collapse.
Geological investigation results show that mudstone bedrock and overlying loess strata have been strongly affected by historical earthquakes in the study area, causing densely distributed geohazards [42]. Landslides aggregately distribute in river erosion zones and valleys. The overlying lithology grouping is mostly very soft, while the depositions are mainly loess, clayey, and debris, indicating that the landslides are typical mounded (earthy). Collapses mainly distribute along faults and are mostly caused by earthen slope slumps, metamorphic rock avalanches, and collapses.
The inventory contained 2027 records in total. Since the 0~10 • slope covered area is considered gentle, the records were more likely to be deposits or misclassified loess; thus, this part of records was eliminated. Afterward, the remaining were 968 landslides, 183 collapses, and 243 unstable slopes. The detailed distributions are shown in Figure 2. Table 1 shows the raw data resources of the conditioning feature extraction, while the detailed description is in Section 2.3.1. The geodetic reference system is unified to the WGS84 coordinate system using a Universal Transverse Mercator (UTM) projection 48 N band. Meanwhile, the spatial resolution is unified to 30 m. The raw data have a spatial resolution of 0.05 degrees for groundwater volume and precipitation. A 5-year mean value of each feature was calculated to reduce the impact of resampling errors. The platforms pre-processed the data obtained with Google Earth Engine, and codes completed the calculation.   Table 1 shows the raw data resources of the conditioning feature extraction, while the detailed description is in Section 2.3.1. The geodetic reference system is unified to the WGS84 coordinate system using a Universal Transverse Mercator (UTM) projection 48 N band. Meanwhile, the spatial resolution is unified to 30 m. The raw data have a spatial resolution of 0.05 degrees for groundwater volume and precipitation. A 5-year mean value of each feature was calculated to reduce the impact of resampling errors. The platforms pre-processed the data obtained with Google Earth Engine, and codes completed the calculation.

Feature Engineering
The FE process has 4 parts: feature extraction, feature enhancement, feature selection, and evaluation. As feature extraction and feature enhancement should be completed before model training, the two processes could be joined and described as feature preprocessing. The evaluation was merged into the LSM validation process and will not be discussed separately.

Feature Extraction
In quantitative analysis, LSM is a coupling result of multi-variables (conditioning features). However, the features used for LSM vary from region, study area scale, geohazard type, and evaluation model [39,43]. To generate a comprehensive analysis, the selected features should cover all formation aspects (topography, lithology, hydrogeology, vegetation, anthropogenic activity, and land cover). Finally, a total of 23 features were constructed. Except for lithology, aspect, distance to faults, distance to rivers, distance to  Table 2 is an aggregation of the classified features' properties. Based on the previous work of authors [44], we have discovered that the factor classification principles have no specific impact on model training or accuracies, only the distinctness in the LSM maps. This paper adopted the equal-interval classification method by considering the geographical significance of distance decay. Each class in the reassigned results of distances to faults, rivers, and roads represents a natural range. Within each interval, the geological influence can be equally seen. The total distances to faults, roads, and rivers were determined after a comprehensive analysis of the study area scale and its natural impacts. As the influence declined with the distance, the contribution to LSM out of the maximum distance could be neglected. Therefore, they were all reassigned to "11".

1.
Lithology feature: Different lithology types would lead to differences in slope stability and chemical properties. The study area is located at the eastern end of the Qilian orogenic belt. The lithology mainly consists of Precambrian metamorphic rocks with a small amount of magmatic rock [42]. Since the lithology feature in the study was mixed and complex, this paper grouped it referring to hardness and reassigned values of 1~15. The approximate lithology types included under each grouping are shown in Table 3.
Features such as elevation, slope gradient, aspect, and cumulative solar radiation contribute to landslides indirectly, as they influence the catchment area and vegetation coverage, thus affecting landslide susceptibility. Areas with a higher slope gradient tend to be more unstable, as the gravitational potential energy gradually transforms into kinetic energy along with the increasing slope gradient [33]. Curvature (the second-order derivative of slope gradient), profile curvature (the curvature in the maximum slope direction), plan curvature (the curvature perpendicular to the maximum slope direction) [19] were all calculated. The Topographic Wetness Index (TWI) and the Topographical Roughness Index (TRI) are micro-geomorphic indices. TWI is a quantitative simulation of soil's dry and wet conditions in a watershed, while TRI describes regional topographic changes [45].  (1), where A s refers to the catchment area that can be calculated by flow accumulation, and β refers to slope gradient. TRI can be calculated according to Equation (2), where H refers to the altitude of the calculation unit, and H max is the highest in the region, while H min is the lowest. These above features were generated using DEM with continuous values, except for the aspect feature, which was divided into 9 directions and was further reassigned accordingly to 1~9 [27] at 45 • intervals. Table 2 shows the detailed sorting results.
Distance to faults and fault density factor reflect the influence of tectonic fractures, which reduce the strength of the rock mass, causing geological activities. These two features were generated by the faults vector. For distance to fault, a total of 10 ring buffers were generated at an interval of 2 km; for the fault density, the analysis radius was set to 5 km.

3.
Vegetation feature: In this paper, the Normalized Differences Vegetation Index (NDV I) is used as a vegetation feature [32]. Landsat 8-OLT images are applied to calculate a 5-year mean NDVI value (2012~2017), the calculation is shown in Equation (3): for Landsat 8-OLT images, the Band Red is the 4th band, while the Band N IR is the 5th band.

Hydrologic features:
The erosive force directly affects the slope foot and the river incision, while precipitation and groundwater jointly affect the infiltration, thereby affecting the stability of the slope. Precipitation and groundwater can reduce the shear strength and change the lithology composition through chemical interaction [46]. In this paper, the selected hydrologic features were precipitation [47], groundwater volume, Normalized Difference Water Index (NDWI), Normalized Difference Water Index (MNDWI), distance to rivers [32], and river density.
Five-year mean (2012~2017) precipitation and groundwater volume features were generated. Both NDW I and MNDW I [48] can be representative of waterbody distribution. To test whether either feature would contribute more to landslides, both features at the 5-year mean (2012~2017) value were generated. The calculations are shown in Equations (4) and (5) For Landsat 8-OLT images, the Band Green is the 3rd band, the Band N IR is the 5th band, while the Band SW IR1 is the 6th band.
Rivers have an important influence on landslides, especially seismic-induced geological hazard development [49]. In this study, distance to rivers and river density were generated to estimate the influence upon landslide of rivers. Considering that evapotranspiration would cause a decrease in the influence of rivers, for distance to rivers, a total of 10 ring buffers were generated at an interval of 200 m; for the river density, the analysis radius was set to 0.5 km.

5.
Anthropogenic activity and land cover features: Anthropogenic activities (unreasonable artificial slope cutting, soil and water damage due to road construction, mining deposits, water storage, and drainage purposes [1,22]) lead to a decrease in slope stability. This study used land cover, distance to roads, road density, and Normalized Difference Building Index (NDBI) [22,40,46].
The land cover feature used in this paper is the global 30 m land cover classification products that were released in 2020, provided by the Institute of Air and Space Information Innovation, Chinese Academy of Sciences (http://data.casearth.cn/, accessed on 12 January 2021). For distance to roads, a total of 10 ring buffers were generated at an interval of 200 m; for the road density, the analysis radius was set to 0.1 km.
The NDBI is applied as a quantitative estimator of buildings in the study area. This paper calculated the mean NDBI of 2012~2017, while the calculation formula is shown in Equation (6).
For Landsat 8-OLT images, the Band N IR is the 5th band, while the Band SW IR2 is the 7th band.

Feature Enhancement
The commonly used feature enhancement methodology is dimensionless, i.e., data with different ranges or distributions converted into a uniform format. Linear methods are commonly used, such as valorization, centralization, normalization (Min-Max scaling), standardization (Z-score scaling), weighting, log function conversion [4], and inverse tangent function conversion. For ML, normalization and standardization are the most used methods. In this study, both methods were tested.

1.
Normalization (Min-Max scaling): The normalization [39,40] (Min-Max scaling) is the extreme difference scaling. The process is to scale the data between [0, 1], with the formula shown as Equation (7). For each column, x min is the minimum value in the dataset, while x max is the maximum. 2. Standardization: Data standardization refers to scaling the data distribution to a normal distribution with 0 as the mean and 1 as the standard deviation (i.e., the standard normal). The formula is shown in Equation (8). For each column, µ is the mean value of the data, while σ is the standard deviation. x

Feature Selection
Generally, the feature selection methods can be sorted into filter, wrapper, and embedded methods. The filter methods use statistical metric calculation results (such as variance, correlation coefficient, chi-square index, maximum information, etc.) by setting a threshold to eliminate features with lower importance and more noise [50][51][52][53]. While the wrapper methods are also known as recursive elimination feature selection methods, in each round, features that do not meet the evaluation threshold are eliminated, and the remaining features are used as input for the next round of training [54,55]. With the development of ML models, the embedded methods have been popular. Methods are usually algorithmically built in, such as the feature importance ranking for DT-based models and penalty term ranking for LR models.
This study selected the RF, the integrated algorithm of the DT, as the basis of the feature selection method. By combining 2 indictors, a total of 4 feature selection methods were built.

1.
Random Forest algorithm: The RF is a typical bagging ensemble ML algorithm [56]. An RF is constructed using several decision trees, and each tree obtains its classification result using individual classification. The modeling operates in parallel while the output of the whole forest is obtained by voting on all judgment results. Figure 3 shows the workflow of the RF-based classification algorithm using random bootstrap sample selection.
The basic process of RF-based classification is as follows: (1) Construct the original training sample dataset for DTs, the number of cases is N while the number of input variables is M. (2) Generate sub-training datasets by sampling with the replacement bootstrap method for n times, meaning that the generated RF has n trees in total. (3) To select the features for each non-leaf node (internal node), the model first randomly selects a certain number of features from all features and uses them as split features and then selects the best-performing one for node splits. (4) The classifier output is determined by a majority vote of by each tree in the RF.

Random Forest algorithm:
The RF is a typical bagging ensemble ML algorithm [56]. An RF is constructed using several decision trees, and each tree obtains its classification result using individual classification. The modeling operates in parallel while the output of the whole forest is obtained by voting on all judgment results. Figure 3 shows the workflow of the RF-based classification algorithm using random bootstrap sample selection. The basic process of RF-based classification is as follows: (1) Construct the original training sample dataset for DTs, the number of cases is while the number of input variables is . (2) Generate sub-training datasets by sampling with the replacement bootstrap method for times, meaning that the generated RF has trees in total. (3) To select the features for each non-leaf node (internal node), the model first randomly selects a certain number of features from all features and uses them as split features and then selects the best-performing one for node splits. (4) The classifier output is determined by a majority vote of by each tree in the RF.
Gini Impurity (GI) and Entropy are both commonly used in the RF algorithm. After repeated training processes, the model scores and accuracies using GI were always higher than using Entropy; therefore, this paper adopted GI as the indicator. The calculation is shown in Equation (9), where ( ) represent the GI of each feature , is the total number of samples, and is the probability each sample's occurrence. Gini Impurity (GI) and Entropy are both commonly used in the RF algorithm. After repeated training processes, the model scores and accuracies using GI were always higher than using Entropy; therefore, this paper adopted GI as the indicator. The calculation is shown in Equation (9), where I Gini ( f ) represent the GI of each feature f , m is the total number of samples, and f i is the probability each sample's occurrence.
GI should be at a maximum when a node is equally divided among all classes, which means that the split uses the least helpful information. The split is kept until the terminal nodes have a few cases or are all pure. Therefore, the RF forms according to Equation (10), and the multiple independent DT classifiers can be written as {h(X, θ k ), n = 1, 2, . . . N}. (10) where N represents the number of DTs, and h i (x) is the classifier result of DT i .

Feature selection methods:
While the RF is applied as the base feature selection model, this paper constructed the filter-embedded method and the wrapper-embedded method accordingly. RF uses 2 indicators to measure feature importance: GI and Out of Bag Accuracy (OOBA). The method using GI is sorted as the Mean Decrease Impurity method (MDI), while the method using OOBA is the Mean Decrease Accuracy (MDA) method. By combining the MDI and MDA with filter and wrapper thoughts, a total of 4 methods were proposed: filter-MDI, filter-MDA, wrapper-MDI, and wrapper-MDA.
In the RF training process, after n sampling rounds, a subset of n samples equal in size to the original training set is obtained, and the possibility p of each sample being selected is calculated after Equation (11). Since the sampling process is put back, while n is large enough, p would converge to 1 − 1 e . Therefore, approximately 37% of the training data would never be involved in the modeling; these data are called the Out of Bag Data (OOB), and OOBA can also be used as an evaluating criterion for model accuracy [57]. In both the MDI and MDA processes, an RF model is first built, and the criterion is calculated accordingly. For MDI, the features are ranked referring to GI, while for MDA, the features are ranked referring to the change of OOBA. By replacing each feature with noise data and calculating the variance, the change of OOBA is measured. More significant variance indicates the higher feature importance.
For the filter-MDI, the feature selection is performed by setting a threshold of feature remanence. It took only one round of training with the lowest time complexity. The threshold is set at 17, which refers to 70% of total features, equally in both filter-MDI and filter-MDA. The wrapper-MDI recursively performed the feature selection. In each round, the RF model is re-trained, the feature of the lowest GI is eliminated, and the remaining features forms into new subsets for model training. To determine how many features remained that would make out the highest LSM accuracy, we drew an AUCchanging curve that ranked the relative feature importance by the order in which features are eliminated. The same is the wrapper-MDA, while the metrics switched from GI to OOBA. The wrapper-MDA had the highest time complexity for its two-tier iteration.

Landslide Susceptibility Modeling
To explore the FE's effect on different ML models except for RF, another 3 ML models were used, namely LR, SVC (SVM for classification), and CART. Python 3.7 was used for modeling; all ML models (including RF) were built according to the scikit-learn module. These models are all the most-used models in binary classification. The CART is compared to the integrated method (the RF), while the LR and SVM are selected to compare whether the tree-based model or functional model performs better in LSM.

Logistic Regression Model
The LR is a generalized linear regression analysis model commonly used for dichotomous classification. This method has the advantages of simplicity, parallelizability, and strong interpretability [31,58,59]. This paper chose the "L2" penalty to avoid overfit. "C" (the hyperparameter controlling regularization degree) was set as 0.04. The "lib linear" solver was set accordingly. As the LSM was a binary classification process, the "multi_class" was set as "ovr".

Classification and Regression Tree Model
The CART model is one typical model using DT as the base estimator. It is a nonparametric supervised ML method that generates tree-formed decision rules to complete the tasks. It consists of a Root Node, a series of Internal Nodes, and Leaf Nodes; each Internal Node represents an attribute judgment, each branch represents a judgment result output, and each Leaf Node represents a classification result [32,36]. In the CART modeling procession, the hyperparameters were set as 15 (max_depth), "gini" (criterion), 30 (min_samples_leaf), and 15 (min_sample_split).

Support Vector Machine (for Classification)
The SVC model is a supervised generalized classifier for binary classification. It is a nonlinear learning algorithm developed from pairwise theory using certain kernel functions to calculate a margin hyperplane that can maximize the heterogeneity between samples [24,59]. In the SVC modeling procession, "rbf", representing the Gauss kernel function, was selected. The Gamma hyperparameter was set as 0.07.

Validation
In binary classification processes, samples can be divided into positive or negative samples [6,60]. The results are sorted into 4: positive samples that are predicted to be positive (true positive, TP); negative samples that re-predicted to be positive (false positive, FP); negative samples that are predicted to be negative (true negative, TN); positive samples that are predicted to be negative (false negative, FN). The ROC curve is drawn with the True Positive Rate (TPR) on the Y-axis and the False Positive Rate (FPR) on the X-axis. The closer the curve is to the upper left corner, the more accurate the model is. TPR and FPR can be calculated according to Equations (12) and (13).
where P is the number of positive samples in the original dataset,FN can be calculated by P − TP, while TN can be calculated by P − FN.
The AUC refers to the area under the ROC curve, with a range of [0.5, 1]. This paper evaluated the FE by comparing the AUC of LSM before and after the FE process. The ROC curve was plotted using python, while the AUC was also calculated.

Results
In this section, the detailed experimental procedure and the corresponding results are presented in the order of sampling construction, FE and LSM. As there are three types of geohazards in the study area, the experiments and results were constructed separately.

Sampling Dataset Preparations and Feature Extraction
As the quantitative LSM can be regarded as a binary classification process, landslide inventories were used to construct the sampling dataset, which contained positive samples (landslide points) and negative samples (non-landslide points). In this study, three geohazard sampling datasets were built separately. Considering the scale of Tianshui city, the landslide inventories are relatively small. Thus, it is necessary to enhance the sampling dataset size to avoid overfitting. The differences between positive and negative sample numbers should not be too large, while the total should not be too small. For negative samples, the spatial distribution should be homogeneous. After a comprehensive analysis, the ratio of positive and negative samples was set to 1:3 (landslide sampling dataset), 1:4 (collapse sampling dataset), and 1:4 (unstable slope sampling set). The sample number within each dataset is shown in Table 4. While Figure 4 shows the points' distribution of each dataset, to ensure the sample purity and thus improve the separability, the negative sample points were separated from the positive sample points in spatial location by at least 2 km. Figure 5 shows the graphics for all feature layers after the pre-procession.

Results and Comparison of Feature Enhancement Methods
This paper compared the model score and AUC under different ML methods to decide whether the Min-Max scaling or the Z-score scaling is superior. The results are shown in Table 5.
Comparing the data in Table 5, it is evident that in most cases, the standardizationprocessed results are better than normalization-processed ones, the maximum improvement in model score was 12.54, while the maximum improvement in AUC is 0.095. Moreover, the improvement was more obvious for models with higher data requirements, such as SVC and LR. Thus, the following feature selection and LSM results were all based on data that used the standardization method.

Results and Comparison of Feature Selection Methods
Four feature selection methods were separately performed, referring to Section 2.3.3. The multicollinearity test was first performed using the SPSS platform (ver. 2021). The multi-correlation between features is also an indicator of FE's effectiveness, for features with a high correlation may cause model instability. We compared the numbers of features with multi-correlation before and after the FE to add extra validation to the experiment. In multicollinearity analysis, the Variance Inflation Factor (VIF) has been a standard evaluation index. The pre-multicollinearity analysis suggested that high correlations existed among the condition factor dataset, with a high VIF up to over 2000; multi-correlations are specifically high among NDVI, NDWI, and MNDWI; as well as curvature, plane curvature, and profile curvature features.
The ratio of the training dataset to the testing dataset was set to 7:3. To control the variables, the other hyperparameters of RF were set to the same value, namely 600 (n_estimators), "gini" (criterion), 80 (random_state), and 20 (max_depth). Since the sample set in this study was small, the results converged under this parameter combination. samples, the spatial distribution should be homogeneous. After a comprehensive analysis, the ratio of positive and negative samples was set to 1:3 (landslide sampling dataset), 1:4 (collapse sampling dataset), and 1:4 (unstable slope sampling set). The sample number within each dataset is shown in Table 4. While Figure 4 shows the points' distribution of each dataset, to ensure the sample purity and thus improve the separability, the negative sample points were separated from the positive sample points in spatial location by at least 2 km.    samples, the spatial distribution should be homogeneous. After a comprehensive analysis, the ratio of positive and negative samples was set to 1:3 (landslide sampling dataset), 1:4 (collapse sampling dataset), and 1:4 (unstable slope sampling set). The sample number within each dataset is shown in Table 4. While Figure 4 shows the points' distribution of each dataset, to ensure the sample purity and thus improve the separability, the negative sample points were separated from the positive sample points in spatial location by at least 2 km.

Results and Comparison of Feature Enhancement Methods
This paper compared the model score and AUC under different ML methods to decide whether the Min-Max scaling or the Z-score scaling is superior. The results are shown in Table 5. For filter methods, the elimination threshold should first be settled. Commonly used methods are setting fixed thresholds, while for RF-based FE methods, the threshold could also be settled by drawing a learning curve of hyperparameters (max_features). However, the RF model had certain randomness that led to max_feature values varied from datasets, causing low stability in threshold selection based on the learning curve; thus, this paper adopted a fixed threshold instead. In most of the LSM processions, the number of factors ranges from 2 to 22, with an average of 9 [43]. This paper calculated the GI differences according to the feature ranking result as Equation (14) to decide the most suitable threshold that would bring the most minor information loss and the most representative of the feature dataset. D GI = GI k − GI k−1 (k = 1, 2, 3, . . . , n) (14) where GI k is the related GI of feature k, and Figure 6 shows the result. where is the related GI of feature , and Figure 6 shows the result. After repeated experiments, the last peak always appeared in the range of 16~20 (feature numbers), indicating that when the remaining features are appropriately 70% (i.e., 17 features) remaining, the GI is relatively low and will not vary dramatically. Therefore, 70% (i.e., 17 features) was finally selected as the threshold.

Results of Filter-MDI
In the filter-MDI procession, the Gini impurity value for each feature was output through the interface feature_importance. Figure 7 shows the ranking results. After repeated experiments, the last peak always appeared in the range of 16~20 (feature numbers), indicating that when the remaining features are appropriately 70% (i.e., 17 features) remaining, the GI is relatively low and will not vary dramatically. Therefore, 70% (i.e., 17 features) was finally selected as the threshold.

Results of Filter-MDI
In the filter-MDI procession, the Gini impurity value for each feature was output through the interface feature_importance. Figure 7 shows the ranking results.
The changes in model score and AUC are shown in Table 6. Improvements can be observed after the feature selection, where the most significant improvement appears in the collapse dataset, up to 1.45 of model score and 0.018 of AUC.
ture numbers), indicating that when the remaining features are appropriately 70% (i.e., 17 features) remaining, the GI is relatively low and will not vary dramatically. Therefore, 70% (i.e., 17 features) was finally selected as the threshold.

Results of Filter-MDI
In the filter-MDI procession, the Gini impurity value for each feature was output through the interface feature_importance. Figure 7 shows the ranking results. The changes in model score and AUC are shown in Table 6. Improvements can be observed after the feature selection, where the most significant improvement appears in the collapse dataset, up to 1.45 of model score and 0.018 of AUC.

Results of Filter-MDA
For the filter-MDA procession, as the OOBA change is a relative value, its ranking indicated the relative feature importance as well. The results are shown in Figure 8. Ranking differed from the results processed by filter-MDA, such as NDBI, slope direction, lithology, site type, geologic lithology, distance to rivers, distance to faults, and NDVI. However, the features with the highest importance are still slope, elevation, and precipitation.

Results of Filter-MDA
For the filter-MDA procession, as the OOBA change is a relative value, its ranking indicated the relative feature importance as well. The results are shown in Figure 8. Ranking differed from the results processed by filter-MDA, such as NDBI, slope direction, lithology, site type, geologic lithology, distance to rivers, distance to faults, and NDVI. However, the features with the highest importance are still slope, elevation, and precipitation. The changes in model score and AUC are shown in Table 7. The most significant improvement of model score appeared in the collapse dataset, up to 1.17, while the AUC improvements were even.  The changes in model score and AUC are shown in Table 7. The most significant improvement of model score appeared in the collapse dataset, up to 1.17, while the AUC improvements were even.

Results of Wrapper-MDI
In the wrapper-MDI procession, the feature importance is ranked in the order of iterated elimination. By continuously eliminating features, the model was rebuilt, and AUC also changed. The number of remaining features corresponding with the highest AUC determined the feature selection threshold. Figure 9 shows the detailed AUC changing curve during the procession. To be noted, the optimal number of features is not a fixed value but varies with the dataset division. In the wrapper-MDI feature selection procession, the impurity is only a rejection indicator and would not be shown in the final ranking. The feature importance results are shown in Table 8.  To be noted, the optimal number of features is not a fixed value but varies with the dataset division. In the wrapper-MDI feature selection procession, the impurity is only a rejection indicator and would not be shown in the final ranking. The feature importance results are shown in Table 8. The changes in AUC are shown in Table 9. To be noted, due to frequent iterations, the results represent the improvement after feature selection; the highest AUC may not correspond to the best model performance. Table 9. Changes in AUC after using the filter-MDA feature selection method. As the models were rebuilt in every iteration, the model scores are not logged.

Geohazards
Before  Figure 10 shows the detailed AUC changing curve during the wrapper-MDA procession.
Remote Sens. 2022, 14, x FOR PEER REVIEW 20 of 28 Table 9. Changes in AUC after using the filter-MDA feature selection method. As the models were rebuilt in every iteration, the model scores are not logged.  Figure 10 shows the detailed AUC changing curve during the wrapper-MDA procession. The ranking results are shown in Table 10.  The ranking results are shown in Table 10.  Table 11 shows the changes in AUC. Compared with the filter-indicator methods, the two wrapper-indicator methods achieved higher AUCs, and the wrapper-MDA performed better than the wrapper-MDI.

Landslide Susceptibility Mapping
By comparing the repeated experimental process and the results, this study finally chose to retain 70% of the total number of features (i.e., 16). Seven features were eliminated from each dataset. While deciding which feature to be eliminated, ones that ranked in the bottom 40% under all four feature selection methods were first chosen. If the total did not account for seven, then features ranked in the bottom 40% under three feature selection methods were chosen. Table 12 shows the eliminated features for each dataset. In Section 3.3, this paper discussed the discovery of high multi-correlation, while the VIF decreased significantly after the FE. The VIFs among the remaining features ranged from 1 to 7. When the VIF value is less than 10, the multi-correlation among features is low and acceptable, indicating that the FE has reduced data redundancy.
The remaining features were used to model the prediction. The LSMs were completed by outputting the probability of "predicted landslide hazard". Four ML models (CART, RF, LR, and SVC) were prepared and used. This paper sliced the prediction dataset in the order of county administrative boundaries. The results were divided into five classes: very low, low, medium, high, and very high, with an interval of 0.2. The ranking represented the susceptibility of geohazards. Figure 11 shows the LSM results.
To validate the results, this paper adopted the ROC curves and AUC. Figure 12 shows the ROC curves.
In addition, we compared the AUC changes before and after the FE. The results are shown in Table 13. For models with high data quality requirements (such as LR and SVC), the AUC improvement after the FE can be more than 0.24. Remote Sens.   To validate the results, this paper adopted the ROC curves and AUC. Figure 12 shows the ROC curves. In addition, we compared the AUC changes before and after the FE. The results are shown in Table 13. For models with high data quality requirements (such as LR and SVC), the AUC improvement after the FE can be more than 0.24.

Discussion
In this section, discussions that reflect the experiments are proposed. This paper summarizes the results and sorts the discourse into geohazards' FE results, FE improvement for different ML methods, and regional LSM.

1.
Effectiveness of the FE: As discussed in Section 3.4, the most suitable feature enhancement method is standardization; the improvements are more evident for LR or SVC. Four RF-based methods were proposed and tested for feature selection. As the RF modeling required parameter adjust-Remote Sens. 2022, 14, 5658 23 of 27 ment and the dataset needed to be shuffled and re-divided, the accuracy is not fixed. Both wrapper-indicator methods appeared to be more unstable than the filter-indicator methods.
The filter-MDI feature selection method is the most stable, as it requires one-time modeling, with the simplest structure and the highest time efficiency. While the wrapper-MDA achieved the highest AUC improvement, the process is time-consuming and unstable. In most cases, the AUC changing curve oscillated hard, affecting the results and leading to apparent differences between the results obtained by other methods. This indicated that the iteration algorithm might not be the most suitable for LSM. Although it achieved the highest AUC improvement, the simple ranking filter methods are already enough to reach the goal of FE.
Still, by comparing the accuracy change of LSM models before and after the FE, it can be concluded that in most cases, the FE could bring a promised improvement upon LSM, especially for the LR and the SVC. Repeated experiments have been carried out to ensure that the FE would improve the results. For landslide, unstable slope, and most cases of collapse, the AUC after the FE was always higher than the original, although the increases sometimes seemed to be very tiny. For the CART and the RF, the accuracy improvements are not much. The reason may be that the FE methods proposed in this paper are all RF-based.

2.
High susceptibility area distribution analysis and the correlated domain conditioning features for different geohazards: Referring to Section 3.4 and the previous work of [42], the geohazards in Tianshui city mainly consist of loess-cutting and loess-plating forms. The high susceptibility areas are mainly distributed on the riverbanks, where the soil structure is easily cut and eroded by flowing water and rainfall. Since Tianshui is in the arid region of northwest China, the rivers are mostly surrounded by residential areas, and human activities can also damage the stability of the slopes.
The highest and lowest important conditioning features among the three hazards had certain similarities. In the order of ranking results, the dominant features are shown as follows: (1) Slope gradient, elevation, precipitation, NDVI, lithology, land cover, and groundwater volume (landslide); (2) Precipitation, elevation, slope gradient, distance to roads, distance to faults, groundwater volume, and lithology (collapse); (3) Precipitation, NDVI, lithology, slope gradient, elevation, distance to roads, distance to faults, and road density (unstable slope).
By comparing the results for different hazards, the features most closely associated with regional geohazards are slope gradient, elevation, and precipitation. Moreover, details can be observed for different geohazards. Landslides and unstable slopes are more closely associated with vegetation cover (i.e., the NDVI). Meanwhile, the collapse and unstable slopes are related more to roads, reflecting that these two types of geohazards are highly influenced by human activities such as artificial slope cutting during road works. Distance to faults is another domain reason for collapse and unstable slopes, yet it is not equally important for landslides. Groundwater volume has been an important conditioning feature for landslides and collapses but is not prominent for unstable slopes. Meanwhile, comparing the eliminated features in Table 12, we can conclude that this aspect has a weak connection between all types of geohazards. This is because the overall low vegetation cover of the study area, collapse, and unstable slopes has little relation with rivers. However, they are still affected by rainfall and groundwater.
These discussions show that the principles concluded by ML algorithms are somehow consistent with the laws of geography and geology, thus proving the efficiency of the FE. However, the results are subject to the study area due to the limited landslide inventory and study area. The geohazard type in Tianshui city is simple (loess geohazard mainly), while the triggering condition is mostly rainfall-flush of gravity. Multiple study areas should be involved for further studies to test the FE and to reveal the hazard formation modes thoroughly.

3.
Accuracy and validation of LSM: From Table 13, the ML model with the best analysis results and highest AUC is always the RF model. Considering the cartographic mapping results, the CART model performed the worst with too many block patches and a clear slice boundary of the dataset. However, this phenomenon appeared in both DT-based models due to the large spatial resolution scale that caused homogeneous value distribution of some conditioning factors (e.g., precipitation, groundwater volume, etc.).
The overall test dataset is sliced into the administrative boundary, as we aimed to give clear advice for each county. All four ML models had short optimization in the boundaries. The LR and SVC had more continuous mapping results; therefore, the administrative boundaries are relatively insignificant. Regarding the zoning effect, the analysis results of the RF model had the highest differentiation between the medium, high, and very high susceptibility areas. The very high susceptibility area is the smallest, making it the best differed in the susceptibility results in riverbanks and other low-flat areas. Above all, the RF model outperformed the other three ML models, but the mapping results appeared to be dispersed; thus, the result is slightly worse than the one obtained by the SVC.

4.
The correlating mitigation advice: For landslides, the high and very high susceptibility areas are located in Qinzhou County, north Maiji County, west Zhangjiachuan County, and east Wushan County, mostly in farming and wetland areas. Along the Shaoxing River, Weihe River, Baimao River, and Qingshui River, with a slope gradient ranging mostly below 20 • , the corresponding elevation is below 2 km. Lithology types are mostly sandy loess and clay. For collapses, higher susceptibility to landslides is distributed in the central part of the study area, mainly on both sides of the West Qinling-Beiyuan Faults and the Lixian-Luojiabao Faults. The high and very high susceptibility areas of landslides and collapses overlapped, indicating that the two types of geohazards have similar characteristics. Since unstable slopes can be regarded as developing landslides or collapses, the high susceptibility area of unstable slopes is small and located within the other two.
The high susceptibility areas of three geohazards in Tianshui city are generally distributed along the low and gentle terrain of rivers and plates that are used mostly for farming and construction, indicating that the soil and water conservation capacity has been damaged. In addition, all three geohazards show a strong tendency to be distributed along both sides of the road, indicating that the stability decreases in rock-soil bodies due to anthropogenic activities have been severe.

Conclusions
By summarizing Section 4, the conclusions are as follows: The FE has been proven effective and can help improve the accuracy of LSM. The wrapper-indicator methods performed better than the filter-indicator methods in accuracy yet appeared to be more unstable and time-consuming. However, the improvement differences between feature selection methods are not significant enough to ignore the time cost and the overfitting of iteration methods; the simplest filter-MDI method can already meet the demand for efficient and accurate LSM. This paper recommends using the simple method to balance the modeling complexity, accuracy, and robustness. What matters the most is whether the FE is involved in the LSM, not which method is applied for feature selection.
By analyzing the feature importance ranking results, this paper lists the features of the most significant influence: (1) Slope gradient, elevation, precipitation, NDVI, lithology, land cover, and groundwater volume (landslide); (2) Precipitation, elevation, slope gradient, distance to roads, distance to faults, groundwater volume, and lithology (collapse); (3) Precipitation, NDVI, lithology, slope gradient, elevation, distance to roads, distance to faults, and road density (unstable slope).
This paper jointly analyzed the high-susceptibility regions of each geohazard with the dominant conditioning features and discovered that the geohazards in Tianshui city are strongly influenced by hydrology and anthropogenic activities. It is recommended to replan the farming areas and increase the protection of wetlands to enhance the soil water conservation capacity, as well as carry out geotechnical reinforcement along the road to avoid casualties and economic losses due to disasters.