Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods

Bravo-López, Esteban; Fernández Del Castillo, Tomás; Sellers, Chester; Delgado-García, Jorge

doi:10.3390/land12061135

Open AccessArticle

Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods

¹

Department of Cartographic, Geodetic and Photogrammetric Engineering, Photogrammetric and Topometric Systems Research Group, Centre for Advanced Studies in Earth Sciences, Energy and Environment, University of Jaen, 23071 Jaen, Spain

²

Instituto de Estudios de Régimen Seccional del Ecuador (IERSE), Vicerrectorado de Investigaciones, Universidad del Azuay, Cuenca 010204, Ecuador

^*

Authors to whom correspondence should be addressed.

Land 2023, 12(6), 1135; https://doi.org/10.3390/land12061135

Submission received: 1 May 2023 / Revised: 13 May 2023 / Accepted: 25 May 2023 / Published: 27 May 2023

(This article belongs to the Special Issue Landslide and Natural Hazard Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

Landslides are events that cause great impact in different parts of the world. Their destructive capacity generates loss of life and considerable economic damage. In this research, several Machine Learning (ML) methods were explored to select the most important conditioning factors, in order to evaluate the susceptibility to rotational landslides in a sector surrounding the city of Cuenca (Ecuador) and with them to elaborate landslide susceptibility maps (LSM) by means of ML. The methods implemented to analyze the importance of the conditioning factors checked for multicollinearity (correlation analysis and VIF), and, with an ML-based approach called feature selection, the most important factors were determined based on Classification and Regression Trees (CART), Feature Selection with Random Forests (FS RF), and Boruta and Recursive Feature Elimination (RFE) algorithms. LSMs were implemented with Random Forests (RF) and eXtreme Gradient Boosting (XGBoost) methods considering a landslide inventory updated to 2019 and 15 available conditioning factors (topographic (10), land cover (3), hydrological (1), and geological (1)), from which, based on the results of the aforementioned analyses, the six most important were chosen. The LSM were elaborated considering all available factors and the six most important ones, with the previously mentioned ML methods, and were compared with the result generated by an Artificial Neural Network with resilient backpropagation (ANN rprop-) with six conditioning factors. The results obtained were validated by means of AUC-ROC value and showed a good predictive capacity for all cases, highlighting those obtained with XGBoost, which, in addition to a high AUC value (>0.84), obtained a good degree of coincidence of landslides at high and very high susceptibility levels (>72%). Despite the findings of this research, it is necessary to study in depth the methods applied for the development of future research that will contribute to developing a preventive approach in the study area.

Keywords:

Cuenca, Ecuador; machine learning; feature selection; random forests; extreme gradient boosting; landslides

1. Introduction

Landslides are events that cause great impact in different parts of the world. Their destructive capacity generates loss of life and considerable economic damages [1,2]. In view of this, it is necessary to have methods that promote the prevention and mitigation of the effects that these events can cause. One of these methods is Landslide Susceptibility Mapping (LSM), which allows us to visualize the probability of landslide occurrence in specific areas based on their characteristics and thus to know the zones most prone to landslides [3,4]. LSM is fundamental for proper landslide risk management and allows the generation of tools (maps) that help determine the probable distribution of landslides in a specific zone [5].

Mountainous areas corresponding to or close to large mountain ranges such as the Andes suffer the effects caused by landslides. Ecuador is a country crossed by this mountain range and has suffered major disasters caused by catastrophic landslides such as La Josefina (Azuay) in March 1993 and more recently those of Gulag-Marianza (Cuenca, Azuay) in March 2022 (both located in the vicinity of the study area) and Alausí (Chimborazo) in March 2023. All these events have caused huge losses as mentioned above and reflect the need to establish effective prevention processes that help the institutions involved in this task to make the right decisions and thus properly determine the sites that are potentially susceptible to landslides [6]. Furthermore, the scarcity of LSM application studies in the aforementioned country reflects the imperative need to carry out research of this kind in the areas that have suffered the effects of these events over time.

Among the most common methods to develop LSM are quantitative methods, and within these, one of the categories of analysis that has recently generated interest in the scientific community is based on Machine Learning (ML) models whose main advantage is their ability to represent nonlinear relationships such as that between conditioning factors (independent variables) and susceptibility to landslides (dependent variable), in addition to not requiring a normal distribution of the independent variables [7]. The application of ML through its different algorithms is evidenced in the elaboration of several studies around the world. The most widely applied ML methods are the conventional ones such as Support Vector Machines (SVM) [8], Artificial Neural Networks (ANN) [9,10,11,12], Random Forests (RF) [13], Decision Trees, Logistic Regression (LR) [14], and Gradient Boosting, among others. The relatively current methods and applications of ML and its application for LSM can be reviewed in the study of Tehrani [15].

It should be considered that one of the most important aspects for the generation of LSM with greater accuracy is not only the ML model to be applied but also the selection of an appropriate set of conditioning factors [16]. In this sense, factor selection analyzes the relevance of the factors to be used to build prediction models and is applied to eliminate irrelevant variables and simplify LSM generation [17]. The number of conditioning factors to be considered for LSM is not determined, as it depends on the information available for each study area; therefore, the availability of factors could vary within a wide spectrum between 2 and 596 [18]. The availability of variables for landslide susceptibility analysis generates difficulties such as data collection in certain areas and the use of several factors for the implementation of these analyses, which could affect the computational performance and the accuracy of the results [7]. Therefore, it is a fundamental task to determine which are the prevailing variables that affect landslide susceptibility, based on the available data and the characteristics of the study area. These variables are usually classified thematically, and the most used in different studies are the geo-lithological, topographical (slope and elevation, among others), and those related to land cover [18]. In this context, feature selection allows the generation of LSMs considering only relevant landslide conditioning factors and thus optimizing the model, improving its predictive capability and the performance of computational processes [5]. The factor selection is generally based on an analysis of correlation between factors, which makes it possible to discard some of them and determine the most important ones [7].

Several studies have analyzed the importance of conditioning factors for LSM generation. Micheletti et al. [17] explored ML algorithms (SVM, RF, and AdaBoost) for selection of geological and morphological factors, and with them generated LSMs in Vaud canton (Switzerland). Vasu and Lee [19] developed LSMs in Seoul (Republic of Korea) with 23 conditioning factors. Using Extreme Machine Learning, they obtained 13 factors, obtaining a good performance in the final LSMs. Liu et al. [7] analyzed the predictive performance of various ML-based feature selection algorithms and conventional ML algorithms (LR, ANN, SVM, RF, and Gaussian Process) for LSM generation in Hunan province (China). Meena et al. [5] implemented statistical and ML models (RF and XGBoost) to generate LSM in the province of Belluno (Italy) and evaluated the importance of 14 conditioning factors. Liao et al. [20] identified the essential conditional factors for landslide susceptibility modeling using hybrid ML methods and different grid resolutions in Wushan and Wuxi counties (China). There are other studies on the application of feature selection methods implemented in this research, although few of them have been applied specifically to landslide susceptibility analysis: CART [21,22]; Recursive Feature Elimination (RFE) [23,24]; RF [25,26]; and Boruta [27,28]. In general, the selection and combination of conditioning factors analysis applied in landslide susceptibility modeling has been applied in several studies [29,30,31,32].

Another approach used for the evaluation of the importance of variables (i.e., conditioning factors) is based on eXplainable Artificial Intelligence (XAI), which allows analyzing the importance of variables with a broader approach [33]. Among the XAI methods are SHAP (SHapley Additive eXplanation) which has been applied in several studies for landslide susceptibility analysis [33,34,35,36] which is based on cooperative game theory (CGP) to explain the prediction results and thus improving the explainability of ML models [35]. SHAP is easy to operate [36] and can be presented graphically [34]. Another XAI method is LIME (Local Interpretable Model-agnostic Explanations), which reflects the behavior of a classifier in predicting samples, whereby it is possible to observe the prediction behavior of a model [35,37]. Based on the above, it can be said that the aforementioned methods are a reliable option to be considered in the analysis of conditioning factors that influence landslide generation.

Several studies have been carried out using RF, XGBoost, or both for LSM development. Sahin [38] evaluated the predictive capability of ensemble tree methods for LSM in Ayancik district (Turkey) and found that XGBoost performed the best, followed by RF. Sahin [39] performed a comparative analysis of four gradient boosting algorithms for LSM application in Ulus district (Turkey) and found that XGBoost achieved outstanding performance. Can et al. [40] evaluated the performance of XGBoost for LSM in the upper Ataturk Dam basin (Turkey) and obtained successful results. Deng et al. [41] applied RF to generate LSM with slope units in Maoxian County (China), obtaining satisfactory results. Kavzoglu and Teke [6] analyzed the performance of different ensemble ML predictive algorithms in Macka County (Turkey) and confirmed the superiority of NGBoost. Wei et al. [42] carried out a comparative study of tree based ensemble models for LSM in Laiyuan County (China) and found that XGBoost and RF obtained the best performance. Badola et al. [43] conducted a landslide susceptibility analysis using XGBoost in the Chamoli district (India) and obtained a satisfactory performance. Daviran et al. [44] compared three ML techniques (SVM, ANN, and RF) in the Tarom-Khalkhal sub-basin (Iran) and obtained superior performance with RF. Sahin [16] developed a framework for LSM in the Babadag district (Turkey) and obtained excellent performance with both XGBoost and RF. Zhang et al. [45] generated LSM in Fengjie County (China) and obtained similar performance for both algorithms, although with a slight superiority for RF.

This research is the continuation of a study previously carried out [46] whose objective is to analyze different ML techniques to determine the one with the best results in terms of susceptibility in the study area (a sector around the urban area of Cuenca, Ecuador), which has no previous studies of landslide susceptibility analysis considering the proposed methods. In this study, a methodology was applied to evaluate which are the most important conditioning factors from a set of 15 factors, based on the implementation of feature selection algorithms (CART, RFE, RF, and Boruta), which have rarely been applied so far to determine relevant factors in LSM generation, and multicollinearity methods were traditionally applied (Variance Inflation Factor VIF and correlation analysis). The results of feature selection determined six factors that were most relevant in the area, in which LSMs were elaborated applying ML methods (RF and XGBoost) and compared with the LSMs applying the totality of available factors. To provide continuity to the research previously carried out, LSMs were also generated by applying an ANN (rprop-). Finally, a statistical analysis was performed using a Wilcoxon test to determine which of the maps obtained presented the highest significance in terms of susceptibility levels. The contribution of this research is to determine whether the feature selection algorithms contribute to the improvement of LSM performance by applying only certain factors, compared to the results obtained by applying all available variables.

2. Materials and Methods

2.1. Study Area

The study area is located in Azuay province, located in the south of the country and specifically corresponds to Cuenca canton (one of the most populated in Ecuador), including its entire urban area and some rural parishes that surround it. In the study area defined for this research, a landslide inventory was conducted in 2019, the details of which can be seen in the studies of Miele et al. [47] and Bravo-López et al. [46]. This inventory included a significant number of rotational landslides (75% of landslides recorded). The average altitude of the area is 2853 m above sea level (m.a.s.l.), and it is a mountainous area because it is located in the Andes Mountains. Climatic conditions are variable because there are rainy periods (from December to May) and dry periods (from June to November) with an average annual rainfall of 940 mm. The information used comes from ETAPA-EP (Empresa Pública de Teléfonos, Agua Potable y Alcantarillado), years 2017 to 2020 of 15 meteorological stations of the institution. The geological characteristics of the study area are defined by detrital materials such as conglomerates, sands, clays, and silts corresponding to Miocene-Pliocene, interspersed with andesitic-riolitic tuffs and volcanic material in general, corresponding to a volcano-sedimentary series [46,48]. Figure 1 illustrates some details of the study area.

2.2. Materials

2.2.1. Landslide Inventory

This research used a landslide inventory carried out in 2019 by the Institute for Sectional Regime Studies of Ecuador (IERSE) of the University of Azuay (Cuenca, Ecuador) with the University Federico II (Naples, Italy), with the objective of determining the areas with the highest propensity to the occurrence of these phenomena, as well as to know relevant characteristics of these phenomena. The inventory covered an area of 380 km², which was divided into 36 quadrants of 13.5 km² to facilitate the collection of information. A fundamental aspect of the inventory is that it was based on field work with the presence of technical personnel in situ to collect information. This procedure was carried out with the Mobile Application for Regional Landslide Inventories (MARLI), the details of which can be found in the study of Sellers et al. [48]. In addition to field work, it was necessary to carry out photo-interpretation tasks with planet satellite images (available online: https://www.planet.com/explorer/ accessed on 1 November 2019) and orthophotos of the study area. In areas of difficult access, differential interferometry (DInSAR) was applied with COSMO-Skymed (available online: https://esatellus.service-now.com/csp?id=project_proposal&dataset=CosmoSkyMed accessed on 1 November 2019) and Copernicus Sentinel-1 images (available online: https://scihub.copernicus.eu/dhus/#/home accessed on 1 November 2019) to identify ground displacements and movements. Seven-hundred and ten landslides were recorded in total, most of which are rotational (529). The inventory is currently being updated in order to improve the quality of the information and thus optimize the models, which contribute to a better management of the risk that landslides cause in the environment. Since the present study is the continuation of the one elaborated by Bravo-López et al. [46] the inventory in question is described in greater detail in that research, as well as in the study of Miele et al. [47] which details aspects related to the application of DInSAR.

2.2.2. Conditioning Factors

Conditioning factors selection is a fundamental step for an adequate LSM generation. In this research, the selection of factors was based mainly on the availability of information in the study area, and it is important to point out that there is no selection standard, since each study area has different characteristics [42]. In this sense, information on 15 landslide conditioning factors was available: topographic (10): elevation, slope, slope aspect, curvature, stream power index (SPI), sediment transport index (STI), topographic position index (TPI), terrain ruggedness index (TRI), topographic wetness index (TWI), and solar radiation index; land cover (3): land cover, distance to roads, and normalized difference vegetation index (NDVI); hydrological (1): distance to rivers; and geological (1): lithology. Based on previous research [46], topographic factors were derived from a three-meter spatial resolution digital elevation model (DEM) obtained from the National Information System on Rural Land and Technological Infrastructure (SIGTIERRAS) updated to 2010. A land cover map was also obtained from SIGTIERRAS, whose scale is 1:25,000 and is updated to 2018. The distance maps to rivers and roads (scale 1:25,000) were obtained from the Military Geographical Institute (IGM) and are updated to 2013. Finally, the lithological information was obtained from the National Information System (SNI), using a 1:100,000 scale geological map, updated to 2005. It is important to point out that all the information previously described has been taken from official sources of the government of Ecuador.

2.3. Methods

The methodology was implemented considering certain stages elaborated in a previously conducted research (mentioned in the introduction), in which the landslide inventory was obtained, being necessary to select only those of rotational type. The steps followed in this research were (1) obtaining information: landslide inventory and conditioning factors; (2) training and testing data extraction; (3) multicollinearity analysis (correlation and VIF); (4) conditioning factors selection (feature selection); (5) ML model implementation: RF and XGBoost; (6) results validation; (7) obtaining landslide susceptibility maps (LSM); and (8) statistical comparison between susceptibility levels (Figure 2).

2.3.1. Conditioning Factors Characteristics

As indicated in Section 2.2.2, 15 landslide conditioning factors classified thematically into topographic (10), land cover (3), hydrologic (1), and geologic (1) categories were used in this research. Table 1 shows some details of these factors.

Topographic factors are important in determining the stability of terrain, as it relates to the spatial distribution of important aspects such as soil moisture or flow direction [49]. In this context, elevation or altitude controls climatic aspects and precipitation levels, which directly influence the occurrence of landslides, [50] with a higher probability of occurrence in areas with higher elevation [51]. In the study area, the altitude range varies between 2583 and 3398 m.a.s.l. Slope is another important factor. When the slope increases, instability increases, [52] which is directly related to the occurrence of landslides, being a fundamental factor for the generation of LSM [50]. In the study area, the range of this factor varies, in degrees, between values lower than 16 and higher than 64. Aspect (orientation) indicates exposure to local climatic conditions [52] and some atmospheric processes such as rain, wind, sunlight, and soil moisture [50]. The map of this factor was scaled between 0 (for south orientation) and 1 (for north orientation). In the study area, there are few flat areas, where very few rotational landslides have been recorded (less than 1%). Curvature refers to changes in slope or aspect in a certain direction [53]. It is also related to permeability and surface runoff [50]. In this research, total curvature was considered, which varies between values less than −4.7 and greater than 4. Stream Power Index (SPI) indicates erosive processes caused by surface runoff. High values of this index are related to a high susceptibility to landslides [52]. In the study area, this factor varies between values lower than −2.8 and higher than 2.6. Sediment Transport Index (STI) indicates the amount of sediment transported by water currents [54] and in the study area present uniformity since most of the values are close to zero. Topographic Position Index (TPI) indicates the difference, within a given radius, of a center point and an average elevation [55]; thus, like curvature, it is an expression of the concave or convex shape that generally coincides with the low or high topographic positions within the slopes. In the study area, most of the values of this factor fluctuate between 0.2 and 1.8. Terrain Ruggedness Index (TRI) indicates changes in the soil surface caused by erosion [20,54]. This factor presents low values in most of the study area. Topographic Wetness Index (TWI) indicates the spatial distribution of soil moisture [20], besides the saturation conditions of the soil during rainfall [52]. In the study area, this factor varies between values lower than 3.9 and higher than 9.3. The solar radiation index is useful to represent the average annual intensity of solar radiation in a sector. Although its use is uncommon [56], it is important as an indicator of stability because it is related to the amount of vegetation on a slope, since the less solar radiation on a slope, the more vegetation it will have [57].

Land (soil) cover refers to the materials that cover the Earth’s surface [58], both natural elements (water bodies, forests, rocks, or soils) and human-modified surfaces (roads, buildings, and agriculture). Land use refers to the socio-economic appropriation of land [58]. Soil covers relate to the stability of slope materials; each cover has characteristics that can influence landslide generation [59], as it affects soil mechanical behavior and moisture in different ways. Thus, vegetation can protect soil from erosion and improve slope stability through mechanical anchoring and soil suction by roots [60], and conversely, deforestation, road construction, slope cutting, or building construction on slopes often reduce slope stability. For all these reasons, it is one of the factors that is usually included among the conditioning factors or determinants of susceptibility [18]. Since this is a qualitative variable, it has to be transformed into a quantitative variable (with normalized values between 0 and 1) according to a criterion, which in this study was the greater or lesser protective vegetation cover offered to the land uses considered. The values adopted are shown in Table 2.

Normalized Difference Vegetation Index (NDVI) indicates the amount of vegetation cover and ranges from +1 to −1. Higher values reflect dense vegetation, which are more resistant and less susceptible to landslides [56]. This index was obtained from SIGTIERRAS orthophotos (derived from a single image) updated to 2010. Distance to roads is related to the road construction process, which has an impact on slope stability in mountainous areas [61]. On the other hand, the hydrological factor corresponds to distance to rivers, which the smaller it is, the greater the erosion at the base of slopes, as well as the saturation of the slope material [54].

Lithology is related to geomechanical characteristics of the soil such as cohesion and angle of internal friction, playing an important role in landslide generation [62], and is therefore included in almost all susceptibility analyses. In this study, the units of the original geologic map [46,48] were classified considering the strength of the materials in the following classes: (i) metavolcanites, (ii) conglomerates and tuffaceous sandstones, (iii) slope and colluvial deposits, (iv) volcanic tuffs and agglomerates, (v) alluvial deposits, (vi) massive siltstones and shales, and (vii) clays and sands. These classes were transformed to quantitative values on a scale of 1 to 10 according to the values in Table 3 and were also normalized with values between 0 and 1 [63].

The information described above was obtained based on QGIS raster visualization methods, applying a single band pseudocolor rendering type, linear interpolation, and equal interval mode. Figure 3 shows the conditioning factors used in this research.

2.3.2. Extraction of Training and Test Datasets

For LSM generation, it is necessary to evaluate the models with training (70%) and test (30%) datasets [64,65]. The division of the dataset into 70% for training and 30% for test is an empirical approach that basically avoids overfitting, although it is not the only possible division [66].These datasets are important because they teach the model the classes to predict, check its degree of fit (training), and verify its predictive ability (test) [67]. This research continued with the methodology applied in the previous study [46], randomly obtaining points in landslide-free zones (labeled with 0) and based on the centroid of polygons corresponding to landslides (labeled with 1). Based on the experience of the previous publication [46] and in order to avoid sampling biases, a balanced number of samples was considered for landslide-free zones and points corresponding to landslides [19]. Based on the above, 1040 points were specifically analyzed, of which 529 were classified as “landslide” (1) and 511 were classified as “no landslide” (0). Likewise, the numerical values corresponding to conditioning factors were considered based on the location of the random points in the zones with and without landslides. On the other hand, the cartographic unit used in this study was grid cells. The importance of these units is based on the fact that they contain a set of specific terrain conditions [14], which is important in LSM generation.

2.3.3. Multicollinearity Analysis

Multicollinearity refers to the linear relationship between two or more variables [68]. The utility of this analysis lies in ensuring the independence of two or more factors to be considered in LSM development [42], which affects the accuracy of a model [41]. In this research, multicollinearity was analyzed using two methods: (i) correlation analysis and (ii) variance inflation factor (VIF).

Based on previous research [46], to analyze the relationships between conditioning factors, Spearman’s correlation coefficient [52] was used because it is not affected by the distribution of the data of the variables used, since most of them have a non-normal distribution (Figure 4). A threshold value of 0.7 has been taken as the threshold [69]; therefore, according to this analysis, all factors that have not exceeded this threshold could be considered for the implementation of the models.

VIF is a method widely used in geosciences to analyze multicollinearity [42]. It is named due to the fact that they report the extent to which the increase in the variance of the estimated coefficients is due to collinear independent variables. According to Craney and Surles [70], VIF reports how much of the variability of a regressor is explained by the rest of the regressors in the model due to the correlation between these regressors. For p − 1 independent variables, VIF is calculated with:

{V I F}_{i} = \frac{1}{1 - r_{i}^{2}}, i = 1, \dots, p - 1

(1)

where

r_{i}^{2}

is the coefficient of determination obtained by fitting a regression model for the i-th independent variable on the other p − 2 independent variables [70]. If the VIF value is greater than 10, this factor should be discarded [42]. This analysis can also be combined with the calculation of tolerance; however, in this study only the VIF value was determined for each conditioning factor and was implemented with the rms R package [71].

2.3.4. Conditioning Factors Selection (Feature Selection)

The objective of this selection is to eliminate the variables that are less important and thus try to improve the prediction of landslide susceptibility in order to optimize it [5,17]. In this study, the feature selection tasks were performed on the model training data, and the main objective is to analyze the most important conditioning factors that should improve the performance and accuracy of the models, as well as to avoid overestimated predictions. Based on the criteria of Meena et al. [5], no standard threshold value has been defined to select or rule out a feature for LSM, so it is necessary to develop several tests. In the following, the basics of each algorithm applied in this research are described in a very concise manner.

Feature Selection with CART

The Acronym for Classification and Regression Trees is a method that creates a binary tree by BRP (Binary Recursive Partitioning) and is based on partitioning rules, which allow selecting from a large number of explanatory variables, the most important ones to determine a response variable [72]. BRP divides the data into subsets according to the available factors, by creating nodes, having a root node that contains all the objects [72], which in turn is divided into child nodes that present predictor values in yes/no responses [61]. For each split point, the root node is divided into two child nodes separating the objects according to the values of the variable’s split point (values lower or higher than this point). Subsequently, the variable and the splitting point with the highest reduction of impurities is selected, dividing the parent node into two child nodes based on the selected splitting point. The process is repeated until the tree has a maximum size to finally trim the tree to an optimal size [72]. In this study the method in mention was implemented with the rpart package [73] of R software.

Feature Selection with Random Forest

This algorithm builds several simple classifiers with randomly chosen features, thus achieving a good exploration of potential features subsets [25]. Because one of the main features of RF is bagging, whose main objective is to reduce the variance of a prediction function [74], not all training data are included in the construction of each base hypothesis, which allows to discover the confidence degree of feature importance. Out-of-bag data evaluate subsets of features independently, without requiring a separate test set. When constructing an RF, at each node a feature is selected at random to split the node and maximize the information gain, which is used as a measure of correlation between feature and class. The information gain values measure not only the individual relevance of each feature but the capability of each feature, across a variety of possible subsets of features. This allows the algorithm to explore the importance of each feature within the dataset [25]. In this study, the method described above was implemented with caret package [75] of R software.

Feature Selection with Boruta

This method is based on Random Forests and is a useful algorithm for feature selection and classification problems [27]. By using RF, it adds randomization processes for selection of samples of variables, which allows to choose those that are of real importance [28]. This randomness provides a better view of the properties that are actually relevant [76]. In summary, the method works by first adding randomness to the dataset by mixing each feature, thereby generating shadow features. Then, the Z score values are measured after training the RF-based classifier; measuring these values tests whether the attribute contains a Z score higher than the maximum Z score among shadow attributes (MZSA) value. The final evaluation consists of evaluating MZSA, rejecting the features that have a lower value, since they are considered irrelevant. Variables that are close to the maximum Z score value are confirmed or rejected based on the mean Z score value [77]. In this study, the described method was implemented with package boruta [28] of R software.

Feature Selection with Recursive Feature Elimination (RFE)

This method consists of a wrapper-type feature selection algorithm that performs filter-based selection using SVM [27]. This algorithm has no effect on correlation methods and, in simple words, produces a classification of subsets of features [78]. The main configuration options are the number of features to choose and the type of algorithm that will perform feature selection. In summary, the operation of this method is based on the search for features in the training dataset, eliminating those that are less robust [79]. By means of recursion processes, the analyzed dataset is reduced to select suitable features [7] by retaining the dominant ones and eliminating the weak ones [20], using iterative procedures that perform such backward elimination [78]. In this study, RFE was implemented with caret package [75] of R software.

2.3.5. Machine Learning Models Implementation

In the study conducted earlier as part of this research project, LSMs were generated by applying an artificial neural network using the neuralnet R package and analyzing various approaches to a backpropagation algorithm (for more details see [46]). In summary, the algorithm with the best results, mainly considering the ROC-AUC metric, was rprop- (0.761), and this method will be used to compare it with the results obtained by applying the algorithms defined for this study (RF and XGBoost), based on the conditioning factors selected as most relevant according to feature selection results and with the 15 available factors.

Random Forests

This method consists of a combination of deep decision trees, so that each tree depends on values taken from a vector sampled independently and randomly and with the same distribution for all trees in the forest [80], obtaining a predicted value for each decision tree [5]. Some relevant RF characteristics are its predictive accuracy, low tendency to overfit, relatively low computational cost, and its ability to work with high dimensional data [17]. Its predictive efficiency and lack of overfitting problems are due to the law of large numbers [80]. In addition, good levels of randomization allow excellent results in classification [5] and in regression, although somewhat less so in the latter. Basically, the Random Forest method requires the setting of two essential parameters which are the number of trees (ntree) (a large number according to Breiman [80]) and the number of variables to be randomly selected from the set of available features (mtry) [17,38]; however, another common hyperparameter is nodesize which refers to the minimum number of observations at a terminal node [81]. Due to its effectiveness, it has been widely used for geological disaster susceptibility analysis [7], and in this study it was implemented using caret package of R software [75].

Extreme Gradient Boosting (XGBoost)

This method is an optimized gradient algorithm designed for optimal performance. It uses boosting techniques to improve accuracy and reduce overfitting [5]. Due to its good performance and high effectiveness, it is considered for implementing regression and classification models [38]. Its main goal is to combine weaker learners to improve accuracy [5,40] and thereby create a stronger learner through an additive training strategy [39]. The prediction model is produced as a boosting ensemble of weak classification trees by gradient descent, which optimizes the loss function [38]. This model is useful for classification and regression tasks, but it requires an optimal selection of parameters, which can be done several times until the best ones are obtained, thus optimizing its performance [39]. In this sense, the main parameters can be classified into general, task, and reinforcement parameters [82]. In this study, we used the caret R package [75] for the implementation of this model, and we mainly configured the parameter nrounds to calculate the maximum number of iterations, i.e., the number of trees to grow (for classification problems), max_depth to control the depth of the tree (a vector of possible values can be specified), and colsample_bytree to control the number of features that will be given to a tree and subsample to assign a number of samples per tree.

Therefore, as noted, the configuration of hyperparameters for each model is also an aspect to be mentioned. Hyperparameters are parameters that are set prior to the execution of the algorithm. In order to obtain more accurate models it is necessary to adjust these parameters, which can be done by the trial and error method [44]. Table 4 shows the configuration of the main hyperparameters considered for the RF model, and Table 5 shows those corresponding to the XGBoost model. The functionality of the hyperparameters was previously explained in the fundamentals of each algorithm. On the other hand, it is important to mention that both models allow the hyperparameter tuning operation, which consists of obtaining a good adjustment of these; however, since it is a lengthy process, this operation was not carried out in this research. The ML models were implemented in RStudio Server and executed in a high-performance computer (HPC) of the Corporación Ecuatoriana para el Desarrollo de la Investigación y la Academia (CEDIA).

2.3.6. Results Validation

Validating the results of a model is essential for LSMs to have scientific validity. In this study, emphasis was placed on validating the predictive capacity of the models by means of the area under the receiver operating characteristic (ROC) curve (AUC) method, which has become a standard method for evaluating the accuracy of predictive models [83]. This method quantitatively summarizes the ROC curve and thus describes the ability of a model to perform prediction tasks, allowing specific verification of the model’s accuracy and efficiency in predicting susceptibility [53]. The AUC value generally fluctuates between 0.5 and 1. A high value (close to 1) indicates a higher accuracy rate, ensuring better levels of accuracy and reliability, while a value close to or below 0.5 indicates that the prediction is a random guess [53]. Another validation metric implemented in the present study was F-Score, a metric based on the confusion matrix, which basically is the harmonic mean between data classified as landslides and non-landslides and provides a balanced view between them considering precision and recall values. The F-Score value varies between 0 and 1, and better performance is evident with the F-Score being close to 1 [84]. More theoretical details of the confusion matrix are described in previous research [46]. ROC, AUC, and F-Score were applied on test data (30%) and were performed with ROCR [85], pROC [86], and caret [75] R packages.

2.3.7. Obtaining Landslide Susceptibility Maps (LSM)

Once the algorithms have been run, the prediction values that are the basis of the LSMs are obtained. In this sense, it is necessary to define the susceptibility levels based on the prediction values generated by the models. The quantile method [5,40] was used for this task, as it is a classification method that separates values into groups with the same number of elements in each category, thus providing a better distribution of values in each class as opposed to other approaches such as natural breaks which can lead to limited or excess data in some classes [5]. Five susceptibility levels were defined based on the method described above: very low, low, medium, high, and very high. The LSM and spatial analysis procedures carried out in this research were developed entirely with free software (QGIS, SAGA, and GRASS).

2.3.8. Statistical Comparison between LSM

A non-parametric test like the Wilcoxon signed-rank test is used to evaluate and assess the statistical significance of a systematic pairwise differences between landslide susceptibility models at the significant level α = 0.05 [87,88,89]. This test is based on a null hypothesis that assumes that there is a statistical significance between models only if the p value < 0.05 and z value is in the range −1.96 ≤ z.value ≤ +1.96. Otherwise, it is safe to reject the null hypothesis, and there is no statistically significant difference in performance between a selected pair of models.

In other words, if the p-value is lower than the significance threshold (p-value < 0.05) and the z-value exceeds its critical values (z-values −1.96, +1.96), then it is safe to assume that the null hypothesis is not valid (and is rejected). Therefore, a significant difference between the two compared models exists [11,87,88,89,90].

3. Results

3.1. Multicollinearity Analysis

3.1.1. Correlation Analysis

Spearman’s coefficient was applied in this analysis because the distribution of the data for most of the factors is not normal (Table 6) as demonstrated in the previous research [46]. In this sense, it was observed that SPI and STI presented the highest correlation value (0.95); in addition, slope and solar radiation also had a high correlation with a value of −0.84. STI, and slope also had a correlation value of 0.71, which exceeded the defined threshold (0.7). Based on this analysis, solar radiation, SPI, and STI factors were discarded, and it should be noted that in some cases, at least one of the highly correlated factors should be discarded [7].

3.1.2. Variance Inflation Factor (VIF)

To corroborate the VIF value, this analysis was performed on the raw rasters and on the training dataset, which showed similarity in the values obtained (Table 7). In this context, it was observed that the highest VIF values, in two analysis approaches implemented, belong to slope (VIF > 16) and solar radiation (VIF > 12) factors, which, in both cases, exceed the defined minimum value (10). In the first case, slope has been retained, because it is one of the most effective, [10,11,62] important [52,91], and influential [92] factors for landslide susceptibility analysis. This importance is reflected in the results obtained by the feature selection algorithms described in Section 3.2, which show slope as an important factor. Solar radiation was previously discarded because of its high correlation.

3.2. Feature Selection

The results obtained with the application of feature selection algorithms reflected different values (Table 8). This is because the performance of each algorithm is different [7]; therefore, for each algorithm, the factors with the highest value were determined to be the most important.

From the values shown in the table above, the factors with the highest scores were verified. Based on the characteristics of each method, the CART algorithm determined distance to roads, elevation, NDVI, lithology, solar radiation, and slope as the most important factors. Random Forest determined distance to roads, elevation, TRI, NDVI, solar radiation, slope, and lithology as most important. Boruta determined distance to roads, elevation, TRI, NDVI, solar radiation, slope, and lithology as most relevant, while RFE determined distance to roads, elevation, TRI, NDVI, solar radiation, and slope as most important. On the other hand, the least relevant factors were also determined, being the main aspect in which all methods presented a low value (or non-existent with Boruta) of land cover and distance to rivers. Based on the above findings, the six most important factors were selected, being distance to roads, elevation, NDVI, TRI, slope, and lithology the factors to be used to implement the ML models to obtain the LSMs, to compare them with the LSMs obtained by applying all the conditioning factors. Although solar radiation was important in all the feature selection algorithms implemented, this factor was discarded due to its high correlation value and VIF, as shown in the corresponding sections (Section 3.1.1 and Section 3.1.2, respectively), and lithology was considered instead, since, according to the scores obtained in the feature selection methods, it presented a certain level of importance. As noted above, lithology is related to the geomechanical characteristics of the soil, which is why it is included in almost all susceptibility analyses [18]. Therefore, although it does not present the highest values of importance, and due to its certain independence from other factors, it was finally included among the selected factors.

3.3. Machine Learning Models and Validation

The values obtained for the area under the ROC curve (AUC) with the test data (Table 9) (Figure 5) show the acceptable predictive level obtained with ANN considering six factors (AUC = 0.732, F-Score = 0.664). The Random Forests model shows a better predictive ability than ANN, as their AUC value is approximately 0.79 with a minimal advantage for the 6-factor model (0.795) compared to the 15-factor model (0.793). This similarity is also reflected in the F-Score value, although with a slightly larger difference (RF 6 factors = 0.729; RF 15 factors = 0.707). XGBoost models show good predictive ability, as their AUC value exceeds 0.8. Specifically, the best predictive performance was obtained with XGBoost considering 15 factors (AUC = 0.871, F-Score = 0.781), being somewhat lower the one obtained with six factors (AUC = 0.845, F-Score = 0.758).

3.4. Landslide Susceptibility Analysis

Based on the total number of pixels in the study area, pixels corresponding mainly to water bodies, which were not assigned to any susceptibility level, were discarded. With this consideration, the values corresponding to each susceptibility level were determined, obtaining notable differences in the maps generated (Figure 6). The classification method used was the quantile method (see Section 2.3.7), which has generated a similar number of pixels for each susceptibility level with RF and XGBoost algorithms (around 20% for each level), with small differences in ANN, which accumulates a higher number of pixels in the high class (~25%) and decreases its amount in the middle class (~14%) (Table 10). Figure 7 and Table 10 show the percentage of rotational landslides that were assigned to each susceptibility level based on the implemented algorithms. The highest number of landslides accumulated in the high and very high susceptibility classes, 62% for ANN with six factors, 66% and 71% in RF (6 and 15 factors, respectively), and 73% and 78% in XGBoost (6 and 15 factors). Similarly, the lowest percentage of landslides occurred in the very low and low susceptibility classes: 32% in ANN, 17% and 14% in RF, and 12% and 8% in XGBoost.

3.5. Statistical Comparison between LSM

In this study, Table 11 presents five landslides susceptibility models, on which a systematic pairwise comparison using the Wilcoxon signed-rank test at the 5% significance level was conducted. The results show that there is a systematic difference in the performance results between each pair of models except for the RF (6 factors) and XGBoost (6 factors), RF (6 factors) and RF (15 factors), ANN (6 factors) and XGBoost (6 factors), and ANN (6 factors) and XGBoost (6 factors) pairs, where the difference in performance was found to be statistically insignificant.

4. Discussion

The results obtained show that prediction levels vary when considering a certain number of factors. In this sense, considering all the implemented methods, the best predictive capacity was obtained by the XGBoost algorithm considering 15 conditioning factors (AUC = 0.871, F-Score = 0.781), followed by the same method (XGBoost) but with six factors (AUC = 0.845, F-Score = 0.758). As for the RF performance, it is very similar considering 6 factors (AUC = 0.795) than 15 factors (AUC= 0.793); however, with the F-Score values, the difference was slightly larger (see Table 9). ANN performance with 6 factors is acceptable for its AUC value (0.732) and F-Score (0.664), but it is lower compared to the previous research [46] that considered 10 factors and in its validation metrics obtained an AUC value = 0.761 and F-Score = 0.853. The implementation of ANN with 15 factors did not generate good results (AUC = 0.633, F-Score = 0.436), as the LSM obtained did not adequately distinguish the different susceptibility levels, and for this reason it has been discarded in this research. In general, it was observed that the use of a smaller number of factors (6), i.e., when feature selection is applied, allows a similar (RF) or even greater predictive capacity (ANN) to be achieved than with all factors (15), which proves its usefulness. In general, it can be said that the XGBoost 15 factors model is the most suitable for the data used in this study. This is in accordance to the results obtained in prior statistical analysis in this study such as the obtained AUC and F-Score metrics.

According to the above, the results obtained in this research should be analyzed from two perspectives: (i) feature selection and (ii) ML models. In the first case, the number of important factors selected (6 out of 15) differs from the number obtained by Vasu and Lee [19] who, out of 23 conditioning factors, obtained 13 factors as the most important. Meena et al. [5] assessed the importance of 14 conditioning factors, eliminating the five least important ones using the trial-and-error method and found that the elimination did not affect the overall accuracy of LSM as in the present study. It is important to consider that due to the variability of the results generated by the factor selection algorithms, some relevant factors may be discarded if only the results of a single method are taken as a basis [17]; for that reason, it is necessary to try with different algorithms and thus verify the coincidences and differences of the results, in order to properly choose the factors to be used for the generation of LSM, which was done in this study. If the performance obtained by discarding certain factors generates good results, it implies that considering only certain factors is sufficient for LSM [5]. In this sense, the choice of factors has the advantage of speeding up the LSM computation and generation process by using only the most important factors and not all available factors.

On the other hand, the selection of several factors can generate two problems: (i) producing a generalized model due to excessive dimensionality and (ii) over-fitting it, producing unreliable results [19]. It should be kept in mind that each study area has its own characteristics, so that a factor that is important in one area will not necessarily be important in another [5]. For these reasons, the selection of conditioning factors to reduce dimensionality is necessary to identify those that are most relevant and therefore allow the generation of LSM of acceptable quality based on the characteristics of each study area [19]. There is also a clear need to update the information in order to have more accurate models. Although information from official sources was used in this research, they do not present the necessary updates from competent institutions. Furthermore, it is necessary to consider anthropic and climatic factors for the implementation of more efficient models that contribute to an adequate mitigation of the risk generated by these events [5]. There is a need for further studies on factor importance analysis using feature selection to ensure better accuracy of the resulting LSMs, as well as to reduce the time involved in obtaining them both computationally and in data collection.

Regarding the results obtained in the development of LSM, the similarity of these results with those obtained in similar studies is notorious in some cases and different in others, especially when implementing RF. Sahin [38] evaluated the predictive ability of ensemble tree methods for LSM with 15 conditioning factors, obtaining an AUC value for XGBoost of 0.897 and for RF of 0.886. It is evident that, as in the present research, the best performing method was XGBoost, followed by RF. Sahin [39] compared four gradient-boosting algorithms (among which XGBoost was considered) with RF; obtaining a superior value for XGBoost (second place of the four methods analyzed) with an AUC value = 0.886, which is similar to that obtained in the present study (0.871 with 15 factors). Can et al. [40] evaluated the performance of XGBoost with 10 conditioning factors, obtaining a good performance of this method, reflected in its AUC value (0.96). Kavzoglu and Teke [6] compared the performance of algorithms with 12 conditioning factors, considering a novel method (NGBoost) with more common ones (RF and XGBoost). The AUC value obtained with XGBoost was the same as in the present study when applying 15 factors (0.871), but for RF it was higher (0.863) when compared to those obtained in the present research. Wei et al. [42] compared tree based ensemble models (considering RF and XGBoost) with 12 conditioning factors. The results reflected the good performance of XGBoost and RF, as evidenced by their high AUC values (0.91 and 0.94, respectively). Badola et al. [43] in their landslide susceptibility analysis with seven conditioning factors using XGBoost performed satisfactorily (AUC = 0.91). Daviran et al. [44] considered the ANN and RF methods with 15 conditioning factors, obtaining AUC values of 0.82 (ANN) and 0.88 (RF). Finally, Zhang et al. [45] applied RF and XGBoost for LSM, with 16 conditioning factors. Although the performance obtained by the algorithms, considering the AUC value, was very similar between them, in this case RF obtained a higher performance (0.866) compared to XGBoost (0.864), which is close to that obtained with 15 factors in this research.

Among the limitations of this study, the scarcity of updated official information, mentioned above, must be considered. This scarcity does not allow for information related to fundamental parameters such as precipitation and fault lines, and it is necessary to consider these factors for future studies, as well as the updating of the landslide inventory to verify the variation between the different susceptibility levels obtained by the applied models. Analysis of different landslide typologies could also be implemented, based on the updated landslide inventory information. Moreover, although the ROC curve is a prevalent validation tool, it is only an indicator of the overall reliability of prediction maps but does not reveal the uncertainty of spatial prediction patterns [93]. In view of this, it is necessary to include validation parameters that make it possible to resolve this uncertainty and provide greater certainty in this respect. Furthermore, it is necessary to continue investigating other ML models in addition to those proposed in this study, with the aim of verifying the variation in the susceptibility levels of the study area, which generates a research line that will provide scientifically valuable contributions for the treatment of landslides in this area, which has an evident lack of research that applies the methods proposed in this study.

The generation of good results with both RF and XGBoost was evidenced not only by the AUC and F-Score values but also by the high levels of coincidence of landslides in the high and very high susceptibility zones. In the case of RF, it was approximately 66% with 6 factors and 71% with 15 factors, while with XGBoost even better coincidence levels were obtained: 73% with 6 factors and 78% with 15 factors, which consolidates this method as the one that provides the best results in the study area of all those analyzed so far (ANN and RF). This means that in the future it will be necessary to investigate this method more exhaustively, carrying out various analyses with different conditioning factors to verify with which of them a better performance is obtained.

5. Conclusions

In this research, landslide susceptibility maps of an area surrounding the city of Cuenca (Ecuador) were prepared, analyzing the most important conditioning factors based on the results of multicollinearity analysis (correlation analysis and VIF), as well as the implementation of feature selection algorithms using Machine Learning (ML). There were 15 conditioning factors: 10 topographic (elevation, slope, slope aspect, curvature, SPI, STI, TPI, TRI, TWI, and solar radiation), three land cover (land cover, distance to roads, and NDVI), one hydrological (distance to rivers), and one geological (lithology), from which, based on the results of the analyses mentioned above, the six most important were chosen: distance to roads, elevation, NDVI, TRI, slope, and lithology. In addition, a landslide inventory (with 529 rotational landslides) updated to 2019 was available, which together with the conditioning factors allowed the generation of training (70%) and test (30%) datasets to verify the predictive capacity of the ML models applied for the generation of landslide susceptibility maps (LSM), using grid cells as mapping units.

LSMs were implemented with two ML methods (RF and XGBoost) which gave good results both with the six most important factors and with all available factors. In this regard, the best result, based on test values, was obtained with XGBoost (15 factors) whose AUC value was 0.871 and F-Score value was 0.781, followed by XGBoost with 6 factors (AUC = 0.845, F-Score = 0.758), RF with 6 factors (AUC = 0.795, F-Score = 0.729), and RF with 15 factors (AUC = 0.793, F-Score = 0.707). As this study is a continuation of a previous research, in which a Multilayer Perceptron ANN was implemented and the different approaches of the Backpropagation algorithm were analyzed, these results were compared with this method considering the six most important factors and the most important algorithmic approach of the MLP ANN (rprop-), whose AUC value was 0.732 and F-Score value was 0.664, which, although an acceptable result, was largely surpassed by the new algorithms implemented. In general, the results showed that when applying feature selection and according to the characteristics of each algorithm, it is possible to achieve different predictive capabilities. With RF, this capacity is very similar with both 6 and 15 factors; with XGBoost the predictive capacity is slightly higher when applying 15 factors, and with ANN the predictive capacity is higher with 6 factors. This shows the usefulness of feature selection for obtaining LSM, although further studies are needed.

The above evidence shows an advance in the application of ML for landslide susceptibility analysis in the study area, mainly due to the good results obtained with XGBoost; however, further research is needed to verify other ML models, as well as the importance of the factors that can be considered, dividing the study area into specific sectors based on susceptibility levels. Both RF and XGBoost present a good coincidence of landslides (greater than 66%) in the zones of high and very high susceptibility, although this is variable depending on the number of factors used. Overall, it could be concluded that the XGBoost 15 factor model is the most suitable for the data used this study. This is in accordance with the results obtained in prior statistical analysis in this research such as the obtained AUC = 0.871 and F-Score = 0.781 from this LSM. In addition, the scarcity of research of this kind in the study area opens up a research opportunity that should be explored to verify the best methodologies that, based on the availability and quality of available data, can contribute to adequate planning and land use planning with preventive approaches, considering that the study area has suffered catastrophic events generated by mass movements.

Author Contributions

Conceptualization, E.B.-L. and T.F.D.C.; data selection and preparation, E.B.-L. and C.S.; software implementation, E.B.-L.; methodology, E.B.-L., T.F.D.C. and C.S.; validation, E.B.-L. and T.F.D.C.; investigation, E.B.-L., T.F.D.C. and C.S.; writing—original draft preparation, E.B.-L.; writing—review and editing, E.B.-L., T.F.D.C., C.S. and J.D.-G.; supervision, T.F.D.C. and J.D.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Azuay (project No. 2022-0171) and by the project “Captura de Información Geográfica mediante sensores móviles redundantes de bajo coste. Aplicación a la gestión inteligente del territorio” (FEDER-UJA project No. 1265116); y SPS-LIDAR (National Research Agency of Spain; ref. RTI2018-099638-B-I00).

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to Fundación Carolina (Spain) for funding the doctoral stays at the University of Jaen (E.B.-L.), the Photogrammetric and Topometric Systems Research Group of University of Jaen, and IERSE and Vicerrectorado de Investigaciones of the University of Azuay for the support provided to develop this research. We also thank the reviewers for their suggestions for improving this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Das, S.; Sarkar, S.; Kanungo, D.P. A Critical Review on Landslide Susceptibility Zonation: Recent Trends, Techniques, and Practices in Indian Himalaya. Nat. Hazards 2023, 115, 23–72. [Google Scholar] [CrossRef]
Schuster, R.L. Socioeconomic Significance of Landslides. Spec. Rep.—Natl. Res. Counc. Transp. Res. Board 1996, 247, 12–35. [Google Scholar]
Brabb, E. Innovative Approaches to Landslide Hazard and Risk Mapping. In Proceedings of the 4th International Symposium on Landslides, Toronto, ON, Canada, 16–21 September 1984; pp. 307–323. [Google Scholar]
Varnes, D.J. Landslide Hazard Zonation: A Review of Principles and Practice; United Nations, Education, Scientific and Cultural Organization; United Nations: Paris, France, 1984; No. 3. [Google Scholar]
Meena, S.R.; Puliero, S.; Bhuyan, K.; Floris, M.; Catani, F. Assessing the Importance of Conditioning Factor Selection in Landslide Susceptibility for the Province of Belluno (Region of Veneto, Northeastern Italy). Nat. Hazards Earth Syst. Sci. 2022, 22, 1395–1417. [Google Scholar] [CrossRef]
Kavzoglu, T.; Teke, A. Predictive Performances of Ensemble Machine Learning Algorithms in Landslide Susceptibility Mapping Using Random Forest, Extreme Gradient Boosting (XGBoost) and Natural Gradient Boosting (NGBoost). Arab. J. Sci. Eng. 2022, 47, 7367–7385. [Google Scholar] [CrossRef]
Liu, L.L.; Yang, C.; Wang, X.M. Landslide Susceptibility Assessment Using Feature Selection Based Machine Learning Models. Geomech. Eng. 2021, 25, 1–16. [Google Scholar] [CrossRef]
Huang, Y.; Zhao, L. Review on Landslide Susceptibility Mapping Using Support Vector Machines. Catena 2018, 165, 520–529. [Google Scholar] [CrossRef]
Aslam, B.; Zafar, A.; Khalil, U. Comparative Analysis of Multiple Conventional Neural Networks for Landslide Susceptibility Mapping. Nat. Hazards 2023, 115, 673–707. [Google Scholar] [CrossRef]
Pradhan, B.; Lee, S. Regional Landslide Susceptibility Analysis Using Back-Propagation Neural Network Model at Cameron Highland, Malaysia. Landslides 2010, 7, 13–30. [Google Scholar] [CrossRef]
Tien Bui, D.; Tuan, T.A.; Klempe, H.; Pradhan, B.; Revhaug, I. Spatial Prediction Models for Shallow Landslide Hazards: A Comparative Assessment of the Efficacy of Support Vector Machines, Artificial Neural Networks, Kernel Logistic Regression, and Logistic Model Tree. Landslides 2016, 13, 361–768. [Google Scholar] [CrossRef]
Yi, Y.; Zhang, W.; Xu, X.; Zhang, Z.; Wu, X. Evaluation of Neural Network Models for Landslide Susceptibility Assessment. Int. J. Digit. Earth 2022, 15, 934–953. [Google Scholar] [CrossRef]
Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine Learning Methods for Landslide Susceptibility Studies: A Comparative Overview of Algorithm Performance. Earth Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
Guzzetti, F.; Carrara, A.; Cardinali, M.; Reichenbach, P. Landslide Hazard Evaluation: A Review of Current Techniques and Their Application in a Multi-Scale Study, Central Italy. Geomorphology 1999, 31, 181–216. [Google Scholar] [CrossRef]
Tehrani, F.S.; Calvello, M.; Liu, Z.; Zhang, L.; Lacasse, S. Machine Learning and Landslide Studies: Recent Advances and Applications. Nat. Hazards 2022, 114, 1197–1245. [Google Scholar] [CrossRef]
Sahin, E.K. Implementation of Free and Open-Source Semi-Automatic Feature Engineering Tool in Landslide Susceptibility Mapping Using the Machine-Learning Algorithms RF, SVM, and XGBoost. Stoch. Environ. Res. Risk Assess. 2023, 37, 1067–1092. [Google Scholar] [CrossRef]
Micheletti, N.; Foresti, L.; Robert, S.; Leuenberger, M.; Pedrazzini, A.; Jaboyedoff, M.; Kanevski, M. Machine Learning Feature Selection Methods for Landslide Susceptibility Mapping. Math. Geosci. 2014, 46, 33–57. [Google Scholar] [CrossRef]
Reichenbach, P.; Rossi, M.; Malamud, B.D.; Mihir, M.; Guzzetti, F. A Review of Statistically-Based Landslide Susceptibility Models. Earth Sci. Rev. 2018, 180, 60–91. [Google Scholar] [CrossRef]
Vasu, N.N.; Lee, S.R. A Hybrid Feature Selection Algorithm Integrating an Extreme Learning Machine for Landslide Susceptibility Modeling of Mt. Woomyeon, South Korea. Geomorphology 2016, 263, 50–70. [Google Scholar] [CrossRef]
Liao, M.; Wen, H.; Yang, L. Identifying the Essential Conditioning Factors of Landslide Susceptibility Models under Different Grid Resolutions Using Hybrid Machine Learning: A Case of Wushan and Wuxi Counties, China. Catena 2022, 217, 106428. [Google Scholar] [CrossRef]
Alqadhi, S.; Mallick, J.; Talukdar, S.; Bindajam, A.A.; Van Hong, N.; Saha, T.K. Selecting Optimal Conditioning Parameters for Landslide Susceptibility: An Experimental Research on Aqabat Al-Sulbat, Saudi Arabia. Environ. Sci. Pollut. Res. 2022, 29, 3743–3762. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Sadhasivam, N.; Kariminejad, N.; Collins, A.L. Gully Erosion Spatial Modelling: Role of Machine Learning Algorithms in Selection of the Best Controlling Factors and Modelling Process. Geosci. Front. 2020, 11, 2207–2219. [Google Scholar] [CrossRef]
Munasinghe, K.; Karunanayake, P. Recursive Feature Elimination for Machine Learning-Based Landslide Prediction Models. In Proceedings of the 3rd International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2021, Jeju Island, Republic of Korea, 13–16 April 2021; pp. 126–129. [Google Scholar]
Zhou, X.; Wen, H.; Zhang, Y.; Xu, J.; Zhang, W. Landslide Susceptibility Mapping Using Hybrid Random Forest with GeoDetector and RFE for Factor Optimization. Geosci. Front. 2021, 12, 101211. [Google Scholar] [CrossRef]
Rogers, J.; Gunn, S. Identifying Feature Relevance Using a Random Forest. In Proceedings of the Subspace, Latent Structure and Feature Selection: Statistical and Optimization Perspectives Workshop, SLSFS 2005, Bohinj, Slovenia, 23–25 February 2005; Revised Selected Papers; Springer: Berlin/Heidelberg, Germany, 2006; pp. 173–184. [Google Scholar]
Tanyu, B.F.; Abbaspour, A.; Alimohammadlou, Y.; Tecuci, G. Landslide Susceptibility Analyses Using Random Forest, C4.5, and C5.0 with Balanced and Unbalanced Datasets. Catena 2021, 203, 105355. [Google Scholar] [CrossRef]
Chen, R.C.; Dewi, C.; Huang, S.W.; Caraka, R.E. Selecting Critical Features for Data Classification Based on Machine Learning Methods. J. Big Data 2020, 7, 52. [Google Scholar] [CrossRef]
Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Abraham, M.T.; Satyam, N.; Lokesh, R.; Pradhan, B.; Alamri, A. Factors Affecting Landslide Susceptibility Mapping: Assessing the Influence of Different Machine Learning Approaches, Sampling Strategies and Data Splitting. Land 2021, 10, 989. [Google Scholar] [CrossRef]
Gariano, S.L.; Brunetti, M.T.; Iovine, G.; Melillo, M.; Peruccacci, S.; Terranova, O.; Vennari, C.; Guzzetti, F. Calibration and Validation of Rainfall Thresholds for Shallow Landslide Forecasting in Sicily, Southern Italy. Geomorphology 2015, 228, 653–665. [Google Scholar] [CrossRef]
Meten, M.; PrakashBhandary, N.; Yatabe, R. Effect of Landslide Factor Combinations on the Prediction Accuracy of Landslide Susceptibility Maps in the Blue Nile Gorge of Central Ethiopia. Geoenviron. Disasters 2015, 2, 9. [Google Scholar] [CrossRef]
Mind’je, R.; Li, L.; Nsengiyumva, J.B.; Mupenzi, C.; Nyesheja, E.M.; Kayumba, P.M.; Gasirabo, A.; Hakorimana, E. Landslide Susceptibility and Influencing Factors Analysis in Rwanda. Environ. Dev. Sustain. 2020, 22, 7985–8012. [Google Scholar] [CrossRef]
Iban, M.C.; Bilgilioglu, S.S. Snow Avalanche Susceptibility Mapping Using Novel Tree-Based Machine Learning Algorithms (XGBoost, NGBoost, and LightGBM) with EXplainable Artificial Intelligence (XAI) Approach. Stoch. Environ. Res. Risk Assess. 2023, 37, 2243–2270. [Google Scholar] [CrossRef]
Pradhan, B.; Dikshit, A.; Lee, S.; Kim, H. An Explainable AI (XAI) Model for Landslide Susceptibility Modeling. Appl. Soft Comput. 2023, 142, 110324. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into Geospatial Heterogeneity of Landslide Susceptibility Based on the SHAP-XGBoost Model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Wen, H.; Li, Z.; Zhang, H.; Zhang, W. An Interpretable Model for the Susceptibility of Rainfall-Induced Shallow Landslides Based on SHAP and XGBoost. Geocarto Int. 2022, 37, 13419–13450. [Google Scholar] [CrossRef]
Sun, D.; Gu, Q.; Wen, H.; Xu, J.; Zhang, Y.; Shi, S.; Xue, M.; Zhou, X. Assessment of Landslide Susceptibility along Mountain Highways Based on Different Machine Learning Algorithms and Mapping Units by Hybrid Factors Screening and Sample Optimization. Gondwana Res. 2022. [Google Scholar] [CrossRef]
Sahin, E.K. Assessing the Predictive Capability of Ensemble Tree Methods for Landslide Susceptibility Mapping Using XGBoost, Gradient Boosting Machine, and Random Forest. SN Appl. Sci. 2020, 2, 1308. [Google Scholar] [CrossRef]
Sahin, E.K. Comparative Analysis of Gradient Boosting Algorithms for Landslide Susceptibility Mapping. Geocarto Int. 2020, 37, 2441–2465. [Google Scholar] [CrossRef]
Can, R.; Kocaman, S.; Gokceoglu, C. A Comprehensive Assessment of XGBoost Algorithm for Landslide Susceptibility Mapping in the Upper Basin of Ataturk Dam, Turkey. Appl. Sci. 2021, 11, 4993. [Google Scholar] [CrossRef]
Deng, H.; Wu, X.; Zhang, W.; Liu, Y.; Li, W.; Li, X.; Zhou, P.; Zhuo, W. Slope-Unit Scale Landslide Susceptibility Mapping Based on the Random Forest Model in Deep Valley Areas. Remote Sens. 2022, 14, 4245. [Google Scholar] [CrossRef]
Wei, A.; Yu, K.; Dai, F.; Gu, F.; Zhang, W.; Liu, Y. Application of Tree-Based Ensemble Models to Landslide Susceptibility Mapping: A Comparative Study. Sustainability 2022, 14, 6330. [Google Scholar] [CrossRef]
Badola, S.; Mishra, V.N.; Parkash, S. Landslide Susceptibility Mapping Using XGBoost Machine Learning Method. In Proceedings of the 2023 International Conference on Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS), Hyderabad, India, 27–29 January 2023; Volume 1, pp. 1–4. [Google Scholar]
Daviran, M.; Shamekhi, M.; Ghezelbash, R.; Maghsoudi, A. Landslide Susceptibility Prediction Using Artificial Neural Networks, SVMs and Random Forest: Hyperparameters Tuning by Genetic Optimization Algorithm. Int. J. Environ. Sci. Technol. 2023, 20, 259–276. [Google Scholar] [CrossRef]
Zhang, W.; He, Y.; Wang, L.; Liu, S.; Meng, X. Landslide Susceptibility Mapping Using Random Forest and Extreme Gradient Boosting: A Case Study of Fengjie, Chongqing. Geol. J. 2023, 14, 3495. [Google Scholar] [CrossRef]
Bravo-López, E.; Fernández Del Castillo, T.; Sellers, C.; Delgado-García, J. Landslide Susceptibility Mapping of Landslides with Artificial Neural Networks: Multi-Approach Analysis of Backpropagation Algorithm Applying the Neuralnet Package in Cuenca, Ecuador. Remote Sens. 2022, 14, 3495. [Google Scholar] [CrossRef]
Miele, P.; Di Napoli, M.; Guerriero, L.; Ramondini, M.; Sellers, C.; Annibali Corona, M.; Di Martire, D. Landslide Awareness System (Laws) to Increase the Resilience and Safety of Transport Infrastructure: The Case Study of Pan-American Highway (Cuenca–Ecuador). Remote Sens. 2021, 13, 1564. [Google Scholar] [CrossRef]
Sellers, C.A.; Buján, S.; Miranda, D. MARLI: A Mobile Application for Regional Landslide Inventories in Ecuador. Landslides 2021, 18, 3963–3977. [Google Scholar] [CrossRef]
Zhu, A.X.; Miao, Y.; Yang, L.; Bai, S.; Liu, J.; Hong, H. Comparison of the Presence-Only Method and Presence-Absence Method in Landslide Susceptibility Mapping. Catena 2018, 171, 222–233. [Google Scholar] [CrossRef]
Wang, X.; Huang, F.; Fan, X.; Shahabi, H.; Shirzadi, A.; Bian, H.; Ma, X.; Lei, X.; Chen, W. Landslide Susceptibility Modeling Based on Remote Sensing Data and Data Mining Techniques. Environ. Earth Sci. 2022, 81, 50. [Google Scholar] [CrossRef]
Li, J.; Wang, W.; Han, Z.; Li, Y.; Chen, G. Exploring the Impact of Multitemporal DEM Data on the Susceptibility Mapping of Landslides. Appl. Sci. 2020, 10, 2518. [Google Scholar] [CrossRef]
Vorpahl, P.; Elsenbeer, H.; Märker, M.; Schröder, B. How Can Statistical Models Help to Determine Driving Factors of Landslides? Ecol. Model. 2012, 239, 27–39. [Google Scholar] [CrossRef]
Conforti, M.; Pascale, S.; Robustelli, G.; Sdao, F. Evaluation of Prediction Capability of the Artificial Neural Networks for Mapping Landslide Susceptibility in the Turbolo River Catchment (Northern Calabria, Italy). Catena 2014, 113, 236–250. [Google Scholar] [CrossRef]
Dai, X.; Zhu, Y.; Sun, K.; Zou, Q.; Zhao, S.; Li, W.; Hu, L.; Wang, S. Examining the Spatially Varying Relationships between Landslide Susceptibility and Conditioning Factors Using a Geographical Random Forest Approach: A Case Study in Liangshan, China. Remote Sens. 2023, 15, 1513. [Google Scholar] [CrossRef]
Saha, S.; Sarkar, R.; Roy, J.; Bayen, B.; Bhardwaj, D.; Wangchuk, T. Application of RBF and MLP Neural Networks Integrating with Rotation Forest in Modeling Landslide Susceptibility of Sampheling, Bhutan. In Impact of Climate Change, Land Use and Land Cover, and Socio-economic Dynamics on Landslides. Disaster Risk Reduction; Sarkar, R., Shaw, R., Pradhan, B., Eds.; Springer: Singapore, 2022; pp. 221–245. [Google Scholar]
Ali, S.A.; Parvin, F.; Pham, Q.B.; Khedher, K.M.; Dehbozorgi, M.; Rabby, Y.W.; Anh, D.T.; Nguyen, D.H. An Ensemble Random Forest Tree with SVM, ANN, NBT, and LMT for Landslide Susceptibility Mapping in the Rangit River Watershed, India. Nat. Hazards 2022, 113, 1601–1633. [Google Scholar] [CrossRef]
Ghasemian, B.; Shahabi, H.; Shirzadi, A.; Al-Ansari, N.; Jaafari, A.; Kress, V.R.; Geertsema, M.; Renoud, S.; Ahmad, A. A Robust Deep-Learning Model for Landslide Susceptibility Mapping: A Case Study of Kurdistan Province, Iran. Sensors 2022, 22, 1573. [Google Scholar] [CrossRef] [PubMed]
Herold, M.; Latham, J.S.; Di Gregorio, A.; Schmullius, C.C. Evolving Standards in Land Cover Characterization. J. Land Use Sci. 2006, 1, 157–168. [Google Scholar] [CrossRef]
Pham, B.T.; Tien Bui, D.; Prakash, I.; Dholakia, M.B. Hybrid Integration of Multilayer Perceptron Neural Networks and Machine Learning Ensembles for Landslide Susceptibility Assessment at Himalayan Area (India) Using GIS. Catena 2017, 149, 52–63. [Google Scholar] [CrossRef]
Löbmann, M.T.; Geitner, C.; Wellstein, C.; Zerbe, S. The Influence of Herbaceous Vegetation on Slope Stability—A Review. Earth Sci. Rev. 2020, 209, 103328. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rahmati, O. Prediction of the Landslide Susceptibility: Which Algorithm, Which Precision? Catena 2018, 162, 177–192. [Google Scholar] [CrossRef]
Costanzo, D.; Rotigliano, E.; Irigaray, C.; Jiménez-Perálvarez, J.D.; Chacón, J. Factors Selection in Landslide Susceptibility Modelling on Large Scale Following the Gis Matrix Method: Application to the River Beiro Basin (Spain). Nat. Hazards Earth Syst. Sci. 2012, 12, 327–340. [Google Scholar] [CrossRef]
Nefeslioglu, H.A.; Duman, T.Y.; Durmaz, S. Landslide Susceptibility Mapping for a Part of Tectonic Kelkit Valley (Eastern Black Sea Region of Turkey). Geomorphology 2008, 94, 401–418. [Google Scholar] [CrossRef]
Achour, Y.; Pourghasemi, H.R. How Do Machine Learning Techniques Help in Increasing Accuracy of Landslide Susceptibility Maps? Geosci. Front. 2020, 11, 871–883. [Google Scholar] [CrossRef]
Habumugisha, J.M.; Chen, N.; Rahman, M.; Islam, M.M.; Ahmad, H.; Elbeltagi, A.; Sharma, G.; Liza, S.N.; Dewan, A. Landslide Susceptibility Mapping with Deep Learning Algorithms. Sustainability 2022, 14, 1734. [Google Scholar] [CrossRef]
Gholamy, A.; Kreinovich, V.; Kosheleva, O. Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation. Dep. Tech. Rep. 2018, 1209, 1–6. [Google Scholar]
Deparday, V.; Gevaert, C.; Molinario, G.; Soden, R.; Balog-Way, S. Machine Learning for Disaster Risk Management; World Bank: Carroll, NH, USA, 2019. [Google Scholar]
Alin, A. Multicollinearity: Computational Statistics. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 370–374. [Google Scholar] [CrossRef]
Martín, B.; Alonso, J.C.; Martín, C.A.; Palacín, C.; Magaña, M.; Alonso, J. Influence of Spatial Heterogeneity and Temporal Variability in Habitat Selection: A Case Study on a Great Bustard Metapopulation. Ecol. Model. 2012, 228, 39–48. [Google Scholar] [CrossRef]
Craney, T.A.; Surles, J.G. Model-Dependent Variance Inflation Factor Cutoff Values. Qual. Eng. 2002, 14, 391–403. [Google Scholar] [CrossRef]
Harrell, F.E., Jr. Rms: Regression Modeling Strategies, R Package Version 6.7-0; 2022. Available online: https://cran.r-project.org/web/packages/rms/index.html (accessed on 24 November 2022).
Questier, F.; Put, R.; Coomans, D.; Walczak, B.; Heyden, Y. Vander; The Use of CART and Multivariate Regression Trees for Supervised and Unsupervised Feature Selection. Chemom. Intell. Lab. Syst. 2005, 76, 45–54. [Google Scholar] [CrossRef]
Therneau, T.; Atkinson, B. Rpart: Recursive Partitioning and Regression Trees, R Package Version 4.1.19; 2019. Available online: https://cran.r-project.org/web/packages/rpart/index.html (accessed on 24 November 2022).
Hastie, T.; Tibshirani, R.; Friedman, J. Random Forests. Elem. Stat. Learning. Data Min. Inference Predict 2009, 587–604. [Google Scholar]
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; Benesty, M.; et al. Caret: Classification and Regression Training. R Package Version 6.0-94. 2022. Available online: https://cran.r-project.org/web/packages/caret/index.html (accessed on 23 November 2022).
Pourghasemi, H.R.; Kariminejad, N.; Amiri, M.; Edalat, M.; Zarafshar, M.; Blaschke, T.; Cerda, A. Assessing and Mapping Multi-Hazard Risk Susceptibility Using a Machine Learning Technique. Sci. Rep. 2020, 10, 3203. [Google Scholar] [CrossRef] [PubMed]
Prasad, P.; Loveson, V.J.; Das, S.; Chandra, P. Artificial Intelligence Approaches for Spatial Prediction of Landslides in Mountainous Regions of Western India. Environ. Earth Sci. 2021, 80, 1–20. [Google Scholar] [CrossRef]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene Selection for Cancer Classification Using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Jennifer, J.J. Feature Elimination and Comparison of Machine Learning Algorithms in Landslide Susceptibility Mapping. Environ. Earth Sci. 2022, 81, 489. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Probst, P.; Wright, M.N.; Boulesteix, A.L. Hyperparameters and Tuning Strategies for Random Forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting; R Package Version 1.7.5.1. 2022. Available online: https://cran.r-project.org/web/packages/xgboost/index.html (accessed on 8 February 2023).
Lobo, J.M.; Jiménez-Valverde, A.; Real, R. AUC: A Misleading Measure of the Performance of Predictive Distribution Models. Glob. Ecol. Biogeogr. 2008, 17, 145–151. [Google Scholar] [CrossRef]
Keyport, R.N.; Oommen, T.; Martha, T.R.; Sajinkumar, K.S.; Gierke, J.S. A Comparative Analysis of Pixel- and Object-Based Detection of Landslides from Very High-Resolution Images. Int. J. Appl. Earth Obs. Geoinf. 2018, 64, 1–11. [Google Scholar] [CrossRef]
Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T. ROCR: Visualizing Classifier Performance in R. Bioinformatics 2005, 21, 3940–3941. [Google Scholar] [CrossRef] [PubMed]
Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.; Müller, M. PROC: An Open-Source Package for R and S+ to Analyze andcompare ROC Curves. BMC Bioinform. 2011, 12, 77. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Tien Bui, D.; Merghadi, A.; Sahana, M.; Zhu, Z.; Chen, C.W.; Khosravi, K.; Yang, Y.; Pham, B.T. Assessment of Advanced Random Forest and Decision Tree Algorithms for Modeling Rainfall-Induced Landslide Susceptibility in the Izu-Oshima Volcanic Island, Japan. Sci. Total Environ. 2019, 662, 332–346. [Google Scholar] [CrossRef]
Merghadi, A.; Abderrahmane, B.; Tien Bui, D. Landslide Susceptibility Assessment at Mila Basin (Algeria): A Comparative Assessment of Prediction Capability of Advanced Machine Learning Methods. ISPRS Int. J. Geo Inf. 2018, 7, 268. [Google Scholar] [CrossRef]
Wilcoxon, F. Individual Comparisons by Ranking Methods; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Chung, C.J.F.; Fabbri, A.G. Validation of Spatial Prediction Models for Landslide Hazard Mapping. Nat. Hazards 2003, 30, 451–472. [Google Scholar] [CrossRef]
Hong, H.; Pradhan, B.; Sameen, M.I.; Chen, W.; Xu, C. Spatial Prediction of Rotational Landslide Using Geographically Weighted Regression, Logistic Regression, and Support Vector Machine Models in Xing Guo Area (China). Geomat. Nat. Hazards Risk 2017, 8, 1997–2022. [Google Scholar] [CrossRef]
Saleem, N.; Enamul Huq, M.; Twumasi, N.Y.D.; Javed, A.; Sajjad, A. Parameters Derived from and/or Used with Digital Elevation Models (DEMs) for Landslide Susceptibility Mapping and Landslide Risk Assessment: A Review. ISPRS Int. J. Geo-Inf. 2019, 8, 545. [Google Scholar] [CrossRef]
Vakhshoori, V.; Zare, M. Is the ROC Curve a Reliable Tool to Compare the Validity of Landslide Susceptibility Maps? Geomat. Nat. Hazards Risk 2018, 9, 249–266. [Google Scholar] [CrossRef]

Figure 1. Map of study area with landslides locations.

Figure 2. Methodology implemented in this research.

Figure 3. Conditioning factors maps of the study zone: (a) elevation, (b) slope, (c) aspect, (d) curvature, (e) SPI, (f) TWI, (g) TPI, (h) TRI, (i) STI, (j) Solar radiation, (k) land cover, (l) distance to roads, (m) NDVI, (n) distance to rivers, and (o) lithology (lithology classes according to the strength of the materials: (1) metavolcanites, (2) conglomerates and tuffaceous sandstones, (3) slope and colluvial deposits, (4) volcanic tuffs and agglomerates, (5) alluvial deposits, (6) massive siltstones and shales, and (7) clays and sands.).

Figure 4. Distributions of conditioning factors in study zone.

Figure 5. ROC curves of ML LSM algorithms implemented.

Figure 6. Landslide Susceptibility Maps (LSM) obtained with each algorithm and factors selection: (a) ANN (6 factors), (b) RF (6 factors), (c) RF (15 factors), (d) XGBoost (6 factors), (e) XGBoost (15 factors).

Figure 7. Distribution of susceptibility levels per each LSM.

Table 1. Conditioning factors utilized in this research.

Thematic Variable (Obtained from)	Conditioning Factor	Source	Scale/Resolution
Topographical (Digital Elevation Model (DEM))	Aspect, Curvature, Elevation, Slope, SPI, STI, TPI, TRI, TWI, Solar radiation	SIGTIERRAS-IERSE	3 m
Land (Soil) Cover (Land cover map Ortophoto Roads layer)	Land cover	SIGTIERRAS	1:25,000 0.3 m 1:25,000
	NDVI	SIGTIERRAS
	Distance to roads	IGM
Hydrological (Rivers layer)	Distance to rivers	IGM	1:25,000
Geo-lithological (Geological map)	Lithology	SNI	1:100,000

SNI: Sistema Nacional de Información (Quito, Ecuador); IGM: Instituto Geográfico Militar (Quito, Ecuador); IERSE: Instituto de Estudios de Régimen Seccional de Ecuador (Universidad del Azuay).

Table 2. Land use classification applied in this study.

Land Use	Class
Water	0
Infrastructure and urban development (including urbanizations, etc.)	2 to 5
Forests and forestry plantations	4
Shrubs	5
Herbaceous vegetation (scrub)	6
Grasslands	8
Crops	9
Moorlands and wastelands	10

Table 3. Lithology classification applied in this study.

Lithology	Class
Metavolcanites	2
Volcanic tuffs and agglomerates	4
Conglomerates and tuffaceous sandstones	5
Colluvial slope deposits	6
Alluvial slope deposits	7
Massive siltstones and shales	8
Clays and sands	10

Table 4. Hyperparameter settings of RF model.

Hyperparameter	Setting Value
method	rf
mtry	12
ntree	500
nodesize	14

Table 5. Hyperparameter settings of XGBoost model.

Hyperparameter	Setting Value
method	xgbTree
nrounds	100
max_depth	(2, 3, 5, 10)
colsample_bytree	1
subsample	1

Table 6. Spearman’s correlation values for all conditioning factors. In bold red, the factors that exceed the threshold value (0.7) and their values.

	Aspect	Curvature	Elevation	Dist. Rivers	Dist. Roads	Land Cover	Lithology	NDVI	Slope	Solar Rad.	SPI	STI	TPI	TRI	TWI
Aspect	1	0	−0.03	−0.09	0.07	0.02	0.02	0.01	0.11	−0.01	0.06	0.08	−0.39	0.1	−0.09
Curvature	0	1	0.03	0.03	0.01	−0.01	−0.02	−0.06	0.02	0.03	−0.33	−0.27	0.54	0.02	−0.34
Elevation	−0.03	0.03	1	0.17	0.28	0.14	−0.37	0.41	0.25	0.16	0.15	0.19	0.07	0.13	−0.15
Dist. rivers	−0.09	0.03	0.17	1	−0.21	−0.02	−0.06	−0.19	−0.17	0.27	−0.09	−0.13	0.1	−0.24	0.1
Dist. roads	0.07	0.01	0.28	−0.21	1	0.17	−0.23	0.39	0.43	−0.29	0.25	0.33	−0.03	0.24	−0.22
Land cover	0.02	−0.01	0.14	−0.02	0.17	1	−0.07	0.05	−0.01	0.08	0.04	0.02	0	−0.03	0.02
Lithology	0.02	−0.02	−0.37	−0.06	−0.23	−0.07	1	−0.17	−0.22	0.06	−0.12	−0.16	−0.06	−0.13	0.12
NDVI	0.01	−0.06	0.41	−0.19	0.39	0.05	−0.17	1	0.36	−0.22	0.26	0.31	−0.05	0.25	−0.15
Slope	0.11	0.02	0.25	−0.17	0.43	−0.01	−0.22	0.36	1	−0.84	0.52	0.71	0.01	0.51	−0.56
Solar rad.	−0.01	0.03	0.16	0.27	−0.29	0.08	0.06	−0.22	−0.84	1	−0.44	−0.6	0.03	−0.44	0.43
SPI	0.06	−0.33	0.15	−0.09	0.25	0.04	−0.12	0.26	0.52	−0.44	1	0.95	−0.26	0.16	0.33
STI	0.08	−0.27	0.19	−0.13	0.33	0.02	−0.16	0.31	0.71	−0.6	0.95	1	−0.21	0.28	0.09
TPI	−0.39	0.54	0.07	0.1	−0.03	0	−0.06	−0.05	0.01	0.03	−0.26	−0.21	1	−0.01	−0.28
TRI	0.1	0.02	0.13	−0.24	0.24	−0.03	−0.13	0.25	0.51	−0.44	0.16	0.28	−0.01	1	−0.44
TWI	−0.09	−0.34	−0.15	0.1	−0.22	0.02	0.12	−0.15	−0.56	0.43	0.33	0.09	−0.28	−0.44	1

Table 7. VIF analysis results.

Conditioning Factor	VIF Value (Rasters)	VIF Value (Training Dataset)
Aspect	1.3883	1.4643
Curvature	2.0677	2.0101
Elevation	2.6585	2.3921
Distance to rivers	1.1912	1.1009
Distance to roads	1.4028	1.2795
Land cover	1.0723	1.0740
Lithology	1.1968	1.2218
NDVI	1.3955	1.2853
Slope (degree)	25.3757	16.9914
Solar radiation	14.3339	12.0909
SPI	4.6122	3.4571
STI	1.3294	1.7197
TPI	2.1803	2.0362
TRI	1.3092	1.1574
TWI	5.2104	3.1945

Table 8. Conditioning factor importance according to feature selection algorithms. In bold, the conditioning factors with their scores (the highest).

Conditioning Factor	CART	FS RF	Boruta	RFE
Aspect	1.243	3.180	-	1.217
Curvature	5.356	5.927	4.678	5.587
Elevation	18.551	24.058	17.984	22.798
Distance to rivers	4.869	5.392	2.102	1.289
Distance to roads	33.637	41.921	31.796	48.887
Land cover	0.573	1.689	2.186	3.469
Lithology	9.413	9.379	6.608	4.773
NDVI	10.698	16.874	14.336	13.070
Slope (degree)	6.369	13.403	10.293	10.324
Solar radiation	7.367	13.433	11.422	10.916
SPI	2.265	7.168	3.381	3.492
STI	2.665	6.927	4.135	5.839
TPI	4.412	6.911	4.053	3.498
TRI	3.066	18.172	16.394	18.253
TWI	0.826	6.309	5.086	5.793

Table 9. ROC AUC and F-Score values for test datasets.

Algorithm	AUC (Testing)	F-Score (Testing)
ANN (6 factors)	0.732	0.664
ANN (15 factors)	0.633	0.436
RF (6 factors)	0.795	0.729
RF (15 factors)	0.793	0.707
XGBoost (6 factors)	0.845	0.758
XGBoost (15 factors)	0.871	0.781

Table 10. Susceptibility classes per algorithm and percentage of landslides per class.

ANN (6 Factors)
Susceptibility	Pixel Amount	Pixels (%)	Landslides (%)
Very low	8,492,791	19.97	17.6
Low	8,563,546	20.14	14.4
Medium	6,319,583	14.86	5.7
High	10,678,501	25.11	32.1
Very high	8,469,395	19.92	30.2
RF (6 Factors)
Susceptibility	Pixel Amount	Pixels (%)	Landslides (%)
Very low	8,610,562	20.25	6.2
Low	8,357,722	19.65	10.8
Medium	8,718,899	20.50	17.0
High	8,335,882	19.60	23.4
Very high	8,500,751	19.99	42.5
RF (15 Factors)
Susceptibility	Pixel Amount	Pixels (%)	Landslides (%)
Very low	8,680,409	20.51	5.5
Low	8,144,119	19.24	8.1
Medium	8,577,562	20.26	15.1
High	8,118,914	19.18	20.4
Very high	8,808,285	20.81	50.9
XGBoost (6 Factors)
Susceptibility	Pixel Amount	Pixels (%)	Landslides (%)
Very low	8,380,324	19.71	2.5
Low	8,676,920	20.40	9.1
Medium	8,493,746	19.97	15.7
High	8,495,564	19.98	22.7
Very high	8,477,262	19.94	50.1
XGBoost (15 Factors)
Susceptibility	Pixel Amount	Pixels (%)	Landslides (%)
Very low	8,404,377	19.85	1.7
Low	8,323,806	19.66	6.8
Medium	8,374,361	19.78	13.6
High	8,463,257	19.99	23.4
Very high	8,763,488	20.70	54.4

Table 11. Pairwise comparison of LSM using Wilcoxon signed-rank test.

Models Comparison	p Value	Significance
XGBoost (15 factors) vs. XGBoost (6 factors)	0.0473	Yes
XGBoost (15 factors) vs. RF (15 factors)	0.0044	Yes
XGBoost (15 factors) vs. RF (6 factors)	0.0054	Yes
XGBoost (15 factors) vs. ANN (6 factors)	0.0099	Yes
XGBoost (6 factors) vs. RF (15 factors)	0.0248	Yes
XGBoost (6 factors) vs. RF (6 factors)	0.1706	No
XGBoost (6 factors) vs. ANN (6 factors)	0.0577	No
RF (15 factors) vs. RF (6 factors)	0.1681	No
RF (15 factors) vs. ANN (6 factors)	0.9677	No
RF (6 factors) vs. ANN (6 factors)	0.3492	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bravo-López, E.; Fernández Del Castillo, T.; Sellers, C.; Delgado-García, J. Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods. Land 2023, 12, 1135. https://doi.org/10.3390/land12061135

AMA Style

Bravo-López E, Fernández Del Castillo T, Sellers C, Delgado-García J. Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods. Land. 2023; 12(6):1135. https://doi.org/10.3390/land12061135

Chicago/Turabian Style

Bravo-López, Esteban, Tomás Fernández Del Castillo, Chester Sellers, and Jorge Delgado-García. 2023. "Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods" Land 12, no. 6: 1135. https://doi.org/10.3390/land12061135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Conditioning Factors in Cuenca, Ecuador, for Landslide Susceptibility Maps Generation Employing Machine Learning Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Materials

2.2.1. Landslide Inventory

2.2.2. Conditioning Factors

2.3. Methods

2.3.1. Conditioning Factors Characteristics

2.3.2. Extraction of Training and Test Datasets

2.3.3. Multicollinearity Analysis

2.3.4. Conditioning Factors Selection (Feature Selection)

Feature Selection with CART

Feature Selection with Random Forest

Feature Selection with Boruta

Feature Selection with Recursive Feature Elimination (RFE)

2.3.5. Machine Learning Models Implementation

Random Forests

Extreme Gradient Boosting (XGBoost)

2.3.6. Results Validation

2.3.7. Obtaining Landslide Susceptibility Maps (LSM)

2.3.8. Statistical Comparison between LSM

3. Results

3.1. Multicollinearity Analysis

3.1.1. Correlation Analysis

3.1.2. Variance Inflation Factor (VIF)

3.2. Feature Selection

3.3. Machine Learning Models and Validation

3.4. Landslide Susceptibility Analysis

3.5. Statistical Comparison between LSM

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI