Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling

Park, Soyoung; Hamm, Se-Yeong; Kim, Jinsoo

doi:10.3390/su11205659

Open AccessArticle

Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling

by

Soyoung Park

¹,

Se-Yeong Hamm

²

and

Jinsoo Kim

^3,*

¹

BK21 Plus Project of the Graduate School of Earth Environmental Hazard System, Pukyong National University, Busan 48513, Korea

²

Department of Geological Sciences, Pusan National University, Busan 46241, Korea

³

Department of Spatial Information Engineering, Pukyong National University, Busan 48513, Korea

^*

Author to whom correspondence should be addressed.

Sustainability 2019, 11(20), 5659; https://doi.org/10.3390/su11205659

Submission received: 29 August 2019 / Revised: 2 October 2019 / Accepted: 10 October 2019 / Published: 14 October 2019

(This article belongs to the Special Issue Sustainable Applications of Remote Sensing and Geospatial Information Systems to Earth Observations)

Download

Browse Figures

Versions Notes

Abstract

:

This study analyzed and compared landslide susceptibility models using decision tree (DT), random forest (RF), and rotation forest (RoF) algorithms at Woomyeon Mountain, South Korea. Out of a total of 145 landslide locations, 102 locations (70%) were used for model training, and the remaining 43 locations (30%) were used for validation. Fourteen landslide conditioning factors were identified, and the contributions of each factor were evaluated using the RRelief-F algorithm with a 10-fold cross-validation approach. Three factors, timber diameter, age, and density had no contribution to landslide occurrence. Landslide susceptibility maps (LSMs) were produced using DT, RF, and RoF models with the 11 remaining landslide conditioning factors: altitude, slope, aspect, profile curvature, plan curvature, topographic position index, elevation-relief ratio, slope length and slope steepness, topographic wetness index, stream power index, and timber type. The performances of the LSMs were assessed and compared based on sensitivity, specificity, precision, accuracy, kappa index, and receiver operating characteristic curves. The results showed that the ensemble learning methods outperformed the single classifier (DT) and that the RoF model had the highest prediction capability compared to the DT and RF models. The results of this study may be helpful in managing areas vulnerable to landslides and establishing mitigation strategies.

Keywords:

decision tree; ensemble learning; landslide susceptibility; random forest; rotation forest

1. Introduction

Landslide susceptibility is the likelihood of a landslide occurring in a certain area given the local terrain attributes [1]. It is usually assumed that landslides will occur in the future if the conditions are the same as those that produced them in the past [2]. A landslide susceptibility map (LSM) portraying the spatial distribution of landslide susceptibility can be a useful tool for decision-makers in developing effective hazard mitigation and land-use planning. Therefore, the most important prerequisite is to produce LSMs containing reliable and accurate information [3].

In recent decades, a number of different techniques have been developed and applied to produce LSMs, including heuristic, deterministic (engineering approach), and probabilistic (non-deterministic or data-driven) methods [4,5], among which probabilistic methods are generally the most widely used [5,6,7,8]. Probabilistic methods, also as known as statistical methods, are based on statistical correlations between historical records of landslide occurrences and a set of influencing parameters.

Among the probabilistic methods, machine learning methods have become popular in recent years. Machine learning is a branch of artificial intelligence that can effectively overcome the limitations of data-dependent bivariate and multivariate statistical methods [9]. Machine learning offers several advantages; for example, there is no need for prior elimination of outliers, data transformation, or statistical assumptions, and interactions between landslides and landslide conditioning factors are automatically identified. The biggest advantage is that the prediction accuracy typically exceeds that of conventional methods [10,11,12].

Due to their robustness, machine learning methods, such as artificial neural network [13,14], fuzzy logic [15,16], neuro-fuzzy [17], support vector machine [18,19], random forest (RF) [3,20], and naïve Bayes tree [21] methods, have been popularly applied in landslide susceptibility analyses. Since the early 1990s, ensemble classifiers (learning models) have received substantial attention in machine learning due to their ability to improve prediction accuracy and deal with complex high-dimensional data [22]. Ensemble classifiers can be established through the combination of two or more classifiers or ensemble frameworks such as staking, Bagging, AdaBoost, random subspace, MultiBoost, RF, and rotation forest (RoF) [23].

Among these, RoF is a relatively new ensemble technique introduced in 2006. Rodriguez et al. [24] compared model performance between ensemble methods such as Bagging, AdaBoost, RF, and RoF using 33 data sets from the UCI Machine Learning Repository. Their experimental results showed that RoF outperformed the other ensemble methods in accuracy and diversity. Since then, RoF has been applied in various fields, including disease diagnosis [25], customer churn prediction [26], bankruptcy prediction [27], pattern recognition [28], and land-use and land-cover classification [29,30]. In the above studies, the RoF model showed outstanding performance and strong generalization for successful classification. Nevertheless, few, if any, studies have applied RoF in landslide susceptibility mapping and no research has compared the results between the decision tree (DT), RF, and RoF methods.

Therefore, this study aimed to use tree-based classifiers, such as DT, RF, and RoF, to build landslide models and produce LSMs based on these landslide models. In addition, the performances of three LSMs were evaluated and compared based on receiver operating characteristic (ROC) curves and statistical indices. Throughout this process, the ultimate aim of this study was to assess the accuracy of the RoF model compared to other methods and determine whether the RoF model could contribute to creating accurate LSMs.

2. Study Area

The study area was Woomyeon Mountain, located in the Seocho district of Seoul, South Korea. It covers an area of ~6.68 km², between longitudes 126°59′02″ E and 127°01′41″ E, and latitudes 37°27′00″ N and 37°28′55″ N (Figure 1). The altitude ranges from 20 m to 310 m above sea level, and the slope angle near the top of the mountain is approximately 30–35°. The entire area surrounding Woomyeon Mountain is a zone of distributed temperate deciduous forests. The dominant species are oak trees, including Mongolian, Oriental, sawtooth, and other oaks.

Geologically, biotite-banded gneiss is distributed in the bedrock, and the terrain overlying the gneiss is often vulnerable to landslides due to severe weathering and the presence of faults (Figure A1). In addition, granite gneiss with relatively poor compositional differentiation is excavated en masse, and part of an embedded dike is present. The gneiss outcrop is poor due to severe weathering throughout, and its foliation structure is irregular due to multiple flexures [31,32].

The soil profile can be divided into three main layers: the colluvium layer, transition zone layer, and clay layer. The colluvium layer is generally composed of loose material, i.e., gravel and silty sand, which extends to a maximum depth of 3.0 m from the surface. The transition zone, located beneath the colluvium layer (i.e., between the colluvium and bedrock) and having a thickness of 0.2 m to 0.5 m, is mainly composed of a clay layer. It is anticipated that landslides could be generated in this layer. The clay layer overlies a subsoil of stiff weathered bedrock [31,33].

At the end of July 2011, torrential rains amounting to 40% of the annual precipitation in the area occurred in central South Korea over four days (26–29 July), with >50 mm of hourly precipitation recorded. In the study area, the 230–267 mm of rainfall that occurred ~15 h prior to July 27, and the subsequent heavy rainfall (86–113 mm/h) that lasted for approximately 1 h, led to landslides and debris flows. This resulted in serious damages to human life and property, with 68 casualties (18 dead and 50 injured), 30 buried homes, and 116 inundated homes [31].

3. Methodology

This study was performed using the following main steps: (1) construction of the spatial database including landslides and landslide conditioning factors; (2) preparation of the training and validation datasets for landslide susceptibility modeling; (3) feature selection using the Relief-F algorithm; (4) landslide susceptibility mapping using DT, RF, and RoF models; and (5) validation and performance comparison among LSMs. Production of a dataset with a 5-m spatial resolution and mapping of the results were implemented using ArcGIS ver. 10.2 (ESRI, Inc., Redlands, CA, USA). In addition, the Relief-F algorithm and the DT, RF, and RoF models were implemented using R ver. 3.5.2 (Foundation for Statistical Computing, Vienna, Austria).

3.1. Construction of the Spatial Database

3.1.1. Landslide Inventory Map

The creation of databases for areas prone to landslides is key in impacting the accuracy of landslide susceptibility analyses. In this study, aerial photographs obtained from the National Geographic Information Institute were used to create a landslide inventory map. These orthorectified images have a spatial resolution of 0.5 cm, and were produced by taking images every two years over the entirety of South Korea, dividing the country into two regions.

In this study, 12 orthorectified aerial photographs were obtained from before (early summer of 2011) and after (summer of 2012) the occurrence of the landslides. After the occurrence, locations were confirmed by visual comparison, and a landslide inventory map was created by comparing the images with field surveys and national reports. The map includes a total of 145 landslide occurrence locations.

3.1.2. Landslide Conditioning Factors

In general, although landslides are caused by interactions between various factors, there is no consensus on which factors affect the occurrence of landslides. Here, the landslide conditioning factors used in previous studies were examined, with 14 factors ultimately selected by considering the availability of data for the study area. These factors were constructed using thematic maps produced by various national institutions (Figure 2). Available pedology (scale 1:25,000) and geology (1:50,000) maps were not used because they were only small–scale thematic maps.

A digital elevation model (DEM) with a spatial resolution of 5 m was constructed using topographic maps (1:5000) collected from the Korean National Geographic Information Institute. Based on the DEM, seven factors: altitude, slope, aspect, profile curvature, plan curvature, topographic position index (TPI), and elevation-relief ratio (ERR), were produced. Because the study area contains mountainous terrain, these geomorphic factors are extremely important in considering landslide occurrence, since the geomorphology of a slope affects its instability, contributing to the occurrence of slope failures. Altitude, slope, and aspect are basic geomorphic factors, and have been used in previous landslide studies. Curvature enables the determination of the likelihood of deposition; here, the curvature in the maximum slope direction (profile curvature) and the curvature in the direction perpendicular to the maximum slope direction (plan curvature) were used. The TPI represents the erosion/accumulation capacity of a terrain by calculating the difference between the elevation value of a certain cell and the average elevation value of adjacent cells. The TPI is zero on plain slope surfaces, negative in valleys and canyon bottoms, and positive in ridgetops and hilltops [34]. The ERR, which was proposed by Pike and Wilson (1971) [35], is closely related to morphometric erosional evolution. It is calculated using Equation (1), and reflects the stage of geomorphic development, as well as lithological differences:

ERR = \frac{\bar{z} - z_{m i n}}{z_{m a x} - z_{m i n}}

(1)

where

\bar{z}

,

z_{m i n}

, and

z_{m a x}

represent the average, minimum, and maximum altitude, respectively.

Hydrological conditions also affect the occurrence of landslides. Thus, three hydrogeological factors, slope length and steepness (LS), topographic wetness index (TWI), and stream power index (SPI), were calculated using the DEM. LS, one of the six factors of the revised universal soil loss equation (RUSLE) that describes soil erosion, represents the combined effects of slope length and steepness, and affects soil particle transport. Thus, an increase in this index indicates a higher possibility of landslide occurrence. The TWI is related to the saturation of soil moisture in local areas of topographic convergence. As saturation increases, soil and rock strengths decrease. The SPI measures the erosion power of a stream, with an increase indicating easier erosion and a higher probability of slope failure [3,36,37]. These factors are calculated as follows:

LS = {(\frac{A_{s}}{22.13})}^{0.6} {(\frac{s i n β}{0.0896})}^{1.3}

(2)

TWI = \ln (\frac{α}{t a n β})

(3)

SPI = A_{s} \times t a n β

(4)

where

A_{s}

is the specific catchment area,

β

is the local slope gradient, and

α

is the local upslope area.

In this study, timber type, diameter, age, and density were extracted from the 5000-scale forest-type map produced by the Korea Forest Research Institute, and used as landslide conditioning factors. The timber types classified the forest areas into coniferous, deciduous, and mixed forests according to tree species, with 12 detailed species identified in the study area. Tree diameter was classified into four grades by measuring tree height at ~1.2 m from the ground. Timber age classified trees into nine grades in 10-year increments. Timber density represents the area within a certain forest area occupied by the tree crown, and thus represents tree growth.

3.2. Preparation of Training and Validation Datasets

Training and validation datasets were created using landslide areas, non-landslide areas, and landslide conditioning factors corresponding to these areas to create an LSM using machine learning.

The areas in the landslide inventory map created through the previous process were randomly divided into two groups comprising 70% (102 areas) and 30% (43 areas) of the total. To analyze landslide susceptibility using machine learning, certain information regarding non-landslide areas is also required, which was extracted using combined systematic and random sampling. In particular, non-landslide areas were extracted at 20-px (100-m) intervals, and extracted areas were randomly selected to have the same numbers as the landslide areas.

The frequency ratio (FR) was used to extract the values of the landslide conditioning factors corresponding to the extracted landslide and non-landslide areas. The FR analyzes the correlation between a landslide area and a landslide conditioning factor, with an increase in its value indicating a larger impact on landslide occurrence. In this study, FR values for the landslide conditioning factors were calculated and normalized, and then new classes were assigned (Table 1).

Finally, the values newly assigned to each landslide conditioning factor were applied to the 204 landslide and 86 non-landslide point data to construct training and validation datasets. The training dataset was used to create models that used DTs, RFs, and RoFs for future analysis, while the validation dataset was used to verify the created models.

3.3. Relief-F Feature Selection Method

Each landslide conditioning factor contributes differently to landslide occurrence, and thus the choice of factors affects the prediction accuracy of the model. Here, before training the model, the relevance of the landslide conditioning factors used in this study was evaluated using the Regressional Relief-F (RRelief-F) algorithm, a feature selection method.

The core of the original Relief algorithm [38] evaluates the quality of an attribute based on how well it distinguishes nearby instances. Therefore, a good attribute has the same value for instances of the same class, and distinguishes between instances of different classes [39,40]. To achieve this, the Relief algorithm calculates weights for all attributes through iterative estimation of two nearest neighbors (the nearest hit and miss) when randomly selected instances are provided [40].

The Relief algorithm has been extended through many studies, such as the Relief-F algorithm [39]. Here, the RRelief-F algorithm introduced by Robnik-Sikonja and Kononenko [40], which is a Relief-F algorithm suitable for regression problems where the predicted value (class) is continuous, was used.

3.4. Landslide Susceptibility Modeling

3.4.1. Decision Trees

A DT is a basic tree-based approach for developing a model that classifies or predicts the values of dependent variables by learning various decision rules inferred from all available data. DT analysis methods include various algorithms, such as the chi-square automatic interaction detector decision tree (CHAID) [41], classification and regression tree (CART) [42], ID3 [43], C4.5 [44], and C5.0 [44] algorithms. In this study, the CART algorithm was used. Because the other ensemble methods used in this study construct individual DTs using a slightly modified CART algorithm, the CART algorithm is suitable for comparing the results.

As its name suggests, CART creates classification and regression trees based on a binary partitioning algorithm to predict categorical dependent variables (classification) and continuous dependent variables (regression). The learning process used by CART repeatedly and recursively divides all independent variables into subsets using an appropriate splitting criterion. A key point is that it maximizes homogeneity within the subsets and heterogeneity between the subsets.

Among various splitting criteria, CART generally uses the Gini index, which is calculated as follows [21,42]:

Gini (p_{1}, p_{2}, \dots, p_{n}) = \sum_{i \neq j} p_{j} (1 - p_{i}) = 1 - \sum_{j} p_{j}^{2}

(5)

where

p_{i}

and

p_{j}

are the probabilities of landslide occurrences in classes i and j, respectively. The Gini index ranges between 0 and 0.5.

3.4.2. Random Forest

An ensemble model improves classification accuracy by learning several models and combining the results predicted by each. RF, introduced by Breiman [45], is a tree-based ensemble learning algorithm that uses a concept known as bagging, wherein multiple DTs are formed using bootstrapped samples of the original dataset, which are then aggregated.

In the RF model, approximately 66% (“in-bag”) of the bootstrapped samples are used for the training of each DT, and the remaining 33% (“out-of-bag”) are used for evaluating the accuracy of the final ensemble model [3]. Here, for accuracy evaluation, the out-of-bag (OOB) error, which is an unbiased estimate of the generalization error, was used. This value is calculated as the proportion of misclassifications (%) over all OOB elements [20]. Therefore, the accuracy of the RF model improves when the OOB error is minimized.

Final decisions of class membership and model construction (output) are determined by the majority vote among all trees [46]. In addition, two types of error rates, the mean decrease in accuracy and mean decrease in node impurity (mean decrease in Gini), are calculated and used to prioritize the factors used [47]. These values indicate the contribution of each factor used when the RF model is constructed.

3.4.3. Rotation Forest

RoF is an advanced ensemble learning algorithm introduced by Rodriguez et al. [24] that creates accurate and diverse classifiers based on feature transformation. The RoF framework combines bagging techniques with random subspaces and principal component analysis (PCA) to construct an ensemble classifier [48]. Here, the process used was the same as that of the RF model, if one excludes PCA.

The main procedure for constructing an RoF model is as follows: (1) Attribute sets are randomly divided into K subsets, where K is a user-defined parameter. (2) Sample subsets are acquired through bootstrap sampling, and feature transformation (PCA) is applied to each sample subset. (3) The rotation matrix is realigned according to the sequence of the original attribute sets. (4) Base classifiers are trained based on the rotated data. (5) The results of various base classifiers are integrated and the predicted final class label is the final output [49].

In this process, the confidence for each class is calculated with the average combination method across all classifiers. The final class label is assigned to the class with the highest confidence value [29], and the success of the RoF model depends on the rotation matrix and base classifiers created by the transformation methods [30]. More details can be found in Rodriguez et al. [24].

3.5. Model Performance Assessment

The probabilities predicted by the DT, RF, and RoF models were classified into two classes, landslide presence and absence, using different cut-off values. The expected probabilities were compared to observed landslide locations to create a contingency table, which provides information on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values depending on whether the landside occurrence was well classified. Statistical indices were calculated using the aforementioned four values as follows.

Sensitivity = \frac{T P}{T P + F N}

(6)

Specificity = \frac{T N}{T N + F P}

(7)

Precision = \frac{T P}{T P + F P}

(8)

Accuracy = \frac{T P + T N}{T P + F P + T N + F N}

(9)

Sensitivity represents the proportion of the landslide pixels correctly classified, while specificity represents the proportion of correctly classified non-landslide pixels. Precision, a positive predictive value, represents the ratio of actual landslide pixels to those classified as landslides by the model. Accuracy represents the ratio of the correctly classified landslide and non-landslide pixels [21,50]. Here, a model was better if the precision and accuracy were higher. A value of one represents a perfect model [48].

In addition, the kappa coefficient, calculated based on the difference between the observed and predicted landslides, evaluates the reliability of each model. It is calculated as follows [51]:

k = \frac{P_{o b s} - P_{e x p}}{1 - P_{e x p}}

(10)

P_{o b s} = \frac{T P + T N}{S A}

(11)

P_{e x p} = \frac{(T P + F N) (T P + F P) + (F P + T N) (F N + T N)}{S A^{2}}

(12)

where SA is the total number of pixels in the study area. The kappa coefficient ranges from 0 to 1. When it is 0.8 or larger, the model performance is considered to be nearly perfect [52].

The ROC curve is a plot of sensitivity (on the y-axis) against (1−specificity) (x-axis). Morphologically, the performance of the model is better if the left edge of the ROC curve is closer to the top left corner. To express this quantitatively, the area under the ROC curve is calculated as follows [21]:

AUROC = \frac{(\sum T P + \sum T N)}{(Y + N)}

(13)

where Y is the total number of landslides and N is the total number of non-landslides. The area under the ROC curve (AUROC) ranges from 0.5 to 1.0. The model’s classification accuracy is considered to be very high if the AUROC is > 0.9, whereas the model is considered to have cognitive discrimination ability if the AUROC is between 0.7 and 0.9 [50].

4. Results

4.1. Landslide Conditioning Factor Analysis

Table 2 shows the results of applying the RRelief-F algorithm to the 14 landslide conditioning factors used in this study using the “FSelector” package. These values represent the weights (importance) of the conditioning factors for predicting landslide occurrence. Thus, larger weights imply highly relevant factors corresponding to high predictive abilities for landslide occurrence. Conditioning factors with zero or negative values mean that their underlying processes do not contribute to landslide occurrence. Because these conditioning factors may affect the accuracy of the model, they were removed before further analysis.

Among the conditioning factors used, 11 had values greater than zero. In particular, TPI had the largest value (0.182) and profile curvature had the smallest (0.009). However, the weights of timber diameter, age, and density were 0.000, indicating that they had null predictive ability. Therefore, these factors were excluded during further analyses of landslide susceptibility.

4.2. Landslide Susceptibility Mapping

Landslide models were constructed by training the DT, RF, and RoF models using training datasets. These models were performed employing 10-fold cross-validation to decrease the variability. During the training process, the values of parameters for each model were optimized to improve the predictive performances of the landslide models. The landslide models constructed by DT, RF, and RoF were applied to calculate landslide susceptibility indices (LSIs) throughout the whole study area. Thereafter, LSMs were constructed and reclassified into five susceptibility classes using the natural breaks method.

4.2.1. Decision Trees

In the DT model, four priori parameters: minsplit, minbucket, maxdepth, and cp, were optimized using the “mlr” package, and then the “rpart” package was used to construct the landslide model. Here, minsplit is the minimum number of observations in the parent node, minbucket is the minimum number of observations in a terminal node, maxdepth is the maximum depth of a tree, and cp is the complexity parameter. The optimized values for minsplit, minbucket, maxdepth, and cp were 6, 6, 17, and 0.001, respectively.

The LSI values ranged from 0.00 to 1.00. These values were reclassified into very low susceptibility (0.00–0.10), low susceptibility (0.10–0.33), moderate susceptibility (0.33–0.57), high susceptibility (0.57–0.63), and very high susceptibility (0.63–1.00) to produce the LSM (Figure 3a). The area percentages of each class were 26.48%, 9.82%, 18.78%, 7.18%, and 37.74%, respectively. The area percentage of the very high susceptibility class was higher in the DT LSM compared to the LSMs produced by RF and RoF (Figure 4).

4.2.2. Random Forest

In the RF model, selection of the hyper-parameters and construction of the landslide model were implemented using the “caret” and “randomForest” packages, respectively. The parameters were ntree and mtry, representing the number of trees in the forest and the number of variables tested at each node, respectively. The optimized values for ntree and mtry were 1 and 4000, respectively.

The LSM was produced by classifying areas into five susceptibility classes: very low (0.01–0.23), low (0.23–0.42), moderate (0.42–0.61), high (0.61–0.80), and very high (0.80–1.00) (Figure 3b). The area percentages of each class were 19.90%, 22.19%, 19.92%, 19.35%, and 18.63%, respectively (Figure 4).

4.2.3. Rotation Forest

Two priori parameters were also optimized for the RoF model, the number of variable subsets (K) and the number of base classifiers (L), using the “caret” package. The landslide model with K = 1 and L = 26 was constructed using the “rotationForest” package.

The LSI also ranged from 0.00 to 1.00. These values were reclassified as very low susceptibility (0.01–0.22), low susceptibility (0.22–0.40), moderate susceptibility (0.40–0.60), high susceptibility (0.60–0.78), and very high susceptibility (0.78–0.93) (Figure 3c). The distribution of LSI values for each susceptibility class was similar to the LSM produced by RF. However, the very low (25.41%) and very high (23.13%) susceptibility classes were higher than in the LSM produced by RF, while the other classes were slightly lower (Figure 4).

4.3. Model Validation and Comparison

4.3.1. Statistical Indices

Table 3 shows the performances of the three landslide models examined using the statistical indices. Overall, the RF model was superior to the DT and RoF models as it produced the highest values for sensitivity (1.000), specificity (0.962), precision (0.961), and accuracy (0.980). The RoF model exhibited an overall performance similar to that of the RF model, but the factor values were slightly smaller. The DT model performed the worst, with an accuracy that was 0.108 lower than that of the RF model. This occurred because the performances for the classification of landslide and non-landslide pixels were reduced by 0.120 and 0.097, respectively. The kappa statistics of the three landslide models ranged from 0.745 to 0.843, indicating general agreement.

The performances of the three models were also evaluated using the validation dataset. Although the landslide RF model performed best in previous results, the RoF model offered the highest predictive performance here. Its accuracy was 0.802, which was 0.023 higher than that of the RF model (0.779), indicating that the RoF model better classified landslide pixels than the RF model based on the sensitivity values. The DT model again performed the worst. The kappa statistics of the RF and RoF models were 0.558 and 0.605, respectively, indicating moderate and substantial agreement. The kappa statistic of the DT model, however, was 0.395, indicating only fair agreement between the model and the validation data.

4.3.2. Receiver Operating Characteristic Curve

The ROC curve was analyzed for success and prediction rates. The success rate, analyzed using the training dataset, represents how well the landslide model fits the data. The models produced similar AUROC values for the success rate, but the RF model had the highest value (Figure 5a, Table 4).

As for the AUROC values obtained for the prediction rate when analyzing using the validation dataset, the RoF model produced the highest value (0.868), followed by RF (0.853) and DT (0.772) (Figure 5b, Table 5). This indicates that the RoF model is the best predictor of landslides among the three models. The results of the ROC curve were the same as the results obtained using the statistical indices.

5. Discussion

In this study, the LSMs produced by the DT, RF, and RoF models were evaluated by statistical indices and AUROC. Among the models, the DT model is the easiest and most straightforward to interpret. In addition, the DT model has many advantages: (1) it is a type of statistical analysis with no statistical distribution assumption, (2) it can handle data from various measurement scales, (3) variable transformation is not required, and (4) it facilitates the construction of the rules for prediction of complex relationships [50,53].

Despite these advantages, the DT model is too simple to effectively describe many real-world situations [54]. This can be recognized from the results in this study, where the LSM produced by the DT model had the lowest performance based on analyses using statistical indices and AUROC. In addition, the difference between the success rate and prediction rate curves for the LSM produced by the DT model was 0.188, which was the highest value, followed by the LSM produced by the RF model (0.147) and the LSM produced by the RoF model (0.122).

By contrast, the ensemble learning models, RF and RoF, outperformed the DT model because they both enhanced the goodness-of-fit and prediction ability. This is reasonable because the classifiers based on ensemble learning can reduce both bias and variance and avoid over-fitting problems against base classifiers to improve their predictive capability [55]. These results are consistent with those of previous studies [3,22,49], indicating that ensemble learning contributes to improving the performance of single (weak) classifiers.

Among the ensemble learning models, the RoF model had a better performance than the RF model, but the difference was not large. The key to its robust performance lies in the main idea of the RoF model, which is to encourage diversity and individual accuracy simultaneously within an ensemble classifier. Specifically, diversity is promoted by using PCA to perform feature extraction for each base classifier and accuracy is sought by keeping all principal components and also using the entire dataset to train each base classifier.

In addition, when predicting landslide events, the RoF has several advantages. For example, it does not require assumptions on the landslide conditioning factor distributions, it has low bias, and it can efficiently deal with unbalanced data and over-fitting. However, the RF model is better than the RoF model in terms of computational efficiency. Kavzoglu and Colkesen [56] revealed that the RoF model costs much more time (23.03 s) compared to the RF model (4.01 s) for the training dataset. This is due to the complexity of the RoF model in employing two parameters and PCA in the modeling stage.

As mentioned earlier, the AUROC values of the success rate curves were very high, almost reaching a value of 1, but the AUROC values of the prediction rate curves were decreased by about 20%. This result shows that the landslide models using the DT, RF, and RoF models were trained excessively well by the training data. Because of this, the landslide models had high accuracy for training data, but increased error for real data. This problem, called overfitting, is a crucial problem in machine learning and data mining.

Avoiding or solving the overfitting problem is not easy because there can be many causes. In the case of this study, first, the non-landslide area was selected to be the same as the landslide area even though the actual non-landslide area was very large compared to the landslide area. Second, the training and validation datasets were determined by the 70:30 sampling ratio without performing an accuracy assessment of the sampling ratio. Third, despite the RRelief-F feature selection process, the landslide conditioning factors used still included noise.

In future studies, the following should be performed to enhance the accuracy of LSMs: (1) effective analysis of the ratio of non-landslide area, (2) evaluation of the sampling ratio for the training and validation datasets, (3) consideration of differences between various feature selection methods, and (4) comparison of model performance by additional ensemble methods.

6. Conclusions

This study used three machine learning models, DT, RF, and RoF, to analyze landslide susceptibility at Woomyeon Mountain, South Korea. Fourteen landslide conditioning factors were produced using thematic maps generated by government organizations. These factors were evaluated for their contributions to the models using the RRelief-F algorithm. Finally, 11 landslide conditioning factors were selected, excluding timber diameter, timber age, and timber density. Landslide susceptibility analyses and mapping were performed with these 11 landslide conditioning factors using DT, RF, and RoF models. The LSMs produced by DT, RF, and RoF were evaluated using statistical indices and AUROC. Overall, the three LSMs showed reasonable goodness-of-fit and good performances with the training and validation datasets. In particular, the ensemble learning models, RF and RoF, outperformed the DT model, and the RoF model had a higher performance than the RF model. These results demonstrate that ensemble learning methods have a powerful ability to improve prediction accuracy compared to the single classifier approach. In addition, the RoF model proved to be effective for landslide susceptibility assessment in the study area. The LSMs produced in this study may be useful for decision makers, planners, and engineers in landslide-prone areas.

Author Contributions

S.P. wrote the paper and analyzed the data; S.-Y.H. suggested the idea for the study; J.K. managed the paperwork.

Acknowledgments

This research was financially supported by the BK21 plus Project of the Graduate School of Earth Environmental Hazard System. In addition, this work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF–2017R1A2B2009033).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Figure A1. Geological maps (scale, 1:50,000) of the study area obtained from the Korea Institute of Geoscience and Mineral Resources.

References

Brabb, E.E. Innovative Approaches to Landslide Hazard Mapping. In Proceedings of the 4th International Symposium on Landslides, Toronto, ON, Canada, 16–21 September 1984; USGS: Reston, VA, USA, 1985; pp. 307–324. [Google Scholar]
Guzzetti, F.; Carrara, A.; Cardinali, M.; Reichenbach, P.; Galli, M.; Ardizzone, F. Landslide Hazard Evaluation: An Aid to a Sustainable Development. Geomorphology 1999, 31, 181–216. [Google Scholar] [CrossRef]
Sahin, E.K.; Colkesen, I.; Kavzoglu, T. A Comparative Assessment of Canonical Correlation Forest, Random Forest, Rotation Forest and Logistic Regression Methods for Landslide Susceptibility Mapping. Geocarto Int. 2018, 1–23. [Google Scholar] [CrossRef]
Pham, B.T.; Shirzadi, A.; Bui, D.T.; Prakash, I.; Dholakia, M.B. A Hybrid Machine Learning Ensemble Approach Based on a Radial Basis Function Neural Network and Rotation Forest for Landslide Susceptibility Modeling: A Case Study in the Himalayan Area, India. Int. J. Sediment. Res. 2018, 33, 157–170. [Google Scholar] [CrossRef]
Yilmaz, I. Comparison of Landslide Susceptibility Mapping Methodologies for Koyulhisar, Turkey: Conditional Probability, Logistic Regression, Artificial Neural Networks, and Support Vector Machine. Environ. Earth Sci. 2010, 61, 821–836. [Google Scholar] [CrossRef]
Yalcin, A.; Reis, S.; Aydinoglu, A.C.; Yomralioglu, T. A GIS-Based Comparative Study of Frequency Ratio, Analytical Hierarchy Process, Bivariate Statistics and Logistics Regression Methods for Landslide Susceptibility Mapping in Trabzon, NE Turkey. Catena 2011, 85, 274–287. [Google Scholar] [CrossRef]
Mohammady, M.; Pourghasemi, H.R.; Pradhan, B. Landslide Susceptibility Mapping at Golestan Province, Iran: A Comparison between Frequency Ratio, Dempster–Shafer, and Weights-Of-Evidence Models. J. Asian Earth Sci. 2012, 61, 221–236. [Google Scholar] [CrossRef]
Park, S.; Choi, C.; Kim, B.; Kim, J. Landslide Susceptibility Mapping Using Frequency Ratio, Analytic Hierarchy Process, Logistic Regression, and Artificial Neural Network Methods at the Inje Area, Korea. Environ. Earth Sci. 2013, 68, 1443–1464. [Google Scholar] [CrossRef]
Pham, B.T.; Pradhan, B.; Bui, D.T.; Prakash, I.; Dholakia, M.B. A comparative study of different machine learning methods for landslide susceptibility assessment: A case study of Uttarakhand area (India). Environ. Modell. Softw. 2016, 84, 240–250. [Google Scholar] [CrossRef]
Naghibi, S.A.; Pourghasemi, H.R.; Dixon, B. GIS-Based Groundwater Potential Mapping Using Boosted Regression Tree, Classification and Regression Tree, and Random Forest Machine Learning Models in Iran. Environ. Monit. Assess. 2016, 188, 44. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rahmati, O. Prediction of the Landslide Susceptibility: Which Algorithm, Which Precision. Catena 2018, 162, 177–192. [Google Scholar] [CrossRef]
Bui, D.T.; Pradhan, B.; Lofman, O.; Revhaug, I.; Dick, O.B. Landslide Susceptibility Mapping at Hoa Binh Province (Vietnam) Using an Adaptive Neuro–Fuzzy Inference System and GIS. Comput. Geosci. 2012, 45, 199–211. [Google Scholar] [CrossRef]
Conforti, M.; Pascale, S.; Robustelli, G.; Sdao, F. Evaluation of Prediction Capability of the Artificial Neural Networks for Mapping Landslide Susceptibility in the Turbolo River Catchment (Northern Calabria, Italy). Catena 2014, 113, 236–250. [Google Scholar] [CrossRef]
Kawabata, D.; Bandibas, J. Landslide Susceptibility Mapping Using Geological Data, a DEM from ASTER Images and an Artificial Neural Network (ANN). Geomorphology 2009, 113, 97–109. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Pradhan, B.; Gokceoglu, C. Application of Fuzzy Logic and Analytical Hierarchy Process (AHP) to Landslide Susceptibility Mapping at Haraz Watershed, Iran. Nat. Hazards 2012, 63, 965–996. [Google Scholar] [CrossRef]
Zhu, A.X.; Wang, R.; Qiao, J.; Qin, C.Z.; Chen, Y.; Liu, J.; Zhu, T. An Expert Knowledge-Based Approach to Landslide Susceptibility Mapping Using GIS and Fuzzy Logic. Geomorphology 2014, 214, 128–138. [Google Scholar] [CrossRef]
Pradhan, B. A Comparative Study on the Predictive Ability of the Decision Tree, Support Vector Machine and Neuro-Fuzzy Models in Landslide Susceptibility Mapping Using GIS. Comput. Geosci. 2013, 51, 350–365. [Google Scholar] [CrossRef]
Bui, D.T.; Pradhan, B.; Lofman, O.; Revhaug, I. Landslide Susceptibility Assessment in Vietnam Using Support Vector Machines, Decision Tree, and Naive Bayes Models. Math. Probl. Eng. 2012, 2012, 26. [Google Scholar] [CrossRef]
Lee, S.; Hong, S.M.; Jung, H.S. A Support Vector Machine for Landslide Susceptibility Mapping in Gangwon Province, Korea. Sustainability 2017, 9, 48. [Google Scholar] [CrossRef]
Youssef, A.M.; Pourghasemi, H.R.; Pourtaghi, Z.S.; Al-Katheeri, M.M. Landslide Susceptibility Mapping Using Random Forest, Boosted Regression Tree, Classification and Regression Tree, and General Linear Models and Comparison of Their Performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 2016, 13, 839–856. [Google Scholar] [CrossRef]
Chen, W.; Zhang, S.; Li, R.; Shahabi, H. Performance Evaluation of the GIS–Based Data Mining Techniques of Best-First Decision Tree, Random Forest, and Naive Bayes Tree for Landslide Susceptibility Modeling. Sci. Total Environ. 2018, 644, 1006–1018. [Google Scholar] [CrossRef]
Kadavi, P.; Lee, C.W.; Lee, S. Application of ensemble–based machine learning models to landslide susceptibility mapping. Remote Sens. 2018, 10, 1252. [Google Scholar] [CrossRef]
Nguyen, Q.K.; Tien Bui, D.; Hoang, N.D.; Trinh, P.; Nguyen, V.H.; Yilmaz, I. A Novel Hybrid Approach Based on Instance Based Learning Classifier and Rotation Forest Ensemble for Spatial Prediction of Rainfall-Induced Shallow Landslides Using GIS. Sustainability 2017, 9, 813. [Google Scholar] [CrossRef]
Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation Forest: A New Classifier Ensemble Method. IEEE Trans. Pattern Anal. 2006, 28, 1619–1630. [Google Scholar] [CrossRef] [PubMed]
Liu, K.H.; Huang, D.S. Cancer Classification Using Rotation Forest. Comput. Biol. Med. 2008, 38, 601–610. [Google Scholar] [CrossRef]
De Bock, K.W.; Van Den Poel, D. An Empirical Evaluation of Rotation-Based Ensemble Classifiers for Customer Churn Prediction. Expert Syst. Appl. 2011, 38, 12293–12301. [Google Scholar] [CrossRef]
Nanni, L.; Lumini, A. An Experimental Comparison of Ensemble of Classifiers for Bankruptcy Prediction and Credit Scoring. Expert Syst. Appl. 2009, 36, 3028–3033. [Google Scholar] [CrossRef]
Choudhury, S.D.; Tjahjadi, T. Clothing and Carrying Condition Invariant Gait Recognition Based on Rotation Forest. Pattern Recog. Lett. 2016, 80, 1–7. [Google Scholar] [CrossRef]
Du, P.; Samat, A.; Waske, B.; Liu, S.; Li, Z. Random Forest and Rotation Forest for Fully Polarized SAR Image Classification Using Polarimetric and Spatial Features. ISPRS J. Photogramm. 2015, 105, 38–53. [Google Scholar] [CrossRef]
Xia, J.; Du, P.; He, X.; Chanussot, J. Hyperspectral Remote Sensing Image Classification Based on Rotation Forest. Geosci. Remote. Sens. Lett. 2016, 11, 239–243. [Google Scholar] [CrossRef]
Korean Geotechnical Society (KGS). The Study on Investigation of Cause and Development of Restoration Policy about Landslide in Wumyon Area; Korean Geotechnical Society: Seoul, Korea, 2011. (In Korean) [Google Scholar]
Park, S.; Kim, J. Landslide Susceptibility Mapping Based on Random Forest and Boosted Regression Tree Models, and a Comparison of Their Performance. Appl. Sci. 2019, 9, 942. [Google Scholar] [CrossRef]
Park, D.W.; Nikhil, N.V.; Lee, S.R. Landslide and Debris Flow Susceptibility Zonation using TRIGRS for the 2011 Seoul Landslide Event. Nat. Hazards Earth Syst. Sci. 2013, 13, 2833–2849. [Google Scholar] [CrossRef]
Weiss, A. Topographic Position and Landforms Analysis. In Proceedings of the Poster Presentation, ESRI User Conference, San Diego, CA, USA, 9–13 July 2001; The Nature Conservoncy: Arlington County, VA, USA, 2001; p. 200. [Google Scholar]
Pike, R.J.; Wilson, S.E. Elevation-Relief Ratio, Hypsometric Integraland Geomorphic Area—Altitude Analysis. Geol. Soc. Am. Bull. 1971, 82, 1079–1084. [Google Scholar] [CrossRef]
Beven, K.J.; Kirkby, M.J. A Physically Based, Variable Contributing Area Model of Basin Hydrology/Un Modele a Base Physique De Zone Dappel Variable de Lhydrologie Du Bassin Versant. Hydrol. Sci. J. 1979, 24, 43–69. [Google Scholar] [CrossRef]
Moore, I.D.; Grayson, R.B.; Ladson, A.R. Digital Terrain Modelling: A Review of Hydrological, Geomorphological, and Biological Applications. Hydrol. Process. 1991, 5, 3–30. [Google Scholar] [CrossRef]
Kira, K.; Rendell, L.A. Practical Approach to Feature Selection. In Proceedings of the 9th International Conference on Machine Learning, Aberdeen, Scotland, UK, 1–3 July 1992; ICML: Aberdeen, Scotland, UK, 1992; pp. 249–256. [Google Scholar]
Kononenko, I. Estimating Attributes: Analysis and Extensions of RELIEF. In Proceedings of the European Conference on Machine Learning, Berlin, Germany, 6–8 April 1994; Springer Science Business Media: Berlin, Germany, 1994; pp. 171–182. [Google Scholar]
Robnik-Sikonja, M.; Kononenko, I. An Adaptation of Relief for Attribute Estimation in Regression. In Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, USA, 8–12 July 1997; IEEE: Nashville, TN, USA, 1997; Volume 5, pp. 296–304. [Google Scholar]
Kass, G.V. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Appl. Stat. 1980, 29, 119–127. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman & Hall: London, UK, 1984. [Google Scholar]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: Burlington, NJ, USA, 1993. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Micheletti, N.; Foresti, L.; Robert, S.; Leuenberger, M.; Pedrazzini, A.; Jaboyedoff, M.; Kanevski, M. Machine Learning Feature Selection Methods for Landslide Susceptibility Mapping. Math. Geosci. 2014, 46, 33–57. [Google Scholar] [CrossRef]
Arabameri, A.; Pradhan, B.; Pourghasemi, H.; Rezaei, K.; Kerle, N. Spatial Modelling of Gully Erosion Using GIS and R Programing: A Comparison among Three Data Mining Algorithms. Appl. Sci. 2018, 8, 1369. [Google Scholar] [CrossRef]
Witten, I.H.; Frank, E.; Mark, A.H. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann Publishers: Burlington, NJ, USA, 2011. [Google Scholar]
Shirzadi, A.; Soliamani, K.; Habibnejhad, M.; Kavian, A.; Chapi, K.; Shahabi, H.; Ahmad, A. Novel GIS Based Machine Learning Algorithms for Shallow Landslide Susceptibility Mapping. Sensors 2018, 18, 3777. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Bui, D.T.; Merghadi, A.; Sahana, M.; Zhu, Z.; Pham, B.T. Assessment of Advanced Random Forest and Decision Tree Algorithms for Modeling Rainfall-Induced Landslide Susceptibility in the Izu-Oshima Volcanic Island, Japan. Sci. Total Environ. 2019, 662, 332–346. [Google Scholar] [CrossRef] [PubMed]
Garosi, Y.; Sheklabadi, M.; Conoscenti, C.; Pourghasemi, H.R.; Van Oost, K. Assessing the Performance of GIS-Based Machine Learning Models with Different Accuracy Measures for Determining Susceptibility to Gully Erosion. Sci. Total Environ. 2019, 664, 1117–1132. [Google Scholar] [CrossRef] [PubMed]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Tehrany, M.S.; Pradhan, B.; Jebur, M.N. Spatial Prediction of Flood Susceptible Areas Using Rule Based Decision Tree (DT) and a Novel Ensemble Bivariate and Multivariate Statistical Models in GIS. J. Hydrol. 2013, 504, 69–79. [Google Scholar] [CrossRef]
Elith, J.; Leathwick, J.R.; Hastie, T.A. Working Guide to Boosted Regression Trees. J. Anim. Ecol. 2008, 77, 802–813. [Google Scholar] [CrossRef] [PubMed]
Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms; JohnWiley & Sons, Inc.: Hoboken, NJ, USA, 2004; pp. 203–234. [Google Scholar]
Kavzoglu, T.; Colkesen, I. An Assessment of the Effectiveness of a Rotation Forest Ensemble for Land-Use and Land-Cover Mapping. Int. J. Remote Sens. 2013, 34, 4224–4241. [Google Scholar] [CrossRef]

Figure 1. Location of the study area and landslide dataset.

Figure 2. Landslide conditioning factors used in this study: (a) altitude, (b) slope, (c) aspect, (d) profile curvature, (e) plan curvature, (f) topographic position index, (g) elevation-relief ratio, (h) slope length and slope steepness, (i) topographic wetness index, (j) stream power index, (k) timber type, (l) timber diameter, (m) timber age, and (n) timber density.

Figure 3. Landslide susceptibility maps (LSMs) produced from three models: (a) decision tree, (b) random forest, (c) rotation forest.

Figure 4. Distribution of each susceptibility class on LSMs.

Figure 5. Receiver operating characteristic curves for LSMs produced from three models: (a) success rate curve using the training dataset and (b) prediction rate curve using the validation dataset.

Table 1. Normalized classes for landslide conditioning factors assigned using the frequency ratio.

Factor	Class	Number of Pixels in Domain	Number of Landslide Pixels	Frequency Ratio	Normalized Class
Altitude	20.00–65.56	49,073	8	0.427	1
	65.56–95.17	68,099	18	0.692	2
	95.17–125.92	54,415	20	0.962	3
	125.92–158.95	40,131	23	1.5	6
	158.95–195.40	26,637	12	1.179	4
	195.40–236.40	17,935	16	2.336	7
	>236.40	10,757	5	1.217	5
Slope	0.0 –7.10	12,845	1	0.204	1
	7.10–13.38	44,632	6	0.352	2
	13.38–18.29	68,462	26	0.994	4
	18.29–22.94	64,308	26	1.059	5
	22.94–27.85	46,338	28	1.582	7
	27.85–34.13	24,538	14	1.494	6
	>34.13	5924	1	0.442	3
Aspect	Flat	2124	0	0	1
	North	33,682	12	0.933	5
	Northeast	32,212	7	0.569	2
	East	36,219	8	0.578	3
	Southeast	34,795	17	1.279	7
	South	37,218	16	1.126	6
	Southwest	35,630	19	1.396	8
	West	25,446	14	1.44	9
	Northwest	29,721	9	0.793	4
Profile	Concave	131,376	62	1.236	3
curvature	Flat	3774	1	0.694	1
	Convex	131,897	39	0.774	2
Plan curvature	Concave	124,461	57	1.199	3
	Flat	6489	1	0.403	1
	Convex	136,097	44	0.846	2
Terrain position	−8.87–-1.82	7749	6	2.027	7
index	−1.82–-0.78	35,071	18	1.344	6
	−0.78–-0.14	65,552	31	1.238	5
	−0.14–0.50	80,859	31	1.004	4
	0.50–1.30	47,450	10	0.552	2
	1.30–2.50	24,331	6	0.646	3
	>2.50	6035	0	0	1
Elevation-relief	0.01–0.33	9143	0	0	1
ratio	0.33–0.42	26,087	1	0.1	2
	0.42–0.48	52,853	25	1.238	5
	0.48–0.52	72,581	30	1.082	4
	0.52–0.58	60,144	33	1.437	7
Elevation-relief	0.58–0.67	32,250	6	0.487	3
ratio	>0.67	13,989	7	1.31	6
Slope length and	0.00–4.83	63,353	10	0.413	2
slope steepness	4.83–12.55	106,388	37	0.911	3
	12.55–21.24	69,171	36	1.363	4
	21.24–34.75	22,367	15	1.756	5
	34.75–56.96	4185	3	1.877	6
	56.96–96.64	1376	1	1.903	7
	>93.64	207	0	0	1
Topographic	−7.57–-4.79	24,064	3	0.326	1
wetness index	−4.79–-1.25	4072	1	0.643	3
	−1.25–2.01	49,146	14	0.746	4
	2.01–3.35	92,435	42	1.19	6
	3.35–4.79	70,461	33	1.226	7
	4.79–7.28	21,767	8	0.962	5
	>7.28	5102	1	0.513	2
Stream power	−13.82–-11.17	1402	1	1.867	7
index	−11.17–-8.61	7285	0	0	1
	−8.61–-4.47	21,359	3	0.368	2
	−4.47–-0.15	46,837	7	0.391	3
	−0.15–1.27	91,085	41	1.178	5
	1.27–3.12	87,844	45	1.341	6
	>3.12	11,235	5	1.165	4
Timber type	Pine	858	0	0	3
	Nut pine	9578	2	0.547	5
	Larch	5577	2	0.939	7
	Pitch pine	1724	0	0	2
	Sawtooth oak	46,898	14	0.782	6
	Mongolian oak	4739	0	0	1
	Oriental oak	2244	1	1.167	10
	Other oak	104,176	41	1.03	8
	Popular	5643	5	2.32	12
	False acasia	63,896	28	1.147	9
	Other broadleaf	15,027	8	1.394	11
	Mixed forest	6687	1	0.392	4
Timber diameter	6–18cm	8677	2	0.603	1
	18–30cm	258,370	100	1.013	2
Timber age	21–30ages	6203	1	0.422	2
	31–40ages	251,913	100	1.039	3
	41–50ages	8931	1	0.293	1
Timber density	51–70%	11,930	3	0.658	1
	>70%	255,117	99	1.016	2

Table 2. Relative importance of conditioning factors in this study using the RRelief-F algorithm.

No	Conditioning Factor	Importance
1	Elevation-relief ratio	0.182
2	Aspect	0.158
3	Altitude	0.157
4	Slope	0.148
5	Timber type	0.107
6	Topographic wetness index	0.102
7	Slope length	0.101
8	Topographic position index	0.100
9	Stream power index	0.078
10	Plan curvature	0.058
11	Profile curvature	0.009
12	Timber diameter	0.000
13	Timber age	0.000
14	Timber density	0.000

Table 3. Performance assessment of the three models with the calibration and validation datasets using statistical indices.

	Calibration Dataset			Validation Dataset
	DT	RF	RoF	DT	RF	RoF
Sensitivity	0.880	1.000	0.957	0.707	0.833	0.882
Specificity	0.865	0.962	0.891	0.689	0.740	0.750
Precision	0.863	0.961	0.828	0.674	0.698	0.698
Accuracy	0.873	0.980	0.922	0.698	0.779	0.802
Kappa	0.745	0.961	0.843	0.395	0.558	0.605

Table 4. Parameters of the receiver operating characteristic curve with the calibration dataset.

	AUROC	Std. Error	Asymptotic Sig.	Asymptotic 95% Confidence Interval
	AUROC	Std. Error	Asymptotic Sig.	Lower Bound	Upper Bound
DT	0.960	0.011	0.000	0.938	0.982
RF	1.000	0.000	0.000	0.999	1.000
RoF	0.990	0.004	0.000	0.982	0.998

Table 5. Parameters of the receiver operating characteristic curve with the validation dataset.

	AUROC	Std. Error	Asymptotic Sig.	Asymptotic 95% Confidence Interval
	AUROC	Std. Error	Asymptotic Sig.	Lower Bound	Upper Bound
DT	0.772	0.051	0.000	0.672	0.872
RF	0.853	0.043	0.000	0.770	0.937
RoF	0.868	0.039	0.000	0.792	0.944

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, S.; Hamm, S.-Y.; Kim, J. Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling. Sustainability 2019, 11, 5659. https://doi.org/10.3390/su11205659

AMA Style

Park S, Hamm S-Y, Kim J. Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling. Sustainability. 2019; 11(20):5659. https://doi.org/10.3390/su11205659

Chicago/Turabian Style

Park, Soyoung, Se-Yeong Hamm, and Jinsoo Kim. 2019. "Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling" Sustainability 11, no. 20: 5659. https://doi.org/10.3390/su11205659

APA Style

Park, S., Hamm, S.-Y., & Kim, J. (2019). Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling. Sustainability, 11(20), 5659. https://doi.org/10.3390/su11205659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Performance Evaluation of the GIS-Based Data-Mining Techniques Decision Tree, Random Forest, and Rotation Forest for Landslide Susceptibility Modeling

Abstract

1. Introduction

2. Study Area

3. Methodology

3.1. Construction of the Spatial Database

3.1.1. Landslide Inventory Map

3.1.2. Landslide Conditioning Factors

3.2. Preparation of Training and Validation Datasets

3.3. Relief-F Feature Selection Method

3.4. Landslide Susceptibility Modeling

3.4.1. Decision Trees

3.4.2. Random Forest

3.4.3. Rotation Forest

3.5. Model Performance Assessment

4. Results

4.1. Landslide Conditioning Factor Analysis

4.2. Landslide Susceptibility Mapping

4.2.1. Decision Trees

4.2.2. Random Forest

4.2.3. Rotation Forest

4.3. Model Validation and Comparison

4.3.1. Statistical Indices

4.3.2. Receiver Operating Characteristic Curve

5. Discussion

6. Conclusions

Author Contributions

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI