Development of a Novel Hybrid Intelligence Approach for Landslide Spatial Prediction

: We proposed an innovative hybrid intelligent approach, namely, the multiboost based naïve bayes trees (MBNBT) method for the spatial prediction of landslides in the Mu Cang Chai District of Yen Bai Province, Vietnam. The MBNBT, which is an ensemble of the multiboost (MB) and naïve bayes trees (NBT) base classiﬁer, has rarely been applied for landslide susceptibility mapping around the world. For the modeling, we selected 248 landslide locations in the hilly terrain of the study area. Fifteen landslide conditioning factors were selected for the construction of the database based on the one-R attribute evaluation (ORAE) technique. Model validation was done using statistical metrics, namely, sensitivity, speciﬁcity, accuracy, mean absolute error (MAE), root mean square error (RMSE), and the area under the receiver operating characteristics curve (AUC). Performance of the hybrid model was evaluated and compared with popular soft computing benchmark models, namely, multiple perceptron neural network (MLPN), Support Vector Machines (SVM), and single NBT. Results indicated that the proposed MBNBT (AUC = 0.824) model outperformed the popular models, namely, the MLPN (AUC = 0.804), SVM (AUC = 0.804), of the model results also suggested that the MB meta classiﬁer ensemble model could enhance the prediction power of the NBT model. Therefore, the MBNBT is a suitable method for the assessment of landslide susceptibility in landslide prone areas.


Study Area
The Mu Cang Chai District is located between latitudes 21 • 39'00" N to 21 • 50'00" N and longitudes 103 • 56'00" E to 104 • 23'00" E covering an approximately 1196 km 2 area in Yen Bai Province, Vietnam ( Figure 1). The study area is occupied mainly by forest and barren, cultivable, and scrub lands. The topography of the area is hilly with elevation ranging from 280 m to 2820 m. Geologically, the area is covered by eruptive (Ngoi Thia and Tu Le complexes) and intrusive magmatic rocks (Tram Tau formation and Phu Sa Phin complex) associated with sedimentary and metamorphic rocks. Three main faults, namely, Nghia Lo, Phong Tho-Van Yen, and Nam Co-Minh are affecting the stability of rocks in the study area. The area is located in a tropical monsoon region having an annual mean precipitation ranging from 3700 mm to 5490 mm mainly during the monsoon period (May to October). Mean temperature ranges from 9.7 • C (December-January) to 28 • C (June-July). The annual mean temperature is 14.3 • C and the humidity is approximately 81%.

Data Used
A geographic information system (GIS) database including landslide inventory and affecting factors was created for the landslide spatial analysis. A landslide inventory map records the locations of landslides and other information, such as the date of occurrence and the types of ground/rock mass movements wherever available. Landslide inventory, in this study, was created from 248 historical landslide events which were identified and mapped in the study area by interpreting air photos, Landsat imageries, and Google Earth images. Field surveys were carried out under a national project in Vietnam for checking the ground truth of the occurrence of landslides (Figure 1). The largest landslide event involving 100,000 m 3 volume was recorded at the Che Cu Na commune (2011). The types of landslides, which occurred in the study area, included translational (35 events), mixed (36 events), toppling (45 events), rotational (124 events), and debris slides (8 events).
Landslide affecting factors (parameters) such as slope, aspect, profile curvature, elevation, distance to rivers, river density, curvature, distance to roads, road density, plan curvature, distance to faults, fault density, land use, lithology, and rainfall were considered for the landslide analysis as they are known as important factors to landslide occurrences in any area [42,43]. More specifically, slope affects landslide occurrences, as landslides often occur at certain critical slope angles depending on the nature of the ground mass and orientation of the sliding plane [44,45]. This aspect is related with precipitation falling on the slope, solar radiation, soil conditions, and vegetation; thus, it is considered one of the condition factors in landslides [44,45]. Profile curvature, plan curvature, and curvature represent the morphology of the surface which control the run-off and accumulation of surface water; thus, they have an effect on landslide occurrences [44]. Elevation is considered as one of the condition factors in landslides, as it is related to the weathering of soil and rocks on the slope [45]. At higher elevations, the weathering is generally much less. Distance to rivers and river density are landslide conditioning factors, as the ground mass near rivers is generally more saturated with water and a high-density drainage area drains out more surface water (run off) [44]. Distance to roads and road density affect landslide occurrences, as excavation for roads create more instability in the ground/rock mass, thus more landslides [44]. Distance to faults and fault density are considered as conditioning factors in landslides, as faults themselves cause landslides and ground/rock mass near faults are generally more fractured and vulnerable to sliding [44]. Land-use patterns greatly affect landslide occurrences. Landslides occur more in barren lands and in areas of agricultural activities [46]. Lithology is also considered a landslide conditioning factor, as the physical properties of soil and rock materials including their strength, porosity, permeability, and weathering affect sliding [45]. Rainfall is considered as one of the triggering factors in landslides, as it reduces soil cohesion and increases the pore pressure [47], and its influence on slope stability also depends on the duration and intensity of the rainfall.
In this study, the factor maps were generated from a digital elevation model (DEM) with a 20 m spatial resolution constructed from contours extracted from topographic maps at a 1:50,000 scale, a geology map at a 1:50,000 scale collected from the Vietnam Institute of Geosciences and Mineral Resources, Google Earth images, and meteorological data on the area, which were classified into different classes, as shown in Figure 2, in the raster format of 20 m resolution. Out of these, the lithological map was classified into six groups based on lithological characteristics including Group 1-acid-neutral igneous magmatic rocks and tuff; Group 2-acid-neutral intrusive magmatic rocks; Group 3-terrigenous sedimentary rocks with rich aluminosilicate components; Group 4-mafic-ultramafic magma rocks; Group 5-carbonate rocks; and Group 6-quaternary deposits [48,49]. Meteorological data were collected from global weather data for SWAT [29,50] for a 31 year period (1984-2014) to generate a rainfall map. Distance to features (i.e., roads, rivers, faults) and feature density were constructed using feature extraction from the topographic map and geological maps. In order to use the datasets for modeling, the conditioning factors were reclassified into various sub-classes [43] on the basis of the frequency analysis of landslides occurring in the area [8][9][10].

MultiBoost (MB)
The MB is considered an effective ensemble machine learning method which can help significantly in enhancing the efficiency of weaker classifiers [51]. It was proposed by Webb [52] as a combination of the adaboost ensemble and wgging technique (a variant of bagging). The main principle of the MB is to utilize the weighted aggregation of multiple classifiers generated during the selection of the bootstrap samples for classification [53]. The MB takes advantages of both wagging (reducing the variance) and adaboost (reducing bias and variance); therefore, it is more efficient than adaboost and wgging alone [52]. So far, the MB has been utilized efficiently in the fields of medical [53] and computer sciences [54]. However, the application of the MB is not popular in landslide studies.

Naïve Bayes Trees (NBT)
The NBT, which was proposed by Kohavi [55], is a hybrid intelligent approach of two machine learning methods of Naïve Bayesian (NB) and decision trees (DTs). The NBT is known as a classification tree method in which the tree structure of the NBT is constructed using the NB method at the leaves and the DTs method at the nodes [27]. The main purpose of the NBT method is to weaken the independent assumption in the NB, and deal with the fragmentation problems in the DTs method [55]. The NBT method also takes advantages of both NB and DTs. Therefore, it is known for having better classification accuracy than the single NB or DTs [55]. However, its predictive capability can be improved further in integration with ensemble techniques. Moreover, although the NBT has been utilized efficiently in various fields, namely, the computer sciences [56] and medical sciences [57], its application is still confined in the study of landslides. In the current study, the NBT method has been integrated with the MB ensemble method to construct the novel hybrid model (MBNBT) for landslide prediction.

Support Vector Machines (SVM)
The SVM, which was first developed by Vapnik [58], is a supervised strong classification method. It is based on the regression learning algorithm that works on statistical learning theory and the structured risk minimization principle [59]. At first, a hyperplane is constructed on the training dataset in order to map the original data into a high dimensional feature space [60]. Theoretically, this hyperplane separates the original input space and the four kernel functions and converts the target dataset into the two classes of landslide and non-landslide [61]. The result of the SVM modelling depends on the four kernel mathematical functions which are used for the transformation of data in the SVM, including radial base kernel function (RBF), polynomial kernel function (PF), sigmoidal kernel function (SF), and linear kernel function (LF). These functions are represented in the equations below [58,62]: where γ is the gamma term in the kernel function for all kernel types except linear, d is the polynomial degree term in the kernel function for the polynomial kernel, r is the bias term in the kernel function for the sigmoid kernel. γ, d, and r are user-controlled parameters, as their correct definition significantly increases the accuracy of the SVM solution [58]. The obtained rsesult from the SVM is dependent on the optimal choice of the kernel parameters. In the present study, the RBF kernel, the most used kernel function, was used to produce the landslide susceptibility map.

Multi-Layer Perceptron Networks (MLPNs)
A MLPN is one of the most important and most common ANNs [63], which is an artificial intelligence information processing system. The ANN allows for the solution of complex problems of classification, functional estimation, and optimization through the estimation of a linear or non-linear relationship between the input and output data [64][65][66][67][68]. It can represent and compute information from a multivariate space to another space that builds a model to generalize and predict output from input [64,[69][70][71][72][73]. This non-linear function approximation algorithm is often used to solve classification problems [74]. In addition, the ANN is a classification of a terrain into ordinal zones of landslide susceptibility [75].
The MLPN that is often used for defining non-linear relationships, in addition to the two input and output layers, has one or more layers hidden among them [76]. Hidden layers can increase the network performance during complex functions modeling [77]. While the input layer is responsible for receiving data, the output layer determines the results of the model.

Feature Selection Based on the One-R Attribute Evaluation Technique
In landslide prediction problems, it is very important to select suitable factors, which can be used to generate the optimal input data for training and testing of the machine learning models. Feature selection is one of the techniques for this task in landslide susceptibility modeling. It helps in evaluating the importance of each factor in predicting the final results using such models on which the irrelevant or unimportant factors might be removed from the input space. Thus, it can increase the quality of input data and enhance the predictive capability of landslide models by decreasing the dimensionality of the input space, preventing redundancy, and decreasing noise and over-fitting problems [78]. There are different feature selection methods used in selecting the suitable factors for prediction modeling such as Information Gain [79,80], Forward Elimination [20], Backward Elimination [81], and One-R attribute evaluation (ORAE) [82]. Out of these methods, the ORAE, which is one of the effective filter selection methods, was selected for the first time for the landslide susceptibility modeling in this study. The main principle of the ORAE is to use statistical correlation between the output variable and a set of input factors on which it selects the most important factors for modeling. Using the ORAE, one rule (One-R) is separately constructed for each factor in the training dataset, and then the rule, which has the smallest error metric, is chosen for modelling. On the base of the smallest calculated error metrics, it will independently sort all factors according to their importance to solve prediction problems.

Validation Methods
In this study, some statistical criteria were applied; sensitivity (SEN) [61], Specificity (SPC), Accuracy (ACC), MAE, RMSE [83], and area under the ROC curve (AUC) [43,84] were used to validate the applied prediction models. In general, higher SEN, SPC, ACC, and AUC values and lower MAE and RMSE errors show better performance of the models and vice versa [43]. The SEN, SPC, ACC, and AUC are computed using four metrics including true positive (TP) (landslide correctly classified as landslide), true negative (TN) (landslide correctly classified as non-landslide), false positive (FP) (landslide incorrectly classified as landslide), and false negative (FN) (landslide incorrectly classified as non-landslide) [85,86]. According to the definition, the SEN and SPC are denoted as the fraction of landslide and non-landslide pixels that correctly and incorrectly classified [19,44]. The validation metrics can be formulated as follows: where X est. and X obs. are defined as predicted values obtained from modeling and actual values obtained from real observation, respectively, n is defined as the total number of samples used in the datasets. Another standard statistical metric to validate the models is the ROC curve analyses [87,88]. The ROC curve is plotted based on the sensitivity and 100-specificity on the xand y-axis, respectively. The AUC is used to judge the performance of a model in which a value of 1 indicates an accurate model, while an AUC equal to 0.5 is an inaccurate model [41]. The AUC is calulated based on the following equation: where, P and N are defined as the total number of landslide and non-landslide samples, respectively.

Development of the MBNBT Model for Landslide Susceptibility Mapping
Landslide susceptibility assessment using the MBNBT was carried out in four main steps: (1) dataset generation, (2)

Generation of Datasets
In the initial step, the training and testing datasets were generated for landslide spatial prediction, where the landslides were divided into the two parts of training and testing landslides [89]. In this study, the training dataset was generated from 174 landslide locations (70% landslides) and 174 non-landslide locations, whereas the testing dataset was generated from 74 landslide locations (30% remaining landslides) and 74 other non-landslide locations. A ratio of 70/30 training and testing landslides were randomly selected in this study by using the built-in random point selection function tool in the ArcGIS software. Fifteen landslide conditioning factors were taken into account when generating the datasets. The ORAE feature selection method was applied to evaluate importance of these factors for selecting the suitable factors for model construction and validation.

Model Construction
Model construction involved two main steps: (i) optimization and (ii) classification. (i) Optimization: In this step, the MB ensemble was utilized to optimize the original training dataset for creating optimal inputs for classification. Different sub-training datasets were first generated using different iterations. Thereafter, the optimal input data was determined with the optimal number of iterations. In this study, the optimal number of iterations was set "20".
(ii) Classification: In this step, the NBT classifier was applied to classify two variables (landslide and non-landslide) using optimal training datasets generated by the MB for the spatial prediction of landslides.

Model Validation and Comparison
In this step, the testing dataset was used to validate the performance of the models. The methods, namely, SEN, SPC, ACC, MAE, RMSE, and AUC were used to validate the models. In addition, other single benchmark models, namely, SVM, MLPN, and single NBT were selected for comparison.

Development of Landslide Susceptibility Map
In this step, the results of the trained models were used to produce the landslide susceptibility maps. These maps were generated in five susceptible classes: very high, high, moderate, low, and very low based on the susceptible indexes using the geometrical intervals method [90].

Importance of Landslide Conditioning Factors Using the ORAE Method
The results of the factor selection based on the ORAE method is shown in Figure 4

Model Validation and Comparison
Validation of the new proposed model and other benchmark landslide models was carried out on both the training and testing datasets. The comparative validation results of the different models are shown in Figure 5 and Tables 1 and 2, for the training and validation datasets. Out of these, the training dataset in the modelling was used for goodness-of-fit analysis. The performance or predictive capability of the models were evaluated using the validation dataset. The results of the models based on the training dataset revealed that all models predicted the spatial distribution of landslides very well. However  Table 2). The results of the benchmark models based on the training dataset showed that although the NBT had the most goodness-of-fit (AUC = 0.831), the most predictive capability using the validation dataset was of the MLPNs (AUC = 0.810) model, followed by the NBT (AUC = 0.802) and SVM (AUC = 0.800) models (Table 2). It can be concluded that the new proposed model (MBNBT) outperformed and outclassed the other soft computing benchmark models (SVM, MLPN, and NBT) for the spatial prediction of landslides.

Development of Landslide Susceptibility Map
Mapping the area with landslide potential in the Mu Cang Chai District was carried out using all models and the final maps are shown in Figures 6-9, individually. The distribution of pixels were also carried out and is shown in Figures 6-9. The results showed that most of the pixels in the study area were distributed in the very low class (39.19%), followed by the very high class (23.40%), high class (16.86%), low class (10.35%), and moderate class (10.21%), respectively, whereas the landslide pixels were found mainly in the very high class (82.86%), followed by the high class (8.87%), moderate class (4.03%), very low class (2.42%), and low class (2.02%), respectively. Figure 10 shows the frequency ratio analysis of the four machine learning methods for landslide spatial prediction in the study area. Analysis of the findings indicated that the produced map was highly appropriate as most landslide pixels were found in the very high and high classes.

Verification of the Landslide Susceptibility Map
Validation of the efficiency of the machine learning models on producing landslide susceptibility maps was done using statistical metrics. Performance of the models' accuracy was checked and evaluated by the ROC curve and AUC methods. Training (goodness-of-fit/performance) and validation (prediction accuracy) datasets were analyzed based on the ROC curve method (Figure 11). The results of all four studied models showed high goodness-of-fit (AUC > 0.814); however, the new proposed model (MBNBT) had the best prediction accuracy (AUC = 0.825) for spatial prediction of landslides.

Discussion
The objective of spatial landslide modeling is to generate a valid and accurate susceptibility map [44,89]. So far, many models have been built for landslide susceptibility mapping over the past few decades, out of which, machine learning algorithm-based ensemble models have received more attention in recent years [44]. In the present study, a novel hybrid intelligent model, namely, MBNBT, was introduced for the spatial prediction of landslides in the Mu Cang Chai District of Yen Bai Province (Vietnam). The performance of the MBNBT was compared with benchmark models such as MLPN, SVM, and single NBT.
Fifteen landslide conditioning factors: slope, aspect, profile curvature, curvature, plan curvature, elevation, distance to rivers, river density, distance to roads, road density, distance to faults, fault density, land use, lithology, and rainfall were selected for landslide modeling based on site condition and experience. The ORAE technique was used to select the most important conditioning factors for the landslide spatial prediction. Considering the acceptable performance of the models in the training and test stages, the ORAE can be considered a powerful technique to select important factors to enhance the power prediction capability of base/individual models while decreasing the noise and also reducing over-fitting problems. The results of the ORAE method showed that all fifteen studied factors were contributing landslide occurrences, but road density was the most important factor, as most of observed landslides in the study area were located near the roads. However, we want to make it clear that roads themselves are not responsible for landslides, but that the excavation for the roads creates instability in the surrounding ground mass, which leads to landslides [13,27,44]. Thus, the human interface in changing the existing geo-environmental conditions of the area plays an important role in landslide occurrences.
A comparative study of the predictive capability of the models using SEN, SPC, ACC, MAE, RMSE, and AUC methods indicated that the proposed novel model MBNBT was the best model for landslide spatial prediction. However, other models also gave a reasonable performance. The results of the MLPN model showed that the performance of this model was better than the SVM and NBT models, which is in agreement with the finding of Pradhan and Lee [91] and Conforti et al. [92]. Thus, the MLPN model can be used successfully for the spatial prediction of landslides. Garosi et al. [93] reported that the MLPN had good predictive performance. However, the MLPN data sampling methods employed significantly affected the performance of this model, especially when the training dataset was small [94], and this problem is considered a major deficiency of MLPNs [93]. The performance of the SVM model was reasonable in landslide susceptibility mapping [95,96]. The present study showed that the NBT had the lowest performance compared to the other models. This was because the NB-based algorithm was based on the independent assumption among predictor variables that would affect its predictive accuracy [97]; therefore, performance of NBT depends on the independence assumption [98]. Analysis of the model study results showed that the proposed MBNBT was a better model because hybrid intelligence is considered more effective than single classifiers [96]. The MBNBT takes advantages of the combination of two machine learning methods, namely, MB and NBT. More specifically, the MB used in MBNBT is known as an effective ensemble method which is able to improve the classification accuracy of single classifiers like NBT [52]. Likewise, the NBT used in MBNBT is also a good and encouraging method in landslide prediction [98] and has the advantages of both DTs and NB [55]. On the other hand, the input dataset used for MBNBT was optimized during the training process; therefore, it helped to increase the classification accuracy of the MBNBT compared with other single classifiers (MLPN, SVM, and NBT). Overall, all the studied models had reasonable efficiency for predicting the area of landslide occurrence in the study area, but the MBNBT model had the highest efficiency; thus, it can be used for better landslide susceptibility mapping.

Concluding Remarks
In the present study, a novel hybrid machine learning model, namely, MBNBT, was proposed for the spatial prediction of landslides in the Mu Cang Chai District of Yen Bai Province. This model is a combination of two effective machine learning techniques of the MB ensemble and the NBT base classifier. The ORAE technique was used for the selection of landslide affecting factors. Model validation was done using statistical metrics: SEN, SPC, ACC, MAE, RMSE, and AUC. Performance of the proposed model was compared with other popular models, namely, MLPN, SVM, and single NBT. Results indicated that the proposed model, MBNBT, outperformed (AUC = 0.824) the MLPN (AUC = 0.804), SVM (AUC = 0.804), and NBT (AUC = 0.800) models. Thus, the proposed novel model, MBNBT, indicates a great and promising machine learning method for landslide spatial prediction which can also be applicable for other landslide prone areas. In this study, we used a ratio of 70/30 which is a common ratio applied for generating training and testing datasets for modeling of landslide prediction. However, we propose to evaluate the performance of models with different ratios of training and testing datasets in future works for obtaining another best ratio, if any.
In the present study all the conditioning factors used for modeling had some effect on the prediction results. Thus, all these factors were considered in the analysis. However, for the removal of less important features, sensitivity analysis can be adopted depending on the requirement of the factor of safety of slope. It is also proposed to identify and classify different vegetation in the area which can prevent seepage and also act as anchor in stabilizing the ground mass.