Novel Ensemble Landslide Predictive Models Based on the Hyperpipes Algorithm: A Case Study in the Nam Dam Commune, Vietnam

: Development of landslide predictive models with strong prediction power has become a major focus of many researchers. This study describes the first application of the Hyperpipes (HP) algorithm for the development of the five novel ensemble models that combine the HP algorithm and the AdaBoost (AB), Bagging (B), Dagging, Decorate, and Real AdaBoost (RAB) ensemble techniques for mapping the spatial variability of landslide susceptibility in the Nam Dan commune, Ha Giang province, Vietnam. Information on 76 historical landslides and ten geo-environmental factors (slope degree, slope aspect, elevation, topographic wetness index, curvature, weathering crust, geology, river density, fault density, and distance from roads) were used for the construction of the training and validation datasets that are the prerequisites for building and testing the proposed models. Using different performance metrics (i.e., the area under the receiver operating characteristic curve (AUC), negative predictive value, positive predictive value, accuracy, sensitivity, specificity, root mean square error, and Kappa), we verified the proficiency of all five ensemble learning techniques in increasing the fitness and predictive powers of the base HP model. Based on the AUC values derived from the models, the ensemble ABHP model that yielded an AUC value of 0.922 was identified as the most efficient model for mapping the landslide susceptibility in the Nam Dan commune, = 0.897), Decorate-HP (AUC = 0.865), and the single HP model (AUC = 0.856), respectively. The novel ensemble models proposed for the Nam Dan commune and the resultant susceptibility maps can aid land-use planners in the development of efficient mitigation strategies in response to destructive landslides.


Introduction
Landslides are global geohazards that are responsible for substantial death and injury [1], as well as damage to the natural and built environment [2]. Previous studies reported that landslides annually cause more than 4300 fatalities and global economic losses of tens of billions of USD [3]. The distribution maps of landslide susceptibilities are essential tools for mitigating the devastating effects of landslides. These maps provide engineers and managers with operational guidance and reference for making timely decisions to proactively manage unstable terrain and mitigate the effects of potential landslides [4]. However, the preparation of reliable maps remains challenging because landslides are influenced by an interplay of different climate, geologic, geographic, and anthropogenic factors [5].
Landslide susceptibility can be modeled using a variety of GIS-aided methods, including regression logistics [6], simplified statistical/probabilistic frequency ratios [7], analytical hierarchy process [8], statistical indices [9], weight of evidence [10], evidential belief functions [11], certainty factors [12], and geographically weighted regression [13]. Another suite of approaches is machine learning methods, which are known as methods with an advantage for processing large datasets that exhibit non-linear and complex relationships, and that are typically associated with environmental problems, particularly natural hazard issues such as floods [14], wildfires [15], sinkholes [16], drought [17], earthquakes [18], gully erosion [19], and land subsidence [20,21]. The methods can recognize the discrepancy between historical records and different landscape-level variables to predict future events [22].
Another approach that has been proven effective in many fields of science, including landslide modeling and mapping, is ensemble modeling [41], which emerged in the early 1990s [42]. In terms of accuracy and robustness, ensemble methods have outperformed single model methods [43]. Recent advances in machine learning algorithms and computational powers, coupled with an increase in the availability of high-resolution data, now allow a wider use of ensemble modeling in landslide prediction [44]. Examples of popular ensemble techniques include Bagging, Rotation Forest, Random Subspace and MultiBoost. These methods, in turn, have been used to enhance the performance of base NBT [45], BLR [32], LMT [37], ANN [46], random forest [47], J48 decision tree [48], SVM [49], stochastic gradient descent [41], and alternating decision tree [50] methods.
Because landslides have such serious impacts on society, there is a need to continually improve the predictability of landslide susceptibility models for different environmental settings. The key objective of our modeling research is to improve the reliability of landslide modeling and the effectiveness of landslide susceptibility maps for land-use planners and decision makers. To achieve this, we use a Hyperpipes (HP) model in an ensemble learning framework to predict landslide susceptibility in the Ha Giang province, Vietnam. We couple the HP model with AdaBoost, Real AdaBoost, Bagging, Dagging, and Decorate ensemble learners. To the best of our knowledge, HP has not yet been employed as a base method for developing ensemble models for the prediction of natural hazards. HP is a straightforward, fast-executing algorithm that performs well on different datasets, and tailor its process to extend the scope of the dataset upon which it is effective. Here, we first formulate the likelihood of landslide occurrences as a binary classification problem. We next propose multiple copies of the HP model trained by different ensemble learning techniques. Lastly, we verify and compare the models in terms of their fitness and predictive powers, and generate distribution maps of landslide susceptibility by each model.

Description of the Study Area
Nam Dan commune is situated in the southern part of the Xin Man district, Ha Giang province, Vietnam (Figure 1). With a relatively complex terrain and an average altitude of 1000 m, the region has been visibly divided by high mountains. The climate of the area is tropical monsoon, so it often gets stormy in summer with an annual rainfall of about 2300-2400 mm. The area is dominated by two separate seasons that include rainy season and dry season. While the rainy season typically extends between 5 and 10 months, the dry season starts from November and ends in April of the next year. Forests cover most parts of the Nam Dan commune. However, over the recent years, the area of forests has been reduced due to the development of farmlands and fire occurrences that intensified the risk of landslides, which, in turn, causes significant damage to equipment and people. Through a field survey, over 71 landslide sites in the Nam Dan commune have been detected, which occurred from 2011 to 2019, and are mostly concentrated along provincial road No. 178. Local people are living on the slope and foot of the landslides, which can be extremely dangerous.

Modeling Methodology
In this paper, the prediction of landslide susceptibility for the Nam Dan commune was modeled in four main steps that are shown in Figure 2 and described as follows: (1) data collection and preparation; (2) development and validation of the models using the Weka software [51]; (3) generation susceptibility maps using ArcGIS (https://desktop.arcgis.com/en/); and (4) reliability analysis of the generated susceptibility maps.

Generation of an Inventory Map of Historical Landslides
A map of the landslides that have occurred in the recent past is an important component for modeling of the landslide susceptibility using machine learning methods [52]. To generate such a map for the Nam Dan commune, the records for historical landslides were obtained from the SRV-10/0026 project that had been conducted in the study area. These records were verified and updated via visualization interpretation of remotely sensed data and field surveys (October 2018 and August 2019), also adding the recently occurred landslide locations at the talus provincial road No 178, in August 2019, to the initial records. In total, 76 locations of landslide events were identified across the Nam Dan commune. Most of the slides were of the shallow landslide type. Dividing the detected landslides into separate sets, the training and validation datasets that respectively comprised 53 (~70%) and 23 (~30%) landslides, were generated.

Landslide Conditioning Factors
Modeling of landslide susceptibility is the process of investigating the associations between historical landslides and a suit of explanatory variables known as landslide conditioning factors that typically characterize the geomorphology, hydrology, climate, and anthropogenic conditions of the research area [49,53]. For modeling the landslide susceptibility in the Nam Dan commune, ten conditioning factors were selected and used: slope aspect, slope degree, elevation, curvature, topographic wetness index (TWI), weathering crust, river density, fault density, distance from roads, and geomorphology. Maps of these factors are presented in Figures 3, 4. Currently, the development of data processing techniques and the data-sharing policies of companies, such as NASA, has facilitated the collection of such conditioning factors. Slope aspect is a widely used topography factor for landslide susceptibility mapping because this factor has a relationship with many other factors, such as solar radiation and rainfall amount, that affect landsliding [54]. Slope degree is a significant factor for modeling landslide susceptibility due to its weighty effects on the sliding of material and flow of water [55]. Elevation characterizes local climate condition and the resistance of the slopeforming materials to failure [56]. Curvature affects the accumulation and flow of water on terrain surface, thus influencing the probability of landslide occurrences [57]. TWI was selected as another conditioning factor because this factors in the relationships between topography and moisture [58]. The thematic map for these five topography factors was derived from a 10-m Digital Elevation Model (DEM) that was constructed using 1:10,000 topographic maps. A weathering crust was selected in this study to represent the difference between the weathering conditions of the rocks that in turn influence the probability of landsliding [59]. Data for the weathering crust were derived from the 1:10,000 geomorphology map, prepared by the SRV-10/0026 project. Rivers generally adversely affect the stability of slopes due to erosion and scouring of the side slopes of the valley and saturation of the groundmass within flood height [60,61]. Therefore, the inclusion of the information related to rivers is vital for landslide modeling and mapping. In this study, the rivers of the Nam Dan commune were derived from the DEM and used to generate a thematic map of river density. Faults induce instability in the ground and rock mass that typically lead to landslides [62,63]. In this study, faults were interpolated from the SAR images (Cosmos Skymed with descending orbit, band X, 3m resolution) and used to generate a thematic map of fault density for the Nam Dan commune. Road construction is the main causes of landslide occurrences in many regions [64]. A thematic map of distance to roads is usually used as a proxy for explaining the effects of human activities on landsliding. In this study, the road networks of the Nam Dan commune were derived from a topographic map with a scale of 1:10,000. Geology is another important factor for landslide susceptibility modeling because this factor reflects the rockmass/groundmass condition of the Nam Dan commune for the stability analysis [65]. In this study, geological units of the Nam Dan commune were identified using the geological map of the General Department of Geology and Minerals of Vietnam.

Factors Importance
To check for the significance of the factors identified for landslide modeling in the Nam Dan commune, we used the One-R feature selection technique [66]. The One-R technique works based on correlation analysis among the conditioning factors and the historical landslides to select the most significant factors for the modeling process [66]. Using the One-R technique, each factor is assigned a weight that indicate the average merit (AM) of the factor in landslide modeling. Based on the AM, one can more accurately focus on the crucial factors and remove unnecessary factors to increase the predictive performance of the models.

Hyperpipes (HP) Algorithm
HP is a very straightforward and fast classification algorithm that can deal with a large number of attributes [67][68][69]. The HP algorithm works as follows: (1) from the original training dataset, a single pipe for each particle class was created, and each pipe is marked corresponding to the class; (2) the dataset is analyzed instance-by-instance; (3) for every value in the instance, if the attribute value has not occurred yet in a dataset, it will be added to the pipe; (4) each instance is compared to the current attribute values in the pipes for every lass; and (5) the instance is then chosen with the class of pipe corresponding to the optimal match. The HP algorithm has been mostly used in medical science [67,68]. However, this algorithm can also be suggested for landslide modeling due to its several advantages [70].

AdaBoost
AdaBoost, proposed by Freund and Schapire [71], is known as a boosting ensemble technique developed for improving the performance of weak classifiers. Using an adaptive boosting approach, AdaBoost successively generates one classifier at a time; each classifier is trained on a dataset produced selectively from the initial dataset by progressively increasing at each step the likelihood of difficult data samples [57,72]. Therefore, this technique can control not only the bias but also the variance among the data. Generally, an initial classifier is built from a subset that was formed from the initial dataset. The initial classifier-based model is then utilized to predict all samples in the original dataset. Afterwards, a new subset is generated after the classification and evaluation of the error. This process is repeated until reaching the optimum performance for the base classifier. AdaBoost has been frequently used in combination of various classifiers, such as logistic regression [57], functional tree [73], and neural network [74], for landslide prediction.

Bagging
As one of the primary ensemble techniques that was developed by Breiman [75], Bagging is a component of various learners that is used to decrease the variance of the base classifiers, leading to an increase in prediction accuracy. Bagging utilizes a random bootstrap replacement method to generate multiple training subsets with sizes equivalent to the initial training dataset [76]. Each subset is then used to construct multiple classifiers. Finally, all the classifiers are aggregated by a majority vote for classification. Using this methodology, Bagging can mitigate the faults of component learners and improve the recognition performance of the unstable classifiers. Bagging has been frequently used in combination of various classifiers, such as alternating decision tree [38], Credal decision tree [77], kernel logistic regression [78], functional tree [73], and random forest [79], for landslide susceptibility mapping.

Dagging
Originally developed by Ting and Witten [80], Dagging is known as an ensemble learning technique that integrates various classifiers on different samples of training dataset in order to enhance predictive accuracy [81]. Normally, the structure for the training and classification phases of the Dagging ensemble is similar to that of the Bagging algorithm. Nevertheless, the Dagging ensemble generates several disjointed and stratified samples that insert each part of the data to a copy of the base classifier [80]. Finally, all the classifiers are aggregated by a majority vote for classification. Combined with models such as alternating decision tree [82] and functional tree [83], this technique has been frequently used for the prediction of landslides.

Decorate
The Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples method, also called Decorate, was introduced by Melville and Mooney [84]. Decorate is a straightforward meta-learner that can create diverse classifiers from the distribution characteristic of the original training dataset. While Bagging and AdaBoost generate classifiers from the training dataset, Decorate augments the training dataset by generating and adding base classifiers into the original training dataset, and an ensemble is then generated iteratively. In each iteration, the new created artificial training examples are added to the current ensemble. The procedure is repeated until obtaining the favorite size for the committee. For the Decorate algorithm, two key phases are (1) generating and adding the artificial training examples; and (2) evaluating the accuracy of the augmented training dataset. Among the different ensemble learning techniques, Decorate is the only technique that utilizes artificially generated samples to improve prediction quality [85].

Real AdaBoost
Based on the AdaBoost ensemble technique, Schapire and Singer [86] proposed the Real AdaBoost ensemble technique. Real AdaBoost, which combines adaptive and boosting, is an attractive technique that has been widely employed in many fields related to statistics and machine learning [87]. Real AdaBoost uses local optimum criteria that often yields better convergence than the AdaBoost technique [88]. Real AdaBoost has been used in combination with classifiers such as alternating decision tree [82] for landslide prediction.

Validation Methods
To assess and compare the ability of the six models proposed in this study, the following quantitative statistic measures were used: positive predictive value (PPV), negative predictive value (NPV), sensitivity (SST), specificity (SPF), accuracy (ACC), Kappa index, root mean square error (RMSE), and area under the receiver operating characteristic (ROC) curve (AUC). A full description on these methods and their corresponding formula can be found in the literature [89][90][91].

Susceptibility Mapping
When the HP models and its five ensemble models were successfully trained and validated using the training and verified datasets, they were used to generate the landslide susceptibility maps. To do so, we first computed the landslide susceptibility indices for all pixels of the study area, and then categorized the indices into different classes that depicted very low, low, moderate, high, and very high susceptibility to landslide occurrence within the Nam Dam commune [5].

Analysis of Factor Significance
The relevance of conditioning factors to landslide occurrence exerts a significant effect on the quality of the prediction output [92]. However, no universal method has yet been suggested for the best selection of conditioning factors [93]. Here, we used the One-R technique for quantifying the significance of the factor selected. The results revealed that distance from roads with AM = 82.379 was the most useful variable for describing the distribution of landslide susceptibility in the Nam Dan commune, followed by elevation (AM = 80.552), river density (AM = 72.915), weathering crust (AM = 68.75), and fault density (AM = 67.186) ( Table 1). Since all ten factors showed AM ≠ 0, they need to be included in the modeling process [91,94]. From another point of view, the factor analysis revealed that, in our study area, landslides are mostly associated with road networks, demonstrating the impact of human activities on slope instabilities, and thereby intensifying the landslide susceptibilities [95].

Evaluation of Models Performance
The models were trained using the training dataset and after that were validated using the validation dataset. In the training phase (Table 2), the single HP and ensemble Decorate-HP models achieved the highest possible value of the PPV metric (100%), which indicated that the values of these two models had the highest capability to distinguish between landslide pixels and other pixels across the area. In terms of the NPV metric, the ensemble RABHP model with a value of 85.94% performed the best in distinguishing between non-landslide pixels and other pixels across the area. For the SST metric that measures the model's ability to correctly assign the landslide pixels to the landslide class, the RABHP model with a value of 87.50% was the best model. Regarding the SPF metric that measures the model's ability to correctly classify the non-landslide pixels in the non-landslide class, the single HP and ensemble Decorate-HP models with a value of 100% were dominant over the other models. In terms of the ACC metric that indicates the overall model accuracy, the RABHP model (92.19%) was identified as the most precise model over the training phase. Further, this model achieved the highest agreement (Kappa = 0.844) between observations and predictions. To evaluate the practicability of the six proposed models, they need to be tested with the unseen data (i.e., validation dataset), resulting in computing the prediction rate of the models. In the validation phase (Table 3) [24,96]. The most logical explanation for these different performances is that different methods were developed based upon different computational algorithms, and therefore show different performances dealing with different datasets. For example, Quinlan [97] suggested that the main reason for AdaBoost's failure is overfitting. According to Breiman [75], Bagging perform the best if the base model is unstable. In the matter of the magnitude of the training and validation error, the ensemble ABHP and Decorate-HP models with the lowest and highest RMSEs were identified as the most and least accurate models, respectively. The RMSEs ranged from 0.334 (ABHP) to 0.495 (Decorate-HP) in the training phase and from 0.362 (ABHP) to 0.496 (Decorate-HP) in the validation phase of the modeling process ( Figure 5). The overall performance of the models proposed for the for the Nam Dan commune was evaluated by the ROC method that showed that all six models had excellent training and validation performances. More specifically, a comparative analysis of the AUC values derived from the models indicated that the ensemble RABHP model with an AUC of 0.968 had the best training performance (i.e., fitness to the training data), followed by the BHP, Dagging-HP, ABHP, Decorate-HP, and HP models that achieved the AUC values of 0.95, 0.937, 0.933, 0.895, and 0.895, respectively (Figure 6a). In the matter of the predictive power (i.e., validation performance) (Figure 6b), the ROC-AUC method exhibited that the ABHP model with an AUC of 0.922 was the best model for predicting future landslide susceptibilities, followed by the RABHP, BHP, Dagging-HP, Decorate-HP, and HP that achieved AUC values of 0.919, 0.909, 0.897, 0.865, and 0.856, respectively. Although the models ranked differently in the training and validation phases, and the ROC-AUC method clearly demonstrated that the ensemble models successfully outperformed the single HP model. Integrating the single HP model with the ensemble learning techniques, the generalization and predictive power of the HP model improved by up to 8% and 7%, respectively. In agreement with our results, several previous research reported on the superiority of ensemble modeling over single modeling approaches [79,98].

Evaluation of Susceptibility Maps
Using the models' outputs, the distribution maps of landslide susceptibilities were constructed for the Nam Dan commune in favor of proper land-use planning for the development of infrastructure across the area. The distribution maps of landslide susceptibility were prepared to show very low, low, moderate, high, and very high susceptibilities to landslide occurrence within the commune (Figure 7). In the map produced using the ABHP model, approximately 5% and 15% of the land area fall in the very high and high susceptibility zones, respectively. In the map generated by the BHP model, approximately 12% of the Nam Dan commune is located in the high susceptibility zone and 5% in the very high susceptibility zone. In the map generated by the Dagging-HP model, 17% of the land area is affected by high susceptibility and 10% by very high susceptibility. In the map generated by the Decorate-HP model, 65% of the land area is hit by high susceptibility and 15% by very high susceptibility. The RABHP model locates approximately 27% of the land area in the high susceptibility zone and 10% in the very high susceptibility zone. In the map generated by the HP model, approximately 50% of the land area is in the high susceptibility zone and 12% in the very high susceptibility zone (Figure 8a). The frequency ratio analysis of the susceptibility maps revealed that the map derived from the ABHP model was the most trustworthy map in comparison with the other maps derived from the other five models. The percentage and frequency of landslide pixels in the ABHP maps for the high and very high susceptibility classes were higher than the other models (Figure 8b,c).

Conclusions
Development of landslide predictive models with strong prediction power has become a major focus of many researchers. At present, attempts are in progress to develop reliable, accurate landslide models for different environmental settings worldwide. Here, we described the first attempt to model landslide susceptibilities for the Nam Dan commune of Vietnam using five novel ensemble models that integrated the HP algorithm with the AdaBoost, Bagging, Dagging, Decorate, and Real AdaBoost ensemble techniques. Several performance metrics were utilized to measure the fitness (i.e., training performance) and predictive power (i.e., validation performance) of the models over the training and testing phases using information from 76 historical landslides and 10 geo-environmental variables. The results showed that all five ensemble models performed excellently for landslide susceptibility mapping and for improving the fitness and predictive powers of the base HP model. These improvements underscore the effectiveness of ensemble learning techniques to overcome the variance, bias, and noise problems that are permanently associated with the landslide modeling process. Among the developed models, the ABHP model nominated by the highest AUC value (0.922) showed the best performance for predicting landslide susceptibility across the Nam Dan commune. The landslide susceptibility maps produced within this case study, particularly the map produced by the ABHP model, can guide land-use planners and decision-makers to make more informed and efficient decisions in response to future landslides.