# A Novel Hybrid Approach Based on Instance Based Learning Classifier and Rotation Forest Ensemble for Spatial Prediction of Rainfall-Induced Shallow Landslides using GIS

## Abstract

**:**

## 1. Introduction

## 2. Study Area and Data

#### 2.1. Study Area

^{2}, between longitudes 106°41’34” E and 106°48’32” E, and latitudes 21°49’43” N and 21°57’13” N. The altitude varies from 194.5 m to 800 m above sea level with the mean of the altitude is 328 m and the standard deviation is 84.7 m. Slope angles in the study area are from 0° to 84°. Approximately 23.7% of the study area has ground slopes less than 8° and about 10.2% fall in slopes from 8° to 15°. Around 21.1% of the study area falls in slopes 15°–25°, whereas areas with slope 25°–45° account 43.5% of the total study area. Only 1.5% of the study area has slopes larger than 45°.

#### 2.2. Data Used

## 3. Theoretical Background of the Methods Used

#### 3.1. Instance Based Learning Algorithm

_{1}, X

_{2}, …, X

_{n}) and Y ϵ [1,0]. In the current context of landslide susceptibility analysis, X

_{i}is an input vector that represents the 14 influencing factors (slope, slope length, aspect, curvature, elevation, valley depth, toposhade, TWI, SPI, STI, landuse, soil type, lithology, and distance to faults), and Y

_{i}is the two classes, landslide and non-landslide. In the training phase, the input dataset is mapped into feature space and then the feature space is partitioned into multiple regions where decision boundaries are based on the similarity in the content of the dataset [59]. In the prediction phase, distances between pixels in the new dataset and all the training pixels are calculated. Based on k thresholds, the determination of nearest neighbors is carried out by sorting these distances. Then landslide and non-landslide classes for each of the nearest neighbors are determined. Finally, the prediction value for each pixel is obtained using simple majority of the class of nearest neighbors.

_{i}) is the similarity between new data and the training data X

_{i}; and Z (X

_{i}, Y

_{i}) is the category value of the training data X

_{i}.

#### 3.2. Rotation Forest Ensemble

_{1}, X

_{2}, …, X

_{n}) and Y ϵ [1,0], the training phase of Rotation Forest ensemble is as follows:

**Step 1.**Setup parameters: Choose k-NN algorithm as the base classifier, the ensemble size (L), the number of feature subsets (K).

**Step 2.**Training the classifier ensemble model: For i = 1… L.

- (a)
- Split X into K subsets (each subset contains M features): S
_{i, j}for j = 1…KGenerate S’_{i, j}by eliminating randomly a subset of classes.Generate new set S”_{i, j}by selecting a bootstrap sample with a size 75% from S’_{i, j}.Perform Principle Component Analysis on S’_{i, j}to obtain coefficients ${\mathrm{a}}_{i,j}^{(1)},\dots ,{\mathrm{a}}_{i,j}^{({\mathrm{M}}_{\mathrm{k}})}$ and then store in a matrix C_{i, j}.Arrange the matrix C_{i, j}in a rotation matrix R_{i}:$${R}_{i}=\left[\begin{array}{cccc}{\mathrm{a}}_{i,1}^{(1)},\mathrm{...},{\mathrm{a}}_{i,1}^{({\mathrm{M}}_{1})}& \left[0\right]& \mathrm{...}& \left[0\right]\\ \left[0\right]& {a}_{i,2}^{(1)},\mathrm{...},{\mathrm{a}}_{i,2}^{({\mathrm{M}}_{2})}& \mathrm{...}& \left[0\right]\\ \mathrm{...}& \mathrm{...}& \mathrm{...}& \mathrm{...}\\ \left[0\right]& \left[0\right]& \mathrm{...}& {\mathrm{a}}_{i,\mathrm{K}}^{(1)},\mathrm{...},{\mathrm{a}}_{i,\mathrm{K}}^{({\mathrm{M}}_{\mathrm{K}})}\end{array}\right]$$Construct ${R}_{i}^{a}$ by rearrange the rows of R_{i}to match the order of the influencing factors in the training dataset. - (b)
- Construct base classifier D
_{i}using the training set ${\mathit{YR}}_{i}^{a}$.

**Step 3.**Calculating landslide susceptibility index.

_{N}is as follows: (i) Build the transformed data ${Y}_{N}={X}_{N}{R}_{i}^{a}$ run it through the L classifiers to get degree of support for the landslide and the non-landslide classes, d

_{i,j}with i = 1,…,L; j = 1, 2 for the landslide and the non-landslide classes, respectively. (ii) Landslide susceptibility index (LSI) is then estimated for each pixel of X

_{N}using the average combination method as follows:

## 4. Proposed Hybrid Modeling Approach Based on Instance Based Learning Algorithm and Rotation Forest Ensemble for Spatial Prediction of Rainfall-Induced Shallow Landslides

#### 4.1. The GIS Database

#### 4.2. Feature Selection

_{i}, D) is the number of samples associated with the class Y

_{i}, landslide or non-landslide; and S

_{j}is the class j of influencing factor S.

#### 4.3. The Hybrid Model: Configuration and Training

#### 4.4. Performance Assessment and the Final Trained Hydrid Model

## 5. Results and Analysis

#### 5.1. Determination of the Best Distance Metric and k Value

#### 5.2. Feature Selection and Predictive Ability of Landslide Influencing Factors

#### 5.3. Model Training and Assessment

#### 5.4. Cartographic Presentation of the Landslide Susceptibility Map

#### 5.5. Usability Assessment of the Proposed Hybrid Model

^{2}) is then calculated (using Equation (5)) and then Chi-square comparisons with the critical table values at the significant level α = 5% are employed to assess the significance of differences between the susceptibility models. If the Chi-square value exceeds the critical table values of 3.841, the null hypothesis is rejected and the prediction power of the two susceptibility models is said to be significantly different [63].

_{ij}is the number of pixels misclassified by the susceptibility model i; and PI

_{ji}is the number of pixels misclassified by the susceptibility model j.

## 6. Discussion and Conclusion

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

**Figure 2.**Landslide influencing factors: (

**a**) Slope map; (

**b**) aspect; (

**c**) elevation; (

**d**) valley depth; (

**e**) landuse; (

**f**) soil type; (

**g**) lithology; and (

**h**) distance to faults. ACL: Annual Crop Land; PA: Populated Area; PTL: Protective Forest Land; PDL: Productive Forest Land; PL: Paddy Land; BL: Barren Land; PCL: Perennial Crop Land; WSL: Water Surface Land; GL: Grass Land; FA: Ferralic Acrisols; DG: Dystric Gleysols; PA: Plinthic Acrisols; WA: Water Area; DF: Dystric Fluvisols; EF: Eutric Fluvisols; RF: Rhodic Ferralsols; RM: Rocky mountain; CO: COnglomerate; and QD: Quaternary Deposit.

**Figure 5.**Success rate and prediction rate curves, and their areas under the curve (AUC) for the landslide susceptibility map in this study.

**Figure 9.**ROC curves and AUC analysis using the validation data for: (

**a**) the proposed hybrid model; (

**b**) the Random Forest model; (

**c**) the J48 Decision Trees model; and (

**d**) the Neural Net model.

No | Influencing Factors | Classes |
---|---|---|

1 | Slope (°) | (1) 0–8; (2) 8–15; (3)15–25; (4) 25–35; (5) 35–45; (6) >45 |

2 | Slope length (m) | (1) 0–10; (2) 10–30; (3) 30–50; (4) 50–80; (5) 80–120; (6) >120 |

3 | Aspect | (1) Flat; (2) North; (3) Northeast; (4) East; (5) Southeast; (6) South; (7) Southwest; (8) West; (9) Northwest |

4 | Curvature | (1) <−2; (2) −2 to −0.01; (3) −0.01 to 0.01; (4) 0.01 to 2; (5) >2 |

5 | Elevation (m) | (1) <260; (2) 230–300; (3) 300–350; (4) 350–450; (5) 450–550; (6) >550 |

6 | Valley depth (m) | (1) <10; (2) 10–30; (3) 30–50; (4) 50–70; (5) 70–100; (6) >100 |

7 | Toposhape | (1) Ridge; (2) Saddle; (3) Flat; (4) Ravine; (5) Convex hillside; (6) Saddle hillside; (7) Slope hillside; (8) Concave hillside; (9) Inflection hillside; (10) Unknown hillside |

8 | TWI | (1) <5; (2) 5–6; (3) 6–7; (4) 7–8; (5) 8–9; (6) >9 |

9 | SPI | (1) <30; (2) 30–100; (3) 100–200; (4) 200–300; (5) >300 |

10 | STI | (1) <10; (2) 10–30; (3) 30–50; (4) 50–70; (5) >70 |

11 | Landuse | (1) Annual crop land; (2) Populated area; (3) Protective forest land; (4) Productive forest land; (5) Paddy land; (6) Barren land; (7) Perennial crop land; (8) Water surface land ; (9) Grass land |

12 | Soil type | (1) Ferralic acrisols; (2) Dystric gleysols; (3) Plinthic acrisols; (4) Water area; (5) Dystric fluvisols; (6) Eutric fluvisols; (7) Rhodic ferralsols; (8) Rocky mountain |

13 | Lithology | (1) Conglomerate; (2) Basalt; (3) Quaternary deposit; (4) Siltstone; (5) Limestone; (6) Sandstone; (7) Tuff |

14 | Distance to faults (m) | (1) 0–100; (2) 100–200; (3) 200–300; (4) 300–400; (5) >400 |

No | Distance Metrics | Classification Accuracy (%) | |
---|---|---|---|

Training Data | Validation Data | ||

1 | Euclidean | 83.3 | 74.4 |

2 | Manhattan | 83.4 | 75.9 |

3 | Chebyshev | 79.6 | 73.4 |

4 | Minkowski | 83.3 | 74.4 |

No. | Influencing Factor | Tolerance | VIF | IG |
---|---|---|---|---|

1 | Aspect | 0.88 | 1.14 | 0.20 |

2 | Slope | 0.38 | 2.63 | 0.19 |

3 | Sediment transport index | 0.16 | 6.15 | 0.11 |

4 | Stream power index | 0.18 | 5.68 | 0.06 |

5 | Distance to faults | 0.90 | 1.11 | 0.05 |

6 | Toposhade | 0.68 | 1.46 | 0.05 |

7 | Topographic wetness index | 0.59 | 1.69 | 0.05 |

8 | Curvature | 0.68 | 1.47 | 0.05 |

9 | Lithology | 0.88 | 1.14 | 0.04 |

10 | Landuse | 0.91 | 1.10 | 0.03 |

11 | Slop length | 0.46 | 2.19 | 0.03 |

12 | Soil type | 0.94 | 1.07 | 0.03 |

13 | Valley depth | 0.91 | 1.10 | 0.02 |

14 | Elevation | 0.91 | 1.11 | 0.01 |

**Table 4.**Model performance using the training data (PPV: Positive predictive value; NPV: Negative predictive value).

No | Parameter | Proposed Hybrid Model | Random Forest Model | J48 Decision Trees Model | Neural Nets Model |
---|---|---|---|---|---|

1 | True positive | 3579 | 3637 | 3531 | 3528 |

2 | True negative | 2931 | 3385 | 3296 | 2781 |

3 | False positive | 214 | 156 | 262 | 265 |

4 | False negative | 862 | 408 | 497 | 1012 |

5 | PPV (%) | 94.4 | 95.9 | 93.1 | 93.0 |

6 | NPV (%) | 77.3 | 89.2 | 86.9 | 73.3 |

7 | Sensitivity (%) | 80.6 | 89.9 | 87.7 | 77.7 |

8 | Specificity (%) | 93.2 | 95.6 | 92.6 | 91.3 |

9 | Accuracy (%) | 85.8 | 92.6 | 90.0 | 83.2 |

10 | Kappa index | 0.716 | 0.851 | 0.799 | 0.663 |

11 | AUC | 0.948 | 0.981 | 0.942 | 0.905 |

**Table 5.**Model validation using the validation data (PPV: Positive predictive value; NPV: Negative predictive value).

No | Parameter | Proposed Hybrid Model | Random Forest Model | J48 Decision Trees Model | Neural Nets Model |
---|---|---|---|---|---|

1 | True positive | 1256 | 762 | 1017 | 1227 |

2 | True negative | 1278 | 1528 | 1421 | 1176 |

3 | False positive | 408 | 902 | 647 | 437 |

4 | False negative | 386 | 135 | 242 | 488 |

5 | PPV (%) | 75.5 | 45.8 | 61.1 | 73.7 |

6 | NPV (%) | 76.8 | 91.9 | 85.5 | 70.7 |

7 | Sensitivity (%) | 76.5 | 85.0 | 80.78 | 71.6 |

8 | Specificity (%) | 75.8 | 62.9 | 68.71 | 72.9 |

9 | Accuracy (%) | 76.1 | 68.8 | 73.3 | 72.2 |

10 | Kappa index | 0.523 | 0.376 | 0.466 | 0.444 |

**Table 6.**Statistical comparison of the prediction power of the landslide susceptibility models in this study using McNemar’s test.

No | Pairwise Comparison | Chi-Square (χ^{2}) | p-value | Significance |
---|---|---|---|---|

1 | The hybrid model vs. Random Forest | 687.077 | <0.0001 | Yes |

2 | The hybrid model vs. J48 Decision Trees | 181.845 | <0.0001 | Yes |

3 | The hybrid model vs. Neural Net | 10.081 | 0.0015 | Yes |

