A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning

Nurwatik, Nurwatik; Ummah, Muhammad Hidayatul; Cahyono, Agung Budi; Darminto, Mohammad Rohmaneo; Hong, Jung-Hong

doi:10.3390/ijgi11120602

Open AccessArticle

A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning

by

Nurwatik Nurwatik

^1,*

,

Muhammad Hidayatul Ummah

¹,

Agung Budi Cahyono

¹

,

Mohammad Rohmaneo Darminto

¹

and

Jung-Hong Hong

²

¹

Department of Geomatics Engineering, Faculty of Civil, Planning, and Geo Engineering, Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia

²

Department of Geomatics, National Cheng Kung University, Tainan City 701, Taiwan

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(12), 602; https://doi.org/10.3390/ijgi11120602

Submission received: 29 September 2022 / Revised: 14 November 2022 / Accepted: 21 November 2022 / Published: 2 December 2022

(This article belongs to the Special Issue Geo-Information for Watershed Processes)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

One hundred seventeen landslides occurred in Malang Regency throughout 2021, triggering the need for practical hazard assessments to strengthen the disaster mitigation process. In terms of providing a solution for investigating the location of landslides more precisely, this research aims to compare machine learning algorithms to produce an accurate landslide susceptibility model. This research applies three machine learning algorithms composed of RF (random forest), NB (naïve Bayes), and KNN (k-nearest neighbor) and 12 conditioning factors. The conditioning factors consist of slope, elevation, aspect, NDVI, geological type, soil type, distance from the fault, distance from the river, river density, TWI, land cover, and annual rainfall. This research performs seven models over three ratios between the training and testing dataset encompassing 50:50, 60:40, and 70:30 for KNN and NB algorithms and 70:30 for the RF algorithm. This research measures the performance of each model using eight parameters (ROC, AUC, ACC, SN, SP, BA, GM, CK, and MCC). The results indicate that RF 70:30 generates the best performance, witnessed by the evaluation parameters ACC (0.884), SN (0.765), GM (0.863), BA (0.857), CK (0.749), MCC (0.876), and AUC (0.943). Overall, seven models have reasonably good accuracy, ranging between 0.806 and 0.884. Furthermore, based on the best model, the study area is dominated by high susceptibility with an area coverage of 51%, which occurs in the areas with high slopes. This research is expected to improve the quality of landslide susceptibility maps in the study area as a foundation for mitigation planning. Furthermore, it can provide recommendations for further research in splitting ratio scenarios between training and testing data.

Keywords:

landslide susceptibility; machine learning; k-nearest neighbor; naïve Bayes; random forest

Graphical Abstract

1. Introduction

Landslides are the phenomena of downslope movements by soil mass and rock slopes. Landslides occur due to the sliding of a volume above a layer of rock containing clay after the saturation of water acts as a launcher [1]. A landslide is a natural phenomenon controlled by geological factors, rainfall, and land use on the slopes [2]. Indonesia is a country with a high potential for landslides. According to the data from the National Disaster Management Agency of Indonesia (BNPB), throughout 2021, there were 632 incidents reaching 20% of the total disasters in Indonesia throughout 2021.

Malang Regency is situated in East Java Province and is highly vulnerable to landslides. The Malang Regency Regional Disaster Management Agency (BPBD) data accounted for 117 landslides in 2021, reaching 44% of the total disasters in Malang Regency throughout 2021. Geographical conditions render the Malang Regency highly vulnerable to landslides. It is located in a highland area with various slopes, from sloping to very steep, as it is surrounded by the Tengger Mountains, Mounts Kawi and Kelud, and Mounts Arjuna and Welirang.

Landslide susceptibility assessment is a fundamental action for improving the mitigation process. Periodical assessment is necessary, since landslides occur periodically, and the conditioning factors change over time. Implementing various methods to investigate the location of landslides, assess the vulnerability area, and analyze the impacts can be conducted by using a terrestrial survey [3], satellite monitoring [4], or spatial modeling [5]. Spatial modeling has become a prompt solution, along with the growth of technologies and the availability of various data sources. It can integrate various data sources through algorithms to produce maps, such as the machine learning approach. Machine learning (ML) is a branch of computational algorithms developed and designed to imitate human intelligence by learning from environmental data [6]. Machine learning is capable of solving problems regarding predictions and classifications [7]. In terms of landslide susceptibility modeling, a prediction can utilize machine learning using coordinate data of landslide occurrence as training data and landslide conditioning factors as the evaluators [8].

Research trends using the keywords landslide susceptibility and machine learning have grown significantly since 2018 [9]. Research conducted by [10] applied the NB (naïve Bayes) algorithm, the RBF (radial basis function) classifier, and the RBF network for Longhai, China, for analysis of landslide susceptibility modeling. It indicated that the naïve Bayes algorithm showed high performance in predicting landslide susceptibility with an AUC value of 0.872. Moreover, other research conducted by [11] using ANN (artificial neural network) and support vector machine (SVM) algorithms, decision trees (DTs), RF (random forest), and combined models of ANN and SVM was implemented in the Cameron Highlands district located in the state of Pahang, Malaysia. According to this research, the RF algorithm produced the best performance, with an AUC value on the testing data of 0.82. Research conducted by [12] carried out spatial modeling of landslide susceptibility in the Wayanad district in the southern part of India using RF, SVM, and K-NN (k-nearest neighbor) algorithms. The K-NN algorithm has a good predictive ability of landslide susceptibility, with a maximum AUC value of 0.981. The maximum entropy (MAXENT) algorithm was developed for various spatial analyses with good performance results as part of the development of machine learning algorithms for spatial analysis. The Maxent algorithm can perform various spatial analyses, including predictions of urban waterlogging-prone areas, fire hazards, and land subsidence studies [13,14,15]. A recent study, however, showed that using the maximum-entropy algorithm (MAXENT) in the evaluation of landslide susceptibility produced a lower accuracy than RF [16].

In the study area, research regarding spatial modeling of landslide susceptibility applied scoring and overlay analysis, logistic regression, and spatial multi-criteria evaluation [17,18,19]. Those methods are subject-oriented and rely on the consistency of various experts in the adjustment process and the time-consuming handling of multiple data sources. In addition, a landslide susceptibility model using the conventional scoring method, multi-criteria evaluation, and expert judgment generates less accuracy [17]. Considering the condition of Malang Regency as a mobility center with a high tourist attraction, a high-accuracy of landslide susceptibility assessment is necessary to mitigate casualties. Therefore, this research applies machine learning algorithms to assess landslide susceptibility in the study area. Using a statistical approach and machine learning techniques can help to reduce the subjectivity of the analysis. The model can be evaluated quantitatively, and producing the contribution level of each variable can be quantitatively based on [20,21]. RF, KNN, and NB are three machine-learning algorithms that have produced accurate models of landslide susceptibility in various case studies.

Therefore, this research will compare the spatial modeling of landslide susceptibility using three machine learning algorithms (RF, NB, and KNN). This research applies three splitting ratios for training and testing data comprising 50:50, 60:40, and 70:30 for NB and KNN. Moreover, the RF only uses 70:30, following the best splitting ratio produced from previous research [22]. Eight evaluation parameters were sequentially used to test the performance of seven models. These parameters were comprised of ROC (receiver operator characteristic), AUC (area under curve), accuracy (ACC), sensitivity (SN), specificity (SP), balanced accuracy (BA), geometric mean (GM), Cohen’s kappa (CK), and Matthew’s correlation coefficient (MCC).

2. Materials and Methods

2.1. Study Area

Figure 1 depicts the study area of this research showing the distribution of landslides and non-landslide location. The study area was in Malang Regency, which is located geographically at 112°17′10.9″–112°57′0.0″ E and 7°44′55.11″–8°26′35.45″ S. Malang Regency has 33 sub-districts, 12 urban villages, and 378 villages. Malang Regency is the second largest regency in East Java Province with an area of 334.786 ha. The topography of Malang Regency varies, with elevation values between 0 and 3660 MASL. It has several mountains, including Mount Semeru (4676 MASL), Mount Kelud (1731 MASL), Mount Welirang (3156 MASL), and Mount Arjuno (3339 MASL). Consequently, the slope is varied between 0° and 85.2°. The geological type is dominated by tuff formation with extrusive intermediate pyroclastic composition and derived from volcanic deposits. Malang Regency has a tropical climate with an average surface temperature of 18.25 °C to 31.45 °C.

2.2. Data Sources

2.2.1. Data Training Sample

The training sample consisted of landslides and non-landslide areas [23]. The data type was a point feature acquired from the Malang Regency Disaster Management Agency’s daily reports from 2012 to 2021. From the data collected on landslide occurrence during 2012–2021, the number of points was 88. The number of landslides inclined in certain locations from 2012 to 2021. Hence, it was assumed that past events are still actively occurring at some locations. Moreover, the non-slide training sample was obtained by randomly extracting points with a slope of less than 2° [24]. The number of non-landslide training samples, as many as 88 points, was adjusted to the number of landslide points. Eventually, the total number of training sample points was 176.

2.2.2. Spatial Data Landslide Conditioning Factors

The selection of landslide conditioning factors is essential to achieving high modeling accuracy. Standard rules related to the parameters that affect the landslide susceptibility model do not exist [25]. Landslide conditioning factors depend on the characteristics of the case study, the type of occurrence of the landslide, and the scale of analysis [26]. This research proposed 12 landslide conditioning factors to produce landslide susceptibility maps considering study area conditions, literature studies, and data availability. The 12 parameters consist of topography, land cover, and hydrological and trigger factors. Topographic factors consist of elevation, slope, and aspect. Moreover, land cover factors include geological type, soil type, distance from faults, and vegetation density. Hydrological factors include TWI (Topographic Wetness Index), distance from the river, and river density, whilst the triggering factor is average annual rainfall in 2012–2021. Figure 2 visualizes the landslide conditioning factors in the study area.

Elevation Data

This research used DEMNAS as the elevation data with a resolution of 0.27 arc-second or 8 m, published in 2018 by The Indonesian Geospatial Information Agency (https://tanahair.indonesia.go.id/demnas, accessed on 26 February 2022) [27]. DEMNAS was used to extract elevation, aspect, and slope parameters. Based on the DEMNAS, the study area has an elevation value of 0 to 3660 MASL with a slope ranging from 0° to 73°. Moreover, the aspect distributes from 0 to 360, indicating that the slope angle direction is clockwise. It consists of north, northeast, east, southeast, south, southwest, west, northwest, and flat. Besides extracting topographic factor parameters, elevation data also generated TWI of the study area. The TWI ranged from 1.8 to 16.8. For modeling purposes, this research resampled all the data into 30 m. In addition, the resampling process was carried out to project all datasets into the same coordinate system.

Geological Map Data

Geological map data were acquired from the Geological Agency, Ministry of Energy, and Mineral Resources Indonesia with the scale of 1:100,000. The latest geological map was created in 1992 by the Indonesian ministry of energy and mineral resources. The map was produced from measurements of direct outcrop points in the field, which started in 1921 during the Dutch-Indies period [28]. Geological maps extracted geological type and fault parameter information. Furthermore, this research proceeded with Euclidean distance analysis to calculate the distance from the fault location; moreover, the geological type was converted into a raster format and resampled. According to the geological type, the study area is dominated by a tuff formation with a coverage area of 16%. The formation is a pyroclastic extrusive rock originating from volcanic deposits.

The study area consists of 34 geological unit formations. Table 1 represents the characteristics related to the types of formations, rock formations, and deposits. In general, the rock conditions are composed of rocks brought by volcanic activity consisting of tuff, sandy tuff, volcanic breccia, agglomerates, and lava. Moreover, the distance between the study area and the fault ranges between 0 and 50,000 m. The type of fault which crosses the study area is a local fault with shear, descending, and horizontal faults [29,30,31]. The local faults pass through the Sub-district of Sumbermanjing, Bantur, Gedangan, Gondang Legi, Turen, Wajak, Poncokusumo, and Dampit. The fault which passes through Sumbermanjing Sub-district is a descending type, while those passing through Sumbermanjing and Bantur Sub-district are shear-type and horizontal, respectively.

Soil Type Data

The Indonesian Ministry of Agriculture Indonesia produced soil-type map data with a scale of 1:50,000 in 2014. The rasterization proceeded to convert the data into a raster format. Then, this research resampled the map with 30 m. According to the soil type, cambisol dominates the study area with a coverage area of 60%. Cambisol soil types are rich in mineral matter and vary in drainage, depth, and base saturation [32].

Landsat-8 OLI TIRS Imagery Data

Landsat-8 OLI TIRS Imagery data were acquired from the USGS (United States Geological Survey) directory using the Google Earth engine (https://developers.google.com/earth-engine/datasets, accessed on 24 March 2022). The acquisition time of imagery was 19 August 2021, with a cloud cover of 5.51%. Land-cover analysis and NDVI were chosen in 2021 and on a specific date, as this research tried to utilize the latest and best data specifications with a relatively low cloud cover. Since to produce a good landslide hazard prediction model, the latest land cover and NDVI data are necessary [33]. The imagery has a spatial resolution of 30 m on a multispectral sensor [34]. The Landsat-8 OLI TIRS imagery data were used to extract land cover and triggering factor parameters. The extracted land cover factor was the vegetation index using the

N D V I

algorithm.

N D V I

can be used to estimate the level of greenery density in an area of land [35]. The

N D V I

algorithm can be seen in Equation (1), where

N I R

is the near infrared band, and

R

is the red band of the Landsat-8 [36].

N D V I = \frac{N I R - R}{N I R + R}

(1)

This research applied the supervised classification random forest method to generate land cover. It comprised water bodies, forests, vegetation (including agricultural land), built-up land, and bare land. The classification was reasonably well-accepted, with overall accuracy and kappa accuracy values of 0.89 and 0.86, respectively. The classification results indicated that forests cover 38% of the study area. According to the vegetation density, the result showed that the vegetation includes a variety of land cover, namely, water bodies, low-density vegetation, medium-density vegetation, and high-density vegetation, with a density index between −0.46 and 0.83 [37].

Annual Rainfall Data

Annual rainfall data were acquired by calculating daily CHIRPS data with a resolution of 5 km retrieved from 2012 to 2021. This research used CHIRPS data as a database for annual rainfall because the number of rain gauge station points covering the study area is very limited. Consequently, the rain gauge station data were less representative. While CHIRPS is a terrestrial rainfall database that combining three types of rainfall information (global climatology, satellite-based rainfall estimates, and in-situ rainfall observations) [38]. It can be accessed at https://data.chc.ucsb.edu/products/CHIRPS-2.0/ (accessed on 24 March 2022). Retrieval and processing of this dataset were carried out using the Google Earth engine. Following the process, raster extraction produced 190 rainfall grid points, which were assumed to be rainfall measuring points. Then, this research applied ordinary kriging to generate rainfall value over the study area. Based on the average annual rainfall data, the study area has 1750.56–3338.21 mm/year.

River Net Data

River net data were obtained from a topographic map produced by the Indonesian Geospatial Information Agency. The river net data have a scale of 1:25,000 and were published in 1999. These data are the latest data owned by the Indonesian Geospatial Information Agency. River net data were used to extract the hydrological factor parameters composed of the distance from the river and the density of the river. Euclidean distance analysis was carried out to measure the distance parameter from the river. Based on the distance parameter from the river, the study area has a distance value from the river between 0 and 5055.16 m. A line density analysis proceeded with units of km/km² to generate river density. The result demonstrated that the river density has a value of 0–6.58 km/km².

2.3. Methods

This research applied three machine learning algorithms composed of random forest, naïve Bayes, and k-nearest neighbor to compare their performance in generating landslide susceptibility analysis. Figure 3 illustrates the workflow of this research. In general, the landslide susceptibility analysis consisted of 3 major steps: (1) conditioning factor parameters preparation, (2) modelling, and (3) model evaluation.

Random forest is an ensemble learning model from a set of decision trees (DTs). Each DT depends on a sample of independent data values, and the distribution of each decision tree is the same [39]. RF is effective for predictions, as it uses the strength of each DT and its correlation and is less sensitive to the problem of over-fitting [40]. It works by performing a majority voting of the overall results of each DT. Equation (2) denotes the RF algorithm, where

{\hat{C}}_{r f}

is the class of random forest results, and the hat operator in

\hat{C}

indicates that the class is the estimated class; x is an input vector; and

{\hat{C}}_{n}

is the predictive class of the nth tree in a random forest [41].

{\hat{C}}_{r f} = m a j o r i t y v o t e {\{{\hat{C}}_{n} (x)\}}_{n = 1}^{N}

(2)

The k-nearest neighbor (KNN) is a machine learning algorithm utilizing neighboring techniques in determining the class of a point [42]. A point is classified based on its closest neighbors to the training data. KNN is categorized as a non-parametric ML model because the computational process does not depend on data distribution [12]. The determination of the shortest distance between the new data and the training data commonly utilizes Euclidean distance (Equation (3)), where

X_{i v}

is the individual characteristic of i;

X_{j v}

is an individual characteristic of j; p is the number of sample partitions; and v is an individual sample [43].

d_{i j} = \sqrt{\sum_{v = 1}^{p} {(X_{i v} - X_{j v})}^{2}}

(3)

Naïve Bayes (NB) is a supervised learning method based on statistical measurement for classifying purposes. NB works based on the Bayesian theorem, which is well suited for when the data have a high dimension and is not affected by the distribution of the data [44]. NB is a simple form of a Bayesian network, with all variables considered independent of each other [45]. Equation (4) denotes the NB algorithm for landslide susceptibility modeling, where x is the parameter of the factors causing landslides; y is the classification variable for landslides and non-landslides;

P (y_{i})

is the probability of

y_{i}

; and

P (x_{i} / y_{i})

is a posterior probability that can be calculated by Equation (5) [10].

y = \begin{matrix} a r g m a x P (y_{i}) \\ y_{i} = (l a n d s l i d e, n o n - l a n d s l i d e \end{matrix} \prod_{i = 1}^{14} P (x_{i} / y_{i})

(4)

P (x_{i} / y_{i}) = \frac{1}{\sqrt{2 π} σ} e^{\frac{- {(x_{i} - μ)}^{2}}{2 σ^{2}}}

(5)

For landslide susceptibility modelling, a stack raster ensures that all parameters are in the exact resolution. Therefore, this research extracts landslide occurrence points for each parameter and conducts a normalization process using the z-score calculation so that all numeric data are in the same dimension (Equation (6)), where X is the value of data, namely, the average value of all the data; and S is the standard deviation of the overall data [46].

Z = \frac{X - \bar{X}}{S}

(6)

Following the normalization process, splitting is performed to separate training and testing data. The training data are used to generate prediction models, while testing is used to evaluate the built models. The ratios between training and testing are 70:30, 60:40, and 50:50 for applying the NB and K-NN algorithms. Moreover, the RF algorithm uses a ratio of 70:30. In general, specific rules in determining the splitting ratio scenario between training and testing data do not exist, since each machine learning algorithm has its optimum splitting ratio to perform the best model. However, some splitting ratio schemes which are commonly used are 50:50, 60:40, and 70:30. The KNN and NB algorithms use these three scenarios to obtain optimum model accuracy [10,22,41,47,48]. Unlike the RF algorithm, previous research with the same physical area characteristics showed that the RF algorithm had maximum accuracy when using a splitting ratio of 70:30 [22]. Therefore, the RF algorithm only used a splitting ratio of 70:30 in this research.

After the modeling process, eight evaluation parameters comprising ROC (receiver operator characteristic), AUC (area under curve), accuracy (ACC), sensitivity (SN), specificity (SP), balanced accuracy (BA), geometric mean (GM), Cohen’s kappa (CK), and Matthew’s correlation coefficient (MCC) were used to assess the performance of each model. The evaluation values were obtained based on the confusion matrix of four predicted labels, which consisted of tp and fp for the number of positive data samples and tn and fn for the number of negative data samples. Table 2 denotes the equations and objectives of each evaluator.

3. Results

3.1. Continuous Data Parameter Normality Characteristics

Some machine learning algorithms assume that the training data are normally distributed, so that identifying the normality characteristics of the data for evaluating the application of machine learning algorithms is necessary. This research uses a non-parametric alternative statistical Kolmogorov–Smirnov test (K-S) to display the normality characteristics. The K-S test uses the cumulative distribution to determine the distribution level of data [52]. Moreover, the K-S test is reliable for various purposes to efficiently establish Goodness of Fit [53]. Table 3 denotes the results of the K-S test from the landslide and non-landslide training dataset.

The error rate in decision making is set to 5% = 0.05, with the decision-making criteria using sig.α or p-value. If sig.α < α, then H₀ is rejected [54]. The results of the K-S test shows that all parameters in non-landslide training are not normally distributed. Moreover, in the landslide training data, only the TWI parameter is normally distributed with a p-value of 0.09.

3.2. Landslide Susceptibility Modeling Results

Before performing KNN for the landslide susceptibility model, it is required estimating the value of k to generate the number of nearest neighbors considered from a point. The estimation of the k-value used the cross-validation technique. Cross-validation is performed with three iterations to optimize the accuracy. Figure 4 illustrates the results of the measure of the k-value. Based on the cross-validation results for the estimated k-value for the 50:50 KNN model, the optimum k-value was 3, with a maximum accuracy of 0.814, while for the 60:40 KNN model, the value of k produces a maximum accuracy of 3 with a maximum accuracy of 0.796. The KNN 70:30 model had a maximum accuracy when the k value was 7, with a maximum accuracy value of 0.817.

The KNN yields probability values of landslides from 0 to 1. The average probability values of the KNN 50:50, KNN 60:40, and KNN 70:30 models are 0.449, 0.338, and 0.365, respectively. The probability values are then classified into low susceptibility (0–0.3), moderate susceptibility (0.3–0.6), and high susceptibility (0.6–1) [42]. Figure 5 illustrates the result of each scenario, where (A), (B), and (C) demonstrate the results of landslide susceptibility modeling using the KNN. The 50:50 KNN model indicates that high susceptibility dominates the study area with an area of 147,319.29 km² (42%) as opposed to the 60:40 KNN model being dominated by moderate susceptibility with an area of 195,318.54 km² (56%). For the 70:30 KNN model, the study area was dominated by low susceptible with an area of 180,326.16 km² (51%).

Likewise, the NB algorithm applies three scenarios between training and testing composed of 50:50, 60:40, and 70:30. The results indicates that the probability values of landslides in the NB 50:50 model has a range of 6.24 × 10⁻¹⁰ to 1, with an average of 0.451. Moreover, the NB 60:40 model generates a probability range between 5.68 × 10⁻¹⁴ and 1, with an average of 0.424. In the NB 70:30 model, the probability values of landslides lay between 5.87 × 10⁻¹² and 1, with an average value of 0.299. In addition, the NB models also classifies the susceptibility. Figure 6 illustrates the proportion of the study area based on the probability classification of landslides. The classification of all scenarios showes that low susceptibility dominated the study area; 51% of the study area was classified as low susceptibility in the NB 50:50 model, with 179,493.57 km², as opposed to the NB 60:40 model with 231,354.63 km² (66%). Moreover, the area with low susceptibility on the NB 70:30 model was 235.410.39 km² (67%). On the contrary, only RF 70:30 generates more than 50% high susceptibility.

The RF algorithm only implements a scenario between training and testing (70:30) to produce a landslide susceptibility map. In the RF modelling, it was necessary to estimate the best mtry, which is the number of random variables, before establishing a DT. The best mtry estimation agrees using a cross-validation technique. Figure 7 illustrates the results of the cross-validation.

According to results, the mtry value which produces the highest accuracy (0.896) of the RF model was 11. The RF 70:30 generates probability values from 0.01 to 1, with an average of 0.595. After generating the RF model, it classified the level of susceptibility to landslides based on the respective value. The result indicates that high susceptibility dominated 51% of the study area, with 177,208.83 km² distributed over the edge.

4. Discussion

This research produces seven landslide susceptibility models. All models indicates that high levels of landslide susceptibility located on the edge of the study area, except for the KNN 60:40 and KNN 70:30 models. The probability values generated by all models ranged from 0 to 1. The probability values of landslides approaching 0 indicated no susceptibility to landslides. On the contrary, once the probability value was close to 1, it refers to an increased susceptibility to landslides [55]. Then the probability value can be classified into three levels of landslide susceptibility, composed of low, moderate, and high when the probability value ranged from 0 to 0.3, from 0.3 to 0.6, and from 0.6 to 1, respectively.

The evaluation is conducted towards training and testing data. For training data, ACC and CK were measured. Figure 8 depicts the evaluation results of each model using these parameters. The RF 70:30 model generates the highest values for ACC and CK, with values of 0.915 and 0.819, respectively. In comparison, the NB60:40 model yields the highest evaluation value for ACC and CK, with values of 0.863 and 0.691, respectively. For the KNN algorithm, the KNN50:50 model produces the highest ACC and CK values among all the scenarios, with ACC and CK values of 0.823 and 0.597, respectively.

Eight parameters (ACC, SN, SP, BA, GM, MCC, CK, and ROC–AUC) are used to evaluate the performance of each model. Figure 9 depicts the results of the evaluation of each model using these parameters. The RF 70:30 model generates the highest values for six evaluation parameters, namely, ACC, SN, GM, BA, CK, and MCC, with values of 0.884, 0.765, 0.863, 0.857, 0.749, and 0.876, respectively. Moreover, for the SP parameter, the NB 50:50 and KNN50:50 models have the highest value among the other models, namely, 0.977. The NB 50:50 model had the lowest performance with six evaluation parameters, namely, ACC (90.806), SN (0.536), GM (0.757), BA (0.724), CK (0.556), and MCC (0.601). Moreover, the KNN 70:30 model obtains the lowest performance for the SP evaluation parameter (0.846).

ROC–AUC measures the performance of each model for distinguishing landslide and non-landslide as a binary value. ROC–AUC is the relationship curve between SP and SN. Figure 10 illustrates the results of the ROC–AUC of training and testing data. Based on the ROC graph, all models have an AUC of more than 0.7, which indicates that the model had good performance [46]. Compared to testing data of other models, the RF 70:30 generates the highest AUC of 0.943. On the contrary, the model with the lowest AUC value was KNN70:30 (0.852), meaning that the performance of KNN in identifying landslides was low. In line with the AUC value in the testing data, the AUC in the RF algorithm training data and the 70:30 scenario produces the highest AUC compared to other models, with a value of 1. Moreover, the lowest AUC value for training data is obtained with KNN 60:40, with an AUC value of 0.922. In addition, KNN 70:30 produced the lowest values of ACC, SP, CK, and MCC, with respective values of 0.814, 0.846, 0.611, and 0.624. Moreover, the most optimum scenario of the KNN splitting ratio between training and testing was 50:50, which produces the highest values in five of the eight evaluation parameters. It comprises ACC (0.833), SP (0.977), CK (0.625), MCC (0.658), and AUC (0.881).

Figure 11 depicts the relative variable contribution degree of each model. In general, slope led to the highest relative contribution degree in all models, with a value of 100%. However, each model produces a different sequence of contribution degrees on each parameter. Looking at the lowest contribution degree, the NDVI has the lowest relative contribution degree in the RF 70:30 and NB 70:30 models, with merely 0.44% and 5.28%, respectively. For land use, it possesses the lowest relative contribution degree in the NB 50:50 model (7.31%). Moreover, the proportion of geological type in the NB 60:40 model was just above ten (10.13%). In KNN 50:50 and KNN 60:40 models, the soil type parameter yielded a relative contribution degree of 10.12% for the KNN60:40 model as opposed to the KNN50:50 model (8.64%). In the KNN 70:30 model, the parameter with the lowest relative contribution degree was aspect, with a value of 9.24%.

Among all models, RF was the appropriate model to discriminate non-landslide areas from landslide areas based on landslide conditioning factors, considering the model evaluation performance and accuracy [56]. The evaluation parameters comprise ROC–AUC, ACC, SP, SN, GM, BA, CK, and MCC. According to the evaluation results, RF 70:30 was the best model with the highest value of seven of the eight evaluation parameters. In the application of the NB algorithm, the optimal ratio between training and testing scenarios is 60:40, as it generated the highest value in five of eight parameters [57]. Moreover, the scenario with the lowest performance wis the 50:50, since it generates the lowest value in six of the eight evaluation parameters. RF performes the best in this research, followed by KNN and NB sequentially. In addition to implementing algorithms using similar conditioning factors, KNN, RF, and NB yielded good performance, with AUC values of 0.8903, 0.8690, and 0.8639, respectively [58]. The excellent performance of these three algorithms in predicting landslides was also approved by additional conditioning factors such as curvature, lithology, road ratios, and forest area ratios [59].

The KNN algorithm shows the lowest performance compared to the best models of the other algorithms. However, compared to the overall splitting ratio scheme, the NB algorithm produces the lowest performance compared to the KNN and RF algorithms. Based on the results of the continuous data normality test in the previous sub-section, the training data do not normally distribute. Otherwise, the NB algorithm assumes that the data does not normally distribute [60]. Therefore, this research applies numerical training normality tests on the data to determine the normality of data distribution using the Kolmogorov–Smirnov (K-S) test. Eventually, the K-S test proves that NB’s performance depends on the training data distribution.

The most influential parameter of all models is the slope. In several studies related to landslide susceptibility modeling using machine learning algorithms, the slope parameter dominantly leads to the highest relative contribution as opposed to other parameters [8,11,47]. Other research produces different contribution levels, such as elevation [61] and rainfall [62], while the slope parameter has a contribution level in the fifth order. In this research, the distribution of landslide training data dominantly occurred on slopes between 8° and 30°, which are classified as rather steep slopes. According to the influence of topography on the landslides occurrence, ref. [63] found that landslides tend to occur at slope values between 15° and 25°, as the slope angle controls shear forces and stresses on a slope [64]. The slope angle level affects how much shear stress there is and how low the level of slope stability is [65]. As the slope angle increases, the tangential stress increases in the consolidated soil layer, while the axial stress (shear strength increases on a steeper slope) and the slope stability level decrease accordingly. As a result, slope angle triggers the potential for rock mass increase and ultimately triggers soil movement down the slope [64]. Variations in the slope value affect the magnitude of the stress on the potential shear surface and determine the deformation mechanism [66]. Furthermore, the saturation of the fill slope causes the rock mass to slide down the slope because the high compressibility and mobility of air in the unsaturated void allow the fill slope to initiate undrained failure. The saturation level on the fill slope is determined by the type of soil and the hydrological conditions [67,68].

The lowest relative contribution level is divergent in each model. The NDVI had the lowest relative contribution level in the RF 70:30 and NB 70:30 models. The results of these two models indicates that the model is less associated with NDVI data. The landslide training data tend to occur at NDVI values between 0.24 and 0.787 which is classified as low to high vegetation density [37]. According to the influence of vegetation density in identifying landslides occurrence, it does not significantly contribute [69]. On the other hand, ecological damage, indicated by low vegetation density, will trigger landslides. Therefore, it is necessary to consider ecological restoration as the primary means of preventing and controlling landslides [70]. Vegetation can be an effective measure for mitigating landslides, as it can promote the shear strength of the soil through a series of mechanical and hydrological effects [71].

The land cover parameter is found to be the lowest relative contribution level in the NB 50:50 model. This research plots all the training datasets regardless of the land cover type. However, the locations were mainly in a forest area, and built-up areas, including roads, were non-significant, as the spatial resolution of the imagery is 30 m while the road width usually is less than 30 m. As a result, a misclassification possibly occurred due to the road being covered by vegetation. Hence, land cover is also an essential factor in the assessment of landslide susceptibility [72]. Changes in land cover, such as deforestation, which is used to support various human activities, can increase slope instability, which causes landslides [73].

The soil type parameter has the lowest relative contribution level in the KNN 50:50 and 60:40 models; 57.90% of landslides occurred on Gleisol soil of the study area. Gleisol has a loamy texture, as it is formed in a basin area and is affected by excessive water [74]. Loamy soil increases the potential of landslides because the loose soil is relatively soft after being exposed to water and breaks when the air temperature is too high [75]. There is a relationship between soil type and landslide occurrence regarding geotechnical properties [76]. The geotechnical properties consist of hydraulic conductivity, infiltration rate, runoff and increased pore water pressure on the slope, volume change, and the rate of decrease in shear strength during rain [77]. These geotechnical properties are also related to the type of geology of an area [78]. Other areas that have the potential for landslides are sandy slope areas. When sandy slope areas also have the characteristics of an area with high rainfall, slope instability will increase, and ultimately landslides will occur [79]. In the NB 60:40 model, the geological parameter contributes as the lowest order. Based on the distribution of the training dataset, landslides tend to occur with the characteristics of rocks originating from volcanic deposits. Volcanic deposits are easily weathered rocks, especially tuff, which is highly weathered to wholly weathered. On the other hand, previous studies have proved that geology or lithology contributes relatively significantly [58,69,80].

In the 70:30 KNN model, aspect contributes as the lowest order, which is opposed to other research finding that aspect has a relatively significant contribution level [58,69,80]. In this research, the landslides dominantly occurred on slopes facing northeast, with a percentage of 28%. The direction of the slope is related to the amount of sunlight intensity. In areas continuously exposed to direct sunlight, the organic content of the soil composition in the area is low, which causes the area to be easily dispersed and ultimately triggers landslides. The northern aspect is more susceptible to landslides, where landslides occur in the southern hemisphere, and the southern aspect is more susceptible to landslides in the northern hemisphere and vice versa [81]. In the northern hemisphere, the direction of the slopes facing south has a higher intensity of sunlight than slopes facing north. In areas exposed to direct sunlight continuously, the organic content of the soil constituents in the area is low, which triggers the area to easily disperse, and ultimately causes landslides [81]. However, aspect does not contribute significantly to this research model, since Indonesia is a country situated in the equatorial region. As a result, sunlight intensity is almost the same in all directions [82].

Evaluation of landslide susceptibility is carried out to accurately determine areas that are susceptible to landslides [83]. Mistakes in determining landslide susceptibility can lead to false judgment, resulting in loss of life and property. The landslide susceptibility map becomes fundamental for evaluating sustainable disaster mitigation issues [83]. A machine learning approach can accurately and efficiently predict the level of landslide susceptibility. The application of machine learning to evaluate landslide susceptibility has not been widely implemented in Indonesia. In addition, the landslide susceptibility map in the study area still applies the conventional scoring method with low accuracy. As a result, machine learning has the potential to be implemented. Moreover, machine learning can efficiently update landslide susceptibility maps continuously. Determining the splitting ratio between training and testing data is crucial in determining the model’s accuracy. Hence, this research is expected to provide recommendations for further research using the RF, KNN, and NB algorithms. Subsequently, it can save time in the process of determining the splitting ratio between training and testing for landslide susceptibility modelling.

5. Conclusions

This research compares the performance of the RF, KNN, and NB algorithms in producing a spatial model of landslide susceptibility in Malang Regency, East Java Province, Indonesia. According to the results, the RF algorithm dominantly led to the highest value of evaluation parameters, composed of ACC, SN, GM, BA, CK, and MCC, with respective values of 0.884, 0.765, 0.863, 0.857, 0.749, and 0.876. In addition, RF generates the best performance, with an AUC of 0.943. On the other hand, the optimum splitting ratios between the training and testing data for the NB and KNN algorithms in the case study were 60:40 and 50:50, with AUC values of 0.928 and 0.916, respectively. Slope contributes as the highest relative contribution degree for all the models, with the same value of 100%. According to the best model, high susceptibility dominates Malang Regency, which includes 51% of the study area. Thus, the predictive model can assist policymakers in promoting sustainable mitigation for the potential location. However, optimization methods and prior knowledge concerning selecting landslide conditioning factors and landslide occurrence inventories are necessary to improve prediction accuracy. This research recommends utilizing multi-temporal data for more complex analyses in future research.

Author Contributions

Conceptualization, Nurwatik Nurwatik and Muhammad Hidayatul Ummah; methodology, Nurwatik Nurwatik, and Muhammad Hidayatul Ummah; software, Muhammad Hidayatul Ummah; validation, Muhammad Hidayatul Ummah; formal analysis, Nurwatik Nurwatik; resources, Muhammad Hidayatul Ummah; data curation, Muhammad Hidayatul Ummah; writing—original draft preparation, Muhammad Hidayatul Ummah; writing—review and editing, Nurwatik Nurwatik; visualization, Muhammad Hidayatul Ummah; supervision, Nurwatik Nurwatik and Mohammad Rohmaneo Darminto; project administration, Agung Budi Cahyono; funding acquisition, Agung Budi Cahyono and Jung-Hong Hong. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Directorate of Research and Community Service (DRPM-ITS), Institut Teknologi Sepuluh Nopember, grant number 1623/PKS/ITS/2022.

Data Availability Statement

The data presented in this study are openly available at https://tanahair.indonesia.go.id/portal-web (accessed on 25 February 2022), https://tanahair.indonesia.go.id/demnas/ (accessed on 26 February 2022), and https://developers.google.com/earth-engine/datasets (accessed on 24 March 2022).

Acknowledgments

The author acknowledges with profound thanks DRPM-ITS for its technical and financial support. The author also acknowledges the tremendous support of the Geoinformatics Laboratory, Department of Geomatics Engineering, and ITS.

Conflicts of Interest

The authors declare no conflict of interest.

References

Skempton, A.W.; Hutchinson, J. Stability of natural slopes and embankment foundations. In Proceedings of the 7th International Conference on Soil Mechanics and Foundation Engineering, Mexico City, Mexico, 29 August 1969; pp. 291–340. [Google Scholar]
Muntohar, A. Tanah Longsor: Analisis-Prediksi-Mitigasi, 1st ed.; Universitas Muhammadiyah Yogyakarta: Kasihan, Indonesia, 2012. [Google Scholar]
Keefer, D.K. Investigating landslides caused by earthquakes—A historical review. Surv. Geophys. 2002, 23, 473–510. [Google Scholar] [CrossRef]
Lu, P.; Bai, S.; Casagli, N. Investigating spatial patterns of persistent scatterer interferometry point targets and landslide occurrences in the Arno River basin. Remote Sens. 2014, 6, 6817–6843. [Google Scholar] [CrossRef] [Green Version]
Hong, H.; Naghibi, S.A.; Pourghasemi, H.R.; Pradhan, B. GIS-based landslide spatial modeling in Ganzhou City, China. Arab. J. Geosci. 2016, 9, 1–26. [Google Scholar] [CrossRef]
El Naqa, I.; Murphy, M.J. What is machine learning? In Machine Learning in Radiation Oncology; Springer: Berlin/Heidelberg, Germany, 2015; pp. 3–11. [Google Scholar]
Ahmad Hania, A. Mengenal Artificial Intelligence, Machine Learning, & Deep Learning. Available online: https://amt-it.com/mengenal-perbedaan-artificial-inteligence-machine-learning-deep-learning/ (accessed on 3 July 2022).
Youssef, A.M.; Pourghasemi, H.R. Landslide susceptibility mapping using machine learning algorithms and comparison of their performance at Abha Basin, Asir Region, Saudi Arabia. Geosci. Front. 2021, 12, 639–655. [Google Scholar] [CrossRef]
Yanbin, M.; Hongrui, L.; Lin, W.; Wengang, Z.; Zhengwei, Z.; Haiqing, Y.; Luqi, W.; Xingzhong, Y. Machine learning algorithms and techniques for landslide susceptibility investigation: A literature review. J. Civ. Environ. Eng. 2022, 44, 53–67. [Google Scholar]
He, Q.; Shahabi, H.; Shirzadi, A.; Li, S.; Chen, W.; Wang, N.; Chai, H.; Bian, H.; Ma, J.; Chen, Y.; et al. Landslide spatial modelling using novel bivariate statistical based Naïve Bayes, RBF Classifier, and RBF Network machine learning algorithms. Sci. Total Environ. 2019, 663, 1–15. [Google Scholar] [CrossRef] [PubMed]
Al-Najjar, H.A.H.; Pradhan, B. Spatial landslide susceptibility assessment using machine learning techniques assisted by additional data created with generative adversarial networks. Geosci. Front. 2021, 12, 625–637. [Google Scholar] [CrossRef]
Abraham, M.T.; Satyam, N.; Lokesh, R.; Pradhan, B.; Alamri, A. Factors Affecting Landslide Susceptibility Mapping: Assessing the Influence of Different Machine Learning Approaches, Sampling Strategies and Data Splitting. Land 2021, 10, 989. [Google Scholar] [CrossRef]
Adab, H.; Atabati, A.; Oliveira, S.; Moghaddam Gheshlagh, A. Assessing fire hazard potential and its main drivers in Mazandaran province, Iran: A data-driven approach. Environ. Monit. Assess. 2018, 190, 670. [Google Scholar] [CrossRef]
Lin, J.; He, P.; Yang, L.; He, X.; Lu, S.; Liu, D. Predicting future urban waterlogging-prone areas by coupling the maximum entropy and FLUS model. Sustain. Cities Soc. 2022, 80, 103812. [Google Scholar] [CrossRef]
Rahmati, O.; Golkarian, A.; Biggs, T.; Keesstra, S.; Mohammadi, F.; Daliakopoulos, I.N. Land subsidence hazard modeling: Machine learning to identify predictors and the role of human activities. J. Environ. Manage 2019, 236, 466–480. [Google Scholar] [CrossRef]
Shahzad, N.; Ding, X.; Abbas, S. A Comparative Assessment of Machine Learning Models for Landslide Susceptibility Mapping in the Rugged Terrain of Northern Pakistan. Appl. Sci. 2022, 12, 2280. [Google Scholar] [CrossRef]
Laila Nugraha, A.; Sukmono, A.; Sugistu Firdau, H.S.; Lestari, S. Study of Accuracy in Landslide Mapping Assessment Using GIS and AHP, A Case Study of Semarang Regency. KnE Eng. 2019. [Google Scholar] [CrossRef]
Bachri, S.; Sumarmi; Yudha Irawan, L.; Utaya, S.; Dwitri Nurdiansyah, F.; Erfika Nurjanah, A.; Wahyu Ning Tyas, L.; Amri Adillah, A.; Setia Purnama, D. Landslide Susceptibility Mapping (LSM) in Kelud Volcano Using Spatial Multi-Criteria Evaluation. IOP Conf. Ser. Earth Environ. Sci. 2019, 273, 012014. [Google Scholar] [CrossRef]
Bachri, S.; Shrestha, R.P.; Yulianto, F.; Sumarmi, S.; Utomo, K.S.B.; Aldianto, Y.E. Mapping landform and landslide susceptibility using remote sensing, gis and field observation in the southern cross road, Malang regency, East Java, Indonesia. Geosciences 2021, 11, 4. [Google Scholar] [CrossRef]
Ghasemian, B.; Shahabi, H.; Shirzadi, A.; Al-Ansari, N.; Jaafari, A.; Kress, V.R.; Geertsema, M.; Renoud, S.; Ahmad, A. A Robust Deep-Learning Model for Landslide Susceptibility Mapping: A Case Study of Kurdistan Province, Iran. Sensors 2022, 22, 1573. [Google Scholar] [CrossRef]
Pham, B.T.; Vu, V.D.; Costache, R.; Van Phong, T.; Ngo, T.Q.; Tran, T.-H.; Nguyen, H.D.; Amiri, M.; Tan, M.T.; Trinh, P.T.; et al. Landslide susceptibility mapping using state-of-the-art machine learning ensembles. Geocarto Int. 2022, 37, 5175–5200. [Google Scholar] [CrossRef]
Darminto, M.R.; Widodob, A.; Alfatinahc, A.; Chuc, H.-J. High-Resolution Landslide Susceptibility Map Generation using Machine Learning (Case Study in Pacitan, Indonesia). Int. J. Adv. Sci. Eng. Inf. Technol. 2021, 11, 369–379. [Google Scholar] [CrossRef]
Tsangaratos, P.; Ilia, I. Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: The influence of models complexity and training dataset size. Catena 2016, 145, 164–179. [Google Scholar] [CrossRef]
Xu, C.; Dai, F.; Xu, X.; Lee, Y.H. GIS-based support vector machine modeling of earthquake-triggered landslide susceptibility in the Jianjiang River watershed, China. Geomorphology 2012, 145–146, 70–80. [Google Scholar] [CrossRef]
Vakhshoori, V.; Pourghasemi, H.R.; Zare, M.; Blaschke, T. Landslide susceptibility mapping using GIS-based data mining algorithms. Water 2019, 11, 2292. [Google Scholar] [CrossRef]
Tseng, C.M.; Lin, C.W.; Hsieh, W.D. Landslide susceptibility analysis by means of event-based multi-temporal landslide inventories. Nat. Hazards Earth Syst. Sci. Discuss. 2015, 3, 1137–1173. [Google Scholar]
Iswari, M.Y.; Anggraini, K. Demnas: Model Digital Ketinggian Nasional Untuk Aplikasi Kepesisiran. Oseana 2018, 43. [Google Scholar] [CrossRef] [Green Version]
Ronodirdjo, M.Z. Buku Ajar Pengantar Geologi; Duta Pustaka Ilmu: Mataram, Indonesia, 2019. [Google Scholar]
Varianti, E. Geologi daerah Sumberbening dan sekitarnya Kecamatan Bantur Kabupaten Malang Provinsi Jawa Timur. J. Online Mhs. Bid. Tek. Geol. 2019, 1, 1–10. [Google Scholar]
Wasis, W.; Sunaryo, S.; Susilo, A. Local Fault Line Tracing in Sri Mulyo Village, Dampit Sub District, Malang Regency Based on Geophysical Data. Nat. B J. Health Environ. Sci. 2011, 1, 41–50. [Google Scholar]
Islami, D.A.L. Al Geologi daerah Klepu dan sekitarnya, Kecamatan Sumbermanjing Wetan Kabupaten Malang, Provinsi Jawa Timur. J. Online Mhs. Bid. Tek. Geol. 2017, 1, 1–12. [Google Scholar]
Martins, K.G.; Marques, M.C.M.; dos Santos, E.; Marques, R. Effects of soil conditions on the diversity of tropical forests across a successional gradient. For. Ecol. Manag. 2015, 349, 4–11. [Google Scholar] [CrossRef]
Viet, L.D.; Chi, C.N.; Tien, C.N.; Quoc, D.N. The Effect of the Normalized Difference Vegetation Index to Landslide Susceptibility using Optical Imagery Sentinel 2 and Landsat 8. In Proceedings of the 4th Asia Pacific Meeting on Near Surface Geoscience & Engineering, Online, 30 November–2 December 2021; Volume 2021, pp. 1–5. [Google Scholar]
Yang, I.; Acharya, T.D. Exploring Landsat 8. Available online: https://www.researchgate.net/profile/Tri-Acharya/publication/311901147_Exploring_Landsat_8/links/589c0de6458515e5f4549e58/Exploring-Landsat-8.pdf%0Ahttp://earthobservatory.nasa.gov/IOTD/ (accessed on 20 April 2022).
Gessesse, A.A.; Melesse, A.M. Chapter 8—Temporal relationships between time series CHIRPS-rainfall estimation and eMODIS-NDVI satellite images in Amhara Region, Ethiopia. In Extreme Hydrology and Climate Variability; Melesse, A.M., Abtew, W., Senay, G.B.T.-E.H., Eds.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 81–92. ISBN 978-0-12-815998-9. [Google Scholar]
Pettorelli, N. The Normalized Difference Vegetation Index; Oxford University Press: Oxford, UK, 2013; ISBN 0199693161. [Google Scholar]
Hashim, H.; Abd Latif, Z.; Adnan, N.A. Urban vegetation classification with ndvi threshold value method with very high resolution (vhr) pleiades imagery. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci.-ISPRS Arch. 2019, 42, 237–240. [Google Scholar] [CrossRef] [Green Version]
Funk, C.C.; Peterson, P.J.; Landsfeld, M.F.; Pedreros, D.H.; Verdin, J.P.; Rowland, J.D.; Romero, B.E.; Husak, G.J.; Michaelsen, J.C.; Verdin, A.P. A quasi-global precipitation time series for drought monitoring. US Geol. Surv. Data Ser. 2014, 832, 1–12. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
Cunningham, P.; Delany, S.J. K-Nearest Neighbour Classifiers-A Tutorial. ACM Comput. Surv. 2021, 54, 1–25. [Google Scholar] [CrossRef]
Gonçalves, D.N.S.; Gonçalves, C.D.M.; De Assis, T.F.; Silva, M.A. Da Analysis of the difference between the euclidean distance and the actual road distance in Brazil. Transp. Res. Procedia 2014, 3, 876–885. [Google Scholar] [CrossRef] [Green Version]
Vikramkumar, B.V. Trilochan Bayes and Naive Bayes Classifier. arXiv 2014, arXiv:1404.0933. [Google Scholar]
Zhang, H. The optimality of Naive Bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, Sarasota, FL, USA, 12–14 May 2004; Volume 2, pp. 562–567. [Google Scholar]
Kurniawan, D. Pengenalan Machine Learning dengan Python; PT Elex Media Komputindo: Jakarta, Indonesia, 2020. [Google Scholar]
Akinci, H.; Kilicoglu, C. Random Forest-Based Landslide Susceptibility Mapping in Coastal Regions of Artvin, Turkey. ISPRS Int. J. Geo-Inf. 2020, 9, 553. [Google Scholar] [CrossRef]
Li, X.; Cheng, J.; Yu, D.; Han, Y. Research on Non-Landslide Selection Method for Landslide Hazard Mapping. Res. Sq. 2021, 1–11. [Google Scholar]
Hossin, M.; Sulaiman, M.N. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 2015, 5, 1–11. [Google Scholar] [CrossRef]
Zhang, H.; Song, Y.; Xu, S.; He, Y.; Li, Z.; Yu, X.; Liang, Y.; Wu, W.; Wang, Y. Combining a class-weighted algorithm and machine learning models in landslide susceptibility mapping: A case study of Wanzhou section of the Three Gorges Reservoir, China. Comput. Geosci. 2022, 158, 104966. [Google Scholar] [CrossRef]
Chicco, D.; Warrens, M.J.; Jurman, G. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. IEEE Access 2021, 9, 78368–78381. [Google Scholar] [CrossRef]
Aslam, M. Introducing Kolmogorov-Smirnov Tests under Uncertainty: An Application to Radioactive Data. ACS Omega 2020, 5, 914–917. [Google Scholar] [CrossRef] [Green Version]
Massey Jr, F.J. The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46, 68–78. [Google Scholar] [CrossRef]
Fleming, T.R.; O’Fallon, J.R.; O’Brien, P.C.; Harrington, D.P. Modified Kolmogorov-Smirnov test procedures with application to arbitrarily right-censored data. Biometrics 1980, 36, 607–625. [Google Scholar] [CrossRef]
Lee, S.; Choi, J.; Min, K. Landslide susceptibility analysis and verification using the Bayesian probability model. Environ. Geol. 2002, 43, 120–131. [Google Scholar] [CrossRef]
Hussain, M.A.; Chen, Z.; Zheng, Y.; Shoaib, M.; Shah, S.U.; Ali, N.; Afzal, Z. Landslide Susceptibility Mapping Using Machine Learning Algorithm Validated by Persistent Scatterer In-SAR Technique. Sensors 2022, 22, 3119. [Google Scholar] [CrossRef] [PubMed]
Bui, D.T.; Tsangaratos, P.; Nguyen, V.T.; Van Liem, N.; Trinh, P.T. Comparing the prediction performance of a Deep Learning Neural Network model with conventional machine learning models in landslide susceptibility assessment. Catena 2020, 188, 104426. [Google Scholar] [CrossRef]
Abu El-Magd, S.A.; Ali, S.A.; Pham, Q.B. Spatial modeling and susceptibility zonation of landslides using random forest, naïve bayes and K-nearest neighbor in a complicated terrain. Earth Sci. Inform. 2021, 14, 1227–1243. [Google Scholar] [CrossRef]
Park, S.-J.; Lee, D.-K. Predicting susceptibility to landslides under climate change impacts in metropolitan areas of South Korea using machine learning. Geomat. Nat. Hazards Risk 2021, 12, 2462–2476. [Google Scholar] [CrossRef]
Soria, D.; Garibaldi, J.M.; Ambrogi, F.; Biganzoli, E.M.; Ellis, I.O. A ‘non-parametric’ version of the naive Bayes classifier. Knowl.-Based Syst. 2011, 24, 775–784. [Google Scholar] [CrossRef]
Marjanović, M.; Kovačević, M.; Bajat, B.; Voženílek, V. Landslide susceptibility assessment using SVM machine learning algorithm. Eng. Geol. 2011, 123, 225–234. [Google Scholar] [CrossRef]
Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine learning methods for landslide susceptibility studies: A comparative overview of algorithm performance. Earth-Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
Nakileza, B.R.; Nedala, S. Topographic influence on landslides characteristics and implication for risk management in upper Manafwa catchment, Mt Elgon Uganda. Geoenviron. Disasters 2020, 7, 1–13. [Google Scholar] [CrossRef]
Dai, F.C.; Lee, C.F.; Li, J.; Xu, Z.W. Assessment of landslide susceptibility on the natural terrain of Lantau Island, Hong Kong. Environ. Geol. 2001, 40, 381–391. [Google Scholar] [CrossRef]
Nourani, V.; Pradhan, B.; Ghaffari, H.; Sharifi, S.S. Landslide susceptibility mapping at Zonouz Plain, Iran using genetic programming and comparison with frequency ratio, logistic regression, and artificial neural network models. Nat. Hazards 2014, 71, 523–547. [Google Scholar] [CrossRef]
Çellek, S. Effect of the slope angle and its classification on landslides. Himal. Geol. 2022, 43, 85–95. [Google Scholar]
Christian, J.T.; Baecher, G.B. DW Taylor and the foundations of modern soil mechanics. J. Geotech. Geoenviron. Eng. 2015, 141, 2514001. [Google Scholar] [CrossRef]
Take, W.A.; Bolton, M.D.; Wong, P.C.P.; Yeung, F.J. Evaluation of landslide triggering mechanisms in model fill slopes. Landslides 2004, 1, 173–184. [Google Scholar] [CrossRef]
Kim, J.-C.; Lee, S.; Jung, H.-S.; Lee, S. Landslide susceptibility mapping using random forest and boosted tree models in Pyeong-Chang, Korea. Geocarto Int. 2018, 33, 1000–1015. [Google Scholar] [CrossRef]
Gonzalez-Ollauri, A.; Mickovski, S.B. Hydrological effect of vegetation against rainfall-induced landslides. J. Hydrol. 2017, 549, 374–387. [Google Scholar] [CrossRef] [Green Version]
Norris, J.E.; Stokes, A.; Mickovski, S.B.; Cammeraat, E.; Van Beek, R.; Nicoll, B.C.; Achim, A. Slope Stability and Erosion Control: Ecotechnological Solutions; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; ISBN 1402066767. [Google Scholar]
Guillard, C.; Zezere, J. Landslide Susceptibility Assessment and Validation in the Framework of Municipal Planning in Portugal: The Case of Loures Municipality. Environ. Manag. 2012, 50, 721–735. [Google Scholar] [CrossRef]
Karsli, F.; Atasoy, M.; Yalcin, A.; Reis, S.; Demir, O.; Gokceoglu, C. Effects of land-use changes on landslides in a landslide-prone area (Ardesen, Rize, NE Turkey). Environ. Monit. Assess. 2009, 156, 241–255. [Google Scholar] [CrossRef]
Tufaila, M.; Alam, S. Karakteristik tanah dan evaluasi lahan untuk pengembangan tanaman padi sawah di kecamatan oheo kabupaten konawe utara. Agriplus 2014, 24, 184–194. [Google Scholar]
Balai, B. Ksda Faktor Penyebab Tanah Longsor. Available online: http://ksdasulsel.menlhk.go.id/post/faktor-penyebab-tanah-longsor#:~:text=Tanahyangkurangpadatdan,longsor%2Cterutamabilaterjadihujan (accessed on 3 July 2022).
Mahmood, K.; Kim, J.M.; Ashraf, M. The effect of soil type on matric suction and stability of unsaturated slope under uniform rainfall. KSCE J. Civ. Eng. 2016, 20, 1294–1299. [Google Scholar] [CrossRef]
Yeh, H.-F.; Lee, C.-C.; Lee, C.-H. A rainfall-infiltration model for unsaturated soil slope stability. Sustain. Environ. Res. 2008, 18, 271–278. [Google Scholar]
Igwe, O. The geotechnical characteristics of landslides on the sedimentary and metamorphic terrains of South-East Nigeria, West Africa. Geoenviron. Disasters 2015, 2, 1–14. [Google Scholar] [CrossRef] [Green Version]
Di, B.; Stamatopoulos, C.A.; Stamatopoulos, A.C.; Liu, E.; Balla, L. Proposal, application and partial validation of a simplified expression evaluating the stability of sandy slopes under rainfall conditions. Geomorphology 2021, 395, 107966. [Google Scholar] [CrossRef]
Chen, W.; Xie, X.; Peng, J.; Wang, J.; Duan, Z.; Hong, H. GIS-based landslide susceptibility modelling: A comparative assessment of kernel logistic regression, Naïve-Bayes tree, and alternating decision tree models. Geomat. Nat. Hazards Risk 2017, 8, 950–973. [Google Scholar] [CrossRef] [Green Version]
Gilliam, F.S.; Hédl, R.; Chudomelová, M.; McCulley, R.L.; Nelson, J.A. Variation in vegetation and microbial linkages with slope aspect in a montane temperate hardwood forest. Ecosphere 2014, 5, 1–17. [Google Scholar] [CrossRef]
Singh, S. Understanding the Role of Slope Aspect in Shaping the Vegetation Attributes and Soil Properties in Montane Ecosystems. Available online: www.tropecol.com (accessed on 4 April 2022).
van Westen, C. Landslide Risk Assessments for Decision-Making. In Proceedings of the 2012 UR Forum, Cape Town, South Africa, 2–6 July 2012; pp. 67–71. [Google Scholar]

Figure 1. Location of the study area. (A) Elevation of the study area and distribution of training points. (B) The location of the study area in East Java Province. (C) The location of East Java Province in Indonesia.

Figure 2. Landslide conditioning factors; (A) annual rainfall; (B) geological type; (C) aspect; (D) slope; (E) distance to fault; (F) elevation; (G) soil type; (H) distance to river; (I) TWI; (J) NDVI; (K) river density; (L) land cover.

Figure 3. Research workflow of the comparison of landslide susceptibility prediction using RF, NB, and KNN algorithms.

Figure 4. Cross-validation to obtain the best k value for each scenario of the KNN algorithm.

Figure 5. Landslide susceptibility modeling result of 7 models. (A) KNN 50:50 model. (B) KNN 60:40 model. (C) KNN 70:30 model. (D) NB 50:50 model. (E) NB 60:40 model. (F) NB 70:30 model. (G) RF 70:30 model.

Figure 6. Landslide susceptibility classes’ percentages for each model.

Figure 7. Cross-validation results to obtain the best mtry of the RF model.

Figure 8. Result of evaluation of each model on training data.

Figure 9. Result of evaluation of each model.

Figure 10. ROC–AUC plot of training and testing data.

Figure 11. Relative contribution degree of each model (SL = slope; RD = river density; DR = distance to river; EL = elevation; AR = annual rainfall; DF = distance to fault; AS = aspect; GT = geological type; ST = soil type; LU = land use).

Table 1. Geological unit, Malang Regency.

Code	Formation	Rock Formation	Deposit	Area (km²)
Qvtm1	Malang tuff	E: I: PA	Volcanism: subaerial—Volcanism	633.995
Qpkb	Kawi-butak volcanic rock	E: I: PC	Volcanism: subaerial—Volcanism	446.265
Tomm3	Mandalika formation	E: I: L	Volcanism: subaerial—Volcanism	401.839
Qpj	Jombang formation	ST: CC: CE: B	Volcanism: subaerial—Volcanism:	331.369
Tmn5	Nampol formation	ST: CC: M: S	Sedimentation: transitional—Sed	277.764
Qvt2	Tengger volcanic rock	E: I: PA	Volcanism: subaerial—Volcanism	238.221
Qvaw	Arjuna-Welirang volcanic rock	E: I: PC	Volcanism: subaerial—Volcanism	184.673
Tmw1	Wuni formation	ST: CC: CE: B	—	184.217
Qp	Western volcanic rock	E: I: PC	Volcanism: subaerial—Volcanism	171.113
Qpat	Anjasmara old volcanic rock	E: I: PC	Volcanism: subaerial—Volcanism	160.352
Qvs2	Semeru volcanic deposit	E: I: L	Volcanism: subaerial—Volcanism	96.447
Qpva	Anjasmara young volcano	E: I: PC	Volcanism: subaerial—Volcanism	87.365
Tomt	Tuff member	E: I: PA	Volcanism: subaerial—Volcanism	70.339
Tmcl	Campurdarat formation	ST: CC: LS	Sedimentation: littoral—Sedimen	45.227
Qpvb1	Buring volcanic deposit	E: MC: L	Volcanism: subaerial—Volcanism	39.199
Qas	Swamp and river deposits	S: CC: M: S	Sedimentation: terrestrial: fluv	26.346
Non	Lake	-	-	20.240
Tmwl1	Wonosari formation	ST: R: LS	Sedimentation: littoral: reef—S	15.264
Qvk4	Kelud young volcano	E: I: PC	Volcanism: subaerial—Volcanism:	13.200
Qpvk	Kelud old volcanic rock	E: I: L	Volcanism: subaerial—Volcanism	11.789
Tomi	Rock intrusion	IE: I	Plutonism: sub-volcanic—Plutoni	11.564
Qpvp	Marikeng volcanic rock	IE: I	Plutonism: sub-volcanic—Plutoni	6.937
Qvlh	Lava deposit	E: I: PC	Volcanism: subaerial—Volcanism	5.602
Qvs	Tengger volcanic sand	E: I: PA	Volcanism: subaerial—Volcanism	4.173
Qvk5	Kepolo volcanic deposit	E: I: L	Volcanism: subaerial—Volcanism	3.084
Qpw	Welang formation	ST: CC: M: S	Sedimentation: terrestrial: allu	2.546
Qvj	Jembangan volcanic deposit	E: MC: L	Volcanism: subaerial—Volcanism	2.225
Qt5	Terrace deposit	ST: CC: A	Sedimentation: terrestrial: allu	2.179
Qlk	Katu’s peak lava	E: I: L	Volcanism: subaerial—Volcanism	1.829
Qal	Aluvial and coastal deposit	ST: CC: A	Sedimentation: terrestrial: fluv	1.130
Qvb5	Bromo volcanic rock	E: I: PC	Volcanism: subaerial—Volcanism:	0.810
Qlks	Lava Parasite Kepolo Mt. Semeru	E: I: L	Volcanism: subaerial—Volcanism	0.727
Qlk1	Lava andesit parasit	E: I: L	Volcanism: subaerial—Volcanism	0.058
Qlv	Avalanche deposits from volcanoes	E: I: PC	Volcanism: subaerial—Volcanism	0.035
Qpk1	Kalipucang formation	ST: CC: CE: CL	Sedimentation: terrestrial: fluv	0.001

Rock Formation: ST = sediment, CC= clastic, E = extrusive, I = intermediate, L = lava, PC = polymic, A = alluvium, M = medium, PA = pyrocla, R = reef: LS = limestone, S = sands, CE = coarse, B = brecc, MC = mafic.

Table 2. Metric evaluator equation and each objective.

Metric	Equation	Objective
ACC	$\frac{t p + t n}{t p + f p + t n + f n}$	Indicates the ratio of correct prediction to the total number of evaluation samples [49].
SN	$\frac{t p}{t p + f n}$	Measures the fraction of correctly classified positive patterns [49].
SP	$\frac{t n}{t n + f p}$	Measures the fraction of correctly classified negative patterns [49].
GM	$\frac{s n + s p}{2}$	Measures the average sensitivity (sn) obtained under each class [50].
BA	$\sqrt{s n \times s p}$	Measures the roots of the products sn and sp [50].
CK	$\frac{2 \times ((T P \times T N) - (F P \times F N))}{((T P + F P) \times (F P + T N)) + ((T P + F N) \times (F N + T N))}$	Consistency value between 2 raters (observation and prediction) [51].
MCC	$\frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) \times (T P + F N) \times (T N + F P) \times (T N + F N)}}$	Measures the performance of the classification algorithm through the correlation between observations and predictions [51].
ROC-AUC	$A U C = \frac{S_{p} - n_{p} (n_{n} + 1) / 2}{n_{p} n_{n}}$	The ROC curve is built based on sn (sb-Y) with sp (sb-X), and AUC is an integral ROC [10].

Table 3. Result of K-S test training dataset.

Parameter	Landslide Training Point			Non-Landslide Training Point
Parameter	D-Value	p-Value	Normal Distribution	D-Value	p-Value	Normal Distribution
River Density	0.167	2.32 × 10⁻⁶	No	0.096	0.04519	No
Annual Rainfall	0.165	3.47 × 10⁻⁶	No	0.104	0.01946	No
Distance to Fault	0.258	6.76 × 10⁻¹⁶	No	0.151	3.92 × 10⁻⁵	No
Elevation	0.107	0.01467	No	0.101	0.02766	No
Distance to River	0.192	5.39 × 10⁻⁷	No	0.152	3.03 × 10⁻⁵	No
NDVI	0.175	5.39 × 10⁻⁷	No	0.149	4.74 × 10⁻⁵	No
Slope	0.140	2.04 × 10⁻⁴	No	0.113	7.78 × 10⁻³	No
TWI	0.088	9.00 × 10⁻²	Yes	0.205	8.66 × 10⁻¹⁰	No

Hypothesis: H₀ = normally distributed; H1 = not normally distributed.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nurwatik, N.; Ummah, M.H.; Cahyono, A.B.; Darminto, M.R.; Hong, J.-H. A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning. ISPRS Int. J. Geo-Inf. 2022, 11, 602. https://doi.org/10.3390/ijgi11120602

AMA Style

Nurwatik N, Ummah MH, Cahyono AB, Darminto MR, Hong J-H. A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning. ISPRS International Journal of Geo-Information. 2022; 11(12):602. https://doi.org/10.3390/ijgi11120602

Chicago/Turabian Style

Nurwatik, Nurwatik, Muhammad Hidayatul Ummah, Agung Budi Cahyono, Mohammad Rohmaneo Darminto, and Jung-Hong Hong. 2022. "A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning" ISPRS International Journal of Geo-Information 11, no. 12: 602. https://doi.org/10.3390/ijgi11120602

APA Style

Nurwatik, N., Ummah, M. H., Cahyono, A. B., Darminto, M. R., & Hong, J.-H. (2022). A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning. ISPRS International Journal of Geo-Information, 11(12), 602. https://doi.org/10.3390/ijgi11120602

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparison Study of Landslide Susceptibility Spatial Modeling Using Machine Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources

2.2.1. Data Training Sample

2.2.2. Spatial Data Landslide Conditioning Factors

Elevation Data

Geological Map Data

Soil Type Data

Landsat-8 OLI TIRS Imagery Data

Annual Rainfall Data

River Net Data

2.3. Methods

3. Results

3.1. Continuous Data Parameter Normality Characteristics

3.2. Landslide Susceptibility Modeling Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI