Entropy-Based Hybrid Integration of Random Forest and Support Vector Machine for Landslide Susceptibility Analysis

: Landslide susceptibility mapping is a crucial step in comprehensive landslide risk management. The purpose of the present study is to analyze the landslide susceptibility of Mandi district, Himachal Pradesh, India, based on optimum feature selection and hybrid integration of the Shannon entropy (SE) model with random forest (RF) and support vector machine (SVM) models. An inventory of 1723 rainfall-induced landslides was generated and randomly selected for training (1199; 70%) and validation (524; 30%) purposes. A set of 14 relevant factors was selected and checked for multicollinearity. These factors were ﬁrst ranked using Information Gain and Chi-square feature ranking algorithms. Furthermore, Wilcoxon Signed Rank Test and One-Sample T-Test were applied to check their statistical signiﬁcance. An optimum subset of 11 landslide causative factors was then used for generating landslide susceptibility maps (LSM) using hybrid SE-RF and SE-SVM models. These LSM’s were validated and compared using receiver operating characteristic (ROC) curves and performance matrices. The SE-RF performed better with training and validation accuracies of 96.93% and 88.94%, respectively, compared with the SE-SVM model with training and validation accuracies of 94.05% and 82.4%, respectively. The prediction matrices also conﬁrmed that the SE-RF model is better and is recommended for the landslide susceptibility analysis of similar mountainous regions worldwide.


Introduction
Landslides are one of the most devastating natural disasters inflicting death and destruction, especially in the mountainous regions, around the globe. Their occurrence can be attributed to the downward movement of soil mass or debris due to natural triggering factors like rainfall and earthquakes, or through anthropogenic activities like deforestation and road construction [1,2]. Landslides have affected 4.5 million people and have caused the death of 18,000 people worldwide [3]. The central Himalayan region of northern India also witnesses frequent landslides, especially in monsoon season. The state of Himachal Pradesh witnessed 18 major landslide incidences during the monsoon season of 2020. An estimated 20 buildings collapsed, and 44 people died due to these incidences, inflicting a revenue loss of $ $57.5 million to the state [4]. To reduce these losses, it is necessary to predict the potential landslide-prone areas in order to administer adequate response and emergency measures on time.
The landslide susceptibility analysis is considered a foremost tool for comprehensive landslide risk management [5,6]. The susceptibility analysis is considered a complex process. It is a readily perceived research area in recent times, and experts have proposed various techniques and methodologies in different geological and meteorological settings [7,8]. Using appropriate prediction models, the analysis involves zoning the area into potentially

Study Area and Landslide Inventory
The Mandi district in the state of Himachal Pradesh, India, has geographic coordinates of 31 • 13' and 32 • 05' north latitudes and 76 • 37' and 77 • 25' east longitudes, and a 3951 km 2 area ( Figure 1). The area's elevation increases from west to east and south to north, and ranges between 500 m to 3400 m. The area falls in the mid-hills-sub-humid zone and high hills temperate wet agro-climatic zone, which receives an annual average rainfall of 1240 mm and an annual temperature of 24 • C. The road density of the Mandi district is 155 km per 100 km 2 . The two major National Highways that run across the district's length and breadth are NH-3 (Atari-Manali-Leh) and NH-154 (Pathankot-Sundernagar-Bilaspur). The Beas river runs through the northern part of the district whereas the southern part is drained by the Satluj river.
(Pathankot−Sundernagar−Bilaspur). The Beas river runs through the northern part of the district whereas the southern part is drained by the Satluj river. The landslide inventory map's main elements include the geographical location of the landslide, its date and time of occurrence, triggering factors like heavy rainfall or earthquake, and damages incurred in terms of life and property [59]. The landslide inventory of the Mandi district was prepared by analyzing high-resolution satellite and Google Earth images, historical reports, and field surveys using handheld GPS (Figure 1). A total of 1723 rainfall-induced landslides and their location, type, and triggering factor, etc., were documented, with areas ranging from 26 m 2 to 23,164 m 2 and an average area of about 844 m 2 . Based on accepted terminology, the landslides were then categorized as shallow transitional slides with some deep rotational landslides [60]. Furthermore, it was found that a common triggering factor for all the landslides was rainfall during monsoon season and road construction activities in the region. The landslide inventory was further split into 1199 (70%) training and 524 (30%) validation datasets using a random sampling procedure in the ArcGIS environment.

Landslide Causative Factors (LCF's)
The interaction between geological, morphometric, topographical, and hydrological factors in a region influences landslides' occurrence. Hence, the appropriate selection of these causative factors is a primary step in landslide susceptibility analysis [61]. In the present study, 14 landslide causative factors, namely, slope gradient, plan curvature, slope aspect, elevation, drainage density, lithology, geology, land use and landcover (LULC), normalized difference vegetation index (NDVI), soil characteristics, lineament density, stream power index (SPI), topographic wetness index (TWI), and distance from the roads, were identified using expert opinions and data availability. The list of various data sources is presented in Table 1. These factors were rasterized to a resolution of 30 m in the GIS environment. The landslide inventory map's main elements include the geographical location of the landslide, its date and time of occurrence, triggering factors like heavy rainfall or earthquake, and damages incurred in terms of life and property [59]. The landslide inventory of the Mandi district was prepared by analyzing high-resolution satellite and Google Earth images, historical reports, and field surveys using handheld GPS (Figure 1). A total of 1723 rainfall-induced landslides and their location, type, and triggering factor, etc., were documented, with areas ranging from 26 m 2 to 23,164 m 2 and an average area of about 844 m 2 . Based on accepted terminology, the landslides were then categorized as shallow transitional slides with some deep rotational landslides [60]. Furthermore, it was found that a common triggering factor for all the landslides was rainfall during monsoon season and road construction activities in the region. The landslide inventory was further split into 1199 (70%) training and 524 (30%) validation datasets using a random sampling procedure in the ArcGIS environment.

Landslide Causative Factors (LCF's)
The interaction between geological, morphometric, topographical, and hydrological factors in a region influences landslides' occurrence. Hence, the appropriate selection of these causative factors is a primary step in landslide susceptibility analysis [61]. In the present study, 14 landslide causative factors, namely, slope gradient, plan curvature, slope aspect, elevation, drainage density, lithology, geology, land use and landcover (LULC), normalized difference vegetation index (NDVI), soil characteristics, lineament density, stream power index (SPI), topographic wetness index (TWI), and distance from the roads, were identified using expert opinions and data availability. The list of various data sources is presented in Table 1. These factors were rasterized to a resolution of 30 m in the GIS environment. An ALOS-PALSAR Digital Elevation Model (DEM) with a 12.5 m resolution was used to derive the elevation, slope gradient, slope aspect, curvature, topographical wetness index (TWI), stream power index, and drainage density of the study area using ArcGIS software. The slope gradient of an area is defined as the rate of change of elevation over a distance in the direction of the steepest fall that influences landslides in a particular area [62]. The slope gradient map was classified as flat (<15 • ), moderate (15-25 • ), moderately steep (25-35 • ), steep (35-45 • ), and very steep (>45 • ). The plan curvature of the slope represents the direction of the maximum slope and has an essential role in landslide occurrence as it controls the inflow and outflow of the drainage networks of an area [63]. The curvature map of the area was divided into five categories: convex, slight convex, flat, slightly concave, and concave. The slope's aspect is the orientation of the slope, measured clockwise in degrees from 0 to 360, where 0 is north-facing, 90 is east-facing, 180 is south-facing, and 270 is westfacing. Aspect is again an essential parameter for better understanding the slope stability in a particular direction [64]. Elevation is extensively used in landslide susceptibility. Even though a direct relationship between landslide occurrences and elevation could not be established, it still affects other parameters like rainfall and seismicity [65]. The Beas river basin surrounds the northern part of the Mandi District. It is characterized by a lower elevation, including the Balh valley in the Mandi District. The southwestern part of the Mandi District is characterized by low to moderate elevation zones with elevations ranging from 600 to 1500 m. The drainage density of an area influences the surface runoff and slope erosion potential [52]. The Beas and Satluj rivers in the Mandi District, and their tributaries and distributaries, drain the area well. Although most of the study area has a very low drainage density, there is still a considerable area with a moderate to high drainage density and significant landslides. TWI is used to quantify the hydrological impact of the drainage networks on the wetness/saturation of the soils on the slopes. The greater the value of the TWI, greater is the soil's water content, and the more significant is its hence tendency for erosion [66]. The TWI map was generated by the combined arithmetic applications of slope gradient and flow accumulation parameters, using TWI = (Ln (As)/Tan (β)), where Ln is the natural log, As is the flow accumulation, and β is the slope gradient in radians. The TWI of the area was classified into the following five categories: very low, low, moderate, high, and very high. Lineament density is defined as the total length of all of the lineaments divided by the area under study. Stream power index (SPI) describes the potential of flow erosion of a particular point on the topographic surface. In a given catchment area, the amount of water contributed by the upslope area increases with the increase in the slope gradient. This, in turn, increases the SPI and risk of erosion at a given point on the surface. An SPI map of the study area was also prepared by the combined arithmetic application of morphometric variables, which includes the slope gradient and flow accumulation parameters using SPI = As x Tan(β). The lineament density was computed using the line density tool in the GIS environment. The Operational Land Imager (OLI) and Thermal Infrared Sensor (TRIS) of the Landsat-8 satellite with 12-Band multispectral images with a 30-m resolution were procured from 2015 to 2020, from the first week of October to the last week of November, were considered adequate due to the availability of cloud-free data just after the monsoon season is over. The NDVI map is one of the most fundamental and widely accepted indexes to detect vegetation and landcover changes caused by infrastructural developmental activities [67]. NDVI provides information about landcover changes using the energy absorbed and emitted by the objects on the Earth's surface, and is prepared by image analysis techniques on high-resolution Landsat-8 images using the following: NDVI = (NIR − RED)/(NIR + RED), where NIR (Near Infra-Red Band) and RED bands represent the spectral reflectance bands of the electromagnetic spectrum. The LULC characteristics of an area indicate the physical characteristics of the Earth's surface and the change brought due to human interference. Deforestation, intensive agriculture, and new infrastructure development may lead to soil and slide degradation, which are the main causative factors for landslide occurrences [35]. The LULC map of the area was generated using Landsat-8 images by the maximum likelihood classification method in ERDAS Imagine software. An area's geological and lithological boundaries are closely related to the slope and rock strength, and such boundaries may lead to increased landslide activity [61]. The study area lies within the lesser Himalayan region. The Jutog, Chail, Shah, and Tertiary group of rocks are predominant in the district. The oldest rocks belong to Jutog groups, whereas the youngest valley fills are of a recent age, comprising clay, sand, and gravel beds. The Jutog formation comprises slates, Schists, and Quartzite, with Hematite and magnetite bands included in Chail formation. The Granitic rocks are found to occur around the Karsog area. Thin bands of Slates are also found to occur. Salt grit, locally known as lokhan, is overlain by Mandi Darla volcanics [68]. The soil type is characterized by the percentage of sand, silt, and clay minerals present, configuring the soil's texture and hydraulic properties. Even though it is not straightforward to define the complex relationship between the soil's hydrological properties and the mechanics through which landslides occur, soils with a higher permeability still allow the water to flow through them, making them more susceptible to landslides. Additionally, different types of soils have different cohesion values. Therefore, the infiltrated water might erode the soils with lesser cohesion values [69]. Five soils were identified in the study area, along with their depth, drainage, and erosion properties. The road construction activities in mountainous regions result in loss of support and crack development due to an increased strain in the upper soil mass. In addition, road construction leads to a change in each area's natural drainage corridor [70]. Hence, landslide occurrences are more common along the road alignment. The distance to the roads of the study area was divided into six classes: 0-100, 100-200, 200-300, 300-400, 400-500, and >500 m. The thematic layers of landslide causative factors were prepared using Arc GIS 10.4.1 and GEOMATICA. The mathematical calculations for statistical and machine learning methods were carried out in SPSS software and integrated R-ArcGIS bridge for spatial data analysis.
In the landslide susceptibility analysis, landslide causative factors are usually selected based on the area's landslide categories, geological, and topographical characteristics [43]. As no definite guidelines are available for the optimum selection of these factors, many research studies selected these factors randomly or based on data availability. However, to avoid overfitting data and achieve the maximum predictive potential from the model, it is necessary to quantify and select the best-suited subset from the available factors and to remove non-essential factors with a low correlation to landslide occurrence [61]. Many researchers have used various techniques, like linear correlation, factor analysis, chi-square ranking, and multi-factor approach, to carry out feature selection process but encountered problems of excess time consumption and inability in order to decide the threshold for minimum factor inclusion in the model [71][72][73][74]. However, some studies have used a hybrid approach of combining feature ranking with a statistical significance test to select the optimum feature subset. The statistical significance indicates the level of confidence based on which a null hypothesis can be accepted or rejected. A hypothesis can only be accepted at a 95% significance level (p > 0.05) for the systematic pairwise difference between the different model performances. A k-fold cross-validation procedure is generally applied to split the dataset into subsets, allowing for different training samples for each process. In the present study, Chi-squared and information gain algorithms were used for feature ranking, logistic regression (LR) was used as the initial predictive model, and Wilcoxon's Signed-Rank test and One-Sample T-Test were used to measure the statistical significance level to obtain the optimum feature subset. The detailed methodology of the study is depicted in Figure 2.
Geomatics 2021, 1, FOR PEER REVIEW 6 model, it is necessary to quantify and select the best-suited subset from the available factors and to remove non-essential factors with a low correlation to landslide occurrence [61]. Many researchers have used various techniques, like linear correlation, factor analysis, chi-square ranking, and multi-factor approach, to carry out feature selection process but encountered problems of excess time consumption and inability in order to decide the threshold for minimum factor inclusion in the model [71][72][73][74]. However, some studies have used a hybrid approach of combining feature ranking with a statistical significance test to select the optimum feature subset. The statistical significance indicates the level of confidence based on which a null hypothesis can be accepted or rejected. A hypothesis can only be accepted at a 95% significance level (p > 0.05) for the systematic pairwise difference between the different model performances. A k-fold cross-validation procedure is generally applied to split the dataset into subsets, allowing for different training samples for each process. In the present study, Chi-squared and information gain algorithms were used for feature ranking, logistic regression (LR) was used as the initial predictive model, and Wilcoxon's Signed-Rank test and One-Sample T-Test were used to measure the statistical significance level to obtain the optimum feature subset. The detailed methodology of the study is depicted in Figure 2.

Shannon Entropy (SE) Model
The entropy of a system conceptually measures the degree of randomness, disorder, uncertainty, or instability of a system [6,33]. Claude Shannon, in 1948, developed the concept of entropy to analyze a fundamental communication problem of information theory, but later, this theory was found to be helpful in other areas. The concept of the entropy of landslides refers to the probability distribution of landslide occurrences concerning its frequency in each subclass of landslide causative factors [34,63]. Thus, entropy values can

Shannon Entropy (SE) Model
The entropy of a system conceptually measures the degree of randomness, disorder, uncertainty, or instability of a system [6,33]. Claude Shannon, in 1948, developed the concept of entropy to analyze a fundamental communication problem of information theory, but later, this theory was found to be helpful in other areas. The concept of the entropy of landslides refers to the probability distribution of landslide occurrences concerning its frequency in each subclass of landslide causative factors [34,63]. Thus, entropy values can be used to calculate the relative weights of the data based on an index system using the following equations E jmax = log 2 N j , j = number of subclasses where E j and E jmax are the entropy values, H j is the information coefficient, and N j is the number of classes in each landslide causative factor. W j is the relative weight assigned to each landslide causative factor and FR is the frequency ratio value.

Random Forest (RF) Model
The RF model is a supervised learning algorithm that combines decision tree predictors. Each tree has randomly sampled independent data, and each tree fits independently in the data subset, achieved by splitting existing samples and regenerating random new samples using bootstrapping [75]. The RF model is a widely used model for regression and classification problems. It provides a high prediction accuracy, low errors, and can reduce the risk of overfitting [42,54]. To achieve good model performance and minimize the errors in the RF model, three hyperparameters are defined, namely: (i) the number of trees to be grown/combined (ntree), (iii) the maximum number of features to be considered at each split (mtree), and (iv) the size of the terminal nodes (nodesize).

Support Vector Machine (SVM) Model
SVM is a machine learning algorithm based on statistical learning and structural risk minimization theory [76]. The primary aim of SVM is to separate the non-linear dataset using an optimal hyperplane into two sample classes. The optimal classification hyperplane maximizes the margin of separation and splits the dataset points as ±1, where +1 refers to the presence and -1 refers to the absence of point on the classification hyperplane. The distance between the training points adjoining the classification hyperplane (support vectors) is known as the classification margin [77]. The activation kernel function transforms non-linear data into a higher dimensional feature space for linear classification. The kernel functions can be classified as linear, polynomial, radial, and sigmoid. Previous studies used these radial basis kernels, as well as polynomial kernel functions, the most in the landslide susceptibility analysis [78,79]. The optimum hyperplane is generated using the decision function f(x) = (ω.φ(x)) + b, where ω represents the coefficient vector defining the orientation of the classification hyperplane, φ(x) is the input sample x converted to high dimensional feature space, and b is the offset of hyperplane taken from origin.

Multicollinearity Analysis
A multicollinearity test was conducted to identify the interdependence among the landslide causative factors. Any collinearity among variables can result in errors in output and decrease the model's predictive potential [80]. Variance inflation factor (VIF) values > 10 or tolerance values < 0.1 suggest the problem of collinearity among the independent variables. Out of the 14 landslide causative factors initially selected for analysis, it was found that all factors have acceptable values of VIF and tolerance (Table 2). Hence, all 14 landslide causative factors were deemed suitable for further analysis of the optimum feature selection and landslide susceptibility analysis.

Optimum Selection of LCF's
In the present study, landslide causative factors' quality and usefulness were determined using information gain and Chi-square ranking algorithms. The logistic regression (LR) model was applied iteratively to access the prediction capabilities of feature datasets with an additional feature in each step.
A k-fold method was applied to split the landslide inventory dataset into 10 subsets to produce a new training dataset for each iterative step. It can be observed from Table 3 that both these methods produced different weights. In the next step, the Wilcoxon signed-rank test and One-Sample T-Test were applied as a statistical significance test for a pairwise comparison of the prediction models. The results of all of the possible model scenarios are shown in Table 4. It was observed that Case-4 and Model-11 had relevant features, a high prediction performance, and a high confidence level, and were selected as the optimum feature subsets for further analyses. The selected features are shown in Figure 3. Geomatics 2021, 1, FOR PEER REVIEW 9

LSM Using SE-RF Model
The Ej values calculated for the Shannon entropy model were used to reclassify the subsets of landslide causative factors, and each factor was assigned a relative Wj value ( Table 5). The landslide inventory data were split into 10-folds using the k-fold cross-validation method. These factors were then used as inputs for the FR model. The hyperparameters for the RF model were taken as ntree = 250, mtree = 5, and nodesize = 5. The LSM SE-RF produced was classified into five susceptibility classes, as follows: very low (0-0.196), low (0.196-0.372), moderate (0.372-0.552), high (0.552-0.745), and very high (0.745-1). The percentage of area in each susceptibility class was calculated as very low (22.47%), low (23.72%), moderate (22.12%), high (17.51%), and very high (14.17%; Figure 4a).

LSM using SE-SVM Model
The 10 k-fold cross-validation dataset and landslide causative factors reclassified using SE model factors were considered as an input in the radial kernel-based SVM algorithm for calculating the LSMSE-SVM values. These values ranged from 0 to 1, where values closer to 0 indicated a lower probability of landslide occurrence and values closer to 1 indicated a higher probability of landslide occurrence. The LSM produced using the SE-SVM model was classified into five categories, namely very low (0-0.349), low (0.349-0.450), moderate (0.450-0.556), high (0.556-0.674), and very high (0.674-1) (Figure 4b), using the natural breaks classification method in a GIS environment. The LSMSE-SVM map analysis indicated that the study area percentage was very low, low, moderate, high, and very high, with 17.76%, 24.35%, 26.12%, 19.28%, and 12.49%, respectively (Figure 4b).

LSM Using SE-SVM Model
The 10 k-fold cross-validation dataset and landslide causative factors reclassified using SE model factors were considered as an input in the radial kernel-based SVM algorithm for calculating the LSM SE-SVM values. These values ranged from 0 to 1, where values closer to 0 indicated a lower probability of landslide occurrence and values closer to 1 indicated a higher probability of landslide occurrence. The LSM produced using the SE-SVM model was classified into five categories, namely very low (0-0.349), low (0.349-0.450), moderate (0.450-0.556), high (0.556-0.674), and very high (0.674-1) (Figure 4b), using the natural breaks classification method in a GIS environment. The LSM SE-SVM map analysis indicated that the study area percentage was very low, low, moderate, high, and very high, with 17.76%, 24.35%, 26.12%, 19.28%, and 12.49%, respectively (Figure 4b).
It was observed that the Wj values were the highest for the drainage density (0.269), TWI (0.140), and NDVI (0.121), and the ranking algorithms also suggested that these features had high ranking coefficients. Hence, these were identified as the primary factors responsible for higher landslide susceptibility in the study area.

Performance and Validation of Models
In the present study, the predictive performance of the hybrid LSM models was evaluated using various statistical and visual performance metrics. The confusion matrix is generally used in classification problems. It represents the counts from actual and predicted  Table 6. The results of the prediction matrices indicate that both models have good accuracies, prediction, and recall values, and the MAE and RMSE errors generated in both models are under acceptable limits. Based on these statistical and visual performance metrics, it can be found that both models have a good predictive potential and acceptable values of errors.
Geomatics 2021, 1, FOR PEER REVIEW 13 It was observed that the Wj values were the highest for the drainage density (0.269), TWI (0.140), and NDVI (0.121), and the ranking algorithms also suggested that these features had high ranking coefficients. Hence, these were identified as the primary factors responsible for higher landslide susceptibility in the study area.

Performance and Validation of Models
In the present study, the predictive performance of the hybrid LSM models was evaluated using various statistical and visual performance metrics. The confusion matrix is generally used in classification problems. It represents the counts from actual and predicted values using true positive (TP), true negative (TN), false positive (FP), and falsenegative (FN) values, where TP indicates the number of actual landslide pixels classified accurately, TN indicates the number of non-landslide pixels classified accurately, FP indicates the number of non-landslide pixels classified as landslides pixels, and FN indicates the number of actual landslide pixels classified as non-landslides pixels. The confusion matrix accuracy, precision, recall, root mean square error (RMSE), and mean absolute error (MAE) values were calculated. Receiver operating characteristic (ROC) curves with area under curve (AUC) values are generally used to access the classification performance. The higher the AUC's value, the better the model is for accurately predicting the landslide and non-landslide pixels. The landslide inventory training dataset was used to generate the AUC prediction curve, while the validation dataset was used to generate the AUC validation curve and performance matrices. It was found that the SE-FR and SE-SVM models had training accuracies with AUC = 96.933 and AUC = 94.053, respectively. In contrast, the prediction capability of the models had AUC = 88.945 and AUC = 82.4 values, respectively ( Figure 5). In terms of the results obtained from the confusion matrices, the SE-FR and SE-SVM models had values for accuracy of 0.896 and 0.854, precision of 0.958 and 0.931, respectively, and recall of 0.814 and 0.790, respectively. The MAE and RMSE values of the SE-FR and SE-SVM models were 0.135 and 0.174, and 0.295 and 0.347, respectively as shown in Table 6. The results of the prediction matrices indicate that both models have good accuracies, prediction, and recall values, and the MAE and RMSE errors generated in both models are under acceptable limits. Based on these statistical and visual performance metrics, it can be found that both models have a good predictive potential and acceptable values of errors.

Discussion
Statistical and ML modeling is an essential component for determining the landslide susceptibility of an area. The accuracy of statistical models depends on the data quality, appropriate landslide causative factors, and model structure. The process of the generation of landslide susceptibility maps is complex and requires a multistep analysis. The present study focusses on three main issues: (a) the optimum selection of landslide causative factors using feature selection process, (b) the mapping of landslide susceptibility of Mandi District using hybrid SE-RF and SE-SVM models, and (c) the comparison of these two hybrid models based on performance matrices. In the present study, 14 LCFs were evaluated to find the optimum subset. It was found that the feature selection process, which used a hybrid approach, resulted in the selection of 11 optimum feature subsets.
The analysis of the LSMs produced using SE-RF and SE-SVM models indicated that areas with high TWI values, particularly in the 0-100 m distance to roads, are highly prone to landslides. In mountainous regions like the study area, the continuous excavation of slopes for road construction activities and the infiltration of water, especially during monsoon season, results in an increased burden on slopes. The soil mass in such areas becomes unstable and often results in sliding. Lower elevation regions have seen such anthropological activities on a larger scale than the higher elevation regions of the study area. In addition, higher-elevation regions have less accessibility, and few landslides are reported.
Similarly, the analysis of the geology map confirms that the Middle Siwalik Group was highly prone to landslides due to the presence of sedimentary rocks like mediumto coarse-grained sandstone and conglomerate. The rest of the causative factors, such as curvature, aspect, and lineaments, have a lower influence on the landslide susceptibility of the region. Such a combination of LCF's is seen in similar studies of mountainous regions [5,11,12,36,48,70,81].
RF and SVM are two highly efficient machine learning models that can tackle complex non-linear relationships among variables, and are readily used by researchers in classification problems. FR is a combined tree-based model that can handle high dimensional spaces and categorical features with a high accuracy and is easily interpretable. A disadvantage of the RF model is its incapacity to calculate the relative importance of each subclass of the landslide causative factors. SVM uses "support vectors" and performs better when data are sparce and non-linearly separable. SVM has the advantage of having non-linear kernel functions but has a higher tendency of overfitting. SE is a statistical bivariate model that can calculate the factors' relative weights and subclasses with relative ease and minimal time consumption. The analysis of the results of this study indicated that the integration of SE-RF and SE-SVM models resulted in increased accuracy and efficiency for both models.
In comparison with each other, the SE-RF model performed better than the SE-SVM model. The SE-SVM model has a +2.88% higher AUC for model validation and +6.54% higher AUC for model prediction. The performance matrices also indicated an increase in +4.2% accuracy and +2.7% precision, with a 3.9% decrease in MAE and 5.2% decrease in RMSE errors. This may be attributed to the overdependence of the SVM model on data pre-processing and kernel functions. Such results are consistent with the findings of [80] who combined LR and SVM with the IOE method, who combined EBF method with RF to obtain landslide susceptibility, who used an ensemble of WOE with different kernel functions of SVM, and who used various DEM's and an integrated FR-RF model for assessment of the landslide susceptibility. Thus, using a hybrid approach to integrate statistical and ML models helps eliminate the disadvantages of individual models and increases the overall efficiency prediction capabilities of the models.

Conclusions
The analysis of landslide susceptibility is the primary step in managing and mitigating landslide risk in a mountainous region. Many statistical and ML algorithms have been used in recent years, but no definitive method is considered best for preparing the LSM of a region. The hybrid integration of these methods has the advantage of a better prediction potential compared with individual models. In the present study, a statistical SE model is integrated with RF and SVM models to overcome the shortcomings of the individual method. A total of 14 LCFs (slope gradient, plan curvature, slope aspect, elevation, drainage density, lithology, geology, land use and landcover (LULC), normalized difference vegetation index (NDVI), soil characteristics, lineament density, stream power index (SPI), topographic wetness index (TWI), and distance from the roads) were identified. A feature selection process was carried out using two feature ranking algorithms, i.e., information gain and Chi-square, which were used to determine the individual scores of the LCFs, and Wilcoxon signed-rank test and One-Sample T-Test were used to determine the statistical significance of the factors. The results of both hybrid models indicated TWI and distance from roads to be the two primary factors responsible for landslide occurrences in the study area. The results also indicated that although both models performed satisfactorily, the SE-RF model had +2.88% and +6.54% higher AUC values than the SE-SVM model. The main advantage of such an approach is that only relevant LCFs were used to generate the LSM. The integration of models helps establish an effective spatial relationship between landslide occurrences and LCFs, while reducing overfitting problems. This study will help regional planners and stakeholders in effective landslide risk management and sustainable developmental activities.