Geohazards Susceptibility Assessment along the Upper Indus Basin Using Four Machine Learning and Statistical Models

: The China–Pakistan Economic Corridor (CPEC) project passes through the Karakoram Highway in northern Pakistan, which is one of the most hazardous regions of the world. The most common hazards in this region are landslides and debris ﬂows, which result in loss of life and severe infrastructure damage every year. This study assessed geohazards (landslides and debris ﬂows) and developed susceptibility maps by considering four standalone machine-learning and statistical approaches, namely, Logistic Regression (LR), Shannon Entropy (SE), Weights-of-Evidence (WoE), and Frequency Ratio (FR) models. To this end, geohazard inventories were prepared using remote sensing techniques with ﬁeld observations and historical hazard datasets. The spatial relationship of thirteen conditioning factors, namely, slope (degree), distance to faults, geology, elevation, distance to rivers, slope aspect, distance to road, annual mean rainfall, normalized difference vegetation index, proﬁle curvature, stream power index, topographic wetness index, and land cover, with hazard distribution was analyzed. The results showed that faults, slope angles, elevation, lithology, land cover, and mean annual rainfall play a key role in controlling the spatial distribution of geohazards in the study area. The ﬁnal susceptibility maps were validated against ground truth points and by plotting Area Under the Receiver Operating Characteristic (AUROC) curves. According to the AUROC curves, the success rates of the LR, WoE, FR, and SE models were 85.30%, 76.00, 74.60%, and 71.40%, and their prediction rates were 83.10%, 75.00%, 73.50%, and 70.10%, respectively; these values show higher performance of LR over the other three models. Furthermore, 11.19%, 9.24%, 10.18%, 39.14%, and 30.25% of the areas corresponded to classes of very-high, high, moderate, low, and very-low susceptibility, respectively. The developed geohazard susceptibility map can be used by relevant government ofﬁcials for the smooth implementation of the CPEC project at the regional scale.


Introduction
The Himalayan-Karakoram Mountain (HKM) in northern Pakistan are among the most important mountain ranges globally, with changing topographic and climatic characteristics [1,2]. The irregular topography makes this region extremely vulnerable to geohazards, including landslides, debris flow, glacial erosion, flash floods, river incision, and periglacial action [3,4]. These geohazards cause severe damages to human lives and infrastructure [5][6][7][8]. The China-Pakistan Economic Corridor (CPEC) project extends from China to Pakistan through this mountain range. This project has been a driver of the recent rapid growth of urban and rural areas in Pakistan. However, the expected development of new infrastructure in this region faces many challenges as geological surveying and assessment activities have been impeded by geohazards.
In particular, landslides and debris flow are the most prominent geohazards affecting human lives and infrastructure in this region. These hazards are considered to be catastrophic natural hazards with potential for causing significant damage to human lives, infrastructure, agricultural livestock, and forest growth; thus, they have profound effects on social activity and economic growth [9][10][11]. The occurrence of these geohazards varies according to different environmental settings, from mountainous to coastal areas, tectonically active to tectonically stable and inactive regions, and very dry to humid areas [12,13]. Therefore, identification of the spatial distribution of these geohazards is crucial for disaster management.
Geographical Information System (GIS) and Remote Sensing (RS) approaches are useful for determining the spatial distribution of geohazards, which facilitated the development of susceptibility maps [14][15][16][17]. GeoHazard Susceptibility Maps (GHSMs) are used to determine the relative probability of hazard occurrence by considering the causes of previous events. The term susceptibility can be defined as the probability of a particular hazard in a particular area. In our case, the probability of an area to experience landslides and debris flows occurrence [18]. In GHSMs, an area is divided into different classes depending on the probability of events, such as slope failure and mass movements. According to the availability of data, GHSMs can be prepared using different approaches, including the fuzzy logic [19][20][21], Multi-Criteria Decision Making (MCDM) [22], Weights-of-Evidence (WoE) [23,24], Logistic Regression (LR) [25], Support Vector Machine (SVM) [26], neuro-fuzzy [27], Artificial Neural Networks (ANNs) [28], Frequency Ratio (FR) [29], and Shannon Entropy (SE) models [30]. Among these listed models, LR, FR, WoE, and SE were found to be more accurate, easy to implement, and highly reliable [31][32][33].
Thus far, many researchers have performed geohazard susceptibility analyses for the development of susceptibility maps. However, only Ahmed et al. [34] have carried out a regional-level study on the susceptibility of northern Pakistan to geohazards; they used Weighted Overlay Model (WOM) and fuzzy logic methods. Some individual studies have also been conducted on smaller-scale areas in this region: Derbyshire et al. [35] studied the terrain type and geomorphic processes of a 200 km section of the Karakoram Highway (KKH) from Gilgit to Khunjerab Pass. Kamp et al. [36] developed a Landslide Susceptibility Map (LSM) using the Multi-Criteria Evaluation (MCE) and the Analytical Hierarchy Process (AHP) method. Bacha et al. [37] developed a detailed LSM of Hunza-Nagar district using WoE and FR models. Recently, Ali et al. [38] developed an LSM of the KKH from Hassanabdal to Khunjerab pass, using AHP and WOM models. Khan et al. [39] studied the spatial distribution of landslides and developed an LSM using the FR approach (Haramosh Valley and Bagrote Valley).
Although geospatial techniques have shown enormous potentials for assessing hazard susceptibility, their use remains relatively low in the context of the study area. In addition, comprehensive multi-hazard (landslides and debris flows) risk management and susceptibility mapping is lacking at a regional scale, especially in areas frequently affected by geohazards. Previously, Ahmed et al. [35] developed an LSM for the upper Indus basin (Pakistan) by considering elevation, slope, aspect, curvature, seismic hazards, geological structure, precipitation (monsoon), drainage network, and Normalized Differ-ence Vegetation Index (NDVI). However, to generate a detailed hazard map, all factors that control their spatial distribution of geohazards need to be considered. The addition of more conditioning factors will improve the precision and accuracy of GHSMs. Though there have been several studies assessing individual landslide or debris flow susceptibility in the study area, none of the works have reported a significant role in controlling the spatial distribution of geohazards in the case study site. The present study aims to fill this gap. Furthermore, the implementation of only one model is not sufficient for the susceptibility assessment of an area because every model has drawbacks.
To address these limitations, thirteen conditioning factors, namely, slope (degree), distance to faults, geology, elevation, distance to rivers, slope aspect, distance to road, annual mean rainfall, NDVI, profile curvature, Stream Power Index (SPI), Topographic Witness Index (TWI), and Land Cover (LC), were considered in this study. Moreover, four machine learning and statistical models, SE, LR, WoE, and FR, were adopted. The main objectives of this research are (i) to analyze factors influencing the spatial distribution of geohazards (landslides and debris flows) and (ii) to compare the performances of the four models. The results of the study are not only relevant to other researchers in the field, but also policymakers, in that the study provides useful insights for managing geohazards.

Study Area
The study area is located at 31 • 0 N-38 • 0 N longitude and 70 • 0 E-80 • 0 E latitude, covering the upper Indus basin (Pakistan) and southwestern part of the Tashkurgan Valley (China) (Figure 1). The transboundary study area was selected based on the importance of the CPEC project. The main route of the CPEC project (KKH) passes through the northern areas of Pakistan and connects to China through the Khunjerab Pass. The mainstream channel in this area is the Indus River, which flows along the KKH, from north to south. The river originates from the Tibetan Plateau (China), entering Pakistan in the Dardistan region through Jammu Kashmir (Indian Territory), crossing the Himalayas, and finally flows into the Arabian Sea. This region is highly prone to geohazards, including debris flow and landslides, because of its uneven terrain, high seismicity, and high intensity of monsoon precipitation (200-500 mm), spanning from July to mid-September (http://www.pmd.gov.pk/en/ (accessed on 2 February 2021)). Topographically, the study area is part of the Himalaya, Hindukush, and Karakoram (HHK) mountain ranges. The elevation of the study area ranges from 160 m to 8564 m above sea level (a.s.l.), with an average altitude of 3500 m. Slope angles in this region range from 0 to 87 • . The common geomorphological features in the lower reach of the valleys are alluvial fans and flood terraces with some big debris fans, while snow and glaciers cover the higher slopes and peaks.
The study area features different geological formations of various ages, exposed along the Indus River and KKH (Figure 2a and Table 1). Tectonically, this region is characterized by orogenic features that began to form with the start of the Indo-Eurasian collision (50 million years). Active faults, crustal shorting, and subduction are ongoing processes with an uplift rate of 7 mm/year and a convergence rate of 4-5 cm/year along the Indo-Eurasian collision zone [40,41]. Important tectonic features of this region include the main boundary thrust, main mantle thrust, main Karakoram thrust, Kamila Jal shear zone, Raikot fault, and Karakoram fault (Figure 2b) [42,43]. All these tectonic features are responsible for high seismicity, ultimately causing brittle deformation along the KKH. Over the last two centuries, frequent earthquakes with magnitudes greater than 6.5 Mw (18976.5 Mw ( , 19056.5 Mw ( , 19346.5 Mw ( , 19506.5 Mw ( , and 2005 have been recorded in this region [44]. The geological, geomorphological, and hydrogeological environment of the area is favorable for geohazards [34,[45][46][47]. Valley hills in the Indus River basin exhibit an extensive amount of mass movement, rockslide avalanches (volumes >1 × 10 10 m 3 ), and slope failures, especially along the KKH [48][49][50]. Considering that the main route of the CPEC project passes through this highly active region, susceptibility assessment of geohazards is essential.

Data Collection
The first step for the development of GHSMs is the determination of causative factors. Based on previous studies and the characteristics of geohazards in the study area [29,34,38], NDVI, TWI, distance to faults, elevation, annual mean rainfall, slope (degree), slope aspect, profile curvature, distance to rivers, distance to road, LC, geology, and SPI were selected as conditioning factors ( Figure 3). The elevation of the study area was extracted from a Shuttle Radar Topography Mission (SRTM) 30 m digital elevation model (https://earthexplorer.usgs.gov/ (accessed on 2 February 2021)), and the slope, slope aspect, profile curvature, SPI, TWI, and drainage network layers were developed from the elevation map in the ArcGIS platform. The annual mean rainfall map was prepared based on 27 weather stations (from 2000 to 2015) in the study area, using the Inverse Distance Weighted (IDW) tool in ArcGIS [51,52]. Geology and faults of the study area were collected from a regional geological map of the study area (at the scale of 1:1,000,000) (http://gsp.gov.pk and http://en.cgs.gov.cn/ (accessed on 2 February 2021)). A brief description of these conditioning parameters is listed in Table 2.

Geohazards Inventory
A geohazard inventory map was prepared after a field visit, using the interpretation of satellite imageries (Landsat 7 and Landsat 8: https://earthexplorer.usgs.gov/ (accessed on 2 February 2021)) and Google Earth images. During the field survey (from 16 October to 26 October 2019), the coordinates of the geohazards were taken and these coordinate points were digitized as polygons. The total numbers of sample points are 480, of which 240 are geohazard points (210 landslides and 30 debris flow events), and 240 are non-hazards points ( Figure 1). All these points were used as sampling points. For both landslide and debris flow inventories, hazard pixels were assigned a value of 1, and non-hazard pixels were assigned a value of 0. To evaluate the predictability, field verified data were potentially used. Finally, the total hazard events were randomly divided into training models (70%, comprising 168 hazard points) and validation models (30%, comprising 72 hazard points) [53]. Distance to road The presence of road/highway can cause many geohazards mainly due to changes in slopes stability and shear stress resulting from the excavation, undercutting of the slope, changes in hydrological conditions, and additional loads. The KKH is the main route of the CPEC project that connects Pakistan to China. The KKH is passing through northern Pakistan, where the reconstruction and re-routing are still ongoing. Therefore, only KKH was considered for this study to prepare the distance to the road thematic map.
Extracted from Google earth Acharya and Lee [29] Distance to rivers River channels affect slope stability by slope toe erosion and saturating the lower part of the material. The Tolerance (TOL) and Variance Inflation Factor (VIF) were used to test the multicollinearity of all conditioning parameters, as linear collinearity among the conditioning parameters decreases the predictive capability of a model [66]. Generally, a TOL less than 0.10 or a VIF greater than 5.0 indicates multicollinearity among the conditioning factors [67].
Multicollinearity analysis was performed, and the values of TOL and VIF were found to be >0.10 and <5.0, respectively, which indicate that the variables are free from collinearity ( Table 3). LR is a multivariate statistical approach used to evaluate the association between dependent variable and independent variables [68,69]. The advantage of this model is that the data need not have a normal distribution and the variables can be either categorical, continuous, or have any combination of these two [70,71]. The mathematical expression of the LR model is given below [26]: where ρ is the probability of hazards or non-hazards occurrence, z is the linear combination, n represents the number of conditioning variables, x i (i = 0, 1, 2, . . . , n) are the conditioning variables, l 0 is the intercept of this model, and l i (i = 0, 1, 2, . . . , n) are the regression coefficients for the independent variables of the LR model.

Weights-of-Evidence (WoE)
WoE is a quantitative "data-driven" approach [72]. This method was first used in the field of medicine [73] and then applied to the identification of mineral potential [74]. WoE is based on Bayes' hypothesis-the concept of previous and subsequent probability to predict the relationship between the spatial distribution of hazard affected areas and analyzed hazard factors as reported by Bonham-Carter [75] as follows: where p is the probability of occurrence of a particular event, and ln is the natural log. Similarly, B h denotes the presence of a potential geohazard predictive factor, B h is the absence of a potential geohazard predictive factor, A h is the presence of geohazards, and A h denotes the absence of geohazards. In Equations (2) and (3), W + h and W − h show the relationship of each factor's class to geohazard occurrence, where positive weights are directly proportional for the occurrence of geohazards and vice-versa with negative weights. To calculate the weights of each conditioning factor, Equations (4) and (5) were used [76]: where N pix 1 is the number of pixels showing the presence of geohazards and conditioning factors; N pix 2 is the presence of geohazards and absence of conditioning factors; N pix 3 is the absence of geohazards and the presence of conditioning factors; N pix 4 is the absence of both geohazards and conditioning factors. The final weight W h C was estimated using the following equation:

Frequency Ratio (FR)
FR is a widely used probability-based approach for GHSM. The FR is defined as the ratio of the probability of an event occurrence to the probability of non-occurrence for given attributes [75,77]. The spatial relationship between the location of geohazards and each parameter was obtained. To obtain the FR values for each class of the conditioning parameter, a combination between the geohazard inventory map and factors map was established through Equation (7).
where FR represents the Frequency Ratio (FR) of the class i of factor j; N pix(HX i ) is the number of pixels with geohazards within class i of the conditioning factor X; N pix NX j is the number of pixels within the conditioning factor X j ; m is the number of classes in the conditioning factor X i ; n is the number of factors in this area [29]. The derived frequency ratio was used to calculate relative frequency (Equation (8)). Further, the predictive rate was calculated using Equation (9) [29].

Shannon Entropy (SE)
The SE model is based on the instability, disorder, imbalance, and uncertainty of a system. SE indicates the most favorable class of a variable in a natural environment for hazard occurrence [78]. Entropy values were used for calculating the target weights of the index system. The mathematical expression of weight, W j , was used for calculating the information coefficient for this model (as a whole variable weight value) as follows [79]: where, P ij is the probability density and FR is the frequency ratio. Furthermore, based on Equation (11), each class of the factor is divided into different ranks (Table 4).
H j and H jmax represent entropy values in Equations (12) and (13) H jmax = log 2 sj, sj − number o f classes (13) W j = I j P j I j represents the information coefficient and W j is the final weight value of the variables as a whole.

Model Validation
In this study, derived maps were validated using geohazard inventory points, which were split into two categories: training data (70%) and validation data (30%). The Area Under the Receiving Operative Characteristic (AUROC) approach was adopted to evaluate the performance of the standalone models for geohazard susceptibility assessment [80,81].

Application of Logistic Regression
The LR analysis revealed the most effective variables influencing the distribution of geohazards. With sig (p) values less than 0.05, the most effective variables were found to be distance to faults, slope, elevation, geology, LC, and rainfall ( Figure 4). The remaining conditioning factors were insignificant, with higher sig (p) (>0.05) values [26,82]. Furthermore, the positive and negative coefficients of LR indicate their contribution to the occurrence of geohazards. Based on the obtained coefficients, the probability of geohazards was calculated using Equation (16), and the GHSM was generated accordingly.  Table 4. Ultimately, the GHSI for LR was calculated, using Equation (16). The probability value varies from 0 to 1 [25]. The final SI values were reclassified into five classes: very-low, low, moderate, high, and very-high ( Figure 5), using the natural breaks method [83]. In this model the percentage of very-high, high, moderate, low and very low susceptibility areas are 11.19%, 9.24%, 10.18%, 39.14%, and 30.25%, respectively, as shown in Figure 6.

Application of Weights-of-Evidence
The weight and contrast were calculated for each class of conditioning variables using the WoE model. The output of total weight indicated the significance of each factor as shown in Table 4. In this model, distance to road, distance to faults, geology, elevation, and slope showed higher magnitudes of W h C (Figure 7). The class of distance to road with a distance interval of 0 to 1000 m showed high susceptibility to geohazards, followed by the class with a distance interval of 3000 to 4000 m, with W h C values of 2.663 and 1.912, respectively. The presence of the road can lead to frequent geohazards mainly due to changes in slope stability and shear stress resulting from the excavation and undercutting of fragile slopes [84,85]. In the case of distance to faults, distance intervals from 3000 to 6000 m, 9000 to 12,000 m, and 12,000 to 15,000 m showed weights (W h C) of 0.831, 0.647 and 0.722, respectively, showing a higher probability of geohazards. Elevation intervals of 1000-2000, 2000-3000, and 3000-4000 m showed high (W h C) values of 1.618, 1.892, and 0.468, respectively. The investigation of lithological units revealed that the Ig and Pzu groups (composed of slate, phyllite, and shale) have higher W h C values (2.501 and 2.082), followed by the group Q, Mzpzi, and Ks groups (1.124, 0.984, and 0.367). These findings suggest that rock units with the lowest shear strength (gneisses, phyllites, slate, and shale) are prone to mass wasting. The derived W h C weights (−2.252 to 1.130) showed that slope factor has a significant impact on the occurrence of geohazards, which is in line with the findings of a previous study [70]. Slope classes of 30-45 • , and 45-60 • showed higher W h C weights, with values of 1.130 and 0.796, respectively. Gentle slopes are generally believed to experience a low frequency of geohazards because of their lower shear stresses and low gradients [86]. Distance to rivers with buffer zones of 0-1000 m and >6000 m showed higher susceptibility to geohazards, with W h C values of 2.001 and 4.833, respectively. In terms of profile curvature, concave surfaces (negative curvature) with high W h C values (0.567) are more prone to geohazards, compared to convex surfaces (positive curvature). Concave surfaces have higher W h C values than convex surfaces because surfaces with negative curvature retain more water for longer periods after heavy rainfall than surfaces with a positive curvature [29]. As a result, concave surfaces have increased soil moisture content with longer saturation periods, promoting erosion and decreasing soil stability [59]. In this model, the weights of LC, rainfall, slope aspect, SPI, TWI, and NDVI were of less importance because of very low W h C values (0 or less than 0) ( Table 4). The final susceptibility map was divided into five classes from very low to very high ( Figure 5); the percentages of these classes are shown in Figure 6.

Application of Frequency Ratio
The Predictive Ratio (PR) was calculated from the frequency ratio and relative frequency, respectively. NDVI, SPI, LC, TWI, and elevation exhibited relatively high values of PR, followed by geology, distance to faults, and rainfall ( Figure 4). Distance to rivers, slope, profile curvature, and distance to road have a moderate impact on the prediction of geohazard occurrence. In this model, the slope aspect was the least significant factor with a low PR value. The PR values ranged from 1.00 to 5.62, and they were directly proportional to susceptibility, with higher values indicating higher hazard susceptibility. Regarding NDVI, barren surfaces showed a much high PR value of 5.62 (Table 4). Furthermore, these areas are extremely erodible, and thus exhibit higher SPI values. Compared with natural vegetation and shrublands, forest areas showed lower susceptibility to geohazards. TWI reflects the relationship between the accumulation of water at any location and the gravitational force acting on an inclined surface toward a downslope stream [56]. In this model, TWI showed an inverse relationship with geohazards; TWI values less than five indicate a high frequency of geohazards in this region. Decreasing distance from rivers and roads was proportional to susceptibility, with shorter distances showing higher PR values. River channels have significant effects on fluvial processes that influence bedrock incision in seismically active mountain ranges [87]. In this study, we observed that most of the geohazards in this region occurred along the main Indus River and associated tributaries, which are in line with the study of Ahmed, Rogers and Ismail [34]. The PR values calculated for geology and distance to faults were high. Exposed outcrops in the study area were intensely sheared, folded, and faulted due to seismicity, resulting in high susceptibility to geohazards. Geohazards have most frequently been observed in the Misgar Slates and exposed outcrop of the Jijal complex along the river. Similarly, the elevation and slope play a key role in the occurrence of geohazards. Elevations of 2000-3000 m showed higher susceptibility. Slope angles of 45-60 • showed the highest PR value (2.03), followed by slope angles of 30-45 • (2.00). After determining the PR values of all conditional factors, a GHSM was prepared using Equation (10). The final susceptibility map was reclassified into five classes (very-low, 20.18%; low, 25.34%; moderate, 15.97%; high, 27.16%; very-high, 11.04%) using an expert-based classification method [26], as shown in Figures 5 and 6.

Application of Shannon Entropy
We also used the SE model to minimize the imbalance between factors and obtain a realistic status of their effect on the susceptibility of geohazards. The relationship between the conditioning parameters and geohazards was obtained, and the most significant factors affecting the occurrence of hazards were also identified. The summarized results are listed in Table 4. According to the SE results, geology, elevation, distance to road, distance to faults, NDVI, and distance to rivers are the most significant factors controlling the occurrence and distribution of geohazards in this region ( Figure 4). Therefore, these conditioning factors can better indicate the occurrence and distribution of geohazards in the study area. A GHSM was developed using the following equation [88]: Based on the SE model (Equation (11)), each class of conditional factors was divided into different ranks ( Table 4). The SE-GHSM was divided into five classes from very-low to very-high susceptibility classes as shown in Figures 5 and 6. Figure 8 shows a comparison of various results from field observations, aerial imageries, and the LR model. A, B, C, and D areas show very-high susceptibility to geohazards. In the Hunza valley, an area indicated ground subsidence and shallow landslides, which is attributable to underlying and overlying unconsolidated sediments (slate formations). The historical landslide in Attabad-Hunza blocked river flow and submerged 19 km of the KKH, developing a natural dam. Moreover, the undercutting of slopes for the reconstruction of KKH has increased the susceptibility of the region to geohazards. The entire city of Gilgit is located on a large debris fan. This section is characterized by hot and dry valleys with low precipitation rate, and some old and large debris fans were observed during the field visit. The Nanga Parbat Haramosh massif (western syntaxes) is the most susceptible area of this region, with high tectonic influence and an uplift rate of 1 cm/year [40,89,90]. The dominant tectonic features are MMT and the Raikot fault, which are responsible for the high seismicity and brittle deformation of bedrocks, ultimately leading to mass movement. The Dasu section along the KKH is strongly influenced by rainfall compared to the elevated areas.

Validation of Geohazards Susceptibility Maps
The AUROC curve was used for validating the obtained results. Success and prediction rate curves were plotted to compare the results of GHSMs for existing hazard sites. The performance of the models was evaluated by plotting AUROC curves for success rate (Figure 9a). The LR, WoE, FR, and SE models showed AUROC values of 85.30%, 76.00%, 74.60%, and 71.40%, respectively (Figure 9a). In terms of prediction rate, the AUROC values of LR, WoE, FR, and SE were 83.10%, 75.00%, 73.50%, and 70.10%, respectively (Figure 9b). The findings revealed that the GHSM prepared using the LR model had better predictive capability than those from the other three models. Other studies have also reported that the LR model has higher performance and is more suitable for susceptibility analysis compared to FR and WoE [68,70]. Our findings are in line with those of Rasyid et al. [70] in that a standalone LR model has high precision and is appropriate for hazard susceptibility assessment. Moreover, Polykretis and Chalkias [68] carried out landslide susceptibility by comparing WoE, LR, and ANN models and found that success and predication rate of LR model performed better (LR = 0.928 and 0.836, ANN = 0.912 and 0.780, WoE = 0.893 and 0.847). In another study, Benchelha et al. [91] found that the success and predication rate of LR is higher (91.8% and 91.1%) than ANN model (88.6% and 87.7%). Therefore, it is concluded that the LR model is the best option for mapping hazard potential under similar settings.

Model's Results Comparison
The model results were compared to assess variations between the results of each GHSM. The parameters can be primarily interpreted based on the absolute value of the correlation coefficient as follows: very weak similarities 0-20%, weak to low similarities 20-40%, moderate similarities 40-70%, strong similarities 70-90%, and very strong similarities 90-100% [92]. The similarity coefficient of LR vs. WoE was 57.70%, and those of LR vs. FR and LR vs. SE maps were 47.40%. This indicates moderate similarities of LR with WoE, and low similarities of LR with FR and SE maps ( Table 5). The value of 0.898 shows a strong correlation between FR and SE (89.80%). This correlation suggests that the maps developed using the LR and WoE methods have moderate similarities, and those using the FR and SE models have a strong similarity.

Pixel-Wise Spatial Distribution of Susceptibility Class Comparison
To compare geohazard susceptibility maps, classified maps were ranked into five groups, i.e., very low, low, moderate, high, and very high. Then, they were compared with each of the constructed maps using a two-dimensional multiplication matrix [93,94]. The study found that the matrix's diagonal elements represent the same spatial distribution of susceptibility classes for both maps (LR-WoE, LR-FR, LR-SE, WoE-FR, WoE-SE, and FR-SE). For example, Figure 10a indicates that 39.21% of areas have been occupied by the same hazard intensity for the two susceptibility maps, e.g., susceptibility maps developed from LR and WoE. Meanwhile, 60.79% spatial distributions show dissimilarities in which 31.03% (upper elements of a diagonal matrix) distributions show the higher categories of susceptibility by the geohazard susceptibility map constructed using the WoE model. A total of 29.76% (lower elements of a diagonal matrix) distributions show the lower categories compared to the constructed map using the LR model ( Figure 10a). Further, the comparison results for other models are summarized in Figure 10b-f. These differences are mainly due to the influence of factors in controlling the spatial distribution of geohazards in standalone models (Figures 4 and 5) and the proposition of the susceptibility map class [95].
The LR, WoE, FR, and SE models all indicated the significance of the causative factor on the occurrence of geohazards. In the LR model, distance to faults, slope angle, elevation, geology, LC, and precipitation were the most significant factors. The results of these analyses are supported by the findings of previous studies on this region. Ahmed, Rogers and Ismail [34] reported that geology, faults, slope angle, and rainfall are the most important factors for the occurrence of geohazards in the upper Indus basin. Rahim et al. [96] found that geology, faults, slope angle, and LC have the greatest impact on geohazards in this region. According to the WoE model, the distance from faults with distance intervals of 3000-6000 m, 9000-12,000 m, and 12,000-15,000 m have weights (W h C) of 0.831, 0.647 and 0.722 respectively, indicating a positive correlation with the occurrence of geohazards. Slope classes of 30 [37,97]. According to the FR model, NDVI, SPI, LC, TWI, elevation, geology, and distance to faults are important factors controlling geohazards in this region. The significance of these factors has also been reported by previous studies [29,38]. According to the SE model, geology, elevation, distance to road, faults, NDVI, and rivers are the most important factors. These findings are in line with those of previous studies [29,34,38]. Furthermore, AUROC curves verified that the LR model has a higher performance than the other three models, suggesting that the LR model is the most suitable for regional-level analysis of geohazards under settings similar to those of the study area. Similarities in the influencing factors of geohazard occurrence among the developed maps were assessed. The LR and WoE models provided maps of moderate similarity, whereas the FR and SE models provided highly similar maps.

Conclusions
In this study, regional level geohazard susceptibility mapping from the Punjab plains (Pakistan) to the Tashkurgan Valley (China) was conducted. Thirteen conditioning factors were considered and their TOL and VIF values were calculated to verify the absence of multicollinearity for each variable. Four models, LR, WoE, FR, and SE, were used to develop geohazard susceptibility maps, and the suitability of each model was validated through AUROC curves. Furthermore, model results were analyzed to determine the statistical importance of the coefficients and the spatial association between geohazards and each class of the conditioning factors. The main conclusions of the study can be summarized as follows: 1.
The LR results suggested that distance to faults, slope, elevation, geology, LC, and rainfall are the most significant parameters controlling geohazards occurrence in the study area.

2.
According to the AUROC curve, the LR model showed the highest accuracy with an AUROC value of 85.30%, closely followed by the WoE model (76.00%), FR model (74.60%), and SE model (71.40%). Likewise, the prediction rate of the LR model was 83.10%. Thus, the LR model can be considered more accurate than the other three models.
Areas susceptible to geohazards, visualized in the hazard susceptibility maps, represent important sites for geohazard assessment in the study area. The GHSM prepared using the LR model can be considered to be more realistic and suitable because of its reliability and higher accuracy. Therefore, the developed map can provide useful information for the local communities to recognize and avoid hazardous areas as well as for associated government authorities to enforce preventive measures in areas of high to very-high susceptibility.
Further, we believe that this paper will be of interest to the international readership because the developed maps and the approach will be used by relevant individuals or organizations involved in major construction projects in areas susceptible to geological disasters. One of our study's main limitations is the lack of historical geohazard information. The advanced machine-learning algorithms require a large number of training datasets. Therefore, we will consider advanced machine-learning and data-driven models for susceptibility mapping with more detailed datasets in a future study.

Acknowledgments:
The authors would like to show gratitude to the University of Chinese Academy of Sciences and the Institute of Mountain Hazards and Environment, Chinese Academy of Sciences for their financial assistance. The authors would also like to acknowledge and appreciate the provision of rainfall data by the Pakistan Meteorological Department (PMD), without which this study would not have been possible.

Conflicts of Interest:
The authors declare no conflict of interest.