Machine Learning-Driven Landslide Susceptibility Mapping in the Himalayan China–Pakistan Economic Corridor Region

: The reliability of data-driven approaches in generating landslide susceptibility maps depends on data quality, analytical method selection, and sampling techniques. Selecting optimal datasets and determining the most effective analytical methods pose significant challenges. This study assesses the performance of seven machine learning classifiers in the Himalayan region of the China–Pakistan Economic Corridor, utilizing statistical techniques and validation metrics. Thirteen geo-environmental variables were analyzed, including topographic (8), land cover (1), hydrological (1), geological (2), and meteorological (1) factors. These variables were evaluated for multicollinearity, feature importance, and their influence on landslide incidences. Our findings indicate that Support Vector Machines and Logistic Regression were highly effective, particularly near fault zones and roads, due to their effectiveness in handling complex, non-linear terrain interactions. Conversely, Random Forest and Logistic Regression demonstrated variability in their results. Each model distinctly identified landslide susceptibility zones ranging from very low to very high risk. Significant conditioning variables such as elevation, rainfall, lithology, slope, and land use were identified, reflecting the unique geomorphological conditions of the Himalayas. Further analysis using the Variance Inflation Factor and Pearson correlation coefficient showed minimal multicollinearity among the variables. Moreover, evaluations of Area Under the Receiver Operating Characteristic Curve (AUC-ROC) values confirmed the strong predictive capabilities of the models, with the Random Forest Classifier performing exceptionally well, achieving an AUC of 0.96 and an F-Score of 0.86. This study shows the importance of model selection based on dataset characteristics to enhance decision-making and strategy effectiveness.


Introduction
Landslides pose significant global challenges, resulting in tragic loss of life and substantial economic losses [1].It ranks as the third most prevalent natural disaster, leading to economic damages exceeding USD 2.7 billion and impacting more than 3 million people globally, with 10,338 fatalities recorded from 2008 to 2017 [2].The Abbottabad district, situated in the foothills of Balakot-devastated by the 2005 Kashmir earthquake-serves as a prime example of a region facing significant landslide risks within the Himalayan region in Pakistan.This susceptibility arises due to the rugged terrain, heavy rainfall during the monsoon season, high seismic activity, steep slopes, and unstable soil cover [3].To mitigate these devastating effects, Landslide Susceptibility Mapping (LSM) has become a crucial tool.LSM allows experts to identify areas at high risk by evaluating terrain characteristics, thus playing a pivotal role in managing landslide risks [4].This approach is particularly valuable in generating susceptibility maps that highlight areas prone to landslides within Land 2024, 13, 1011 2 of 22 specific regions [5].Notably, regions with hilly terrain adjacent to large mountain ranges, such as the Himalayas, are particularly susceptible.The China-Pakistan Economic Corridor (CPEC) route traverses these mountains and has historically suffered from significant landslide incidents.Neighboring countries like India and Nepal have also faced severe landslide challenges, showing the urgent need for preventative strategies along the CPEC corridor [6].Despite its importance, LSM research in these critical areas has faced several challenges, emphasizing the need for continued investigation in regions with a history of landslides.
The integration of diverse machine learning models marks a significant improvement in LSM.This advancement enhances the method's ability to reveal complex interactions between conditioning factors and landslide risks without primarily relying on traditional statistical assumptions [7,8].The effectiveness of LSM depends not only on the choice of machine learning model but also on the careful selection of conditioning factors.These factors, which can range from a few to several hundred, are chosen based on the characteristics of the data and the unique attributes of the study area [9,10].Common factors influencing landslide susceptibility include geological, topographical, and land cover elements [11].The selection process of these factors is crucial-it involves assessing the importance of each factor and removing those that are unnecessary, thereby streamlining LSM and improving its accuracy [12].This process often includes correlation analyses among factors to identify those most critical for predicting landslide susceptibility [13].
Extensive research has highlighted the crucial role of conditioning factors in LSMs, with many studies utilizing machine learning algorithms to assess landslide susceptibility in diverse geographical areas.Notably, Micheletti et al. applied Support Vector Machines and Random Forest algorithms in Switzerland's Vaud canton [14], while Lee, Hong, and Jung integrated Support Vector Machines with Geographic Information System for producing and evaluating landslide maps in PyeongChang and Inje, Korea [15].Similarly, Meena et al. investigated the use of statistical and machine learning models in Italy's Belluno province [16].In addition, various machine learning models have undergone rigorous testing for Landslide susceptibility modeling; for example, in northern Pakistan, a comparative study of Support Vector Machine, Random Forest, Maximum Entropy, Gradient Boosting Machines, and Logistic Regression was conducted, with optimization achieved through cross-validation techniques [17].The Random Forest Classifier was also critically evaluated in Saudi Arabia and China [18,19].Despite the widespread application of machine learning in landslide susceptibility analysis, there remains a gap in understanding these dynamics within the Himalayas, particularly in the CPEC.Our study addresses this gap by applying advanced machine learning techniques to investigate the unique interactions of conditioning factors and landslides in this region, aiming to enhance the accuracy of region-specific susceptibility models.
This research, building upon prior studies [17,20], continues to explore the most effective machine learning techniques for landslide susceptibility analysis in the Himalayan terrains of the CPEC.We employ seven machine learning models, including Support Vector Machines and Random Forest Classifiers, to analyze terrain vulnerabilities in the study area.Our study uniquely evaluates 13 conditioning factors specific to this region, such as slope, aspect, lithology, curvature, and land use, etc., thereby enhancing the accuracy of the landslide susceptibility models.Feature selection algorithms like Classification and Regression Trees (CART) and Recursive Feature Elimination (RFE) optimize the influential factors, supported by Variance Inflation Factor and correlation analysis.Additionally, the Wilcoxon test statistically assesses the model's significance.The goal is to improve hazard mitigation and land-use planning through comprehensive landslide susceptibility insights.

Study Area
The study area is situated in the western Himalayas, specifically within the China-Pakistan Economic Corridor (CPEC), as shown in Figure 1.It spans from 32.9 • N to 39.7 • N in latitude and 71.2 • E to 77.8 • E in longitude, characterized by towering mountains, glaciers, rivers, fault lines, and forests.This region is historically susceptible to natural disasters like earthquakes, landslides, and floods, particularly the 2005 earthquake, which had a magnitude of Mw = 7.6 and caused extensive damage to infrastructure, destabilizing numerous slopes in the area [21,22].

Study Area
The study area is situated in the western Himalayas, specifically within the China-Pakistan Economic Corridor (CPEC), as shown in Figure 1.It spans from 32.9° N to 39.7° N in latitude and 71.2° E to 77.8° E in longitude, characterized by towering mountains, glaciers, rivers, fault lines, and forests.This region is historically susceptible to natural disasters like earthquakes, landslides, and floods, particularly the 2005 earthquake, which had a magnitude of Mw = 7.6 and caused extensive damage to infrastructure, destabilizing numerous slopes in the area [21,22].Geologically, the region is divided into two seismic zones: the Harman Seismic Zone (HSZ) and the Indus Kohistan Seismic Zone (IKSZ) [23].Prominent fault lines, including the Main Mantle Thrust (MMT) and the Main Karakoram Thrust (MKT), play a crucial role in shaping the region's geological characteristics [24].This region holds strategic importance within the upper Indus basin, notably due to the presence of mega infrastructures like the Diamer-Bhasha Dam and Dasu Hydropower Project [25].The Karakoram Highway (KKH), a key component of the China-Pakistan Economic Corridor (CPEC), traverses the Karakoram, Hindu-Kush, and Himalayan mountain ranges, shows the region's importance in the northern sector of CPEC.Furthermore, the study area serves as the source of numerous glaciers and river systems, including the origin point of the Indus River.The climate is semi-arid, with southern areas receiving moisture from monsoon currents originating in the Indian Ocean, while the northern regions remain relatively arid [26].This unique interplay of climatic zones, strategic importance, and varying altitudes makes it an essential area for geological and climatic hazard research.Geologically, the region is divided into two seismic zones: the Harman Seismic Zone (HSZ) and the Indus Kohistan Seismic Zone (IKSZ) [23].Prominent fault lines, including the Main Mantle Thrust (MMT) and the Main Karakoram Thrust (MKT), play a crucial role in shaping the region's geological characteristics [24].This region holds strategic importance within the upper Indus basin, notably due to the presence of mega infrastructures like the Diamer-Bhasha Dam and Dasu Hydropower Project [25].The Karakoram Highway (KKH), a key component of the China-Pakistan Economic Corridor (CPEC), traverses the Karakoram, Hindu-Kush, and Himalayan mountain ranges, shows the region's importance in the northern sector of CPEC.Furthermore, the study area serves as the source of numerous glaciers and river systems, including the origin point of the Indus River.The climate is semi-arid, with southern areas receiving moisture from monsoon currents originating in the Indian Ocean, while the northern regions remain relatively arid [26].This unique interplay of climatic zones, strategic importance, and varying altitudes makes it an essential area for geological and climatic hazard research.

Datasets
A landslide inventory crucial for landslide susceptibility mapping utilizes historical data, satellite imagery, and field surveys to examine the relationship between landslide occurrences and their causative factors.In this research, landslide locations were identified and mapped using high-resolution Sentinel-2 satellite (10 m) and Google Earth imagery [27,28], which are well-suited for detecting small to large-scale slope failures and landform alterations.The landslide inventory was validated using literature [29][30][31][32][33], historical data from the NASA geohazard catalog, and a comprehensive field survey of the study area, as shown in Figure 2 The inventory consists of 2197 polygon shapes drawn on satellite images based on validation data, each representing a landslide within the study area.The majority of landslides, accounting for 71.13%, are classified as very small, each covering an area of less than 1 × 10 5 m 2 .Small landslides, which cover areas ranging from 1 × 10 5 to 3 × 10 5 m 2 , make up 16.42% of the total landslides.Medium-sized landslides, covering areas between 3 × 10 5 and 5 × 10 5 m 2 , constitute 10.72% of all landslides.Large landslides, each with an area spanning 5 × 10 5 to 1 × 10 6 m 2 , are relatively uncommon, accounting for only 1.52%.The least frequent category is very large landslides, which cover an area greater than 1 × 10 6 m 2 and represent just 0.21% of all documented landslides.For the analysis, non-landslide areas were marked using ESRI ArcGIS, employing a random selection technique based on Yao et al.'s methodology [34].This delineation process divided the locations into two categories: 1538 landslide locations for model training (70%) and 659 for validation (30%).For landslide susceptibility analysis in this study, thirteen conditioning factors were carefully selected, as shown in Table 1 and Figures 3 and 4, which are categorized into topographic, land cover, hydrological, geological, and meteorological types.These factors include valley depth, terrain relief, diverse land use classifications, and the impact of rainfall, proximity to road networks, lithology, and active fault lines [35][36][37].Analytical indices such as the Terrain Ruggedness Index (TRI) and the Topographic Wetness Index (TWI), alongside factors like elevation variations, land use changes, and slope gradients, were also crucial [38,39].This comprehensive approach provides a robust framework for assessing landslide risks, ensuring a detailed analysis of susceptibility in the region.The Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM), available through NASA's Earthdata Search website (https://earthdata.nasa.gov/)(accessed on 2 February 2024), offers a spatial resolution of 30 m.These data were used in deriving several critical factors for landslide susceptibility analysis, including slope, as- The Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM), available through NASA's Earthdata Search website (https://earthdata.nasa.gov/)(accessed on 2 February 2024), offers a spatial resolution of 30 m.These data were used in deriving several critical factors for landslide susceptibility analysis, including slope, aspect, curvature, terrain relief, valley depth, the Topographic Wetness Index (TWI), the Terrain Ruggedness Index (TRI), and elevation.Further parameters were incorporated using resampled 30 m resolution data: these data sources include USGS Global Land Cover, the National Library of Australia for Lithology and Fault Distance, Natural Earth Data for Road Distance, and the Climate Hazards Group for Rainfall data.
Thematic maps for the selected conditioning factors were produced using ArcGIS software 10.8.1., with all data resampled to a consistent 30 m resolution for uniformity.The datasets were standardized and normalized to minimize redundancy and organize the data effectively.Each raster layer representing a conditioning factor was categorized into various levels, with all maps created for this study following the WGS 1984 datum and projected in the UTM zone 42 system.

Statistical Feature Filtering and Assessment
Multicollinearity in the dataset, arising from interdependent conditioning factors such as slope, elevation, and rainfall, was addressed to enhance the accuracy of landslide susceptibility modeling.
To analyze the relationships between these variables, the Pearson Correlation Coefficient (PCC) was employed, which quantifies the linear correlation between pairs of variables, providing insights into the strength and direction of their relationships [40].The PCC is defined by the following formula [41]: where x i and y j are individual sample values and x and y their respective means.Tolerance (TOL) and the Variance Inflation Factor (VIF) were used for multicollinearity assessment, which is crucial in identifying and mitigating the influence of strongly correlated variables to ensure the robustness and reliability of landslide susceptibility models [42].The computation for VIF is as follows [43]: where R 2 i is the determination coefficient for the ith variable.Moreover, the Weight of Evidence (WoE) approach, which is based on Bayesian probability principles, investigates spatial relationships between landslides and key factors using a logarithmic linear framework.It calculates positive (α) and negative (β) correlations, and their combined effect (∆) on landslide occurrences as follows [44]: where Q represents probability, X signifies specific factor categories, X ′ represents categories not within the factors, Y relates to landslide-prone regions, and Y ′ refers to areas without landslides.Furthermore, a combination of advanced algorithms was employed for optimized feature selection and statistical comparison of machine learning models.Classification and Regression Trees (CART) were used to construct binary decision trees and identify critical variables, while Recursive Feature Elimination (RFE) was implemented for its systematic approach to iteratively refining the feature set [45].Furthermore, for the statistical comparison of different landslide susceptibility models, the Wilcoxon signed-rank test was employed.This non-parametric method assesses the significance of differences between models at a set level (typically α = 0.05), using p-values and z-values to determine whether the null hypothesis of no significant difference can be rejected or affirmed [46].This comprehensive approach, combining feature selection algorithms with rigorous statistical testing, provided a robust framework for landslide susceptibility analysis.

Machine Learning Techniques
In this study, seven well-established machine learning models were used, each selected for its specific strengths in predicting landslide susceptibility.These classifiers, comprise Support Vector Machine, Logistic Regression, Random Forest, K-Nearest Neighbors, Gradient Boosting Decision Tree, Gaussian Bayesian Network, and Convolutional Neural Network, which are briefly outlined in terms of their fundamental principles as follows: The Support Vector Machine, utilizing the Radial Basis Function (RBF) kernel, is effective for high-dimensional data and is represented by the following equation [47,48]: where the kernel function k x i , x j applies an exponential decay, exp, to the negative scaled squared distance x i − x j 2 between vectors x i and x j , modulated by g.Moreover, Logistic Regression, used for binary classification and thus effective in determining event occurrence, models landslide probabilities as follows [49]: where P is the predicted landslide probability, Y is the logit function, B is the intercept, A i are coefficients, and X i are predictors affecting landslide risk.The Random Forest Classifier (RFC) enhances accuracy through its ensemble approach, making it suitable for diverse data subsets [50,51].It can be described by the following equation: where F(x) is the averaged result of multiple predictions f b(x) made by an ensemble of B different models, with b as the index iterating through each model to compute the ensemble's output.The Convolutional Neural Network features exceptional capabilities in feature extraction and resistance to overfitting, which makes it essential for processing input data in the following formula [52].
Here, Rescaled x normalizes x between 0 and 1, where xmin and xmax are its minimum and maximum values, shifting the range so that all values are proportionally adjusted.
The Gradient Boosting Decision Tree (GBDT) iteratively improves accuracy, making it efficient for modeling complex relationships using the following equation [53,54]: This indicates the prediction Y t (X i ) refined through prior predictions Y t−1 (X i ) and the current weak learner's contribution F t (X i ), with η moderating the update.
Furthermore, GBDT employs the F-MSE impurity metric [55]: Land 2024, 13, 1011 Here, in the GBDT F − MSE formula, w L and w R are weights for the left and right nodes, respectively.They normalize the sums ∑ L and ∑ R of squared errors (r i − ŷi ) 2 to calculate node-specific mean squared errors.The final F − MSE, a squared, weighted difference of these errors, emphasizes efficient branching to improve prediction accuracy.
The cubic K-Nearest Neighbor model is advantageous for geological data analysis, facilitating precise classification as follows [56,57]: In this formula, u i and v i are the i-th components of vectors u and v, respectively.Finally, the Gaussian Naive Bayes model, known for its straightforward and effective application, determines landslide likelihood using the following equation [58]: Here, the Naive Bayes classifier z NB predicts the class label (either "landslide presence" or "absence") by selecting the class wj that maximizes the combined probabilities of the prior P wj and the conditional probabilities ∏ n i=1 P z i w j for each feature zi.This method uses argmax predict the most likely class, assuming features are independent.Moreover, the likelihood of the predictors, given the outcome category, is expressed as a Gaussian distribution [59]: In this formula, λ represents the mean, and ρ represents the standard deviation associated with each feature zi for the outcome wj.

Validation and Reliability Evaluation
The validation of landslide susceptibility models was conducted using two key indicators.The Receiver Operating Characteristic (ROC) curve was employed to measure classification effectiveness, with the Area Under the Curve (AUC) indicating accuracy levels [60].A higher AUC value, approaching 1, denotes greater model accuracy, while values near or below 0.5 suggest limited predictive ability.Additionally, the F-Score is a harmonic mean of precision and recall, which ranges from 0 to 1, with values closer to 1 indicating better performance [61].For the validation of models, these indicators were applied to 30% testing data.

Assessment of Explanatory Variables
The relationships among thirteen geographical and environmental factors were analyzed using Pearson correlation coefficients, as illustrated in Figure 5.A key finding was the strong positive correlation (0.92) between slope and terrain relief, indicating that steeper slopes are often found in areas of significant relief.Conversely, a negative correlation (−0.55) between slope and the Topographic Wetness Index (TWI) suggested that higher slope areas tend to have lower TWI values.Some factors display significant correlations with each other's, while others exhibit minimal associations; this dynamic interrelation among conditioning factors emphasizes the diverse influences on landslide susceptibility.
Furthermore, the details of multicollinearity assessment are presented in Table 2.All conditioning variables showed minimal multicollinearity (VIF values below 3), indicating that each factor contributes uniquely to the models.Elevation displayed the highest VIF value of 2.98, indicating significant interactions with other variables, suggesting a central role in the analysis due to its low tolerance of 0.33.Conversely, 'Distance to active_faults', with its lower VIF of 1.10 and higher TOL of 0.96, exhibited a relatively independent influence within the model.This highlights the importance of selecting relevant variables for accurately modeling landslide susceptibility.Furthermore, the details of multicollinearity assessment are presented in Table 2.All conditioning variables showed minimal multicollinearity (VIF values below 3), indicating that each factor contributes uniquely to the models.Elevation displayed the highest VIF value of 2.98, indicating significant interactions with other variables, suggesting a central role in the analysis due to its low tolerance of 0.33.Conversely, Distance to active_faults', with its lower VIF of 1.10 and higher TOL of 0.96, exhibited a relatively independent influence within the model.This highlights the importance of selecting relevant variables for accurately modeling landslide susceptibility.

Factors Influencing Landslide Susceptibility
The impact of various causative factors on landslides across different categories can be observed in Figure 6.Proximity to active fault zones and road construction within 2 km leads to landslide occurrences of 12.4% and 12.1%, respectively, due to geological instability and altered terrain.Anthropogenic activities within elevation ranges between 1940 and 2660 m contribute to an 18.2% landslide incidence.Additionally, sub-classes such as shrub, sparse vegetation, and snow increase susceptibility to landslides by reducing soil stability, limiting root reinforcement, and promoting saturation, respectively.Aspect direction also plays a role, with south and west subclasses exhibiting landslide occurrences of 20.3% and 12.8%, respectively.This can be attributed to increased exposure to solar radiation and weathering on south-facing slopes, as well as enhanced susceptibility to rainfall-induced erosion and soil saturation on west-facing slopes, influenced by prevailing wind and precipitation patterns.In TWI > 50 sub-class areas, landslide is notably high at 61.9% due to increased saturation and reduced soil stability linked to higher topographic wetness indices, amplifying slope instability.Valley depths exceeding 1000 m experience a high landslide occurrence rate of 20.4%.Geological formations such as Eocene Limestone exhibit high landslide occurrence rates due to weathering and erosion processes, while human activities like quarrying and construction further amplify landslide risks in limestone-rich areas.

Factors Influencing Landslide Susceptibility
The impact of various causative factors on landslides across different categories can be observed in Figure 6.Proximity to active fault zones and road construction within 2 km leads to landslide occurrences of 12.4% and 12.1%, respectively, due to geological instability and altered terrain.Anthropogenic activities within elevation ranges between 1940 and 2660 m contribute to an 18.2% landslide incidence.Additionally, sub-classes such as shrub, sparse vegetation, and snow increase susceptibility to landslides by reducing soil stability, limiting root reinforcement, and promoting saturation, respectively.Aspect direction also plays a role, with south and west subclasses exhibiting landslide occurrences of 20.3% and 12.8%, respectively.This can be attributed to increased exposure to solar radiation and weathering on south-facing slopes, as well as enhanced susceptibility to rainfall-induced erosion and soil saturation on west-facing slopes, influenced by prevailing wind and precipitation patterns.In TWI > 50 sub-class areas, landslide occurrence is notably high at 61.9% due to increased saturation and reduced soil stability linked to higher topographic wetness indices, amplifying slope instability.Valley depths exceeding 1000 m experience a high landslide occurrence rate of 20.4%.Geological formations such as Eocene Limestone exhibit high landslide occurrence rates due to weathering and erosion processes, while human activities like quarrying and construction further amplify landslide risks in limestone-rich areas.

Feature Priority Selection
The key factors for landslide susceptibility mapping were selected using the Classification and Regression Trees (CART) Importance and Recursive Feature Elimination (RFE) Ranking methods, as depicted in Figure 7.According to the CART Importance analysis, Rainfall emerged as the most critical factor with a value of 0.16, followed by elevation at 0.13, slope at 0.11, lithology at 0.09, and land use at 0.10.Conversely, the RFE Ranking method identified elevation as the most important feature, assigning it a rank of 1.It was closely followed by rainfall with a rank of 2, slope at 3, lithology at 4, and land use at 5. Valley depth was evaluated as the least significant factor, with a CART Importance value of 0.02 and an RFE Ranking of 13.This analysis demonstrates that while certain factors vary in importance, each one contributes to the predictive accuracy of landslide modeling.Therefore, all factors were selected for further detailed analysis.

Feature Priority Selection
The key factors for landslide susceptibility mapping were selected using the Classification and Regression Trees (CART) Importance and Recursive Feature Elimination (RFE) Ranking methods, as depicted in Figure 7.According to the CART Importance analysis, Rainfall emerged as the most critical factor with a value of 0.16, followed by elevation at 0.13, slope at 0.11, lithology at 0.09, and land use at 0.10.Conversely, the RFE Ranking method identified elevation as the most important feature, assigning it a rank of 1.It was closely followed by rainfall with a rank of 2, slope at 3, lithology at 4, and land use at 5. Valley depth was evaluated as the least significant factor, with a CART Importance value of 0.02 and an RFE Ranking of 13.This analysis demonstrates that while certain factors vary in importance, each one contributes to the predictive accuracy of landslide modeling.Therefore, all factors were selected for further detailed analysis.

Landslide Susceptibility Analysis
In the process of landslide susceptibility mapping, conditioning factors are integrated into study area pixels, various models predict landslide probabilities, and the area is segmented into five susceptibility categories-very low to very high-using the quantile method.In this process, selected landslide-causing factors are input variables, and grid units are classified as landslide' (1) or non-landslide' (0) in the model's output.The result findings align closely with the geographical attributes of the study area.Predominantly, very-high susceptibility zones are concentrated in mountainous regions, whereas verylow susceptibility zones are covered in vegetated areas.Table 3 and Figure 8 show that the VERY LOW susceptibility category consistently covers a significant portion of the study area, ranging from 26.68% to 59.61%, with corresponding landslide frequencies varying between 0.8% and 6.76%.Moreover, at the LOW susceptibility level, the models show varying spatial extents, ranging from 12.68% to 41.62%, along with corresponding landslide frequencies of 2.9% to 7.46%.The MODERATE susceptibility zones cover a range of

Landslide Susceptibility Analysis
In the process of landslide susceptibility mapping, conditioning factors are integrated into study area pixels, various models predict landslide probabilities, and the area is segmented into five susceptibility categories-very low to very high-using the quantile method.In this process, selected landslide-causing factors are input variables, and grid units are classified as 'landslide' (1) or 'non-landslide' (0) in the model's output.The result findings align closely with the geographical attributes of the study area.Predominantly, very-high susceptibility zones are concentrated in mountainous regions, whereas very-low susceptibility zones are covered in vegetated areas.Table 3 and Figure 8 show that the VERY LOW susceptibility category consistently covers a significant portion of the study area, ranging from 26.68% to 59.61%, with corresponding landslide frequencies varying between 0.8% and 6.76%.Moreover, at the LOW susceptibility level, the models show varying spatial extents, ranging from 12.68% to 41.62%, along with corresponding landslide frequencies of 2.9% to 7.46%.The MODERATE susceptibility zones cover a range of 7.61% and 15.89% of the total area, with corresponding landslide frequencies that vary from 4.42% to 11.87%.HIGH susceptibility areas exhibit substantial spatial variability between 6.22% and 11.05%, while the landslide frequencies also vary significantly, ranging from 10.81% to 20.98%.The VERY HIGH susceptibility zones consistently occupy smaller proportions of the study area, ranging from 6.10% to 13.59%, while showing the highest landslide frequencies, ranging from approximately 53.57% to 75.33%.Furthermore, Random Forest Classifiers and Convolutional Neural Networks tend to categorize larger regions as very low risk, while Logistic Regression tends to identify smaller zones in this risk category.For low-risk areas, a Support Vector Machine covers a larger area compared to a Random Forest Classifier and K-Nearest Neighbors.The moderate-risk zone exhibits notable variability across the models.In the high-risk category, the Convolutional Neural Network and Gaussian Bayesian Network delineate larger areas with a heightened risk of landslides, while the Support Vector Machine designates a smaller extent.In the very high-risk zone, K-Nearest Neighbors and the Support Vector Furthermore, Random Forest Classifiers and Convolutional Neural Networks tend to categorize larger regions as very low risk, while Logistic Regression tends to identify smaller zones in this risk category.For low-risk areas, a Support Vector Machine covers a larger area compared to a Random Forest Classifier and K-Nearest Neighbors.The moderate-risk zone exhibits notable variability across the models.In the high-risk category, the Convolutional Neural Network and Gaussian Bayesian Network delineate larger areas with a heightened risk of landslides, while the Support Vector Machine designates a smaller extent.In the very high-risk zone, K-Nearest Neighbors and the Support Vector Machine allocate larger areas, with the Random Forest Classifier indicating a higher frequency of landslides, whereas the Random Forest Classifier and Gradient Boosting Decision Tree allocate smaller extents.

Models Correlations in LSM
This study classifies the pairwise correlations of machine learning model performance into three distinct categories-strong, moderate, and weak-based on the p-values presented in Table 4. Strong correlations were observed among model pairs such as Support Vector Machine vs. Logistic Regression, Support Vector Machine vs. Gaussian Bayesian Network, and Gaussian Bayesian Network vs. Gradient Boosting Decision Tree, with p-values approaching 1, which suggests similar methodologies and possibly common mechanisms of models in landslide susceptibility assessment.The high correlation indicates these models have significant overlap in data processing and interpretation, likely due to similar algorithmic structures or dataset feature sensitivities.Moderate correlations, with p-values between 0.25 and 0.75, were observed in pairs like Support Vector Machine vs. Random Forest Classifier and Logistic Regression vs. Gaussian Bayesian Network, which shows these models share certain methodological aspects but also have notable differences in their evaluations, indicating a mix of shared and unique approaches in their processing and analysis.Furthermore, weak correlations, seen in pairs like Random Forest Classifier vs. Logistic Regression and Gaussian Bayesian Network vs. Convolutional Neural Network with p-values close to 0, highlight distinct model approaches in landslide susceptibility, emphasizing the need for model selection that aligns with specific study goals for effective analysis.

Models Validation
The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and F-score values obtained from the test data indicate an acceptable level of predictive performance, as detailed in Table 5.In the analysis of machine learning models for their effectiveness, the Convolutional Neural Network and Random Forest Classifier show superior performance with F-scores of 0.84 and 0.86, respectively, and AUCs of 0.93 and 0.96 Figure 9.These models outperformed Logistic Regression, which had a lower F-score of 0.57 and an AUC of 0.87, indicating its lesser efficacy in this context.The Support Vector Machine and Gaussian Naive Bayes models also demonstrated moderate effectiveness, with the Support Vector Machine achieving an F-score of 0.69 and an AUC of 0.91, while the Gaussian Bayesian Network lagged slightly with an F-score of 0.63 and an AUC of 0.86, reflecting their varying degrees of suitability for landslide susceptibility analysis.9.These models outperformed Logistic Regression, which had a lower F-score of 0.57 and an AUC of 0.87, indicating its lesser efficacy in this context.The Support Vector Machine and Gaussian Naive Bayes models also demonstrated moderate effectiveness, with the Support Vector Machine achieving an F-score of 0.69 and an AUC of 0.91, while the Gaussian Bayesian Network lagged slightly with an F-score of 0.63 and an AUC of 0.86, reflecting their varying degrees of suitability for landslide susceptibility analysis.

Discussion
The findings of this research contribute significantly to the understanding of landslide susceptibility within the Himalayan region enclosed by the China-Pakistan Economic Corridor.Data analysis showed the Random Forest Classifier as the leading model, outperforming other models like Support Vector Machine, K-Nearest Neighbors, and Gaussian Bayesian Network [62].Random Forest Classifier s superior performance is attributed to its ensemble method, effectiveness in handling non-linear patterns, and ability to process high-dimensional data, making it particularly suitable for the Himalayan dataset.However, the effectiveness of these machine learning models is critically dependent on the quality of the input data.For landslide susceptibility mapping, it is essential to precisely capture the unique characteristics and dynamics of each landslide to ensure reliable analysis and effective mitigation strategies.Meanwhile, accurate representation of their extent and shape is crucial for evaluating impacted areas and developing effective mitigation measures.Additionally, not specifying the state of activity and the typology of landslides can result in risk assessments that are either underestimated or overestimated.To improve accuracy and reliability in landslide risk assessments, future studies will focus

Discussion
The findings of this research contribute significantly to the understanding of landslide susceptibility within the Himalayan region enclosed by the China-Pakistan Economic Corridor.Data analysis showed the Random Forest Classifier as the leading model, outperforming other models like Support Vector Machine, K-Nearest Neighbors, and Gaussian Bayesian Network [62].Random Forest Classifier's superior performance is attributed to its ensemble method, effectiveness in handling non-linear patterns, and ability to process high-dimensional data, making it particularly suitable for the Himalayan dataset.However, the effectiveness of these machine learning models is critically dependent on the quality of the input data.For landslide susceptibility mapping, it is essential to precisely capture the unique characteristics and dynamics of each landslide to ensure reliable analysis and effective mitigation strategies.Meanwhile, accurate representation of their extent and shape is crucial for evaluating impacted areas and developing effective mitigation measures.Additionally, not specifying the state of activity and the typology of landslides can result in risk assessments that are either underestimated or overestimated.To improve accuracy and reliability in landslide risk assessments, future studies will focus on monitoring various landslide types and incorporating detailed classifications of their activity states.
Furthermore, a strong correlation between slope and terrain relief reflects the area's unique geomorphology, arising from the tectonic collision of the Indian and Eurasian plates [63,64].Steeper slopes correspond to higher relief, inherently increasing landslide susceptibility [65].The minimal correlation between aspect and rainfall indicates their separate influences on landslide susceptibility.Aspect influences micro-environments by affecting sunlight exposure and vegetation types, impacting landslide risk [66,67].For instance, south-facing slopes are generally warmer, supporting drought-resistant flora, while north-facing slopes are cooler and moister, suitable for shade-loving species [68].On the other hand, in the Himalayas, the moderate correlation between valley depth and elevation suggests a complex link to landslide susceptibility.Higher elevations, with deeper valleys, face increased landslide risks due to factors like soil saturation and reduced vegetation.This relationship, while significant, is part of a broader set of variables, including geology and human activities, highlighting the intricate dynamics of landslides in this region.
Landslide susceptibility in the study area is influenced by a complex interplay of environmental and geographical factors, such as sparse vegetation, notably in shrub areas, which increases landslide risk by reducing soil stabilization.Deep valleys and certain elevations are more susceptible, potentially due to freeze-thaw cycle stress [69].Climate plays a significant role, with specific rainfall patterns linked to higher landslide incidences, likely from soil saturation [70].Moreover, indices like the Topographic Wetness Index (TWI) and Terrain Ruggedness Index (TRI) provide insights, showing that higher moisture levels and irregular terrain surfaces promote landslides [71].Additionally, terrain relief and steep slopes are crucial, with certain ranges more prone to landslides due to climatic impacts and gravitational forces.Proximity to active faults and roads also affects landslide occurrence, arising from geological instability and human-induced topographic changes.Terrain aspect, particularly south and west-facing slopes, and certain terrain curvatures, influence landslide susceptibility, altering sunlight exposure, moisture conditions, and water flow patterns, which can increase erosion risks.
In the field of landslide susceptibility modeling, the use of various factors can lead to complications such as overly generalized models due to high dimensionality and overfitting, which can make the results unreliable [72,73].Each study area's distinct characteristics require a customized approach of selecting conditioning factors, aiming to minimize dimensionality and pinpoint the most important factors for that area.The need for up-todate information is also paramount, as data from official sources often lack recent updates, affecting the precision of the models.It is important to select factors in both humanrelated and climatic elements to create more efficient models for managing landslide risks.Moreover, additional research is needed to analyze the importance of these factors and to refine feature selection methods, enhancing the accuracy and reducing the time and computational effort involved in LSM development and data gathering.
The study in the Himalayas' China-Pakistan Economic Corridor region highlights the importance of landslide susceptibility assessment due to unique geological and infrastructural features [17,74].It reveals that Support Vector Machines and Logistic Regression are particularly effective in identifying high-risk areas, like those near fault lines and roads.These models excel in defining boundaries and estimating probabilities, with the Support Vector Machine also correlating well with tree-based methods and Convolutional Neural Networks, indicating their versatility in assessing various risk factors like terrain ruggedness and rainfall patterns.However, the varied correlations of Random Forest Classifier with other models point to the need for diverse approaches, especially for localized factors.This research shows the significance of strategic model selection, utilizing the strengths of different models to address the complex challenges of landslide risk assessment in this geologically sensitive region.
While our research focuses on a specific region within the Himalayas, potentially limiting the direct applicability of our findings to areas with different geological and environmental conditions, it provides valuable comparative insights that extend beyond the immediate study area.The relationships identified between factors such as slope, terrain relief, and topographic wetness index can inform similar studies in other mountainous or geologically active regions [75,76].This comprehensive framework for integrating multiple environmental variables to predict landslide susceptibility can be adapted to different geographical settings, enhancing the accuracy of susceptibility maps when adjusted for local data.The robust performance metrics of the Random Forest Classifier and Convolutional Neural Network models in our study suggest that these models can achieve similar accuracy in other regions with appropriate adjustments.Additionally, the insights gained from our analysis, such as identifying high-risk zones and understanding the influence of various environmental factors, provide a framework for risk management and policy development that can be applied elsewhere [77,78].Although different geological contexts may require recalibration of the models and the quality of available data may vary by region, our work significantly contributes to the field of landslide susceptibility mapping and offers valuable guidelines for future research and policy development in diverse geographical contexts.
Landslide Susceptibility Mapping (LSM) is a critical tool for identifying areas at risk and predicting possible landslide incidents, which are significant threats to environmental stability and socioeconomic advancement.By utilizing LSM, stakeholders such as policymakers, governmental leaders, land use planners, real estate professionals, environmental groups, transportation and utility entities, agricultural and forestry sectors, insurance firms, and the general public can develop informed strategies to mitigate the adverse impacts of landslides [79,80].This research highlights the effectiveness of LSM in pinpointing areas prone to landslides within the China-Pakistan Economic Corridor in the Himalayas, particularly high-risk zones primarily located in the central regions of our study area, along the Karakoram Highway, and in towns such as Gilgit, Jaglot, and Shandur.Recognizing these high-risk areas aids decision-makers in strategically avoiding them for agricultural, residential, and infrastructural development.To protect these susceptible areas, a combination of structural, geotechnical, political, legal, and administrative measures can be implemented to safeguard both the population and the environment from the threats posed by landslides.

Conclusions
This study conducted a comprehensive analysis of landslide susceptibility within the Himalayan region of the China-Pakistan Economic Corridor, utilizing high-resolution Sentinel-2 satellite imagery and Google Earth data to identify 2197 landslide polygons.The landslides were validated through literature reviews, NASA's geo-hazard catalog, and extensive field surveys, revealing that the majority of landslides, 71.13%, are very small, each covering area less than 1 × 10 5 m 2 , while very large landslides, each covering more than 1 × 10 6 m 2 , are the least common, representing only 0.21% of the total landslides.Nonlandslide areas were delineated through ESRI ArcGIS using a random selection approach.The dataset was then divided into 70% for model training (1538 locations) and 30% for validation (659 locations), ensuring a robust model development process.Additionally, the study employed 13 conditioning factors, including slope, aspect, curvature, land use, land cover, and lithology, etc., obtained from various sources.These factors were analyzed using Pearson Correlation Coefficients, which reveal a strong positive correlation (0.92) between slope and terrain relief.This indicates that steeper slopes are prevalent in regions with significant relief, thereby impacting landslide susceptibility.Conversely, a negative correlation (−0.55) was observed between slope and the Topographic Wetness Index, suggesting lower wetness indices in steeper areas, which can influence water runoff patterns and erosion susceptibility.
Furthermore, the multicollinearity assessment revealed minimal interaction among most variables; however, elevation exhibited a higher VIF of 2.98, implying more pronounced interactions with other conditioning factors.These findings show the complexity of interactions in geographical modeling and emphasize the importance of careful selection of predictive variables in landslide susceptibility models.Moreover, machine learning models varied in their categorization of risk zones, with the Random Forest Classifier and Convolutional Neural Network categorizing larger regions as VERY LOW risk and the Support Vector Machine covering more area as LOW risk than Random Forest Classifier and K-Nearest Neighbors, highlighting the significance of model selection based on the specific characteristics and data sensitivities of the study area.These findings align with the study's aim to precisely model landslide susceptibility and highlight the need for improved prediction through dynamic variables and machine learning techniques.This research is crucial for identifying at-risk areas and predicting landslides, which threaten environmental stability and socioeconomic progress.Using landslide susceptibility mapping, stakeholders can develop strategies to mitigate impacts.Implementing structural, geotechnical, political,

Figure 1 .
Figure 1.Study area and elevation map with highlighted landslide locations and major roads.

Figure 1 .
Figure 1.Study area and elevation map with highlighted landslide locations and major roads.

Figure 2 .
Figure 2. Field investigation maps depicting landslides in the study area, including (a) Study area map showing landslide polygons in red and elevation in grayscale.The red square indicates the specific region detailed in subfigures (c-f).(b) Detailed topographic map of the area within the red square in (a), showing the distribution of landslide polygons marked in red.The yellow lines in (cf) delineate the boundaries of observed landslides.(c) Bara Khun area, (d) Shandur area, (e) Morkhun area, and (f) Astore area.

Figure 2 .
Figure 2. Field investigation maps depicting landslides in the study area, including (a) Study area map showing landslide polygons in red and elevation in grayscale.The red square indicates the specific region detailed in subfigures (c-f).(b) Detailed topographic map of the area within the red square in (a), showing the distribution of landslide polygons marked in red.The yellow lines in (c-f) delineate the boundaries of observed landslides.(c) Bara Khun area, (d) Shandur area, (e) Morkhun area, and (f) Astore area.

Figure 2 .
Figure 2. Field investigation maps depicting landslides in the study area, including (a) Study area map showing landslide polygons in red and elevation in grayscale.The red square indicates the specific region detailed in subfigures (c-f).(b) Detailed topographic map of the area within the red square in (a), showing the distribution of landslide polygons marked in red.The yellow lines in (cf) delineate the boundaries of observed landslides.(c) Bara Khun area, (d) Shandur area, (e) Morkhun area, and (f) Astore area.

Land 2024 , 22 Figure 5 .
Figure 5. Heat map of Pearson correlations for landslide variables, with red-to-blue grids showing high-to-low strength and labels representing r-values from Pearson correlations analysis, ranging from −1 to +1.

Figure 5 .
Figure 5. Heat map of Pearson correlations for landslide variables, with red-to-blue grids showing high-to-low strength and labels representing r-values from Pearson correlations analysis, ranging from −1 to +1.

Figure 7 .
Figure 7. Landslide covariate priority selection by CART and RFE models.(a) displays the priority of different covariates as determined by the CART (Classification and Regression Trees) model, quantifying each covariateʹs importance in predicting landslide occurrence.The values on the radar chart reflect the normalized importance scores ranging from 0 to 0.18, with higher values indicating greater importance in the model.(b) illustrates the ranking of covariates by the RFE (Recursive Feature Elimination) method, where each covariate's contribution to model accuracy is ranked from 0 to 12, with lower numbers representing higher priority.

Figure 7 .
Figure 7. Landslide covariate priority selection by CART and RFE models.(a) displays the priority of different covariates as determined by the CART (Classification and Regression Trees) model, quantifying each covariate's importance in predicting landslide occurrence.The values on the radar chart reflect the normalized importance scores ranging from 0 to 0.18, with higher values indicating greater importance in the model.(b) illustrates the ranking of covariates by the RFE (Recursive Feature Elimination) method, where each covariate's contribution to model accuracy is ranked from 0 to 12, with lower numbers representing higher priority.

Figure 9 .
Figure 9. Receiver Operating Characteristic (ROC) curves for various machine learning models evaluating landslide susceptibility.

Figure 9 .
Figure 9. Receiver Operating Characteristic (ROC) curves for various machine learning models evaluating landslide susceptibility.

Table 1 .
Shows conditioning factors with sources and spatial resolutions used for landslide susceptibility in the study.

Table 2 .
Variance inflation factor and tolerance values for landslide susceptibility factors.

Table 2 .
Variance inflation factor and tolerance values for landslide susceptibility factors.

Table 3 .
Susceptibility classes and landslide percentages as identified by machine learning models in this study.

Table 4 .
Wilcoxon signed-rank test used for comparing models in the study.

Table 5 .
Evaluation metrics for machine learning models in landslide susceptibility assessment.

Table 5 .
Evaluation metrics for machine learning models in landslide susceptibility assessment.