Predictive Modelling of Landslide Susceptibility in the Western Carpathian Flysch Zone

: Landslides are the most common geodynamic phenomenon in Slovakia, and the most affected area is the northwestern part of the Kysuca River Basin, in the Western Carpathian ﬂysch zone. In this paper, we evaluate the susceptibility of this region to landslides using logistic regression and random forest models. We selected 15 landslide conditioning factors as potential predictors of a dependent variable (landslide susceptibility). Classes of factors with too detailed divisions were reclassiﬁed into more general classes based on similarities of their characteristics. Association between the conditioning factors was measured by Cramer’s V and Spearman’s rank correlation coefﬁcients. Models were trained on two types of datasets—balanced and stratiﬁed, and both their classiﬁcation performance and probability calibration were evaluated using, among others, area under ROC curve ( AUC ), accuracy ( Acc ), and Brier score ( BS ) using 5-fold cross-validation. The random forest model outperformed the logistic regression model in all considered measures and achieved very good results on validation datasets with average values of AUC val = 0.967, Acc val = 0.928, and BS val = 0.079. The logistic regression model results also indicate the importance of assessing the calibration of predicted probabilities in landslide susceptibility modelling.


Introduction
Landslides are one of the main natural hazards that put people's lives at risk, cause damage to property, and require high costs in order to stabilise sites and protect investments in concerned areas when implementing investment projects. On the other hand, human activity significantly and frequently affects slope stability and unsuitably located building sites initiate slope movements. Landslide risk increases mainly through unregulated urbanisation, unsuitable land use, gradual deforestation, and removing continuous grassland vegetation cover. Landslides are influenced by changes in climatic conditions, mainly the character of regional precipitation, occurrence of torrential rains, and their distribution within yearly cycles.
Landslides belong to slope deformations that significantly affect the status and efficient use of land. They pose a constant threat mainly where there are construction sites without adequate measures. They repeatedly cause damage to land linear and other buildings, as well as to agricultural and forest soils. Nowadays, landslide risk in some regions of Slovakia is increasing as a result of more intensive construction activity shifting from flat and slightly inclined locations to slopes and more exposed locations. This trend is evident mainly in municipalities in mountainous areas of Slovakia. It is caused mainly by the lack of suitable construction sites in flat locations but also by targeted construction on slopes as a result of the attractiveness of the environment. For this reason, the need to assess the landslide susceptibility of the area, optimise forecasts of this phenomenon, and efficiently affect human activities in the area is constantly increasing.
The monitored area of Slovakia belongs to the Western Carpathian flysch area, where landslides are the most common geodynamic phenomenon. Landslide occurrence assessment in this area showed that fewer slope deformations occur in flysch highlands than in flysch uplands [1]. Geodynamic phenomena in flysch highlands are characterised mainly by extensive creep failure or landslides with mass movement on layered surfaces. Surface creep is typical mainly for flysch uplands and results in landslides and earth flows.
Maps of landslide susceptibility are useful tools for identifying future potential susceptible areas of landslide occurrence. They are used in land planning, forecasting natural risks, and mitigating measures [2]. The authors studied the influence of sensitivity and scaling issues in the landslide susceptibility mapping using random forests. They indicated that the unit (scale) and the training process strongly influence the classification accuracy and the prediction process. Maps of landslide susceptibility depict the relative probability of a given type of landslide occurring in a given area, without taking into consideration the probability of occurrence in time [3]. They explicitly or implicitly represent a forecast of future occurrences of the landslide areas [4]. They are based on comparing and subsequent statistical processing and assessing the spatial correlation between the predisposing factors (inclination, land use, geological bedrock, etc.), the landslide distribution in the area, and the time dimension of landslides connected to triggering events (precipitation, earthquake, volcanoes, and anthropogenic activities) [5,6]. These maps aim to provide a document that helps mitigate future damage to property and human health in the area. Mapping the landslide susceptibility of the area is crucial for safe planning of human activities, constructions, etc. [7]. At the regional level, landslide susceptibility maps represent the first efficient step in reaching detailed assessment and lowering landslide risk, which contributes to overall safety [8,9].
In the Czech Republic and Slovakia, many authors inquired into landslide susceptibility, see, e.g., in [49][50][51][52]. In their work, the authors of [49] evaluated 11 input parameters that were derived from geological conditions of the area, climatic and hydrological conditions, morphometry of the georelief, and the current landscape structure. The evaluation was processed by a bivariate method and the weight of the parameter as a whole. An important step in urban planning and proposals of land use for development is also producing and implementing models of landslide risk. The work in [53] is the only one to have used predictive modelling when evaluating landslide risk in Slovakia, who applied the Support Vector Machines method and compared it with the bivariate analysis. In all cases, it became apparent that the suitability of the used method is determined by the quality and the amount of the input data, required output data precision and their interpretability.
Landslides are often strongly correlated with multiple causes, resulting in complex parameterisations. Understanding the relationship between the existing soil landslide distribution and the soil landslide influencing factors is a fundamental requirement for soil landslide susceptibility mapping. This study aims to create a landslide susceptibility prognosis of the slopes in the study area using statistical predictive modelling for future use in land use planning. The fifteen generated layers of causative factors were used as predictor variables and were grouped into five classes: topographic, geological, hydrological, soil, and land cover. We chose two modelling methods: logistic regression and random forests. The resulting models were compared with the real occurrence of landslides. In this article, we comprehensively evaluate the susceptibility of flysch areas to landslides in the Western Carpathians, where this methodology has not yet been used. Until now, multivariate or bivariate analyses have been used here. The Kysuce region was selected as a model area as it belongs to the Western Carpathian flysch zone and is composed of sedimentary rocks, primarily sandstone, and clay. The characteristics of these rocks significantly affect the relief and water regime. They are prone to erosion and with their low permeability they can cause frequent landslides during torrential rains.

Study Area
The study area was defined as the entire river basin of the Kysuca River ( Figure 1  The Western Carpathian geological structure is mainly the result of alpine orogeny. The Outer Western Carpathians are formed by the Outer Subcarpathic flysch belt (Krosno and Magura units) and klippen belt (Orava unit-Mesozoic sediments) [55]. The flysch belt represents a massive accretionary wedge with a nappe structure. It is a Cretaceous Formation mainly from the Paleogene. It is a typical alternation of clay shales and sandstones that sedimented in a deep sea environment, where they were transported by gravitational and turbidity currents (Krosno and Magura units) [56].
Three fundamental tectonic units make up the geological structure of the area, from northwest to southeast it is Silesian Nappe, Magura Nappe, and Klippen Belt. The most common geological units in the area are the Zlín Formation (355.65 km 2 ), deluvial sediments (231.62 km 2 ), the Bystrica Unit (229.26 km 2 ), and the Soláň Formation (182.63 km 2 ). The Zlín Formation covers approximately one-third of the study area. This formation has higher susceptibility to landslides, which is caused mainly by the different geomorphological values of the claystone and sandstone sequence. The higher share of claystone results in greater susceptibility of this formation to exogenous degradation processes. They are mostly degraded due to frequent imbibition and drying processes resulting from higher precipitation rates.
From a precipitation perspective, Kysuce belongs to the humid climatic area. The average annual precipitation in the northern areas of the Kysuce valley is 700-1110 mm. The highest average monthly precipitation is in June and July (  -1980  56  53  50  66  88  121  126  100  67  58  66  65  915   1981-2000  60  50  60  67  87  111  107  87  79  52  70  70  902   2001-2020  63  53  53  50  103  94  120  83  81  68  58 56 882 There are 1869 landslides registered in this area. Landslides are registered in the State Geological Institute of Dionýz Štúr databases and their locations were verified in the field. The database contains point landslides (those of small proportion) and area-wide landslides. The Atlas of slope stability maps SR [57] is the basis for digital layers that are continuously updated. Landslides take up almost 20 % (174.21 km 2 ) of the monitored area and the majority, 62 %, occurred mainly due to climatic factors associated with erosion (lateral and deep) and abrasion. According to the activity, the landslides were divided into three categories: active, potential and stabilised ( Figure 2). An active (live) landslide is a landslide in motion. If the motion is currently being suspended, but the causes of its formation can appear under suitable conditions, then the landslide is considered potential (temporarily suspended). In stabilised (permanently suspended-non-active) landslide the causes for its occurrence are no longer present, or they have been removed by human intervention. The majority of the landslides present in the study area, 70 %, were classified as potential landslides, with a total of 1284 (122.36 km 2 ). Stabilised landslides follow with 544 (50.20 km 2 ), and then 41 active landslides (1.64 km 2 ). For the purposes of the predictive modelling of landslide susceptibility the active and potential landslides were grouped together to form the landslide category and the stabilised landslides were assigned to the non-landslide category together with the landslide-free areas (see Figure 3p for the map of landslides as used during the modelling).

Methods and Data
There is no global guideline for selecting landslide conditioning factors, but they should be selected based on the characteristics of the case study area, data availability, and literature review. One of the most important steps in spatial prediction modelling is selecting appropriate and effective conditioning factors among all available factors [58]. When selecting the input variables, we concentrated on those variables that can be triggers for landslides. They are geological, hydrogeological and soil properties, morphometric parameters of relief, hydrophysical soil properties, and hydraulic properties of rocks. We selected two methods of predictive modelling: logistic regression and random forests. In statistical modelling and subsequent evaluation, we proceeded as follows: 1.
selection of input variables and their reclassification, 2.
correlation analysis of the variables and their selection into statistical models, 3.
statistical modelling by logistic regression and random forests methods, 4.
interpretation of the individual methods, and 5.
evaluation of the landslide susceptibility of the area.

Variable Selection and Data Preparation
When selecting the input variables (landscape parameters), we used a database of landscape-ecological complex elements [59], which represents a vector representation of synthetic units expressing relevant characteristics of the abiotic component of the landscape sphere together with elements of land cover. When processing the data, we used real landslide occurrences registered in the State Geological Institute of Dionýz Štúr [57].
In connection to the environmental factors used in evaluation of landslide hazard, there is a tendency to use those layers of data that can be easily obtained from the digital model of relief and satellite images, while lesser emphasis was on those layers of data that needed detailed analysis in the field (environmental factors aimed at the digital model of the terrain, geology and soil, geomorphology, usage of the area and risk elements) [60][61][62]. Hydrological conditions, such as high precipitation and infiltration and exfiltration of groundwater, have strong influence on landslide occurrence [63]. Lithology of rocks and slope inclination are significant factors contributing to the higher activity of slides. When evaluating and mapping the susceptibility to landslides, the majority of authors use the degree of slope inclination and lithology as independent variables [21,64]. The authors of [65] divide factors influencing the occurrence of landslides into two types: external and internal factors. External factors comprise precipitation, earthquakes, and antropogenous factors. However, internal factors also deserve the attention-lithology of rocks, degree of slope inclination, location of the slope, profile of the slope, and elevation gradient.
We selected 15 conditioning factors that acted as input variables in the models of landslide susceptibility. Each of these 15 factors, except for altitude, was obtained in an individual vector layer. Altitude was obtained in a raster layer with 15 m × 15 m grid cells. Therefore, to combine altitude with the other variables, the study area was divided into pixels (grid cells) of size 15 m × 15 m and data properties of each pixel were attributed according to the majority representation of the individual variable classes in the given pixel. Furthermore, the selected modelling methods work better with pixels than with UCU-s and the advantages and disadvantages of both spatial units are comparable [66].
The selected variables are of different types which needs to be addressed during modelling. Of the 15 input variables all but one are categorical variables, with eight nominal (surface relief morphology, hydrogeological units, soil type, land cover, geological units, granulometry, profile curvature, and orientation) and six ordinal (inclination, transmissivity coefficient, filtration coefficient, groundwater level, soil depth, and soil skeleton). The last variable (altitude) is considered a continuous variable. The dependent variable (landslide susceptibility) was taken to be a binary variable.
Some of the classes of the categorical nominal variables contained a small number of pixels which may cause difficulties with interpreting the results during the modelling process and applying the models to other areas. Therefore, classes of the categorical nominal variables were reclassified into more general classes to address this. A list of all conditioning factors and their classes used in modelling and the landslide occurrence being modelled is provided below (visualisations of the occurrence of the variables in the study area can be found in Figures 3 and 4): • Geological units (GU) ( Figure 3a)-a synthetic unit that complexly states the attributes of abiotic components of the landscape. Legend: 0 water; 1 fluvial sediments; 2 proluvial sediments; 3 deluvial sediments; 4 quartzy, arcosic and graywacky sandstones to conglomerates, dominance of black-brown shaly claystones, sandstones to pebbly sandstones and conglomerates; 5 medium grained sandstone to fine conglomerates, at places with biotite, local claystones, marlstones, intercalations of Bystrica-type claystones and quartzy sandstones (thin-layered flysch); 6 sandstone-shale facies: gray quartz and graywacky sandstones, green and grey shales, marlstones and pelosiderites (thin-layered flysch); 7 graywacky sandstones, local shales (sandy flysch); 8 Bystrica facies: Bystrica-type claystones, arcosic and graywacky sandstones with glauconite and conglomerates (flysch); 9 fine-to medium grained quartz carbonate sandstones, green-grey silty claystones, layered grey soft limestones with dark cherts, radiolarites, radiolarian limestones [67]. • Hydrogeological units (HGU) (Figure 3b)-units comprising lithology, lithostratigraphy of rocks, and hydrogeological properties of rock environment. Legend: 0 water; 1 lithofacial undivided loamy slope deposits, rocky soil and debris; 2 sandy-stony and boulder debris cones; 3 loams, sandy loams, sandy-loamy to boulderlike gravels in alluvial fans; 4 sands, sandy gravels and fine-to pebbly-gravels of terraces; 5 claystones, calcareous claystones and marlstone; 6 clayey flysch (flysch with claystones, siltstones and marlstones dominance); 7 normal flysch-claystones, siltstones and sandstones; 8 carbonate flysch (calcareous sandstones and marlstones); 9 sandy flysch (flysch with sandstone dominance); 10 conglomerate flysch (flysch with conglomerates dominance); 11 sandstones with claystones; 12 limestones, quartzy, sandy, marly and nodular limestones, radiolarites. Slope inclination (SIn) (Figure 3i)-expresses a continuous gradient field of the altitudes. It is a key morphometric parameter defining acute intensity of gravitationally conditioned geomorphological processes. It is a characteristic derived from a DTM, calculated in GIS using the algorithm "Neighbourhood method" [68]. Legend (categories are in degrees): (3,7]; 4 (7,12]; 5 (12,17]; 6 (17,25]; 7 (25,35]; 8 (35,90]. • Groundwater level (GWL) (Figure 3j)-a continuous upper groundwater surface limit in the earth crust filling up its cavities. The data were acquired from hydrogeological drilling and engineer-geological probes. These measurements were then extrapolated using Spline interpolation algorithm that enabled the definition of the barriers.  During landslide susceptibility modelling using the logistic regression the categorical nominal variables are transformed into c − 1 binary indicator variables (where c is the number of classes for the given variable). The categorical ordinal variables were transformed to continuous variables by taking the middle of the interval the classes represent as their values, as was suggested in [69]. This still allows for the modelling of the nonlinear relationship between the landslide susceptibility and the input variables by adding the powers of the variables (e.g., slope).
After discarding pixels with missing values for some variables (these were the pixels corresponding to the Nová Bystrica reservoir), the dataset contained 4,223,653 pixels from which 550,896 were labelled as landslide areas.

Correlation Analysis of the Input Variables
A common feature of the real-world data is that the input variables (also called predictors) are not independent from each other. This may result in unstable estimates of the parameters which prevent identifying the variables relevant to modelling landslide susceptibility and interpreting their effects [70]. Therefore, the next step was to perform the correlation analysis for the input variables to determine the relation between them. The manner of measuring the association varies based on the type of variables (continuous, categorical ordinal, or categorical nominal).
To measure the association between the two categorical variables, we used Cramer's V that is defined for two variables X and Y with categories A i and B i , respectively, where i = 1, . . . , r and j = 1, . . . , k, as , where n is the number of pixels; n i· and n ·j are the number of pixels in categories A i and B j , respectively; and n i,j is the number of pixels that are in both A i and B j . Values of Cramer's V lie between 0, indicating no association, and 1, indicating complete association. For calculating Cramer's V the ordering of categories of categorical ordinal variables was neglected.
To measure the association between the continuous or categorical ordinal variables we used Spearman's rank correlation coefficient that is defined as where n is the number of pixels, x i and y i are ranks of the i-th pixel with respect to variables X and Y, andx andȳ are mean ranks of variables X and Y. Values of Spearman's rank correlation coefficient lie in the interval [−1, 1], with values −1 and 1 indicating perfect linear correlation and 0 indicating no linear correlation.

Landslide Susceptibility Modelling
The main aims of statistical modelling are predicting an unknown variable from accessible data or describing in as understandable a manner as possible the relation between the explained and explanatory variables [71]. Selecting the modelling technique should thus depend on the problem we are facing. Linear models are easily interpreted but their predictive power is not as substantial as it is for nonlinear models. On the other hand, nonlinear models have higher predictive power, however it is at the cost of lower interpretability. In our evaluation we will use two of the most used statistical modelling methods-logistic regression and random forests.

Logistic Regression
Logistic regression [72] is one of the most popular and widely used multivariate statistical methods for landslide susceptibility modelling [66,73]. It is a linear model and attempts to model the probability, p, of the event (in this case landslide) occurring based on the values of the input variables. The formula for the probability is defined as where n is the number of input variables, X = (X 1 , X 2 , . . . , X n ) is the vector of input variables and β i (i = 0, 1, . . . , n) are the coefficients of the logistic regression needed to be estimated from the data. For more information about logistic regression see, e.g., in [74].

Random Forests
Random forests [75] are one of the tree-based methods widely used for classification problems. It is an ensemble method which means that it is a collection of basic models that are all trained to solve the same classification problem and their outputs are combined to provide improved performance. For random forests the basic model used is a decision tree.
The decision tree attempts to classify the data by applying a series of deterministic binary criteria. As such it is composed of two node types: decision nodes and end (also called leaf) nodes. In decision nodes, the current split criterion is applied to determine the next node the data point is passed into. In end nodes, the data point is assigned a class that is determined during the decision tree training.
During the training process each decision tree in the random forest is trained on the random portion of the whole training data set and at each node a subset of variables randomly selected from all input variables is considered in creating a criterion used to split the data in that node. This is done in order to reduce the correlation between decision trees in the forest and so decrease the error rate. The class assigned in each end node is determined as the class of the majority of the training data in the end node. For the training of random forests two hyperparameters need to be selected: number of trees in the forest N tree , and number of input variables considered in creating a splitting criterion in each node m try . These variables are randomly chosen from all input variables available to the random forest and are chosen for each node separately. It has been shown in [75] that higher values of m try increase both the correlation between the trees and the classification accuracy of individual trees. Therefore, multiple values of m try need to be considered during the training. According to the authors of [71], increasing N tree has a positive effect on the performance of the model; however, empirical studies performed in [76,77] show that for most of the examined datasets random forest performance does not substantially improve after the first 100 and 64 or 128 trees, respectively, were trained. Therefore, the number of trees in the random forest model used in this study was set to N tree = 100. The number of variables considered for splitting in a node m try was selected to optimise the model performance.
When a random forest model is used for classification, a new data point is run down every tree in the forest to obtain a set of classes assigned to the data by the trees. The final class can then be assigned based on the majority vote. For more information about random forests see, e.g., in [78].
Since we are also interested in the probability of landslide occurrence, we use the random forest method described in [79]. To estimate the probability, first the proportion of landslide pixels in the end node is determined for each tree. Then, the individual estimates are averaged to calculate the final estimate.
The random forests method also allows us to evaluate the importance of predictor variables in relation to the performed task. Two variable importance measures most commonly used are Gini importance [75] and permutation importance [80]. For variable X, the Gini importance takes the accumulated improvement in the splitting criterion over all nodes and trees where the variable X is used for splitting as its importance. The permutation importance is computed as the decrease in the performance of the model when the values of the variable are randomly permuted. In this study the permutation importance is used.

Training and Validation Datasets
To assess the quality of the prediction models it is useful not only to look at the performance of the model on the data that was used to create it, but also on the as of yet unknown data to assess its so-called generalisation ability. The usual practice to do this is to perform k-fold cross-validation. The whole dataset is randomly divided into k equal-sized subsets. The model training and validation is then performed k-times each time with a different subset being the validation dataset not used during training and representing new data. For more information see, e.g., in [81].
After considering the number of pixels in the dataset, number of levels of input variables and their frequencies, the 5-fold cross-validation was performed. We also modified the division of the whole dataset slightly to account for the imbalance in the number of landslide and non-landslide pixels in the dataset.
The usual practice in landslide susceptibility modelling is to train models on the balanced dataset, i.e., on a dataset containing the same number of landslide and nonlandslide pixels. To achieve such training datasets during the cross-validation, landslide pixels were divided into 5 subsets and an equal number of non-landslide pixels was added to each subset. However, since landslides occur only rarely in nature, we also performed cross-validation with the stratified subsets, i.e., the ratio of landslide and non-landslide pixels in the subset was the same as the ratio in the whole dataset. The validation dataset for both types of training datasets was always stratified to simulate the frequency of landslide occurrence in nature.

Evaluation of Model Performance
One of the easiest ways to visualise performance of the binary classification model is to create a confusion matrix. It is a 2 × 2 table with columns corresponding to the labels from the dataset and rows to the predicted labels. The confusion matrix cells then represent the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) pixels.
There are many measures that can be computed from these values depending on the aim of the study. The most frequently used measure is accuracy, which is the ratio between the number of correctly classified pixels and the overall number of pixels: Its reliability is, however, greatly reduced for highly unbalanced datasets, as is the case for the landslide susceptibility modelling, because it places more weight on the more common class, lowering the ability of the model to predict the rare class.
Matthews correlation coefficient (MCC) addresses the issue of imbalanced classes [82]. It is defined as follows: .
MCC values lie in the interval [−1, 1], with −1 and 1 indicating the perfect misclassification and perfect classification, respectively, and 0 indicates that the model is only as good as the random assignment of classes based on their ratio. In addition to accuracy and MCC we also report true positive rate (TPR) and true negative rate (TNR), also called sensitivity and specificity respectively, defined as follows: Both of the models used in this study predict the probability of a landslide occurring for the given pixel. A common way to assign class, i.e., landslide or non-landslide, to a pixel is to select a threshold probability value z (e.g., z = 0.5) and assign labels in such a way that if the predicted probability is lower than z, the pixel is labelled as non-landslide, and as landslide otherwise. It can easily be seen that a confusion matrix is highly dependent on the threshold probability value z. In this study, it is chosen in such a way that the value of MCC is maximised for the given dataset.
In contrast to the confusion matrix, the receiver operating characteristics (ROC) curve is a threshold-independent characteristic of the model. It describes the relation between the TPR (or sensitivity) and 1 − TNR (or 1 − specificity). The area under the ROC curve (AUC) is then used as a quantitative measure to evaluate the models. AUC values lie in the interval [0, 1]. An AUC value of 1 indicates that the model is capable of perfectly differentiating between classes, while a value 0.5 indicates that the model is only as good as the random assignment of classes based on their ratio, and values lower than 0.5 indicate a problem with the model. According to the authors of [83], the model performance is considered acceptable if AUC ∈ [0.7, 0.8), excellent if AUC ∈ [0.8, 0.9), and outstanding if AUC ≥ 0.9. The Brier score [84] was selected for evaluating the probability estimates of the models and is defined as where N is the number of pixels in a dataset, y i is the occurrence of landslides for i-th pixel, i.e., 1 for the landslide pixels and 0 for the non-landslide pixels, x i is vector of values of input variables for i-th pixel, andp(x i ) is the predicted probability. Similarly to accuracy, it also favours the more common class which may result in poor calibration of probabilities for the rarer class [85]. We therefore also compute a stratified Brier score [85] that computes the Brier score for each class separately. Values of both overall (when computed for landslide and non-landslide pixels together) and stratified Brier scores lie in the interval [0, 1] with smaller values indicating better calibration of probabilities.

Computational Software
Data preparation was performed in ArcGis 10.3 and R version 4.1.0 [86]. Association between conditioning factors was computed using the cramerV() function from the rcompanion package [87] and function cor() from the stats package, and results were visualised using the function corrplot.mixed() from the corrplot package [88]. Hierarchical clustering of conditioning factors was performed using the hclust() function from the stats package. Logistic regression models were trained using the function glm() from the stats package. The random forest model for predicting probabilities was trained using the function ranger() from the ranger package [89]. Model evaluation using 5-fold cross-validation and selected performance metrics was performed using custom codes in R. Graphs were produced using functions from the ggplot2 package [90]. Resulting susceptibility maps were produced in ArcGis 10.3.

Correlation Analysis of the Input Variables
For the input variables that are of the categorical type (all but the altitude variable), we assessed the association between the pairs of variables using Cramer's V (results are shown in Figure 4). It is possible to use the obtained association V X,Y between variables X and Y to define dissimilarity between variables X and Y as d X,Y = 1 − V X,Y and use the agglomerative hierarchical clustering method (for technical details see, e.g., in [81]) to create a dendrogram that shows how similar variables are ( Figure 5). In this dendrogram three groups of variables that provide similar information (high values of Cramer's V and thus close connection in dendrogram) are identifiable:   . The rock complex in sandstone development is the main collector of groundwater, characterised by crack-intergrain permeability, and the drainage of this complex is considerably variable. These complexes are made by rocks with low permeability. Unlike this complex, rocks in sandstone-clay development are characterised by rhythmic alternation of sandstones and clay (or prevalence of sandstones in some parts of the formation), i.e., rocks with aquiferous properties with isolators that limit groundwater circulation in the complex. The clayey substrate is poorly permeable and most precipitation water flows over the surface, followed by more intensive erosion development (especially in deforested areas) with subsequent formation of landslips. The alternation of geological layers with different hydrogeological parameters in flysch areas represents a triggering factor for landslides during heavy rainfalls. In the second group are soil characteristics. The association linkages between variables are SSk-ST (0.68), SDe-ST (0.66), SDe-SIn (0.63), SSk-SDe (0.62), SSk-SIn (0.51), and ST-SIn (0.41). Landslide processes often arise due to the saturation of soil pores with water at deeper soils and the action of gravitational forces on unstable slopes (e.g., during long-term rains or as a result of snow melting). In the third group are relief characteristics. The two variables show very high association linkage SRM-PC (0.90). It is a connection of transport slope and linear relief curvature.
Variables-inclination of the slope, transmissivity coefficient, filtration coefficient, groundwater level, soil depth, and soil skeleton-are ordinal variables and for them, together with the continuous variable altitude, we have created the matrix of Spearman's rank correlation coefficients ( Figure 6). The values obtained confirm the findings of association between transmissivity coefficient-filtration coefficient (0.82), and soil depth-soil skeleton (−0.72) based on the values of Cramer's V. In addition, the Spearman's rank correlation coefficient gives the nature of the association based on whether the value is positive or negative: pixels with higher transmissivity coefficient usually also have higher filtration coefficient, whereas pixels with bigger soil depth usually have lower soil skeleton.

Logistic Regression
The presence of correlated variables in the logistic regression may negatively impact the training process and hinder the interpretability of the estimated coefficients [83]. Therefore, the selection of one variable from the groups identified in the previous section was performed first in the following manner. Using 5-fold cross-validation with balanced training datasets a simple logistic regression model with only one prediction variable was trained for each of the candidate variables and average AUC and Brier score values for both training (AUC train and BS train ) and validating (AUC val and BS val ) datasets were computed (Table 3). After taking into consideration the computed AUC and Brier score values, association with the other variables, and also the interpretability of the variables, we chose the following as the representatives of the three groups of variables showing association. From the first group, geological units were chosen since the AUC values were highest and Brier score values were almost as good as values for the hydrogeological units. In the second group, the soil type achieved the best values and was chosen. From the third group, profile curvature was chosen despite its slightly lower values than those for surface relief morphology because the difference is not large and profile curvature exhibits lower association with other variables.
The logistic regression model with predictor variables geological units, profile curvature, soil type, grain-size of soils, orientation, land cover, groundwater level, and altitude was then trained and its performance evaluated using 5-fold cross-validation with two different types of training datasets as described in Section 3.4.
The computed average values of the selected metrics for evaluating the classification performance and calibrating predicted probabilities are summarised in Table 4. The average values of AUC val for both stratified and balanced training datasets were approximately the same -AUC val ≈ 0.732, which according to the work in [83] is acceptable, albeit not very high. The average values of other considered classification measures computed on the validation datasets were also very close for both types of training datasets. The higher average accuracy and TNR values as well as significantly lower average value of TPR on the validation dataset than on the balanced training dataset are probably the consequence of a higher ratio of non-landslide pixels in the validation dataset. The average values of stratified Brier scores show that model trained on balanced datasets achieved similar scores for both landslide and non-landslide pixels, while the model trained on the stratified datasets greatly favours non-landslide pixels at the expense of landslide pixels, as is evident from the high values on landslide pixels and very low values on non-landslide pixels. The overall Brier score was better for stratified datasets, however, this is because almost 87 % of all pixels are non-landslide. The computed average values of the considered measures achieved on training datasets are also reported for completeness.   Table 4. Average values of metrics based on classification performance of (AUC, MCC, Acc, TPR, and TNR) and calibration of probabilities (BS) by logistic regression and random forest models. Bar indicates the average value of metric and standard deviation is given in parentheses. Note that for BS lower values indicate better performance.

Random Forests
To find the number of variables considered for the splitting condition randomly chosen from all 15 input variables for each node separately, m try , that optimises the random forests model performance, models with m try varying from 1 to 15 were trained and their performance in terms of the AUC and the overall Brier score was evaluated using 5-fold cross-validation with balanced datasets (Figure 7). The highest average value of AUC val was achieved for m try = 8 and the lowest average value of BS val was achieved for m try = 9. Therefore, the selected value of m try used for computing other performance measures and producing a landslide susceptibility map was 8. The computed average values of the selected metrics for evaluating the classification performance and calibrating predicted probabilities for the random forests model with N tree = 100 and m try = 8 are summarised in Table 4. The average values of AUC val were close for both balanced and stratified training datasets and were exceptionally high, AUC val = 0.967 and AUC val = 0.971 for balanced and stratified datasets, respectively. For the other classification measures considered, models trained on stratified datasets performed slightly better, except for true positive rate TPR. Again, the higher average TNR value and the significantly lower average TPR value on the validation dataset than on the balanced training dataset are probably the consequence of a higher ratio of nonlandslide pixels in the validation dataset. The better average value of the overall Brier score was achieved by the model trained on the stratified datasets. The average values of stratified Brier scores indicate that training on the stratified datasets puts more weight on the non-landslide pixels as can be expected due to them representing a majority in the datasets while training on the balanced datasets puts more weight on the landslide pixels. The computed average values of the considered measures achieved on training datasets are also reported for completeness.
As random forest models show a very good ability to distinguish between landslide and non-landslide pixels, the significance of the variables used in the model was evaluated using the permutation importance measure [80]. The results are shown in Figure 8. Altitude was found to be the most significant variable, followed by orientation and groundwater level. Orientation plays an important role in the climatic factors of the area. The area is rich in atmospheric precipitation, which is due to its greater exposure to the prevailing northwestern current. It is part of a wider area lying at the interface of oceanic and continental influences, where air masses of various properties alternate several times during the year. Ocean currents reduce the differences between summer and winter (winters are warmer and summers are colder), causing more cloud cover, more rainfall, and more frequent fogging. Changes in the groundwater level can cause landslides, especially in the vicinity of watercourses (this is mainly the fluctuation of the groundwater level in spring or autumn). Geological structure, geomorphological conditions and character of georelief and related hydrogeological conditions of the area as well as anthropogenic factors represented by landscape structure and land use can also be considered relevant variables that reflect favourable conditions for landslides.

Comparison of the Models
The type of training datasets has very little effect on the classification performance on the validation datasets for both the logistic regression and the random forest models, with logistic regression displaying smaller differences. The random forest model outperformed the logistic regression model in all considered measures, achieving very good values of AUC val = 0.967, Acc val = 0.928, and TNR val = 0.949 when trained on balanced datasets compared to values achieved by the logistic regression model-AUC val = 0.732, Acc val = 0.706, and TNR val = 0.722. Although the random forest model is better at correctly classifying landslide pixels with its value of TPR val = 0.787 than the logistic regression model with its value of TPR val = 0.602, the comparison with the very good ability to correctly classify non-landslide pixels (TNR val = 0.949) indicates room for improvement. The logistic regression model used in this study achieved similar performance, based on AUC values, as models used in several studies, e.g., [91] (AUC = 0.763), [92] (AUC = 0.753), [6] (AUC = 0.769), and [93] (AUC = 0.76), while others report higher AUC values, e.g., [94] (AUC = 0.869), [95] (AUC = 0.849), and [96] (the highest reported value was AUC = 0.940). The random forest model achieved a very high average value of AUC, which is in agreement with several studies that reported similar values, e.g., [97] (AUC = 0.960), [48] (AUC = 0.949), [98] (AUC = 0.972), and [99] (AUC = 0.956), while others report lower AUC values, e.g., [91] (AUC = 0.782), [46] (AUC = 0.781), and [6] (AUC = 0.812). The observed viability in model performance reported in the literature may be due to the different methods used in data preparation for model training, e.g., mapping units used (pixels, polygons, slope units), number of data points representing landslides (single point or multiple), and handling of categorical conditioning factors (as dummy variables or transformation into continuous variables). The study performed in [96], where the authors used logistic regression to model landslide susceptibility in four different areas and reported values of AUC between 0.851 and 0.940, suggests that there is also some inherent variability in the effect of conditioning factors on landslide susceptibility. The finding that the random forest model achieves better performance at the task of landslide susceptibility modelling presented in this study is in agreement with studies [6,91,93], where a comparison between the logistic regression models and random forest model was also performed.
A common practice in landslide susceptibility modelling is to immediately convert probabilities predicted by the models to sensitivity classes using Jenks natural breaks method. As this method automatically selects sensitivity ranges based on the input data, it can obscure poor calibration of predicted probabilities which are an important output of the susceptibility models.
As is evident from the average values of stratified Brier scores, the training dataset composition has an impact on the predicted probabilities which can be illustrated by creating histograms of these probabilities (Figure 9). Logistic regression predicted low probabilities when trained on stratified training datasets compared to balanced training datasets (Figure 9a,b), severely underestimating the probability of landslide occurrence even for the landslide pixels. If the Jenks natural breaks method was used before assessing the predicted probabilities, distinguishing which model is better would be difficult, even though the one trained on the stratified datasets predicts probabilities of landslide occurrence below 0.5 for the majority of landslide pixels. A similar instance of a model underestimating landslide occurrence probability can be seen in [100], where the upper bound of the very high susceptibility class is 0.5.
With the random forest models, the predicted probabilities cover the whole interval for both types of datasets (Figure 9c,d), although their effect can be seen in plateauing of the histograms for predicted probabilities for landslide (red histogram in Figure 9c) and nonlandslide (green histogram in Figure 9d) pixels for stratified and balanced training datasets, respectively. Histograms of probabilities predicted by random forest models (Figure 9c,d) indicate that depending on the application and which type of pixels are more important to correctly classify, using a balanced training dataset might result in better performance.
Furthermore, these histograms provide us with insights into why the random forest model greatly outperformed the logistic regression model. For the random forest model the distribution of predicted probabilities for all pixels (grey histograms in Figure 9c,d) exhibits a pronounced U-shape and histograms for landslide and non-landslide pixels (red and green histograms in Figure 9c,d) are very different, as could be expected from a well performing classification model. On the other hand, for the logistic regression model the histograms for landslide and non-landslide pixels (red and green histograms in Figure 9a,b) are very similar to each other.  . Histograms of predicted landslide susceptibility probabilities by logistic regression (LR; (a,b)) and random forest (RF; (c,d)) models trained on different types of datasets for different types of pixels-for all pixels in the background, for landslide pixels in the forefront, and for non-landslide pixels in the middle. Predicted probabilities for individual testing datasets from cross-validation were combined to create one dataset. Note the logarithmic scale on the vertical axes.

Landslide Susceptibility Mapping
A standard result of the landslide susceptibility modelling is the map showing spatial distribution of the predicted probabilities. To create such a map, logistic regression and random forest models trained on balanced datasets were used and for each model the predicted probabilities for individual validation datasets from cross-validation were aggregated to create an overall validation dataset. These datasets were then used to create landslide susceptibility maps ( Figure 10).  According to the logistic regression model, only 0.018% of the Kysuce region was classified as without susceptibility to landslides. According to the random forests model, it was up to 24.8%. For the logistic regression model, the slopes with very low susceptibility cover 17.2% of the area while for the random forests model it is 40.6 % of the area. The slopes with very high susceptibility cover 2.6% of the area for the logistic regression model and 12.7% of the area for the random forests model (Figure 11). When compared with currently registered landslides that cover 18.26% of the area (1869 landslides), the random forests model appears to be more realistic. Furthermore, compare the map results in Figure 10. The probability of landslide occurrence according to the random forest model is the highest on pixels with soil consisting of deluvial sediments or possibly sandy flysch, which are characterised as permeable rocks with middle transmissivity, and groundwater level spanning 2.5 m below the surface covered by coniferous forests or meadows and grasslands on a transport slope with linear curvature at an inclination from 9.5 • to 21 • that is oriented mostly towards south, southeast, west and southwest. The probability of landslide occurrence according to the logistic regression model is the highest on pixels with soil consisting of deluvial sediments (permeable rocks with medium flow) and groundwater level spanning 1.25 m below the surface covered by non-forest woody vegetation or coniferous forests on a slope depression with concave curvature at an inclination from 9.5 • to 14.5 • that is oriented mostly towards south, southwest, west, and northwest.

Conclusions
Our paper represents a study of the selected landslide hazard in the landscape evaluated by two statistical predictive methods (the logistic regression method and the random forests method). The results show that (based on Cramer's V correlation matrix) three groups of variables seem to be highly correlated. Based on Spearman's rank correlation coefficient computed for categorical ordinal variables, the high correlation between the variables (transmissivity coefficient-filtration coefficient, the depth of soil-soil skeleton) was confirmed.
The geological-substrate complex, horizontal form of the relief, and soil type were chosen as the representative variables from the variables in groups in the logistic regression model. The model trained on the balanced datasets achieved values of AUC val = 0.732, MCC val = 0.234, Acc val = 0.706, TPR val = 0.602, TNR val = 0.722, and BS val = 0.213, which indicate that logistic regression is ill equipped for such a complex task as landslide susceptibility modelling.
The random forest model trained on balanced datasets vastly outperformed the logistic regression model and achieved very good values on all performance metrics considered: AUC val = 0.967, MCC val = 0.700, Acc val = 0.928, TPR val = 0.787, TNR val = 0.949, and BS val = 0.079. Altitude, orientation, and depth of groundwater emerged as some of the more significant variables in building the random forest model.
The results presented in this study show that while the training dataset type (stratified or balanced) has little impact on the classification performance of the logistic regression model; it has great impact on the predicted landslide susceptibility probabilities, with balanced training datasets producing probabilities that are less likely to underestimate landslide occurrence. With the random forest model the training dataset composition influences on which pixel type more weight is put, however, the effect on the predicted probabilities is mild. The results produced by the logistic regression model clearly indicate the need to incorporate usage of metrics such as Brier score into the landslide susceptibility modelling methodology.
Due to gradual urbanization and increasing desire for higher quality of life, the architects are forced to inquire into evaluating complex engineer-geological relations in more detail when assessing land, underground, linear, water, and other types of buildings.
An important condition is a detailed knowledge of the current state of the geological environment, geo-barriers, terrain characteristics, and other components of the natural environment. Predicting geological processes that can take place in the future in the given area is a necessity as well. Here, modelling the characteristics of the area and objectively assessing possibilities or conditions of area usage are of great benefit. Determining optimal conditions for area usage and appropriate construction location in the area can be means of reducing high expenses on possible future reconstruction and, last but not least, increasing the safety of the inhabitants.