A Meta-Learning Approach of Optimisation for Spatial Prediction of Landslides

: Optimisation plays a key role in the application of machine learning in the spatial prediction of landslides. The common practice in optimising landslide prediction models is to search for optimal/suboptimal hyperparameter values in a number of predetermined hyperparameter conﬁgurations based on an objective function, i.e., k-fold cross-validation accuracy. However, the overhead of hyperparameter optimisation can be prohibitive, especially for computationally expensive algorithms. This paper introduces an optimisation approach based on meta-learning for the spatial prediction of landslides. The proposed approach is tested in a dense tropical forested area of Cameron Highlands, Malaysia. Instead of optimising prediction models with a large number of hyperparameter conﬁgurations, the proposed approach begins with promising conﬁgurations based on several basic and statistical meta-features. The proposed meta-learning approach was tested based on Bayesian optimisation as a hyperparameter tuning algorithm and random forest (RF) as a prediction model. The spatial database was established with a total of 63 historical landslides and 15 conditioning factors. Three RF models were constructed based on (1) default parameters as suggested by the sklearn library, (2) parameters suggested by the Bayesian optimisation (BO), and (3) parameters suggested by the proposed meta-learning approach (BO-ML). Based on ﬁve-fold cross-validation accuracy, the Bayesian method achieved the best performance for both the training (0.810) and test (0.802) datasets. The meta-learning approach achieved slightly lower accuracies than the Bayesian method for the training (0.769) and test (0.800) datasets. Similarly, based on F1-score and area under the receiving operating characteristic curves (AUROC), the models with optimised parameters either by the Bayesian or meta-learning methods produced more accurate landslide susceptibility assessment than the model with the default parameters. In the present approach, instead of learning from scratch, the meta-learning would begin with hyperparameter conﬁgurations optimal for the most similar previous datasets, which can be considerably helpful and time-saving for landslide modelings.


Introduction
Landslides are a threat to human society in most parts of the world today [1], leading to substantial economic losses and deaths [2]. For that reason, landslide spatial prediction is important for managing landslide-prone areas [3]. There are many factors affecting landslides, such as topography, lithology, hydrology, rainfall, vegetation, and human activity [4][5][6][7]. Such factors are known as causative or conditioning factors that have complex and nonlinear relationships [8,9]. These factors or dataset can be extracted from remote sensing sensors employed in the landslide modelling data preparation process. Nowadays, more remotely sensed data such as satellite images, aerial photogrammetry, including light detection and ranging (LiDAR), and radio detection and ranging (RADAR) can be obtained for landslide data equationtion, making landslide susceptibility modelling (LSM) more efficient in response to landslide disaster management [10][11][12].
Machine learning and statistical modelling are among popular methods for landslide susceptibility assessment [13][14][15]. In the literature, the effect of data quantity and quality on statistical and machine learning models' performance was widely investigated [16][17][18]. For example, the significance of input data [19], the role of absence of landslide data [16,20] effect of balanced and imbalanced data [21,22] and effect of feature transformation on optimisation [12] have been addressed in recent studies [23]. For such purpose of LSMs, software such as ArcGIS, QGIS, and GRASS GIS has been widely used for spatial analysis and visualization, while platforms such as Python, R, R studio, and Matlab have been broadly utilized for prediction and modelling [24].
In recent years, machine learning methods, including deep learning techniques such as convolutional neural networks (CNN) and recurrent neural networks (RNN), have been found successful in landslide modeling compared to costly methods requiring site investigations or domain experts [25]. Machine learning methods have achieved substantial results in assessing landslide susceptibility due to the absence of prior knowledge requirements [7,26,27]. Machine learning approaches also achieve higher prediction accuracy because they can reliably identify nonlinear relationships between causative factors and the likelihood of landslide occurrence [28,29].
Nevertheless, machine learning algorithms need hyperparameter optimisation to optimise prediction capacity, which is an expensive task computationally and requires additional validation datasets [30]. Hyperparameters are parameters set before starting the training process. Optimisation of machine learning algorithms can be achieved with manual search or automatic search methods. Manual search attempts to set the hyperparameters manually. Expert users should identify key parameters with a greater effect on the performance of the model. Professional background and practical experience are required for a manual search. Therefore, tuning hyperparameters with a manual search is not effective and cannot be reproduced easily [31]. Automatic search algorithms are proposed to overcome the drawbacks of a manual search. The most common automatic search methods include grid search, random search, and Bayesian optimisation [32]. Grid search tries each combination of possible hyperparameter values on the training set and evaluates the performance on the cross-validation set according to a pre-defined metric. Although this method achieves automatic tuning, it suffers from the curse of dimensionality. Random search attempts to use random combinations of a range of values. Compared to the grid search, the random search is more efficient in high-dimensional space [31]. Another efficient algorithm and smarter than the grid and random search is Bayesian optimization (BO) [33]. With grid and random search, each hyperparameter guess is independent. However, Bayesian approaches use knowledge of previous algorithm iterations.
Grid and random search were commonly used to optimise support vector machines for spatial prediction of landslide studies [3,[34][35][36][37][38]. Recently, Bayesian optimisation was also used to select hyperparameters of machine learning models to determine landslide susceptibility. Scholars used Bayesian optimisation to select optimal hyperparameters of a CNN for landslide susceptibility assessment [33]; this study showed that Bayesian optimisation could enhance CNN's accuracy by nearly 3% compared to default configurations, outperforming the artificial neural network (ANN) and support vector machine (SVM). Sun D. et al. [39] used Bayesian optimisation to select the hyperparameters of a random forest (RF) model for assessing landslide susceptibility. Their findings showed that the optimised model had higher reliability and landslide prediction compared to non-optimised models.
In addition to the above optimisation methods, many studies have developed other optimisation techniques for machine learning-based landslide susceptibility. The most common methods are population intelligence-inspired (or metaheuristic), including biogeography-based optimizations. Pham et al. [40] used swarm intelligence optimisation named moth flame optimiser (MFO) to find optimal hyperparameters of a CNN model for the assessment of landslide susceptibility. They used MFO to optimise two main hyperparameters of the CNN model, which are a number of convolutional filters and a number of neurons in the fully connected layers. The results of their research suggested that the MFO-optimised CNN model could produce better results than RF, random subspace, and CNN without optimising hyperparameters.
Though the problem remains with the above optimization methods, which are computationally expensive and require additional data sets for complex machine learning algorithms. A substantial number of evaluations are required to find optimal models. A promising approach is to apply meta-learning to the hyperparameter search problem [41,42]. Meta-learning refers to systematically and data-driven learning from prior experience. The key concept behind meta-learning for hyperparameter search is to recommend appropriate configurations for a new dataset based on well-known configurations based on similar previously tested datasets [43]. The first step in meta-learning optimization is to obtain meta-data, which is data identifying prior learning tasks and previously learned models. It includes the exact configurations of the algorithms used to train the models, hyperparameter settings, pipeline compositions and/or network architecture. It also includes the resulting model evaluations, such as accuracy and training time, the learned model parameters, as well as measurable properties of the task itself, also known as meta-features. The second step is to learn from this meta-data, extract, and transfer information to direct the search for optimum models for new tasks.
Prior knowledge on LSM, especially their optimised hyperparameters is important but not utilised in previous studies. This research fills this gap by developing a metalearning-based optimisation of a RF algorithm for spatial prediction of landslides. It aims to speed up optimisation by starting from promising configurations based on several basic and statistical meta-features. The contribution of the research involves developing a set of meta-features that are spatial or statistical appropriate for this research aims. The proposed approach may also naturally be integrated into other machine learning algorithms, making it useful for practical applications.

Study Area
The study area is a sub-area of Cameron Highlands located in the north-eastern tip of Pahang State, Malaysia (latitudes 101 • 24 00 E and 101 • 25 10 E, longitudes 4 • 30 00 N and 4 • 30 55 N) ( Figure 1). It is a tropical mountainous region with frequent occurrence of landslides and flash flooding triggered by strong, prolonged rainfall [44]. The combination of topography, climate, and human activities creates natural hazards that pose a major threat to Cameron Highlands [20,44]. Government reports and previous research indicate that landslides in this area have been common, and that there has been significant damage to property in the past [20].
Geomorphology of the area is mostly hilly landforms, mainly the western and northern parts. The land slope ranges from 0 • to 78 • and the lowest and highest altitudes are 1153 m and 1765 m. Forest and tea plantations, temperate vegetables and flower farms are the main vegetation cover of the area. The primary lithology in this area is megacrystic biotite granites, the other geological structures being schists and phyllite [44]. Cameron Highlands has a mild climate with average annual rainfall between March and May and November to December. The average daytime and night-time temperatures are 24 • C and 14 • C, respectively. Approximately 8% (5500 ha) of the area is agricultural land, 86% (60,000 ha) is cultivated, 4% (2750 ha) is housed, and the rest is used for recreation and other activities. The selected sub-area occupies about 25 km 2 .

Geospatial Database
Valid datasets are required to assess landslide susceptibility with machine learning methods ranging from digital elevation models, geological data to geographic information system (GIS) data such as stream network, road network, and administrative division/subdivision to historical landslides aka landslide inventory. The datasets collected and prepared in the current research are shown in Table 1. A 0.5 m digital elevation model (DEM) was generated using LiDAR point clouds and down-sampled to 2 m. The details of the LiDAR data and campaign are documented in Table 1. The DEM was generated by removing non-ground points by the Multiscale Curvature Classification (MCC) algorithm [45] and the Inverse Distance Weighted (IDW) interpolation techniques were implemented in ArcGIS Pro 2.4. The DEM is then used to generate other causative factors such as altitude, slope, aspect, plan curvature, profile curvature, ruggedness index, relative topographic position, topographic wetness index, and sediment transport index. Satellite images from Landsat 7/ETM+ were used to generate vegetation density and normalised difference vegetation indices. Geological data such as lithology and lineaments have also been analysed in this study. However, the study area contained only one type of lithology (granite) and was therefore not considered being a factor in the prediction models. Lineament data was used to prepare the distance from lineaments factor using the GIS Euclidean distance function. Additionally, GIS data such as stream and road networks, land use, administrative divisions/subdivisions, and historical landslides, were used to prepare other causative factors and to prepare training and test samples.
This research was conducted using open-source libraries including Numpy, Scikitlearn, and Pandas in Python platform. In addition, ArcGIS Pro 2.4 was implemented for data preparation, spatial analysis, and result presentation. All experiments were conducted in Python using Scikit-Learn and Keras on a computer with a Core i7-4510U CPU running at 2.60 GHz, 16 GB of RAM, and a x64-based processor. Landslide inventory data is essential for training and validating machine learning and statistical predictive models. A typical landslide inventory database contains information, such as the location of past and recent landslide events, type of landslides, and other statistical information about the landslide sites and their impacts. The landslide inventory data of the area was prepared by [46] in their previous works in the same area. A total of 63 landslides were identified in the area, as shown in Figure 1. The database shows that most of the landslides are shallow rotational and a few translational in type. Landslide data has been randomly split into a training set (80% of the landslides; 50) and test set (20% of the landslide; 13). Table 2 shows the descriptive statistics of landslide causative factors at landslide locations calculated with mean values from a 10 m circular buffers.

Landslide Causative Factors
This work involved a total of 15 causative factors, two of which were categorical while the others were numerical. All factors were selected based on previous works done in the same area, which showed a significant correlation to landslide events. Although some research indicates that numerical factors should be reclassified from continuous to categorical; this study did not reclassify numerical factors into categorical to allow more reliable calculation of meta-features required in the proposed meta-learning, reduce the predictive model's sensitivity to the type of reclassification models and range of values used to convert continues values into categorical and reduce the time-complexity of the models as they do not have to do extra processes with new examples during prediction. Thus, two factors, namely vegetation density and land use, as well as 13 numerical factors, as indicated in Table 3, are included in this research (the maps of landslide causative factors are provided in Appendix A). The categorical data were transformed to numerical form. The one-hot encoding was used for the ordinal representation. The integer encoded variable is removed and one new binary variable is added for each unique integer value into the variable. Each category value is converted into a new column in this strategy and assigned a 1 or 0 (representing true/false) value to the column [47]. Figure 2 shows the overall methodology for evaluating spatial prediction of landslides using RF and a meta-learning approach for optimising hyperparameters.

Overall Research Methodology
Several geospatial datasets were obtained and pre-processed, including a 2 m digital elevation model, geological data (i.e., lineaments), land use, stream network, road network, and satellite images. Following that, a proper geospatial database was developed to include 15 causative factors from the collected datasets. These include altitude, slope, aspect, plan curvature, profile curvature, distance from roads, distance from streams, and distance from lineaments, land use, normalised difference vegetation index, vegetation density, ruggedness index, relative topographic position, topographic wetness index, and sediment transport index. The database also included the research area boundary (area of interest, short AOI) and historical landslides, which are essential to prepare training and test samples for the modelling phase.
In addition to historical landslides, non-landslide samples are required to prepare training and test samples. They are created randomly at far-off locations by a particular distance to existing landslides. The total number of non-slide samples is equal to the number of landslide locations to prevent data set imbalances. The non-slide samples produced are combined with landslide samples to create the training (80%) and test (20%) samples used to train and test the modelling technique. From training samples, 15 percent were held aside to optimise model hyperparameters. The selection process of non-landslide samples should be performed carefully. There are several strategies for selecting training samples in the literature. The fundamental approach is random sampling which we used in this study [10,11]. The non-landslide samples were collected from the rest of the area, exhibiting different features such as buildings, trees, etc. We utilized landslide inventories as a guide to select these non-landslide points. Accordingly, the generation of the nonlandslide points was performed randomly via ArcGIS Pro 2.4. tool, satisfying the following conditions. First, any non-landslide sample needs to be buffered at a minimum distance of 500 m away from landslides. Second, the distance between any two non-landslide samples must be more than 100 m.
Three RF models were developed using three different hyperparameter configurations such as (1) default values from sklearn library, (2) best values from Bayesian optimisation (BO), and (3) best values from the proposed meta-learning optimisation method (BO-ML). Table 3. List of landslide causative factors used in this research, their calculation method and rationale.

Factor Equation/Calculation Method Rationale
Altitude (m) Extracted from the smoothed version of DEM (calculated by the de-noising algorithm described by [48].
Elevated area affects slope loading. Higher altitude areas increase the likelihood of landslides, especially if the orientation of the sliding plane is closer to open excavation [49], Altitude also affects the extent of rock weathering [50].
A significant topographic parameter in any landslide susceptibility study and commonly used in previous research [3,44]. Higher landslide frequency is often found in steep slopes [33,44].
Regulates topographic moisture levels influenced by solar radiation and precipitation [51].
Plan curvature Calculated using the equation for the calculation of plan curvature as [52]. Plan curvature is negative for diverging flow along ridges and positive for convergent areas, e.g., along valley bottoms.
Curvature represents slope changes along a curve's tiny arcs, influencing slope instability by altering landform character [53]. Plan curvature is the curvature perpendicular to the direction of the peak slope. The convex surface drains moisture immediately while the concave surface holds moisture for long.
Profile curvature Calculated using the same equation for the calculation of plan curvature as [52]. Profile curvature is negative for slope increasing downhill (convex flow profile, typical of upper slopes) and positive for slope decreasing downhill (concave, typical of lower slopes).
Profile curvature refers to the convergence and flow divergence across a surface.
Distance from roads (m) For each cell, the Euclidean distance to the closest road feature was calculated.
A landslide susceptibility assessment routinely uses anthropogenic factors such as distance from roads. Some of the most common actions during construction are shallow to deep excavations, foreign load application, and vegetative cover removal along highways and roads [54].
Distance from streams (m) For each cell, the Euclidean distance to the closest stream feature was calculated.
A hydrological community's intermittent flow regime and gullies encompass erosive and saturation processes. Subsequently, pore water pressure may increase, leading to landslides in areas adjacent to drainage channels [54]. Table 3. Cont.

Factor Equation/Calculation Method Rationale
Distance from lineaments (m) For each cell, the Euclidean distance to the closest lineament feature was calculated.
Lithological factors are used in many landslide research; these affect the type and mechanism of landslides as rocks differ in terms of internal structure, mineral composition, and susceptibility to landslides [55].

Land use
Prepared from 10m SPOT 5 satellite images using maximum likelihood classification. 10 land use and land cover types were identified, e.g., water, transportation, agriculture, residential and bare land. The classified map's overall accuracy was 87.20%, verified with field surveys.
Human activities influence patterns of land use, contributing to landslides [56], more common in barren areas than forests and residential areas [57].
NDVI is highly correlated with photosynthesis activity, hence with vegetation density [58] Greater NDVI values mean more amount of vegetation cover.

Vegetation density
The area was divided into four classes of vegetation densities, i.e., non-vegetation, low-density, moderate-density, and high-density vegetation. Non-vegetated areas were identified based on aggregating non-vegetation classes of land use data. The three density levels of vegetation were determined by classifying the NDVI raster using (0.176-0.286, 0.287-0.417, 0.418-1.0) ranges for the three classes, respectively.
Vegetation cover plays an important role in causing landslides in Cameron Highlands. Such areas are vulnerable to unstable erosion. The vegetation root leads to hill slope stabilization and reduction in landslide occurrences. The higher vegetation density value indicates the higher vegetation concentration per area unit.

Ruggedness index (RI)
For each grid cell in the DEM, the root-mean-square-deviation (RMSD) is calculated using the residuals (i.e., elevation differences) between a grid cell and its eight neighbours. Details can be found in [59].
Used to describe and quantify local relief. It also affects erosion and deposition rate, activity rates, and age of deposits. Changes in gradient result in increased rainfall accumulation and infiltration.
Relative topographic position (RTP)  To optimise RF hyperparameters with the proposed meta-learning method, the following steps are followed. First, a total of 110 datasets were created by random sample features and samples from the available data. For each dataset, a set of basic and statistical meta-features were calculated. Additionally, for each dataset, the best hyperparameters of RF were obtained through Bayesian optimisation. Each dataset contained a pair of meta-features and best values of hyperparameters. Datasets were then randomly split into training datasets (100) and validation datasets (10). To find best hyperparameter values for a new dataset, k nearest datasets were determined by comparing meta-features with k nearest neighbour algorithm and running Bayesian optimisation by starting from a list of hyperparameter configurations suggested by meta-learning. Lastly, the model predictive ability was evaluated based on three accuracy metrics, including k-fold cross-validation accuracy (k = 5 in this work), F1-score, and area under the receiving operating characteristic curves (AUROC).

Modelling: Random Forest
Random forest (RF) developed by Breiman, L. [60] in 2001 is a common supervised classification algorithm based on the learning of decision trees [60]. It can also be extended to other tasks such as regression, clustering, and detecting interactions. A forest is formed by constructing several binary trees. Each tree fits a bootstrap sample from the training set with a random subset of features and samples selected at each node to minimize correlation among the generated trees [61]. The random selection of the data set may affect the model's performance, so using a collection of several trees helps ensure model stability. For each tree grown on a bootstrap sample, the error rate "out-of-bag" (OOB) equal to the standard deviation error between predicted and observed values is calculated using samples left out of the bootstrap sample. OOB samples are also used to establish a ranking of importance for the features. Majority vote of all trees decides the final predictions.
When splitting a node during tree construction using the concept of the largest Gini coefficient, the selected split is no longer the best split among all features [62]. Alternatively, the split is the best split among a random subset of features. Because of this randomness, the bias of the forest typically increases marginally (regarding a single non-random tree bias) [63,64]. Nevertheless, due to averages, its variance often decreases, typically more than compensating for the increase in bias, producing a better overall model. RF estimates the relative importance and ranks input features concerning the predictability of landslides using features used as decision nodes in the trees [36]. Features used at the top of the tree contribute to the final prediction decision of a larger fraction of the input samples. Thus, the estimated fraction of the samples they contribute to can be used to estimate the relative importance of the features. Here, the fraction of samples a feature contributes to is combined with the decrease in impurity from splitting them to create a normalised estimate of the predictive power of that feature. Other benefits of the RF algorithm include [65]: • RF is ideal for working with mixed variables, i.e., both categorical and numerical, most likely in landslide modelling, • In RF, each tree has access to specific subspace feature sets. This random selection of features to split each node contributes to a favourable error rate. This randomness also offers high accuracy rates for outliers in predictors, and • RF is a good feature engineering tool. That means finding the most relevant features from the training dataset.
The challenge with RF is to optimise a number of hyperparameters to increase efficiency and predictive capacity [66]. The most important parameters include the number of trees in the forest, maximum depth of the trees, the maximum number of features considered at each split, minimum samples required in a leaf, minimum samples required to split a node, and the function to measure split quality (also known as criterion). Selecting optimised values for these parameters can reduce the model's time complexity, increase model generalization (minimize the OOB error), and produce a more accurate estimation of feature importance. Appendix B contains more information on these parameters [60].

Optimisation: Bayesian Optimisation
This research uses Bayesian optimisation (BO) to determine optimal landslide predictive models-hyperparameters. Unlike grid and random search, BO finds optimum hyperparameters faster, relying on prior knowledge gained through iterations. The algorithm depends on the Gaussian process (GP) model to fit the posterior distribution of an objective function by increasing the number of samples, allowing it to find the optimal solution and optimise the hyperparameter [39]. GP model can easily determine a predictive distribution of objective function. The model's predictive distribution determines the possible values of the objective function at each point of the input space. By considering this predictive distribution, BO methods guide the search, focusing on those input space regions that are expected to provide the most information about the solution to the optimisation problem.
In this research, BO was used to optimising six RF model hyperparameters as shown in Table 4. It was based on 20 iterations, also known as calls (n). As an objective function, negative minimum AUROC with five-fold cross-validation was used. After each iteration, the better model configuration was found until the 20-call convergence. At the cost of additional processing time, better configurations can be found when looking for larger search spaces.

Meta-Learning:
The idea of meta-learning is to find the best values of RF hyperparameters for new datasets based on previous knowledge regarding best hyperparameters. For a new dataset, the Bayesian optimisation, instead of learning from scratch, would begin with hyperparameter configurations that were optimal for the most similar previous datasets. The similarity between datasets can be measured using common distance metrics such as p-norm distance.
Let us denote the best hyperparameter configurations asθ 1 , . . . ,θ N for the previously encountered datasets D 1 , . . . , D N , respectively. For a new dataset, first, the algorithm computes the p-norm distance (d p ) from this dataset to all the training (encountered) datasets and sort them by increasing distance to the new dataset. The similarity between the datasets was measured based on their meta-features, which can be computed for the training and validation (new) datasets.
where d p is the p-norm distance between D i and D j datasets, and m i and m j are the set of meta-features for the two datasets.
Then, t hyperparameter configurations will be determined based on the most t similar datasets based on d p . In this research, t was set to 10. Starting from these best t hyperparameter configurations, Bayesian optimisation will run for 10 calls to find the best hyperparameter values for the new dataset. This will help to save time and computing resources as there will not be a need to search large hyperparameter spaces.
Implemented Meta-Features: To evaluate the proposed meta-learning approach, a number of basic and statistical meta-features were implemented. This research implemented 13 basic meta-features such as the number of instances and the number of features, describe the basic dataset structure [67]. Further, it implemented 14 statistical meta-features which characterise the data via descriptive statistics such as kurtosis and skewness [68]. Such meta-features can characterize the complexity of datasets providing an evaluation of the performance of the algorithm [69]. Broad data characterization, deep data exploration and various meta-learning-based data assessments can be obtained by extracting numerous meta-feature functions [69]. These meta-features can be general information (e.g., simple measures, number of instances, attributes, and classes) and statistical information (e.g., standard statistical measures describing data distribution and discriminant measures including Min., Max, Standard deviation, Skewness, Kurtosis, etc.). These meta-features can be simply extracted by instantiating the "MFE class" in Python. It calculates a group of meta-features as summary functions to abstract these values. After fitting, the "Extract" method extracts the corresponding measures [70]. The details of these meta-features are given in Appendix C.

Datasets:
The evaluation of the meta-learning optimisation was based on datasets randomly sampled (both factors and samples) from the datasets of Cameron Highlands. A total of 110 datasets were created with a different number of factors and a number of samples. For developing and training the meta-learning models, this research used 100 datasets randomly selected from the generated datasets. The remaining 10 datasets were kept for validating the method. Table 5 shows the shape and the list of causative factors selected in each validation dataset. For each training dataset, this research first shuffled the samples and then split it in stratified fashion into 80% training and 20% validation. Before training any machine learning model, the dataset needs to be well-shuffled to avoid bias or patterns in the split datasets for training, testing, and validation datasets. If shuffling is neglected, the data can be sorted, or similar data points could lie next to each other, leading to slow convergence.
Moreover, stratified sampling seeks to split a dataset so that every split is similar with respect to datasets to ensure the same distribution of classes on datasets. It is often desired to ensure that the train and test sets have almost a similar percentage of samples of each target class as the complete set.
Then, we computed the validation performance (AUROC) for Bayesian optimisation by five-fold cross-validation using the validation dataset. For each training and validation dataset, the basic and statistical meta-features were computed and stored on disk. As such, the computed datasets stored on the disk were the meta-features and performance of RF with best hyperparameters found by the Bayesian optimisation for the training dataset and only meta-features for the validation datasets.

Performance Assessment
The performance of the proposed models was evaluated using three standard accuracy metrics including k-fold cross-validated accuracy, F1-score, and area under receiving characteristic curves (ROC) (i.e., AUROC).

k-fold cross-validated accuracy:
Accuracy is the most common metric for classification tasks, including landslide susceptibility mapping. It is the fraction of the correctly predicted sample and can be calculated using the following equation [71]: where TP is true positive, the number of positive samples correctly predicted, TN is true negative, the number of negative samples correctly predicted, FP is false positive, the number of positive samples wrongly predicted as negative, and FN is false negative, the number of negative samples wrongly predicted as positive.
To avoid miscalculating the accuracy of a model, it is suggested to use k-fold crossvalidation, which is a model validation approach based on partitioning the data into k subsets (k = 5 in this research). The basic idea behind this approach is to hold a set at a time and train the model on the remaining set. Then, test the model on hold out set. However, accuracy is not always the best metric for evaluating classification models because it does not care about positive events. Therefore, additional metrics are often used to explain the performance of the proposed models.

F1-score:
The F1-score is the harmonic mean of recall (r) and precision (p), with a higher score as a better model. Precision (p) is determined by dividing the true positives (number of landslide pixels) with the total number of pixels classified as a landslide. The sensitivity (r) is the degree of true positives correctly predicted and distinguished and can be defined as the number of true positives divided by the total number of pixels belonging to the landslide class. The F1-score is calculated using the following equation [72]: The problem with the F1-score metric is assuming a 0.5 threshold for selecting which samples are predicted as positive. Changing this threshold would change performance metrics. ROC curve is a very common method to solve this problem.
ROC curves and AUROC: ROC, a graphical representation of model success and predictive accuracy, is another important accuracy metric often used to assess models of landslide susceptibility. The area under ROC is known as AUROC, a quantitative measure summarizing model performance. ROC curves help to understand the balance between true-positive and false-positive rates. A perfect model has 1.0 AUROC, and 0.5 AUROC indicates random models. The closer the AUROC to 1.0, the higher the model's performance [73].
ROC curves plot y-axis sensitivity and x-axis specificity, corresponding to decision thresholds [74,75]. AUROC calculates a trapezoidal equation as follows: where T i is the ith percent landslide susceptibility, C i is the ith cumulative percentage of landslide occurrence, n is the number of the percent landslide susceptibility index value, and B is the baseline value (i.e., B is usually equal to zero).

The Relative Importance of Causative Factors
The influence of every conditioning factor on the occurrence of landslides varies [76,77]; thus, examining the importance of the factors for landslide occurrence can provide valuable information. To date, there is no agreement on the selection of reliable landslide conditioning factors due to the complexity of landslides [7,78]. However, several strategies can support the identification of the most and least contributing factors. RF is a popular technique that is widely used for this purpose due to its ability to feature importance/selections.
Multicollinearity can occur when a predictor variable in a multiple regression model can be linearly predicted from the others with superior accuracy. However, decision trees algorithms (such as RF) are resistant to multicollinearity or outliers by nature [79]. Moreover, the success of a meta-learning technique significantly relies on the quality and quantity of the meta-data features employed for learning. To well characterize the metadata, a collection of many meta-features discriminating among various learning tasks is needed [79]. This ability of meta-learning can offer a better generalization to other areas with almost similar geo-environmental and topographical characters. Nevertheless, we have selected the most essential and best available causing factors based on the previous projects in the study area [20].
In the present study, we used RF as the standard algorithm for the modelling, as the accuracy of RF is not certainly affected by the multicollinearity loads [80][81][82][83]. In fact, this is one of the main advantages of RF, which meta-learning could benefit and learn from the features in a convenient scheme.

Results
This paper developed a meta-learning approach to optimising RF models for the assessment of landslide susceptibility at Cameron Highlands. This proposed approach was compared to RF with default hyperparameter settings (sklearn) and Bayesian optimisation to evaluate its predictive ability. Table 6 shows the default values of the critical RF hyperparameters taken from the sklearn software package (https://scikit-learn.org, access on 1 September 2020). RF, with these default values, achieved AUROC of 0.779 and 0.761 for the training and test datasets, respectively. Comparing the default model with random models on the test dataset achieved AUROC of 0.742 ± 0.015 for 20 randomly sampled parameters. Figure 3 shows the ROC plots of these random models. The results suggest that RF with the default parameters compares to a random model, indicating the need to optimise these critical parameters systematically to produce more accurate landslide predictions.   Table 7 shows the performance of the RF model with two optimisation methods, i.e., Bayesian method and the proposed meta-learning based approach compared to the model with the default parameters. Considering the five-fold cross-validation accuracy metric, the Bayesian method achieved the best performance for both the training (0.810) and test (0.802) datasets. The meta-learning approach achieved slightly lower accuracies than the Bayesian method for the training (0.769) and test (0.800) datasets. The model with the default parameters achieved the worse performance. Similarly, the Bayesian method outperformed the other two models based on F1-score on both the training and test datasets. Based on this metric, the meta-learning approach (BO-ML) also achieved better results than the model with the default parameters. Finally, based on the AUROC metric, the models with optimised parameters either by the Bayesian or meta-learning methods produced more accurate landslide predictions than the model with the default parameters of the sklearn. Figure 4 illustrates the ROC plots of the three investigated models using the test dataset.  To further evaluate the proposed meta-learning optimisation method, Table 8 shows the best hyperparameter settings obtained by the Bayesian method for the 10 validation datasets (see Section 2.3.4). These best hyperparameter configurations were obtained based on the search space given earlier in Table 4. The performance of the three RF models (i.e., default parameters, Bayesian method, meta-learning approach) are given in Table 9 for the training dataset and Table 10 for the test dataset using the three different evaluation metrics, i.e., five-fold cross-validation accuracy, F1-score, and AUROC.  Table 9. Performance of the three investigated models on the 10 validation datasets using the training subset.

Landslide Susceptibility Maps
The successfully tested models were used along with the training dataset to prepare the landslide susceptibility maps of the study area. For each pixel, the probability of landslide occurrence was computed using RF with the default parameters, Bayesian method, and the proposed meta-learning approach. The resulted data was used in GIS and the three susceptibility maps were produced. The maps were reclassified into five susceptibility classes, namely very low, low, moderate, high, and very high susceptibility using the natural break classification method implemented with the ArcGIS Pro 2.4 function ( Figure 5). Then, the number of landslides were identified in each susceptibility class in the produced susceptibility maps to quantify the landslide density (number of landslides/susceptibility class area). Figure 6 shows the area in pixels of the five susceptibility classes and the landslide density graphs based on the results from the three tested models. The three models identified the largest number of landslides in the higher susceptibility classes. Higher susceptibility values found in areas with high altitudes, far from roads, and close to streams.
As the five susceptibility levels by the natural break classification approach are based on the histogram of data distribution, the classes were therefore distributed within the five class ranges as (very-low (<0.2), low (0.2-0.4), moderate (0.4-0.6), high (0.6-0.8), and very-high (>0.8) which is common in natural break classification [84]. Figure 6a shows that almost more than 51% of the area belongs to very-low and low classes, while almost 3% of the area ranges in a very-high susceptible area which seems logical within the study area. More importantly, as is shown in the landslide density graph in Figure 6b, the historical landslides mostly fell into the high and very-high susceptibility regions, revealing a good correspondence with the map generated by the present meta-learning approach (BO-ML).

Discussion
LSMs are essential tools for managing landslide-prone areas. The models are also required to have a high predictive ability and be stable to be used successfully by governmental agencies to develop landslide risk strategies. This paper contributes to the improvement of RF models with Bayesian optimisation and meta-learning for accurate landslide susceptibility mapping at Cameron Highlands. Unlike existing studies which optimise machine learning models from scratch, the proposed optimisation approach starts with promising hyperparameter configurations that performed well on encountered datasets. This approach saves time and computing resources as it runs at fewer iterations compared to other grid and Bayesian methods.
This research used RF as a prediction model because it showed a encouraging performance and stable predictions in previous studies [15,24,85]. In the present study, as meta-learning begins with promising configurations based on several basic and statistical meta-features, RF appears an attractive method for dealing with such meta-features. It searches for the best feature among a random subset of features instead of searching for the most critical feature while splitting a node, which provides additional randomness to the model. Moreover, it is a powerful method that is less sensitive to multicollinearity, noise, and outliers since it employs random features for the modelling [86,87]. These random features in the learning process make the overfitting issue to be minimal. Another advantage is its ability to recognize the most and least important causing factors, which is an additional benefit to support the aim of the present work.
Since RF can assess the relative importance of the landslide causative factors, we looked at how it performs with different hyperparameter configurations. Figure 7 shows the relative importance of the causative factors obtained by the default model, Bayesian method, and the proposed meta-learning approach.
The three models agreed that the distance from roads and vegetation density are the most and least important causative factors regarding the predictability of landslides in the study area, respectively. These results were also confirmed as most of the landslides have occurred along the roadside (Figure 8).  The default model identified NDVI and distance from streams as the second and third most important factors and plan curvature and distance from lineaments as the second and third least important factors. Instead, the Bayesian method showed that STI and distance from streams are the second and third most important factors. This model also showed that distance from lineaments and TWI are the second and third least important factors in the study area, respectively. Finally, the meta-learning approach calculated altitude and distance from streams as the second and third most important factors. This approach agreed with the default model that the plan curvature and distance from lineaments are the second and third least important factors regarding the predictability of landslides in Cameron Highlands. While the three models agree on some factors as the most or least important in the study area, further analyses should be carried out if computing factor importance is the research focus rather than the accuracy of landslide predictions However, the same meta-learning approach can be used to optimise other machine learning models, especially those require extensive computing power, e.g., neural networks and ensemble models. The meta-learning optimisation can also be a promising solution when many models need to be tested as it provides faster optimisation to those models.
Meta-learning also can be used with other optimisation methods other than Bayesian techniques such as evolutionary approaches [40]. As those are already being tested for landslide prediction, it could be worth exploring meta-learning with such optimisation techniques. Since the application of meta-learning as an approach of optimisation is very rare in landslide prediction studies, advances in these techniques can improve the accuracy and efficiency of landslide prediction models which ultimately can lead to better landslide risk management.
This research confirms that optimising the RF hyperparameters can lead to more accurate prediction of landslides. The use of Bayesian optimisation with RF yielded more accurate predictions by~7% (five-fold cross-validation accuracy),~0.125 F1-score, and~0.103 AUROC compared to the model with default parameters. The meta-learning approach achieved predictions better than the default model by~6.8% (five-fold crossvalidation accuracy),~0.105 F1-score, and~0.059 AUROC. The critical parameters of RF investigated and optimised in this research have controls to model bias, variance, and overfitting issues, which explains why the optimised parameters yield more accurate predictions than the default parameters. The results highlight the importance of integrating optimisation techniques such as Bayesian and meta-learning to RF and possibly other machine learning models to achieve significant improvements in preparing landslide susceptibility maps. Other related studies also reported improvements to the RF [39] and deep learning methods [33] with Bayesian optimisation for landslide susceptibility assessment.
Although the Bayesian optimization and meta-learning optimization techniques were relatively close to each other, BO-ML was successfully able to identify the largest number of landslides in the higher susceptibility classes as per density graphs in Figure 6b, which is a promising result. From another side, the Bayesian optimization effectively tunes a few hyper-parameters; however, its productivity reduces a lot when the search dimension grows in large amounts, leading to a situation where it is at the same level as random search [88]. In addition, it can suffer from a high computational cost if the number of evaluations is high.
Nevertheless, the proposed optimisation approach starts with promising hyperparameter configurations that performed well on encountered datasets. This approach saves time and computing resources as it runs at fewer iterations compared to other methods. As the meta-learning optimisation technique is very limited in the landslide prediction, advances in these techniques can boost the accuracy and efficiency of landslide modeling, especially when the data are large and more complex that needs a massive processing time. It can improve model generalization and produce a more accurate estimation of contributing factors. In addition, in previous studies, the prior knowledge extracted from those techniques was not used to optimise prediction models, but rather optimal values were found by searching large hyperparameter search spaces. It is therefore simple to execute and can be quickly utilized to off-the-shelf hyperparameter optimizers.
As a guideline for employing meta-learning in landslide prediction domain, the selected set of meta-features represents a first footstep in the direction of the meta-learner design, which is capable of suggesting the proper bias for base-learning in landslide studies. To generate such a meta-learning scheme, we recommend the researchers further evaluate the chosen meta-features' applicability in characterizing diverse learning domains, including groups of related features that need to be considered.
As a limitation of the study, as with any machine learning algorithm, the success of a meta-learning technique significantly relies on the quality of the meta-data employed for learning. To characterize meta-data, a compilation of many meta-features, discriminating among various learning tasks is to be identified/considered. Additionally, careful analysis and usage are needed as complex models need more time to adapt; therefore, it could be challenging to assess their final operation in the early stages of learning.

Conclusions
A large number of previous studies have optimised the spatial prediction of landslides. The knowledge extracted in such studies may play a significant role in studies in areas with similar topographic and geological characteristics. However, this prior knowledge was not used to optimise prediction models, but rather optimal values were found by searching large hyperparameter search spaces. This paper introduced the use of meta-learning to optimise spatial prediction, improving the optimisation process of spatial prediction models of landslides. The approach was based on a set of spatial and statistical meta-features for input data, which can be easily calculated for a given datasets. In this way, the optimisation algorithms can be directed towards the optimal values of the hyperparameters faster and as a result yielding improved optimisation.
To test the proposed approach, RF was used as a benchmark prediction model for its generalisation ability and based on previous studies as it often achieved accurate predictions. The meta-learning optimisation was used to find best values of critical parameters of the model can be compared to a benchmark method (Bayesian optimisation) and the model with default values of parameters commonly used in machine learning tools such as Python's scikit-learn library. Experiments on comparing RF with default parameters to a random model suggested the need to optimise critical parameters systematically. Based on the AUROC metric, optimised models (Bayesian or meta-learning) produced more accurate results than the model with the default parameters.
Despite the successful landslide prediction results obtained in this research with Bayesian and meta-learning optimisation techniques, further research is required to identify and implement more specific meta-features to landslides and spatial data, i.e., spatial distribution, spatiotemporal data clusters, landslide types, and landslide geometry. The more meta-features, the more accurate estimation of data similarity can be obtained which will result in better identification of suboptimal hyperparameter values. As a result, the prediction models can be more accurately and reliably be optimised and used on new datasets.
Overall, advancing our understanding of hyperparameter optimisation of a landslide prediction model can have a positive impact on decision making for planning and managing landslide-prone areas. By implementing a proper meta-learning scheme, further generalized state-of-the-art models can be developed for the landslide modelings. Although, in this research, we conducted a comparative study on a relatively small region, implementing the model on even larger configuration spaces can be considered to assess the model's exportability. Furthermore, the study could be simulated in other topographical set-ups with a variety of factors and different kinds of landslides (e.g., debris flow, rockfall, deep-seated landslides, etc.).    At higher numbers of trees, the RF is relatively stable, but at one point it can still overfit and time complexity of the model can increase. This should be tuned.

Criterion
Function to measure split quality. Gini impurity and information gain are two common functions.
Both Gini and information gain impurity metrics work well. However, the latter is more computationally heavy due to the log in the Entropy equation.

Maximum depth
Tree's maximum depth. The longest path between root and leaf node or max number of splits possible within each tree Since this parameter is used to control over-fitting as higher depth allows the model to learn very detailed relationships to a particular sample, it should be tuned. Appendix B Table A1. List of the most critical hyperparameters of RF.

Parameter Explanation Necessity of Tuning
Number of estimators or trees Number of trees to create the RF.
At higher numbers of trees, the RF is relatively stable, but at one point it can still overfit and time complexity of the model can increase. This should be tuned.

Criterion
Function to measure split quality. Gini impurity and information gain are two common functions.
Both Gini and information gain impurity metrics work well. However, the latter is more computationally heavy due to the log in the Entropy equation.

Maximum depth
Tree's maximum depth. The longest path between root and leaf node or max number of splits possible within each tree Since this parameter is used to control over-fitting as higher depth allows the model to learn very detailed relationships to a particular sample, it should be tuned.

Maximum features
Number of features to consider when looking for best splitting. These are randomly selected.
The square root of the total number of features is suggested as a thumb-rule. The lower is greater for the reduction of variance, however, the bias will increase. Higher values can lead to over-fitting. It should, therefore, be tuned. This can smooth the model. A smaller leaf makes the model more prone to capturing noise in train data and should, therefore, be tuned.

Minimum samples split
Specifies the minimum number of samples required for splitting in a node.
To control over-fitting. Higher values prevent a model from learning relationships that could be very specific to the particular sample selected for a tree. Too high values can also contribute to underfitting. The ratio between the categorical to numerical factors.

Meta Features Equation Explanation
Rationale Statistical features (features describe statistical properties of the dataset) Skewness E(X−µ X ) 3 σ 3

X
List of skewness values for the causative factors. It defines the measure of the symmetry of data distribution for a given X value. When X is far below its mean (X − µ x ) 3 is a big negative number, and when X is far above its mean (X − µ x ) 3 is a big positive number. Feature