Transfer Learning with Attributes for Improving the Landslide Spatial Prediction Performance in Sample-Scarce Area Based on Variational Autoencoder Generative Adversarial Network

: Owing to the complexity of obtaining the landslide inventory data, it is a challenge to establish a landslide spatial prediction model with limited labeled samples. This paper proposed a novel strategy, namely transfer learning with attributes (TLAs), to make good use of existing landslide inventory data, a strategy that is based on a variational autoencoder of a generative adversarial network (VAEGAN) for improving the landslide spatial prediction performance in sample-scarce areas. Different from transfer learning (TL), TLAs are pretraining the model with the data reconstructed by VAEGAN, so that the models learn in advance the landslide attributes of sample-scarce areas. Accordingly, a database containing a total of 986 landslides in three study areas with 14 landslide-inﬂuencing factors was established, and each of the three models, i.e., convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM) and gated recurrent units (GRUs), was respectively selected as the feature extractor of the VAEGAN to reconstruct the data with attributes and the prediction model to generate the landslide susceptibility maps to investigate and validate the proposed TLA strategy. The experimental results showed that the TLA strategy increased the mean value of evaluators, such as the area under the receiver-operating characteristic (AUROC), F1-score, precision, recall and accuracy by about 2–7% compared with TL, results that indicated that the generated data have the attribute of speciﬁc study areas and the effectiveness of TLA strategy in sample-scare areas.


Introduction
Landslides occur on all continents and are global threats to human infrastructure and the environment [1].Urbanization (e.g., highway expansion) accelerates the demand for landslide prevention and control.However, landslide assessment is a complicated task that includes needing to understand geotechnics, geomorphology, hydrology and statistics [2].Although the method of establishing a physical model to evaluate the stable state of a slope is reliable [3], it is suitable only for the small research area or a single slope [4].Thus, it is a challenge to use physical models to evaluate the stability of a large number of slopes as it is time-consuming and expensive [5].
Recently, the landslide spatial prediction (LSP) has become essential for landslide susceptibility maps (LSMs) [6].Additionally, many models for LSMs have been proposed, such as statistical [7] and machine-learning (ML) [8] models.However, these methods cannot extract critical information from the landslide-influencing factors; in addition, these machine-learning methods are prone to overfitting and make it difficult to improve the prediction accuracy [9].To address these problems, deep-learning frameworks (e.g., convolutional neural networks (CNNs)), which have received more attention, can achieve an equivalent or higher level of human experts in the prediction accuracy and objectivity in many fields [10].Recently, the deep-learning frameworks are commonly used in LSP [11][12][13].Furthermore, the deep-learning frameworks are confirmed to be better than machine-learning methods in LSP [14].These LSP models take landslide-influencing factors as the inputs to generate spatial probability maps of landslides [15].However, an excellent LSP model of an area needs a lot of data for training.It is a challenge to collect these landslide inventory data whether they are from an onsite field survey or a search from remote-sensing images and historical data because a lot of professional knowledge will be needed [16].Thus, it is difficult to establish a strong model in sample-scarce areas [17].
For the data with similar features, TL, which makes good use of the existing landslide inventory data in other regions, can solve the above problem.That is, a model is first trained with the data from a sample-abundant region, and next, this pretrained model is fine-tuned by using data in sample-scarce areas.The TL strategy utilizes the learning representation from a well-trained model, which enables the model to successfully transfer the learned knowledge to other data sets [18,19].Although the TL has impressive effects, some critical problems remain to be solved when the TL strategy is applied to LSP.Especially for the limitation of the data with unsimilar features, it is hard to generalize a model pretrained with the data from an sample-abundant area to another sample-scarce area [16], owing to the data-set bias (e.g., the inches of rain are quite different between areas), which makes the effect of TL negligible.
Thus, it is necessary to increase the attribute similarity between the data set used for training and the one used for fine-tuning the models.Reducing the data-set bias by improving the attribute similarity between the source domain and target domain can enhance the performance of the models when applying the TL [20].The technique of reconstructing the data, which include not only its own attributes but also other attributes, has achieved good results in the field of image enhancement [21].In the study [21], the original data set is purposely reconstructed with new features contained in another data set by using a variational autoencoder generative adversarial network (VAEGAN), increasing the attribute similarity between the data set of the source domain and the one of the target domain.However, this technique has not yet been used in LSP.
Therefore, in order to improve the performance of an LSP model in sample-scarce areas, the strategy of using TLAs is proposed in this paper, which is a solution that consists of two steps: (1) reconstructing the data with similar attributes of landslide-influencing factors in two areas on the basis of the VAEGANs and (2) first pretraining the LSP model with the reconstructed data and then fine-tuning them with limited samples from samplescare areas.In other words, the TLA strategy transfers the learning representation from the existing landslide inventory data to a sample-scare area and increases the compatibility of the models.In summary, the contributions of this work are listed as below: 1.
Investigating the transferability by applying the proposed TLA strategy in two study areas.

2.
Three deep-learning frameworks (CNNs, GRUs and BiLSTM) are selected as the feature extractor of the VAEGAN to assess the attribute similarities of autoencoded (reconstructed) data between the source domain and the target domain.

3.
The Bayesian optimization algorithm is used to obtain the best hyperparameters and training options from three LSP models (CNNs, GRUs and BiLSTM).

Study Area and Landslide Inventory
This paper includes three study areas in China: the first one is across both Luoding Country and Xinyi County (LX) of Guangdong Province (Figure 1a); the second one is in Guigang County (GG) of Guangxi Province (Figure 1b); and the third one is in Zigui County (ZG) of Hubei Province.

Study Area and Landslide Inventory
This paper includes three study areas in China: the first one is across both Luoding Country and Xinyi County (LX) of Guangdong Province (Figure 1a); the second one is in Guigang County (GG) of Guangxi Province (Figure 1b); and the third one is in Zigui County (ZG) of Hubei Province.A landslide survey is an indispensable procedure for data statistics and understanding of landslide spatial distribution [22].The landslide inventories of the study areas were initially obtained from the archive database in 2012 and then were supplemented with several field surveys using geoinformatics (ArcGIS) from Google Earth images (historical images from 2010 to 2016).Additionally, the description of the landslides in the study areas can be found in Table A1 of Appendix A. Figure 1 shows the landslide distribution of the first two study areas.The inventory contains 484 landslide locations in LX and 88 landslide locations in GG.These historical landslide locations are explored by using expert knowledge and onsite investigation, which can be found in the report by the China Geological Environment Monitoring Institute [23].According to Reichenbach et al. [1], 596 factors have been found to assess landslide susceptibility, from 1983 to 2016, and the average number of factors used in each model is nine.In addition, the selected landslideinfluencing factors should be measurable, operable, uneven, complete and nonredundant [24].Some studies [25][26][27] have shown that using between 4 and 12 factors is suitable for LSP.Therefore, there are 14 landslide-influencing factors (Table 1) are considered in this paper, which can be classified as seven topography factors (altitude, aspect, plan curvature, profile curvature, surface roughness, topographic wetness index and slope), five environmental factors (normalized difference vegetation index, land use, rainfall, distance to rivers and distance to roads) and two geological factors (lithology and distance to faults).The thematic maps of these landslide-influencing factors of LX and GG are shown in Figures 2 and 3, respectively.A landslide survey is an indispensable procedure for data statistics and understanding of landslide spatial distribution [22].The landslide inventories of the study areas were initially obtained from the archive database in 2012 and then were supplemented with several field surveys using geoinformatics (ArcGIS) from Google Earth images (historical images from 2010 to 2016).Additionally, the description of the landslides in the study areas can be found in Table A1 of Appendix A. Figure 1 shows the landslide distribution of the first two study areas.The inventory contains 484 landslide locations in LX and 88 landslide locations in GG.These historical landslide locations are explored by using expert knowledge and onsite investigation, which can be found in the report by the China Geological Environment Monitoring Institute [23].According to Reichenbach et al. [1], 596 factors have been found to assess landslide susceptibility, from 1983 to 2016, and the average number of factors used in each model is nine.In addition, the selected landslide-influencing factors should be measurable, operable, uneven, complete and nonredundant [24].Some studies [25][26][27] have shown that using between 4 and 12 factors is suitable for LSP.Therefore, there are 14 landslide-influencing factors (Table 1) are considered in this paper, which can be classified as seven topography factors (altitude, aspect, plan curvature, profile curvature, surface roughness, topographic wetness index and slope), five environmental factors (normalized difference vegetation index, land use, rainfall, distance to rivers and distance to roads) and two geological factors (lithology and distance to faults).The thematic maps of these landslide-influencing factors of LX and GG are shown in Figures 2 and 3, respectively.Slope is closely related to the local altitude, so altitude is one of the factors that influence landslides [29].
The digital elevation model (DEM) with a 30 m resolution of the study area can be downloaded from http://www.gscloud.cn/home(accessed on 23 June 2022).

Slope angle
This directly affects slope stability and has been widely used in landslide sensitivity analysis [9].DEM derivatives.Aspect This is related to the landslides in that slopes in different orientations are differently affected by precipitation and solar radiation [30].

DEM derivatives. Plan curvature
This reflects the rate of change of the aspect along the contour and thus can affect the flow of water across a surface [31].

DEM derivatives. Profile curvature
This influences the acceleration and deceleration of flow through slope; thus, some valuable information about erosion and deposition is provided [32].

DEM derivatives. SDS
This is an index reflecting the degree of surface fluctuation and erosion intensity [33].SDS = 1/ cos(slope).TWI TWI describes the topographic properties of hydrological processes in that both slope and local upslope contribution areas are considered [34].
TW I = Ln(SCA/ tan β), where SCA is the specific catchment area (m 2 /m) and β is the slope angle (degree) of the position.

Environmental
Land use Land use is a key factor in landslides and has an important influence on the stability of slopes thanks to vegetation cover [35].
Land-use data are from a study in 2015 [36].

Rainfall
This is a key landslide-inducing factor in that it can affect the shear strength of the slope [37].
The rainfall (mm/month) raster data in the study area were obtained by using the inverse-distance weighting interpolation method [38] on rain stations (http://data.cma.cn/,accessed on 5 January 2022) in the vicinity of the study area.

NDVI
This reflects the greenness of an area and may alter the distribution of soil and hydrological processes on slopes [39].
The NDVI data were obtained from the MOD13Q1 product, which were downloaded from https: //search.earthdata.nasa.gov/search(accessed on 3 July 2022) and processed by the MODIS projection tool [40].Furthermore, in order to minimize potential atmospheric effects, the NDVI data used in the paper are the average value of the entire year of 2015.

Distance to rivers
Rivers affect slope stability in that they can cut and erode banks, and these actions reshape and sculpt the landscape.In addition, fluctuations in the water level greatly affect the groundwater level of the slope [29].
The data come from the National Geomatics Center of China (http://www.ngcc.cn/ngcc/(accessed on 20 May 2022), and the Euclidean distance tool of ArcGIS is used to obtain the river distance in the study area.
12 Distance to roads This is considered to be one of the most important human factors affecting the occurrence of landslides [7].
The method of obtaining the road distance raster data is the same that for obtaining the river distance.

Geological Lithology
The mechanical and hydrological properties of rock masses (such as permeability and friction angle) differ between lithological units, so this factor can greatly affect slope stability [37].
Geological maps of the study area were obtained from the National 1:200,000 Digital Geological Map (Public Edition) of China [41].The description of the lithological formations in the study areas are shown in Table A2 of Appendix A.

Distance to faults
This has an important influence on the distribution and scale of landslides in the study area [30].
The data on the locations of faults in the study area were obtained from the National 1:200,000 Digital Geological Map (Public Edition) Spatial Database of China [41], and The Euclidean distance tool of ArcGIS was used to obtain the fault distance in the study area.In addition to the positive samples, the same number of negative samples (nonlandslide) are randomly collected from the first two study areas on the basis of following some basic priorities (for example, low slopes) [11,16].The heat maps of the correlation matrix of the landslide-influencing factors and one output variable for data sets in three study In addition to the positive samples, the same number of negative samples (nonlandslide) are randomly collected from the first two study areas on the basis of following some basic priorities (for example, low slopes) [11,16].The heat maps of the correlation matrix of the landslide-influencing factors and one output variable for data sets in three study areas are shown in the Figure A1, which illustrates the correlation between parameters.According to the researches [42][43][44], the samples in both areas of this study are split into a training set (80%) and a validation set (20%) for comparison and to validate the performance of each method.

Overview
The paper achieves the transfer learning with attributes for improving model performance in the sample-scarce areas by using a VAEGAN.The method is sketchily shown in Figure 4. First, a VAEGAN is trained from landslide-influencing factors (positive samples and negative samples).Second, the factors with the landslide-related attributes of two areas (LX and GG) will be extracted and reconstructed by a VAE.It is specified in Section 3.4.Finally, the LSP models will be established by a CNN, and the transferability of the models will be tested.The thematic maps of landslide-influencing factors and sample production are made by ArcGIS 10.8 (Environmental Systems Research Institute, Inc., Redlands, CA, USA).Additionally, the deep-learning framework is accomplished on MATLAB 2022a (MathWorks Inc., Native, MA, USA).
Land 2023, 12, x FOR PEER REVIEW 8 of 28 areas are shown in the Figure A1, which illustrates the correlation between parameters.
According to the researches [42][43][44], the samples in both areas of this study are split into a training set (80%) and a validation set (20%) for comparison and to validate the performance of each method.

Overview
The paper achieves the transfer learning with attributes for improving model performance in the sample-scarce areas by using a VAEGAN.The method is sketchily shown in Figure 4. First, a VAEGAN is trained from landslide-influencing factors (positive samples and negative samples).Second, the factors with the landslide-related attributes of two areas (LX and GG) will be extracted and reconstructed by a VAE.It is specified in Section 3.4.Finally, the LSP models will be established by a CNN, and the transferability of the models will be tested.The thematic maps of landslide-influencing factors and sample production are made by ArcGIS 10.8 (Environmental Systems Research Institute, Inc., Redlands, CA, USA).Additionally, the deep-learning framework is accomplished on MATLAB 2022a (MathWorks Inc., Native, MA, USA).

Assessment for Landslide-Influencing Factors
The selection of features is important for the prediction of LSMs [29].Studies have shown that many factors can be selected [1].However, redundant features will interfere with the recognition ability of a model, reduce the generalizability and increase the operation time [45].In order to prove the validity of the selected landslide-influencing factors or eliminate irrelevant factors to improve the predictive ability of the model, the gain ratio (GR) technique [46] is adopted in this paper.When the GR of a factor is less than or equal to zero, the factor is assumed irrelevant to the landslide and should not be used as the input of the model.

Convolutional Neural Network
As a nonlinear tool that can extract key attributes from large numbers of data [47], the CNNs are used as the LSP model and the feature extractor of a VAEGAN in this paper.In LSP, the CNN input is the landslide-influencing factors in vector format, and the output consists of the landslide (positive class) and nonlandslide (negative class) labels [9,11,48].In the TLA strategy, the feature extractor of the VAEGANs is a CNN without the classification layer.Furthermore, to avoid numerical problems, the all dimensions of the input data for the LSP model are normalized to [0,1] by using Equation (1), before being included in the models.

Assessment for Landslide-Influencing Factors
The selection of features is important for the prediction of LSMs [29].Studies have shown that many factors can be selected [1].However, redundant features will interfere with the recognition ability of a model, reduce the generalizability and increase the operation time [45].In order to prove the validity of the selected landslide-influencing factors or eliminate irrelevant factors to improve the predictive ability of the model, the gain ratio (GR) technique [46] is adopted in this paper.When the GR of a factor is less than or equal to zero, the factor is assumed irrelevant to the landslide and should not be used as the input of the model.

Convolutional Neural Network
As a nonlinear tool that can extract key attributes from large numbers of data [47], the CNNs are used as the LSP model and the feature extractor of a VAEGAN in this paper.In LSP, the CNN input is the landslide-influencing factors in vector format, and the output consists of the landslide (positive class) and nonlandslide (negative class) labels [9,11,48].In the TLA strategy, the feature extractor of the VAEGANs is a CNN without the classification layer.Furthermore, to avoid numerical problems, the all dimensions of the input data for the LSP model are normalized to [0, 1] by using Equation (1), before being included in the models.
Land 2023, 12, 525 8 of 26 where x is the origin input data and x max and x min are the maximum value and the minimum value of the origin input data, respectively.

The Training of VAEGANs
In Figure 5, the VAEGAN has two components: a VAE and a GAN, which share the same generator.Different from regular autoencoders, VAEs apply and learn the probability distribution on the latent space extracted by the encoder from the input, so that the distribution of outputs from the decoder matches that of the observed data.Next, the distribution sampled by the decoder will be encoded to reconstruct new data.Meanwhile, applying a discriminator in the VAE, which is to distinguish whether the data is real or generated, has a positive effect on model performance [21].The loss of the VAEGAN is shown as follows: where b is the minibatch size, x i is the input data, x i is the reconstructed data decoded from the prior distribution (z) and µ i and σ i are the mean and standard deviation (prior distribution) of the latent space, respectively.xp i is the reconstructed data decoded from the normal distribution (z p ). Dis(x i ), Dis(x i ) and Dis(xp i ) comprise the output of the discriminator when the input is the origin data x i , the reconstructed data x i and the reconstructed data decoded from the normal distribution xp i , respectively.
Land 2023, 12, x FOR PEER REVIEW 9 of 28 where  is the origin input data and  and  are the maximum value and the minimum value of the origin input data, respectively.

The Training of VAEGANs
In Figure 5, the VAEGAN has two components: a VAE and a GAN, which share the same generator.Different from regular autoencoders, VAEs apply and learn the probability distribution on the latent space extracted by the encoder from the input, so that the distribution of outputs from the decoder matches that of the observed data.Next, the distribution sampled by the decoder will be encoded to reconstruct new data.Meanwhile, applying a discriminator in the VAE, which is to distinguish whether the data is real or generated, has a positive effect on model performance [21].The loss of the VAEGAN is shown as follows: with where b is the minibatch size,  is the input data, ̅ is the reconstructed data decoded from the prior distribution (z) and μ and σ are the mean and standard deviation (prior distribution) of the latent space, respectively.p is the reconstructed data decoded from the normal distribution (z ).Dis( ), Dis(̅ ) and Dis(p ) comprise the output of the discriminator when the input is the origin data  , the reconstructed data x and the reconstructed data decoded from the normal distribution p , respectively.

Data Reconstruction and Transfer Learning with Attributes
Reconstructing data with attributes can make data with specified knowledge, thereby increasing the commonality and representativity of samples [20].Specifically, a VAEGAN

Data Reconstruction and Transfer Learning with Attributes
Reconstructing data with attributes can make data with specified knowledge, thereby increasing the commonality and representativity of samples [20].Specifically, a VAEGAN was pretrained by using the data set of LX and the one of GG.Additionally, the latent vector representations of each data set will be extracted by the pretrained VAEGAN.That is, for each attribute, the mean vectors of the latent space of landslide-influencing factors in each area are computed.Next, the attributes that existed only in a study area (e.g., GG) will be obtained by computing the difference between the mean vectors (operation [a] in Figure 6).Finally, the data containing the attributes of two areas will be obtained by reconstructing the vector, which is the integrated result of the attribute vector (with attributes) and the latent space of LX (without attributes) (operation [b] in Figure 6).The reconstructed data contain the attributes of the landslide-influencing factors in GG and LX, which makes the LSP model learn the attributes beyond one study area.Additionally, the TLA will then be achieved by fine-tuning the model pretrained by the reconstructing data with the labeled samples of GG (Figure 7).
Land 2023, 12, x FOR PEER REVIEW 10 of 28 was pretrained by using the data set of LX and the one of GG.Additionally, the latent vector representations of each data set will be extracted by the pretrained VAEGAN.That is, for each attribute, the mean vectors of the latent space of landslide-influencing factors in each area are computed.Next, the attributes that existed only in a study area (e.g., GG) will be obtained by computing the difference between the mean vectors (operation [a] in Figure 6).Finally, the data containing the attributes of two areas will be obtained by reconstructing the vector, which is the integrated result of the attribute vector (with attributes) and the latent space of LX (without attributes) (operation [b] in Figure 6).The reconstructed data contain the attributes of the landslide-influencing factors in GG and LX, which makes the LSP model learn the attributes beyond one study area.Additionally, the TLA will then be achieved by fine-tuning the model pretrained by the reconstructing data with the labeled samples of GG (Figure 7).was pretrained by using the data set of LX and the one of GG.Additionally, the latent vector representations of each data set will be extracted by the pretrained VAEGAN.That is, for each attribute, the mean vectors of the latent space of landslide-influencing factors in each area are computed.Next, the attributes that existed only in a study area (e.g., GG) will be obtained by computing the difference between the mean vectors (operation [a] in Figure 6).Finally, the data containing the attributes of two areas will be obtained by reconstructing the vector, which is the integrated result of the attribute vector (with attributes) and the latent space of LX (without attributes) (operation [b] in Figure 6).The reconstructed data contain the attributes of the landslide-influencing factors in GG and LX, which makes the LSP model learn the attributes beyond one study area.Additionally, the TLA will then be achieved by fine-tuning the model pretrained by the reconstructing data with the labeled samples of GG (Figure 7).

Evaluators of Model Performance
To investigate the effectiveness of the TLAs strategy, the measures, including accuracy, recall, area under the receiver-operating characteristic (AUROC), precision and the F1-score, are introduced, and their mathematical calculation are listed as follows: In addition, the landslide-frequency ratio (FR) can be used to assess the model performance even if the landslide susceptibility zones in LSM are varied [29].The FR is mathematically expressed as: where the landslide area of each susceptibility zone is LA i , the total landslide area in the study area is TLA, the area of each susceptibility zone is A i and the total area of the study area is TA.The FR index also considers the relationship between the susceptibility zone in different grades and the landslide area, which indicates the reasonableness of the model to predict the susceptibility zone.

Importance Analysis of Factors
The influencing degree of factors can be reflected by the GR, and the result is shown in Figure 8.
F1-score, are introduced, and their mathematical calculation are listed as follows: = (7) In addition, the landslide-frequency ratio (FR) can be used to assess the model performance even if the landslide susceptibility zones in LSM are varied [29].The FR is mathematically expressed as: where the landslide area of each susceptibility zone is  , the total landslide area in the study area is , the area of each susceptibility zone is  and the total area of the study area is .The FR index also considers the relationship between the susceptibility zone in different grades and the landslide area, which indicates the reasonableness of the model to predict the susceptibility zone.

Importance Analysis of Factors
The influencing degree of factors can be reflected by the GR, and the result is shown in Figure 8.
On the one hand, in LX, the topography factors (e.g., SDS and slope) are more important for landslides in this study area than the geological factors (e.g., distance from faults), hydrological factors (e.g., TWI) and environmental factors (e.g., rainfall and land use) are.
On the other hand, in GG, the topography factors (e.g., plan and profile curvature) are more important for landslides than the environmental factors (e.g., rainfall and land use), geological factors (e.g., distance from faults) and hydrological factors (e.g., TWI) are.
In general, the influential extent of these landslide-influencing factors is different in two areas, but the topography factors are most important in the two areas.For example, the SDS and the plan curvature had the highest impact on landslides in LX (0.113) and in GG (0.056), respectively.Thus, all landslide-influencing factors are considered to have a positive impact on landslides.On the one hand, in LX, the topography factors (e.g., SDS and slope) are more important for landslides in this study area than the geological factors (e.g., distance from faults), hydrological factors (e.g., TWI) and environmental factors (e.g., rainfall and land use) are.
On the other hand, in GG, the topography factors (e.g., plan and profile curvature) are more important for landslides than the environmental factors (e.g., rainfall and land use), geological factors (e.g., distance from faults) and hydrological factors (e.g., TWI) are.
In general, the influential extent of these landslide-influencing factors is different in two areas, but the topography factors are most important in the two areas.For example, the SDS and the plan curvature had the highest impact on landslides in LX (0.113) and in GG (0.056), respectively.Thus, all landslide-influencing factors are considered to have a positive impact on landslides.

Evaluation of Supervised Learning
The samples of each study area are randomly divided into 80% and 20% for the training and the validation of the CNN models, respectively.According to Bayesian optimization [49], the best architecture of a CNN features three convolutional layers, no pooling layer and a piecewise decaying learning-rate strategy (Table 2 and Figure 9).This is because the function of the pooling layer is to extract key features from a large amount of information.However, for LSP, when the input dimension is small, adding pooling layers may cause key features to be lost, leading to the opposite of what is expected.For the selection of the learning rate, piecewise decay is better than constant.This indicates that at the later stage of the model iteration, using a small step size is beneficial to search for the smaller value in the loss function.The samples of each study area are randomly divided into 80% and 20% for the training and the validation of the CNN models, respectively.According to Bayesian optimization [49], the best architecture of a CNN features three convolutional layers, no pooling layer and a piecewise decaying learning-rate strategy (Table 2 and Figure 9).This is because the function of the pooling layer is to extract key features from a large amount of information.However, for LSP, when the input dimension is small, adding pooling layers may cause key features to be lost, leading to the opposite of what is expected.For the selection of the learning rate, piecewise decay is better than constant.This indicates that at the later stage of the model iteration, using a small step size is beneficial to search for the smaller value in the loss function.Table 3 is the prediction results of the CNNs with best hyperparameters.The model performance is better when it is trained with the data set of LX.Especially in the AUROC, the model trained with the data set of LX performs better (about 6%) than the one trained with GG.This is because there are sufficient samples in LX. Figure 10 shows the LSMs of two study areas.Additionally, the natural break classification [7] was used to classify landslide susceptibility indices (probability) as very low, low, moderate, high and very high.Table 3 is the prediction results of the CNNs with best hyperparameters.The model performance is better when it is trained with the data set of LX.Especially in the AUROC, the model trained with the data set of LX performs better (about 6%) than the one trained with GG.This is because there are sufficient samples in LX. Figure 10 shows the LSMs of two study areas.Additionally, the natural break classification [7] was used to classify landslide susceptibility indices (probability) as very low, low, moderate, high and very high.

Influence of Transfer Learning with Attribute Strategy on Model Performance
To evaluate the influence of the TLA on model performance, the experiments were implemented, and the results are listed in Table 4.The situation of the training and testing data set is on the left, and the mean value of evaluators (accuracy, AUROC, recall, precision, F1-score) is on the right.No transferring skills are applied in Experiment A; in Experiment B, the GG model is pretrained by the LX data set and fine-tuned with GG-labeled samples.In Experiment C, the GG model is pretrained by the data set with attributes (LX ), that are reconstructed by VANs and then fine-tuned with GG-labeled samples.
It can be seen from Experiments A and B that the model performance improved by the TL is inappreciable.It can be seen more clearly from Figure 11a that the model performance actually decreased in AUROC when the TL strategy is applied.Moreover, in the early stage of training (Figure 11b), the loss of the TL strategy is larger than the SL.These results indicate that the sample attributes of the two study areas are quite different and that the model pretrained by LX is not suitable for GG.However, in Experiment C, the model performance is significantly improved by pretraining the model with the data, which were reconstructed by a VAEGAN with a CNN feature extractor.The TLA strategy increased the mean value of evaluators by about 7% compared with the SL and the TL.Meanwhile, in GG, the convergence loss of the TLA is lower than that of the TL and the SL, which indicates that the reconstructed data, LX , contain the representative of both areas, and the performance of the LSP models in sample-scarce areas can be improved by using the TLA.

Influence of Transfer Learning with Attribute Strategy on Model Performance
To evaluate the influence of the TLA on model performance, the experiments were implemented, and the results are listed in Table 4.The situation of the training and testing data set is on the left, and the mean value of evaluators (accuracy, AUROC, recall, precision, F1-score) is on the right.No transferring skills are applied in Experiment A; in Experiment B, the GG model is pretrained by the LX data set and fine-tuned with GG-labeled samples.In Experiment C, the GG model is pretrained by the data set with attributes (LX), that are reconstructed by VANs and then fine-tuned with GG-labeled samples.It can be seen from Experiments A and B that the model performance improved by the TL is inappreciable.It can be seen more clearly from Figure 11a that the model performance actually decreased in AUROC when the TL strategy is applied.Moreover, in the early stage of training (Figure 11b), the loss of the TL strategy is larger than the SL.These results indicate that the sample attributes of the two study areas are quite different and that the model pretrained by LX is not suitable for GG.However, in Experiment C, the model performance is significantly improved by pretraining the model with the data, which were reconstructed by a VAEGAN with a CNN feature extractor.The TLA strategy increased the mean value of evaluators by about 7% compared with the SL and the TL.Meanwhile, in GG, the convergence loss of the TLA is lower than that of the TL and the SL, which indicates that the reconstructed data, LX, contain the representative of both areas, and the performance of the LSP models in sample-scarce areas can be improved by using the TLA.

Discussion
According to the above results, the prediction accuracy of the LSP models was improved by increasing the attribute similarity of the data sets by using the TLA strategy.In order to further explore the transferability of the TLA, the experiments were conducted with different combinations of the LSP models and the feature extractors in the VAEGAN.Additionally, another study area has been employed to assess the transferability of the TLA strategy, which is shown in the following discussion.

Comparison of LSP Model
The improvement in model performance in the previous section proves the effectiveness of CNN and TLA strategies in establishing LSP models with limited samples.However, different deep-learning models have different effects on evaluators in LSP [50].Therefore, in order to validate the CNN framework proposed in this paper, the GRU model and the BiLSTM model are added.The hyperparameters and training parameters of these models are obtained by using the Bayesian optimization algorithm.
Table 5 shows the results of LX and GG in supervised learning.It can be seen that all the performance values of the models were decreased in sample-scarce areas, especially the GRU.In the comparisons, the BiLSTM model achieves excellent results in recall and F1-score.Additionally, the CNN is better than the others in AUROC, accuracy and precision, which makes the CNN the best in the mean value of the evaluators.

Evaluation and Comparison of the Model Transferability
To further explore the performance of the model with the strategy of TLA, this paper proposes the following models: GRU-VAEGAN and BiLSTM-VAEGAN, for example,

Discussion
According to the above results, the prediction accuracy of the LSP models was improved by increasing the attribute similarity of the data sets by using the TLA strategy.In order to further explore the transferability of the TLA, the experiments were conducted with different combinations of the LSP models and the feature extractors in the VAEGAN.Additionally, another study area has been employed to assess the transferability of the TLA strategy, which is shown in the following discussion.

Comparison of LSP Model
The improvement in model performance in the previous section proves the effectiveness of CNN and TLA strategies in establishing LSP models with limited samples.However, different deep-learning models have different effects on evaluators in LSP [50].Therefore, in order to validate the CNN framework proposed in this paper, the GRU model and the BiLSTM model are added.The hyperparameters and training parameters of these models are obtained by using the Bayesian optimization algorithm.
Table 5 shows the results of LX and GG in supervised learning.It can be seen that all the performance values of the models were decreased in sample-scarce areas, especially the GRU.In the comparisons, the BiLSTM model achieves excellent results in recall and F1score.Additionally, the CNN is better than the others in AUROC, accuracy and precision, which makes the CNN the best in the mean value of the evaluators.

Evaluation and Comparison of the Model Transferability
To further explore the performance of the model with the strategy of TLA, this paper proposes the following models: GRU-VAEGAN and BiLSTM-VAEGAN, for example, operation [a] in Figure 5, replace the feature extractors in a VAEGAN from CNN to GRU (experiment D) and BiLSTM (experiment E), respectively.Next, five groups of experiments were conducted.
Table 6 and Figure 12 show the mean values of the above evaluators and ROC curves.When the TLA techniques (in Experiments C, D and E) were used, the performance values of the models are improved, especially the CNN.The best ROC curve of the TLA reaches the highest, 0.844, by comparing those with the SL (0.771) and the TL (0.772).5, replace the feature extractors in a VAEGAN from CNN to GRU (experiment D) and BiLSTM (experiment E), respectively.Next, five groups of experiments were conducted.Table 6 and Figure 12 show the mean values of the above evaluators and ROC curves.When the TLA techniques (in Experiments C, D and E) were used, the performance values of the models are improved, especially the CNN.The best ROC curve of the TLA reaches the highest, 0.844, by comparing those with the SL (0.771) and the TL (0.772).The loss during training is one of the evaluators reflecting the quality of the data set [19].The training results are shown in Figure 13.For the LSP models, although GRU converges the fastest, the CNN model has the lowest converging loss.Additionally, BiLSTM is less stable than the others.For transferability (e.g., Figure 13a), the loss is much less in the early training stage when the TLA strategy is used, especially when a CNN is used as VAEGAN feature extractor.This indicates that the TLA technique is effective.Figure 14 shows the LSMs in GG predicted by models.In Figure 14h,k,n, most grids in the study area are classified as in a very high susceptibility zone, which is The loss during training is one of the evaluators reflecting the quality of the data set [19].The training results are shown in Figure 13.For the LSP models, although GRU converges the fastest, the CNN model has the lowest converging loss.Additionally, BiLSTM is less stable than the others.For transferability (e.g., Figure 13a), the loss is much less in the early training stage when the TLA strategy is used, especially when a CNN is used as VAEGAN feature extractor.This indicates that the TLA technique is effective.
Land 2023, 12, x FOR PEER REVIEW 15 of 28 operation [a] in Figure 5, replace the feature extractors in a VAEGAN from CNN to GRU (experiment D) and BiLSTM (experiment E), respectively.Next, five groups of experiments were conducted.Table 6 and Figure 12 show the mean values of the above evaluators and ROC curves.When the TLA techniques (in Experiments C, D and E) were used, the performance values of the models are improved, especially the CNN.The best ROC curve of the TLA reaches the highest, 0.844, by comparing those with the SL (0.771) and the TL (0.772).The loss during training is one of the evaluators reflecting the quality of the data set [19].The training results are shown in Figure 13.For the LSP models, although GRU converges the fastest, the CNN model has the lowest converging loss.Additionally, BiLSTM is less stable than the others.For transferability (e.g., Figure 13a), the loss is much less in the early training stage when the TLA strategy is used, especially when a CNN is used as VAEGAN feature extractor.This indicates that the TLA technique is effective.Figure 14 shows the LSMs in GG predicted by models.In Figure 14h,k,n, most grids in the study area are classified as in a very high susceptibility zone, which is Figure 14 shows the LSMs in GG predicted by models.In Figure 14h,k,n, most grids in the study area are classified as in a very high susceptibility zone, which is inappropriate.
It is indicated that the GRU is underfitting when the training samples are inadequate.The model performance seems to be improved by the TLA strategy.In fact, most samples of the testing data set are predicted as the "landslide", increasing the F1-score and the mean value of evaluators.However, the AUROC and accuracy are poor.This conclusion also can be proved from another aspect.For the data set of the target domain (GG), when reconstructing the data by using a GRU-VAEGAN, the reconstructed data contain little attribute similarity and can even be seen as the noise and can decrease the effect of TLA strategy because of data-set bias.This indicates that the GRU is not a suitable feature extractor of a VAEGAN.In Figure 13a, in case TLA (GRU, CNN), at the beginning, the loss is higher than the losses of the other cases, indicating that the attribute similarity between the data set used to pretrain the model and the one of target domain is minor.The loss converges to a lower value because of the strong fitting ability of the CNN.
Land 2023, 12, x FOR PEER REVIEW 16 of 28 inappropriate.It is indicated that the GRU is underfitting when the training samples are inadequate.The model performance seems to be improved by the TLA strategy.In fact, most samples of the testing data set are predicted as the "landslide", increasing the F1score and the mean value of evaluators.However, the AUROC and accuracy are poor.This conclusion also can be proved from another aspect.For the data set of the target domain (GG), when reconstructing the data by using a GRU-VAEGAN, the reconstructed data contain little attribute similarity and can even be seen as the noise and can decrease the effect of TLA strategy because of data-set bias.This indicates that the GRU is not a suitable feature extractor of a VAEGAN.In Figure 13a, in case TLA (GRU, CNN), at the beginning, the loss is higher than the losses of the other cases, indicating that the attribute similarity between the data set used to pretrain the model and the one of target domain is minor.The loss converges to a lower value because of the strong fitting ability of the CNN.The LSMs of well-trained models should have higher landslide-density values (frequency ratios) in very high susceptibility zones than in other ones [29].To qualitatively The LSMs of well-trained models should have higher landslide-density values (frequency ratios) in very high susceptibility zones than in other ones [29].To qualitatively evaluate the model performance values, the FR method was used; the FR values in the high and very high susceptibility zones of each situation are shown in Figure 15.For the LSP models, the high and very high landslide susceptibility zones predicted by the CNN contained 75% of the historical landslides but only accounted for about 30% of the total area, reaching the highest FR value by comparing them with GRU and BiLSTM.For transferability, Figure 15b-d shows the FRs when the LSP model is the CNN, GRU and BiLSTM, respectively.Additionally, it can be concluded that compared with the SL and the TL, the TLA strategy can significantly improve the performance of an LSP model.
Land 2023, 12, x FOR PEER REVIEW 17 of 28 evaluate the model performance values, the FR method was used; the FR values in the high and very high susceptibility zones of each situation are shown in Figure 15.For the LSP models, the high and very high landslide susceptibility zones predicted by the CNN contained 75% of the historical landslides but only accounted for about 30% of the total area, reaching the highest FR value by comparing them with GRU and BiLSTM.For transferability, Figure 15b-d shows the FRs when the LSP model is the CNN, GRU and BiLSTM, respectively.Additionally, it can be concluded that compared with the SL and the TL, the TLA strategy can significantly improve the performance of an LSP model.

Application of Transfer Learning with Attribute in other Study Area
In order to further investigate the transferability of the TLA strategy between the areas with huge differences, Zigui County, Hubei Province, China, was added.Zigui County is located in the Three Gorges Reservoir Area (TGRA) of the Yangtze River Basin.Data on 409 historical landslides, as shown in Figure 16, were obtained from the landslide inventory.Additionally, the thematic maps of Zigui County can be found in Figure A2 of Appendix A.

Application of Transfer Learning with Attribute in Other Study Area
In order to further investigate the transferability of the TLA strategy between the areas with huge differences, Zigui County, Hubei Province, China, was added.Zigui County is located in the Three Gorges Reservoir Area (TGRA) of the Yangtze River Basin.Data on 409 historical landslides, as shown in Figure 16, were obtained from the landslide inventory.Additionally, the thematic maps of Zigui County can be found in Figure A2 of Appendix A.
Land 2023, 12, x FOR PEER REVIEW 17 of 28 evaluate the model performance values, the FR method was used; the FR values in the high and very high susceptibility zones of each situation are shown in Figure 15.For the LSP models, the high and very high landslide susceptibility zones predicted by the CNN contained 75% of the historical landslides but only accounted for about 30% of the total area, reaching the highest FR value by comparing them with GRU and BiLSTM.For transferability, Figure 15b-d shows the FRs when the LSP model is the CNN, GRU and BiLSTM, respectively.Additionally, it can be concluded that compared with the SL and the TL, the TLA strategy can significantly improve the performance of an LSP model.

Application of Transfer Learning with Attribute in other Study Area
In order to further investigate the transferability of the TLA strategy between the areas with huge differences, Zigui County, Hubei Province, China, was added.Zigui County is located in the Three Gorges Reservoir Area (TGRA) of the Yangtze River Basin.Data on 409 historical landslides, as shown in Figure 16, were obtained from the landslide inventory.Additionally, the thematic maps of Zigui County can be found in Figure A2 of Appendix A.  The performed experiments are consistent with the previous ones, and the results are shown in Tables 7 and 8.The AUROC is shown in Figure 17.In the SL, the CNN achieves the best results in evaluators.However, the CNN is also more sensitive to the data set with or without the attributes than the others.In Table 8, when the TL was applied, only the CNN model performance values decrease.This is because the distance between the two places is large, so the attributes of landslide-influencing factors are different (e.g., lithology and rainfall).Applying the TL strategy between the data sets with huge differences will reduce the model performance because the differences will be considered as noise for model training.However, the reconstructed data contain the attributes of both study areas; thus, the model performance values are increased when applying the TLA strategy.The performed experiments are consistent with the previous ones, and the results are shown in Tables 7 and 8.The AUROC is shown in Figure 17.In the SL, the CNN achieves the best results in evaluators.However, the CNN is also more sensitive to the data set with or without the attributes than the others.In Table 8, when the TL was applied, only the CNN model performance values decrease.This is because the distance between the two places is large, so the attributes of landslide-influencing factors are different (e.g., lithology and rainfall).Applying the TL strategy between the data sets with huge differences will reduce the model performance because the differences will be considered as noise for model training.However, the reconstructed data contain the attributes of both study areas; thus, the model performance values are increased when applying the TLA strategy.The Figure 18 shows the training process of the models.As before, the BiLSTM takes longer convergence time and is less stable compared with the CNN and GRU.Furthermore, when the TLA technique is used, the training progress is more efficient and yields a lower convergence loss.The Figure 18 shows the training process of the models.As before, the BiLSTM takes longer convergence time and is less stable compared with the CNN and GRU.Furthermore, when the TLA technique is used, the training progress is more efficient and yields a lower convergence loss.The LSMs of ZG predicted by the models are shown in Figure A3 of Appendix A. It can be seen that the LSMs predicted by GRU and BiLSTM contained more high and very high susceptibility zones compared to the CNN. Figure 19 shows the FRs in different situations, and this information helps to more clearly evaluate the rationality of these LSMs and the transferability of these techniques (TL and TLA).The model performance is improved by using the TLA, especially by pretraining the LSP models (the CNN) with the data reconstructed by the CNN-VAEGAN, indicating that the TLA strategy has strong compatibility regardless of the far distance between ZG and LX.The evaluation of the models showed that the proposed TLA strategy can improve the performance of the LSP model in both GG and ZG.When the TLA strategy was applied, the evaluators of all the models were improved.Moreover, different feature extractor of the VAEGAN significantly affected the transferability in LSP.The reconstructed data contain more similar attributes of both study areas and improve the transferability when the feature extractor of the VAEGAN is a CNN, which facilitates efficient and reliable prediction in the sample-scarce area.

Findings and Limitations of this Study
At present, there are also some outstanding studies that have contributed to improving model performance in landslide susceptibility mapping [51], building damage assessment after earthquakes [52] and flood assessment [53] by applying the TL strategy.Unlike these studies, which directly gather knowledge from previous, similar situations (known as case-based reasoning) or select the data from a source area that has a similar The LSMs of ZG predicted by the models are shown in Figure A3 of Appendix A. It can be seen that the LSMs predicted by GRU and BiLSTM contained more high and very high susceptibility zones compared to the CNN. Figure 19 shows the FRs in different situations, and this information helps to more clearly evaluate the rationality of these LSMs and the transferability of these techniques (TL and TLA).The model performance is improved by using the TLA, especially by pretraining the LSP models (the CNN) with the data reconstructed by the CNN-VAEGAN, indicating that the TLA strategy has strong compatibility regardless of the far distance between ZG and LX.The LSMs of ZG predicted by the models are shown in Figure A3 of Appendix A. It can be seen that the LSMs predicted by GRU and BiLSTM contained more high and very high susceptibility zones compared to the CNN. Figure 19 shows the FRs in different situations, and this information helps to more clearly evaluate the rationality of these LSMs and the transferability of these techniques (TL and TLA).The model performance is improved by using the TLA, especially by pretraining the LSP models (the CNN) with the data reconstructed by the CNN-VAEGAN, indicating that the TLA strategy has strong compatibility regardless of the far distance between ZG and LX.The evaluation of the models showed that the proposed TLA strategy can improve the performance of the LSP model in both GG and ZG.When the TLA strategy was applied, the evaluators of all the models were improved.Moreover, different feature extractor of the VAEGAN significantly affected the transferability in LSP.The reconstructed data contain more similar attributes of both study areas and improve the transferability when the feature extractor of the VAEGAN is a CNN, which facilitates efficient and reliable prediction in the sample-scarce area.

Findings and Limitations of this Study
At present, there are also some outstanding studies that have contributed to improving model performance in landslide susceptibility mapping [51], building damage assessment after earthquakes [52] and flood assessment [53] by applying the TL strategy.Unlike these studies, which directly gather knowledge from previous, similar situations (known as case-based reasoning) or select the data from a source area that has a similar The evaluation of the models showed that the proposed TLA strategy can improve the performance of the LSP model in both GG and ZG.When the TLA strategy was applied, the evaluators of all the models were improved.Moreover, different feature extractor of the VAEGAN significantly affected the transferability in LSP.The reconstructed data contain more similar attributes of both study areas and improve the transferability when the feature extractor of the VAEGAN is a CNN, which facilitates efficient and reliable prediction in the sample-scarce area.

Findings and Limitations of This Study
At present, there are also some outstanding studies that have contributed to improving model performance in landslide susceptibility mapping [51], building damage assessment after earthquakes [52] and flood assessment [53] by applying the TL strategy.Unlike these studies, which directly gather knowledge from previous, similar situations (known as case-based reasoning) or select the data from a source area that has a similar distribution to the target area (known as domain adaptation) to complete the TL strategy, this study proposes the TLA strategy, which increases the attribute similarity between the source and target domain data sets.The TLA strategy achieves this by reconstructing the data of the source area according to the attributes of both the source area and the target area.Compared with case-based reasoning, the TLA strategy improves model performance in scarce-sample areas and achieves better prediction results.The goals of both domain adaptation and the TLA are to enhance model performance by using the samples that are more similar to the target domain distribution.However, the domain adaptation focuses more attention on selecting the similar distribution data from source domain, while the proposed TLA strategy reconstructs the source domain samples by using the VAEGAN to generate the data with attributes similar to those of the target domain.The comparisons of these two strategies in sample-scare areas deserve further investigation in future research.
To further explore the effect of sample size on the TLA strategy, several experiments were conducted in the GG and ZG study areas.
The GG study areas were assumed to have only 11, 22, 33, 44, 55, 66, 77 and 88 samples, separately.In each case, the samples were randomly selected from the landslide inventory, and the training and test data sets were divided in the ratio of 4:1.As shown in Figure 20, the model performance values are improved by increasing the number of training samples.Additionally, the LSP model yielded the best performance with 22 samples by applying the TLA strategy, compared to the SL and the TL.
Land 2023, 12, x FOR PEER REVIEW 20 of 28 distribution to the target area (known as domain adaptation) to complete the TL strategy, this study proposes the TLA strategy, which increases the attribute similarity between the source and target domain data sets.The TLA strategy achieves this by reconstructing the data of the source area according to the attributes of both the source area and the target area.Compared with case-based reasoning, the TLA strategy improves model performance in scarce-sample areas and achieves better prediction results.The goals of both domain adaptation and the TLA are to enhance model performance by using the samples that are more similar to the target domain distribution.However, the domain adaptation focuses more attention on selecting the similar distribution data from source domain, while the proposed TLA strategy reconstructs the source domain samples by using the VAEGAN to generate the data with attributes similar to those of the target domain.The comparisons of these two strategies in sample-scare areas deserve further investigation in future research.
To further explore the effect of sample size on the TLA strategy, several experiments were conducted in the GG and ZG study areas.
The GG study areas were assumed to have only 11, 22, 33, 44, 55, 66, 77 and 88 samples, separately.In each case, the samples were randomly selected from the landslide inventory, and the training and test data sets were divided in the ratio of 4:1.As shown in Figure 20, the model performance values are improved by increasing the number of training samples.Additionally, the LSP model yielded the best performance with 22 samples by applying the TLA strategy, compared to the SL and the TL.Similarly, the ZG study areas were assumed to have only 51, 102, 153, 204, 255, 306, 357 and 409 samples, separately.As shown in Figure 21, the model performance was lower when the TL strategy was applied with a sample size of 102, compared to the SL.However, the model performance reached 0.809 at the mean value of evaluators by applying the TLA strategy, which exceeded the performance of the SL (0.796) and TL (0.734) strategies.
This study found out that the transferability of the TLA is also affected by the distance between the study areas.Compared to GG, the application of the TL strategy in ZG is less effective than the SL owing to the increase in the distance between the source and target domains, resulting in a greater difference in the attribute similarity of the landslideinfluencing factors.The smaller the distance between the study areas (e.g., LX and GG), the more obviously the TLA strategy can improve the performance of the model.Although the increase in the distance reduces the similarity between the data sets (e.g., LX and ZG), the TLA strategy can still to some extent improve the performance of the model, which demonstrates the robustness of the proposed TLA strategy.Similarly, the ZG study areas were assumed to have only 51, 102, 153, 204, 255, 306, 357 and 409 samples, separately.As shown in Figure 21, the model performance was lower when the TL strategy was applied with a sample size of 102, compared to the SL.However, the model performance reached 0.809 at the mean value of evaluators by applying the TLA strategy, which exceeded the performance of the SL (0.796) and TL (0.734) strategies.In summary, this parametric study shows that to obtain a reasonable landslide spatial model by using the TLA strategy, at least 55 and 102 properly selected sample points are required for the GG and ZG test data sets, respectively.These sample points resulted in a mean evaluator value of 0.762 and 0.809 for the GG and ZG test data sets, respectively.This study found out that the transferability of the TLA is also affected by the distance between the study areas.Compared to GG, the application of the TL strategy in ZG is less effective than the SL owing to the increase in the distance between the source and target domains, resulting in a greater difference in the attribute similarity of the landslideinfluencing factors.The smaller the distance between the study areas (e.g., LX and GG), the more obviously the TLA strategy can improve the performance of the model.Although the increase in the distance reduces the similarity between the data sets (e.g., LX and ZG), the TLA strategy can still to some extent improve the performance of the model, which demonstrates the robustness of the proposed TLA strategy.
In summary, this parametric study shows that to obtain a reasonable landslide spatial model by using the TLA strategy, at least 55 and 102 properly selected sample points are required for the GG and ZG test data sets, respectively.These sample points resulted in a mean evaluator value of 0.762 and 0.809 for the GG and ZG test data sets, respectively.
Meanwhile, the proposed TLA also has the following limitations: 1.
As the landslide prediction model in this study is limited to a deep-learning framework, hybrid deep-learning methods (e.g., hybrid deep-learning frameworks, hybrid deep-learning-machine-learning frameworks) are worth trying in order to improve the reliability and accuracy of LSMs.

2.
Regarding the lack of considerations of the landslide range and spatial information, the landslide inventory in this paper consists of single points, which limits the input of LSP models limited to the 1D sequence format.The prospective research can focus on combining the information of remote-sensing images and explore the feature processing ability of CNNs in high-dimensional (landslide pixel spatial) data.

Conclusions
For the first time, the present study proposed a TLA strategy that was based on the VAEGAN models (CNN-VAEGAN, GRU-VAEGAN and BiLSTM-VAEGAN) in LSP, which can facilitate and expedite the training progress with limited landslide samples.The main conclusions were as follows: 1.
The CNN frameworks were not only an excellent selection for the LSP model but also a worthwhile choice for a feature extractor for a VAEGAN in TLAs.

2.
For the LSP in the SL strategy, the performance of the CNN was more reliable than that of the BiLSTM and GRU, which achieved the best performance in the mean value of evaluators (AUROC, accuracy, precision, recall, F1-score and FR) in three study areas.

3.
For the transferability, the TLAs strategy developed in this research yielded better results in performance of landslide prediction models in sample-scarce areas, which surpassed the TL, reflecting the practicability and advantage of the methods proposed in this paper.A group of metamorphic rocks containing a complex of metamorphic volcanic rocks, metamorphic iron and phosphate mineral layers.

Figure 1 .
Figure 1.Landslide inventory maps of first two study areas in China.(a) the first one across Luoding County and Xinyi County (LX) of Guangdong Province, (b) the second one in Guigang County (GG) of Guangxi Province, (c) the locations of the study areas in Guangdong Province and Guangxi Province, (d) the locations of the two provinces in China.

Figure 1 .
Figure 1.Landslide inventory maps of first two study areas in China.(a) the first one across Luoding County and Xinyi County (LX) of Guangdong Province, (b) the second one in Guigang County (GG) of Guangxi Province, (c) the locations of the study areas in Guangdong Province and Guangxi Province, (d) the locations of the two provinces in China.

Figure 4 .
Figure 4. Overview workflow in this study.

Figure 4 .
Figure 4. Overview workflow in this study.

Figure 6 .
Figure 6.The illustration map of generating the data with attributes by a VAE.

Figure 7 .
Figure 7. Transfer learning and transfer learning with attributes.

Figure 6 .
Figure 6.The illustration map of generating the data with attributes by a VAE.

Figure 6 .
Figure 6.The illustration map of generating the data with attributes by a VAE.

Figure 7 .
Figure 7. Transfer learning and transfer learning with attributes.Figure 7. Transfer learning and transfer learning with attributes.

Figure 7 .
Figure 7. Transfer learning and transfer learning with attributes.Figure 7. Transfer learning and transfer learning with attributes.

Figure 8 .
Figure 8. Importance of landslide-influencing factors, based on (a) LX and (b) GG.Figure 8. Importance of landslide-influencing factors, based on (a) LX and (b) GG.

Figure 8 .
Figure 8. Importance of landslide-influencing factors, based on (a) LX and (b) GG.Figure 8. Importance of landslide-influencing factors, based on (a) LX and (b) GG.

Figure 9 .
Figure 9. Selected CNN model by Bayesian optimization.

Figure 9 .
Figure 9. Selected CNN model by Bayesian optimization.

Figure 11 .
Figure 11.(a) AUROC curves and (b) training efficiency on validation set of SL, TL and TLA.The symbol "LX→GG" represents the fact that the LSP model is pretrained by the data set of LX and then fine-tuned by the data set of GG, and the LX represent the reconstructed data that contains the attributes of the data set in LX and the data set in target domain (e.g., GG).

Figure 11 .
Figure 11.(a) AUROC curves and (b) training efficiency on validation set of SL, TL and TLA.The symbol "LX→GG" represents the fact that the LSP model is pretrained by the data set of LX and then fine-tuned by the data set of GG, and the LX represent the reconstructed data that contains the attributes of the data set in LX and the data set in target domain (e.g., GG).

Figure 12 .
Figure 12.AUROC curves on testing data set of (a) CNN, (b) GRU and (c) BiLSTM.The contents in the bracket, such as (CNN, GRU), represent that the decoder of VAEGAN is CNN and that of LSM models is GRU, respectively.

Figure 12 .
Figure 12.AUROC curves on testing data set of (a) CNN, (b) GRU and (c) BiLSTM.The contents in the bracket, such as (CNN, GRU), represent that the decoder of VAEGAN is CNN and that of LSM models is GRU, respectively.

Figure 12 .
Figure 12.AUROC curves on testing data set of (a) CNN, (b) GRU and (c) BiLSTM.The contents in the bracket, such as (CNN, GRU), represent that the decoder of VAEGAN is CNN and that of LSM models is GRU, respectively.

Figure 15 .
Figure 15.FR values of LSMs (GG) in high and very high susceptibility zones: (a) comparison in LSP models, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU, (d) comparison in transferability of BiLSTM.

Figure 15 .
Figure 15.FR values of LSMs (GG) in high and very high susceptibility zones: (a) comparison in LSP models, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU, (d) comparison in transferability of BiLSTM.

Figure 15 .
Figure 15.FR values of LSMs (GG) in high and very high susceptibility zones: (a) comparison in LSP models, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU, (d) comparison in transferability of BiLSTM.

Figure 19 .
Figure 19.FR values of LSMs (ZG) in high and very high susceptibility zones: (a) comparison in LSP model, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU and (d) comparison in transferability of BiLSTM.

Figure 19 .
Figure 19.FR values of LSMs (ZG) in high and very high susceptibility zones: (a) comparison in LSP model, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU and (d) comparison in transferability of BiLSTM.

Figure 19 .
Figure 19.FR values of LSMs (ZG) in high and very high susceptibility zones: (a) comparison in LSP model, (b) comparison in transferability of CNN, (c) comparison in transferability of GRU and (d) comparison in transferability of BiLSTM.

Figure 20 .
Figure 20.The mean value of evaluators in different numbers of samples participating in the study area (GG, target area; LX, source area).

Figure 20 .
Figure 20.The mean value of evaluators in different numbers of samples participating in the study area (GG, target area; LX, source area).

Figure 21 .
Figure 21.The mean value of evaluators in different numbers of samples participating in the study area (ZG, target area; LX, source area).Meanwhile, the proposed TLA also has the following limitations: 1.As the landslide prediction model in this study is limited to a deep-learning framework, hybrid deep-learning methods (e.g., hybrid deep-learning frameworks, hybrid deep-learning-machine-learning frameworks) are worth trying in order to improve

Figure 21 .
Figure 21.The mean value of evaluators in different numbers of samples participating in the study area (ZG, target area; LX, source area).

Figure A1 .
Figure A1.The heat maps of the correlation matrix of the landslide-influencing factors and one output variable for data sets in (a) LX, (b) GG and (c) ZG.

Figure A1 . 28 Figure A2 .
Figure A1.The heat maps of the correlation matrix of the landslide-influencing factors and one output variable for data sets in (a) LX, (b) GG and (c) ZG.

Table 1 .
The information about landslide-influencing factors used in this study[28].

Table 2 .
The best-selected parameters of CNNs.

Table 2 .
The best-selected parameters of CNNs.

Table 3 .
LSM model performance comparison in LX and GG (supervised learning).
Note: Bold font is the best case.

Table 3 .
LSM model performance comparison in LX and GG (supervised learning).
Note: Bold font is the best case.

Table 4 .
Transferring ability comparison of different methods (LX to GG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 4 .
Transferring ability comparison of different methods (LX to GG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 5 .
LSM models performance comparison in LX and GG (supervised learning).
Note: Boldface indicates the best case.

Table 5 .
LSM models performance comparison in LX and GG (supervised learning).
Note: Boldface indicates the best case.

Table 6 .
Comparison of the transferability of different models (LX to GG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 6 .
Comparison of the transferability of different models (LX to GG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 6 .
Comparison of the transferability of different models (LX to GG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 7 .
LSM models performance comparison in ZG (supervised learning).
Note: Boldface indicates the best case.

Table 8 .
Comparison of transferability of different models (LX to ZG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table 7 .
LSM models performance comparison in ZG (supervised learning).

Table 8 .
Comparison of transferability of different models (LX to ZG).
Note: the scores in the table represent the mean value of evaluators; boldface indicates the best case.

Table A2 .
Cont.Lower part consists of gravel, sand and gravel interbedded with sandstone, siltstone; upper part consists of quartz-rich gravel, conglomerate with gravel and sand, quartz sandstone, siltstone and siltstone with shale.Thick layer of limestone, interbedded limestone, shale, sandstone, limestone, carbonate limestone, coal seams, shale-interbedded sandy mudstone and shale.C 1 Ceshui FormationThe main composition is quartz sandstone and fine sandstone, interbedded with black shale and non-smoldering coal beds.In some local areas there are interbedded limestones and mudstones.Qh ∠f⊥Dawanzhen Formation Sand and gravel interbedded with clayey sand.Nh Liantuo Formation, Nantuo Formation Mainly grey-white, grey-green, purple-red sandstone and conglomerate, with conglomerate at the base; grey-green, purple-red to conglomeratic rock.JxQb Yunkai group (including Fengdongkou Formation, Lankeng Formation and Shawanping Formation) A group of metamorphic rocks containing a complex of metamorphic volcanic rocks, metamorphic iron and phosphate mineral layers.