Landslide Detection from Open Satellite Imagery Using Distant Domain Transfer Learning

: Using convolutional neural network (CNN) methods and satellite images for landslide identiﬁcation and classiﬁcation is a very efﬁcient and popular task in geological hazard investigations. However, traditional CNNs have two disadvantages: (1) insufﬁcient training images from the study area and (2) uneven distribution of the training set and validation set. In this paper, we introduced distant domain transfer learning (DDTL) methods for landslide detection and classiﬁcation. We ﬁrst introduce scene classiﬁcation satellite imagery into the landslide detection task. In addition, in order to more effectively extract information from satellite images, we innovatively add an attention mechanism to DDTL (AM-DDTL). In this paper, the Longgang study area, a district in Shenzhen City, Guangdong Province, has only 177 samples as the landslide target domain. We examine the effect of DDTL by comparing three methods: the convolutional CNN, pretrained model and DDTL. We compare different attention mechanisms based on the DDTL. The experimental results show that the DDTL method has better detection performance than the normal CNN, and the AM-DDTL models achieve 94% classiﬁcation accuracy, which is 7% higher than the conventional DDTL method. The requirements for the detection and classiﬁcation of potential landslides at different disaster zones can be met by applying the AM-DDTL algorithm, which outperforms traditional CNN methods.


Introduction
With the rapid development of cities, urban construction sites have expanded to low hills, with engineering construction occurring on many unstable slopes [1]. Under extreme cases, such as seismic shaking and heavy rainfall, these unstable slopes can easily become landslides and cause severe damage to the natural environment, property and personal safety [2,3]. For example, on 7 June 2008, heavy rainfall triggered many landslides, debris flows and floods in the Hong Kong area [4]. As landslides often cause serious destruction to human settlements, roads and agricultural lands, we must identify and manage potential landslide areas and warn citizens about their occurrence in order: to ensure their safety [5]. Currently, two methods are used to identify potential landslides: field investigation and indoor remote sensing interpretation [6]. Although the first method is effective and accurate for landslide detection, it requires considerable labor, material and financial resources [7]. For remote sensing interpretation, specialists judge whether a landslide has occurred according to optical images, digital elevation model (DEM) data and other geological information [8]. This method is time-consuming and its interpretation accuracy may be poor [8,9]. Therefore, a novel method that can automatically recognize landslides must be constructed based on new technologies and new methods [10].
New technologies, such as machine learning, have produced unprecedented opportunities for the prediction, identification and management of geological disasters in the era of big data [11]. Landslide detection is regarded as the study of satellite image recognition and classification [6]. Machine learning methods, such as support vector machines (SVMs) [12,13], artificial neural networks (ANNs) [14] and deep learning, have been extensively used in landslide research [1,15], especially convolutional neural networks (CNNs), which have become a research focus in the field of landslide detection [16,17]. The SVM, ANN, RF and CNN methods have been used for potential landslide detection using optical remote sensing images and topography [18], and the CNN has been considerably improved as the mainstream deep learning method used to extract information from remote sensing images [19]. A residual network based on the CNN was designed for landslide detection [20]. To detect special slope failures, the CNN approach was developed [21]. For images with diverse spectral characteristics, a recognition method based on CNNs was proposed [21]. Fully convolutional networks (FCNs) and patch-based deep convolutional neural networks (DCNNs) were used to classify land cover types [22]. A CNN with a suitable input image size and training strategies is effective for slope failure detection [23]. The new method combining CNN and multilayer has been applied to the classification of very-high-resolution remote sensing imagery and displayed better performance [24]. For land cover classification tasks, the attentive spatial temporal graph CNN can exploit both spatial and temporal dimensions in satellite images and time series data [25]. The attention mechanisms have been gradually introduced into the structure of CNN. They include channel attention, spatial attention and non-local attention; these three kinds of attention mechanisms perform better on super-resolution single images [26]. Moreover, a deep neural network equipped with a control gate and feedback attention mechanism can perform pixel-wise classification for very-high-resolution remote sensing images [27]. Combining a CNN and attention mechanism can lead to good performance for landslide detection [6].
Machine learning has two requirements before training a classification model: it needs sufficient training data, and the training data must have the same distribution as the validation data [28,29]. Unfortunately, in this study, the study area, Longgang, China, is unique and has only 177 landslide samples, which is not enough to train a robust classification model. Obtaining labeled training data always requires considerable costs and has limitations [1,28]. Therefore, we introduce the transfer learning algorithm to solve the problem [29,30]. Transfer learning can eliminate these two disadvantages by transferring knowledge from the source domain to the target domain [31].
Transfer learning assumes that the source domain and the target domain are more or less similar [32,33], but for the landslide detection problem, few open landslide datasets are available as the source domain [6,28]. Therefore, we first introduce distant domain transfer learning (DDTL) for remote sensing imagery classification. The DDTL algorithm only needs a small amount of labeled target data and unlabeled source data from completely different domains [34]. In addition, we first apply the scene classification dataset to the landslide classification task as the source domain. Then, we creatively integrate the attention module into the DDTL framework because the attention mechanism can extract the key information and suppress redundant, useless information, thus improving the results of the model [35,36].

Study Area and Dataset
The study covers the whole Longgang district in the city of Shenzhen, Guangdong Province, China, with an area of 388.22 km 2 , as shown in Figure 1. The topography of Longgang District is complex and dominated by low mountains and hills. The central part of this region is a low-lying alluvial plain, which easily catches water. In the subtropical ocean monsoon zone, the average annual rainfall of this region is 1935.8 mm, and landslides often occur, resulting from heavy rainfall due to strong typhoons [16,37]. The stability of the slopes in this study area is poor because the thick, weathered soil layer has low strength within the range of influence of the fault zone [38]. The fault zone of Longgang District is shown in Figure 2. Urban construction sites have expanded to low hills and platform areas, the original shape of the hillsides has changed, and the stability of the mountains has been degraded by intense human engineering activities. Under severe conditions, such as concentrated rainfall or earthquakes, the reduction in the mechanical strength of the soil can easily cause landslides and other geological disasters [12,39]. For example, on 20 December 2015, a landslide on an unstable artificial slope composed of construction waste destroyed almost all the buildings along the flow path [40,41]. The stability of the slopes in this study area is poor because the thick, weathered soil layer has low strength within the range of influence of the fault zone [38]. The fault zone of Longgang District is shown in Figure 2. Urban construction sites have expanded to low hills and platform areas, the original shape of the hillsides has changed, and the stability of the mountains has been degraded by intense human engineering activities. Under severe conditions, such as concentrated rainfall or earthquakes, the reduction in the mechanical strength of the soil can easily cause landslides and other geological disasters [12,39]. For example, on 20 December 2015, a landslide on an unstable artificial slope composed of construction waste destroyed almost all the buildings along the flow path [40,41].
Due to the natural environment and engineering activities, many landslides result from unstable slopes in this region [1], as shown in Figure 3. In the dataset used in this study, 177 landslides, including collapses, a few mudslides and mainly unstable slopes, were interpreted by experts with satellite images and other geological information. Subsequently, we selected 431 non-landslide regions, including supported stable slopes and some roads, rivers, farmland and residential areas. Compared to other landslide datasets, the Longgang landslide dataset is more complicated and unique because it contains many supported stable slopes, which are difficult for computer vision to recognize. The landslides in this region are mainly caused by artificially unstable slopes with a certain space geometry and certain texture features [42]. In addition, satellite images contain much useless information because roads and houses are situated near unstable slopes in low hill areas.

The Source Domain Datasets
In this section, we introduce the source domains used in the distant domain transfer learning algorithm as shown in the Table 1. The primary source domain is the Bijie landslide dataset [6]. The auxiliary source domains are the UC Merced land use dataset [31,43], the Google image datasets of SIRI-WHU and WHU-RS19 [43][44][45] and the NWPU-RESISC45 dataset [46].
The Bijie landslide dataset, for the city of Bijie, Guizhou Province, consists of 770 landslide samples and 2003 negative samples. The dataset is visually similar to the Longgang target domain. The landslide images mainly include rock falls and rock slides, while the environmental remote sensing images consisting of mountains, villages, roads, rivers, forests and agricultural land were chosen manually [6]. To the best of our knowledge, this dataset is the only open, accurate and large remote sensing landslide dataset. Landslides were created to promote automatic landslide detection studies using optical remote sensing images, although the features of landslides differ between regions.  Due to the natural environment and engineering activities, many landslides result from unstable slopes in this region [1], as shown in Figure 3. In the dataset used in this study, 177 landslides, including collapses, a few mudslides and mainly unstable slopes, were interpreted by experts with satellite images and other geological information. Subsequently, we selected 431 non-landslide regions, including supported stable slopes and some roads, rivers, farmland and residential areas. Compared to other landslide datasets, the Longgang landslide dataset is more complicated and unique because it contains many supported stable slopes, which are difficult for computer vision to recognize. The landslides in this region are mainly caused by artificially unstable slopes with a certain space geometry and certain texture features [42]. In addition, satellite images contain much useless information because roads and houses are situated near unstable slopes in low hill areas.

The Source Domain Datasets
In this section, we introduce the source domains used in the distant domain transfer learning algorithm as shown in the Table 1. The primary source domain is the Bijie landslide dataset [6]. The auxiliary source domains are the UC Merced land use dataset [31,43], the Google image datasets of SIRI-WHU and WHU-RS19 [43][44][45] and the NWPU-RESISC45 dataset [46].  The UC Merced land use dataset has 21 classes of land use image datasets meant for research purposes, and each class has 100 images. The images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the country. The pixel resolution of this public domain imagery is 1 foot. Each image measures 256 × 256 pixels [31,43].
The SIRI-WHU dataset contains 200 images for 12 classes, and each image measures 200 × 200 pixels, with a 2 m spatial resolution. This dataset was acquired from Google Earth (Google Inc.) and mainly covers urban areas in China, and the scene image dataset was designed by the RS_IDEA Group in Wuhan University (SIRI-WHU) [43][44][45].
The WHU-RS19 dataset contains 12 categories of physical scenes in satellite imagery collected from Google Earth. The dataset consists of airports, bridges, rivers, forests, meadows, ponds, parking, ports, viaducts, residential areas, industrial areas and commercial areas, with 50 samples for each class. This dataset has a total of 1005 images [47].
The NWPU-RESISC45 dataset, consisting of 31,500 images, has 45 scene classes with 700 images in each class, created by Northwestern Polytechnical University (NWPU) for remote sensing image classification. Similar to other remote sensing datasets, the 45 classes of this dataset include the same scene images, such as airplanes, airports and beaches [46]. The remote sensing images of the Longgang landslide dataset and the five source domain datasets are shown in Figure 4.

Methods
In this study, we applied the distant domain transfer learning algorithm for Longgang landslide detection. Moreover, we combined the DDTL and the attention module to refine the feature maps and improve the performance of target classification. In this section, we first introduce the image enhancement approach; then, we briefly review the theory of distant domain transfer learning and attention modules, and we subsequently introduce the improved attention module. For comparison, we also introduce the traditional CNN model and the similar pretrained transfer learning model.

Methods
In this study, we applied the distant domain transfer learning algorithm for Longgang landslide detection. Moreover, we combined the DDTL and the attention module to refine the feature maps and improve the performance of target classification. In this section, we first introduce the image enhancement approach; then, we briefly review the theory of distant domain transfer learning and attention modules, and we subsequently introduce the improved attention module. For comparison, we also introduce the traditional CNN model and the similar pretrained transfer learning model.

Image Enhancement
Optical remote sensing images are carriers of natural information, and only interpreting the natural information of these images can realize their application value. However, for some blurry and shadowed remote sensing images, it is necessary to enhance their quality. The technology of image enhancement, using a series of algorithms, enhances the useful information and improves the visual quality in order to meet the requirements of computer vision [48].
Improving the quality of images using image enhancement techniques is crucial for preparing image datasets. To apply deep learning methods for image classification, it is necessary to improve the remote sensing image quality in order to obtain good training outcomes. Low-illumination optical remote sensing images, because of uneven light or other reasons, have poor visual effects, and the differences in the image features are low, so they cannot meet the requirements of image classification and recognition [49].
Among the Longgang landslide images are many low-light images from Google Earth, so, in this paper, we introduce the image contrast enhancement algorithm to process the images in order to improve the remote sensing image quality for computer vision systems [50].
To solve the enhancement problem, the contrast enhancement algorithm fuses the input image P with the processed image under another exposure to reduce the complexity [50]. The overall formula for calculating the fused image is defined as follows: where W is the weight matrix, which can enhance the low contrast of underexposed regions and preserve the well-exposed regions; c is the index of three color channels; R is the enhanced result. The weight matrix is calculated as follows: where T is the scene illumination map solved by optimization, and µ is a parameter controlling the enhancement degree. More details about the optimization problem are provided in [50].
In the algorithm, because no information about the picture is available, we use the fixed parameters (a = −0.3293, b = 1.1258) to calculate g(P c , k) [50,51]. The formula is as follows: Finally, the parameter k is calculated by maximizing the image entropy of the enhancement brightness as follows.k = argmax k (g(B, k)). (4) To improve the calculation efficiency, we resize the input image to 50 × 50 when optimizing k, where B is the brightness component, p i is the beam of the histogram of B, and N is often set to 256.

Distant Domain Transfer Learning
Similar to traditional machine learning, transfer learning has gradually emerged as a focus of research and will become popular in the information field because of its unique advantages [52]. Regarding the scarcity of research landslide images, transfer learning can improve the model performance by transferring knowledge from source domain data [53]. For traditional transfer learning, the source domain and the target domain should be similar or have a certain connection [54]. In this way, the dependence on constructing the target model with much labeled target data can be reduced [55,56].
An important problem in transfer learning that has not been well studied is the fact that conventional transfer learning algorithms assume that the source domain and the target domain should be similar. For the landslide detection task of Longgang District, there are no similar open landslide datasets that can be used [57]. Therefore, we first introduce the novel feature-based distant domain transfer learning algorithm (DDTL), which requires only a small set of labeled target data and unlabeled source data from completely different domains [34,52,58]. The DDTL model contains two main modules, a feature extractor and a target classifier, and the structure of DDTL is shown in Figure 4. Moreover, we innovatively added the attention mechanism to the DDTL (AM-DDTL) and improved the attention model to make it more suitable for remote sensing landslide detection tasks.
In this paper, the AM-DDTL algorithm can solve the Longgang landslide detection problem in which there is no dataset similar to the target domain given the uniqueness of landslides. To the best of our knowledge, this is the first time that AM-DDTL has been applied to the field of landslide detection.
In the AM-DDTL problem, we assume that the Longgang target domain is too scarce to train a robust classification model. The source domain for the Longgang landslide dataset is insufficient.
In the AM-DDTL algorithm, we select other remote sensing datasets as the source domains denoted as S = {(x 1 1 , · · · , x n 1 ),· · · ,(x 1 S N , · · · , x n S N )}, where n and S N represent the number of samples in each source domain and the number of source domains, respectively. Then, Let P(x) and P(y|x) be the marginal and conditional distributions of a dataset, respectively. In the AM-DDTL problem, we have the following equations.
In this section, we introduce the convolutional autoencoder and the attention mechanisms separately; then, we explain how to evaluate the classification results. The framework of the DDTL is shown in Figure 5.
to train a robust classification model. The source domain for the Longgang landslide dataset is insufficient.
In the AM-DDTL algorithm, we select other remote sensing datasets as the source domains denoted as S = {( , ⋯ , ),⋯,( , ⋯ , )}, where n and represent the number of samples in each source domain and the number of source domains, respectively. Then, we have one labeled target domain denoted as , , ⋯ , ( , ) }, where n and represent the number of samples in each target domain and the number of target domains, respectively. Let P(x) and P(y|x) be the marginal and conditional distributions of a dataset, respectively. In the AM-DDTL problem, we have the following equations.
In this section, we introduce the convolutional autoencoder and the attention mechanisms separately; then, we explain how to evaluate the classification results. The framework of the DDTL is shown in Figure 5.

Autoencoder
The autoencoder first transforms the input information into a lower-dimensional representation and then reconstructs the initial input information using the lower-dimensional representation in the decoder part [59]. As an autoencoder trained on an unsupervised model can extract useful features from unlabeled data, deep learning models with autoencoders or unsupervised methods have been widely used in image and natural language processing [60].
The convolutional autoencoder with convolutional layers is suitable for the landslide detection task because the study area has only a few labeled landslide images that can be used. Therefore, the convolutional autoencoder can be used for feature extraction for the target landslide images and the source domain remote sensing images in the model.
The convolutional autoencoder, a kind of feed-forward neural network, is often used to solve image processing problems. The convolutional autoencoder consists of three main components, the encoder E conv (·), the decoder D conv (·) and the loss function, which can measure the information loss due to compression. The encoder and the decoder of the convolutional autoencoder contain an input layer, convolutional layers and the output layer separately. The convolutional autoencoder is shown in Figure 6. The mechanism of the autoencoder can be expressed as follows: where X s is the input image, f s is the image feature, andX s is the output image. the autoencoder can be expressed as follows: where is the input image, is the image feature, and is the output image.

Attention Mechanism
In convolutional neural networks for computer vision or natural language processes, attention modules that imitate human visual attention are very popular. We introduce the attention module to the DDTL framework to improve the transfer ability.
The squeeze-and-excitation (SE) module is a novel architectural unit [36]. The SE module focuses on the channel relationship of the given input feature map ∈ × × to improve the performance of useful information extraction from the feature map. The SE model contains a squeeze operation , excitation operation and scale operation . The squeeze operation compresses the feature map ∈ × × to a one-dimensional feature map ∈ × × by global average pooling. After the squeeze operation, the one-dimensional feature maps are passed through an excitation operation, which produces a set of weights ∈ × × for every channel. The weights are applied to the feature map at the scale operation, and the output of these operations is the outcome of the SE module, which can be directly passed into the other convolution layers of the networks.
The operation of the SE module is shown as follows.
where the symbol  denotes element-wise multiplication in this paper.

Attention Mechanism
In convolutional neural networks for computer vision or natural language processes, attention modules that imitate human visual attention are very popular. We introduce the attention module to the DDTL framework to improve the transfer ability.
The squeeze-and-excitation (SE) module is a novel architectural unit [36]. The SE module focuses on the channel relationship of the given input feature map F in ∈ R c×h×w to improve the performance of useful information extraction from the feature map. The SE model contains a squeeze operation M sq , excitation operation M ex and scale operation M sc . The squeeze operation compresses the feature map F in ∈ R c×h×w to a one-dimensional feature map F sq ∈ R c×1×1 by global average pooling. After the squeeze operation, the onedimensional feature maps F sq are passed through an excitation operation, which produces a set of weights F ex ∈ R c×1×1 for every channel. The weights are applied to the feature map F in at the scale operation, and the output F out of these operations is the outcome of the SE module, which can be directly passed into the other convolution layers of the networks.
The operation of the SE module is shown as follows.
where the symbol Θ denotes element-wise multiplication in this paper. Unlike the SE module, the convolutional block attention module (CBAM) is an effective attention module for image recognition using convolutional neural networks in computer vision tasks [35]. This module handles the input features of the image along the two separate dimensions sequentially, the channel attention module and the spatial attention module, to help the information to flow within the network by learning which information to emphasize or suppress.
To improve the performance of the DDTL framework for remote sensing image classification, we propose an improved attention module based on CBAM (improved CBAM). The improved attention module adapts the remote sensing image recognition and classification task well. On the basis of the CBAM module, inspired by the 3D spatialchannel attention module [61], we added a submodule to improve the extraction capacity of the input feature map. The improved attention module is different from the 3D spatialchannel attention module. For the improved attention module, the channel attention and spatial attention are connected in series. However, for the 3D spatial-channel attention module, the channel attention and spatial attention are connected in parallel. As shown in Figure 7, there are three convolution layers, with kernel sizes of 1 × 1, 3 × 3 and 7 × 7. The feature map F out from the CBAM module is sent into the submodule and then processed separately by 3 convolutional layers. After the outputs of the three convolutions are concatenated, a convolutional layer with a 7 × 7 kernel size follows the submodule.
where the F out is the outcome of the improved CBAM model and the F out is the outcome of the CBAM model. The f is the convolution operator and 1 × 1 denotes the kernel size.
catenated, a convolutional layer with a 7×7 kernel size follows the submodule.
where the is the outcome of the improved CBAM model and the is the outcome of the CBAM model. The is the convolution operator and 1×1 denotes the kernel size.

Reconstruction Loss
The reconstruction loss function is used to evaluate the difference between the input and output images of the convolutional autoencoder. After the images from the target and source domains are processed by the autoencoder, we define the reconstruction errors between the original images and the processed images as the loss function of the feature extractor . It is necessary to ensure that the difference between the input and output is small enough to obtain better extracted features.
is given as follows:

Reconstruction Loss
The reconstruction loss function is used to evaluate the difference between the input and output images of the convolutional autoencoder. After the images from the target and source domains are processed by the autoencoder, we define the reconstruction errors between the original images and the processed images as the loss function of the feature extractor L R . It is necessary to ensure that the difference between the input and output is small enough to obtain better extracted features. L R is given as follows:

Domain Loss
Usually, minimizing the reconstruction error L R can uncover the high-level features of the images. However, the distribution mismatch between the source and the target domains cannot be ignored, so only minimizing L R is not sufficient to obtain a robust model. To extract the same or similar features shared by the target domain and the source domain, we can minimize the domain distance between these domains. In other words, it is necessary to solve the problem of the distribution mismatch between the target and source domains, and we use the domain loss function to constrain the problem. The transfer model adds an adaptation layer to the convolutional autoencoder in order to measure the domain loss L D between the target domain and the source domain [62].
Regarding the domain loss L D , we use the maximum mean discrepancy (MMD) [63], an important statistical domain distance estimator, to measure the domain distance. The domain loss is expressed as follows: MMD(X, Y) = 1 n 1 where n 1 and n 2 are the numbers of instances of two different domains, and ϕ(·) is the kernel that converts two sets of features to a common reproducing kernel Hilbert space (RKHS), where the distance of two domains is maximized.

Classification Loss
After the encoder process, the high-level features extracted from the target domain are used for target classification with two fully connected layers. The fully connected layers can find the best feature combination for each class in the target task [64]. In the DDTL model, the output layer with a cross-entropy function is used to calculate the classification loss as follows:

The Total Loss of the DDTL
According to the above three loss functions, the total loss is defined as follows: where p E , p D, p C are the parameters of the encoder, decoder and classifier, respectively. The framework of the proposed AM-DDTL model is shown in Figure 8.

Results
In this section, we first present the outcome of image enhancement and introduce the improved CBAM DDTL model. Then, we present an overview of the different attention mechanism comparison results. In addition, we explore the different effects of different source domains in obtaining a more accurate classification model using landslide remote sensing images.

The Result of Image Enhancement
In this study, we processed optical images using illumination estimation techniques with a weight matrix and synthesized multiple exposure images using a camera response model [50]. The original high-resolution images extracted from Google Earth (Google Inc.) and the image-enhanced images are shown in Figure 9. Moreover, we conducted several experiments to explore the effect of image enhancement using different deep learning methods.

Results
In this section, we first present the outcome of image enhancement and introduce the improved CBAM DDTL model. Then, we present an overview of the different attention mechanism comparison results. In addition, we explore the different effects of different source domains in obtaining a more accurate classification model using landslide remote sensing images.

The Result of Image Enhancement
In this study, we processed optical images using illumination estimation techniques with a weight matrix and synthesized multiple exposure images using a camera response model [50]. The original high-resolution images extracted from Google Earth (Google Inc.) and the image-enhanced images are shown in Figure 9. Moreover, we conducted several experiments to explore the effect of image enhancement using different deep learning methods. with a weight matrix and synthesized multiple exposure images using a camera response model [50]. The original high-resolution images extracted from Google Earth (Google Inc.) and the image-enhanced images are shown in Figure 9. Moreover, we conducted several experiments to explore the effect of image enhancement using different deep learning methods.  The enhanced images had good visual quality, and the image enhancement algorithm not only improved the brightness of the images but also made the details of the images clearer than before. Landslides have a strong contrast with the surrounding environment in remote sensing images.
To determine whether the image enhancement algorithm is useful for deep learning and computer vision, we compared the different deep learning methods using the enhanced images and the original images. The result is shown in Table 2. We can see that the experiment using enhanced images obtained higher accuracy than that using the original images.

. Landslide Detection by Pretrained Model
The most popular algorithm of traditional transfer learning is the pretrained model [65,66]. The pretrained models are trained by the ImageNet dataset [64,67,68] and then applied to the other image classifications, which can always reduce the time consumed and improve work efficiency. In other words, the target domain classification model is obtained by removing the top neural network of the existing ImageNet training model and retraining an output layer with the target domain. The structure of the pretrained model is shown in the right-hand picture in Figure 10.

Landslide Detection by the CNN Model
The convolutional neural network is widely used to solve image classification problems and is one of the most popular deep learning algorithms [18]. Compared with traditional machine learning algorithms, it has outstanding ability in image classification because of the convolution layers and subsampling layers, which can effectively extract the useful feature maps of images [64,[69][70][71]. As shown in Table 2, DDTL achieved the highest accuracy (88.01%) for the classification task, but the CNN model only achieved 86.16% classification accuracy because the training data were insufficient [18]. Moreover, the pretrained models, VGG-16, VGG-19 and ResNet-50, achieved accuracies of 87.09%~89.86%. The pretrained models achieved higher accuracy because these models were trained on a massive number of images and obtained the initial parameters. The settings of the pretrained model were more similar to those of the DDTL model; however, the accuracy of the pretrained model was lower than that of the DDTL model. The difference in the classification accuracy may have been caused by the different domain statistical distributions. In other words, the massive images used to train the pretrained models and Longgang landslide datasets had different distributions.
We used the remote sensing images (RGB) and DEM data of the Bijie landslide dataset as the source domain to explore the effect of DEM data on the landslide classification task. As shown in Table 3, the total loss of DDTL with RGB + DEM was lower compared with the results of RGB, and the result of classification using only DEM data was the worst. Therefore, in the experiments in this study, we used the DEM as the supplementary geomorphological data for the source domain.

The Comparison of Different Attention Mechanisms
We improved a special attention mechanism based on the CBAM attention mechanism, and the improved CBAM was found to be suitable for information extraction from In this study, we constructed the CNN model with five convolutional layers with 3 × 3 kernels followed by a max-pooling kernel. The bottom of the convolutional neural network is connected with the flattened layer and the dense layers. The convolutional layer and the max-pooling layer can find and preserve the features of the image for classification, and the activation function sigmoid of the dense layers is excellent for binary classification. The structure of the normal CNN model is shown in the left-hand picture in Figure 10.
As shown in Table 2, DDTL achieved the highest accuracy (88.01%) for the classification task, but the CNN model only achieved 86.16% classification accuracy because the training data were insufficient [18]. Moreover, the pretrained models, VGG-16, VGG-19 and ResNet-50, achieved accuracies of 87.09%∼89.86%. The pretrained models achieved higher accuracy because these models were trained on a massive number of images and obtained the initial parameters. The settings of the pretrained model were more similar to those of the DDTL model; however, the accuracy of the pretrained model was lower than that of the DDTL model. The difference in the classification accuracy may have been caused by the different domain statistical distributions. In other words, the massive images used to train the pretrained models and Longgang landslide datasets had different distributions.
We used the remote sensing images (RGB) and DEM data of the Bijie landslide dataset as the source domain to explore the effect of DEM data on the landslide classification task. As shown in Table 3, the total loss of DDTL with RGB + DEM was lower compared with the results of RGB, and the result of classification using only DEM data was the worst. Therefore, in the experiments in this study, we used the DEM as the supplementary geomorphological data for the source domain.

The Comparison of Different Attention Mechanisms
We improved a special attention mechanism based on the CBAM attention mechanism, and the improved CBAM was found to be suitable for information extraction from satellite images. Additionally, we compared the SE, CBAM and improved CBAM to obtain a better understanding of the effect of the attention mechanism. The outcome is shown in Figure 11 and Table 4. satellite images. Additionally, we compared the SE, CBAM and improved CBAM to obtain a better understanding of the effect of the attention mechanism. The outcome is shown in Figure 11 and Table 4.  Regarding the improved attention mechanism, based on the CBAM attention mechanism, we added a submodule consisting of convolutional layers with different kernel sizes to extract more effective information. The submodule consists of three parallel convolutional layer subblocks that can process the information with the three convolutional layers. The three convolutional layers have different kernel sizes, 1 × 1, 3 × 3 and 7 × 7, which can extract the characteristics of landslides from different scales. To select and verify the subblock's structure, we compared the effect of the different kernel sizes. Figure  12 shows that the three parallel convolutional layer subblocks produce the best classification effect. In addition, the convolutional layer with a kernel size of 1 × 1 also achieves better performance because this kernel size can extract the information from each pixel of the feature map, but this capability may extract more useless information.  Regarding the improved attention mechanism, based on the CBAM attention mechanism, we added a submodule consisting of convolutional layers with different kernel sizes to extract more effective information. The submodule consists of three parallel convolutional layer subblocks that can process the information with the three convolutional layers. The three convolutional layers have different kernel sizes, 1 × 1, 3 × 3 and 7 × 7, which can extract the characteristics of landslides from different scales. To select and verify the subblock's structure, we compared the effect of the different kernel sizes. Figure 12 shows that the three parallel convolutional layer subblocks produce the best classification effect. In addition, the convolutional layer with a kernel size of 1 × 1 also achieves better performance because this kernel size can extract the information from each pixel of the feature map, but this capability may extract more useless information. We obtained feature maps with different attention mechanisms in the DDTL model: the SE-DDTL module, the CBAM-DDTL module and the improved CBAM-DDTL module. We compared the landslide image produced using these models, and the outcome of the comparison is shown in Figure 13. We found that the feature map of CBAM-DDTL and the improved CBAM-DDTL produced good landslide feature performance. The normal DDTL model without an attention mechanism always obtained a few landslide feature pixels, and the feature extraction effect of the SE-DDTL model was between that of DDTL without an attention mechanism and DDTL with CBAM. In later experiments, we used the improved CBAM-DDTL as the model. We obtained feature maps with different attention mechanisms in the DDTL model: the SE-DDTL module, the CBAM-DDTL module and the improved CBAM-DDTL module. We compared the landslide image produced using these models, and the outcome of the comparison is shown in Figure 13. We found that the feature map of CBAM-DDTL and the improved CBAM-DDTL produced good landslide feature performance. The normal DDTL model without an attention mechanism always obtained a few landslide feature pixels, and the feature extraction effect of the SE-DDTL model was between that of DDTL without an attention mechanism and DDTL with CBAM. In later experiments, we used the improved CBAM-DDTL as the model.

The Comparison of Different Source Domains
In this section, we first compare the different source domains, as shown in Figures 14 and 15. We selected five datasets: WHU-RS19 dataset, the UC Merced land use dataset, the Google Earth dataset of SIRI-WHU, the Bijie landslide dataset and the NWPU-RESISC45 dataset. They differ in the number and variety of remote sensing images. First, the NWPU-RESISC45 dataset achieves the best performance not only in terms of the classification loss but also the total loss. In addition to gaining the lowest loss values, the curve of the NWPU-RESISC45 dataset is smoother than that of the other source domains. Using the NWPU-RESISC45 dataset as the source domain, the model needs less than twenty epochs to reach equilibrium, while, for the other source domains, the model may need more than 40 cycles to obtain a good result. However, the Bijie landslide dataset did not achieve the best performance, although it was visually more similar to the Longgang landslide dataset. The SIRI-WHU and Bijie landslide datasets had the same training curve and achieved a well-trained state at the same time, but the curve of the Bijie landslide dataset had some unstable values and larger fluctuations. The UC Merced land use dataset's training curve finally started to stabilize, but for WHU-RS19, training still continued after 100 epochs.
Regarding the domain loss of different source domains, as shown in Figure 15, the domain loss of the NWPU-RESISC45 dataset first increases and then tends towards the lowest stable value state. The domain loss of the WHU-RS19 dataset continues to rise because the training is ongoing. The SIRI-WHU and Bijie landslide datasets share the same curve shape for domain loss.

The Comparison of Different Source Domains
In this section, we first compare the different source domains, as shown in Figures 14  and 15. We selected five datasets: WHU-RS19 dataset, the UC Merced land use dataset, the Google Earth dataset of SIRI-WHU, the Bijie landslide dataset and the NWPU-RESISC45 dataset. They differ in the number and variety of remote sensing images. First, the NWPU-RESISC45 dataset achieves the best performance not only in terms of the classification loss but also the total loss. In addition to gaining the lowest loss values, the curve of the NWPU-RESISC45 dataset is smoother than that of the other source domains. Using the NWPU-RESISC45 dataset as the source domain, the model needs less than twenty   Regarding the domain loss of different source domains, as shown in Figure 15, the domain loss of the NWPU-RESISC45 dataset first increases and then tends towards the lowest stable value state. The domain loss of the WHU-RS19 dataset continues to rise because the training is ongoing. The SIRI-WHU and Bijie landslide datasets share the same curve shape for domain loss.
After comparing the differences in the single source domains, we also explored the best combination of these source domains because the high-level features of different source domains are different, and they may overlap or intersect. For this purpose, we compared the single domain with the multisource domain to explore the effects of these remote sensing image datasets and found the optimal source domain combination that could obtain the best classification effect for landslide images. The outcome of this experiment is shown in Figure 16. had some unstable values and larger fluctuations. The UC Merced land use dataset's training curve finally started to stabilize, but for WHU-RS19, training still continued after 100 epochs.  Regarding the domain loss of different source domains, as shown in Figure 15, the domain loss of the NWPU-RESISC45 dataset first increases and then tends towards the lowest stable value state. The domain loss of the WHU-RS19 dataset continues to rise because the training is ongoing. The SIRI-WHU and Bijie landslide datasets share the same curve shape for domain loss.
After comparing the differences in the single source domains, we also explored the best combination of these source domains because the high-level features of different source domains are different, and they may overlap or intersect. For this purpose, we compared the single domain with the multisource domain to explore the effects of these remote sensing image datasets and found the optimal source domain combination that could obtain the best classification effect for landslide images. The outcome of this experiment is shown in Figure 16. After comparing the differences in the single source domains, we also explored the best combination of these source domains because the high-level features of different source domains are different, and they may overlap or intersect. For this purpose, we compared the single domain with the multisource domain to explore the effects of these remote sensing image datasets and found the optimal source domain combination that could obtain the best classification effect for landslide images. The outcome of this experiment is shown in Figure 16. Figure 16 and Table 5 shows that NWPU-RESISC45 obtains the best classification effect, whether acting as the single source domain or combined with the Bijie landslide dataset, and the two curves are highly coincident. The most obvious result is the substantial improvement in the classification effect when the three remote sensing scene classification datasets, WHU-RS19, the UC Merced land use dataset and SIRI-WHU, are combined with the Bijie landslide dataset. Especially for the WHU-RS19 dataset, using the single WHU-RS19 source, the model did not achieve a stable state, while the model using the multisource model obtained a better effect when trained only approximately 40 times. For the other two datasets, the gap between the two curves in the total loss represents the improvement of the classification effect.
For the domain loss, the UC Merced land use dataset and the SIRI-WHU dataset are consistent, and the two curves of single source and multisource gradually stabilize after rising, although the number of training sessions is different. In addition, comparing the curves of the SIRI-WHU dataset and the UC Merced land use dataset shows that the two curves of the SIRI-WHU dataset are more similar in shape than those of the UC Merced land use dataset.  Figure 16 and Table 5 shows that NWPU-RESISC45 obtains the best classification effect, whether acting as the single source domain or combined with the Bijie landslide dataset, and the two curves are highly coincident. The most obvious result is the substantial improvement in the classification effect when the three remote sensing scene classification datasets, WHU-RS19, the UC Merced land use dataset and SIRI-WHU, are combined with the Bijie landslide dataset. Especially for the WHU-RS19 dataset, using the single WHU-RS19 source, the model did not achieve a stable state, while the model using the multisource model obtained a better effect when trained only approximately 40 times. For the

Effectiveness Verification of Landslide Detection
In order to verify the effect of the landslide detection model, we evaluated whether the model could identify the landslides and further aid in indoor remote sensing interpretation work. The northern region of Longgang, having a more complex terrain environment, was used as the verification area. Here, there may have been undiscovered or recently occurring landslides. We applied our improved CBAM-DDTL model using the Bijie landslide dataset and NWPU-RESISC45 as the source domain. Finally, the model found 12 suspicious landslide candidates.
After the targeted detailed investigations in the field, the 12 candidates were confirmed by the authors. This demonstrated that more than 95% of the candidates were real landslides. The verification experiment in this area showed that our proposed landslide detection model provides outstanding performance.

Discussion
In this study, we compared the CNN model, pretrained model and DDTL method. The CNN model obtained the worst classification accuracy, and the DDTL model and the pretrained model obtained similar classification results, which were better than those of the CNN model. Regarding the CNN model and the pretrained model, we used the Longgang landslide dataset, which only contains 177 landslide images, to train the classification model, and we divided the whole landslide dataset into a training set and validation set according to the ratio of 7:3, as in previous research [1,71]. After multiple training steps, we obtained the training results. We conclude that the CNN model did not obtain promising classification results because only a few images were used for its training, and the CNN model could not extract useful features for landslide classification. The pretrained model obtained similar classification results as the DDTL model because these pretrained models have initial parameters trained on massive images. The pretrained model can extract the high-level feature maps of the remote sensing images. However, after intensive study, we found that the massive images used to train the model had a mismatch distribution with the target, the Longgang landslide dataset, which may have influenced the classification effect. The pre-obtained parameters can alleviate the shortcomings to obtain better results.
In this paper, we innovatively introduce an attention mechanism into distant domain transfer learning because remote sensing landslide images contain complex information, such as rivers, roads and lakes. This redundant information may harm landslide classification. Therefore, we introduce the attention mechanism into the convolution process for feature extraction. The SE attention mechanism extracts the deep landslide image features through channel attention. The CBAM attention mechanism, unlike the SE mechanism, extracts a feature through the channel and the spatial attention and obtains good feature extraction. Therefore, the classification of CBAM-DDTL has better accuracy than the SE-DDTL model. Regarding the improved CBAM-DDTL model inspired by [61], we process the outcome of the CBAM with three parallel convolutional layers with kernel sizes of 1 × 1, 3 × 3 and 7 × 7. To verify the effect of the subblock with three parallel convolutional layers, we compared the classification effect for kernel sizes of 1 × 1, 3 × 3 and 7 × 7, and the three parallel convolutional layers. The parallel convolutional layers obtained the lowest total loss-the result is the same as the outcome of [6]-and we conclude that the reason is that the 1 × 1 kernel size layer can discover small pixels of high features, and paralleling with another two kernel size layers can slightly increase the extraction effect.
Regarding the source domains, the NWPU-RESISC45 source domain obtains the lowest total loss, while WHU-RS19 obtains the worst performance. Therefore, the quantity and diversity of the source domain is the dominant factor for the classification effect. In addition, we know that the domain distance is smaller when the domain loss is lower, as with medical image classification [58], because, as Figure 16 shows, the combination of NWPU-RESISC45 and the Bijie landslide dataset results in not only the lowest total loss but also the lowest domain loss.
From Figure 14, we can easily see that the Bijie landslide dataset does not achieve the best results. Therefore, we can conclude that the source domain is visually similar to the target domain, and the classification results may not be optimal. The SIRI-WHU and Bijie landslide datasets have the same performance and approximately the same number of images. Moreover, in Figure 16, the two domain loss curves of a single source and multiple sources have the same shape, which shows that the two datasets share a certain number of high-level features.
Comparing the UC Merced land use dataset with the Bijie landslide dataset shows that the Bijie landslide dataset has more common features with the target domain than the UC Merced land use dataset and can obtain a more stable classification loss value sooner. For the UC Merced land use dataset, the convolution operation continuously extracts deep features common to the target domain, and the model obtains a small classification loss. This is because the feature map of the Bijie landside dataset is more similar to the Longgang landslide target domain than the UC Merced land use dataset.
The study in this paper has two limitations: (1) the feature extraction of remote sensing images using the DDTL algorithm is computationally expensive; (2) the landslide classification task also needs a publicly available landslide remote sensing dataset with a huge number of images for further research; (3) due to some limitations, we cannot use high-resolution remote sensing satellite images such as the Gaofen series and only use Google Earth images for analysis.

Conclusions
In this paper, we introduce the DDTL algorithm for landslide detection in the Longgang District based on remote sensing images and highly accurate DEM. Three contributions make this study distinctive. First, we innovatively introduce the DDTL algorithm. The DDTL algorithm does not require training data and testing data to have the same probability distribution, and it can handle the classification task, which only has a few labeled target samples, by transferring knowledge from completely different source domains. Second, we combine the attention mechanism and the DDTL algorithm and propose an improved attention mechanism that is suitable for landslide detection. Third, this is the first study to introduce scene classification datasets into the landslide detection task. We use the classification loss, reconstruction loss and domain loss to evaluate the DDTL model. All the experimental results show that the improved CBAM-DDTL has a better classification effect than other methods, and the NWPU-RESISC45 dataset is more suitable than other datasets for the landslide detection task using the DDTL model.