Landslide Extraction from High-Resolution Remote Sensing Imagery Using Fully Convolutional Spectral–Topographic Fusion Network

: Considering the complexity of landslide hazards, their manual investigation lacks effi-ciency and is time-consuming, especially in high-altitude plateau areas. Therefore, extracting landslide information using remote sensing technology has great advantages. In this study, comprehensive research was carried out on the landslide features of high-resolution remote sensing images on the Mangkam dataset. Based on the idea of feature-driven classification, the landslide extraction model of a fully convolutional spectral–topographic fusion network (FSTF-Net) based on a deep convolutional neural network of multi-source data fusion is proposed, which takes into account the topographic factor (slope and aspect) and the normalized difference vegetation index (NDVI) as multi-source data input by which to train the model. In this paper, a high-resolution remote sensing image classification method based on a fully convolutional network was used to extract the landslide information, thereby realizing the accurate extraction of the landslide and surrounding ground-object information. With Mangkam County in the southeast of the Qinghai–Tibet Plateau China as the study area, the proposed method was evaluated based on the high-precision digital elevation model (DEM) generated from stereoscopic images of Resources Satellite-3 and multi-source high-resolution remote sensing image data (Beijing-2, Worldview-3, and SuperView-1). Results show that our method had a landslide detection precision of 0.85 and an overall classification accuracy of 0.89. Compared with the latest DeepLab_v3+, our model increases the landslide detection precision by 5%. Thus, the proposed FSTF-Net model has high reliability and robustness.


Introduction
A landslide is a type of natural disaster that is very common. In addition to its impact on the physical environment, landslides often have serious socio-economic impacts [1]. The accurate extraction of landslide disasters can provide key information for their early prevention. Remote sensing datasets and processing technologies (such as segmentation [2] and classification [3]) can be used to provide services for quick information extraction and the emergency management of landslide disasters. Compared with synthetic aperture radar (SAR) and light detection and ranging (LiDAR), optical imaging sensors are more easily supported by different platforms and have significant advantages including their wide coverage, short update cycle, few environmental limitations, and large amount of information. With the increasing number of satellites, the number of optical remote sensing images is rapidly increasing. Therefore, landslide extraction using optical remote sensing images has become a research hotspot in recent years [4][5][6][7]. As an increasing number of remote sensing satellites with high spatial resolution (<2 m) are being launched, researchers are able to obtain more abundant and detailed ground object information. In addition to more effective spectral information, information such as ground-object texture, geometric structure, and shape can also be obtained, which will improve the landslide extraction accuracy of distribution, quantity, and contour [8][9][10].
Current landslide information extraction methods using high-resolution remote sensing images are mainly divided into pixel based and object based [11,12]. In a study by Yong et al. [13], landslides were extracted from high-resolution remote sensing images based on the optimal partition algorithm. The random forest algorithm has also been used to extract landslide information based on the texture feature of Landsat 8 images [14]. The pixel-based method relies on the homogeneous radiation assumption of landslides represented by heterogeneous polygons. Therefore, it is unable to process the multi-level spatial details of landslides provided by high-resolution images. Different machine learning methods and classifiers have been integrated with object-oriented extraction and used for landslide extraction. Heleno et al. [3] applied an automatic landslide extraction method for rainfall-induced landslides on Madeira Island using the support vector machine (SVM) and object-oriented with a radial basis function kernel. Ma et al. [15] used wordview2 images for the automatic detection of shallow landslides, and the object-oriented method had a relatively high accuracy which could reach 85%. Efstratios Karantanellis et al. [16] developed a new object-based image analysis (OBIA) methodology, and its outputs demonstrated the potential for the accurate characterization of individual landslide objects. Since a landslide is a slope sliding process, the topographic-driven segmentation of the study area is of significance for the object-based methods [2,8,17]. From the perspective of mode identification, the selection of artificial thresholds and representative feature extraction limits the accuracy of landslide-information extraction. Therefore, the accuracy can be effectively improved by automatic feature learning from remote sensing datasets rather than manually performed, since we can carry out ground object classification and landslide extraction using effective features.
Great theoretical and practical significance is involved in the application of deep learning to remote sensing images and performing research into intelligent analysis with target identification and ground-object classification [18]. Recently, various deep learning architectures based on graph convolutional networks (GCNs) [19], generative adversarial networks (GANs) [20] and long short-term memory (LSTM) [21] have been applied to remote sensing, and they have been shown to produce state-of-the-art results. Multilayer autoencoders (AEs) are usually used for spectral-spatial feature learning and have good effects [22]. Residual spectral-spatial attention networks [23] have made great progress in hyperspectral image classification. A gated bidirectional network was proposed for the feature fusion of remote sensing scene classification [24]. Current deep learning methods have demonstrated advantages in landslide extraction. Ding et al. [25] applied the convolutional neural network (CNN) structure to landslide detection using GF-1 images in Shenzhen, and the landslide detection rate was 72.5%. Using a CNN and an improved region-growing algorithm, Yu et al. [26] extracted the area and boundary of landslides which exhibited high accuracy in the landslide extraction. A CNN model was developed by Lei et al. [27] to solve the complexity and spatial uncertainties of landslides. Nikhil Prakash et al. [4] presented a modified U-Net model for the semantic segmentation of landslides at a regional scale from EO data, and it has a better performance than traditional machine methods. Haojie Wang et al. [28] proposed a deep-learning method to identify natural terrain landslides using integrated geodatabases, which outperforms other machine learning algorithms due to its strengths in feature extraction and multidata processing. Multiple deep learning networks including VGG16, VGG19, ResNet50, ResNet101, DenseNet120, DenseNet201, UNet−, UNet+, and ResUNet were compared in the study of Chang Li et al. [29]. Results showed that VGG models have the highest precision but the lowest recall. Shengwu Qin et al. [30] introduced a distant domain transfer learning (DDTL) method for landslide detection and classification, which outperforms traditional CNN methods. In these studies, good results could only be obtained based on a large number of training samples. Moreover, these methods were only applicable to specific areas. If cross-scenes and multiple sensors were taken into account, the performance of these models would be greatly reduced. Therefore, a landslide extraction network with a lifelong learning ability should be built.
Internal factors such as topography, geological structure, and lithology are the primary causes of landslides, while external factors including rainfall, underground water, surface water, human activities, and earthquakes accelerate the occurrence of landslides. Owing to the complexity and uncertainty of landslides, it is challenging to extract landslide information. It was effective to detect landslides based on different features, given that landslides had no unique spectral features and shapes. For instance, NDVI [9], topographical features (slope, aspect, and curvature) [31], morphological features [15], and other geological features have been used for landslide detection and extraction. A residual network was trained based on spectral and topographical features, and the results of different feature integration strategies were compared [6]. Xu et al. [32] proposed an end-toend network model for post-earthquake landslide segmentation and extraction, and a number of non-landslide areas were removed through the comprehensive use of the geological features to improve the overall extraction accuracy. Peng Liu et al. [33] proposed improved U-Net model adding a spatial information band (DSM, slope, and aspect), and the extraction accuracy is 13.8% higher than the traditional U-Net model. With the accumulation of multi-source data, such as remote sensing and basic geographic data, it is necessary to integrate multi-source data into the network to design an appropriate network structure, thereby improving the accuracy of landslide extraction.
The contributions of this study include the following aspects. (1) Given that image segmentation and classification could be integrated using the fully convolutional network, such a network was chosen as the basic network for the pixel-level classification of landslides and surrounding ground objects, which provided the quantity of landslides as well as distribution, and contour information. (2) Based on the atrous convolution, pyramid pooling, an encoder-decoder structure, the multi-scale feature, and contextual information of the fully convolutional network model were effectively integrated. Moreover, the multi-source data (slope, aspect, and NDVI data) were input into the branch network to extract the features and then fused with the feature extracted by optical remote sensing RGB images. A landslide information extraction model of a fully convolutional spectraltopographic fusion network, named FSTF-Net, was proposed. The FSTF-Net model can not only identify different shapes of landslides and ground objects but also obtain clear landslide boundaries. Our approach can considerably improve the extraction accuracy of landslides and surrounding ground objects.

Methods
In this paper, a fully convolutional spectral-topographic fusion network to extract landslide information from high-resolution remote sensing imagery is proposed. It contains three stages: data preparation, training, and classification stages, which are illustrated in Figure 1. In the data-preparation stage (Figure 1a), the image data, slope, aspect, NDVI data, and labeled data were sliced into small patches using superpixel segmentation. Meanwhile, the multistage sampling 16 [34] method was employed to ensure that the sample proportions of various classes were relatively balanced. In the training stage, training samples were input to the proposed FSTF-net network (Figure 1b), and the stochastic gradient descent (SDG) was used to update network parameters. In the final stage, the trained FSTF-Net network was performed on test data to generate the classification results of landslides and surrounding ground objects.

Encoder-Decoder
To obtain the classification results of each pixel, not only is the class information of the target required, but so is the location information. The encoder-decoder network structure [35] was adopted in this study, which conformed to the end-to-end learning mode. The encoding part was used to extract the deep and abstract feature based on which the decoding part obtained the pixel-level prediction results.

Resnet and Atrous Spatial Pyramid Pooling
In this study, the encoding structure chooses the ResNet50 feature extraction part (conv1-block4) combined with ASPP as the backbone of the encoding module. The structure of ResNet50 consists of one convolution layer and four blocks. Each block has several bottleneck units. Inside the bottleneck unit, there is a shortcut connection between the input and output. The bottleneck unit solved the vanishing gradient problem. Original images are downsampled by 1/32 in the standard ResNet50. It changed the 3 × 3 convolution stride of the first bottleneck in block4-1. For the receptive field, kept unchanged from the remaining convolution kernels in block4, the standard convolution was replaced with atrous convolution with the rate of 2 to obtain 16 times downsampled feature maps of the original image.
Landslides have different directions, structures, boundaries, and shapes, and thus the multi-scale feature must be considered and redundant information eliminated so as to improve the accuracy of extraction. Since there is severe non-uniformity in the landslide area, it is difficult to obtain effective landslide features. However, this difficulty can be solved by the pyramid pooling (PP) module, which is suitable for the feature learning of the landslide area and solving the misclassification problem of small area landslides.
In order to achieve the effectiveness of feature learning, we extract the output from convolutional layers of different scales. The last block in the ResNet50 network (block4 of ResNet50) is replicated in parallel by the ASPP network. There were four parallel atrous convolutions, one 1 × 1 convolution and three 3 × 3 convolutions at expansion rates of 6, 12, and 18. Global average pooling is applied to the lasted feature image to include global contextual information into the model and generate a 1 × 1 convolution kernel with 256 filters. It provides better global information than the maximum pooling. The final output of the encoding stage is the fusion result of multi-scale feature maps. After fusion, 1 × 1 convolution is used to reduce the dimensionality of three feature maps of different sizes, and enter the decoding module to obtain a feature map with the same size as the original image. Figure 2 shows details of the encoder structure.

Decoder
Inspired by the encoding-decoding structure from UNet which has been widely used in the dense semantic classification task of remote sensing images, the skip connection between the downsampling path and the upsampling path restores the spatial information lost during the maximum pooling operation [36]. The decoder structure connects the convolutional layer with the corresponding deconvolutional layer so that the deconvolutional layer can perform targeted upsampling operations. Based on feature fusion from the encoder structure, the multi-step decoder structure is adopted to restore the original resolution of the feature map. The feature map is bilinearly two times upsampled and concatenated with the corresponding low-level feature in the encoder part, which is the conv1 of bottleneck4 in block2 of ResNet, and then the lower level feature in ResNet (conv1 of bottleneck3 in block1) also used to be fused. The image resolution is refined after the last layer skip connection. As part of FSTF-Net, the Resnet + ASPP + encoder-decoder (RAE-D) network structure is illustrated in Figure 3, which only can be trained by highresolution RGB images. Based on the integration of multi-scale features and contextual information in remote sensing pixel-level semantic classification, the RAE-D network structure of the RGB image feature was used in this study. Equipped with the advantages of atrous spatial pyramid pooling (ASPP) and the encoder-decoder structure, a feature pyramid with different levels was generated by the encoder. The low-level feature focused on details, while the highlevel feature focused on the overall situation. The PP module was integrated into the backbone network to overcome the global pooling problem. In the decoder, an additional connection was added to the network so as to provide access to low-level information for the top-level classification layer. Two skip connections were used to integrate the low-level and high-level features into the final feature map, thereby effectively obtaining more contextual information and transmitting detailed information from the lower layer to a higher layer. Hence, the spatial information destroyed by pooling could be better restored, and the landslide obtained after the decoder in the final prediction had a clear boundary contour.

Proposed FSTF-Net
In this study, a landslide information extraction model named FSTF-Net based on the fully convolutional network of multi-source data fusion was proposed that is composed of a fully convolutional network containing encoding and decoding stages and adding a completely different branch in the networks. The final feature is obtained after the RGB band of the image entered one branch structure composed of Resnet50 and ASPP. In addition, the features of the slope, aspect, and NDVI were effectively extracted in another branch and then integrated into the backbone framework to improve the extraction accuracy of landslides and surrounding ground objects.

Fusion of Multiple Sources
The multi-source data fusion strategy based on the deep-learning classification network can be divided into layer stacking [7] and feature fusion [37], as shown in Figure 4. Multiple sources are integrated as multiple inputs of the single network model in layer stacking, and the number of input channels is increased. Targeted learning cannot be carried out on the multi-source image data. In addition, different features have different semantic expressions, thus the effect of layer stacking is not ideal [6]. In this study, a featurelevel fusion based on multi-source remote sensing-image data was researched. Multiple features were independently obtained through the backbone structure of different branches, then the feature was fused and input into the classifier, and the weight coefficients of each feature were automatically learned. The method improved image classification performance and information extraction accuracy. There are two methods of achieving feature-level fusion. In early studies, the corresponding elements of two input vectors were directly added or multiplied, or the maximum value was taken [38]. The other method was to concat multiple vectors on a specified axis. Recent studies [39,40] showed that the mode of concat could more effectively encode features of different sources. Thus, the mode of concat for feature fusion was used in the network structure in this study, as shown in Figure 5. The design of network architecture is difficult when the different sources of input data are collected. Therefore, a network model must be designed based on the input datasets and types. In this study, landslide extraction was based on the spectral, slope, aspect, and NDVI using the proposed fusion network, and two parallel independent branch networks were used as feature extractors to convert the spectral data and topographical data into the abstract features for representation. Before classification, these features were fused through Concat, that is, various feature vectors were stacked on the specific axis and classified after being restored to the original resolution through the decoding module, thereby realizing landslide-information extraction.
(i = 1, 2,…, n) is the output of the ℎ branch network to be fused. The ( ) function is the feature fusion operation in the third dimension. The fusion equation is:

Spectral-Topographic Fusion Network
In this study, a new fusion network model FSTF-Net was proposed to extract landslides and surrounding ground-object information from high-resolution remote sensing images based on spectral and topographical information. The aim was to obtain superior extraction results from multi-source data fusion than from single-source images. As core parts, ResNet, PP, VGGNet, and feature fusion were used in this study. FSTF-Net is composed of two parallel networks that were merged at the final stage so that the entire network could learn the fusion feature of branch networks. The spectral RGB data and multisource data (NDVI, slope, and aspect) were inputs of the proposed FSTF-Net. The scheme of the proposed FSTF-Net is displayed in Figure 6. In the study of Sameen et al. [6], the CNN model had better performance than that of ResNet for inputs such as topographical variables. When the CNN model is used for landslide extraction, a deeper-level model does not ensure better accuracy, and the network depth has no impact on the final results. Therefore, to build a network with computing memory efficiency, conv1 to conv5 of VGG-16 were selected to extract high-level features of multi-source data as the backbones of branch networks with the topographical data and NDVI data. The branch through four iterations of 2-fold downsampling and the 16 times downsampled feature map was eventually generated, as shown in Figure 7. The FSTF-Net network integrates the advantages of multi-scale atrous convolution and skip connections, which can not only obtain multi-scale features and retain contextual information, but also improve the extraction ability of slope, aspect, and NDVI. The network structure is shown in Figure 8. The red dotted line in the figure represents the features extracted from the high-resolution RGB images using Resnet with the ASPP branch network, and the blue dotted line shows the extraction of the slope, aspect, and NDVI multi-source data feature using the VGG-16 branch network. The high-level output features from the branch network are fused. Then, the fusion network model was trained based on error back-propagation. In this study, the cross-entropy function was used as the loss function, and the network parameters were updated using SDG. Conditional random fields (CRFs) were not used for post-processing because the adjustment of additional hyper-parameters was required, which only leads to a small improvement or even adverse effects. The loss function of the model is calculated by cross-entropy loss: Cross-entropy is a well-known and default loss function, where is the output value of the softmax that corresponds to the prediction value of each pixel, and is the result of true classification, which means the pixel point of the label value.
, b, , and are the jth weight of the ith neuron, bias, ith output of the network, and the actual classification result. Dropout was introduced into the method, which reduced the number of iterating parameters during training and prevented over-fitting.
In the training stage of the model, we used stochastic gradient descent (SGD) for parameters updating: where Δ ( +1) is the parameter increment, which is the combination of the original parameter, gradient, and historical increment, ( ) and ( +1) are the original and updated parameters: where is used to control the iteration step length by the preset learning-rate parameter; ( ) is the cost function; and and are the parameters of the weight decay and momentum.

Dropout and Batch Normalization
In this study, the feature extraction ability of the proposed FSTF-Net structure gradually improved with the increasing number of network layers, however, the number of parameters in the network also increased. This leads to overfitting, resulting in the degradation of network performance. Hence, dropout [41] and Batch Normalization (BN) [42] were introduced to the network to improve the calculation and learning efficiency. The dropout set a probability to eliminate the nodes in the neural network, thereby reducing the number of iterative parameters and preventing overfitting. BN was added to each convolutional layer in this study to reduce the difference in data distribution, thereby eliminating the local fluctuation caused by weight updating and lowering the probability of overfitting. In the network structure, the BN layer was inserted before inputting the data of each layer to the activation function. The input of the batch was set to = { (1⋯ ) }, and the training parameters of the BN algorithm are γ and β.
Mini-batch mean-indent: Mini-batch variance: Normalize: The output of batch normalization: To introduce dropout to the network structure, it was added to the bottom layer of the network structure before the classifier and after the rectified linear unit (ReLU) activation function and BN layer, as shown in Figure 9. Through the dropout and BN optimization, the proposed network achieved good classification results and showed great generalization ability. Figure 9. Dropout optimized network model.

Study Area
Mangkam County, in the southeast of the Qinghai-Tibet Plateau, China, was taken as the study area in the present work ( Figure 10). It is located in the plateau monsoon climate region with a rough topography, frequent geological disasters, and severe environmental conditions (low pressure, anoxia, severe cold, gales, and intense radiation). Since there is no communication signal in most regions and people have to cross a wide range of depopulated zones, it is very difficult to manually explore landslide disasters. It is also difficult for short-range unmanned aerial vehicles (UAVs) to monitor detailed terrestrial changes within a severe environment. Thus, using high-resolution satellite remote sensing to obtain surface information has become the only choice.

Datasets
Our datasets are composed of three different high spatial resolution optical satellite remote sensing data from January to March 2018, including Beijing-2, WorldView 3, and SuperView-1, as shown in Table 1. Among them, Beijing-2 is capable of collecting satellite imagery with a spatial resolution of 0.8 m panchromatic and 3.2 m spatial resolution multispectral bands. The WorldView-3 satellite images include a panchromatic image with 0.31 m spatial resolution and eight-band multispectral imagery with a resolution of 1.24 m, and it can be applied to the extraction of key elements of the landslide body. Super-View-1 satellite images include panchromatic images with a 0.5 m spatial resolution and multispectral imagery with a resolution of 2 m. Taking into account the needs of landslide information extraction and the monitoring of the surrounding environment, since the edges of landslides have a certain degree of ambiguity, our datasets were labeled into the following six classes: landslide, building, forest, water, road, and bare land, which can better distinguish the landslides and surrounding ground objects in the study area ( Figure  11).  Owing to the regionality of landslide disasters, topographical data (slope and aspect) and NDVI data were chosen as the extraction factors in our study with which to measure the difference between landslide and non-landslide features based on the geological conditions, environmental feature, and topography in the study area and its surrounding areas. Although other factors such as rainfall and geological structure (lithology and seismic intensity) were also important for landslide extraction, due to the small coverage of high-resolution remote sensing data in the study area, the rainfall and geological structure were not significantly changed.
A stereoscopic image obtained by Resources Satellite 3 was used to generate the DEM, the accuracy of which can reach 5 m. The images of the study area collected by Resources Satellite 3 in February 2018 had a product level of 1A (radiation correction product through pre-processing) and a cloud cover of 0%. At the stage of pre-processing datasets [43,44], in order to fit the inputs of the fully convolutional network model, we applied the resampling method of cubic convolution [45] to process the DEM to the spatial resolution of 0.3 m. The NDVI data were calculated from Landsat 8 (30 m) and resampled to match the same resolution of the network input. Two topographical data (slope and aspect) were generated by the DEM. The source and format of the obtained landslideextraction factors are shown in Table 2. All data were converted into grid format in the WGS-1984 coordinate system. The construction of the landslide information database was mainly derived from 175 pairs of sample data, including remote sensing images, topographical, NDVI, and label data. The sample dataset was organized in the format of "image-label" and for each pixel in the original image, the corresponding ground-object class and feature values of multi-source data could thus be obtained ( Figure 12).

Training and Metrics
Based on the aforementioned datasets, two experiments were designed to verify the proposed method as follows: (1) high-resolution RGB images were used to train the encoding-decoding structure RAE-D of the FSTF-Net model, and (2) RGB images and multisource data were integrated to train the FSTF-Net model.
To evaluate the effectiveness of multi-source data fusion, the feature learning performances of four end-to-end classification network models (ASPP, DeeplabV3+, RAE-D, and FSTF-Net) were compared. ASPP and DeeplabV3+ were used as universal high-resolution remote sensing image classification networks. Because of the triple-channel network structure, multi-source data fusion could not be realized. RAE-D is an important branch structure of the proposed FSTF-Net model. As a branch network of RGB highresolution remote sensing image inputs, RAE-D is used as a reference structure for comparing the effect of non-multi-source data fusion on the final performance. The results showed the effectiveness and advantages of the FSTF-Net.
Our approach was based on the TensorFlow framework in the Linux system environment. The network layers were built through Tersorflow.keras, which can easily realize the customization of multiple networks. The Sklearn library was used to evaluate the final accuracy. Our training used the SGD method for 300 epochs. A "step" policy for the learning rate adjustment (gamma = 0.1, step-size = 15,000) was used during each epoch. Batch normalization was used for optimization after each convolutional layer. The basic parameters for calculating the increments were m = 0.9 and dw = 0.0005, and the base learning rate was 0.001. In the training stage, we first randomly shuffled all of the samples and subsequently fed them into the network in batches. The labeled data were used to evaluate their accuracy. To ensure the consistency of the comparison of algorithms, these models used the same training and test datasets.
The performance of the various methods can be evaluated based on the criteria as follows: per-class precision, overall accuracy (OA), average recall average, and F1-score G-mean, which are considered easily interpretable and have better theoretical properties than other classification measures. The research of this paper was mainly for the information extraction of landslides and surrounding ground objects from high-resolution remote sensing images. However, due to the fuzzy boundaries of the extracted landslides and the complicated internal structure, it is difficult to accurately evaluate the results of the landslide information extraction using only the evaluation method of classification accuracy. Taking into account the existing accuracy evaluation methods, our landslide extraction accuracy indicators are mainly divided into two schemes: the remote sensing image classification accuracy evaluation method based on the confusion matrix and the specific landslide target detection accuracy evaluation method based on error analysis.
The following indicators are used as the classification accuracy evaluation criteria: where , , , and represent the numbers of true positives, true negatives, false positives, and false negatives as predicted by the network model, respectively.
For the evaluation of the extraction results of a single landslide-specific target, a more reliable evaluation method was to use the error analysis of the evaluation of the accuracy of landslide extraction. The samples here are also divided into (the landslide that was correctly detected), (which was a landslide that was not detected as a non-landslide), (a non-landslide that was correctly classified), and (it was a non-landslide that was falsely detected as a landslide). The four evaluation parameters used were: detection percentage ( ), which indicates the probability that the algorithm correctly recognizes the landslide; the omission error affected by , also known as the omission errors ( ); The misclassification error, affected by , is also called commission errors ( ); quality percentage ( ) is a comprehensive index of the target extraction accuracy which is affected by and . The higher the , the higher the overall accuracy of landslide extraction. The definition of the four comprehensive accuracy indicators is:

Experimental Results
To better evaluate the performance of our method in the extraction of landslides and surrounding ground objects, Basic FCN, FCN-8s, ASPP, and DeepLab_v3+ [46] were adopted as the baseline for the comparison to our proposed model. Our model achieves the highest OA value of 0.89. Although the model has little improvement in terms of overall accuracy, there were significant changes from DeepLab_v3+ in the classification accuracies for the landslides: the accuracies of extraction were enhanced by 0.03. As shown in Table 3, the accuracy of landslide extraction reaches approximately 0.86. As part of FSTF-Net, the Resnet + ASPP + encoder-decoder (RAE-D) also performed very well on small targets for roads' and buildings' classification accuracies with 0.80 and 0.78, respectively.
Overall, according to the consistency analysis of Table 3 and Figure 13, although the RAE-D network obtained high classification accuracy, our FSTF-Net obtains the highest accuracy among all methods in the experiment. This is because the architecture of FSTF-Net combines spectral information from RGB and NDVI images, with geographical information from slope and aspect.  This approach was also evaluated for landslide extraction accuracy based on error analysis. The DP, OE, CE, and QP for landslides in these experiments are shown in Table  4. Compared with the results obtained by ASPP and DeeplabV3+, the results showed that the proposed FSTF-Net method was better in landslide extraction. The purpose of multisource data fusion by deep learning in this study was to improve the final accuracy of landslide extraction through the integration of more types of data, including basic geographical data such as DEM data and traditional RGB image data, thereby obtaining a better representation of landslide features.

Importance of Multi-Source Data Fusion
The encoding-decoding structure of RAE-D only takes RGB images as input. Compared with Deeplabv3+, this adds a skip connection to restore the detailed information of ground objects. As shown in Figure 14, landslides can be identified based on spectral information. When using the feature fusion network FSTF-Net, the additional topographic information could improve performance in the extraction of landslides and surrounding ground objects and reduce the salt-and-pepper noise. It is worth noting that DeeplabV3+ cannot distinguish small targets well because it has fewer low-level features compared to FSTF-Net and RAE-D. A deep convolutional network is effective in the recognition of complex image patterns and semantic classification. However, whether the landslide boundary can be obtained in pixel-level classification should be discussed. For the explanation, the landslide areas extracted by FSTF-Net and RAE-D were overlapped, as shown in Figure 15a,b. Different models were used to compare the landslide boundary and range in the area. The red area is the true value of the yellow part of RAE-D, and the purple is that of FSTF-Net. It can be seen that the FSTF-Net benefited from the multi-source data, and the extracted landslide boundary was significantly better with a more complete shape. The landslide area in the figure has some additional extended structures that can be learned from the topographical feature. In summary, the results of FSTF-Net were more accurate, continuous, and close to the true value. In addition, although topographical information does not significantly improve the overall accuracy of the results, it is very helpful for distinguishing built-up areas from landslide areas which have similar spectral features. As shown in Figure 16, the red polygon marks the landslide area and the blue polygon marks the non-landslide built-up area. Their topographical information can be effectively distinguished because most landslides usually occur on steep slopes. From the hill shade, slope, and aspect, the blue flat area is unlikely to be a landslide area. Therefore, by training and learning the topographic features of landslides, the FSTF-Net model can clearly distinguish a landslide area from a non-landslide area. The deep fusion network proposed in this study not only used twobranch networks to obtain multiple features at the same time but also learned the multiscale expression of the feature from the branch networks. Without an additional supervised learning method, different branch networks were integrated into the network to improve the extraction capabilities of the network.

Analysis of Landslide Change Detection
The landslide extraction results of the FSTF-Net model from the March 2018 images were compared with those from the January and February 2018 images to analyze the landslide information changes in some areas of the study area from January to March 2018. Figure 17 shows the specific information of new landslide area #1. The area is located near an artificial mining area, and there landslides have already occurred nearby.

Applications of Proposed Approach
To verify the feasibility and applicability of the proposed approach, the high-resolution remote sensing images of landslides caused by the Jiuzhaigou earthquake were used for the information extraction of the landslide and surrounding ground objects. The datasets were used as the experimental data, as shown in Table 5. We also obtained the resampled DEM and NDVI results by using the cubic convolution method in order to fit the inputs of the fully convolutional network model. In the plateau area under the same constraints, the FSTF-Net model that was trained in the Mangkam County area was used for transfer learning. Results show that the model can directly identify and classify the Jiuzhaigou landslide and surrounding ground objects without extra training. Thus, the proposed FSTF-Net model has great advantages in cross-scene and multi-sensor scenarios. In this study, the classification accuracy of the Jiuzhaigou landslide and surrounding ground objects using the FSTF-Net model is shown in Table 6. Among the results, the classification accuracy of vegetation was the highest, reaching approximately 88%. The accuracy of the landslides was also impressive (79%), and the overall accuracy was 82% ( Figure 19).  From the perspective of landslide extraction accuracy, the DP, OE, CE, and QP of landslides are shown in Table 7   The model trained by the Mangkam dataset was applied to the Jiuzhaigou landslide extraction and high-quality results were obtained. From the Jiuzhaigou landslide extraction results, it can be clearly seen that the FSTF-Net model successfully extracted most of the landslide contour, without obvious omissions. The value of DP was approximately 81%. In addition, the result was not influenced by other ground objects such as trees, verifying the advantages of using spectral and topographic information in the FSTF-Net model. Although the classification accuracy of the proposed method must still be improved, it presents many advantages, requires less preparatory work, and has high extraction efficiency. The overall performance is slightly lower than that of the Mangkam dataset, which may be due to the difference in scenes. In the Jiuzhaigou area, the vegetation area is higher than that in Mangkam County, and the network did not learn this prior knowledge. A possible solution to this problem is transfer learning; that is, a small sample dataset in the new landslide extraction area can be applied to fine-tune the model to improve its performance. This method is especially suitable for post-disaster evaluation, where time constraints are prevalent and the number of landslide samples is small.
Aiming at the landslide extraction from high-resolution remote sensing images, we propose a comprehensive and widely used scheme. For two landslide extraction tasks in different regions, the designed model is able to use the information obtained in the first landslide extraction task, as prior knowledge about the landslide, and apply it to the second landslide extraction task. This is a major improvement because the training of a common deep learning model was not very flexible and difficult to adapt to different regions. The model proposed in this study follows the trend of information extraction with specific constraints and specific targets. Currently, the model is limited to the extraction of landslide information from high-resolution remote sensing images in plateau areas. However, the network framework of deep learning and its variants are universal and can provide references for the analysis and application of other data in remote sensing images.

Conclusions
In this study, we proposed a deep convolutional neural network named FSTF-Net for landslide extraction. Based on multi-source data fusion, the network is an end-to-end accurate landslide extraction framework. The following conclusions can be drawn:

•
Based on the atrous convolution, pyramid pooling, and encoding-decoding structure, the multi-scale feature and the contextual information of the fully convolutional network model were effectively integrated to improve the performance of the network. The multi-source data, including topographical factors (slope and aspect) and NDVI, were input into the network and integrated with the feature extracted by remote sensing images. Through the improvement and optimization of the network structure, the end-to-end FSTF-Net model based on multi-source data was obtained. Comparison with other existing networks showed that the FSTF-Net model achieved accurate landslide extraction and the detailed recovery of different types of ground objects in complex scenes. Based on the existing multi-source data, the model effectively increased the accuracy of landslide extraction. The overall classification accuracy reached 89% and the accuracy of the landslide detection was 85%.

•
Taking the geological disaster caused by the Jiuzhaigou earthquake in 2017 as an example, high-resolution remote sensing satellite images were collected from Google Earth. Based on these images, the trained FSTF-Net model from the Mangkam dataset was used to extract the information of landslides and surrounding ground objects after the Jiuzhaigou earthquake. The accuracy of the landslide detection was 81%. The method not only greatly reduced labor costs and time but also ensured the accuracy and reliability of the interpretation of the surface environment, providing a reference for subsequent research on the automatic extraction of landslide information.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.