Automatic Mapping of Landslides by the ResU-Net

Massive landslides over large regions can be triggered by heavy rainfalls or major seismic events. Mapping regional landslides quickly is important for disaster mitigation. In recent years, deep learning methods have been successfully applied in many fields, including landslide automatic identification. In this work, we proposed a deep learning approach, the ResU-Net, to map regional landslides automatically. This method and a baseline model (U-Net) were collectively tested in Tianshui city, Gansu province, where a heavy rainfall triggered more than 10,000 landslides in July 2013. All models were performed on a 3-band (near infrared, red, and green) GeoEye-1 image with a spatial resolution of 0.5 m. At such a fine spatial resolution, the study area is spatially heterogeneous. The tested study area is 128 km2, 80% of which was used to train models and the remaining 20% was used to validate accuracy of the models. This proposed ResU-Net achieved higher accuracy than the baseline U-Net model in this mountain region, where F1 improved by 0.09. Compared with the U-Net model, this proposed model (ResU-Net) performs better in discriminating landslides from bare floodplains along river valleys and unplanted terraces. By incorporating environmental information, this ResU-Net may also be applied to other landslide mapping, such as landslide susceptibility and hazard assessment.


Introduction
In mountainous regions, massive landslides can be induced by major earthquakes or heavy rainfalls. Recent examples include the 2008 Wenchuan earthquake [1][2][3][4], 1999 Chi-Chi earthquake, and 2009 Typhoon Morakot [5]. These landslides usually spread over hundreds of thousands of square kilometers. Fast mapping of these landslides is urgently needed for post-disaster response, while it remains a challenge to the geoscience community.
Field surveys play an important role in mapping regional landslides before remote sensing emerged. From the middle of 20th century, manual interpretation of landslides from aerial photos validated by field reconnaissance has been the most commonly used tool to map widely distributed landslides [6], though which was time-and labor-consuming. In the 2000s, very high spatial resolution satellite images became available and landslide mapping has been tried semi-automatically [7]. These methods to map regional landslides are often carried out in two ways of either pixel-based or object-based image analysis [8,9]. Pixel-based methods use different characteristics between landslide pixels and non-landslide pixels to pick out landslides. For pixel-based methods, change detection based on spectral information has been the most commonly used technique [10][11][12][13]. The difference between pre-and

High Resolution Satellite Images
The GeoEye-1 image was acquired on 8 October 2013, two months after the July 2013 Tianshui rainfall-triggered landslide event. The panchromatic band has a nominal spatial resolution of 0.5 m. The multispectral bands are 1.65 m ground sampling distance with four bands in near-infrared (NIR), red, green, and blue bands. Covering an area of 128 km 2 , the GeoEye-1 image is in the WGS84 coordinate system, with a longitude of 105°44'48"-105°55'45"E and a latitude of 34°13'11"-34°19'20"N.

Landslide Inventory Maps
The inventory of the 2013 Tianshui rainfall-triggered landslides was manually produced from the GeoEye-1 image and validated by field reconnaissance. Up to 17,515 landslides triggered by the heavy rainfall were delineated as individual solid polygons in the study area ( Figure 1). According to the classification scheme proposed by [37] and modified by [38], most landslides in this study area are shallow debris avalanches and debris flows ( Figure 2). Debris avalanches are located on slopes and debris flows are located in valley bottoms, which were transformed from landslide debris on slopes. During our field reconnaissance, we found depth of most debris avalanches are ≈1 m. All types of landslides in this work are defined as landslides. Several field surveys were carried out after the event and landslide photos were taken in August 2014 or September 2018 ( Figure 2) to validate interpretation.

High Resolution Satellite Images
The GeoEye-1 image was acquired on 8 October 2013, two months after the July 2013 Tianshui rainfall-triggered landslide event. The panchromatic band has a nominal spatial resolution of 0.5 m. The multispectral bands are 1.65 m ground sampling distance with four bands in near-infrared (NIR), red, green, and blue bands. Covering an area of 128 km 2 , the GeoEye-1 image is in the WGS84 coordinate system, with a longitude of 105 • 44 48"-105 • 55 45"E and a latitude of 34 • 13 11"-34 • 19 20"N.

Landslide Inventory Maps
The inventory of the 2013 Tianshui rainfall-triggered landslides was manually produced from the GeoEye-1 image and validated by field reconnaissance. Up to 17,515 landslides triggered by the heavy rainfall were delineated as individual solid polygons in the study area ( Figure 1). According to the classification scheme proposed by [37] and modified by [38], most landslides in this study area are shallow debris avalanches and debris flows ( Figure 2). Debris avalanches are located on slopes and debris flows are located in valley bottoms, which were transformed from landslide debris on slopes. During our field reconnaissance, we found depth of most debris avalanches are ≈1 m. All types of landslides in this work are defined as landslides. Several field surveys were carried out after the event and landslide photos were taken in August 2014 or September 2018 (  Figure 2. Landslide photos (a-c) taken during field reconnaissance. Locations and directions to take these photos are shown on GeoEye-1 RGB composite images on the right column.

Preparation of Training and Validation Datasets
The spatial resolution of GeoEye-1 used in this work is 0.5 m in panchromatic and 1.65 m in multi-spectral bands. Using the Gram-Schmidt Spectral Sharpening module in ENVI 4.8, we pansharpened the multispectral and panchromatic bands. The size of pansharpened image is 33,368 × 22,900 pixels with a 0.5 m ground sampling distance. In order to simplify the model and improve speed of model training, we converted the 16-bit original image into 8-bit and only used three spectral

Preparation of Training and Validation Datasets
The spatial resolution of GeoEye-1 used in this work is 0.5 m in panchromatic and 1.65 m in multi-spectral bands. Using the Gram-Schmidt Spectral Sharpening module in ENVI 4.8, we pansharpened the multispectral and panchromatic bands. The size of pansharpened image is 33,368 × 22,900 pixels with a 0.5 m ground sampling distance. In order to simplify the model and improve speed of model training, we converted the 16-bit original image into 8-bit and only used three Remote Sens. 2020, 12, 2487 5 of 14 spectral bands (near infrared, red, and green) to perform the analysis. The landslide inventory maps were converted to raster with the same resolution as the pansharpened image. The label for landslides was encoded as 1 and the non-landslide pixels encoded as 0.
Training of convolutional neural networks (CNN) requires many thousands annotated training samples. As the CNN model operates on square input patches, the satellite image and inventory map layer were divided to 1443 patches with size of 600 × 600 pixels. In total, 80% of these image patches (1154) were randomly selected as training data to train the model. The remaining 20% of image patches (289) were taken as the model validation area to evaluate the performance of the models.

Baseline Model
The U-Net network is presented by semantic segmentation task for bio-medical image segmentation [34]. It is designed for a small training dataset with training strategy that relies on the strong use of data augmentation, in order to use the available annotated samples more efficiently. U-Net is a deep learning framework based on fully convolutional networks and comprises two parts: a contracting path (an encoder) to capture context via a compact feature map and an expanding path (a decoder) for precise localization. The U-Net combines low layer context information and high layer semantic information through skip connection to improve image segmentation accuracy. In this paper, we used U-Net as a baseline model to compare its performance with the ResU-Net.

Residual Learning Block
Generally, deeper layers in deep learning models could have better performances, but deeper layers can also lead to model degradation. To settle this problem, He et al. [27] presented a residual learning framework to ease the training of networks and solve training error with increasing depth of networks. The residual learning block is where x i and y are input and output vectors of the i-th residual block considered, and function F(x) represents the residual mapping to be learned. In this study, skip connection was implemented between the first convolutional layer and the third convolutional layer and performs identity mapping between input vectors and output vectors ( Figure 3). Outputs of each residual block were added to the outputs of the stacked convolutional layers.

Baseline Model
The U-Net network is presented by semantic segmentation task for bio-medical image segmentation [34]. It is designed for a small training dataset with training strategy that relies on the strong use of data augmentation, in order to use the available annotated samples more efficiently. U-Net is a deep learning framework based on fully convolutional networks and comprises two parts: a contracting path (an encoder) to capture context via a compact feature map and an expanding path (a decoder) for precise localization. The U-Net combines low layer context information and high layer semantic information through skip connection to improve image segmentation accuracy. In this paper, we used U-Net as a baseline model to compare its performance with the ResU-Net.

Residual Learning Block
Generally, deeper layers in deep learning models could have better performances, but deeper layers can also lead to model degradation. To settle this problem, He et al. [27] presented a residual learning framework to ease the training of networks and solve training error with increasing depth of networks. The residual learning block is where xi and y are input and output vectors of the i-th residual block considered, and function F(x) represents the residual mapping to be learned. In this study, skip connection was implemented between the first convolutional layer and the third convolutional layer and performs identity mapping between input vectors and output vectors ( Figure 3). Outputs of each residual block were added to the outputs of the stacked convolutional layers. In this work, a residual block includes three convolutional layers and three batch normalization (BN) layers [39]. The first and second convolutional layers are following a rectified linear unit (ReLU) activation layer after the BN layer as shown in Figure 3. The (i+1)-th residual learning block considers the i-th input residual vectors to optimize training progress [27].

Proposed ResU-Net
In order to solve training degradation problem with increasing depth of network layers, we adopted residual learning block to replace every convolutional layer of the encoding path of U-Net baseline model. The network architecture of the ResU-Net is illustrated in Figure 4 and Table 1, which consists of an encoding path (left side) and a decoding path (right side). In this work, a residual block includes three convolutional layers and three batch normalization (BN) layers [39]. The first and second convolutional layers are following a rectified linear unit (ReLU) activation layer after the BN layer as shown in Figure 3. The (i + 1)-th residual learning block considers the i-th input residual vectors to optimize training progress [27].

Proposed ResU-Net
In order to solve training degradation problem with increasing depth of network layers, we adopted residual learning block to replace every convolutional layer of the encoding path of U-Net baseline model. The network architecture of the ResU-Net is illustrated in Figure 4 and Table 1, which consists of an encoding path (left side) and a decoding path (right side).
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 14 The second residual unit (Residual unit_2 in Table 1) consists of four residual blocks stacked with 12 convolutional layers. The third residual unit (Residual unit_3 in Table 1) includes six residual blocks stacked with 18 convolutional layers. The fourth residual unit (Residual unit_4 in Table 1) includes three residual blocks stacked with nine convolutional layers. Default parameters in the structures of the residual units were used in this work [27]. Down-sampling of feature map was performed by conv12, conv24, and conv42 with a stride of 2, while doubling the number of feature channels. The decoding path contains repeated applications of four concatenate blocks, one addition block, and an output unit. Each concatenate block consists of one 1 × 1 convolution and an up-sampling of the feature map that halves the number of feature channels. We built a concatenation between output feature maps after each concatenate block and output feature maps from the corresponding residual unit of encoding path. At the final layer of the decoding path, a 1 × 1 convolution filter and a Sigmoid activation layer were used to map segmentation results for a binary classification. The copy and crop were implemented between output feature of every residual block and convolutional layer of decoding path. The copy and cropping are necessary for multi-scale feature fusion. As shown in Table  1, the ResU-Net has 56 convolutional layers.  Figure 4. Architecture of ResU-Net.  The encoding path comprises an input unit, a head unit, and a residual unit. The head unit includes two convolutional layers, followed by a BN layer and a ReLU. The first residual unit (Residual unit_1 in Table 1) includes three residual blocks stacked with nine convolutional layers. The second residual unit (Residual unit_2 in Table 1) consists of four residual blocks stacked with 12 convolutional layers. The third residual unit (Residual unit_3 in Table 1) includes six residual blocks stacked with 18 convolutional layers. The fourth residual unit (Residual unit_4 in Table 1) includes three residual blocks stacked with nine convolutional layers. Default parameters in the structures of the residual units were used in this work [27]. Down-sampling of feature map was performed by conv12, conv24, and conv42 with a stride of 2, while doubling the number of feature channels.

Conv(3x3)
The decoding path contains repeated applications of four concatenate blocks, one addition block, and an output unit. Each concatenate block consists of one 1 × 1 convolution and an up-sampling of the feature map that halves the number of feature channels. We built a concatenation between output feature maps after each concatenate block and output feature maps from the corresponding residual unit of encoding path. At the final layer of the decoding path, a 1 × 1 convolution filter and a Sigmoid activation layer were used to map segmentation results for a binary classification. The copy and crop were implemented between output feature of every residual block and convolutional layer Remote Sens. 2020, 12, 2487 7 of 14 of decoding path. The copy and cropping are necessary for multi-scale feature fusion. As shown in Table 1, the ResU-Net has 56 convolutional layers.

Implementation of Training
The ResU-Net model and U-Net were implemented using the PyTorch framework. All the codes used in this research are available as supplementary material on GitHub: https://github.com/WenwenQi/ Deep-Residual-U-Net-for-extracting-landslides. Additionally, all models were optimized by the Mini-batch Stochastic Gradient Descent algorithm [40]. The performance of the deep learning algorithms heavily relies on training dataset, layer depth, input window sizes, and training strategies [29]. We set the initial learning rate as 0.001, which was then decayed by a rate of 0.1 for every 20 epochs. The total epochs were 200. We trained the ResU-Net model and U-Net model with a mini-batch size of 5 on a NVIDIA™ GeForce GTX 1080 Ti GPU (12GB memory). Momentum was high (0.9) and weight decay was 5 × 10 −5 . There were 1148 fixed-size training image tiles (600 × 600 pixels) available for training. In this work, data augmentation was used during training, for example, flipping image horizontally, vertically, and horizontally and vertically, in order to increase training data volume and avoid the over-fitting problem.
In this work, we used the Binary Cross Entropy loss function (BCELoss), which can be described as: where N is the batch size, l n is loss of each sample of the mini-batch, and w n is a manual rescaling weight given to the loss of each batch element in the training. By default, the losses (l(x, y)) were averaged over each loss element in the batch, which was used for measuring the error of a reconstruction in an auto-encoder. We used the Kaiming Uniform initialization method [41] to initialize weights of the model, which takes Rectified Linear Unit and Parametric Rectified Linear Unit (ReLU/PReLU) into account. This method is a robust initialization tool allowing for extremely deep models to converge. By using the models, a series of landslide probability maps were produced based on the validating area. All output probability maps were tiles with size of 600 × 600 pixels. We added geo-transformation and projection information for every image tile of probability map and mosaic all image tiles using GDAL/OGR library.

Evaluation
Evaluation was carried out in the validating part of the study area using manually interpreted landslides. To evaluate the performance of the proposed ResU-Net, we calculated recall, precision, and F1. Recall is the fraction of the total reference landslides that are actually detected. Precision is the fraction of the total detected landslides that are really landslides. Recall and Precision are evenly weighted used F-measure (F1). The F1 score was used as a standard criterion to evaluate model performance by combining the precision and recall. The F1 score is simply a way to combine the precision and recall. The F1 score can be calculated by using harmonic mean of the precision and recall. It ranges from 0 to 1. The higher the F1, the better the predictions.
where TP is true positive, which is correctly identified landslides; FN is false negative, which is true landslides but is omitted by the method; and FP is false positive, which is not landslides but is mistakenly detected as landslides by the method.

Results
In Table 2, we can see that the proposed ResU-Net have the best results in terms of precision, recall, and F1. Precision in the ResU-Net improved slightly (0.03) compared to the U-Net model. Recall in the ResU-Net improved the most than the U-Net model (0.13 higher than U-Net). F1 in the ResU-Net is 0.09 higher than the U-Net. With the same hardware, it took ≈29 and ≈72 h to train the U-Net and the ResU-Net models, respectively. After training, both models need similar testing time (2-3 s) for each image tile (600 × 600 pixels). We further showed mapping results of the two models (the U-Net and the ResU-Net) in the validation area ( Figure 5). The yellow, red, and blue polygons are corrected landslides, omission, and commission errors, respectively.
Although all methods can correctly detect most landslides (yellow polygons in Figure 5), they have difficulty in detecting small size landslides (Figure 5b1-b6 and Figure 5c1-c6). The ResU-Net model is better than the U-Net model for extraction of small and slender landslide polygons (Figure 5b2,b3 and Figure 5c2,c3).
Compared to the ResU-Net, the U-Net has difficulty in delineating landslide boundaries and has more omission errors (red polygons in Figure 5b1-b6). As follows in Figure 5b6, the U-Net model has more omission errors around boundaries of the large size landslides.
As the GeoEye-1 image we used has very high spatial resolution (0.5 m), landslides are spectrally heterogeneous ground features. Floodplains in this image look very similar with landslides and pose a challenge in identifying landslides. The U-Net model has more omission errors on floodplains Remote Sens. 2020, 12, 2487 9 of 14 ( Figure 5b5). The problem with very high spatial resolution images like this is that the same ground features could have different spectral values, whereas different ground features are spectrally similar. A few omission errors and more commission errors occur for the ResU-Net along the floodplains (Figure 5c5). In Figure 5b7,c7, we can see that both models have difficulties in identifying landslides from roads and buildings. However, more false positives and false negatives occur under the U-Net model than the proposed ResU-Net.

Model Comparisons
In this model, we proposed a ResU-Net by integrating the U-Net [34] and the residual learning algorithms [27]. To identify landslides, feature selection (such as NDVI, textural information) is a crucial step in traditional machine learning models [10,13]. Selecting the most efficient features as input for machine learning models heavily relies on expert experiences. Deep learning models were developed on the basis of large datasets and high computing power. When we train deep learning models, model inputs are image patches. These image patches are spectral features of three bands (near infrared, red, and green) and semantic features can be automatically generated and used in identifying landslides. Converting low-level (spectral) features to high-level (semantic) features is a major advantage of deep learning models over traditional machine learning models.
Although deep learning methods have the advantage of automatic semantic feature extraction, they could face the problem of training degradation with limited annotated training samples. Most

Model Comparisons
In this model, we proposed a ResU-Net by integrating the U-Net [34] and the residual learning algorithms [27]. To identify landslides, feature selection (such as NDVI, textural information) is a crucial step in traditional machine learning models [10,13]. Selecting the most efficient features as input for machine learning models heavily relies on expert experiences. Deep learning models were developed on the basis of large datasets and high computing power. When we train deep learning models, model inputs are image patches. These image patches are spectral features of three bands (near infrared, red, and green) and semantic features can be automatically generated and used in identifying landslides. Converting low-level (spectral) features to high-level (semantic) features is a major advantage of deep learning models over traditional machine learning models.
Although deep learning methods have the advantage of automatic semantic feature extraction, they could face the problem of training degradation with limited annotated training samples. Most existing methods to identify landslides use many environmental factors and a few convolutional layers (four-or seven-layer depth) to avoid model degradation [28][29][30]. By adopting the residual learning to every convolutional layer of the encoding path in the model, the proposed ResU-Net model can avoid model degradation to reach higher accuracy than the U-Net (F1 improved by 0.09). By integrating the residual learning units, the ResU-Net could have up to 56 convolutional layers without training degradation. In addition, the ResU-Net applied a shortcut connection strategy to pass some high-resolution low-level features to the final classification module, which could use image information more efficiently. Compared with the U-Net model, the proposed ResU-Net performs better in delineating landslide boundaries ( Figure 5).
Output of these models (the ResU-Net and the U-Net) are landslide probability maps. Higher values in these probability maps indicate more confidence in predicting landslides. To get the final predicted binary landslide map, we have to use a threshold to transfer the landslide probability map. Pixels with model output values larger than the defined threshold will be assigned as landslide pixels, otherwise are non-landslide pixels. In general, lower thresholds may successfully detect more true landslides, but we risk assigning false pixels as landslides (more commission errors). This will lead to lower precision and F1, but higher recall. In contrast, higher thresholds will result in less predicted landslide pixels (omission errors), leading to lower recall, F1, and higher precision. As determining the threshold is not the scope of this work, we used thresholds with balanced precision, recall, and F1 scores for all models.

Limitations of the Model
Landslides are earth surface processes of various spatial scales ranging from a few square meters to many thousands of square meters. Both deep learning models did not consider landslide scale differences, and they have poor performances in detecting small landslides. During semantic feature extractions at the convolution and pooling procedures, small landslides of a few image pixels in deep learning models could become blurred. Building image pyramids and introducing pyramid pooling algorithms are two possible ways to solve this problem [29,31].
High accuracy of extracting landslides in this region may be partially because the proposed algorithm was applied on a rural area. Models can easily discriminate landslides from dense vegetation, but there are bare lands in our study area, which are spectrally similar to landslides. It is also possible that by adding a high spatial resolution DEM, the model could easily discriminate landslides from flat floodplains [42]. However, the optical image in this work has a spatial resolution of 0.5 m and a DEM with such high spatial resolution is very difficult to acquire. Acquiring very high spatial resolution DEM may also be a problem in other parts of mountain regions. Therefore, extracting landslides from bare spectral information is a worthwhile effort.
Landslide extraction using deep learning models requires large amounts of label data and raw images. To perform the ResU-Net and U-Net, we divided the entire study area into 1443 image patches with size of 600 × 600 pixels. We then selected 80% of the study area, i.e., 1154 image patches, to train both models. In total, the amount of training data is very large, 1154 × 600 × 600 × 4 pixel values (three optical bands and a layer with landslide labels). As the GeoEye-1 data is 16-bit floating data, there are at > 1.6 billion floating values to be used as training data. We trained this amount of training data with a server (Ubuntu 16.04 and Intel ® Xeon ® CPU E5-2620) and accelerated by the NVIDIA™ GeForce GTX 1080 Ti GPU (12GB memory). These requirements for label data and hardware may limit its use to some extent. Besides, there are some omission and commission errors in our results ( Figure 5), which indicate that other information other than spectral information (such as DEM) should be used to further improve model performances.

Potential Applications
With more high spatial resolution images available in the future, large volumes of very high-resolution images will be available for regional landslide mapping. During post-disaster assessment, fast mapping of landslides could be urgently needed. This method can provide a possible solution in fast landslide mapping triggered by major earthquakes or extreme rainfalls. By using NVIDIA™ GPU, an image tile with size of 600 × 600 pixels can be automatically mapped landslides per 3 s. This fast mapping method may play an important role in landslide inventory mapping in the future.
The proposed method for identifying landslides could also be used in other types of landslide mapping. Take the landslide susceptibility mapping as an example, which is similar with landslide detection of this work. Both of them use landslides as model dependent. The differences are instead of using spectral bands from remote sensing images, landslide susceptibility mapping uses environmental layers such as lithology, slope, and land use cover type [43]. In the future, if these environmental factors are considered, the model proposed in this paper could be readily applied to landslide susceptibility and hazard mapping.

Conclusions
Automatic mapping of landslides from optical images could greatly benefit generation of regional landslide inventories. Although existing deep learning methods can extract features efficiently in landslide identifications, they may face training degradation with limited annotated samples. To solve this problem, we proposed the ResU-Net for automatic landslide mapping, which is built based on the architecture of U-Net by integrating residual learning algorithms. By hybriding both algorithms, this proposed method achieves higher accuracy in a spatially heterogeneous mountain region than the baseline U-Net model by using 3-band optical images. The higher accuracy in the ResU-Net indicates that this model performs better in avoiding model degradation with limited annotated training landslide samples. The ResU-Net also has promising potential in other types of regional landslide studies, such as landslide susceptibility and hazard mapping by integrating environmental information.