Study of the Automatic Recognition of Landslides by Using InSAR Images and the Improved Mask R-CNN Model in the Eastern Tibet Plateau

: The development of landslide hazards is spatially scattered, temporally random, and poorly characterized. Given the advantages of the large spatial scale and high sensitivity of InSAR observations, InSAR is becoming one of the main techniques for active landslide identiﬁcation. The difﬁcult problem is how to quickly extract landslide information from extensive InSAR image data. Since the instance segmentation model (Mask R-CNN) in deep learning can provide highly robust target recognition, we select the landslide-prone eastern edge of the Tibetan Plateau as a speciﬁc test area. Introducing and optimizing this model achieves high-speed and accurate recognition of InSAR observations. First, the InSAR patch landslide instance segmentation dataset (SLD) is established by developing a common object in context (COCO) annotation format conversion code based on InSAR observations. The Mask R-CNN+++ is found by adding three functions of the ResNext module to increase the ﬁneness of the network segmentation results and enhance the noise resistance of the model, the DCB (deformable convolutional block) to improve the feature extraction ability of the network for geometric morphological changes of landslide patches, and an attention mechanism to selectively enhance usefully and suppress features less valuable to the native Mask R-CNN network. The model achieves 92.94% accuracy on the test set, and the active landslide recognition speed based on this model under ordinary computer hardware conditions is 72.3 km 2 /s. The overall characteristics of the results of this study show that the optimized model effectively enhances the perceptibility of image morphological changes, thereby resulting in smoother recognition boundaries and further improvement of the generalization ability of segmentation detection. This result is expected to serve to identify and monitor active landslides in complex surface conditions on a large spatial scale. Moreover, active landslides of different geometric features, motion patterns, and intensities are expected to be further segmented.


Introduction
Due to the rapid expansion of modern human activities and the increase in extreme meteorological events because of global climate change, the occurrence of landslides has also R-CNN [35] in deep learning [36] combines target detection and segmentation tasks in one network model by segmenting the target pixels within the detection frame while locating the target location. Therefore, this model is used as the basis for further improving the characteristics of InSAR patches to automatically recognize landslide patches in InSAR observation results.
This study selects part of the eastern edge of the Tibetan Plateau, which is located in landslide-prone areas, as a specific test area. This study takes InSAR observations as the analysis object based on the instance segmentation model in deep learning. (1) This study realizes the conversion code to automatically generate a standard COCO annotation format from InSAR images and corresponding vector files and establishes the SLD (InSAR Landslide Dataset); and (2) using the Mask R-CNN network as the base model, an instance segmentation model (Mask R-CNN+++) for InSAR landslide patches is established by replacing the convolutional blocks of the feature extraction network and adding an attention mechanism to the feature pyramid network to automatically recognize active landslides based on InSAR result maps. In conclusion, an automatic active landslide recognition method based on deep learning and InSAR observation results is established. A format conversion tool is developed using ArcPy [37] to form a complete image recognition and processing process that reduces unnecessary human intervention and thus improve work efficiency and facilitate large-scale and short-period recognition tasks.

Research Area
The study area ( Figure 1) is located on the eastern edge of the Tibetan Plateau in the middle section of the Jinsha, Lancang, and Nujiang river basins, one of China's most developed landslide hazard areas [38]. The area was influenced by the collision between the Indian and Eurasian plates and experienced tectonic deformation stages, such as right-slip compressional torsion, large-scale slip extrusion, and left-slip tensional torsion during the Cenozoic, with complex tectonic activity. Since the Quaternary, the tectonic activity in the region has remained robust [39,40], and seismic activity has been frequent [41]. The strong and continuous tectonic activity and runoff erosion have led to the overall topography of the study area, which is characterized by high northwest, low southeast, and dense canyons [42]. The average elevation ranges from 4200 m elevation in the north to approximately 1800 m elevations in the south [43], with a maximum elevation difference of more than 3000 m per square kilometre. This topographic feature also importantly influences the regional climate distribution pattern, in which the plateau area has a subcold semihumid plateau climate with active freeze-thaw phenomena, permafrost, and seasonal permafrost development; the high mountain valley area is influenced by the near north-south trending mountains and thus atmospheric circulation, with significant temperature differences between day and night and reduced precipitation, thus manifesting as hot in river valley areas and cold in high mountain areas. This highly undulating terrain, variable climate, and tectonic movements lead to vigorous internal and external dynamic effects, a fragile geological environment, and frequent disasters in the area, thus posing severe threats to engineering construction and human safety.

InSAR Data Processing
The deep learning approach carried out in this study is based on InSAR observations. The InSAR observations are based on data from Sentinel-1A (https://scihub.copernicus.eu accessed on 10 September 2021), which has a C-band wavelength of 5.6 cm. The observations cover IW (wide interferometric) mode radar images with 30 scenes of ascending and descending orbit images. The period is from January 2019 to March 2020, the polarisation mode is VV, the average incidence angle of the images is 42.5 degrees, the spatial resolution is 25 m × 25 m, and the azimuthal resolution is 13.8~13.9 m. The distance direction resolution is 2.3~2.4 m.
Sentinel-1A images must be processed before the start of the study process. The general processing flow of D-InSAR was used for data processing. The SAR data with short temporal and spatial baselines were first selected for D-InSAR calculation [44] to obtain high-quality relative interferometric data, followed by laminar atmosphere removal with a positive correlation to elevation [45]. Then, filtering [46], superposition enhancement, random phase error removal, and deformation phase resonance enhancement were performed sequentially. The surface deformation map of the study area is obtained and used as the base data for deep learning and the validation data with a resolution of 25 m. Finally, all InSAR observations of the region (II in Figure 1a) with different topographic conditions from the SLD are selected as the images identified from January 2019-March 2020 to further validate the Mask R-CNN+++ generalization capability.

Automatic Recognition Solutions
This study needs to interpret a certain number of active landslides from InSAR observations as a dataset, which is divided into three parts: a training set, validation set, and test set. This process uses the standard COCO [47] (common object in context) annotation format. During the implementation, different optimization methods are overlaid on the

InSAR Data Processing
The deep learning approach carried out in this study is based on InSAR observations. The InSAR observations are based on data from Sentinel-1A (https://scihub.copernicus.eu, accessed on 10 September 2021), which has a C-band wavelength of 5.6 cm. The observations cover IW (wide interferometric) mode radar images with 30 scenes of ascending and descending orbit images. The period is from January 2019 to March 2020, the polarisation mode is VV, the average incidence angle of the images is 42.5 degrees, the spatial resolution is 25 m × 25 m, and the azimuthal resolution is 13.8~13.9 m. The distance direction resolution is 2.3~2.4 m.
Sentinel-1A images must be processed before the start of the study process. The general processing flow of D-InSAR was used for data processing. The SAR data with short temporal and spatial baselines were first selected for D-InSAR calculation [44] to obtain high-quality relative interferometric data, followed by laminar atmosphere removal with a positive correlation to elevation [45]. Then, filtering [46], superposition enhancement, random phase error removal, and deformation phase resonance enhancement were performed sequentially. The surface deformation map of the study area is obtained and used as the base data for deep learning and the validation data with a resolution of 25 m. Finally, all InSAR observations of the region (II in Figure 1a) with different topographic conditions from the SLD are selected as the images identified from January 2019-March 2020 to further validate the Mask R-CNN+++ generalization capability.

Automatic Recognition Solutions
This study needs to interpret a certain number of active landslides from InSAR observations as a dataset, which is divided into three parts: a training set, validation set, and test set. This process uses the standard COCO [47] (common object in context) annotation format. During the implementation, different optimization methods are overlaid on the original Mask R-CNN model, and the network is trained by migration learning and data augmentation. Then, the different optimization methods are evaluated from multiple perspectives based on evaluation metrics and expert assessments in computer vision, and iterative improvements are made to determine the most suitable model structure and values of hyperparameters for the optimization methods. Finally, the optimized model establishes the recognition process and completes the format conversion for the whole InSAR observation result map. An active landslide is identified in the study area. The specific research flow chart is shown below (Figure 2).  Then, the different optimization methods are evaluated from multiple perspectives based on evaluation metrics and expert assessments in computer vision, and  iterative improvements are made to determine the most suitable model structure and values of hyperparameters for the optimization methods. Finally, the optimized model establishes the recognition process and completes the format conversion for the whole In-SAR observation result map. An active landslide is identified in the study area. The specific research flow chart is shown below (Figure 2).

Construction and Partitioning of the Dataset
The construction of a dataset is the basis of deep learning; the dataset construction is divided into two steps: data collection and data annotation. In ground deformation detection by using InSAR observations, a certain amount of landslide location and boundary information needs to be obtained in advance by manual decoding as the target learning object, and this process requires data annotation. The COCO annotation format followed by most instance segmentation algorithms is followed in the data annotation process. Therefore, this paper develops a method to generate a COCO annotation format from vector files and raw images automatically transformed for sample acquisition (constructing datasets); this method includes three steps: data preparation, format transformation, and data generation. That is, the image files (such as Image and Mask) are generated by aligning and cropping the images based on reading the original images and vector files; then, the storage paths and files (JSON) are generated by reading the image sample geometry information, label information, and category information; and finally, the files are converted into a dataset in COCO annotation format. The detailed steps are shown in Figure  3.
The InSAR observations of the study area and the 421 active landslide boundary vectors obtained through decomposition are used to finally form a landslide dataset (SLD)

Construction and Partitioning of the Dataset
The construction of a dataset is the basis of deep learning; the dataset construction is divided into two steps: data collection and data annotation. In ground deformation detection by using InSAR observations, a certain amount of landslide location and boundary information needs to be obtained in advance by manual decoding as the target learning object, and this process requires data annotation. The COCO annotation format followed by most instance segmentation algorithms is followed in the data annotation process. Therefore, this paper develops a method to generate a COCO annotation format from vector files and raw images automatically transformed for sample acquisition (constructing datasets); this method includes three steps: data preparation, format transformation, and data generation. That is, the image files (such as Image and Mask) are generated by aligning and cropping the images based on reading the original images and vector files; then, the storage paths and files (JSON) are generated by reading the image sample geometry information, label information, and category information; and finally, the files are converted into a dataset in COCO annotation format. The detailed steps are shown in Figure 3. 256 × 3, and the segmentation process is carried out with the minimization of edge effe and computational power methods. In addition, the online augmentation method was plied to further increase the number of training datasets by performing operations su as flipping and brightening the datasets when training the model.

Model Description
The Mask R-CNN model is based on the original target detection model (Faster CNN) classification and local frame regression branch, with a mask branch, which detect objects and segment instances [48][49][50]. The model first passes an image through backbone network and then extracts feature maps C2, C3, C4, and C5 at different reso tions in different stages to form a "feature pyramid network" (FPN). C2, C3, C4, and contain the feature information in bottom-up order, from high to low level. P2, P3, P4, and P6 are obtained by the FPN structure, which can perform multiscale feature fus and improve the scale robustness of the model. The model performs binary classificat (foreground and background) and regression based on the anchor points generated by RPN to filter out some suggestions. Then, the ROI is changed to a fixed size of 7 × 7 or × 14 pixels by ROIAlign. Finally, the ROI is fed into the fully connected layer and FCN classification, regression, and segmentation tasks. The above model uses ROIAlign stead of ROI pooling in Faster-RCNN and combines the residual network with a feat pyramid network (FPN) for feature extraction of images, enabling the network to segm the targets with high quality while detecting them. Although the model has achieved cellent results, it needs to be optimized for specific data in specific scenarios to meet needs of different tasks.
In this paper, to use the model to better identify active landslides, the bottom ResN block in the original feature extraction network is replaced with a ResNext block, wh can increase the fineness of the network segmentation results and enhance the noise sistance of the model without a significant increase in computational effort. Second, t paper replaces a modulated deformable convolution in the higher levels of the feat extraction network to capture the implicit higher levels in the InSAR observations a improve the feature extraction capability of the network for geometrically changing jects. Finally, an attention mechanism is introduced in the feature pyramid construct to scale the different channels to enhance usefully selectively and suppress less valua features. The attention mechanism models the pixel-level dense context-aware relati ships by recalibrating the channel dependencies according to the global context to achi advanced feature enhancement. The specific network structure diagram is shown bel (Figure 4). The InSAR observations of the study area and the 421 active landslide boundary vectors obtained through decomposition are used to finally form a landslide dataset (SLD) with a sample size of 636 and a resolution of 30 m. A total of 380 images are randomly selected for training (60%), 128 images for testing (20%), and 128 images for validation (20%) according to the typical method of dividing the training, validation, and test segmentation datasets. In this case, the image size is 256 × 256 pixels, the input shape is 256 × 256 × 3, and the segmentation process is carried out with the minimization of edge effects and computational power methods. In addition, the online augmentation method was applied to further increase the number of training datasets by performing operations such as flipping and brightening the datasets when training the model.

Model Description
The Mask R-CNN model is based on the original target detection model (Faster R-CNN) classification and local frame regression branch, with a mask branch, which can detect objects and segment instances [48][49][50]. The model first passes an image through the backbone network and then extracts feature maps C2, C3, C4, and C5 at different resolutions in different stages to form a "feature pyramid network" (FPN). C2, C3, C4, and C5 contain the feature information in bottom-up order, from high to low level. P2, P3, P4, P5, and P6 are obtained by the FPN structure, which can perform multiscale feature fusion and improve the scale robustness of the model. The model performs binary classification (foreground and background) and regression based on the anchor points generated by the RPN to filter out some suggestions. Then, the ROI is changed to a fixed size of 7 × 7 or 14 × 14 pixels by ROIAlign. Finally, the ROI is fed into the fully connected layer and FCN for classification, regression, and segmentation tasks. The above model uses ROIAlign instead of ROI pooling in Faster-RCNN and combines the residual network with a feature pyramid network (FPN) for feature extraction of images, enabling the network to segment the targets with high quality while detecting them. Although the model has achieved excellent results, it needs to be optimized for specific data in specific scenarios to meet the needs of different tasks.
In this paper, to use the model to better identify active landslides, the bottom ResNet block in the original feature extraction network is replaced with a ResNext block, which can increase the fineness of the network segmentation results and enhance the noise resistance of the model without a significant increase in computational effort. Second, this paper replaces a modulated deformable convolution in the higher levels of the feature extraction network to capture the implicit higher levels in the InSAR observations and improve the feature extraction capability of the network for geometrically changing objects. Finally, an attention mechanism is introduced in the feature pyramid construction to scale the different channels to enhance usefully selectively and suppress less valuable features. The attention mechanism models the pixel-level dense context-aware relationships by recalibrating the channel dependencies according to the global context to achieve advanced feature enhancement. The specific network structure diagram is shown below (Figure 4).

ResNext Convolution Block
Deep neural networks have a robust feature extraction capability, but the the network is not as deep as possible. The increase in network layers in the m be accompanied by degradation, so more feature information cannot be obtain rectly increasing the number of network layers. The ResNext convolutional blo an improvement on Inception and ResNet. The internal structure of the ResNe lutional block is shown in Figure 5.
The module is based on the "split-transform-merge" model of inception, w 256 input channels are first split into 32 input channels. Then, each group is su the same convolutional operation, and finally, the results of all the groups are fu the original input. Second, this study inherits the repetitive layer strategy of ResN

ResNext Convolution Block
Deep neural networks have a robust feature extraction capability, but the depth of the network is not as deep as possible. The increase in network layers in the model may be accompanied by degradation, so more feature information cannot be obtained by directly increasing the number of network layers. The ResNext convolutional block [51] is an improvement on Inception and ResNet. The internal structure of the ResNext convolutional block is shown in Figure 5.

Deformable Convolution Block
Due to the high variability of surface undulations, images captured by satellites exhibit significant variations in geometric features. However, geometric variations affect the DCB (deformable convolutional block) [54]. In the deformable convolu layer, an additional two-dimensional offset is added to the regular grid sampling loc in the standard convolution. See the schematic DCB structure below ( Figure 6). For ple, given a three × three kernel with an extension of 1, the receiver domain size a tension of the standard convolutional grid R can then be expressed as: Thus, for every outcome y, we have: where x represents the input feature map, w represents the weights of the sampled and n p enumerates the locations. Although in deformable convolution, the regula  is augmented with offsets | | N   . Therefore, the deformable convolution expressed as: Now, the free deformation is described by the irregular offset positions n p  The offsets are learned from the previous feature maps by using additional convolu layers in parallel. The free deformation has a 2N channel dimension that correspo N 2-D offsets. As the offset is usually a decimal, bilinear interpolation is introdu revise the value of the sampled points after migration. Therefore, to improve the f extraction capability of the backbone network for deformation change objects, this The module is based on the "split-transform-merge" model of inception, where the 256 input channels are first split into 32 input channels. Then, each group is subjected to the same convolutional operation, and finally, the results of all the groups are fused with the original input. Second, this study inherits the repetitive layer strategy of ResNet. However, the difference is that the number of paths increases, and the same topology is used to form the ResNext module group convolution on each path. This unique structure allows the residual network (ResNext) to improve accuracy without increasing parameter complexity, while the same topology reduces the number of hyperparameters. Since Voulodimos et al. [52] verified ResNext on ImageNet [53], the top-5 error was reduced by 0.62% from the 50-layer residual network to the 101-layer residual network. However, the 152-layer residual network to the 101-layer residual network was reduced by only 0.11%, and the overall time and computational effort increased significantly. The ResNext module is selected to replace the underlying ResNet module in this paper. The purpose is to increase the fineness of the network segmentation results and enhance the noise resistance of the model.

Deformable Convolution Block
Due to the high variability of surface undulations, images captured by satellites often exhibit significant variations in geometric features. However, geometric variations do not affect the DCB (deformable convolutional block) [54]. In the deformable convolutional layer, an additional two-dimensional offset is added to the regular grid sampling locations in the standard convolution. See the schematic DCB structure below ( Figure 6). For example, given a three × three kernel with an extension of 1, the receiver domain size and extension of the standard convolutional grid R can then be expressed as: Thus, for every outcome y, we have: where x represents the input feature map, w represents the weights of the sampled value, and p n enumerates the locations. Although in deformable convolution, the regular grid is augmented with offsets N =| |. Therefore, the deformable convolution can be expressed as: Remote Sens. 2022, 14, x FOR PEER REVIEW 9 Figure 6. Illustration of a 3 × 3 deformable convolution. The offset field comes from the input fe map and has the same spatial resolution as the input.

Attentional Mechanisms
Landslides in InSAR observations often exhibit different characteristics at diff pyramid levels. The attention mechanism is called to shift attention to the most cr regions of an image and ignore irrelevant parts [55], thus allowing capturing critic formation from complex graphical features by further weakening the requiremen training sets to construct semantic associations of individual pixel points in an image introduction of attention mechanisms in constructing feature pyramids can solv problem of feature layer imbalance for different sizes of landslides. The attention m nisms introduced in this study include the convolutional block attention module and attention module.
The convolutional block attention module (CBAM) [56] concatenates channel a tion and spatial attention. See the schematic CBAM structure below ( Figure 7). The rithm decouples the channel attention map from the spatial attention map to imp computational efficiency and exploits the global spatial information by introducing g pooling. The CBAM has two sequential submodules, namely, channel and spatial. G an input feature mapping X∈R C×H×W , the CBAM sequentially derives a one-dimens channel attention vector sc∈R C and a two-dimensional spatial attention map ss∈R H×W . The CBAM combines channel attention and spatial attention sequences and the spatial and cross-channel relationships of features to tell the network what and w to pay attention. More specifically, the CBAM emphasizes proper channels and reinf local areas of information. Now, the free deformation is described by the irregular offset positions p n + ∆p n . The offsets are learned from the previous feature maps by using additional convolutional layers in parallel. The free deformation has a 2N channel dimension that corresponds to N 2-D offsets. As the offset is usually a decimal, bilinear interpolation is introduced to revise the value of the sampled points after migration. Therefore, to improve the feature extraction capability of the backbone network for deformation change objects, this paper adopts the deformable convolution that can effectively simulate the geometric change of the target.

Attentional Mechanisms
Landslides in InSAR observations often exhibit different characteristics at different pyramid levels. The attention mechanism is called to shift attention to the most critical regions of an image and ignore irrelevant parts [55], thus allowing capturing critical information from complex graphical features by further weakening the requirement for training sets to construct semantic associations of individual pixel points in an image. The introduction of attention mechanisms in constructing feature pyramids can solve the problem of feature layer imbalance for different sizes of landslides. The attention mechanisms introduced in this study include the convolutional block attention module and self-attention module.
The convolutional block attention module (CBAM) [56] concatenates channel attention and spatial attention. See the schematic CBAM structure below (Figure 7). The algorithm decouples the channel attention map from the spatial attention map to improve computational efficiency and exploits the global spatial information by introducing global pooling. The CBAM has two sequential submodules, namely, channel and spatial. Given an input feature mapping X∈R C×H×W , the CBAM sequentially derives a one-dimensional channel attention vector sc∈R C and a two-dimensional spatial attention mapping ss∈R H×W . The CBAM combines channel attention and spatial attention sequences and uses the spatial and cross-channel relationships of features to tell the network what and where to pay attention. More specifically, the CBAM emphasizes proper channels and reinforces local areas of information.
pooling. The CBAM has two sequential submodules, namely, channel and spatial. Given an input feature mapping X∈R C×H×W , the CBAM sequentially derives a one-dimensional channel attention vector sc∈R C and a two-dimensional spatial attention mapping ss∈R H×W . The CBAM combines channel attention and spatial attention sequences and uses the spatial and cross-channel relationships of features to tell the network what and where to pay attention. More specifically, the CBAM emphasizes proper channels and reinforces local areas of information. The self-attention (SA) module [57] is a particular case of the attention mechanism that differs from the standard one. See the schematic SA structure below ( Figure 8).The SA module aims to select from the global information the information that is more critical The self-attention (SA) module [57] is a particular case of the attention mechanism that differs from the standard one. See the schematic SA structure below ( Figure 8).The SA module aims to select from the global information the information that is more critical to the current task goal. Thus, the full features of the image can be better utilized. The module first performs a linear transformation and channel compression on the convolutional feature mapping by using two one × one convolutions. Then, the module converts the two tensors into matrix form, transposes and multiplies them, and then obtains the attention mapping by using softmax.
Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 2 to the current task goal. Thus, the full features of the image can be better utilized. Th module first performs a linear transformation and channel compression on the convolu tional feature mapping by using two one × one convolutions. Then, the module convert the two tensors into matrix form, transposes and multiplies them, and then obtains th attention mapping by using softmax. Additionally, the original feature mapping is linearly transformed by using one × on convolution and then multiplied by the previously obtained attention mapping matrix which is summed to obtain the self-attentive feature mapping. Finally, the self-attentiv feature mapping and the original convolutional feature mapping are weighted and summed as the final output. Self-attentive feature mapping can be regarded as the prod uct of feature mapping and its transposition. This operation can enhance the associatio between distant features in the image. The dependency between any two-pixel points can be learned, and then global features can be obtained.   Additionally, the original feature mapping is linearly transformed by using one × one convolution and then multiplied by the previously obtained attention mapping matrix, which is summed to obtain the self-attentive feature mapping. Finally, the self-attentive feature mapping and the original convolutional feature mapping are weighted and summed as the final output. Self-attentive feature mapping can be regarded as the product of feature mapping and its transposition. This operation can enhance the association between distant features in the image. The dependency between any two-pixel points can be learned, and then global features can be obtained.

Model Training
The model built in this paper is implemented based on the deep learning frameworks TensorFlow and Keras. The experiments run on an NVIDIA GeForce GTX 2080Ti GPU with 16 GB of RAM and a dual-core Intel (R) Xeon (R) CPU E5-2637 computer on Windows OS. To enhance the generalization ability and robustness of the model, the following data enhancement and migration learning techniques are used in this paper: (1) Gaussian noise addition and sharpening in the images, (2) horizontal and vertical image flipping, and (3) generic initial parameters trained on the COCO dataset. Four hundred iterations were trained for the SLD dataset, with an initial learning rate of 0.01 for the first 200 iterations of the training network and a learning rate of 0.001 for 200 iterations of all networks. Finally, the completed training model was applied to the corresponding InSAR observations in the study area.

Evaluation Indicators
To quantitatively evaluate the recognition performance of network structures with different optimization approaches, this paper uses accuracy, the F1-score, and the mean cross-merge (mIoU) as quantitative metrics to evaluate the recognition results. That is: In Equations (4)-(6), TP denotes the number of correctly extracted landslides, FP denotes the number of incorrectly extracted landslides, and FN denotes the number of incorrectly extracted landslides. TN denotes the number of incorrectly extracted nonlandslides. The Index parameter's relation is shown in Table 1.

Indicator Results
Since the model improvement is optimized for the feature extraction method, we use the feature extraction network ResNet101 + FPN as the basis, and the feature extraction methods of the ResNext module, DCB module, and attention mechanism (CBAM and SA) are used for experimental comparison. The results of their experimental parameters are shown in Table 2. By comparison, replacing only the bottom layer of the network with ResNext, that is, the original single-way convolution is transformed into a multiway convolution for prediction, all three indices are improved; compared to adding only ResNext, replacing DCB with the extracted deformed object features at the top layer of the network improves the mIoU value by 2.17%; introducing the SA module in the FPN improves the F1-score value. However, the CBAM module combines spatial attention with channel attention, thereby improving some parameters, while mIoU decreases by 0.77% when the attention module is added. Therefore, the SA module is undoubtedly the best choice among the two modules by learning the dependencies between any two-pixel points and thus obtaining the global features. The results show that introducing the DCB and SA modules in the base network for feature extraction is the optimal solution. In terms of model performance, the addition of the ResNext module can increase the perceptual field of the network, thereby significantly improving the generalization ability of the model segmentation detection.

Test Set Recognition Results
The native model focuses only on the accuracy of the area and not much on the variation in landslide geometry; consequently, the detection results may portray the active landslide morphology inaccurately. Therefore, after adding ResNext, DCB, and attention mechanisms (CBAM and SA) to the model, the model becomes more suitable for capturing the specific morphological features of landslides in InSAR observation maps. The typical landslides identified with the help of the optimized model in this paper are shown in Figure 9, which reflects the recognition effect of Mask R-CNN+++ on different types of landslides under different combination conditions. The detection results of Mask R-CNN+++ proposed in this paper are accurate in outlining landslide boundaries and can better extract InSAR landslide patches, except for some more minor landslides that will be missed. The typical landslides identified with the help of the optimized model in this paper are shown in Figure 9, which reflects the recognition effect of Mask R-CNN+++ on different types of landslides under different combination conditions. The detection results of Mask R-CNN+++ proposed in this paper are accurate in outlining landslide boundaries and can better extract InSAR landslide patches, except for some more minor landslides that will be missed.

Application Recognition Results
This paper uses the established Mask R-CNN++ to identify the entire study area's InSAR images. The overall number of landslide deformations is finally predicted to be 891. The results are shown in Figure 10 below. The recognition results show that most of the active landslides around the rivers have been successfully detected, especially in the landslide-prone high mountain canyon areas, thus showing that the method proposed in

Application Recognition Results
This paper uses the established Mask R-CNN++ to identify the entire study area's InSAR images. The overall number of landslide deformations is finally predicted to be 891. The results are shown in Figure 10 below. The recognition results show that most of the active landslides around the rivers have been successfully detected, especially in the landslide-prone high mountain canyon areas, thus showing that the method proposed in this paper is very effective. The picture below shows the recognition results of the whole image, and the identified landslides are indicated in yellow, red, and black.

Identify the Preliminary Results of the Classification
The model used in this paper can perform the instance segmentation task, i.e., classifying pixels by using semantic segmentation and distinguishing different object instances. Mask R-CNN adds the mask branch for pixel segmentation to Faster-CNN. A fully convolutional network structure for predicting segmentation masks is introduced behind the ROIAlign layer and applied to a single ROI to predict the segmentation masks in pixel-topixel behaviour and decide the size and class of masks to use in combination with the predicted classes in the Faster branch. In this way, the model can segment different targets in the same class. With its advantage of surface coverage, InSAR deformation observation can observe various types of surface deformation. Different deformation features will show different InSAR patches, distinguished by colour, shape, intensity, texture, spatial structure combination, and geomorphic location. Therefore, in this paper, the deformation areas automatically identified by the above model are transformed into shapefiles and statistically analysed in terms of number, perimeter, area, and patch brightness and further divided into deformation characteristics of different target types in the same category by combining existing experience.
In terms of quantity, the number decreases sequentially from object 1 and object 2 to object 3; accordingly, the sum of the perimeter and area of each target type decreases, but each target type corresponds to a similar average value. Using the geometric morphological perspective, the perimeter divided by the area of the lowest target type is object 2, thus reflecting that mainly elongated patches are dominant. Additionally, the highest target type is object 3, which is dominated by small circular patches, and object 1 is in the middle. The results are shown in Figure 11 below.

Identify the Preliminary Results of the Classification
The model used in this paper can perform the instance segmentation task, i.e., classifying pixels by using semantic segmentation and distinguishing different object instances. Mask R-CNN adds the mask branch for pixel segmentation to Faster-CNN. A fully convolutional network structure for predicting segmentation masks is introduced behind the ROIAlign layer and applied to a single ROI to predict the segmentation masks in pixelto-pixel behaviour and decide the size and class of masks to use in combination with the predicted classes in the Faster branch. In this way, the model can segment different targets in the same class. With its advantage of surface coverage, InSAR deformation observation can observe various types of surface deformation. Different deformation features will show different InSAR patches, distinguished by colour, shape, intensity, texture, spatial structure combination, and geomorphic location. Therefore, in this paper, the deformation areas automatically identified by the above model are transformed into shapefiles and statistically analysed in terms of number, perimeter, area, and patch brightness and further divided into deformation characteristics of different target types in the same category by combining existing experience.
In terms of quantity, the number decreases sequentially from object 1 and object 2 to object 3; accordingly, the sum of the perimeter and area of each target type decreases, but each target type corresponds to a similar average value. Using the geometric morphological perspective, the perimeter divided by the area of the lowest target type is object 2, thus reflecting that mainly elongated patches are dominant. Additionally, the highest target type is object 3, which is dominated by small circular patches, and object 1 is in the middle. The results are shown in Figure 11 below. The information on surface change is directly expressed as the bands of InSAR patches into periodicity, and the patch brightness can be used to indicate the intensity of change. In terms of patch brightness, object 1 shows a higher brightness and a near-circular chair-like shape, while object 2 and object 3 have lower brightness and similar intensity. The results are shown in Figure 12 below. In summary, from the perspective of landslide type classification, object 1 is a strongly deformed landslide, object 2 is a collapse-debris flow type, and object 3 is a small The information on surface change is directly expressed as the bands of InSAR patches into periodicity, and the patch brightness can be used to indicate the intensity of change. In terms of patch brightness, object 1 shows a higher brightness and a near-circular chair-like shape, while object 2 and object 3 have lower brightness and similar intensity. The results are shown in Figure 12 below. The information on surface change is directly expressed as the bands of InSAR patches into periodicity, and the patch brightness can be used to indicate the intensity of change. In terms of patch brightness, object 1 shows a higher brightness and a near-circular chair-like shape, while object 2 and object 3 have lower brightness and similar intensity. The results are shown in Figure 12 below. In summary, from the perspective of landslide type classification, object 1 is a strongly deformed landslide, object 2 is a collapse-debris flow type, and object 3 is a small landslide with a relatively slow deformation rate. Object 1 is brighter in colour, thus reflecting multiple winding cycles, indicating that this object is the most robust deformation In summary, from the perspective of landslide type classification, object 1 is a strongly deformed landslide, object 2 is a collapse-debris flow type, and object 3 is a small landslide with a relatively slow deformation rate. Object 1 is brighter in colour, thus reflecting multiple winding cycles, indicating that this object is the most robust deformation among the three. Objects 2 and 3 are darker and single in colour, thus reflecting that the deformation is relatively small in a single 2π cycle. Object 1 is mainly in topographic conditions with a considerable height and steep slope. The soft surface on its slope is in the prolapse state. The upper geotechnical body of the slope is in an unstable state, thus providing the driving force for the deformation destabilization of the landslide. Object 1 shows clear deformation patches of landslide form, thus reflecting obvious overall deformation and clear deformation boundaries. Therefore, this type is dominated by strongly deformed landslide accumulation deformation, with few newly developed bedrock deformations. Object 2 appears mostly in the form of debris flows and riverbank collapses. Debris flow is a strong fragmentation of the sloping rock in the moving process after the destabilization of the landslide source area and gradually disintegrates into debris particles whose particle size range usually spans several orders of magnitude and spreads in the form of fluid transport over long distances and large areas. Under hydrodynamic action, riverbank collapse impacts the soil particles on the bank slope. The cohesive material in the soil body is detached from the soil particles, followed by fissures, with the expansion of fissures leading to the dispersion of aggregates and disintegration. Concerning Object 3, a small landslide with low-velocity deformation is the main feature of this type. Most deformations are distributed as independent individuals, irregularly rounded or oval, generally from several tens to one hundred metres in diameter. The edges of these deformations are visible in the image. The morphology in the deformation patch is subcircular, and the deformation boundary is clear.

Advantages of the Present Model in the Recognition Process
This paper uses a deep learning instance segmentation model that uses InSAR observations to identify the location of active landslides and their boundaries over a large area with high dispersion. In 2017, He et al. [35] proposed Mask R-CNN, an instance segmentation model in the Faster-RCNN framework based on target detection; Mask R-CNN can perform target classification, target detection, semantic segmentation, instance segmentation, and many other tasks. Compared with Mask R-CNN, Mask R-CNN+++ (ResNext-DCB + CBAM\SA), which is based on the Mask R-CNN proposed in this paper, improves accuracy, the F1-score, and mIoU by up to 3%. In addition, the landslide edges extracted by the optimized model are smoother and more accurate in scope. The overall time taken for the recognition process of the whole image of the test area (5.32 × 10 4 km 2 ) for the whole study area is 736 s, which is also more significant than the manual time efficiency. The improvement of the Mask R-CNN model can serve for active landslide investigation in complex geological environments and areas with a wide distribution, large scale and high risk of significant landslide hazards. This study finds that the attention mechanism, the addition of variable convolution, and the optimal input scale can effectively improve deep learning accuracy, but there is also room for improvement. First, although adding the attention mechanism improves the overall model accuracy to some extent, the increase in computational burden cannot be ignored, especially in the case of the self-attention mechanism, which has significantly improved the recognition effect. Second, including deformable convolution at the higher level of the feature extraction network allows adaptive changes by changing the sampling position of the input, i.e., the function of making adaptive changes, such as object magnification or selection by enhancing the sensory domain.

Impact of Instance Segmentation Model Optimization on the Recognition Process
Landslides develop at different scales, and small landslides are often more numerous, resulting in many small targets in the images. Affected by the resolution of remote sensing images, the deformation characteristics of such landslides are frequently blurred. Therefore, the labelling process in the premodelling stage will cause some information loss and lead to some offset in the boundary recognition in the detection process. Sun et al. [58] added deformable convolution at the top level of the instance segmentation model to enhance the recognition of the changing geometry of the target, but the backbone network used ResNet50 as the feature extraction network. Therefore, this paper replaces the top convolution of the feature extraction network with a deformable convolutional block that adaptively captures saliency according to the morphological features of the landslide. Additionally, this paper selects the backbone network as ResNet101 with a more robust extraction capability and replaces the bottom convolutional layer with ResNext to avoid information loss during sampling and increase the range of the perceptual field view. The comparison experiments show that replacing convolutional blocks in the network improves the F1-score value in the evaluation index by nearly 4%, thereby indicating that the difference between improving precision and recall is as small as possible. In the actual recognition process, the background information of remote sensing images can improve the model generalization ability, but this information also interferes with the determination ability in model detection. Therefore, in this case, the attention mechanism needs to be added to the feature pyramid network to select the information that strongly correlates with the detected target from considerable redundant information and strengthen the dependency relationship between features to achieve the ability to express the enhanced features. As the types of attention mechanisms are increasing in depth by researchers [59][60][61], two types of attention mechanisms, namely, the CBAM and the SA, are selected in this paper. Furthermore, the mIoU value of the evaluation parameter after the experiment is improved by nearly 7%, thus indicating that the gap between the predicted values of the output of the well-labelled accurate value model is reduced, where the SA is more suitable for the overall contour extraction of landslide types. Different types of convolutional kernels are used in the feature extraction network to extract features from landslide observation result images, which can be used to maintain the spatial relationship between image pixels. The need for data volume is also reduced by methods such as data augmentation and migration learning, while the recognition performance of the model is improved.

Uncertainty in Observation and Identification under Different Natural Conditions
Landslides are controlled by many factors, such as topography, lithology, tectonics, meteorology, hydrology, vegetation, and human activities. The combinations of these factors also vary, thus resulting in different landslide patterns reflected in the InSAR results as various features in the geometry and colour changes of landslide patches. For example, in the Jinsha River, the retrogressive landslide formed by the traction damage process of the landslide body triggered by the rising water level [62], the InSAR patches show large deformation at low levels and many colour change cycles, while the deformation in other parts is relatively small; the red-layered soft rock landslide between Nangqian town and Gama town on the south bank of the Lancang River [63] is characterized by minor colour differences in the strip patches, but clusters are developed. In this case, the recognition of all types of active landslides cannot be fully guaranteed, even if data enhancement techniques are used in training the model. An ideal landslide recognition model oriented to InSAR observation result maps should be able to solve the recognition and classification problems of different types of landslides. Although the current deep learning technology has a powerful ability to analyse large datasets, this technology cannot identify all landslides indiscriminately because the sample dataset needs to be continuously improved. In the actual sample collection process, only some representative landslide types are usually considered [64], and the classifier in the training model, in this case, can meet the practical needs, especially when it is difficult to obtain a sufficient number of less common landslide samples. This choice is beneficial for conducting research. However, this can also lead to some undesirable results. The more frequent landslides are more easily learned by the model, while landslides that are uncommon or have problems with the original data are easily misidentified. The features learned by the model are more limited. In this case, it is necessary to combine geomorphological features and optical remote sensing features to supplement the judgement further.

Influence of the Input Scale on the Identification Results
The input scale plays a vital role in modern detection frameworks during experiments. It is logical that an optimal image training scale exists and that clarifying this scale plays a vital role in improving deep learning. The resolution of the input images can be very high; however, as a result, such images can eventually lead to a decision-maker whose judgements are highly dependent on image data analysis. However, due to memory and GPI capacity limitations, a whole remote sensing image cannot be directly used for detection or incorporated into segmentation frameworks. For example, Chen et al. [65] used the Mask R-CNN model for urban village recognition of whole optical images based only on cropping, without preserving the geographic coordinates in the image. This model can complicate the complete monitoring of large images but cannot extract the identified targets into a standard format for dissemination and more in-depth study. This paper addresses oversized image recognition by sliding shifting the high-resolution input into multiple subimages based on coordinate information, building up a deep learning channel, and fusing this subimage input to generate the final result. This approach requires high computer hardware performance and a long processing time, and it also needs to consider the strategy of cropping images. Therefore, the next step of this paper is to enable the fast acquisition of essential features from large images and the prediction of their results by using different sampling results for large images as input data for the network and constructing an optimization scheme for the detector corresponding to them.

Potential Uncertainty of Geological Body Characteristics and Landslide Type Recognition
Currently, instance segmentation model improvement focuses on improving recognition accuracy and real-time performance [66], and application in specific tasks is still being attempted. In landslide research, the size and motion pattern of InSAR deformation patches of landslides can be used to determine their main hazard formation patterns [67] and further analyse and evaluate landslide hazards. However, landslide classification needs to be based on geometric features, motion patterns, and intensity.
The geometry of anomalously deformed slopes in InSAR observations is controlled by complex geological conditions and topographic and geomorphological conditions. The spatial contours of these slopes are often closely associated with different types of active landslides. For example, semicircular landslides are semicircular at the upper boundary and slightly open at the lower part in the shape of a dustpan, which is a typical morphology of medium-sized landslides [68]. The rectangular landslide has a flat main scarp, and the overall length in the direction of sliding is smaller than the width. The morphology is transverse spreading, which is similar to a rectangle. A tongue-shaped landslide body is small in the upper part, significant in the lower part, and high in the middle, with a shape similar to a tongue, such as a translational landslide [69]. Irregular landslides are controlled by complex geological conditions, including topographic and geomorphological conditions. Such conditions form multistage landslides, multiple juxtaposed landslides, or irregular shapes.
Different forms of landslide movement will also have their characteristics in the InSAR deformation map; landslides can be divided into two basic types: retrogressive and thrust. For retrogressive landslides, the toe of the slope is more significant than that of the top, while the deformation of other parts is relatively small. The width of the landslide decreases from the back end to the front end. As two boundaries control the deformation area, the deformation in some parts of the boundary produces a significant abrupt change. In contrast, the thrust-type landslide body has a large deformation in the high main scarp and a blocked weak deformation in the low area. The larger deformation areas are located near the landslide perimeter. This is because the rear section of the thrust-type landslides first sliding deformation due to the existence of a significant sliding thrust, and then with the deformation of the rear section of the geotechnical body continuously developed forward and on both sides.
The size of the InSAR deformation patch corresponds to the range of feature deformation. Small and medium landslides have a small surface deformation range, thus resulting in small deformation patches; therefore, identification is not dominant. The opposite is true for large landslides, which are formed after years of breeding and have a larger overall surface deformation range. Moreover, large landslides are more likely to cause human and economic losses, and their differentiation is of great significance in disaster recognition. The scientific, systematic, and practical classification of landslides can promote the understanding of landslides in the study area and provide essential reference values for future related landslide field surveys, monitoring and early warning, emergency prevention and control, and other related research work. For example, Xu et al. [70] achieved the mastery of spatial and temporal distribution and deformation patterns of significant geological hazards, such as landslides, collapses, and debris flows by using space-based InSAR and other monitoring means and established a comprehensive classification and early warning system based on deformation observation characteristics. This method can issue early warning information before the actual occurrence of geological hazards and evacuate the residents in the danger zone to protect their lives and properties. Xie et al. [71] revealed the spatial and temporal distribution and deformation patterns of significant geological hazards in the upper reaches of the Minjiang River by using a combination of InSAR monitoring and in situ monitoring in 2020. This method can further reduce the loss of life and property to landslides within the Minjiang River Gorge. Rosi et al. [72] used satellite SAR data processed with the continuous persistent scattering InSAR (PSI) technique to map density areas to highlight areas of different landslide densities and sizes. Therefore, the method constructed in this paper cannot identify all small slope bodies. However, because the method can accurately identify large and medium-sized landslides, it will effectively improve the identification efficiency of typical active landslides in mountainous areas to ensure greater confidence. The instance segmentation model can further distinguish active landslides with different geometric features, motion patterns, and intensities of deformation by distinguishing different targets in the same class, thereby improving deep learning for landslide recognition. Due to the grey box characteristics of deep learning, although the model can distinguish between different targets, the meaning of the model's geological attributes cannot be given directly yet and needs to be further clarified by combining both geology and statistics. The preliminary statistics reveal the difference in the model's geometric features, motion patterns, and intensity. Moreover, according to the theoretical analysis, there is still room for excavation to distinguish more geological attributes of landslides; distinguishing more geological attributes of landslides is to be studied in the next step.

Conclusions
How to use deep learning to recognize active landslides is a challenging but exciting question. When using deep learning as an automatic method to solve InSAR landslide mapping, specific research on its three major components, namely, the dataset, the algorithm, and computing power, is still needed. In this study, a Mask R-CNN+++ model is proposed for the rapid recognition of active landslides in some areas of the eastern edge of the Tibetan Plateau, where landslides are highly prevalent. The dataset construction in this paper is based on an automatic dataset generation procedure for raw images and vector files and a landslide dataset (SLD) conforming to the COCO annotation format. The image instance segmentation Mask R-CNN model is optimized by introducing deformable convolutional layers, residual multibranch network ResNext blocks, and attention mecha-nisms. Correspondingly, the model increases the fineness of the segmentation results. The noise resistance of the model improves the feature extraction ability of the network for geometrically changing objects. Moreover, it is ensured that the computational effort does not increase significantly, thus making the model identify smoother landslide edges, adapt to a broader range of geological environments, and identify more diverse landslide forms. The instance segmentation model can also distinguish different targets in the same category and differentiate active landslides with different scales and deformation degrees. Initially, we reveal the differences in the geometric features, motion patterns, and intensity of landslides combined with landslide theory. This paper solves the mismatch between the input image size of the model and the size of the whole remote sensing image and realizes the rapid recognition of active landslides in an extensive range of high mountain valley areas. Using this deep learning model, the identification results of general active deformation slopes in the InSAR monitoring results over an extensive spatial range are characterized by a clear delineation of deformation boundaries, high efficiency, and good accuracy. However, it is easy to ignore ancient landslide accumulations with blurred deformation patches and discontinuous and incomplete morphology, and small-scale deformation bodies, which need to be appropriately combined with geomorphological and optical remote sensing features for further supplementary judgement.