Automated Characterization of Yardangs Using Deep Convolutional Neural Networks

: The morphological characteristics of yardangs are the direct evidence that reveals the wind and ﬂuvial erosion for lacustrine sediments in arid areas. These features can be critical indicators in reconstructing local wind directions and environment conditions. Thus, the fast and accurate extraction of yardangs is key to studying their regional distribution and evolution process. However, the existing automated methods to characterize yardangs are of limited generalization that may only be feasible for speciﬁc types of yardangs in certain areas. Deep learning methods, which are superior in representation learning, provide potential solutions for mapping yardangs with complex and variable features. In this study, we apply Mask region-based convolutional neural networks (Mask R-CNN) to automatically delineate and classify yardangs using very high spatial resolution images from Google Earth. The yardang ﬁeld in the Qaidam Basin, northwestern China is selected to conduct the experiments and the method yields mean average precisions of 0.869 and 0.671 for intersection of union (IoU) thresholds of 0.5 and 0.75, respectively. The manual validation results on images of additional study sites show an overall detection accuracy of 74%, while more than 90% of the detected yardangs can be correctly classiﬁed and delineated. We then conclude that Mask R-CNN is a robust model to characterize multi-scale yardangs of various types and allows for the research of the morphological and evolutionary aspects of aeolian landform. case studies using the 80 Google Earth image subsets described in 2.5 with a spatial resolution of the original images of 0.6 m. We resampled these image subsets to three different spatial resolution images of 1.2 m, 2.0 m and 3.0 m using the nearest neighbor method. We applied the same trained model on these four groups of images with spatial resolutions from 0.6 to 3.0 m and analyzed their performances.

The rapid development of remote sensing (RS) technologies has enabled the observation and measurement of yardangs at various spatial scales. De Silva et al. [22] studied the relationship between the yardang morphologies and the material properties of the host lithology using multi-source RS data combined with field surveys. Al-Dousari et al. [12] Remote Sens. 2021, 13, 733 3 of 19 In this study, we use the instance segmentation method to distinguish individual yardang as a separate instance and specifically adopt the Mask R-CNN model proposed by He et al. [39] to conduct the experiment. The Mask R-CNN is an extension of the Faster R-CNN [41], integrating the object detection and segmentation tasks by adding a mask branch into the architecture. This model can be described as a two-stage algorithm. Firstly, it generates the candidate object bounding boxes on feature maps produced by the convolutional backbone. Secondly, it predicts the class of the object, refines the bounding box and generates a binary mask for each region of interest (RoI). In the remote sensing domain, the Mask R-CNN is usually applied to detect man-made objects such as buildings [42], aircraft [43] and ships [44]. However, it has hardly been used for terrain feature extraction. Zhang et al. [45] used this method to map ice-wedge polygons from VHSR imagery and systematically assessed its performance. They reported that the Mask R-CNN can detect up to 79% of the ice-wedge polygons in their study area with high accuracy and robustness. Chen et al. [46] detected and segmented rocks along a tectonic fault scarp by using Mask R-CNN and estimated the spatial distributions of their traits. Maxwell et al. [47] extracted valley fill faces from LiDAR-derived digital elevation data using the Mask R-CNN. They used this method to map geomorphic features using LiDAR-derived data but stated poor transferability for photogrammetrically derived data. It should be noted that in former research, the spatial scales of geomorphic objects and their shape irregularity have not changed significantly.
We conclude that there is a need to investigate these methods for aeolian landforms and specifically for multi-scaled yardangs with various complexity. Therefore, the main purposes of this study are: (1) to automatically characterize and classify three types of yardangs using the Mask R-CNN model on VHSR remote sensing images provided by Google Earth; (2) to evaluate the accuracy and the generalization capacity of the model by applying it to the test dataset and extra image scenes at different sites. In this study, we will separately assess the correctness of detection, classification and delineation tasks, and analyze their performances regarding the different spatial resolutions (0.6, 1.2, 2.0 and 3.0 m) of the images.

Study Area and Data
The Qaidam Basin, located in the northeastern part of the Tibetan Plateau, is a hyperarid basin with an average elevation of 2800 m above sea level. Covering an area of approximately 120,000 km 2 , it is the largest intermontane basin enclosed by the Kunlun Mountains in the south, the Altyn-Tagh Mountains in the northwest and the Qilian Mountains in the northeast [48]. The East Asian Monsoons, which extend to the southeastern basin, carry the main moisture source and cause a gradual reduction in the annual precipitation from the southeast to the northwest, decreasing from 100 mm to 20 mm [7,25]. The northwesterly to westerly winds prevailing in the basin sculpt enormous yardang fields, covering nearly 1/3 of the whole basin [19,49]. The basin is also the highest-elevation yardang field on Earth.
The yardangs are mainly distributed in the central eastern and northwestern parts of the basin. The classification schemes for yardangs are different depending on the geographical regions. In this study, we focused on three major types-long-ridge, mesa and whaleback yardangs that are widely developed in the Qaidam Basin. Table 1 illustrates their characteristics that can be distinguished in remotely sensed images. In this study, we used very high spatial resolution images from Google Earth. Due to the vast and uneven distribution of yardangs, we firstly selected 24 rectangular image subsets of different areas that host representative yardangs ( Figure 1). Ten of these image subsets with a spatial resolution of 1.19 m target areas with long-ridge yardangs, which are usually longer than 200 m. They are mainly distributed in the northernmost part of the basin. The other 14 subsets with resolution of 0.60 m cover areas where a significant amount of mesa and whaleback yardangs are located. All of the image subsets are composed of R (red) G (green) B (blue) bands and were downloaded using LocaSpaceViewer software.

Annotated Dataset
The DL model used in this study requires rectangular image subsets as the input. In order to generate an adequate dataset that can be used to train and test the Mask R-CNN algorithm, all 24 image subsets were clipped into 512 × 512 pixel image patches with an

Annotated Dataset
The DL model used in this study requires rectangular image subsets as the input. In order to generate an adequate dataset that can be used to train and test the Mask R-CNN algorithm, all 24 image subsets were clipped into 512 × 512 pixel image patches with an overlap of 25% in their X and Y directions, respectively, using Geospatial Data Abstraction Library (GDAL, https://gdal.org/ (accessed on 31 December 2020)). The image patch containing at least one instance of yardang object with clear and complete boundary was considered valid. Finally, 899 image patches were selected for further manual annotation. We used the "LabelMe" image annotation tool [50] to annotate all the yardangs objects using polygonal annotations. For the instance segmentation step, every single object instance was required to be differentiated. Therefore, we distinguished each instance of the same class by adding sequence numbers after their class names. We delineated an approximately even number of yardang objects for long-ridge, mesa and whaleback yardangs (2568, 2539 and 2532 polygonal objects, respectively) to avoid class imbalance problems [51]. The annotations were saved in the format of JavaScript Object Notation (.json) and were converted into useable datasets using Python. These datasets were then split at random into three sub-datasets at the proportion of 8:1:1. The resulting training dataset consisted of 719 patches, the validation dataset yielded 90 patches and the test dataset yielded another 90 patches.

Mask R-CNN Algorithm
The Mask R-CNN has been one of the most popular deep neural network algorithms and aims to complete object instance segmentation [39]. As an extension of the Faster R-CNN [41], a branch that outputs the object mask was constructed in the Mask R-CNN, resulting a much finer localization and extraction of objects. The Mask R-CNN is a two-stage procedure. In the first stage, the Residual Learning Network (ResNet) [52] backbone architecture extracts features from raw images, and the Feature Pyramid Network (FPN) [53] improves the representation of the object by generating multi-scale feature maps. Next, the Region Proposal Network (RPN) scans the feature maps and proposes regions of interest (RoI) which may contain objects. In the second stage, the proposed RoIs will be assigned to the relevant area of the corresponding feature map and generate the fixed size feature map via RoIAlign. Then for each RoI, the classifier predicts the class and refines the bounding box whilst the FCN [33] predicts the segmentation mask. For a detailed discussion of the Mask R-CNN, readers can consult He et al. [39].

Implementation
In this study, we specifically used an open-source package developed by the Matterport team to implement the Mask R-CNN method [54]. The model is built on Python 3, Keras and TensorFlow, and the source codes are available on GitHub (https://github. com/matterport/Mask_RCNN (accessed on 31 December 2020)). The experiments were conducted on a personal computer equipped with an Intel i5-8600 CPU, 8 GB of RAM and a NVIDIA GeForce GTX 1050 5 GB graphic card.
To train the model on our dataset, we created a subclass from the Dataset class and modified the loading dataset function provided in the source codes. We concluded from many studies [45,47,[55][56][57] that the utilization of pre-trained weights learned from other tasks to initialize the model will benefit the training efficiency and result. Although different data and classes are used and targeted in other tasks (e.g., everyday object detection), some common and distinguishable features can be learned and applied in another disparate dataset (known as transfer learning) [30,56]. As a result, we used the pre-trained weights learned from the Microsoft Common Object in Context (MS-COCO) dataset [58] to train our model instead of building the Mask R-CNN from scratch. In the training process, we adopted the ResNet-101 as the backbone and used a learning rate of 0.002 to train the head layers for 2 epochs. This was followed by training all layers at the learning rate of 0.001 for 38 epochs. The default learning momentum of 0.9 and the weight decay of 0.0001 were used for all epochs. We adopted 240 training and 30 validation steps for each epoch with a mini-batch size of one image patch. We set all the losses (RPN class loss, RPN bounding box loss, Mask R-CNN class loss, Mask R-CNN bounding box loss and Mask R-CNN mask loss) to be equally weighted and kept the default configuration for other parameters. The full description of parameter settings can be found in the above-mentioned Matterport Mask R-CNN repository. Data augmentation is a strategy to artificially increase the amount and diversity of data and is commonly used to minimize overfitting during the training for neural networks [28,30,[59][60][61]. Therefore, we applied random augmentations for training dataset including horizontal flips, vertical flips and rotations at 90 • , 180 • and 270 • using the imgaug library [62]. The validation dataset was used to avoid overfitting and to decide which model weights to use. Once a final model was obtained, it was used to detect yardangs in the test datasets and in other study sites. The detection confidence threshold was set at 70%, which meant that objects with confidence scores below 0.7 were ignored.

Accuracy Assessment
We assessed the trained Mask R-CNN model on the test dataset based on the mean average precision (mAP) over the different intersection of union (IoU) thresholds ranging from 0.50 to 0.95 in 0.05 increments. The mAP computes the mean average precision value in each category over multiple IoUs and represents the area under the precision-recall curve. The IoU is the ratio of the intersection area and the union area of predicted mask and ground truth mask of an object [61,63]. This can be described in Equation (1) where A is the inferred polygon of yardang and B is the manually digitalized one.
We calculated the IoU value for each image patch in the test dataset while the confusion matrix was used to analyze the efficiency of the method on specific categories. In order to test the transferability and robustness of the Mask R-CNN method, we applied the model on 80 randomly selected image subsets (spatial resolution of 0.60 m and spatial extent of 1 km × 1 km) excluding previously annotated datasets in the Qaidam Basin ( Figure 2). We then manually assessed the accuracies of detecting, classifying and delineating yardangs. The criteria of accuracy evaluation of the three sub-tasks are as follows:  For the detection task, a true positive (TP) indicates a correctly detected yardang and a false positive (FP) indicates a wrongly detected yardang, which in fact belongs to the background. A missed yardang ground truthc is a false negative (FN). We then calculated the precision, recall and overall accuracy (OA) of the method for detecting yardangs using the equations below.

Model Optimization and Accuracy
For classification and delineation tasks, we focused on those instances that were correctly detected as yardangs. A positive classification refers to a correct type of a detected yardang. Likewise, a positive delineation means the Mask R-CNN successfully outlined a yardang based on the interpreter's judgment. We then calculated the accuracies of these tasks.

Model Optimization and Accuracy
We recorded the learning curve of the model and evaluated its accuracy. As shown in Figure 3, the overall loss is the sum of the other five losses. The validation loss shows a fluctuation and reaches its lowest value after 27 epochs (0.527). Although the training loss presents a continuous decline trend, it declines very slowly and no substantial improvement of validation loss is recorded in the following epochs. This indicates overfitting, where after 27 epochs, the model tends to learn the detail and noise in the training data. This negatively affects the performance of detecting yardangs. Therefore, we selected the Mask R-CNN model that was trained for 27 epochs. The convergence after several epochs was also noted in some related works [45,47,57]. It can be attributed to the usage of pre-trained weights and the relatively small amount of training data for a specific object, which have fewer features to learn compared to a common dataset.
We applied the final model on the test dataset that contained 90 image patches including 768 yardangs (291 long-ridge yardangs, 263 mesa yardangs and 214 whaleback yardangs). The total numbers of predicted long-ridge, mesa and whaleback yardangs are 306, 285 and 197, respectively. The IoU measures how much the predicted yardang boundaries overlap with the ground truth. We received a mean IoU value of 0.74 for the test dataset and 90% of the test patches yielded IoU values greater than 0.5 ( Figure 4a). Mean average precisions (mAPs) at different IoU thresholds are plotted in Figure 4b, where the averaged mAP of 10 IoU thresholds is 0.588. The mAP 50 (mAP at IoU = 0.50) is 0.869, indicating that the model performs well with a common standard for object detection. The mAP 75 (mAP at IoU = 0.75) is 0.671 when a more stringent metric is used, showing a relatively high accuracy and the effectiveness of this model. The normalized confusion matrix ( Figure 4c) displays the fine classification results of yardangs, showing that three types of yardangs can be clearly distinguished. However, the prediction accuracy of the whaleback yardangs is 0.76, which is the lowest value compared to the other two types. The morphological features of whaleback yardangs are more complex due to their diverse variance in sizes and shapes [7], which results in relatively more false and failed detections of this kind of yardangs. Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 20 We applied the final model on the test dataset that contained 90 image patches including 768 yardangs (291 long-ridge yardangs, 263 mesa yardangs and 214 whaleback yardangs). The total numbers of predicted long-ridge, mesa and whaleback yardangs are 306, 285 and 197, respectively. The IoU measures how much the predicted yardang boundaries overlap with the ground truth. We received a mean IoU value of 0.74 for the test dataset and 90% of the test patches yielded IoU values greater than 0.5 ( Figure 4a). Mean average precisions (mAPs) at different IoU thresholds are plotted in Figure 4b, where the averaged mAP of 10 IoU thresholds is 0.588. The mAP 50 (mAP at IoU=0.50) is 0.869, indicating that the model performs well with a common standard for object detection. The mAP 75 (mAP at IoU=0.75) is 0.671 when a more stringent metric is used, showing a relatively high accuracy and the effectiveness of this model. The normalized confusion matrix (Figure 4c) displays the fine classification results of yardangs, showing that three types of yardangs can be clearly distinguished. However, the prediction accuracy of the whaleback yardangs is 0.76, which is the lowest value compared to the other two types. The morphological features of whaleback yardangs are more complex due to their diverse variance in sizes and shapes [7], which results in relatively more false and failed detections of this kind of yardangs.

Case Studies and Validation
To assess the transferability of this method, we conducted the case studies using the 80 Google Earth image subsets described in 2.5 with a spatial resolution of the original images of 0.6 m. We resampled these image subsets to three different spatial resolution images of 1.2 m, 2.0 m and 3.0 m using the nearest neighbor method. We applied the same trained model on these four groups of images with spatial resolutions from 0.6 to 3.0 m and analyzed their performances.  Table 2 shows the accuracies of detection, classification and delineation where the overall accuracy (OA) of detection is 74% with a precision and recall of 0.87 and 0.84, respectively. For successfully detected yardangs, the Mask

Case Studies and Validation
To assess the transferability of this method, we conducted the case studies using the 80 Google Earth image subsets described in 2.5 with a spatial resolution of the original images of 0.6 m. We resampled these image subsets to three different spatial resolution images of 1.2 m, 2.0 m and 3.0 m using the nearest neighbor method. We applied the same trained model on these four groups of images with spatial resolutions from 0.6 to 3.0 m and analyzed their performances.

Case Study Results of 0.6 m Resolution Images
A total of 3361 yardangs are detected in this experiment, while 447 of them are false positives and 556 yardangs are missed. Table 2 shows the accuracies of detection, classification and delineation where the overall accuracy (OA) of detection is 74% with a precision and recall of 0.87 and 0.84, respectively. For successfully detected yardangs, the Mask R-CNN shows an ability to distinguish the fine classes of yardangs with a classification OA of 91%. The model can infer the outline of yardangs with a delineation OA of 95%, which proves that this method can be used for a precise characterization of yardangs. Figure 5 presents enlarged examples of classification and delineation results, with their locations marked in Figure 2. The long-ridge yardangs are arranged densely with very narrow spacing and show a wide range of lengths (from meters to hundreds of meters). Under these circumstances, there may be some missing instances or overlapping boundaries (Figure 5b,d). Although the shapes of mesa yardangs are mostly irregular, they can be well depicted because of their complete and clear boundaries (Figure 5f,h). For most of the whaleback yardangs, although their sizes vary widely, the delineation results are satisfactory. In some cases, the boundaries between the whaleback yardangs and their background can be very blurred and can thus cause some oddly shaped polygons (Figure 5j,l).

m Resolution Images
A total of 1939 yardangs are detected in this case study. Based on visual inspection, 189 of them are false positives. The number of false negatives is 1719, which means approximately half of the yardangs are not detected. The down-sampled images lose some key features to detect small targets and, as expected, the recall of detecting decreases from 0.84 to 0.5. This results in a significant reduction of detection OA from 74% to 48%. Table 3 provides the detection, classification and delineation accuracies. Compared to the original images, the OA of classification and delineation drop by 6% and 2%, respectively. Some long-ridge yardangs shorter than 50 m are missed out and some overlapping polygons are created because of the ambiguous boundaries between yardangs (Figure 6b,d). The delineation results of mesa yardangs are still satisfying but several small residual yardangs are missed (Figure 6f,h). However, the detection of whaleback yardangs is drastically affected by the reduced spatial resolution. In Figure 6l, the number of the detected whaleback yardangs decreases by nearly 80% compared to the result of 0.6 m spatial resolution images (Figure 5l). It is noted that most of the missed yardangs were shorter than 40 m.

Case Study Results of 1.2 m Resolution Images
A total of 1939 yardangs are detected in this case study. Based on visual inspection, 189 of them are false positives. The number of false negatives is 1719, which means approximately half of the yardangs are not detected. The down-sampled images lose some key features to detect small targets and, as expected, the recall of detecting decreases from 0.84 to 0.5. This results in a significant reduction of detection OA from 74% to 48%. Table  3 provides the detection, classification and delineation accuracies. Compared to the original images, the OA of classification and delineation drop by 6% and 2%, respectively. Some long-ridge yardangs shorter than 50 m are missed out and some overlapping polygons are created because of the ambiguous boundaries between yardangs (Figure 6b and  6d). The delineation results of mesa yardangs are still satisfying but several small residual yardangs are missed (Figure 6f and 6h). However, the detection of whaleback yardangs is drastically affected by the reduced spatial resolution. In Figure 6l, the number of the detected whaleback yardangs decreases by nearly 80% compared to the result of 0.6 m spatial resolution images (Figure 5l). It is noted that most of the missed yardangs were shorter than 40 m.     Table 4). Due to the increase of the missed detections, the OA of detection drops to 37%. Similarly, the OA of classification slightly decreases 1% compared to the result of 1.2 m resolution images. It is found that 94% of all detected yardangs can be correctly delineated regardless of the resampling issues, which indicates the image segmentation stability of this model. In this case study, many small end-to-end long-ridge yardangs are detected as one long yardang as a result of blurred boundaries (Figure 7b and 7d). The mesa yardangs can still be well delineated but more relatively small yardangs are missed (Figure 7f and 7h). The number of the detected whaleback yardangs decreases as the image pixel size increases and yardangs that are longer than 50 m are more likely to be detected (Figure 7j and 7l).  (Table 4). Due to the increase of the missed detections, the OA of detection drops to 37%. Similarly, the OA of classification slightly decreases 1% compared to the result of 1.2 m resolution images. It is found that 94% of all detected yardangs can be correctly delineated regardless of the resampling issues, which indicates the image segmentation stability of this model. In this case study, many small end-to-end long-ridge yardangs are detected as one long yardang as a result of blurred boundaries (Figure 7b,d). The mesa yardangs can still be well delineated but more relatively small yardangs are missed (Figure 7f,h). The number of the detected whaleback yardangs decreases as the image pixel size increases and yardangs that are longer than 50 m are more likely to be detected (Figure 7j,l).

Case Study Results of 3.0 m Resolution Images
The numbers of true positives and false negatives equate to 1042 and 2426, respectively (Table 5), which means that only 30% of the ground truth features are predicted on the 3.0 resolution images. The OA of detection, classification and delineation are 29%, 83% and 93%, respectively. Even though the outlines of yardangs are still recognizable for human eyes, the losses of boundary information are significant and make it hard for Mask R-CNN to detect and depict them. Therefore, the reduction of detected yardangs happen to all types, especially for long-ridge yardangs (Figure 8b and 8d) and whaleback yardangs (Figure 8j and 8l), which are densely aligned and accompanied by many small yardang objects. Similar to the other case studies, the detection and delineation results of

Case Study Results of 3.0 m Resolution Images
The numbers of true positives and false negatives equate to 1042 and 2426, respectively (Table 5), which means that only 30% of the ground truth features are predicted on the 3.0 resolution images. The OA of detection, classification and delineation are 29%, 83% and 93%, respectively. Even though the outlines of yardangs are still recognizable for human eyes, the losses of boundary information are significant and make it hard for Mask R-CNN to detect and depict them. Therefore, the reduction of detected yardangs happen to all types, especially for long-ridge yardangs (Figure 8b,d) and whaleback yardangs ( Figure 8j,l), which are densely aligned and accompanied by many small yardang objects. Similar to the other case studies, the detection and delineation results of mesa yardangs are stable while benefitting from their relatively simple and clear morphological characteristics (Figure 8f,h).  (Figure 8f and 8h).

Transferability
The four case studies evaluate the transferability of Mask R-CNN with respect to the image content and spatial resolution. Although the study sites are geographically close to the locations of training data, considering the complex and various development environments of yardangs, they still exhibit variance in micro-landform at different degrees. The trained Mask R-CNN still achieved 74% OA of detection on 0.6 m spatial resolution images of new study sites. The OA of Mask R-CNN decreases as the pixel size of the image

Transferability
The four case studies evaluate the transferability of Mask R-CNN with respect to the image content and spatial resolution. Although the study sites are geographically close to the locations of training data, considering the complex and various development environments of yardangs, they still exhibit variance in micro-landform at different degrees. The trained Mask R-CNN still achieved 74% OA of detection on 0.6 m spatial resolution images of new study sites. The OA of Mask R-CNN decreases as the pixel size of the image increases. However, the OA of detection, classification and delineation show different reactions to the increasing pixel size. The OA of detection drops by 45% (Figure 9a) when downsampling the image from 0.6 m to 3.0 m, which is a great degradation level compared to the 8% reduction of classification OA (Figure 9b) and 2% reduction of delineation OA (Figure 9c). This is mainly because the coarser resolution images lose some key information for effectively identifying yardangs and the background. Nonetheless for those yardangs who have been successfully detected, their boundaries are clear enough to be depicted and the variance of geomorphic characteristics of three types of yardangs are still significant. increases. However, the OA of detection, classification and delineation show different reactions to the increasing pixel size. The OA of detection drops by 45% (Figure 9a) when downsampling the image from 0.6 m to 3.0 m, which is a great degradation level compared to the 8% reduction of classification OA (Figure 9b) and 2% reduction of delineation OA (Figure 9c). This is mainly because the coarser resolution images lose some key information for effectively identifying yardangs and the background. Nonetheless for those yardangs who have been successfully detected, their boundaries are clear enough to be depicted and the variance of geomorphic characteristics of three types of yardangs are still significant.

Advantages and Limitations of using Google Earth Imagery
Google Earth is a large collection of remotely sensed imagery including satellite and aerial images. The full coverage of the yardang fields in the study area and the availability of VHSR images make it possible to characterize yardangs with fine differentiations. The Google Earth images are acquired with different sensors and for different points in time, causing the variance in color and texture of yardangs in different images. This introduces more diversity to the training data and helps the model to learn more features of yardangs. However, the distribution of VHSR images in the study area is uneven. For example, the VHSR image subsets we used for annotation can be the mosaic results of images which have been resampled to the same resolution. Some of them are up-sampled from coarser images; however, the content and information of the images are not or very limitedly promoted as the pixel size of the image decreases. In addition, some yardangs are partly integrated with the background or are covered by gravel and sand. In such cases, it is hard for human interpreters to make an integral and accurate annotation of yardangs on images. As a result, the model may not sufficiently extract the detailed features of some yardangs during the training stage.

Advantages and Limitations of the Method
The results of this study that focus on characterizing aeolian yardangs demonstrate that the Mask R-CNN can be used to extract multi-scale and multi-type terrain features synthetically. The model learns the multi-level feature representations from the training data and could be successfully applied to the test dataset and other regions. Thanks to transfer learning, which is a notable advantage of deep learning, this method can potentially be applied to the other yardang fields on the Earth or even on the other planets. To the best of our knowledge, this is the first study that used DL method to automatically detect, classify and delineate yardangs on remotely sensed imagery. However, there are still some limitations.

Advantages and Limitations of Using Google Earth Imagery
Google Earth is a large collection of remotely sensed imagery including satellite and aerial images. The full coverage of the yardang fields in the study area and the availability of VHSR images make it possible to characterize yardangs with fine differentiations. The Google Earth images are acquired with different sensors and for different points in time, causing the variance in color and texture of yardangs in different images. This introduces more diversity to the training data and helps the model to learn more features of yardangs. However, the distribution of VHSR images in the study area is uneven. For example, the VHSR image subsets we used for annotation can be the mosaic results of images which have been resampled to the same resolution. Some of them are up-sampled from coarser images; however, the content and information of the images are not or very limitedly promoted as the pixel size of the image decreases. In addition, some yardangs are partly integrated with the background or are covered by gravel and sand. In such cases, it is hard for human interpreters to make an integral and accurate annotation of yardangs on images. As a result, the model may not sufficiently extract the detailed features of some yardangs during the training stage.

Advantages and Limitations of the Method
The results of this study that focus on characterizing aeolian yardangs demonstrate that the Mask R-CNN can be used to extract multi-scale and multi-type terrain features synthetically. The model learns the multi-level feature representations from the training data and could be successfully applied to the test dataset and other regions. Thanks to transfer learning, which is a notable advantage of deep learning, this method can potentially be applied to the other yardang fields on the Earth or even on the other planets. To the best of our knowledge, this is the first study that used DL method to automatically detect, classify and delineate yardangs on remotely sensed imagery. However, there are still some limitations.
Methodologically, the performance of the Mask R-CNN largely depends on the quantity and quality of the training data, which require massive time and labor inputs for data selecting and labeling. In addition, as a supervised learning method, a large number of hyperparameters can influence the performance of the model. Considering the existing equipment condition in this study, it seems impossible to fully optimize the Mask R-CNN model. We adjusted some of the settings to make this heavy-weighted model easier to train on a smaller GPU. Despite this, the training process still took around 10 hours for 40 epochs. This was a much longer time for completing one experiment than needed for traditional machine learning methods and caused extremely high costs to be incurred for comparative studies. Therefore, the optimization relied on a limited series of experiments by tuning a small number of model parameters. Based on the case studies, we analyzed the effect of spatial resolution on mapping yardangs and found a negative correlation between image downsampling and OA of detection ( Figure 9). Due to the relatively high spatial resolution (0.60 m and 1.19 m) of the training data, the prediction performance was very poor on coarser images. This indicates that the model requires a certain consistency between training data and predicting data.
Another crucial factor that affects the model performance lies in the yardangs' intrinsic feature complexity. To improve the quality of the training data, we select representative yardangs with very distinctive features. We also purposely balance the number of annotated instances for three types of yardangs to avoid potential biases, namely that the model is not more sensitive to a certain class. Although we collect training data from different sites and used data augmentation, the training dataset is not fully adequate considering the wide range of possible variations of yardang shapes. When applied to regions that host atypical yardangs, the prediction results are usually unsatisfactory. For example, some long-ridge yardangs show an ambiguous morphology. Their upwind sides are eroded into small ridges with tapered heads while their downwind sides gradually get wider and converge together without clear boundaries [7]. Under such circumstances, the Mask R-CNN may only delineate parts of the yardang or may generate a bizarre polygon that does not appropriately represent its direction and geometry. This also happens to whaleback yardangs when they occur in dense arrangements that are hard to be identified separately. The surfaces of yardangs are not always flat and smooth. In some places, remnant cap layers grow on the top surfaces of some long-ridge and mesa yardangs. Sometimes, these portions of a whole yardang are recognized as small yardangs by Mask R-CNN and thus generate several intersected or contained polygons, which does not reflect the real situation.
Despite being incredibly slow, the development of yardangs is a dynamic process and it is common for terrain features to have gradational boundaries in the transitional zone, which raises new issues for precise classification of yardangs. For instance, the downwind part of some long-ridge yardangs are cut off by fluvial erosion and gradually turn into mesa yardangs [20]. Once separated blocks are observed, they are recognized as mesa yardangs by interpreters. However, some of the yardangs remain the features of long-ridge yardangs and mislead the Mask R-CNN to predict them as long-ridges with relatively shorter lengths. It can be explained that the R-CNN (Region-based Convolutional Neural Network) model only uses information within a specific proposal region of an image but ignores the contextual information which is also crucial for object detection tasks [40]. Specifically, yardangs are spatially clustered and the visual identification for an ambiguously featured yardang would take the landscape pattern and the types of surrounding yardangs as references. These are the global and local surrounding contexts that are not considered in the Mask R-CNN. As a result, the model generated slightly confused classification results of yardangs in such situations.

Recommendations and Future Work
A limited amount of training data has always been a large obstacle for applying deep learning in the remote sensing domain especially on the extraction of natural terrain features [36,37]. Therefore, a large image dataset covering aeolian landforms with labeled data from multiple spatial resolutions and different sources is needed. The model trained on the thematic data may improve the performance of transfer learning compared to the utilization of pre-trained weights on normal photographs. The Mask R-CNN shows its advantage in extracting objects with unique color and textural features [42,45,57,64] on optical images. However, the elevation information of yardangs, which is an important parameter to distinguish between different types of yardangs and to differentiate them from other classes, is not involved in this study. Previous studies verified a high potential of CNN-based deep learning models for learning features from digital elevation data and other terrain derivatives [37,47,65]. Therefore, more attention should be given to additional information (e.g., DEMs) or the combined representations that integrate multi-source data as inputs for deep learning models. This can foreseeably promote the detection and delineation results of yardangs if their spectral and geometric characteristics are similar to the other aeolian landform. Further ways to improve terrain feature mapping could be the implementation of unsupervised [66], semi-supervised [67] and weakly-supervised [68] deep learning models for remote sensing image classification and segmentation. These approaches are an efficient way to extract features from unlabeled data and can break through the limitations caused by the scarcity of annotation data.
To verify the effectiveness of our method to characterize yardangs, future work is necessary to conduct comparative experiments using different algorithms, including objectbased image analysis (OBIA) [69] and other conventional machine learning methods, in future studies. We also have to compare different CNN-based semantic segmentation methods like UNet and DeepLabv3+ while considering the recent progress within the computer vision domain. As a black-box model, the features that are learned by deep neural networks are unclear and the prediction results are usually of poor interpretability. Additional refinement of the current models may solve this problem, such as by introducing domain knowledge into the training process.

Conclusions
In this project, we apply the Mask R-CNN, a deep learning model for instance segmentation. The experimental results indicate that the Mask R-CNN can successfully extract the features of yardangs from the training dataset. In particular, the method can efficiently localize yardangs, depict their boundaries and classify them into correct fine categories simultaneously during the prediction stage. Our method yields a mean IoU value of 0.74 and when the IoU threshold is 0.5 for the testing dataset, there is a mean average precision of 0.869. Based on visual inspection, the automatically delineated boundaries of yardangs are very similar to manually annotated boundaries as they are accurate enough to support further morphological analysis. When applied to additional images with a wider geographical extent and complex background, the Mask R-CNN shows good transferability and can detect up to 74% of yardangs on 0.6 m spatial resolution images. The model trained on VHSR images has the potential to be applied to coarser images. However, the accuracy of detecting yardangs is more sensitive to the changes in image resolution compared to the classification and delineation tasks.
The extraction ability of the model for three types of yardangs differs slightly. In tendency, it yields better results for mesa yardangs, which have less complicated forms compared to long-ridge and whaleback yardangs. Since we focus on the most representative yardangs in this study, there is a need for further studies on the characterization of yardang features with variable complexities. Future work should also include a systematic analysis of the model optimization and comparative studies between deep learning and other methods. We hope that a largely automated and accurate yardang characterization based on the deep learning method will support future scientific studies on yardangs. The resulting data should support geomorphometric and spatial distribution studies on yardangs and other aeolian landforms.