Deep Convolutional Neural Networks for Automated Characterization of Arctic Ice-Wedge Polygons in Very High Spatial Resolution Aerial Imagery

: The microtopography associated with ice-wedge polygons governs many aspects of Arctic ecosystem, permafrost, and hydrologic dynamics from local to regional scales owing to the linkages between microtopography and the ﬂow and storage of water, vegetation succession, and permafrost dynamics. Wide-spread ice-wedge degradation is transforming low-centered polygons into high-centered polygons at an alarming rate. Accurate data on spatial distribution of ice-wedge polygons at a pan-Arctic scale are not yet available, despite the availability of sub-meter-scale remote sensing imagery. This is because the necessary spatial detail quickly produces data volumes that hamper both manual and semi-automated mapping approaches across large geographical extents. Accordingly, transforming big imagery into ‘science-ready’ insightful analytics demands novel image-to-assessment pipelines that are fueled by advanced machine learning techniques and high-performance computational resources. In this exploratory study, we tasked a deep-learning driven object instance segmentation method (i.e., the Mask R-CNN) with delineating and classifying ice-wedge polygons in very high spatial resolution aerial orthoimagery. We conducted a systematic experiment to gauge the performances and interoperability of the Mask R-CNN across spatial resolutions (0.15 m to 1 m) and image scene contents (a total of 134 km 2 ) near Nuiqsut, Northern Alaska. The trained Mask R-CNN reported mean average precisions of 0.70 and 0.60 at thresholds of 0.50 and 0.75, respectively. Manual validations showed that approximately 95% of individual ice-wedge polygons were correctly delineated and classiﬁed, with an overall classiﬁcation accuracy of 79%. Our ﬁndings show that the Mask R-CNN is a robust method to automatically identify ice-wedge polygons from ﬁne-resolution optical imagery. Overall, this automated imagery-enabled intense mapping approach can provide a foundational framework that may propel future pan-Arctic studies of permafrost thaw, tundra landscape evolution, and the role of high latitudes in the global climate system.


Introduction
Polygonal topography typical of the Circum-Arctic permafrost region is associated with ice wedges, developed by repeated frost (or thermal contraction) cracking and formation of ice veins as a result of filling cracks with meltwater in the spring time; this process occurs over hundreds to thousands of years and leads to formation of large massive-ice bodies [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. Wedge ice is the most common type of massive ground ice in the permafrost region. While the vertical extent of wedge-shaped epigenetic ice wedges does not exceed the depth of frost cracking (usually 2 to 4 m), syngenetic wedges that have formed during continuing sedimentation may have irregular shapes and can be very large. For example, ice wedges in yedoma (ice-rich syngenetic permafrost formed in the late Pleistocene within unglaciated areas of Eurasia and North America) can reach up to 10 m in width and more than 40 m in vertical extent [11,15,16].
Frost cracking and subsequent development of ice wedges create a network of polygons that forms archetypal polygonal patterned tundra occupying a large portion of the Arctic. Leffingwell [17] described two major types of ice-wedge polygons (IWP): polygons with depressed blocks and polygons with elevated blocks. Later, these types of polygons were renamed to low-centered (LCP) and high-centered (HCP) polygons, correspondingly, and since at least the 1950s, many authors have used these terms to distinguish major types of polygons [2,14,[18][19][20][21][22][23][24][25][26][27]. In some publications, a third type-flat-centered (FC) polygons-has been added [28][29][30][31]. Drew and Tedrow [32] and Mackay [33] suggested more complicated classifications of IWP based on their morphology; both classifications include six types of polygons. Some authors have described "mixed" polygons; first of all, polygons transitional from LCP to HCP [22,23,34]. In this study, we focus on the two major types-LCP and HCP polygons-widely represented within the Circum-Arctic permafrost region.
Low-and high-centered polygons have very distinctive features that can be relatively easily detected in satellite imagery and aerial photos. LCP polygons are framed by elevated rims that develop above actively growing ice wedges; sometimes the centers of the LCP polygons contain shallow ponds. HCP polygons have elevated (dry) centers and well-developed troughs over ice wedges, where the troughs often are filled with water. LCP polygons are typical of aggrading stages of ice-wedge formation, while HCP polygons usually indicate partly degraded ice wedges [20,35]. LCP polygons are commonly observed within relatively young terrain units (e.g., young drained-lake basins, floodplains) with actively growing ice wedges and are usually are of a larger diameter than HCP polygons, which prevail primarily within older terrain units (e.g., yedoma, old drained-lake basins) [36]. The microtopography associated with the IWPs governs many aspects of permafrost [34,35,37,38], vegetation [22,25,39,40] and hydrologic dynamics [27,41,42], and Arctic ecosystem in general [30,32,35,38] at plot-to-local scales (1-100 m), landscape (100 m-10 km), and regional scales (10-1000 km), mainly due to the role of polygon type on the flow and storage of water [27,41].
Through satellite imagery, aerial photos, and ground observations, large-scale ice-wedge degradation was observed across the Arctic, and in many places, this degradation has resulted in transformation of LCP polygons into HCP polygons in less than a decade [27,29,37,38,[43][44][45][46][47][48][49][50][51][52][53][54][55][56]. In most cases, ice-wedge degradation, which is extremely hazardous for both environment and infrastructure, has been triggered by climatic fluctuations [27,38], wildfires [57], human activities [48], or any other factors that lead to increase in the active-layer thickness. Degradation of ice wedges is a quasi-cyclic process, which includes five main stages: Undegraded wedges-Degradationinitial-Degradation-advanced-Stabilization-initial-Stabilization-advanced [34,37,38]. In the continuous permafrost zone, accumulation of organic matter in the troughs developing on top of degrading wedges eventually leads to decrease in the active-layer thickness and formation of the ice-rich soil layer, protecting ice wedges from further degradation. However, under certain conditions ice-wedge degradation may result in complete melting of ice wedges and formation of large thermokarst lakes [34,48]. The probability of such transformation is higher in areas with warmer permafrost, like the Seward Peninsula in Alaska [58]. Understanding of spatiotemporal dynamics behind the evolution of ice-wedge polygonal tundra demands for objective and detailed maps consolidating the ice wedge extent and successional stage [56,59,60].
Vast geographical extent, remoteness, logistical challenges, and high cost retard field-based mapping of IWPs on large scales. Remote sensing (RS) provides transformational opportunities to observe, monitor, and measure the Arctic polygonal landscape at multiple spatial scales and varying developed an algorithm to segment terrain polygons on the surface of Mars using the dynamics of watershed contours with higher spatial resolution images acquired by the Mars Orbiter Camera and Global Surveyor. Pina et al. [77] and Bandeira et al. [78] further used the geometrical and topological information in Pina et al. [76] to analyze the polygonal networks.
To the best of our knowledge, no study has so far developed a fully automated and scalable method, which is capable of accurately detecting and characterizing IWPs (polygon type: LCP and HCP) from VHSR imagery over large geographical areas (e.g., landscape to pan-Arctic) in an operational context. Developing automated methods to map ice-wedge polygonal tundra across the pan-Arctic requires a thinking beyond manual/conventional image interpretation approaches and, instead, exploiting 'out-of-the-box' sophisticated image processing techniques. Recently, deep learning (DL) [79] has shown great potential for object instance segmentation in detecting and delineating each distinct object in an image [80] of common objects from everyday pictures. While DL is securing a solid maturity in computer visions applications, its potentials are still at the exploratory stage in the context of remote sensing image classification. Previous studies have used DL with RS imagery in order to extract geographical information. For example, Penatti et al. [81] conducted the first study to test the transferability of CNN approaches from everyday object detection to remote sensing image classification and found that pre-trained CNN models can conceivably repurpose over remote sensing scenes. The underlying rationale is that low-level and mid-level descriptors of both everyday objects and geo-objects inherit quite similar behaviors, even though they finally attain drastically different class labels. Marmanis et al. [82] proposed a Convolution Neural Network (CNN)-based semantic segmentation method for VHSR aerial image with output as boundaries between classes and tested on ISPRS Potsdamsemantic labeling dataset. Ding et al. [83] developed an improved VGG16-Net-backbone Faster RCNN frame for object detections in Google Earth optical remote sensing images. Gevaert et al. [84] applied a fully convolutional network (FCN) to distinguish non-ground and ground objects from VHSR imagery. Accordingly, automated computer vision techniques have reached a stage that, if adopted, could meaningfully inform the natural sciences. Simultaneously, high-resolution (>0.25 m) imagery collected by DigitalGlobe has been available for over five years to the polar science community through a license with the U.S. Federal Government [85].
Designing an automated ice-wedge polygon mapping tool is the first step to allow the Arctic science community to characterize the transforming continuous tundra landscape. The current study builds on our ongoing Arctic permafrost research project that aims to understand the complex and interlinked processes responsible for the evolution of the pan-Arctic ice-wedge polygon tundra landscape. While polar geoscience stands at the precipice of revolution enabled by the petabytes of very high spatial resolution satellite imagery, the distribution of ice-wedge polygons and their status are still largely unknown. Most of the large image repositories are still used for reconnaissance and manual interpretation purposes. This reflects a prevalent methodological vacuum between 'big imagery' and Arctic science products. Given the sheer data volumes interlaced with scene complexities, conventional remote sensing image classification methods fail to fill this vacuum. Our overarching goal is to capitalize on 'image-to-assessment' pipelines that are catalyzed by deep learning and high-performance computational resources. Recently, He et al. [86] presented that the Mask R-CNN algorithm outperforms state-of-the-art instance segmentation methods such as MNC [87] and FCIS [88] (MNC and FCIS won the COCO 2015 and 2016 segmentation challenges respectively). Although the Mask R-CNN has shown superior performances in everyday object detection in computer vision applications, only limited studies/projects have applied the Mask R-CNN in RS domain. For example, Pešek [89] developed an open-source GIS tool which allows users to train their own network to produce vector masks from RS imagery. In 2018, Remillard launched a project "Images to OpenStreetMap (OSM)" focus on delineating baseball, soccer, tennis, football, and basketball fields from the Microsoft's Bing imagery and OSM imagery and adding those delineated objects to OSM (https://github.com/jremillard/images-to-osm). In the same year, crowdAI started a "Mapping Challenge" aimed at translating RS imagery into thematic maps (https://github.com/crowdAI/crowdai-mapping-challenge-mask-rcnn). The main goal of this study is to explore the potential of a state-of-the-art DL CNN method (Mask R-CNN) to characterize the tundra ice-wedge polygon landscape. We systematically assessed the performance and transferability of the Mask R-CNN in detection, delineation, and classification of IWPs by repeating the same inference procedure to image scenes of four different spatial resolutions (0.15, 0.25, 0.5, and 1 m) over two study sites near the community of Nuiqsut, Northern Alaska. We conducted a three-step quantitative assessment and detailed visual inspections to assess the correctness and completeness of classification results. To assess the transferability of the Mask R-CNN, we trained the Mask R-CNN only using annotation from VHSR imagery for the first study site (Nuiqsut), with a spatial resolution of 0.15 m, and then used the trained Mask R-CNN to make instance segmentations on resampled VHSR imagery with 0.5 m resolution for Nuiqsut, resampled VHSR imagery with 1 m resolution for Nuiqsut, and VHSR imagery with 0.25 m resolution for the second study site (Crea Creek).

Study Area and Image Data
Our study encompasses two areas near the northern Alaska community of Nuiqsut, which is located within the National Petroleum Reserve-Alaska and about 25 km inland from the Beaufort Sea [90] (Figure 1). The two study areas are referred to here as 'Nuiqsut' (42 km 2 ) and 'Crea Creek' (92 km 2 ). The study areas are generally characterized by low-gradient Arctic tundra and include large features such as lakes and vegetated drained thaw lake basins along with the typical LCPs and HCPs. The VIS-NIR aerial images and LiDAR point clouds were acquired in September 2013 using a fixed-wing aircraft [91][92][93]. The original orthorectified images are in the resolutions of 0.15 m (Nuiqsut) and 0.25 m (Crea Creek) in NAD_1983_StatePlane_Alaska_4_FIPS_5004_Feet_and NAD83_Alaska_Albers coordinate system, respectively. We projected the images into the polar stereographic coordinate system. We synthesized two additional image scenes from the original scenes of Nuiqsut by downsampling following the conventional nearest neighbor resampling technique.

Annotated Data
We selected 340 subsets, each with the dimension of 600 × 600 pixels (i.e., 90 m × 90 m), from the false-color composite of the Nuiqsut image ( Figure 2) for manual annotation purposes. We chose the dimension of annotated data as 600 × 600 pixels based on two considerations: (1) to maximize the number of IWPs per subset; and (2) to minimize the error from manual annotation process as a too small or too large subset can be difficult for annotation. To avoid a class balancing problem, we roughly annotated an even number of objects for HCP and LCP polygons (3728 and 3764 polygonal objects). We used the "VGG Image Annotator" web tool (http://www.robots.ox.ac.uk/~vgg/software/ via/via.html) to annotate training samples for object instance segmentation and saved the training data in the format of JavaScript Object Notation (.json), which is also the data format being used in some other machine learning training data collection, such as COCO (http://cocodataset.org/#home). Finally, we randomly split the annotated 340 subsets into three sub datasets based on an 80:10:10 split, which entailed: training dataset (272 subsets (Figure 2a)), validation dataset for minimizing overfitting (33 subsets (Figure 2b)), and test dataset for evaluating the performance of the trained DL algorithm (35 subsets (Figure 2c)).

General Workflow
Our automated imagery-based ice-wedge polygon extraction workflow rests on four key steps ( Figure 3): (1) division of VHSR imagery into overlapping patches; (2) object instance segmentation of input patches; (3) mask-to-polygon conversion; (4) eliminate duplicate polygons and compose unique polygons. Two block sizes of 600 × 600 pixels and 360 × 360 pixels were used to partition the VHSR scenes of Nuiqsut and Crea Creek with an overlap of 20% using the Geospatial Data Abstraction Library (GDAL, http://www.gdal.org/). It is worth noting that the two block sizes were selected separately in order to match the scale of the annotated data (i.e., 90 × 90 m). The DL algorithm performed the object instance segmentation with outputs as predicted binary mask with classification information. Finally, we cleaned (e.g., remove duplicates) the output polygons using scikit-image (http://scikit-image.org/) and GDAL.

General Workflow
Our automated imagery-based ice-wedge polygon extraction workflow rests on four key steps ( Figure 3): (1) division of VHSR imagery into overlapping patches; (2) object instance segmentation of input patches; (3) mask-to-polygon conversion; (4) eliminate duplicate polygons and compose unique polygons. Two block sizes of 600 × 600 pixels and 360 × 360 pixels were used to partition the VHSR scenes of Nuiqsut and Crea Creek with an overlap of 20% using the Geospatial Data Abstraction Library (GDAL, http://www.gdal.org/). It is worth noting that the two block sizes were selected separately in order to match the scale of the annotated data (i.e., 90 × 90 m). The DL algorithm performed the object instance segmentation with outputs as predicted binary mask with classification information. Finally, we cleaned (e.g., remove duplicates) the output polygons using scikit-image (http://scikit-image.org/) and GDAL.

Deep Learning Algorithm
We chose the state-of-the-art Mask R-CNN method [86] to implement the object instance segmentation due to its simplicity and effectiveness [94]. The Mask R-CNN method is an extended method for object instance segmentation and is built on the Faster R-CNN (a fast and effective algorithm for object detection [95]) by including a function for predicting masks for distinct objects [86]. Methodologically, the Mask R-CNN is a two-stage algorithm: (1) the Mask R-CNN generates proposals (i.e., candidate object bounding boxes) after scanning the image; (2) the Mask R-CNN predicts the class, bounding box, and binary mask for each region of interest (RoI) [86]. In terms of structure (Figure 4), the Mask R-CNN mainly consists of (1) backbone architecture Residual Learning network (ResNet) [96] for feature extraction; (2) Feature Pyramid Network (FPN) [97] for improving representation of objects at multiple scales; (3) Region Proposal Network (RPN) for generating RoI; (4) RoI Classifier for class prediction of each RoI; (5) Bounding Box Regressor (BBR)

Deep Learning Algorithm
We chose the state-of-the-art Mask R-CNN method [86] to implement the object instance segmentation due to its simplicity and effectiveness [94]. The Mask R-CNN method is an extended method for object instance segmentation and is built on the Faster R-CNN (a fast and effective algorithm for object detection [95]) by including a function for predicting masks for distinct objects [86]. Methodologically, the Mask R-CNN is a two-stage algorithm: (1) the Mask R-CNN generates proposals (i.e., candidate object bounding boxes) after scanning the image; (2) the Mask R-CNN predicts the class, bounding box, and binary mask for each region of interest (RoI) [86]. In terms of structure (Figure 4), the Mask R-CNN mainly consists of (1) backbone architecture Residual Learning network (ResNet) [96] for feature extraction; (2) Feature Pyramid Network (FPN) [97] for improving representation of objects at multiple scales; (3) Region Proposal Network (RPN) for generating RoI; (4) RoI Classifier for class prediction of each RoI; (5) Bounding Box Regressor (BBR) for refining RoI; (6) FCN [98] with RoIAlign [86] and bilinear interpolation for predicting pixel-accurate mask. A deeper discussion on the Mask R-CNN algorithm is beyond the scope of this study; thus, we refer readers to He et al. [86] for a detailed discussion on the mathematical basis of the algorithm. for refining RoI; (6) FCN [98] with RoIAlign [86] and bilinear interpolation for predicting pixel-accurate mask. A deeper discussion on the Mask R-CNN algorithm is beyond the scope of this study; thus, we refer readers to He et al. [86] for a detailed discussion on the mathematical basis of the algorithm.

Accuracy Assessment
We conducted a three-step assessment for the trained Mask R-CNN. First, we assessed the mean average precision (mAP: the mean of average precision values of each class) of the trained Mask R-CNN with the hold-out test dataset. Second, we randomly selected additional 30 subsets ( Figure 5) for each study site (including non-IWP samples and excluding previous selected 340 subsets for Nuiqsut) and manually validate the accuracies of detection, delineation, and classification. We evaluated the accuracies of detection, delineation, and classification based on the following criteria: a positive detection indicates a IWP is correctly detected by the Mask R-CNN, and a negative detection indicates a IWP is not correctly detected by the Mask R-CNN; likewise, a positive delineation indicates the Mask R-CNN successfully inferences outline of an IWP based on interpreter's judgement; a positive classification indicates the Mask R-CNN inferences the correct type of IWP of a detected IWP. Our manual validation included the following steps: in Step (1) we created 30 random square polygons (100 × 100 m) for the each study area with 12 attributes in shapefile database (true-positive, false-positive, true-negative, and false-negative for each of detection, delineation, and classification respectively) to assess the commission and omission errors; In Step (2) we manually counted all IWPs within or crossing the boundaries of the validation square polygons; and in Step (3) we filled the number for the each attribute based on the manual counting. Third, we corroborated qualitative assessments with detailed visual inspections by coupling imagery and LiDAR DTMs.

Accuracy Assessment
We conducted a three-step assessment for the trained Mask R-CNN. First, we assessed the mean average precision (mAP: the mean of average precision values of each class) of the trained Mask R-CNN with the hold-out test dataset. Second, we randomly selected additional 30 subsets ( Figure 5) for each study site (including non-IWP samples and excluding previous selected 340 subsets for Nuiqsut) and manually validate the accuracies of detection, delineation, and classification. We evaluated the accuracies of detection, delineation, and classification based on the following criteria: a positive detection indicates a IWP is correctly detected by the Mask R-CNN, and a negative detection indicates a IWP is not correctly detected by the Mask R-CNN; likewise, a positive delineation indicates the Mask R-CNN successfully inferences outline of an IWP based on interpreter's judgement; a positive classification indicates the Mask R-CNN inferences the correct type of IWP of a detected IWP. Our manual validation included the following steps: in Step (1) we created 30 random square polygons (100 × 100 m) for the each study area with 12 attributes in shapefile database (true-positive, false-positive, true-negative, and false-negative for each of detection, delineation, and classification respectively) to assess the commission and omission errors; In Step (2) we manually counted all IWPs within or crossing the boundaries of the validation square polygons; and in Step (3) we filled the number for the each attribute based on the manual counting. Third, we corroborated qualitative assessments with detailed visual inspections by coupling imagery and LiDAR DTMs. Remote Sens. 2018, 10, x FOR PEER REVIEW 9 of 33

Implementation
We implemented the Mask R-CNN method using an open-source package built on Keras and Tensorflow developed by the team of Mask-RCNN on Github [99]. The codes are available on Github (https://github.com/matterport/Mask_RCNN). We conducted experiments on a customized server equipped with an Intel i5 CPU, 16 GB RAM, a GeForce GTX 970 graphic card, and a GeForce GTX 1080ti graphic card. In the training process, a graphics processing unit (GPU) (GeForce GTX 1080ti graphic card) was used to train the adopted ResNet-101 (101 layers) backbone Mask R-CNN in the package with a mini-batch size of 2 images, 272 steps per epoch, learning rate of 0.001, learning momentum of 0.9, and weight decay of 0.0001. The accessibility of this open-source package was the main reason we chose the ResNet-101 rather than ResNet-152 backbone Mask R-NN. We modified the loading dataset function for our customized training data and left other parameters in default settings. For the full description of parameter settings of the used submodels of the Mask R-CNN (e.g., ResNet, RPN, BBR etc.), we refer readers to the model part of the Mask R-CNN repository on Github (https://github.com/matterport/Mask_RCNN). Instead of building the Mask R-CNN from scratch, we trained our model using the pre-trained weights of the Mask R-CNN for COCO because COCO has a large amount of training data for the Mask R-CNN to learn common and discriminative features (known as transfer learning). To minimize overfitting, we used the validation dataset to decide the most generalized Mask-RCNN. Additionally, random horizontal flips augmentation was used to introduce variety in the training data. In the inference process, we set a detection confidence threshold as 70% (i.e., detections with confidence less than 70% were ignored). Two GPUs were used to accelerate the process of inferencing input patches with 20% overlapping by using a data processing queue. As a result, the aerial imagery (63,103 × 62,584 pixels) of Nuiqsut with a spatial resolution of 0.15 m, a total of 8134 patches (600 × 600 pixels), was able to infer approximately to within 21 min (~6.3 fps).

Model Optimization and Accuracy
We optimized the Mask R-CNN model during the training process and evaluated the accuracy of the optimized model prior to its application in the case studies. We adopted an early stopping

Implementation
We implemented the Mask R-CNN method using an open-source package built on Keras and Tensorflow developed by the team of Mask-RCNN on Github [99]. The codes are available on Github (https://github.com/matterport/Mask_RCNN). We conducted experiments on a customized server equipped with an Intel i5 CPU, 16 GB RAM, a GeForce GTX 970 graphic card, and a GeForce GTX 1080ti graphic card. In the training process, a graphics processing unit (GPU) (GeForce GTX 1080ti graphic card) was used to train the adopted ResNet-101 (101 layers) backbone Mask R-CNN in the package with a mini-batch size of 2 images, 272 steps per epoch, learning rate of 0.001, learning momentum of 0.9, and weight decay of 0.0001. The accessibility of this open-source package was the main reason we chose the ResNet-101 rather than ResNet-152 backbone Mask R-NN. We modified the loading dataset function for our customized training data and left other parameters in default settings. For the full description of parameter settings of the used submodels of the Mask R-CNN (e.g., ResNet, RPN, BBR etc.), we refer readers to the model part of the Mask R-CNN repository on Github (https://github.com/matterport/Mask_RCNN). Instead of building the Mask R-CNN from scratch, we trained our model using the pre-trained weights of the Mask R-CNN for COCO because COCO has a large amount of training data for the Mask R-CNN to learn common and discriminative features (known as transfer learning). To minimize overfitting, we used the validation dataset to decide the most generalized Mask-RCNN. Additionally, random horizontal flips augmentation was used to introduce variety in the training data. In the inference process, we set a detection confidence threshold as 70% (i.e., detections with confidence less than 70% were ignored). Two GPUs were used to accelerate the process of inferencing input patches with 20% overlapping by using a data processing queue. As a result, the aerial imagery (63,103 × 62,584 pixels) of Nuiqsut with a spatial resolution of 0.15 m, a total of 8134 patches (600 × 600 pixels), was able to infer approximately to within 21 min (~6.3 fps).

Model Optimization and Accuracy
We optimized the Mask R-CNN model during the training process and evaluated the accuracy of the optimized model prior to its application in the case studies. We adopted an early stopping strategy to minimize overfitting with the hold-out validation dataset. We trained the Mask R-CNN with a predefined 100 epochs so that we could record the full learning curve of the model. As seen in Figure 6a,c-f, the validation loss values reached their lowest and then rebounded, but the training loss values continued descending, which indicates that after the 8th epoch, the model tended to "memorize" the training data, rather than "learn" to generalize the features of IWPs. Similarly, Figure 6b presents that the Mask R-CNN bounding box refinement stopped improving after the 8th epoch. Overall, around the 8th epoch (cyan vertical lines in Figure 6) the validation loss value reached its lowest. This elucidates that the Mask-RCNN tended to be overfitting after the 8th epoch, although all training losses decayed continuously (see Figure 6). Therefore, we selected a Mask R-CNN trained for 8 epochs. The convergence at the 8th epoch can be explained by two main factors. First, we trained our model using the pre-trained weights of the Mask R-CNN for COCO instead of training a Mask R-CNN from randomized weight. Most learning processes for the weights of low-level features were skipped. Therefore, our training goal was focused on adjusting the Mask R-CNN to recognize IWPs. Second, the relatively small size of our training data directly resulted in this early convergence because of the depth of the used network. Finally, we evaluated the Mask R-CNN trained for 8 epochs with the hold-out test dataset (35 subsets including 799 IWPs (i.e., 385 HCPs and 414 LCPs)) ( Figure 3). It reported mean average precisions (mAPs) of 0.695 and 0.601 when intersection-over-union (IoU: the ratio between the intersection and the union of the inferred outline and the ground truth outline of an object) thresholds are greater than 0.5 and 0.75.
Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 33 strategy to minimize overfitting with the hold-out validation dataset. We trained the Mask R-CNN with a predefined 100 epochs so that we could record the full learning curve of the model. As seen in Figure 6a,c-f, the validation loss values reached their lowest and then rebounded, but the training loss values continued descending, which indicates that after the 8th epoch, the model tended to "memorize" the training data, rather than "learn" to generalize the features of IWPs. Similarly, Figure 6b presents that the Mask R-CNN bounding box refinement stopped improving after the 8th epoch. Overall, around the 8th epoch (cyan vertical lines in Figure 6) the validation loss value reached its lowest. This elucidates that the Mask-RCNN tended to be overfitting after the 8th epoch, although all training losses decayed continuously (see Figure 6). Therefore, we selected a

Case Studies
We conducted four case studies to systematically assess the performance and transferability of the Mask R-CNN by repeating the same inference procedure of the Mask R-CNN over the four image scenes: (1) Table 2). This is likely to mainly be because the Mask R-CNN was trained based on the 0.15 m resolution samples; thus, it fails to discern some of the key low-level features, which are imperative to defining IWPs at 0.5 m resolution. However, the precision of detection improved by 0.01 compared to the case study 1. Similar to case study 1, the Mask R-CNN can make accurate delineation and classification regardless of resampling issue. In the resampled imagery, linear features between incomplete or disjoint IWPs prone to disappear (Figure 8a,g,i). In fact, fewer IWPs are detected in the 0.5 m resolution scenes with incomplete or disjoint IWPs (Figure 8b,h,j). The effect of resampling does not drastically affect the performance of the Mask R-CNN in the scenes with clearly visible troughs or rims (Figure 8d,f,l,n).

Case Study 3: 1 m Resolution Image Scene
The total numbers of delineated LCPs and HCPs from the result of 0.5 m image for Nuiqsut are 10,763 and 1318. Case 3 is based on the resampled VHSR imagery in a 1 m resolution for Nuiqsut. The increase of pixel size from 0.15 m to 1 m results in substantial disappearance of IWPs' edge information (Figure 9a,g,i). Therefore, compared the case studies 1 and 2, fewer IWPs are detected using the same Mask R-CNN (Table 3). The OA and recall of detection are 0.15, which is more than 5 times lower than the case study 1 (0.15 m resolution). For the LCPs with wide-span rims and troughs, the approach is still able to detect, accurately delineate, and classify IWPs when the widths of rims and troughs are greater than 1 m (Figure 9d,f,l). The OAs of the delineation and classification are 88% and 100%, respectively. In addition, the number of detected HCPs decreases dramatically as the pixel size of the imagery increases, even though the troughs between HCPs are still visible to the human eye at 1 m resolution (Figure 7h,p, Figure 8h (Table 4). Based on our manual validation, around 72% of total IWPs can be detected with an approximate precision of 0.98. Very similar to the previous case studies, the OAs of delineation and classification of IWPs are high, even in this new image scene with slightly coarser resolution, particularly for IWPs with clear edges (Figure 10b,f). The Mask R-CNN approach is missing incomplete or disjoint IWPs (Figure 10h,l,n), which is reasonable because those IWPs are not distinct objects. Disjoint IWPs might be detected and classified more effectively using scene classification or semantic classification.

Transferability
Case studies 1-3 depict the degree of transferability of the Mask R-CNN with respect to the spatial resolution and scene content of the input imagery ( Figure 11 and Table 5). We trained the Mask R-CNN using annotation from VHSR imagery for Nuiqsut with a spatial resolution of 0.15 m. The OA of the Mask R-CNN decreases as the pixel size of the imagery increases ( Figure 12). The total numbers of delineated LCP and HCPs decrease with increasing imagery resolution (Table 5). HCP and LCP had various degrees of reaction to the increasing pixel size. The numbers of missed LCP and HCPs with 0.15-0.5 m resampling are much lower than with 0.5-1 m resampling. Table 5 also shows that HCPs are more sensitive to the resampling than LCPs. For example, 83% of delineated HCP in the 0.5 m imagery was missed in the 1 m imagery, compared to 57% for LCPs. This is mainly because LCPs have more distinguishable features than HCPs, such as a wet center and a dry rim. Case study 4 demonstrates the transferability of the Mask R-CNN in terms of microtopography. The two study areas (Nuiqsut and Crea Creek) are geographically close to each other, but they still manifest fine-scale variations in the microtopography. The trained Mask R-CNN based on the VHSR imagery for Nuiqsut can still achieve 72% OA in a new area and with a different spatial resolution.

Ice-Wedge Polygons' Convolutional Feature Maps
The advent of CNN has led to breakthroughs in image processing [79]. Similarly to other DL applications in computer vision, the backbone architecture of the Mask R-CNN relies on convolutional feature map (CFM) from CNN. A CFM is the output layer using a given filter (also known as kernel) to the previous layer. During the process of computing CFMs, sliding-window

Ice-Wedge Polygons' Convolutional Feature Maps
The advent of CNN has led to breakthroughs in image processing [79]. Similarly to other DL applications in computer vision, the backbone architecture of the Mask R-CNN relies on convolutional feature map (CFM) from CNN. A CFM is the output layer using a given filter (also known as kernel) to the previous layer. During the process of computing CFMs, sliding-window operations translate current window-size pixel information to a stack of the same number of CFMs by computing and generating the dot product (Figures 13 and 14) with an optimized number of selected filters (such as horizontal/vertical filters, sum/average pooling etc.). The ResNet-101 of the used Mask R-CNN has five stages (a stage is used to name a layer in the ResNet depending on its position in the network): c1, c2, c3, c4, and c5. Figure 13 depicts four CFMs from each of the last output layers of the Mask R-CNN stage from 1 to 5 for HCPs. Each stage in the ResNet consists of a certain number of residual blocks, and each residual block has three layers. For example, Figure 14 (4th row) presents four feature maps from the last layer of the w block (i.e., 23th) in the 4th stage. A 101-layer ResNet includes 23 residual blocks in its 4th stage. Through the proper use of CFMs, the Mask R-CNN can detect the locations of HCPs (Figure 13 (4th row)) and LCPs (Figure 14 (4th row)) based on the high-level features, which then allows delineation and classification of HCPs and LCPs using other representative features from CFMs. position in the network): c1, c2, c3, c4, and c5. Figure 13 depicts four CFMs from each of the last output layers of the Mask R-CNN stage from 1 to 5 for HCPs. Each stage in the ResNet consists of a certain number of residual blocks, and each residual block has three layers. For example, Figure 14 (4th row) presents four feature maps from the last layer of the w block (i.e., 23th) in the 4th stage. A 101-layer ResNet includes 23 residual blocks in its 4th stage. Through the proper use of CFMs, the Mask R-CNN can detect the locations of HCPs (Figure 13 (4th row)) and LCPs (Figure 14 (4th row)) based on the high-level features, which then allows delineation and classification of HCPs and LCPs using other representative features from CFMs.

Visual Inspection
We qualitatively compared enlarged views for automatically detected IWPs to LiDAR DTM (1 m spatial resolution) of polygons in the Crea Creek area (Figures 15-17). The LiDAR DTM study included a Canny edge raster analysis, which is a widely used edge detector, and terrain profiles in the XX' and YY' directions. The LiDAR was compared to a false color composite VHSR image for HCP, LCP, and incomplete/disjoint IWP (Figures 15a, 16a, and 17a). The troughs of HCPs are

Visual Inspection
We qualitatively compared enlarged views for automatically detected IWPs to LiDAR DTM (1 m spatial resolution) of polygons in the Crea Creek area (Figures 15-17). The LiDAR DTM study included a Canny edge raster analysis, which is a widely used edge detector, and terrain profiles in the XX' and YY' directions. The LiDAR was compared to a false color composite VHSR image for HCP, LCP, and incomplete/disjoint IWP (Figure 15a, Figure 16a, and Figure 17a). The troughs of HCPs are indistinguishable in the 1m VHRS image because of the surrounding mixed pixels (Figure 15a). HCP becomes more visible with the help of LiDAR DTM and derived Canny edge from LiDAR (Figure 15b,c). We highlighted the breakpoints where the automated delineation spotted the boundaries in the microtopography to depict a successful agreement between automated delineation and Canny edge information from LiDAR (Figure 15c-e). The exact boundary for a LCP is almost invisible in optical imagery at 1 m resolution (Figure 16a). LiDAR DTM and derived Canny edge from LiDAR can only detect the "inner circle" edge in a LCP; therefore, the slight disagreement with the automated delineation (Figure 16b,c). In contrast, the automatic method captures the actual boundary (Figure 16d,e). Incomplete/disjoint IWPs are difficult to detect because they usually do not show distinct edges. The Canny edge raster for incomplete/disjoint IWP presents highly disorganized edges across the scene ( Figure 17). However, the Mask-RCC was able to detect small proportions of incomplete/disjoint IWPs (Figure 17a,d,e), despite the image lacking distinct and organized boundaries.

Discussion
Deep learning leverages computational models to learn representations of data with multi-level abstractions. Choosing the right computational model and data is important in deep learning applications. In this exploratory study, we selected the Mask R-CNN, which is an object instance segmentation method, to conduct automated characterization of ice-wedge polygons. The simplicity, effectiveness, and the method's proven success in processing everyday images have previously been reported by He et al. [86] and Liu et al. [94]. To the best of our knowledge, this is the first study that used the Mask R-CNN method (or any other object instance segmentation methods such as PANet [94] and MaskLab [100]) to detect, delineate, and classify IWPs from VHSR remote sensing imagery. However, future efforts should include a systematic experiment of Mask R-CNN performance to other CNN-driven semantic segmentation methods in IWP mapping. Also, we need

Discussion
Deep learning leverages computational models to learn representations of data with multi-level abstractions. Choosing the right computational model and data is important in deep learning applications. In this exploratory study, we selected the Mask R-CNN, which is an object instance segmentation method, to conduct automated characterization of ice-wedge polygons. The simplicity, effectiveness, and the method's proven success in processing everyday images have previously been reported by He et al. [86] and Liu et al. [94]. To the best of our knowledge, this is the first study that used the Mask R-CNN method (or any other object instance segmentation methods such as PANet [94] and MaskLab [100]) to detect, delineate, and classify IWPs from VHSR remote sensing imagery. However, future efforts should include a systematic experiment of Mask R-CNN performance to other CNN-driven semantic segmentation methods in IWP mapping. Also, we need to compare how well CNN methods perform compared to well-established knowledge-driven paradigms such as object-based image analysis (OBIA [101]) and other conventional classification methods (e.g., pixel-based). It would also be valuable to conduct a detailed analysis on the CNN model optimization procedures to improve the processing time by examining a variety of combinations of hyperparameters. Optimization is considered to be a critical part of choosing the right computational model, because there are many parameters (e.g., regularization and training schedule) that control the performance of the overall model. In this study, besides all other default regularization settings, we only chose early stopping strategy to minimize overfitting. Without proper implementation, the model could either not function or only function with the training data. The quality of model outputs pivot to the quality of annotated data. Production of accurate and stable annotated data sets is time-, labor-, and cost-intensive, with manual mapping representing one extreme. Therefore, it is advantageous to conduct a comprehensive study on the minimum requirements and quality of the training datasets, which intend to use in the Mask R-CNN training purposes. Here, we only used a training dataset in 0.15 m resolution, while the automated mappings in the coarser resolutions may have benefitted from a training dataset in a comparable resolution.
Mask R-CNN was a successful method in characterizing the Arctic ice-wedge polygonal landscape from VHSR images ( Figure 12). The algorithm successfully utilized the multi-level feature representations learned from the training data for detection, delineation, and classification of targets of interest (Figures 13 and 14). However, there is a negative correlation between the overall accuracies of detection and (the coarser) spatial resolution of input imagery. This is likely to be due to the resolution dependency of the training dataset (0.15 m) in the Mask R-CNN approach. This means that the consistency of spatial resolution of both training data and predicting data could significantly impact the Mask R-CNN's performance. Sub-meter resolution commercial satellite imagery (<0.25 m) is available across the pan-Artic domain. Thus, a Mask R-CNN derived product, with a relaxation to spatial resolution, could potentially be of high value for science applications that can benefit from a pan-Arctic ice-wedge polygon map.
The case studies with their OAs of delineation and classification elucidate the robustness of the model to the spatial resolution of the training data. We envision two possible ways to improve the detection ability of the Mask R-CNN: (1) enriching training data from the aspects of size and source such as training data from multiple spatial resolutions (e.g., UAS, aerial, and satellite imagery, because only relatively small amount of annotated data were prepared for the Mask R-CNN due to the costly annotation process; or (2) lower the confidence threshold for object detection. In this study, we deliberately selected the two extremes of the ice-wedge polygon types-low-centered and high-centered polygons-to investigate the preliminary insights related to the feasibility and appropriateness of the Mask R-CNN. We expect to progressively enhance the thematic depth of the model to discern other polygon classes, such as flat-centered, transitional, and mixed polygons, and conical mounds (i.e., thermokarst mounds or baydzherakhs that form as a result of deep melting of large ice wedges, predominantly in yedoma regions). It is not surprising that all the OAs of the IWP classification are above 95% (Figure 12). Increasing thematic complexity by considering additional types of IWPs (such as types A-F [6]) could potentially lower the OAs of classification.
So far, human-augmented ice-wedge polygon mapping has been the most widely used method in Arctic science applications. Mapping ice-wedge polygons can be challenging. In many instances, flat-and low-centered polygons exhibit ill-developed polygon boundaries. Traditional methods coupled with shallow classifiers (such as nearest-neighbor, support vector machine, and random forest) are incompetent in coping with subtle scene variations, especially in the context of regional-scale mapping applications. Edge detectors, such as Canny, can decode some of the hidden discontinuities in the LiDAR DTM. For example, Canny edge detector can effectively outline the boundary of a HCP polygon (Figure 15c). Based on our observations, the Mask-RCNN can still successfully detect a small proportion of incomplete/disjoint IWP with an inconspicuous edge, which is believed to be the contribution of the feature maps from CNN (Figures 13 and 14). Scene classification might be a solution to detect incomplete/disjoint IWP as a complement to the Mask R-CNN. Here, we systematically tasked the Mask R-CNN algorithm with the original and spatially synthesized imagery as a stress test to observe the behavior against lost spatial detail. Future efforts could expand the stress test by coupling spatial artifacts with spectral, radiometric, and structural artifacts. Such experiments could unravel the susceptibility of the automated algorithm to scene content variations, which are common in remote sensing imagery due to multitude drivers, such as sensor characteristics (e.g., spatial resolution, spectral widths, bit depth), acquisition parameters (e.g., low-angle imaging), seasonal variations (e.g., early Summer versus late Summer), and pre-processing steps (e.g., pansharpening and data compression).
Human interpreters are highly capable of seeking high-level semantics while suppressing unrelated natural variations and process-induced artifacts in an image scene. We ideally expect the automated algorithm to respond in the same manner. Otherwise, the scene dependency could greatly impede the scalability of the mapping task over a large domain with multiple sensors and varying imaging conditions. When the trough network is pronounced and well connected, ice-wedge polygons exhibit bona fide [102] object characteristics in which the object boundaries are crisp and can be demarcated clearly and unambiguously from their respective environment. In some instances, ice-wedge polygons lack a distinct physical border and blend with the landscape. This forces the model to realize an indeterminate (fuzzy) border zone (or boundary-like region), rather than a crisp discontinuity delimiting the object of interest. High-centered polygons circumscribe a spectrally/texturally homogenous entity resembling the real-world representation ( Figure 15). In contrast, low-centered polygons sometimes exhibit a repulsion to fidelity. Inundated polygon centers and the dissipating wetness along their boundaries are the readily available cues that support a positive detection of an IWP and a thematic recognition as a low-centered IWP ( Figure 16). However, its exact physical boundary is largely undefined in the imagery. The detection would be far more challenging in the absence of water at the polygon center. Vogt et al. [103] coupled spatio-structural properties with functional dispositions and temporal developments when deciding the crispness of the object boundaries. Tundra vegetation is closely linked to moisture status [104], which in turn is closely linked to microtopography [105]-both which can vary dramatically within just a meter. An overly coarse image, such as the 1 m VHRS in the LCP scenario (Figure 16c), blurs the contrasts and spectrally homogenizes the center-rim and trough polygon features, which are actually functionally heterogeneous entities in the landscape. Albrecht et al. [106] experimentally showed the variability in manual object delineations of scale and generalization and emphasized the necessity for degrees of freedom related to fuzziness in the object boundaries. This conceptually aligns with Vogt et al. [103], who argued that indeterminacy of object boundaries comes in various degrees of abruptness and, thus, discontinuity is a matter of scale and granularity. Accordingly, a slight geometrical disagreement between automatically detected polygon boundaries from the imagery and DTM-derived polygon edges would probably even persist in a manual delineation.
Very high spatial resolution remote sensing imagery is radically transforming our ability to understand the most remote, logistically challenging, and ecologically fragile landscapes in Polar Regions, at greater details than ever before [107]. These image data products and their derivatives (e.g., stereo DEMs) result in unprecedented opportunities to map, model, and forecast biological, geological, and hydrological functioning in Arctic and Antarctic in an intensive fashion at extensive spatial scales. The general notion of 'topography is stationary with respect to climate change' does not hold true for the Arctic, because microtopographic changes in polygonal tundra due to increasing permafrost temperatures are now visible at sub-decadal scale [41]. Thawing of ice-rich permafrost can be visible in RS imagery because of subsequent ground subsidence and development of troughs in ice-wedge polygon tundra. The entire Arctic has been imaged by commercial satellites in sub-meter resolution four times, on average, in the last six years, some places every month, and the collection is set to continue for at least another three years [85]. Arctic researchers now have access to approximately disturbance across the Arctic region in a near real-time fashion due to recent and increasing number of launches of SmallSats.
However, scientific opportunities of 'big imagery' have so far been poorly explored. Shifting from megabytes to petabytes of imagery demands novel computational approaches to interpretation coupled with efficient use of high-performance parallel and distributed computing resources [108] and reliance on an ecosystem of integrated digital resources and services [109], such as National Extreme Science and Engineering Discovery Environment (XSEDE, https://www.xsede.org/). These cyberinfrastructure platforms can support data discovery and access. Parallel to the revolution of the spatial resolution of commercial satellite imagery, we are experiencing a radical valence towards computer vision approaches that successfully link low-level motifs with high-level semantics. Deep learning approaches excel at identifying higher-order texture features and contextual characteristics critical for uncovering complex data representations in satellite imagery. These methods are promising compared to standard remote sensing classification approaches (e.g., object-based image analysis) at the expense of greater computational complexity. For example, Witharana and Lynch [107] systematically showed the conceptual and implementation caveats of the OBIA framework in autonomous censusing of Antarctic wildlife based on sub-meter satellite imagery using continental scale, repeatable mapping applications.
The efficient use of GPUs is essential for the success of the CNN [79]. Conducting pan-Arctic scale characterization ice-wedge polygons from commercial satellite imagery demands a computationally scalable and efficient solution. Here, the performance of I/O and the Mask R-CNN prediction must be optimized and the image data streamed directly from the data repositories to computing clusters in order to minimize the I/O bound issue. Strategies such as the Message Passing Interface (MPI, a distributed memory parallel model) may allow the implementation of the Mask R-CNN predictions on a large number of available GPUs (Figure 18). A master node would control data flow, such as images distribution and result collection using job queue. Each GPU-node would then receive an assigned image and execute prediction process across the whole image and return the final result in the master node. However, scientific opportunities of 'big imagery' have so far been poorly explored. Shifting from megabytes to petabytes of imagery demands novel computational approaches to interpretation coupled with efficient use of high-performance parallel and distributed computing resources [108] and reliance on an ecosystem of integrated digital resources and services [109], such as National Extreme Science and Engineering Discovery Environment (XSEDE, https://www.xsede.org/). These cyberinfrastructure platforms can support data discovery and access. Parallel to the revolution of the spatial resolution of commercial satellite imagery, we are experiencing a radical valence towards computer vision approaches that successfully link low-level motifs with high-level semantics. Deep learning approaches excel at identifying higher-order texture features and contextual characteristics critical for uncovering complex data representations in satellite imagery. These methods are promising compared to standard remote sensing classification approaches (e.g., object-based image analysis) at the expense of greater computational complexity. For example, Witharana and Lynch [107] systematically showed the conceptual and implementation caveats of the OBIA framework in autonomous censusing of Antarctic wildlife based on sub-meter satellite imagery using continental scale, repeatable mapping applications.
The efficient use of GPUs is essential for the success of the CNN [79]. Conducting pan-Arctic scale characterization ice-wedge polygons from commercial satellite imagery demands a computationally scalable and efficient solution. Here, the performance of I/O and the Mask R-CNN prediction must be optimized and the image data streamed directly from the data repositories to computing clusters in order to minimize the I/O bound issue. Strategies such as the Message Passing Interface (MPI, a distributed memory parallel model) may allow the implementation of the Mask R-CNN predictions on a large number of available GPUs (Figure 18). A master node would control data flow, such as images distribution and result collection using job queue. Each GPU-node would then receive an assigned image and execute prediction process across the whole image and return the final result in the master node.

Conclusions
Imagery-based IWP mapping is still largely constrained to time-and labor-intensive human-augmented workflows. The rapid influx of sub-meter resolution commercial satellite imagery to the polar science community demands high-performance image analysis workflows to meet the ever-increasing demand for imagery-enabled science products. We applied the Mask R-CNN to automatically detect and delineate ice-wedge polygons and identify their type. Four case studies from two study sites were used to assess the performance and transferability of the Mask R-CNN for the mission of characterizing the tundra ice-wedge polygon landscape. Our results report that: (1) the Mask R-CNN can detect up to 79% of IWPs in study sites with a VHSR imagery pixel resolution of 0.15 m and around 72% of IWPs with the imagery with a pixel resolution of 0.25 m; (2) besides promising performance in detection, the Mask R-CNN can delineate and classify detected IWPs accurately; (3) the pressure test of the Mask R-CNN on resampled imagery shows the flexibility and potential in automatically mapping IWPs in coarser RS images. Findings of this study provide an extensible framework for imagery-enabled intense mapping of ice-wedge polygons at extensive spatial and temporal scales.
While the Mask R-CNN presents promising ability in automatically characterizing IWPs, further studies are necessary to fully understand the use of deep learning-driven object instance segmentation in characterizing IWPs. Our future work will focus on four main directions: (1) a comprehensive comparison study on how well the Mask R-CNN performs compared to other methods (e.g., OBIA, PANet, and MaskLab etc.); (2) a full analysis on model optimization procedures by examining a variety of combinations of hyperparameters; (3) a detailed analysis on the minimum requirements and quality of the training datasets, which intend to use in the Mask R-CNN training purposes; (4) a stress analysis by coupling spatial artifacts with spectral, radiometric, and structural artifacts.