Convolutional Neural Networks for Automated Built Infrastructure Detection in the Arctic Using Sub-Meter Spatial Resolution Satellite Imagery

Rapid global warming is catalyzing widespread permafrost degradation in the Arctic, leading to destructive land-surface subsidence that destabilizes and deforms the ground. Consequently, human-built infrastructure constructed upon permafrost is currently at major risk of structural failure. Risk assessment frameworks that attempt to study this issue assume that precise information on the location and extent of infrastructure is known. However, complete, high-quality, uniform geospatial datasets of built infrastructure that are readily available for such scientific studies are lacking. While imagery-enabled mapping can fill this knowledge gap, the small size of individual structures and vast geographical extent of the Arctic necessitate large volumes of very high spatial resolution remote sensing imagery. Transforming this ‘big’ imagery data into ‘science-ready’ information demands highly automated image analysis pipelines driven by advanced computer vision algorithms. Despite this, previous fine resolution studies have been limited to manual digitization of features on locally confined scales. Therefore, this exploratory study serves as the first investigation into fully automated analysis of sub-meter spatial resolution satellite imagery for automated detection of Arctic built infrastructure. We tasked the U-Net, a deep learning-based semantic segmentation model, with classifying different infrastructure types (residential, commercial, public, and industrial buildings, as well as roads) from commercial satellite imagery of Utqiagvik and Prudhoe Bay, Alaska. We also conducted a systematic experiment to understand how image augmentation can impact model performance when labeled training data is limited. When optimal augmentation methods were applied, the U-Net achieved an average F1 score of 0.83. Overall, our experimental findings show that the U-Net-based workflow is a promising method for automated Arctic built infrastructure detection that, combined with existing optimized workflows, such as MAPLE, could be expanded to map a multitude of infrastructure types spanning the pan-Arctic.


Introduction
Permafrost, defined as Earth materials that remain at or below 0 • C for at least two consecutive years, underlies approximately 24% of the exposed land surface of the Northern Hemisphere [1]. However, climate change has led to widespread warming of the permafrost landscapes across the Arctic [2], where land surface temperatures are reported to have increased by more than 0.5 • C per decade since 1981, exceeding average global warming by a factor of between 2 and 3 [3]. This rapid warming causes degradation of permafrost that, if ice-rich, results in destructive processes such as differential land-surface subsidence [4]. Consequently, built infrastructure (e.g., roads and railroads, fuel and water pipelines, residential and public buildings, industrial facilities, airports, etc.) across the Arctic are It has been noted that very high spatial resolution (VHSR) imagery (<5 m resolution) is crucial in providing the required level of detail for accurate detection and classification of individual structures in the Arctic [22][23][24]. Therefore, the use of medium-resolution imagery means that many features can be missed, and those that are detected cannot be subcategorized. Despite this, all of the published products had limited access to VHSR imagery due to high imagery costs and low availability. However, the entire Arctic has been imaged by Maxar commercial satellite sensors at a sub-meter resolution, providing free 'big' imagery data to U.S. National Science Foundation Polar Program-funded researchers via the Polar Geospatial Center at the University of Minnesota. The conspicuous shortfalls of traditional remote sensing image analysis when confronted with large volumes of VHSR imagery [30] have catalyzed a migration towards computer vision-based algorithms, namely the convolutional neural network (CNN). High spatial resolution images present scene objects much larger than the associated pixel size, introducing complex properties such as geometry, context, pattern, and texture that compose objects at multiple levels. Furthermore, higher spatial resolution significantly increases intra-class spectral variability, given the increased number of pixels constructing image features [31]. As such, traditional image analysis methods, namely per-pixel-based approaches, are ill-equipped to handle VHSR imagery, whereas CNNs are better equipped. For example, urban area extraction from coarse spatial resolution imagery may be satisfied by an algorithm that solely exploits high reflectance in the near-infrared region (which is characteristic of urban areas). However, as individual urban structures become visible at finer resolutions, detecting these objects will require an algorithm that can exploit features beyond the spectral reflectance values, such as edges, corners, and curves of buildings and roads, geometric patterns visible on building rooftops, textural differences between humanbuilt structures and natural landscape backgrounds, etc. Through several processing layers, CNNs can learn to optimize the convolutional filters required to extract these features at multiple levels of abstraction, which are then assembled into feature representations used to detect and classify scene objects.
Several studies have successfully implemented CNN algorithms, namely the Mask R-CNN and the U-Net, for automated detection of various kinds of built infrastructure from VHSR imagery at multiple scales. The Mask R-CNN performs object instance segmentation, in which each individual object associated with a given class is detected, delineated with a bounding box, and classified [32]. The U-Net performs semantic segmentation, in which each pixel in an image is classified based on the detected object it is associated with. However, while object instance segmentation would treat, for example, multiple buildings of the same type as distinct structures, semantic segmentation would treat them as a single entity and therefore does not count the number of individual structures. Therefore, training the Mask R-CNN is more computationally intensive than training the U-Net, but both have recently achieved favorable results in infrastructure detection from high spatial resolution remote sensing imagery. For example, Tiede et al. tasked a Mask R-CNN with detecting dwellings in Khartoum, Sudan, from 0.5 m Pléiades satellite imagery, achieving an F1 score of 0.78 [33]. Wang  Shanghai, and Khartoum, achieving an F1 score of 0.704 [35]. Yang et al. tasked a modified U-Net with extracting roads from aerial imagery in the Massachusetts Roads dataset and DeepGlobe Road Extraction dataset, achieving F1 scores of 0.784 and 0.794, respectively [36]. However, deep learning-based infrastructure detection from VHSR imagery has so far not been tested in the Arctic.
In this paper, we present the first study on CNN-based automated detection of built infrastructure at two Arctic locations using VHSR imagery. Our overall objective was to understand the ability of the U-Net CNN to perform semantic segmentation of sub-meterresolution satellite imagery for detection of built infrastructure in the Arctic. Target classes included residential and commercial buildings, public buildings, industrial buildings, and roads. Additionally, we conducted a systematic experiment to understand how image augmentation improves the performance of the U-Net CNN when training data is limited.

Study Area and Data
We selected two study sites on the North Slope of Alaska: (1) Utqiagvik and (2) Prudhoe Bay ( Figure 1). Utqiagvik is the largest city of the North Slope Borough and the 12th-most populated city in Alaska; therefore, infrastructure is strongly developed there, with residential and commercial buildings, public buildings, pipelines, and roads ( Figure 2a,b). Prudhoe Bay is one of the most prominent industrial areas in the Arctic [16]. The Prudhoe Bay oil field comprises an extensive network of infrastructure supporting the oil and gas extraction process, including multiple gathering centers, flow stations, pipelines, and roads connecting all facilities (Figure 2c,d). Therefore, Utqiagvik provided training samples for residential/commercial, public, and road infrastructure classes. Prudhoe Bay provided training samples for an industrial infrastructure class and added to the road class.
To train and test the U-Net CNN, we utilized six VHSR commercial satellite images in total from the WorldView-02 (WV-02) and QuickBird-02 (QB-02) sensors, two for the Utqiagvik site and four for the Prudhoe Bay site. We strictly utilized the blue, green, red, and near-infrared bands of the imagery. Specific details of the imagery used at each site, including acquisition date, sensor, and spatial resolution, are given in Table 2. All of the images were provided by the Polar Geospatial Center at the University of Minnesota.

Generalized Workflow
Our workflow rests upon four stages: (1) input preparation, (2) model training and validation, (3) model evaluation, and (4) output postprocessing ( Figure 3). Input preparation is based on two key operations. First, annotated infrastructure samples from each image were rasterized, and then satellite images and corresponding annotated raster layers for each site were split into smaller tiles sized at 256 pixels by 256 pixels. Second, these tile pairs (both images and masks) were randomly partitioned into sub-datasets for training, validation, and testing, utilizing an 80:10:10 split. Our training dataset consisted of 119 tile pairs, and both our validation and testing datasets consisted of 17 tile pairs (153 tile pairs in total). Once the input was prepared, we trained and validated the model, applying image augmentation techniques to the training dataset in order to synthetically inflate its size. Next, we evaluated the model's performance on the testing dataset, which the model had not previously seen, and obtained accuracy metrics and model predictions. Finally, we performed postprocessing on the output by stitching the predicted tiles together into a final map. and gas extraction process, including multiple gathering centers, flow stations, pipelines, and roads connecting all facilities (Figure 2c,d). Therefore, Utqiagvik provided training samples for residential/commercial, public, and road infrastructure classes. Prudhoe Bay provided training samples for an industrial infrastructure class and added to the road class.

Annotated Data Collection
In most remote sensing applications of a CNN, annotated data would need to be produced by drawing features of interest through an on-screen digitizing process. However, given that infrastructure of major settlements is consistently monitored by some governments, high-quality geospatial datasets are consistently maintained and can be utilized for CNN development if one can gain access to them. In the case of this study, we were able to obtain such a dataset. In addition, volunteered mapping efforts such as OpenStreetMap provide global coverage of buildings and roads in several areas of the Arctic. However, quality assessment must be performed to ensure locational accuracy before using OpenStreetMap in CNN training, given inconsistencies due to the nature of this kind of data. Remote Sens. 2022, 14, x FOR PEER REVIEW 6 of 23 To train and test the U-Net CNN, we utilized six VHSR commercial satellite images in total from the WorldView-02 (WV-02) and QuickBird-02 (QB-02) sensors, two for the Utqiagvik site and four for the Prudhoe Bay site. We strictly utilized the blue, green, red, and near-infrared bands of the imagery. Specific details of the imagery used at each site, including acquisition date, sensor, and spatial resolution, are given in Table 2. All of the images were provided by the Polar Geospatial Center at the University of Minnesota.

Generalized Workflow
Our workflow rests upon four stages: (1) input preparation, (2) model training and validation, (3) model evaluation, and (4) output postprocessing ( Figure 3). Input preparation is based on two key operations. First, annotated infrastructure samples from each image were rasterized, and then satellite images and corresponding annotated raster layers for each site were split into smaller tiles sized at 256 pixels by 256 pixels. Second, these tile pairs (both images and masks) were randomly partitioned into sub-datasets for  Annotated data for the Utqiagvik study site comprised a geospatial vector dataset of building footprints (polygon features) and road centerlines (line features), which were digitized from 2019 aerial photography of the city by the North Slope Borough (NSB) GIS division. (This imagery belongs to the NSB and was not a part of our dataset, but the data layers extracted from the imagery were provided to us upon request). In the training data, we applied a buffer to road centerlines to convert them to polygons. The optimal buffer size was decided based on accurate overlapping between the polygon features representing the roads and the actual roads in the imagery. In the NSB dataset, features corresponding to a building footprint were classified as either a residential, commercial, public, or unoccupied building. We omitted unoccupied buildings and merged the residential and commercial classes together, as there were not enough commercial building features in the dataset to train the U-Net to detect this type of infrastructure. Manual editing of this data was conducted to ensure that polygon features corresponding to specific buildings and roads aligned with those structures in the satellite imagery, accounting for discrepancies between the aerial photography used for digitization and the satellite imagery used for the analysis. Furthermore, given the difference in acquisition dates of the aerial photography (2019) and the satellite imagery (2002,2009,2014), either certain digitized structures were not present in the satellite imagery, or structures present in the satellite imagery were not digitized. As a result, some features had to be removed from the dataset, and missing structures had to be digitized. Annotated data for the Prudhoe Bay study site comprised OpenStreetMap data that provided footprints of industrial structures and roads. Manual editing of this data was conducted to ensure that polygon features aligned with the industrial structures and roads in the satellite imagery. training, validation, and testing, utilizing an 80:10:10 split. Our training dataset consisted of 119 tile pairs, and both our validation and testing datasets consisted of 17 tile pairs (153 tile pairs in total). Once the input was prepared, we trained and validated the model, applying image augmentation techniques to the training dataset in order to synthetically inflate its size. Next, we evaluated the model's performance on the testing dataset, which the model had not previously seen, and obtained accuracy metrics and model predictions.
Finally, we performed postprocessing on the output by stitching the predicted tiles together into a final map.

Annotated Data Collection
In most remote sensing applications of a CNN, annotated data would need to be produced by drawing features of interest through an on-screen digitizing process. However, given that infrastructure of major settlements is consistently monitored by some governments, high-quality geospatial datasets are consistently maintained and can be utilized for CNN development if one can gain access to them. In the case of this study, we were able to obtain such a dataset. In addition, volunteered mapping efforts such as OpenStreetMap provide global coverage of buildings and roads in several areas of the Arctic. However, quality assessment must be performed to ensure locational accuracy before using OpenStreetMap in CNN training, given inconsistencies due to the nature of this kind of data.
Annotated data for the Utqiagvik study site comprised a geospatial vector dataset of building footprints (polygon features) and road centerlines (line features), which were digitized from 2019 aerial photography of the city by the North Slope Borough (NSB) GIS division. (This imagery belongs to the NSB and was not a part of our dataset, but the data layers extracted from the imagery were provided to us upon request). In the training data, we applied a buffer to road centerlines to convert them to polygons. The optimal buffer size was decided based on accurate overlapping between the polygon features representing the roads and the actual roads in the imagery. In the NSB dataset, features corresponding to a building footprint were classified as either a residential, commercial, public, or unoccupied building. We omitted unoccupied buildings and merged the residential and commercial classes together, as there were not enough commercial building features in the dataset to train the U-Net to detect this type of infrastructure. The number of buildings and roads from each location that make up the dataset are described in Table 3. These samples are then rasterized, since CNNs require their input to be in the form of an image. After splitting the data into sub-datasets, we can measure the size of each target class as the number of pixels in the labeled masks belonging to each class, as seen in Table 4.

Deep Learning Algorithm
We chose to task a U-Net CNN with semantic segmentation of VHSR satellite imagery due to its success in various image analysis tasks and computational efficiency. The U-Net was first developed for biomedical image segmentation [37] and has since spread to a wide range of applications, such as remote sensing. The U-Net is a fully convolutional neural network defined by its U-shaped architecture, hence the name "U-Net," that consists of an encoding and decoding path. The encoding path is also known as the analysis path or contracting path, and the decoding path is also known as the synthesis path or expansive path. The former shapes the typical CNN, consisting of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. The function of the encoding path is to reduce the dimensionality of the input layers and increase the number of feature channels. In this route, a 3 × 3 convolution is followed by ReLU and 2 × 2 max pooling that downsamples and doubles the feature channels. On the other hand, the decoding path functions opposite to the encoding path. It reduces the number of channels and increases the spatial dimensions of the layers. The number of channels is halved in an upsampling process using 2 × 2 convolution at the start of the decoding route. Afterward, 3 × 3 convolution layers followed by ReLU are used. Skip connections are used to concatenate the corresponding feature layer from the encoding path to recover the information lost during downsampling in the encoding route. Finally, the dimension of the layers is restored using 1 × 1 convolution to generate a pixelwise classified predicted map.
However, in order to account for limited training data, we implemented transfer learning in order to leverage knowledge representations of low-level features (e.g., edges, lines, corners) learned by networks that have been pre-trained on large datasets for other computer vision tasks. We replaced the encoding path of the U-Net with a ResNet-50 backbone that was pre-trained on the ImageNet dataset. ResNet, or the residual neural network [38], is a CNN that utilizes identity skip connections to address the degradation problem that arises when accuracy gets saturated and degrades rapidly as network depth increases. It is constructed by stacking multiple bottleneck residual blocks, which consist of series of 1 × 1, 3 × 3, and 1 × 1 convolutions, as seen in Figure 4b. The backbone is frozen so as to avoid losing any of the information that its layers contain during training. Meanwhile, the U-Net decoder remains unfrozen and trainable in order to adjust to the parameters of the pre-trained layers. Figure 4a depicts  backbone that was pre-trained on the ImageNet dataset. ResNet, or the residual neural network [38], is a CNN that utilizes identity skip connections to address the degradation problem that arises when accuracy gets saturated and degrades rapidly as network depth increases. It is constructed by stacking multiple bottleneck residual blocks, which consist of series of 1 × 1, 3 × 3, and 1 × 1 convolutions, as seen in Figure 4b. The backbone is frozen so as to avoid losing any of the information that its layers contain during training. Meanwhile, the U-Net decoder remains unfrozen and trainable in order to adjust to the parameters of the pre-trained layers. Figure 4a depicts the architecture of our U-Net model with a ResNet-50 backbone.

Model Training
The model was constructed and trained using PyTorch 1.10 and the Segmentation Models for PyTorch library (https://github.com/qubvel/segmentation_models.pytorch (accessed on 1 April 2021)), with a hardware configuration of an Intel Core i7-10750H 6-Core Processor and NVIDIA GeForce RTX 2060 with 6 GB of dedicated VRAM. Hyperparameters for model training are listed in Table 5.

Model Training
The model was constructed and trained using PyTorch 1.10 and the Segmentation Models for PyTorch library (https://github.com/qubvel/segmentation_models.pytorch (accessed on 1 April 2021)), with a hardware configuration of an Intel Core i7-10750H 6-Core Processor and NVIDIA GeForce RTX 2060 with 6 GB of dedicated VRAM. Hyperparameters for model training are listed in Table 5. To account for limited training data, we employed image augmentation to synthetically inflate the training data space through data warping, which generates additional samples through transformations applied in the dataspace [39]. We created copies of existing image tiles by applying four non-destructive geometric transformations that do not add to or detract from an image's information: random 90 • rotation, horizontal flip (reflection across horizontal axis), vertical flip (reflection across vertical axis), and transposition (reflection across either diagonal axis). Figure 5 provides a diagram visualizing these transformations. We conducted a systematic experiment to understand how these different transformations improve the performance of the U-Net and determine the optimal set of augmentations. The experiment consisted of six trials, in which we trained the model under different conditions: one trial for each of the selected transformations applied to the training dataset individually (four trials), one trial for all of the transformations applied together, and one trial for no image augmentation.

Accuracy Assessment
The accuracy of infrastructure detection performed by the model was assessed through standard semantic segmentation metrics: Recall, Precision, and F1 score. Recall represents the fraction of correctly labeled pixels of each class and is calculated as the ratio of positives identified by the model to the actual number of true positives: (1) Precision represents the fraction of detected pixels in each class that belong to the assigned class and is calculated as the ratio of true positives compared to all positives identified by the model: Figure 5. Diagram of selected geometric transformations. "M1 reflection" and "M3 reflection" refer to horizontal and vertical flipping, respectively. "M2 reflection" and "M4 reflection" refer to transposition.

Accuracy Assessment
The accuracy of infrastructure detection performed by the model was assessed through standard semantic segmentation metrics: Recall, Precision, and F1 score. Recall represents the fraction of correctly labeled pixels of each class and is calculated as the ratio of positives identified by the model to the actual number of true positives: Precision represents the fraction of detected pixels in each class that belong to the assigned class and is calculated as the ratio of true positives compared to all positives identified by the model: F1 score combines both recall and precision together to assess overall model performance: Recall, Precision, and F1 score were calculated for each target class. F1 score was averaged across all classes for an overall assessment of model performance. Furthermore, accuracy assessment was conducted for each trial of the augmentation experiment to determine the optimal augmentation method(s). Finally, we utilized the confusion matrix to visualize true and false positives and negatives for each class that the model was trained to detect.

Quantitative Metrics
The results of model accuracy assessment and the augmentation experiment are displayed in Table 6. Transposition and all augmentations applied together both yielded the highest model accuracy, with average F1 scores of 0.83 and 0.82, respectively. These two methods were considered to be the optimal augmentations, compared to the other four methods that yielded significantly lower scores. The next highest average F1 score came from the model trained on the dataset with random 90 • rotation applied. This disparity between the top two scores and the bottom four scores can be attributed to the fact that either roads or public buildings were completely missed by the model in the bottom four trials. The residential/commercial and industrial classes were the most stable in terms of model detection. This may be attributed to the fact that they were better represented in the training data compared to the public and road classes (Table 4). Furthermore, roads at both study sites are largely unpaved and narrow, making it difficult for the model to detect roads as features separate from the background. However, as shown, optimal image augmentation methods can aid performance when training data is lacking.
Confusion matrices showing the number of correctly and incorrectly classified pixels for each infrastructure class and augmentation trial are available in Figure 6 and corroborate Table 6. These are a useful tool for visualizing the true and false positives and negatives used to calculate the reported accuracy metrics, as well as understanding how the model confuses classes during detection. It can be seen that there are varying sources of false positives and negatives. In the two highest-scoring model trials (transposition and all augmentations), confusion between infrastructure classes is largely reduced, and the only significant source of confusion is misclassification of infrastructure as background and vice versa. However, when less effective augmentation methods, or no augmentation, are applied, infrastructure classes are confused for each other at significantly higher rates. For example, public buildings are largely confused for residential/commercial and industrial buildings when either horizontal/vertical flipping or no augmentations are applied. Furthermore, the most notable difference in model confusion between the optimal and less optimal augmentations is that no infrastructure classes were missed when the optimal augmentations were applied, as seen in Table 6. Table 6. Per-class accuracy metrics and average F1 score resulting from augmentation experiment.

Visual Results
In addition to quantitative evaluation, visual results are shown in Figures 7-11, which were produced using the model with the highest average F1 score. Selected model predictions on input tiles from the test dataset are shown in Figure 7, with each infrastructure class and both Utqiagvik and Prudhoe Bay study sites being shown. A final map of predicted infrastructure in Utqiagvik is shown in Figure 8. Figures 9-11 show convolutional feature maps (CFMs) extracted during training from the final activation function of each major stage of the model, as shown in Figure 4. A CFM is the output of a convolution operation between a given filter (or kernel) and input image (or output of a previous layer), and it is computed as the dot product of these two during a sliding-window operation. Individual CFMs visualize the features that a CNN is learning at a particular stage in the network. Viewing all the CFMs together from an entire training process reveals the complete feature representation with multi-level abstraction that a network has constructed for target features. In the U-Net encoder, convolutional filters detect increasingly more abstract features, or low-level features, that are propagated to the decoder, which constructs these into higher level features and eventually an output segmentation map. Four CFMs are selected from each stage of the encoder and decoder, serving as examples of the feature representations that the model constructs for each infrastructure class. Ultimately, CFMs prove to be a useful diagnostic, as they allow researchers to internally assess the learning process at each stage and visually identify where detection fails or succeeds.  In addition to quantitative evaluation, visual results are shown in Figures 7-11, which were produced using the model with the highest average F1 score. Selected model predictions on input tiles from the test dataset are shown in Figure 7, with each infrastructure class and both Utqiagvik and Prudhoe Bay study sites being shown. A final map of predicted infrastructure in Utqiagvik is shown in Figure 8.       Figure 4. A CFM is the output of a convolution operation between a given filter (or kernel) and input image (or output of a previous layer), and it is computed as the dot product of these two during a sliding-window operation. Individual CFMs visualize the features that a CNN is learning at a particular stage in the network. Viewing all the CFMs together from an entire training process reveals the complete feature representation with multi-level abstraction that a network has constructed for target features. In the U-Net encoder, convolutional filters detect increasingly more abstract features, or low-level features, that

Discussion
In this paper, we presented the first exploratory study of automated Arctic built infrastructure detection from VHSR satellite imagery using a CNN. Only one previous study [22] has successfully demonstrated automated detection of infrastructure with machine learning and deep learning on the pan-Arctic scale, while others relied on manual digitization or semi-automated analysis at local scales. This study served as an initial assessment of CNNbased automated detection from VHSR satellite imagery by testing the methodology on two Alaskan North Slope locations and five common infrastructure types. Overall, model accuracy assessment shows that the U-Net CNN with a transfer learning approach can successfully automate detection of various infrastructure types with high segmentation accuracy. Buildings for residential and commercial use, public use, and industrial use, as well as roads, can all be detected and delineated as individual features (Figures 7 and 8).
We conducted an image augmentation experiment to specifically address the challenge of limited training data that hampers most CNN-based detection tasks. Results show that optimal augmentation methods can reduce inter-class confusion among infrastructure types and improve the overall F1 score from a minimum of 0.62 to a maximum of 0.83 (Table 5). However, augmentation yielded virtually no improvement in the recall of the residential/commercial and road classes, which was 0.65 for both of these classes when all augmentation methods were applied. This indicates that the model still misses a large portion of residential/commercial buildings and roads, either by completely failing to detect a structure or not detecting the full extent of a structure. Ultimately, this implies that there is a limit on the amount of synthetic inflation that image augmentation can induce in small training datasets. Therefore, we expect to see improved performance in these classes as we collect or produce more training samples. We recognize this to be the most important scope of improvement for this work, as CNNs are "data-hungry" models and can be drastically improved by training with more samples.
Furthermore, expanding the training dataset includes expanding the geographic extent of automated infrastructure detection by sampling different communities and industrial locations across the Alaskan North Slope and the broader pan-Arctic region. This will allow us to assess the transferability of CNN-based automated infrastructure detection, which is a significant step in developing our methodology because infrastructure and its landscape context can vary widely across the Arctic. For example, infrastructure across different settlements can vary in terms of size, shape, building material, density of surrounding infrastructure, and more. In addition, landscape backgrounds differ across Arctic regions, resulting in variability of contextual information that a CNN would need to recognize in order to properly detect infrastructure. If the training data is unable to capture this inherent variability, the operational utility of the CNN will be severely limited. Therefore, systematic experimentation is required in order to understand the transferability mechanism.
Expanding the training dataset can also include enhancing the thematic depth of the model to discern other infrastructure classes. As mentioned, in this study we focused on roads and different building types, but there are several characteristic infrastructure types that define the Arctic built environment which we have not addressed. These, include impervious cover, gravel pads, and fuel and water pipelines, all of which are essential structures for studying and understanding the interlinkages between permafrost disturbance and infrastructure in expanding industrial areas. Of particular relevance in Arctic permafrost regions is piping infrastructure, which is largely constructed above ground because it is easier to maintain when permafrost thaws and also reduces the risk of disturbing permafrost. This makes remote sensing especially relevant in monitoring Arctic infrastructure. However, experimentation in this aspect is necessary, given that as the number of classes increases, the learning process for a CNN becomes more complex and may lead to higher inter-class confusion, especially between linear features like roads and pipelines that may appear similar in satellite imagery.
Finally, as these points of improvement are addressed and the automated built infrastructure detection workflow is refined, we will have the opportunity to incorporate it with existing optimized automated detection workflows, such as the Mapping Application for Arctic Permafrost Land Environment (MAPLE) [40,41]. MAPLE has successfully produced the first pan-Arctic ice-wedge polygon map, with over 1 billion individual ice-wedge polygons detected and classified, including mapped surface water as well. MAPLE is also being expanded to automatic detection of ice-wedge troughs. As ice-wedge polygon type (low-centered or high-centered) and growth of troughs can indicate permafrost degradation, combining built infrastructure maps with ice-wedge polygon, ice-wedge trough, and water maps can be used to identify areas where infrastructure is susceptible to the damaging effects of permafrost thawing.

Conclusions
Imagery-based infrastructure mapping of Arctic permafrost landscapes has been constrained to human-augmented workflows, namely manual digitization and semi-automated workflows, on locally confined scales. Only one study has successfully automated mapping on the pan-Artic scale but is limited to the use of 10 m Sentinel-1 and Sentinel-2 imagery within a 100 km distance of the Arctic coast. The rapid influx of sub-meter spatial resolution commercial satellite imagery into the Arctic science community provides the opportunity to map infrastructure across the entire Arctic at a fine scale (<1 m). However, image analysis workflows required for this task have not been developed or tested. In this study, we applied the U-Net with a Res-Net 50 backbone, combined with image augmentation, to automatically detect different infrastructure types in two Alaskan North Slope locations (industrial Prudhoe Bay and the City of Utqiagvik). Our results show that with limited training data, the U-Net can achieve an average F1 score of 0.83 in multi-class semantic segmentation of VHSR satellite imagery for automated infrastructure detection when optimal augmentation methods are applied.
While the U-Net shows promising ability in automatically detecting Arctic built infrastructure, further studies are necessary to advance the geographic and thematic domain of the workflow and fully understand its abilities in these avenues. Therefore, our future work can focus on two main directions revolving around expansion of the training dataset: (1) systematic experimentation on the transferability of the U-Net across Arctic locations; (2) enhancing the thematic depth of the U-Net by adding more infrastructure classes to the training dataset.