Towards Circumpolar Mapping of Arctic Settlements and Infrastructure Based on Sentinel-1 and Sentinel-2

: Infrastructure expands rapidly in the Arctic due to industrial development. At the same time, climate change impacts are pronounced in the Arctic. Ground temperatures are, for example, increasing as well as coastal erosion. A consistent account of the current human footprint is needed in order to evaluate the impact on the environments as well as risk for infrastructure. Identiﬁcation of roads and settlements with satellite data is challenging due to the size of single features and low density of clusters. Spatial resolution and spectral characteristics of satellite data are the main issues regarding their separation. The Copernicus Sentinel-1 and -2 missions recently provided good spatial coverage and at the same time comparably high pixel spacing starting with 10 m for modes available across the entire Arctic. The purpose of this study was to assess the capabilities of both, Sentinel-1 C-band Synthetic Aperture Radar (SAR) and the Sentinel-2 multispectral information for Arctic focused mapping. Settings differ across the Arctic (historic settlements versus industrial, locations on bedrock versus tundra landscapes) and reference data are scarce and inconsistent. The type of features and data scarcity demand speciﬁc classiﬁcation approaches. The machine learning approaches Gradient Boosting Machines (GBM) and deep learning (DL)-based semantic segmentation have been tested. Records for the Alaskan North Slope, Western Greenland, and Svalbard in addition to high-resolution satellite data have been used for validation and calibration. Deep learning is superior to GBM with respect to users accuracy. GBM therefore requires comprehensive postprocessing. SAR provides added value in case of GBM. VV is of beneﬁt for road identiﬁcation and HH for detection of buildings. Unfortunately, the Sentinel-1 acquisition strategy is varying across the Arctic. The majority is covered in VV+VH only. DL is of beneﬁt for road and building detection but misses large proportions of other human-impacted areas, such as gravel pads which are typical for gas and oil ﬁelds. A combination of results from both GBM (Sentinel-1 and -2 combined) and DL (Sentinel-2; Sentinel-1 optional) is therefore suggested for circumpolar mapping.


Introduction
The Arctic environment is changing rapidly due to climate change [1], but also due to direct human impacts [2,3]. Infrastructure expansion is driven by exploitation, transport routes, and military history. Arctic infrastructure has expanded in many regions in recent decades following industrial development. We find nowadays a mixture of settlements of indigenous communities and cities resulting from industrial growth. The manifold environmental impacts have been documented extensively with that double-bounce contributes to 24% of backscatter for urban pixels in case of C-band (based on fully polarimetric data). Related workflows focus in general on built-up areas only (see, e.g., in [17,23]) due to the good separability from other landcover types. Moreover, here ambiguities are manifold. Many other surfaces are misclassified as built-up (e.g., high backscatter also occurs over block fields). An external data source is required in order to pre-select areas with settlements for manual postprocessing. The insufficiency of such records in turn reduces the potential for improving settlement records. For example, the Global Urban Footprint (GUF) dataset based on TerraSAR-X [17] leaves out many smaller settlements across the Arctic. Iannelli and Gamba [24] suggest a combination with derivatives from digital elevation models to at least treat ambiguities in mountain areas. In addition to the backscatter amplitude, texture can be utilized [25] to improve built-up area detection. Utilizing repeated acquisitions, interferometric SAR coherence can be combined with backscatter intensity to derive built-up areas (the authors of [23,26] tested for semi-arid and arid regions). Results by Corbane et al. [26] for a town in the Netherlands suggest that the use of coherence (VV) increases the error of omission but performance increases with decreasing built-up density. The association of built-up areas or urban footprint with certain landcover classes varies across studies. Talukdar et al. [27] include, for example, roads in built-up areas with focus on multispectral data only. Stromann et al. [28] treat roads as a separate class and distinguish between low-and high-density built-up areas. However, in general, only the term "urban area" is used and binary outputs (urban versus none urban) are generated.
A further challenge in this context is the variation in acquisition strategies of SAR missions across the Arctic. Sentinel-1, for example, acquires in different polarizations with varying coverage. Developments targeting global coverage focus on VV+VH (see, e.g., in [17,23]) as this is the dominant setting for land areas globally. In the Arctic, acquisitions are, however, partially made in HH+HV due to requirements by sea ice and glaciology-related applications. The viewing direction of the sensor and orientation of building walls also plays a role in detectability [29,30]. Depending on the properties of the feature, objects are spatially offset in the resulting images. Size-related offsets in SAR vary in direction from offsets in optical images [31].
The Sentinel satellites of the recently initiated European Copernicus Mission are expected to provide an improvement for this type of application. Optical as well as radar data are acquired in resolutions starting from 10 m. It is estimated that Sentinel-2 can identify roads starting from 3 m width in agricultural environments [32]. This opens the way for the identification of buildings and roads of dimensions as can be found in the Arctic. The task, however, is data intensive, and cannot be accomplished manually (as in previous studies [2,3]) across the entire Arctic. Threshold approaches for SAR and applications of indices (Normalized Difference Built-up Index [33] or the Index-based Built-up Index [34]) in case of multispectral data alone are often of limited applicability [24]. Machine learning techniques have been shown applicable in other environments than the Arctic for settlement detection (see, e.g., in [35][36][37][38][39]). Specifically, deep learning is expected to allow identification of human footprint-related features and to deal with ambiguities. A range of remote sensing studies, including in urban environments, exist for these approaches [40,41].
A challenge for semiautomatic techniques is the need for extensive training data. Settlements in Arctic environments are usually small and only scarce reference data exist (limited to areas of manual HR (high resolution) image interpretation). This requires sparsity-aware algorithms. Gradient boosting machine (GBM) learning, as suggested by Chen and Guestrin [42], is expected to be applicable for sparse data. It has been shown to outperform support vector machines (SVM) or random forest for VHR (very high resolution) data in urban environments [37]. This may also apply to 10 m resolution, which resolves many relevant artificial objects better than, e.g., Landsat [2,21]. Stromann et al. [28] tested Sentinel-1 and -2 with SVM for urban environments with good results, but did not compare it to other approaches.
A further option is deep learning, which is a special type of machine learning that can solve more complex problems. Deep learning-based semantic segmentation generally uses convolutional neural networks (CNNs). This method is usually applied in the context of using VHR optical and SAR data. Deep learning has been shown to reduce additional (manual) mapping efforts for delineating the urban footprint from Tandem-X/TerraSAR-X data [39]. It has been applied using VHR optical data for mapping (informal) urban villages in China [36], and informal settlements in Africa [38,43,44]. Wurm et al. [45] combine Sentinel-2 and TerraSAR-X data for informal settlement detection. Deep learning has so far not been tested for human footprint identification in tundra environments. Here, we cannot utilize VHR data yet for the entire Arctic (access limitations, data amount), but Sentinel-2 is expected to provide the required detail in this context (identification of roads and small features as typical for tundra settlements). Road monitoring applications based on SAR data are usually focusing on subsidence monitoring, but deep learning, specifically the UNet neural network architecture [46], has been shown to be of value for road extraction from Sentinel-1 backscatter amplitude [47]. Furthermore, the results of Zhao et al. [41] for cities in China demonstrate the superior performance of deep learning for separation of urban targets from Sentinel-1.
The added value of Sentinel-2 for impervious surfaces in urban environments has been, e.g., demonstrated by Lefebvre et al. [48]. Pesaresi et al. [35] also used Symbolic Machine Learning (SML) for detailed urban land cover mapping in central Europe and suggest that Sentinel-1 can complement Sentinel-2 and specifically aid the treatment of ambiguities in Sentinel-2 reflectance. Zhou et al. [49] suggested fusion of Sentinel-1 and multispectral (Landsat) as well as Hyperion (better spectral resolution than Landsat but same spatial resolution). They point out that texture features in SAR are an asset in this context. Iannelli and Gamba [24] investigated the fusion of Sentinel-1 and Sentinel-2 for urban mapping (urban versus non-urban for larger cities). A two step approach is suggested. First, a mask is prepared with Sentinel-2 (combination of several approaches, including random forest), which then serves as input for a threshold-based classification of Sentinel-1 to treat the ambiguities in C-band to some extent. Results match largely the extent of build-up in the GUF dataset. A joint use of Sentinel-1 and Sentinel-2 within a SVM approach is also suggested by Stromann et al. [28] for specific features of large cities (demonstrated for Stockholm and Beijing). High-density and low-density built-up areas have been considered in addition to road features. High-density built-up areas show high confusion with roads and bedrock, whereas low-density is confused with green spaces. In general, only built-up areas and/or larger cities are considered. For Arctic studies, however, information is required for the distribution of not only buildings of smaller settlements with low density, but also roads and other features related to infrastructure.
The objective of this study is to evaluate the capabilities of Sentinel-1 and Sentinel-2 for circumpolar mapping of infrastructure beyond of what is typically represented in global datasets. Target classes include further human-impacted areas (such as gravel pads) in addition to buildings and roads. Two machine learning approaches applicable for data scars settings (gradient boosting machines and deep learning-based semantic segmentation) are assessed and eventually recommendations are formulated.

Satellite Data
The Sentinel-1 and -2 missions are part of ESA's Copernicus program. Whereas Sentinel-1 carries synthetic aperture radar systems, Sentinel-2 provides multispectral data.
The Sentinel-1 mission consists currently of two satellites with a near-polar, sun-synchronous orbit, 180 degrees apart from each other. The two earth observation satellites Sentinel-1A (launched in April 2014) and Sentinel-1B (launched in April 2016) have an identical C-band SAR sensor onboard [50]. The Interferometric Wide Swath (IW) mode combines a swath width of 250 km with a relatively good ground resolution of 5 × 20 m. A pixel spacing of 10 × 10 m is commonly used for nominal resolution of derived products. This is also the case for Ground Range Detected (GRD) products as distributed by Copernicus. Information can be captured in dual polarization (HH+HV or VV+VH; H-horizontal, V-vertical). Mostly VV+VH is available for the Arctic land area for this mode and resolution. Greenland and several high Arctic islands are covered in HH+HV mode due to requirements of glacial monitoring. The availability of HH and HV, at the same time, varies also over time in this region. More data are acquired in HH only. This results in limitations for certain applications, e.g., regarding mapping of vegetation [51]. Both polarization combinations (HH+HV and VV+VH) are only available for Svalbard, where they are acquired from different orbits (ascending versus descending), resulting in opposing looking directions (Figure 1). GRD products were used for this study. GRD products are detected, multi-looked, and projected to ground range using an Earth ellipsoid model [52]. As temporal variations of backscatter can occur with changes in liquid water content, only winter data (December and/or January; frozen soil conditions) are used for cross-Arctic consistency and comparability (see Table 1). The use of such data also allows for simplified normalization with respect to the incidence angle influence on backscatter intensity when deriving the backscatter coefficient σ 0 , which is commonly used [53]. Acquisitions from July to early August (snow-free season) have been considered (see, e.g., the selected scenes of the validation granules in Table 2).

Study Areas and Calibration and Validation Data
Consistent information on human impact is required across the entire Arctic and is of special concern in settings with presence of permafrost [12][13][14]. The areas of interest have therefore been selected with respect to permafrost extent (source: [56,57]). All calibration and validation sites are located in sporadic to continuous permafrost ( Figure 3).
High spatial resolution datasets are required in order to calibrate and validate the landcover classifications. This needs to reflect requirements by the algorithms, satellite data availability, environmental settings (potential reflectance ambiguities), and target categories. The relevant categories (roads, buildings, and gravel pads) need to be included. Ideal sources are vector datasets which are based on aerial photographs and/or very high spatial resolution satellite data. Published datasets are, however, scarce. The records in [3,58] include three sites (each 20 km 2 ) at the Prudhoe Bay oil extraction site for which human impact on the environment has been mapped. This area is characterized by herbaceous shrub tundra. This region was the first developed oilfield in the Arctic, and is the largest in the United States. The built-up layer of the Global Human Settlement (GHS) dataset [59] includes approximately 30 250 × 250 m pixels with up to 20% coverage in the proximity of Prudhoe Bay, which indicates presence of human impact but provides no information on actual objects. The records in [3] consider changes after 1968 up until 2011. Openly available datasets in other environmental settings are lacking. This is a challenge for settlements which are different from the Prudhoe Bay environmental setting, specifically those which are largely build on bedrock. Misclassifications are expected in such areas due to reflectance ambiguities which need to be assessed. A dataset has therefore been specifically assembled for the Greenland West coast (eight settlements) and Svalbard (Longyearbyen). Longyearbyen is neither represented in the GUF [17] nor in the GHS. Four of the Greenland sites are covered by the GHS and six out of the eight in the GUF (12 m, 2012 version). Cadastral information has been selected from national sources, checked as part of an in situ campaign in 2018, and annotated. Surface and roof types have been mapped ( Figure 2) as well as functions (public, industrial, and residential) of buildings [54,60]. Datasets are complemented with a reference area in the proximity for each covered settlement. This reference area represents a typical environmental setting and does not include any infrastructure or human impact. The Prudhoe Bay, Greenland and Svalbard datasets are used for independent (of data record types used for calibration) validation in our study. Dataset characteristics of the validation records and used acquisitions over these regions are provided in Tables 1 and 2. Dark gray-sporadic to continuous permafrost (source: [56]). Granule 41WPR is pointed out as it is used for Gradient Boosting Machine calibration only.
Algorithms such as deep learning require a large amount of training data. OpenStreetMap data are therefore commonly used as input for deep learning in remote locations (see, e.g., in [39]). Such data can provide useful information in densely populated areas where completeness and accuracy can be expected to be relatively high. The majority of settlements are, however, not fully represented in the Arctic, which delimits the choice for calibration areas. Approximately 12,000 building and 3200 road objects have been used. They represent the complete set of available features within the selected Sentinel-2 granules. Streets and building information is available in vector format via OSM Geofabrik (Geofabrik GmbH, Karlsruhe, Germany, http://www.geofabrik.de/), but other human-impacted areas are not precisely included. An alternative source is required. The Google online hybrid map includes very high resolution images for some Arctic settlements. It has been therefore used to identify human-impacted areas across several Sentinel-2 granules for the deep learning ( Figure 3). It has been further used to assemble a training dataset dedicated for the pixel-based machine learning approach.
Remote Sens. 2020, 12, 2368 8 of 28 Specific emphasis was on selection of sufficiently large areas (with respect to the 10 m spatial resolution) and homogeneous properties in this case. Table 1. Validation data and corresponding Sentinel-1 (S-1) and Sentinel-2 (S-2) acquisitions. Reference objects represent areas without any human impact. Dates for Sentinel-2 are provided in Table 2

Methods
An automated workflow for downloading, preprocessing, and classifying Sentinel-1 and Sentinel-2 data has been set up with the mapping of the coastal infrastructure with circum-Arctic extent as a final goal. We further classified with two different methods, one being a pixel-based classification using a Gradient Boosting Machine [42] and the other being a windowed semantic segmentation approach using the deep learning framework keras [61] with the Tensorflow backend, referred to as GBM and DL, respectively, in the following.

Preprocessing of Satellite Data
Data from a total of 50 Sentinel-2 granules have been considered for this study. Sentinel-2 data are available as granules (100 × 100 km tiles) in UTM projection and largely at Level 2A (orthorectified, top of atmosphere). As the data are available for top-of-atmosphere reflectance only, atmospheric correction was required in addition to cloud masking. Sentinel-2 provides a spatial resolution of 10 m for some bands but not for all. As shown by Kumpula et al. [2], scale is crucial for Arctic infrastructure and impact mapping. Enhancement of spatial resolution of the coarser band therefore needs to be considered to exploit the multispectral capabilities offered by Sentinel-2.
In order to account for anomalies due to undetected clouds in Sentinel-2, three acquisitions have been combined. They represent the same year where possible, but in most cases several years were utilized (see, e.g., dates for validation sites in Table 1). This strategy does, however, require atmospheric correction in order to account for related differences between the dates. We therefore applied atmospheric correction using sen2cor on the Sentinel-2 data, which also generates a cloud mask during the process. We further performed super-resolution based on the tool Dsen2 [62], which uses a convolutional neural network. Lanaras et al. [62] showed that their approach clearly outperforms simpler upsampling methods and better preserves spectral characteristics. The original model was trained on Level-1 data, which have not been atmospherically corrected, and global sampling. We therefore retrained and tested their model on Level-2 data (output of sen2cor) using the same published training and testing routines (https://github.com/lanha/DSen2) for selected granules from our study sites (acquired only in July and August). The code was only modified to read in the Level-2 data. We used nine Sentinel-2 scenes for training and three scenes (subset of classification calibration granules, see Figure 3) for testing of our model, which amounts to one-fifth of the dataset sizes used by Lanaras et al. [62] for their global model. The Root Mean Square Error (RMSE) between the original 20 meter resolution data and the result of super-resolution performed on synthetically downsampled 20 meter resolution data is the primary suggested evaluation metric [62]. It has been therefore applied for evaluation. After the super-resolution step, clouds were masked using the cloud mask output from sen2cor.
For preprocessing of Sentinel-1, we applied border noise removal, based on the bidirectional all-samples method of Ali et al. [63], calibration, thermal noise removal, and orthorectification using the digital elevation model (DEM) GETASSE30 (Global Earth Topography And Sea Surface Elevation at 30 arc second resolution; as shown applicable in [53]). These steps were carried out with the Sentinel Application Platform (SNAP) toolbox provided by the European Space Agency and σ 0 was derived. The backscatter-incidence angle relationship varies by landcover type and needs to be determined for each location to allow normalization and subsequent combination and comparison of scenes over space and time. This is specifically of concern when large areas need to be covered. Otherwise, typical striping patterns in orbit direction (across near to far range of the radar scene) occur (e.g., as seen in Figures 9 and 10 of Mahdianpari et al. [64], ). Widhalm et al. [53] suggest a simplified method which is applicable to frozen tundra environments. A linear dependency is assumed with validity for an incidence angle range of approximately 20 • to 45 • . A challenge for normalization of Sentinel-1 data in our case is posed by the variable availability of polarizations (HH and HV versus VV and VH). So far, normalization parameters have been published only for HH and VV [51,53]. Models still need to be calibrated for HV and VH. Existing studies used a landcover map based on Landsat for the Usa Basin which is located to the west of the Northern Urals, Russia (Usa Basin, Russia, Virtanen and Ek [65]; see also Bartsch et al. [19], Widhalm et al. [53], and Bartsch et al. [51]). Data from multiple orbits, representing a range of incidence angles for each location, need to be used. Representative scenes over the Usa Basin have been therefore preprocessed in addition, and a function was fit to each landcover-specific sample to describe the relationship of the local incidence angle with σ 0 for VH (IW mode) and HV (enhanced wide swath mode). The relationship between the slope values k for all the different landcover classes and σ 0 at 30 • was eventually derived in order to obtain the normalization functions.
After normalization of the Sentinel-1 data, a mosaic was created for each sub-region, and reprojected and subset to match the Sentinel-2 granules extent.

Calibration and Model Assessment Data Preparation
The different requirements of the algorithms as well as the inconsistencies in the input dataset (satellite data as well as calibration data) need to be considered. Two sets of training data have been therefore generated: one set for each of the tested algorithms. They consist of separate datasets for each of the polarization mode combinations (VV+VH and/or HH+HV as available), due to the spatial variability in polarization of the Sentinel-1 coverage ( Figure 4). For assessment of model efficiency (referred to as internal assessment in the following; usage of data consistent with the calibration data), the datasets have been split into two parts. The proportion used for internal assessment was 40% in case of GBM and 20% for DL. The separation was made randomly. Twenty percent is commonly used for DL [66]. A comparably large amount of training data is needed in this case. This is less critical for GBM. Therefore, twice the amount was used for calibration. In order to demonstrate the need for targeted calibration data for each approach, the GBM algorithm has been tested also using the dataset created for DL. A proportion of 40% has been selected for internal assessment in this case to ensure comparability. The calibration dataset for the deep learning based on OpenStreetMap initially does only include buildings and roads in vector format. In order to represent other human-impacted surfaces, additional information must be collected manually for this class for selected settlements with high spatial resolution coverage in Google Hybrid Maps. The training information also needs to be spatially complete for a certain area (window; here 512 × 512 pixel) in the case of deep learning. The vectors (polygons and lines) have been rasterized matching the 10 m spatial resolution of the Sentinel data. Pixels along the line render path, in case of lines and pixels that have their center inside polygons in case of polygons, were considered during rasterization. The total DL training area covered approximately 7000 square kilometers and has been applied for VV+VH and/or Sentinel-2 use. The area which can be potentially covered with sufficient quality training data for HH+HV extent (Greenland and Svalbard) is too small for DL calibration. HH+HV application with DL was therefore excluded from the analyses.
As an alternative to OpenStreetMap, which does not reflect limitations of satellite records (especially spatial resolution), an additional calibration dataset has been prepared based on visual interpretation of high resolution datasets available in Google Hybrid Maps, as well as the Sentinel-1 and 2 data available over the selected sites in case of GBM. In total, 463 polygons have been defined in the proximity of Varandai, in the Northern Urals, and on the Yamal peninsula (all in Russia, see Figure 3) for the Gradient Boost Machine algorithm for classifications with VV+VH: 156 polygons for the road class, 141 polygons for the building class, and 166 polygons for other human-impacted areas (bright objects). Similarly, for classifications with the HH and HV channel of Sentinel-1, 65 polygons for the road class, 79 polygons for the building class, and 32 polygons for other human-impacted areas (bright objects) have been defined within the HH+HV covered area, in the vicinity of Longyearbyen & Barentsburg (Svalbard, Norway), as well as Ilulissat and Aasiaat (Greenland). The polygons represent target objects which are relatively large (extent over several pixels) and are homogeneous. Samples have been also collected for water bodies and shrub to herbaceous tundra areas in order to spectrally separate these areas in the pixel-wise GBM classification. This type of dataset is referred to as the Region of Interest (ROI) dataset in the following.

Gradient Boosting Machine Learning
The GBM algorithm has been calibrated with the ROI dataset (specifically created for GBM, see previous section) and the rasterized tiles (windows created for DL) in order to evaluate the need for dedicated calibration records. In both cases, 60% has been used for calibration and 40% for internal assessment. The separation has been based on a stratified random selection. Different versions have been computed to test the effect of including Sentinel-1 data with varying polarizations and compare the results to each other. We used a gradient boosting tree classifier from the xgboost library [42], which has the advantage that it provides support for the graphics processing unit (GPU), leading to significant speed-up during training and prediction. Gradient boosting improves the quality of fit. Each regression tree contains a continuous score on each of the leafs. During training, we used 10-fold cross-validation (training set split into ten equally sized parts, run ten times with each part acting as validation set once and the remaining nine parts as training set) and the macro-averaged F1-score (arithmetic mean over class-wise F1-scores, also known as Dice similarity coefficient) for hyperparameter selection. Specifically, we tested combinations for the number of trees, maximum tree depth, and learning rate of the GBM classifier, but these parameters had to be restricted in favor of prediction time. The model with the combination that obtained the highest macro-averaged F1-score during cross-validation was automatically selected for the final training and prediction steps. The selected parameters are a learning rate of 0.1 and maximum tree depth of 3 for all data combinations (ten different models were trained with different polarization combinations and input data). The selected number of trees varies across models and is either 500 or 1000. The trained models were saved as PKL files (Python Pickle File-for serialization of objects at runtime) and used to classify entire granules in the following.
Target classes additionally include water and tundra for better separability (see Table 3). They represent the main landcover classes surrounding Arctic settlements. Open water x

Machine Learning Based on Neural Networks-Deep Learning
We split the training granules into tiles of 512 × 512 pixels with an overlap of 256 pixels and selected a total of 459 of these 512 × 512-tiles for training. The labeled masks required for training were obtained through a rasterized version of the reference data likewise. The dataset was further split into training (80%) and validation (20%) sets used during the training process. Due to the nature of the data (raster), a random split has been applied for the separation of the two sample sets. The selection of the appropriate ratio of tiles with and without infrastructure reference was found to be critical for model convergence and was obtained in a trial-and-error fashion. The Keras library [61] was used for the implementation of the training and classification routines, which were also run on GPU for computational efficiency. We used a UNet neural network architecture [46], which is especially useful for small datasets, as it is clearly the case here. Here, small refers to the number of images commonly used for training CNNs (often tens of thousands in case of semantic segmentation [67]). In contrast to pixel-based classification, deep learning-based semantic segmentation can capture context information [46]. That is why we expect it to be useful even for a small number of training images. We used data augmentation in the form of horizontal and vertical flips of the images (see, e.g., in [68][69][70]). We used a custom loss function and metric for monitoring progress during training. A commonly used metric for semantic segmentation with deep learning is also the F1-score, which is here often referred to as the dice similarity coefficient. The proposed metric is the global average of all the dice similarity coefficients calculated for each class separately. Therefore, it is essentially calculated the same way as the macro-averaged F1-score used with the GBM classifier, but it uses probabilities instead of the hard class values. The loss function used for training is the complement of the averaged dice coefficient (one minus the averaged dice coefficient). The classification of entire Sentinel-2 granules was conversely performed using a sliding window of 512 × 512 pixels, again with an overlap of 256 pixels.
Target classes in the case of DL are confined to the artificial object types (see Table 3). Other surrounding landcover types do not need to be considered due to the nature of the deep learning algorithm.

Validation Strategy
Model efficiency assessment based on the separated proportion of the two algorithm-specific datasets has been carried out in a first step (see previous sections on the classification algorithms). External datasets have been used in a second step for independent assessment. The external datasets are not consistent with respect to all tested regions (type of settlement and environment as well as coverage by specific polarizations). However, they have been nevertheless utilized for an independent assessment as they are not algorithm specific selections (see Section 2.2). Confusion matrices have been extracted for external validation, covering users and producers accuracy in percent.

Normalization Parameters for HV and VH and Performance of the Adapted Super-Resolution Scheme
A linear relationship between the slope (characterizing the local incidence angle and backscatter relationship) and frozen condition backscatter is apparent for the tested landcover types with an R 2 of 0.91 for HV and 0.70 for VH ( Figure 5). HV-and VH-polarized backscatter at C-band is in general lower than for equal polarization, but the relationship of HV and VH backscatter is almost identical among each other ( Figure 5) and similar (just off-set) to HH [53] and VV [51]. The RMSE of our model for the super-resolution scheme (28.9) is lower than the one reported for the original model (34.5 [62]). Values are given in digital numbers (DN) of the Sentinel-2 data, which is reflectance multiplied by 10,000.

Model Efficiency Assessment
The best results are obtained for combination of Sentinel-1 with Sentinel-2 using gradient boost machine learning (see Table 4). A combination does not lead to a higher score in the case of DL. The highest F1-score is achieved with the use of Sentinel-2 only. A similar level of accuracy is, however, reached when both available polarization bands (VV+VH) are used in combination with Sentinel-2.
The available schemes provide class specific scores for GBM only. The highest scores are reached for "other human impact" in all cases. The GBM results for the other classes differ between VV+VH and HH+HV combinations. The accuracy of road identification is lower compared to buildings in the case of VV and/or VH. The opposite is the case for HH and/or HV usage. The actual score for roads is, however, similar for the polarization combinations. This suggests best performance for VV and/or VH combinations. The usage of samples dedicated for application with gradient boost machine learning clearly provides better results for this method than usage of the OpenSteetMap-based sample dataset created for DL. Table 4. Model efficiency assessment results (F1-score) for the Gradient Boost Machine (GBM) and the Deep Learning (DL) algorithm. Separation by polarization combination of Sentinel-1 data (V-vertical, H-horizontal). The column "Tundra/other" reflects the tundra class in case of GBM-specific training data both waterbodies and tundra in case of use of DL-specific samples.

Assessment with External Data
The main sites for testing the performance of the algorithm are Prudhoe Bay, Western Greenland, and Svalbard. The Prudhoe Bay data allow assessment of users and producers accuracy as it also provides spatially continuous information on other classes than human impact.
The results for Longyearbyen confirm the performance difference between VV+VH and HH+HV ( Figure 6) as already shown with the model efficiency assessment. Buildings are better detected with HH+HV and roads with VV+VH. The ability of the deep learning approach to separate human-made objects from the surroundings can be clearly shown for this area. Whereas DL achieves 100% accuracy, GBM shows errors of 40% and higher. The combination of Sentinel-1 with Sentinel-2 does not lead to better results for all categories compared to usage of Sentinel-2 only in case of DL based on the full record in Lu et al. [54]. An example which shows the University of Svalbard (UNIS) building demonstrates, however, the potential added value of Sentinel-1 VV+VH (Figure 7, ID #1). The full extent is only captured with VV+VH inclusion. The consideration of at least VV or HH, respectively, is of advantage in case of GBM ( Figure 6). This is also confirmed by the results for Greenland (Figure 8).   The users accuracy derived for the Prudhoe Bay area shows similar performance for GBM and DL in the case of roads (Figure 9). GBM shows better results for buildings than DL. In general, many objects are not detected by both algorithms. This is especially the case for roads (~60% not included), but also for other human-impacted areas (~40%). Only fragments of roads are identified by both algorithms (see Figure 10c,d). However, the results indicate that the spatial resolution is sufficient to identify the presence of roads, although not their complete extent. DL results do not include all human impacted area. Often only a proportion of gravel pad is detected (Figure 10d). They are better represented in the GBM results but many other natural surfaces which are bare of vegetation are included, e.g., river banks.

Surface Types and Building Properties
Roof material and shape play a role in the detectability of buildings. As these can be associated with certain uses, this also causes differences between residential, industrial, and public buildings. The capabilities of VV and HH polarization are distinct according to the results for Longyearbyen, where both are available (Figure 11). The inclusion of HH in the GBM classification leads to better results than with VV in all cases. Agreement ranges between 60 and 90% for all material types except for wood roofs. Percentages are highest for sloped roofs. This can be confirmed with the comparisons for Greenland (Figure 12), although the agreement is much lower than at Longyearbyen for other types. An agreement of more than 60% can be reached with VV only in case of metal roofs.
The usage of Sentinel-1 for the deep learning approach does not lead to differences in detectability regarding building and road types ( Figure 11). Deep learning results are similar to GMB based on inclusion of Sentinel-1 with HH polarization. This can be confirmed with data from the Greenland sites (see Figure 12). DL clearly outperforms GBM only in case of buildings with roofs from wood.
Building size clearly plays a role for identification ( Figure 13). Buildings smaller than 1000 m 2 have been in several cases not detected by both classification types at Longyearbyen. Partial identification is common for the Gradient Boosting Machine results, both in case of VV&VH and HH&HV. Moreover, smaller objects can be identified fully with the deep learning approach and partial identification is less common (only 20%; further 20% undetected and 60% completely covered).
The assessment results for road types differ between Longyearbyen and the Greenland settlements. Deep learning seems to be more appropriate in the case of Longyearbyen. Gradient Boosting Machines appear to be superior for the Greenland sites for both cases gravel and asphalt roads. As expected, asphalt roads can be better separated than gravel roads.

Sensor-Specific and Scene Selection Issues
The combination of several Sentinel-2 acquisitions to deal with undetected clouds has been shown applicable in Arctic regions before [51]. A challenge is the short snow-free season and subsequently phenological stages. We have chosen July to August as months for scene selection. Snow patches could still occur in some years at northern sites in July. A check for snow presence (e.g., based on the Normalized Difference Snow Index) might be necessary for circumpolar application. The usage of late season acquisitions (September) may lead to different results in separability from the surrounding tundra.
In most cases, acquisitions from several years (up to three) needed to be combined (Table 2). Changes in landcover (construction of new roads, etc.) during that period may therefore not be fully represented. The first acquisition is therefore the most applicable time stamp for the classification. The full time span should, however, be included in the meta data of any derived product. An alternative would be to use only one cloud-free image, but detailed visual inspection would be necessary for each granule and the assumption made that undetected clouds can be manually identified. Such a strategy is expected to nevertheless lead to spatial inconsistencies as dates and years of suitable scenes vary across the Arctic.
Previous studies which target larger area coverage have focused on VV+VH usage in case of Sentinel-1 as this is the commonly available acquisition mode [23,71,72]. Our results indicate that Sentinel-1 VV+VH and HH+HV provide complementary information with respect to the infrastructure types of interest. A combination would be theoretically possible for Longyearbyen, where both data types are available. The polarization combinations are, however, acquired from different orbits-ascending and descending. This leads to differences in acquisition geometry and spatial offset of associated backscatter in the case of buildings. Higher backscatter related to a certain building can be spatially offset by several pixels (Figure 1). Quadpol acquisitions would therefore be of advantage. Such data are potentially available in C-band for the Arctic from Radarsat-2, but not spatially consistent and with access restrictions.
The spatial offsets in case of buildings in SAR data also subsequently pose a major issue when combined with multispectral data. The detectability of objects with SAR strongly depends on shape and exposition with respect to the sensor (see, e.g., in [29,30]). This effect may explain specifically the partially lower performance of the combined versus the Sentinel-2 only version in case of the deep learning algorithm (compare, e.g., Figure 6). The misalignment is expected to reduce the separability of objects. This difference is not apparent in the GBM accuracy values, which may partially result from the fact that the training data has been defined taking sensor-specific properties into account (spatial resolution and heterogeneity). The impact of the misalignment effect could not be tested for combination of HH+HV and Sentinel-2 in case of the deep learning algorithm. The limitation of Sentinel-1 availability for HH+HV to Greenland and Svalbard reduces the extent of the training dataset considerably. The usage of the full coverage of cadastral information for training instead of OpenStreetMap may allow for an extension to HH+HV usage.
The direct HH+HV and VV+VH comparison was only possible for Longyearbyen, Svalbard. This settlement is located in an area with rather sparse vegetation. The performance differences may differ for tundra regions with higher vegetation coverage, specifically misclassifications are expected to be lower.

Suitability of Algorithms
The simplified normalization approach introduced in [53] for HH can be shown applicable for all available polarization combinations of Sentinel-1 ( Figure 5). This enables operational preprocessing. The good performance (better RMSE than reported before) of the adapted super-resolution scheme could be explained with the regional focus. Lanaras et al. [62] explicitly note that it is likely that even better results than theirs could be achieved, if a user focusing on a specific task and geographic region retrains their proposed network with images from that particular environment.
Both algorithms have been selected due to their known performance in case of scarce data. The training data generation still poses a major challenge for Arctic settlement detection. For example, DL training for HH+HV could not implemented. Too few settlements exist for Greenland and Svalbard (the only regions for which HH+HV acquisitions are available) with good quality data as part of OpenStreetMap data. The definition of calibration areas for GBM requires the availability of very high spatial resolution satellite data in order to identify relatively homogeneous objects. Availability of such data is also limited across the Arctic.
The performance of the algorithms differs between the classes and each shows weaknesses in certain cases. GBM provides better detection of gravel pads (see Figure 10). DL clearly represents roads in their linear shapes, but other human-impacted areas are not fully detected. The error of commission in the GBM results is, however, very high, especially when river banks, beaches, and bedrock surfaces occur. They are often misclassified as buildings and other human-impacted areas. This requires extensive manual postprocessing. DL (Sentinel-2 only)-based detection of single buildings is more complete than with GBM ( Figure 13). This can be also seen when compared to OpenStreetMap data in case of Longyearbyen (Figure 7). Both buildings, which are part of the validation dataset (#1 and #5), are composed of a mixture of classes in the GBM results. A combination of results from both algorithms might aid automatization, for example preselection of objects/target regions based on the DL classification based on Sentinel-2 only. A similar two-step approach was suggested in Iannelli and Gamba [24] in the case of urban area detection of larger cities.
As HH+HV is unavailable in IW mode for most of the Arctic, the target class "buildings" should be therefore preferably derived based on the deep learning method. Whereas roads and buildings have specific shapes, other human-impacted areas do not. The advantage of the pixel based Gradient Boosting Machines approach is in this case the independence from the object shape. This is specifically of added value for identification of airstrips, which are partially not covered in the DL results (see Prudhoe Bay example, Figure 10).
The F1 score was used in all cases of model assessment. Global accuracy metrics are not appropriate in case of imbalanced class frequencies and metrics calculated per class that are averaged over all classes are usually used to avoid biases by the dominant classes [73]. There is a large variety of metrics used for assessing classifier performances. Here, we used the macro-averaged F1-score (arithmetic mean of class-wise F1-scores). The F1-metric is widely used for semantic segmentation tasks [73] and can be readily applied to the pixel-based classification using the GBM. Opitz and Burst [74] note that there is another formula for calculating the macro F1 in the literature, but clearly recommend the use of the arithmetic mean of class-wise F1-scores in case of imbalanced classes.

Suitability of Training and Validation Data
The need to utilize targeted calibration datasets (different for GBM and DL) delimits a direct comparison of the model efficiency results in all cases. The model efficiency assessment in case of GBM (use of a GBM-specific calibration dataset versus a DL-specific dataset), however, underlines the need for such a differentiation. The strengths and weaknesses of each approach can be demonstrated by using external independent data for the validation. Especially the separation of naturally bare areas from human infrastructures (roads and buildings) is better for DL.
A spatially consistent assessment was only possible for the Prudhoe Bay area. The unavailability of subclass "buildings" was however a limitation. Alternative datasets might exist around the Arctic but are not openly available.
OpenStreetMap cannot be expected to be complete (due to lack of input in general as well as changes over time) within the training areas, which may have reduced the calibration performance and subsequently classification accuracy of the DL results. This is specifically expected over the Siberian sites. There are also constraints in the case of GBM. We have partially combined typical tundra environments (NW Siberia) and settlements on bedrock (Svalbard and Greenland) for the calibration (see Figure 3). This could not be followed in case of HH+HV as this is largely acquired over regions where settlements are build on bedrock or on very sparsely vegetated sites. This may limit the comparability of the performance results. Figure 14 shows features included in the OSM compared to the GBM and DL results for a subset of a Sentinel-2 granule located in Western Siberia. This granules was excluded from the DL calibration due to obvious incompleteness (see Figure 3). The majority of detected features are not represented in the OSM.

Target Classes
Validation of SAR-derived settlement characteristics is usually limited to presence of built-up areas and/or its density (see, e.g., in [26,71]). Evaluations are made for continents or across the globe, but not specifically for the Arctic. The producers accuracy in the case of GBM/HH as well as DL results for Longyearbyen is similar to the results obtained for Sentinel-1 application on the North American and Asian continent for built-up area identification [71]. The error of omission is similar for GBM, but the performance of DL (90-100% of natural surfaces classified correctly) is clearly better than that reported for these global studies. Corbane et al. [26] suggested the use of interferometric coherence for urban area detection. The assumption would be a stable signal return over time from not only buildings but also all human impacted areas. This is specifically applicable in highly vegetated regions. We therefor did not consider the usage of coherence as naturally bare areas are common in our study areas. Long-term coherence in C-band is comparably high for high Arctic land surfaces, as for example around northern settlements on Greenland [75]. The use of long-term coherence may, however, aid to improve built-up area detection in shrub tundra areas where signal decorrelation from summer season to summer season can occur [76]. The F1-score for GBM results in the case of roads is similar to results reported for a range of algorithms tested by Zhang et al. [47] on Sentinel-1 VV+VH for a district of Beijing. VV was found superior to VH, whereas our results show no consistent difference. Performance is, however, partially higher for Sentinel-1 HH+HV and Sentinel-2 combinations (>0.9; Table 4). It should be noted that previous studies (including [23,26,41,47,71]) have focused on VV+VH usage only in case of Sentinel-1 as this is the commonly available acquisition mode for IW over land.
All of the defined target classes do contain various surfaces types which differ in reflectance (both optical and microwave). Material as well as shape (in case of buildings) play a role ( Figure 11). Ideally, this should be reflected in the target classes (larger diversity), but the practical implementation is impeded by the lack of sufficient training data. According to Brunner et al. [77], the variation of the strength of the double-bounce signal with incidence angle depends on the surrounding material (asphalt versus vegetation). Such effects may also lead to uncertainties in our results.
The differences regarding classification accuracy between Longyearbyen and the Greenland sites with respect to building use (especially for "residential"; see Figures 11 and 12) might relate to differences in material, size, and shape or also could be an artifact related to the limited sample availability, specifically in the case of Longyearbyen.
Many relevant surface features cannot be detected with Sentinel-1 and Sentinel-2. This includes pipelines and informal trails. Hinkel et al. [78] suggest a horizontal resolution in the range of 1.4 to 2.5 m for detection of tundra trails. They can exceed the pixel size in some cases, but do often not appear as bare areas. There can be ponds along the route due to thermokarst and adapted vegetation communities. The vegetation composition is changing due to increase of wetness (as disturbance results in ground thaw). The detection would require the introduction of an additional class and dedicated training data.
Existing studies on Arctic settlements and their change over time are mostly delimited to coarser spatial resolution data. Vegetation change or heat island effects are investigated [79,80]. Landsat data have been applied for a land cover map which also considers "human infrastructures" in the taiga tundra transition zone with comparably low proportion (~8%) of vegetation-free surfaces [10]. Settlements and roads have not been separated in this case. The producers accuracy was estimated with 82%, but the users accuracy assessment was only derived based on a merged dataset of "bare land (sand and stone)" and "human infrastructures". Therefore, the results cannot be directly compared. Landsat time series can provide insight into long-term trends and have been shown of high value to study the natural environment in the Arctic (see, e.g., in [81]). A combination with the identified infrastructure may provide insight into the timing of their construction.

Conclusions
Both tested machine learning approaches, the gradient boosting machines, and deep learning have advantages and disadvantages. Deep learning is superior to GBM with respect to users accuracy. GBM results in misclassifications in cases of, for example, river beds (common in tundra settings) and bedrock areas. Results therefore require comprehensive postprocessing (including manual refinement). SAR provides added value in the case of GBM. VV is of benefit for road identification and HH for detection of buildings. The Sentinel-1 acquisition strategy is, however, varied across the Arctic. The majority is covered in VV+VH. DL is of benefit for road and building detection but misses out large proportions of other human-impacted areas such as gravel pads which are typical for gas and oil fields in the Arctic. A combination of both GBM (Sentinel-1 and 2 combined) and DL (Sentinel-2, Sentinel-1 optional) is therefore suggested for circumpolar mapping for features not represented in global datasets. This comprises thematic content (roads and human impacted areas, as not included in, e.g., GUF an GHS), inclusion of all Arctic human-impacted areas (gaps currently exist in GUF, GHS, and OSM), and provision of consistent information with a time stamp (especially a shortcoming of OSM). Actual spatial extent of features rather than representation as points can then be considered for advanced risk assessment studies with respect to climate change impacts.