Semi-Automated Semantic Segmentation of Arctic Shorelines Using Very High-Resolution Airborne Imagery, Spectral Indices and Weakly Supervised Machine Learning Approaches

: Precise coastal shoreline mapping is essential for monitoring changes in erosion rates, surface hydrology, and ecosystem structure and function. Monitoring water bodies in the Arctic National Wildlife Refuge (ANWR) is of high importance, especially considering the potential for oil and natural gas exploration in the region. In this work, we propose a modiﬁed variant of the Deep Neural Network based U-Net Architecture for the automated mapping of 4 Band Orthorectiﬁed NOAA Airborne Imagery using sparsely labeled training data and compare it to the performance of traditional Machine Learning (ML) based approaches—namely, random forest, xgboost—and spectral water indices—Normalized Difference Water Index (NDWI), and Normalized Difference Surface Water Index (NDSWI)—to support shoreline mapping of Arctic coastlines. We conclude that it is possible to modify the U-Net model to accept sparse labels as input and the results are comparable to other ML methods (an Intersection-over-Union (IoU) of 94.86% using U-Net vs. an IoU of 95.05% using the best performing method).


Introduction
The Arctic Ocean has the longest coastline of any ocean on Earth and has recently been characterized as one of the most climate change vulnerable ecosystems on the planet [1]. Some of the most widely seen changes that are occurring include the lengthening of open-water seasons, stronger storms, declining sea ice extent and thickness, lake drainage, and increased rates of coastal erosion, all of which are impacting a range of ecosystem goods and services, wilderness areas, numerous mostly coastal indigenous communities and their subsistence-based food security and cultural and heritage sites, as well as industry, defense, and energy related infrastructure [2][3][4][5]. Approximately 65% of the Arctic coastline is unlithified [6], persists barely above a rising sea level, and/or exhibits degrading ice-rich permafrost [7] that appears to be subsiding in many Arctic locations [8]. Importantly, most sustained arctic coastal erosion studies appear to show an increase in the rate of coastal erosion, and several suggest erosion rates for the last few decades are 50-100% greater than those recorded in the previous half century [9][10][11][12]. Increasing rates of coastal erosion translate to large amounts of soil organic carbon loss, which contribute to positive feedback loops that strengthen warming trends and accelerate ecological changes across the globe.
consists of CNN logistic regression classifier [58]. Most of the research involving advanced DL architectures were conducted using 30 m spatial resolution Landsat images.
Despite the advances in techniques for land cover mapping in recent years, having access to labelled data remains a limiting factor. Deep neural networks usually require thousands of images as training samples where the desired features have already been determined. However, a recently developed CNN based model, U-Net [59], has been shown to provide highly accurate semantic segmentation with small numbers of training data. Since manual labelling from satellite images requires considerable time and effort to generate training labels, the U-Net model is a good alternative for this situation. While the U-Net model requires significantly fewer volumes of data compared to traditional deeplearning based models, one of the shortcomings of this method is the need for accurate pixel level annotations during training. These types of annotations, typically referred to as dense labels, as seen in Figure 1b, require considerable amount of time, effort, and often skilled annotators with a good understanding of the region of interest/labelled classes. Sparsely labelled data, as seen in Figure 1c, where only some pixels for a given image are labelled, can be collected in large amounts in a relatively fast and cheap manner and often without the need for an expert. Promising advances in computer vision research have shown that such methods that can learn from unlabeled or partially labeled data [60]. As such, there have been many research projects focused on training CNN model with sparse training labels [61][62][63]. In this paper, we present supervised ML methods that can learn from sparsely labeled data and aid segmentation of land and water pixels in high resolution imagery and compare their performance to remote sensing-derived indices to detect water and land boundaries. In summary, the main objectives of this paper are to: 1.
Compare the performances of different ML algorithms and remote sensing indices derived from VHR airborne multispectral imagery for shoreline mapping on the Beaufort Sea coast of the Arctic National Wildlife Refuge, Alaska; 2.
Modify U-Net model (a supervised learning approach for deep neural networks) to accept sparse labels as an input for generating densely segmented labels as the output.

Study Area and Data Sources
Coastlines along the Beaufort Sea are geomorphologically variable but can generally be classified as either bays/inlets, deltas, exposed bluffs, lagoons or tapped basins [64], where coastal lagoons make up over 50% of the region [65]. There are numerous factors controlling erosion of these coastal features including duration of sea ice-free extent, wind fetch length, nearshore bathymetry, land cover type, and ground ice-content [11,66]. These coastal features contain substantial stores of soil organic carbon (SOC) [66] and the erosion and subsequent release of SOC is partially responsible for the organic matter input to nearshore marine environments that supports productive food webs [67] while riverine sediment transport also plays a significant role [68]. This influx of terrestrial and organic materials along with the range of water depths and sediment compositions are what makes these nearshore waters (from a remote sensing perspective) optically complex [4,68]. Within these areas are shallow lagoons and embayments with depths typically no greater than 10 m [69]. Sea level is minimally impacted by tidal range along the Beaufort Sea Coast (<50 cm) but can be dramatically elevated by a couple of meters through wind driven action [3,70].
This study focuses on the coastal margin of the "1002 area" of the Arctic National Wildlife Refuge (ANWR). ANWR is a ∼78,000 km 2 coastal plain region on the eastern North Slope of Alaska and was established as a refuge in 1980 through the Alaska National Interest Lands Conservation Act by Congress, recognizing the large potential for oil and gas resources and its importance as a wildlife habitat [3]. The 1002 area consists of barrier islands, salt marshes, coastal lagoons, coastal bluffs and river deltas that provide habitats for over 42 fish species, 37 land mammals, eight marine mammals and ∼200 residential and migratory bird species (Figure 2) [3]. These coastlines can display narrow, low-lying beaches while backshore coastal morphology can consist of sand and gravel beaches, barrier islands, wetlands, barrier spits, and low-lying permafrost coastal bluffs with a range of ice content that are typically 2-6 m above sea-level in some areas [3,69,71]. Surface features of the coastal plain within our study area consist of tapped and untapped thermokarst lakes, coalesced low-center polygons, and braided rivers rich with sediment from the interior Brooks Range [3]. We obtained National Oceanic and Atmospheric Administration (NOAA) RSD highresolution airborne RGB and NIR imagery covering the roughly 170-kilometer coastline of the 1002 area of ANWR collected between 18-19 July 2017. For both RGB and NIR scenes, 265 orthomosaiced image tiles were downloaded through NOAA's Data Access Viewer (https://www.coast.noaa.gov/dataviewer/ (accessed on 23 January 2021)). Each image tile measured 2.5 km × 2.5 km for a total imagery footprint of 1672.80 sq. km (Figures 3 and 4). Images tiles were download in either 3-band (RGB) or 1-band (NIR), 8-bit GeoTIFF format and later reprojected to a NAD83/Alaska Albers (EPSG: 3338) coordinate reference system. Finally, corresponding RGB and NIR image tiles were composited into 265 4-band images using ArcMap 10.6. NOAA RSD collected this imagery from a Beechcraft King Air 350CER manned aircraft flying at a nominal altitude of ∼2286 m above ground level (AGL) with two Appanix Digital Sensor System (DSS) SN580 cameras (one each for RGN and NIR). Image capture and precision georeferencing were synchronized and completed with an on-board Applanix POS/AV410 Global Navigation Satellite System (GNSS) and Inertial Measurement Unit (IMU). RGB and NIR camera systems had focal lengths of 52 mm and CDD pixel sizes of 5.2 × 5.2 µm and 6.0 × 6.0 µm respectively. While the ground sampling distance (GSD) of posted RGB and NIR orthomosaic images tiles was 35 cm. Stated horizontal accuracy for posted orthomosaics was +/−1.5 m at 95% CI.

Methods
The task of identifying and mapping geomorphological features in remote sensing images fits well within the framework of semantic segmentation. Semantic Segmentation is one of the oldest and most widely studied problems in computer vision [72][73][74][75][76] and involves understanding not only what objects are in the scene, but also what regions of the image the objects are located in and at what spatial resolution. In recent years, land cover mapping using semantic segmentation of satellite/airborne images has seen great success in different application domains. These can partly be accredited to an increasingly large amount of fully annotated images. However, collecting large scale accurate pixel-level annotations is time consuming and sometimes requires substantial amounts of financial investment and skilled labor. We get around such challenges by introducing a new method to train the modified U-Net model with easy to generate sparse labels. Here, we apply two remote sensing indices-NDWI [33], and NDSWI [35]-and two classical ML techniques-random forest, eXtreme Gradient Boosting or xgboost [77], and modified U-Net (Section 3.2)-to automate methods using a high-performance cloud computing environment for mapping water bodies from airborne imagery. We then fine-tuned the threshold to generate binary labels using Decision Stump (DS) (Section 3.3). We utilized a modified version of the U-Net architecture [59] to perform semantic segmentation-pixelwise classification in images. Using the ANWR as an area of study, we leveraged freely available high-resolution orthomosaic imagery for training and evaluation. We generated sparse labels for training and dense labels for evaluation using the approach in Section 3.1. Using these resources, we developed an extensible pipeline, a dataset and baseline methods that can be utilized for generating land/water masks from high resolution airborne images. We also present qualitative and quantitative results describing properties of our models.

Label Creation Strategy
Composited 4-band airborne image tiles were used to manually delineate areas of both water and land pixels by creating polygon features in ESRI shapefile formats within ArcMapThese shapefiles were then used as model training and testing datasets. From the 265 total airborne scenes, 165 scenes were hand-annotated with sparse labels by two remote sensing and ecological scientists familiar with landforms in the study area. Annotations were used for training after filtering out scenes that contained only water or only land or had image artifacts due to orthomosaicing. A total of 20 different scenes were used to create dense labels for the test sets. The strategy used by the specialists to make the sparse labels consisted of creating circular polygon features of various sizes within some (but not all) land and water regions of each scene. The dense labels, however, were annotated more carefully, and therefore more time was needed in delineating every single water pixel from each scene when creating the dense labels. While coastal water annotation was visually more straight forward, with the exception of deltaic regions, some terrestrial features required additional scrutiny to determine the presence of surface water. Annotators used a combination of visual characteristics to classify a feature as containing surface water:

1.
Dark color in the visual spectrum indicating sufficient light attenuation in standing water; 2.
the presence of reflected light due to ripples or waves caused by wind; and 3.
the presence of accumulated white water on the western shorelines of water bodies caused by prevailing easterly winds.
The corresponding land labels were created by inverting water polygons for each scene.

Spectral Water Indices
This study focused on testing and utilizing the capacity of NDWI and NDSWI in masking water and land pixels along arctic coastal tundra shorelines. Due to the selection of available bands in the source imagery used in this study (near-infrared, Red, Green and Blue), we were limited to the number of indices useful for water detection (primarily those that utilize the near-infrared band as opposed to the shortwave-infrared). Moreover, NDSWI was specifically developed using in situ hyperspectral data of tundra wetlands and with a goal to develop an index that could not be confounded by atmospheric moisture as other water spectral indices have shown to be sensitive to in these arctic coastal marine ecosystems [35]. The equations for the indices are as shown in Equation (1).

Machine Learning
Fundamental to any form of geospatial remote sensing image processing is the need for reliable, repeatable, and accurate landscape feature (e.g., shoreline, bluff edge, and beach width) identification and delineation. Arctic Coastal Change Detection (ACCD) is challenged by the need to detect change in coastal features (waterline, bluff edge, beach etc.) over thousands of kilometers of coast at high spatial and temporal resolutions. This type of land cover mapping is an application area of a wider range of problem in the computer vision community, known as semantic segmentation, for which supervised ML based approaches have performed well. Furthermore, these techniques have improved rapidly in recent years due to progress in deep learning and semantic segmentation with Convolutional Neural Networks (CNNs) [59,78]. Recently, high performance computing, ML, and deep learning approaches have provided solutions for efficient and accurate landscape feature mapping across difference ecosystems. In the Arctic, studies have delineated polygonal tundra geomorphologies [45,79], arctic lake features [80], glacier extents [48,81,82], and coastal features [40][41][42]83]. In this research, we propose an automated pipeline using traditional ML based methods-random forest, and xgboost, and a deep neural network based U-Net architecture for arctic coastal mapping and compare their performances. One of the advantages of using ML based approaches over spectral indices is that the ML based method for land cover mapping generalizes geological features such as impervious surfaces, wetlands, and Plant functional types (PFT) for which the spectral indices are not well defined.

Threshold Fine-Tuning
Given that NDWI and NDSWI values ranges from −1 to 1; the outputs from random forest, xgboost, and U-Net being probabilistic in nature, ranges from 0 to 1; and we require binary labels as the final segmentation mask; we needed to effectively convert these intensities to binary mask. A simple and widely used approach to do so is to use a threshold of 0.5 for probabilistic output intensities and select the appropriate threshold based on the literature for NDWI (>=0.3 for water [84]) and NDSWI. Other methods for threshold fine-tuning include analyzing ROC curve [85], Otsu's method [86], and DS-a one-level decision tree- [87]. We implement DS with IoU as the single input feature using exhaustive search.

Architectural Overview
We used two different spectral water indices-NDWI and NDSWI-and three ML methods-random forest, xgboost, and modified variant of the U-Net architecture [59]-for this study. Since the U-Net expects training labels to be dense, we modified dice loss [88] by masking this with pixel locations on the sparse labels. Then, this masked dice loss with gradient descent as the optimization algorithm was used for training the modified U-Net architecture.
Our approach is summarized in the multi-step pipeline presented in Figure 5 using NDWI and NDSWI to generate intensity masks (see Section 3.2 below). These approaches for generating intensity masks do not require training labels. To train the ML models, we first converted the raw vector data sparse labels to corresponding image masks for each image. We then augmented the 4 Band orthorectified airborne imagery with NDWI and NDSWI as additional channels, divided each airborne image/corresponding mask into 255 subregions, filtered out any subregion for which there was not at least 10% of labeled pixels, and randomly split them into training, testing and validation data sets. The training and validation samples are normalized dynamically during training with the mean and standard deviation of the test set. As a post-processing step, we then generated a binary mask by thresholding the intensity for "water" class using optimal thresholds calculated using the validation set. The final scores that we report use the densely labelled test set.

Masked Dice Loss
With the introduction of Convolutional Neural Networks (CNNs), different application areas involving semantic segmentation has achieved good results [48,59,78]. One of the downside of these CNN architecture is the need for dense labeled segmentation data. Obtaining large amounts of dense labeled segmentation data can sometimes be time consuming. It is therefore, a promising direction in computer vision research to develop semantic segmentation methods that can learn without the need for dense labels. These types of labels are also commonly referred to as weak labels. Previous research has reported semantic segmentation networks trained with various types of weak labels such as image level annotations [60,89] and sparse labels [61,63]. We present masked dice loss and a method to train a deep neural architecture using sparse training labels and masked dice loss in this research. Dice loss is based on the Sorensen-Dice coefficient [90,91] which is a statistic used to gauge the similarity of two samples. In the computer vision community, it was introduced for 3D medical image segmentation [92]. Dice Loss is given by the equation: where C is the set of classes that are present in the image, I is the set of pixels in the image, p i,c denotes the probabilistic output from the model for class c at position i, and g i,c denotes the ground truth value for class c at position i. For our purpose, we need to train the image semantic segmentation network using sparse labels. For this purpose, we introduce masked dice loss in the equation: Implementing the masked dice loss, gradients are computed and back-propagated based only on the output for the pixels present in the ground truth sparse labels.
Reproducibility: Our approach implementation is based on scikit-learn [93] and pytorch [94]. All networks were trained on Azure NC6 Virtual Machine powered by NVIDIA Tesla K80 GPU. The code to replicate our process is available at (https://github. com/Aryal007/coastal_mapping (accessed on 30 May 2021)). Figure 6 shows examples of output land/water segmentation masks using different spectral indices and ML based approaches. From the figure, we can observe that the U-Net model has an intensity close to the end values for land/water classification compared to other methods. For quantitative evaluation, the intensity masks need to be converted to binary mask with unique values for each classes(in this case land and water) as can be seen on Figure 7. Figure 6. Output intensity on a sample image from the test set using different models. The intensity has been normalized to 0-1 range for NDWI and NDSWI for plotting. The intensity for U-Net, Random Forest, and XGBoost represents the probability of the corresponding pixel to be classified as water.  Figure 6. The intensity masks are converted to respective binary masks using thresholds as seen in Section 3.3 for evaluation.

Threshold Fine-Tuning
While visualizing the results, instead of just finding a threshold that works the best for each of the methods, we plot a curve showing the performance at each possible threshold interval with a step size of 0.01. We use the validation set to determine the appropriate threshold that produces the highest IoU for class water using each method. The highest IoU and the threshold that yielded highest IoU on the validation is summarized in Table 1.  Figure 8 shows the histogram for pixels corresponding to land and water classes in the validation set, the IoU for land/water classes with an interval of 0.01 between thresholds, the threshold that yielded the maximum IoU, and maximum possible IoU. Based on the distribution of the histogram for land and water classes, we expect the results seen using remote sensing indices features to be more subjective to the thresholding value. This is exactly what we see in the line graph showing land and water IoU at each threshold interval. It is also important to note that the DS computed thresholds for NDWI (0.78), random forest (0.4), and xgboost (0.38) are different from the commonly used thresholds (0.3, 0.5 and 0.5, respectively) while the DS computed threshold for U-Net model is close to the commonly used value of 0.5. At the time of writing, we could not find a published recommended threshold value for NDSWI. Based off of our findings, we propose using a value of 0.48 as the optimal threshold for land/water segmentation in environment similar to the arctic coastal plain of Alaska. This means that the performance using the U-Net model is less dependent on finding the optimal thresholding value over other methods.

Evaluations
We evaluate the performances of spectral water indices, ML models, and U-Net on the densely labeled test set and used IoU, Precision, and Recall as our evaluation metrics. The comparative performance between different methods can be seen in Table 2. Table 2. Experimental results for binary segmentation masks generated using different methods. The IoU when using random forest is higher than when using all other methods for both land and water classes.

Region Based Evaluations
Each scene was divided into 225 (512 × 512-pixel) sub regions with no overlap. The subregions were classified as coastal if they contained portions of the land/water interface adjacent to mainland backshore environments. This classification excluded barrier islands and deltaic regions where narrow beaches and/or permafrost bluffs do not directly interface with the waterline. A total of 162 sub regions were classified as coastal. We see a similar performance trend across all the models with random forest performing the best in terms of IoU. The comparative performance for region based evaluations between different methods can be seen in Table 3.

Discussion
Unprecedented change in the Arctic has drawn the attention of numerous large-scale and long-term research initiatives. NASA's Arctic-COLORS (https://arctic-colors.gsfc.nasa. gov/ (accessed on 12 November 2021)) and Arctic Boreal Vulnerability Experiment (ABoVE) (https://above.nasa.gov/ (accessed on 12 November 2021)) as well as a newly funded National Science Foundation funded Long Term Ecological Research Project (Beaufort Lagoon Ecosystems LTER) are a few of a much larger group of initiatives currently conducting research to better understand biogeochemical, land-marine interactions, and how arctic coastal change is modifying ecological properties and processes. More accurate and high-resolution mapping data will no doubt aid in various research efforts being conducted in this field. Technological advances in remote sensing, computer vision and high-performance computing (HPC) along with the increase in large-scale, agency-level airborne campaigns such as NOAA's coastal imaging missions conducted by their Remote Sensing Division (RSD), provide a unique opportunity for mapping arctic shorelines across large areas at high spatial resolutions on Alaska's North Slope.
NOAA airborne image collections across coastal regions primarily bolster NOAA's mission goal of coastal resiliency by serving as baseline datasets in creating high-resolution orthomosaic imagery to aid in navigation, determining pre-and post-storm conditions and facilitating coastal-zone management. Shoreline vectors (digital representations of the interface between land and water) derived from this imagery are used in, among many other research and management purposes, efforts for tracking and quantifying rates of coastal change. Generally, these shorelines are operator-derived mono or stereoscopically, manually digitized or created through feature extraction routines and published to NOAA's Continuously Updated Shoreline Product (CUSP). Furthermore, these image collections extend well beyond Alaska's coastal regions. Similar VHR, 4-band, airborne imagery is collected along the majority of coastlines in the contiguous United States (including the Great Lakes coastal areas) and is freely available to public. With the open-source methodology presented here, similar land/water segmentation efforts can be expanded to a wide range of coastal regions coinciding with available NOAA image collections.
We show that accurate shoreline mapping in the "1002 area" of the Arctic National Wildlife Refuge (ANWR) can be obtained using two different remote sensing indices, two different traditional ML based approaches and a DL method. However, the model has not been tested outside of this region. Direct comparison of the results presented here to previous work is difficult due to the variety of methods and source imagery used in the literature. Similar work to create land/water masks from 1 m resolution, airborne, color-infrared imagery using NDWI threshold and object-based classification methods in Arctic-Boreal regions report recall, precision and IOU of 0.94, 0.87 and 0.83 respectively for the water class. Additionally, they report recall, precision and IOU of 0.98, 0.99 and 0.98, respectively, for the land class [95]. While we expect the performance between different models to be relatively similar for other regions for this task, we may see an improvement in performance in using DL methods with more number of land cover classes [96]. Based on the body of literature around the performance of U-Net architecture, one would expect U-Nets to outperform single-pixel based models. However, random forest may have performed better for this task since the U-Net model was trained using sparse labels. Furthermore, as the CNN models make explicit assumption that the inputs are images and thus compute the output of neurons that are connected to local regions in the input, we may not have been able to utilize the full spatial properties of the U-Net model due to the sparsely labelled training data. Further research should consider intercomparison with a model trained using dense labels to utilize the full spatial properties of the convolutional neural network-based architecture. For future work, we will also investigate the performance of U-Net model trained with sparse labels to classify edge pixels using metrics that better capture the changes along highest priority coastal sections in Alaska [97].

Conclusions
We have addressed the problem of training DL based U-Net model using sparse labels for training and have shown that the sparsely labeled data allow it to learn good distribution comparable to other ML based methods. The results are very competitive but the random forest model provides slightly better results over the U-Net model trained using sparse labels for our task of land/water classification. Additionally, and from an operational perspective, our findings suggest that efficient and accurate surface water mapping can be achieved with less labeling effort and a lower barrier to entry in terms of computer science expertise. Although, since the performance of remote sensing indices are highly dependent on finding the optimal threshold, an exhaustive search for the threshold is needed to observe the best results. A relatively similar performance to that of ML based approaches is seen when using the remote sensing indices (NDWI and NDSWI), which could be attributed to the simplistic nature of the task itself (only two output classes), or to limited multispectral properties being incorporated in the input indices. Funding: Miguel Velez-Reyes, Craig Tweedie, and Stephen Escarzaga were partially supported by the National Oceanic and Atmospheric Administration, Office of Education Educational Partnership Program award number NA16SEC4810008 and by the NSF LTER award number: 1656026. Sergio Vargas Zesati was partially supported by NASA award numbers NNX17AC58A and 80NSSC21K1164 and NSF ITEX-AON award number: 1836861. The content of the paper is solely the responsibility of the award recipient and do not necessarily represent the official views of the U.S. Department of Commerce, National Oceanic and Atmospheric Administration.

Data Availability Statement:
All data used during the study were downloaded through NOAA's Data Access Viewer (https://www.coast.noaa.gov/dataviewer/ (accessed on 23 January 2021)).

Acknowledgments:
We would like to thank Microsoft for providing us with the free Microsoft Azure resources through their AI for Earth grant program. All tests to obtain the results in this paper were conducted on a Microsoft Azure deployment which consists of two HDInsight clusters (an Azure HDInsight Apache Kafka cluster, and an HDInsight Apache Spark cluster), communicating directly within the premises of a single virtual network. We also acknowledge NOAA for providing a rich dataset which this work has been built on. This research was supported in part by the Department of Computer Science, and the Environmental Science and Engineering program at the University of Texas at El Paso.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: