1. Introduction
Automated image analysis has long been a challenging problem in multiple domains. Over the last decade, deep learning (DL) convolutional neural nets (CNNs) have reshaped the boundaries of computer visions applications (CV), enabling unparalleled opportunities for automated image analysis. Applications span from everyday image understanding through industrial inspections to medical image analysis [
1]. Conspicuous shortfalls of traditional per-pixel based approaches when confronted with sub-meter scale remote sensing imagery (satellite and aerial) have shifted the momentum towards novels paradigms, such as object-based image analysis (OBIA) [
2], in which homogenous assemblages of pixels are considered in classification process. Recently, OBIA has been flanked by the challenges of big data [
3] and scalability [
4]. The success of DLCNNs in CV applications has received great interest from the remote sensing community [
5]. There has been an explosion of studies integrating DLCNN to address remote sensing classification problems spanning from general land use and land cover mapping [
6,
7] to targeted feature extraction [
8,
9,
10]. Deep learning CNNs excel at object detection [
10,
11,
12], semantic segmentation (multiple objects of the same class indicate a single object) [
7,
13], and semantic object instance segmentation (multiple objects of the same class indicates distinct individual objects) [
14]. Over the years, a plethora of DLCNN architectures have been proposed, developed, and tested. The influx of new DLCNNs continues to grow. Each has its own merits and disadvantages with respect to the detection and/or classification problem at hand. The appreciation for DLCNNs in the remote sensing domain is increasing, while some of the facets that are unique to remote sensing image analysis have been overlooked along the way.
Remote sensing scene understanding deviates from everyday image analysis in multiple ways, such as imaging sensors and their characteristics, coverage and viewpoints, and the objects and their behaviors in question. From the standpoint of Earth imaging, the image can be perceived as a reduced representation of the scene [
2]. The image modality departs from the scene modality depending on the sensor characteristics. Scene objects are real-world objects, where image objects are assemblages of spatially arranged samples that model the scene. Images are only snapshots, and their size and shape are dependent on the sensor type and spatial sampling. For instance, certain land cover types, such as vegetation, are well-pronounced, exhibiting greater discriminative capacity in the near infrared (NIR) region than in the visible range. If the imaging sensor is constrained to the visible range, we limit ourselves from getting the advantage of the NIR wavelengths in classification algorithms. Similar to spectral strengths, the spatial resolution of the imaging sensor can either prohibit or permit our ability to construct the shape of the geo objects and their spatial patterning [
15]. There is no single spatial scale that explains all the objects, but the semantics we pursue are organized into a continuum of scales [
16,
17]. In essence, an image represents the sensor’s view of the reality and not the explicit representation of scene objects.
These are practical challenges that apply to very high spatial resolution (VHSR) multispectral (MS) commercial satellite imagery. The luxury of VSHR satellite imagery is that the wavelengths are not confined to traditional panchromatic, standard red-green-blue (RGB) channels, or NIR. VSHR satellite imagery includes both visible and NIR regions and, therefore, produce an array of multiple spectral channels. For instance, the WorldView02 sensor captures eight MS channels at less than 2 m resolution, and data fusion techniques allow resolution-enhanced MS products at sub-meter spatial resolutions. Besides spatial details, discriminating one geo object from another could be straightforward or difficult depending on their spectral responses recorded in the MS channels. Selection of optimal spectral bands is a function of the type of environment and the kind of information pursued in the classification process. In remote sensing mapping applications, land cover types and their constituent geo objects exhibit unique reflectance behaviors in different wavelengths, or spectral channels, enabling opportunities to discriminate them from each other and characterize them into semantic classes. This leaves the question for the user to select the optimal spectral channels from the MS satellite imagery for DLCNN model applications. The decision is difficult when candidate DLCNN architectures restrain the input to only three spectral channels. An intriguing question is should one adhere to RGB channels while ruling out the criticality of other spectral bands, or is it necessary to mine all MS bands to choose the optimal bands for model predictions? To the best of our knowledge, this is a poorly explored problem despite its validity in remote sensing applications. Here, we make an exploratory attempt to understand this problem based on a case study that branches out from our on-going project on Arctic permafrost thaw mapping from commercial satellite imagery [
18,
19,
20,
21].
Permafrost thaw has been observed across the Arctic tundra [
22]. Ice-rich permafrost landscapes commonly include ice wedges, for which growth and degradation is responsible for creating polygonized land surface features termed ice-wedge polygons (IWP). The lack of knowledge on fine-scale morphodynamics of polygonized landscapes introduces uncertainties to regional and pan-Arctic estimates of carbon, water, and energy fluxes [
23]. Logistical challenges and high costs hamper field-based mapping of permafrost-related features over large spatial extents. In this regard, VHSR commercial satellite imagery enables transformational opportunities to observe, map, and document the micro-topographic transitions occurring in polygonal tundra at multiple spatial and temporal frequencies.
The entire Arctic has been imaged at 0.5 m resolution by commercial satellite sensors (DigitalGlobe, Inc., Westminster, CO, USA). However, imagery is still largely underutilized, and derived Arctic science products are rare. A considerable number of local-scale studies have analyzed ice wedge degradation processes using satellite imagery and manned-/unmanned aerial imagery/LiDAR data [
24,
25,
26,
27]. Most of the studies to date have relied on manual image interpretation and/or semi-automated approaches [
25,
26,
28]. Therefore, there is a need and an opportunity for utilization of VHSR commercial imagery in regional scale mapping efforts to spatio-temporally document microtopographic changes due to thawing ice-rich permafrost. The bulk of remote sensing image analysis methods suffer from scalability and image complexities, but DLCNNs hold great promise in high throughput image analysis. Several pilot efforts [
18,
20,
29,
30,
31] have demonstrated the potential adaptability of pre-trained DLCNN architectures in ice-wedge polygon mapping via the transfer learning strategy. However, the potential impacts of MS band statistics on DLCNN model predictions have been overlooked.
Owing to the increasing access of MS imagery and growing demand for a suite of pan-Arctic scale permafrost map products, there is a timely need to understand how spectral statistics of input imagery influence DLCNN model performances. Despite the design goals of DLCNN architectures to learn higher order abstractions of imagery without pivoting to the variations of low-level motifs, studies have documented the potential impacts of image quality, spectral/spatial artifacts of image compression, and other pre-processing factors on DLCNNs. Dodge et al. [
32] described image quality impacts on multiple deep neural network models for image classification and showed that DL networks were sensitive to image quality distortions. Consequently, Dodge and Karam [
33] carried out a comparison between human and deep learning recognition performance considering quality distortions and demonstrated that DL performance is still much lower than human performance on distorted images. Vasiljevic et al. [
34] also investigated the effect of image quality on recognition by convolutional networks, which suffered a significant degradation in performance due to blurring along with a mismatch between training and input image statistics. Moreover, Karahan et al. [
35] presented the influence of image degradations on the performance of deep CNN-based face recognition approaches, and their results indicated that blur, noise, and occlusion cause a significant decrease in performance. All these findings from previous studies provided useful insights towards developing computer visions applications, considering image quality that can perform reliably on image datasets.
Benchmark image datasets in CV applications are largely confined to RGB imagery, and trained DLCNNs are typically used for those data. Examples include imageNet, COCO, VisionData, MobileNet, etc. [
36,
37,
38,
39]. Priority for three spectral channels has become the de facto standard in everyday image analysis. As discussed before, this is a limiting factor in remote sensing applications. Training DLCNN architecture from scratch requires enormous amounts of training data to curtail overfitting. Because of this, transfer learning is becoming the standard practice to work with a limited amount of training data. In such circumstances, input channels are confined to three spectral channels, despite the original image containing more than three channels. In remote sensing, multispectral bands could significantly affect the capacity to be invariant to quality distortions. Selection of the optimal spectral band combination from all the available multispectral channels for model training and prediction is important because the dominant land cover types (heterogeneity) control the global image statistics as well as local spectral variance. Our contention is that improper band selection or reliance solely on RGB channels can potentially hamper mapping accuracies. The information content in MS channels should be prudently capitalized on DLCNN model predictions; otherwise; we are discarding valuable cues that are advantageous in automated detection and classification processes. The central objective of this exploratory study is to understand to what degree MS band statistics govern the DLCNN model predictions. We scaffold our analysis on a case study that includes ice-wedge polygons in two common tundra vegetation types (tussock and non-tussock sedge) as candidate geo objects. We choose Mask RCNN as the candidate DLCNN architecture to detect ice-wedge polygons from eight-band Worldview-02 commercial satellite imagery. A systematic experiment was designed to understand the impact of choosing the optimal three-band combination on model prediction. We tasked five cohorts of three-bands combinations coupled with statistical measures to gauge the spectral variability of input MS bands.
2. Study Area and Image Data
Our study area covers coastal and upland tundra near Nuiqsut on the North Slope of Alaska (
Figure 1). We obtained two summer-time WorldView-02 (WV2) commercial satellite image scenes from tussock sedge and non-tussock sedge tundra regions from the Polar Geospatial Center (PGC) at the University of Minnesota. The WV2 sensor records spectral reflectance at eight discrete wavelengths representing coastal blue (band 1), blue (band 2), green (band 3), yellow (band 4), red (band 5), red edge (band 6), NIR1 (band 7), and NIR2 (band 8). The spatial resolution of the data product is ~0.5 m with 16-bit radiometric resolution. Scenes were chosen from tussock and non-tussock sedge tundra regions based on the Circumpolar Arctic Vegetation Map (CAVM) [
40], which presents important baseline reference data for pan-Arctic vegetation monitoring in tundra ecosystems [
41]. Alaska has heterogeneous tundra types such as tussock sedge, dwarf shrub, and moss tundra in the foothills of northern Alaska [
41]. The region generally covers (
Figure 1, details in [
40]): (i) Non-tussock sedge, dwarf-shrub, moss tundra: moist tundra dominated by sedges and dwarf shrubs < 40 cm tall, with a well-developed moss layer; (ii) Tussock sedge, dwarf-shrub, moss tundra: moist tundra, dominated by tussock cottongrass and dwarf shrubs <40 cm tall; (iii) Sedge: wetland complexes in the colder/warmer areas of the Arctic, dominated by sedges, grasses, and mosses. We chose two WV2 satellite image scenes focusing only two candidate vegetation types: (1) tussock sedge and (2) non-tussock sedge tundra for our analysis.