Landscape classification with deep neural networks.

: The application of deep learning, specifically deep convolutional neural networks 8 (DCNNs), to the classification of remotely sensed imagery of natural landscapes has the potential 9 to greatly assist in the analysis and interpretation of geomorphic processes. However, the general 10 usefulness of deep learning applied to conventional photographic imagery at a landscape scale is, 11 at yet, largely unproven. If DCNN-based image classification is to gain wider application and 12 acceptance within the geoscience community, demonstrable successes need to be coupled with 13 accessible tools to retrain deep neural networks to discriminate landforms and land uses in 14 landscape imagery. Here, we present an efficient approach to train/apply DCNNs with/on sets of 15 photographic images, using a powerful graphical method, called a conditional random field (CRF), 16 to generate DCNN training and testing data using minimal manual supervision. We apply the 17 method to several sets of images of natural landscapes, acquired from satellites, aircraft, unmanned 18 aerial vehicles, and fixed camera installations. We synthesize our findings to examine the general


The growing use of image classification in the geosciences
There is a growing need for fully automated pixel-scale classification of large datasets of color digital photographic imagery, to aid analysis and interpretation of natural landscapes and geomorphic processes.The task of classifying natural objects and textures in images of landforms is increasingly widespread in a wide variety of geomorphological research [1][2][3][4][5][6][7], providing impetus for the development of completely automated methods to maximize speed and objectivity.The task of labeling image pixels into discrete classes is called object class segmentation or semantic segmentation, whereby an entire scene is parsed into object classes at a pixel level [8][9].
There is a growing trend in studies of coastal and fluvial systems for using automated methods to extract information from time-series of imagery from fixed camera installations [10][11][12][13][14][15][16], UAVs [17][18][19] and other aerial platforms [20].Fixed camera installations are designed for generating time-series of images for assessment of geomorphic change in dynamic environments.Many aerial imagery datasets are collected for building digital terrain models and orthoimages using Structure-from-Motion (SfM) photogrammetry [21,22].Numerous complementary or alternative uses of such imagery and elevation models for the purposes of geomorphic research include facies description and grain size calculation [23,24], geomorphic and geologic mapping [25,26], vegetation structure description [27,28], physical habitat quantification [29,30], and geomorphic/ecologic change detection [31][32][33].In this paper, we utilize and evaluate two emerging themes in computer vision research, namely deep learning and structured prediction, that, when combined, are shown to be extremely effective in application to pattern recognition and semantic segmentation of highly structured, complex objects in images of natural scenes.

Application of deep learning to landscape scale image classification
Deep learning is the application of artificial neural networks with more than one hidden layer to the task of learning and subsequently recognizing patterns in data [34,35].A class of deep learning algorithms called deep convolutional neural networks (DCNNs) are extremely powerful at image recognition, resulting in a massive proliferation of their use [36,37], across almost all scientific disciplines [38,39].A major advantage to DCNNs over conventional machine learning approaches to image classification is that they do not require so-called 'feature-engineering' or 'feature extraction', which is the art of either transforming image data so that they are more amenable to a specific machine learning algorithm, or providing the algorithm more data by computing derivative products from the imagery, such as rasters of texture or alternative color spaces [40,6,12].In deep learning, features are automatically learned from data using a general-purpose procedure.Another reputed advantage is that DCNN performance generally improves with additional data, whereas machine learning performance tends to plateau [41].For these reasons, DCNN techniques will find numerous applications where automated interpretation and quantification of natural landforms and textures are used to investigate geomorphological questions.
However, many claims about the efficacy of DCNNs for image classification are largely based upon analyses of conventional photographic imagery of familiar, mostly anthropogenic objects [42,6], and it has not been demonstrated that this holds true for image classification of natural textures and objects.Aside from the relatively large scale, images of natural landscapes collected for geomorphological objectives tend to be taken from the air or at high vantage, with a nadir (vertical) or oblique perspective.In contrast, images that make up many libraries upon which DCNNs are trained and evaluated tend to be taken from ground level, with a horizontal perspective.In addition, variations in lighting and weather greatly affect distributions of color, contrast and brightness; certain land covers change appearance due to changing seasons (such as deciduous vegetation); and geomorphic processes alter the appearance of land covers and landforms causing large intra-class variation, for example, still/moving, clear, turbid, and aerated water.Finally, the distinction of certain objects and features may be difficult against similar backgrounds, for example groundcover between vegetation canopies.
The most popular DCNN architectures have been designed and trained on large generic image libraries such as ImageNet [43], mostly developed as a result of international computer vision competitions [44] and primarily for application to close-range imagery with small spatial footprints [42], but more recently have been used for landform/land use classification tasks in large spatial footprint imagery such as that used in satellite remote sensing [45][46][47][48][49].These applications have involved design and implementation of new or modified DCNN architectures, or relatively large existing DCNN architectures, and have largely been limited to satellite imagery.Though powerful, DCNNs are also computationally intensive to train and deploy, very data hungry (often requiring millions of examples to train from scratch), and require expert knowledge to design and optimize.
Collectively, these issues may impede widespread adoption of these methods within the geoscience community.
In this contribution, a primary objective is to examine the accuracy of DCNNs for oblique and nadir conventional medium-range imagery.Another objective is to evaluate the smallest, most lightweight existing DCNN models, retrained for specific land use/land cover purposes, with no retraining from scratch and no modification or fine-tuning to the data.We utilize a concept known as 'transfer learning', where a model trained on one task is re-purposed on a second related task [35].
Fortunately, several open-source DCNN architectures have been designed for general applicability to the task of recognizing objects and features in non-specific photographic imagery.Here, we use existing pre-trained DCNN models that are designed to be transferable for generic image recognition tasks, which facilitates rapid DCNN training when developing classifiers for specific image sets.
Training is rapid because only the final layers in the DCNN need to be retrained to classify a specific set of objects.

Pixel-scale image classification
Automated classification of pixels in digital photographic images involves predicting labels, y, from observations of features, x, which are derived from relative measures of color in red, green and blue spectral bands in imagery.In the geosciences, the labels of interest naturally depend on the application but may be almost any type of surface land cover (such as specific sediment, landforms, geological features, vegetation type and coverage, water bodies, etc) or description of land use (rangeland, cultivated land, urbanized land, etc).The relationships between x and y are complex and non-unique, because the labels we assign depend nonlinearly on observed features, as well as on each other.For example, neighboring regions in an image tend to have similar labels (i.e. they are spatially autocorrelated).Depending on the location and orientation of the camera relative to the scene, labels may be preferentially located.Some pairs of labels (e.g.ocean and beach sand) are more likely to be proximal than others (e.g.ocean and arable land).
A natural way to represent the manner in which labels depend on each other is provided by graphical models [50] where input variables (in the present case, image pixels and their associated labels) are mapped onto a graph consisting of nodes, and edges between the nodes describe the conditional dependence between the nodes.Whereas a discrete classifier can predict a label without considering neighboring pixels, graphical models can take this spatial context into account, which makes them very powerful for classifying data with large spatial structure, such as images.Much work in learning with graphical models [51] has focused on generative models that explicitly attempt to model a joint probability distribution P(x,y) over inputs, x, and outputs, y.However, this approach has important limitations for image classification where the dimensionality of x is potentially very large, and the features may have complex dependencies, such as the dependencies or correlations between multiple metrics derived from images.In such cases, modeling the dependencies among x is difficult and leads to unmanageable models, but ignoring them can lead to poor classifications.
A solution to this problem is a discriminative approach, similar to that taken in classifiers such as logistic regression.The conditional distribution P(y|x) is modeled directly, which is all that is required for classification.Dependencies that involve only variables in x play no role in P(y|x), so an accurate conditional model can have much simpler structure than a joint model, P(x,y).The posterior probabilities of each label are modeled directly, so no attempt is made to capture the distributions over x, and there is no need to model the correlations between them.Therefore, there is no need to specify an underlying prior statistical model, and the conditional independence assumption of a pixel value given a label, commonly used by generative models, can be relaxed.This is the approach taken by conditional random fields (CRFs), which are a combination of classification and graphical modeling known as structured prediction [52,50].They combine the ability of graphical models to compactly model multivariate data (the continuum of land cover and land use labels) with the ability of classification methods to leverage large sets of input features, derived from imagery, to perform prediction.In CRFs based on 'local' connectivity, nodes connect adjacent pixels in x [51,53], whereas in the fully connected definition, each node is linked to every other [54,55].CRFs have recently been used extensively for task-specific predictions such as in photographic image segmentation [56,57,42] where, typically, an algorithm estimates labels for sparse (i.e.non-contiguous) regions (i.e.supra-pixel) of the image.The CRF uses these labels in conjunction with the underlying features (derived from a photograph), to draw decision boundaries for each label, resulting in a highly accurate pixel-level labeled image [55,42].

Paper purpose, scope, and outline
In summary, this paper evaluates the utility of DCNNs for both image recognition and semantic segmentation of images of natural landscapes.Whereas previous studies have demonstrated the effectiveness of DCNNs for classification of features in satellite imagery, we specifically use examples of high-vantage and nadir imagery that are commonly collected during geomorphic studies and in response to disasters/natural hazards.In addition, whereas many previous studies have utilized relatively large DCNN architectures either specifically designed to recognize landforms, land cover or land use, or trained existing DCNN architectures from scratch using a specific dataset, the comparatively simple approach taken here is to repurpose an existing comparatively small, very fast MobileNetV2 DCNN framework to a specific task. .Further, we demonstrate how structured prediction using a fully connected CRF can be used in a semi-supervised manner to efficiently generate ground truth label imagery and DCNN training libraries.Finally, we propose a hybrid method for accurate semantic segmentation based on combining 1) the recognition capacity of DCNNs to classify small regions in imagery, and 2) the fine grained localization of fully connected CRFs for pixel-level classification.
The rest of the paper is organized as follows.First, we outline a workflow for efficiently creating labeled imagery, retraining DCNNs for image recognition, and semantic classification of imagery.A user-interactive tool has been developed that enables the manual delineation of exemplative regions in the input image of specific classes in conjunction with a fully connected conditional random field (CRF) to estimate the class for every the pixel within the image.The resulting label imagery can be used to train and test DCNN models.Training and evaluation data sets are created by selecting tiles from the image that contain a proportion of pixels that correspond to a given class that is greater than a given threshold.Then we detail the transfer learning approach applied to DCNN model repurposing, and describe how DCNN model predictions on small regions of an image may be used in conjunction with a CRF for semantic classification.We chose the MobileNetsV2 framework, but any one of several similar models may alternatively be used.The retrained DCNN is used to classify small spatially distributed regions of pixels in a sample image, which is used in conjunction with the same CRF method used for label image creation to estimate a class for every pixel in the image.We introduce four datasets for image classification.The first is a large satellite dataset consisting of various natural land covers and landforms, and the remaining three are from high-vantage or aerial imagery.These three are also used for semantic classification.In all cases, some data is used for training the DCNN, and some for testing classification skill (out-of-calibration validation).For each of the datasets, we evaluate the ability of the DCNN to correctly classify regions of images or whole images.We assess the skill of the semantic segmentation.Finally, we discuss the utility of our findings to broader application of these methods for geomorphic research.

Fully connected Conditional Random Field
A conditional random field (CRF) is an undirected graphical model that we use here to probabilistically predict pixel labels based on weak supervision, which could be manual label annotations or classification outputs from discrete regions of an image based on outputs from a trained DCNN.Image features x and labels y are mapped to graphs, whereby each node is connected to an edge to its neighbors according to a connectivity rule.Linking each node of the graph created from x to every other node enables modeling of the long-range spatial connections within the data by considering both proximal and distal pairs of grid nodes, resulting in refined labeling at boundaries and transitions between different label classes.We use the fully connected CRF approach detailed in [55], which is summarized briefly below.The probability of a labeling y given an imagederived feature, x, is where  is a set of hyperparameters,  is a normalization constant, and  is an energy function that is minimized, obtained by where  and  are pixel locations in the horizontal (row) and vertical (column) dimensions.The vectors   and   are features created from   and   and are functions of both relative position and intensity of the image pixels.Whereas   indicate so-called 'unary potentials', which depend on the label at a single pixel location (i) of the image, 'pairwise potentials',   , depend on the labels at a pair of separated pixel locations (i and j) on the image.The unary potentials represent the cost of assigning label   to grid node .In this paper, unary potentials are defined either through sparse manual annotation or automated classification using DCNN outputs.The pairwise potentials are the cost of simultaneously assigning label   to grid node  and   to grid node , and are computed using image feature extraction, defined by: where  = 1: are the number of features derived from x, and where the function  quantifies label 'compatibility', by imposing a penalty for nearby similar grid nodes that are assigned different labels.
Each   is the sum of two Gaussian kernel functions that determines the similarity between connected grid nodes by means of a given feature   : The first Gaussian kernel quantifies the observation that nearby pixels, with a distance controlled

Generating DCNN training libraries
We developed a user-interactive program that segments an image into smaller chunks, the size of which is defined by the user.On each chunk, cycling through a pre-defined set of classes, the user is prompted to draw (using the cursor) example regions of the image that correspond to each label.
Unary potentials are derived from these manual on-screen image annotations.These annotations should be exemplative, i.e. a relatively small portion of the region in the chunk that pertains to the class, rather than delimiting the entire region within the chunk that pertains to the class.Typically, the CRF algorithm only requires a few example annotations for each class.For very heterogeneous scenes, however, where each class occurs in several regions across the image (such as the water and anthropogenic classes in Figure 1) example annotations should be provided for each class in each region where that class occurs.
Using this information, the CRF algorithm estimates the class of each pixel in the image (Figure 1).Finally, the image is divided up into tiles of a specified size, T. If the proportion of pixels within the tile is greater than a specified amount, Pclass, then the tile is written to a file in a folder denoting its class.This simultaneously and efficiently generates both ground-truth label imagery (to evaluate classification performance) and sets of data suitable for training a DCNN.A single photograph typically takes 5-30 minutes to process with this method, so all the data required to retrain a DCNN (see section below) may take only up to a few hours to generate.CRF inference time depends primarily on image complexity and size, but also secondarily affected by the number and spatial heterogeneity of class labels.

Retraining a deep neural network (transfer learning)
The training library that consists of image tiles each labeled according to set of classes, whose generation are described in section 2.2., is used to retrain an existing DCNN architecture to classify similar unseen image tiles.Among many suitable popular and open-source frameworks for image classification using deep convolutional neural networks, we chose MobileNetV2 [58] because it is relatively small and efficient (computationally faster to train and execute) compared to many competing architectures designed to be transferable for generic image recognition tasks, such as Inception [59], Resnet [60], and NASnet [61], and it is smaller and more accurate than MobileNetV1 [62].It also is pretrained for various tile sizes (image windows with horizontal and vertical dimensions of 96, 128, 192, and 224 pixels) which allows us to evaluate that effect on classifications.
However, all of the aforementioned models are implemented within TensorFlow-Hub [63], which is a library specifically designed for reusing pre-trained TensorFlow [64] models for new tasks.Like MobileNetV1 [62], MobileNetV2 uses depthwise separable convolutions where, instead of doing a 2D convolution with a kernel, the same result is achieved by doing two 1D convolutions with two kernels, k1 and k2, where k = k1 • k2.This requires far fewer parameters, so the model is very small and efficient compared to a model with the same depth using 2D convolution.However, V2 introduces two new features to the architecture: 1) shortcut connections between the bottlenecks called inverted residual layers, and 2) linear bottlenecks between the layers.A bottleneck layer contains few nodes compared to the previous layers, used to obtain a representation of the input with reduced dimensionality [59], leading to large savings in computational cost.Residual layers connect the beginning and end of a convolutional layers with a skip connection, which gives the network access to earlier activations that weren't modified in the convolutional layers, and make very deep networks without commensurate increases in parameters.Inverted residuals are a type of residual layer that has fewer parameters, which leads to greater computational efficiency.A 'linear' bottleneck is where the last convolution of a residual layer has a linear output before it is added to the initial activations.
According to [58], this preserves more information than the more-traditional non-linear bottlenecks, which leads to greater accuracy.
For all datasets, we only used tiles (in the training and evaluation) where 90% of the tile pixels were classified as a single class (that is, Pclass > 0.9).This avoided including tiles depicting mixed land cover/use classes.We chose tile sizes of T = 96x96 pixels and T = 224x224 pixels, which is the full range Each training and testing image tile was normalized against varying illumination and contrast, which greatly aids transferability of the trained DCNN model.We calculated a normalized image (X′) from a non-normalized image (X) using where µ and σ are mean and standard deviation, respectively [47].We chose to scale every tile by a maximum possible standard deviation (for an 8-bit image) by using σ=255.For each tile, µ was chosen as the mean across all three bands for that tile.This procedure could be optimized for a given dataset but in our study the effects of varying values of σ were minimal.Example image as in Figure 1.

CRF-based semantic segmentation
For pixel-scale semantic segmentation of imagery, we have developed a method that harnesses the classification power of the DCNN, with the discriminative capabilities of the CRF.An input image is windowed into small regions of pixels, the size of which is dictated by the size of the tile used in the DCNN training (here, T=96x96 or T=224x224 pixels).Some windows, ideally with an even spatial distribution across the image, are classified with a trained DCNN.Collectively, these predictions serve as unary potentials (known labels) for a CRF to build a probabilistic model for pixelwise classification given the known labels and the underlying image (Figure 2).
Adjustable parameters are: 1) the proportion of the image to estimate unary potentials for (controlled by both T and the number/spacing of tiles), and 2) a threshold probability, Pthres, larger than which a DCNN classification was used in the CRF.Across each dataset, we found that using 50% of the image as unary potentials, and Pthres = 0.5, resulted in good performance.CRF hyperparameters were also held constant across all datasets.We found that good performance across all datasets was achieved using θα= 60, θβ = 5, and θγ = 60.Holding all of these parameters constant facilitates comparison of the general success of the proposed method.However, it should be noted that accuracy could be further improved for individual datasets by optimizing the parameters for those specific data.This could be achieved by minimizing the discrepancy between ground truth label images and model-generated estimates using a validation dataset.

Metrics to assess classification skill
Standard metrics of precision, P, recall, R, accuracy, A, and F1 score, F, are used to assess classification of image regions and pixels.Where TP, TN, FP, and FN are, respectively, the frequencies of true positives, true negatives, false positives, and false negatives:

Data
The chosen datasets encompass a variety of collection platforms (oblique stationary cameras, oblique aircraft, nadir UAV, and nadir satellite) and landforms/land covers, including several shoreline environments (coastal, fluvial and lacustrine).

NWPU-RESISC45
To evaluate the MobileNetV2 DCNN with a conventional satellite-derived land use/land cover dataset, we chose the NWPU-RESISC45, which is a publicly available benchmark for REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU).
The entire dataset, described by [6], contains 31,500 high-resolution images from Google Earth imagery, in 45 scene classes with 700 images in each class.The majority of those classes are urban/anthropogenic.We chose to use a subset of 11 classes corresponding to natural landforms and land cover (Figure 3), namely: beach, chaparral, desert, forest, island, lake, meadow, mountain, river, sea ice, and wetland.All images are 256x256 pixels.We randomly chose 350 images from each class for DCNN training, and 350 for testing.The dataset consists of 48 images obtained in July 2017 from a Ricoh GRII camera mounted to a 3DR Solo quadcopter, a small unmanned aerial system (UAS), flying 80-100 m above ground level in the vicinity of Braddock Bay, New York, on the shores of southern Lake Ontario [65].A random subset of 24 were used for training, and 24 for testing (Supplemental data S1C and S1D).Training and testing tiles were generated for five classes (Table A2 and Figure 5).The dataset consists of 14 images collected from a stationary autonomous camera systems monitoring eddy sandbars along the Colorado River in Grand Canyon.The camera system, sites and imagery is described in [16].Imagery came from various seasons and river flow levels, and sites differ considerably in terms of bedrock geology, riparian vegetation, sunlight/shade, and water turbidity.
One image from each of seven sites were used for training, and one from each those of same seven sites were used for testing (Supplemental data S1E and S1F).Training and testing tiles were generated for four classes (Table A3 and Figure 6).The dataset consists of a sample of 75 images from the California Coastal Records Project (CCRP) [66], of which 45 were used for training, and 30 for testing (Supplemental data S1G and S1H).The photographs were taken over several years and times of the year, from sites all along the California coast, with a handheld digital single-lens reflex camera from a helicopter flying at approximately 50-600 m elevation [20].The set includes a very wide range of coastal environments, at very oblique angles, with a corresponding very large horizontal footprint.Training and testing tiles were generated for ten classes (Table A4 and Figure 7).S2E) for all classes reveal that most mis-classifications occur between similar groupings, for example swash and surf, and roads and buildings/anthropogenic.If the model systematically fails to distinguish between certain very similar classes, confusion matrices provide the means with which to identify which classes to group (or, by the same token, split), if necessary, to achieve even greater overall classification accuracies.In most cases, however, the accuracy over all of the classes is less important than adequate prediction skill for each class, in which case fine-tuning of model hyperparameters should be undertaken to improve differentiation between similar classes.Only for certain data and classes did the distinction between T=96 and T=224 tiles make a significant difference, particularly for the Lake Ontario data where classifications were systematically better using T=224.

Pixel classification accuracy
With no fine tuning of model hyperparameters, we achieved average pixelwise classification accuracies of between 70 and 78% (F1 scores, Table 3) across four datasets, based on CRF modeling of sparse DCNN predictions with T=96 tiles (Figure 8).Classification accuracy for a given feature was strongly related to size of that feature (Figure 9).For those land cover/uses that are much greater in size than a 96x96 pixel tile, average pixelwise F scores were much higher, ranging from 86 to 90 %.

Confusion matrices (Supplemental A, Figures S2F through S2I) again show how mis-classifications only
systematically tend to occur between pairs of the most similar classes.
However, the general usefulness of deep learning applied to conventional photographic imagery at a landscape scale is, at yet, largely unproven.Here, consistent with previous studies that have demonstrated the ability of DCNNs for classification of land use/cover in long-range remotely sensed imagery from satellites [6,9,[45][46][47][48][49], we demonstrated that DCNNs are powerful tools for classifying landforms and land cover in medium-range imagery acquired from UAS, aerial, and ground-based platforms.Further, we found that the smallest and most computationally efficient widely available DCNN architecture, MobilenetsV2, classifies land use/cover with comparable accuracies to larger, slower, DCNN models such as AlexNet [67,45,6], VGGNet [68,45,6], GoogLeNet [6,69,70], or customdesigned DCNNs [9,46,47].Although we deliberately chose a standard set of model parameters, and achieved reasonable pixel-scale classifications across all classes, even greater accuracy is likely attainable with a model fine-tuned to a particular dataset [6].Here, reported pixel-scale classification accuracies are only estimates because they do not take into account potential errors in the ground truth data (label images) which could have arisen due to human error and/or imperfect CRF pixel classification.A more rigorous quantification of classification accuracy would require painstaking pixel-level classification of imagery using a fully manual approach, which would take hours to days for each image, possibly in conjunction with field measurements to verify land cover represented in imagery.
In remote sensing, the acquisition of pixel-level reference/label data is time-consuming and limiting [46], so acquiring a suitably large dataset for training DCNN is often a significant challenge.
Therefore most studies that use pixel-level classifications only use a few hundred reference points [71,72].We suggest a new method for generating pixel-level labeled imagery for use in developing and evaluating classifications (DCNN-based and others), based on manual on-screen annotations in combination with a fully connected conditional random field (CRF, Figure 1).As stated in section 2.2., the CRF model will typically only require a few example annotations for each class as priors, so for efficiency's sake annotations should be more exemplative than exhaustive, i.e. relatively small portions of the regions of the image associated with each class.However, the optimal number and extent of annotations depends on the scene and the (number of) classes, and therefore learning an optimal annotating process for a given set of images is highly experiential.
This method for generating label imagery will find general utility for training and testing any algorithm for pixelwise image classification.We show that in conjunction with transfer learning and small, efficient DCNNs, it provides the means to rapidly train a DCNN with a small dataset.In turn, this facilitates the rapid assessment of the general utility of DCNN architectures for a given classification problem, and provides the means to fine-tune a feature class or classes iteratively based on classification mismatches.The workflow presented here can be used to quickly assess the potential of a small DCNN like MobilenetV2 for a specific classification task.This 'prototyping' stage can also be used to assess classes that should be grouped, or split, depending on analysis of confusion matrices such as presented in Supplemental 2, Figures S2A through S2E.If promising, larger models such as Resnet [60] or NASnet [61] could be used, within the same framework provided by Tensorflow Hub, for even greater classification accuracy.
Recognizing the capabilities of the CRF as a discriminative classification algorithm given a set of sparse labels, we propose a pixel-wise semantic segmentation algorithm based upon DCNNestimated regions of images in combination with the fully-connected CRF.We offer this hybrid DCNN-CRF approach to semantic segmentation as a simpler alternative to so-called `fully convolutional' DCNNs [8,39,73] which, in order to achieve accurate pixel level classifications, require much larger, more sophisticated DCNN architectures [37], which are often computationally more demanding to train.Since pooling within the DCNN results in a significant loss of spatial resolution, these architectures require an additional set of convolutional layers that learn the 'upscaling' between the last pooling layer, which will be significantly smaller than the input image, and the pixelwise labelling at the required finer resolution.This process is imperfect, therefore label images appear coarse at object/label boundaries [73] and some post-processing algorithm, such as a CRF or similar approach, is required to refine predictions.Because of this, we also suggest that our hybrid approach might be a simpler approach to semantic segmentation, especially for rapid prototyping (as discussed above) and in the cases where the scales of spatially continuous features are larger than the tile size used in the DCNN (Figure 9).However, for spatially isolated features, especially those that exist throughout small spatially contiguous areas, the more complicated fully convolutional approach to pixelwise classification might be necessary.
The CRF is designed to classify (or in some instances, where some unary potentials are considered improbable by the CRF model, reclassify) pixels based on both the color/brightness and the proximity of nearby pixels with the same label.When DCNN predictions are used as unary potentials, we found that, typically, the CRF algorithm requires DCNN-derived unary potentials, regularly spaced, for at least one quarter of pixels in relatively simple scenes and about one half in relatively complicated scenes (e.g. Figure 10B) for satisfactory pixelwise classifications (e.g. Figure 10C).With standardized parameter values that were not fine-tuned to individual images or datasets, CRF performance was mixed, especially for relatively small objects/features (Table 3).This is exemplified by Figure 10, where several small outcropping rocks whose pixel labels were not included as CRF unary potentials, were either correctly or incorrectly labeled by the CRF, despite the similarity in their location, size, color, and their relative proximity to correctly labeled unary   or correctly classified (yellow ellipses).

Conclusions
In summary, we have developed a workflow for efficiently creating labeled imagery, retraining DCNNs for image recognition, and semantic classification of imagery.A user-interactive tool has been developed that enables the manual delineation of exemplative regions in the input image of specific classes in conjunction with a fully connected conditional random field (CRF) to estimate the class for every the pixel within the image.The resulting label imagery can be used to train and test DCNN models.Training and evaluation data sets are created by selecting tiles from the image that contain a proportion of pixels that correspond to a given class that is greater than a given threshold.
The training tiles are then used to retrain a DCNN.We chose the MobileNetsV2 framework, but any one of several similar models may alternatively be used.The retrained DCNN is used to classify small spatially distributed regions of pixels in a sample image, which is used in conjunction with the same CRF method used for label image creation to estimate a class for every pixel in the image. Our by   (standard deviation for the location component of the color-dependent term), with similar color, with similarity controlled by   (standard deviation for the color component of the colordependent term), are likely to be in the same class.The second Gaussian is a 'smoothness' kernel that removes small isolated label regions, according to   , the standard deviation for the location component.This penalizes small, spatially isolated pieces of segmentation, thereby enforcing more spatially consistent classification.Hyperparameter   controls the degree of allowable similarity in image features between CRF graph nodes.Relatively large   indicates image features with relatively large differences in intensity may be assigned the same class label.Similarly, a relatively large   means image pixels separated by a relatively large distance may be assigned the same class label.

Figure 1 .
Figure 1.Application of the semi-supervised CRF at Seabright Beach, Santa Cruz, California for generation of DCNN training tiles and ground-truth labeled images.From left to right, (A) the input image, (B) the hand-annotated sparse labels, and (C) the resulting CRF-predicted pixelwise labeled image.
available for MobileNets, in order to compare the effect of tile size.All model training was carried out in Python using TensorFlow library version 1.7.0 and TensorFlow-hub version 0.1.0.For each dataset, model training hyperparameters (1000 training epochs, a batch size of 100 images, and a learning rate of 0.01) were kept constant, but not necessarily optimal.For most datasets, there are relatively small numbers of very general classes (water, vegetation, etc.), which in some ways creates a more difficult classification task, owing to the greater expected within-class variability associated with broadly defined categories, than datasets with many more specific classes.Model retraining (sometimes called 'fine-tuning') consists of tuning the parameters in just the final layer rather than all the weights within all of the network's layers.Model retraining consists of firs using the model, up to the final classifying layer, to generate mage feature vectors for each input tile, then retraining only the final, so-called fully connected, model layer that actually does the classification.For each training epoch, 100 feature vectors from tiles chosen at random from the training set, and feeds them into the final layer to get predict the class.Those class predictions are then compared against the actual labels, which is used to update the final layer's weights through back-propagation.

Figure 2 .
Figure 2. Application of the unsupervised CRF for pixelwise classification, based on unary potentials of regions of the image classified using a DCNN.From left to right, (A) the input image, (B) the DCNN-estimated sparse labels, and (C) the resulting CRF-predicted pixelwise labeled image.
True positives are image regions/pixels correctly classified as belonging to a certain class by the model, while true negatives are correctly classified as not belonging to a certain class.False negatives are regions/pixels incorrectly classified as not belonging to a certain class, and false positives are those regions/pixels incorrectly classified as belonging to a certain class.Precision and recall are useful where the number of observations belonging to one class is significantly lower than those belonging to the other classes.These metrics are therefore used in evaluation of pixelwise segmentations, where the number of pixels corresponding to each class vary considerably.The F1 score is an equal weighting of the recall and precision and quantifies how well the model performs in general.Recall is a measure of the ability to detect the occurrence of a class, which is a given landform, land use or land cover.A 'confusion matrix', which is the matrix of normalized correspondences between true and estimated labels, is a convenient way to visualize model skill.A perfect correspondence between true and estimated labels is scored 1.0 along the diagonal elements of the matrix.Misclassifications are readily identified as off-diagonal elements.Systematic misclassifications are recognized as offdiagonal elements with large magnitudes.Full confusion matrices for each test and dataset are provided as Supplemental Data 2.

Figure 3 .
Figure 3. Example tiles from NWPU dataset.Classes, in columns, from left to right, are beach,

Figure 4 .
Figure 4. Example tiles from Seabright beach.Classes, in columns, from left to right, are anthropogenic/buildings, foam, road/pavement, sand, other natural terrain, vegetation, and water.

Figure 5 .
Figure 5. Example tiles from Lake Ontario shoreline.Classes, in columns, from left to right, are anthropogenic/buildings, sediment, other natural terrain, vegetation, and water.

Figure 6 .
Figure 6.Example tiles from Grand Canyon.Classes, in columns, from left to right, are rock/scree, sand, vegetation, and water.

Figure 7 .
Figure 7. Example tiles from CCRP dataset.Classes, in columns, from left to right, are buildings/anthropogenic, beach, cliff, road, sky, surf/foam, swash, other natural terrain, vegetation, and water.

Figure 8 .
Figure 8. Example images (left column), DCNN-derived unary potentials (middle column), and CRFderived pixelwise semantic segmentation (right column) for each of the four datasets, from top to bottom, Seabright, Lake Ontario, Grand Canyon, and CCRP.

Figure 9 .
Figure 9. Average recall versus average area (in square pixels) of classes.
to use our extensive pixel-level label dataset to evaluate and facilitate in the development of custom DCNN models for specific classification tasks in the geosciences.

Figure 10 .
Figure 10.Classification of a typical CCR image: (A) Original image; (B) DCNN predictions; (C) CRF predictions; (D) and (E) show the same region (magnification ×2) from the DCNN and CRF labels, respectively.The colored ellipses in (D) indicate small rocky areas either misclassified (red ellipses)

Table 1 .
Whole tile classification accuracies and F1 scores for each dataset and tile size, using the test tile set not used to train the model.

Table 2 .
MeanFor each image set, classes are already available for all image tiles used for testing, so the DCNN model is simply retrained against the pre-defined classes for each data set.This results in five separate retrained models, one for each of the five datasets.With no fine tuning of model hyperparameters (of whole tile classification accuracies (%), per class, for each of the non-satellite datasets (T=96 / T=224), using the test tile set not used to train the model.whichthemost important are number of training epochs, learning rate, and batch size), we achieved average classification accuracies of between 91 and 98% (F1 scores) across five datasets with T=224 tiles, and between 88% and 97% with T=96 tiles (Table1).Over 26 individual land cover/use classes (Table 2) in four datasets, average classification accuracies ranged between 49 and 99%.Confusion matrices (Supplemental 2, Figures S2A through

Table 3 .
Mean P/R/F/A (all %) per class for pixelwise classifications using each of the non-satellite datasets (T=96), using the test set of label images.
potentials.Dark shadows on cliffs were sometimes misclassified as water, most likely because the water class contains examples of shallow kelp beds, which are also almost black.A separate 'shadow' or 'kelp' class might have ameliorated this issue.We found that optimizing CRF parameters to reduce such misclassifications could be done for an individual image, but not in a systematic way that would improve similar misclassifications in other images.Whereas here we have used RGB imagery, the CRF would work in much the same way with larger multivariate datasets such as multispectral or hyperspectral imagery, or other raster stacks consisting of information on coincident spatial grids.
geoscience community, similar demonstrable examples, need to be coupled with accessible tools and datasets to develop deep neural network architectures that better discriminate landforms and land uses in landscape imagery.To that end, we invite interested readers to use our data and code (see Acknowledgements) to explore variation in classifications among multiple DCNN architectures, and

Table A2 .
work demonstrates the general effectiveness of a repurposed, small, very fast, existing DCNN framework (MobileNetV2) for classification of landforms, land use, and land cover features in both satellite and high-vantage, oblique and nadir imagery collected using planes, UAVs and static monitoring cameras.With no fine tuning of model parameters, we achieve average classification accuracies of between 91 and 98% (F1 scores) across five disparate datasets, ranging between 71 and 99% accuracies over 26 individual land cover/use classes across four datasets.Further, we demonstrate how structured prediction using a fully connected CRF can be used in a semi-supervised manner to very efficiently generate ground truth label imagery and DCNN training libraries.Finally, we propose a hybrid method for accurate semantic segmentation of imagery of natural landscapes based on combining 1) the recognition capacity of DCNNs to classify small regions in imagery, and 2) the fine grained localization of fully connected CRFs for pixel-level classification.Where land cover/uses that are typically much greater in size than a 96x96 pixel tile, average pixelwise F1 scores range from 86 to 90%.Smaller, and more isolated features have greater pixelwise accuracies.This is Classes and number of tiles used for the Lake Ontario dataset.

Table A3 .
Classes and number of tiles used for the Grand Canyon dataset.

Table A4 .
Classes and number of tiles used for the California Coastal Records dataset.