A Data-Driven Approach to Classifying Wave Breaking in Infrared Imagery

We apply deep convolutional neural networks (CNNs) to estimate wave breaking type (e.g., non-breaking, spilling, plunging) from close-range monochrome infrared imagery of the surf zone. Image features are extracted using six popular CNN architectures developed for generic image feature extraction. Logistic regression on these features is then used to classify breaker type. The six CNN-based models are compared without and with augmentation, a process that creates larger training datasets using random image transformations. The simplest model performs optimally, achieving average classification accuracies of 89% and 93%, without and with image augmentation respectively. Without augmentation, average classification accuracies vary substantially with CNN model. With augmentation, sensitivity to model choice is minimized. A class activation analysis reveals the relative importance of image features to a given classification. During its passage, the front face and crest of a spilling breaker are more important than the back face. For a plunging breaker, the crest and back face of the wave are most important, which suggests that CNN-based models utilize the distinctive ‘streak’ temperature patterns observed on the back face of plunging breakers for classification.


Introduction
In the surf zone, the spatio-temporal patterns and dynamics of wave breaking generate nearshore currents and transport sediment, which changes seafloor topography.This, in turn, affects wave transformation processes, spatial gradients in energy dissipation, and nearshore hydrodynamic circulation patterns [1].This circulation determines the fate of seabed nutrients, contaminants, and pathogens, and asserts control on the seabed and water column as habitats [2].Numerical modeling of the onset of wave breaking and the amount of energy lost during breaking (dissipation) usually relies on parameterizations such as Thornton and Guza [3] and Duncan [4].The evolution of individual waves toward breaking is not fully understood [5], and there is no consensus on a method for predicting the statistics of breaking waves [6].Therefore, detailed observations of wave breaking are needed to develop improved parameterizations of wave transformation and energy dissipation processes across the surf zone.
Breaking waves are classified into discrete classes, namely collapsing, surging, plunging, and spilling [7].Plunging (when the wave crest curls forward and abruptly impinges the water surface) and spilling (when an aerated roller cascades down the front face of the wave) are the most common breaker types on open coastlines and sandy beaches.Spilling and plunging breakers have different rates of energy dissipation [8].Therefore, a robust, fully automated, objective approach to classifying breaker type and other wave properties from remote sensing data would be an extremely valuable tool for studying surf zone energy budgets.Thermal infrared (IR) imagery has been used to study deep-water wave breaking [9], microscale breaking [10], and surf zone wave breaking [11] because it is relatively insensitive to reflected light from the sun and sensitive to the subtle temperature difference between relatively warm active foam (the foam produced during active breaking) and relatively cool residual foam (the foam left behind in the wake of a breaking wave).Additionally, distinctive streaky temperature patterns have been observed on the back face of plunging waves in IR imagery [12,13].However, to our knowledge, no attempt has been made to either mechanistically relate the energy content of wave breaking to specific IR temperature patterns, or to harness those IR textures to infer statistical descriptors of spilling and plunging breakers.Here, we successfully attempt the latter.
In recent years, while applied machine learning technologies have developed at an unprecedented pace [14], there has been a concomitant interest in their application to imagery of coastlines for academic research [15][16][17], beach monitoring, and for leisure purposes.Indeed, the recent proliferation in coastal imaging systems is driven in part by the expectation that such artificially intelligent technologies will make data-driven observations of surf zone hydrodynamics feasible.A specific class of machine learning algorithms called deep convolutional neural networks (CNNs, also known as DCNNs or Convnets), has been shown to produce state-of-the-art performance for a variety of image recognition and classification tasks e.g., [18][19][20][21].A major reputed advantage of deep learning over conventional machine learning is that it does not require manual feature selection and extraction.These practices select or transform image data to make them more amenable to a specific algorithm or to reduce model overfitting, thereby increasing model generality.Another attractive aspect of deep learning models is that they can be used as generic feature extractors by initializing each model "neuron" with a set of weights and biases learned from a different dataset, such as Imagenet [22], a library of millions of labeled generic images.This widespread practice is called 'transfer learning' [23] and makes model training considerably faster.
Here, we employ CNN models on a training set of IR images of breaking or near-breaking surf zone waves to create a function that estimates wave breaking type from an arbitrary single image.Feature extraction is automatic, and predictions are made based on image texture related to small-scale spatial patterns in sea surface temperature.The success of the classification is a test of how well generic CNN models, initialized with weights learned from a different dataset and a different set of classes, extract geophysically relevant information from imagery.

Field Site and Instrumentation
Observations of breaking waves in the outer surf zone were collected using a thermal IR camera during a field campaign, 7-8 November 2016, at the US Army Corps of Engineers (USACE) Field Research Facility (FRF) in Duck, NC, USA.A DRS UC640-17 long-wavelength (8-14 µm), uncooled VOx Microbolometer IR camera was mounted to the FRF pier and stabilized by four guy lines attached to the pier railings.This camera viewed the sea surface at 45 • incidence angle and collected data continuously at 10 Hz.A continuously operating Riegl VZ-400 LIDAR (LIght Detection and Ranging), with ≤1 cm accuracy measured the sea surface elevation across a profile intersecting the field of view of the IR camera.An example IR image of a spilling breaker is shown in Figure 1B.For context, Figure 1A shows a surf zone-scale view of the same breaker, and Figure 1C displays the LIDAR profile of the breaker.In this study, 10.5 hours of data were used, during which time wave height and period both varied significantly, from 0 to 5.94 m and 2.32 to 19.36 s, respectively, measured using the LIDAR profiles.
The dataset used here consists of 9400 oblique images.Example IR images of unbroken waves, spilling breakers, and plunging breakers are shown in Figure 2.For each of the training images, breaker type is manually classified.The distribution of pixel intensities in the IR imagery is used to determine whether or not the image contains a breaking wave [11]; then, the image is manually labeled based on the patterns of image intensity.Plunging breakers are typified by an organized streak pattern on the back face of the wave (Figure 2A), whereas spilling breakers are identified by more unorganized texture on the back face of the wave (Figure 2B).

Model for Discrete Classification of Breaker Type
A supervised machine learning approach estimates a function f that maps inputs x (a training set of IR images of waves) to corresponding outputs y (labels of wave breaker class) such that y = f (x).This function is then used to make predictions on unlabeled data.Several open-source CNN architectures have been designed to recognize objects and features in non-specific photographic imagery [24].Among numerous suitable, popular, state-of-the-art and open-source frameworks for whole-image classification using CNNs, we choose MobilenetV2 [25], Xception [26], Resnet-50 [27], InceptionV3 [28], Inception-ResnetV2 [29] and VGG19 [19] (in increasing order of number of tunable model parameters).The major difference between Inception-based and Mobilenet-based architectures is that Mobilenet-based models use depthwise separable convolution while Inception V3 uses standard convolutions.In a standard convolution, the filter operates on all image channels of the input image, so the matrix multiplication between the input and filter is multidimensional.However, in a depthwise separable convolution, separate filters operate on each image channel, then output feature maps are generated using a pointwise filter.Inception-ResnetV2 is a hybrid of Inception and Resnet architectures that introduce residual connections that add the output of the convolution operation of the inception module, to the input.The motivating idea behind this is that the next layer will learn the concepts of the previous layer plus the input of that previous layer (the data that was used to learn those concepts).This also allows the model to be much deeper, but with a similar number of model parameters.VGG19 is an older model architecture that is very deep but does not have residual connections or depthwise separable convolutions.
These CNNs automatically generate a hierarchy of features that are learned from input data using a general-purpose procedure.This procedure applies a set of convolution filters to extract local features and spatially shares the parameters of each filter.The weights of the filters in each layer are learned using back-propagation [14], an optimization process that minimizes the error between the output produced by the CNN and the ground-truth label.Each model is implemented using the TensorFlow symbolic math library [30].Logistic regression on the extracted features is then used for discrete classification.Let x be the features extracted from the last pooling layer of the CNN.
CNNs are made of many connected layers, and each layer consists of nodes.A node combines input from the input image data with a set of coefficients, called weights, which either amplify or dampen that input.Typically, if a CNN is trained 'end-to-end', significance is assigned to inputs with regard to the task that the algorithm is trying to learn (in our case, wave classification) by modifying the values of each weight until optimal classification performance is achieved.Here, we simply take an existing set of weights, determined for each chosen model architecture by a training process for the Imagenet dataset [22].These input-weight products are summed and then the sum is passed through a node's activation function; this determines whether and to what extent that signal should progress further through the network to affect the ultimate classification.The set of the final node's activations is called features, which is used for classification within a multinomial logistic regression.The model so-weighted becomes a generic feature extractor that is optimized for the 1000 Imagenet classes rather than the three wave breaking classes of interest here, but nonetheless we show that the generic CNN is powerful enough to classify those three classes, despite not being optimized.This is an example of so-called 'transfer learning' [23].
To handle ambiguity, the most probable wave breaker class, y, is found by by transforming x into a discrete probability distribution over the k = 3 possible classes using multinomial logistic regression, known in the field of machine learning as the softmax function [31].Outputs from this discriminative approach, which models the conditional distribution p(y|x) directly, are found to be consistently superior to the equivalent outputs from a simple generative approach, whereby the posterior label probabilities are found using Bayes' theorem.The inferior skill of this so-called 'naive Bayes' approach is likely due to the extracted image features not contributing independently to the probability of a given classification (i.e., the features are correlated, violating the assumptions of the model).
Models are trained without and with image augmentation.Augmentation is implemented using the following random changes to the images: (1) shifts in either or both image dimensions of up to 10%; (2) rotations up to ±10 degrees; (3) shear in either axis up to 5 degrees; and (4) zoom up to 20% by image area.Image vertical and horizontal flipping is another common strategy employed to augment training datasets, but it was not employed here because image textures are anisotropic (i.e., the direction of wave propagation is important to image texture).All classified images are randomly split into a training set and a testing set.The data used to train the CNN models consists of 5455 (7455) images of unbroken waves, 138 (2138) images of plunging breakers, and 1904 (3904) images of spilling breakers without (with) augmentation.The testing data consist of 1820 images of unbroken waves, 47 images of plunging breakers, and 636 images of spilling breakers.The same square area is extracted from each image, and this square region then reduced to 299 × 299 pixels, for computational efficiency.

Results
To assess classification performance, we compare the automated CNN classifications against the test dataset (Table 1).Based on Rank-1 scores, which is the percentage of samples for which the model correctly estimates wave breaker class, there is little difference between MobilenetV2, InceptionV3, and Inception-ResnetV2 (in order of increasing model complexity).Resnet-50 is the worst performing model, without and with image augmentation.The general prediction skill of the model for individual classes is assessed using an F1 score, which is an equal weighting of the recall and precision.Precision and recall are standard accuracy metrics employed when the number of observations belonging to one class (plunging) is significantly lower than those belonging to the other classes (spilling and unbroken), which is the case here.Precision is the proportion of positive identifications that are correct (a precision of 1 means there are no false positives), and recall is the proportion of actual positives identified correctly (a recall of 1 means there are no false negatives).F1 scores (Table 1) reveal greater differences between models and highlight the prediction improvement achieved using image augmentation.For each trained model, a 'confusion matrix', which is the matrix of normalized correspondences between true and estimated labels, was used to visualize model skill.A perfect correspondence between true and estimated labels is scored 1.0 along the diagonal elements of the matrix.Misclassifications are readily identified as off-diagonal elements.Systematic misclassifications are recognized as off-diagonal elements with large magnitudes.Comparison of the confusion matrices for models trained without (Figure 3) and with (Figure 4) augmentation shows that the latter ensures individual class predictions have errors of less than 50%, regardless of CNN model used.All augmented models with the exception of Resnet-50 are accurate to within 20% for all classes.Without augmentation, misclassifications are much more likely.

Discussion
Transfer learning-that is, using a deep neural network with weights learned using another dataset-as a generic feature extractor, combined with a simple logistic regression classifier, was sufficient for estimating wave breaker class from IR images of the surf zone with high accuracy.For most models, however, image augmentation improved classification performance.This supports the general consensus among machine learning experts that CNNs require large datasets and performance increase is to be expected with more data [32].This holds even if the classical notion of data information content is challenged by the enormous amount of redundancy in the augmented data.It is widely known that effective use of logistic regression requires large sample sizes [31], hence the importance of augmentation.Augmentation would also further aid end-to-end model training that optimizes the model used as a feature extractor for the specific data set.
The result demonstrates that deep learning is a powerful tool for task-specific classification of dynamic natural features in geophysical imagery.In this specific application, we find that deep learning methods are sensitive to subtle variations of IR image tone, contrast, saturation, and texture that collectively indicate a changing dynamic state.The feature extractors initialized with weights learned on conventional photographic imagery were sufficient to extract the salient features of a geophysical dataset.Unlike many studies applying CNNs to photographic imagery for non-physical purposes, our classification results also suggest that neither residual network connections (such as Resnet-50 and InceptionResnetV2) nor very deep architectures (such as VGG19) are crucial for high accuracy (Table 1).It will be interesting to observe how well these observations hold for other geophysical classification problems.
The global average pooling (GAP) layers in modern CNN feature extractors are crucial to minimize overfitting by reducing the total number of parameters in the model.Furthermore, Zhou et al. [33] and subsequent studies have demonstrated that CNNs with GAP layers that have been trained for a classification task can also be used for object localization.In a conventional sense, this is where an object is in the image.In the present case, we can use this technique to indicate how important each location is with respect to the wave breaker class prediction.The localization is expressed as a 'class activation map', where relatively large values indicate regions that are relatively important for the CNN to perform the classification task.We implemented the method of Zhou et al. [33] on five sequential example images of each wave breaker class.Scrutiny of the results in Figure 5 reveals that different features are important, depending on the type of wave or breaker and also on the stage in its temporal evolution.During the passage of a spilling breaker, regions of the image near the wave crest and on the front face of the wave (Figure 5A-E) are most important.For a plunging breaker, (Figure 5F-J), regions of the image near the wave crest and on the back face of the wave are most important.This supports a hypothesis that the CNN is picking up on the distinctive 'streak' temperature patterns observed on the back face of plunging breakers [12,13].Therefore, we argue that not only does the network generally make the correct classification, it also assigns those classes for the right reasons.Such visualization techniques could therefore lead to improved mechanistic understanding of the hydrodynamics or thermodynamics of wave breaking.
Images that were misclassified can almost all be be grouped into one of the following four categories: (1) breaking waves just entering the field of view; (2) images containing wake from the previous wave; (3) at the onset of breaking, where a portion of the wave crest is breaking while the rest remains unbroken; and (4) a wave that is in transition from the onset of plunging to the developing breaker stage.Since waves exist somewhere on the continuum between end-member cases of spilling and plunging, those waves whose IR signature is not clearly spilling or plunging pose a challenge to both the DCNN and the eye.Future work to incorporate knowledge of sequential classifications into the CNN algorithm may therefore help to define additional classes, such as the transition between onset and steady state breaking.The ratio of prediction probabilities of spilling and plunging waves may also be a useful metric to determine where a given wave exists on the continuum of breaker type.

Conclusions
We successfully classify breaker type in infrared imagery of surf zone waves using deep convolutional neural networks (CNNs).Six CNN-based models are tested using weights learned from generic datasets and a logistic regression classifier.The simplest model (MobilenetV2) performs optimally across all three wave classes (non-breaking, spilling, and plunging), with average classification accuracies of 89% and 93%, without and with image augmentation respectively.Classification error is less than 20% for all classes in all models, except Resnet-50, when training datasets are augmented using random image transformations, and neither residual network connections (Resnet-50 and Inception-ResnetV2) nor very deep networks (VGG19) are required for high classification accuracy.
Class activation maps reveal that the regions of the image that are important for CNN-based class determination correspond to dynamically relevant features of the different breaker types (i.e., the aerated roller on the front face and at the crest of a spilling breaker [3,4], and the strained and streaky back face of a plunging breaker [12,13]).Misclassification is most common for images of ambiguous breaker type and breakers transitioning from onset to steady state.For these times when a discrete class may not be appropriate to describe the wave state, CNN-based models may be useful for defining the continuum of breaker type and identifying the dynamic features of a developing breaker.
We have presented a technique that may be applied on imagery representing a small spatial footprint of just a few square meters.Therefore, it seems likely that data-driven approaches such as this might be used to identify specific waves in a field of breaking waves by analyzing small regions of imagery with large spatial footprints of thousands of square meters, which would open new research avenues.Another potential next step in application of CNN techniques to IR imagery in the surf zone is image segmentation, which involves classifying every pixel in an image [23,34,35].Of particular interest would be the spatial extent of active and passive foam, as well as streaks and other thermal patterns, which may be useful for estimating wave energy dissipation.

Figure 1 .
Figure 1.(A) An IR image of the surf zone and the pier are shown with detected breaking indicated by the transparent red layer.The corresponding LIDAR transect (solid black curve) and the pier IR camera field of view (dashed outline) are projected into the image; (B) corresponding example image from the pier IR camera with the LIDAR transect (black line) projected into the image; (C) corresponding example LIDAR transect (same as shown in (A,B)), with active breaking highlighted in red on the front face of the wave, and the field of view of the pier IR camera marked by dashed lines.

Figure 2 .
Figure 2. Example IR images of the three categorical wave classes: (A) plunging; (B) spilling; and (C) unbroken waves.The back face of a spilling breaker exhibits a unorganized texture, whereas that of a plunging breaker is characterized by an organized streak pattern.

Figure 5 .
Figure 5. Class activation maps of five sequential example images for each of three wave breaker classes: (A-E) plunging; (F-J) spilling, and (K-O) unbroken.Relatively high values (red colors) are regions of relative importance with respect to the class considered.

Table 1 .
CNN classification results.Rank-1 and F1 scores for each of six models without (and with) image augmentation.