Automatic Hierarchical Classification of Kelps Using Deep Residual Features

Across the globe, remote image data is rapidly being collected for the assessment of benthic communities from shallow to extremely deep waters on continental slopes to the abyssal seas. Exploiting this data is presently limited by the time it takes for experts to identify organisms found in these images. With this limitation in mind, a large effort has been made globally to introduce automation and machine learning algorithms to accelerate both classification and assessment of marine benthic biota. One major issue lies with organisms that move with swell and currents, such as kelps. This paper presents an automatic hierarchical classification method local binary classification as opposed to the conventional flat classification to classify kelps in images collected by autonomous underwater vehicles. The proposed kelp classification approach exploits learned feature representations extracted from deep residual networks. We show that these generic features outperform the traditional off-the-shelf CNN features and the conventional hand-crafted features. Experiments also demonstrate that the hierarchical classification method outperforms the traditional parallel multi-class classifications by a significant margin (90.0% vs. 57.6% and 77.2% vs. 59.0%) on Benthoz15 and Rottnest datasets respectively. Furthermore, we compare different hierarchical classification approaches and experimentally show that the sibling hierarchical training approach outperforms the inclusive hierarchical approach by a significant margin. We also report an application of our proposed method to study the change in kelp cover over time for annually repeated AUV surveys.


Introduction
Kelp forests support diverse and productive ecological communities throughout temperate and arctic regions worldwide. Environmental anomalies such as cyclones, storms, marine heat waves and climate change have a detrimental effect on benthic marine life including kelps [1]. Significant declines in kelp bed were observed around the globe in recent decades, with the main drivers identified as eutrophication and climate change related environmental stressors. For instance, large-scale disappearance of kelp was observed in 2002 in the southern coast of Norway [2]. In Spain, large scale reductions in two main species of kelp have also been observed since the 1980's [3].
Similarly, kelp populations in Australia have decreased as a consequence of climate change driven environmental stressors. In the east coast of Tasmania, the coverage of giant kelp Macrocystis pyrifera in the present decade is around 9% of the coverage in the 1940's [4]. This decline is consistent with the intrusion of warmer, nutrient poor water from the East Australian Current, which now extends 350 km further south than in the 1940's [5]. Wernberg et al. [6] reported a rapid climate-driven transition of kelp forests to seaweed turfs in the Australian temperate reef communities with kelp forests showing a 100 km poleward contraction from their pre-heatwave distribution on the Western Australia coast. This trend is alarming for the numerous endemic species that rely on kelp forests for support. Loss of kelp forests is also a major threat for Australia's fishing and tourism industries, which generate more than 10 billion Australian dollars per annum [7]. There is thus a pressing and immediate need for monitoring programs to document changes in kelp dominated habitats along coastlines worldwide and especially in temperate Australia.
Autonomous underwater vehicles (AUVs) are emerging as highly effective tools for monitoring changes in benthic marine environments, because (i) they can autonomously conduct non-destructive sampling in remote marine habitats; (ii) they can repeatedly survey the same spatial region to detect change over time; and (iii) they are fitted with a range of instrumentation to acquire both physical and biological data. AUVs were used to monitor the marine benthos across temperate and tropical environments in Australia [8,9]; to survey invasive pest species [10]; to document rapid loss of corals associated with warming events [9,11]; to describe benthic community structure at depths greater than 1000 m [12]; and assess environmental impacts of the Deepwater Horizon oil spill [13]. In a large-scale study of deep waters, the distribution patterns of kelp forests were investigated to provide useful insights on the effect of environmental changes on the kelp population [14]. The survey took an extremely long time to complete as marine biologists had to manually classify images and to identify kelp from imagery.
AUV driven monitoring can generate large quantities of imagery. For example, an AUV deployed in Western Australia collected more than 15,000 stereo image pairs each day and was deployed between 10 and 12 days each year [9]. Manual analysis of such a large number of images per deployment (150,000 to 200,000 stereo image pairs) takes a significant amount of time and effort and is the major bottleneck in data acquisition from AUV surveys. To promptly identify changes in benthic species, especially dominant habitat formers (such as kelps and corals), it is necessary to match image-analysis time to surveying time so data can be analyzed rapidly and identification of change patterns can be accomplished. Automatic classification is critical to speed up image analysis and consequently automatic classification of benthic species has raised interest in ecologists and computer scientists (such as [15][16][17][18][19]). Nonetheless, automated classification of AUV collected imagery is challenging because images are captured in dynamic shallow water with little to no control on lighting and significant variations in what is visible and how it is perceived.
In this paper, we tackle the challenge of automatically annotating underwater imagery for the presence of kelp to detect changes in the coverage of Australian kelp forests. The common practice is to study the distribution and density of benthic species, which involves manually annotating a smaller dataset and then extrapolating these results to make inferences about the sites under study. Automating the process of determining kelp coverage will significantly decrease image processing times and will allow for large scale analysis of datasets and for early identification of changes in kelp cover. To automate this process, it is paramount to select appropriate features. In computer vision tasks, the general trend has shifted from conventional hand-crafted features to off-the-shelf deep features [20]. Hand-crafted features which usually encode one aspect of data (i.e., color, shape or texture) were a popular choice as image representations for benthic marine species recognition tasks in the works of [15,18,21,22]. Moreover, given that hand-crafted features are designed specifically for a current task at hand, they generally do not perform well when applied on a different task. Recently, Convolutional Neural Networks (CNNs) and features extracted from pre-trained CNNs have become the preferred choice for benthic marine image classification tasks, e.g., [19,[23][24][25]. These off-the-shelf features are image representations learned by a deep network trained on a larger dataset such as ImageNet. Off-the-shelf CNN features are generic and have shown better performance as compared to hand-crafted features on a variety of image recognition tasks [20]. In this paper, we propose to apply image representations extracted from deep residual networks (ResNets) to further improve the automatic annotation of benthic species. Besides better performance, one big advantage of ResNets is their faster training time and ease of optimization. Figure 1 depicts the evolution of classification pipelines for automatic benthic marine species annotation.  The main motivation for using ResNet as a base network to extract features for kelp classification is its superior performance over previous deep networks [26]. Moreover, the feature extraction is fast due to the low computational complexity of ResNets and the reduced number of floating point operations (FLOPs). Also, the feature extracted from ResNet is 2048-dimensional, which is half of the traditional 4096-dimensional feature vector of previous networks such as VGG16 [27]. These compact features result in reduced memory requirements for storing the features of large benthic marine datasets.

Image
The main contributions of this paper are: 1.
The first application of deep learning for automated kelp coverage analysis.

2.
A supervised kelp image classification method based on features extracted from deep residual networks, termed as Deep Residual Features (DRF).

3.
A comparison of the classification performance of the DRF with the widely used off-the-shelf CNN features for automatic annotation of kelps.

4.
Experiments demonstrating DRF's superior classification accuracy compared to previous methods for kelp classification.

5.
We compare hierarchical image classification with multi-class image classification and report the accuracies and mean f1-scores for two large datasets.
6. An application of our proposed method to automatically analyze kelp coverage across five regions of Rottnest Island in Western Australia. 7.
We demonstrate the performance of the proposed kelp coverage analysis technique using ground truth data provided by marine experts and show a high correlation with previously conducted manual surveys.
The paper is organized as follows. In Section 2, we will briefly review related work. In Section 3, we present our proposed approach and explain the features extracted from deep networks. We then report the experimental results and kelp coverage analysis. In Section 4, we discuss the next steps required to implement our proposed method to a platform to rapidly analyze benthic images. Section 5 concludes this paper.

Kelp Classification
Previous studies on automatic classification and segmentation of kelps in benthic marine imagery were based on hand-crafted features ( Table 1). To the best of our knowledge, deep networks or features extracted from deep networks have not yet been applied to solve this problem. Here we briefly summarize a few of the prominent studies focused on automating kelp identification. Denuelle and Dunbabin [16] utilized a technique that employed generation of kelp probability maps using Haralick texture features across an entire image. They reported that supervised and unsupervised segmentation yielded similar results. Color imbalance resulted in a significant number of false positives thus implying that the images collected must be diversified to cater for the various possible underwater lighting and visibility conditions. When compared to manual segmentation by experts, the results show good agreement.
Bewley et al. [17] presented a technique for the automatic detection of kelps using AUV gathered images. The proposed method used local image features which are fed to Support Vector Machines (SVM) [29] to identify whether kelp is present in the image under examination. Comparison of several descriptors such as Local Binary Patterns (LBP) and Principal Component Analysis was carried out across multiple scales. This algorithm was tested on benthic data (collected from Tasmania in 2008), which contained 1258 images with 62,900 labels and 19 classes. The f1-score, which is the harmonic mean of precision and recall was used to evaluate the performance of their proposed method: A maximum f1-score of 0.69 was reported for kelps. It was also suggested that practical systems can be built to assist scientists with automatic identification of kelps. They also concluded that results could be improved by using combinations at multiple scales, finding superior descriptors and by using more supplementary AUV data. The study concluded that for a local geographical region, and for a particular species, sufficient generalization is possible.
This work was extended in [28] for a multi-class classification problem in the presence of a taxonomical hierarchy. A local classifier was trained for each node of the hierarchy tree for LBP features and the classification results were compared through multiple hierarchy training methods. This algorithm achieved an f1-score of 0.75 for kelps and an overall mean f1-score of 0.197 for all 19 classes present in the dataset.

Deep Learning for Benthic Marine Species Recognition
In recent years, deep networks and off-the-shelf CNN features have become the first choice to tackle computer vision tasks. Only a handful of studies have developed benthic marine species recognition methods based on deep learning. Beijbom et al. [23] trained three and five-channel deep CNNs based on the CIFAR10 LeNet architecture [30] to improve the classification performance for coral and non-coral species. Reflectance and fluorescence images were registered together to obtain a five-channel image, which improved the classification performance by a significant margin. This was the first reported study to employ training of deep networks (from scratch) for benthic marine species recognition.
Off-the-shelf CNN features [20] along with multi-scale pooling were first used for coral classification in [19] on the Moorea Labelled Coral (MLC) dataset, which is a challenging dataset introduced in [18]. This paper also explored a hybrid feature approach, combining CNN features with texton maps to further improve the classification accuracy on this dataset. Class imbalance is an additional problem which refers to the disproportionate difference in the amount of points allocated to some classes compared to others. This is a common issue in benthic marine datasets, as some species are significantly more abundant than others. To address the class imbalance, a cost-sensitive learning approach was studied in [31] using off-the-shelf CNN features for MLC dataset. In another study, features extracted from pre-trained deep networks were used to generate coral population maps for the Abrolhos Islands in Western Australia [24]. This study reported a trend of decreasing live coral cover in this region. This is consistent with the manual analysis of AUV images conducted by marine researchers [9,11].
Deep residual networks (ResNets) are a special class of CNNs and are deeper, faster to train and easier to optimize than previous CNN architectures [26]. ResNets employ techniques such as residual learning and identity mapping for shortcut connections [32], which enables them to overcome the limitations of traditional CNNs and outperform them in training speed and accuracy. ResFeats, features extracted from the output of convolutional layers of a 50-layer ResNet (ResNet-50), were reported to improve the performance of different image classification tasks in [33], including coral classification on the MLC dataset. Although these features are computationally expensive large arrays, we chose to use the image representations extracted from the layers closer to the output end of ResNet-50 to reduce computation cost and alleviate the need for dimensionality reduction.

Methods and Results
In this section, we outline the key components of our proposed method ( Figure 2) and present the adopted experimental protocols.

Benthoz15 Dataset
This Australian benthic data set (Benthoz15) [34] consists of an expert-annotated set of geo-referenced benthic images and associated sensor data. These images were captured by AUV Sirius during Australia's integrated marine observation system (IMOS) benthic monitoring program at multiple temperate locations ( Table 2) around Australia [8]. Marine experts manually annotated each of these images according to the Collaborative and Automation Tools for Analysis of Marine Imagery and Video (CATAMI) classification scheme. For each image, up to 50 randomly selected pixels were hand labelled using the Coral Point Count with Excel Extensions (CPCe) software package [35]. For each labelled pixel (point), a square patch of 224 × 224, centered at the labelled pixel is extracted. This patch is then used as an input for feature extraction. These pixels were randomly selected using CPCe for manual annotations. Several of these pixels can be found on class boundaries, making the classification problem more challenging. The whole dataset contains 407,968 expert labelled points, taken from 9874 distinct images collected at different depths and sites over the past few years. There are 145 distinct class labels in this dataset, with pixel labels ranging from 2 to 98,380 per class. 33 out of these 145 classes belong to macroalgae (MA) species. 63,722 labelled points out of the total belong to the kelp class. Further details on the labeling methodology can be found in [34]. The Rottnest Island dataset was also collected by AUV Sirius and contains 297,800 expert labelled points, taken from 5956 distinct images collected at different depths from five sites around Rottnest Island from 2010 to 2013 (Table 3). Three out of the five sites are labelled north (15 m, 25 m and 40 m depth) and two as south (15m and 25 m depth). There are 78 distinct class labels in this dataset, with pixel labels ranging from 2 to 155,776 per class (Table A1). This makes the classification quite challenging. 25 out of these 78 classes belong to macroalgae species. 156,000 labelled points out of the total belongs to the kelp class.

Classification Methods
Deep residual features are extracted from the output of the last convolutional block of a 50-layer deep residual network (ResNet-50) [26] that is pre-trained on ImageNet. Figure 3 shows the architecture of the ResNet-50 deep network which we have used for feature extraction. The ResNet-50 is made up of five convolutional blocks stacked on top of each other (Figure 3). The convolutional blocks of a ResNet are different from those of the traditional CNNs because of the introduction of a shortcut connection between the input and output of each block. Identity mappings when used as shortcut connections in ResNets [32], can lead to better optimization and reduced complexity. This in turn allows one to use deeper ResNets which are faster to train and are computationally less expensive than the conventional CNNs i.e., VGGnet [27].
x 3 x 4 x 6 x 3  . ResNet-50 architecture [26] shown with the residual units, the size of the filters and the outputs of each convolutional layer. DRF extracted from the last convolutional layer of this network is also shown. Key: The notation k × k, n in the convolutional layer block denotes a filter of size k and n channels. FC 1000 denotes the fully connected layer with 1000 neurons. The number on the top of the convolutional layer block represents the repetition of each unit. nClasses represents the number of output classes.
The image representations extracted from the fully connected layers of deep networks pre-trained on ImageNet [20] capture the overall shape of the object contained in the region of interest. The features extracted from the deeper layers encode class specific properties (i.e., shape, texture and color) and give superior classification performance as compared to features from shallower layers [36]. Hence, we propose to extract the features from the output of the last convolutional block of ResNet-50 ( Figure 3). The output of the Conv5 block is a 7 × 7 × 2048 dimensional array and is used as input of the FC-1000 layer. This large array is however, first converted to a 2048-dimensional vector by using a max-pool layer. We extract this 2048-dimensional vector and name it DRF. We do not use the FC-1000 layer for feature extraction because it is used as an output layer to classify the 1000 classes of the ImageNet dataset, which was used to pre-train this network. Our feature extraction method is different from the conventional method employed in previous deep networks such as VGGnet. The presence of multiple fully connected layers in the VGGnet makes the feature extraction straightforward. The only fully connected layer in ResNet is class specific to the ImageNet dataset. Therefore, we proposed to use the output of the last convolution block for DRF extraction.
There are three different approaches described in [37] to deal with the hierarchical classification problem:

1.
Flat Classification: This approach ignores the hierarchy and treats the problem as a parallel multi-class classification problem.

2.
Local Binary Classification: A binary classifier is trained for every node in the hierarchical tree of the given problem.

3.
Global Classification: A single classifier is trained for all classes and the hierarchical information is encoded in the data.
We have used the local binary classification technique in this paper to identify kelps from other taxa. This approach is easier to implement and more useful when all the nodes in the hierarchy are not labeled to a specific leaf node level. For example, some macroalgae are not labeled to the species level in the Benthoz15 dataset [34]. Moreover, this approach also allows for the use of different features, training sets and classifiers for each node of the hierarchy tree. The hierarchy tree for kelps is shown in Figure 4. . Hierarchy tree for kelps in our benthic data. In each node, the first line shows the node number, 2nd line shows the name of the specie, and 3rd and 4th lines show the number of labels belonging to that particular species in Benthoz15 and Rottnest Island data respectively.

Training and Testing Protocols
In this paper, two training approaches are used, namely inclusive training and sibling training. In the inclusive training method, all the non-kelp samples from the entire dataset are treated as negative samples i.e., nodes 1.2 and 1.1.2 in Figure 4. However in the sibling training method, only those non-kelp samples are considered to be negative which comes under the macroalgae node i.e., node 1.1.2 in Figure 4. We use a linear Support Vector Machines (SVM) [29] classifier because it has shown excellent performance with features extracted from deep networks [20]. We use the SVM classifier in a one-vs-all configuration with a linear kernel. We perform 3-fold cross validation within the training set to optimize the SVM parameters and mean performance are reported in Section 3.

Image Enhancement and Implementation Details
We applied color channel stretch on each image in the dataset to reduce the effect of underwater color distortion phenomenon. We calculated the averages of the lowest 1% and the highest 99% of the intensities for each color channel. The average of the lowest 1% intensities was subtracted from all the intensities in each respective channel and the negative values were set to zero. These intensities were then divided by the average of the highest 99% of the intensities. This process enhanced the color information of benthic marine images.
For feature extraction, we used a pre-trained ResNet-50 [26] deep network architecture in our experiments. We used the publicly available model of this network, which was pre-trained on the ImageNet dataset. We implemented our proposed method using MatConvNet [38] and the SVM classifier using LIBLINEAR [39] (Figure 2). We performed our experiments with three different classification approaches: flat classification and local binary classification with both inclusive and sibling training policies. The overall classification accuracy is not an effective measure of binary classifier performance for datasets exhibiting a skewed class distribution. Therefore, to evaluate the performance of our classifier, we have used four evaluation criteria: overall classification accuracy, mean f1-score (the average of f1-scores of each class involved in the test data), precision and recall values of kelp.

Classification Results
In this section, we report the results of three different types of features for the three training methods on the two datasets: (i) Maximum Response (MR) filter and texton maps of [18] as baseline handcrafted features. We used a publicly available implementation of this method; (ii) CNN features extracted from a VGG16 network pretrained on ImageNet dataset [27]; (iii) Our proposed DRFs extracted from a pretrained ResNet-50.
Classification by the DRF method always outperformed the traditional CNN features and MR features in both datasets as it consistently showed higher accuracy, higher f1 scores, higher precision of kelps and higher kelp recall than previously used features. Additionally, hierarchical classification (sibling and inclusive) in comparison to flat classification, also improved f1-score and recall of kelps while providing lower training times. The sibling training method achieved the highest f1-score for both datasets. Because f1-score is an evaluation metric based on both precision and recall, we recommend the sibling training method as the top performing practical method for classification and automated coverage analysis of kelps.

Benthoz15 Dataset
To highlight the superior classification performance of DRF, we have included a comparative study among DRF and the traditionally used CNN features extracted from VGGnet [27] and MR features ( Table 4). The DRF method performs better than both the features for all three classification experiments. The lowest overall accuracy was achieved by the flat multi-class classification method (57.6%). Additionally, a very low mean f1-score of 0.05 was observed, since many classes among the total 145 had very few samples for training and testing. Nonetheless, the flat classification method achieved the highest precision (71%) for kelps among all the three methods. Out of every 100 kelp samples, this method correctly identifies 71 samples as kelps. However, this method resulted in the worst recall value of 65% (Table 4).
The best classification accuracy is achieved with the inclusive training method (90%) for which all the non-kelp samples are bundled together in the negative class. This training scheme achieves a mean f1-score of 0.79 which is similar to the highest f1-score of 0.80 obtained using the sibling training method ( Table 4).
The sibling training method is more challenging as compared to the inclusive training method because the negative samples only include macroalgae classes and some of these classes are very similar to kelp in appearance. This accounts for a drop in classification accuracy from 90% to 83.4%. However the sibling training method resulted in the highest mean f1-score (0.80) and recall value (78%) for kelp. Moreover, statistical testing supports the hypothesis that all three DRF classifiers are better than their VGG and MR counterparts at significance level of 0.05. For each DRF feature X and competing feature Y ∈ (MR, VGG), we did a paired t-test over randomly chosen image samples (N = 50,000), using the SVM classifier. Statistical results showed that, for each pairing of features (X, Y), feature X gave better classification than feature Y at the 0.05 significance level. The calculated p-value was less than 0.05 which rejected our null hypothesis that both classifiers show similar performance.

Rottnest Island Dataset
The DRF was then applied to the Rottnest Island data and once again confirmed that the DRF outperformed the VGG and MR features for all the classification experiments ( Table 5). The hierarchical methods performed better than the flat classification method for all evaluation criteria except for precision. However, the recall value achieved by this method is the worst. This is consistent with the results obtained on Benthoz15 dataset. The mean f1-score for flat classifier (0.03) is again very low given the fact that all 78 classes are classified at the same time. The sibling training method comes out as the best method with respect to accuracy (77.2%), mean f1-score (0.76) and recall value (79%) of kelps. Moreover, the sibling training method is also the fastest method because it has less negative examples than the inclusive method.
Fine-tuning a deep network is also a popular approach for transfer learning [40]. We also compared our proposed method with fine-tuning. Fine-tuning a ResNet-50 on Rottnest Island data achieved an overall classification accuracy of 58.8% as compared to the 59.0% achieved by our proposed method. For Benthoz15 dataset, fine-tuning a ResNet-50 resulted in an overall classification accuracy of 57.1% which is 0.5% lower than our proposed method. The performance change was marginal for both datasets. Hence, we concluded that the classification accuracy achieved by both methods on benthic marine datasets is comparable. One important aspect to compare is the computational time required by these two approaches. The time needed to extract off-the-shelf features from a ResNet and classify them using an SVM classifier is far less than the time required to fine-tune a 50 layer ResNet on a dataset as large as 297,800 input images. Our proposed method requires a few hours to run. However, fine-tuning a ResNet-50 with Rottnest Island dataset takes at least 2 days on an Nvidia Titan-X GPU. Given these considerations, we selected our proposed method over fine-tuning a ResNet with a marine dataset approach. Table 5. A comparison of flat, inclusive and sibling classification methods for kelp classification on Rottnest Island dataset for MR, VGG, and DRF methods. The flat classification focuses on all the classes present in the dataset whereas the inclusive and sibling classification only includes kelps and non-kelps. Mean f1-score corresponds to the average of the individual f1-score of each class involved in the experiment. Best scores are shown in bold font.

Method Accuracy (%) Mean f1-score Precision of Kelps (%) Recall of Kelps (%)
MR One of many challenges in benthic cover estimations through image analysis is the large amount of time required to manually classify the imagery. The average time for manual annotation with 50 sample points per image is 8 minutes. A trained marine expert can annotate up to 8 images per hour. The proposed method is significantly less time consuming as it results in an annotation rate of 1800 images per hour using a Nvidia Titan-X GPU. This is approximately 225 times faster than manual annotation by experts. Nonetheless, note that the proposed machine learning algorithm is only classifying 'kelp' vs. 'non kelp'. Although it is faster, it is not yet trained to classify 145 potential benthic classes. This paper evaluates the technique for a single class and presents a way forward to develop the methodology for other classes and faster processing times, which will allow scientists to promptly analyze changes in benthic community composition.

Kelp Coverage Analysis
We extended our method to estimate kelp cover for the Rottnest Island dataset. The expert identified coverage was calculated by aggregating the pixel level ground truth labels in every image. We calculated the estimated kelp coverage by aggregating the predicted labels for the same locations for which the expert labels were available. Kelp cover estimated by the annotations generated by our proposed method was compared to the cover based on expert classification ( Figure 5; Table 6). Scatter plots were generated for each of five sites and all the data included in the 2013 test set. An important application of our proposed method is to estimate the population trends of kelp across spatial and time scales. To accomplish this task, we split the Rottnest Island data into sites and trained a classifier on this basis instead of years. The three sites from the north constitute the training set and the two southern sites form the test set.
The first sub-plot in Figure 5 shows kelp coverage for all of the data included in the test set. The slope of the line generated by linear regression is very close to the ideal case. This highlights the robustness of our proposed algorithm. The remaining sub-plots show kelp coverage for each of the five sites. These sub-plots show a good agreement between the annotations generated by our proposed method and the annotations provided by the human experts (Table 6). Moreover, we also calculated the R-squared (R 2 ) value for each plot to show correlation between the actual and predicted cover. Our proposed method achieved a high R 2 value for each individual site and then all sites combined.
It is important to note that the DRF classification seems to over-fit kelp cover at high percentages of cover and to under-fit kelp cover at lower ones.  The estimated kelp coverage is not significantly different from the coverage calculated by the experts from the ground truth labels ( Figure 6). This indicates the robustness of our proposed method for estimating kelp coverage. These results are beneficial to marine scientists since many surveys focus on estimating kelp coverage, which is an important metric to indicate the health of kelp forests. Figure 7 shows the expert identified and estimated percent cover of kelp across years of sites 2 and 4. For site 2, a slight over estimation of kelp cover by the DRF classification is visible, however no distinct trend of change across years is observable in either manual or automatic classification.
On the other hand, the estimation of kelp cover for site 4 shows no overestimation and similarly to site 2, no trend change in kelp cover over the years.

Discussion
The use of AUVs to survey benthic marine habitats has allowed scientists to investigate remote locations such as off-shore and deep sites, which are beyond the limits of traditional SCUBA diving. Nonetheless, the efficiency of image collection does not match the availability of data for ecological analysis, as image classification is time consuming and costly given that it is performed manually by marine experts. Additionally, manual classification has other disadvantages such as observer discrepancies and biases. Automated analysis of imagery is thus essential to fully benefit from the advantages of remote surveying technologies such as AUV's. In this study, we have addressed this problem by evaluating a machine learning automated image classification method using Deep Residual Features (DRF) for a key marine benthic species: the kelp Ecklonia radiata.
We have demonstrated that the image representations extracted from pre-trained deep residual networks can be effectively used for benthic marine image classification in general and kelps in particular. These powerful and generic features outperform traditional off-the-shelf CNN features, which have already shown superior performance over conventional hand-crafted features [19,20].
The sibling and inclusive hierarchical training methods further enhance performance when compared to flat multi-class classification methods. The sibling and inclusive training methods show comparatively similar performance. However, the sibling method is superior because it has lower training time than the inclusive method. Furthermore, estimations of kelp cover by automated DRF classification closely resemble those of manual expert classifications with the added advantage of faster processing times. This work provides evidence that automatic annotations may save resources and time while providing effective estimates of benthic cover.
This method was also applied on a dataset to compare kelp coverage for multiple sites, across three depths and for a consecutive time series of four years (2010-2013) at Rottnest Island. The patterns observed showed differences in percent cover of the kelp Ecklonia radiata between sites (with higher percentage cover of kelp in shallower sites compared to deeper sites) and no considerable change of kelp cover across years. These trends were similar to those observed by manually classified data once more confirming the usefulness of automated image classifying methods and the ability to use them for ongoing monitoring of kelp beds with AUV technology.
In this study, we found no evidence of catastrophic loss of kelp over the years at any of the sites surveyed at Rottnest Island. These results are comparable to previous estimates of change in E. radiata cover across depth in Australia, performed with manually classified images [14]. They are in contrast with trends of significant and continuous kelp decline reported in the region after an extreme marine heatwave which resulted in widespread mortality of benthic species including corals, seagrasses, invertebrates and kelp [6]. The loss of kelp in Western Australia resulted in a range contraction of 100 km [6] and in crab and scallop fishery closures of benthic species associated with kelp habitat. Importantly, the kelp loss was reported in habitats shallower than 15 m, with little attention to the response of deeper habitats to the heatwave [9]. This may be why our results contrast with studies reporting catastrophic loss of kelp, since our shallowest locations were at 15 m of depth, and most in situ studies take place even shallower (about 12 m). Additionally, all our sites were located off-shore (even the shallow ones), which may indicate that off-shore sites are less impacted by environmental pressures. This may be due to the lack of other environmental disturbances that coastal habitats are exposed to, due to their distance to shore and human populations. The interaction of several disturbances was shown to cause ecological responses such as wide spread mortality of marine benthic species [41]. Kelps growing offshore and in deeper locations (>15 m of depth) appear to be less impacted by extreme warming in contrast to coastal shallow reefs [42]. As a result of the catastrophic consequences that extreme climatic events may have on key habitat building species, such as kelp, deeper marine regions were identified as potential refugia for shallow marine species [43][44][45]. This emphasizes the importance of AUV surveys to provide information on offshore and deep locations which may be influenced by different factors to their inshore counterparts [9]. The use of automated image analysis for processing AUV images will streamline the processing of these images to efficiently identify patterns observed in deep and remote locations and compare them with patterns observed in shallow and inshore sites.
The rapid characterization of ecological changes is crucial in light of the catastrophic threats to marine biodiversity posed by the rise of extreme climatic events driven by climate change and other anthropogenic stressors. Technology has enabled the rapid collection of images even in remote locations through autonomous underwater vehicles, remotely operated vehicles, automated cameras and even satellite imagery. The subsequent annotation of such imagery is typically time consuming and consequently, the automation of marine species classification from digital images has become a priority. This study focuses on the kelp species E. radiata, which is the dominant habitat builder of temperate reefs in Australia, though automated classification of marine species was applied to other important marine species. For example, progress in automated tropical coral identification has resulted in accurate classification the level of genera [46] . Other successful automated classification techniques for coral reefs include the collection of multifaceted data, minimum manual classification effort (around 2% of pixels) and machine learning techniques which result in cm-scale benthic habitat maps of high taxonomic resolution and accuracy of up to 97% [47]. Similarly, in pelagic species such as fish automated classification has advanced rapidly, with automated fish detection and identification algorithms also measuring basic fish morphological features such as total length [48,49]. In contrast, automated methods for identification of marine macroalgae from benthic images still result in low agreement [46], highlighting the need for more research into unequivocal definitions of algal groups for image classification.
Although the proposed DRF classification method allowed us to compare kelp cover in different sites and across different years providing marginal differences with the estimations from manual annotations, there were some errors associated with the proposed technique. We observed an over-prediction of kelp at high percentage cover and under-prediction at low cover. Nonetheless, the over prediction was smaller when data was divided per site and in some sites was negligible (4 and 5). Overall, the estimated kelp cover closely resembles manual classification and taking into consideration the cost effectiveness of automated DRF classification methods, the benefits of the automated classification method out-weight the drawbacks. As such, automated classification of kelp from AUV-derivated images constitute a cost-effective method for estimations of kelp abundance across space and time.
A comparison of the best overall accuracies of hierarchical classification across the two used datasets shows that both the sibling and inclusive DRF classifiers has shown better classification accuracy on Benthoz15 dataset as compared with Rottnest Island dataset. For example, the inclusive DRF classifier for Benthoz15 dataset (Table 4) has an absolute gain of 15% over the respective classifier for the Rottnest dataset (Table 5). This substantial difference is possibly due to the high presence of the brown algae Scytothalia dorycarpa in the Rottnest Island data. Scytothalia dorycarpa is very similar to kelp in appearance and usually occurs in areas of the sea floor with high cover of kelp. Therefore, marine scientists may mis-classify it as kelp in poor quality images. This misclassification is possible if the point falls on the edge of Scytothalia dorycarpa, where the boundary between the two species is not clear. The expert misclassification of Scytothalia dorycarpa as kelp may also explain the over-prediction of kelp by the DRF classification method at high percentage cover. The over-prediction of the automated classification is actually an overestimation of the kelp cover by the manual annotation method. The subjectivity in the classification is removed by the automated analysis, which uses several features to classify kelp. Figure 8 illustrates the similarity of appearance of these two species. Poor quality images (low light and resolution) will also affect the manual classification of other classes of algae such as 'turf matrix', 'fine branching red algae' or other canopy forming brown algae. These and other algae classes are not as common as kelp at the sites surveyed at Rottnest Island. Thus, misclassification associated with manual annotations may also explain the over prediction of kelp at low percentage covers. At low cover of kelp, a turf and foliose matrix of red algae occurs on the rocks. In areas of low kelp cover it is easy for an expert to distinguish kelp from other classes, but perhaps due to the imbalance of data for training the classifier sometimes other classes are classified as kelp resulting in over-prediction by the DRF classification method. These issues highlight the need for larger training datasets for deep learning-based automatic annotation. Extensive and comprehensive training sets will allow for better classifier training and give the opportunity to increase the amount of biota classified automatically (e.g., other algae species, corals, sponges, invertebrates such as sea urchins, and lobsters). Future work will explore multi-class classification of benthic marine species across diverse benthic habitats so methods based on deep learning algorithms can be applied to numerous ecological problems that include other benthic marine species. Scientists who use data extracted from image classification should keep these considerations in mind when manually annotating images since these datasets are extremely valuable for deep learning-based automatic classification.

Conclusions
The aim of this study was to investigate deep learning techniques for automatic annotation of kelp species in a complex underwater scenery. Towards this end, we evaluated a Deep Residual Features (DRF)-based method to carry out this task and showed that it outperformed the widely adopted off-the-shelf CNN based classification. We also established that hierarchical classification with the sibling method gave superior results compared to the flat multi-class approach with the added advantage of faster training times. Our results suggest that the proposed automatic kelp annotation method can significantly reduce the number of human-hours spent in manual annotations. Additionally, our proposed method can enhance the effectiveness of AUV monitoring campaigns by facilitating the early detection of changes in the population of key species though rapid image processing times, as demonstrated with examples from the Rottnest Island dataset. To conclude, the proposed DRF based automatic annotation of benthic images is to this date the most accurate machine learning technique for estimation of kelp cover.