SeeCucumbers: Using Deep Learning and Drone Imagery to Detect Sea Cucumbers on Coral Reef Flats

: Sea cucumbers ( Holothuroidea or holothurians) are a valuable ﬁshery and are also crucial nutrient recyclers, bioturbation agents, and hosts for many biotic associates. Their ecological impacts could be substantial given their high abundance in some reef locations and thus monitoring their populations and spatial distribution is of research interest. Traditional in situ surveys are laborious and only cover small areas but drones offer an opportunity to scale observations more broadly, especially if the holothurians can be automatically detected in drone imagery using deep learning algorithms. We adapted the object detection algorithm YOLOv3 to detect holothurians from drone imagery at Hideaway Bay, Queensland, Australia. We successfully detected 11,462 of 12,956 individuals over 2.7ha with an average density of 0.5 individual/m 2 . We tested a range of hyperparameters to determine the optimal detector performance and achieved 0.855 mAP, 0.82 precision, 0.83 recall, and 0.82 F1 score. We found as few as ten labelled drone images was sufﬁcient to train an acceptable detection model (0.799 mAP). Our results illustrate the potential of using small, affordable drones with direct implementation of open-source object detection models to survey holothurians and other shallow water sessile species. , images of Hideaway Bay, Australia, and used YOLOv3 to detect holothurians. Performance was evaluated using common object detection metrics. All data and algorithms are open access and readily available online. In total, 11,462 out of 12,956 individuals were successfully detected, which were unevenly distributed across a 2.7 ha area. The object detector performed well, achieving an mAP of 0.855, a precision of 0.82, a recall of 0.83 and an F1 score of 0.82. We found that as few as ten labelled drone images were sufﬁcient to train an acceptable detection model (0.799 mAP). Collectively, these results illustrate the potential of using affordable unoccupied aerial vehicles (UAV, or drones) to survey and monitor holothurians and other shallow water sessile species with direct implementation of open source object detection models to increase the efﬁciency, replicability, and area able to be covered.


Introduction
Sea cucumbers (Holothuroidea), or holothurians (also known as bêche de mer), are a valuable fishery resource due to their high market demand [1][2][3][4]. They also play an important role as recyclers of nutrients to other trophic levels, hosts for many biotic associates, and crucial bioturbation agents to maintain and improve the sediment quality [5,6]. Species such as Holothuria atra, H. mexicana, Isostichopus badionotus, and Stichopus chloronotus are prolific bioturbators, capable of processing the upper 3 to 5 mm of all marine sediments available in their habitat at least once per annum [6,7]. Since the volume of sediments ingested and defecated by sea cucumber is remarkable (9-82 kg per individual per year), their role in maintaining biodiversity, primary productivity, and sediment health could be substantial over long timescales in areas where they are highly abundant [5]. For example, a recent study calculated that Holothuria atra were likely responsible for the bioturbation of more than 64,000 metric tonnes per year at Heron Island Reef in the southern Great Barrier Reef [8]. Therefore, investigating the population dynamics and distribution patterns of common holothurian species are important steps to quantify their fishery value and their ecological functions in the ecosystem.
The efficiency of a detection model could be improved by using more advanced hardware, faster DL algorithms, or better training procedures. More powerful hardware could shorten the computing time for both training and detection, but such improvement is beyond the control of ecologists. Training regimes and DL algorithms, on the other hand, can be implemented and optimised by any developer or researcher with programming ability, such as by changing the input training dataset, tuning the hyperparameters of learning algorithms, selecting different evaluation metrics, etc. The size of the training dataset determines the time and labour required to prepare the data (i.e., labelling holothurians in our case). Hyperparameters are the configurations of the learning algorithm itself before the learning process starts (i.e., the selection of pre-trained weights and anchor boxes, see Section 2.3.3) which impacts the performance of the resulting model [38]. In this study, we selected the third version of YOLO (YOLOv3) due to its widespread use in the literature and industry and well established open source community of support. It also offers faster processing with minimal reduction in performance when compared to other object detection models, such as Single-Shot Detector, RetinaNet, and Regions with CNN (R-CNN) [34].
Our work contributes an automatic holothurian detection model using the YOLOv3 architecture and was delivered through the following steps: (1) summarized common evaluation metrics to select the most suitable for assessing holothurian detection models; (2) investigated the minimum training and labelling dataset sizes required to achieve an acceptable detection model; (3) tuned the YOLOv3 hyperparameters to select the optimal detection model; and (4) applied the optimal training model to quantify the density of holothurians at Hideaway Bay reef in North Queensland, Australia.

Study Site
Hideaway Bay (20.072 914°S 148.481 359°E) is a mainland attached fringing reef located on Cape Gloucester in the Mackay Whitsunday Region of North Queensland, Australia ( Figure 1a). The reef extends up to 350 m offshore and over 3 km alongshore [39]. A recent survey showed that the environmental conditions at monitoring sites in this region are generally characterised by relatively high turbidity and high rates of sedimentation [40] with the reef flat largely dominated by terrigenous sediments [39]. Little information about holothurian population is known in this area. Yet easy access and calm weather made it an ideal site for drone imagery data collection.

Data Acquisition
Drone imagery was captured in July 2020 using a DJI Phantom 4 Pro-a multirotor drone suitable for flying slowly at low altitudes and taking off and landing in small spaces. We used the free Drone Deploy mission planning app to create a flight path over the area of interest at 20 m altitude with 75% overlap and 75% sidelap between nadir images, suitable for creating an orthomosaic in future studies. As the orthomosaic process can introduce errors such as double mapping or ghosting when combining overlapping images [41], we considered individual images better suited to our counting sea cucumber application. We therefore selected 63 of the total images, representing only those with no or very little overlap (every fourth photo along a run, and every fourth flightline). The resolution of these images was 4864 × 3648 pixels (px) (FOV = 73.7 • , GSD = 0.57 cm) (Figure 1b). The average area of one drone image was approximately 423 m 2 (Figure 1b). Since the clarity of marine based drone imagery is subject to turbidity, wave conditions, and light and shade variation, all images were taken at low tide under calm conditions with a low level of turbidity [42] to minimize the training dataset complexity. Generally speaking, taking images in the early morning can minimize the sun glint and a wind speed less than 5 knots will not create significant ripples or waves that reduce the image quality [42]. A total area of 26,662 m 2 (∼2.7 ha) was surveyed.

Data Processing
Data were processed through five major steps ( Figure 2): (a) pre-process drone images; (b) use bounding boxes to label holothurians as required by YOLOv3 and prepare different sized training datasets to investigate the influences of dataset size on training results; (c) train and validate models using YOLOv3 deep learning object detection algorithm by tuning zero, one or two hyperparameters (for details see Section 2.3.3); (d) evaluate and determine an optimal holothurian detection model using common object evaluation metrics; and (e) apply the optimal detection model to map the sea cucumber density in the area of interest.

Image Pre-Processing
The 63 drone images were cropped to the default image input size of YOLOv3, 416 × 416 px ( Figure 1c). As shown in Figure 3, each drone image is cropped into 108 smaller images (9 rows and 12 columns) giving a total 6804 cropped images was obtained. The cropped images at the last row and column were resized (i.e., padded with black pixels, see Figure 3) in order to meet the default settings of YOLOv3 input images. This resizing approach allows images to preserve the aspect ratio and provide positive sea cucumber information without affecting the classification accuracy [43].

Labelling and Dataset Preparation
Each cropped image was manually examined and each sea cucumber was identified and labelled manually by three trained volunteers using Labelme [44]. In order to maximize the available useful information, sea cucumbers under all conditions (fully exposed on sandy bottom or on coral reefs, partially covered by sediments or rubbles, cutoff by the edges of the images, etc.) were labelled with a tight rectangular box (Figure 1c,d). The pixel coordinates of the top left and bottom right corner of each box were saved with annotations in a JSON file for each cropped image, which was used as ground truth for later analyses. The cropped and labelled images were first randomly split into two subsets: training and validation (88%) and testing (12%). The testing dataset comprised 804 images that were reserved for ultimate model evaluation, which was never used during the training and validation. The ML training and validation dataset comprised 6000 images. To study the importance of training sample size and identify the optimal number of labelled images required this subset was randomly sampled into six training sets composed of 1000, 2000, 3000, 4000, 5000, and 6000 images. Each of the six training datasets were further split into 80% training (800, 1600, 2400, 3200, 4000, and 4800 cropped images, respectively) and 20% validation (200, 400, 600, 800, and 1200 cropped images, respectively) to facilitate the deep learning training process.

Model Training and Validation
YOLOv3 is an open-source deep learning object detection algorithm with CNN architecture (Darknet50) [34] that is often trained with hyperparameter tuning tailored for specific applications. For the purpose of this study we used a high performance computer to implement YOLOv3 [45] with Python 3.6, Keras 2.2.4 [37], and TensorFlow 1.13 [35]. We tuned two hyperparameters before starting the learning process: pre-trained model weights and anchor box size. By definition, pre-trained model weights are used during transfer learning, which refers to the situation of learning in a new setting through the transfer of knowledge from a related setting that has already been learned [46]. Meanwhile, anchor boxes serve as the initial guesses of the bounding boxes for detected objects [47]. Faster progress or improved performance are often expected by adopting such variations. The default settings for these two hyperparameters in YOLOv3 are using anchor boxes and pre-trained model weights obtained from the COCO dataset [45]. In this study, four modifications of hyperparameters were adopted as follows: To modify the anchor boxes, we changed their size and shape using k-means clustering of the labelled bounding boxes in sea cucumber dataset (scenarios B and D above) [34]. To determine the influence of the pre-trained model weights, the COCO derived pre-trained model weights were changed to random numbers (scenarios C and D above). Combining the four hyperparameter tuning scenarios (A-D above) and the six different sized training datasets (i.e., 1000-6000), there were 24 training variations.

Sea Cucumber Detection Evaluation
The detection models were applied to the ultimate unseen testing dataset (804 images) that had not been used in any of the previous training scenarios. Here we used the evaluation metrics adapted from commonly used evaluation metrics in Keras and TensorFlow libraries [48], the 2020 COCO Object Detection challenge [49,50] and the PASCAL VOC Challenge [51]. These include intersection over union (IOU), mean average precision (mAP), precision, recall, and F1 scores, which are calculated based on confusion matrices and confidence scores. A confusion matrix is the combination of ground truth data and detected results from an ML model, whereas the confidence score is a value measured by a detection model showing the certainty of the results (from 0 to 1, i.e., from not confident to very confident) [48]. The object detection evaluation metrics were calculated and interpreted as described in Table 1.
The evaluation metrics measure the effectiveness of the model, and are thus influential in determining model selection according to the users' requirements [48]. For instance, choosing a model with maximum F1 or mAP score would be the best option if the goal is to achieve a good balance between precision and recall. In other cases, high precision would be preferred if the desired information is about the exact location of sea cucumbers, whereas high recall would be preferred if more accurate population counting is needed. To achieve either higher precision or higher recall, the model's training and detection result can be adjusted by modifying the IOU (intersection over union) and confidence score threshold. In this study, the goal was to produce a density map of sea cucumbers, and both precision and recall values were important. Thus, using the F1 score or mAP which combines precision and recall scores was preferred. In this work, one object class was designated to group all sea cucumber species. In future, multiclass object detection within image for other taxa or specific sea cucumber species could be investigated by adding separate object classes for each target of detection. Thus, the mAP was chosen as the primary criteria since it allows for the addition of more object classes in the future. Since there has been no research recommending an absolute mAP value to determine whether the performance of a model is acceptable, we used the top result in COCO Detection Leaderboard (mAP = 0.770) as the judging criteria [52].

Mapping Sea Cucumber Density
The output of the detection model was superimposed onto the input image detailing the location and confidence score of the output prediction within the image (Figure 4). The detected results of sea cucumber counts in each cropped image were added together to calculate the number of sea cucumbers present in the complete drone image using the optimal model obtained above. The images were georeferenced according to the geotagged metadata of the drone images and visualised as a sea cucumber density (i.e., number of sea cucumbers/area of the drone image) footprint map in ArcGIS Desktop 10.7 [53].

Evaluation Metrics Definitions Interpretation and Relevance
Intersection over Union (IOU) where A is the area of the detected bounding box and B is the area of the mannually labelled bounding box.
By using an IOU threshold of 0.5 to define true positive detections we required that at least 50% of the bounding box area identified by the ML approach overlapped with the area identified by the human observer. A higher IOU threshold would indicate a higher accuracy of the detection location within an image, and thus result in less true positive detections.
In this study,a moderate IOU threshold (0.5) was chosen to compare with other object detection challenges (used for both COCO and PASCAL VOC object detection challenge) [49,51] and as the exact location of a sea cucumber individual was not the priority.
Confusion Recall values range from 0 for poor recall to 1 for perfect recall. Higher recall means less incorrect detections, i.e., less detection of objects that are not sea cucumbers.
This is the harmonic mean of precision and recall. The closer the F1 score is to a value of 1 the better the performance of the model. Instead of choosing either the model with the best precision or the best recall, the highest F1 score balances the two values. It is useful when both high precision and high recall are desired.

Evaluation Metrics Definitions Interpretation and Relevance
where N is the number object classes being detected (in our case, N =1 since we only detect se cucumbers), n is the number of recall levels (in an ascending order) at which the precision is first interpolated, r is recall, and p is precision [51,54].
This metric is similar to the F1 score, but with the benefit that it has the potential to measure multiple categories if required..

Results and Discussion
A total of 6804 cropped images were created and a total of 12,956 sea cucumbers were manually labelled. Based on the evaluation, the performance of the detection models were influenced by size of the training dataset and the hyperparameters used as described and discussed below.

Model Performance Evaluation
Of the 24 variations tried the worst performance was training with modifying both hyperparameters (Scenario D) and using the smallest training dataset (1000), which was unable to detect any sea cucumbers resulting in an mAP value of 0 ( Figure 5). The best detection result (mAP = 0.855) was achieved using 6000 cropped training images with no changes in default hyperparameters (Scenario A). The relevant optimal confidence score threshold was found to be 0.27, which resulted in 0.82 precision, 0.83 recall, and 0.82 F1 score, respectively, ( Table 2). This indicates that 82% of sea cucumbers detected were correct and more than 83% of true sea cucumbers were detected. The details of mAP variation and the associated precision and recall curves are provided in the Appendix A Table A1.

Influence of Training Dataset Size
Without considering the impacts of hyperparameter tuning, the increasing training data sample sizes improved the model performance ( Figure 5 and Table 2). In scenarios A and B, the mAP value improved very marginally as the training dataset size increased from 1000 images (Scenario A = 0.799, Scenario B = 0.760) to 6000 images (Scenario A = 0.855, Scenario B = 0.838) (i.e., from 10 to 56 uncropped drone images). Yet in scenarios C and D, where the pre-trained model weights were removed, the mAP value increases dramatically as the training dataset size increased (Scenario C from 0.002 to 0.773, Scenario D 0.000 to 0.750). Moreover, the training dataset size was also the major factor determining the training time needed. Each 1000 images contributed approximately one hour worth of training time. If using the best mAP for COCO dataset as the judging criteria (i.e., mAP = 0.770) [52], the minimum dataset size required to train an acceptable sea cucumber detection system would be 1000 cropped images (i.e., less than 10 drone images) under Scenario A (mAP = 0.799 > 0.770). This number, however, may be subject to change due to various conditions including more diverse sea cucumber species presented, higher turbidity in the water column or worse weather condition.

Influence of Hyperparameter Tuning
Hyperparameter tuning had negative impacts on the detection models, which was different from our original expectation. The average mAP, including all training dataset sizes, with no tuning of the default hyperparameters (Scenario A) was 0.835 (Table 2). An average mAP of 0.813 was achieved by changing the anchor box size (Scenario B) and an average mAP of 0.545 was achieved by removing the COCO derived pre-trained model weights (Scenario C). Changing both hyperparameters (Scenario D) resulted in the lowest average mAP (0.345). Using the default pre-trained model weights means the model has been optimized by exposure to more than 120,000 labelled images [34,49] before the specific sea cucumber training, which made it better at recognizing patterns, colours, textures, etc. Without it, the basic feature recognition was learnt from scratch only from the labelled sea cucumber images. Therefore, providing more images during the training significantly improved the output (Figure 4 scenarios C and D).
Using default anchor boxes also performed better than using modified anchor boxes, which agrees with the original YOLOv3 paper which stated that while changing anchor boxes might improve the performance of the model, it could decrease the model stability [34]. Hence, keeping the default hyperparameters of YOLOv3 was preferrable for our dataset. However, it is still questionable whether using pre-trained model weights will always improve the model performance. If the dataset being studied is sufficiently diverse and large, training from scratch could outperform training from pre-trained weights derived from common object datasets.

Comparison to Previous Studies
It is also important to compare the performance between different DL algorithms rather than just focus on YOLOv3 alone. The optimal detection values (IOU = 0.5, confidence score threshold = 0.27, precision = 0.82, recall = 0.83, mAP = 0.855, F1 = 0.82) compare favourably with past ecological studies that utilise machine learning. Kilfoil et al. [16] used a ResNet 50 CNN model to detect sea cucumbers from drone imagery in French Polynesia. They reported a similar evaluation metrics reporting various values (F1 score = 0.68, precision = 0.80, recall = 0.59) at a Minimum Validation Criteria (MVC) threshold of 0.25 [16]. In their study, the MVC is defined as "the minimum acceptable probability that an object is a sea cucumber for it to be counted as such" [16] (the equivalent concept to our confidence score threshold, which achieved 0.27 for the optimal model). The precision and recall in this study also exceeded the aforementioned citation [16], which was expected since YOLOv3 utilises different object detectors (faster RCNN vs. YOLOv3) and CNN backbones (ResNet 50 vs. Darknet 53) that should result in better and faster detection results [33,34]. However, such comparison across different studies are difficult since these studies often used different evaluation metrics and assess their models with different confidence thresholds. For instance, Beijbom et al. [55] uses Cohen's kappa to evaluate the annotation accuracy of algae and hard corals, which varies from 43% to 96%. Villon et al. [56] reported fish species detection underwater have been shown to reach a bounding box overlap precision above 55% by using IOU = 0.5, T = 98%, where T was defined as a probability threshold. It is impossible to conclude that YOLOv3 is a better detector than faster RCNN or other algorithms. The differences could be a consequence of changing IOU threshold and using different training datasets with different image capture quality, water column variation, weather condition. Other environmental characteristics such as the complexity of the benthic habitat structure, the presence of holothurian-like organisms and coral reef patterns may also hinder or improve the performance of the object detection model. Since reproducibility is a major principle of scientific research, the failure to detail methodology and evaluation metrics in some ecological studies that utilise modern DL approaches becomes a shortcoming. The knowledge gap could be filled in the future by using the same datasets to compare the different CNN models and methodologies. This type of comparison requires researchers to make their datasets openly available to the community. The dataset and source code underlying this paper is made publicly available on GitHub (https://github.com/joanlyq/SeeCucumbers, accessed on 24 March 2021) and GeoNadir (https://data.geonadir.com/project-details/172, accessed on 24 March 2021) for future comparison.

Mapping Sea Cucumber Density
Within the area of each drone image, the maximum sea cucumber density ranged from 0 to 1.43 individuals/m 2 ( Figure 6) with an average density of sea cucumbers across the whole surveyed was area of 0.50 individuals/m 2 . Details of sea cucumber density can be found in Table A2. A recent study at Heron Reef in the southern Great Barrier Reef used manually digitised drone images to calculate sea cucumber densities of 0.2 m 2 on the shore adjacent sand dominated inner reef flat and 0.14 individuals/m2 at the coral dominated outer reef [8]. While those densities are comparable with our study, it is interesting to note that at Hideaway Bay higher densities of sea cucumbers tended to be found further from shore in areas of higher coral cover ( Figure 6). Heron Reef has no terrestrial sediment inputs whereas Hideaway Bay has a mixed terrigenous and carbonate sediment environment [57]. However, further research and monitoring of sea cucumber populations at these two and other sites, is required to understand these trends.

Potential Future Applications
This implementation has demonstrated the potential of using state-of-the-art object detection algorithms with drone RS to quantify holothurian density in shallow reef environments. This method offers many benefits over current techniques by increasing efficiencies in both data capture and information extraction. Traditional survey methods only cover several hundred square meters in a day and track tens of individual sea cucumbers [6,7], whereas the drone images in this study collected data over an area size of 2.7 ha in less than 30 min. The total dataset collection, labelling and training process in this work took approximately 48 h for the best model, and only eight hours for the minimum acceptable model (using less than ten drone images to train with default YOLOv3 hyperparameters that achieved a 0.799 mAP). Similar to previous studies, manually counting and labelling holothurians from drone images was the most time consuming element in the working process [8]. Using open source DL object detection models could provide a solution to reduce the counting time required for repeat surveys under similar water and other environmental conditions as the labelling and training process only needs to be done once. It detects and quantifies the counts of holothurians over broad spatial scale instead of extrapolating from small scale transects. Even if the detection model may require update as the dataset grows, it is usually a small proportion of the full dataset. The model can improve over time with better and larger training datasets across different locations. It also increases the reproducibility of studies and allows data to be reviewed and reanalysed by different experts.
Beyond these immediate improvements in workflows, automated sea cucumber detection from drone images is the first step toward further fruitful outcomes. It will allow researchers an entirely new stream of data regarding object level reef monitoring from aerial images. The detection model can be further applied to other ecological studies focusing on sessile marine invertebrates such as movement patterns, bioturbation contribution quantification, population dynamics, preferred habitats etc. Being able to detect the coordinates for target objects in geo-tagged drone images would allow the development of a faster and more automated locating process for distribution analysis. The density footprint map can be further combined with benthic habitat or bathymetry maps to gain more insights about the factors that impacting the distribution of sea cucumbers.
However, the current model is unable to detect holothurians to a species level. Thus, in situ surveys conducted by divers or snorkellers are complimentary with RS surveys and crucial to understand the ecological or biological function of specific species. Better understanding of holothurian physical and physiological characteristics of different species could help to overcome current shortcomings. Future improvements in the algorithm or the image data platform may also eliminate the negative influence of noise due to water column characteristics and accommodate environments that are more diverse. This means that methods and findings contained herein can also be used beyond the realm of the humble sea cucumber, and applied to many other benthic features. Finally, the faster and easier acquisition of data will allow for long term monitoring on a larger scale, which will improve the accuracy and efficiency of conservation management.

Conclusions
As people are becoming more aware of the ecological importance of sea cucumbers as well as their economic value, researchers are trying to devise efficient holothurian monitoring methods. There is also an increasing trend towards applying state-of-the-art machine learning technology to ecological studies. Our study not only presented an automatic sea cucumber detection model using drone imagery on coral reef flats, but also was the first one to apply the DL model to quantify the holothurian population and density over a broad spatial area. Under this workflow, we processed 63 high spatial resolution drone images of Hideaway Bay, Australia, and used YOLOv3 to detect holothurians. Performance was evaluated using common object detection metrics. All data and algorithms are open access and readily available online. In total, 11,462 out of 12,956 individuals were successfully detected, which were unevenly distributed across a 2.7 ha area. The object detector performed well, achieving an mAP of 0.855, a precision of 0.82, a recall of 0.83 and an F1 score of 0.82. We found that as few as ten labelled drone images were sufficient to train an acceptable detection model (0.799 mAP). Collectively, these results illustrate the potential of using affordable unoccupied aerial vehicles (UAV, or drones) to survey and monitor holothurians and other shallow water sessile species with direct implementation of open source object detection models to increase the efficiency, replicability, and area able to be covered.

Acknowledgments:
We would like to thank Todd McNeill for their help in collecting drone imagery; Jane Williamson, Jordan Dennis, Edward Gladigau, Holly Muecke for their help in labelling the dataset. We owe deep gratitude to Jonathan Kok, Alex Olsen, Nicolas Younes, Redbird Furgeson, Raf Rashid for their valuable feedbacks of the manuscripts. We acknowledge useful assessments and correction from four anonymous reviewers as well as the journal editors.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A
These are supplementary information for training and detection result. Table A1. Precision and recall curves summary of all 24 variations. The blue shaded area is equal to mAP of each variation and the red dot it the precision and recall level obtained from optimal confidence score threshold.