Repeatable Semantic Reef-Mapping through Photogrammetry and Label-Augmentation

In an endeavor to study natural systems at multiple spatial and taxonomic resolutions, there is an urgent need for automated, high-throughput frameworks that can handle plethora of information. The coalescence of remote-sensing, computer-vision, and deep-learning elicits a new era in ecological research. However, in complex systems, such as marine-benthic habitats, key ecological processes still remain enigmatic due to the lack of cross-scale automated approaches (mms to kms) for community structure analysis. We address this gap by working towards scalable and comprehensive photogrammetric surveys, tackling the profound challenges of full semantic segmentation and 3D grid definition. Full semantic segmentation (where every pixel is classified) is extremely labourintensive and difficult to achieve using manual labeling. We propose using label-augmentation, i.e., propagation of sparse manual labels, to accelerate the task of full segmentation of photomosaics. Photomosaics are synthetic images generated from a projected point-of-view of a 3D model. In the lack of navigation sensors (e.g., a diver-held camera), it is difficult to repeatably determine the slope-angle of a 3D map. We show this is especially important in complex topographical settings, prevalent in coral-reefs. Specifically, we evaluate our approach on benthic habitats, in three different environments in the challenging underwater domain. Our approach for label-augmentation shows human-level accuracy in full segmentation of photomosaics using labeling as sparse as 0.1%, evaluated on several ecological measures. Moreover, we found that grid definition using a leveler improves the consistency in community-metrics obtained due to occlusions and topology (angle and distance between objects), and that we were able to standardise the 3D transformation with two percent error in size measurements. By significantly easing the annotation process for full segmentation and standardizing the 3D grid definition we present a semantic mapping methodology enabling change-detection, which is practical, swift, and cost-effective. Our workflow enables repeatable surveys without permanent markers and specialized mapping gear, useful for research and monitoring, and our code is available online. Additionally, we release the Benthos data-set, fully manually labeled photomosaics from three oceanic environments with over 4500 segmented objects useful for research in computer-vision and marine ecology.


Introduction
Accelerations in technologies [1] have empowered ecological studies by facilitating digital representations of natural systems [2], thus reducing uncertainties in predicting their future-state [3]. Advances in computer-vision and remote-sensing enable cross-scale research. In the near future, deep neural networks will help to decipher process-from-pattern as part of automated workflows; preceded by data acquisition from robotic platforms and semantic segmentation of image-based maps [4,5]. Image-based mapping and semantic segmentation are used in an array of ecological studies and applications, ranging from studying vegetation patterns [6][7][8] and city-scapes [9] to farm-management [10]. Specifically, photogrammetry has become a popular approach for benthic research and reef monitoring [11][12][13][14][15][16][17][18][19][20][21]. Structure-From-Motion (SFM) photogrammetry estimates the 3D scene structure and relative motion using subsequent images. It is now possible to view an ecosystem within a digital framework as a continuum across spatial scales, and examine the individuals, populations, and communities that comprise it. Nevertheless, photogrammetry is not yet fully mature as a repeatable method for wide scale ecological surveys. First, the output 3D models and photomosaics need to be labeled rigorously for analysis. This is laborious and requires expert knowledge. Thus, there is an urgent need for automation in the full segmentation task (i.e., labeling each pixel) of photomosaics. Second, a 3D grid needs to be consistently defined for repeated surveys. Without proper data extraction that includes full, pixel-wise classification and labeling, the relevant information remains concealed in the image. Here, we address both issues, providing a more coherent solution for habitat-mapping and underwater photogrammetry. While our methods are applicable to all domains in which photogrammetry is used, here we focus on the benthic environment.
There are increasing efforts for automatic labeling using machine learning [4,6,9,20,22]. However, the commonly used tools [23,24] still provide point classification and not full segmentation. Such sparse sampling is overlooking object/patch level information, such as the morpho-metrics (shape and size) of individual organisms that can be provided by full semantic segmentation. Several methods for segmentation of benthic images and photomosaics have been demonstrated [25][26][27][28][29], including using multi-view images [30] and 3D models [31]. These works that are based on deep learning provide impressive results; however, deep neural networks rely on a high number of learning parameters and because of that, they need to be trained with a large amount of data to avoid overfitting. Then, the main problem for successful automatic identification of marine species is the lack of training data and extensive variability within taxa [32,33] that prevents using labeled data from other locations and predicting labels that were not used in the training data. To overcome this, we propose propagating sparse labels using our Multi-Level Superpixel (MLS) approach [25]. In [25] this method was suggested as a way to quickly generate training data for deep learning semantic segmentation in several terrestrial and underwater domains. Here we show that even by itself it enables obtaining fast full segmentation with minimal human intervention. We test it extensively on photomosaics with respect to ecological measurements and show that it provides very high accuracy. Thus, it can be used as a complimentary method for generating dense training data in cases where there are no available trained deep networks as it is general and not domain specific. Challenges for deep-learning algorithms in underwater imaging include illumination and range, as well as image degradation caused by refraction and wavelength-specific attenuation [34,35]. An orthophoto is generated from a single angle-of-view on the 3D model through the process of orthorectification where a planimetrically correct image is created by removing the effects of perspective (tilt) and relief (terrain). In an orthophoto, the objects are scaled and located in their true positions (topology), enabling direct measurements of areas and distances [36]. However, in transition from 3D to 2D (orthorectification) there are six degrees-of-freedom that need to be set. In topographically complex structures, such as coral reefs, exporting different perspectives of the same 3D model affects the occlusions ( Figure 1) and map-topology, as well as artifacting and distortion on non-planar objects with limited input views. Thus, the distance and angle between organisms may differ without consistency in orthorectification. This can be detrimental, for example, in studies regarding neighbor-relations and size-distributions. Most solutions for defining the plane of projection try to define the Z-axis according to depth in the water-column. Usually, permanent markers such as plastic tubes or steel bolts are used for this purpose [15], and their depth and the distance between them need to be measured directly or indirectly [37]. Other means to solve this problem include towed buoys mounted with GPS sensors [38] in shallow water surveys, and positioning with acoustic data [39]. Yet, these solutions are impractical for deep and remote reef habitats such as Mesophotic Coral Ecosystems (MCEs, 30-150 m depth) [40].
To tackle this problem, we define the Z-axis as the depth axis by placing a spirit leveler within the survey plot and using it to transform the 3D model.
Benthic habitat mapping using acoustic and optic sensors encompassess a range of foci and scales, from species distribution models to community mapping and abiotic habitat mapping [41]. Optical imaging can provide much greater detail than acoustic sensors, which have wider scalability. However, benthic habitats are difficult to map due to the complex interactions between physical, chemical, biological, and behavioral elements that comprise them [42]. Here we present a multi-class community mapping scheme for benthic surveys.
The sessile communities that form and inhabit the reef are linked through cross-scale processes. For instance, in scleractinian corals, growth-rates and neighbor interactions occur at very small spatial scales, yet they operate within a much more expansive system, where dispersion is enhanced by predation and extreme weather events [43], and vicariance is reticulate through ocean currents [44]. Accordingly, both the minute and the enormous scales are significant in characterizing the physical and biological features of reef structures. The composition of taxa in space and time has been the focus of many studies in benthic ecology. However, reefs are so intricate ( Figure 2) that in the lack of adequate technology for community-level investigation, the dynamics of sessile organisms remain puzzling. Thus, fundamental questions regarding key ecological processes in the reef have remained largely the same for over five decades [45][46][47][48][49], as a simplified compartmentalization of the benthos is often made for handling complex phenomena. In ecological studies, the scale of investigation depends on the rate of events [50,51]. Benthic organisms have growth rates on the scale of mms to cms per year [52]. Therefore, our investigation necessitates cm scale change-detection abilities. To assess and validate the change-detection ability of our workflow, we conduct a repeated survey and show that such orthorectification enables consistently examining the growth and decay, spatial topology, and presence/absence counts of sessile reef organisms.
Our methodology for automated and repeatable semantic mapping can detect and relocate sessile organisms on the cm-scale across hundreds of metres. Such a tool can assist in constructing a multi-level, cross-scale view of underwater and terrestrial ecosystems, useful for research and monitoring efforts. In this paper, we describe its application on a new data-set that includes manually segmented photomosaics from three different regions: a rocky reef in the Eastern Mediterranean, a coral reef in the Northern Red-Sea, and a coral community in the Eastern Caribbean. We validate our approach through computer-vision metrics as well as relevant ecological metrics.
Our specific contributions are: • Extensive ecological validation of semantic segmentation through label-augmentation of sparse annotations. • Validation of 3D grid standardisation with a consumer-grade spirit-leveler. • The Benthos data-set that includes three segmented photomosaics from different oceanic environments.

Imaging System and Photogrammetric Equipment
A NIKON D850 camera with a 35 mm NIKKOR lens in a Nauticam housing with four INON Z-240 strobes was used (Figure 3b). Photogrammetric targets are objects with distinguishable features and orientation. Our targets included measuring tapes, 0.5 m scale-bars, underwater colour charts (DGK), a spirit leveler, and dive slates with electrical tape markings (Figure 3a).

Plot Setup and Acquisition Protocol
When reaching the target depth, a distinguishable natural or artificial object which is relatively simple to navigate to was detected as a starting point for the survey. From that point, we measured the required transect length (5-30 m) using a measuring tape and marked its surroundings using photogrammetric targets and scale bars. In the orthorectification experiments, we aligned a spirit leveler in the survey plot. The spirit leveler has three bubble indicators ( Figure 3a). When it is placed in such a way that the bubbles are centred, the leveler can be used to define a plane-of-projection. Optimally, the leveler was placed in the centre of the plot, and parallel to the transect. The leveler is used to transform the 3D model, thus it is paramount to obtain a good reconstruction of it by acquiring many (>15) images from different angles and distances.
Before each survey, several test images were taken to adjust camera settings: ISO, aperture, shutter speed, and focus. When reaching the optimal camera settings the survey was initiated, and settings were not changed throughout it. Images were acquired at 1 Hz using the camera's interval timer shooting function. The camera was held mainly downwardlooking while the diver swam in a lawn-mower (boustrophodonic) pattern, performing close reciprocal passes over the survey plot to ensure overlap between parallel legs.

Study Sites and Data-Sets
We used image-sets from three distinct oceanic environments (Figures 2 and 4). This comes to show the implementation of our workflow in different ecological zones, and demonstrate the generality of this method (Table 1).  Table 1. The different data-sets used in this study are from three oceanic regions. Some of the data-sets are labeled coarsely (not all pixels have a label) and some are manually segmented (full manual labeling; every pixel has a label). Classification is divided between a genus-specific scheme and a lower level habitat-mapping scheme (Terrain) with eight classess that represent the terrain type.

Labeling and Classification
We used two manual labeling schemes: coarse labeling (a polygon inside the object covering its centre but not all of its pixels) and full segmentation, and two classification schemes: genus-specific (57 classes), and habitat mapping (eight classes) ( Table 1). We used labelbox, a dedicated tool for computer-vision applications, because of its flexibility, academic pricing benefits, and simple interface. Images were uploaded and labeled with a polygon project setup. In data-set Red Sea 20 (RS20) we used genus-specific classes for scleractinian corals, and other sessile groups at lower taxonomic resolutions. In data-sets Red-Sea (RS), Caribbean (CR), and Mediterranean (MD) we used eight classes in full manual labeling, by the terrain type. This requires less expertise and can be distributed among non-expert labelers such as under-graduate or high-school students, and even external workforces.

Label-Augmentation
This experiment reflects the amount of labeling effort required in order to obtain the highest quality of label-augmentation.

Augmentation from Sparse Annotations
Label-augmentation consists of expanding sparse labels to full segmentation by augmenting the number of labeled samples. We use the method previously developed by us [53] that was since validated extensively on different types of data including city-scape images for autonomous driving, terrestrial orthophotos, and fluorescent and RGB coral images [25,54] (code available online https://github.com/Shathe/ML-Superpixels (accessed on 19 January 2021)). Here, we examine this method with respect to meaningful ecological measures. We apply label-augmentation on photomosaics ( Figure 5), where the input is sparse annotations, and the output is a fully segmented map. A superpixel is a low-level grouping of neighboring pixels. The MLS approach uses superpixels to propagate the sparse labels. It computes several superpixel levels of different sizes and uses the sparse annotations as votes. It consists of applying the superpixel image segmentation iteratively, progressively decreasing the number of superpixels generated in each iteration. In the first iteration, the number of superpixels is very high, leading to very small-sized superpixels for capturing small details of the images. The following iterations decrease the number of superpixels, leading to larger superpixels covering unlabeled pixels. Successive iterations do not overwrite information; they only add new labeling information until all pixels are covered. To evaluate the method, we conducted an experiment to estimate how many initial seeds are required to achieve an accurate full segmentation and how different sparsities affect the augmentation performance. As our photomosaics were manually labeled densely, i.e., all the pixels were labeled, we simulate the sparse labeling by randomly sampling initial seeds in several sparsity levels (10%, 1%, 0.1%, 0.01%, 0.001%) of the original dense labels, and augmenting it using the same method. These sparse labels simulate the way benthic data-sets are usually labeled for reducing the labeling cost.

Orthorectification
The purpose of this experiment was to simulate repeated surveys without permanent markers or navigation sensors. In this manner, repeated surveys can take place with the aid of natural and artificial references such as distinctive reef features or mooring sinkers. These objects serve as a starting point for the survey, and orthophotos can be registered in post-processing as long as they are consistently orthorectified.

Repeated-Survey Simulation
To estimate the ability of our pipeline for orthorectifying using a leveler and its changedetection sensitivity we repeated image acquisition two to three times during the same dive, resulting in image sets that constitute technical replicates. Between repeats, the spirit leveler was moved around the scene. The ground-truth photomosaic represents a first temporal repeat or baseline survey, and it was fully manually labeled. The labels were then sparsified (subsampled), and augmented on all replicates resulting in fully segmented photomosaic replicates. These were compared to the ground-truth mosaic for evaluation ( Figure 6e).
In this experiment our replicates are expected to be identical and the negative-control (naïve) is expected to show the highest variance from the ground-truth. To generate the naïve (no leveler) photomosaic, one of the image-sets was exported twice, before and after transformation. We registered the replicate orthophotos using Matlab's manual image registration tool cpselect and 15-20 registration points (Figure 6c).

Evaluation Metrics
In all augmentation experiments, the augmented labels were evaluated against the original manual dense annotations. Several metrics were used to assess the performance of the augmentation including recall, accuracy (per pixel) and the Intersection over Union These metrics are normally used to assess the performance of CNNs in segmentation tasks.

Community-Metrics Comparisons
We developed a Matlab code for community data extraction. All the objects below size 0.0002 m 2 were excluded from analysis because they come from noise in the segmentations.

•
Class-specific size-frequency distributions. We divided the classes in nine bins, starting from 0.0002 m 2 to 0.045 m 2 with a step size of 0.005 m 2 . We used χ 2 distance to assess the similarity of class size distribution between maps. Low values indicate high similarity between sets of data where zero is the maximal similarity. Size in m 2 = ∑ Pixels 4 × 10 −6 (5)

Label-Augmentation
In this experiment we used fully manually labeled images to test the MLS augmentation approach from sparse seeds on wide scale data; photomosaics from different reef environments. We used data-sets RS, MD, and CR. The label-augmentation experiment shows that augmented labeling and dense manual annotations provide very similar ecological outputs. Figure 7 depicts both the per-pixel accuracy and the IoU for all sparsity levels. The per-pixel metric (accuracy) is higher because of the background classes (Sand, Rock) that have a larger area. Please note that this means that the per-pixel metrics are biased towards dominant classes. The per-class metric (IoU) is lower because of small biotic classes which are more difficult to augment. Per-class metrics take more into account the small ones, which usually have worse results because they are harder to propagate/augment. Even though, these metrics show that we can properly propagate sparse labels even for small classes. For both metrics it can be seen that the sparsity of 0.1% is optimal, as investing time in annotating beyond this sparsity level does not provide a serious gain in augmentation quality both in terms of accuracy and IoU. The χ 2 distance was calculated between the class-specific size-frequency distributions (e.g., Figure 8 left), and averaged over eight classes in each mosaic replicate and its manual labels (ground-truth). The distance values are low and there is no significant decrease in distance above sparsity 0.1%. In data-set CR, the trend is even slightly reversed (Figure 8 right). In relative area (Figure 9 left), the error values are low and there is no serious decrease in error above sparsity 0.1%. The error for sparsities 0.1-10% ranges from two to ten percent. In relative amount (Figure 9 right), the errors decrease above sparsity 0.1% in the RS and CR data-sets, but not in the MD data-set. The error for sparsities 0.1-10% is low and ranges from eight to 20 percent.
The label-augmentation experiment shows that the segmentation improves with denser seeds, most significantly up to 0.1% sparsity, where annotations denser than 0.1% do not provide serious increase in accuracy. The effect of sparsity on the error in community metrics was slightly different between data-sets. Noticeably, the RS data-set was most affected by percent sparsity since it is the most topographically complex reef, with clearer trends of decrease in error with denser seed-labels. These results mean that to obtain a reliable segmentation, the amount of required labeling is above 0.1% of the pixels. Labelaugmentation is an important contribution as an efficient sparse to dense approach for image-segmentation and alleviates the effort in generating training data for deep learning applications. Ideally, the best way to augment labels is from point annotations of each object in the image, because the augmentation propagates seed labels. Therefore, an object that is not labeled, will not show on the augmented image, and will be overridden by neighboring labels. In labeling, even humans fail to accurately label objects along the edges and our accuracy in augmentation was affected by the segment edges where most of the errors occur. Figure 10 shows the manual and augmented labels of the RS data-set, with close-up views on single coral colonies. (Bottom) Average of χ 2 distance between the size-freq. distribution of eight classes. Each photomosaic was compared to its ground-truth: Fully manually labeled photomosaic. The distance values are low and there is no significant decrease in the distance from sparsity 0.1%. In data-set CR, the trend is even slightly reversed. Figure 9. Percent average error in relative area and relative amount over eight classes in the different data-sets. In relative area (left), the error values are low and there is no significant decrease in error above sparsity 0.1%. The error for sparsities 0.1-10% ranges from two to ten %. In relative amount (right), there is a decrease in error above sparsity 0.1% in the RS and CR data-sets, but not in the MD data-set. The error for sparsities 0.1-10% ranges from eight to 20%.

Orthorectification
The orthorectified photomosaics were superior to the naïve photomosaics in accuracy and IoU throughout all data-sets ( Figure 11). The χ 2 distance was calculated between the class-specific size-frequency distributions (e.g., Figure 12 left), and averaged over eight classes in each mosaic replicate and its augmentation from sparsity 0.1% (ground-truth). In the orthorectification experiment (Figure 12 right) the orthorectification using a leveler reduces the distance between the naïve and orthorectified histograms.   Each photomosaic was compared to its ground-truth: augmented labels from sparsity of 0.1% and the orthorectification further reduces the distance. The distance for photomosaics that have been orthorectified ranges between three to four percent, and for the naïve photomosaics from five to eight percent, with similar effect in both data-sets.
In relative area (Figure 13 left) the orthorectification increases accuracy. Error was calculated as the ratio between each replicate and the ground-truth mosaic (multiplied by 100). The error for photomosaic that have been orthorectified ranges between two to five percent, and for the naïve photomosaics from four to seven percent, with similar effect in both data-sets. In relative amount (Figure 13, right) the orthorectification increases the accuracy. The error for photomosaics that have been orthorectified ranges between eight to 11 percent, and for the naïve photomosaics from 13 to 19 percent, with more effect in the RS data-set. The orthorectification experiment shows that for both data-sets the error in size-frequency distribution as well as relative size and amount decreased when the maps were orthorectified. This is a clear trend in which the naïve orthomosaics are inferior in repeatability to the orthorectified replicates. We found that the effect of orthorectification on the augmentation results was stronger on topographically complex reefs where more occlusions occur. Furthermore, despite small differences in accuracy (Figure 11), the presence/absence data as well as the topology of the maps differs significantly (Figure 10). Such artifacts generated by inconsistent orthorectification are inimical for studies interested in tracking single coral colonies over time. Figure 13. Percent average error in total amount of individuals over eight classes in the different datasets. In relative area (left), the error for photomosaic that have been orthorectified ranges between two to five percent, and for the naïve photomosaics from four to seven percent, with similar effect in both data-sets. In relative amount (right), the error for photomosaic that have been orthorectified ranges between eight to 11 percent, and for the naïve photomosaics from 13 to 19 percent, with more effect in the RS data-set.

Discussion
We presented a methodology for cm-scale change-detection, solving two main issues: swift data-extraction and consistent 3D grid definition. We used the MLS approach for label-augmentation on photomosaics, and a 3D transformation of the map using a spirit leveler placed in the scene. We used image-sets from three distinct oceanic environments, implementing our workflow as a general and robust method in different ecological zones ( Figure 4, Table 1). The results support our method as a practical, rapid, and cost-effective solution that can be applied at the reef-scape scale with colony level resolution.
Labeling the RS and CR data-sets was difficult due to the large amount of small segments. The CR data-set also had a large amount of non-rigid corals and strong surge currents during image acquisition, resulting in reconstruction artifacts. In all data-sets, we found that it is important to label the objects close to the edges.
Label-augmentation is useful because it opens the possibility to augment previously labeled data with minimal adjustments. In that case, the detection ability in augmenting labels from sparse annotations will be at the resolution of the spacing between labels. Although 0.1% sparsity might take longer than the traditional random point annotation, it is important to note that we showed it yields significantly more data and therefore worth the added time. In addition, when labeling the orthophotos using polygons it requires only <10 clicks per object and yields more than 0.1% sparsity.
In the orthorectification experiment we tested whether 3D grid definition using a spirit leveler is superior to a non-intervention, naïve approach, and provides stable results.
Before deciding on the leveler method, we also tried to use a line and a float with weights for finding the Z-axis, but it does not reconstruct well in the model and is susceptible to currents. Alternatively, we thought on using corner-shaped aluminum bars linked to permanent markers, which are intrusive and can be inconsistent due to subtle movements of the seabed. We also tried to measure the depth at the corner and centre of the plot which proved to be time-consuming and less practical.
To conclude, slope angle and topographical complexity are the key factors that increase the necessity of this kind of approach for consistent 3D grid-definition. A two to five percent error in the orthorectification experiment translates into a cm change-detection threshold in real-world applications such as coral reef monitoring.
Full consistent semantic mapping brings forth unprecedented level-of-detail that will pave the way for a new generation of highly detailed ecological studies. Yielding extensive community metrics automatically is one of the main motivations of photogrammetric surveys. Here, we streamline and enhance the volume of information extracted automatically from photomosaics.
We successfully detected and classified organisms as small as two cm 2 . However, orthorectification generates artifacts such as blur and holes which also affect size/area measurements. We designed our workflow to be robust and adaptable rather than domainspecific which is often the case with deep-learning and limited training-data. As our analysis shows, the commonly used accuracy and IoU evaluation metrics often do not tell the whole story, as ecological metrics provide a more comprehensive evaluation. Therefore, our evaluation criteria are useful for testing other image segmentation approaches as well.
Although this workflow is customized for an underwater setting, it is widely transferable. Previously, label-augmentation through the MLS approach was established to facilitate training semantic segmentation [25,53,54] and was demonstrated on multiple domains such as multi-modal images of corals and terrestrial orthophotos with similar accuracy, showing the generality and the wide range of problems that can be handled with this approach. Thus, this work is significant for all ecologists that wish to use photogrammetry in their research. Object Based Image Analysis (OBIA) has been used extensively in remote sensing [55] and benthic habitat mapping [56][57][58][59]. As long as the object of interest is larger than the pixel size, it can be delineated as a group of pixels. Here we use an adaptive-Superpixel approach-MLS, which can also be considered to be an object-based Image analysis scheme. At the cm scale resolutions of benthic organisms, we require high resolution data to employ such object oriented image analysis schemes, which can be provided by photogrammetric surveys with sub cm resolution across tens of metres.
Shortcomings and remaining challenges for robust automatic surveys include lack of specific algorithms for underwater photogrammetry that take into consideration non-rigid organisms such as soft-corals in surge currents, and analysis of 3D photogrammetrical outputs [31,60,61]. Many studies have used 2D maps, photomosaics, instead of 3D data; point-clouds or surface-mesh which are also generated in the photogrammetric process. This reduction is made to simplify the technical aspects of data-analysis (labeling in 3D) in the lack of adequate software and workflows. Relative abundance (amount) and relative area (Figures 9 and 13) metrics are important for ecological studies because they reflect diversity and evenness measures as well as the well-being of the reef. Furthermore, since object level separation is still ambiguous in photomosaic analysis (due to occlusions and angle-of-view), area measurements are more reliable than individual counts and allow estimating the percent live-cover. A main drawback in photomosaics is that even with consistent orthorectification there are occlusions and size distortions (artifacts) of nonplanar objects that make measuring individual reef organisms challenging. An inherent limitation in orthophotos is that they fail to depict crevices, overhangs, and other nonplanar reef formations. Therefore, some organisms are not represented proportionally in ecological estimations based on orthophotos. Nevertheless, they are still the prevalent tool for such estimates because of their advantages; scalability and resolution. Future studies should focus on deriving community metrics of benthic habitats using the full suite of visual information in photogrammetric surveys; linking the high-resolution information contained in the input images for the Structure From Motion (SFM) process across a wide scale 3D map.
One of the biggest challenges in achieving a taxonomic segmentation of the seabed is the intricacy of the benthos. Sessile invertebrates often do not have clear boundaries, and display overgrowth patterns which are difficult to classify. In such cases, even manual segmentation would not be accurate. Specific algorithms for underwater image enhancement [62] might improve edge detection algorithms, and enable better segmentation. Moreover, there still remains significant work to be done on accurate placement of underwater photomosaic on GPS coordinates [37,63]. This would benefit reef ecology because benthic-mapping would become easily repeatable between teams as well as more precisely comparable across geographic grids. This is normally done using permanent markers or navigation sensors such as GPS buoys which are only effective for shallow reefs. Diver-held navigation tools such as underwater tablets (e.g., http://allecoproducts.fi/about/, https://uwis.fi/en/ (accessed on 19 January 2021)) are emerging, and will complement photogrammetric surveys in the near future.
Classification of the benthos is one of the most important aspects of marine research and conservation, and underwater photogrammetry can supplement other modalities in benthic habitat mapping. For example, [64] combined acoustic and visual sensors to produce a wide scale bathymetry coupled with high resolution photomosaics from video. Furthermore, Multibeam echosounders are becoming popular for indicating seabed substrate type. However, they still require calibration across sites and devices [65]. Spectral features obtained in acoustic surveys have also been shown to be a predictor for terrain type in acoustic habitat mapping. However, these also require ground truthing [66], which can be done using photomosaics. Adaptive workflows that combine acoustic and visual sensors will enable complex navigation tasks with multimodal data. For example, an acoustic survey can find points of interests followed by a close-up visual survey. This will not only benefit ecological surveys and habitat depiction but also development of new algorithms for Multi-modal Simultaneous Localization and Mapping (SLAM).

Conclusions
We presented our work on benthic mapping and accelerated segmentation through photogrammetry and Multi-Level Superpixels, and showed the accuracy of repeated surveys using orthorectification and sparse label augmentation. We included objects as small as 2 cm 2 and shown that our method provides fast and reliable segmentation across scale. This approach is appropriate for any person who is interested in using photogrammetry for ecological surveys, especially diver-based underwater surveys (i.e., transects, reef-plots).
Photogrammetry is gaining traction among marine ecologists and map-models of the benthos have outstanding resolution and scaling abilities. Diver-based photogrammetry is possible to conduct without extensive expertise or specialized equipment and is becoming a key tool in the benthic ecology toolbox. With meaningful annotations, photomosaics of the reef can capture the size, shape, and location of hundreds of individual reef organisms. Thus, the bottleneck in ecological studies is shifting towards analysis over acquisition. With advances in acquisition and computer-processing abilities, it is of great importance to explore new ways for data extraction, and the automation of classification and labeling needs to be integrated into marine surveys. Label-augmentation enables serious time savings for complete scene understanding and measurements at the individual-to-population level, such as size-frequency distribution and relative abundance. However, to compare complementing maps over time, standardization needs to be made in terms of consistent 3D grid definition; especially on topographically complex reefs.
We conducted repeated surveys of the same reef plot at minimal intervals of a few minutes assuming no actual change in the terrain in this time-frame, following this assumption, we expected the photomosaic replicates to contain identical ecological information.
However, comparing the similarity of orthophotos is not straightforward, due to differences such as color and artifacts (blur/holes) caused for example by slight differences in the distance and angle of image. We compared the maps indirectly through label-augmentation. Comparing the segmented maps generated from augmenting a single set of sparse labels (from the original image) tests all steps of the workflow intact, and includes noise from the photogrammetric (different input images) and orthorectification processes. Thus, it simulates an observer effect and a noisy real-world situation.
When applying this workflow in any setting, the most important factors to consider are the classification level (taxonomic/functional specificity), as it implies on the level of expert knowledge required as well as the accuracy in automatic identification, and the expected change-detection ability which is governed by the effective resolution and signalto-noise ratio. Furthermore, it is important to consider the effect of the slope on the reef, in the sense that the top down view is not always perpendicular to the reef-table.
At the moment, there are several tools for image segmentation with weak humaninterference [67]. Moreover, new tools will soon be released with promising outlook on the benthic photomosaic segmentation tasks [68].