Machine Learning for the Fast and Accurate Assessment of Fitness in Coral Early Life History

: As coral reefs continue to degrade globally due to climate change, considerable effort and investment is being put into coral restoration. The production of coral offspring via asexual and sexual reproduction are some of the proposed tools for restoring coral populations and will need to be delivered at scale. Simple, inexpensive, and high-throughput methods are therefore needed for rapid analysis of thousands of coral offspring. Here we develop a machine learning pipeline to rapidly and accurately measure three key indicators of coral juvenile ﬁtness: survival, size, and color. Using machine learning, we classify pixels through an open-source, user-friendly interface to quickly identify and measure coral juveniles on two substrates (ﬁeld deployed terracotta tiles and experimental, laboratory PVC plastic slides). The method’s ease of use and ability to be trained quickly and accurately using small training sets make it suitable for application with images of species of sexually produced corals without existing datasets. Our results show higher accuracy of survival for slides (94.6% accuracy with ﬁve training images) compared to ﬁeld tiles measured over multiple months (March: 77.5%, June: 91.3%, October: 97.9% accuracy with 100 training images). When using fewer training images, accuracy of area measurements was also higher on slides (7.7% average size difference) compared to tiles (24.2% average size difference for October images). The pipeline was 36 × faster than manual measurements. The slide images required fewer training images compared to tiles and we provided cut-off guidelines for training for both substrates. These results highlight the importance and power of incorporating high-throughput methods, substrate choice, image quality, and number of training images for measurement accuracy. This study demonstrates the utility of machine learning tools for scalable ecological studies and conservation practices to facilitate rapid management decisions for reef protection.


Introduction
The continued increase in sea surface temperatures due to climate change has been the major driver in the loss of up to 50% of the world's coral reefs [1][2][3][4][5][6]. The occurrence of mass bleaching and mortality events, in which corals lose their symbiotic dinoflagellates (Symbiodiniaceae) en-masse, has become more frequent [7]. Even relatively healthy ecosystems, such as the Great Barrier Reef (GBR), suffered back-to-back bleaching events in 2016 and 2017 [2,6]. An increase in the frequency of these mass bleaching events impedes a reef's ability to potentially recover to previous levels of coral cover before the next disturbance event [8]. Coral reefs show some potential to endure anthropogenic impacts through rapid acclimation and adaptation [9][10][11]. However, some reef restoration may be needed to maintain resiliency whilst global efforts to reduce warming are implemented. As a result, there has been a rapid increase in investment in intervention and restoration initiatives focused on improving coral survival [12][13][14], especially in the large-scale production of to the harnessing of "big data" previously experienced during the molecular biology sequencing revolution [53].
Machine learning, compared to deep learning algorithms, may be considered a more reproducible workflow when using new datasets. This is, in part, due to their robustness when encountering different backgrounds and high replicability (see damage estimations; [54]). In contrast, deep neural networks are more complex and can face high variability in their optimisation, hyperparameters and architecture of the deep learning framework [55]. We therefore propose this ML method to be more suitable for assessing relatively simple coral juvenile fitness traits. It should be noted that true AI deep learning, although inordinately more complex and time consuming relative to machine learning pixel classification, is considerably more powerful [56]. True AI deep learning should, therefore, be considered over pixel classification for more complex, field-based classifications if the size of the dataset and the desired accuracy of the outputs outweighs the effort of training. Pixel classification using random forests was chosen for this pipeline over other supervized ML techniques due to its ability to more accurately quantify the size of objects and lower variability under different background surroundings and lighting [54]. Finally, deep learning often depends on large amounts of training data. Coral juveniles are often in low numbers and lack homogenous structures, textures, and shapes. The training needed using deep learning methods for each timepoint of these experiments is therefore not feasible.
Here we present a novel ML pipeline, for the high-throughput data acquisition of coral juveniles from images, that is accessible to coral restoration practitioners with little training (Figure 1). This tool uses the open-source pixel classification software, Ilastik, and a custom-made script for use in Fiji ImageJ to rapidly and accurately measure three coral traits: survival, growth, and color. We use both field and laboratory-based images of small coral juveniles (less than one year) to test the performance of the three measurements of fitness between the ML pipeline and human "ground-truthed" measurements. We also assess the trade-off between training time and accuracy of the model's measurements and provide guidelines for practitioners on image acquisition, processing, quality, substrate type, and the number of training images on the success of the image analysis using ML.

Datasets and Manual Measurements of Juveniles Using ImageJ
The machine learning (ML) image analysis pipeline was assessed using coral juveniles settled onto two types of substrates: (1) field deployed terracotta tiles (hereafter 'tile images') and (2) laboratory-maintained PVC plastic slides (hereafter 'slides images'). Both substrates used coral juveniles from the species Acropora tenuis of the same age range (within the first year of life). For the two coral juvenile experiments used here, there were thousands of tiles and slides from several time points. Specifically, there were 1200 tile images with three different time points from the field, and 900 slides with images taken at 12 time points in the laboratory. Terracotta tile images were obtained from a year-long field study conducted at Davies reef on the Great Barrier Reef (site description and map can be found in Quigley et al. [19]. Images were taken at three time points using a Nikon D810 camera body and Nikon AF-S 60 mm micro lens (image resolution: 7360 × 4912 pixels) and an Olympus Tough TG-5 camera with Ikelite DS160 strobes (image resolution 4000 × 3000 pixels). Plastic PVC slide images with settled juveniles were taken with a Nikon D810 camera body and Nikon AF-S 60 mm micro lens (image resolution: 7360 × 4912 pixels), as per the first time point from the tile field-based study. For both sets of images, "ground-truthed" juvenile measurements were taken manually. The area of each individual juvenile was measured using the polygon selection tool in the ImageJ software [57]. Size was calibrated using the scale bar present in each image. Color of juveniles was assessed using the CoralWatch Health Chart [18] and was matched to the closest score on the "D" scale by a single person to minimise observer bias. Survival of juveniles was classified by eye as either alive or dead.

Training Image Production
For the tile images, the initial training set consisted of images of five alive juveniles, five dead juveniles, five dying juveniles, and five images of miscellaneous substrate and debris including crustose coralline algae, snails, and macroalgae. Additional training images were produced by dividing the complete tile images into multiple images. Training images including coral juveniles were grouped and chosen at random. The slides in the Remote Sens. 2021, 13, 3173 5 of 17 slide images feature four wells containing settled coral juveniles. The number of juveniles varied between zero and six per well. In this analysis, the number of slides was taken into consideration, and not the number of single juveniles, due to gregarious settlement of juveniles and formation of groups of settled juveniles. The initial training set for the slide images consisted of five images of wells containing alive juveniles and five images of different parts of the slides, including the number engraved on the slide for unique slide identification and the spaces between the slides in each slide rack. Additional training images were used and included one image of a set of five slides. Each of these were counted as five training images (See Supplementary Figure S1A). All images of juveniles from the tiles were first cropped and batch processed. Cropping of images in sets was found to be more accurate, instead of using the whole tile image owing to the large number of false positives occurrences from the complex and variable substrate and the lower resolution of the images.

Measurements of Juveniles Using the Ilastik Pipeline
Ilastik [50,58] is an open-source pixel classification tool used to produce simple semantic segmentations of images ( Figure 2). This tool performs classification using a random forest classifier with 100 trees, which was chosen for its good 'generalisation' performance, robust choice making when using small sample sizes, and reproducibility of results [59][60][61]. Pixel features and their scales are chosen to discriminate between the different classes of pixels. Ilastik gives the option to choose features that use color/intensity, edge (brightness or color gradient) and texture to discern objects. Each feature can be selected on different scales, corresponding to the sigma of the Gaussian and evaluate classifier predictions from the user's annotations. Filters with higher sigma values can pull information from larger neighbourhoods. Initially, all features were selected to give the model the highest power, as recommended [58]. Through project testing, the 'suggested features' tool within Ilastik was used to determine the optimal settings for sensitivity to select the features that are the most effective at discriminating between pixels in different classes. Through testing project sensitivity ('all features' sensitivity = 0.95, 'suggested features' sensitivity = 0.93) and the average percentage difference of area ('all features' area difference = 7.3%, 'suggested features' area difference = 8.6%) we chose to use all features. This is recommended by the Ilastik developers [50] and demonstrates high accuracy [51]. Sensitivity is defined as:  Figure S2 and Supplementary Information S1). exported segmentation masks were utilized to compute the coral health parameters using custom, open-source code (https://github.com/LaserKate/Coral_Ilastik_Pipeline; accessed on 7 August 2021) in Fiji ImageJ (version 1.53 h). Outputs from this code include the survival count of juveniles per image, juvenile area and their Hue, Saturation, and Lightness (HSL) values. This data was then exported into a 'csv' file, which can be directly uploaded into other open-source software such as R (See Supplementary Figure S2 and Supplementary Information S1).

Figure 2.
Infographic showing the evolution of images through the machine learning image analysis pipeline. The sequence includes: the original image, training annotations, the produced prediction layer, and uncertainty layer signalling areas that require further training. After training, the final segmentation layer shows the outline of the juvenile with precise boundaries. This is then converted into a simple segmentation mask, which uses grey scale for each classification label. The final image shows the juvenile after being labelled with a unique identifier and outlined to show the area measured.

Model Validation; Assessment of Pipeline Accuracy and Speed
Ilastik projects were produced using various numbers of training images for both the tile and slide substrates. The purpose of this was to test how many training images are needed to give the most accurate outputs. The total number of randomly selected test images was 111 March images, 81 June images and 95 October images for the tile images. A total of 275 test images of juveniles were randomly selected for the slide images. Further, there was no overlap of training and test data, and no areas in the test images were seen by Ilastik during training. As juvenile appearances vary between species, different Infographic showing the evolution of images through the machine learning image analysis pipeline. The sequence includes: the original image, training annotations, the produced prediction layer, and uncertainty layer signalling areas that require further training. After training, the final segmentation layer shows the outline of the juvenile with precise boundaries. This is then converted into a simple segmentation mask, which uses grey scale for each classification label. The final image shows the juvenile after being labelled with a unique identifier and outlined to show the area measured.

Model Validation; Assessment of Pipeline Accuracy and Speed
Ilastik projects were produced using various numbers of training images for both the tile and slide substrates. The purpose of this was to test how many training images are needed to give the most accurate outputs. The total number of randomly selected test images was 111 March images, 81 June images and 95 October images for the tile images. A total of 275 test images of juveniles were randomly selected for the slide images. Further, there was no overlap of training and test data, and no areas in the test images were seen by Ilastik during training. As juvenile appearances vary between species, different projects would have to be trained for each species, especially if a different substrate was used. The accuracy of the survival counts, area, and HSL measurements were compared between each project and the mean differences ± standard deviations between the manual method and the machine learning pipeline were calculated.
The number of juveniles counted by the pipeline was automatically recorded to an excel file. This gave a 'survival count' of juveniles in each image. This count was then compared with the ground truth manual survival counts from the same images. Regions of interest (ROI) were visually (qualitatively) assessed against manual assessments to calculate false positive and false negative rates. For example, at times multiple juveniles were classified as a single juvenile due to their close proximity to each other (false positives). The accuracy of juvenile size measurements was assessed by comparing the pipeline ROI areas in the output excel file with the ground truth manual measurement using ImageJ. Color accuracy of images using the CoralWatch Health Chart "D" scale was also assessed by comparing manual and pipeline assessments [18]. Manual assessments were made for every juvenile in the test images and were given a visual score according to the health chart. The output.csv file from the pipeline contained an average HSL value for all the juveniles counted. These were directly compared with the HSL values of the CoralWatch Chart color score. The time to perform the manual and pipeline measurements for survival, size, and color was quantified for both tile and slide images. This included data handling time for each step.

Statistical Analysis
To evaluate the difference in juvenile counts between methods, the "correct" values (defined as those taken from the manually derived measurements of images) were compared to Ilastik outputs for each test image. True positives were measurements that were identified in both the manual and pipeline assessments. False positives were juveniles counted in the pipeline but not manually and false negatives were juveniles that were counted manually but not in the pipeline. From this, an F-score (F1) was calculated for each project using the equation [62]: Abbreviations are as follows: PPV = positive predictive value, TPR = true positive rate (correct), FNR = false negative rate, TP = true positives, FP = false positives, and FN = false negatives. Survival means and standard deviations were calculated using R (version 4.0.3; [63]). The packages 'rstatix' and 'broom' were used for the area analysis [64,65]. For the tile image size and color analysis, a two-way ANCOVA and pair-wise comparison was performed to examine the effects of the number of training images on area, with manual measurements set as the covariate. A Bonferroni adjustment was applied using the package 'emmeans' [66]. For the analysis of slide images areas and color, a one-way ANCOVA and pair-wise comparison was performed to examine the effect of number of training images in measurement accuracy.

Assessment of Manual Versus Pipeline Calculations of Coral Juvenile Survival
The optimal number of tile images to train with to measure survival was consistent for all timepoints, but varied by which month's images were used in training ( Figure 3A). The optimal number of training images to train on was 100 for all months (March, June, and October), giving a True Positive Rate (TPRs) value of 77.5%, 91.3% and 97.9%, respectively, when trained using training images from the same time point ( Figure 3B). The number of juveniles counted as alive was consistently low (<25%) if trained using images from a different time point. When the slide images were input into the pipeline with five training images, an average of a 94.6% TPR was obtained, with a 0.89% FNR and a 7.3% FPR in comparison to the ground truth counts ( Figure 3C,D). TPRs and FNRs did not drop to less than 0.9% and 0.5% when training with more than five training images. However, the lowest FPR was recorded with 40 training images (4.9%, Figure 3D). The number of training images with the highest F-score was 40 images (F1 = 0.973), with the initial training set having the lowest (F1 = 0.756). Training the initial set with just five additional slide images increased the F-score to 0.958.

Assessment of Manual Versus Pipeline Calculations of Coral Juvenile Size
The correlation between ground truth manual measurements of coral juvenile area and the measurement of the segmentation masks made by the pipeline on the tile images varied by month and the number of images used in training ( Figure 4A). For the March data, the regression with the best fit was found when training with 150 images (R 2 = 0.634, p = 1.76 × 10 14 ), although training with 60 images showed a similar fit (R 2 = 0.614, p = 6.84 × When the slide images were input into the pipeline with five training images, an average of a 94.6% TPR was obtained, with a 0.89% FNR and a 7.3% FPR in comparison to the ground truth counts ( Figure 3C,D). TPRs and FNRs did not drop to less than 0.9% and 0.5% when training with more than five training images. However, the lowest FPR was recorded with 40 training images (4.9%, Figure 3D). The number of training images with the highest F-score was 40 images (F1 = 0.973), with the initial training set having the lowest (F1 = 0.756). Training the initial set with just five additional slide images increased the F-score to 0.958.

Assessment of Manual vs. Pipeline Calculations of Coral Juvenile Size
The correlation between ground truth manual measurements of coral juvenile area and the measurement of the segmentation masks made by the pipeline on the tile images varied by month and the number of images used in training ( Figure 4A). For the March data, the regression with the best fit was found when training with 150 images (R 2 = 0.634, p = 1.76 × 10 14 ), although training with 60 images showed a similar fit (R 2 = 0.614, p = 6.84 × 10 8 ). The best fit regression for the June data was found when training with 40 images (R 2 = 0.936, p = 8.51 × 10 15 ), whilst the October data was best fit when trained with 60 images (R 2 = 0.867, p = 5.76 × 10 10 ; Figure 4A). The regressions indicate a more accurate measurement by the pipeline in the later timepoints (i.e., June and October), indicating higher measurement accuracy in older corals. The area was consistently underpredicted (~30%) by the pipeline when trained with 20 or more images ( Figure 4B). After 40 training images, the difference between manual and pipeline measurements did not vary considerably with the mean difference in area measured around 27.9% ± 23.7 for 40 images and 29.1% ± 19.6 for 150 images (Supplementary Table S1). Pairwise comparisons for June and October showed that there was a statistical difference when using the initial training set and all additional numbers of training images (Supplementary Table S2; all p < 0.05). There was no statistical difference between any other number of training images. For March, the pairwise comparison showed that all analyses with more than 40 training images were not significantly different (Supplementary Table S2).
For the slide images, there was no significant difference seen when training with different numbers of images (F = 1.914, p = 0.054). The mean difference in the size of juveniles settled onto slides when using just the initial training images was an underestimate on average of −0.11% ± 116.75. When using five more images to train, the mean underestimated difference was −10.84% ± 15.91, substantially improving the standard deviation and decreasing the median range of values ( Figure 4C), although the pipeline mean was farther from the mean of the manual measurements. The lowest mean underestimated difference was measured when training with 40 images (−3.47% ± 19.35).

Assessment of Manual vs. Pipeline Calculations of Coral Juvenile Color
Coral color is used as a health proxy for corals to indicate the relative abundance of symbiont cells (Symbiodiniaceae) inside their tissues [18], where "D1" scores represent low symbiont densities (pale, bleached corals), and "D6" scores represent tissues with high symbiont densities ( Figure 5A). Pairwise comparison for the slide images showed significant differences in over and under-predicting HSL values when comparing the initial training set with all other numbers of training images ( Figure 5B; Supplementary Tables S3-S5; all p < 0.05). However, there were no significant differences in over-or under-predicting HSL values when five or more training images were used, indicating that the accuracy of color assessment does not significantly improve with greater than five training images.
considerably with the mean difference in area measured around 27.9% ± 23.7 for 40 images and 29.1% ± 19.6 for 150 images (Supplementary Table S1). Pairwise comparisons for June and October showed that there was a statistical difference when using the initial training set and all additional numbers of training images (Supplementary Table S2; all p < 0.05). There was no statistical difference between any other number of training images. For March, the pairwise comparison showed that all analyses with more than 40 training images were not significantly different (Supplementary Table S2). Boxplots of the percentage change in area (mm 2 ) between manual measurements and the pipeline using the slide images (C). Note that the y-axis for B and C have different scales. For (B,C), any point below 0 on the y-axis indicates the pipeline is under-predicting the size of the juveniles relative to the manual measurements. If the point is above 0 then the pipeline is over-predicting the size of the juvenile relative to the manual measurements.
For the slide images, there was no significant difference seen when training with different numbers of images (F = 1.914, p = 0.054). The mean difference in the size of juveniles settled onto slides when using just the initial training images was an underestimate on average of −0.11% ± 116.75. When using five more images to train, the mean underestimated difference was −10.84% ± 15.91, substantially improving the standard deviation and decreasing the median range of values ( Figure 4C), although the Boxplots of the percentage change in area (mm 2 ) between manual measurements and the pipeline using the slide images (C). Note that the y-axis for B and C have different scales. For (B,C), any point below 0 on the y-axis indicates the pipeline is under-predicting the size of the juveniles relative to the manual measurements. If the point is above 0 then the pipeline is over-predicting the size of the juvenile relative to the manual measurements.
In the tile images, the hue and lightness measurements did not significantly differ when using more training images for all three time-points (hue: F = 0.989, p = 0.462, lightness: F = 0.675, p = 0.800). However, pairwise comparisons between manual and pipeline measurements showed that saturation, when using the initial training set, significantly under-predicted juvenile values compared to predictions from all other numbers of training images ( Figure 5C; Supplementary Table S4).

Assessment of Time Saving for the Measurement of Coral Juvenile Survival, Size, and Color in Manual vs. the Pipeline Measurements
On average, it took~720 h to manually measure 1200 tiles, compared to betweeñ 115 and 215 h using the pipeline (depending on number of training images used). This is equivalent to 6.2× faster per time point when using 20 training images, and 5× faster when training with 60 images ( Figure 6A). The time to process slides was also significantly quicker using the pipeline compared to manual measurements ( Figure 6B). Time efficiency drastically increased for larger datasets, with the analysis of 900 slides being 36× faster compared to manual measurements per time point using five training images, and even being 4× faster when a large number (50) of images was used to train against. The threshold in which the time taken using manual measurements is slower than the pipeline processing occurs at >250 tiles (60 training images) and >30 slides (5 training images; Figure 6A,B). low symbiont densities (pale, bleached corals), and "D6" scores represent tissues with high symbiont densities ( Figure 5A). Pairwise comparison for the slide images showed significant differences in over and under-predicting HSL values when comparing the initial training set with all other numbers of training images ( Figure 5B; Supplementary Tables S3-S5; all p < 0.05). However, there were no significant differences in over-or under-predicting HSL values when five or more training images were used, indicating that the accuracy of color assessment does not significantly improve with greater than five training images.
. In the tile images, the hue and lightness measurements did not significantly differ when using more training images for all three time-points (hue: F = 0.989, p = 0.462, lightness: F = 0.675, p = 0.800). However, pairwise comparisons between manual and pipeline measurements showed that saturation, when using the initial training set,

Assessment of Time Saving for the Measurement of Coral Juvenile Survival, Size, and Color in Manual Versus the Pipeline Measurements
On average, it took ~720 h to manually measure 1200 tiles, compared to between ~115 and 215 h using the pipeline (depending on number of training images used). This is equivalent to 6.2× faster per time point when using 20 training images, and 5× faster when training with 60 images ( Figure 6A). The time to process slides was also significantly quicker using the pipeline compared to manual measurements ( Figure 6B). Time efficiency drastically increased for larger datasets, with the analysis of 900 slides being 36× faster compared to manual measurements per time point using five training images, and even being 4× faster when a large number (50) of images was used to train against. The threshold in which the time taken using manual measurements is slower than the pipeline processing occurs at >250 tiles (60 training images) and >30 slides (5 training images; Figure 6A, B).

Discussion
As coral reef ecosystems face continuing stress and degradation from persistent ocean warming and other anthropogenic pressures [6], conservation efforts are becoming increasingly considered to restore reefs and improve their resilience to stress. This includes understanding processes that promote the potential for coral recovery and

Discussion
As coral reef ecosystems face continuing stress and degradation from persistent ocean warming and other anthropogenic pressures [6], conservation efforts are becoming increasingly considered to restore reefs and improve their resilience to stress. This includes understanding processes that promote the potential for coral recovery and resilience. While a global effort to reduce carbon dioxide emissions should be central, large investments are being made to design coral restoration interventions that can be used to mitigate degradation [67]. Some of these interventions include the seeding of enhanced, heat tolerant reef-building coral species using assisted gene-flow methods onto reefs to prepare them for future warming [19,68]. This may include the seeding of hundreds of thousands of individuals, necessitating tools that can rapidly count and assess corals on large scales. Open-source tools are therefore critical and should be made available to aid in the development and vetting of these proposed management strategies [25]. We present a pipeline for coral juvenile analysis targeted at three key coral fitness traits relevant for ecological and restoration purposes that performs up to 36× faster with minimal differences in accuracy, especially in survival and size. While employing current methods of manual measurements using tools such as ImageJ, analysis of coral juveniles can take many months [19]. This pipeline, therefore, has the potential to accelerate the turnover of results between experiments and scale-up experimental size using a user-friendly, free, and reproducible interface. Once training is completed, the pipeline has the capability to batch process images and requires little further human input, thereby allowing the scale of experimental analysis to be expanded with little increase in analysis time.
This pipeline represents an open-source interactive learning and segmentation toolkit. It leverages machine learning (ML) algorithms to easily segment, classify, track, and count cells and other experimental data in an interactively supervized format, averting the necessity for understanding the complex algorithms underpinning them [50,58,69]. This allows for multiple user groups to take advantage of its capabilities. For example, the interface cues the user to draw label annotations onto images, allowing the program to classify pixels using a powerful non-linear algorithm. This creates a decision surface in feature space and generates accurate classification and projects the class assignment onto the original image. Other deep learning methods of image analysis are extremely powerful and show great success, but rely on significant amounts of training data to build a highdimensional feature space [70], compared to the small training set needed for algorithm parameterisation through user supervision [50,58]. Additionally, this pipeline offers a limited number of pre-defined features that are powerful enough to detect image features that the human eye cannot [29], which are relevant for small, cryptic, and taxonomically amorphous coral juveniles.

Overall Assessment of the Pipeline Performance in Accurately and Rapidly Measuring Survival, Size, and Color
The assessment of coral juvenile survival on the tile images was challenging given the diversity of organisms that recruited onto the tiles in the field and the changing nature of the community composition through time. However, this pipeline successfully (77.5-97.9% detection with 100 training images) detected and quantified surviving juveniles when trained with juvenile images from the same time point. The higher accuracy in detection in later timepoints may also have been due to changing coral morphology over time as the juveniles grew and developed ever-increasing branch complexity. This also shows the importance of training with the corresponding time point when assessing measurements taken in the field, given the successional changes in flora and fauna, environmental parameters such as temperature and light intensity. This highlights the importance of minimizing the number of arbitrary objects in images that could be difficult for both the pipeline and the user to classify. Low accuracy when using images from different timepoints further suggests that this pipeline would perform with low accuracy if assessing images of a different substrate not seen in training. Moreover, image quality is important for the same reason.
While training, the pipeline creates uncertainty masks that can be retrained to give more accurate segmentation masks, however low-resolution images lead to high uncertainty in these masks. This is especially difficult if the user is also uncertain of the classification of specific areas of an image due to low resolution. Finally, although high-resolution images will give a more accurate analysis of the juveniles, it will increase the time needed for the pipeline to compute pixel classifications for the training images. Alternatively, the slides used in this study were specifically designed to provide a high contrast (grey matte PVC versus brown coral tissue), low complexity surface for rapid image classification. Therefore, the slide images were devoid of other objects and had less objects to classify. This allowed the number of training images to be fewer and more accurate, with an almost 100% accurate assessment of survival using as few as five training images.
Juvenile size analysis for the tile images required 40 images to give the most accurate assessment of size compared with manual measurements. Although the pipeline significantly and consistently underestimated the size of juveniles on the complex tiles of the surfaces (no matter the number of training images), by the last time point in which juveniles were larger, the underestimation was less than 10% (compared to initial~40%) and was consistent, suggesting that a calibration factor could be applied if needed. This further highlights the importance of image quality. When the images of juveniles had an indistinguishable boundary with the substrate it was difficult for both the human user and the ML algorithm to accurately identify and then train to classify pixels near the boundary. Interestingly, when juveniles are very small and the substrate is complex, this brings up the argument whether the manual measurement or the ML measurement is correct. Alternatively, where substrate complexity was kept low by design on the slides, the size comparisons between the manual and pipeline measurements were almost indistinguishable (median~0% difference), especially with >five training images. It is important to note that the slide images still exhibited outliers with extreme differences in area (seen in the median interquartile ranges for different sets of training image numbers). These often occurred when juveniles had settled in close proximity to one another, and the pipeline classified the slide pixels between the juveniles as "juvenile" pixels, joining several juveniles together. However, these cases were rare when using high-resolution images and can be easily excluded during the qualitative visual analysis of the output images.
The assessment of coral color was more complex, given the three measurements used to quantitatively describe it. However, hue, saturation, and lightness (HSL) did not change when using more than five training images. The percentage difference in HSL between the manual and pipeline measurements when using just the initial training set compared to using five or more images is likely due to the pipeline classifying pixels differently at the borders between juvenile and tiles. This means that the pipeline is calculating an average HSL value for a different number of pixels. A more accurate prediction of the object border and area allows less influence of false positive pixels in the color classification, creating more accurate HSL values and can be further improved with subsequent testing. Interestingly, the final October time point showed more variability in hue when compared to March and June, indicating higher variability in hue in later life stages of coral development. This is potentially due to colonization of symbionts throughout new growth regions of the coral. Even though the HSL scores were consistent between the numbers of training images used in most instances, these values were different to the HSL scores on the coral health chart. For example, hue values were higher, while saturation and lightness values were lower. The coral health chart has been extremely important in standardizing the analysis of coral stress and bleaching and is a key tool when making standardized in situ assessments [18]. Its simplicity means that it can be used in citizen science initiatives giving consistent comparisons and will remain a key tool in bleaching assessment. However, the method has some limitations. Coral color is extremely diverse and may not fall under one of the four scales provided on the coral health chart. Thus, this pipeline gives additional options in the assessment of hue, allowing the user to standardize to symbiont type, more accurately predict symbiont density, and potentially include other measurements like fluorescence [71], thereby uncovering new traits of temporal color change between life stages that the color chart may not reveal. Finally, while saturation and lightness can be assessed using the coral chart scale, the output given by this pipeline is a mean value of the entire coral juvenile. This potentially reduces the human bias of bleaching score given that color is often patchily distributed along the coral colony. When a coral juvenile first takes up symbionts, they are often ingested through the mouth and displayed first in the tentacles. The coenosarc (the connecting tissue between corallites) frequently does not display symbionts until after the polyp has dense aggregations of algal cells. This variation in cell density between polyp and coenosarc gives the coral a non-uniform color and can make it difficult to score using coral health chart [72] and introduces additional human bias. The proposed pipeline therefore has the potential to eliminate this bias by taking a mean value across the full surface of the individual.

Recommendations for Training Parameters for Pipeline Users
While the pipeline successfully analyzed coral juvenile survival, size, and color, there are factors to consider when preparing images. For example, different image types require a different number of training images to produce the most accurate outputs that minimize the difference between manual and ML measurements. We advise that the user should train this pipeline on several numbers of training images until the comparison between output images and the test set images is minimized. For these substrates, it was generally between 5 to 100 training images, depending on the trait. As coral juveniles grow, their appearance will also change, impacting accuracy. If several time points are being analyzed with little visual change to the juveniles, the same pipeline and training images may be used. However, if there is substantial time between time points, especially in field-based studies, it is advised that a test set of images from the corresponding time point is analyzed.
One of the key benefits of this pipeline is the ability to increase number of coral juveniles analyzed with little increase to the analysis time. Once trained, the pipeline can be left to analyze a large number of images without supervision. The results from this analysis show that when training with 60 images (the number of images that showed accurate results for survival, size, and color), the pipeline is nearly 5× faster than manual measurements if retraining is needed for each time point. The slide analysis, when training with five training images (the number of images that showed accurate results for survival, size, and color), is 36× faster than manual measurements if retraining is needed for each time point. If retraining is not needed for every time point, analysis is even faster. The overlap in manual versus the pipeline estimates clearly demonstrate the pipeline's efficiency.
Alternatively, if an experiment has a small number of juveniles to analyze (150 slides or 200 tiles), then training time may outweigh the time it takes to manually measure the juveniles. Training time will change with more complex substrates. This is relevant as the development of artificial substrates for sexually-produced restoration corals using biomimicry of CCA is being discussed to increase settlement success [15]. The two substrates used here included terracotta tiles and plastic PVC slides. The tiles are ideal for coral recruit settlement given their rough texture and crevices, creating areas for the larvae to attach to and escape predation. However, this also allows other organisms to easily attach, creating a more difficult image to train on when compared to the smooth, matte, and uniform PVC, where less accumulation of algae and other organisms can occur. However, the smooth surface may also hinder the ability of certain coral species to attach during settlement. Therefore, the most suitable substrate would be both visually uniform and have substantial micro-texture for attachment.
Image resolution was also extremely important for analysis accuracy as image ambiguity increases training time and produces uncertainty for both the trainer and the pipeline. Images should be clear with distinct boundaries in order to provide accurate training classifications, thereby generating accurate outputs. If too many of the same objects are trained on, overtraining can occur [69]. Overtraining can then lead to overfitting, which arises when a classifier fits the training data too "tightly", leading to unwanted behaviour from the ML algorithm. It can cause the algorithm to perform well on the training data, but not on independent test data. We recommend testing for this by comparing the test data outputs using several numbers of training images, as we have demonstrated here. Once accuracy starts to decrease, then the algorithm is overfitting, and the training number should be reduced.

Future Directions in Tool Development for Coral Conservation
As new restoration intervention technologies are developed, further improvements will be made to conservation tools. For example, although this analysis used 2D images, this pipeline is also capable of processing 3D images [34]. This will deliver estimates of the size and color of 3D scanned corals, further improving biological information, throughput, time efficiency, accuracy, and ease of use in the field of current 3D methods [34,[73][74][75] and apply them to early life stages, which are especially challenging for small coral juveniles (see Quigley et al. [19]). Understanding coral ontogeny over all life stages is essential, in which multiple methods are needed to understand growth form and structural complexity of corals. Although using 2D images suits the analysis of coral recruits and young juveniles due to their flat form in the early months (of particular species), 3D analysis is essential as coral juveniles start to grow more complex shapes [68,74]. Open-source tools that combine powerful software libraries for biological image analysis with easy functionality will be pivotal to their rapid application [27]. For example, the plugin we designed here, and project algorithms trained in other open-source programs can be combined in software, such as Fiji, so that segmentation masks can be created and analyzed simultaneously.

Conclusions
The use of ML in coral juvenile analysis can provide researchers, managers, and conservation practitioners with free, open-source tools to improve the health of coral reefs by enhancing the understanding of factors influencing coral growth and survival. Our pipeline successfully analyzed coral juveniles on two lab and field-based substrates and demonstrated an improvement of efficiency of up to 36× times. This adaptable pipeline could be used for various organisms to assess diverse benthic communities. This information will help spur and refine conservation practices in the new era of large-scale and ambitious restoration projects globally.  Author Contributions: Conceptualization, K.Q., C.J.N.; methodology, All Authors; validation, All Authors; formal analysis, All Authors; investigations, All Authors; resources, All Authors; data curation, All Authors; writing-original draft preparation, A.M.; writing-review and editing, All Authors; visualisation, A.M.; supervision, K.Q., C.J.N.; project administration, K.Q.; funding acquisition, K.Q. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The Fiji software is available at https://imagej.net/Fiji (accessed on 7 August 2021). The Ilastik software is available at https://www.ilastik.org (accessed on 7 August 2021). The coral photos and analysis code for use in Fiji and Ilastik is available at https://github.com/LaserKate/Coral_Ilastik_Pipeline (accessed on 7 August 2021).