A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand

Seagrass has been acknowledged as a productive blue carbon ecosystem that is in significant decline across much of the world. A first step toward conservation is the mapping and monitoring of extant seagrass meadows. Several methods are currently in use, but mapping the resource from satellite images using machine learning is not widely applied, despite its successful use in various comparable applications. This research aimed to develop a novel approach for seagrass monitoring using state-of-the-art machine learning with data from Sentinel–2 imagery. We used Tauranga Harbor, New Zealand as a validation site for which extensive ground truth data are available to compare ensemble machine learning methods involving random forests (RF), rotation forests (RoF), and canonical correlation forests (CCF) with the more traditional maximum likelihood classifier (MLC) technique. Using a group of validation metrics including F1, precision, recall, accuracy, and the McNemar test, our results indicated that machine learning techniques outperformed the MLC with RoF as the best performer (F1 scores ranging from 0.75–0.91 for sparse and dense seagrass meadows, respectively). Our study is the first comparison of various ensemble-based methods for seagrass mapping of which we are aware, and promises to be an effective approach to enhance the accuracy of seagrass monitoring.


Introduction
Together with mangrove and salt marsh, seagrass has been evaluated as an effective coastal ecosystem for blue carbon storage [1][2][3]. However, ongoing degradation of seagrass meadows [4] is leading to a requirement for accurate mapping and monitoring methods to facilitate the MRV (Monitoring, Reporting, and Verification) approach necessary for broad scale evaluation of their contribution to blue carbon reservoirs [5]. In the last decade, satellite imagery has been used extensively in developing seagrass mapping techniques by employing various classification algorithms with or without parallel traditional field surveys [6]. Among them, Sentinel-2 imagery is becoming more popular for seagrass mapping. Operated by the European Space Agency since 2015, this sensor supports a high quality image at spatial resolutions between 10 and 60 m [7]. Sentinel-2 data have been distributed free-of-charge at the top-of-atmosphere corrected level (level 1C) for blue, green, red, and near infrared (NIR) bands at 10 m resolution, and provides a very good resource for

Study Site
Tauranga Harbor, North Island, New Zealand was selected as our study site ( Figure 1). The site supports a single seagrass species, Zostera muelleri, distributed in the intertidal parts of the harbor [32]. Z. muelleri tolerates a wide range of salinity (10-30 practical salinity unit (psu)), however, a salinity at 12 psu is optimal to produce the highest shoot density [33]. Z. muelleri is a small plant when growing intertidally as in the Tauranga Harbor, with leaves 5-30 cm in length and 0.1-0.4 cm in width. It has a maximum growth rate in the austral summer (December to March) with optimal temperature ranging from 27-33 °C [34][35][36]. Biomass declines gradually over winter to reach a minimum in early spring (October) [37]. Since satellite image based mapping uses surface reflectance as input data for the classification, our mapping only addressed the above-ground part of seagrass meadows, and may thus underestimate the colonized area if applied in late winter when the aboveground parts have senesced. Flowering and seed production in Z. muelleri is rare in New Zealand, and lateral spread is slow and by vegetative mechanisms [38,39].
In Tauranga Harbor, the tide regime is semi-diurnal, with a range of 0.2-2.1 m. The size of the harbor means that tide timings are different in various areas in the harbor [40]. Examination of harbor bathymetry and seagrass distribution, which is almost exclusively intertidal in the harbor, indicates that seagrass was occupying a depth range of 0.0-1.5 m at the Sentinel-2 acquisition time. Across the years 2018-2019, the mean air temperature ranged from 2.5 °C in winter to 31.6 °C in summer (data from the New Zealand Meteorological Service [41] whereas mean sea temperature ranged from 14 °C in winter to 23 °C in summer [42] . Previous mapping using aerial photography and manual classifications for the years 1959, 1996, and 2011 [43] suggested a loss of ~50% of seagrass area, making a strong case for the ongoing monitoring of seagrass cover. In addition, a wide range of seagrass density and substratum makes Tauranga Harbor a suitable location for testing novel classification algorithms ( Figure 2).

Field Survey
A seagrass mapping survey was undertaken between 1 and 7, April, 2019 ( Figure 1) in the intertidal areas of the harbor. At low tide, the boundary of seagrass meadows was delimited using a Garmin Etrex 30 global positioning system (GPS) with an accuracy of ±2 m. Other substrata recorded during the field survey were bare sand and muddy sand. Macroalgae were neither detected from our field survey nor mentioned in previous mapping reports [32,43].
Ground truth points (GTPs) were recorded by following the boundary between seagrass meadows and unvegetated areas. In addition, the internal boundaries of seagrass and bare substrate were recorded if they coexisted in the same meadow. The frequency with which GTPs were recorded was related to the 10 m pixel size of Sentinel-2, and varied according to seagrass density. The number of GTPs was 2-5 GTPs per pixel in the case of patchy meadows, and decreased to one GTP per pixel or one GTP per 2-3 pixels for continuous, dense meadows. GTPs were not collected for seagrass meadows smaller than 100 square meters (corresponding to the pixel size of Sentinel-2 imagery), and some meadows could not be accessed due to logistic constraints.
Seagrass class boundaries were determined visually, with "dense" and "sparse" boundaries recorded. A meadow was identified as "dense" if the coverage was greater than 80% (Figure 2a), and "sparse" if the coverage was less than 80% ( Figure 2b). A total of 4315 GTPs were recorded, with 2751 and 1564 for sparse and dense meadows, respectively; 237 GTPs were recorded for other substrata.

Satellite Data Acquisition and Image Pre-Processing
A Sentinel-2 scene acquired on May 1, 2019 was selected and downloaded from the GLOVIS website [44] ( Table 1). The Sentinel-2 scene was pre-processed at level 1C (atmospheric correction at the top of atmosphere), and in the projection of WGS-84 UTM 60S. Sentinel pixels were classified into non-seagrass, sparse, and dense seagrass classes according to our field observations. Field and remote sampling were closely synchronous, and we considered the field data to provide a sufficiently accurate representation of seagrass spatial distribution to develop the models.

Atmospheric Correction
Atmospheric correction was executed in a Python™ environment using the dark spectrum fitting method in ACOLITE [45]. This fast and free-to-download tool has been adapted for aquatic application and presents a reliable atmospheric correction for Landsat and Sentinel-2 [46]. This process converts pixel values from the top of atmosphere (at level 1C) to surface reflectance for water pixels.
The values and options of selected parameters are presented in Table 2. For our study site, sun glint was not observed in the acquired scene. Therefore, the parameter of sun glint correction was set to False. Short wave infrared spectral range (SWIR), a spectral band of Sentinel-2 with the wavelength close to 1600 nm, was used to mask land and cloud pixels from water pixels. Corrected surface reflectance for water pixels of blue ( ), green ( ), and red ( ) bands were used for the next step of water column correction.

Water Column Correction
Satellite image acquisition did not coincide with low tide in Tauranga Harbor (Table 1), and most, but not all, seagrass areas were inundated at the time of acquisition. We therefore applied a water column correction. Previous studies suggested that visible wavelengths are the most sensitive to seagrass meadows [47][48][49] and penetrate well into water [47]. Several studies have indicated that the NIR band is rapidly absorbed in an underwater environment [20,50], and may therefore contribute noise to the image, leading to a low accuracy of underwater habitat detection [48,51]. As a result, we decided upon the use of only visible spectra in the current study without the NIR band. After the water column correction step, water pixels in the blue (458-523 nm), green (543-578 nm), and red (650-680 nm) spectral bands were selected as input data for the evaluation of the model's performance. Water column correction was conducted using bottom reflectance index (BRI), as proposed by Sagawa et al. (2010) [52]. Instead of absolute values of bottom reflectance, this approach creates different indexes for various bottom types, which are used for the step of classification. The index is calculated using Equation (1): where is the bottom reflectance index of band i; is the surface reflectance for water pixel of band i ( ); is the attenuation coefficient of solar radiance in water column (m −1 ) of band i; is a geometric factor accounting for the path length through the water; and is the water depth (m).
Values of and of blue, green, and red bands were retrieved from the atmospheric correction step. Water depth ( ) was extracted from the bathymetry published by the National Institute of Water and Atmospheric Research (NIWA). g was calculated using Equation (2) [53]: where and g was calculated as 0.0245 for the Sentinel-2 scene at the study site.

Selection of Maximum Likelihood, Random Forest, Rotation Forest, and Canonical Correlation Forest Classifiers
MLC is the most popular classification method in remote sensing [6] and is based on probability theory. This model requires a normal distribution, equal covariance, and a sufficient number of training samples [54] to maintain a reliable result. Class mean vector and covariance matrices are used to minimize the class distance and maximize the probability of a feature belonging to the selected class by using quadratic or linear discrimination functions. This method, however, may overfit posterior probabilities and result in inaccurate classification when the assumptions are violated.
RF [55] builds a forest of decision trees from a bootstrap sampling of training data points, and uses only a selection of all the samples for each decision tree. For each subset, a decision tree is built using two thirds of the total samples for training and one third for validation of the RF model. Majority voting is applied to arrange labels to given classes. The RF model constructs and combines a large number of base-decision trees using the classification and regression tree (CART) algorithm. Known as a robust and consistent model, RF is the most popular ensemble based model for classification problems, and, in particular, for seagrass mapping [6].
In 2006, Rodriguez et al. (2006) presented RoF, a feature extraction based method [56]. RoF is expected to provide an alternative selection for both regression and classification tasks [28]. The training data created by a bootstrap sampling is split into K subsets and then principal component analysis (PCA) is applied for each subset. All the principal components are retained and a number of decision trees are built from these transformed datasets. RoF can perform better than RF with a smaller number of trees, and therefore reduces the time for running the model. Instead of estimating the average value from all the trees in RF, a confidence value is calculated to assign a label to a given class with the highest value of confidence.
CCF creates a number of canonical correlation trees (CCTs) using canonical correlation analysis (CCA) to maximize the correlation between the input data and the selected labels. The authors in [31] confirmed the robustness of the model and also introduced projection bootstrapping, which uses all of the data and improves the prediction accuracy over RF models. CCF is different from RF and RoF, as it lacks the tuning hyper-parameter, but offers similar accuracy with a smaller number of trees [31] and therefore, may consume less computation time for the training phase.

Training and Testing Dataset
Differences exist between the collection date of the GTPs and the acquisition date of the satellite image and, in the variable marine environment, this may be inevitable [57][58][59][60]. The sampling dates in April and May both fell in the austral fall. The difference between the field survey (1-7, April 2019) and the acquisition date of the Sentinel-2 image (1 May, 2019) was approximately 23 days, which was considered acceptable when compared with various reports in the literature [11,[58][59][60]. Therefore, the GTPs were reliable to support the selection of image pixels for the training and testing datasets. Following the GTPs, the regions of interest (pixels) of three classes (dense seagrass, sparse seagrass, and non-seagrass) were selected and randomly split into 60% for training and 40% for the testing phases, and used as a unique input for all selected models (Table 3). The optimization and performance of the RF, RoF, and MLC algorithms for image classification were conducted in a Python™ environment using a Jupyter notebook as the code editor. RF and RoF codes were sourced from Scikit-learn [61], GitHub [62], and MLC from mlpy [63,64]. For the RF and RoF models, optimization of the hyper-parameters used a grid search with fivefold cross-validation [26,28]. The optimization and performance of the CCF model was conducted in the MATLAB environment using the source code of Rainforth and Wood [31,65]. The CCF models were performed with 10, 30, 50, 100, 200, and 500 trees, and an optimal number was selected based on the lowest misclassification rate, highest Kappa coefficient, and an acceptable computation time.
The results of the CCF model were exported to a CSV format for model comparisons in Python TM . All source codes were open access, and the codes developed in this study will be submitted to GitHub. Image processing and classification were performed using a desktop computer with four 3.8 GHz physical cores and 16 Gb RAM.

Evaluation Criteria
Equations (3)-(7) involving accuracy, Kappa coefficient, precision, recall, and F1 were used to compare the performance of the selected models.
where is the predicted value and is the corresponding true value where is the observed agreement ratio and is the expected agreement where is the true positive; is the false positive; and is the false negative. In addition, the non-parametric McNemar test was used to statistically compare the accuracy of the selected models in this research. The test was executed in a Python™ environment using the mlxtend library [66]. The chi-square value (χ 2 ) was calculated from Equation (8) with Edward's continuity correction.
where is the false negative and is the false positive.

Results
Pixels were selected from each class, which had been verified by ground truth samplings, for the evaluation of automated classification. The main challenge to the automated classification that emerged relate to the mixing of sparse seagrass and bare sand in the same meadow. In particular, the substratum in very shallow water belonging to the non-seagrass class could be confused with sparse seagrass.

Hyper-Parameter Tuning for Random Forest, Rotation Forest, and Optimizing the Number of Trees for Canonical Correlation Forest Models
The hyper-parameters were tuned and consistently maintained for RF and RoF running during the training and testing phases (Table 4).  Figures 3a,d). As a result, we selected 200 trees as an optimal choice for CCF, which gave a computation time for the training and testing runs of 27 and 4.5 s, respectively; these times increased linearly with the number of trees (Figures 3b,c).  Table 3.

Comparing the Performance ofRandom Forest, Rotation Forest, Canonical Correlation Forest, and Maximum Likelihood Models for Seagrass Mapping
The ML methods consistently outperformed the MLC model for all evaluation metrics (Table 5), and the McNemar test indicated these effects were significant (Table 6). In particular, the precision values of the ML models were similarly high, and greater than that obtained by the MLC method for dense and sparse seagrass, whilst very high recall was observed for both classes using the MLC model. F1 values of the MLC were lower than the ML methods for both dense and sparse seagrass. Among the ML methods, RoF outperformed the CCF and RF models ( Figure 4 and Table 5), and this difference was also statistically significant ( Table 6). The RoF model showed the highest values for precision and F1 across all three ML methods for both dense and sparse seagrass classes ( Table 5).
All ensemble-based ML models were able to detect dense seagrass meadows with very high F1 scores ranging from 0.90-0.91 whereas the MLC model produced a F1 score of 0.75 ( Table 5). The performance of these ML models was consistent with a balance of high precision and recall. The good performance for the dense seagrass class was conceivably due to a spectral separation of this class from the sparse and non-seagrass classes. As can be seen from Table 5, the RoF model improved the accuracy of the classification in terms of F1 score by 1% compared to the CCF and RF, but by nearly 17% over the MLC model for the dense seagrass class. Corresponding improvements were 3% (CCF, RF) and 33% (MLC) for the sparse seagrass class.
As indicated above, the classification was more challenging when a mixture of seagrass and bare sand were present in the same meadow. The RoF model still produced the highest F1 score (0.75), followed by the CCF and RF (0.73) models (Table 5). Conversely, the MLC model yielded a F1 score of only 0.50, which was significantly lower than the three ML models. With respect to computation time, the CCF model requires more time to run whilst the MLC model executes both the training and testing phases very quickly (Table 5).  Table 5. Accuracy, precision, recall, and F1 of the model performing for the selected Sentinel-2 scene; best performance is indicated by the characters in bold. As the RoF model yielded the highest performance assessment for seagrass mapping in this research, we employed this model to create the classified maps ( Figure 5). The distribution of seagrass meadows was consistent in the middle and southern part of the harbor, which may suggest optimal sites for blue carbon assessment as well as potential targets for the long-term conservation of seagrass in Tauranga Harbor. The seagrass area was estimated as approximately 1027.59 ha in May 2019.

Discussion
As far as we are aware, this research is the first attempt to compare the performance of the RF, RoF, CCF, and MLC methods for seagrass mapping with a full radiometric correction of the image. Desirable characteristics for seagrass mapping are both high precision and recall. High precision means that the classifier is able to detect the seagrass pixels precisely, whilst high recall means that the classifier is able to find all possible pixels of seagrass. To give a final coherence score and harmonize the values of precision and recall, the F1 score is usually preferred to evaluate a model's performance. The research presented here suggests that ML models detect dense seagrass meadows well, and outperform the traditional MLC approach.
Of the machine learning ensemble approaches used, the CCF and RF models performed less well than the RoF model, contradicting a superior performance of CCF in other studies [31]. CCF produced lower recall whilst RF created a lower precision for both the dense and sparse seagrass classes. For sparse seagrass meadows, CCF detected more precisely than the RF model. In addition, MLC produced very high recall, but low precision scores for both seagrass classes, leading to a lower F1 score and accuracy than the ensemble based models. Overall, our results show that RF, RoF, and CCF are good performers with a balance of high precision and recall scores, whilst very low precision scores of dense and sparse seagrass classes ranging from 0.34-0.63 were found for MLC. These results confirm the robustness and consistency of machine learning ensemble based methods in comparison to the MLC. We hypothesize that the poor performance of MLC may be because of the need for input data to satisfy the built-in assumptions, described above, which are difficult to sustain in a spatially heterogeneous marine environment.
Of the methods tested here, only the RF technique has previously been applied to seagrass mapping using very high spatial resolution imagery. In that case, high precision (0.947) and recall (0.968) values were determined mapping Posidonia oceanica from digital airborne images, though no comparison to other methods was attempted [19]. In another seagrass study, the overall accuracy only reached 82% using the RF algorithm applied to RapidEye imagery [67]. Considering the size of the seagrass meadows and the mix of substrate in Tauranga Harbor, the measured scores in our results were reliable for both dense and sparse seagrass mapping using medium spatial resolution of Sentinel-2 data (10 m pixel size). In other studies that allowed for a comparison of RF, RoF, and CCF models, CCF slightly outperformed the RF and RoF models for land cover mapping [22], RoF outperformed RF and CCF for mangrove mapping [68] whilst a similar performance of RoF and CCF models was noted for landslide mapping [23].
In addition to the performance advantages, RF and RoF are easy to execute in Python™, whilst CCF is confined to the MATLAB environment. The open source operating environment and diverse Python™ libraries provide multiple solutions for seagrass mapping and monitoring, and enhance the capacity to develop novel algorithms for various tasks in marine science [69]. Several libraries in the Python™ environment support a built-in framework for classification problems with a long list of state-of-the-art machine learning algorithms [70]. This ease-of-use approach allows a person with minimum programming skills to make a classification more reliable with machine learning. Recently, cloud computing broadens the executable environments for the big Earth observation data, especially in coastal resource mapping. A cloud computation system is able to deal with massive amounts of remote sensing datasets, parallel processing of satellite image using multiple data centers, and can respond to real time monitoring on country and global scales [71,72]. The use of open source machine learning algorithms in Python TM and a cloud system, therefore, promises large scale and more reliable mapping in the future.
Our results have validated the performance of the RF, RoF, and CCF models for seagrass mapping and suggest that the RoF technique is a promising novel approach to further seagrass monitoring at various sites around the world. However, the current study does have some limitations. The mismatch between the number of ground truth points (GTPs) and Sentinel-2 pixel size (10  10 m) may raise a degree of uncertainty in classification, particularly for sparse seagrass. However, we considered that despite this mismatch, a sufficient and representative number of GTPs for each class was collected during the field survey (as presented in Subsection 2.2) to have confidence in the classification. Related issues that might explain the low values of precision and recall for sparse seagrass meadows are issues of mixed pixels, whereby small seagrass patches or dispersed clumps within a pixel challenge the classification process. To our knowledge, these effects are not easy to compensate for in the case of low to very low coverage seagrass using Sentinel-2 imagery. Thus, the use of very high spatial resolution sensors such as WorldView (~0.3-0.4 meters) [57,58] or Pleiades-1 (0.5 meters) is currently being investigated for future studies for seagrass mapping. Moreover, with the development of computer vision and pattern recognition, deep learning approaches using a variety of algorithms such as convolutional neural networks (CNNs or recurrent neural networks (RNNs) for semantic segmented imagery applied sub-pixel techniques should be encouraged for future studies [6].

Conclusions
We tested the performance of ML ensemble-based and MLC methods for seagrass mapping from Sentinel-2 data. Using Tauranga Harbor as a validation site, our comparison indicated that all MLbased approaches significantly outperformed MLC. MLC failed to detect sparse seagrass meadows, with a low F1 score of 0.50. We noted a better performance of RoF compared to the RF and CCF models with the highest F1 scores of 0.91 and 0.75 for dense and sparse seagrass classes, respectively.
Our results attest to the reliable application of the RoF model for the mapping and monitoring of seagrass in shallow water using Sentinel-2 imagery. Despite a lower accuracy for sparse than dense seagrass meadow classification, the CCF model shows potential for the mapping of seagrass and merits further testing at various scales and in various case studies. Regarding MLC, this model is still an applicable candidate for dense seagrass meadows, however, it may not be applicable for the mapping of sparse to very sparse seagrass meadows.