Reef-insight: A framework for reef habitat mapping with clustering methods via remote sensing

Environmental damage has been of much concern, particularly in coastal areas and the oceans, given climate change and the drastic effects of pollution and extreme climate events. Our present-day analytical capabilities, along with advancements in information acquisition techniques such as remote sensing, can be utilised for the management and study of coral reef ecosystems. In this paper, we present Reef-Insight, an unsupervised machine learning framework that features advanced clustering methods and remote sensing for reef habitat mapping. Our framework compares different clustering methods for reef habitat mapping using remote sensing data. We evaluate four major clustering approaches based on qualitative and visual assessments which include k-means, hierarchical clustering, Gaussian mixture model, and density-based clustering. We utilise remote sensing data featuring the One Tree Island reef in Australia's Southern Great Barrier Reef. Our results indicate that clustering methods using remote sensing data can well identify benthic and geomorphic clusters in reefs when compared with other studies. Our results indicate that Reef-Insight can generate detailed reef habitat maps outlining distinct reef habitats and has the potential to enable further insights for reef restoration projects.


Introduction
Remote sensing provides the methodology that enables aerial data to be retrieved using advanced satellites and aerial vehicles [1,2]. In recent decades, remote sensing has been prominent in a number of applications, which include tropical forest environmental monitoring [3], environmental monitoring [4], mining environmental monitoring [5], coral reef monitoring [6], agriculture [7], surface moisture and soil monitoring [8], and space research [9]. Remote sensing data with machine learning methods have been increasingly used [10][11][12] in diverse applications, such as mineral exploration [13], environmental monitoring, and agriculture [14].
Coral reef mapping [15,16] provides valuable information about reef characteristics, such as the structure of the reef, geomorphic and benthic zones, and coral distribution, which can help in reef restoration projects [17,18]. Some of the related studies that used remote sensing are discussed as follows: Kennedy et al. [19] proposed a coral classification system that combines satellite data and local knowledge for identifying different geomorphic regions in a coral reef. Among various analytical techniques used on remote sensing data, Phinn et al. [20] assessed the quality of benthic and geomorphic community maps of coral reefs produced with multi-scale image analysis. Other map-processing approaches commonly used are supervised classification [21] and manual delineation of classes using images as a backdrop. Phinn et al. [22] evaluated eight commonly used benthic cover mapping techniques based on the two processing approaches stated above. Eight techniques were assessed on the basis of cost, accuracy, time, and relevance, where the preferred mapping approach was supervised learning using the classification of satellite data using basic field knowledge. Nguyen et al. [23] provided a review of coral reef mapping with multispectral satellite image correction, and pre-processing techniques and classification algorithms. Machine learning enables valuable information to be extracted from remotely sensed data with clustering [24], dimensionality reduction [25], and classification methods [11]. Machine learning methods are slowly becoming prominent for climate change problems [26]. These methods can also be used to understand and reconstruct data for climate and vegetation millions of years back in time [27]. Clustering [28] is an unsupervised machine learning method that is useful for remote sensing when labelled data are unavailable. The clusters produced using clustering techniques can be further improved using specific spatial information from the data or by applying basic domain knowledge [29,30]. Clustering techniques can be used for image segmentation tasks [31], where pixels are grouped into distinct regions (clusters) on the basis of given similarity measures. There are several clustering techniques that serve this purpose, such as the k-means clustering algorithm [32], agglomerative hierarchical clustering [33], density-based spatial clustering (DBSCAN) [34], and Gaussian mixture model (GMM) [35].
Our choice of clustering methods is based on their properties for the segmentation of image-based data. Our literature review indicates that the selected methods have strengths and limitations in various applications; however, our study evaluates them to segment reef community mapping based on satellite-based image data.
In the case of hyperspectral and multispectral data that feature multiple bands and thus feature a large number of data, dimensional reduction methods, such as principal component analysis (PCA) [36], can reduce the number of bands in order to make the data applicable for clustering methods. They have been used in remote sensing applications, such as mineral exploration with satellite data [25]. Existing work conducted in creating reef maps using supervised learning techniques can be utilised for qualitative comparisons with clustering results and for labelling the regions. The success of clustering techniques in mapping regions and in the field of geosciences inspires the use of the technique for the benthic [37] and geomorphic mapping [38] of coral reefs. Although remote sensing data (multispectral and hyperspectral) have been used for coral reef mapping [6,[39][40][41][42][43], not much has been conducted using clustering techniques, particularly using open source software frameworks. There are proprietary remote sensing software suits [44,45] that have inbuilt features for reef mapping, but these are not easily available. This is a problem not just for reproducible research, but for the application of such methods in developing countries, where such software suites are not economically viable for research purposes, and this slows down research and development in the area of reef monitoring, which is a major focus of climate change-related research. The main problem in reef community mapping is to automatically detect different communities of coral systems, which is challenging with limited, noisy and sparse datasets.
In this paper, we present an unsupervised machine learning framework using novel clustering methods for the detection and mapping of coral reef habitats with remote sensing. We present Reef-Insight, a framework for reef community mapping using remote sensing with which we compared four major clustering approaches in order to determine which method is the most suitable based on qualitative and visual assessments. The clustering methods include k-means, GMM, agglomerative clustering, and DBSCAN. We utilise a remote sensing dataset from One Tree Island in the Southern Great Barrier Reef in Australia. Our framework provides the detection and generation of detailed maps that highlight distinct reef habitats that can guide scientists and policymakers in reef restoration projects.
Our key innovation is in the development of a framework that can take different forms of data (benthic and geomorphic maps) and evaluate prominent clustering methods for reef community mapping.
The rest of the paper is organised as follows: In Section 2, we present the proposed methodology, followed by experiments and results in Section 3. Section 4 provides a discussion, and Section 5 provides the conclusion with directions for future work.

Study Area
One Tree Island (located at 23 • 30 ′ 30 ′′ south, 152 • 05 ′ 30 ′′ east) is a coral reef in the southern Great Barrier Reef. It is a part of the Capricorn-Bunker group about 90 km east of Port Gladstone in Queensland, Australia. The University of Sydney maintains the research station on the island, and as such, the One Tree Island reef has been the subject of detailed biological and geological investigation over the past four decades (see [46][47][48]), including studies using remote sensing [49]. Hence, the reef habitats and geomorphic zones characterising the One Tree Island reef have been well studied. Our study area ( Figure 1) is located at the eastern end of a coral reef that spans about 5.5 km in length and 3.5 km in width.

Data
We utilise the PlanetScope satellite imagery which is available on the Allen Coral Atlas website. The PlanetScope (Dove) (https://earth.esa.int/eogateway/missions/planetscope, accessed on 1st June 2023 ) image-based data feature 3 spectral bands (red, green, and blue) at 8-bit radiometric resolution. The raw images are processed for atmosphere radiance, sensor and radiometric calibration, flatfield correction, debayering, orthorectification, and surface reflectance. Furthermore, mosaic-based processing is done to utilise the "best scene on top" technique to create the final mosaic. The mosaic process is taken from the implementation in the Allen Coral Atlas [51] (Figure 2). We create bathymetric image data using 10 m resolution with the Sentinel-2 surface reflectance dataset using Google Earth Engine (GEE) (https://developers.google.com/earth-engine/datasets/catalog/sentinel-2, accessed on 1st June 2023 ). Finally, we create a single mosaic (16-bit integer) by aggregating the median value of the input data over a period of 12 months. We utilise this bathymetric information for creating geomorphic maps.

Benthic and Geomorphic Regions in Reef
The benthic [37] mapping of coral reefs refers to the use of aerial imagery, underwater photos, acoustic surveys, and data from sediment samples. The benthic zone refers to an ecological region with a low level of water, such as an ocean and coral reef community, that includes the sediment surface. Geomorphology refers to the evolution of topographic and bathymetric features through physical, chemical, and biological processes on the Earth's surface [52]. Geomorphic coral reef mapping [38] refers to topographic and bathymetric features of reef habitats [53].

Clustering Techniques 2.4.1. K-Means Clustering
K-means clustering is an algorithm that divides data into a set of clusters (k) based upon a distance metric [32,54]. Given a d-dimensional vector for a dataset of samples (x 1 , x 2 , . . . , x N ) of size N, the algorithm partitions (groups) the data into k (≤N) sets C = C 1 , C 2 , . . . , C k . The aim of the algorithm is to minimize the error given by the withincluster sum of squares (WCSS), which is given as the sum of squared Euclidean distances between the data samples and the corresponding centroid in the original algorithm [55].
where x i is a data sample belonging to cluster C k and µ k is the mean of the samples in cluster C k . We assign each data sample to a given cluster such that the WCSS error to their assigned cluster centres, µ k , is minimised. The total WCSS error is given as follows: Although k-means clustering has been prominent in tabular data, it can also be used on image and remote sensing data for segmentation, which is also the focus of this paper. There have been applications of k-means clustering for remote sensing-based image segmentation, change detection, and land cover classification. Theiler et al. [56] proposed a variation of the k-means algorithm to utilise both the spectral and spatial properties of satellite imagery for image segmentation. Lv et al. [57] integrated k-means clustering with an adaptive majority voting model for land cover change detection. Celik [58] used dimensional reduction with PCA and k-means clustering for the task of change detection. Abbas et al. [59] utilised k-means and ISODATA [60] (which is an extension of k-means clustering) for land cover classification using remote sensing data. These applications motivate the use of k-means clustering in our proposed framework.

GMM
A GMM is based on a probabilistic model that assumes that data are generated from a mixture of Gaussian distributions with parameters that are adjusted by training. A GMM is useful for clustering, anomaly detection, and density estimation [61]. It consists of three parameters, which include mean (µ), which defines the centre of each of the Gaussian distributions; covariance (Σ), which represents the spread; and mixing probability (Π), defining the weight of the respective Gaussian distribution. The mixing coefficients for each cluster (k) are themselves probabilities (π k ), and must have a sum of 1 as shown below.
In comparison to GMMs, k-means clustering places a circle (a hypersphere in the case of higher dimensions) at the centre of each cluster. We can define a radius with the most distant point in the cluster; however, GMMs can also handle oblong and ellipsoidal forms of clusters. The applications of GMMs in remote sensing data processing include image clustering, segmentation, and synthetic data generation. Bei et al. [62] presented an improvised GMM that takes into account spatial information to improve image clustering. Yin et al. [63] combined the fuzzy region competition method with a GMM for image segmentation. Davari et al. [64] utilised a GMM for hyperspectral remote sensing that featured the challenge of large dimensions (features) with fewer training data points. Neagoe et al. [65] presented a cascade of k-means clustering and a GMM for semi-supervised classification.

Agglomerative Clustering
Agglomerative hierarchical clustering, also known as agglomerative nesting (AGNES), is the most common type of hierarchical clustering used to group data samples in clusters based on their similarity [66][67][68]. The algorithm begins by treating each data instance as a singleton cluster, and pairs of clusters are merged until all clusters have been merged into a large cluster featuring all the data. This produces a dendrogram, which is a tree-based representation of the data samples. It produces a flexible and informative cluster tree instead of forcing users to choose a particular number of clusters, such as determining k in the k-means algorithm. Goncalves et al. [69] proposed an unsupervised clustering method combining self-organising maps (SOMs) with AGNES for the classification of remotely sensed data. Liu et al. [70] used hierarchical clustering for the image segmentation of high-resolution remote sensing images.

DBSCAN
DBSCAN [34] uses local density estimation to identify clusters of arbitrary shapes, which is not easily possible with traditional methods, such as k-means clustering. In DBSCAN, the data samples are seen as core points (density), reachable points, and outliers. The algorithm counts how many samples are located within a small distance from each core point and marks a region called the neighbourhood. The data samples in the neighbourhood of a core sample belong to the same cluster. This neighbourhood may include other core instances; therefore, a long sequence of neighbouring core instances forms a single cluster. Any sample that is not a core sample and does not have one in its neighbourhood is considered an anomaly. DBSCAN clustering has been prominent in a number of applications with tabular data [71] and has been used for remote sensing data. Wang et al. [72] presented an improved DBSCAN method for Lidar data, and the results showed that it could segment different types of point clouds with higher accuracy in a robust manner. Liang Zhang et al. [73] utilised DBSCAN clustering in their adaptive superpixel generation algorithm for synthetic-aperture radar (SAR) imagery. Liujun Zhu et al. [74] used DBSCAN for vegetation change detection using multi-temporal analysis.

Framework
Next, we present the framework that incorporates the different clustering algorithms, i.e., k-means, GMM, agglomerative clustering, and DBSCAN, for the segmentation of two different types of satellite imagery to create coral reef maps (Figure 3). The initial step is to acquire remote sensing-based imagery of the coral region of interest ( Figure 3-Step a). The coral reef mosaic data taken from the Allen Coral Atlas [51] utilise sensor and radiometric calibration for image processing. Moreover, they employ the "best scene on top" (BOT) technique in the mosaicking process of PlanetScope imagery [51]. In Step b, we check what type of reef community mapping is desired by the user, i.e., benthic or geomorphic mapping. We then create a geomorphic map of the region, where the bathymetric data (Step c) are concatenated with the imagery obtained in the previous step. Next, we evaluate the clustering algorithms (Figure 3-Step d) to create clustering regions (segments). We need to evaluate the results and thus need a way to ensure that the acquired segments are meaningful. Hence, we apply qualitative analysis, where we assign each cluster a colour according to the map used for comparison; then, we compare the maps qualitatively (visually), side by side (Figure 3-Step e). This helps in assigning the labels to the clusters based on the visual similarities to the existing maps. If the results obtained are unsatisfactory, the clustering algorithm is again applied to the data with new parameters (Step g). The final step incorporates map refinement and clean-up (Step h), wherein we merge the extra clusters with the closest region of interest [51] to generate the coral reef map of interest (Step i). Map refinement by logical rules (Step h) remaps the smaller, excess clusters to the major cluster of a given label surrounding them. This ensures that the smaller clusters are merged to obtain a refined map with only the regions having labels of interest.

K-Means and GMM Clustering
We begin by finding the optimal number of clusters for our framework's respective clustering methods, i.e., k-means and GMM. In the case of k-means clustering, we use the elbow method, which plots the sum of the square distance to find the number of optimal clusters (k value) by calculating the distance between a data point and the cluster (WCSS). The point where the curve starts to flatten and resembles the elbow of the curve is chosen as the k value. In Figure 4, an elbow can be seen at k = 3. In the case of GMM, we use the Bayesian information criterion (BIC) to find the value of k. The gradient of the BIC score curve, much similar to finding the elbow of the curve, is used to estimate the optimal number of clusters for the data. A lower score indicates that the model better fits the data. However, in order to avoid over-fitting, this technique penalises the methods with a large number of clusters. In Figure 4, we select the point that reflects the major change in the gradient which can be seen at k = 2.  In our case, we provide a visual comparison to evaluate the methods, i.e. qualitative comparison of the clustered maps of the reef and the Allen Coral Atlas. Hence, while keeping the elbow method and BIC score in mind, we review the results obtained using the clustering methods and choose k based on maximum resemblance to the regions of interest in the coral maps (e.g., Figure 5b).

Comparison of Selected Clustering Results
Figures 6 and 7 represent the benthic and geomorphic maps generated with the four clustering methods considered in our framework. Agglomerative clustering is computationally exhaustive; hence, we down-scaled the data to 20% of the original size, which led to the loss of finer details of the areas of interest (Figures 6d and 7d). DBSCAN gave an adequate result, especially for the geomorphic map (Figure 6c) of the region, by removing the bathymetric noise and focusing on the reef area. However, a large number of small clusters are not ideal for the map refinement step ahead for visual comparison with existing maps. The k-means results are given in Figures 6a and 7a. The GMM results given in Figures 6b and 7b are satisfactory results in both benthic and geomorphic mapping compared with DBSCAN and AGNES.

Comparison with Allen Coral Atlas
Next, we execute k-means and GMM clustering methods with a selected number of clusters (k = 4) that gives maximum resemblance to actual data. In the GMM, we set the covariance type parameter set to full to ensure that each component of the GMM had its own general covariance matrix. The preliminary results obtained (Figure 8a) represent the four clusters. We observe that the GMM clustering method has well extracted the flat sand region represented by the black cluster. We then combine the clustered region and the sand region to obtain the final result. The clustering results of generating benthic coral maps obtained with k-means (Figure 9b) and the GMM (Figure 9c) showcase three clusters, namely, ocean, sand, and rock/rubble. Upon visual comparison with the benthic map from the Allen Coral Atlas (Figure 9a), we can see that GMM provides clusters with higher similarity than k-means clustering. Figure 8 shows the results of the logical rules ( Figure 3-Step h) used to create the benthic map generated using GMM. Figure 8b has a black cluster that got remapped to a sand (yellow) cluster in the map refinement stage (  Figure 9c).
In the case of the geomorphic map, we set the number of clusters (k − 7) for both methods, i.e. k-means and GMM. The preliminary results obtained using the GMM (Figure 8b) provide extra clusters in the ocean region by making a distinction in water bathymetry. We combine the clusters in the ocean region by visually comparing them with the Allen Coral Atlas geomorphic map. We observe that the final geomorphic maps generated using k-means ( Figure 10b) and GMM (Figure 10c) generate four clusters: reef flat, lagoon/ plateau, reef slope, and ocean. The reef flat and the lagoon/ plateau region created with the GMM had a greater resemblance to the original geomorphic map given by the Allen Coral Atlas (Figure 10a). A general limitation in clustering methods for reef habitat mapping is the classification of local regions which is due to a lack of labelled data. Nevertheless, this approach can be useful for gathering a basic overview of reef habitats without the need for manual labelling, which is a labour-intensive task.

Discussion
In this study, we presented a framework that processes remote sensing data and compared four clustering methods (k-means, GMM, AGNES, and DBSCAN) for generating reef habitat maps using remote sensing data of the One Tree Island reef. We utilised the Allen Coral Atlas mosaic of the region to generate benthic maps using clustering methods and incorporated bathymetric data for understanding the geomorphic habitats of the region. We selected the appropriate clustering method based on visual comparative analysis to ensure that the clustered regions took into account the local field knowledge and geology. Thus, we generated a three-class benthic map distinguishing among sand, rock/rubble, and ocean, and a four-class geomorphic map identifying reef flat, reef slope, lagoon, and ocean regions on the map. The quality of segmentation depended on parameter tuning and the choice of the clustering method.
The coral maps obtained with the proposed framework can be considered preliminary maps for understanding the geomorphology of the region of interest. Our framework can provide additional support to the supervised methods that are mainly used for reef mapping, such as the one used in the Allen Coral Atlas. In future studies, our framework can be used to understand more about surface geology in coral reefs, given the sea-level rise for thousands of years. Since there are drilled reef-core data of the site that give insights into the evolution of coral reef structures for thousands of years, this can help extend software such as BayesReef [75], which uses Bayesian inferences for stimulating one-dimensional reef cores into three-dimensional reef evolution simulations.
In terms of the limitations of the framework, we note that there is a lengthy process of taking into account the visual comparison to find the optimal parameters for reef mapping. Moreover, it is also difficult to assess the accuracy of clusters on their own, and field knowledge and labelled maps to allocate labels to each clustering region are required. However, the study has revealed certain combinations of hyper-parameters (k value) that are useful for reef areas, and the same can be used in future studies in which the framework is applied to other regions. Furthermore, the current study considered a relatively small area, and it can be a challenge if clustering methods are used on a large area, such as the entire GBR region. The framework would then need to be extended using a distributed/parallel computing infrastructure so that the method can work with smaller regions, i.e., large regions can be divided using a grid and the results can be combined.
We note that we did not consider data with temporal dimension, as they were not available in our dataset. In the future, given the availability of temporal satellite data, our framework could be extended to evaluate decadal changes in reef habitat mapping that can capture extreme climate events, such as cyclones, and other events, such as tsunamis, which have devastating impacts on coral reef systems [76]. In such a study, the need for parameter tuning and the evaluation of clustering methods can be eliminated using the results from this study. Temporal data can also be used to study short-term seasonal changes in reef habitats. The quality of segmentation depends on parameter tuning and the choice of the clustering method. Our paper evaluates different clustering algorithms and recommends the best for this problem. The method is replicable and readability available with the results of the study and the availability of code and data.

Conclusions and Future Work
In this study, we presented a framework to compare different clustering methods for the task of reef habitat mapping using unlabelled remote sensing data. We used One Tree Island of the GBR to demonstrate the effectiveness of the framework. The framework transformed the raw clusters into a reef habitat map using field knowledge and map refinement operations based on logical rules that were gathered from expert knowledge. The results show that the k-means and GMM clustering methods are the most suitable for benthic and geomorphic reef mapping, as these methods created the maps that were the most visually similar to the maps obtained using related methods (Coral Atlas).
In future work, our framework can be used for reef change detection, especially when field inspection cannot be easily conducted; e.g., in case of natural disasters such as tsunamis, storms, and cyclones. The framework can help assess the impact of extreme climate events (cyclones and storms) on reef habitats, which can play a crucial role in reef restoration projects. Furthermore, the framework can also be utilised for generating maps using remote sensing data of the regions for which labelled data are unavailable, such as remote sensing data obtained from Mars and Moon exploration projects. Our framework is a way to address the challenges faced by reef scientists that involve finding labelled data for analysis and the need for manually labelling reef regions, especially in large regions. It can be considered a low-cost and robust approach to working with raw data during the exploration stage of a research study. In future work, our framework can be extended with other clustering methods and further validated using regions with labelled reef data.

Conflicts of Interest:
The authors declare no conflict of interest.