An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data

AlZayer, Zayad; Mason, Philippa; Platt, Robert; John, Cédric M.

doi:10.3390/rs17071244

Open AccessArticle

An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data

¹

Department of Earth Science and Engineering, Imperial, London SW7 2AZ, UK

²

Digital Environment Research Institute, Queen Mary University of London, London E1 4NS, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1244; https://doi.org/10.3390/rs17071244

Submission received: 25 November 2024 / Revised: 7 February 2025 / Accepted: 13 March 2025 / Published: 1 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

This study presents an innovative approach to automated coral reef monitoring using satellite imagery, addressing challenges in image quality assessment and correction. The method employs Principal Component Analysis (PCA) coupled with clustering for efficient image selection and quality evaluation, followed by a machine learning-based cloud removal technique using an XGBoost model trained to detect land and cloudy pixels over water. The workflow incorporates depth correction using Lyzenga’s algorithm and superpixel analysis, culminating in an unsupervised classification of reef areas using KMeans. Results demonstrate the effectiveness of this approach in producing consistent, interpretable classifications of reef ecosystems across different imaging conditions. This study highlights the potential for scalable, autonomous monitoring of coral reefs, offering valuable insights for conservation efforts and climate change impact assessment in shallow marine environments.

Keywords:

machine learning; marine observation; coral reefs; cloud detection; time-series analysis

1. Introduction

Space-borne Earth Observation (EO) provides the only data that are globally available, regularly collected, consistent, objective, and openly accessible to monitor climate and environmental changes at the scale of our planet. These data include long-term archives from 1972 with the Landsat missions, the IKONOS mission, Sentinel-2A from 2015 and 2B from 2017, as well as the PlanetScope constellation of satellites from 2014 [1,2,3,4]. Imagery from these satellites effectively describe the character state of ground targets, thereby offering valuable insights into the state of the landscape at a given point in time. Satellite image datasets encompass pixel-scale information potentially addressing a broad range of objectives, from vegetation and crop mapping to local ecological surveillance to tracking carbon budgets [5], alongside the unknown economic impacts of climate change [6]. Over the past 40 years, the crucial yet challenging task of monitoring coral reefs has been undertaken, with data-gathering initiatives tracing back to as early as the 1960s [7], and more comprehensive databases spanning from the 1980s to 2022 including some citizen science datasets [8,9]. Such datasets are often collected through the use of underwater diving [10] or drone surveys [11,12] with many studies carried out by NGOs (with only a limited number of studies showing repeated site visits [13]). Such approaches are often combined with satellite imagery [14] to create more detailed classifications often entailing extensive image corrections with iterative editing until the desired accuracy is achieved [15]. Considering the threats posed by climate change and anthropogenic activities [16], and the rapid temperature rise that has led to a reduction in both coral cover and diversity [17,18,19,20], there exists an immediate and pressing need for accurate and swift global coral reef monitoring via EO as careful oceanic management is also one of the UNs 17 goals for sustainable development [21]. This workflow aims to aid conventional workflows through the use of machine learning.

The monitoring of coral reefs using satellite imagery is crucial for understanding the health and dynamics of these vital ecosystems on a broad spatial scale. Traditional methods of assessing satellite image quality involve manual inspection by experts, which is both time-consuming and subject to variability in judgment. Additionally, the selection of images often relies on the availability of validation data and automatically extracted cloud cover, which may not always be up-to-date or representative of part of the scene where coral reefs occur. This raises the critical question: how much of the available satellite data is usable for accurate and timely reef monitoring?

Scientific studies that focus on bathymetry or other environmental targets rely on high data integrity of reflectance measurements across the spectrum. Processing of satellite imagery for marine environments is a complex and imperfect task, with several crucial steps necessary to ensure high levels of data reliability and usability. Automating the image selection and correction process is essential to improve these data products further. Truly automated image processing requires advanced techniques to achieve high quality data at a local and regional scale, ensuring that the data used for subsequent analysis are both reliable and consistent. This image correction workflow typically comprises atmospheric corrections, cloud masking, and in the context of marine monitoring, deglinting. Some steps in the image correction workflow are applied globally, such as atmospheric corrections and cloud masking, and some are context dependent, such as deglinting, which is required for marine monitoring applications [20,22,23,24]. Atmospheric corrections, which account for the effects of the atmosphere on the signal received by the satellite sensor, are the first corrections essential to obtain reliable reflectance values [25,26,27]. For ESA satellites, the atmospheric correction algorithm sen2cor operates by using the Copernicus Atmosphere Monitoring Service (CAMS) or identifying Dark Dense Vegetation (DDV) [28] to correct for the effects of atmospheric scattering and absorption. The assumption of the presence of DDV holds true for land imagery [29] but not necessarily for bodies of water. Another key assumption of this method is that the surface’s reflectance is Lambertian, meaning that the sun surface scatter is uniform in all directions. This is not true for water bodies, as the reflectance and absorption coefficients do not vary linearly [30], where reflectance may be specular and/or Lambertian in varying degrees. Experiments suggest that Landsat, for example, can give fairly accurate reflectance up to 5 m water depth [31]. This is relevant as the Sentinel-2 and PlanetScope constellation bands are designed to complement the Landsat bands [32].

The second correction of interest, also needed for an automated workflow, is cloud screening. Cloud screening is necessary for the identification of cloud vs. non-cloud pixels, i.e., valid and invalid pixels. Many cloud detection algorithms exist, with f-mask [33] being very commonly used, and sen2cloudless [34] being a more recent but also often used algorithm. There are a plethora of approaches to this task, with a variety of methods, ranging from object detection algorithms, pixel-based machine learning, and deep learning. The majority of these methods are tailored for use over land and are less effective over water bodies. As suggested by [34], there is still disagreement on the definitions of transparent clouds, which is task-dependent. The choice of a cloud detection algorithm, as well as the selection of appropriate imagery, is often tailored to the specific problem at hand. Notably, these algorithms are typically optimised for terrestrial (land) applications, reflecting the primarily land-focussed objectives of the Sentinel-2 mission [35].

Image balancing techniques such as linear contrast enhancement, balanced contrast enhancement technique, or histogram equalisation are commonly applied to satellite imagery [36,37], and there are multiple methods to stretch imagery primarily to aid visualisation and/or enhance the signal. Lyzenga’s algorithm is a widely utilised method for reducing the influence of water depth on spectral reflectance in remote sensing studies of aquatic environments or other water column correction techniques [38,39]. By transforming reflectance values logarithmically and incorporating depth-dependent attenuation coefficients, the algorithm generates depth-invariant indices (DIIs), enabling accurate identification and mapping of benthic features such as coral reefs, seagrass beds, and bathymetric structures [40,41,42,43], though sensor limitations do exist [44]. This approach has proven instrumental in environmental monitoring, particularly for applications requiring reliable spectral discrimination in optically shallow waters, such as habitat mapping, ecosystem assessments, and change detection analyses.

Here, we present a novel approach utilising unsupervised machine learning (ML) algorithms to improve marine scene correction and detection. We have three specific aims that collectively address the limitations in satellite imagery analysis for coral reef monitoring: (1) The quick look method for time-series image quality: we develop and implement a method for time-series quality assessment for quick-look satellite images. This method aims to streamline the evaluation process, making it faster and more consistent compared to manual assessments for image selection. (2) Automated image correction: by leveraging techniques such as Principal Component Analysis (PCA) and superpixel analysis, we identify and correct areas of variable water depth on a local scale, enhancing the overall reliability and making colour balance more consistent across scenes. (3) Interpretable unsupervised classification: we implement classification methods to map and monitor coral reefs with corrected satellite imagery, without the need for ground control data and in an automated way.

The combination of these three objectives establishes a data-centric, unsupervised machine learning approach, and we demonstrate improved efficiency and reliability of coral reef monitoring using satellite imagery. This provides the basis for future studies into the health and changes in these ecosystems over time.

2. Sentinel-2 Data

The bands of Sentinel-2 Multi-Spectral Instrument (MSI) have spatial resolutions of 10, 20, or 60 m, described in detail with their characteristics presented in Table 1 below. MSI has 13 spectral bands across Visible (VIS), Near Infrared (NIR), and Short-Wave Infrared (SWIR) parts of the electromagnetic spectrum. It has a nominal revisit time of 5 days between the two Sentinel-2 satellites (2A and 2B), whilst the wavelength ranges of their spectral bands are similar, the centre wavelengths are slightly offset (Figure 1). In this study, we focus on the Sentinel-2 scenes for the tile LCD-55, based in North Eastern Australia, tile 34SGJ in Eastern Greece, and tile 17RQH in the Bahamas included in the Appendix Figure A3.

3. Methods

3.1. Libraries

We used Python 3.10.4 with pandas (2.2.2) [46,47], numpy (1.26.4) [48], scikit-learn (1.5.1) [49], opencv-python (4.10.0) [50], scikit-image (0.24.0) [51], matplotlib (3.9.1) [52], seaborn (0.13.2) [53], and xgboost (2.1.1) [54].

3.2. Background on Machine Learning in Remote Sensing

Machine learning is a branch of artificial intelligence that focusses on building systems that can learn from and make decisions based only on data. By using algorithms and statistical techniques, these systems can recognise patterns automatically, without needing to be manually programmed for each specific task. Due to its ability to handle complex tasks and adapt to new information efficiently, machine learning is integral to many applications, including image and speech recognition, recommendation systems, and autonomous driving.

One major advantage of machine learning techniques is that they offer superior objectivity, stability, and consistency due to being data-driven compared to rule-based methods in the context of remote sensing [55]. This is particularly important when monitoring water and ocean environments using digital satellite image data, which represent repeatable, consistently configured, objective measurements of reflectance from the Earth’s surface. Machine learning algorithms commonly used in remote sensing include Support Vector Machines (SVMs), random forests (RFs), and Gradient Boosted Trees (XGBoost—XGB), with the latter often excelling in detecting and classifying features in complex and dynamic environments [56,57].

Most research tends to focus on supervised learning algorithms for a variety of tasks [58,59,60,61], especially in remote sensing (Figure A1). In supervised learning, data are paired with a known target value or “label”. Supervised learning has been used to tackle a wide variety of challenges ranging from coral mapping, bleaching detection, and spectral unmixing most commonly applied to Sentinel-2 data [58,59,61,62]. This approach has been applied at various scales, from the classification of individual corals to entire satellite images. It also demands the existence of ground truth data, which often are not available, and entails certain assumptions about these labelled data, including a uniform quality of labels among all labellers [63], thereby necessitating expert verification. The requirement for high-quality labelled data can often be a stumbling block as it can be a challenge to procure, process, and label. Ground truthing can also be impossible when dealing with historical satellite imagery if no in situ data were collected at the time.

3.3. Filtering Scenes

To begin processing the imagery, the first step is to find the most suitable images, i.e., images that are broadly consistent in brightness and colour balance, across time. To automate this process, we first applied Principal Component Analysis (PCA), which is a statistical technique that transforms the original variables into a new set of uncorrelated (i.e., orthogonal to each other) principal components oriented along the axis of maximum variance and ordered by decreasing amounts of explainable variance. In this context, PCA is particularly useful for compressing image information into a lower-dimensional space by simply selecting a few principal components, making it easier to identify and visualise patterns and differences in the imagery [64]. This is applied to the Preview Image Files (PVIs) provided directly by the ESA from both Sentinel-2A and -2B satellites. These preview images contain the three visible bands at 490 nm, 560 nm, and 665 nm at a resolution of 320 m per pixel [35], representing a coarse-resolution RGB image that is quick to process. These images are flattened and transformed into a feature vector with 352,947 dimensions, which is then compressed with PCA to contain 95% of the variance (usually resulting in 4 dimensions); this reduces the need to download terabytes of data unnecessarily. Based on the results of the PCA, the most appropriate scenes are filtered by clustering using DBSCAN. DBSCAN groups together points that are close to each other based on a specified distance measure [65]. DBSCAN works by identifying clusters based on the density of the data in the feature space, thus discovering clusters of arbitrary shape and handling noise. In contrast, centroid-based methods such as KMeans clustering, partition the data into clusters by minimising the sum of the distances between data points and their respective cluster centroids, iteratively updating to find the optimal centers. The primary advantage of density-based methods is that the number of clusters does not need to be defined, whilst methods such as KMeans require a user-defined number of cluster centres, which may not be known a priori [64]. The algorithm examines the neighbourhood around each data point using two parameters: the neighbourhood radius and the minimum number of points required to form a cluster. We use a value of 80 for the neighbourhood parameter and 5 for the minimum number of points. These first two steps (Figure 2) are accomplished on the PVIs; subsequently, after the clear cluster is established, the Level 1C images are downloaded and the remainder of the methodology uses these data.

Changes in sunlight intensity, angle, and atmospheric conditions can significantly affect the quality of remote sensing images. By applying PCA on the PVIs, we can highlight these variations and ensure that only the images with consistent and optimal illumination conditions are selected. This process helps capture the most dense and reliable time-series of images possible. Consequently, the selected images are better suited for further analysis, providing a robust dataset for subsequent processing, such as depth correction and cloud removal.

After this initial image selection, a second filtering step is applied to the selected scenes. Individual reefs are cropped using the UNEP World Conservation Monitoring Centre global distribution of coral reefs dataset [66], which consists of separate polygons containing the geographic extent of each reef. This ensures that only the subsets of reef areas is retained, and not vast areas of open sea (step 3 in Figure 2). This step is crucial as it reduces the computational load and allows for local-scale colour corrections, since entire scene colour transformations would result in suboptimal correction. By applying PCA once again to these cropped images, we can further filter the data for optimum visualisation and observe the variation colour and amount of clouds present (step 4 in Figure 2). This additional step ensures that the final dataset is of the highest quality, with minimal atmospheric and illumination artefacts, facilitating more accurate and reliable remote sensing analysis. The cropped images can also be clustered for optimal image selection.

After the image screening steps, cloud removal over water bodies is implemented. This step leverages the intrinsic spectral properties of water across the Sentinel-2 wavelengths, through a supervised machine learning technique. We implement an XGBoost model, chosen due to the performance of gradient boosting models in cloud screening tasks [34,67] as well as bathymetric and water quality analysis [68,69]. Gradient boosting models work by training weak learners, such as decision trees, into a strong predictive model. Each new learner is trained to correct the errors of its predecessors by fitting to the negative gradient of the loss function. The algorithm builds models iteratively, with each new tree attempting to minimise the residual errors of previous trees [70]. Gradient boosting has proven itself in its ability to automatically discover complex data structures, including nonlinearity and high-order interactions across a variety of remote sensing tasks [71,72,73,74].

The core assumption of this approach is that incident infrared radiation is strongly absorbed by water (both moisture in the atmosphere and liquid water on the surface), so that reflectance should be zero. By extension, any high values can be considered as noise (or atmosphere) over water. Generally speaking, the visible bands penetrate water, so we can use this relationship to find both the remaining clouds and mask out the land (step 5 in Figure 2) in a single step.

Specifically, the model uses the following subset of features, shown in Table 2.

The input bands are selected because they can penetrate water to varying degrees, providing the spectral response of the ocean as input as it is known that the spectral response of water should be low, compared to cloud.

In contrast, the target (predicted) infrared bands are chosen because they are strongly absorbed by water and atmospheric elements. Under ideal conditions with a clear atmosphere and no glint contamination these bands should appear very dark or nearly black over water bodies.

The training process for the XGBoost model involves the following:

Preparing a training dataset of Sentinel-2 cropped patches of images with a range of imaging conditions from 2015 to 2022 containing 3,150,006 unique pixel values from location 14°56′24.3″S 145°42′00.7″E
Extracting the image digital number (DN) values of bands 2, 3, 4, and 8 as input values.
Calculating the mean value of bands 9, 10, 11, and 12 as the target variable.
Training the XGBoost model to predict the mean target variable based on the input values.
The algorithm is then tested on a different geographic region within the extracted scenes from 2015 to 2022 (Figure 3), with a total of 70 test images in tile LCD55 from location 14°56′24.3″S 145°42′00.7″E.

This data-driven approach requires minimal human input, with the only manual step being the initial selection of a threshold for the predicted values, as the predicted values are the mean of the target bands. This threshold is used to create a binary mask, separating areas of likely cloud or land from clear water, as the algorithm predicts the digital number of the mean of the bands, a thresholding operation is necessary to create a binary mask. This removes the need to perform hand crafted band math as is currently performed by default in level 2 processed imagery [75], as here the relationship is directly computed.

To use the trained model for cloud and land masking we perform the following:

Apply the model to new Sentinel-2 images, predicting the mean value of bands 9, 10, 11, and 12.
Apply the predetermined threshold to the predicted values (the mean of the target bands).
Create a binary mask where values above the threshold are classified as cloud/land, and values below are classified as clear water.

Figure 3. True colour composite images and their products after cloud masking, mask products in gray, blue arrows indicate false detections. (A1–A5): original cropped L1C image (acquired 15-06-2021), with entire area affected by cirrus clouds; the fmask algorithm [33,76] is overly sensitive; and UKIS [77] and s2cloudless [34] are insensitive, resulting in almost no available imagery and cloud covered areas, respectively; and ours. (B1–B5): original image (acquired 11-07-2016); similarly fmask is overestimating cloudy areas; UKIS underestimating; s2cloudless is very close; and finally ours.

3.4. Colour Correction

The next step of the overall method employs Lyzenga’s algorithm [31,78] to calculate the DIIs, which are essential for correcting the entire image and its bands based on deep water areas. Lyzenga’s method assumes that the radiance reflected from the bottom surface of the water body is approximately a linear function of surface reflectance and an exponential function of water depth. This approach helps to standardise the radiance values across different depths, enhancing the accuracy of remote sensing data for aquatic environments [31].

To automate this task, we use the Simple Linear Iterative Clustering (SLIC) technique to create image patches or superpixels [79]. Superpixels are groups of pixels that share similar characteristics, such as colour and texture, and are used to simplify image processing tasks by reducing the number of elements to be processed. These superpixels are intended to adhere to natural boundaries within the image, ensuring that they capture homogeneous regions effectively (Figure 4C).

Superpixels can vary in shape and size, which leads to array mismatch in statistical machine learning methods, as a fixed size is expected when inputting the data. To address this issue, superpixels are padded to create uniform shapes at the end of the array, allowing for consistent processing across the entire image. Padding, the process of adding a fixed value (zeros in this case) around an array, ensures that each superpixel can be treated as a standardised unit, facilitating the application of subsequent algorithms and corrections. Details of how the padding is performed are included in the appendix as well as the effects of padding vs. resizing (Figure A6).

To identify regions of shallow and deep water, we use the superpixels as they adheree to natural boundaries. We first apply a new PCA transformation on the superpixels to reduce the dimensionality of the data, retaining 95% of the sum of explained variance. The principal components (Figure 4A) are then clustered using DBSCAN.

Figure 4 shows that under cloudless imaging conditions, specifically no cirrus clouds, the surrounding water and the reef are distinctly separable. Thus it is possible to simply remove the effects of the water column on overall reflectance and calculating the DIIs (Section 3.4) with the percentiles of each class (deep sea and reef), ensuring that the final image accurately represents the true signal. In this case, the changing effect of the water column with depth is corrected for, and any remaining outliers e.g., cirrus clouds, are identified and removed.

This process avoids the a problem with DIIs whereby the user is required to select areas of deep sea and same water depth [78], meaning that studies that rely on it [80,81,82,83,84] are difficult to replicate if the sea and sand polygons are not provided. In contrast, this process is essentially automatic.

3.5. Final Classification

After the depth correction, Otsu’s thresholding method [85,86] is employed to extract the relevant area of the reef itself, as implemented in the python library scikit-image. Otsu’s thresholding is an automatic image thresholding technique that determines the optimal threshold value by minimising the intra-class variance (or maximising the inter-class variance) of pixel intensities. The algorithm separates out the two populations (i.e., water and reef) by exhaustively searching for the threshold that minimises the weighted sum of variances of the two classes. The intra-class variance

σ_{w}^{2} (t)

for a threshold t is defined as

σ_{w}^{2} (t) = ω_{0} (t) σ_{0}^{2} (t) + ω_{1} (t) σ_{1}^{2} (t)

where

ω_{0} (t)

and

ω_{1} (t)

are the probabilities of the two classes separated by threshold t, and

σ_{0}^{2} (t)

and

σ_{1}^{2} (t)

are their respective variances. The class probabilities are computed from the histogram bins:

ω_{0} (t) = \sum_{i = 0}^{t - 1} p (i)

ω_{1} (t) = \sum_{i = t}^{L - 1} p (i)

where L represents the number of intensity levels and

p (i)

is the probability of intensity level i occurring in the image. By minimising this intra-class variance, we can effectively distinguish between the reef and the surrounding water, facilitating the accurate extraction of the reef area. The method performs optimally when the image histogram exhibits a bimodal distribution with a clear valley between two peaks, which is why the previous processing steps are necessary, to ensure there is good contrast between the reef structure and the surrounding water. This means that this process eliminates the surrounding shallow water sections that still contain the spectral signatures of the water and or glint as their population distributions tend to be different after processing. Separating the populations becomes much simpler as the two distributions overlap less.

The remaining area is then masked and a simple clustering algorithm is applied, in this case KMeans, where the appropriate number of clusters is determined using the silhouette score [87]. This is sufficient to group the spectral signatures of the pixels themselves, as the majority of the irrelevant data (such as cloud, deep water, etc.) are removed. This step is performed on a pixel level using the three visible bands that penetrate water and the NIR band, using four clusters with randomly initialised centroids for the initial model. The same centroids are then reused to calibrate the next iteration of models, i.e., once an initial model is formed, we allow for day to day variation between spectral signatures by retuning the centroids in the next image and limiting the number of iterations allowable to less than 50.

4. Results

4.1. Image Selection Using PCA

Separating out the PVIs using PCA results in approximately 205 usable scenes from a total of 505 scenes; after review, approximately 16 images would have been omitted had we used a cloud threshold of 50% and approximately 28 additional reef observations from partially cloud-obscured images that were false positives. This process tested on a consumer CPU takes 21 s (averaged over 10 runs) to transform, train, and predict suitable imagery on all the available PVIs from 2015 to 2025 in a single location.

By applying PCA, imagery that is broadly consistent with optimal conditions will cluster similarly, while poor, sun-glinty images produce clusters that are further from the main cluster. Brighter imagery is observed to plot in the higher value range of PC1 whilst PC2 indicates the position of the cloudy areas. Under circumstances where the spectral reflectance of the imagery is broadly consistent, the images would plot in the centre (Figure 5A). The methods tested show that this process is applicable at both 320 m and 10 m per pixel, which is a major advantage as it means it can also be applied at reef scale on a higher resolution. This is illustrated in Figure 6A,B where the same principle is applied on an individual different reef: in terms of broad trends, images with higher overall values group close to each other as PC1 increases, while imagery with partial cloud cover group with negative PC2 values. More significantly, we observe that using scene-level cloud thresholding omits valuable information in all of the testing, whereby at 100% cloud cover several clear examples are missed, as shown with the red arrows in Figure 6B.

4.2. Cloud Removal Using XGBoost

For cloud removal, simply predicting the value of the average of bands 9, 10, 11, and 12 results in a mean absolute error of 35 digital number (DN) on unseen random data points, and a mean absolute error of 124.5 DN on a geographically split region, showing that the algorithm was able to learn the overall spectral properties of a pixel in a water-dominated location. Qualitative observations show that, on the whole, thin cirrus clouds can be removed from imagery that lies within the primary, well-calibrated image cluster. Additional results and tests are included in the Appendix (Figure A7).

4.3. Colour Correction Using Superpixels

After applying SLIC technique and PCA transformation, several qualitative patterns were observed. The superpixels adhered well to environmental boundaries, allowing clear differentiation between shallow and deep water regions. For instance, in the corrected imagery, previously indistinct underwater features, such as coral reefs, remaining cloud boundaries, and deep sea can be observed in the histograms (Figure 7A,B). The stretch is visibly clearer and the two populations can be separated using automatic thresholding methods.

Using SLIC [79] with the extracted superpixels and another PCA model, it is possible to linearly separate the reef from the surrounding deep sea whilst the classification (Figure 4) uses DBSCAN [65] to eliminate any bias in choosing the number of clusters. For the simple case of a well-calibrated single image, this separates deep sea and reef very well (Figure 4A). This is then used to extract DIIs and stretch the image appropriately.

A byproduct of this process is that automatic Otsu thresholding [86] (Figure 7) can be applied to separate areas of shallow water and sun-glint. This is possible because the pixel intensity distributions of the areas are more distinctly separated (as observed in the histograms, see Figure 7), enabling the thresholding algorithm to select thresholds more effectively.

Figure 7A illustrates this improvement compared to Figure 7B, where the original image histogram shows the reef and sea are almost inseparable. This methodology is valuable as it helps identify areas of deep water and the corresponding reef ecosystem, facilitating the production of the correct image stretch.

Shown in Figure 4A, there is clear separation between segments of the deep sea, while noise is generally confined to areas of high spectral reflectance within the wave front.

4.4. Time-Series Analysis and Change Detection

To allow for change detection over time, the processed data are clustered into semantically meaningful areas (Figure 8). Noticeably the larger areas remain clustered into the same class, with misclassifications on areas that are cloudy. It is also worth noting that errors from the extraction of the DIIs are propagated to the clustering itself on the next stage, underscoring the need to also view the data manipulation as a holistic whole, rather than as a series of independent steps.

Qualitative results show that meaningful clusters can be found even under relatively complex imaging conditions, i.e., lots of thin cirrus clouds (Figure 8D). The spectral signature of the reef belt, shoals, and inner and outer tidal channels remain relatively consistent despite the different imaging conditions across time.

5. Discussion and Conclusions

Our findings demonstrate that meaningful information can be effectively extracted from Sentinel-2 imagery using our approach. Principal Component Analysis (PCA) followed by clustering proves to be a powerful method to select high-quality images and filter out unsuitable ones. This is found to be robust even even using PVIs, which allows us to avoid unnecessary intense memory operations using full Sentinel Level 1C and 2A products. By focusing on the area of interest within a scene, and capturing spatial dependencies, PCA reveals clusters of usable data that traditional cloud screening methods miss. This is because traditional methods only report overall cloud cover and may overlook imagery useable for the application of interest (Figure A4 illustrates cases where images with high cloud percentages remain usable for reef areas). Per-pixel cloud screening using XGBoost leverages the intrinsic absorption properties of water. This technique automates the detection and masking of both clouds and land, relying solely on the spectral characteristics of selected bands. In clear water, the predicted values remain low, while elevated values indicate the presence of clouds, land, or atmospheric interference. Although wave fronts are inadvertently masked due to higher reflectance, applying a predefined cutoff (Figure 3) mitigates this issue. A clear visual improvement is demonstrated in results over benchmark methods, when used over the coastal areas of interest in this study. Significantly, the method is easily adaptable to other satellite sensors with similar multi-spectral bands, such as Landsat and PlanetScope.

The use of superpixel segmentation through SLIC in tandem with PCA aids in the removal of glint effects, and classification of areas of deep water and reef. By over-segmenting images into representative patches that can be clustered without prior area-specific information, the workflow supports autonomous monitoring over extensive global archives. Our method is fully automated, and represents a significant time improvement over what is typically accomplished manually, whilst retaining visual accuracy. This corrected imagery integrates smoothly with traditional clustering algorithms for change detection.

In conclusion, this study presents a scalable workflow that addresses a number of key challenges in image quality assessment, cloud masking, and water depth correction. The integration of PCA for image filtering, XGBoost for automated cloud detection, and unsupervised clustering supports efficient analysis of vast Earth Observation datasets (approximately 1.6 TB from Sentinel-2 [88]. By streamlining processing under challenging conditions such as glint and cloud interference, our approach facilitates timely, data-driven decision-making crucial for ecological monitoring and conservation in rapidly changing environments.

Author Contributions

Conceptualization, Z.A. and C.M.J.; Methodology, Z.A.; Formal analysis, Z.A.; Data curation, Z.A.; Writing—original draft, Z.A.; Writing—review & editing, P.M., R.P. and C.M.J.; Visualization, Z.A. and R.P.; Supervision, P.M. and C.M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code used to develop this work is available at https://github.com/z-alzayer/ShallowLearn (accessed on 12 March 2025). This study uses Sentinel-2 data made available by the Copernicus Programme of the European Union and the European Space Agency (ESA).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SWIR	Short-Wave Infrared
SLIC	Simple Linear Iterative Clustering
PCA	Principal Component Analysis
EO	Earth Observation

Appendix A. Additional Details

A discussion of supervised vs. unsupervised is discussed in the intro; the argument for unsupervised discusses the idea that if what is observed is not representative of reality (i.e., in terms of spectral mixing) then perhaps going fully unsupervised and understanding what the data represents maybe a more valid approach. Figure A1 shows that generally supervised workflows are favoured, not only within the remote sensing community but overall as well.

The section that follows includes additional relevant experiments that highlight the effectiveness of the method.

Figure A1. Publications by year from the Scopus database, showing queries for supervised remote sensing and unsupervised remote sensing, and effectively twice the number of publications focus on supervised methods, both in remote sensing and in other fields.

Appendix A.1. Image PCA Experiments

Reversing the PCA to obtain the data back, we effectively show that the learned space is indeed more free of atmospheric noise at the cluster centre. We also show that the more extreme values of PC2 are indeed related to the cloud positioning. This effectively shows that the method is working as intended.

Figure A2. Figure showing resampling of PC space from the clean clusters. The original clusters on the left figure show that the resampling space is full of very cloudy imagery; it also shows increasing clouds in the y direction, to the north and south, respectively. In the figure on the right, it shows an overall reduced amount of clouds (though thin cirrus clouds remain present), effectively showing that the reverse process works as intended.

Figure A3 shows that the cloud screening is fairly inadequate for building the fullest and or densest time-series possible. Even on a very coarse scale, it is generally functioning qualitatively better than the cloud screening.

Figure A3. (A) PCA applied and plotted with the first 2 components to PVIs of tile 55LCD comparing the ESA cloud coverage assessment with the imagery itself (a total of 554 tiles are used (A)). As PC1 increases so do the amount of clouds, similarly with PC2 it increases or decreases with the orientation of the clouds, whilst the images of good quality tend to plot close to each other. Similar trends are observed in both (C) (558 tiles) and (E) (511 tiles). Variation exists depending on the data itself. (B,D,F) show a time-series comparison compared with the ESA cloud coverage assessment, highlighting that in order to build the most dense time-series, simple numerical assessments of a single value of cloud cover are inadequate.

Figure A4. (A) PCA applied to the screened PVIs, with the automated cloud coverage assessment values as a background image, showing that well-balanced images group together, whilst glinty and cloudy images also group together differently. (B) PCA cluster analysis of the same reef, showing that automated cloud coverage assessments do not reliably indicate the usability of an image. Even points at 100% cloud coverage may contain usable data. (C,D) PCA plots from a different location, representing increased cloud cover with decreasing PC2 values, and suboptimal imaging towards positive PC1 values. (E,F) PCA plots from a third location, representing increased cloud cover with negative PC2 and poor image balance towards PC1, again with 100% cloud coverage filters lose valuable data points, suggesting that the screening process through a single number is suboptimal for building large datasets.

Figure A5. PC space with multiple super pixels on multiple reefs in Eastern Australia (A–C) all show clear separation between the reef superpixels and the surrounding sea in the scatter plot, meaning it is possible to automatically extract, either through unsupervised clustering (in this case DBSCAN was used) or through the straight line formula as the populations are separable in the PC space.

Figure A6. (A,B) showing the PC space of (C,D), respectively, whereby zero padding images leaves the size of each superpixel as a principal component, whereas resizing removes this relevant information. It shows that the classes are linearly separable in a two-class situation (where one class is red and the other is blue).

Appendix A.2. Cloud Prediction

Figure A7 shows that the XGBoost model is working on both an L1C input on a much larger scale and on PlanetScope data.

Figure A7. Prediction on larger scale L1C image above (10 m per pixel scale) and PlanetScope data (2.5 m per pixel scale).

References

Goward, S.N.; Williams, D.L.; Arvidson, T.; Rocchio, L.E.; Irons, J.R.; Russell, C.A.; Johnston, S.S. Landsat’s enduring legacy: Pioneering global land observations from space. Photogramm. Eng. Remote Sens. 2022, 88, 357–358. [Google Scholar] [CrossRef]
eoPortal Directory. IKONOS-2. Available online: https://www.eoportal.org/satellite-missions/ikonos-2#eop-quick-facts-section (accessed on 14 May 2024).
European Space Agency (ESA). Introducing Sentinel-2. 2024. Available online: https://www.esa.int/Applications/Observing_the_Earth/Copernicus/Sentinel-2 (accessed on 12 March 2025).
Planet Labs PBC. PlanetScope—Planet Satellites. 2010. Available online: https://support.planet.com/hc/en-us/articles/4407820499217-PlanetScope-Constellation-Sensor-Overview (accessed on 14 May 2024).
Duarte, C.M. Reviews and Syntheses: Hidden Forests, the Role of Vegetated Coastal Habitats in the Ocean Carbon Budget. Biogeosciences 2017, 14, 301–310. [Google Scholar]
Tol, R. The Economic Effects of Climate Change. J. Econ. Perspect. 2009, 23, 29–51. [Google Scholar] [CrossRef]
Goreau, T.F. Mass Expulsion of Zooxanthellae from Jamaican Reef Communities after Hurricane Flora. Science 1964, 145, 383–386. [Google Scholar] [PubMed]
van Woesik, R.; Kratochwill, C. A Global Coral-Bleaching Database, 1980–2020. Sci. Data 2022, 9, 35058458. [Google Scholar] [CrossRef]
Belbin, L.; Wallis, E.; Hobern, D.; Zerger, A. The Atlas of Living Australia: History, current state and future directions. Biodivers. Data J. 2021, 9, e65023. [Google Scholar] [CrossRef]
Rogers, R.; de Oliveira Correal, G.; De Oliveira, T.C.; De Carvalho, L.L.; Mazurek, P.; Barbosa, J.E.F.; Chequer, L.; Domingos, T.F.S.; de Andrade Jandre, K.; Leão, L.S.D.; et al. Coral health rapid assessment in marginal reef sites. Mar. Biol. Res. 2014, 10, 612–624. [Google Scholar] [CrossRef]
Suan, A.; Franceschini, S.; Madin, J.; Madin, E. Quantifying 3D coral reef structural complexity from 2D drone imagery using artificial intelligence. Ecol. Inform. 2025, 85, 102958. [Google Scholar]
Casella, E.; Collin, A.; Harris, D.; Ferse, S.; Bejarano, S.; Parravicini, V.; Hench, J.L.; Rovere, A. Mapping coral reefs using consumer-grade drones and structure from motion photogrammetry techniques. Coral Reefs 2017, 36, 269–275. [Google Scholar] [CrossRef]
McLeod, E.; Shaver, E.C.; Beger, M.; Koss, J.; Grimsditch, G. Using resilience assessments to inform the management and conservation of coral reef ecosystems. J. Environ. Manag. 2021, 277, 111384. [Google Scholar]
Scopélitis, J.; Andréfouët, S.; Phinn, S.; Arroyo, L.; Dalleau, M.; Cros, A.; Chabanet, P. The next step in shallow coral reef monitoring: Combining remote sensing and in situ approaches. Mar. Pollut. Bull. 2010, 60, 1956–1968. [Google Scholar] [CrossRef] [PubMed]
Andréfouët, S. Coral reef habitat mapping using remote sensing: A user vs. producer perspective. Implications for research, management and capacity building. J. Spat. Sci. 2008, 53, 113–129. [Google Scholar] [CrossRef]
Hughes, T.P.; Graham, N.A.; Jackson, J.B.; Mumby, P.J.; Steneck, R.S. Rising to the Challenge of Sustaining Coral Reef Resilience. Trends Ecol. Evol. 2010, 25, 633–642. [Google Scholar] [CrossRef]
Bruno, J.F.; Selig, E.R.; Casey, K.S.; Page, C.A.; Willis, B.L.; Harvell, C.D.; Sweatman, H.; Melendy, A.M. Thermal Stress and Coral Cover as Drivers of Coral Disease Outbreaks. PLoS Biol. 2007, 5, e124. [Google Scholar] [CrossRef]
Pandolfi, J.M.; Bradbury, R.H.; Sala, E.; Hughes, T.P.; Bjorndal, K.A.; Cooke, R.G.; McArdle, D.; McClenachan, L.; Newman, M.J.H.; Paredes, G.; et al. Global Trajectories of the Long-Term Decline of Coral Reef Ecosystems. Glob. Trajectories Long-Term Decline Coral Reef Ecosyst. 2003, 301, 955–958. [Google Scholar] [CrossRef]
Hoegh-Guldberg, O.; Mumby, P.J.; Hooten, A.J.; Steneck, R.S.; Greenfield, P.; Gomez, E.; Harvell, C.D.; Sale, P.F.; Edwards, A.J.; Caldeira, K.; et al. Coral Reefs under Rapid Climate Change and Ocean Acidification. Science 2007, 318, 1737–1742. [Google Scholar] [CrossRef] [PubMed]
Wicaksono, P.; Fauzan, M.A.; Kumara, I.S.W.; Yogyantoro, R.N.; Lazuardi, W.; Zhafarina, Z. Analysis of reflectance spectra of tropical seagrass species and their value for mapping using multispectral satellite images. Int. J. Remote Sens. 2019, 40, 8955–8978. [Google Scholar] [CrossRef]
United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. Resolution A/RES/70/1, United Nations. 2015. Available online: https://sustainabledevelopment.un.org/content/documents/21252030%20Agenda%20for%20Sustainable%20Development%20web.pdf (accessed on 20 March 2025).
Emsley, D.S. Sen2Coral: Detection of Coral Bleaching from Space. Available online: https://sen2coral.argans.co.uk/ (accessed on 1 May 2024).
Chan, J.C.W. Shallow water habitats monitoring using simulated PRISMA hyperspectral data and Depth Invariant Index—The case of coral reef in Maldives. IOP Conf. Ser. Earth Environ. Sci. 2022, 1109, 012066. [Google Scholar] [CrossRef]
Hedley, J.; Roelfsema, C.; Koetz, B.; Phinn, S. Capability of the Sentinel 2 mission for tropical coral reef mapping and coral bleaching detection. Remote Sens. Environ. 2012, 120, 145–155. [Google Scholar] [CrossRef]
Huang, Y.; Yang, H.; Tang, S.; Liu, Y.; Liu, Y. An Appraisal of Atmospheric Correction and Inversion Algorithms for Mapping High-Resolution Bathymetry Over Coral Reef Waters. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4204511. [Google Scholar] [CrossRef]
Kutser, T.; Paavel, B.; Kaljurand, K.; Ligi, M.; Randla, M. Mapping shallow waters of the Baltic Sea with Sentinel-2 imagery. In Proceedings of the 2018 IEEE/OES Baltic International Symposium (BALTIC), Klaipeda, Lithuania, 12–15 June 2018; pp. 1–6. [Google Scholar]
Xu, J.; Zhao, J.; Wang, F.; Chen, Y.; Lee, Z. Detection of coral reef bleaching based on sentinel-2 multi-temporal imagery: Simulation and case study. Front. Mar. Sci. 2021, 8, 584263. [Google Scholar] [CrossRef]
Main-Knorn, M.; Pflug, B.; Louis, J.; Debaecker, V.; Müller-Wilm, U.; Gascon, F. Sen2Cor for Sentinel-2. In Image and Signal Processing for Remote Sensing XXIII; SPIE: Bellingham, WA, USA, 2017; pp. 37–48. [Google Scholar] [CrossRef]
Kaufman, Y.J.; Sendra, C. Algorithm for Automatic Atmospheric Corrections to Visible and Near-IR Satellite Imagery. Int. J. Remote Sens. 1988, 9, 1357–1381. [Google Scholar] [CrossRef]
Whitlock, C.H.; Poole, L.R.; Usry, J.W.; Houghton, W.M.; Witte, W.G.; Morris, W.D.; Gurganus, E.A. Comparison of Reflectance with Backscatter and Absorption Parameters for Turbid Waters. Appl. Opt. 1981, 20, 517–522. [Google Scholar] [CrossRef] [PubMed]
Lyzenga, D.R. Remote Sensing of Bottom Reflectance and Water Attenuation Parameters in Shallow Water Using Aircraft and Landsat Data. Int. J. Remote Sens. 1981, 2, 71–82. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar]
Qiu, S.; Zhu, Z.; He, B. Fmask 4.0: Improved cloud and cloud shadow detection in Landsats 4–8 and Sentinel-2 imagery. Remote Sens. Environ. 2019, 231, 111205. [Google Scholar]
Skakun, S.; Wevers, J.; Brockmann, C.; Doxani, G.; Aleksandrov, M.; Batič, M.; Frantz, D.; Gascon, F.; Gómez-Chova, L.; Hagolle, O.; et al. Cloud Mask Intercomparison eXercise (CMIX): An evaluation of cloud masking algorithms for Landsat 8 and Sentinel-2. Remote Sens. Environ. 2022, 274, 112990. [Google Scholar] [CrossRef]
Agency, E.S. Sentinel-2 Products Specification Document; Technical Report S2-PDGS-TAS-DI-PSD; European Space Agency: Paris, France, 2021. [Google Scholar]
Guo, L.J. Balance contrast enhancement technique and its application in image colour composition. Remote Sens. 1991, 12, 2133–2151. [Google Scholar] [CrossRef]
Liu, J.G.; Mason, P.J. Essential Image Processing and GIS for Remote Sensing; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Lyzenga, D.R. Passive remote sensing techniques for mapping water depth and bottom features. Appl. Opt. 1978, 17, 379–383. [Google Scholar]
Harahap, S.D.; Wicaksono, P. Relative Water Column Correction Methods for Benthic Habitat Mapping in Optically Shallow Coastal Water. In Recent Research on Geotechnical Engineering, Remote Sensing, Geophysics and Earthquake Seismology; Çiner, A., Ergüler, Z.A., Bezzeghoud, M., Ustuner, M., Eshagh, M., El-Askary, H., Biswas, A., Gasperini, L., Hinzen, K.G., Karakus, M., et al., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 181–183. [Google Scholar]
Thalib, M.S.; Nurdin, N.; Aris, A. The Ability of Lyzenga’s Algorithm for Seagrass Mapping using Sentinel-2A Imagery on Small Island, Spermonde Archipelago, Indonesia. IOP Conf. Ser. Earth Environ. Sci. 2018, 165, 012028. [Google Scholar] [CrossRef]
Pratomo, D.; Cahyadi, M.N.; Hariyanto, I.; Syariz, M.; Putri, S. Lyzenga Algorithm for Shallow Water Mapping Using Multispectral Sentinel-2 Imageries in Gili Noko Waters. BIO Web Conf. 2024, 89, 07006. [Google Scholar] [CrossRef]
Sutrisno, D.; Sugara, A.; Darmawan, M. The assessment of coral reefs mapping methodology: An integrated method approach. Iop Conf. Ser. Earth Environ. Sci. 2021, 750, 012030. [Google Scholar]
Manessa, M.D.M.; Kanno, A.; Sagawa, T.; Sekine, M.; Nurdin, N. Simulation-based investigation of the generality of Lyzenga’s multispectral bathymetry formula in Case-1 coral reef water. Estuar. Coast. Shelf Sci. 2018, 200, 81–90. [Google Scholar] [CrossRef]
Hedley, J.D.; Roelfsema, C.M.; Phinn, S.R.; Mumby, P.J. Environmental and sensor limitations in optical remote sensing of coral reefs: Implications for monitoring and sensor design. Remote Sens. 2012, 4, 271–302. [Google Scholar] [CrossRef]
European Space Agency. Sentinel-2 Spectral Response Functions (S2-SRF). Document COPE-GSEG-EOPG-TN-15-0007, Issue 3.2. Available online: https://landsat.usgs.gov/landsat/spectral_viewer/bands/Sentinel-2A%20MSI%20Spectral%20Responses.xlsx (accessed on 12 March 2025).
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar] [CrossRef]
The Pandas Development Team. pandas-dev/pandas: Pandas. 2020. Available online: https://zenodo.org/records/13819579 (accessed on 12 March 2025).
Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Bradski, G.; Kaehler, A. OpenCV. Dr. Dobb’s J. Softw. Tools 2000, 25. [Google Scholar]
van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.; Boulogne, F.; Warner, J.D.; Yager, N.; Gouillart, E.; Yu, T.; The Scikit-Image Contributors. Scikit-Image: Image processing in Python. PeerJ 2014, 2, e453. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Waskom, M.L. seaborn: Statistical data visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Goswami, A.; Sharma, D.; Mathuku, H.; Gangadharan, S.M.P.; Yadav, C.S.; Sahu, S.K.; Pradhan, M.K.; Singh, J.; Imran, H. Change Detection in Remote Sensing Image Data Comparing Algebraic and Machine Learning Methods. Electronics 2022, 11, 431. [Google Scholar] [CrossRef]
Bamisile, O.; Cai, D.; Oluwasanmi, A.; Ejiyi, C.; Ukwuoma, C.C.; Ojo, O.; Mukhtar, M.; Huang, Q. Comprehensive assessment, review, and comparison of AI models for solar irradiance prediction based on different time/estimation intervals. Sci. Rep. 2022, 12, 9644. [Google Scholar]
Zamani Joharestani, M.; Cao, C.; Ni, X.; Bashir, B.; Talebiesfandarani, S. PM2.5 prediction based on random forest, XGBoost, and deep learning using multisource remote sensing data. Atmosphere 2019, 10, 373. [Google Scholar] [CrossRef]
Boonnam, N.; Udomchaipitak, T.; Puttinaovarat, S.; Chaichana, T.; Boonjing, V.; Muangprathub, J. Coral Reef Bleaching under Climate Change: Prediction Modeling and Machine Learning. Sustainability 2022, 14, 6161. [Google Scholar] [CrossRef]
White, E.; Amani, M.; Mohseni, F. Coral Reef Mapping Using Remote Sensing Techniques and a Supervised Classification Algorithm. Adv. Environ. Eng. Res. 2021, 2, 28. [Google Scholar]
Pavoni, G.; Corsini, M.; Ponchio, F.; Muntoni, A.; Edwards, C.; Pedersen, N.; Sandin, S.; Cignoni, P. TagLab: AI-assisted Annotation for the Fast and Accurate Semantic Segmentation of Coral Reef Orthoimages. J. Field Robot. 2022, 39, 246–262. [Google Scholar]
Zeng, R.; Hochberg, E.J.; Candela, A.; Wettergreen, D.S. Spectral Unmixing and Mapping of Coral Reef Benthic Cover with Deep Learning. In Proceedings of the 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Rome, Italy, 13–16 September 2022; pp. 1–5. [Google Scholar]
Li, J.; Knapp, D.E.; Fabina, N.S.; Kennedy, E.V.; Larsen, K.; Lyons, M.B.; Murray, N.J.; Phinn, S.R.; Roelfsema, C.M.; Asner, G.P. A Global Coral Reef Probability Map Generated Using Convolutional Neural Networks. Coral Reefs 2020, 39, 1805–1815. [Google Scholar] [CrossRef]
Sheng, V.S.; Provost, F.; Ipeirotis, P.G. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV, USA, 24–27 August 2008; pp. 614–622. [Google Scholar]
Bishop, C. Pattern Recognition and Machine Learning; Springer: New Delhi, India, 2006; Volume 2, pp. 5–43. [Google Scholar]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD; ACM: New York, NY, USA, 1996; Volume 96, pp. 226–231. [Google Scholar]
UNEP-WCMC; WorldFish Centre; WRI; TNC. Global Distribution of Coral Reefs, Compiled from Multiple Sources Including the Millennium Coral Reef Mapping Project; Version 4.1, updated by UNEP-WCMC. Includes contributions from IMaRS-USF and IRD (2005), IMaRS-USF (2005) and Spalding et al. (2001); UN Environment Programme World Conservation Monitoring Centre: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
Sentinel Hub. Sentinel Hub’s Cloud Detector for Sentinel-2 Imagery. 2024. Available online: https://github.com/sentinel-hub/sentinel2-cloud-detector (accessed on 12 March 2025).
Abdul Gafoor, F.; Al-Shehhi, M.R.; Cho, C.S.; Ghedira, H. Gradient Boosting and Linear Regression for Estimating Coastal Bathymetry Based on Sentinel-2 Images. Remote Sens. 2022, 14, 5037. [Google Scholar] [CrossRef]
Krishnaraj, A.; Honnasiddaiah, R. Remote sensing and machine learning based framework for the assessment of spatio-temporal water quality in the Middle Ganga Basin. Environ. Sci. Pollut. Res. 2022, 29, 64939–64958. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Jin, Q.; Fan, X.; Liu, J.; Xue, Z.; Jian, H. Estimating tropical cyclone intensity in the South China Sea using the XGBoost Model and FengYun Satellite images. Atmosphere 2020, 11, 423. [Google Scholar] [CrossRef]
Bhagwat, R.U.; Shankar, B.U. A novel multilabel classification of remote sensing images using XGBoost. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 29–31 March 2019; pp. 1–5. [Google Scholar]
Shao, Z.; Ahmad, M.N.; Javed, A. Comparison of Random Forest and XGBoost Classifiers Using Integrated Optical and SAR Features for Mapping Urban Impervious Surface. Remote Sens. 2024, 16, 665. [Google Scholar] [CrossRef]
Niazkar, M.; Menapace, A.; Brentan, B.; Piraei, R.; Jimenez, D.; Dhawan, P.; Righetti, M. Applications of XGBoost in water resources engineering: A systematic literature review (December 2018–May 2023). Environ. Model. Softw. 2024, 174, 105971. [Google Scholar]
S2 Processing—Sentiwiki.copernicus.eu. Available online: https://sentiwiki.copernicus.eu/web/s2-processing (accessed on 1 February 2025).
Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar]
Wieland, M.; Fichtner, F.; Martinis, S. UKIS-CSMASK: A Python Package for Multi-Sensor Cloud and Cloud Shadow Segmentation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, XLIII-B3-2022, 217–222. [Google Scholar] [CrossRef]
Green, E.; Edwards, A.; Clark, C. Remote Sensing Handbook for Tropical Coastal Management; UNESCO Publishing: Paris, France, 2000. [Google Scholar]
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Qi, C.; Zhu, J.; Su, D.; Yang, F.; Zhu, J. A satellite-derived bathymetry method combining depth invariant index and adaptive logarithmic ratio: A case study in the Xisha Islands without in-situ measurements. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104232. [Google Scholar]
Siregar, V.; Agus, S.; Subarno, T.; Prabowo, N. Mapping shallow waters habitats using OBIA by applying several approaches of depth invariant index in North Kepulauan Seribu. IOP Conf. Ser. Earth Environ. Sci. 2018, 149, 012052. [Google Scholar]
Manuputty, A.; Lumban-Gaol, J.; Agus, S.B. Seagrass mapping based on satellite image Worldview-2 by using depth invariant index method. ILMU Kelaut. Indones. J. Mar. Sci. 2016, 21, 37–44. [Google Scholar]
Aljahdali, M.H.; Elhag, M. Calibration of the depth invariant algorithm to monitor the tidal action of Rabigh City at the Red Sea Coast, Saudi Arabia. Open Geosci. 2020, 12, 1666–1678. [Google Scholar]
Komatsu, T.; Hashim, M.; Nurdin, N.; Noiraksar, T.; Prathep, A.; Stankovic, M.; Son, T.P.H.; Thu, P.M.; Van Luong, C.; Wouthyzen, S.; et al. Practical mapping methods of seagrass beds by satellite remote sensing and ground truthing. Coast. Mar. Sci. 2020, 43, 1–25. [Google Scholar]
Otsu, N. A threshold selection method from gray-level histograms. Automatica 1975, 11, 23–27. [Google Scholar]
Liao, P.S.; Chen, T.S.; Chung, P.C. A Fast Algorithm for Multilevel Thresholding. J. Inf. Sci. Eng. 2001, 17, 713–727. [Google Scholar]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
European Space Agency. Sentinel-2 Operations. 2024. Available online: https://www.esa.int/Enabling_Support/Operations/Sentinel-2_operations (accessed on 12 March 2024).

Figure 1. Spectral response functions of the MSI instrument onboard Sentinels 2A and 2B [45].

Figure 2. Flowchart showing a full overview of the workflow in the method following a top down approach whereby lower resolution imagery are processed first at 320 m pixel resolution all the way down to the 10 m scale.

Figure 4. (A) The results of applying PCA to the superpixels extracted using SLIC [79]. The sea segments tend to follow a cubic relationship, whilst the relevant reef areas are linear. It can also be seen that the PCs are negative due to the inverse correlation between input variables in (B) the original image and (C) the unsupervised classification result on the PCs of the superpixels. Red bounding boxes highlight points at the 1st percentile and 80th percentile of each cluster (B), respectively, with points shown in scatter plot (A). Additional examples are provided in the Appendix Figure A5.

Figure 5. (A) PCA applied and plotted with the first 2 components to PVIs of tile 55LCD comparing the ESA cloud coverage assessment with the imagery itself (a total of 554 tiles are used). As PC1 increases, so do the amount of clouds; similarly with PC2 it increases or decreases as the orientation of the clouds shifts, whilst the images of good quality tend to plot close to each other. Similar trends are observed in (B) time-series analysis vs. ESA cloud coverage.

Figure 6. (A) PCA applied to the screened PVIs, with the ESA automated cloud coverage assessment values as a background image, showing that well-balanced images are closer together, whilst glinty and cloudy images also group together separately. (B) PCA cluster analysis of the same reef, showing that automated cloud coverage assessments do not reliably indicate the usability of an image. Even points at 100% cloud coverage may contain usable data, and conversely at 0% (red arrows show examples) the imagery may not be usable; additional examples included in the Appendix Figure A4.

Figure 7. (A) Image histogram of B4 after extracting the DII, OTSU thresholds shown in red, with the first two thresholds showing the sea and glint components of the imagery. (B) Original image histogram of B4 scaled to 255, the histogram is improperly stretched showing OTSU automatic thresholds picking out only the brightest pixels. (C) Results of the DII before clipping the sea and (D) after clipping the sea using the automatic thresholding, eliminating the first two population thresholds leaving the true signal of the reef.

Figure 8. Clustering results on from 2019 to 2022 using the depth invariant indices using 3 bands (B04, B03, B02), images on the right after colour enhancement. (A) 2019 and (B) 2020, with relatively good imaging conditions and classification of areas is relatively consistent with first observation. (C) 2021 and (D) 2022, with cloud shadows misclassified as shallow marine.

Table 1. Sentinel-2 spectral bands, their wavelength ranges, purposes, and spatial resolutions. Sensor characteristics provided by the ESA [45].

Band	$λ$ (nm)/ $Δ λ$ (nm)	Purpose	Resolution
B1	443/20	Atmos. corr. (aerosol)	60 m
B2	490/65	Veg. senescence, Atmos. corr.	10 m
B3	560/35	Total chlorophyll	10 m
B4	665/30	Max chlorophyll absorption	10 m
B5, B6	705, 740/15	Red edge, Atmos. corr.	20 m
B7	783/20	LAI, NIR edge	20 m
B8	842/105	LAI	10 m
B8a	865/20	NIR plateau, chlorophyll, biomass, LAI	20 m
B9	945/20	Water vapour, Atmos. corr.	60 m
B10	1375/30	Cirrus detection	60 m
B11	1610/90	Lignin, biomass, snow/ice/cloud	20 m
B12	2190/180	Veg. conditions, soil erosion, burn scars	20 m

Table 2. Input bands and target variables for the XGBoost cloud detection model.

Band	$λ$ (nm)	Description	Role
2	490	Blue	Input Band
3	560	Green	Input Band
4	665	Red	Input Band
8	842	Near-Infrared	Input Band
9	945	Water Vapor	Target Band
10	1375	Cirrus	Target Band
11	1610	Short-Wave Infrared 1	Target Band
12	2190	Short-Wave Infrared 2	Target Band

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

AlZayer, Z.; Mason, P.; Platt, R.; John, C.M. An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data. Remote Sens. 2025, 17, 1244. https://doi.org/10.3390/rs17071244

AMA Style

AlZayer Z, Mason P, Platt R, John CM. An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data. Remote Sensing. 2025; 17(7):1244. https://doi.org/10.3390/rs17071244

Chicago/Turabian Style

AlZayer, Zayad, Philippa Mason, Robert Platt, and Cédric M. John. 2025. "An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data" Remote Sensing 17, no. 7: 1244. https://doi.org/10.3390/rs17071244

APA Style

AlZayer, Z., Mason, P., Platt, R., & John, C. M. (2025). An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data. Remote Sensing, 17(7), 1244. https://doi.org/10.3390/rs17071244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Machine Learning-Based Method for Unsupervised Characterisation for Coral Reef Monitoring in Earth Observation Time-Series Data

Abstract

1. Introduction

2. Sentinel-2 Data

3. Methods

3.1. Libraries

3.2. Background on Machine Learning in Remote Sensing

3.3. Filtering Scenes

3.4. Colour Correction

3.5. Final Classification

4. Results

4.1. Image Selection Using PCA

4.2. Cloud Removal Using XGBoost

4.3. Colour Correction Using Superpixels

4.4. Time-Series Analysis and Change Detection

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Additional Details

Appendix A.1. Image PCA Experiments

Appendix A.2. Cloud Prediction

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI