Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery

Echevarría-Rubio, José Manuel; Martínez-Flores, Guillermo; Morales-Pérez, Rubén Antelmo

doi:10.3390/data10110177

Open AccessArticle

Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery

by

José Manuel Echevarría-Rubio

^1,*

,

Guillermo Martínez-Flores

^1,*

and

Rubén Antelmo Morales-Pérez

²

¹

Departamento de Oceanología, Centro Interdisciplinario de Ciencias Marinas, Instituto Politécnico Nacional, Av. Instituto Politécnico Nacional s/n, Colonia Playa Palo de Santa Rita, La Paz 23096, Baja California Sur, Mexico

²

Instituto Mexicano de Tecnología del Agua, Paseo Cuauhnáhuac 8532, Colonia Progreso, Jiutepec 62550, Morelos, Mexico

^*

Authors to whom correspondence should be addressed.

Data 2025, 10(11), 177; https://doi.org/10.3390/data10110177

Submission received: 14 August 2025 / Revised: 25 October 2025 / Accepted: 27 October 2025 / Published: 1 November 2025

(This article belongs to the Section Spatial Data Science and Digital Earth)

Download

Browse Figures

Versions Notes

Abstract

Over the past decade, recurring influxes of pelagic Sargassum have posed significant environmental and economic challenges in the Caribbean Sea. Effective monitoring is crucial for understanding bloom dynamics and mitigating their impacts. This study presents a comprehensive machine learning (ML) and deep learning (DL) framework for detecting Sargassum and estimating its fractional cover using imagery from key satellite sensors: the Operational Land Imager (OLI) on Landsat-8 and the Multispectral Instrument (MSI) on Sentinel-2. A spectral library was constructed from five core spectral bands (Blue, Green, Red, Near-Infrared, and Short-Wave Infrared). It was used to train an ensemble of five diverse classifiers: Random Forest (RF), K-Nearest Neighbors (KNN), XGBoost (XGB), a Multi-Layer Perceptron (MLP), and a 1D Convolutional Neural Network (1D-CNN). All models achieved high classification performance on a held-out test set, with weighted F1-scores exceeding 0.976. The probabilistic outputs from these classifiers were then leveraged as a direct proxy for the sub-pixel fractional cover of Sargassum. Critically, an inter-algorithm agreement analysis revealed that detections on real-world imagery are typically either of very high (unanimous) or very low (contentious) confidence, highlighting the diagnostic power of the ensemble approach. The resulting framework provides a robust and quantitative pathway for generating confidence-aware estimates of Sargassum distribution. This work supports efforts to manage these harmful algal blooms by providing vital information on detection certainty, while underscoring the critical need to empirically validate fractional cover proxies against in situ or UAV measurements.

Keywords:

remote sensing; machine learning; fractional cover; pelagic Sargassum; multi-sensor; spectral library

1. Introduction

Over the past decade, the Caribbean Sea and adjacent regions have witnessed an unprecedented surge in the frequency and magnitude of pelagic Sargassum influxes [1,2]. Pelagic Sargassum, primarily composed of Sargassum natans and Sargassum fluitans, is a genus of brown macroalgae that forms extensive floating mats. Historically, Sargassum is associated with the Sargasso Sea, where it serves as a critical marine habitat [3]. However, massive blooms have increasingly originated from a new proliferation region in the tropical Atlantic since 2011, often referred to as the “Great Atlantic Sargassum Belt” (GASB) [1,4]. These blooms are advected westward, causing widespread inundations along the coastlines of the Caribbean, Central and South America, the Gulf of Mexico, and Florida [5].

These large-scale accumulations, colloquially known as “golden tides”, pose severe threats. Environmentally, the decomposition of beached Sargassum creates hypoxic conditions that harm coastal ecosystems such as coral reefs and seagrass beds [6,7,8]. It also releases noxious gases such as hydrogen sulfide (H₂S), which impacts public health and tourism [7], and can alter water chemistry by leaching compounds, including heavy metals [9]. The socioeconomic impacts are profound for tourism- and fishery-dependent regions, where beach accumulations deter tourists, disrupt fishing activities, and impose significant cleanup costs on local communities [7].

1.1. Remote Sensing for Sargassum Monitoring

Effective management of Sargassum inundations requires robust, timely, and wide-area monitoring. While field surveys provide detailed information, they are impractical for large-scale monitoring due to cost and logistical constraints. Satellite remote sensing offers a vital alternative, providing synoptic views and consistent data acquisition over vast geographical areas [3,10,11].

Sensors with varying spatial and temporal resolutions offer complementary capabilities. Instruments like the Moderate Resolution Imaging Spectroradiometer (MODIS), with its daily global coverage, are suitable for monitoring large offshore aggregations and tracking major bloom events, such as the GASB [12,13]. However, its moderate spatial resolution (250 m to 1000 m) limits its utility for detailed coastal mapping [12]. Conversely, medium- to high-resolution sensors, such as the Operational Land Imager (OLI) on Landsat-8 and the Multispectral Instrument (MSI) on the Sentinel-2 satellites, offer greater spatial detail (30

m

) for OLI (15 m panchromatic); (10, 20, and 60 m) for MSI [7,14]. This enhanced resolution is valuable for detecting Sargassum in complex coastal waters, estimating coverage, and validating detections from coarser spatial resolution sensors [10]. The synergy is clear: MODIS for regional surveillance and early warning, and Landsat/Sentinel-2 for detailed coastal assessments [15,16].

1.2. Existing Detection Methodologies and Challenges

Various remote sensing methodologies detect Sargassum by leveraging its distinct spectral characteristics, notably the “red-edge” feature and red absorption due to the presence of chlorophyll pigments. Standard methods include spectral indices, such as the Floating Algae Index (FAI) [17]. While computationally efficient, this index can suffer from false positives due to thin clouds, sunglint, or turbid waters, and its sensitivity can be affected by Sargassum density and physiological state [1,18,19].

To overcome the limitations of traditional index-based methods, more advanced Machine Learning (ML) and Deep Learning (DL) techniques have gained prominence. These approaches are adept at learning complex, non-linear relationships directly from multidimensional spectral data, offering the potential for higher accuracy and greater robustness against factors such as atmospheric haze or varying water constituents. Traditional ML algorithms like Random Forests (RF) and XGBoost (XGB) have been successfully applied to classify pixels based on their multispectral reflectance values [20,21]. Also, DL models, particularly Convolutional Neural Networks (CNNs) such as ErisNet [6], have shown exceptional promise in distinguishing Sargassum from other features by not only analyzing spectral signatures and leveraging spatial patterns within the imagery [7,22,23]. Despite these advances, challenges persist, including the need for accurate atmospheric correction, the effective handling of cloud cover, and the critical step of transitioning from simple detection to the quantitative estimation of fractional cover or biomass.

1.3. Study Rationale and Objectives

This study aims to develop, evaluate, and apply a comprehensive framework for improved Sargassum detection and quantitative estimation using multi-sensor satellite data. A key aspect is leveraging the probabilistic output of trained classifiers as a proxy for sub-pixel fractional cover, generating quantitative maps that estimate the percentage of Sargassum coverage within each pixel, thus providing a more nuanced view than simple presence/absence classifications. The specific objectives were: (1) to generate a binary-labeled spectral library from high-resolution satellite imagery (10 m to 30 m); (2) to train and evaluate three machine learning (Random Forest, K-Nearest Neighbors, and XGBoost) and two deep learning classifiers (Multi-Layer Perceptron and a 1D Convolutional Neural Network); (3) to implement a workflow for applying these models to high-resolution imagery from Landsat-8 and Sentinel-2 to produce fractional cover maps, while ensuring the framework applies to other sensors like MODIS through band harmonization; and (4) to assess the consistency and reliability of the model ensemble. A novel aspect of this research is the consistent use of classifier probability outputs as a proxy for fractional cover. This classifier is coupled with a systematic evaluation of a multi-algorithm workflow and a detailed analysis of inter-algorithm agreement to quantify model consensus and uncertainty.

2. Data and Methods

The overall methodology is summarized in Figure 1 and consists of four primary stages. First (Data Acquisition), a spectral library is created from high-resolution Landsat-8 and Sentinel-2 imagery through ground-truth labeling and spectral extraction. This library undergoes (Preprocessing), including surface reflectance harmonization and cleaning, to create the final training dataset. In the third stage (Model Development), this dataset is used to train and validate the ensemble of machine and deep learning models.

Finally, in the fourth stage (Application & Analysis), the trained models are deployed on new high-resolution satellite imagery (Landsat-8 and Sentinel-2) to generate fractional cover maps. As illustrated in the flowchart, the models are trained exclusively on the harmonized Landsat/Sentinel-2 spectral library. However, their design, which uses five core spectral bands common across sensors, makes them directly applicable to coarser resolution imagery such as MODIS, provided the same preprocessing and harmonization steps are applied. This study focuses on the application and validation using high-resolution data, while the framework’s multi-sensor capability enables future analysis of Sargassum dynamics across different spatial scales.

2.1. Study Area

This research focuses on the Caribbean Sea and its adjacent regions, including the Gulf of Mexico and the tropical Atlantic Ocean (Figure 2). This area was selected due to the severe and persistent influxes of pelagic Sargassum since 2011 [1]. The massive blooms, originating from the “Great Atlantic Sargassum Belt,” are transported westward, causing significant environmental and socioeconomic impacts along the coastlines of the Yucatán Peninsula, the Greater and Lesser Antilles, and Central America. The diverse and complex coastal environments within this region, ranging from clear oceanic waters to turbid nearshore zones, provide a robust and challenging setting for developing and validating a multi-sensor detection framework.

2.2. Satellite Data Acquisition and Specifications

To build a robust multi-scale detection framework, this study primarily utilized satellite scenes acquired between 2015 and 2024 from the Operational Land Imager (OLI) on Landsat-8 and the Multispectral Instrument (MSI) on Sentinel-2. While MODIS data was not used in the model training or validation presented here, its sensor specifications are included in the comparative analysis (Figure 3 and Table 1) to demonstrate the spectral alignment that makes the developed framework applicable to its imagery.

All data products were sourced from their official archives: the U.S. Geological Survey (USGS) EarthExplorer for Landsat-8 Collection 2 Level-2 Science Products (L2SP) and the Copernicus Data Space Ecosystem for Sentinel-2 Level-2A (L2A) products. A primary selection criterion for all scenes was cloud cover of less than 5% over the regions of interest, to ensure the quality of the spectral data. The key technical characteristics of these sensors, along with the five core spectral bands (Blue, Green, Red, NIR, and SWIR1) used for analysis, are presented in Figure 3 and detailed in Table 1.

The spectral library used to train the models was constructed exclusively from 34 high-resolution scenes (14 from Landsat-8 and 20 from Sentinel-2). These scenes were selected for their explicit depiction of diverse Sargassum and non-Sargassum features and were used for the manual ground-truth labeling process detailed in Section 2.3. The trained and validated models were then applied to a set of imagery to generate the final fractional cover maps. A comprehensive log of all satellite scenes used in this research, detailing their acquisition dates, sensors, and unique identifiers, is provided in Appendix A.

2.3. Data Preprocessing and Spectral Library Generation

All satellite imagery was obtained as Level-2 analysis-ready data products (L2SP for Landsat-8 and L2A for Sentinel-2). These products are delivered with rigorous, sensor-specific radiometric calibration and atmospheric correction already applied by the respective space agencies (USGS and ESA), converting top-of-atmosphere radiance to standardized surface reflectance. Preprocessing included three key steps: cloud and shadow masking, harmonization of Landsat-8 reflectance to a Sentinel-2 equivalent using band-specific linear transformations, and selection of the five core spectral bands (Blue, Green, Red, NIR, and SWIR1). The concept of harmonizing multiple satellite sensors to produce analysis-ready data is well-established in the remote sensing community [24,25]. To make Landsat-8 data spectrally compatible with Sentinel-2 for our models, we applied the band-specific linear transformations summarized in Table 2. A rigorous cloud and shadow masking protocol was then applied to ensure the integrity of the spectral data, using the quality assessment layers provided with each product.

The harmonization process is applied on a per-pixel, per-band basis using the coefficients detailed in Table 2. This step is critical to ensure that Landsat-8 data is spectrally consistent with the Sentinel-2 data that serve as the baseline for the training library. This transformation follows the general linear equation shown in Equation (1), using band-specific coefficients derived from the harmonization framework established by Claverie et al. [24]:

{SR}_{S2_equiv} = (slope \times {SR}_{L 8}) + intercept

(1)

where

S R_{L 8}

is the original surface reflectance value of a Landsat-8 pixel,

S R_{S2_equiv}

is the resulting Sentinel-2 equivalent surface reflectance, and the corresponding ‘slope’ and ‘intercept’ values are selected from Table 2 for the specific band being processed.

For Landsat-8 Collection 2, the QA_PIXEL band was used [26]. This band provides a bit-packed representation of surface conditions. Pixels were masked and excluded from the analysis if their QA_PIXEL value indicated either Cloud or Cloud Shadow with medium or high confidence. This corresponds to any pixel where the Cloud Confidence bits are 10 (Medium) or 11 (High), or where the Cloud Shadow Confidence bits are 10 (Medium) or 11 (High). For example, pixel values such as 22,280 (High Conf Cloud) and 23,888 (High Conf Cloud Shadow) were removed.
For Sentinel-2 Level-2A, the Scene Classification Layer (SCL) was used [27]. This layer provides a per-pixel classification of scene content. Pixels were masked if they belonged to any of the following classes, which represent clouds, cloud shadows, or other unwanted atmospheric features: Cloud Shadows (Class 3), Cloud Medium Probability (Class 8), Cloud High Probability (Class 9), or Thin Cirrus (Class 10).

This robust, sensor-specific filtering process ensured that only high-quality, clear-sky pixels were used.

The initial ground truth was generated by manually labeling of pixels in QGIS from the selected scenes. Polygons delineated “sargassum” mats and diverse “no_sargassum” features (e.g., clear water, turbid water, thin clouds, sunglint, land). The raw spectral library underwent cleaning to remove invalid entries and reduce redundancy. The final training dataset comprised 196,037 spectral signatures (11,597 “sargassum” and 184,440 “no_sargassum”). Descriptive statistics (Table 3) and visualizations of the data distributions (Figure 4, Figure 5 and Figure 6) reveal distinct spectral characteristics between the classes. The “sargassum” generally showed lower Blue, Green, Red, and SWIR1 reflectance, and a characteristic Red-NIR increase. The mean spectral signatures are shown in Figure 7.

2.4. Model Development

Five supervised learning algorithms were developed to classify pixels based on their five-band spectral signatures. Three traditional machine learning (ML) models: Random Forest (RF), K-Nearest Neighbors (KNN), and XGBoost (XGB), and two deep learning (DL) models: a Multi-Layer Perceptron (MLP) and a 1D Convolutional Neural Network (1D-CNN). The entire framework was implemented in Python (v3.12.4), leveraging core scientific libraries including Scikit-learn (v1.5.2) for ML models and preprocessing, XGBoost (v2.1.3) for the gradient boosting model, and TensorFlow (v2.17) with the Keras API for the DL architectures.

The model development process began with the creation of a complete harmonized spectral library derived from 34 high-resolution Landsat-8 and Sentinel-2 images. This dataset was partitioned into a training set (70%) and a test set (30%). The split was performed using stratified sampling to ensure that the proportions of ‘sargassum’ and ‘no_sargassum’ pixels were identical in both the training and test sets, thereby preventing distributional bias. Features were then scaled using the StandardScaler from Scikit-learn, which was fitted only on the training data to prevent data leakage and subsequently applied to the test set.

A key aspect of this framework was the use of classifier probability outputs as a proxy for Sargassum sub-pixel fractional cover. The models are trained on a library containing unambiguous, pure-pixel signatures of dense Sargassum. Therefore, the resulting continuous probability output for a given pixel (ranging from 0.0 to 1.0) is interpreted as the model’s confidence that the pixel’s spectrum resembles that of a 100% Sargassum cover. This probability provides a quantitative estimate superior to a simple binary presence/absence classification, though it remains a proxy that requires empirical validation.

For the traditional ML models, hyperparameter tuning was conducted using Scikit-learn’s GridSearchCV. This process systematically searched for the optimal model configuration using 5-fold stratified cross-validation, with the weighted F1-score as the primary optimization metric. The full hyperparameter grids explored for each model, including parameters like n_estimators, max_depth, and learning_rate, are detailed in Appendix B. For the DL models, training was performed over 100 epochs with a batch size of 64, employing an early stopping callback that monitored validation loss to prevent overfitting and restore the best-performing model weights.

The final, tuned models were then evaluated against the held-out 30% test set, which the models had never seen during training. For the ultimate goal of fractional cover estimation, the models were configured to output not a binary class label, but the continuous probability (from 0.0 to 1.0) that a pixel belongs to the ’sargassum’ class. This probabilistic output from each classifier serves as the direct proxy for the sub-pixel fractional cover of Sargassum.

2.5. Spatial Analysis and Inter-Algorithm Agreement

To evaluate the consistency and reliability of the trained models when applied to new data, a spatial inter-algorithm agreement analysis was performed on 12 representative high-resolution scenes (a mix of Landsat-8 OLI and Sentinel-2 MSI) from the study area. The workflow was as follows: First, for each of the 12 scenes, fractional cover maps were generated using all five of the trained models (RF, KNN, XGB, MLP, and 1D-CNN). A high-resolution global land mask was then applied to all maps to exclude land pixels, thereby restricting the statistical analysis to ocean areas exclusively.

For the agreement analysis, a distinction was made between visualization and quantitative statistics. For creating visual agreement maps, a sensitive probability threshold was used to show the spatial extent of even low-density detections. However, for the rigorous statistical comparison, a stricter detection threshold of fractional cover ≥0.5 was applied. This threshold was chosen because it represents a conservative measure, identifying only those pixels where a classifier’s confidence in the presence of Sargassum exceeds its confidence in its absence. This approach minimizes the inclusion of low-probability, ambiguous pixels in the quantitative agreement statistics, leading to a more robust assessment of model consensus.

The agreement between model pairs was quantified using Cohen’s Kappa [28]. To measure the overall reliability of the five-model ensemble, we used Fleiss’ Kappa [29], a standard method for assessing agreement among multiple raters; its application and principles in remote sensing accuracy assessment are well established [30].

3. Results

3.1. Model Performance

The classification performance of the five trained models was rigorously evaluated on the test set, which comprised 30% of the total spectral library and was not used during model training or hyperparameter tuning. The evaluation metrics, calculated from binary classification results using a 0.5 probability threshold, are detailed in Table 4, with the exact formulas for the macro-averaged metrics outlined in Appendix C.

The consistency of the high-performance scores across all five models was notable. While scores exceeding 0.99 can sometimes indicate overfitting, here they were interpreted as a direct consequence of the high-quality, curated spectral library. The manual labeling process focused on clear, unambiguous pixels of dense Sargassum and spectrally distinct background features (e.g., clear water, deep ocean, land). On this “clean” test set, the classification task is relatively straightforward, and all models learned powerful discriminating features with minimal error. This similarity in performance metrics on ideal data provides a strong baseline, but also motivates the inter-algorithm agreement analysis on full, real-world scenes, which presents a more challenging test of model robustness and reveals greater divergence in their behavior when faced with more ambiguous pixels.

A key observation was the exceptionally high performance across all models, with weighted F1-scores ranging from 0.9767 (1D-CNN) to 0.9838 (MLP). While such high scores can sometimes suggest overfitting, in this context, they are interpreted as evidence of the strong spectral separability between the sargassum and no_sargassum classes within our carefully curated, high-quality spectral library. The manual labeling process, which focused on clear, unambiguous examples of Sargassum and a diverse set of background features, produced a dataset in which the five core spectral bands provide a highly effective basis for discrimination. The models, therefore, learned the distinguishing features with very high accuracy. It is essential to note that this test set, although independent of the training process, is drawn from the same distribution of scenes used to build the library.

Among the evaluated algorithms, the tree-based models and the MLP performed exceptionally well. The MLP classifier achieved the highest macro-averaged F1-score of 0.9838, closely followed by XGBoost at 0.9837, indicating an excellent balance of precision and recall across classes. The Random Forest and KNN models also demonstrated strong performance, with both achieving an F1-score of 0.9828. The 1D-CNN, while still demonstrating high performance with a macro F1-score of 0.9767, was the lowest-performing model on this specific dataset.

The training times (T-Time) varied significantly across the models, as shown in Figure 8. The MLP was the most computationally expensive, followed by the Random Forest model. In contrast, the KNN model was, by far, the fastest to train. A comprehensive visual summary of model performance, including the performance metrics, training times, and confusion matrices for each algorithm, is provided in Figure 8.

3.2. Qualitative Detection and Agreement Analysis

To assess the practical applicability and spatial performance of the trained models, they were deployed on full satellite scenes. The probabilistic outputs from the models were interpreted as fractional cover maps, providing a detailed, quantitative view of Sargassum distribution.

Figure 9 presents a comprehensive visual comparison for a Landsat-8 scene acquired on 23 July 2015. The figure shows the fractional cover maps generated by all five classifiers (MLP, RF, XGB, KNN, and 1D-CNN) alongside a traditional Floating Algae Index (FAI) map for reference. All models successfully delineate the significant Sargassum aggregations and intricate filaments visible in the scene. While their outputs are broadly similar, subtle differences in spatial extent and detection confidence are apparent, particularly in lower-density or ambiguous areas. These variations underscore the importance of not relying solely on a single algorithm and motivate a more thorough analysis of their consistency.

To quantify the consistency across the model ensemble, we generated inter-algorithm agreement maps. Figure 10 shows a representative agreement map for a Landsat-8 scene. In these maps, the color of each pixel indicates the number of models (from one to five) that detected Sargassum. High-agreement areas (shown in warm colors) represent high-confidence detections in which all models concur. In contrast, low-agreement areas, represented by cool colors, highlight regions of uncertainty where only one or a few models detect Sargassum.

For a more granular examination, Figure 11 presents a detailed multi-panel analysis for three selected hotspots. While the output of any of the five classifiers could be shown, the K-Nearest Neighbors (KNN) model was chosen as a representative example of the traditional ML models due to its robust performance and fast processing time. Each hotspot panel provides a side-by-side comparison of the KNN classification, the true-color RGB image, the FAI output, and the overall algorithm agreement map. This detailed view confirms that features with a clear visual signature and a strong FAI signal typically correspond to high levels of model agreement.

3.3. Quantitative Agreement Summary

While the previous section evaluated the performance of each model independently against a test set, this section assesses the practical consistency and reliability of the five-model ensemble when applied to full satellite scenes. A quantitative analysis was performed on the 12 scenes for which fractional cover maps were generated, applying a strict detection threshold (fractional cover ≥0.5 to all ocean pixels. The purpose of using ensemble agreement statistics like Fleiss’ and Cohen’s Kappa, rather than individual accuracy metrics, is to understand how similarly or differently the models behave in real-world conditions and to quantify the certainty of a detection.

3.3.1. Overall Inter-Ensemble Reliability

To measure the statistical reliability of the entire five-model ensemble, Fleiss’ Kappa (

κ

) was calculated for each of the 12 processed scenes (Figure 12). The average Kappa score across all scenes was 0.311, indicating “Fair” agreement under the Landis & Koch (1977) scale [31]. However, the reliability was highly scene-dependent. Several Landsat-8 scenes achieved a “Moderate” level of agreement (

κ > 0.4

), indicating a robust consensus among the models.

In contrast, two scenes exhibited negative Kappa scores, suggesting systematic disagreement where the models performed worse than random chance. This disagreement is attributed to challenging environmental conditions not fully represented in the training set. These conditions can cause specific model architectures to produce false positives while others do not, leading to significant disagreement and a negative Kappa score. This problem highlights the importance of scene-specific quality assessment.

3.3.2. Pairwise Agreement and Detection Distribution

To further diagnose the relationships between individual models within the ensemble, a pairwise analysis using Cohen’s Kappa (

κ

) was conducted, with the results averaged across all scenes and summarized in the heatmap in Figure 13. This analysis reveals distinct clusters of model behavior. A strong consensus block is evident among the traditional ML models, with the highest agreement observed between the two tree-based algorithms, Random Forest and XGBoost (

κ = 0.794

). This high level of agreement suggests that for operational monitoring where computational efficiency is a priority, either the Random Forest or XGBoost model could serve as a reliable, lighter alternative to running the full ensemble. The KNN model also aligns closely with this group. In contrast, the neural network models (MLP and 1D-CNN) show weaker agreement with both the consensus block and with each other, suggesting they have learned different, potentially more complex, decision boundaries.

The divergence is further explained by the stacked distribution of agreement shown in Figure 13. This plot, which includes all pixels where at least one model made a detection, shows that the most common outcomes are either complete consensus (all five models flagged 36.8% of detected pixels) or minimal consensus (a single algorithm made the detection, collectively accounting for 34.8% of all detections).

Crucially, the stacked bars reveal the composition of these agreement levels. For the “1 Algo(s)” category, contributions are distributed relatively evenly across all five models, with each algorithm serving as the sole detector for 6-8% of the total detected pixels. This finding indicates that each model has unique sensitivities. However, the composition of the intermediate agreement levels (2, 3, and 4 Algos) is dominated by specific combinations. For instance, detections made by exactly two algorithms are most frequently observed in pairs such as Random Forest/XGBoost or KNN/XGBoost, reinforcing the tight coupling of traditional ML models as observed in the Kappa heatmap.

In contrast, detections at the three-algorithm level are primarily made by the complete traditional ML block of (KNN, Random Forest, XGBoost). This detailed breakdown confirms that Sargassum features are either so spectrally clear that every model confidently detects them, or ambiguous enough that their detection depends on the unique sensitivities of individual models or specific model combinations. This result reinforces the value of an ensemble approach for separating high-confidence, unanimous detections from more speculative ones that require further scrutiny.

4. Discussion

This study developed and evaluated a multi-sensor framework for detecting pelagic Sargassum and estimating its fractional cover. By training an ensemble of models on a curated spectral library, the framework demonstrated high classification accuracy and produced quantitative, nuanced maps of Sargassum distribution.

4.1. Effectiveness of Multi-Sensor Integration and Data Harmonization

The framework’s design enables a robust, tiered approach to monitoring by standardizing inputs from Landsat-8, Sentinel-2, and potentially MODIS to a standard set of five spectral bands. This harmonization enables models, trained on high-resolution data, to be directly applied to coarser MODIS imagery, facilitating both regional-scale surveillance (MODIS) and detailed local assessments (Landsat/Sentinel-2). However, this study did not include model validation using MODIS imagery. Applying models trained on high-resolution data to MODIS pixels presents challenges, as the significant difference in spatial resolution means fractional cover estimates must be interpreted with caution due to much greater sub-pixel heterogeneity. The benefit of this approach is broad, multi-sensor applicability. However, it comes at the cost of forgoing sensor-specific features, such as the red-edge bands in Sentinel-2, which can enhance vegetation discrimination [15]. This trade-off between inter-sensor consistency and single-sensor optimization is a key consideration for developing future operational, multi-platform monitoring systems.

4.2. Model Performance and the Nature of the Test Set

The high classification metrics on the held-out test set (macro-averaged F1-Scores of over 0.976) are particularly notable, underscoring the models’ robust performance on a challenging, imbalanced classification problem. Using macro-averaged metrics, which give equal importance to the rare ‘sargassum’ class and the abundant ‘no_sargassum’ class, confirms that the models’ high performance is not merely an artifact of accurately classifying the majority class. Instead, it reflects a genuine ability to learn the highly discriminative spectral signature of healthy, aggregated Sargassum, even when it constitutes a small fraction of the dataset. It is crucial, however, to interpret these metrics within the context from which they were derived. While the test set was independent, it was drawn from the same distribution as the training data. The actual test of model robustness comes from its application to full satellite scenes with varying environmental conditions, as explored through the inter-algorithm agreement analysis.

4.3. Interpreting Inter-Algorithm Agreement and Model Behavior

A central finding of this study was the informative contrast between the models’ strong performance on the curated test set and its more complex behavior on full, real-world scenes. The high macro-averaged F1-scores (>0.976) demonstrate that under ideal conditions, as captured in our high-quality spectral library, the five core bands provide a powerful basis for separating Sargassum from other features.

The overall “Fair” agreement (average Fleiss’

κ = 0.311

) across the test scenes initially seemed modest. However, the distribution of detections reveals a crucial insight: they are typically either of very high or very low confidence. A large portion of detected pixels were flagged either unanimously by all five models (36.8%) or contentiously by only a single model (34.8%), with fewer instances of intermediate agreement. This bimodal distribution highlights the diagnostic power of an ensemble; a simple presence/absence map would obscure this vital information about confidence.

The drivers of this behavior become clear through pairwise model analysis. A consensus block of traditional ML models (RF, XGB, and KNN) behaves very similarly, suggesting they rely on the same robust spectral thresholds. The high macro-averaged F1-scores provide strong evidence that this consensus is based on a solid foundation: all models have effectively learned the core spectral signature of pure ’sargassum’ pixels. This finding explains the high-confidence, unanimous detections of spectrally unambiguous mats. In contrast, the neural network models (MLP and 1D-CNN) show weaker correlation with this block and with each other. Their divergence suggests they have learned more complex, non-linear feature representations, allowing them to identify potential Sargassum in spectrally ambiguous situations (e.g., low density, haze, or sub-pixel mats) that the other models miss, explaining their significant contribution to the large number of low-confidence, single-model detections. This finding powerfully underscores the value of an ensemble approach. The ensemble’s can define high-confidence detections. In contrast, a single neural network model can flag an area of potential interest that requires further scrutiny or would be missed entirely. Finally, the scenes with poor or negative Kappa scores directly illustrate the system’s limitations.

These agreement patterns also provide practical guidance for operational users. A strong consensus block is evident among the traditional ML models, with XGBoost at its core. XGBoost shows the highest pairwise agreement with both Random Forest (

κ = 0.793

) and KNN (

κ = 0.794

), indicating it consistently captures the features identified by the other traditional models. This finding suggests that for computationally efficient operational monitoring, XGBoost can serve as a reliable and robust standalone alternative to running the complete five-model ensemble. Furthermore, the agreement level itself can be used to classify detections into confidence tiers: for instance, High-Confidence (detected by four or more models), Medium-Confidence (detected by two to three models), and Low-Confidence (detected by one model). This tiered approach allows end-users to prioritize high-confidence areas for management action while flagging low-confidence areas for further scrutiny.

It is also crucial to situate this ensemble framework within the context of the latest advancements in deep learning. Recent studies have begun to explore more complex architectures, such as Transformers, for a wide range of remote sensing tasks, highlighting their ability to capture long-range spatial dependencies [32,33]. While our framework relies on more established architectures, its novelty lies not in a single architecture, but in the systematic comparison of diverse models and in using of their probabilistic consensus as a proxy for detection confidence and fractional cover. This model-agnostic, confidence-aware approach complements efforts to develop a single, highly optimized model, providing a practical methodology for quantifying operational uncertainty.

4.4. Global Scalability and Generalizability

While this framework was developed and validated in the Caribbean, its underlying methodology holds significant potential for global scalability. The spectral signature of pelagic Sargassum is fundamentally consistent, meaning the trained models could likely be applied to other regions affected by similar blooms, such as the coasts of West Africa or Southeast Asia. However, direct application would require careful validation, as regional variations in water constituents, atmospheric conditions, and phytoplankton assemblages could influence model performance. A region-specific fine-tuning of the models using a small, localized dataset would likely be necessary to achieve optimal accuracy, but the core five-band framework provides a robust starting point for global Sargassum monitoring efforts.

4.5. Limitations and Future Research

Strengths: This study’s strengths lie in its systematic comparison of five diverse ML and DL algorithms, its multi-sensor applicability, and its rigorous, quantitative analysis of inter-algorithm agreement. A key methodological strength is the evaluation of model performance using macro-averaged metrics. In the remote sensing domain, where the target feature (‘sargassum’) is often significantly less frequent than the background, this approach provides a more challenging and honest assessment of a model’s capabilities, confirming its utility for detecting a rare but critical phenomenon, providing a robust framework for assessing detection confidence.

Limitations: A primary limitation of this study is the current lack of direct empirical validation for the fractional cover estimates. While the probabilistic outputs serve as a robust proxy based on spectral similarity to pure Sargassum, they have not yet been calibrated or validated against concurrently acquired, high-resolution imagery from sources such as Unmanned Aerial Vehicles (UAVs) or in situ measurements. Such a validation campaign was beyond the logistical scope of this research, which focused on developing the large-scale satellite processing framework. Additionally, the model evaluation was performed using a random spatial split of the data. Future analyses should employ a temporal hold-out cross-validation strategy (e.g., training on data from 2015–2020 and testing on 2021–2024) to more rigorously assess the framework’s robustness against inter-annual and seasonal variability in bloom characteristics. Finally, while the binary “sargassum”/“no_sargassum” framework proved effective, future iterations could explore a multi-class classification (e.g., “sargassum”, “water”, “cloud”, “land”) to potentially improve performance by forcing the network to learn more distinct decision boundaries.

Based on our findings, future work should prioritize four key areas:

Empirical Validation Workflow: The most critical next step is the rigorous validation of fractional cover estimates. A future project should implement a validation workflow involving: (a) planning UAV or drone survey campaigns over Sargassum aggregations to coincide with Landsat-8/Sentinel-2 overpasses; (b) co-registering the high-resolution UAV imagery with the satellite pixels; (c) calculating ground-truthed fractional cover from the UAV data; and (d) performing a statistical comparison to calibrate and validate the satellite-derived proxy maps.
Spatial Context: The next logical step is to integrate spatial context. Moving from a 1D (spectral) to a 2D (spectral-spatial) CNN architecture would allow the model to learn not only the spectral signature of Sargassum but also its characteristic texture and morphology, improving its ability to distinguish genuine Sargassum slicks from confounding features such as cloud edges or ship wakes.
Optimizing Spectral Inputs with Sentinel-2: Future work should explore using an expanded set of nine Sentinel-2 spectral bands, from the blue (Band 2) to the SWIR-1 (Band 11) region. This approach offers greater spectral resolution for discriminating Sargassum. Particular focus should be given to the four bands located between the red and NIR regions (i.e., the ‘Red-Edge’ bands 5, 6, 7, and the narrow NIR band 8a).
Semi-Supervised Learning: Given the resource-intensive nature of manual labeling, future research could explore semi-supervised learning techniques to alleviate this burden. By pre-training a model on a large corpus of unlabeled satellite imagery to learn the general statistics of ocean scenes, we could then fine-tune it on our smaller, high-quality labeled dataset to achieve greater robustness and generalization.

5. Conclusions

A multi-sensor framework for detecting pelagic Sargassum and estimating its fractional cover was successfully developed, evaluated, and applied to an extensive dataset. By training an ensemble of diverse classifiers on a curated spectral library, we have established a robust methodology for quantitative Sargassum monitoring. The high classification performance on the test set confirms the strong spectral separability of dense Sargassum. However, the more complex behavior observed in real-world scenes, reflected by a “Fair” overall ensemble agreement, highlights that a crucial contribution of this framework is its ability to quantify detection confidence in the face of spectrally ambiguous features. The key conclusions are:

High Efficacy on Curated Data: All five evaluated classifiers (RF, KNN, XGB, MLP, 1D-CNN) demonstrated excellent performance on the test set, underscoring the strong spectral separability of Sargassum in high-quality satellite imagery.
Probabilistic Output as a Viable Proxy: Using classifier probabilities outputs as a direct proxy for fractional cover is a practical and effective method for transitioning from simple binary detection to quantitative mapping of Sargassum distribution and density.
Ensemble Analysis is Crucial for Confidence Assessment: The inter-algorithm agreement analysis, particularly Fleiss’ and Cohen’s Kappa statistics, proved invaluable. It moved the evaluation beyond simple accuracy metrics to reveal the system’s actual behavior, highlighting that detections are often made with either very high (unanimous) or very low (contentious) confidence, providing a critical framework for assessing the reliability of detections in an operational context.
A Foundation for Operational Monitoring: The developed methodology provides a robust and well-vetted basis for operational Sargassum monitoring. The findings emphasize the importance of using a model ensemble and agreement metrics to produce nuanced, reliable data products for mitigating the impacts of Sargassum blooms.

Ultimately, this work establishes that an ensemble of machine learning models provides a robust pathway toward quantitative, confidence-aware Sargassum monitoring, but underscores that rigorous empirical validation with in situ or UAV data is the critical next step for creating operational products.

Author Contributions

Conceptualization, J.M.E.-R. and G.M.-F.; methodology, J.M.E.-R.; software, J.M.E.-R.; validation, J.M.E.-R.; formal analysis, J.M.E.-R., G.M.-F. and R.A.M.-P.; investigation, J.M.E.-R., G.M.-F. and R.A.M.-P.; resources, J.M.E.-R. and G.M.-F.; data curation, J.M.E.-R.; writing—original draft preparation, J.M.E.-R.; writing—review and editing, J.M.E.-R., G.M.-F. and R.A.M.-P.; visualization, J.M.E.-R. and G.M.-F.; supervision, G.M.-F. and R.A.M.-P.; project administration, G.M.-F.; funding acquisition, J.M.E.-R., G.M.-F. and R.A.M.-P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by grants from the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional (SIP-IPN) with project numbers 20221330 and 20230026 awarded to G. Martínez-Flores. Additionally, J. M. Echevarría-Rubio was a recipient of a scholarship (No. 1043025) from the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) and the Beca de Estímulo Institucional de Formación de Investigadores (BEIFI). The APC was funded by Secretaría de Investigación y Posgrado (SIP-IPN).

Data Availability Statement

The spectral library dataset, trained models, and all code used in this study are openly available in the Zenodo repository: https://doi.org/10.5281/zenodo.17246345 (Version 2.0). This repository includes the Python scripts for training and classification, step-by-step Jupyter Notebook (v7.4.3) tutorials, and example Landsat-8 and Sentinel-2 scenes to ensure full reproducibility of the workflow. The repository’s README.md file provides a detailed description of the file structure and variable names. The whole satellite imagery archive used for model development is publicly available from USGS EarthExplorer and the Copernicus Data Space Ecosystem, as cited in the manuscript.

Acknowledgments

The authors would like to express their gratitude to the Centro Interdisciplinario de Ciencias Marinas of the Instituto Politécnico Nacional (CICIMAR-IPN) and the Instituto Mexicano de Tecnología del Agua (IMTA) for the facilities and support provided to carry out this research. We also thank the space agencies that provide open access to the satellite imagery used in this study: the National Aeronautics and Space Administration (NASA) for Landsat-8 data and the European Space Agency (ESA) for Sentinel-2 data. We thank the anonymous reviewers for their thoughtful remarks and suggestions.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
DL	Deep Learning
ESA	European Space Agency
FAI	Floating Algae Index
GASB	Great Atlantic Sargassum Belt
KNN	K-Nearest Neighbors
LAADS	Level-1 and Atmosphere Archive
DAAC	Distribution System Distributed Active Archive Center
L2SP	Level-2 Science Products
ML	Machine Learning
MLP	Multi-Layer Perceptron
MODIS	Moderate Resolution Imaging Spectroradiometer
MSI	Multispectral Instrument
NASA	National Aeronautics and Space Administration
NIR	Near-Infrared
OLI	Operational Land Imager
QGIS	Quantum Geographic Information System
RF	Random Forests
SR	Surface Reflectance
SWIR	Short-Wave Infrared
USGS	U.S. Geological Survey
XGB	Extreme Gradient Boosting

Appendix A. Satellite Imagery Log

The following table provides a comprehensive list of the satellite scenes used in this study. All scenes were selected based on having less than 5% cloud cover over the area of interest.

Table A1. List of satellite scenes used for ground-truth labeling and model training.

Sensor	Acquisition Date	Scene Identifier (Path/Row or Tile ID)
Landsat-8 OLI	23 July 2015	016/046
Landsat-8 OLI	23 July 2015	016/047
Landsat-8 OLI	30 July 2015	017/046
Landsat-8 OLI	30 July 2015	017/047
Landsat-8 OLI	6 August 2015	018/047
Landsat-8 OLI	6 August 2015	018/048
Landsat-8 OLI	31 August 2015	017/048
Landsat-8 OLI	11 January 2018	017/047
Landsat-8 OLI	21 February 2018	016/047
Landsat-8 OLI	21 February 2018	016/048
Landsat-8 OLI	28 February 2018	017/047
Landsat-8 OLI	28 February 2018	017/048
Landsat-8 OLI	22 July 2018	017/047
Landsat-8 OLI	4 November 2018	016/048
Sentinel-2 MSI	6 July 2016	T16QFJ
Sentinel-2 MSI	1 July 2017	T16QFJ
Sentinel-2 MSI	28 July 2017	T16QGH
Sentinel-2 MSI	12 August 2017	T16QGJ
Sentinel-2 MSI	30 August 2017	T16QFJ
Sentinel-2 MSI	21 September 2017	T16QGJ
Sentinel-2 MSI	21 June 2018	T16QFJ
Sentinel-2 MSI	23 June 2018	T16QGJ
Sentinel-2 MSI	11 July 2018	T16QEH
Sentinel-2 MSI	27 August 2018	T16QGH
Sentinel-2 MSI	8 September 2020	T16QEH
Sentinel-2 MSI	25 June 2021	T16QEH
Sentinel-2 MSI	20 July 2021	T16QEH
Sentinel-2 MSI	14 August 2022	T16QEH
Sentinel-2 MSI	23 September 2022	T16QEH
Sentinel-2 MSI	20 June 2023	T16QEH
Sentinel-2 MSI	10 July 2023	T16QEH
Sentinel-2 MSI	19 August 2023	T16QEH
Sentinel-2 MSI	23 September 2023	T16QEH
Sentinel-2 MSI	18 August 2024	T16QEH

Appendix B. Supplementary Code Access and Details

The computational workflows developed and utilized in this research are fundamental to the methodology and reproducibility. These include procedures for data preprocessing, ground truth generation, model training and evaluation, and satellite data classification. The key methodological components are available at https://zenodo.org/records/17246345 (accessed on 31 October 2025).

Details on Hyperparameter Grids

Hyperparameter grids for GridSearchCV are detailed in the supplementary code. For RF, parameters such as “n_estimators” and “max_depth” were tuned. For KNN, “n_neighbors”, “weights”, and “metric”. For XGBoost, “n_estimators”, “learning_rate”, and “max_depth”, with “scale_pos_weight”. For MLP, “hidden_layer_sizes”, “activation”, “solver”, “alpha”, “learning_rate_init”, “batch_size”, and “early_stopping”.

Appendix C. Calculation of Macro-Averaged Performance Metrics

To ensure transparency and reproducibility, this section details the methodology used to calculate the macro-averaged performance metrics presented in Table 4 from the confusion matrices in Figure 8. The macro-average computes the metric independently for each class and then takes the unweighted mean. This approach treats all classes equally, regardless of their support (i.e., the number of instances in each class), which is crucial for evaluating performance on imbalanced datasets.

The core formulas for a single class are:

\begin{matrix} Precision & = \frac{T P}{T P + F P} \end{matrix}

(A1)

\begin{matrix} Recall & = \frac{T P}{T P + F N} \end{matrix}

(A2)

\begin{matrix} F 1 - Score & = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(A3)

where TP = True Positives, FP = False Positives, and FN = False Negatives.

For a binary classification problem with classes “Sargassum” (positive) and “No Sargassum” (negative), the per-class metrics are first calculated. The macro-averaged metric is the arithmetic mean of the metric for each class:

\begin{matrix} Macro Precision & = \frac{{Precision}_{sargassum} + {Precision}_{no_sargassum}}{2} \end{matrix}

(A4)

\begin{matrix} Macro Recall & = \frac{{Recall}_{sargassum} + {Recall}_{no_sargassum}}{2} \end{matrix}

(A5)

\begin{matrix} Macro F 1 - Score & = \frac{{F 1 - Score}_{sargassum} + {F 1 - Score}_{no_sargassum}}{2} \end{matrix}

(A6)

Example Calculation: K-Nearest Neighbors (KNN)

Using the confusion matrix for the KNN model from Figure 8:

True Positives (TP): 3331 (“Sargassum“ correctly identified)
False Negatives (FN): 148 (“Sargassum“ missed, predicted as “No Sargassum“)
False Positives (FP): 78 (“No Sargassum“ misidentified as “Sargassum“)
True Negatives (TN): 55,255 (“No Sargassum“ correctly identified)

Step 1: Calculate metrics for the “Sargassum” (positive) class.

${Precision}_{sargassum} = \frac{3331}{3331 + 78} = 0.9780$
${Recall}_{sargassum} = \frac{3331}{3331 + 148} = 0.9575$
${F 1 - Score}_{sargassum} = 2 \times \frac{0.9780 \times 0.9575}{0.9780 + 0.9575} = 0.9676$

Step 2: Calculate metrics for the “No Sargassum” (negative) class. For this class, the roles of the confusion matrix values are inverted:

TP becomes TN (55,255), FP becomes FN (148), and FN becomes FP (78).
${Precision}_{no_sargassum} = \frac{55, 255}{55, 255 + 148} = 0.9973$
${Recall}_{no_sargassum} = \frac{55, 255}{55, 255 + 78} = 0.9986$
${F 1 - Score}_{no_sargassum} = 2 \times \frac{0.9973 \times 0.9986}{0.9973 + 0.9986} = 0.9980$

Step 3: Calculate the final macro-averaged metrics.

Macro Precision = $\frac{0.9780 + 0.9973}{2} = 0.9877$
Macro Recall = $\frac{0.9575 + 0.9986}{2} = 0.9781$
Macro F1-Score = $\frac{0.9676 + 0.9980}{2} = 0.9828$

These final values correspond to the metrics reported for KNN in Table 4. The same procedure was applied to all other models.

References

Wang, M.; Hu, C.; Barnes, B.; Mitchum, G.; Lapointe, B.; Montoya, J. The Great Atlantic Sargassum Belt. Science 2019, 365, 83–87. [Google Scholar] [CrossRef] [PubMed]
Rodríguez-Martínez, R.; Jordán-Dahlgren, E.; Hu, C. Spatio-Temporal Variability of Pelagic Sargassum Landings on the Northern Mexican Caribbean. Remote Sens. Appl. Soc. Environ. 2022, 27, 100767. [Google Scholar] [CrossRef]
Ody, A.; Thibaut, T.; Berline, L.; Changeux, T.; André, J.; Chevalier, C.; Blanfuné, A.; Blanchot, J.; Ruitton, S.; Stiger-Pouvreau, V.; et al. From In Situ to Satellite Observations of Pelagic Sargassum Distribution and Aggregation in the Tropical North Atlantic Ocean. PLoS ONE 2019, 14, e0222584. [Google Scholar] [CrossRef] [PubMed]
Gower, J.; Young, E.; King, S. Satellite Images Suggest a New Sargassum Source Region in 2011. Remote Sens. Lett. 2013, 4, 764–773. [Google Scholar] [CrossRef]
Triñanes, J.; Putman, N.; Goñi, G.; Hu, C.; Wang, M. Monitoring Pelagic Sargassum Inundation Potential for Coastal Communities. J. Oper. Oceanogr. 2021, 16, 48–59. [Google Scholar] [CrossRef]
Arellano-Verdejo, J.; Lazcano-Hernandez, H.; Cabanillas-Terán, N. ERISNet: Deep Neural Network for Sargassum Detection Along the Coastline of the Mexican Caribbean. PeerJ 2019, 7, e6842. [Google Scholar] [CrossRef] [PubMed]
Laval, M.; Belmouhcine, A.; Courtrai, L.; Descloitres, J.; Salazar-Garibay, A.; Schamberger, L.; Minghelli, A.; Thibaut, T.; Dorville, R.; Mazoyer, C.; et al. Detection of Sargassum from Sentinel Satellite Sensors Using Deep Learning Approach. Remote Sens. 2023, 15, 1104. [Google Scholar] [CrossRef]
Marsh, R.; Skliris, N.; Tompkins, E.; Dash, J.; Almela, V.; Tonon, T.; Oxenford, H.; Webber, M. Climate-Sargassum Interactions Across Scales in the Tropical Atlantic. PLoS Clim. 2023, 2, e0000253. [Google Scholar] [CrossRef]
Chandler, C.; Ávila Mosqueda, S.; Salas-Acosta, E.; Magaña-Gallegos, E.; Mancera, E.; Reali, M.; Barreda-Bautista, B.; Boyd, D.; Metcalfe, S.; Sjogersten, S.; et al. Spectral Characteristics of Beached Sargassum in Response to Drying and Decay over Time. Remote Sens. 2023, 15, 4336. [Google Scholar] [CrossRef]
Hernández, W.; Morell, J.; Armstrong, R. Using High-Resolution Satellite Imagery to Assess the Impact of Sargassum Inundation on Coastal Areas. Remote Sens. Lett. 2021, 13, 24–34. [Google Scholar] [CrossRef]
Lazcano-Hernandez, H.; Arellano-Verdejo, J.; Rodríguez-Martínez, R. Algorithms Applied for Monitoring Pelagic Sargassum. Front. Mar. Sci. 2023, 10, 1216426. [Google Scholar] [CrossRef]
Roger, J.C.; Ray, J.P.; Vermote, E.F. MODIS Surface Reflectance User’s Guide: Collections 6 and 6.1. MODIS Land Surface Reflectance Science Computing Facility, Version 1.7 ed. Principal Investigator: Dr. Eric F. Vermote. 2023. Available online: https://modis-land.gsfc.nasa.gov (accessed on 22 October 2025).
Wang, M.; Hu, C. Mapping and Quantifying Sargassum Distribution and Coverage in the Central West Atlantic Using MODIS Observations. Remote Sens. Environ. 2016, 183, 350–367. [Google Scholar] [CrossRef]
Sun, D.; Chen, Y.; Wang, S.; Zhang, H.; Qiu, Z.; Mao, Z.; He, Y. Using Landsat 8 OLI Data to Differentiate Sargassum and Ulva prolifera Blooms in the South Yellow Sea. Int. J. Appl. Earth Obs. Geoinf. 2021, 98, 102302. [Google Scholar] [CrossRef]
Wang, M.; Hu, C. Satellite Remote Sensing of Pelagic Sargassum Macroalgae: The Power of High Resolution and Deep Learning. Remote Sens. Environ. 2021, 264, 112631. [Google Scholar] [CrossRef]
Sun, Y.; Wang, M.; Liu, M.; Li, Z.; Chen, Z.; Huang, B. Continuous Sargassum Monitoring Across the Caribbean Sea and Central Atlantic Using Multi-Sensor Satellite Observations. Remote Sens. Environ. 2024, 309, 114223. [Google Scholar] [CrossRef]
Hu, C. A Novel Ocean Color Index to Detect Floating Algae in the Global Oceans. Remote Sens. Environ. 2009, 113, 2118–2129. [Google Scholar] [CrossRef]
Hu, C.; Feng, L.; Hardy, R.; Hochberg, E. Spectral and Spatial Requirements of Remote Measurements of Pelagic Sargassum Macroalgae. Remote Sens. Environ. 2015, 167, 229–246. [Google Scholar] [CrossRef]
Podlejski, W.; Descloitres, J.; Chevalier, C.; Minghelli, A.; Lett, C.; Berline, L. Filtering Out False Sargassum Detections Using Context Features. Front. Mar. Sci. 2022, 9, 960939. [Google Scholar] [CrossRef]
Xiao, Y.; Liu, R.; Kim, K.; Zhang, J.; Cui, T. A Random Forest-Based Algorithm to Distinguish Ulva prolifera and Sargassum from Multispectral Satellite Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4201515. [Google Scholar] [CrossRef]
Shin, J.; Lee, J.; Jang, L.; Lim, J.; Khim, B.; Jo, Y. Sargassum Detection Using Machine Learning Models: A Case Study with the First 6 Months of GOCI-II Imagery. Remote Sens. 2021, 13, 4844. [Google Scholar] [CrossRef]
Hu, C.; Zhang, S.; Barnes, B.; Xie, Y.; Wang, M.; Cannizzaro, J.; English, D. Mapping and Quantifying Pelagic Sargassum in the Atlantic Ocean Using Multi-Band Medium-Resolution Satellite Data and Deep Learning. Remote Sens. Environ. 2023, 289, 113515. [Google Scholar] [CrossRef]
Cui, B.; Zhang, H.; Jing, W.; Liu, H.; Cui, J. SRSe-Net: Super-Resolution-Based Semantic Segmentation Network for Green Tide Extraction. Remote Sens. 2022, 14, 710. [Google Scholar] [CrossRef]
Claverie, M.; Ju, J.; Masek, J.; Dungan, J.; Vermote, E.; Roger, J.; Skakun, S.; Justice, C. The Harmonized Landsat and Sentinel-2 Surface Reflectance Data Set. Remote Sens. Environ. 2018, 219, 145–161. [Google Scholar] [CrossRef]
Roy, D.; Kovalskyy, V.; Zhang, H.; Vermote, E.; Yan, L.; Kumar, S.; Egorov, A. Characterization of Landsat-7 to Landsat-8 Reflective Wavelength and Normalized Difference Vegetation Index Continuity. Remote Sens. Environ. 2016, 185, 57–70. [Google Scholar] [CrossRef] [PubMed]
U.S. Geological Survey. Landsat 8-9 Collection 2 Level 2 Science Product Guide; Technical Report; Version 5.0; U.S. Geological Survey: Reston, VA, USA, 2024. [Google Scholar]
Copernicus. Sentinel-2 Products Specification Document; Technical Report S2-PDGS-TAS-DI-PSD; European Space Agency (ESA): Paris, France, 2024. [Google Scholar]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Fleiss, J. Measuring Nominal Scale Agreement Among Many Raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]
Congalton, R.; Green, K. Assessing the Accuracy of Remotely Sensed Data: Principles and Practices, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Aleissaee, A.; Kumar, A.; Anwer, R.; Khan, S.; Cholakkal, H.; Xia, G.; Khan, F. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Xia, J.; Romeiser, R.; Zhang, W.; Özgökmen, T. Use of Vision Transformer to Classify Sea Surface Phenomena in SAR Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10937–10956. [Google Scholar] [CrossRef]

Figure 1. Flowchart illustrating the main steps of the research methodology. From satellite data acquisition and preprocessing, through ground truth generation, model training and validation, to the application for fractional cover mapping of Sargassum.

Figure 2. Map of the study area used for Sargassum monitoring. The core region of interest is defined by the Caribbean Large Marine Ecosystem (shaded green). The map also shows the Great Atlantic Sargassum Belt (shaded red), the source of the massive Sargassum blooms that travel westward from the Equatorial Atlantic.

Figure 3. Comparative spectral characteristics of the satellite sensors used in this study. The plot shows the center wavelength and spectral range (bandwidth, represented by the width of each bar) for the core bands of Aqua-MODIS, Landsat-8 OLI, and Sentinel-2 MSI. This plot highlights the spectral alignment of the five bands (Blue, Green, Red, NIR, and SWIR1) that form the basis for the multi-sensor classification framework.

Figure 4. Reflectance distributions for the “sargassum” (coral) and “no_sargassum” (sky blue) classes. The histograms show that, while there is some overlap, the “sargassum” class generally occupies a narrower, lower range of reflectance values, particularly in the Blue, Green, and SWIR1 bands. The “no_sargassum” class exhibits a much broader, multi-modal distribution, reflecting its composition of various water types, land, and clouds.

Figure 5. Box plots comparing the statistical distribution of reflectance values for the “sargassum” and “no_sargassum” classes. The plots clearly illustrate that the median reflectance for “sargassum” is consistently lower than “no_sargassum” in all bands except the NIR, where the medians are nearly identical. The interquartile range (IQR) for the “sargassum” class is also notably tighter, indicating less spectral variability compared to the highly diverse “no_sargassum” class.

Figure 6. Pairwise scatter plots illustrating the relationships between spectral bands for a random subset of 1000 samples. The “sargassum” class (coral) forms a distinct, tight cluster, indicating a strong a positive correlation between the visible bands (Blue, Green, Red) and lower overall reflectance. In contrast, the “no_sargassum” class (sky blue) is highly dispersed, showing a wide range of spectral behaviors. The diagonal plots show the kernel density estimate (KDE) for each band, reinforcing the distinct distributions observed in the histograms.

Figure 7. Mean spectral signature of the “sargassum” class derived from the training dataset. The solid line represents the mean reflectance, and the shaded area indicates the interquartile range (IQR), showing the variability between the 25th and 75th percentiles. The signature displays characteristic vegetation features, including absorption in the Blue and Red bands, a peak in the Green band, and a significant increase in reflectance from the Red to the NIR band (the “red edge” feature), which are critical for discrimination by the supervised models.

Figure 8. Summary of model performance on the test set (30% of the spectral library). (a) Bar charts comparing key classification metrics for each model, alongside their total Training Time (T-Time) in seconds. While overall accuracy is high for all models, the macro-averaged metrics show more nuanced differences in performance. (b) Confusion matrices (CM) for each classifier, detailing the number and percentage of true positive, true negative, false positive, and false negative predictions. The CMs confirm that all models are highly effective at identifying the ‘no_sargassum’ (NS) class and show that the number of misclassified ‘sargassum’ (S) pixels (false negatives) is consistently low.

Figure 9. Visual comparison of Sargassum detection products of a Landsat-8 scene acquired on 23 July 2015. Panels (a–e) display the fractional cover maps generated by each of the five classifiers. The color intensity in these maps corresponds to the model’s output probability of Sargassum presence, visualized using a sensitive threshold to show the extent of even low-density patches. Panel (f) shows the traditional Floating Algae Index (FAI) for comparison. While all models successfully delineate the significant Sargassum aggregations, subtle differences in spatial extent and confidence of detections are apparent, particularly in lower-density filaments, underscoring the value of an ensemble approach.

Figure 10. Example of an inter-algorithm agreement map. The color indicates the number of models that detected Sargassum in a given pixel, where warmer colors (e.g., red) indicate higher agreement (more models concur) and cooler colors (e.g., purple) indicate lower agreement.

Figure 11. Detailed multi-panel analysis for three distinct Sargassum hotspots, demonstrating the relationship between visual evidence and model consensus. For each hotspot, the sub-panels provide a side-by-side comparison of: (1) the fractional cover map from the K-Nearest Neighbors (KNN) model, chosen as a representative classifier; (2) the true-color RGB image for visual ground-truthing; (3) the traditional Floating Algae Index (FAI); and (4) the overall inter-algorithm agreement map. In the agreement map, the color indicates the number of models (from one to five) that detected Sargassum above a probability threshold of 0.5, with warmer colors indicating higher consensus. The figure confirms that features with a clear visual signature in the RGB image and a strong FAI signal consistently correspond to areas of high inter-algorithm agreement, validating the ensemble’s high-confidence detections.

Figure 12. Inter-rater reliability (Fleiss’ Kappa,

κ

) of the five-model ensemble for each of the 12 processed scenes. The dashed red line indicates the average agreement of 0.311. Bars are color-coded based on the Landis & Koch (1977) [31] standard of interpretation. The results show moderate consensus across several scenes. Also highlight scenes with poor or negative agreement, likely due to challenging atmospheric or ocean surface conditions.

Figure 12. Inter-rater reliability (Fleiss’ Kappa,

κ

) of the five-model ensemble for each of the 12 processed scenes. The dashed red line indicates the average agreement of 0.311. Bars are color-coded based on the Landis & Koch (1977) [31] standard of interpretation. The results show moderate consensus across several scenes. Also highlight scenes with poor or negative agreement, likely due to challenging atmospheric or ocean surface conditions.

Figure 13. Overall quantitative agreement summaries. (a) Stacked bar chart showing the distribution of inter-algorithm agreement. Each colored segment represents a unique combination of models that agreed on a set of pixels, with its percentage contribution to the total number of detections. The bimodal nature, with large bars at 1 and 5 algorithms, highlights that detections are frequently either contentious or unanimous. (b) Heatmap of the average pairwise Cohen’s Kappa (

κ

) agreement between all model pairs. The warmer colors indicate stronger agreement, revealing a clear consensus block among the traditional ML models (RF, XGB, KNN).

Figure 13. Overall quantitative agreement summaries. (a) Stacked bar chart showing the distribution of inter-algorithm agreement. Each colored segment represents a unique combination of models that agreed on a set of pixels, with its percentage contribution to the total number of detections. The bimodal nature, with large bars at 1 and 5 algorithms, highlights that detections are frequently either contentious or unanimous. (b) Heatmap of the average pairwise Cohen’s Kappa (

κ

) agreement between all model pairs. The warmer colors indicate stronger agreement, revealing a clear consensus block among the traditional ML models (RF, XGB, KNN).

Table 1. Comparison of satellite sensor specifications relevant for Sargassum detection. The top panel lists general characteristics, while the bottom panel details the specific spectral bands (Common Bands Used) with their corresponding Band ID, Spatial Resolution (SRes), and Center Wavelength (CW).

Feature	Sentinel-2 (MSI)			Landsat-8 (OLI/TIRS)			Aqua (MODIS)
Operator	ESA			NASA/USGS			NASA
Nominal Revisit	∼5 days (S2A, S2B)			16 days			1–2 days
Swath Width ( $k m$ )	290			185			2330
Channel	Band	SRes (m)	CW (nm)	Band	SRes (m)	CW (nm)	Band	SRes (m)	CW (nm)
Blue	B2	10	∼490	B2	30	∼482	B3	500	∼469
Green	B3	10	∼560	B3	30	∼561	B4	500	∼555
Red	B4	10	∼665	B4	30	∼655	B1	250	∼645
NIR	B8A	20	∼865	B5	30	∼865	B2	250	∼859
SWIR1	B11	20	∼1610	B6	30	∼1609	B6	500	∼1640

Table 2. Linear transformation equations used in the processing workflow for harmonizing Landsat-8 (L8) surface reflectance (SR) to match Sentinel-2 (S2). These band-specific adjustments ensure spectral consistency before the data is fed to the machine learning models. Note that Surface Reflectance (SR) is a unitless ratio.

Harmonized Band	Landsat-8 Source Band	Harmonization Equation
Blue (B2 equivalent)	Blue (B2)	$S R_{S2_equiv} = 0.9778 \times S R_{L 8} + 0.0048$
Green (B3 equivalent)	Green (B3)	$S R_{S2_equiv} = 1.0379 \times S R_{L 8} - 0.0009$
Red (B4 equivalent)	Red (B4)	$S R_{S2_equiv} = 1.0431 \times S R_{L 8} - 0.0011$
NIR (B8A equivalent)	NIR (B5)	$S R_{S2_equiv} = 0.9043 \times S R_{L 8} + 0.0040$
SWIR1 (B11 equivalent)	SWIR1 (B6)	$S R_{S2_equiv} = 0.9872 \times S R_{L 8} - 0.0001$

Table 3. Descriptive statistics of surface reflectance for the “sargassum” and “no_sargassum” classes. The statistics reveal lower mean reflectance for the “sargassum” class in the Blue, Green, Red, and SWIR1 bands compared to the diverse “no_sargassum” class, which includes bright features such as clouds and land, resulting in a higher mean and standard deviation. Note the similar mean but distinct distribution for the NIR band. Data are derived from the final training dataset, with N(sargassum) = 11,597 pixels and N(no_sargassum) = 184,440 pixels.

Statistic	Blue		Green		Red		NIR		SWIR1
Statistic	Sarg	No Sarg	Sarg	No Sarg	Sarg	No Sarg	Sarg	No Sarg	Sarg	No Sarg
Mean	0.121	0.202	0.122	0.192	0.110	0.179	0.214	0.213	0.063	0.135
Std Dev	0.032	0.115	0.055	0.115	0.054	0.123	0.078	0.135	0.070	0.102
Min	0.060	0.025	0.013	−0.019	−0.011	−0.037	−0.002	−0.040	−0.050	−0.058
25%	0.089	0.135	0.065	0.129	0.053	0.117	0.157	0.129	−0.015	0.076
Median	0.139	0.163	0.144	0.172	0.134	0.162	0.213	0.194	0.118	0.138
75%	0.149	0.251	0.174	0.232	0.160	0.223	0.271	0.282	0.124	0.184
Max	0.238	0.970	0.213	0.907	0.210	0.958	0.398	1.036	0.155	0.814

Table 4. Classification performance metrics on the test set. The high scores reflect strong spectral separability. Metrics were calculated using a 0.5 probability threshold, and macro-averaged precision, recall, and F1-score are reported to provide a balanced view independent of class imbalance. See Figure 8 for the confusion matrices and Appendix C for calculation formulas.

Algorithm	Accuracy	Precision (Macro)	Recall (Macro)	F1-Score (Macro)	Train Time (s)
Random Forest	0.9962	0.9898	0.9760	0.9828	2093.89
KNN	0.9962	0.9877	0.9781	0.9828	38.89
XGBoost	0.9964	0.9899	0.9776	0.9837	225.61
MLP	0.9964	0.9879	0.9798	0.9838	11,378.22
1D-CNN	0.9948	0.9800	0.9734	0.9767	122.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Echevarría-Rubio, J.M.; Martínez-Flores, G.; Morales-Pérez, R.A. Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery. Data 2025, 10, 177. https://doi.org/10.3390/data10110177

AMA Style

Echevarría-Rubio JM, Martínez-Flores G, Morales-Pérez RA. Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery. Data. 2025; 10(11):177. https://doi.org/10.3390/data10110177

Chicago/Turabian Style

Echevarría-Rubio, José Manuel, Guillermo Martínez-Flores, and Rubén Antelmo Morales-Pérez. 2025. "Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery" Data 10, no. 11: 177. https://doi.org/10.3390/data10110177

APA Style

Echevarría-Rubio, J. M., Martínez-Flores, G., & Morales-Pérez, R. A. (2025). Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery. Data, 10(11), 177. https://doi.org/10.3390/data10110177

Article Menu

Machine and Deep Learning Framework for Sargassum Detection and Fractional Cover Estimation Using Multi-Sensor Satellite Imagery

Abstract

1. Introduction

1.1. Remote Sensing for Sargassum Monitoring

1.2. Existing Detection Methodologies and Challenges

1.3. Study Rationale and Objectives

2. Data and Methods

2.1. Study Area

2.2. Satellite Data Acquisition and Specifications

2.3. Data Preprocessing and Spectral Library Generation

2.4. Model Development

2.5. Spatial Analysis and Inter-Algorithm Agreement

3. Results

3.1. Model Performance

3.2. Qualitative Detection and Agreement Analysis

3.3. Quantitative Agreement Summary

3.3.1. Overall Inter-Ensemble Reliability

3.3.2. Pairwise Agreement and Detection Distribution

4. Discussion

4.1. Effectiveness of Multi-Sensor Integration and Data Harmonization

4.2. Model Performance and the Nature of the Test Set

4.3. Interpreting Inter-Algorithm Agreement and Model Behavior

4.4. Global Scalability and Generalizability

4.5. Limitations and Future Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Satellite Imagery Log

Appendix B. Supplementary Code Access and Details

Details on Hyperparameter Grids

Appendix C. Calculation of Macro-Averaged Performance Metrics

Example Calculation: K-Nearest Neighbors (KNN)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI