Next Article in Journal
Evaluating the Impact of Multi-Source Digital Elevation Model Quality on Archeological Predictive Modeling: An Integrated Framework Based on Machine Learning and SHAP-Based Interpretability Analysis
Previous Article in Journal
Extending the KLIMA Radiative Transfer Model to Cloudy Atmospheres: Towards an All-Sky Analysis of FORUM
Previous Article in Special Issue
Assessing the Feasibility of Satellite-Based Machine Learning for Turbidity Estimation in the Dynamic Mersey Estuary (Case Study: River Mersey, UK)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probabilistic Water Quality Monitoring Using Multi-Temporal Sentinel-2 Data: A Situational Awareness Framework for Harmful Algal Bloom Forecasting

by
Muhammad Zaid Qamar
,
Cristiano Ciccarelli
,
Mohammed Ajaoud
and
Massimiliano Lega
*
Department of Engineering, University of Napoli ‘Parthenope’, Centro Direzionale, Isola C4, 80143 Napoli, Italy
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(6), 959; https://doi.org/10.3390/rs18060959
Submission received: 28 January 2026 / Revised: 6 March 2026 / Accepted: 17 March 2026 / Published: 23 March 2026

Highlights

What are the main findings?
  • A confidence-based water quality monitoring framework integrating Sentinel-2 imagery with XGBoost quantile regression (0.05, 0.50, 0.95 quantiles) and LightGBM temporal forecasting achieved 2.9% and 5.7% MAPE for 10-day and 20-day harmful algal bloom forecasts with 90% prediction intervals.
  • Analysis of 235 data points from Lake Okeechobee revealed a 47.2% bloom frequency, with LightGBM outperforming XGBoost, Random Forest, and Ridge Regression across all temporal horizons (RMSE: 0.2333 vs. 0.3017+ for 10-day forecasts).
What are the implications of the main findings?
  • The probabilistic paradigm transforms water quality monitoring from deterministic predictions to uncertainty-aware decision support, enabling resource managers to implement risk-based responses through categorical classifications (LOW, MEDIUM, HIGH) that accommodate stakeholder-specific risk tolerances.
  • Integration of satellite remote sensing with quantile regression provides a scalable operational framework for water quality monitoring and early warning systems, bridging the algorithmic-operational gap through proof-of-concept visualization tools that communicate prediction reliability alongside forecasted bloom magnitude for actionable management decisions.

Abstract

Environmental monitoring systems require robust uncertainty quantification for effective decision-making in complex ecological processes. Harmful algal blooms represent a critical challenge where prediction uncertainty directly impacts resource allocation and response timing, yet current remote sensing-based prediction systems provide only deterministic classifications without confidence measures. This gap between algorithmic predictions and actionable risk assessment limits operational utility for stakeholders managing water quality under varying risk tolerances. This study developed a transferable probabilistic forecasting framework integrating Sentinel-2 multispectral imagery with quantile regression and ensemble machine learning to generate continuous confidence indicators for cyanobacteria density prediction, demonstrated through its application to Lake Okeechobee, Florida. The methodology combines spectral indices extracted from Sentinel-2 data with XGBoost for quantile regression at 0.05, 0.50, and 0.95 probability levels, and LightGBM for multi-horizon temporal forecasting. Sentinel-2’s 13 spectral bands spanning visible to shortwave infrared wavelengths, combined with its 5-day revisit frequency provide a spectrally rich and temporally dense input space that is well-suited to gradient boosting methods such as XGBoost, which can exploit complex nonlinear interactions among spectral features to distinguish cyanobacterial signatures from background water constituents. LightGBM achieved mean absolute percentage errors of 2.9% for 10-day forecasts and 5.7% for 20-day forecasts, outperforming conventional regression models. The framework generates 90% prediction intervals that enable reliable risk classifications for operational bloom management. This approach bridges the gap between satellite-based algal bloom detection and actionable decision-making by quantifying predictive uncertainty, representing a shift from binary classifications to probability-based environmental monitoring systems that accommodate varying stakeholder risk tolerances in water quality management applications.

1. Introduction

Freshwater quality degradation poses critical challenges to public health, ecosystem services, and economic activities globally, with harmful algal blooms (HABs) representing one of the most pressing contemporary environmental threats. The increasing frequency and intensity of HAB events, driven by nutrient loading and climate change, necessitates robust monitoring and prediction systems capable of supporting timely management interventions. However, traditional monitoring approaches face critical limitations in spatial coverage, temporal resolution, and predictive capability, creating urgent demand for advanced technologies that can enhance both the detection and forecasting of bloom events [1]. Environmental monitoring systems have traditionally focused on generating deterministic outcomes that fail to adequately communicate the inherent uncertainty in environmental forecasting. This limitation becomes critical in environmental systems where potential threats can rapidly escalate into disasters within seconds of unexpected events, necessitating robust monitoring throughout all operational phases [2]. The concept of situational awareness (SA), originally developed in military and aviation contexts, has become increasingly relevant in environmental monitoring and management. Situational awareness refers to the perception of environmental elements with respect to time and space, the comprehension of their meaning, and the projection of their status in the near future [3]. Recent advances in situational awareness techniques have demonstrated the importance of incorporating uncertainty in reasoning processes, particularly through Bayesian networks that provide probabilistic outputs for complex environmental systems. Decision-making under environmental uncertainty has been recognized as a critical challenge requiring sophisticated approaches to handle complexity and multidimensional criteria [4].
The SAFE (Situational Awareness for Environment) Team at the University of Naples ‘Parthenope’ has pioneered the application of situational awareness principles to environmental monitoring, developing methodologies and technologies that provide decision-makers with a comprehensive understanding of complex environmental systems [5,6]. This approach recognizes that effective environmental management requires contextual intelligence that enables timely and appropriate responses to emerging threats. The development of intelligent situation awareness support systems for safety-critical environments has shown the effectiveness of incorporating uncertainty modeling through causal processes that represent variables and their probabilistic relationships [7]. Recent advances have demonstrated the effectiveness of hierarchical monitoring approaches that coordinate the spatiotemporal resolution of multilayer and multispectral sensors to characterize pollution phenomena and enhance situational awareness in complex environmental scenarios [7].
Harmful algal blooms (HABs) represent a particularly challenging environmental phenomenon where situational awareness is critical. The rapid development of HABs, combined with their significant ecological, economic, and public health impacts, necessitates monitoring systems that provide both comprehensive coverage and actionable predictions. Traditional monitoring approaches, which typically involve discrete water sampling and laboratory analysis, often fail to capture the dynamic nature of bloom development and do not provide the lead time necessary for preventive action [8]. Recent studies have demonstrated the effectiveness of integrated fast detection strategies (FDSs) that combine satellite and drone remote sensing with molecular analytics and bio-monitoring to track and assess cyanobacterial harmful algal bloom (cyanoHAB) spread and toxin biomagnification across complex aquatic systems [9]. Current approaches combining satellite remote sensing with machine learning generate deterministic predictions without uncertainty quantification, limiting their utility for stakeholders with varying risk tolerances [9,10].
The emergence of advanced satellite remote sensing platforms, particularly the European Space Agency’s Sentinel-2 mission, has created new opportunities for comprehensive environmental monitoring. These platforms provide high-resolution multispectral imagery with frequent revisit times, enabling the detection and tracking of environmental phenomena across broad spatial and temporal scales. However, without site-specific analyses aimed at identifying pollution sources, transport, and fate (i.e., target), no remediation action can be developed [11]. When integrated with machine learning approaches and site-specific understanding, these data sources can forecast environmental conditions with unprecedented accuracy and lead time, providing the foundation for effective intervention strategies. Comparative evaluations of machine learning algorithms using Sentinel-2 data have shown promising results for mapping floating algal blooms, with ensemble methods demonstrating superior performance over individual algorithms [12]. Additionally, studies have successfully applied satellite band ratios combined with machine learning approaches for predicting harmful cyanobacterial blooms in eutrophic reservoirs, demonstrating the potential of these integrated methodologies [13,14]. The development of near real-time monitoring tools using satellite remote sensing has further enhanced the capability to track harmful algal blooms and turbidity in water bodies with practical applications for water management [15]. The integration of satellite data with proximal sensing platforms, such as drones, has been shown to provide enhanced spatial resolution and complementary information layers, creating a hierarchical monitoring framework that bridges the gap between large-scale satellite observations and detailed local assessments [16].
Recent advances in satellite-based monitoring using Sentinel-2 imagery and gradient boosting methods have demonstrated promise for HAB detection and prediction [12,17,18,19,20,21]. While existing frameworks show technical proficiency in bloom detection, they focus on point estimates rather than probabilistic assessments, failing to communicate the prediction reliability and site-specific confidence levels essential for effective environmental management under uncertainty. The importance of uncertainty quantification in environmental applications has been increasingly recognized, with neural network approaches showing promise for providing both central predictions and uncertainty information in various environmental contexts [22,23]. Studies of cyanobacterial blooms have revealed the critical importance of understanding both natural cycles and anthropogenic influences, as demonstrated by research showing the relationship between urban activities and bloom dynamics during the COVID-19 pandemic [24]. Furthermore, recent advances in bioindicator monitoring have shown the effectiveness of using cyanobacteria as reliable indicators of environmental changes, with innovative approaches enabling the detection and characterization of toxic blooms and their associated risks [25]. Decision-makers require not just predictions of bloom occurrence, but an understanding of the confidence associated with these predictions to allocate resources effectively and implement appropriate response measures. The analysis of regression confidence intervals and Bayesian credible intervals has provided valuable insights into parametric and predictive uncertainties in environmental modeling, offering frameworks for better uncertainty communication [26].
In this paper, we present a novel algorithmic framework that fundamentally reimagines environmental prediction through a confidence-based approach aligned with SA principles. Rather than generating binary classifications or point estimates, our system produces confidence metrics that express the likelihood of different environmental states occurring. This paradigm shift acknowledges the inherent uncertainty in environmental prediction while providing actionable intelligence to decision-makers, building upon established principles of multi-scale environmental monitoring and assessment [27]. Recent developments in uncertainty quantification for probabilistic machine learning in earth observation, particularly using conformal prediction methods, have demonstrated the feasibility of providing reliable uncertainty estimates for environmental applications [28].
The framework integrates multi-source satellite data, advanced feature engineering, quantile regression modeling, and time series forecasting to generate predictions with explicitly quantified uncertainty. The application of quantile regression methods has shown particular promise in environmental forecasting contexts, with studies demonstrating their effectiveness for the probabilistic forecasting of pollution levels and other environmental parameters [29]. Probabilistic forecasting approaches using quantile gradient boosting have proven successful in solar irradiance prediction and other environmental applications, providing both point estimates and uncertainty bounds [30]. By visualizing these metrics through an intuitive visual interface, our system operationalizes the theoretical foundations of situational awareness—transforming abstract concepts of environmental perception, comprehension, and projection into concrete, actionable outputs for decision-makers. The design and implementation of environmental decision support systems has evolved to emphasize the importance of interactive and graphical user–machine interfaces that provide useful information for environmental management problems [31].
To illustrate how the confidence-based paradigm translates into operational practice, we developed a proof-of-concept dashboard implementation, the HAB Risk Prediction System. This interactive web-based interface demonstrates the practical feasibility of communicating probabilistic predictions to end users. The current version requires users to upload a CSV file containing pre-extracted features from satellite imagery, after which the system automatically processes the data through the complete prediction pipeline. The dashboard displays three key metrics: risk level classification (Low, Medium, or High), predicted cyanobacteria density with uncertainty estimates shown as a percentage interval, and the analysis type indicating whether the prediction represents a forecast, historical analysis, or current assessment. The system also displays a visual risk gauge that clearly communicates the severity level, along with model performance indicators such as the mean absolute percentage error (MAPE) to convey prediction confidence. This enhanced reporting of uncertainty enables users to make informed decisions based on both the predicted values and their associated reliability.
We demonstrate the application of this framework to harmful algal bloom prediction in Lake Okeechobee, Florida, utilizing Sentinel-2 satellite imagery and a machine learning approach based on gradient boosting. Lake Okeechobee (Figure 1) has been the subject of extensive research on cyanobacterial bloom monitoring using satellite remote sensing, with studies demonstrating the effectiveness of measuring bloom magnitude and tracking temporal changes in bloom extent [10]. Recent research has highlighted the importance of understanding the impact of monitoring frequency on management decisions, particularly in the context of satellite and in situ cyanobacteria monitoring systems [32]. Advanced time-series approaches using MODIS satellite imagery and deep learning methods have shown promising results for predicting chlorophyll-a concentrations and harmful algal blooms in Lake Okeechobee specifically [17,33,34]. Through this case study, we illustrate how our confidence-based approach enhances decision support by providing nuanced risk assessments that integrate both predicted values and confidence metrics. This dashboard represents one component of a comprehensive suite of situational awareness tools being developed by our laboratory to enhance environmental decision-making capabilities.
Our key contributions include:
  • A confidence-based prediction framework replacing binary outputs with quantile regression-based uncertainty intervals;
  • An integrated architecture combining current condition assessment (XGBoost) with temporal forecasting (LightGBM);
  • A novel risk classification scheme incorporating both predicted values and confidence levels for actionable intelligence;
  • An interactive dashboard system enabling stakeholder-specific risk interpretation and decision support;
  • Empirical validation demonstrating operational utility and multi-horizon forecasting capabilities (10- to 20-day) through a representative case study in a large subtropical freshwater system.

2. Materials and Methods

2.1. Study Area

This study focused on Lake Okeechobee, Florida, USA (27.0°N, 80.8°W), the largest freshwater lake in Florida with a surface area of approximately 1730 km2 and mean depth of 2.7 m (Figure 2). Lake Okeechobee represents an ideal testbed for probabilistic HAB prediction due to its well-documented history of harmful algal blooms, extensive monitoring networks, and ecological significance for the surrounding region [35]. The lake’s shallow, subtropical characteristics make it particularly susceptible to cyanobacterial bloom development, with documented seasonal and spatial variations in bloom frequencies since the mid-1980s [36]. Climatic conditions, including hurricanes and El Niño events, have been shown to significantly influence bloom patterns and dynamics in this system [37]. Recent research has highlighted the spatiotemporal diversity and community structure of cyanobacteria in this large shallow subtropical lake, emphasizing the complex environmental factors that influence bloom dynamics and necessitate probabilistic forecasting approaches [38]. The extensive historical monitoring data available for Lake Okeechobee enables the rigorous validation of our confidence-based prediction framework against both historical and current conditions.

2.2. System Architecture

Data collection, elaboration, and integration are fundamental to detecting environmental trends and developing robust prediction frameworks for complex ecological phenomena [39,40,41]. Our confidence-based HAB prediction framework operationalizes situational awareness through five integrated components that systematically address the three SA levels: perception (data ingestion and feature extraction), comprehension (density prediction with uncertainty quantification), and projection (temporal forecasting and risk classification). These components transform raw satellite observations into actionable risk assessments with explicit uncertainty quantification (Figure 3):
  • Data Ingestion and Caching Module: This component is responsible for the automated retrieval of Sentinel-2 satellite imagery from the Microsoft Planetary Computer platform, the extraction of relevant spectral bands, and storage of processed data in a structured cache system to avoid redundant downloads and enable efficient temporal analysis. The use of cloud computing platforms for satellite data processing has become increasingly important for large-scale environmental monitoring applications, providing scalable infrastructure for handling multi-temporal datasets [42].
    Feature Engineering Module: This module extracts a comprehensive set of environmental indicators from satellite imagery, including spectral indices (e.g., Normalized Difference Vegetation Index, Chlorophyll Index), band ratios optimized for cyanobacteria detection, and contextual information such as land cover type derived from ancillary geospatial datasets. Spectral feature optimization has proven critical for water quality estimation using remote sensing approaches, with the careful selection of wavelength combinations significantly improving predictive performance [43,44,45,46].
    Cyanobacteria Density Prediction Module: This component employs extreme gradient boosting (XGBoost) models trained with quantile regression loss functions to predict cyanobacteria density at three probability levels (0.05, 0.50, and 0.95 quantiles), providing explicit uncertainty bounds around point estimates. Recent advances in quantile extreme gradient boosting have demonstrated enhanced capability for uncertainty quantification in environmental applications, particularly for capturing both aleatoric and epistemic uncertainties in complex nonlinear systems [47].
    Time Series Forecasting Module: This module utilizes Light Gradient Boosting Machine (LightGBM) algorithms to forecast cyanobacteria density over 10-day and 20-day horizons, incorporating temporal dependencies and seasonal patterns while explicitly modeling prediction uncertainty through quantile regression. LightGBM has shown particular effectiveness in environmental time series forecasting applications, including greenhouse temperature prediction and water quality monitoring, due to its computational efficiency and ability to handle high-dimensional feature spaces [48,49,50,51,52].
    Risk Assessment and Visualization Module: This component synthesizes predicted values and uncertainty estimates into categorical risk classifications (LOW, MEDIUM, HIGH) and generates intuitive visualizations through an interactive web-based dashboard interface (Figure 4). From an operational deployment standpoint, the risk classification scheme accounts for both the magnitude of predicted cyanobacteria density and the width of prediction intervals, enabling integration into existing water management workflows where different agencies may maintain distinct action thresholds based on their regulatory mandates and resource constraints.
The integrated system operates with minimal human intervention, automatically fetching the most recent satellite imagery, processing spectral features, generating predictions with uncertainty bounds, and producing multi-horizon forecasts on a regular basis. This automated workflow ensures consistent monitoring and enables near-real-time risk assessment for operational HAB management.

2.3. Data Sources

2.3.1. Sentinel-2 Multispectral Imagery

The primary remote sensing data source for this study was Sentinel-2 multispectral imagery, accessed through the Microsoft Planetary Computer API [53]. The Sentinel-2 mission, consisting of twin satellites (Sentinel-2A launched June 2015, Sentinel-2B launched March 2017), provides systematic global coverage with a 5-day revisit frequency at the equator and improved revisit times at higher latitudes [54]. The MultiSpectral Instrument (MSI) aboard Sentinel-2 satellites acquires imagery in 13 spectral bands spanning visible, near-infrared, and shortwave infrared wavelengths, with spatial resolutions of 10 m, 20 m, and 60 m depending on the spectral band [55].
For this application, we utilized all 13 spectral bands available from Sentinel-2 Level-2A (L2A) products, which provide bottom-of-atmosphere reflectance with atmospheric corrections applied. To balance information content with computational efficiency and maintain consistency across multi-temporal analyses, all bands were resampled to a common spatial resolution of 60 m using bilinear interpolation. The specific spectral bands processed for feature extraction include:
  • Visible and Near-Infrared bands (10 m native resolution): Blue (B02, 490 nm), Green (B03, 560 nm), Red (B04, 665 nm), and Near-Infrared (B08, 842 nm);
  • Red Edge and Shortwave Infrared bands (20 m native resolution): Vegetation Red Edge (B05, 705 nm; B06, 740 nm; B07, 783 nm), Narrow Near-Infrared (B8A, 865 nm), and Shortwave Infrared (B11, 1610 nm; B12, 2190 nm);
  • Atmospheric and Quality bands (60 m native resolution): Coastal Aerosol (B01, 443 nm), Water Vapor (B09, 945 nm), and SWIR-Cirrus (B10, 1375 nm);
  • Auxiliary Products: Scene Classification Layer (SCL) for cloud masking and quality assessment, and Aerosol Optical Thickness (AOT) for atmospheric correction verification.

2.3.2. Training Dataset

For model training, we utilized the Cyanobacteria Aggregated Manual Labels (CAML) dataset [56], a large-scale collection of in situ cyanobacteria measurements compiled from 14 data providers across the United States and archived through NASA’s SeaBASS data repository. The CAML dataset contains 23,570 georeferenced ground measurements of cyanobacteria cell counts (cells/mL) collected at inland water bodies throughout the U.S. over the period 2013–2021. The primary water quality parameter used from this dataset was cyanobacteria cell density, which served as the target variable for supervised learning. Sampling frequencies and monitoring protocols varied across the contributing data providers, as the dataset aggregates measurements from multiple state and federal monitoring programs with different sampling schedules. Severity levels within the dataset were categorized based on World Health Organization (WHO) cyanobacteria density thresholds, providing standardized risk classifications across diverse water bodies. After filtering for the availability of corresponding Sentinel-2 imagery (post-2015 for Sentinel-2A), 11,655 sample points remained across all U.S. water bodies represented in the dataset. This nationally distributed training set enabled the XGBoost density prediction model to learn generalized spectral–cyanobacteria relationships across diverse lake types, optical water properties, and geographic regions, rather than being constrained to site-specific patterns. The model was trained on this full national dataset to predict cyanobacteria density for a given location and date based on satellite-derived spectral features. Because the training data spanned a wide range of U.S. inland water bodies, the resulting model is inherently transferable and can be applied to any lake within the continental United States for which Sentinel-2 imagery is available. For the Lake Okeechobee case study presented in this work, a subset of 235 site-specific data points was used for the application and validation of the framework (see Section 3.1). The integration of extensive in situ cyanobacteria measurements with satellite remote sensing has proven essential for developing accurate machine learning models for water quality assessment [43].

2.3.3. Feature Engineering

We derived a comprehensive set of features from satellite imagery to capture spectral, spatial, and temporal characteristics relevant to algal bloom development. Our feature engineering approach extracted 68 distinct features organized into several categories. The feature categories were selected based on their established ecological relevance to cyanobacterial bloom detection. NDVI variants using different red-edge bands captured the spectral signature of chlorophyll-a and phycocyanin pigments characteristic of cyanobacteria, while band ratios (green/red, green/blue, red/blue) exploited the differential absorption and scattering properties of algal-laden versus clear water. Percentile-based features (95th and 5th percentiles of the green band) characterized spatial heterogeneity within each observation window, which served as a proxy for bloom patchiness. Water classification percentages provided contextual information about the proportion of valid water pixels, and full band statistics (mean, min, max, range) across all 13 Sentinel-2 bands captured the overall radiometric profile of the water surface. Temporal and metadata features (month of acquisition, days before sampling, land cover classification) accounted for seasonality and environmental context that influence bloom development. The specific features are organized as follows:
1.
Normalized Difference Vegetation Index (NDVI) variants:
  • NDVI_B04: (B08 − B04)/(B08 + B04);
  • NDVI_B05: (B08 − B05)/(B08 + B05);
  • NDVI_B06: (B08 − B06)/(B08 + B06);
  • NDVI_B07: (B08 − B07)/(B08 + B07).
2.
Band ratios:
  • Green/Red ratio: B03/B04;
  • Green/Blue ratio: B03/B02;
  • Red/Blue ratio: B04/B02;
  • Green 95th percentile to blue mean ratio: percentile(B03, 95)/mean(B02);
  • Green 5th percentile to blue mean ratio: percentile(B03, 5)/mean(B02).
3.
Percentile-based features:
  • 95th percentile of green band values (green95th);
  • 5th percentile of green band values (green5th).
4.
Water classification:
  • Percentage of pixels classified as water using the Scene Classification Layer (percent_water).
5.
Band statistics for all 15 Sentinel-2 bands (AOT, B01-B12, B8A, SCL, WVP):
  • Mean values (e.g., B01_mean, B02_mean, …, WVP_mean);
  • Minimum values (e.g., B01_min, B02_min, …, WVP_min);
  • Maximum values (e.g., B01_max, B02_max, …, WVP_max);
  • Range values (e.g., B01_range, B02_range, …, WVP_range).
6.
Temporal and metadata features:
  • Month of acquisition;
  • Days before sampling;
  • Land cover classification.
7.
The complete feature set includes:
  • Satellite image features (25 features): B01_mean, B02_mean, B03_mean, B04_mean, B05_mean, B06_mean, B07_mean, B08_mean, B09_mean, B11_mean, B12_mean, B8A_mean, WVP_mean, AOT_mean, percent_water, green95th, green5th, green_red_ratio, green_blue_ratio, red_blue_ratio, green95th_blue_ratio, green5th_blue_ratio, NDVI_B04, NDVI_B05, NDVI_B06, NDVI_B07, AOT_range;
  • Satellite metadata features (2 features): month, days_before_sample;
  • Sample metadata features (1 feature): land_cover.
This comprehensive feature engineering approach ensures that both the spectral characteristics of the water body and the temporal context of the observation are captured, providing a robust foundation for both the current density prediction and time series forecasting models. The use of statistical relationships between remotely sensed spectral values and water quality parameters has been well-established in empirical remote sensing approaches [57,58].

2.4. XGBoost and LightGBM

XGBoost (Extreme Gradient Boosting) was utilized in this work to predict cyanobacteria density using features extracted from satellite data such as spectral band values and vegetation/water indices computed from Sentinel-2 images. This machine learning algorithm is especially beneficial for remote sensing, as it can model complex, nonlinear relationships and is robust to overfitting due to inherent regularization methods [59]. The model was trained using data that are both spatially and temporally resolved, enabling it to capture variations in cyanobacterial dynamics driven by environmental and spectral predictors. The superior predictive performance and scalability of XGBoost made it an appropriate model for this research. The algorithm follows the gradient boosting framework presented by Chen and Guestrin [60], which has been widely used for modeling structured data.
Moreover, the LightGBM (Light Gradient Boosting Machine) algorithm was employed to forecast cyanobacteria density over a 10-day and 20-day horizon, based on historical time series data related to cyanobacterial concentrations. LightGBM is a gradient boosting algorithm that combines tree-based learning techniques, known for its efficiency, scalability, and appropriateness to work with large datasets with high accuracy [61]. Its ability to identify complex, nonlinear relationships makes it well-suited for time series forecasting applications in environmental monitoring. Previous research has proven the effectiveness of LightGBM in harmful algal bloom forecasting based on satellite data and machine learning techniques [62]. Additionally, research has proven the application of LightGBM for river system water level prediction [63], demonstrating its appropriateness to work with different types of hydrological time series data. Through the inclusion of LightGBM in our forecasting platform, we hope to enhance the predictive accuracy of cyanobacterial bloom events, thereby supporting proactive water quality management strategies.

2.5. Quantile Regression for Uncertainty Quantification

Rather than training a single model to predict a point estimate of cyanobacteria density, we employed quantile regression to model the conditional distribution of the response variable. Specifically, we trained three XGBoost models to predict the 0.05, 0.50, and 0.95 quantiles of cyanobacteria density, corresponding to the lower, median, and upper estimates with a 90% prediction interval.
The objective function for quantile regression with XGBoost is:
L τ y , y ^ = i = 1 n w i ρ τ y i y i ^
where ρ τ u = u τ I u < 0 is the quantile loss function, τ is the quantile of interest (0.05, 0.50, or 0.95), y i is the observed value, y i ^ is the predicted value, and w i is the weight for the i -th observation. The function I u < 0 equals 1 when u < 0 and 0 otherwise.
This approach allows us to directly model the conditional quantiles of cyanobacteria density without making assumptions about the distribution of the response variable. The resulting prediction intervals provide a measure of uncertainty that accounts for heteroscedasticity (i.e., varying levels of uncertainty across the range of predictor values). The application of quantile regression with XGBoost has shown promise for uncertainty quantification in environmental modeling, including applications for groundwater nitrate pollution prediction and atmospheric pollution forecasting [64,65,66].

Uncertainty Propagation Between Modeling Stages

The two-stage modeling architecture requires careful consideration of uncertainty propagation. In the first stage, XGBoost quantile regression generates estimates at the 0.05, 0.50, and 0.95 quantiles for each satellite observation, providing a distributional estimate of cyanobacteria density conditioned on spectral features. The median (0.50 quantile) predictions serve as the primary input time series for temporal forecasting, while the quantile spread (0.95–0.05) provides a measure of instantaneous uncertainty.
In the second stage, LightGBM forecasts future median values using the historical sequence of XGBoost median predictions. Importantly, uncertainty from the first stage is not directly propagated through the LightGBM model; rather, LightGBM generates its own forecast uncertainty through quantile regression at the same probability levels (0.05, 0.50, 0.95).
The final prediction intervals thus represent forecast uncertainty (temporal extrapolation error) rather than compounded uncertainty from both stages. This approach was adopted because the direct propagation of input uncertainty through tree-based ensemble methods is computationally expensive and would require Monte Carlo simulation at each forecast step. The resulting intervals should be interpreted as measures of forecast reliability given the historical patterns in the time series, acknowledging that unquantified input uncertainty may cause actual prediction errors to occasionally exceed the stated intervals.

2.6. Risk Classification

We classified bloom risk into three categories (LOW, MEDIUM, HIGH) based on both the predicted cyanobacteria density and the calculated bloom probability:
1.
HIGH risk:
  • Predicted value exceeds the severe bloom threshold (11.5, corresponding to 100,000 cells/mL), or
  • Predicted value exceeds the moderate bloom threshold (10.0, corresponding to 20,000 cells/mL) and bloom probability ≥ 0.7.
2.
MEDIUM risk:
  • Predicted value exceeds the moderate bloom threshold and bloom probability ≥ 0.4 but <0.7, or
  • Predicted value below the moderate bloom threshold but bloom probability ≥ 0.7.
3.
LOW risk:
  • All other cases.
This classification scheme integrates both the magnitude of the predicted value and the confidence in that prediction, providing a more nuanced assessment than traditional binary classifications based solely on exceeding a threshold. The risk classification thresholds used in this study were logarithmic transformations derived from established World Health Organization and EPA guidelines for cyanobacterial risk assessment [67,68].

3. Results

3.1. Historical Data Analysis

Our analysis of Lake Okeechobee utilized a substantial historical dataset spanning from 2017 to July 2025, comprising 235 data points of cyanobacteria density measurements. These 235 observations represent the Lake Okeechobee-specific subset of the broader CAML national dataset (11,655 samples across all U.S. water bodies) used for model training (see Section 2.3.2). While the XGBoost density prediction model was trained on the full national dataset to learn generalized spectral–cyanobacteria relationships, the 235 Lake Okeechobee data points served as the site-specific time series for temporal forecasting, validation, and risk assessment. The study period began in 2017 to coincide with the launch of Sentinel-2B in March 2017, which established the full twin-satellite constellation and enabled the 5-day revisit frequency necessary for consistent multi-temporal monitoring. Data prior to 2017 were excluded because single-satellite coverage (Sentinel-2A only) resulted in less frequent and less reliable imagery acquisition over the study area. The end date of July 2025 reflects the most recent imagery available at the time of analysis. The 235 data points were selected through rigorous quality control processes, including cloud cover filtering (threshold < 20%), complete spectral band availability, and minimum temporal spacing (5 days) to reduce autocorrelation. This filtering ensures that each data point represents an independent, high-quality observation rather than redundant or degraded measurements. The resulting dataset adequately captures the temporal dynamics of algal bloom development while avoiding the computational burden and statistical complications of highly correlated sequential observations. The historical data revealed significant bloom events, with 111 instances (47.2% of the dataset) exceeding the moderate bloom threshold of 10.0 (log-transformed value, equivalent to approximately 20,000 cells/mL). This high frequency of bloom occurrences highlights the importance of effective monitoring and prediction systems for this water body.
The most recent bloom event prior to our forecast period was detected on 17 April 2025, indicating ongoing vulnerability to harmful algal bloom development. Time series analysis of the historical data revealed seasonal patterns in bloom occurrence, with the highest frequency typically observed during the warmer months (late spring through early fall), consistent with the known ecology of cyanobacterial blooms in subtropical lakes. This seasonal pattern aligns with established research on the dynamics of cyanobacteria blooms in shallow Florida lakes, where bloom occurrence is strongly linked to hydrology and seasonal factors [69]. Studies have documented the relationship between wet and dry seasons and cyanobacterial community structure in Lake Okeechobee, with overlap between seasonal patterns affecting bloom development [38].

3.2. Temporal Forecasting Accuracy and Model Comparison

Among all tested models, LightGBM consistently achieved the lowest RMSE and MAPE values across the 10-day forecast horizon, with slightly less improvement over 20-day forecasts. Importantly, the objective of this model comparison was not to maximize predictive performance for Lake Okeechobee specifically, but rather to demonstrate that the confidence-based framework produces reliable uncertainty estimates across multiple modeling approaches. Our comparison against three baseline models (Table 1) confirmed that LightGBM’s performance advantage is genuine rather than artifactual, validating its selection for the probabilistic forecasting architecture [70,71]. The 23–28% improvement in RMSE over XGBoost and Random Forest for the 10-day forecasts indicates that the low error metrics stem from appropriate model selection rather than dataset peculiarities. Notably, even the simpler Ridge Regression baseline showed RMSE values that were substantially higher, confirming that the forecasting problem has inherent structure that our model successfully captures. Table 1 summarizes the comparative performance:
Figure 5 presents the RMSE comparisons, where LightGBM outperformed XGBoost, Random Forest, and Ridge Regression. Corresponding MAPE distributions further confirmed the robustness of LightGBM under temporal shifts.

3.3. Predictive Confidence and Temporal Dynamics

Figure 6 shows the R2 scores for each model. LightGBM yielded an average R2 > 0.40, demonstrating better explanatory power for median cyanobacteria densities compared to alternative models. While these R2 values indicate room for improvement with expanded datasets, they are sufficient to validate the core methodological contribution: that quantile regression integrated with ensemble forecasting can produce meaningful uncertainty bounds even under data-limited conditions typical of many environmental monitoring applications. Forecast reliability was closely aligned with bloom seasonality; during bloom peaks, forecast MAPE rose moderately but remained within confidence intervals estimated via bootstrapping.

3.4. Feature Contributions

Feature importance analysis (Figure 7) highlights red-edge ratios, temporal lag variables, and proximity-to-shore as the most informative features. These align with known ecological drivers of cyanobacterial growth and spatial distribution.

3.5. Summary and Implications

LightGBM offers the best overall tradeoff between predictive accuracy and computational efficiency especially in a 10 day time step forecast. Its ability to ingest XGBoost-derived quantile sequences and produce forward forecasts with built-in confidence metrics marks a significant step toward operational HAB forecasting. Together, the integration of daily quantile estimation and LightGBM-based prediction pipelines offers a scalable, interpretable, and reliable solution for cyanobacterial bloom forecasting in large freshwater systems.
The mean absolute percentage error (MAPE), as shown in Figure 8 from the time series cross-validation, was 2.9%, indicating high confidence in the forecast accuracy. While these metrics are specific to Lake Okeechobee’s bloom dynamics, they served primarily to validate the framework’s capacity to generate reliable probabilistic forecasts rather than to establish performance benchmarks for this particular system. The low MAPE value demonstrates that confidence-based predictions can achieve operational-grade accuracy, supporting a broader adoption of probabilistic approaches in environmental monitoring. MAPE has been widely recognized as a reliable measure of forecast accuracy in time series prediction for environmental applications, with low values indicating high model performance [72,73]. The feature importance analysis for the 10-day LightGBM forecast identified rolling_std_5, lag_3, lag_2, rolling_mean_2, and rolling_min_5 as the top predictors, while the 20-day XGBoost model prioritized month, lag_6, day_of_year, rolling_mean_2, and rolling_mean_3. These results highlight the importance of both seasonal factors and recent short-term dynamics in forecasting bloom development. Our cross-validation approach using 5 folds means that each training set contains approximately 188 observations, which is sufficient for the model to learn the temporal patterns present in the data. The complexity of our feature set (16 engineered time series features including lags, rolling statistics, and seasonal indicators) was well-supported by this sample size, avoiding overfitting while capturing meaningful patterns.

3.6. Bloom Risk Assessment

Table 2 summarizes the bloom risk assessment for the forecast period based on the integrated confidence-based classification framework. Both forecast horizons were classified as MEDIUM RISK, with predicted cyanobacteria densities of 6.9138 cells/mL (10-day forecast) and 7.3592 cells/mL (20-day forecast). The highest risk within the forecast period was predicted for 22 August 2025, with a cyanobacteria density of 7.3592 cells/mL. The risk assessment gauge visualization implemented in the interactive dashboard (Figure 9) provides stakeholders with an intuitive representation of bloom severity levels, enabling the rapid interpretation of risk status for operational decision-making.
The consistency of the MEDIUM RISK classification across both forecast horizons suggests a stable bloom state that requires heightened monitoring but has not yet reached crisis levels requiring immediate intervention. Based on these probabilistic assessments, the system generated the following advisory recommendation: “Significant bloom risk detected in the forecast period. Recommend increased monitoring frequency and preliminary mitigation planning”. This actionable guidance provides clear operational direction for water resource managers while explicitly acknowledging the uncertainty inherent in multi-day forecasting.
The integration of the predicted values and confidence metrics in our risk classification framework provides a more nuanced assessment than traditional threshold-based approaches. While both forecasted values approached the moderate bloom threshold (10.0 cells/mL), neither were classified as HIGH RISK because the associated prediction intervals and bloom exceedance probabilities remained below the high-risk probability threshold established in our classification scheme. This exemplifies how the confidence-based approach avoids alarmist classifications when the certainty of exceeding critical thresholds is moderate, potentially preventing unnecessary resource allocation while maintaining appropriate vigilance. Traditional threshold-based approaches for harmful algal bloom detection and classification have been shown to have limitations when applied across multiple water bodies, as they do not account for the uncertainty inherent in predictions [74]. Recent developments in water quality risk assessment frameworks have emphasized the importance of dynamic threshold determination and probabilistic approaches rather than fixed thresholds for effective environmental management [75]. Our confidence-based framework addresses these limitations by incorporating both the predicted values and associated uncertainty bounds, providing a more robust foundation for decision-making in harmful algal bloom management that accommodates varying stakeholder risk tolerances and operational constraints.

4. Discussion

The confidence-based paradigm presented in this study represents a fundamental shift in environmental monitoring methodology, offering significant advantages over traditional deterministic approaches while addressing critical limitations in current remote sensing-based predictive frameworks. The framework’s practical utility for integrating uncertainty quantification with operational decision support systems is demonstrated through application to Lake Okeechobee, selected as a representative large subtropical freshwater system with well-documented bloom dynamics, though the methodology generalizes to comparable water bodies with sufficient historical data.

4.1. Advantages of the Confidence-Based Approach

The confidence-based framework provides several key advantages over traditional deterministic methods, demonstrating how uncertainty quantification fundamentally enhances situational awareness in environmental management practice. Where deterministic approaches limit decision-makers to a binary perception of environmental states, our results show that probabilistic outputs enable a richer comprehension of system dynamics and a more reliable projection of future conditions. This enhanced situational awareness facilitates more informed decision-making under uncertainty, allowing stakeholders to understand not just what conditions are predicted, but the reliability associated with those predictions. The integration of multiple remote sensing data sources further strengthens the approach by combining current conditions derived from Sentinel-2 multispectral imagery with historical patterns identified through time series analysis. This synergistic use of complementary data streams improves the prediction reliability while providing a more comprehensive understanding of environmental system dynamics than single-source approaches can achieve, consistent with findings from multi-source satellite integration studies demonstrating superior performance over single-sensor methods [76,77,78,79,80,81,82,83,84,85].
The context-sensitive risk assessment methodology represents another significant advantage of the framework. By considering both the predicted values and their associated confidence levels, the risk classification scheme enables more nuanced assessment that can be adapted to specific stakeholder needs and risk tolerances. This flexibility proves particularly valuable in environmental management contexts where different stakeholders may maintain varying thresholds for action based on their operational constraints, regulatory requirements, or resource availability. Despite acknowledging inherent uncertainty, the approach continues to provide actionable intelligence by categorizing risk into discrete levels (LOW, MEDIUM, HIGH) while preserving essential confidence information through prediction intervals. At the system level, this architecture enables modular integration with broader environmental monitoring infrastructure—categorical outputs can trigger automated alerts or feed into multi-hazard early warning systems, while continuous prediction intervals support more sophisticated decision-support tools requiring probabilistic inputs. This layered design ensures that the framework serves both immediate operational needs and longer-term system integration requirements.
The results from the Lake Okeechobee case study effectively demonstrate these advantages in practice. The LightGBM forecasting model successfully identified a persistent medium-risk state with moderate bloom probabilities across both the 10-day and 20-day forecast horizons, providing clear guidance for resource managers without overstating the risk or creating unnecessary alarm. The high forecast accuracy, as indicated by the low mean absolute percentage errors of 2.9% and 5.7% for the respective forecast horizons, further enhances the utility of this information for operational decision-making and resource allocation. These performance metrics compare favorably with existing HAB prediction systems while adding the critical dimension of uncertainty quantification that previous deterministic approaches lack.

4.2. Limitations and Challenges

Despite these advantages, the approach faces several important limitations and challenges that must be acknowledged and addressed in operational implementations. The effectiveness of the framework depends critically on regular access to high-quality satellite imagery, which can be limited by cloud cover, satellite revisit times, and data processing delays. This limitation proves particularly relevant in tropical and subtropical regions where persistent cloud cover during certain seasons can significantly impact data availability and system performance. The Sentinel-2 constellation’s 5-day revisit frequency at the equator partially mitigates this constraint, but extended periods of cloud cover can still create gaps in the monitoring record that affect both current condition assessment and forecast initialization.
Model generalizability presents another challenge, as the machine learning models, while trained on diverse satellite observations spanning multiple years and environmental conditions, may exhibit varying performance when applied to new regions with different bloom dynamics, phytoplankton community structures, or optical water properties. This limitation highlights the ongoing need for region-specific model calibration and validation efforts to ensure reliable performance across diverse geographic and environmental contexts. Transfer learning approaches and domain adaptation techniques may offer promising pathways to improve model generalizability while reducing the data requirements for new deployment locations.
Validation challenges further complicate the implementation and assessment of the approach. Validating predictions of harmful algal blooms requires contemporaneous in situ measurements, which are often limited in both spatial and temporal coverage compared to satellite observations. This constraint can make it difficult to rigorously evaluate model performance, particularly for multi-day forecasting applications where ground truth data may not be available until after prediction periods have elapsed. The sparse distribution of monitoring stations relative to the spatial coverage provided by satellite imagery also creates challenges in comprehensively validating spatially explicit predictions across entire water bodies.
Additionally, while the approach provides more comprehensive information than deterministic models, it requires users to interpret confidence metrics and prediction intervals effectively, which may present challenges for some stakeholders who lack experience with uncertainty-based decision-making frameworks. Effective visualization and communication strategies become essential to overcome these interpretation challenges and ensure the proper utilization of confidence-based predictions. The proof-of-concept dashboard interface developed for this study demonstrates one viable approach to addressing this challenge. As a prototype implementation, it validates the feasibility of communicating probabilistic predictions through intuitive visualizations, though production deployment would require continued refinement based on user feedback and operational experience to optimize information communication for diverse stakeholder groups.
In the Lake Okeechobee case study, these limitations were partially mitigated through the availability of a substantial historical dataset spanning 2013–2021 and the implementation of rigorous temporal cross-validation procedures to evaluate forecast accuracy. Critically, the primary contribution of this work is not optimized performance for this specific water body, but rather a demonstration that a confidence-based, probabilistic approach can successfully integrate satellite remote sensing with quantile regression to produce operationally useful uncertainty estimates. The generalizability of specific performance metrics to other water bodies remains to be established through additional validation studies; however, the underlying methodological framework, quantile regression for uncertainty quantification, multi-horizon forecasting with explicit prediction intervals, and risk classification incorporating confidence levels, transfers directly to comparable aquatic systems with sufficient historical data.
An important source of uncertainty arises from the spatial scale mismatch between satellite observations and point-based in situ measurements. Although Sentinel-2 provides native spatial resolutions of 10 m and 20 m for its primary spectral bands (with only the atmospheric quality bands at 60 m native resolution), each satellite pixel integrates reflectance over its entire footprint area, whereas in situ cyanobacteria samples represent a much smaller water volume at a discrete point location. This disparity means that satellite-derived spectral features represent an area-averaged signal that may not capture the fine-scale spatial heterogeneity of cyanobacterial distributions, particularly during early bloom stages or in patchy bloom conditions. In the present study, all bands were resampled to a common 60 m resolution for computational consistency across multi-temporal analyses, further increasing the averaging effect relative to the native 10–20 m bands. This scale mismatch constitutes an inherent source of uncertainty in any satellite–in situ modeling framework, and users should interpret the resulting prediction intervals as reflecting this uncertainty in addition to model-related and temporal uncertainty components. Future work could explore the use of finer spatial resolutions (10 m) for the primary visible and near-infrared bands to reduce this scale discrepancy, though this would come at increased computational cost.
Several additional contributions to prediction uncertainty during the training phase should be acknowledged. First, measurement uncertainty in the CAML in situ data arises from variability in sampling protocols, laboratory analytical methods, and cell counting procedures across the 14 contributing data providers. Second, atmospheric correction residuals in the Sentinel-2 Level-2A surface reflectance products introduce noise into the spectral features, particularly under hazy or partially cloudy conditions. Third, temporal misalignment between the satellite overpass time and the in situ sampling time can result in discrepancies when bloom conditions change rapidly. Fourth, model structural uncertainty stems from the inherent limitations of tree-based ensemble methods in extrapolating beyond the training data distribution. While the quantile regression framework captures a portion of these uncertainties through the width of prediction intervals, the individual contributions of these sources are not explicitly decomposed in the current implementation. Future extensions could incorporate formal uncertainty propagation methods or Bayesian frameworks to better disentangle these components.
It is worth noting that an alternative approach to uncertainty quantification would be to invert each Sentinel-2 image independently using a regression-based algorithm to estimate cyanobacteria concentrations at the pixel level, and then analyze the prediction uncertainty from the resulting spatiotemporal time series. Such a conventional workflow is well-established in water resources research and may yield similar uncertainty estimates when large historical datasets are available. However, the objectives of the two approaches differ in important respects. The image-inversion approach quantifies uncertainty in the retrieval of concentrations from individual images, focusing on instantaneous estimation accuracy. In contrast, our proposed framework is designed to produce multi-horizon probabilistic forecasts with explicit confidence metrics for future conditions, integrating both the uncertainty in current state estimation (via XGBoost quantile regression) and the additional uncertainty introduced by temporal extrapolation (via LightGBM forecasting). Furthermore, the image-inversion approach would require a separate per-pixel uncertainty model and would not directly provide the forward-looking risk classifications that are central to the situational awareness objectives of this work. The two approaches are therefore complementary rather than competing: the image-inversion method is better suited for retrospective spatial analysis of bloom extent and intensity, while our framework is optimized for operational forecasting and decision support under uncertainty.
The relatively modest size of the site-specific dataset represents an important limitation that warrants explicit discussion. While the XGBoost density prediction model benefits from a large national training set (11,655 samples across U.S. water bodies), the Lake Okeechobee-specific time series used for temporal forecasting comprises only 235 data points spanning 2017–2025. This sample size, although sufficient to demonstrate the viability of the confidence-based framework and to capture seasonal bloom dynamics, constrains the statistical power of the temporal forecasting models and may limit their ability to learn complex multi-year patterns or rare extreme bloom events. The 5-fold cross-validation procedure employed in this study provides approximately 188 training observations per fold, which supports the 16 engineered time series features used but leaves limited capacity for more complex model architectures. Future research should aim to expand the site-specific dataset through longer monitoring periods, the integration of additional in situ data sources beyond CAML, and potentially through data augmentation techniques that generate synthetic bloom trajectories while preserving the statistical properties of the observed time series. The continued operation of the Sentinel-2 constellation will naturally extend the available temporal record, enabling progressively more robust model training as additional years of imagery accumulate.
A related limitation is the notable decline in model explanatory power at longer forecast horizons, as reflected in the R2 values observed across all models. For the 10-day forecast, LightGBM achieved an R2 of 0.476, indicating that the model explains approximately 48% of the variance in cyanobacteria density. However, at the 20-day horizon, all models exhibited substantially reduced R2 values (LightGBM: −0.001, XGBoost: 0.168, Random Forest: −0.061, Ridge Regression: 0.135), with most values near or below zero. This degradation is expected in environmental time series forecasting, as the predictive information contained in recent observations and short-term temporal features diminishes with increasing forecast lead time, and stochastic environmental drivers (e.g., wind events, precipitation, nutrient pulses) become increasingly dominant. Importantly, the low R2 values at the 20-day horizon do not negate the utility of the probabilistic framework: the prediction intervals generated by quantile regression widen appropriately at longer horizons, providing honest uncertainty communication even when point prediction accuracy decreases. Future work could improve longer-horizon forecasts by incorporating exogenous environmental variables (e.g., meteorological forecasts, nutrient loading estimates, water level data) as additional predictors, by exploring recurrent neural network architectures (e.g., LSTM, GRU) that may better capture long-range temporal dependencies, and by increasing the training dataset size to provide more examples of multi-week bloom evolution patterns [86,87,88].

4.3. Implications for Environmental Monitoring and Decision Support

The confidence-based paradigm carries significant implications for how situational awareness is achieved in operational environmental management contexts. Our findings demonstrate that probabilistic outputs do not merely add information—they fundamentally restructure how decision-makers perceive, comprehend, and project environmental states, shifting cognition from binary categorization toward nuanced risk assessment. This transformation can lead to more robust decision-making under uncertainty and aligns with modern approaches to risk management in complex environmental systems. The probabilistic framework enables stakeholders to think more systematically about environmental risks and their associated uncertainties, potentially improving both the timing and effectiveness of management interventions.
The customizable nature of risk thresholds represents another significant implication of the framework. Different stakeholders typically maintain varying risk tolerances and definitions of what constitutes actionable intelligence based on their operational contexts, regulatory requirements, and available resources. The framework accommodates these differences by allowing risk thresholds to be adapted to specific use cases, thereby enhancing the utility of the system for diverse stakeholder groups including water utilities, environmental protection agencies, recreational water managers, and public health officials. This flexibility ensures that the same underlying remote sensing-derived predictions can be interpreted and acted upon differently by various organizations based on their specific needs and constraints.
Integration with existing monitoring systems offers additional advantages by complementing rather than replacing traditional in situ monitoring approaches. The satellite-based framework provides spatially comprehensive coverage that can guide targeted field sampling and optimize the deployment of limited monitoring resources. This integration enhances overall situational awareness by combining the strengths of different monitoring approaches—the broad spatial coverage and frequent temporal sampling of satellite remote sensing with the high accuracy and detailed biogeochemical information from in situ measurements—creating synergistic effects that improve both spatial coverage and temporal resolution of environmental surveillance efforts. This hierarchical monitoring approach aligns with established principles of multi-scale environmental observation that have proven effective in diverse contexts [16,89].
From an operational deployment perspective, the early warning capabilities demonstrated by the system represent perhaps the most practically significant implication for environmental management. By combining current condition assessment derived from recent satellite imagery with forward-looking multi-horizon forecasts, the framework can serve as an effective early warning system for potentially harmful environmental conditions, enabling preemptive action that can significantly reduce negative impacts. This capability proves particularly valuable in harmful algal bloom management contexts where early intervention—such as adjusting water treatment protocols, issuing recreational advisories, or implementing nutrient management strategies—can prevent widespread ecological damage and protect public health through the timely implementation of mitigation measures.
In the Lake Okeechobee case study, the 10-day and 20-day forecasts provide valuable lead time for resource managers to implement appropriate monitoring and mitigation strategies. The system’s recommendation for “increased monitoring and preliminary mitigation planning” exemplifies a measured response that acknowledges the detected risk levels without unnecessarily escalating concern or triggering premature resource deployment. This balanced approach to risk communication and management guidance can be seamlessly adapted to support extended forecast windows beyond 20 days with minimal methodological modifications, providing longer planning horizons that support both tactical and strategic environmental management decisions. The demonstrated forecast accuracy at multi-week horizons suggests potential for even longer-range seasonal predictions that could inform water management planning and resource allocation at broader temporal scales.

5. Conclusions

This study presents a novel confidence-based paradigm for environmental monitoring and prediction that addresses the critical gap between algorithmic outputs and actionable risk assessment. The framework’s capabilities were demonstrated through harmful algal bloom forecasting in Lake Okeechobee, Florida, serving as a representative validation case for large shallow freshwater systems. The framework integrates Sentinel-2 multispectral satellite remote sensing, ensemble machine learning with explicit uncertainty quantification through quantile regression, and multi-horizon time series forecasting to provide a comprehensive assessment of bloom risk that acknowledges and communicates prediction uncertainty through intuitive visualization interfaces.
The framework advances situational awareness from theoretical construct to operational reality: conceptually grounded in established SA principles, operationally implemented through integrated sensing and machine learning components, and validated through interpretive analysis demonstrating enhanced decision-support capabilities across 10-day and 20-day forecast horizons with explicit confidence metrics. By focusing on probabilistic confidence levels rather than deterministic binary outcomes, the system accommodates the inherent uncertainty in environmental prediction while still providing actionable intelligence through categorical risk classifications (LOW, MEDIUM, HIGH) that integrate both the predicted values and prediction intervals.
The validation results from the Lake Okeechobee case study demonstrate the operational utility of this approach for providing nuanced risk assessments that inform decision-making in water quality management. Notably, these results validate the confidence-based methodology itself rather than representing optimized performance for this specific system; the framework’s value lies in its transferable architecture for uncertainty quantification, not in site-specific predictive accuracy that would require local recalibration for each new deployment. The LightGBM forecasting model achieved mean absolute percentage errors of 2.9% and 5.7% for the 10-day and 20-day forecasts, respectively, substantially outperforming conventional regression approaches while providing explicit uncertainty bounds through 90% prediction intervals. The system’s ability to integrate multi-temporal satellite observations, quantify predictive uncertainty through quantile regression, and generate context-sensitive risk classifications represents a significant advancement in remote sensing applications for environmental monitoring and prediction.
Several key contributions emerged from this work. First, the confidence-based prediction framework successfully replaces traditional binary outputs with quantile regression-based uncertainty intervals, enabling stakeholders to assess risk based on their specific tolerance thresholds. Second, the integrated architecture combining current condition assessment (XGBoost) with temporal forecasting (LightGBM) provides both nowcasting and forecasting capabilities within a unified framework. Third, the novel risk classification scheme incorporating both predicted values and confidence levels enables more nuanced decision support than threshold-based approaches. Fourth, the interactive dashboard system facilitates stakeholder-specific risk interpretation through intuitive visualizations that communicate complex probabilistic information effectively.
Future research directions include extending this confidence-based approach to other environmental monitoring applications beyond harmful algal blooms, such as coastal water quality assessment, atmospheric pollution forecasting, and land degradation monitoring. Improving the integration of diverse remote sensing data sources—including hyperspectral sensors, thermal infrared imagery, and synthetic aperture radar—could enhance model performance and provide complementary information for uncertainty quantification. Development of more sophisticated visualization tools and decision support interfaces tailored to different stakeholder groups (water utilities, environmental protection agencies, public health officials) would further improve the operational utility of confidence-based predictions. Additionally, an exploration of physics-informed machine learning approaches that incorporate a mechanistic understanding of bloom dynamics, nutrient cycling, and hydrodynamic processes could improve both prediction accuracy and model interpretability while reducing data requirements for new deployment locations. Addressing the current limitations of the site-specific dataset size and the reduced explanatory power observed at the 20-day forecast horizons should be prioritized: expanding the temporal record through continued satellite monitoring, incorporating exogenous meteorological and hydrological predictors, and exploring deep learning architectures capable of capturing long-range temporal dependencies could substantially improve the forecast accuracy and reliability at extended lead times.
The confidence-based paradigm presented here advances the state-of-the-art in remote sensing applications for environmental prediction by explicitly addressing the critical gap between algorithmic predictions and actionable risk assessment. By providing stakeholders with both the predicted environmental states and associated confidence metrics, the framework supports more informed decision-making under uncertainty and enhances operational capabilities for environmental monitoring and management systems. This approach aligns with the broader trend toward probabilistic environmental modeling and represents a practical pathway for translating advanced remote sensing products into operational decision support tools that can improve environmental management outcomes in the face of complex ecological challenges.

Author Contributions

Conceptualization, M.Z.Q., C.C., M.A. and M.L.; Methodology, M.Z.Q., C.C., M.A. and M.L.; Software, M.Z.Q., C.C., M.A. and M.L.; Validation, M.Z.Q., C.C., M.A. and M.L.; Formal analysis, M.Z.Q., C.C., M.A. and M.L.; Investigation, M.Z.Q., C.C., M.A. and M.L.; Resources, M.L.; Data curation, M.Z.Q., C.C., M.A. and M.L.; Writing—original draft, M.Z.Q.; Writing—review & editing, M.Z.Q., C.C., M.A. and M.L.; Visualization, M.Z.Q., C.C., M.A. and M.L.; Supervision, M.L.; Project administration, M.L. All authors have contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

SeaBass CAML: http://dx.doi.org/10.5067/SeaBASS/CAML/DATA001 and Sentinel 2 Data: Microsoft Planetary Computer, https://doi.org/10.5281/zenodo.7261896.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AOTAerosol Optical Thickness
APIApplication Programming Interface
FDSFast Detection Strategies
HABHarmful Algal Bloom
L2ALevel-2A (Sentinel-2 product)
LightGBMLight Gradient Boosting Machine
MAEMean Absolute Error
MAPEMean Absolute Percentage Error
MODISModerate Resolution Imaging Spectroradiometer
MSIMultiSpectral Instrument
R2Coefficient of Determination
RMSERoot Mean Square Error
SASituational Awareness
SCLScene Classification Layer
SWIRShortwave Infrared
XGBoostExtreme Gradient Boosting

References

  1. Anderson, D.M.; Cembella, A.D.; Hallegraeff, G.M. Progress in understanding harmful algal blooms: Paradigm shifts and new technologies for research, monitoring, and management. Annu. Rev. Mar. Sci. 2012, 4, 143–176. [Google Scholar] [CrossRef]
  2. Lega, M.; Napoli, R.M.A. A new approach to solid waste landfill aerial monitoring. WIT Trans. Ecol. Environ. 2008, 109, 193–199. [Google Scholar] [CrossRef]
  3. Endsley, M.R. Toward a theory of situation awareness in dynamic systems. Hum. Factors 1995, 37, 32–64. [Google Scholar] [CrossRef]
  4. Faucheux, S.; Froger, G.; Noël, J.-F. What forms of rationality for sustainable development? J. Socio-Econ. 1995, 24, 169–209. [Google Scholar] [CrossRef]
  5. Lega, M.; Medio, G.; Severino, V.; Casazza, M.; Endreny, T.; Teta, R. Coastal Water Pollution Characterization: Enhanced Situational Awareness Through Multiscale Data Acquisition and Analysis. Int. J. Environ. Impacts 2024, 7, 188–202. [Google Scholar] [CrossRef]
  6. Persechino, G.; Schiano, P.; Lega, M.; Napoli, R.M.A.; Ferrara, C.; Kosmatka, J. Aerospace-based support systems and interoperability: The solution to fight illegal dumping. WIT Trans. Ecol. Environ. 2010, 140, 203–214. [Google Scholar] [CrossRef]
  7. Mohsen, N.; Lu, J.; Zhang, G. An intelligent situation awareness support system for safety-critical environments. Decis. Support Syst. 2014, 59, 325–340. [Google Scholar] [CrossRef]
  8. Schaeffer, B.A.; Schaeffer, K.G.; Keith, D.; Lunetta, R.S.; Conmy, R.; Gould, R.W. Barriers to adopting satellite remote sensing for water quality management. Int. J. Remote Sens. 2013, 34, 7534–7544. [Google Scholar] [CrossRef]
  9. Esposito, G.; De Rosa, T.; Di Matteo, V.; Ciccarelli, C.; Ajaoud, M.; Teta, R.; Lega, M.; Costantino, V. Bio-tracking, bio-monitoring and bio-magnification interdisciplinary studies to assess cyanobacterial harmful algal blooms (cyanoHABs)’ impact in complex coastal systems. Sci. Total Environ. 2025, 978, 179480. [Google Scholar] [CrossRef]
  10. Mishra, S.; Stumpf, R.P.; Schaeffer, B.A.; Werdell, P.J.; Loftin, K.A.; Meredith, A. Measurement of cyanobacterial bloom magnitude using satellite remote sensing. Sci. Rep. 2019, 9, 18310. [Google Scholar] [CrossRef]
  11. Lega, M.; Casazza, M.; Teta, R.; Zappa, C.J. Environmental impact assessment: A multilevel, multi-parametric framework for coastal waters. Int. J. Sustain. Dev. Plan. 2018, 13, 1041–1049. [Google Scholar] [CrossRef]
  12. Colkesen, I.; Ozturk, M.Y.; Altuntas, O.Y. Comparative evaluation of performances of algae indices, pixel- and object-based machine learning algorithms in mapping floating algal blooms using Sentinel-2 imagery. Stoch. Environ. Res. Risk Assess. 2024, 38, 1613–1634. [Google Scholar]
  13. Nguyen, H.Q.; Ha, N.T.; Pham, T.L. Inland harmful cyanobacterial bloom prediction in the eutrophic Tri An Reservoir using satellite band ratio and machine learning approaches. Environ. Sci. Pollut. Res. 2020, 27, 9135–9151. [Google Scholar] [CrossRef] [PubMed]
  14. Xie, Z.; Lou, I.; Ung, W.K.; Mok, K.M. Freshwater algal bloom prediction by support vector machine in macau storage reservoirs. Math. Probl. Eng. 2012, 2012, 397473. [Google Scholar] [CrossRef]
  15. Pamula, A.S.P.; Gholizadeh, H.; Krzmarzick, M.J.; Mausbach, W.E.; Lampert, D.J. A remote sensing tool for near real-time monitoring of harmful algal blooms and turbidity in reservoirs. JAWRA J. Am. Water Resour. Assoc. 2023, 59, 929–949. [Google Scholar] [CrossRef]
  16. Medio, G.; Severino, V.; Teta, R.; Endreny, T.; Lega, M. Hierarchical Monitoring of Water Quality: Coordinating the Spatiotemporal Resolution of Multilayer and Multispectral Sensors to Characterize Pollution. In WIT Transactions on Ecology and the Environment; WIT Press: Southampton, UK, 2022; Volume 257. [Google Scholar]
  17. Bagherian, K.; Fernández-Figueroa, E.G.; Rogers, S.R.; Wilson, A.E.; Bao, Y. Predicting Chlorophyll-a Concentration and Harmful Algal Blooms in Lake Okeechobee Using Time-Series MODIS Satellite Imagery and Long Short-Term Memory. J. ASABE 2024, 67, 619–632. [Google Scholar] [CrossRef]
  18. Ameer, S.; Shah, M.A.; Khan, A.; Song, H.; Maple, C.; Islam, S.U.; Asghar, M.N. Comparative analysis of machine learning techniques for predicting air quality in smart cities. IEEE Access 2019, 7, 128325–128338. [Google Scholar] [CrossRef]
  19. Kang, Y.; Ozdogan, M.; Zhu, X.; Ye, Z.; Hain, C.; Anderson, M. Comparative assessment of environmental variables and machine learning algorithms for maize yield prediction in the US Midwest. Environ. Res. Lett. 2020, 15, 064005. [Google Scholar] [CrossRef]
  20. Mermer, O.; Zhang, E.; Demir, I. Predicting Harmful Algal Blooms Using Ensemble Machine Learning Models and Explainable AI Technique: A Comparative Study. EartharXiv 2024. [Google Scholar] [CrossRef]
  21. Huang, Z.; Ma, R.; Liu, H.; Xue, K.; Hu, M.; Wei, X.; Li, H. Short-term spatial prediction of algal blooms in Lake Taihu via machine learning and GOCI observations. J. Environ. Manag. 2025, 388, 125964. [Google Scholar] [CrossRef]
  22. Haynes, K.; Lagerquist, R.; McGraw, M.; Musgrave, K.; Ebert-Uphoff, I. Creating and evaluating uncertainty estimates with neural networks for environmental-science applications. Artif. Intell. Earth Syst. 2023, 2, e220061. [Google Scholar] [CrossRef]
  23. Pyo, J.; Park, L.J.; Pachepsky, Y.; Baek, S.S.; Kim, K.; Cho, K.H. Using convolutional neural network for predicting cyanobacteria concentrations in river water. Water Res. 2020, 186, 116349. [Google Scholar] [CrossRef] [PubMed]
  24. Teta, R.; Della Sala, G.; Esposito, G.; Stornaiuolo, M.; Scarpato, S.; Casazza, M.; Anastasio, A.; Lega, M.; Costantino, V. Monitoring Cyanobacterial Blooms during the COVID-19 Pandemic in Campania, Italy: The Case of Lake Avernus. Toxins 2021, 13, 471. [Google Scholar] [CrossRef] [PubMed]
  25. Esposito, G.; Glukhov, E.; Gerwick, W.H.; Medio, G.; Teta, R.; Lega, M.; Costantino, V. Lake Avernus Has Turned Red: Bioindicator Monitoring Unveils the Secrets of ‘Gates of Hades’. Toxins 2023, 15, 208. [Google Scholar] [CrossRef]
  26. Lu, D.; Ye, M.; Hill, M.C. Analysis of regression confidence intervals and Bayesian credible intervals for uncertainty quantification. Water Resour. Res. 2012, 48, W09521. [Google Scholar] [CrossRef]
  27. Ajaoud, M.; Ciccarelli, C.; De Mizio, M.; Gargiulo, M.; Parrilli, S.; Savarese, C.; Tufano, F.; Lega, M. Bridging Sustainability and Environmental Impact Assessment: Multi-Scale Bioindication and Remote Sensing for Pollution Monitoring in Agroecosystems. Sustainability 2025, 17, 4115. [Google Scholar] [CrossRef]
  28. Singh, G.; Moncrieff, G.; Venter, Z.; Cawse-Nicholson, K.; Slingsby, J.; Robinson, T.B. Uncertainty quantification for probabilistic machine learning in earth observation using conformal prediction. Sci. Rep. 2024, 14, 14954. [Google Scholar] [CrossRef]
  29. Vasseur, S.P.; Aznarte, J.L. Comparing quantile regression methods for probabilistic forecasting of NO2 pollution levels. Sci. Rep. 2021, 11, 10394. [Google Scholar] [CrossRef]
  30. Verbois, H.; Rusydi, A.; Thiery, A. Probabilistic forecasting of day-ahead solar irradiance using quantile gradient boosting. Sol. Energy 2018, 173, 313–327. [Google Scholar] [CrossRef]
  31. Poch, M.; Comas, J.; Rodríguez-Roda, I.; Sànchez-Marrè, M.; Cortés, U. Designing and building real environmental decision support systems. Environ. Model. Softw. 2004, 19, 857–873. [Google Scholar] [CrossRef]
  32. Reynolds, N.; Schaeffer, B.A.; Guertault, L.; Nelson, N.G. Satellite and in situ cyanobacteria monitoring: Understanding the impact of monitoring frequency on management decisions. J. Hydrol. 2023, 617, 128884. [Google Scholar] [CrossRef]
  33. Neil, C.; Spyrakos, E.; Hunter, P.D.; Tyler, A.N. A global approach for chlorophyll-a retrieval across optically complex inland waters based on optical water types. Remote Sens. Environ. 2019, 229, 159–178. [Google Scholar] [CrossRef]
  34. Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
  35. Hamilton, D.P.; Carey, C.C.; Arvola, L.; Arzberger, P.; Brewer, C.; Cole, J.J.; Gaiser, E.; Hanson, P.C.; Ibelings, B.W.; Jennings, E.; et al. A Global Lake Ecological Observatory Network (GLEON) for synthesising high-frequency sensor data for validation of deterministic ecological models. Inland Waters 2015, 5, 49–56. [Google Scholar] [CrossRef]
  36. Havens, K.E.; Hanlon, C.; James, R.T. Seasonal and spatial variation in algal bloom frequencies in Lake Okeechobee, Florida, USA. Lake Reserv. Manag. 1994, 10, 133–143. [Google Scholar] [CrossRef]
  37. Phlips, E.J.; Badylak, S.; Nelson, N.G.; Havens, K.E. Hurricanes, El Niño and harmful algal blooms in two sub-tropical Florida estuaries: Direct and indirect impacts. Sci. Rep. 2020, 10, 1910. [Google Scholar] [CrossRef]
  38. Lefler, F.W.; Barbosa, M.; Zimba, P.V.; Smyth, A.R.; Berthold, D.E.; Laughinghouse, H.D. Spatiotemporal diversity and community structure of cyanobacteria and associated bacteria in the large shallow subtropical Lake Okeechobee (Florida, United States). Front. Microbiol. 2023, 14, 1219261. [Google Scholar] [CrossRef]
  39. Lega, M.; d’Antonio, L.; Napoli, R.M.A. Cultural heritage and waste heritage: Advanced techniques to preserve cultural heritage, exploring just in time the ruins produced by disasters and natural calamities. In Management and the Environment V; Popov, V., Itoh, H., Mander, U., Brebbia, C.A., Eds.; WIT Press: Ashurst Lodge, UK, 2010; pp. 123–134. [Google Scholar]
  40. Lega, M.; Napoli, R.M.A. Aerial infrared thermography in the surface waters contamination monitoring. Desalination Water Treat. 2010, 23, 141–151. [Google Scholar] [CrossRef]
  41. Lega, M.; Kosmatka, J.; Ferrara, C.; Russo, F.; Napoli, R.M.A.; Persechino, G. Using advanced aerial platforms and infrared thermography to track environmental contamination. Environ. Forensics 2012, 13, 332–338. [Google Scholar] [CrossRef]
  42. Mahdianpari, M.; Salehi, B.; Mohammadimanesh, F.; Homayouni, S.; Gill, E. The first wetland inventory map of newfoundland at a spatial resolution of 10 m using sentinel-1 and sentinel-2 data on the google earth engine cloud computing platform. Remote Sens. 2018, 11, 43. [Google Scholar] [CrossRef]
  43. Dorne, E.; Wetstone, K.; Cerquera, T.B.; Gupta, S. Cyanobacteria Detection in Small, Inland Water Bodies with CyFi. Proceedings of the AGU. 2024. Available online: https://proceedings.scipy.org/articles/PDHK7238 (accessed on 12 May 2025).
  44. Paneru, B.; Paneru, B. AI for Water Sustainability: Global Water Quality Assessment and Prediction with Explainable AI with LLM Chatbot for Insights. arXiv 2024, arXiv:2409.10898. [Google Scholar]
  45. Shah, F.U.; Khan, A.U.; Khan, A.W.; Ullah, B.; Ali, S.; Ahmad, I.; Shah, S.U. Comparative analysis of ensemble learning algorithms in water quality prediction. J. Hydroinform. 2024, 26, 3041–3058. [Google Scholar] [CrossRef]
  46. Van Nguyen, M.; Lin, C.H.; Chu, H.J.; Jaelani, L.M.; Syariz, M.A. Spectral feature selection optimization for water quality estimation. Int. J. Environ. Res. Public Health 2020, 17, 272. [Google Scholar] [CrossRef] [PubMed]
  47. Yin, X.; Fallah-Shorshani, M.; McConnell, R.; Fruin, S.; Chiang, Y.Y.; Franklin, M. Quantile extreme gradient boosting for uncertainty quantification. arXiv 2023, arXiv:2304.11732. [Google Scholar] [CrossRef]
  48. Cao, Q.; Wu, Y.; Yang, J.; Yin, J. Greenhouse temperature prediction based on time-series features and LightGBM. Appl. Sci. 2023, 13, 1610. [Google Scholar] [CrossRef]
  49. Toharudin, T.; Caraka, R.E.; Pratiwi, I.R.; Kim, Y.; Tai, S.K.; Yustiawan, T.; Purnama, A. How to Handle Unbalanced Classification of PM2.5 Concentration Levels by Observing Meteorological Parameters in Jakarta-Indonesia Using AdaBoost, XGBoost, CatBoost, and LightGBM. IEEE Access 2023, 11, 35989–36003. [Google Scholar] [CrossRef]
  50. Yu, Z.; Ma, J.; Qu, Y.; Pan, L.; Wan, S. PM2.5 extended-range forecast based on MJO and S2S using LightGBM. Sci. Total Environ. 2023, 873, 162369. [Google Scholar] [CrossRef]
  51. Zhang, X.; Jiang, X.; Li, Y. Prediction of air quality index based on the SSA-BiLSTM-LightGBM model. Sci. Rep. 2023, 13, 5550. [Google Scholar] [CrossRef]
  52. Zhou, S.; Song, C.; Zhang, J.; Chang, W.; Hou, W.; Yang, L. A hybrid prediction framework for water quality with integrated W-ARIMA-GRU and LightGBM methods. Water 2022, 14, 1322. [Google Scholar] [CrossRef]
  53. Microsoft Planetary Computer. Available online: https://ui.adsabs.harvard.edu/abs/2022zndo...7261896O/abstract (accessed on 12 May 2025).
  54. Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s optical high-resolution mission for GMES operational services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
  55. Muller-Karger, F.E.; Hestir, E.; Ade, C.; Turpie, K.; Roberts, D.A.; Siegel, D.; Miller, R.J.; Humm, D.; Izenberg, N.; Keller, M.; et al. Satellite sensor requirements for monitoring essential biodiversity variables of coastal ecosystems. Ecol. Appl. 2018, 28, 749–760. [Google Scholar] [CrossRef]
  56. Gupta, S.; Gelbart, E.; Gupta, R.; Wetstone, K.; Dorne, E. Cyanobacteria Aggregated Manual Labels Dataset (NASA and DrivenData); SeaBASS; NASA Ocean Biology Distributed Active Archive Center: Greenbelt, MD, USA, 2024. [Google Scholar] [CrossRef]
  57. Adjovu, G.E.; Stephen, H.; James, D.; Ahmad, S. Overview of the application of remote sensing in effective monitoring of water quality parameters. Remote Sens. 2023, 15, 1938. [Google Scholar] [CrossRef]
  58. Gholizadeh, M.H.; Melesse, A.M.; Reddi, L. A comprehensive review on water quality parameters estimation using remote sensing techniques. Sensors 2016, 16, 1298. [Google Scholar] [CrossRef] [PubMed]
  59. Lu, H.; Ma, X. Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere 2020, 249, 126169. [Google Scholar] [CrossRef] [PubMed]
  60. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In KDD ′16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM, Inc.: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
  61. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
  62. DrivenData. Tick Tick Bloom Benchmark Report. 2024. Available online: https://drivendata.co/blog/tick-tick-bloom-benchmark/ (accessed on 12 May 2025).
  63. Bian, L.; Xie, H.; Wang, H.; Liu, H.; Meng, J.; Chen, J. Application, interpretability and prediction of machine learning method combined with LSTM and LightGBM-a case study for runoff simulation in an arid area. J. Hydrol. 2023, 625, 130091. [Google Scholar] [CrossRef]
  64. Rahmati, O.; Choubin, B.; Fathabadi, A.; Coulon, F.; Soltani, E.; Shahabi, H.; Mollaefar, E.; Tiefenbacher, J.; Cipullo, S.; Bin Ahmad, B.; et al. Predicting uncertainty of machine learning models for modelling nitrate pollution of groundwater using quantile regression and UNEEC methods. Sci. Total Environ. 2019, 688, 855–866. [Google Scholar] [CrossRef]
  65. Sun, W.; Tack, F.; Clarisse, L.; Schneider, R.; Stavrakou, T.; Van Roozendael, M. Inferring Surface NO2 Over Western Europe: A Machine Learning Approach With Uncertainty Quantification. J. Geophys. Res. Atmos. 2024, 129, e2023JD040676. [Google Scholar] [CrossRef]
  66. Cressie, N.; Calder, C.A.; Clark, J.S.; Ver Hoef, J.M.; Wikle, C.K. Accounting for uncertainty in ecological analysis: The strengths and limitations of hierarchical statistical modeling. Ecol. Appl. 2009, 19, 553–570. [Google Scholar] [CrossRef]
  67. World Health Organization. Guidelines for Safe Recreational Water Environments. Volume 1: Coastal and Fresh Waters; World Health Organization: Geneva, Switzerland, 2003; pp. 136–158. [Google Scholar]
  68. Office of Water. Recommendations for Cyanobacteria and Cyanotoxin Monitoring in Recreational Waters; United States Environmental Protection Agency: Washington, DC, USA, 2019; p. 5. [Google Scholar]
  69. Havens, K.E.; Ji, G.; Beaver, J.R.; Fulton, R.S., III; Teacher, C.E. Dynamics of cyanobacteria blooms are linked to the hydrology of shallow Florida lakes and provide insight into possible impacts of climate change. Hydrobiologia 2019, 829, 43–59. [Google Scholar] [CrossRef]
  70. Ahmed, N.K.; Atiya, A.F.; Gayar, N.E.; El-Shishiny, H. An empirical comparison of machine learning models for time series forecasting. Econom. Rev. 2010, 29, 594–621. [Google Scholar] [CrossRef]
  71. İleri, K. Comparative analysis of CatBoost, LightGBM, XGBoost, RF, and DT methods optimised with PSO to estimate the number of k-barriers for intrusion detection in WSNs. Int. J. Mach. Learn. Cybern. 2025, 16, 543–566. [Google Scholar] [CrossRef]
  72. Moreno, J.J.M.; Pol, A.P.; Abad, A.S.; Blasco, B.C. Using the R-MAPE index as a resistant measure of forecast accuracy. Psicothema 2013, 25, 500–506. [Google Scholar] [CrossRef] [PubMed]
  73. Ye, L.; Yang, G.; Van Ranst, E.; Tang, H. Time-series modeling and prediction of global monthly absolute temperature for environmental decision making. Adv. Atmos. Sci. 2013, 30, 382–396. [Google Scholar] [CrossRef]
  74. Yang, C.; Tan, Z.; Li, Y.; Shen, M.; Duan, H. A comparative analysis of machine learning methods for algal bloom detection using remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8589–8605. [Google Scholar] [CrossRef]
  75. Nong, X.; Zeng, J.; Chen, L.; Wei, J.; Zhang, Y. A novel water quality risk assessment framework for reservoir water bodies coupling key parameter selection and dynamic warning threshold determination. Sci. Rep. 2025, 15, 1242. [Google Scholar] [CrossRef]
  76. Carrara, P.; Bordogna, G.; Boschetti, M.; Brivio, P.A.; Nelson, A.; Stroppiana, D. A flexible multi-source spatial-data fusion system for environmental status assessment at continental scale. Int. J. Geogr. Inf. Sci. 2008, 22, 781–799. [Google Scholar] [CrossRef]
  77. Chang, N.B.; Bai, K.; Chen, C.F. Integrating multisensor satellite data merging and image reconstruction in support of machine learning for better water quality management. J. Environ. Manag. 2017, 201, 227–240. [Google Scholar] [CrossRef]
  78. Chen, B.; Huang, B.; Xu, B. Multi-source remotely sensed data fusion for improving land cover classification. ISPRS J. Photogramm. Remote Sens. 2017, 124, 27–39. [Google Scholar] [CrossRef]
  79. Li, Z.; Wang, H.; Zhang, T.; Zeng, Q.; Xiang, J.; Liu, Z.; Yang, R. Multi-Source Precipitation Data Merging for High-Resolution Daily Rainfall in Complex Terrain. Remote Sens. 2023, 15, 4345. [Google Scholar] [CrossRef]
  80. He, Q.; Chen, C.; Wang, Y.; Sun, Y.; Liu, Y.; Hu, B. Fusion Method for Multi-Source Remote Sensing Daily Precipitation Data: Random Forest Model Considering Spatial Autocorrelation. J. Geo-Inf. Sci. 2024, 26, 1517–1530. [Google Scholar] [CrossRef]
  81. Mak, H.W.L.; Laughner, J.L.; Fung, J.C.H.; Zhu, Q.; Cohen, R.C. Improved Satellite Retrieval of Tropospheric NO2 Column Density via Updating of Air Mass Factor (AMF): Case Study of Southern China. Remote Sens. 2018, 10, 1789. [Google Scholar] [CrossRef]
  82. Peterson, K.T.; Sagan, V.; Sloan, J.J. Deep learning-based water quality estimation and anomaly detection using Landsat-8/Sentinel-2 virtual constellation and cloud computing. GIScience Remote Sens. 2020, 57, 735–748. [Google Scholar] [CrossRef]
  83. Samadzadegan, F.; Toosi, A.; Dadrass Javan, F.; Asghari, A.; Fathololoumi, S.; Biswas, A. A critical review on multi-sensor and multi-platform remote sensing data fusion approaches: Current status and prospects. Int. J. Remote Sens. 2025, 46, 1327–1402. [Google Scholar] [CrossRef]
  84. Yang, J.; Jiang, Y.; Song, Q.; Wang, Z.; Hu, Y.; Li, K.; Sun, Y. An Approach for Multi-Source Land Use and Land Cover Data Fusion Considering Spatial Correlations. Remote Sens. 2025, 17, 1131. [Google Scholar] [CrossRef]
  85. Zhang, J. Multi-source remote sensing data fusion: Status and trends. Int. J. Image Data Fusion 2010, 1, 5–24. [Google Scholar] [CrossRef]
  86. Cao, W.; Qi, W.; Lu, P. Air quality prediction based on time series decomposition and convolutional sparse self-attention mechanism transformer model. IEEE Access 2024, 12, 156789–156801. [Google Scholar] [CrossRef]
  87. Chen, Y.; Chen, X.; Xu, A.; Sun, Q.; Peng, X. A hybrid CNN-Transformer model for ozone concentration prediction. Air Qual. Atmos. Health 2022, 15, 1449–1463. [Google Scholar] [CrossRef]
  88. Liu, S.; Hu, Y. Air quality prediction based on factor analysis combined with Transformer and CNN-BILSTM-ATTENTION models. Sci. Rep. 2025, 15, 2156. [Google Scholar] [CrossRef]
  89. Kumari, S.; Singh, S.K. Machine learning-based time series models for effective CO2 emission prediction in India. Environ. Sci. Pollut. Res. 2023, 30, 21844–21856. [Google Scholar] [CrossRef]
Figure 1. Lake Okeechobee Algal bloom. Source: NASA Earth Observatory (16 July 2022).
Figure 1. Lake Okeechobee Algal bloom. Source: NASA Earth Observatory (16 July 2022).
Remotesensing 18 00959 g001
Figure 2. Lake Okeechobee and its geographical location.
Figure 2. Lake Okeechobee and its geographical location.
Remotesensing 18 00959 g002
Figure 3. Confidence-based HAB prediction framework.
Figure 3. Confidence-based HAB prediction framework.
Remotesensing 18 00959 g003
Figure 4. HAB prediction system: Homepage.
Figure 4. HAB prediction system: Homepage.
Remotesensing 18 00959 g004
Figure 5. Model performance comparison (RMSE).
Figure 5. Model performance comparison (RMSE).
Remotesensing 18 00959 g005
Figure 6. Coefficient of determination (R2) for each model.
Figure 6. Coefficient of determination (R2) for each model.
Remotesensing 18 00959 g006
Figure 7. Features importance analysis for 10 day and 20 day forecast.
Figure 7. Features importance analysis for 10 day and 20 day forecast.
Remotesensing 18 00959 g007
Figure 8. MAPE results over the t + 1, t + 2 time steps.
Figure 8. MAPE results over the t + 1, t + 2 time steps.
Remotesensing 18 00959 g008
Figure 9. Risk assessment gauge from our dashboard.
Figure 9. Risk assessment gauge from our dashboard.
Remotesensing 18 00959 g009
Table 1. Model performance comparison for time series forecasting.
Table 1. Model performance comparison for time series forecasting.
ModelRMSE (10 d)MAE (10 d)R2 (10 d)RMSE (20 d)MAE (20 d)R2 (20 d)
LightGBM0.23330.20730.47600.43090.4135−0.0007
XGBoost0.30170.26400.12360.39280.37900.1684
Random Forest0.32350.3006−0.00750.44380.4283−0.0614
Ridge Regression0.54820.4048−1.89250.40070.33220.1347
Table 2. Summary of bloom risk assessment.
Table 2. Summary of bloom risk assessment.
Forecast PeriodPredicted DatePredicted ValueRisk Classification
t + 1 (10 days ahead)12 August 20256.9138MEDIUM
t + 2 (20 days ahead)22 August 20257.3592MEDIUM
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qamar, M.Z.; Ciccarelli, C.; Ajaoud, M.; Lega, M. Probabilistic Water Quality Monitoring Using Multi-Temporal Sentinel-2 Data: A Situational Awareness Framework for Harmful Algal Bloom Forecasting. Remote Sens. 2026, 18, 959. https://doi.org/10.3390/rs18060959

AMA Style

Qamar MZ, Ciccarelli C, Ajaoud M, Lega M. Probabilistic Water Quality Monitoring Using Multi-Temporal Sentinel-2 Data: A Situational Awareness Framework for Harmful Algal Bloom Forecasting. Remote Sensing. 2026; 18(6):959. https://doi.org/10.3390/rs18060959

Chicago/Turabian Style

Qamar, Muhammad Zaid, Cristiano Ciccarelli, Mohammed Ajaoud, and Massimiliano Lega. 2026. "Probabilistic Water Quality Monitoring Using Multi-Temporal Sentinel-2 Data: A Situational Awareness Framework for Harmful Algal Bloom Forecasting" Remote Sensing 18, no. 6: 959. https://doi.org/10.3390/rs18060959

APA Style

Qamar, M. Z., Ciccarelli, C., Ajaoud, M., & Lega, M. (2026). Probabilistic Water Quality Monitoring Using Multi-Temporal Sentinel-2 Data: A Situational Awareness Framework for Harmful Algal Bloom Forecasting. Remote Sensing, 18(6), 959. https://doi.org/10.3390/rs18060959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop