1. Introduction
Water is essential to maintain environmental balance and support human life. It is a critical component for agriculture, industry, and the sustaining of ecosystems. Its availability and quality are directly linked to human health, as water is necessary for hydration, food security, and sanitation [
1]. According to the World Health Organization (WHO), approximately 2 billion people globally use water sources contaminated with feces, which can lead to diseases like cholera, diarrhea, and dysentery, affecting health outcomes and economic productivity. Clean and safe water is crucial for preventing waterborne diseases, with the WHO emphasizing that proper water quality management reduces risks associated with harmful microorganisms. Furthermore, proper hydration contributes to cognitive function, physical performance, and the prevention of diseases like kidney stones and urinary tract infections. Several factors contribute to the occurrence and intensification of algal blooms. Algal blooms are primarily driven by nutrient overload, especially excess nitrogen and phosphorus from agricultural runoff, wastewater discharges, and industrial pollutants. However, environmental variables such as temperature, pH, and electrical conductivity (EC) also significantly influence their growth. Algal species thrive in warm waters (typically above 20 °C), and the presence of excessive nutrients promotes their rapid multiplication.
A study by Rosa [
2] demonstrated that high temperatures and nutrient enrichment are pivotal in causing blooms in freshwater systems. Additionally, low pH can alter the solubility of nutrients, enhancing bloom conditions [
3]. Changes in EC, which are often linked to ionic strength and nutrient concentrations, also create favorable conditions for the growth of certain harmful algal species [
4]. These factors, when combined, contribute significantly to the global rise in HABs in aquatic ecosystems. The application of ML regression models to predict
concentrations has been explored extensively in oceans, rivers, lakes, and reservoirs [
5,
6]. These models typically use a range of water quality variables as inputs, measured in situ at different frequencies and times. For instance, Wei [
7] examined the correlations between
and five key physicochemical factors (salinity, water temperature, depth, DO, and pH) using 2100 measurements collected during an artificial upwelling process in the ocean. Similarly, Soro et al. [
8] analyzed
levels in three major rivers in West Africa by integrating rainfall, water discharges, and water quality data (e.g., water temperature, pH, EC, DO, salinity, phosphate, nitrate, and ammonia) spanning a two-year period. In Korea, Shin et al. [
9] used various ML models to predict
concentrations along the Nakdong River, training a model for one-day-ahead forecasting using daily water quality measurements and weather variables.
Recent studies have also addressed the limitations of daily granularity, which may not be sufficient for real-time monitoring. Barzegar et al. [
10] developed a hybrid CNN-LSTM deep learning model for
prediction in Greece’s Small Prespa Lake, utilizing easily measurable water quality parameters such as pH, ORP, temperature, and EC at 15 min intervals over one year. However, none of these studies analyzed the behavior of their models during different months of the year, which is essential for evaluating model accuracy during critical periods, such as bloom initiation or collapse. Additionally, few considered the model’s capability in triggering harmful algal bloom (HAB) alert levels.
In contrast, a lot of studies have made advancements in time series prediction methods. Bagherian et al. [
11] proposed the use of Artificial Neural Networks (ANNs) and LSTM models to predict
concentrations over 1- and 4-day horizons using daily time series data. Yu et al. [
12] applied LSTM models with smoothed monthly time series data, while Shamshirband et al. [
13] employed ensemble methods based on ANNs for daily time series prediction, incorporating wavelet transformations to enhance model performance. Although these methods focus on
prediction using historical data, the variability in time periods and measurement frequencies, ranging from minutes to months, remains a significant challenge. Zhang et al. [
14] utilized Sentinel-2 imagery to track
concentrations in Nansi Lake, China, and incorporated a neural network model, the Ocean Color Network, to improve
predictions. García-Nieto et al. [
15] focused on the El Val reservoir, employing various ML techniques to predict
concentrations based on multiple variables, including water quality and hydrological data. Deng et al. [
16] reviewed recent advancements in remote sensing and ML techniques for lake water quality management, highlighting the application of Sentinel-2 data in large reservoirs. Additionally, recent ML models such as XGBoost were applied to estimate
concentrations in tropical reservoirs, resulting in significant improvements in prediction accuracy [
17].
HABs are not always detectable through biomass or chlorophyll content alone, particularly when toxic blooms occur at low cell densities. Additional indicators, such as phycocyanin for cyanobacteria and the presence of toxins or their metabolites, can be used, but remains a widely used marker for algal presence and can improve detection tools for predicting HABs. Many studies rely on daily measurements, which fail to capture rapid fluctuations in concentrations, limiting real-time monitoring capabilities. Additionally, most studies span short durations, making it difficult to assess model performance across seasonal variations. Some studies also incorporate costly or hard-to-measure variables, reducing the practicality and scalability of these models for large-scale, cost-effective monitoring. To address these challenges, we propose a series of ML-based regression models designed to estimate fluorescence by uncovering the inherent relationships between four low-cost input variables: water temperature, pH, EC, and .
Remote sensing is highly effective for assessing
concentrations, but it has limitations, including low spatial resolution and interference from cloud cover, which can disrupt continuous, real-time monitoring. To overcome these challenges, we use buoy-based AMS equipped with sensors. These systems provide time series data, enabling more reliable and continuous monitoring compared to satellite imagery, especially in areas with inconsistent or unavailable satellite coverage. The alarm system was designed to trigger when the predicted
exceeds 10 μg/L, corresponding to Alert Level 1 in the World Health Organization cyanobacterial bloom management framework [
18]. A real-time alert system has been implemented to issue warnings when
concentrations exceed this critical threshold, enabling proactive management of water resources.
The integration of physical-based models and ML algorithms has shown promise in improving the prediction of
concentrations in freshwater lakes. In [
19], Chen et al. discussed a Bayesian Model Averaging (BMA) ensemble method that combines these models to reduce uncertainty and enhance forecasting accuracy for algal bloom predictions. The study highlights the potential for BMA to decrease model uncertainty, which remains an area for further exploration. On a similar note, Zhao et al. [
20] emphasized the limitations of existing multi-modal approaches for HAB prediction, especially in real-time applications. They proposed a single-model-based multi-task deep learning framework that improves prediction accuracy without relying on complex ensemble methodologies.
Several studies have explored ML techniques for real-time monitoring and prediction across diverse fields, from safety systems to environmental monitoring. In the domain of real-time monitoring, ensemble learning methods such as Random Forest (RF) and XGBoost have demonstrated exceptional performance in predicting algal blooms owing to their resilience against overfitting and capability to manage large, complex datasets. Jeong et al. [
21] applied RF to predict algal bloom occurrences in freshwater ecosystems, addressing the challenge of class imbalance, as bloom events are less frequent than non-bloom events. They utilized resampling techniques and hyperparameter optimization to improve model accuracy. Zare et al. [
22] also employed RF to forecast harmful algal blooms (HABs) in coastal environments, tackling data heterogeneity by using variable importance analysis to pinpoint the most influential features, thus enhancing model interpretability.
In time series prediction, Gang et al. [
23] introduced a convolution sum discrete process neural network, which enhanced prediction accuracy, while Liu et al. [
24] developed a hybrid model that integrates Convolutional Neural Networks (CNNs) and Fuzzy C-means clustering, providing improved stability in time series forecasting. These methodologies have been increasingly applied in environmental sciences, particularly in the prediction of HABs. Huang et al. [
25] investigated deep hybrid neural networks for predicting chaotic time series, while Hill et al. [
26] utilized ML in conjunction with remote sensing to detect HABs. Pyo et al. [
27] employed deep learning models to predict cyanobacteria cell growth by integrating observed, numerical, and sensing data, thus highlighting the utility of multi-source data in environmental forecasting. Numerous studies have focused on
concentration, a critical indicator of algal blooms. Cho et al. [
28] applied deep learning models for the time series prediction of daily
levels, while Lee et al. [
29] improved HAB predictions in South Korean rivers through deep learning techniques. Additionally, Cho et al. [
30] proposed a merged LSTM model for multi-step
concentration prediction, enhancing prediction accuracy. Recent work by Yussof et al. [
31] demonstrated the efficacy of LSTM networks in predicting HABs, particularly along the west coast of Sabah. Moreover, recent studies have explored the application of LSTM networks for predicting tsunami hydrodynamics, as demonstrated by Ali et al. [
32], who utilized LSTM models to predict the time series of tsunami hydrodynamics, enhancing early warning systems and improving tsunami event predictions. Similarly, Ghosh et al. [
33] applied LSTM networks for spectrum prediction, focusing on opportunistic use of under-utilized radio frequency bands, demonstrating high prediction accuracy in forecasting spectrum usage patterns. Collectively, these studies highlight the growing reliance on advanced deep learning techniques, such as LSTM and hybrid models, for improving the prediction and management of environmental hazards, such as algal blooms. A summary of the related work is given in
Table 1.
While daily monitoring methods are valuable, they may not capture the rapid, dynamic changes in concentrations that are critical for timely HAB detection. Our buoy-based system provides continuous, high-frequency monitoring, which offers a more immediate response to fluctuations in water quality parameters, especially in environments prone to fast-changing conditions. This study overcomes these limitations by utilizing four low-cost input variables, water temperature, pH, EC, and , through ML-based regression models trained on 38 months of data. In addition, the battery status data is also included in the monitoring system. The main reason is for system maintenance purposes to ensure uninterrupted monitoring, because the battery serves as an operational indicator to ensure the continuous functioning of the buoy system for the collection of data. The models are evaluated across seasonal variations to ensure robustness and scalability, offering a cost-effective solution for large-scale, real-time monitoring of HABs. The main contribution of our work is given below:
A hybrid framework for water quality prediction that synergizes feature engineering, data balancing, deep spatial feature extraction, and temporal modeling, addressing the dual challenges of spatial complexity and temporal dependency in environmental time series data.
A feature optimization strategy that integrates domain knowledge (statistical aggregations of key parameters) with data-driven feature importance rankings to reduce redundancy while retaining interpretability in high-dimensional water quality datasets.
A spatial–temporal architecture that combines ResNet-18 for hierarchical feature learning from raw sensor data and LSTM for modeling dynamic water quality trends, enabling joint analysis of local patterns and long-term dependencies.
A deployable monitoring system that unifies feature selection, imbalance correction, and deep sequential learning into a streamlined pipeline, demonstrating practical feasibility for real-time water quality forecasting in resource-constrained environments.