Harmful algal blooms (HABs), driven by environmental pollution, pose significant threats to water quality, public health, and aquatic ecosystems. This study enhances the prediction of HABs in Lake Erie, part of the Great Lakes system, by utilizing ensemble machine learning (ML) models coupled
[...] Read more.
Harmful algal blooms (HABs), driven by environmental pollution, pose significant threats to water quality, public health, and aquatic ecosystems. This study enhances the prediction of HABs in Lake Erie, part of the Great Lakes system, by utilizing ensemble machine learning (ML) models coupled with explainable artificial intelligence (XAI) for interpretability. Using water quality data from 2013 to 2020, various physical, chemical, and biological parameters were analyzed to predict chlorophyll-a (Chl-a) concentrations, which are a commonly used indicator of phytoplankton biomass and a proxy for algal blooms. This study employed multiple ensemble ML models, including random forest (RF), deep forest (DF), gradient boosting (GB), and XGBoost, and compared their performance against individual models, such as support vector machine (SVM), decision tree (DT), and multi-layer perceptron (MLP). The findings revealed that the ensemble models, particularly XGBoost and deep forest (DF), achieved superior predictive accuracy, with R
2 values of 0.8517 and 0.8544, respectively. The application of SHapley Additive exPlanations (SHAPs) provided insights into the relative importance of the input features, identifying the particulate organic nitrogen (PON), particulate organic carbon (POC), and total phosphorus (TP) as the critical factors influencing the Chl-a concentrations. This research demonstrates the effectiveness of ensemble ML models for achieving high predictive accuracy, while the integration of XAI enhances model interpretability. The results support the development of proactive water quality management strategies and highlight the potential of advanced ML techniques for environmental monitoring.
Full article