1. Introduction
Harmful algal blooms (HABs) have caused severe threats to water bodies and the supply of drinking water globally in recent decades [
1,
2,
3]. Today, in Sweden and certain other regions across the globe, intensive algal blooms are mostly limited to lakes, which release legacy phosphorus from lake sediments. The blooming occurs during periods when oxygen depletion in lower parts of the water body trigger anoxic conditions in the sediment. Intensive blooms lead to several societal challenges. Neurotoxins and toxic phytoplankton as well as the reduction in oxygen concentration in water bodies [
4] can be a direct risk to the health of ecology and humans. Algal blooms may also pose indirect problems for drinking water production, such as the obstruction of filters at the intake or even sand filters. The current removal methods for algae cells include coagulation [
5] (Lin et al., 1971), flotation [
6] ultrafiltration [
7], activated carbon, etc. The removal of dissolved toxins caused by algae is an even greater challenge. Especially toxins with small molecular weights of below 400 daltons, such as Nodularins, would require nanofiltration for quantitative removal [
8]. Cyanotoxin release mechanisms and the amount of toxins released are largely unknown. Proxies for other factors, such as studying the algae biovolume, species composition, influencing factors, and developing early indications of/predictions for the presence of algal blooms, are necessary [
9]. Rapid prediction systems for algal blooms are urgently needed to guide necessary precautions and reduce the possible damage and loss for ecology and humans in advance.
The main factors causing algal blooms have been summarized in past research works. For instance, it is agreed that temperature; nutrient status [
10,
11], especially the N/P ratio; and light conditions are the main drivers of cyanobacterial blooms. The exact trigger mechanism of blooms, however, is still difficult to quantify. Often, they depend on local conditions [
12]. Under those circumstances, AI-based algorithms with no need for physical/chemical equations may assist in early warning systems.
Recently, several studies highlighted the use of numerical methods based on AI that allow the prediction of water quality and, in some cases, the occurrence of algae [
13,
14,
15]. Saboe et al. (2021) conducted a prediction of algae concentration and water quality parameters via a new combination of microbial potentiometric sensor signals and machine learning tools, resulting in considerable prediction performances [
13]. Based on the simple application of AI in a water quality simulation, Ahmad et al. (2019) further conducted the denoising through a neuro-fuzzy inference system (WDT-ANFIS) during AI training, improving the prediction performance significantly [
14]. For the selection of suitable AI models, Wang et al. (2022) compared traditional fitting with various AI methods for the prediction of flocculant dosage, suggesting the Elman neural network as the prior choice in this case [
15]. Recent studies have covered valuable areas such as new combinations of sensors, experimental measurements, and AI models, highlighting the synergistic work between AIs and the selection of suitable AI algorithms, among other things.
However, research still neglects significant aspects which may influence the prediction performance. For instance, methods for selecting the best parameters, neuron numbers, and starting points have rarely been reported. The selection of input/output variables is either disregarded as an important step or conducted only using the simple statistics of correlations between two variables. This approach lacks the ability to identify the role of interdependent variables. Auto-deep learning (AutoDL) applications allow for autonomic data selection and the construction of layers in models, which can reduce human intervention in water quality prediction [
16]. However, contrary to what might be expected, AutoDL did not perform better. As a result, high and stable performance in the AI prediction of multiple factors is still an active research field.
Therefore, in this work, we aimed to establish a new fully AI scanning–focusingprocess for selecting the best model to further address the above issues and improve the prediction performance of an early warning system for the occurrence of algae based on multiple water quality factors. The code is programmed to select the best prediction models with most suitable water quality factors as well as neuron numbers in the hidden layer, random factors during the training process, etc. We define the best prediction models with the best combination of suitable factors as so-called “closed systems”, in which all included factors are highly correlated with each other. These identified best models of closed systems are then compared with other methods. The considerable performance of the closed system approach demonstrates its potential to improve the prediction performance through new aspects. Water quality factors, including algae concentration (cell numbers), water temperature, pH, conductivity, turbidity, dissolved organic matter, calculated CO2 concentration and date (year/month/day) are involved as inputs and outputs. To be exact, in each training and application of the prediction model, the inputs are date and selected factors, while the outputs are the predicted values of these selected factors, except for the date.
This study aims to improve the prediction performance with the new full-scanning–focusing process for the best model selection with the most suitable combination of factors (closed system), rather than pursuing an extreme highly accurate prediction result during the calibration period.
In this paper, we discuss the threat of algae bloom and the gap in the knowledge of algae prediction and early warning systems and present our solution of a full-scanning–selecting system to improve the prediction performance of algae concentration (see
Section 1).
Section 2 includes the measurement and data collection processes as the basis of the model as well as the modeling and prediction study design.
Section 3 mainly focuses on the comparison of the prediction performances of various models in training, validation, and application periods as well as the corresponding analysis and discussion. This is to demonstrate the performance and features of our new system. Finally, we summarize our study and discuss the limitations and future perspectives of this work.
2. Materials and Methods
In this work, we aimed to introduce a new scanning–focusing AI process for the prediction of algae concentration in lake water and several other factors crucial for water treatment. The flowchart of this work is shown in
Figure 1. The raw data are based on weekly measured algae concentrations, simulated CO
2 concentrations, and hourly/minutely measured values of other factors, such as temperature, pH value, conductivity, turbidity, and dissolved organic matter. Two AI prediction processes: (1) a blind AI training process with all factors considered (BP) as inputs and (2) a simple process (SP) with only the date and target factor as inputs considered (closed system). The prediction performances of various factors with different AI processes are compared and summarized.
2.1. Description of the Görväln Drinking Water Treatment Plant
The Görväln drinking water plant (DWTP) is located on the eastern side of Sweden’s third largest lake, Lake Mälaren. The plant is run by municipal water company named Norrvatten and produces drinking water for around 600,000 people in the greater Stockholm area. The process is a classical coagulation–rapid sand filtration process (
Figure A1).
The DWTP intake is located inside the Görväln basin, where two water sources meet, one high in alkalinity (>1.2 mM) and high in organic carbon (>10 mg L−1) flowing from the north, and the other low in alkalinity (<0.5 mM) and generally much lower in organic carbon (<7 mg L−1) flowing from the western part of the lake. The basin has a turnover time of around 3 months. Nutrient concentrations of P–PO4 usually vary between 5 to 50 μgL−1 while nitrate concentration lies between 300 to 1800 μgL−1. During spring and late autumn, chlorophyll concentrations in the basin may rise to 50 mg L−1 and the presence of blue–green algae has been repeatedly confirmed.
The DTWP is equipped with several online sensors that register several parameters (
Table 1) at high temporal resolutions. These signals are used to control the dosing of the coagulation process and to observe important changes in water quality.
2.1.1. Description of Algal Cell Count Method
Algae counts were carried out using an inverted microscope. For this purpose, 500 mL of water was sampled at the intake and 50 mL was treated with Lugol’s solution and then left to settle for three days for sedimentation using an utermohl chamber (Hydrobios).
The final algal counts are given as the number of cells per liter of water (cells L
−1). On a few occasions data for both algal cell number and chlorophyll content are available. Based on variance in size, it is not necessarily expected that the cell numbers and chlorophyll will correlate. In our case, we found a significant relationship, which indicates that a measurement of 4 × 10
6 cells corresponds to around 30–40 ug L
−1 chlorophyll. This comparison allows us to convert the observed cell number to a hypothetical chlorophyll concentration in raw water. In general, in natural lake water for managing recreational waters, chlorophyll-a < 10 ug L
−1 is regarded safe, chlorophyll-a >10 ug L−1 with a dominance of cyanobacteria would pose a relatively low probability of adverse health effects, chlorophyll-a >10 ug L
−1 and <50 ug L
−1 would pose moderate probability of adverse health effects and chlorophyll-a > 50 ug L
−1 would pose high probability of adverse health effects [
17].
2.1.2. Calculation of Carbon Dioxide
The photosynthetic activity of algae leads to an uptake of carbon dioxide from the water body. While no direct measurements of carbon dioxide were available, it is possible to accurately calculate carbon dioxide concentration if pH, temperature, and alkalinity are known.
Carbon dioxide concentrations were calculated based on estimates for alkalinity and measured pH and water temperature using the reactive transport model PHREEQC, which is freely available from the USGS site (version 2.17 for Microsoft Windows; USGS, 2020) [
18,
19]. In surface waters that are metastable with respect to calcite (i.e., no actual dissolution and precipitation occurs due to the low deviation from the calcite saturation index), pH is controlled by alkalinity, the presence of organic matter, temperature, and carbon dioxide only. The presence of bacteria may respire organic matter by increasing carbon dioxide concentration while the presence of photosynthetic algae leads to a decrease in carbon dioxide because of photosynthesis. While pH, conductivity, and temperature data were available at 5 min intervals, alkalinity was only measured weekly. In the high alkalinity lake water studied at this site, a strong correlation exists between conductivity and alkalinity. This observation can be used to produce a time series with much higher temporal resolution for calculated carbon dioxide concentration. An analysis revealed that the correlation between alkalinity and conductivity slightly worsens (
Figure A2) if data are included that are more than 6 h away from the actual weekly measurements. Based on that criterion, we had access to around 2000 data points in the period 2015–2022 as data for the chemical equilibrium calculation. The correlation between the pH and content of dissolved carbon acid is displayed in the
Appendix A (
Figure A3).
2.2. Modeling for Prediction and Early Warning Systems
We introduce a new AI-focusing process to model the interaction among several vital water chemical factors in a waterbody for prediction and early warning systems. This aims to conduct high quality modeling in complex environments, search for a relatively closed system for prediction, improve the prediction performance of AI-based modeling, and guide the reduction in necessary monitoring.
Firstly, the interaction among these data of parameters in the water source was modeled through a feed-forward neural network (FNN) for prediction and analysis, carried out in Matlab. This is due to the strong feasibility of neural networks to simulate complex systems. In this case, factors with available data, such as date, algae concentration, water temperature, pH, conductivity, turbidity, and dissolved carbon dioxide concentration, from between January 2015 to June 2020, were involved in training and validation. The data from 2015 to 2019 were employed for training, while those from 2020 were used for validation. Data for 2022 was used for observation purposes. The fitting performance was tested via MSE and R2. Both were calculated from the normalized value of the simulated and original values of all involved variables to represent an integrated fitting performance.
In the FNN model, the construction of 1 input layer, 1 hidden layer, and 1 output layer were applied. During training, 1 to 10 neurons in hidden layers were fully scanned and 10 parallel training tests were conducted with the same training parameter settings to consider the influence of random factors during training. This is to avoid occasionality and to select the proper model construction for each condition. To study the possible advanced influence and delayed reflection between factors, a matric input with 2 dimensions for date and various parameters were applied. Here, we considered 7 weeks as the date range for the input in the training to project data one week into the future for the output.
Secondly, to improve the prediction performance by selecting a satisfactory closed system and to reveal relativity and interaction mechanics between factors, an AI scanning–focusing process was conducted. We analyzed the involvement of various factors in the model outcome. For the prediction of algal bloom, the date and algal concentration were the basic input factors in our modeling, with the consideration of the permutation and combination of the 6 other factors as extra inputs, resulting in a total of 63 combinations. The FNN modeling was processed for each of those sets and prediction performances were compared. Subsequently, relative closed systems that of better performance were selected, which also indicated factors of higher relativity. and Here, we applied the theory of Granger causality [
20] for the judgment of closed systems and relativity. If the involvement of one factor can improve the fitting performance of the whole system, it is retained. Finally, the systems with the highest performances are selected. The proper construction of the FNN model in the corresponding conditions was achieved.
4. Conclusions
The variations in algae concentration and other vital factors in the Görväln drinking water plant were simulated and predicted using our new AI scanning–focusing process. In the relative closed systems (1) date–algae–temperature–pH (DATH) and (2) date–algae–temperature–CO2 (DATC) it was found that the factors therein had high relativities in time-series compared to those with other factors. The predictions using those systems showed dominating advantages compared to those of the blind prediction (BP) model with all accessible factors during training. DATH displayed more stable predictions compared to that of DATC, this may be due to the vital effect of carbon dioxide concentration through factors outside the DATC. In the real application test, DATH showed an outstanding prediction performance, despite the monitored data obtained two years after those of the training period. The prediction showed higher accuracy when describing the values and trends compared to the simple process (SP), which only involved the date and algae concentration. In general, the higher performance of the new models was derived from the scanning–focusing process, which selected both the best model and highest relative closed system. Our AI scanning–focusing process and model selection showed the potential for improving water quality prediction by identifying closed systems with highly correlated factors. This provides a new method that can be considered in the enhancement of numerical prediction for factors in water quality monitoring and wider environmental applications.
The study still has limitations in terms of further improving prediction performance, the need to involve more parameters and water quality indicators, and the establishment of an entire early warning system. This leaves room for future study.
In the next steps, we plan to further improve this modeling process by involving more parameters of modeling into the scanning and selection procedures and producing the best models that can be realized within the framework of data measured in concert and NN modeling. For the early warning system purposes, our process will be combined with AIs for recognition and classification to achieve high performances for predicting risky turns and state-changing points. The target of development is to build a high-performance AI process for prediction and early warning systems not only for the evaluation of water quality but also for other research and industrial areas.