3.1. Study Area and Construction of Data
This study evaluated the applicability of EPM by learning and predicting Chl-a at the Dasan water quality observatory in the Nakdong river basin. The Dasan water quality observatory is located in Goryeong-gun, Gyeongsangbuk-do, Republic of Korea. Automatic water quality measurement equipment was installed at the observatory in 2012, enabling the acquisition of daily water quality data.
Figure 5 shows the location of the Dasan water quality observatory.
According to
Figure 5, the Maegok and Munsan intake stations are located approximately 4 km downstream from the Dasan water quality observatory. Preemptive water quality management measures are essential to ensure a safe water supply for these nearby intake stations. Therefore, daily water quality parameter data from the Dasan water quality observatory were utilized to proactively improve water quality at these intake stations.
Water quality data from the Dasan water quality observatory were obtained from the Water Environment Information System (
https://water.nier.go.kr). Daily water quality data from 2014 to 2023 were used for deep learning training, and daily water quality data from 2024 were used for prediction. Data preprocessing was performed based on the established data from 2014 to 2024. The preprocessing involved interpolation of missing data and data scaling. Linear interpolation was applied for missing data. Additionally, min-max normalization (MMN) was used for data scaling. Outlier processing was not performed on the measured data to analyze the impact on the output data based on the measured water quality data. Based on the collected data, a total of 3652 datasets (daily water quality data from 2014 to 2023) were used for deep learning training, as mentioned above, and additional validation was performed for the same period. A total of 366 datasets (daily water quality data from 2024) were used for deep learning predictions.
Table 1 lists the input parameters used for training and prediction.
According to
Table 1, there are 11 input features for learning and predicting Chl-a. The input features for Chl-a prediction were established based on data measured at the Dasan water quality observatory in Korea based on previous studies [
25,
26,
27,
28,
29,
30,
31,
32]. Data preprocessing was applied based on the data collected for each input feature. When the difference between the maximum and minimum values of the input and output data is large, the search range of the weights and biases expands, so a wide range of data can cause a decrease in the accuracy of the ANN [
33]. Therefore, data preprocessing is necessary before performing ANN training and prediction.
Ref. [
34] compared the performance of MMN, Z-score normalization, and Decimal-scaling normalization, which are preprocessing techniques for processing broad data sets, and confirmed the superior performance of MMN. Therefore, this study used MMN for data preprocessing. MMN is a method that converts the maximum and minimum values of each input data into values between 0 and 1. When MMN is performed, the maximum value is converted to 1, and the minimum value is converted to 0. The formula for implementing MMN is as shown in Equation (4).
where
MMN vali is the i-th data converted using MMN,
Rawi is the i-th raw data,
Rawmax is the maximum value of the raw data, and
Rawmin is the minimum value of the raw data.
To improve water quality, preemptive action is essential, and it is therefore important to consider the temporal relationships between input variables and the model output. Therefore, this study applied Time-lagged Cross Correlation (TLCC) to account for the influence of arrival times between Chl-a and input variables. TLCC is a technique for analyzing simultaneous and time-lagged correlations between two time series. By shifting one time series forward or backward by a certain time interval and calculating the correlation coefficient with another time series, it is possible to estimate the lead or lagged relationship between two variables. Therefore, this study applied TLCC to account for the time lag of input variables. The TLCC equation is shown in Equation (5).
where r is the correlation coefficient,
xi is the
i-th value of the
x variable,
is the mean of the x variable,
yi is the ith value of the
y variable,
is the mean of the y variable, and
n is the number of comparison data.
Table 2 shows the results of applying TLCC to each input feature.
According to
Table 2, NO
3-N exhibits the highest lag time of 9 days, followed by TN at 7 days and Turbidity at 2 days. All input features except for these three are known to be 1 day. Therefore, when applying EPM, it is possible to take preemptive action based on data from the previous day.
3.2. MLP Learning and Prediction Based on Input Data Reconstruction Using SHAP
This study used Linear Regression (LR), RF, and MLP to learn and predict Chl-a at the Dasan Water Quality Observatory. Among the three models, the model with the highest predictive performance was selected. The loss function for MLP learning was set to the root mean square error (RMSE), and the RMSE formula is as follows: Equation (6).
where
n is the number of data points,
is the observed value, and
is the predicted value. LR, RF, and MLP prediction errors were also evaluated using RMSE, and additionally using MAE and R2. The MAE formula is as follows: Equation (7).
where
n is the number of data points,
is the observed value, and
is the predicted value. The R
2 formula is as follows: Equation (8).
where
n is the number of data points,
is the observed value,
is the average of the observed value and
is the predicted value.
To learn and predict Chl-a, the structural parameters of RF and MLP must be set. The parameters of RF used in this study were n_estimators 50, max_depth 37, min_samples_split 6, and min_samples_leaf 4, which showed high performance for high-frequency water quality data in Chl-a prediction [
35]. MLP consisted of five hidden layers with 10 nodes each and showed high learning accuracy in hydrological inflow prediction [
19]. Adam and Relu were used as the optimizer and activation function of MLP, which showed relatively good performance in hydrological runoff prediction [
20,
21].
Table 3 shows the Chl-a prediction results of each model.
According to
Table 3, MLP showed a prediction error reduction effect of about 47% or more in RMSE compared to LR, and also showed a performance improvement of more than 57% in MAE. This is analyzed to be because the linear regression model cannot sufficiently explain environmental data with rapid variability such as tides. In the case of the RF model, the MLP model showed an effect of about 11% or more in RMSE, and also showed a performance improvement of more than 2% in MAE. In particular, based on R2, the MLP model showed an explanatory power improvement of more than 1.77 times compared to LR, and secured a variability explanatory power that was about 1.13 times higher than that of RF.
Figure 6 is a comparison of the predicted and observed values of the three models.
According to
Figure 6, MLP showed high prediction accuracy throughout the entire period and during the peak concentration interval. However, LR and RF showed lower accuracy compared to MLP during the peak concentration interval. In addition, LR showed the lowest prediction accuracy compared to the other two models throughout the entire period. All three models showed low prediction accuracy in the period after the peak concentration because new patterns were not sufficiently reflected. This is analyzed to be due to the validation phase and different input patterns in the prediction process. As a result of comparing the three models, MLP showed relatively high prediction performance, and SV was performed based on this.
To learn and predict Chl-a, the MLP’s structural parameters must be set. The MLP used in this study consisted of five hidden layers, each with 10 nodes, which demonstrated high learning accuracy in hydrological inflow prediction [
19]. The optimizer and activation function of MLP used Adam and Relu, which showed relatively good performance in hydrological runoff prediction [
20,
21]. We analyzed the SV for each input feature based on an MLP trained on Chl-a using the entire input data set composed of 11 input features. This study performs data-driven learning to improve predictive performance, and the input variable reconstruction process is based on the results of quantitatively evaluating the contribution and influence of each variable using explainable artificial intelligence (XAI). Therefore, the selection of input variables was not based on simple statistical correlations, but rather on model-informed assessments that reflect the statistical interactions and contribution patterns learned during the MLP training process. This approach does not directly identify causal relationships, but rather aims to derive the optimal input combination based on data to maximize predictive accuracy. In other words, the MLP-based model used in this study does not assume a linear causal structure such as AR or ARX, and it is not intended to infer causal relationships among water-quality variables. Instead, it is a data-driven predictive model designed to learn nonlinear statistical patterns in water-quality time series.
The average SV was calculated based on the SVs produced through 10 iterations.
Table 4 shows the SVs produced for each input feature.
SHAP analysis results revealed that Chl-a(t−1) had the highest SV (129.08) for predicting Chl-a. Chl-a(t−1) has an SV approximately 49 times higher than Turbidity. For Chl-a(t−1), the previous Chl-a concentration showed the strongest statistical contribution to the model’s prediction of the current Chl-a concentration. Excluding Chl-a(t−1), pH had the highest SV (8.08), while the remaining features ranged from 8 to 2. Therefore, the previous day’s Chl-a exhibited the largest predictive contribution to the model output.
The large SHAP value of Chl-a(t−1) reflects the strong temporal persistence typically observed in algal blooming dynamics. Because short-term Chl-a levels tend to change gradually rather than abruptly, the most recent Chl-a measurement contains substantial information about the current biological state of the system. As a result, the model assigns high predictive weight to Chl-a(t−1). This does not diminish the relevance of other water-quality variables; instead, the strong temporal continuity captured by Chl-a(t−1) explains a large portion of the short-term variability, leaving the remaining features to contribute more incrementally. In nonlinear models such as MLPs, this phenomenon can reduce the apparent marginal contribution of concurrent variables. Not because their influence is removed, but because much of the predictable structure is already captured by the lagged Chl-a value. Therefore, the smaller SHAP values for pH, DO, and nutrients represent their additional predictive contribution beyond the substantial information already contained in Chl-a(t−1).
Based on the SHAP analysis results, the input data was reconstructed to simulate an MLP. Based on the reconstructed input data, the optimal input data was selected through MLP learning and prediction.
Table 5 shows the input data removed in descending order of SV, by case.
According to
Table 5, the original input data consists of Case 1, which consists of a total of 11 input features. As each case is constructed, one input feature is eliminated from the input data. This elimination of input features is achieved by reconstructing the input data by eliminating one lower-order feature at a time based on the SV. Learning and prediction were performed using MLP for each case through the reconstructed input data. The optimal input data was selected through a trade-off analysis between the MLP’s verification and prediction results.
Table 6 shows the verification and prediction results for each case using MLP.
According to
Table 6, the verification results increase as the number of cases increases. However, the prediction results showed an increase in error with the number of cases, but then decreased after a certain number of cases. To analyze the trade-off between the verification and prediction results, normalization was applied to the verification and prediction results, and data scaling was used to perform graphical analysis.
Figure 7 shows the verification and prediction results with normalization applied, by case.
According to
Figure 7, Case 2 showed the best results. While Case 2 did not show the lowest error for the verification results, it did show the best prediction results. Graph analysis using normalization showed that Case 2 produced the best results for both verification and prediction, indicating that unnecessary features were removed during the learning process. Case 2 consists of input data with Tur removed. After selecting the optimal input data, full-period verification and prediction were performed based on the original and reconstructed input data.
Table 7 shows the verification results for each method.
According to the verification results in
Table 7, the Average RMSE of the reconstructed data increased by approximately 0.2527 compared to the original data. Furthermore, the Min RMSE and Max RMSE also increased by 0.1156 and 1.3298, respectively. This increase in verification error is believed to be due to the reduced model complexity resulting from the reduced number of input data features. For ANN data models, accuracy depends on the constructed input data.
Figure 8 shows the verification results.
According to
Figure 8, both the raw and reconstructed data demonstrate high verification accuracy over the entire period. Zooming in on the time points where peak Chl-A concentrations occurred, the raw data tended to overestimate the observed values. Conversely, the reconstructed data showed a slight underestimation of the observed values. This resulted in a slightly larger overall error compared to the raw data, and a slight increase in the Average RMSE. Predictions were made based on the verification results, and the prediction results are presented in
Table 8.
The prediction performance comparison results in
Table 8 show that the reconstructed data reduced the predicted RMSE by approximately 0.8749 compared to the original data. Furthermore, the peak difference from the observed value was also reduced by approximately 0.6405 mg/m
3. This indicates that while the verification accuracy through MLP is somewhat reduced when using the reconstructed input data, the prediction performance is actually improved.
Figure 9 shows the prediction results.
According to the prediction results in
Figure 9, the reconstructed data demonstrated high prediction accuracy throughout the entire period and during the peak concentration interval. However, accuracy deteriorated somewhat during periods of temporary fluctuations or when relatively high concentrations occurred after the peak. This is believed to be due to input patterns that differed from the verification phase during the prediction process. Conversely, the raw data tended to underestimate observed values at peak concentrations. Furthermore, in the post-peak period, when relatively high concentrations occurred, the new patterns were not sufficiently reflected, resulting in lower prediction accuracy.
After reconstructing the input data based on SHAP analysis, we compared the learning and prediction performance of the MLP. Case 2, excluding Tur, showed a slight decrease in verification accuracy but improved prediction performance. This demonstrates that the XAI technique can overcome the limitations of the complexity of the MLP. Accordingly, Case 2, which exhibited high prediction performance, was applied to the EPM.
3.3. Application Results of EPM
In this study, the Algal Blooming Warning Operation Manual (2017) of the National Institute of Environmental Research (NIER) of the Ministry of Environment of the Republic of Korea was utilized to assess water pollution based on Chl-a predicted through the Multi-Level Processing (MLP) [
36]. The algal blooming warning system was introduced in 1998 to proactively monitor and minimize water pollution damage caused by the massive growth of blue-green algae due to rising water temperatures in summer. It operates under the Water Quality and Aquatic Ecosystem Conservation Act, and its warning criteria are divided into three levels: “Watch (≥15 mg/m
3)”, “Warning (≥25 mg/m
3)”, and “Outbreak (≥100 mg/m
3)” based on Chl-a concentration. The issuance or release of a warning is determined based on whether two consecutive measurements meet the criteria. In this study, the warning level was used as the standard.
Based on the MLP results of SHAP-based input data reconstruction, Chl-a was predicted based on Case 2, which showed high accuracy. Scenarios were constructed by combining each input feature, including Chl-a(t−1), which directly affects Chl-a prediction. To simulate changes in TN, TP, and Chl-a concentrations in the lake, a scenario was constructed in which the external TN and TP loads were gradually reduced by 10% up to a maximum of 50% [
37]. In this study, scenarios were constructed by gradually reducing each input feature from 100% to 10% in 10% increments. Scenarios were constructed by virtually adjusting the scaled values of each input feature from 100% to 10% in 10% increments to examine how these adjustments influence the model’s predicted Chl-a levels.
In this study, target levels for reducing predicted algal-bloom warning occurrences were set to 75%, 50%, and 25%, and model-based scenario outcomes were evaluated to identify feature-adjustment patterns associated with reduced predicted warning frequencies [
35]. In addition, a scenario-based analysis was performed for pH and DO, which exhibited high SHAP-based contributions, allowing exploration of how adjustments to these variables affect predicted Chl-a levels within the model. In this study, Chl-a(t−1) appears as the most influential variable because the model learns strong temporal persistence in the Chl-a time series. Therefore, reducing the scaled value of Chl-a(t−1) in the scenario framework lowers the predicted Chl-a simply because the model statistically relies on this lagged relationship. However, Chl-a(t−1) is not a controllable management variable, and the scenario adjustments do not imply that managers can or should physically reduce “yesterday’s Chl-a” to influence today’s conditions. The results instead represent model sensitivity based on temporal autocorrelation learned during training.
3.3.1. pH-Based Algal Bloom Response Scenario Analysis
For the pH-based scenario analysis, scenarios were constructed by virtually adjusting the scaled pH and Chl-a(t−1) inputs, and the resulting number of predicted algal bloom warnings was calculated using the trained model. The results are shown in
Figure 10.
According to
Figure 10, the predicted number of algal bloom warnings decreased when the scaled values of Chl-a(t−1) and pH were lower within the model-based scenario space. These patterns indicate that Chl-a(t−1) and pH exhibit positive associations with the model’s predicted Chl-a values within the scenario settings. In the model output, higher scaled values of Chl-a(t−1) and pH were associated with higher predicted Chl-a levels, which in turn resulted in a larger number of predicted warning occurrences. Therefore, the highest number of algal blooming warning alarms occurred when the upper right corner Chl-a(t−1) and pH ratios were 100%, and the lowest number occurred when the lower left corner Chl-a(t−1) and pH ratios were 10%.
According to the SHAP results, pH exhibited the highest model-based contribution among the features except Chl-a(t−1). This indicates that pH plays a relatively strong role in the model’s prediction process, without implying a causal relationship. Consistent with the SV results, within the EPM scenario experiments, lower scaled pH values were associated with fewer predicted warning occurrences. A similar pattern was observed for Chl-a(t−1), reflecting the model’s internal response to changes in input values. To achieve a predicted warning occurrence rate of 75% or less within the model, the scenario experiments indicated that the scaled pH input would need to be set to approximately 30% or lower when Chl-a(t−1) remained at 100%. If the Chl-a(t−1) ratio is reduced to 90%, the pH must be reduced to 50% or less, and if the Chl-a(t−1) ratio is reduced to 80%, the pH must be reduced to 90% or less. When the Chl-a(t−1) ratio is reduced to 70% or less, the pH is satisfied at all ratios. To limit the alarm occurrence rate to 50% or less, setting the Chl-a(t−1) ratio from 100% to 90% will not allow the target occurrence to be mitigated. Reducing the Chl-a(t−1) ratio to 80% will require a pH reduction of 10% or less. Reducing the Chl-a(t−1) ratio to 70% will require a pH reduction of 40% or less. Reducing the Chl-a(t−1) ratio to 60% or less will satisfy the pH requirement at all rates.
To limit the alarm occurrence rate to 25% or less, setting the Chl-a(t−1) ratio from 100% to 60% will not allow the target occurrence to be mitigated. Reducing the Chl-a(t−1) ratio to 50% will require a pH reduction of 80% or less. Reducing the Chl-a(t−1) ratio to 40% or less will satisfy the pH requirement at all rates. The corresponding scenario thresholds represent combinations of scaled Chl-a(t−1) and pH values that produced the target levels of predicted warning occurrences in the model. These thresholds describe the model’s behavior rather than physical or causal effects in the study area. These results reflect the model’s statistical response to scaled inputs and should not be interpreted as implying that reducing pH, DO, or Chl-a(t−1) would physically control algal blooms in natural systems.
3.3.2. DO-Based Algal Bloom Response Scenario Analysis
For the DO-based scenario analysis, scenarios were constructed by virtually adjusting the scaled DO and Chl-a(t−1) inputs, and the resulting number of predicted algal-bloom warnings was computed using the trained model. The results are shown in
Figure 11.
According to
Figure 11, lower scaled values of Chl-a(t−1) and DO were associated with fewer predicted warning occurrences in the model output. Furthermore, the threshold values are identical to those for Chl-a(t−1) and pH, as they decrease from 75% to 50% and 25%, respectively. However, unlike the pH-based scenarios, the DO-based scenarios exhibited different model-response patterns across reduction targets. For example, when Chl-a(t−1) was fixed at 100% in the scenario space, the model predicted a 75% reduction in warning occurrences for certain pH adjustments. In contrast, even large reductions in DO did not yield the same model-predicted reduction. This indicates that DO contributed less strongly than pH to the model’s prediction patterns. These scenario patterns suggest that DO had a smaller model-based contribution than pH. Furthermore, lower scaled Chl-a(t−1) values were consistently associated with fewer predicted warning occurrences within the model framework.
The 25% reduction target in
Figure 10 shows that the threshold shape for reducing the number of algal bloom alerts by 50% is similar to the shape of the relationship between Chl-a(t−1) and pH. When interpreting scenario outcomes, combinations involving features with higher SHAP contributions yielded larger reductions in predicted warnings, reflecting the model’s internal sensitivity patterns.
3.3.3. 7 Water Quality Index-Based Algal Bloom Response Scenario Analysis
For the 7 water quality index-based algal bloom response scenario analysis, scenarios were constructed based on input feature reductions, and the number of algal bloom warning alerts generated was calculated based on the predicted results. The results are shown in
Figure 12.
According to
Figure 12, NO
3-N, EC, PO
4-P, TOC, TN, TP, and NH
3-N showed the same results when analyzing the results according to the scenario to reduce the number of algal blooming warning alarm occurrences by 50%. Based on the results, it can be seen that when NO
3-N, EC, PO
4-P, TOC, TN, TP, and NH
3-N are aimed at reducing the number of algal blooming warning alarm occurrences by 50%, the model was most sensitive to changes in Chl-a(t−1), which exhibited the highest SHAP contribution among the input features. However, in the scenario to reduce the number of algal blooming warning alarm occurrences by 75%, all features except PO4-P are influential. However, it can be seen that they are not as influential as pH and DO. Based on these results, we found that scenario combinations involving pH or DO adjustments showed larger reductions in model-predicted warning occurrences compared with the other features. However, we also found that mitigation scenarios utilizing NO
3-N, EC, PO
4-P, TOC, TN, TP, and NH
3-N are feasible. Specifically, when Chl-a(t−1) is fixed at 80%, TP is approximately 80% or higher, and NH
3-N is approximately 20% or higher, the number of algal bloom warning alerts is reduced by 75%. TP and NH
3-N exhibited negative SHAP associations with the model’s predicted Chl-a values, indicating a model-learned pattern rather than a causal relationship.
Analysis of scenarios aimed at reducing the number of algal bloom warning alerts by 25% revealed that NO3-N, EC, PO4-P, TOC, TN, TP, and NH3-N have a lower impact than pH and DO. It can be seen that NO3-N, EC, PO4-P, TOC, TN, TP and NH3-N have less influence than Chl-a(t−1), and in that case, it can be seen that adjusting the scaled Chl-a(t−1) values produced the largest changes in predicted warning occurrences within the model.