Abstract
South Korea’s National Institute of Environmental Research (NIER) operates an algae alert system to monitor water quality at public water supply source sites. Accurate prediction of dominant harmful cyanobacterial genera, such as Aphanizomenon, Anabaena, Oscillatoria, and Microcystis, is crucial for managing water source contamination risks. This study utilized data collected between January 2017 and December 2022 from Juam Lake and Tamjin Lake, which are representative water supply source sites at the Yeongsan River and Seomjin River basins. We performed an exploratory data analysis on the monitored water quality parameters to understand overall fluctuations. Using data from 2017 to 2021 as training data and 2022 data as test data, we compared the dominant algal classification accuracy of 11 statistical machine learning algorithms. The results indicated that the optimal algorithm varied depending on the survey site and evaluation criteria, highlighting the unique environmental characteristics of each site. By predicting dominant algae in advance, stakeholders can better prepare for water source contamination accidents. Our findings demonstrate the applicability of machine learning algorithms as efficient tools for managing water quality in water supply source systems using monitoring data.
1. Introduction
In South Korea, sites crucial for providing potable water to local residents are designated and managed as water protection zones. The importance of properly managing these water sources was underscored by the extreme drought in the Honam region of South Korea in 2022. To safeguard the water quality of these sources, the Korean government established an algae alert system in 1998. This system minimizes toxic effects caused by large numbers of harmful cyanobacteria by issuing alerts based on harmful cyanobacterial cell counts: Caution (at least 1000 cells for 2 consecutive counts), Warning (at least 10,000 cells for 2 consecutive counts), Outbreak (at least 1,000,000 cells for 2 consecutive counts), and Release (number of cyanobacterial cells below the alert threshold for 2 consecutive counts) [1,2,3]. Specifically, four representative genera of cyanobacteria, Aphanizomenon, Anabaena, Oscillatoria, and Microcystis, release harmful toxins causing acute liver disease in humans [4] and threatening the stability of aquatic ecosystems [5]. Researchers have explored various methods to reduce the abundance of these harmful cyanobacteria, including physical methods, such as algal blocking mats (ABM); chemical methods, such as plant–mineral composite (PMC) coagulants; and biological methods, such as using Unio douglasiae [6,7,8]. However, such methods are predominantly used reactively rather than proactively, i.e., they are used when water quality is declining or has already declined.
To predict future changes in water quality and, thus, enable more proactive management of water sources, recent studies have explored how to predict changes in specific water quality parameters, with particular focus on statistical machine learning techniques. Such techniques are being investigated because they are capable of processing large amounts of water quality-related data and can be used to compare the usefulness of different water quality parameters. In particular, multiple studies have focused on predicting values of a water quality parameter, chlorophyll-a (Chla). For instance, Kim, H. G. (2017) assessed the suitability of an artificial neural network technique for predicting Chla concentration at a midstream location in South Korea’s Nakdong River [9]. Moreover, Lee et al. (2020) investigated the ability of four statistical machine learning algorithms to predict Chla concentrations [10]. Similarly, Bui et al. (2020) used 16 novel hybrid machine learning algorithms and various water quality parameters to predict changes in the Water Quality Index (WQI) [11]. However, this study was limited in its ability to thoroughly compare the performance of the 16 algorithms. In particular, this study did not apply the latest algorithms, such as AdaBoost or Gradient boosting. The primary difference between these previous studies and the current study is that the former studies focused on accurately predicting the measured values of the water quality parameter Chla (a continuous variable). In contrast, this study aims to accurately classify dominant algae (a categorical variable). Algal growth is influenced by many factors; the most important of which is the availability of nutrients, such as nitrogen (N) and phosphorus (P), and quality parameters, such as water temperature. However, hydraulic/hydrological factors, such as water level and water storage capacity, also play a role, necessitating the consideration of all factors [12].
Therefore, considering the diverse variables related to water quality and hydraulic/hydrological factors, accurately predicting the dominant algae could enable authorities to better prepare for and respond to algal water pollution incidents. For this study, we utilized the water quality monitoring network data, algae alert system data, and hydraulic/hydrological data collected from Juam Lake and Tamjin Lake. These representative water supply sources in the Yeongsan River and Seomjin River systems had measurements taken at seven-day intervals from January 2017 to December 2022 through the National Institute Environmental Research (NIER) Water Environment Information System. We compared and analyzed various statistical machine learning algorithms to determine their accuracy in classifying the dominant algae. By developing and implementing a predictive method for dominant algal occurrences, we aim to provide a more efficient approach to water quality management.
2. Materials and Methods
The methods for this study consisted of three main stages, namely data collection, exploratory data analysis, and a comparison of the classification performance of 11 selected algorithms. A flowchart of these key steps for the methodology is shown in Figure 1.
Figure 1.
Methodological flowchart used in this study.
2.1. Study Area
This study focused on Juam Lake and Tamjin Lake, two representative water supply sources in the Yeongsan River and Seomjin River systems in South Korea. The NIER Yeongsan River Environment Research Laboratory collects weekly samples to monitor water quality and respond to the algae alert system from the dam front (J1, 127°14′26.74″ E/35°03′23.78″ N) and Shinpyeong Bridge (J2, 127°13′59.11″ E/35°00′50.37″ N) at Juam Lake, and the dam front (T1, 126°52′52.01″ E/34°45′07.09″ N) and Yuchi stream confluence (T2, 126°52′11.82″ E/34°46′02.99″ N) at Tamjin Lake. Additionally, the Korea Water Resources Corporation conducts daily measurements of hydraulic/hydrological variables, such as water storage capacity.
Juam Lake is an artificial lake formed by the freshwater held back by Juam Dam, which has a height of 58 m and a length of 330 m. It is located in Daegwang-ri, Juam-myeon, Suncheon-si, Jeollanam-do, and has a total basin area of 1010 km2 and a total water storage capacity of 457 × 106 tons. Juam Dam supplies about 640 × 103 tons of potable water to the western part of Jeollanam-do, including Gwangju, Naju, Mokpo, and Hwasun [13]. Tamjin Lake is an artificial lake created by the construction of Jangheung Dam, which has a height of 53 m and a length of 403 m. It has a total basin area of 193 km2 and a total water storage capacity of 191 × 106 tons. It is located in Yuchi-myeon, Jangheung-gun, Jeollanam-do, and supplies 73 × 106 tons of potable water to 9 cities in Jeollanam-do [14]. Figure 2 shows the sampling sites at Juam Lake and Tamjin Lake.
Figure 2.
Sampling sites at Juam Lake and Tamjin Lake.
2.2. Data Collection
To conduct a comprehensive analysis, we collected and organized hydraulic/hydrological data, algae alert system data, and water quality monitoring network data from the survey sites. These data were measured at seven-day intervals from January 2017 to December 2022 and were obtained through the NIER Water Environment Information System. The number of observations for each sampling site was as follows: in Juam Lake, 307 observations at both the dam front (J1) and Shinpyeong Bridge (J2) sites, and in Tamjin Lake, 304 observations at both the dam front (T1) and Yuchi Stream Confluence (T2) sites. Overall, this study comprised a total of 1222 observations. For comparison of the performance of the statistical machine learning algorithms, the training data consisted of the measurements from 2017 to 2021 at each survey site, while the test data consisted of the remaining measurements from 2022. For the J1 site and J2 site for Juam Lake, the number of observations included in the training data and test data was 257 and 50, respectively. For the T1 site and T2 site for Tamjin Lake, the number of observations included in the training data and test data was 255 and 49, respectively. Table 1 shows the data variables used in this study.
Table 1.
Data variables used in this study.
Of the variables listed in Table 1, biological oxygen demand (BOD), chemical oxygen demand (COD), total nitrogen (TN), total phosphorous (TP), total organic carbon (TOC), suspended solids (SS), and electrical conductivity (EC) were collected from the water quality monitoring network data, while pH, dissolved oxygen (DO), temperature, turbidity, transparency, Chla, and dominant algae were obtained from the algae alert system data. The remaining variables, including low water level, inflow, discharge, and reservoir, were collected from the National Water Resources Management Information System (http://www.wamis.go.kr/ (accessed on 5 February 2023)). The genera of algae that were found from the data collection at the sampling sites are presented in Table 2. Figure 3 shows line graphs of the monthly mean number of algal cells sampled during the survey period, categorized according to the survey site and algal genus. Based on the results in Table 2 and Figure 3, during the survey period, chlorophytes or diatoms tended to dominate in spring, cyanophytes in early summer and summer, and chlorophytes along with diatoms in autumn and early winter [15]. For clarity, in South Korea, the period from 25 June to 19 July is considered early summer, and the period from 20 July to 7 September is considered summer. All data analyses in this study were performed using the statistical software R, version 4.2.1.
Table 2.
Genera of algae that were identified in the water samples collected from the sampling sites.
Figure 3.
Line graphs of average algal cell count at the sampling sites J1, J2, T1, and T2 from January 2017 to November 2022.
2.3. Data Analysis Methods
This section describes the data analysis methods employed in this study, starting with exploratory data analysis. This includes correlation analysis and pattern analysis using a self-organizing map (SOM) to examine the overall distribution of water quality parameters and the hydraulic/hydrological variables included in the data. We also briefly explain the principles of the 11 statistical machine learning algorithms, which we compared against each other to assess their relative predictive power in classifying the dominant algae.
2.3.1. Exploratory Data Analysis
Before analyzing the data, an exploratory data analysis was performed to investigate the overall characteristics of the variables in the data, including descriptive statistics, such as mean or variance, and distribution [16]. While no specific analytical method or process exists, researchers may prefer different methods depending on their objectives. Generally, the first step is to determine whether the variables included in the data are continuous or categorical. The mean, standard deviation, density, and other distributional characteristics were calculated for continuous variables. For categorical variables, the number of categories and the number of observations for each category were examined. In this study, we employed correlation analysis to investigate the relationship between water quality parameters and hydraulic/hydrological parameters, and we applied pattern analysis using an SOM to visually confirm the results.
- Correlation analysis
Correlation is a widely used statistical analysis method for investigating the relationships between continuous variables in a dataset. For this purpose, the Pearson correlation coefficient was calculated as shown in Equation (1), and a significance test was conducted on the resultant coefficient. Generally, the validity of the analytical results can be confirmed only when normality is assumed to be satisfied through a normality test, such as the Shapiro–Wilk (SW) test [17]. However, this method is limited because it can only be applied when variables have the properties of random variables that satisfy independency. Since all measurement variables in this study are time series data measured over a given time period rather than random variables that satisfy independence, the Jarque–Bera (JB) test method was deemed more appropriate [18].
However, environment-related measurement variables typically do not satisfy normality and instead fluctuate considerably. Consequently, the analytical results lose reliability if conducted using a Pearson correlation coefficient for data with such variables. Therefore, we performed correlation analysis using the Spearman correlation coefficient, a non-parametric method that analyzes correlation based on ranks, as expressed in Equation (2):
- 2.
- Pattern analysis using SOM
An SOM is an artificial neural network technique that simultaneously performs dimension reduction and clustering [19]. With this technique, numerous nodes in high-dimensional data are clustered through competition. Based on the winning node that emerges from this competition, the learning results that preserve similarity as much as possible in the reduced dimensions are obtained. This principle is illustrated in Figure 4.
Figure 4.
Schematic diagram of a self-organizing map.
This process repeats the algorithm shown in Equation (3) until convergence, and the th lattice vector at time is updated:
In the above Equation (3), η is a learning rate parameter that reduces the learning rate to prevent overfitting, and λ is a parameter that makes the neighborhood size larger for the winning node and smaller for the distant nodes.
Through SOMs, Jung et al. (2020) performed a pattern analysis based on the water quality parameters measured at 28 sampling sites in the Nakdong River system in South Korea [20]. To determine which branches should be prioritized for management, they used a grading process through cluster analysis based on the characteristics of each site. From these findings, they were able to propose policy recommendations. In this study, we performed a pattern analysis on 17 measurement variables using this method and identified the correlations between them.
2.3.2. Statistical Machine Learning Algorithms for Dominant Algal Classification
We compared the performance of 11 statistical machine learning algorithms for classifying the dominant algae at each survey site. Detailed explanations of the principles of the applied algorithms can be found in the literature [21,22].
- Three tree-based methods
A decision tree (DT) is a method for creating a decision model with a tree-like structure. The impurity of nodes is reviewed to select the optimal separation criteria for pruning. Mean squared error, calculated using Equation (4), is used for regression, and the Gini coefficient, calculated using Equation (5), or the entropy coefficient, calculated using Equation (6), is used for classification. Compared to other algorithms, decision trees are visually simple and relatively easy to interpret [23].
In contrast to the decision tree method, the bagging (Bag) method involves sampling with replacement. This allows observations extracted from the analysis data to be re-extracted in multiple samples () for analysis. This analysis first creates multiple decision tree models () and then averages () the prediction results obtained through this, or performs multiple voting () based on the classification results. “Mode” refers to the value with the highest frequency. Since the bagging technique uses survey data with the replacement method, it greatly reduces the variance of the created model compared to that of the decision tree model which is created once [24,25]. Figure 5 illustrates the principle of the bagging method.
Figure 5.
Schematic diagram of the bagging method.
Thirdly, the random forest (RF) method was proposed to address the shortcomings of bagging, such as a correlation between multiple decision tree models made by multiple samples. Similar to bagging, RF involves extracting multiple samples with replacement from the training data and fitting multiple decision tree models through them. However, in random forest, only a subset of the variables is randomly selected and used for each sample. This results in a better prediction or classification performance compared to bagging. Moreover, the types of variables selected for each sample differ, reducing the erroneous correlation between each sample that can occur with bagging [26].
- 2.
- AdaBoost (Ada)
AdaBoost is a boosting algorithm that creates a strong learner by taking a weighted linear combination of multiple weak learners, as shown in Equation (7). By correcting or supplementing incorrectly predicted or classified instances from previous steps, it can yield more accurate results than possible through the three tree-based methods:
where is the final strong learner obtained, are the weak learners, and are the weights of the weak learners. Figure 6 illustrates the principle of AdaBoost. A more detailed explanation can be found in the literature [27].
Figure 6.
Schematic diagram of the AdaBoost method.
- 3.
- Two gradient boosting methods
Gradient boosting (GB) involves iteratively using a gradient to create a model, and then using the residual from this to create another model. This process reduces the portion of variation that the previous model could not explain, thereby reducing bias. If a given training dataset is and the previously created model is , then gradient boosting gradually finds the function that models the residual, which is the difference between the actual value and the predicted value, as shown in Equation (8):
After the function is found in this process, the new model is updated as shown in Equation (9):
where parameter is the learning rate. This process reduces the risk of overfitting. Figure 7 illustrates the principle of gradient boosting. A more detailed explanation can be found in the study by Natekin et al. [28].
Figure 7.
Schematic diagram of the gradient boosting method.
Extreme gradient boosting (XGB) is an improved method that addresses the slow execution time and overfitting risks of gradient boosting by supporting parallel learning. It has a self-regulating function that makes it more stable and durable. Traditionally, after randomly dividing the training data into parts, data parts are used as new training data and the remaining data part are used as new test data to evaluate the performance of the algorithm. The cross-validation test performs this process on all parts of the data, as shown in Figure 8. A detailed explanation can be found in a paper by Chen et al. [29].
Figure 8.
Example of -fold cross-validation test.
- 4.
- Three discriminant analysis methods
Linear discriminant analysis (LDA) is a classification method using R. A. Fisher’s linear decision boundary. The given data are projected onto a specific one-dimensional axis, followed by a process that finds the optimal straight line that properly distinguishes the categories. This process makes it possible to find the linear decision boundary, as shown in Figure 9. A more detailed explanation can be found in an article by Izenman, A. J. [30].
Figure 9.
Schematic diagram of the linear discriminant analysis method.
Flexible discriminant analysis (FDA) is a method that addresses the limitations of linear discriminant analysis. Instead of relying on linear decision boundaries, FDA uses splines to create a non-linear decision boundary for classification. This allows non-linear relationships to be captured and improves the overall classification accuracy [31].
Finally, when the data contain many explanatory variables, regularized discriminant analysis (RDA) improves the estimation of the covariance matrix through regularization (e.g., shrinkage) to create a decision boundary with better classification performance. For this, the optimal parameter is estimated based on the training data; if = , then linear discriminant analysis is performed, and if = , then quadratic discriminant analysis is performed. Here, , which serves as the weight for the linear decision boundary and quadratic curved decision boundary [32].
- 5.
- Support Vector Machine (SVM)
SVM is a classification algorithm that maximizes the margin, i.e., the distance between the decision boundary and the support vectors. To move the original data in an input space with a complex non-linear distribution to a high-dimensional feature space, SVM uses the kernel method. This technique applies a mapping function without setting a transformation function beforehand. The kernel method converts the data into a linear distribution and makes it easier to find the decision boundary [33]. Figure 10 illustrates this concept. This study used the radial basis kernel, shown in Equation (10), which is known to be the most flexible kernel type for all data distributions. Figure 11 illustrates the principle of the support vector machine, and a detailed explanation can be found in the study by Pisner et al. [34].
Figure 10.
Schematic diagram of the kernel trick.
Figure 11.
Schematic diagram that illustrates the concept of support vector machine.
- 6.
- Deep Neural Network (DNN)
A deep neural network is a model in the form of a neural network created by constructing multiple hidden layers between the input and output layers. The model is trained through a backpropagation algorithm that updates the weights through stochastic gradient descent, as shown in Equation (11).
where is the parameter that controls the learning rate, and is the cost function. Typically, before executing a deep neural network, the appropriate activation function and cost function are determined according to the analysis conditions. In multiclass classification, the activation function is set to a softmax function, as shown in Equation (12), and the cost function is set to a cross-entropy function, as shown in Equation (13). A detailed explanation of deep neural networks can be found in a paper by Montavon et al. [35].
2.3.3. Evaluation Indexes
To evaluate the classification accuracy of the statistical machine learning algorithms, three representative criteria were used: accuracy, sensitivity, and specificity [36]. These criteria were calculated using a confusion matrix that organized the actual correct answers and those answers predicted from the classification, as shown in Table 3.
Table 3.
Confusion matrix of dominant algal classification.
“Accuracy” simply refers to the ratio of observations that match the correct answer through classification among all observations and can be calculated as shown in Equation (14) using the table above.
The advantages of accuracy are that it is easy to calculate and can be understood intuitively. However, as it simply takes the arithmetic average, the imbalance between each class can be severe when using imbalanced data. To compensate for this shortcoming, we also calculated sensitivity and specificity for the four algae categories (cyanophytes, diatoms, chlorophytes, and others). Specifically, we calculated weighted sensitivity and weighted specificity by taking the weighted average of the data, and we used these two metrics as additional criteria to evaluate the algorithms. Sensitivity and specificity can be understood through the binary confusion matrix shown in Table 4.
Table 4.
Binary confusion matrix.
Sensitivity is the ratio of observations properly classified as positive compared to those that are actually positive, whilst specificity is the ratio of observations properly classified as negative compared to those that are actually negative [37]. Both ratios range from 0 to 1, with values closer to 1 indicating better algorithm performance. This is expressed in Equation (15):
For multiclass classification with at least three classes of categorical variables, as in this study, sensitivity and specificity are calculated using the binary confusion matrix for each class. For weighted sensitivity and weighted specificity, the weighted average of each class is used [38]. Hence, to create a binary confusion matrix for the diatom category, we can set diatoms to “positive” and the remaining categories (cyanophytes, chlorophytes, and others) to “negative”. The weighted sensitivity and weighted specificity are expressed in Equation (16). , , , and are the serial number for each category, the probability of being included in each category, and the sensitivity and specificity for each category, respectively.
Moreover, there is a trade-off relationship between sensitivity and specificity, where one decreases if the other increases [39]. Therefore, we additionally defined G mean, which can serve as a suitable supplementary point for these two metrics. This was obtained by taking the square root of the product of weighted sensitivity and weighted specificity as in Equation (17). We applied this form because the measurement data are imbalanced toward the diatom category.
3. Results
3.1. Data Analysis
3.1.1. Exploratory Data Analysis for Monitoring Data
The descriptive statistics of the variables for each sampling site are presented in Table 5. This shows an overview of the distributions of measurement variables for each sampling site [40]. We also calculated the JB test p-value for each variable to determine the normality test results. To identify the overall distribution of each explanatory variable, seven descriptive statistics were calculated: mean, standard deviation, median, minimum, maximum, skewness, and kurtosis. Skewness has a positive value when the tail is long toward the right and a negative value when the tail is long toward the left. A kurtosis value > 0 indicates that the center of the distribution is sharp, and a value < 0 suggests that the center of the distribution is smooth [41]. According to Table 5, none of the measurement variables show a value of zero for skewness or kurtosis at any of the sampling sites.
Table 5.
Descriptive statistics of the water quality and hydraulic/hydrological parameters at each survey site.
Furthermore, except for pH at all sampling sites, the JB test p-value is significantly lower than the significance level of 0.05. Hence, since normality is often violated, the Spearman correlation coefficient needed to be used instead of the Pearson correlation coefficient in the correlation analysis [42].
To better visualize the results, Figure 12 presents the boxplots of the parameters at each survey site. In general, the Tamjin Lake sampling sites have higher water quality parameter values and hydraulic/hydrological data values than those in Tamjin Lake. However, the turbidity and transparency values are higher at the Juam Lake sampling sites than those at Tamjin Lake, while DO and temperature show similar trends for the sampling sites of both lakes.
Figure 12.
Boxplot of the data obtained at the four sampling sites J1, J2, T1, and T2.
Table 6 presents a contingency table of the variable “Dominant Algae”, a categorical variable. The table indicates that diatoms are dominant at all sampling sites during the monitoring period, followed by chlorophytes, cyanophytes, and other algae.
Table 6.
Contingency table of the variable “Dominant Algae”.
3.1.2. Correlation Analysis and SOM Pattern Analysis
In Section 3.1.1, we confirmed that the Spearman correlation coefficient, a non-parametric measure of rank correlation, must be applied for the correlation analysis. Using this, we performed a correlation analysis for each sampling site; the results of which are shown in Figure 13. The figures for each sampling site show the calculated Spearman correlation coefficients. According to the results of the correlation analysis, there are variations in the results at each survey site; however, in general, the water quality parameters that are mutually related (BOD, COD, TN, TP, etc.) show positive correlations, while the water quality and hydraulic/hydrological variables show negative correlations. The pattern analysis of the SOMs supports these results, as shown in Figure 14, Figure 15, Figure 16 and Figure 17. This analysis helped identify the overall movement of the measurement variables at each survey site during the survey period. The water quality parameters that exhibit significant positive correlations in the correlation analysis show similar patterns, while the water quality and hydraulic/hydrological variables that exhibit significant negative correlations show opposite patterns. However, it should be noted that this study used time series data, which are measured over a certain period and are not independent. As such, calculating the normality test p-value for each time-dependent measurement variable and performing a correlation analysis and interpretation based on this have limitations [43].
Figure 13.
Correlation matrix showing Spearman’s correlation analysis of water quality parameters at survey sites (a) J1, (b) J2, (c) T1, and (d) T2. The numbers inside the boxes represent the Spearman correlation coefficients.
Figure 14.
Self-organizing map for sampling site J1.
Figure 15.
Self-organizing map for sampling site J2.
Figure 16.
Self-organizing map for sampling site T1.
Figure 17.
Self-organizing map for sampling site T2.
3.2. Comparison of the Performance of the Statistical Machine Learning Algorithms
This section presents the results of the analysis of the dominant algal classification accuracy using 11 statistical machine learning algorithms.
3.2.1. Tree-Based Algorithm for Assessing Variable Importance
In this study, the classification performance of five tree-based algorithms, namely random forest, bagging, AdaBoost, gradient boosting, and extreme gradient boosting, was compared. Each algorithm computes the importance of each variable to determine which explanatory variable has the most influence on the response variable [44]. Variable importance increases as the reduction in the Gini coefficient or the sum of squared errors increases. In extreme gradient boosting, variable importance is calculated using three measurement criteria: gain, cover, and frequency.
Figure 18 presents the graphs of the error calculated when applying the random forest algorithm based on the training data at each sampling site. The OOB (out-of-bag) error in the legend refers to the error obtained by using the remaining data not included in the sampling with replacement, which allows duplication, from the training data as validation data [45]. The other items in the legend indicate the probability of an incorrect answer calculated as the error for each category when the dominant algae are classified as either cyanophytes, diatoms, chlorophytes, or other algae. Figure 18 demonstrates that each error converges to a specific value as the number of tree models used in random forest increases. The probability of error is the lowest when probabilistically judging that the dominant algae are diatoms. This confirms that the most frequent time points during the survey period were those when diatoms dominated. Figure 19 presents the cross-validation tests conducted by extreme gradient boosting, where the point indicating the smallest mlogloss error value is deemed the best iteration. As illustrated in Figure 19, the mlogloss error value progressively decreases with each iteration for the training data, but it increases after a certain point for the test data, indicating overfitting [46]. Therefore, one of the advantages of extreme gradient boosting is that it reduces the risk of overfitting through cross-validation.
Figure 18.
Graphs representing error when using random forest for the sampling sites (a) J1, (b) J2, (c) T1, and (d) T2.
Figure 19.
Cross-validation test when extreme gradient boosting for the sampling sites (a) J1, (b) J2, (c) T1, and (d) T2.
Using this process, the variable importance of each algorithm for the training data by survey site was calculated, with the results shown in Table 7 and Table 8. Based on the results, the variable importance calculations vary depending on the survey site and algorithm. Overall, temperature and DO are more important than other measurement variables in determining and classifying the dominant algae at a specific point for each survey site. This observation suggests a high correlation between water temperature and oxygen in terms of the possibility of algal occurrence.
Table 7.
Variable importance of explanatory variables for dominant algal classification (bagging, AdaBoost, gradient boosting, and random forest). The top three measurement variable values, based on variable importance for each survey site and algorithm, are bolded. In instances where identical values are present, both variables are bolded.
Table 8.
Variable importance of explanatory variables for dominant algal classification using extreme gradient boosting. The top three measurement variable values, based on variable importance for each survey site and algorithm, are bolded. In instances where identical values are present, both variables are bolded.
These results align with the findings of Woo et al. (2020), who reported that the amount of harmful cyanobacteria occurring at nine water supply source sites in the main stream of the Nakdong River in South Korea from 2012 to 2019 was highly correlated with water temperature and dissolved oxygen [47]. However, at the Tamjin Lake–Yuchi River confluence (T2) site, the variable importance of nutrient-related measurement variables, such as BOD, TN, and Chla, is relatively high, surpassing that of DO. In turn, the variable importance of EC is relatively high at the Tamjin Lake dam front (T1) site. This indicates that nutrients, such as nitrogen and phosphorus, have a more significant influence on algal growth at the Tamjin Lake site compared to the Juam Lake site.
3.2.2. Comparison of Algorithms Based on Four Criteria
To compare the dominant algal classification performance of the 11 statistical machine learning algorithms described in Section 2.3.2, we used the measurements at each survey site from 2017 to 2021 as the training data and the remaining measurements from 2022 as the test data. Each algorithm was trained using the training data, and the classification performance was compared based on accuracy, weighted sensitivity, weighted specificity, and G mean according to the test data. Table 9 presents the calculations of these four criteria for each algorithm based on the classification results by survey site. In this table, for each survey site, the criterion value for the algorithm that shows the best performance based on each of the four criteria is highlighted in bold.
Table 9.
Result of dominant algal classification using 11 statistical machine learning algorithms (values in bold represent the criterion for which each algorithm shows the best performance, at each of the four sites).
The results show that the optimal algorithm varies depending on the survey site and evaluation criteria. Moreover, our findings indicate that algorithms with complex structures and training processes do not always yield optimal performance, and even simple algorithms can sometimes sufficiently analyze the given data. The data used in this study are imbalanced, with diatoms being the dominant algae in most cases. As such, it is most desirable to select the optimal algorithm based on the G mean, which appropriately combines the harmonic average of weighted sensitivity and weighted specificity rather than accuracy.
Accordingly, the best algorithms for classifying the dominant algae are as follows: decision tree for the Juam Lake dam front (J1) site, random forest for the Juam Lake Shinpyeong Bridge (J2) site, support vector machine for the Tamjin Lake dam front (T1) site, and gradient boosting for the Tamjin Lake–Yuchi River confluence (T2) site. The fact that the best algorithm differs for each survey site suggests that the environmental characteristics of each survey site also vary. This is because the statistical and distributional characteristics of the measured variables investigated for each survey site affect the operation of the algorithm, such as the optimal parameter estimation. As a result, the algorithm that shows the best performance for each survey site is different.
4. Discussion
In this study, we analyzed the dominant algae from 2017 to 2022 at various sites in Juam Lake and Tamjin Lake, which are representative water supply sources in the Yeongsan River and Seomjin River systems in South Korea. We also briefly examined the seasonal characteristics of the dominant algae. Additionally, water quality and hydraulic/hydrological parameters related to algal occurrence were collected based on water quality monitoring network data, algae alert system data, and hydraulic/hydrological data to construct the data needed for analysis. We then performed an exploratory data analysis, including correlation analysis and pattern analysis of the SOM for each measurement variable according to the four survey sites, to investigate the overall relationships between the variables and their distributional characteristics. Based on four algorithm evaluation criteria, we also examined the dominant algal classification accuracy of 11 statistical machine learning algorithms for each survey site.
Through evaluating the algorithms, we found that the best one differs for each survey site, indicating that the environmental characteristics of each survey site also differ. In contrast to previous studies [48,49], which mainly used traditional multivariate statistical analysis techniques, such as principal component analysis (PCA) or clustering analysis (CA), to evaluate the environmental characteristics of a survey site, our study attempted to evaluate the environmental characteristics of each survey site using the latest versions of statistical machine learning algorithms. The main results of this study are as follows: chlorophytes or diatoms tended to dominate in spring, cyanophytes in early summer and summer, and chlorophytes and diatoms in autumn and early winter. These results are based on the monthly average number of cells for each algal type measured during the survey period from 2017 to 2022 at the Juam Lake and Tamjin Lake sites.
Through an exploratory data analysis using correlation analysis and pattern analysis of the SOM of the monitoring data, we analyzed the water quality parameters and hydraulic/hydrological variables measured at the Juam Lake and Tamjin Lake sites from 2017 to 2022. This revealed that, overall, mutually related water quality parameters (BOD, COD, TN, TP, etc.) showed positive correlations, while the water quality variables and hydraulic/hydrological variables showed negative correlations.
Using the data from 2017 to 2022 at the Juam Lake and Tamjin Lake monitoring sites of this study, we identified the best algorithms for classifying dominant algae. Based on the G mean, the following algorithms yielded the best performance and were selected: decision tree for the Juam Lake dam front (J1) site, random forest for the Juam Lake Shinpyeong Bridge (J2) site, support vector machine for the Tamjin Lake dam front (T1) site, and gradient boosting for the Tamjin Lake–Yuchi River confluence (T2) site.
This study presents rigorous analyses of water quality data from four survey sites to predict the dominant algae using machine learning algorithms. However, the limited number of survey sites in our study may limit the generalizability of these findings to other water sources, especially those in very different environments. Future research should, therefore, explore the prediction of dominant algae across a larger number of investigation sites to obtain more universal results. This would facilitate the development of a way to evaluate generalized environmental characteristics of water quality. Overall, this study provides valuable insights into the use of statistical machine learning algorithms for water quality management, highlighting the need for further research in this area.
5. Conclusions
The results presented in Section 4 were based solely on data collected from the Juam Lake and Tamjin Lake sites. It is important to note that incorporating additional measurement variables, such as precipitation, and extending the survey period, or analyzing data from water supply sources outside of the Yeongsan River and Seomjin River system, may give different results. As the amount of data increases, so does the prior knowledge obtained, which can then be used to train the algorithms further. This iterative process can potentially improve algorithm performance. Additionally, different water systems have unique water quality and hydraulic/hydrological characteristics, meaning that even the same algorithms may produce varying results when applied to different water systems. Therefore, more research investigating and comparing a wide range of water source points is necessary. This research approach can support stakeholders and authorities to more accurately classify dominant algal occurrences and, thus, more efficiently manage the quality of important water sources.
Author Contributions
Conceptualization, S.-Y.H. and K.-Y.J.; methodology, S.-Y.H. and B.-W.C.; software, S.-Y.H. and K.-Y.J.; validation, S.-Y.H., B.-W.C. and K.-Y.J.; formal analysis, S.-Y.H.; investigation, H.-S.C., M.-S.S., C.-H.L., H.-M.C. and D.-W.H.; resources, H.-S.C. and B.-W.C.; data curation, S.-Y.H. and K.-Y.J.; writing—original draft preparation, S.-Y.H.; writing—review and editing, S.-Y.H. and K.-Y.J.; visualization, S.-Y.H. and K.-Y.J.; supervision, J.-H.P. and D.-S.S.; project administration, S.-Y.H. and D.-S.S.; funding acquisition, J.-H.P. and D.-S.S. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a grant from the National Institute of Environmental Research (NIER), funded by the Ministry of Environment (ME) of the Republic of Korea (NIER-2022-01-01-044).
Data Availability Statement
The datasets used and analyzed during the current study are available from the corresponding author upon request.
Acknowledgments
We would like to thank the reviewers for their comments.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
References
- Kim, S.G. Green algae and algae warning system. Water Future 2017, 50, 22–26. [Google Scholar]
- Kim, K.B.; Jung, M.K.; Tsang, Y.F.; Kwon, H.H. Stochastic modeling of chlorophyll-a for probabilistic assessment and monitoring of algae blooms in the Lower Nakdong River, South Korea. J. Hazard. Mater. 2020, 400, 123066. [Google Scholar] [CrossRef] [PubMed]
- Srivastava, A.; Ahn, C.Y.; Asthana, R.K.; Lee, H.G.; Oh, H.M. Status, alert system, and prediction of cyanobacterial bloom in South Korea. Biomed. Res. Int. 2015, 2015, 584696. [Google Scholar] [CrossRef] [PubMed]
- Falconer, I.R.; Humpage, A.R. Health risk assessment of cyanobacterial (blue-green algal) toxins in drinking water. Int. J. Environ. Res. Public Health 2005, 2, 43–50. [Google Scholar] [CrossRef]
- Fleming, L.E.; Rivero, C.; Burns, J.; Williams, C.; Bean, J.A.; Shea, K.A.; Stinn, J. Blue green algal (cyanobacterial) toxins, surface drinking water, and liver cancer in Florida. Harmful Algae 2002, 1, 157–168. [Google Scholar] [CrossRef]
- Kim, Y.H. Harmful Cyanobacterial Bloom and Application of Physical, Chemical and Biological Control Methods. Ph.D. Thesis, Hanyang University, Seoul, Republic of Korea, 2022. [Google Scholar]
- Joo, J.H. Field Application and Development of Biologically Derived Substances (BDSs) to Mitigate Freshwater Harmful Cyanobacterial Blooms. Ph.D. Thesis, Hanyang University, Seoul, Republic of Korea, 2017. [Google Scholar]
- Guillaume, M.C.; Dos Santos, F.B. Assessing and reducing phenotypic instability in cyanobacteria. Curr. Opin. Biotechnol. 2023, 80, 102899. [Google Scholar] [CrossRef]
- Kim, H.G. Prediction of Chlorophyll-A in the Middle Reach of the Nakdong River at Maegok Using Artificial Neural Networks. Master’s Thesis, Department of Integrated Biological Science, The Graduate School of Busan National University, Busan, Republic of Korea, 2017. [Google Scholar]
- Lee, S.M.; Park, K.D.; Kim, I.K. Comparison of machine learning algorithms for Chl-a prediction in the middle of Nakdong river (focusing on water quality and quantity factors). J. Korean Soc. Water Wastewater 2020, 34, 277–288. [Google Scholar] [CrossRef]
- Bui, D.T.; Khosravi, K.; Tiefenbacher, J.; Nguyen, H.; Kazakis, N. Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Sci. Total Environ. 2020, 721, 137612. [Google Scholar] [CrossRef]
- Caissie, D.; Satish, M.G.; El-Jabi, N. Predicting water temperatures using a deterministic model: Application on Miramichi River catchments (New Brunswick, Canada). J. Hydrol. 2007, 336, 303–315. [Google Scholar] [CrossRef]
- Choi, D.H.; Jung, J.W.; Lee, K.S.; Choi, Y.J.; Yoon, K.S.; Cho, S.H.; Park, H.N.; Lim, B.J.; Chang, N.I. Estimation of pollutant load delivery ratio for flow duration using LQ equation from the Oenam-cheon watershed in Juam Lake. J. Environ. Sci. Int. 2012, 21, 31–39. [Google Scholar] [CrossRef]
- Park, H.G.; Kang, D.W.; Shin, K.H.; Ock, G.Y. Tracing source and concentration of riverine organic carbon transporting from Tamjin River to Gangjin Bay, Korea. KJEE 2017, 50, 422–431. [Google Scholar] [CrossRef]
- Seo, K.A.; Jung, S.J.; Park, J.H.; Hwang, K.S.; Lim, B.J. Relationships between the Characteristics of Algae Occurrence and Environmental Factors in Lake Juam, Korea. J. Korean Soc. Water Environ. 2013, 29, 317–328. [Google Scholar]
- Cox, V. Exploratory data analysis. In Translating Statistics to Make Decisions; Apress: Berkeley, CA, USA, 2017; pp. 47–74. [Google Scholar]
- Das, K.R.; Imon, A.H.M.R. A brief review of tests for normality. Am. J. Ther. Appl. Stat. 2016, 5, 5–12. [Google Scholar] [CrossRef]
- Thadewald, T.; Büning, H. Jarque–Bera test and its competitors for testing normality—A power comparison. J. Appl. Stat. 2007, 34, 87–105. [Google Scholar] [CrossRef]
- Kohonen, T. The self-organizing map. Proc. IEEE 1990, 78, 1464–1480. [Google Scholar] [CrossRef]
- Jung, K.Y.; Cho, S.H.; Hwang, S.Y.; Lee, Y.J.; Kim, K.H.; Na, E.H. Identification of High-Priority Tributaries for Water Quality Management in Nakdong River Using Neural Networks and Grade Classification. Sustainability 2020, 12, 9149. [Google Scholar] [CrossRef]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013; Volume 112, p. 18. [Google Scholar]
- Sugiyama, M. Introduction to Statistical Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 2015. [Google Scholar]
- Park, K.Y.; JW, K. A short guide to machine learning for economists. Korean J. Econ. 2019, 26, 367–408. [Google Scholar] [CrossRef]
- Han, S.W. A Study on Kernel Ridge Regression Using Ensemble Method. Master’s Thesis, Department of Statistics, The Graduate School of Hankuk University of Foreign Studies, Seoul, Republic of Korea, 2016. [Google Scholar]
- Hwang, S.Y. A Study on Efficiency of Kernel Ridge Logistic Regression Classification Using Ensemble Method. Master’s Thesis, Department of Statistics, The Graduate School of Hankuk University of Foreign Studies, Seoul, Republic of Korea, 2017. [Google Scholar]
- Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning; Springer: Boston, MA, USA, 2012; pp. 157–175. [Google Scholar]
- Schapire, R.E. Explaining adaboost. In Empirical Inference; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
- Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K. Xgboost: Extreme Gradient Boosting, R Package Version 0.4-2. 2015, pp. 1–4. Available online: https://cran.microsoft.com/snapshot/2017-12-11/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 10 January 2023).
- Izenman, A.J. Linear discriminant analysis. In Modern Multivariate Statistical Techniques; Springer: New York, NY, USA, 2013; pp. 237–280. [Google Scholar]
- Reynès, C.; Sabatier, R.; Molinari, N. Choice of B-splines with free parameters in the flexible discriminant analysis context. Comput. Stat. Data Anal. 2006, 51, 1765–1778. [Google Scholar] [CrossRef]
- Schölkopf, B.; Smola, A.J.; Williamson, R.C.; Bartlett, P.L. New support vector algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef]
- Friedman, J.H. Regularized discriminant analysis. J. Am. Stat. Assoc. 1989, 84, 165–175. [Google Scholar] [CrossRef]
- Pisner, D.A.; Schnyer, D.M. Support vector machine. In Machine Learning; Academic Press: Cambridge, MA, USA, 2020; pp. 101–121. [Google Scholar]
- Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
- Parikh, R.; Mathai, A.; Parikh, S.; Sekhar, G.C.; Thomas, R. Understanding and using sensitivity, specificity and predictive values. Indian J. Ophthalmol. 2008, 56, 45. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Zhang, Y.; Miao, D. Three-way confusion matrix for classification: A measure driven view. Inf. Sci. 2020, 507, 772–794. [Google Scholar] [CrossRef]
- Li, D.L.; Shen, F.; Yin, Y.; Peng, J.X.; Chen, P.Y. Weighted Youden index and its two-independent-sample comparison based on weighted sensitivity and specificity. Chin. Med. J. 2013, 126, 1150–1154. [Google Scholar]
- Trevethan, R. Sensitivity, specificity, and predictive values: Foundations, pliabilities, and pitfalls in research and practice. Front. Public Health 2017, 5, 307. [Google Scholar] [CrossRef]
- Jung, K.Y.; Lee, I.J.; Lee, K.L.; Cheon, S.U.; Hong, J.Y.; Ahn, J.M. Long-term trend analysis and exploratory data analysis of Geumho River based on seasonal Mann-Kendall test. J. Environ. Sci. Int. 2016, 25, 217–229. [Google Scholar] [CrossRef]
- Blanca, M.J.; Arnau, J.; López-Montiel, D.; Bono, R.; Bendayan, R. Skewness and kurtosis in real data samples. Methodology 2013, 9, 78–84. [Google Scholar] [CrossRef]
- De Winter, J.C.; Gosling, S.D.; Potter, J. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data. Psychol. Methods 2016, 21, 273. [Google Scholar] [CrossRef] [PubMed]
- Bai, J.; Ng, S. Tests for skewness, kurtosis, and normality for time series data. J. Bus. Econ. Stat. 2005, 23, 49–60. [Google Scholar] [CrossRef]
- Gregorutti, B.; Michel, B.; Saint-Pierre, P. Correlation and variable importance in random forests. Stat. Comput. 2017, 27, 659–678. [Google Scholar] [CrossRef]
- Genuer, R.; Poggi, J.M. Random forests. In Random Forests with R; Springer: Cham, Switzerland, 2020; pp. 33–55. [Google Scholar]
- Roelofs, R.; Shankar, V.; Recht, B.; Fridovich-Keil, S.; Hardt, M.; Miller, J.; Schmidt, L. A meta-analysis of overfitting in machine learning. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Woo, C.Y.; Yun, S.L.; Kim, S.G.; Lee, W.T. Occurrence of Harmful Blue-green Algae at Algae Alert System and Water Quality Forecast System Sites in Daegu and Gyeongsangbuk-do between 2012 and 2019. J. Korean Soc. Environ. Eng. 2020, 42, 664–673. [Google Scholar] [CrossRef]
- Jung, K.Y.; Ahn, J.M.; Kim, K.; Lee, I.J.; Yang, D.S. Evaluation of water quality characteristics and water quality improvement grade classification of Geumho River tributaries. J. Environ. Sci. Int. 2016, 25, 767–787. [Google Scholar] [CrossRef]
- Sun, X.; Zhang, H.; Zhong, M.; Wang, Z.; Liang, X.; Huang, T.; Huang, H. Analyses on the temporal and spatial characteristics of water quality in a seagoing river using multivariate statistical techniques: A case study in the Duliujian River, China. Int. J. Environ. Res. Public Health 2019, 16, 1020. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).


















