Using Machine Learning Models for Predicting the Water Quality Index in the La Buong River, Vietnam

: For effective management of water quantity and quality, it is absolutely essential to estimate the pollution level of the existing surface water. This case study aims to evaluate the performance of twelve machine learning (ML) models, including ﬁve boosting-based algorithms (adaptive boosting, gradient boosting, histogram-based gradient boosting, light gradient boosting, and extreme gradient boosting), three decision tree-based algorithms (decision tree, extra trees, and random forest), and four ANN-based algorithms (multilayer perceptron, radial basis function, deep feed-forward neural network, and convolutional neural network), in estimating the surface water quality of the La Buong River in Vietnam. Water quality data at four monitoring stations alongside the La Buong River for the period 2010–2017 were utilized to calculate the water quality index (WQI). Prediction performance of the ML models was evaluated by using two efﬁciency statistics (i.e., R 2 and RMSE). The results indicated that all twelve ML models have good performance in predicting the WQI but that extreme gradient boosting (XGBoost) has the best performance with the highest accuracy (R 2 = 0.989 and RMSE = 0.107). The ﬁndings strengthen the argument that ML models, especially XGBoost, may be employed for WQI prediction with a high level of accuracy, which will further improve water quality management.


Introduction
Surface water in rivers is a fundamental freshwater source, which plays an essential role in socio-economic development and the environment [1].However, surface water bodies are under severe pressure because of exaggerated human activities, such as industrialization, urbanization, and population growth [2,3].Additionally, poor management of water quantity and quality and climate change have reduced water quality during the past few decades, which leads to surface-water pollution [4,5].Therefore, the evaluation and estimation of the water quality level in rivers are of great concern today.
The water quality index (WQI) has been extensively used to assess and classify the surface water and groundwater quality.This index by Brown et al. [6], is computed based on the physicochemical parameters of the water (e.g., temperature, pH, turbidity, dissolved oxygen (DO), biochemical oxygen demand (BOD), and concentrations of other pollutants), to estimate the level of water quality.The WQI provides quantitatively meaningful information to decision makers and planners for water resources management.However, the WQI formulations consist of lengthy calculations and thus require a lot of time and effort [5].Additionally, the WQI formulations are inconsistent as these usually utilize different equations [7].Accordingly, to deal with the mentioned issues, it is absolutely vital to have an alternative approach for computationally efficient and accurate estimation of the WQI.
In recent years, machine learning (ML) techniques have been extensively used for river water quality assessment, including WQI estimation [8].These techniques have proved to be powerful tools for modeling complex non-linear behaviors in water-resource research [9].Our literature review demonstrates that each ML algorithm has its strengths and shortcomings, and its behavior is dependent on the input variables of water quality in the different study regions.Regarding the simulation and prediction of water quality, the capability of adaptive boosting (Adaboost) [10], gradient boosting (GBM) [11], extreme gradient boosting (XGBoost) [12], decision tree (DT) [13,14], extra trees (ExT) [4], random forest (RF) [10,15], multilayer perceptron (MLP) [16], radial basis function (RBF) [17], deep feed-forward neural network (DFNN) [18], and convolutional neural network (CNN) [19] has been reported.Although there are many ML algorithms, researchers are still being confronted with problems, including which ML techniques should be applied or most appropriate for a specific problem.
In Vietnam, the WQI proposal by the Ministry of Environment and Natural Resources (MONRE) [20] requires lengthy calculations and consequently demands a lot of time and effort.However, to the best of our knowledge, no study on the use of machine learning techniques in predicting the WQI has been conducted in Vietnam.Therefore, the present study aimed to assess the performance of twelve ML algorithms, consisting of five boostingbased algorithms (Adaboost, GBM, histogram-based gradient boosting (HGBM), light gradient boosting (LightGBM), and XGBoost), three decision tree-based algorithms (DT, ExT, and RF), and four ANN-based algorithms (MLP, RBF, DFNN, and CNN), in predicting the WQI of the La Buong River in Vietnam.The La Buong River is one of the important rivers that provides water supply for domestic, agricultural, and industrial usages in the southern key economic region of Vietnam.

Study Area
The La Buong River (10 • 45 -11 • 00 N, 106 • 50 -107 • 15 E), a tributary of the Dong Nai River, has a length of approximately 56 km and a basin area of 475.8 km 2 (Figure 1).The La Buong River Basin is located in the western part of the Dong Nai province in the southern key economic region of Vietnam.The topography of the basin ranges from 10 to 385 m above sea level.The basin has a tropical monsoon climate with two different seasons: a 6-month rainy season, lasting from May to October, and a 6-month dry season, lasting from November to April.The average annual temperature was 25.4 • C, the average annual rainfall was 1786 mm, and the average annual streamflow was 7.1 m 3 /s in the period 1981-2015 [21].Rhodic Ferralsols and Ferric Acrisols are the main soils of the basin (accounting for approximately 75% of the basin area).More than 80% of land in the basin is utilized for agricultural development (cashew, coffee, and rubber).The La Buong River Basin is heavily influenced by cropping activities and livestock in the upper basin and industrial activities in the lower basin.Urbanization and industrial development are predicted to rise in the coming years [22].

Data Collection and Processing
Eight years (2010 to 2017) of bimonthly WQ data at four WQ monitoring stations alongside the La Buong River (Figure 1) were collected from the Dong Nai Department of Natural Resources and Environment.The measured WQ data consisted of ten variables: temperature (T), pH, DO, BOD, COD, turbidity (TUR), total suspended solid (TSS), coliform, ammonium (NH4 + ), and phosphate (PO4 3− ).Sampling, preservation, storage, and analysis procedures followed the national guidelines for monitoring surface water.
In the current study, the ten WQ variables were utilized to compute the WQI based on Decision No. 879/QD-TCMT, issued by the Ministry of Natural Resources and Environment (MONRE) of Vietnam [20].The WQI is expressed as follows: where WQIa is the WQI values for chemical variables (DO, BOD, COD, NH4 + , and PO4 3− ), WQIb is the WQI values for physical variables (TSS and TUR), WQIc is the WQI value for biological variable (coliform), and WQIpH is the WQI value for pH.
Based on the WQI values, the river water quality is classified into five levels: excellent (WQI = 91-100), good (WQI = 76-90), fair (WQI = 51-75), poor (WQI = 26-50), and very poor (WQI = 0-25).Full details on the guideline for calculating WQI can be found in MONRE [20].The descriptive statistics of the WQ variables and WQI is exhibited in Table 1.The TSS, TUR, and coliform concentrations presented considerable variations, with high coefficient of variation (CV) values of 153.9% for TSS, 158.4% for TUR, and 343.2% for coliform.The high differences in these variables can be explained by the sources (point source and nonpoint source) and nature of the pollution [23].Furthermore, the differences can be associated with seasonal effects of hydro-climatic conditions in the study area.Additionally, the WQI values indicated that the water quality of the La Buong River varies

Data Collection and Processing
Eight years (2010 to 2017) of bimonthly WQ data at four WQ monitoring stations alongside the La Buong River (Figure 1) were collected from the Dong Nai Department of Natural Resources and Environment.The measured WQ data consisted of ten variables: temperature (T), pH, DO, BOD, COD, turbidity (TUR), total suspended solid (TSS), coliform, ammonium (NH 4 + ), and phosphate (PO 4 3− ).Sampling, preservation, storage, and analysis procedures followed the national guidelines for monitoring surface water.
In the current study, the ten WQ variables were utilized to compute the WQI based on Decision No. 879/QD-TCMT, issued by the Ministry of Natural Resources and Environment (MONRE) of Vietnam [20].The WQI is expressed as follows: where WQI a is the WQI values for chemical variables (DO, BOD, COD, NH 4 + , and PO 4 3− ), WQI b is the WQI values for physical variables (TSS and TUR), WQI c is the WQI value for biological variable (coliform), and WQI pH is the WQI value for pH.
Based on the WQI values, the river water quality is classified into five levels: excellent (WQI = 91-100), good (WQI = 76-90), fair (WQI = 51-75), poor (WQI = 26-50), and very poor (WQI = 0-25).Full details on the guideline for calculating WQI can be found in MONRE [20].The descriptive statistics of the WQ variables and WQI is exhibited in Table 1.The TSS, TUR, and coliform concentrations presented considerable variations, with high coefficient of variation (CV) values of 153.9% for TSS, 158.4% for TUR, and 343.2% for coliform.The high differences in these variables can be explained by the sources (point source and nonpoint source) and nature of the pollution [23].Furthermore, the differences can be associated with seasonal effects of hydro-climatic conditions in the study area.Additionally, the WQI values indicated that the water quality of the La Buong River varies from a very low quality (WQI = 3.02) to excellent quality (WQI = 98.30).The La Buong River WQ data were divided into two parts: 70% for the training process and 30% for the testing process.The ratio of this division is used widely in the data-driven modeling [1,7].To improve the training speed and predictive accuracy of the ML models, the WQ data were normalized to a 0-1 range before the modeling process using the following equation: where x i and x i are the normalized and original values of a WQI variable (i.e., pH, DO, BOD, etc.) at a station, and x min and x max are the minimum and maximum values of that variable, respectively.

Machine Learning Models
As mentioned above, the current study utilized twelve ML models for predicting WQI with three major groups: boosting-based algorithms, decision tree-based algorithms, and ANN-based algorithms.

Boosting-Based Algorithms
Boosting algorithm is an ensemble meta-algorithm method that aims to improve the predictive performance of several given weaker algorithms by primarily reducing bias and variance in supervised learning problems [24].The basic principle of the boosting method starts by creating a model from the training data, and then conducting a second model based on the previous one by reducing the bias error that arises when the first model could not infer the relevant patterns in the given data.Every time a new learning algorithm is added, the weights of data are readjusted, also known as "re-weighting".These models are added sequentially until the training data is reasonably predicted or the maximum number of learners have been added to the ensemble model [25].Five types of boosting-based algorithms were utilized in the current study, including adaptive boosting (AdaBoost), gradient boosting (GBM), histogram-based gradient boosting (HGBM), light gradient boosting (LightGBM), and extreme gradient boosting (XGBoost).Full details on these boosting-based algorithms can be found in Wu et al. [26].

Decision Tree-Based Algorithms
The decision tree and its many variants are the other types of learning algorithms that divide the input space into regions and has separate parameters for each region [27].They are classified as the non-parametric supervised learning method that is widely applied for classification and regression, as well as visually and explicitly represent decisions and decision making.The typical structure of a decision tree is a tree-like flowchart, as the name goes, in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).Besides, the paths from root to leaf represent classification rules.In the present study, three decision tree-based models were assessed with respect to different learning algorithms, including decision tree (DT), extra trees (ExT), and random forest (RF).Full details on these decision tree-based algorithms can be found in Ahmad et al. [28].

ANN-Based Algorithms
In recent decades, AI-based models have been developed considerably to achieve a state-of-the-art architecture, comprising a number of learning algorithms and modern computational structures, across various aspects in studies on river water quality modeling [8].ANN-based models have recently gained popularity due to its robustness and capability to handle nonlinear data even with its typically structured, single hidden layer, or advanced-structured, multiple hidden layers.Basically, ANN includes three layers: input, hidden, and output layers.In case of increasing complexity of the problem, the number of layers will rise and the computational resources will consequently also rise.In this study, both the mentioned structures of the ANN-based models were utilized for predicting WQI, such as multilayer perceptron (MLP), radial basis function (RBF), deep feed-forward neural network (DFNN), and convolutional neural network (CNN).Full details on these ANN-based algorithms can be found in Tiyasha et al. [8] and Tahmasebi et al. [29].

Construction of ML Models
As a first important step for constructing the ML model, the selection of input variables is required to determine a sufficient number of the variables, which have enough underlying information to predict WQI.Moreover, this selection could improve the model accuracy by avoiding the undesirable impact on the predictive performance.In the current study, ten WQ variables were identified as potential inputs.There are several existing methods to assess the input combinations, including autocorrelation function, partial autocorrelation function, cross-correlation function, and correlation coefficient.In the midst of these techniques, the correlation coefficient was selected for the current study because of its efficient and straightforward [4].
Table 2 presents that the WQ variable with the highest value of R 2 was coliform, followed by TSS, TUR, COD, BOD, PO 4 3− , NH 4 + , pH, DO, and T. It is noteworthy that the WQ variables of coliform, TUR, and TSS had the highest correlations with WQI due to impacts of cropping and livestock activities on water quality in the La Buong River.Based on the correlations of ten WQ variables with WQI, ten input variable combinations are listed in Table 3.
After selecting the input WQ variables, the fitted values of model parameters for each ML model were determined using a "trial and error" technique [23].With the twelve ML models and ten scenarios of input variable combinations, 120 ML models for predicting the WQI were built during the training process and the performance of these models was evaluated during the testing process [7].In the present study, the scikit-learn library, a Python-based package, was utilized to develop the twelve ML modes for predicting the WQI.

Performance Evaluation of ML Models
In the current study, two model efficiency statistics, namely, the root mean square error (RMSE) and coefficient of determination (R 2 ), were utilized to evaluate the goodness of fit between the predictions and observations.RMSE measures the deviation between the observed and predicted values, and R 2 measures the degree of correlation between the observed and predicted data [30].
where n is the total number of predicted values, O i is the observed value, O is the mean of observed values, and P i is the predicted value.

Performance Evaluation of Boosting-Based Models
Table 4 exhibits the model performance of the boosting-based algorithms during the testing process.Results showed that AdaBoost-S2 (R 2 = 0.973 and RMSE = 0.175) had the highest performance in predicting WQI among the AdaBoost models, GBM-S7 (R 2 = 0.989 and RMSE = 0.108) had the highest performance among the GBM models, HGBM-S2 (R 2 = 0.967 and RMSE = 0.183) had the highest performance among the GBM models, LightGBM-S6 (R 2 = 0.986 and RMSE = 0.119) had the highest performance among the LightGBM models, and XGBoost-S9 (R 2 = 0.989 and RMSE = 0.107) had the highest performance among the XGBoost models under the S1-S10 scenarios.Additionally, the comparison plots of the measured WQI values with the WQI values predicted by AdaBoost-S2, GBM-S7, HGBM-S2, LightGBM-S6, and XGBoost-S9 in the testing period are shown in Figure 2. Generally, these models replicated very well the measured WQI during the testing period.However, there are small discrepancies between the measured and predicted WQI high or low values (especially those of AdaBoost-S2 and HGBM-S2).On the whole, the comparison between the boosting-based models under the S1-S10 scenarios demonstrates the XGBoost-S9 model as the best performance model.

Performance Evaluation of Decision Tree-Based Models
Table 4 also presents the model performance of the decision tree-based algorithms during the testing process.The results indicated that DT-S5 (R 2 = 0.979 and RMSE = 0.147), ExT-S5 (R 2 = 0.985 and RMSE = 0.126), and RF-S5 (R 2 = 0.986 and RMSE = 0.121) had the highest performance in predicting WQI among the DT models, ExT models, and RF models under the S1-S10 scenarios, respectively.Figure 3 displays the comparisons between the predicted and measured WQI for the DT-S5, ExT-S5, and RF-S5 models during the testing period.In general, all three models reproduced well the measured WQI and small differences between the measured and predicted WQI high or low values can be seen.Regarding the model performance of the decision tree-based models, RF-S5 had the highest accurate prediction.

Performance Evaluation of ANN-Based Models
According to the model performance of the ANN-based algorithms during the testing period (Table 4), MLP-S4 (R 2 = 0.984 and RMSE = 0.132), RBF-S2 (R 2 = 0.887 and RMSE

Performance Evaluation of Decision Tree-Based Models
Table 4 also presents the model performance of the decision tree-based algorithms during the testing process.The results indicated that DT-S5 (R 2 = 0.979 and RMSE = 0.147), ExT-S5 (R 2 = 0.985 and RMSE = 0.126), and RF-S5 (R 2 = 0.986 and RMSE = 0.121) had the highest performance in predicting WQI among the DT models, ExT models, and RF models under the S1-S10 scenarios, respectively.Figure 3 displays the comparisons between the predicted and measured WQI for the DT-S5, ExT-S5, and RF-S5 models during the testing period.In general, all three models reproduced well the measured WQI and small differences between the measured and predicted WQI high or low values can be seen.Regarding the model performance of the decision tree-based models, RF-S5 had the highest accurate prediction.

Performance Evaluation of Decision Tree-Based Models
Table 4 also presents the model performance of the decision tree-based algorithms during the testing process.The results indicated that DT-S5 (R 2 = 0.979 and RMSE = 0.147), ExT-S5 (R 2 = 0.985 and RMSE = 0.126), and RF-S5 (R 2 = 0.986 and RMSE = 0.121) had the highest performance in predicting WQI among the DT models, ExT models, and RF models under the S1-S10 scenarios, respectively.Figure 3 displays the comparisons between the predicted and measured WQI for the DT-S5, ExT-S5, and RF-S5 models during the testing period.In general, all three models reproduced well the measured WQI and small differences between the measured and predicted WQI high or low values can be seen.Regarding the model performance of the decision tree-based models, RF-S5 had the highest accurate prediction.

Performance Evaluation of ANN-Based Models
According to the model performance of the ANN-based algorithms during the testing period (Table 4)

Performance Evaluation of ANN-Based Models
According to the model performance of the ANN-based algorithms during the testing period (Table 4), MLP-S4 (R 2 = 0.984 and RMSE = 0.132), RBF-S2 (R 2 = 0.887 and RMSE = 0.360), DFNN-S2 (R 2 = 0.973 and RMSE = 0.162), and CNN-S7 (R 2 = 0.982 and RMSE = 0.139) are the best models for predicting WQI among the MLP models, RBF models, DFNN models, and CNN models under the S1-S10 scenarios, respectively.Figure 4 illustrates the comparisons between the predicted and measured WQI for the MLP-S4, RBF-S2, DFNN-S2, and CNN-S7 models during the testing period.Generally, these four models reproduced well the measured WQI during the testing period.Moreover, small differences between the measured and predicted WQI high or low values can be observed for most models, except for RBF-S2, which show a considerable discrepancy.Regarding the model performance of the ANN-based models, MLP-S4 had the highest accurate prediction (R 2 = 0.984 and RMSE = 0.132).
ater 2022, 14, x FOR PEER REVIEW 9 of 1 0.139) are the best models for predicting WQI among the MLP models, RBF models DFNN models, and CNN models under the S1-S10 scenarios, respectively.

Discussion
A comparison of twelve ML models, including five boosting-based algorithms (Ada boost, GBM, HGBM, LightGBM, and XGBoost), three decision tree-based algorithms (DT ExT, and RF), and four ANN-based algorithms (MLP, RBF, DFNN, and CNN), was con ducted to evaluate their performance in predicting the WQI based on the model efficiency statistics.Based on the model performance of the twelve ML models, our findings indicat that all ML models could predict the WQI well for this study area, but the best scenario of input variables to the ML models are different.This can be explained by the fact tha each ML algorithm will respond in a different way to different input variables and dat patterns [31].As reported by Morton and Henderson [32] and Yang and Moyer [33], wate quality data are characterized by a nonlinear distribution.In general, Adaboost, HGBM RBF, and DFNN achieved good results under the S2 scenario of the input variables; DT ExT, and RF achieved good results under the S5 scenario; and GBM and CNN achieved good results under the S7 scenario.In addition, MLP, LightGBM, and XGBoost performed

Discussion
A comparison of twelve ML models, including five boosting-based algorithms (Adaboost, GBM, HGBM, LightGBM, and XGBoost), three decision tree-based algorithms (DT, ExT, and RF), and four ANN-based algorithms (MLP, RBF, DFNN, and CNN), was conducted to evaluate their performance in predicting the WQI based on the model efficiency statistics.Based on the model performance of the twelve ML models, our findings indicate that all ML models could predict the WQI well for this study area, but the best scenarios of input variables to the ML models are different.This can be explained by the fact that each ML algorithm will respond in a different way to different input variables and data patterns [31].As reported by Morton and Henderson [32] and Yang and Moyer [33], water quality data are characterized by a nonlinear distribution.In general, Adaboost, HGBM, RBF, and DFNN achieved good results under the S2 scenario of the input variables; DT, ExT, and RF achieved good results under the S5 scenario; and GBM and CNN achieved good results under the S7 scenario.In addition, MLP, LightGBM, and XGBoost performed well in Scenarios S4, S6, and S9, respectively.These findings indicate that most accurate prediction is dependent on the ML model parameters for the given scenario of input variables, which is consistent with results of Hussain and Khan [31].
After comparison of all twelve ML models, it indicated that the XGBoost model outperforms other ML models in the study area.In comparison with other studies, DFNN performs better than XGBoost, MLP, and RF in the Mahanadi River Basin in India [5].Asadollah et al. [4] indicated that ExT is superior to DT and support vector regression (SVR) in the Lam Tsuen River in Hong Kong.Moreover, DT performs better as compared to the MLP model in the Rawal Dam lake in Pakistan [14].In general, different ML algorithms will give different performance when applied to different regions.Therefore, exploring and developing a generalized ML model for applications of water quality assessment is an ongoing struggle.
As stated in previous studies, an important gap is a lack of considering cross influences between the explanatory variables, namely, the cross-correlation between land-use classes and the cross-correlation between climate conditions in influencing river water quality [34][35][36].Land-use change and climate change affect hydrological components, and consequently river discharge and pollutant transport [21].Therefore, it is essential to take into account land-use and climate changes, which may improve the accuracy of the ML models.

Conclusions
This research work was conducted to investigate the capability of twelve ML models, namely, five boosting-based algorithms (Adaboost, GBM, HGBM, LightGBM, XGBoost), three decision tree-based algorithms (DT, ExT, and R)), and four ANN-based algorithms (MLP, RBF, DFNN, and CNN), in predicting the WQI.The four WQ monitoring stations alongside the La Buong River were considered as a case study.Two model efficiency statistics (i.e., R 2 and RMSE) were chosen for performance comparison of the different ML models.XGBoost achieved an R 2 of 0.989 and RMSE of 0.107 in the testing process, thus being the most appropriate ML algorithm in the study area.It was followed by GBM, LightGBM, RF, ExT, MLP, CNN, DT, DFNN, AdaBoost, HGBM, and RBF.Generally, our findings strengthen the argument that ML models, particularly XGBoost, can be utilized for predicting the WQI with a high degree of accuracy, which will further improve water quality management.

Figure 1 .
Figure 1.The La Buong River and location of the WQ monitoring stations.

Figure 1 .
Figure 1.The La Buong River and location of the WQ monitoring stations.

Figure 3 .
Figure 3. Temporal variation in the observed and predicted WQI values for the best performance models using decision tree-based algorithms during the testing period.(a) DT-S5.(b) ExT-S5.(c) RF-S5.

Figure 3 .
Figure 3. Temporal variation in the observed and predicted WQI values for the best performance models using decision tree-based algorithms during the testing period.(a) DT-S5.(b) ExT-S5.(c) RF-S5.

Figure 3 .
Figure 3. Temporal variation in the observed and predicted WQI values for the best performance models using decision tree-based algorithms during the testing period.(a) DT-S5.(b) ExT-S5.(c) RF-S5.
Figure 4  illus trates the comparisons between the predicted and measured WQI for the MLP-S4, RBF S2, DFNN-S2, and CNN-S7 models during the testing period.Generally, these four mod els reproduced well the measured WQI during the testing period.Moreover, small differ ences between the measured and predicted WQI high or low values can be observed fo most models, except for RBF-S2, which show a considerable discrepancy.Regarding th model performance of the ANN-based models, MLP-S4 had the highest accurate predic tion (R 2 = 0.984 and RMSE = 0.132).

Table 1 .
Descriptive statistics of the observed WQ variables and WQI in the La Buong River during 2010-2017 (n = 220).

Table 2 .
Coefficient of determination (R 2 ) between the ten WQ variables and WQI.

Table 3 .
Scenarios of input variables for the current study.

Table 4 .
Efficiency statistics of the 12 ML model under the 10 scenarios of input variable combinations during the testing process.

Table 4 .
Efficiency statistics of the 12 ML model under the 10 scenarios of input variable combinations during the testing process.