Assessment of Water Quality Target Attainment and Inﬂuencing Factors Using the Multivariate Log-Linear Model in the Nakdong River Basin, Republic of Korea

: Because identifying the factors affecting water quality is challenging, water quality assessment of an individual component based on the arithmetic mean method cannot adequately support management policies. Therefore, in this study, we assessed the water quality target attainment at 24 sites in the Nakdong River Basin by applying multivariate log-linear models to identify factors inﬂuencing water quality, including ﬂow and season. The temporal and seasonal water quality trend and ﬂow were also analyzed using the calculated model coefﬁcients. Speciﬁcally, weekly data on biological oxygen demand (BOD), total phosphorous (TP), and ﬂow during 2013–2018 were used to investigate the 2018 water quality target attainment level for this river. The signiﬁcance and suitability of the models were analyzed using the F-test, root mean squared error (RMSE), mean absolute percent error (MAPE), and adjusted R 2 values. All 24 models applied in this study showed statistical signiﬁcance and suitability for the prediction of BOD and TP concentrations. Moreover, ﬂow was identiﬁed as the main factor affecting water quality and had a predominant effect on BOD and TP concentrations in tributaries and the main stream, respectively. Furthermore, among the 24 sites, BOD and TP targets were evidently attained at 18 and 17 sites, respectively.


Introduction
Presently, to assess the quality of public waters, each year, the Ministry of Environment (MOE) in the Republic of Korea uses the arithmetic means of biological oxygen demand (BOD) and total phosphorous (TP) concentrations based on mean annual data to calculate water quality targets and determine the extent to which these targets are being attained. Additionally, the analysis of long-term data on water quality trends is used to identify the effects of national water quality management policies on the quality of public waters [1].
Water quality assessment based on the arithmetic mean method is widely used and easy to apply. However, it has limited applicability in assessing water quality due to seasonality, non-normality of data distribution, and ambiguous decision criteria for missing data and outliers. Further, during water quality assessment, we must account for the changes in water quality due to external environmental factors, such as abnormal climate phenomena (e.g., intense heat, extreme cold, droughts, and floods) and changes in flow owing to the presence of artificial structures, such as dams and weirs.
In particular, water quality is related to flow [2][3][4], which may be an important influencing factor for changes in water quality [4]. In the Republic of Korea, simple log-linear models as well as multivariate log-linear models [5] have been applied to clarify the relationship between water flow and quality in the South Han River, North Han River, and Gyeongan Stream, which flow into Lake Paldang; the suitability of these models in this regard has been compared along with the analysis of the water quality trend [6]. Additionally, various studies have been conducted to calculate loads, estimate parameters, and predict suspended particle loads using the LOADEST multivariate regression mode, which is based on log-linear models [7][8][9]. Regarding studies on water quality assessment based on other methods, the attainment of water quality targets has been assessed using the arithmetic assessment method, which only uses total organic carbon (TOC) data [10]. Further, the exceedance of water quality targets has also been assessed by comparing the loads calculated for each section using the load duration curve [11][12][13][14][15]. In other countries, more detailed studies have also been conducted to assess water quality and loads using multivariate statistical models [5,16] that can explain the influence of different factors that contribute to water quality changes, including flow fluctuation, time, and season [17][18][19]. Specifically, Hirsch (2014) [20] investigated the negative effect of the presence of bias in large datasets on items, such as dissolved nitrate and TP contents using the five-parameter LOADEST model (L5), the seven-parameter LOADEST model (L7), and the weighted regression on time, discharge, and season (WRTDS), which are based on the multivariate regression model developed by Cohn et al. (1992) [5]. Furthermore, in 2019, the U.S. Geological Survey (USGS) analyzed the suspended sediment, total nitrogen, and TP contents as well as loads of the Kankakee River, which is located in Shelby (India), using regression analysis [21]. In Japan, the 75 percentile is applied when evaluating the degree of achievement of organic substances, and if it is satisfied, it is considered that the target standard has been achieved [22].
However, in the Republic of Korea, water quality assessment methods that consider environmental factors, such as flow and season, have not yet been actively applied, implying that new water quality assessment methods that consider such parameters are required. Therefore, in this study, the attainment of water quality targets was determined by applying multivariate log-linear models that consider water quality, flow, and season for sites with corresponding data on flow among representative sites in the sub-basin of the Nakdong River. The water quality trend as well as its dependence on flow as a function of time and season were also assessed using model coefficient values. The results of this study contribute to the development of specific water quality management policies that can be applied in the management of water quality influencing factors in future. It is also expected that the method proposed in this study can serve as an objective assessment method that can be employed to identify the effects of water quality management policies on the realization of water quality targets.

Materials and Methods
This study was conducted in three steps as shown in Figure 1. First, dam water quality, stream flow, and hydrologic data corresponding to the 2013-2018 period were collected at 24 sites in the Nakdong River sub-basin. Second, a multivariate log-linear model was applied to calculate the regression coefficients of seven parameters, and the F-test was used to validate the statistical significance of the model using a statistical analysis program, SPSS (IBM spss USA; version 18.0). Third, the water quality trend as well as the main water quality influencing factors were identified using the calculated multivariate log-linear model, and the accuracy of the calculation was verified by considering the difference between the predicted and observed values. Finally, the reference flow was set from 2013 to 2016, and the attainment of the 2018 water quality target was evaluated using the models that showed statistical significance.

Description of the Study Area
The Nakdong River (Figure 2), which is one of the four major rivers in the Republic of Korea, serves as an important water resource for major cities, such as Daegu, Ulsan, and Busan. It is 525 km long, and its watershed area is 23,817 km 2 (from latitude 35-37 • N and longitude 124-131 • E). Water from the Nakdong River basin supports a total population of over 13 million and is used for various purposes, including industrial and residential purposes (21.6%), agricultural purposes (51.0%), and flow maintenance (27.4%). Within the last few decades, rapid population growth coupled with industrial and urban development has resulted in the deterioration of the quality of water in this river. This is because contaminants, such as organic and inorganic materials, nitrates, and phosphates, are constantly introduced into this river [23]. Additionally, within the 2010-2011 period, eight weirs that are approximately 14-44 km apart were constructed along this river to manage water resources and control water flow [24]. The Sangju weir, which is the first, is located upstream, while the Changnyeong Haman weir is located downstream. The average annual precipitation in the central basin of this river (the Busan weather station) between 2004 and 2018 was 1050 mm, with more than half (57.5%) corresponding to the summer season, which is characterized by the highest temperatures, humidity levels, and evaporation rates [25]. Further, the Nakdong River basin occupies approximately 24% of the Korean land, and approximately 17.3% (5505 km 2 ) of the total basin area is used for agricultural purposes. Among these, forest cover amounts to 68.6%, while wet paddy fields, dry paddy fields, urban areas, and others occupy 10.4%, 6.9%, 6.7%, and 7.4%, respectively [26]. The geological strata in the Nakdong River basin predominantly consists of sedimentary rocks, while metamorphic and igneous rocks are sparsely distributed. Of the 33 midwatershed representative sites along this river, 24 sites, with both water quality and quantity data, were selected for the assessment of the attainment of the 2018 water quality target using multivariate log-linear regression models.

Method of Data Collection
Three parameters, namely, BOD, TP, and stream flow, were selected for the analysis and assessment of the attainment of the 2018 water quality target. Daily data (2013-2018) on these parameters collected on a weekly basis from the 24 representative sites considered in this study were obtained from the Water Environment Information System (WEIS) [27], which is the largest national water quality database in the Republic of Korea. Further, daily dam hydrologic data were obtained from two databases, the national Water Resources Management Information System (WAMIS) [28] and My Water System [29].
Regarding flow data, K-water observation data were used for the Andong1, Sangju2, Sangok, Dalseong, Hwanggang1-1, and Samrangjin sites, while MOE flow observation data were used for the other sites. For analysis, the flow data were coupled with water quality data based on the water quality data collection day.

Description of Multivariate Log-Linear Model
The multivariate statistical regression model proposed by Cohn et al. (1992) [5] was used to estimate concentrations and loads on the basis of trend, discharge, and seasonality.
The model included seven independent parameters: an intercept parameter, two parameters for quadratic fit to the logarithm of discharge, two parameters for quadratic fit to time, and two parameters for the sinusoidal function of seasonality. The model was defined as: where C represents concentration (mg/L), Q represents discharge (m 3 /s), T represents time in decimal years, Q and T represent centering variables, β 1 and β 2 represent regression coefficient corresponding to flow, β 3 and β 4 represent regression coefficients corresponding to time, β 5 and β 6 represent regression coefficients corresponding to seasonality, and ε represents the error, which was assumed to be independent and normally distributed, with a zero mean and a constant variance. In this study, Equation (1) was used to calculate the regression coefficients corresponding to the seven parameters, and finally these regression coefficients were used to assess the attainment of the water quality target. Q and T were defined to reduce covariance among the independent parameters and enhance estimation precision. They were calculated via simplification, without affecting the prediction results according to Equations (2) and (3), respectively [5].
where T represents the center of the calibration data, T represents the mean of the data, T i represents the ith sampled data, and N represents the number of observations in the calibration dataset.

Evaluation Method for the Multivariate Log-Linear Model
The significance of the regression models determined using the weekly 2013-2018 water quality and flow data was determined using the F-test. In this regard, the null hypothesis, H 0 was β i = 0, while the alternative hypothesis, H 1 , was β i = 0. After determining the significance of the models (rejection of the null hypothesis based on the results of the F-test), the significance (α = 0.1 confidence intervals) and the values of each regression coefficient were estimated by conducting t-tests. Thus, the model was determined to be highly significant when the p-value was 0.01 (1% significance level), marginally significant when the p-value ranged between 0.01 and 0.1 (1-10% significance level), and nonsignificant when it was above 0.1 (10% significance level) [5,6].
Among the calculated regression coefficients, β 1 and β 3 represent flow-and timerelated coefficients, respectively, and when β 1 is positive (+), it implies that the water quality concentrations have a tendency to increase as the flow increases, indicating that the watershed is highly affected by nonpoint pollution sources. Conversely, when it is negative (−), it implies that the water quality concentrations decrease as flow increases, indicating that the watershed is highly affected by point pollution sources. Further, when the time coefficient, β 3 , is positive (+), it implies that the water quality concentrations increase with time, and it is indicative of an increase in the influence of pollution sources or a change in land use, for example, urbanization and the construction of agricultural and livestock complexes and industrial facilities. However, when it is negative (−), it implies that the water quality concentrations decrease with time, indicating an improvement in water quality owing to the influence of corresponding water management policies, such as the introduction of environmental management facilities.
The more influential of the two explanatory factors, time and flow, was considered as the main explanatory factor by comparing the absolute values of the standardized flow and time coefficients. Additionally, the regression coefficients, β 5 and β 6 , which are related to seasonal changes, offered the possibility to identify seasonal water quality trends [6].
To minimize the bias of the estimates calculated using the multivariate regression models and improve their accuracy, the suitability of the models was assessed using the root mean square error (RMSE), mean absolute percent error (MAPE), and adjusted R 2 values corresponding to the prediction error based on the comparison of the estimated and observed values corresponding to each site. The RMSE is suitable for the verification of the reliability of the absolute prediction error between predicted and observed values [30] calculated over a short period of time, whereas the MAPE, which is the percentage average of the absolute values of the errors between the actual and predicted values, can compensate for the shortcomings of size-dependent errors, such as units. Specifically, RMSE and MAPE can be expressed as shown in Equations (4) and (5), respectively; for both of them, a lower value can be interpreted as a higher suitability with lower bias [31].
where n represents the total number of observed data, H io represents the i-th observed value, and H ie represents the i-th estimated value. Table 1 shows the accuracy interpretation criteria for MAPE values proposed by Lewis (1982) [32]. MAPE values below 10 are interpreted as highly accurate forecasting, while values above 50 are considered to be inaccurate forecasting. Further, based on the model suitability assessment results, 2018 BOD and TP concentrations were estimated, and the attainment of water quality targets for this year with respect to BOD and TP at the 24 sites considered in this study were assessed using the corresponding 2018 flow data.  [32].

<10
Highly accurate forecasting 10-20 Good forecasting 20-50 Reasonable forecasting >50 Inaccurate forecasting Table 2 shows the F-test results and the adjusted R 2 values obtained using the multivariate log-linear models for each site. Specifically, the F-test results indicated that all the regression models for the 24 sites showed a highly significant linear relationship between the explanatory and response factors (p < 0.001). The adjusted R 2 values, which represent the explanatory power of a regression equation, ranged from 0.100 to 0.573 for the BOD models, with the explanatory power corresponding to tributaries (31.4% on average) being higher than that corresponding to the main stream (24.7% on average). Further, for the TP models, the adjusted R 2 values ranged from 0.197 to 0.803, with the explanatory power corresponding to tributaries (57.6% on average) being higher than that corresponding to the main stream (37.3% on average), as was the case with the BOD models. However, the explanatory powers corresponding to the TP models were higher than those corresponding to the BOD models. Furthermore, the explanatory powers of the BOD and TP models at the Geumhogang6 (40.0%) and Hoecheon2-1 (67.0%) sites in the midstream section were found to be higher than those in the upstream and downstream sections. Regarding BOD concentration, the Hakseong site in the Taehwa River watershed exhibited the highest explanatory power (57.0%), and regarding TP concentration, the Naeseongcheon3-1 site, a tributary in the midstream section of the Nakdong River, showed the highest explanatory power (79.9%).

Application of Multivariate Log-Linear Model
Generally, it is known that multivariate log-linear models can explain approximately 10-50% of the variability (explanatory power) of continuously observed concentrations [5]. The multivariate log-linear models calculated in this study explained approximately 10-60 and 10-80% of BOD and TP concentrations, respectively, indicating that they are suitable for predicting parameter concentrations and identifying the water quality trend taking flow into consideration. Further, the explanatory powers of these models were found to be higher in tributaries, which are sensitive to the influence of flow, than in the main stream. Table 3 shows the results of the examination of the coefficients of the regression models for the Nakdong River basin and their statistical significance. Further, Table 4 shows the results of the analysis of the water quality trend for each site within the 2013-2018 period. The flow-related regression coefficients, β 1 and β 2 , were highly significant or significant for all the sites, except for the Yongsan site, while the time-related regression coefficients, β 3 and β 4 , were highly significant or significant for all the sites, except the Yeonggang2-1, Sangju2, Daeam-1, Hwanggang5, and Yongsan sites. Furthermore, the coefficients β 5 and β 6 , which are regression coefficients corresponding to seasonality, were significant for all the sites. Specifically, the flow-related regression coefficients (β 1 ) of BOD and TP concentrations were more significant than their time-related regression coefficients (β 3 ). Further, the flow-related coefficients of TP concentration, which ranged from −0.219 to 0.501, showed statistical significance at all sites, except for the Hwanggang1-1 and Milyanggang3 sites, while that of BOD concentration as shown in Table 3, which ranged from −0.216 to 0.256, was statistically significant mainly at the upstream and downstream sites. Furthermore, organic matter concentration showed a tendency to decrease owing to an increase in flow rather than time, and a clear tendency of improved water quality was observed, particularly in the downstream section. Regarding BOD concentration, the regression coefficient corresponding to time (β 3 ) was significant at 13 out of the 24 sites. For TP concentration, it ranged from −0.266 to 0.082 and showed statistical significance for all sites except the Sangju2, Sangok, Hakseong, Hyeongsangang4, and Hakseong sites. This implies that time had an effect on nutrient concentration, which, however, was more dependent on flow than time. Unlike BOD concentration, TP concentration showed more vulnerability to the inflow of nonpoint pollutants than to the pollutant dilution effect owing to an increase in flow during rainfall. The regression coefficients for seasonality (β 5 and β 6 ) were also found to be statistically significant at all the sites. In Table 4, the change in BOD concentration at the main stream sites varied depending on the site. Additionally, flow and time were identified as the main explanatory factors for four and seven sites, respectively, and regarding the change in BOD concentration in the tributaries, flow was identified as the main explanatory factor for 10 out of the 13 tributary sites. Further, regarding the change in TP concentration, flow was identified as the main explanatory factor for 9 of the 11 main stream sites. It was also observed that TP concentration showed a tendency to increase across the basin but tended to decrease over time. This is considered to be due to the water quality improvement effect resulting from an increase in investments aimed at improving the TP treatment capacities of water treatment facilities for the removal of point pollutants in the main stream sections of the four major rivers in the Republic of Korea from 2009 [33,34]. Considering the 13 tributary sites, flow and time were identified as the main explanatory factors for six sites and seven sites, respectively. For the main stream sites, the main explanatory factor for TP concentration was flow rather than time, and it was different for each site in the tributaries.   Table 4. Water quality trend description and identification of main explanatory factors.

Andong1
Main Notes: -, Nonsignificant; ↓, Decrease; ↑, Increase; Q, Flow; T, time. Table 5 shows the results of the analysis of the accuracy of the multivariate loglinear models in predicting 2018 water quality parameters. Overall, the predicted TP concentration values reflected the observed values to a greater extent than the predicted BOD concentration values. Further, the observed values were better reflected at the midupstream sites than at the downstream sites, and MAPE was below 50% for all the sites; this is indicative reasonable forecasting. Furthermore, the RMSE values obtained for BOD and TP concentrations were in the ranges 0.3-1.5 and 0.012-0.064 mg/L, respectively, and among the 24 sites, Andong1 and Banbyeoncheon2-1 sites exhibited the highest prediction accuracies for BOD and TP concentrations, respectively, i.e., RMSE values of 0.3 and 0.012 mg/L, respectively ( Figure 3).   The observed and predicted 2018 data for the 24 sites were compared using the verified models as shown in Figure 4, and the comparison showed similar tendencies throughout the basin. When the accuracy of the predicted 2018 data was evaluated, the RMSE and MAPE values for BOD concentration were 0.2 mg/L and 10.6%, respectively, and for TP concentration, they were 0.007 mg/L and 13.7%, respectively, indicating the excellent suitability of the model results.

Assessment of the Attainment of the 2018 Water Quality Target
When the attainment of the 2018 water quality targets at the 24 sites was assessed using the multivariate log-linear models that considered changes in flow within the 2013-2018 period (Table 6), it was observed that the water quality targets with respect to BOD and TP concentrations were attained at 75.8% and 70.8% of the sites, respectively. These values are higher than those obtained using the annual average observed concentration method (58.3% and 50.0% for BOD and TP, respectively) [35]. Regarding BOD concentration, the nonattainment of water quality targets changed to attainment for the large tributaries, which are affected to a greater extent by flow, compared with the main stream sites, and for TP concentration, the attainment rate was improved for both the main stream and tributary sites. This appears to be due to the organic pollutant dilution effect resulting from an increase in flow and the influence of continuous efforts to improve water quality, such as the reinforcement of water quality standards for the water discharged from public sewage treatment facilities and the expansion of TP treatment capacity of sewage and wastewater treatment facilities [32,36].

Conclusions
In this study, among 32 representative sites in the sub-basin of the Nakdong River, multivariate log-linear models were applied to 24 sites where flow data were available to analyze the water quality trend and assess the water quality targets. The significance and suitability of the models were evaluated using the F-test, RMSE, MAPE, and adjusted R 2 values. Thus, it was observed that all the 24 models showed statistical significance. Specifically, the explanatory power for TP concentration was higher than that for BOD concentration, and the explanatory power corresponding to tributaries was higher than that corresponding to the main stream.
Further, based on the multivariate linear models, flow was identified as the main factor that affects water quality with respect to BOD and TP. In particular, flow had a dominant effect on TP and BOD concentrations at main stream and tributary sites, respectively. This observation is highly related to organic pollutant concentration dilution resulting from an increase in flow and the increase in water quality owing to the inflow of nutrients as a result of the discharge of nonpoint pollutants during rainfall. For tributaries with low stream flow, it was observed that the change in water quality was sensitive to the inflow of pollutants. Considering each water quality parameter, BOD concentration tended to decrease, while TP concentration tended to increase as flow increased. However, both BOD and TP concentrations showed a tendency to improve over time. This is considered to be the effect of investments related to environmental improvement, such as the expansion of treatment facilities for biodegradable organic matter and TP treatment over the last 30 years.
Furthermore, based on the RMSE and MAPE values obtained, the multivariate loglinear models showed suitability in the prediction of water quality parameters. The evaluation of the 2018 water quality target attainment rate on this basis showed that the BOD concentration target was attained to a greater extent than the TP concentration target. This could be attributed to the positive effects resulting in significant improvement in BOD concentration in severely polluted main rivers owing to the dilution effect brought about by the increase in flow as well as the implementation of the drastic water quality management policies in the 1980s and 1990s.
Based on their effectiveness in accurately explaining the water quality parameters, the multivariate regression models applied in this study were determined to be more suitable for the prediction of nutrient concentrations instead of organic matter concentrations and the prediction of the water quality of tributaries rather than that of the main stream. This water quality assessment method allowed the identification of water quality influencing factors by linking water quality, flow, and time (seasonality). It is expected that the results derived in this study will be used as basic data for the preparation of water quality-flow integrated management plans as well as objective assessment methods for the identification of the effects of water quality management policies on water quality. In the future, it will be necessary to conduct further research with higher accuracy by expanding the number of survey sites for both water quality and flow and by securing long-term observation data.