Multiple Linear Regression Models for Predicting Nonpoint-Source Pollutant Discharge from a Highland Agricultural Region

Sediment runoff from dense highland field areas greatly affects the quality of downstream lakes and drinking water sources. In this study, multiple linear regression (MLR) models were built to predict diffuse pollutant discharge using the environmental parameters of a basin. Explanatory variables that influence the sediment and pollutant discharge can be identified with the model, and such research could play an important role in limiting sediment erosion in the dense highland field area. Pollutant load per event, event mean concentration (EMC), and pollutant load per area were estimated from stormwater survey data from the Lake Soyang basin. During the wet season, heavy rains cause large amounts of suspended sediment and the occurrence of such rains is increasing due to climate change. The explanatory variables used in the MLR models are the percentage of fields, subbasin area, and mean slope of subbasin as topographic parameters, and the number of preceding dry days, rainfall intensity, rainfall depth, and rainfall duration as rainfall parameters. In the MLR modeling process, four types of regression equations with and without log transformation of the explanatory and response variables were examined to identify the best performing regression model. The performance of the MLR models was evaluated using the coefficient of determination (R2), root mean square error (RMSE), coefficient of variation of the root mean square error (CV(RMSE)), the ratio of the RMSE to the standard deviation of the observed data (RSR) and the Nash–Sutcliffe model efficiency (NSE). The performance of the MLR models of pollutant load except total nitrogen (TN) was good under the condition of RSR, and satisfactory for the NSE and R2. In the EMC and load/area models, the performance for suspended solids (SS) and total phosphorus (TP) was good for the RSR, and satisfactory for the NSE and R2. The standardized coefficients for the models were analyzed to identify the influential explanatory variables in the models. In the final performance evaluation, the results of jackknife validation indicate that the MLR models are robust.


Introduction
In Lake Soyang basin of South Korea, large amounts of sediment are discharged from highland agricultural field regions in the wet season.To develop environmental preservation measures that protect water resources from the turbid water problem and diffuse pollution, prediction models are necessary to estimate the amount of pollutants discharged from subbasins.In this study, a multiple linear regression (MLR) model is established to predict the pollutant runoff discharge using environmental parameters, such as land use, rainfall, and topography.
In South Korea, rainfall events of 200 mm or more occurred only once annually, on average, until the end of the 1970s, but increased to a frequency of two per year in the 1980s and thereafter occurred five times in both 1984 and 1998.And, annual precipitation increased by 19% in the past decade, compared to the first half of the 20th century [1].Conditions in Lake Soyang, located in the upper reaches of the Han River, greatly affect the water quality of the water supply of the capital region of South Korea.Discharged sediments from the highland field area flow into Lake Soyang in the wet season.Consequently, the turbidity of the lake increases to high levels and persists for a long time.In July 2006, a heavy rain event occurred in the Lake Soyang watershed.Overall, 670 mm fell over 8 days, with a maximum intensity of 66 mm per hour.The suspended sediment stayed in Lake Soyang for an extended period of time because of stratification; thus, the turbidity of the lake remained high and was measured at over 20 nephelometric turbidity units (NTU) for 168 days [2,3].
Regression models have been developed to estimate the sediment discharge using the subbasin environmental parameters in many areas.Valtanen et al. [4] applied stepwise multiple linear regression (SMLR) analysis to identify the variables that best explained the variation in event mass loads (EMLs) in each study catchment during cold and warm periods.Runoff duration, peak flow, antecedent dry period, mean runoff intensity, total suspended solids (TSS), TN, TP and total organic carbon (TOC) were used as explanatory variables.Another SMLR analysis was also carried out to assess whether catchment variables explain the EML and EMC values during the cold and warm periods [4].The catchment variables included total impervious area and land use type.All data were log10-transformed to obtain approximately normal distributions.Bian et al. [5] proposed a procedure combining different statistical methods and a hydrological model to quantify the annual runoff response to spatial and temporal variations in impervious surface areas in an urbanized basin.A hydrological model relating annual runoff depth to precipitation, potential evapotranspiration and spatial metrics of the impervious area for baseline periods and periods of change was built using stepwise multiple regression analysis.Roman et al. [6] developed multivariate regression models to enable the prediction of mean annual suspended sediment discharge on the basis of basin characteristics, which is useful for many ungaged river locations in the eastern United States.The models are based on long-term mean sediment discharge estimates and explanatory variables, such as drainage area, mean elevation, and urban area, obtained from a combined dataset of 1201 US Geological Survey (USGS) stations.Tuset et al. [7] analyzed rainfall, runoff and sediment transport relationships in a meso-scale Mediterranean mountain catchment.The relationships among rainfall, runoff and suspended sediment transport were analyzed with Pearson correlations and multivariate regression analysis.The multivariate regression method was used to analyze the relationship between the independent variables (pre-event conditions, rainfall and runoff) and suspended sediment transport for all flood events.Seasonal relationships between total surface runoff and total sediment transport indicate that the sediment transport magnitude shows a clear seasonality influenced by rainfall intensity and sediment availability.
Buendia et al. [8] attempted to use empirical relationships to assess the relationship between sediment yield and basin scale and to provide an update on the main drivers controlling sediment yield in these particular river systems.Quantile regression analysis was used to assess the correlation between basin area and sediment yield, while additional basin-scale descriptors were related to sediment yield by means of multiple regression analysis.The performance of the model was tested through the jackknife validation method [8][9][10][11][12][13][14].Paule-Mercado et al. [15] used MLRs to identify the significant parameters affecting fecal-indicator bacteria concentrations and to predict the response of bacteria concentrations to changes in land use and land cover.Stormwater temperature, 5-day biochemical oxygen demand (BOD 5 ), turbidity, TSS, and antecedent dry days were the most influential independent variables for the bacteria concentrations at the monitoring sites.Several studies have utilized linear regression techniques to predict bacteria concentrations in rivers [16][17][18][19].Furthermore, regression models have been widely used to predict and characterize rainfall and runoff characteristics and to determine the relationship between these two variables [20][21][22][23][24][25][26].Process-based erosion prediction models have also been established to predict the intensity of soil erosion in a particular area [27][28][29][30].
In this study area, two types of environmental parameters affect the stormwater sediment runoff: meteorological factors, such as rainfall depth, rainfall intensity, rainfall duration; and number of preceding dry days and topographic factors, such as percentage of upland field area, subbasin area, and subbasin slope.In this study, the Pearson correlation test was employed to identify the linear relationship between the explanatory environmental parameters and the observed stormwater discharge.SMLR analysis was applied to identify the best performing regression model.Four types of regression equations were examined to determine the best MLR model.Explanatory and dependent variables with and without log e-transformation were tested.Then, the MLR models were validated via a jackknife validation procedure.

Study Area and Field Data
Lake Soyang formed following the construction of the Soyang River Dam.The dam was built to provide irrigation water, flood control and hydroelectric power.The dam has a height of 123 m, a length of 530 m, and a total storage capacity of 2900 million m 3 .The basin area is 2969.3km 2 ; the forest occupies 86.4%; and the dry field, paddy field and residential areas occupy 4.4%, 1.58%, and 1.60% of the basin, respectively.
In Lake Soyang basin (Figure 1), sediment discharge occurred mainly from the upper part of the basin.Land use in the tributary watersheds in the dense highland upland field area is shown in Table 1.In the case of Mandaecheon, the percentage of agricultural area is 27.7%, and upland fields represent 75% of the agricultural land.The Jungjohangcheon, Johangcheon and Jauncheon subbasins contain very small paddy field areas.In the highland area, to decrease the damage caused by the continuous cultivation of economic crops, manage pests, maximize crop productivity and improve soil fertility, 30-50 cm of soil dressing has been applied to the top layer of soil.This soil dressing is a significant contributor to the sediment discharge from the highland fields.
During rainfall, water sampling and flow measurements were performed at the same time.Field surveys were conducted to perform flow measurements at most of the survey points.For the remaining points, such as Inbukcheon, Bukcheon, and Soyang River, real-time water level and flow measurements were obtained from the Ministry of Land Infrastructure and Transport and the Korea Water Resources Corporation.The sampled water from the measurement sites was delivered to the laboratory as quickly as possible, and BOD, chemical oxygen demand (COD), SS, TP, TN, and TOC were analyzed using standard methods [2].In the statistical analysis of this study, the results of stormwater runoff surveys from 2013 to 2016 at nine points in the Jaun, Mandae and Gaha area [31] were used.Of the 79 rainfall events, nine data were too high or too low for rainfall amount due to runoff load measurements and calculation errors; those data were excluded.The discharge load survey was performed from the beginning of the rainfall to the point where it returned to the normal water level after the end of the rainfall.Runoff discharge data of 70 storm events were used to build the MLR models to predict the pollutant load, EMC and pollutant load per area.The range of the rainfall depth used in the MLR model construction was from 10 mm to 215 mm.

Data Analysis
Using water quality and runoff flow data from 70 rainfall events in the Lake Soyang basin, pollutant load per event, EMC, and pollutant load per area were estimated for each rainfall event.The total pollutant load during a rainfall event was calculated using Equation (1).The EMC was defined as the pollutant mass contained in the runoff event divided by the total flow volume of the event.The total pollutant load was divided by the subbasin area to estimate the pollutant load per area.

𝑇𝑜𝑡𝑎𝑙 𝑝𝑜𝑙𝑙𝑢𝑡𝑖𝑜𝑛 𝑙𝑜𝑎𝑑/𝑟𝑎𝑖𝑛𝑓𝑎𝑙𝑙 𝑒𝑣𝑒𝑛𝑡 = 𝐶 𝑄 ∆𝑡
where n represents the number of total measurements, Qi is the runoff flow at n number of time steps (∆t) and Ci is the concentration of a water quality measurement.The distribution of the nonpoint pollutant discharge for the 70 rainfall events from 2013 to 2016 is presented in Table 2. Figure 2 shows box plots of pollutant load per event, EMC, and pollutant load per area at the survey points in the Lake Soyang basin.The maximum, minimum and median values of the suspended sediment (SS) load/event were 46,125,100 kg, 613 kg and 263,083 kg, respectively.The maximum, minimum and median values of the TP load/event were 32,406 kg, 1.83 kg and 480 kg, respectively.As shown in Figure 2, all the mean values of the pollutant loads are larger than the values of the third quartile, and the distributions are biased toward the high values.The maximum, minimum and median values of the SS EMC were 1437 mg/L, 3.8 mg/L and 157 mg/L, respectively.The maximum, minimum and median values of the TP EMC were 1.96 mg/L, 0.011 mg/L and 0.27 mg/L, respectively.All the mean values of the EMCs lie between the 50th percentile and the 75th percentile, and the distributions are relatively uniform.

Data Analysis
Using water quality and runoff flow data from 70 rainfall events in the Lake Soyang basin, pollutant load per event, EMC, and pollutant load per area were estimated for each rainfall event.The total pollutant load during a rainfall event was calculated using Equation (1).The EMC was defined as the pollutant mass contained in the runoff event divided by the total flow volume of the event.The total pollutant load was divided by the subbasin area to estimate the pollutant load per area.

Total pollution load/rain f all event
where n represents the number of total measurements, Q i is the runoff flow at n number of time steps (∆t) and C i is the concentration of a water quality measurement.The distribution of the nonpoint pollutant discharge for the 70 rainfall events from 2013 to 2016 is presented in Table 2. Figure 2 shows box plots of pollutant load per event, EMC, and pollutant load per area at the survey points in the Lake Soyang basin.The maximum, minimum and median values of the suspended sediment (SS) load/event were 46,125,100 kg, 613 kg and 263,083 kg, respectively.The maximum, minimum and median values of the TP load/event were 32,406 kg, 1.83 kg and 480 kg, respectively.As shown in Figure 2, all the mean values of the pollutant loads are larger than the values of the third quartile, and the distributions are biased toward the high values.The maximum, minimum and median values of the SS EMC were 1437 mg/L, 3.8 mg/L and 157 mg/L, respectively.The maximum, minimum and median values of the TP EMC were 1.96 mg/L, 0.011 mg/L and 0.27 mg/L, respectively.All the mean values of the EMCs lie between the 50th percentile and the 75th percentile, and the distributions are relatively uniform.
The explanatory variables that are considered to explain the nonpoint pollutant discharge in the MLR models are the percentage of fields (% field), subbasin area (SA), and mean slope of subbasin (slope) as topographic parameters, and the number of preceding dry days (Ndry), rainfall intensity (Rint), rainfall depth (Rain), and rainfall duration (Dur) as rainfall parameters (Table 3).A Pearson correlation coefficient matrix was used to identify the correlations among the surveyed pollutant discharge estimates and the explanatory variables.The correlations among natural log-transformed variables were also tested using Pearson correlation.

MLR Model Building
MLR modeling was performed to predict the pollutant discharge from the subbasins in the Soyang River.The models were built to explain the pollutant discharge using the subbasin topographic and rainfall data.In the MLR modeling, four types of regression equations are examined: Type2 : where a 0 is the regression constant and a i is the regression coefficient of the explanatory variable X i .
In type 1, the original variables are used to build the MLR model.In type 2, dependent variables, such as pollutant load, EMC and load/area, are log e-transformed to reduce skewness.In type 3, all the explanatory and dependent variables are log e-transformed.In type 4, the dependent variables and some of the explanatory variables are log e-transformed.The fitness of the four regression equations was evaluated by the coefficients of determination of the MLR models.The MLR models were examined in terms of their ability to predict the runoff pollutant discharge for each water quality variable (SS, COD, BOD, TN, and TP).Collinearity may introduce serious stability problems, such as high mean square errors, in a regression model.Therefore, the collinearity of the predictor variables in the created MLR model were tested by calculating the variance inflation factor (VIF) [32].Collinearity is present when the largest VIF is greater than 10 or the average VIF value is substantially greater than 1 [32,33].VIFs were calculated to analyze the multicollinearity in this MLR model.
The MLR model performance was evaluated using the R 2 , RMSE, CV(RMSE), RSR and the NSE.
1/2 ( 8) where O i is the observed daily load, O is the mean of the observed daily load, p i is the calculated daily load, and n is the number of data values.The R 2 index describes the ability of the model to explain variability among the data.RSR incorporates the benefits of error index statistics and includes a scaling/normalization factor; the lower the RSR is, the better the model simulation performance.
The performance ratings for stream flow proposed by Moriasi et al. [33] were 'very good' (0.00 ≤ RSR ≤ 0.50), 'good' (0.50 < RSR ≤ 0.60), or 'satisfactory' (0.60 < RSR ≤ 0.70).NSE is a normalized statistic that reflects the relative magnitude of the residual variance compared with the variance in the observed data (good (NSE > 0.7), satisfactory (0.4 < NSE ≤ 0.7) and unsatisfactory (NSE ≤ 0.4)) [30,34].Finally, the performance of the MLR model was tested using the jackknife validation method [8,11,14].This method consists of deleting one site and carrying out the multiple regression analysis with the same dependent variables and the remaining sites.The pollutant discharge of the deleted site is calculated with the equation resulting from the multiple regression associated with the remaining sites.This process is repeated, deleting one site each time.

Correlation Analysis between Nonpoint Pollutant Discharge and Explanatory Variables
Table 4 shows the Pearson correlation between runoff discharge and subbasin characteristics, without log transformation of the variables.In Table 5, the Pearson correlation matrix between log-transformed variables is introduced.As the values of the correlation coefficients between log-transformed variables were slightly higher (r < 0.69; Table 5) than those of the non-log-transformed variables (r < 0.65; Table 4), we used log-transformed variables as explanatory variables in the MLR models.Compared to other environmental parameters, the rainfall depth and subbasin area showed a relatively significant correlation with most response discharge variables.The rainfall intensity had a relatively significant positive correlation with the response variables of EMC and load/area because the rainfall intensity directly affects the EMC and load/area of each storm event.On the other hand, the subbasin slope had a negative correlation with the response variables of EMC and load/area.

MLR Analysis
Four types of MLR models corresponding to Equations ( 3)-( 6) were tested to identify the most suitable models (Table 6).The R 2 values for SS, COD, BOD, TN, and TP in the type 1 MLR of pollutant load ranged from 0.275 to 0.447.The R 2 values for SS, COD, BOD, TN, and TP in the type 1 MLR of EMC and load/area were also low, indicating poor performance of the regression models.The R 2 values for SS, COD, BOD, TN, and TP in the type 2 MLR of pollutant load were 0.76, 0.67, 0.64, 0.65, and 0.80, respectively.The R 2 values of the type 2 MLR were quite high, but most of the VIF values were larger than 5, with a few values greater than 10.Thus, the VIF showed that multicollinearity was observed in the established models and that the type 2 MLR was not adapted.Although the R 2 values of the type 2 MLR for load/area were acceptable, the VIF values were high, indicating multicolinearity.VIF values and other statistics of MLRs were presented only for the selected model.The results of the MLR model employing the type 4 equation are listed in Tables 7-9.The R 2 values for SS, COD, BOD, TN, and TP in the type 3 MLR of the pollutant load were also fairly high, but all VIF values were less than 5.Among the type 3 MLR models, the SS, TN, and TP in the MLR of EMC and the SS and TP in the MLR of load/area showed acceptable R 2 values.The values of R 2 for SS, COD, BOD, TN, and TP in the type 4 MLR of pollutant load were 0.74, 0.69, 0.69, 0.61, and 0.74 respectively.The R 2 values of the type 4 MLR were a little better than those of the type 3 MLR, and all VIF values were less than 5. Thus, we selected the type 4 equation as the MLR model to predict the runoff pollutant discharge in the study area.However, the COD and BOD in the MLR of EMC and COD and TN in the MLR of load/area could not explain the variance in the pollutant discharge properly.
Using the stepwise variable selection method, two to five variables were retained in the pollutant load model, as shown in Table 7.In the case of the SS model, given the R 2 value, 73.6% of the variability of the dependent variable ln(SS load) is explained by the four explanatory variables.The MLR models indicated in Tables 7-9 are statistically significant at p < 0.0001 except for the ln(COD EMC) model (p = 0.00019).The R 2 values for SS, COD, BOD, TN, and TP in the type 4 MLR of pollutant load were fairly high (0.614 < R 2 < 0.741), as indicated in Table 7.The performance evaluation by CV(RMSE) [35] shows that the SS model was the best and that the other models of the water quality variables were also acceptable.The range of RSR for SS, COD, BOD, and TP in the MLR models of pollutant load (Table 7) was from 0.509 to 0.559, and the performance of the MLR for these variables was good [34].The RSR for TN was 0.622, and the performance of the TN model was satisfactory.The NSE values for the MLR models of pollutant load ranged from 0.61 to 0.74, and the MLR models of the pollutant load had good performance.As a special case, in linear regression forecasting models like this study, NSE is equal to the coefficient of determination, R 2 [36].Overall, all the MLR models of the pollutant load had good prediction performance.
All VIF values in Table 7 are lower than 5, and the mean VIF values are not large.These results suggest that the coefficient of regression for the explanatory variables could be statistically acceptable and that multicollinearity was not present in the established models.
Standardized coefficients refer to how many standard deviations a dependent variable will change in response to an increase of one standard deviation in the predictor variable.This statistic allows us to compare the relative contribution of each independent variable in the prediction of the dependent variable.The higher the absolute value of a coefficient is, the more important the weight of the corresponding variable.Standardized coefficients are useful for comparing effects across different measures.The standardized regression coefficients of Table 7 indicate that subbasin area (0.576 < β i < 0.709) and rainfall depth (0.453 < β i < 0.563) are important influential parameters for all the load predictions.In addition, % field has relatively small effects on the SS, BOD and TP models.−0.17 −0.60

3.423
Notes: a 0 is the regression constant; a i is the regression coefficient of the explanatory variable X i ; β i is the standardized regression coefficient.The area with the high density of highland fields in Lake Soyang basin has steeper slopes than the other areas.However, Lake Soyang basin also contains highly mountainous terrain; thus, the mean slopes of the dense highland field subbasins are lower than the average slope of the entire Lake Soyang basin.Therefore, the standardized regression coefficients of mean slope for the SS and TP load models have (−) signs, and the mean slope has a negative influence on the SS and TP loads.
The explanatory variables for the SS and TP models explained 65.5% and 66.2% of variation in the response variables of EMC.The R 2 values were fairly high, as indicated in Table 8.The R 2 value for the TN model of EMC was 0.539, and the TN model was acceptable [34].The CV(RMSE) value of the BOD model was quite high, and the model was not acceptable.The RSR values for SS and TP in the MLR models of EMC (Table 8) were 0.587 and 0.581, respectively, and the performances of these models were good.The RSR for the TN model was 0.679, and the performance of the TN model was satisfactory.However, the RSR values for the COD and BOD models were high, and these models were unsatisfactory.The NSE values for the MLR models of the EMC show that the SS, TP, and TN models were satisfactory but that the COD and BOD models were not satisfactory.The VIF values for the EMC models were lower than 5, and the mean VIF values were not large.Overall, the MLR models for SS and TP have good prediction performance, and the TN model has acceptable performance.
The standardized regression coefficients in Table 8 indicate that rainfall intensity and rainfall depth are influential explanatory variables for the EMC response variables.Rainfall intensity (0.234 < β i < 0.426) is an important factor for the TP, TN, and COD models, and rainfall depth is important for the SS and BOD models.In the pollutant load model, rainfall depth is a very important parameter, whereas rainfall intensity is not an important explanatory variable.However, rainfall intensity is an influential parameter for the EMC of a storm event.From the Pearson correlation matrix between natural log-transformed stormwater runoff discharge and subbasin characteristics in Table 5, we also can see that EMCs are better correlated to rainfall intensity than rainfall depth, and pollutant loads are much better correlated to rainfall depth than rainfall intensity.In agricultural areas such as the study area, the larger the rainfall intensity, the more nutrients are released from fertilizer and vegetation roots.The standardized regression coefficients of the mean slope for the SS and TP load models have (−) sign, and the mean slope has a large negative influence on the SS and TP EMC.Additionally, % field also has a negative impact on the SS and TP EMC.
The explanatory variables for the SS and TP models explained 69.5% and 67.5% of the variation in the load/area response variables, and the R 2 values were fairly high, as indicated in Table 9.The R 2 value for the BOD model of load/area was 0.51; thus, the BOD model was acceptable.The RSR values for SS and TP in the MLR models of load/area (Table 9) were 0.55 and 0.57, respectively, and the performances of these models were good.The RSR for the BOD model was 0.70, and the performance of the TN model was satisfactory.The NSE values in the MLR models of the load/area show that the SS, TP, and BOD models were satisfactory.The VIF values for the load/area models were less than 5, and the mean VIF values were not large.Overall, the MLR models of load/area for SS and TP have good performance, and the BOD model has acceptable performance.
The standardized regression coefficients in Table 9 indicate that rainfall depth (0.576 < β i < 0.634) is a highly influential parameter for all response variables in the load/area prediction.The β coefficients of the mean slope for the SS and TP load/area models are −0.79 and −0.49, respectively, and the absolute values of the coefficients are comparable to the coefficients of rainfall depth, indicating that the mean slope is a remarkable negative parameter on the SS and TP load/area results.

Jackknife Validation of the MLR Model
The performance of the jackknife validation was evaluated using R 2 , RSR and NSE (Table 10).The R 2 values were calculated by the linear regression between observed and jackknife validation values, and RSR and NSE were also calculated.The R 2 (Figure 3) and NSE values associated with the jackknife procedure were slightly lower than the results of the MLR models, whereas the RSR values were slightly higher than the MLR models.Therefore, the performance of the jackknife validation was slightly worse than that of the MLR models.The results of jackknife validation indicate that the MLR models are robust.
values were slightly higher than the MLR models.Therefore, the performance of the jackknife validation was slightly worse than that of the MLR models.The results of jackknife validation indicate that the MLR models are robust.

Conclusions
MLR models were built to predict the nonpoint-source pollutant discharge in the highland field area in the wet season using environmental parameters as explanatory variables.Runoff discharge data from 70 storm events were used to build the MLR models to predict the pollutant load, EMC and pollutant load per area.Pearson correlation tests were employed to identify the linear relationships between subbasin environmental parameters and the observed stormwater discharge.As the values of correlation coefficients between log-transformed variables were slightly higher than those of variables that had not been log transformed, the log-transformed variables were selected as explanatory variables in the MLR models.
The R 2 values for SS, COD, BOD, TN, and TP in the type 4 MLR of pollutant load were quite high (the best among the four examined MLR types), and all VIF values were less than 5. Thus, the type 4 equation was chosen as the MLR model to predict the runoff pollutant discharge.
The R 2 values for the five water quality variables in the MLR of pollutant load were fairly high (0.614 < R 2 < 0.741), and the RSR values for SS, COD, BOD, and TP in the MLR models of pollutant load ranged from 0.509 to 0.559.Hence, the performance of the MLR for these variables was good [34].The RSR for TN was 0.622, and the performance of the TN model was satisfactory.The NSE values for the MLR models of the pollutant load indicated good performance.Hence, most of the MLR models of the pollutant load have good prediction performance.
The MLR models of EMC for SS and TP also have good prediction performance, and TN model has acceptable performance.The MLR models of load/area for SS and TP have relatively good performance, and the BOD model has acceptable performance.Based on the R 2 , RSR and NSE values, the performance of the jackknife validation was slightly worse than that of the MLR models.Thus, the results of jackknife validation indicate that the MLR models are robust.
The results of the standardized coefficients for the MLR models indicate that subbasin area and rainfall depth are important influential parameters for all the load predictions.The mean slope exerts a negative influence on the SS and TP loads on account of topographic characteristics, as previously explained.For the pollutant load model, rainfall depth is a very important parameter, whereas rainfall intensity was not chosen as an explanatory variable.However, rainfall intensity has an influence on the EMC of the storm event.The mean slope has a large negative influence on the SS and TP EMC, and % field has a negative impact on the SS and TP EMC.Additionally, the rainfall depth is a highly influential parameter for all response variables of the load/area predictions, similar to the pollutant load models.The mean slope has a large negative influence on the SS and TP load/area.The average slope of fields, rather than the average slope of the whole sub-basin, can be an important explanatory variable for the pollutant discharge load of each subbasin.Similar or even better MLR results for EMC could have been obtained using peak rainfall intensity as explanatory variables.Therefore, future studies on MLR need to consider this.

Figure 2 .
Figure 2. Box plots of nonpoint pollutant discharge in the Lake Soyang basin.The top (a) and bottom (c) of each box represent the third and first quartiles, the solid line inside the box is the second quartile (b), and the dotted line inside the box is the mean.One whisker stretches from the third quartile to the maximum, and the other whisker stretches from the first quartile to the minimum.

Figure 2 .
Figure 2. Box plots of nonpoint pollutant discharge in the Lake Soyang basin.The top (a) and bottom (c) of each box represent the third and first quartiles, the solid line inside the box is the second quartile (b), and the dotted line inside the box is the mean.One whisker stretches from the third quartile to the maximum, and the other whisker stretches from the first quartile to the minimum.

Figure 3 .
Figure 3.Comparison between the observed and predicted values during storm events based on (a) MLR models and (b) the results of jackknife validation.Figure 3. Comparison between the observed and predicted values during storm events based on (a) MLR models and (b) the results of jackknife validation.

Figure 3 .
Figure 3.Comparison between the observed and predicted values during storm events based on (a) MLR models and (b) the results of jackknife validation.Figure 3. Comparison between the observed and predicted values during storm events based on (a) MLR models and (b) the results of jackknife validation.

Table 1 .
Land use of the tributary watersheds in the dense highland fields area.

Table 2 .
Distribution of the nonpoint pollutant discharge for the 70 rainfall events in the Lake Soyang basin.

Table 2 .
Distribution of the nonpoint pollutant discharge for the 70 rainfall events in the Lake Soyang basin.

Table 3 .
Explanatory variables considered in the regression models to predict pollutant discharge.

Table 4 .
Pearson correlation matrix between stormwater runoff discharge and subbasin characteristics.
Note: Bold marked correlations are significant at p < 0.01.

Table 5 .
Pearson correlation matrix between natural log-transformed stormwater runoff discharge and subbasin characteristics.
Note: Bold marked correlations are significant at p < 0.01.

Table 6 .
Coefficients of determination from four types of MLR analysis.

Table 7 .
MLR models for pollutant load discharge during stormwater events.

Table 8 .
MLR models for EMCs of stormwater events.

Table 9 .
MLR models for pollutant load per area during stormwater events.

Table 10 .
Three performance indicators for the stormwater runoff discharge values based on jackknife validation.