This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Water quality data are collected by various sampling frequencies, and the data may not be collected at a high frequency nor over the range of streamflow conditions. Therefore, regression models are used to estimate pollutant data for days on which water quality data were not measured. Pollutant load regression models were evaluated with six sampling frequencies for daily nitrogen, phosphorus, and sediment data. Annual pollutant load estimates exhibited various behaviors by sampling frequency and also by the regression model used. Several distinct sampling frequency features were observed in the study. The first was that more frequent sampling did not necessarily lead to more accurate and precise annual pollutant load estimates. The second was that use of water quality data collected from storm events improved both accuracy and precision in annual pollutant load estimates for all water quality parameters. The third was that the pollutant regression model automatically selected by LOADEST did not necessarily lead to more accurate and precise annual pollutant load estimates. The fourth was that pollutant regression models displayed different behaviors for different water quality parameters in annual pollutant load estimation.

Typically, water quality samples are costly to collect and to analyze. Water quality samples are collected by various sampling frequencies [

Thus, water quality data are often estimated using statistical methods such as regression models from measurements made intermittently in a certain period of time [

In addition to sampling frequencies and regression models, water quality parameters are also influential to load estimation, since each water quality parameter has different behaviors in watersheds. Streamflow and total nitrogen concentration displayed poor relationships, and sediment concentration and streamflow showed significant correlation in pasture watersheds [

LOAD ESTimator (LOADEST) [

In this study, the predictive ability of LOADEST and LOADIN was evaluated with six sampling frequencies for three water quality parameters, and appropriate regression models and water quality sampling frequencies were suggested to obtain accurate annual pollutant load estimates.

Daily streamflow and water quality data were required to compute an observed “true load” and to create subsampled water quality datasets for LOADEST and LOADIN runs. Therefore, daily streamflow and water quality data for nitrogen, phosphorus, and sediment were collected from the National Center for Water Quality Research of Heidelberg University [

where _{i}_{i}

Locations of Daily Water Quality Data Stations.

Number of stations and daily water quality data values.

Water Quality Parameters | 1~10 years | 11~20 years | 21~37 years | Total |
---|---|---|---|---|

Nitrogen | 11 ^{a}^{b} |
5 ^{a}^{b} |
5 ^{a}^{b} |
21 ^{a}^{b} |

Phosphorus | 59 ^{a}^{b} |
5 ^{a}^{b} |
5 ^{a}^{b} |
69 ^{a}^{b} |

Sediment | 201 ^{a}^{b} |
5 ^{a}^{b} |
5 ^{a}^{b} |
211 ^{a}^{b} |

Notes: ^{ a} number of stations; ^{b} number of water quality data values.

Correlation Coefficients for Concentrations of Water Quality Parameters with Streamflow (adapted from Park, 2014 [

Six sampling frequencies were established with combinations of fixed intervals and inclusion of storm samples. The first three sampling frequencies represented collecting one water quality sample every week (weekly fixed interval sampling frequency), every two weeks (biweekly fixed interval sampling frequency), and every month (monthly fixed interval sampling frequency).

For fixed interval sampling, water quality data were selected on the same day every week, two weeks, and month so that weekly, biweekly, and monthly sampling had seven, fourteen, and twenty eight variants, respectively. For instance, the first subsampled water quality datasets for the weekly interval frequency were comprised of the water quality data observed on every Monday, while the second subsampled water quality datasets for the weekly interval frequency were comprised of the water quality data observed on every Tuesday. Therefore, seven subsampled water quality datasets were prepared for the weekly interval frequency. The first subsampled water quality datasets for the biweekly interval frequency were comprised of the water quality data observed on every alternate Monday, and the second subsampled water quality datasets for the biweekly interval frequency were comprised of water quality data observed on every alternate Tuesday. Therefore, fourteen subsampled water quality datasets were prepared for the biweekly interval frequency. The subsampling method for monthly interval frequencies was based on date. For instance, the first subsampled water quality dataset for monthly interval frequencies was comprised of the water quality data observed on the 1st day of each month, and the second subsampled water quality dataset for monthly interval frequencies was comprised of the water quality data observed on the 2nd day of each month.

The other three sampling frequencies termed “mixed interval sampling” included additional samples from within storm events while maintaining the same sampling intervals as the first three frequencies. Storm samples in the study were defined as water quality data collected from peak flows in the “high flow” regime that represent the upper 10 percent of flows for a given analysis period [

The regression model coefficients in LOADEST are calibrated by three statistical methods which are Adjusted Maximum Likelihood Estimation (AMLE), Maximum Likelihood Estimation (MLE), and Least Absolute Deviation (LAD) [

LOADEST has 11 regression models, and one of them is selected manually or automatically (model number option 0). The first nine regression models numbered from 1 to 9 (Equations (2)–(10), left-hand side is logarithm pollutant load) are selectable by the automatic regression model selection, but the other regression models numbered 10 and 11 (Equations (11) and (12)) are used to estimate pollutant loads for specific periods defined by users. Therefore, LOADEST was executed ten times for each subsampled water quality dataset with model number 0 to investigate performance for automatic model selection of LOADEST and with model numbers 1 to 9 to evaluate each regression model for manual model selection. Subsampled water quality datasets were used for LOADEST and LOADIN runs.

_{0}+

_{1}ln

_{0}+

_{1}ln

_{2}ln

^{2}

_{0}+

_{1}ln

_{2}

_{0}+

_{1}ln

_{2}sin (2π

_{3}cos (2π

_{0}+

_{1}ln

_{2}ln

^{2}+

_{3}

_{0}+

_{1}ln

_{2}ln

^{2}+

_{3}sin (2π

_{4}cos (2π

_{0}+

_{1}ln

_{2}sin(2π

_{3}cos(2π

_{4}

_{0}+

_{1}ln

_{2}ln

^{2}+

_{3}sin(2π

_{4}cos(2π

_{5}

_{0}+

_{1}ln

_{2}ln

^{2}+

_{3}sin(2π

_{4}cos(2π

_{5}

_{6}

^{2}

_{0}+

_{1}

_{2}ln

_{3}ln

_{0}+

_{1}

_{2}ln

_{3}ln

_{4}ln

^{2}+

_{5}ln

^{2}

where, _{0~6} are coefficients;

LOADIN employs a regression model comprised of streamflow, decimal time, and eight model coefficients (Equation (13)). LOADIN finds values of the eight model coefficients that minimize differences between modeled and observed loads using a genetic algorithm and given water quality datasets.

where, _{i}_{i}

The regression models in LOADEST assume that instantaneous load is an exponential function of data variables (e.g., streamflow) [

Annual pollutant loads estimated by the regression models in LOADEST and LOADIN were compared to annual loads calculated by multiplying daily streamflow by concentration measurements (Equation (14)).

where, _{i}_{i}_{Yr}

A ratio was used to compare the estimated annual pollutant loads to the observed annual pollutant loads (Equation (15)). A ratio greater than 1.0 indicates that a regression model overestimated annual pollutant load, a ratio smaller than 1.0 indicates that a regression model underestimated annual pollutant load, and a ratio of 1.0 indicates estimated annual pollutant load is the same as observed annual pollutant load.

Both accuracy and precision are critical in evaluation of regression models and sampling frequencies, because accuracy indicates the degree of systematic error and precision indicates the degree of dispersion [

LOADIN and the regression models numbered 1, 3, 4, and 7 in LOADEST (LT(1), LT(3), LT(4), and LT(7) in

A distinct feature was found in annual sediment load estimation. More frequent sampling led to accurate and less precise annual sediment load estimates with LOADEST models numbered 1, 3, 4, and 7, when comparing the weekly sampling frequency to the monthly sampling frequency (

95% Confidence Intervals of Estimated to Observed Pollutant Load Ratios. (M: Monthly sampling, B: Biweekly sampling, W: Weekly sampling); (

LOADIN and regression models numbered 1, 3, 4, and 7 in LOADEST provided more accurate and more precise annual phosphorus load estimates than models numbered 2, 5, 6, 8, and 9 in LOADEST with fixed interval sampling frequencies (

In nitrogen load estimation, LOADEST provided poorer load estimates than LOADIN, and especially regression models numbered 6, 8, and 9 in LOADEST with monthly fixed sampling frequencies which were less precise than other regression models. Compared to the annual nitrogen load estimates made by LOADEST, LOADIN provided more accurate and more precise load estimates with both fixed interval sampling frequencies and mixed interval frequencies. The biweekly mixed interval sampling frequency for LOADIN displayed the most accurate and precise annual nitrogen load estimates.

The annual pollutant load estimates made by the automatic model selection of LOADEST did not provide the best precision nor accuracy for pollutant load estimation. Annual sediment and phosphorus load estimates made by regression models numbered 1, 3, 4, and 7 in LOADEST were more accurate and precise than those of the automatically selected regression model. Regression models numbered 2, 5, 6, 8, and 9 provided less precise annual sediment and phosphorus load estimates than the other regression models; however, these models were often selected by automatic model selection in LOADEST. Therefore, the annual pollutant load estimates for automatic model selection had less precision.

Pollutant regression models contain assumptions, and thus flow and water quality datasets need to match the assumptions to obtain accurate and precise pollutant load estimates. Sediment concentration data in the study matched the assumptions of LOADEST, and thus LOADEST provided more accurate and more precise pollutant load estimates than LOADIN. On the other hand, nitrogen concentration data in the study displayed a poor relationship to streamflow data. Nitrogen concentration data did not match the assumptions of LOADEST. Therefore, LOADEST is better for pollutant load estimation when flow and concentration (or load) have a strong relationship.

95% Confidence Intervals of Estimated to Observed Pollutant Load Ratio for Annual Phosphorus Load Estimates by Fixed Sampling.

Model | 95% Confidence Intervals of Ratios | ||
---|---|---|---|

Monthly fixed interval | Biweekly fixed interval | Weekly Fixed Interval | |

LD | 0.791 ± 0.016 | 0.792 ± 0.026 | 0.767 ± 0.028 |

LT(1) | 0.828 ± 0.011 | 0.837 ± 0.014 | 0.839 ± 0.016 |

LT(3) | 0.842 ± 0.012 | 0.854 ± 0.014 | 0.858 ± 0.016 |

LT(4) | 0.851 ± 0.012 | 0.863 ± 0.015 | 0.868 ± 0.016 |

LT(7) | 0.872 ± 0.013 | 0.889 ± 0.015 | 0.898 ± 0.016 |

Both streamflow and water quality data are required to compute pollutant loads. However, water quality data are typically collected less frequently than streamflow, since it is costly to collect and to analyze water quality samples. Therefore, regression models are frequently used to estimate pollutant loads from limited water quality data. LOADEST and LOADIN were evaluated with subsampled water quality datasets from daily water quality data for nitrogen, phosphorus, and sediment.

Four distinct features were observed in the study. The first feature was that more frequent sampling frequencies did not necessarily lead to more accurate and more precise annual pollutant load estimates. The second feature was that supplementing fixed interval water quality data with water quality data collected from storm events improved both accuracy and precision in annual pollutant load estimates for all water quality parameters. The third feature was that the automatic model selection in LOADEST did not necessarily lead to more accurate and precise annual pollutant load estimates. The last feature was that the behaviors of regression models were different in annual pollutant load estimation for different water quality parameters.

The study indicates that regression models numbered 1, 3, 4, and 7 in LOADEST are more accurate and precise to annual phosphorus and sediment load estimation, that the sampling frequency for phosphorus and sediment needs to include storm samples, and that LOADIN is applicable to annual nitrogen load estimation.

The authors declare no conflict of interest.