A Web-based Tool to Interpolate Nitrogen Loading Using a Genetic Algorithm

Water quality data may not be collected at a high frequency, nor over the range of streamflow data. For instance, water quality data are often collected monthly, biweekly, or weekly, since collecting and analyzing water quality samples are costly compared to streamflow data. Regression models are often used to interpolate pollutant loads from measurements made intermittently. Web-based Load Interpolation Tool (LOADIN) was developed to provide user-friendly interfaces and to allow use of streamflow and water quality data from U.S. Geological Survey (USGS) via web access. LOADIN has a regression model assuming that instantaneous load is comprised of the pollutant load based on streamflow and the pollutant load variation within the period. The regression model has eight coefficients determined by a genetic algorithm with measured water quality data. LOADIN was applied to eleven water quality datasets from USGS gage stations located in Illinois, Indiana, Michigan, Minnesota, and Wisconsin states with drainage areas from 44 km 2 to 1,847,170 km 2. Measured loads were calculated by multiplying nitrogen data by streamflow data associated with measured nitrogen data. The estimated nitrogen loads and measured loads were evaluated using Nash-Sutcliffe Efficiency (NSE) and coefficient of determination (R 2). NSE ranged from 0.45 to 0.91, and R 2 ranged from 0.51 to 0.91 for nitrogen load estimation.


Introduction
Nutrients in streams or rivers are products from natural phenomenon of not only the stream but also watershed characteristics [1,2]. However, excessive nutrient inputs to a stream leads to the eutrophication or over-enrichment of waters, and destroys aquatic ecosystems [3,4]. The US Clean Water Act was established to regulate pollutants and manage watersheds, and to improve surface water quality to meet its water quality standards. The Act charges state and federal agencies with developing total maximum daily loads (TMDLs) for waterbodies for each pollutant [5,6]. In addition, the Act indicates that jurisdictions, which have significantly contaminated water need to establish priority rankings for waters on the lists and to develop TMDLs for those waters. Flow duration curves (FDC) and load duration curves (LDC) are one fundamental analysis approach for development of TMDLs, which requires streamflow and water quality data for the concerned watershed [7]. FDC and LDC are used for pollutant load reduction strategies for point and nonpoint source pollutants to meet TMDL targets. To develop LDC, water quality data associated with streamflow needs to have an identical temporal resolution to the streamflow data. However, water quality data are typically intermittent, since collecting water quality data typically requires more effort and expense than collecting streamflow data. Therefore, regression models are often used to interpolate pollutant loads from measurements made intermittently for a certain period of time [8], and provide acceptable pollutant load estimates [9][10][11]. Typically, regression models determine the relationship between streamflow and water quality data and are often simple linear forms using logarithmic transformations [12][13][14]. Load Estimator (LOADEST) [15] is used to estimate pollutant loads from streamflow and decimal time, which is a fraction representing date and time. LOADEST has 11 regression models and determines the model coefficients based on three statistical methods. Adjusted Maximum Likelihood Estimation (AMLE) and Maximum Likelihood Estimation (MLE) allow use of water quality datasets containing censored data, and assume that the model residual (or error) follows a normal distribution [15]. The regression models in LOADEST have two types of terms; one is logarithm streamflow, and the other is decimal time. Logarithmic streamflow is used to identify the relationship between instantaneous pollutant load and streamflow, and decimal time is used to consider temporal variances of pollutant loads. LOADEST has been used to estimate suspended sediment and total phosphorus to evaluate water quality and biological responses to the implementation of best management practice [16][17][18][19][20][21] and provided reasonable load estimates.
However, LOADEST overestimated loads by up to 500% compared to measured annual total nitrogen load estimates [22]. Moreover, LOADEST requires only streamflow and water quality data, though significant effort is often required to prepare the inputs (i.e., streamflow and water quality data) and to handle data format for LOADEST runs. Thus, a web-based tool using a regression model was developed to estimate nitrogen loads associated with streamflow and to provide ready access to water quality data. The web-based tool was applied at eleven U.S. Geological Survey (USGS) gage station locations with nitrogen data to determine regression model behavior.

Web Interface Development
A web-based tool for pollutant load interpolation (LOADIN) [23] was developed in this study, providing features including automated, simple access of measured streamflow and water quality data via web access and no requirement for installation or updates of the tool on a desktop computer. LOADIN requires two inputs; one is streamflow data to calibrate the regression model coefficients and to estimate pollutant loads associated with streamflow. The other is water quality data. Therefore, the interface consists of two tables for the two inputs ( Figure 1). The input data can be prepared by the user, or data can be prepared through web access to the U.S. Geological Survey (USGS). USGS allows web-access [24] to retrieve water quality; only USGS station number is required to request water quality data. LOADIN provides a map interface for the USGS gage stations for the entire U.S., using a database for the locations for streamflow water quality data built and stored in the web server ( Figure 2). LOADIN displays the locations of USGS gage stations, requests water quality data from the USGS server as the user finds and selects the USGS station of interest on the map, and parses the data in the water quality data table.

Regression Model to Estimate Loads
LOADIN uses a regression model (Equation (1)) to estimate instantaneous load using streamflow, decimal time, and eight coefficients. The regression model is composed of three parts; the first part includes streamflow and model coefficients to represent pollutant loads for streamflow variation. The second part is comprised of streamflow, model coefficients, and decimal time with sine function. The third part is comprised of streamflow, model coefficients, and decimal time with cosine function. The second and third terms represent pollutant loads for time (or seasonal) variation. LOADIN, similar to LOADEST, uses a regression model to estimate pollutant loads. However, the regression model in LOADIN assumes that instantaneous load is comprised of the pollutant load from streamflow (i.e., Qi in the Equation (1)) and the pollutant load variability for the given period as decimal time (i.e., Ti in the Equation (1)), whereas the regression models in LOADEST assume that instantaneous load is an exponential function of streamflow and pollutant load variability for the given period.
Park and Engel [22] and Park [25] applied the regression models in LOADIN and LOADEST to annual nitrogen, phosphorus, and sediment load estimation. Phosphorus and sediment data (mg/L) were related to streamflow data (m 3 /s), while nitrogen data (mg/L) displayed seasonal variance and generally poor relationships with streamflow data. The relationships between water quality and streamflow data led to different regression model behaviors. The regression model in LOADIN did not provide reasonable annual phosphorus and sediment load estimates, whereas the regression models in LOADEST did. In other words, the phosphorus and sediment loads followed the assumption of the regression models in LOADEST that pollutant loads are function of streamflow data. On the other hand, LOADEST provided poorer load estimates than LOADIN in annual nitrogen load estimation, since the nitrogen data displayed generally poor relationships with streamflow data. Thus it was concluded that streamflow and water quality datasets need to match the assumptions of regression models and that regression models need to be selected based on water quality parameters [20,21].
where, Loadi is pollutant load at time step; i, C0-7 are coefficients determined by an optimization algorithm; Qi is streamflow at time step I; and Ti is decimal time streamflow measured.
LOADIN determines the eight coefficients that minimize differences between estimated and measured loads using a genetic algorithm. Developed by Holland [26], the genetic algorithm has solved sophisticated problems and has been applied for various areas, such as business, engineering, and science [27,28]. To determine the coefficients, three operators work through 500 generations, which are the selection operator to reproduce population with fitter individuals, the crossover operator to create offspring with combined strong individuals, and the mutation operator to alter partial characteristics of offspring at random. LOADIN calculates measured loads by multiplying water quality data by streamflow data associated with measured water quality data, and determines the regression model coefficients using a genetic algorithm that compares estimated loads from the regression model to measured loads on days measured loads are available.

Application of LOADIN
Since the regression model in LOADIN provided reasonable total nitrogen estimates, LOADN was applied to nitrogen data (USGS Water Quality Parameter Name: Nitrate plus nitrite, water, filtered, milligrams per liter as nitrogen, USGS Water Quality Parameter Code: 00631) from 11 USGS gage stations ( Figure 3). The USGS gage stations are located in Illinois, Indiana, Michigan, Minnesota, and Wisconsin, and have drainage areas from 44 km 2 to 1,847,180 km 2 (Table 1)   Both streamflow and nitrogen data for the eleven USGS stations were retrieved by LOADIN, and the periods for nitrogen load estimates were defined based on the measured data. Streamflow were daily data, but nitrogen data intervals ranged from eight days (on average, USGS gage station 05427948) to fifty-nine days (on average, USGS gage station 04063700) ( Table 2). Therefore, measured loads were calculated by multiplying nitrogen data by streamflow data associated with measured nitrogen data. The estimated nitrogen loads for the days which nitrogen data were measured were evaluated by Nash-Sutcliffe Efficiency (NSE) and coefficient of determination (R 2 ) (Table 2; Figure 4). For example, 231 measured daily loads were computed by multiplying nitrogen concentration data by streamflow data, since USGS gage station 03353637 had 231 measured nitrogen data. The corresponding 231 estimated daily loads from 5228 estimated daily loads (i.e., same as the number of streamflow data) were extracted to compare estimated nitrogen loads to measured nitrogen loads on these days. The NSE of estimated nitrogen loads to measured nitrogen loads ranged from 0.45 (USGS gage station 05427948) to 0.91 (USGS gage station 04101500), and the R 2 ranged from 0.51 (USGS gage station 05427948) to 0.91 (USGS gage station 04101500) (Table 2; Figure 4b,g). More frequent nitrogen data (or large numbers of nitrogen data points) did not necessarily lead to higher NSE or R 2 . For instance, the NSE and R 2 for USGS gage station 04101500 (63 nitrogen data points) were 0.91 and 0.91, respectively, however, the NSE and R 2 for USGS gage station 05427948 (218 nitrogen data points) were 0.45 and 0.51, respectively. In addition, drainage area was not related to the regression model behavior (Table 2; Figure 4b,g). Table 2. Nash-Sutcliffe Efficiency (NSE) and coefficient of determination (R 2 ) of estimated loads to measured load.
The regression model in LOADIN contains terms for time variation, and therefore the time distribution of nitrogen data collected was explored (Table 3). Sixty percent of nitrogen data in USGS gage station 05427948 was collected from March to June, and twenty percent of nitrogen data were collected in March. LOADIN provided poor load estimates for USGS gage station 05427948.
Similar to USGS gage station 05427948, forty-nine percent of nitrogen data were collected from March to June in USGS gage station 03303280, but the percentage for any month did not exceed thirteen percent. Therefore, the quantity of nitrogen data was not biased toward a certain month. Twenty-one percent of nitrogen data were collected in December for USGS gage station 04101500, but there were no consecutive months in which extensive nitrogen data were collected like USGS gage station 05427948. LOADIN provided reasonable load estimates for USGS gage stations 03303280 and 04101500.
If a water quality dataset is biased toward a certain period, LOADIN may provide poor load estimates. Since ten NSEs of the estimated loads to the measured loads, excluding for the nitrogen data from USGS station 05427948, were greater than 0.5, it was concluded that the LOADIN performed well in nitrogen load estimation [29], as long as the water quality dataset is not biased toward a certain period. For example, if nitrogen and streamflow datasets were collected only during summer, estimating nitrogen loads for winter using LOADIN should be avoided. Table 3. Percentage of nitrogen data collected in each month.

Conclusions
Although streamflow and water quality data associated with streamflow are required for pollutant load computations, collecting and analyzing water quality samples are costly and requires significant effort compared to streamflow data. Therefore, water quality data are often collected less frequently than streamflow data. Regression models are often used to estimate (or interpolate) pollutant loads from limited water quality and streamflow data.
A web-based tool, LOADIN, to interpolate pollutant loads was developed in the study. LOADIN has several benefits in pollutant load estimation. The first is that it is easy to operate and use the tool as no installation or update by the users is required since it is a web-based tool. The second is that LOADIN accesses nationwide streamflow and water quality data from the USGS via web access.
LOADIN uses a regression model assuming that instantaneous load is comprised of the pollutant load from streamflow and the pollutant load variability for the period. LOADIN was applied to eleven nitrogen datasets from USGS gage stations, and estimated nitrogen load was evaluated using NSE and R 2 based on measured nitrogen loads. Although LOADIN provided poor load estimation when the water quality dataset was biased toward a certain period, NSEs were greater than 0.5 in ten out of eleven USGS gage stations. Moreover, NSEs were greater than 0.8 for seven of the USGS gage stations. This indicates that LOADIN performance for nitrogen load estimations was satisfactory, unless the water quality dataset was biased toward a certain period.

Author Contributions
Youn Shik Park contributed in developing regression model, programming web interfaces, and feasibility of genetic-algorithm. Bernie Engel contributed in configuring the text on methods, results, and discussion. Youn Shik Park initiated the manuscript writing, Bernie Engel made improvements to the English writing.