Distribution-Based Calibration of a Stormwater Quality Model

: Stormwater quality models are usually calibrated using observed pollutographs. As current models still rely on simpliﬁed model concepts for pollutant accumulation and wash-off, calibration results for continuous pollutant concentrations are highly uncertain. In this paper, we introduce an innovative calibration approach based on total suspended solids (TSS) event load distribution. The approach is applied on stormwater quality models for a ﬂat roof and a parking lot for which reliable distributions are available. Exponential functions are employed for both TSS buildup and wash-off. Model parameters are calibrated by means of an evolutionary algorithm to minimize the distance between a parameterized lognormal distribution function and the cumulated distribution of simulated TSS event loads. Since TSS event load characteristics are probabilistically considered, the approach especially respects the stochasticity of TSS buildup and wash-off and, therefore, improves conventional stormwater quality calibration concepts. The results show that both experimental models were calibrated with high goodness-of-ﬁt (Kolmogorov–Smirnov test statistic: 0.05). However, it is shown that events with high TSS event loads (>0.8 percentile) are generally underestimated. While this leads to a relative deviation of − 28% of total TSS loads for the parking lot, the error is compensated for the ﬂat roof (+5%). Calibrated model parameters generally tend to generate wash-off proportional to runoff, which is indicated by mass-volume curves. The approach itself is, in general, applicable and creates a new opportunity to calibrate stormwater quality models especially when calibration data is limited.


Introduction
Stormwater quality models are essential tools to support the planning of urban water infrastructure. Having reliable model outputs is of high relevance since infrastructural stormwater measures are cost-intensive and have a long service life. Available stormwater quality models still replicate natural pollutant processes in a simplified manner, which in turn leads to uncertain model results [1,2].
Pollutant processes are commonly differentiated into two conceptual phases, (i) buildup and (ii) wash-off, which are both deterministically described by empirical formulas [3]. These model concepts assume that the amount of pollutant masses at the surface generally increases to a maximum as a function of antecedent dry weather periods and decreases as a consequence of rainfall/runoff.
Previous studies, however, demonstrated the inadequacy of this simplified concept to continuously model pollutant concentrations. For example, Muschalla et al. [4] calibrated a buildup/wash-off A refinement of the exponential wash-off equation by incorporating stochastic fluctuations is analyzed by Daly et al. [19]. Here, the coefficient dominating the wash-off process is assumed to be random and consequently addressed by adding Gaussian noise. A good agreement to empirical distributions for TSS and TN (total nitrogen) is reported, but this required a large amount of data. Qin et al. [20] obtained the frequency distributions of (i) the event pollutant load; (ii) the event mean concentration; and (iii) the peak concentration of COD from a continuous simulation of an urbanized catchment. Exponential equations for buildup and wash-off were employed and calibrated with regard to continuous COD concentration measurement data using a genetic algorithm. It is, however, mentioned that the predictive power is limited because the study site undergoes further developments. Annual loads for micropollutants have been estimated based on theoretical distribution functions of the event mean concentration for three residential catchments by Hannouche et al. [21].
The literature shows various approaches to take the stochasticity of pollutant processes into account. While early studies primarily used probabilistic methods to overcome the scarcity of stormwater quality data, recent studies using continuous quality data tend to admit the variability of natural pollutant processes by employing stochastic concepts. With regard to continuous long-term stormwater quality simulations, the presented alternative modelling approaches incorporate stochasticity through (i) probabilistic description and transformation of model input data (rainfall-runoff); (ii) modification of empirical pollutant buildup/wash-off equations; (iii) distribution-based parameterization of intra-event dynamics; and (iv) probabilistic analysis of model results after calibration (post-processing).
It has, however, not been investigated whether available stormwater quality models can be calibrated towards probabilistic pollutant characteristics. Using a distribution-based calibration proposes an additional alternative to incorporate pollutant stochasticity. In contrast to approaches already introduced, this method maintains existing model concepts and avoids expensive post-processing.
The present paper reports on the development of an innovative stormwater quality model calibration approach using TSS event load distribution. The approach is demonstrated on two real-world models for which reliable distributions are available. Calibrated models are finally used to estimate annual TSS loads, which is a key parameter for emission control in several stormwater management guidelines.

Concept of Model Calibration
In this study, stormwater quality models are calibrated using a distribution-based approach depicted in Figure 1. Instead of replicating single-event characteristics or pollutographs, the approach aims to minimize the difference between observed and simulated TSS event load distribution. Since observed TSS event load distributions can be well approximated with theoretical distribution functions [22], the calibration uses a site-specific parameterized lognormal distribution as reference (cf. Section 2.3). The approach focuses on probabilistic event load distribution and places less emphasis on intra-event dynamics. Model results therefore need to be analyzed by means of mass-volume curves (MV curves) [23].

Experimental Sites and Measurement Data
Stormwater runoff and quality processes of a flat roof (FR, 50 m 2 ) and a parking lot (PL, 2350 m 2 ) are considered. At both sites, a long-term online monitoring campaign has been conducted and continuous runoff and quality data at the outlet of the catchment was collected [24]. Stormwater quality data are available from March 2013-November 2015 (stormwater quality observation period). Rainfall measurement is still operated.
Based on site-specific correlation functions for the relationship of TSS and turbidity, continuous TSS time series were estimated from online turbidity measurements. Measurement data were subjected to statistical analysis which is presented and discussed in Leutnant et al. [24]. The authors successfully measured 65 rainfall-runoff events at site FR and 46 events at site PL. The average of all event mean concentrations (µEMC) is 33 mg L −1 at site FR and 60 mg L −1 at site PL. Summing up the estimated TSS loads of all observed events yields 11.3 g m −2 at site FR and 10.6 g m −2 at site PL. Table 1 shows a summary of site and stormwater quality measurement data. Table 2 summarizes rainfall characteristics within and after the stormwater quality observation period. Detailed descriptive event statistics of rainfall, runoff, and TSS loads are given in Leutnant et al. [24].

Experimental Sites and Measurement Data
Stormwater runoff and quality processes of a flat roof (FR, 50 m 2 ) and a parking lot (PL, 2350 m 2 ) are considered. At both sites, a long-term online monitoring campaign has been conducted and continuous runoff and quality data at the outlet of the catchment was collected [24]. Stormwater quality data are available from March 2013-November 2015 (stormwater quality observation period). Rainfall measurement is still operated.
Based on site-specific correlation functions for the relationship of TSS and turbidity, continuous TSS time series were estimated from online turbidity measurements. Measurement data were subjected to statistical analysis which is presented and discussed in Leutnant et al. [24]. The authors successfully measured 65 rainfall-runoff events at site FR and 46 events at site PL. The average of all event mean concentrations (µEMC) is 33 mg L −1 at site FR and 60 mg L −1 at site PL. Summing up the estimated TSS loads of all observed events yields 11.3 g m −2 at site FR and 10.6 g m −2 at site PL. Table 1 shows a summary of site and stormwater quality measurement data. Table 2 summarizes rainfall characteristics within and after the stormwater quality observation period. Detailed descriptive event statistics of rainfall, runoff, and TSS loads are given in Leutnant et al. [24].

Theoretical Distribution Function for Site-Specific TSS Event Loads
Leutnant et al. [22] use TSS event load data from Leutnant et al. [24] and describe site-specific empirical TSS event load distributions by means of theoretical distribution functions. The authors demonstrated that the two-parameter lognormal distribution approximates the empirical TSS event load distribution well and can therefore be used to probabilistically describe TSS event loads. In their study, parameters of the lognormal distribution are optimized with respect to a likelihood function and the goodness-of-fit is evaluated by means of Kolmogorov-Smirnov and Anderson-Darling test statistics. Table 3 shows the general lognormal distribution formula and optimized parameter for µ (meanlog) and σ (sdlog) for sites FR and PL taken from Leutnant et al. [22]. Table 3. Lognormal distribution function and optimized parameters to describe TSS event load distribution for site FR and PL from Leutnant et al. [22].

Flat Roof
Parking Lot

Stormwater Quality Modelling
In this study, pollutant processes for buildup and wash-off are modelled with the widely used exponential equations implemented in the stormwater management model SWMM5 [25]. Buildup B(t) is mathematically described as a function of antecedent dry weather days t (Equation (1)). Pollutant wash-off W(t) is expressed as a function of the current runoff rate q(t) and available masses on the surface B(t) (Equation (2)). Both functions offer two individual parameters to be calibrated. Additionally, the initial buildup B 0 at the beginning of the simulation (t = 0) needs to be estimated. Table 4 shows the parameter used for calibration. Corresponding parameter ranges were extracted from the literature [5,26] and harmonized with the authors' experience.
, t denotes the number of preceding dry weather days.

Parameter Estimation and Goodness-of-Fit Assessment
For both sites, model parameters affecting runoff generation and hydrograph characteristics are initially calibrated by means of the multi-objective algorithm NSGA-2 [27]. The algorithm allows to optimize multiple objectives simultaneously and identifies Pareto-optimal solutions from which a compromise can be drawn. Here, a single objective is defined as an event-specific Nash-Sutcliffe Efficiency (NSE) [28]. Eight representative rainfall-runoff events were taken into account which consequently yields eight objectives to be optimized. Rainfall events have a minimum depth of 2 mm and a minimum intensity in 60 min of 2.5 mm h −1 . The compromise solution follows the L2-metric [29] which calculates the Euclidean distance of all pareto-optimal solutions to an ideal solution. The solution with the smallest Euclidean distance is considered as a compromise. Model parameters (i) surface roughness; (ii) depression storage and (iii) characteristic width of the overland flow are considered for calibration. The calibration yielded an average event-specific NSE of 0.73 for site FR and 0.72 for site PL (the results of water quantity calibration are not further discussed in this paper).
Once optimized parameters of runoff calibration are estimated, model parameters for pollutant buildup and wash-off (cf. Table 4) are optimized. The calibration aim is to fit the simulated TSS event load distribution to the parameterized lognormal distribution. For this purpose, the Kolmogorov-Smirnov (KS) statistic D n which numerically describes the equality of two distributions and tests whether a sample follows a specific distribution [30] is used as the objective function. This means that the smaller the KS statistic D n becomes, the higher the goodness-of-fit of the calibration. The KS statistic ranges from 0 ≤ D n ≤ 1. As this calibration only considers a single objective, a single objective optimization algorithm is used. A differential evolution algorithm [31] implemented by Ardia et al. [32] is applied. The following computation steps are performed: Simulation with a new set of parameters generated by the optimization algorithm.

2.
Determine and split events from simulation time series which satisfy selection criteria (Table 5). An event starts when runoff starts and ends if the maximum runoff within a predefined window is 0.

3.
Computation of runoff volume and TSS load per event.

4.
Selection of events which exceeds a minimum runoff volume (Table 5). This step is introduced because the small size of the catchments leads to a significant number of events with numerically low runoff volume which would result in weights that are disproportionate to these events.

5.
Computation of the cumulative TSS event load distribution function for the events remaining. 6.
Computation of Kolmogorov-Smirnov Distance D n according to Equation (3): with F SWMM being the simulated cumulative TSS event load distribution function and F lognormal the site-specific parameterized lognormal distribution function. 7.
Repeat steps 1-7 to minimize D n until convergence. The goodness-of-fit of the calibrated stormwater quality model is numerically assessed and visually evaluated through a direct comparison of the simulated distribution function and the parameterized lognormal distribution function for TSS event loads. Residuals of the simulated event loads and observed event loads are computed. Simulated intra-event dynamics are analyzed by means of mass-volume curves (MV curves).

Concept of Model Validation
The calibration uses measurement data from the site-specific stormwater quality observation period. Estimated parameters are expected to be valid beyond this period. The model validation therefore uses all available rainfall data from the five-year period (March 2013-April 2018, cf. Table 1). Equality of simulated TSS event load distributions from the five-year period and the observation period are evaluated using Kolmogorov-Smirnov's distance KS D N .

Model Parameter Uncertainty Analysis
The differential evolution algorithm applied belongs to the class of genetic algorithms which minimize an objective function by evolving a population of candidate solutions through successive generations [32]. In this study, the configuration of evolution strategy and mutating operators (crossover probability and differential weighting factor) follows the developer's recommendation. However, the maximum number of iterations is set to 400 and the number of population members (i.e., parameter sets per iteration) is set to 100, which results in 40,000 simulation runs per model in total. For estimating model parameter uncertainties, simulation results are divided into behavioral and non-behavioral groups. Parameter sets that yield the best 20% solutions are attributed as behavioral and subjected to descriptive statistical analysis (mean, standard deviation, and coefficient of variation).

Estimation of Annual TSS Event Loads
The calibrated stormwater quality models are further used to estimate annual TSS event loads and event mean concentrations originated from the study sites. Annual TSS event loads are estimated by considering all event loads from a moving window of 12 consecutive months to account for natural rainfall variability. Using the extended rainfall series, the simulation period comprises ca. five years, with 62 months which yields 50 (62-12) moving years.

Results and Discussion
Calibration results for both sites are shown in Table 6. Statistics for both model parameters and the Kolmogorov-Smirnov-based objective function are given. The best fit parameter sets yielded to an objective function of roughly 0.05 for both models.
According to the low Kolmogorov-Smirnov statistic D n of approx. 0.05 for both sites, the best-fit parameter sets obtained by the distribution-based calibration approach lead to well-approximated parameterized lognormal distributions. From a statistical perspective which also takes the number of samples into account, it can be legitimately assumed that both distributions (lognormal and simulated TSS event loads) follow the same distribution. Both KS statistics are below the critical values at the 90% significance level (0.082 for site FR and 0.118 at site PL). Distributions of simulated and observed event mean concentrations (EMC) are compared in Table 7. A notably high agreement of mean EMC is obtained for both sites (FR: 33 mg L −1 , PL: 62 mg L −1 ). It can also be observed that EMC percentiles of simulation for site FR are slightly higher than the observed percentiles until the 0.75 percentile. Site PL shows the opposite behavior: EMC percentiles of simulation are slightly lower than observed percentiles until the 0.5 percentile. However, in both cases, the maximum observed EMC percentiles are strongly underestimated which suggests an inappropriate accumulation process model to account for random influences (e.g., traffic-induced pollutant emissions [33]). The fact that events with high TSS event mean concentrations are underestimated affects the goodness-of-fit concerning the total TSS event load of the events observed (Table 8). This is especially evident at site PL, where the total TSS event load is underestimated by roughly 28%. Events with more than 0.5 g m −2 are poorly represented (cf. Figure 2). At site FR, the relative deviation is only about 5%. This signals that the error is compensated by events whose simulated TSS event load is higher than that which is observed (intersection at approx. 0.1 g m −2 , cf. Figure 2).   At site FR, the calibrated model replicates the distribution function until the 0.8 percentile with a high goodness-of-fit. Events exceeding this value are generally underestimated by the model and lead to lower simulated event loads than suggested by the lognormal distribution. Since the KS statistic represents the maximum distance between two cumulative distribution functions, the maximum 5% of the events with more than the 0.8 percentile of event loads are underestimated.
The results for site PL show a similar effect. Here, the model shows a good fitting of the distribution function until the 0.9-percentile, which accordingly implies that the maximum 5% of the events with more than the 0.9-percentile of event loads are underestimated.
Both calibrated models tend to underestimate events with high TSS loads which indicates that the calibration approach and the objective function applied is heavily influenced by events with low TSS event load which, as a matter of fact, is the case for the majority of events for both sites. Applying an alternative goodness-of-fit measure as an objective function, which also emphasizes the upper tailing of a distribution function could lead to superior model performance. This, however, remains At site FR, the calibrated model replicates the distribution function until the 0.8 percentile with a high goodness-of-fit. Events exceeding this value are generally underestimated by the model and lead to lower simulated event loads than suggested by the lognormal distribution. Since the KS statistic represents the maximum distance between two cumulative distribution functions, the maximum 5% of the events with more than the 0.8 percentile of event loads are underestimated.
The results for site PL show a similar effect. Here, the model shows a good fitting of the distribution function until the 0.9-percentile, which accordingly implies that the maximum 5% of the events with more than the 0.9-percentile of event loads are underestimated.
Both calibrated models tend to underestimate events with high TSS loads which indicates that the calibration approach and the objective function applied is heavily influenced by events with low TSS event load which, as a matter of fact, is the case for the majority of events for both sites. Applying an alternative goodness-of-fit measure as an objective function, which also emphasizes the upper tailing of a distribution function could lead to superior model performance. This, however, remains unclear as the applied pollutant model itself also has limitations in replicating natural pollutant processes [5,12,34].
Observed and simulated MV curves are shown in Figure 3. Simulated MV curves are calculated for both the stormwater quality observation period and the five-year period using all available rainfall data.
Water 2018, 10, x FOR PEER REVIEW 10 of 14 unclear as the applied pollutant model itself also has limitations in replicating natural pollutant processes [5,12,34]. Observed and simulated MV curves are shown in Figure 3. Simulated MV curves are calculated for both the stormwater quality observation period and the five-year period using all available rainfall data. Mass-volume curves for site FR reveal that simulated intra-event processes do not reflect the observed dynamics in general. In particular, the prevailing first-flush characteristic is not appropriately replicated. Instead, simulated wash-off tends to occur proportionally to runoff.
In contrast, statistics of simulated intra-event processes at site PL correspond well to the data observed. It can be seen that the calibrated model also tends to generate wash proportional to runoff. The high agreement of observed and simulated MV curves at site PL is obtained since the observed MV curves already show a more runoff-proportional wash-off behavior. Although the general characteristic at site PL is satisfactorily represented, the results from both sites indicate that the observed intra-event dynamic can hardly be deterministically described by the model for a continuous simulation period. As pointed out in previous studies [5,12], pollutant buildup and washoff are highly affected by stochastic inputs, which consequently limits the goodness-of-fit of replicating intra-event dynamics.
Simulated distribution functions from the observation period (used for calibration) are compared to the results using the five-year period (validation) in Figure 4. Corresponding goodnessof-fit is given in Table 9.
At site FR, the KS statistic between both distributions is 0.035, implying that the observation period is highly representative. The KS statistic of 0.062 from validation only slightly differs from calibration (KS: 0.053), which indicates a successful model validation.
In contrast, the distribution function from validation at site PL constantly underestimates the assumed lognormal distribution. This is also expressed by a higher KS statistic of 0.073. The distance between calibration and the validation period is slightly higher (KS: 0.083), indicating a less successful model validation. However, it is noticeable that the simulated TSS event distribution of the observation period falls below the lognormal distribution between 0.25 g m −2 and 0.4 g m −2 and exceeds the lognormal distribution for event loads higher than 0.5 g m −2 . This indicates that the observation period is less representative as the number of events is significantly lower.  Mass-volume curves for site FR reveal that simulated intra-event processes do not reflect the observed dynamics in general. In particular, the prevailing first-flush characteristic is not appropriately replicated. Instead, simulated wash-off tends to occur proportionally to runoff.
In contrast, statistics of simulated intra-event processes at site PL correspond well to the data observed. It can be seen that the calibrated model also tends to generate wash proportional to runoff. The high agreement of observed and simulated MV curves at site PL is obtained since the observed MV curves already show a more runoff-proportional wash-off behavior. Although the general characteristic at site PL is satisfactorily represented, the results from both sites indicate that the observed intra-event dynamic can hardly be deterministically described by the model for a continuous simulation period. As pointed out in previous studies [5,12], pollutant buildup and wash-off are highly affected by stochastic inputs, which consequently limits the goodness-of-fit of replicating intra-event dynamics.
Simulated distribution functions from the observation period (used for calibration) are compared to the results using the five-year period (validation) in Figure 4. Corresponding goodness-of-fit is given in Table 9.
At site FR, the KS statistic between both distributions is 0.035, implying that the observation period is highly representative. The KS statistic of 0.062 from validation only slightly differs from calibration (KS: 0.053), which indicates a successful model validation.
In contrast, the distribution function from validation at site PL constantly underestimates the assumed lognormal distribution. This is also expressed by a higher KS statistic of 0.073. The distance between calibration and the validation period is slightly higher (KS: 0.083), indicating a less successful model validation. However, it is noticeable that the simulated TSS event distribution of the observation period falls below the lognormal distribution between 0.25 g m −2 and 0.4 g m −2 and exceeds the lognormal distribution for event loads higher than 0.5 g m −2 . This indicates that the observation period is less representative as the number of events is significantly lower.  The validated models were finally used to estimate annual TSS loads (Table 10) which is of special interest for practical purposes. In the present study, the estimated mean annual TSS load for site FR is 9.9 g m −2 a −1 which according to Dierschke [35] represents a roof with "low to normal" load contribution. The annual TSS load for site PL was estimated at 13.7 g m −2 a −1 , which is significantly lower than reported from measurements by Burton and Pitt [36] (~40 g m −2 a −1 ). As already stated, the model disregards traffic-related stochastic inputs, which could explain the low annual TSS loads estimated. Consequently, the result must be carefully interpreted. This highlights the need to especially account for load-intensive events either through an alternative objective function or modification of the model concept which, e.g., occasionally allows the incorporation of pollutants from additional sources. Generally, the distribution-based calibration approach allows to calibrate stormwater quality models even if data is incomplete but tends to underestimate events with high TSS loads. However, compared to the conventional calibration, the approach has two clear advantages. First, the occurrence of events and its corresponding pollutant contribution is probabilistically considered, which implies that stochasticity is taken into account. Second, measurement data of stormwater  The validated models were finally used to estimate annual TSS loads (Table 10) which is of special interest for practical purposes. In the present study, the estimated mean annual TSS load for site FR is 9.9 g m −2 a −1 which according to Dierschke [35] represents a roof with "low to normal" load contribution. The annual TSS load for site PL was estimated at 13.7 g m −2 a −1 , which is significantly lower than reported from measurements by Burton and Pitt [36] (~40 g m −2 a −1 ). As already stated, the model disregards traffic-related stochastic inputs, which could explain the low annual TSS loads estimated. Consequently, the result must be carefully interpreted. This highlights the need to especially account for load-intensive events either through an alternative objective function or modification of the model concept which, e.g., occasionally allows the incorporation of pollutants from additional sources. Generally, the distribution-based calibration approach allows to calibrate stormwater quality models even if data is incomplete but tends to underestimate events with high TSS loads. However, compared to the conventional calibration, the approach has two clear advantages. First, the occurrence of events and its corresponding pollutant contribution is probabilistically considered, which implies that stochasticity is taken into account. Second, measurement data of stormwater quality processes are rarely completely available for continuous periods which consequently complicates the application of a conventional calibration approach and could result in misleading model outputs. Theoretical distribution functions are continuously defined.

Conclusions
An innovative calibration approach for stormwater quality models with respect to TSS event load distribution is introduced. The approach was applied on two experimental sites, (i) flat roof and (ii) parking lot, for which parameterized lognormal distribution functions were available. From this study, the following can be concluded:

•
Both models have been successfully calibrated, as indicated by the low Kolmogorov-Smirnov distance measure. TSS event load distribution functions from calibration period were compared to distribution functions obtained from simulation with extended rainfall data. • Maximum deviation between lognormal and simulated TSS event load distribution is 5%. • A high agreement of the observed and simulated mean of event mean concentrations (µEMC) was achieved for both sites (FR: 33.2 vs. 33.4 mg L −1 , PL: 60.3 vs. 62.9 mg L −1 ). • Using a theoretical distribution for calibration provides continuous probabilities and allows to calibrate stormwater quality models even if data is incomplete.

•
The approach is generally applicable and especially powerful if distribution functions become generalizable on a catchment-scale.

•
The objective function used for calibration employs the Kolmogorov-Smirnov statistic. Despite its simplicity, it has been shown that events with high TSS event loads tend to be underestimated. A more behavioral distance measure which also accounts for events with high loads remains open for future research. • Based on the calibrated models, annual TSS event loads were estimated. A load of 9.9 g m −2 a −1 was obtained for the flat roof site, and 13.7 g m −2 a −1 was obtained for the parking lot site.
The calibration approach still needs to be tested on larger catchments which consist of multiple subcatchments with different land uses. This requires catchment-specific theoretical distribution functions to be available. Additionally, it could be of interest to determine whether model parameters are correlated to parameters of the theoretical distribution function or catchment characteristics.