Sensitivity Analysis for Urban Drainage Modeling Using Mutual Information

The intention of this paper is to evaluate the sensitivity of the Storm Water Management Model (SWMM) output to its input parameters. A global parameter sensitivity analysis is conducted in order to determine which parameters mostly affect the model simulation results. Two different methods of sensitivity analysis are applied in this study. The first one is the partial rank correlation coefficient (PRCC) which measures nonlinear but monotonic relationships between model inputs and outputs. The second one is based on the mutual information which provides a general measure of the strength of the non-monotonic association between two variables. Both methods are based on the Latin Hypercube Sampling (LHS) of the parameter space, and thus the same datasets can be used to obtain both measures of sensitivity. The utility of the PRCC and the mutual information analysis methods are illustrated by analyzing a complex SWMM model. The sensitivity analysis revealed that only a few key input variables are contributing significantly to the model outputs; PRCCs and mutual information are calculated and used to determine and rank the importance of these key parameters. This study shows that the partial rank correlation coefficient and mutual information analysis can be considered effective methods for assessing the sensitivity of the SWMM model to the uncertainty in its input parameters.


Introduction
Urban drainage models are widely used for planning, design and management of urban drainage systems.These models have become very complex.Sensitivity analysis of models has been identified to be an essential component in building models and evaluating the performance of models [1].Sensitivity analysis can be applied to identify the relative influence of each model input parameter on the model outputs.Generally speaking, there are two main groups of sensitivity analysis: local and global approaches [2].Local sensitivity analysis evaluates how the outputs change by varying one input parameter at a time.On the contrary, the global sensitivity analysis (GSA) considers a variation of all parameters simultaneously and evaluates their contribution to the uncertainty.The local approach has apparent limitations in complex hydrological models (e.g., SWMM), which often involve many nonlinear relationships between input and output variables [3,4].
This study is focused primarily on the global sensitivity analysis (GSA) methods.There are numerous methods available for performing sensitivity analyses.Widely used GSA methods include variance-based methods [5], sampling-based methods such as Latin hypercube sampling with partial rank correlation coefficient index (LHS-PRCC) [6], global screening methods [7], regression-based methods and others [8].
Storm Water Management Model (SWMM) [9] by the U.S. Environmental Protection Agency (EPA) is chosen for this study to simulate various hydrologic/hydraulic processes in urban areas.It is designed for single event or long-term (continuous) simulation of rainfall-runoff transformation, hydraulic performance and water quality.SWMM has four principle hydrologic/hydraulic processes: precipitation, rainfall losses, runoff transformation and flow routing.It is necessary to conduct a detailed sensitivity analysis to evaluate the main parameters of the SWMM which are the most sensitive parameters affecting the rainfall-runoff-routing simulation in the model.
Although several global sensitivity analysis (GSA) methods have been proposed in many fields of science and engineering, there are not many reports on global sensitivity analysis in the urban drainage field [10].The Morris screening and the standardised regression coefficients (SRCs) were applied to a sewer flow and water quality model by Gamerith et al. [11].Vezzaro and Mikkelsen [12] recently applied a variance decomposition GSA method combined with the General Likelihood Uncertainty Estimation (GLUE) in order to identify the major sources of uncertainty in a storm-water quality model.Freni et al. [13] briefly discussed parameter sensitivity in urban stormwater quality modeling within general likelihood uncertainty estimation (GLUE) approach; Dotto et al. [14] analyzed parameter sensitivities using the formal Bayesian approach.Prat et al. [15] investigated sensitivities in an integrated model study of the sewer system and the waste water treatment plant with Monte Carlo simulations and Partial Correlation Coefficients (PCC).
One of the methods of global analysis is the Partial Rank Correlation Coefficient Analysis, a method that is based on the Pearson correlation coefficient and that assumes a monotonic relationship between the input and the output variables [16].The partial rank correlation coefficient (PRCC) is widely used for sensitivity analysis [17,18].
While there are several approaches to quantify the magnitude (strength) of relations between variables, the mutual information, derived from information theory, provides a general measure of dependencies between variables.The concept of mutual information proposed by Shannon and Weaver in 1949 [19] was developed to quantify shared information between data sets.Unlike correlation that is only applicable to the linear relationship between two variables, the mutual information is theoretically able to identify various types of relations [20].Mutual information is well suited for solving the hydrological problems, because there rarely exists the linear relationship between any two hydrological variables [21].This problem can be avoided by analyzing mutual information which measures the statistical dependence between two hydrological variables.Bonnlander et al. [22] demonstrated the application of mutual information for selecting the key input variables of neural network models.Mishra et al. [23] described three global sensitivity methods, that is, stepwise regression, mutual information analysis and classification tree for determining uncertainty importance, and showed some sample applications for ground water models.Zeng et al. [24] compared stepwise regression analysis and mutual entropy analysis for identifying the key uncertainty variables affecting the parameters of normal groundwater level series.
In this study, two GSA methods-LHS-PRCC and mutual information (entropy) analysis-are applied to a conceptual hydrologic-hydraulics model (SWMM) in order to, (i) provide a compare of both methods concerning the similarity of results and their effectiveness; (ii) identify and rank the important model parameters for selected model output variables.The two methods are selected for comparison because these potentially powerful sensitivity analysis techniques are not well known in the hydrologic modeling literature.
The structure of this paper is as follows.In Section 2, we describe two GSA methods: partial rank correlation coefficient (PRCC) and mutual information analysis.We also introduce our working example.SWMM model is described briefly.In Section 3, the applicability of PRCC and mutual information analysis is demonstrated using results from SWMM model application at the Ya-hua garden district in Changsha city.The Latin Hypercube Sampling (LHS) method is used to generate 1000 sets of parameter samples.Sensitivities are evaluated using different model outputs.Finally, some conclusions of the study are given in Section 4.

Study Site
The selected study area was Ya-hua garden district in Changsha city, and the drainage area was approximately 11.7 ha (Figure 1).Mean annual precipitation in the region is between 1200-2000 mm, and the maximum recorded daily precipitation was 338mm in 1996.
The total impervious area of the study area was 56%, and pervious area was 44%.The overland average slope was 0.3%.The study area was subdivided into 24 sub-catchments with the size of each varying from 0.24-0.96ha, based on the existing land use, conduit diameter, topography and drainage characteristics, etc.The total number of conduits was 23, and the number of node was 24.The catchment was drained by a separate sewage drainage system, and the storm water runoff flowed past 23 conduits to the catchment outlet labeled OUT1 (Figure 1).Rainfall was measured by rain gauges installed within the catchment.Table 1 presents the total runoff volume and peak discharge of the monitored rainfall events.The parameters of SWMM model are calibrated by three measurement data on 12/08/01, 28/07/02, 25/08/03.A detailed description of the catchment is given in Ren [25].

Model Description
The Environment Protection Agency (EPA) Storm Water Management Model (SWMM) is chosen for this study.SWMM is a physically-based distributed model, which means that a study area can be subdivided into any number of irregular sub-catchments.The model is capable of simulating all the hydrologic-hydraulic processes that occur in the urban drainage system [26].SWMM consists of four functional program blocks, plus a coordinating executive block.SWMM has been widely applied to plan and design of the urban drainage system.
In the present study, two primary functional blocks, the Runoff block and the Extran block, are considered to route flows.The Runoff block is designed to simulate continuous runoff hydrographs for each sub-catchment in the drainage basin.Runoff hydrographs from the various sub-catchments are taken as input to the Extran block.The Extran block completes complex hydraulic analysis of the open and closed conduit systems.The Extran block solves the complete Saint-Venant equations accounting for channel storage, backwater effects, surcharged flow and reverse flow.
SWMM generates catchment's surface runoff from rainfall using a distributed non-linear reservoir model, taking into account depression loss, infiltration and evaporation.Infiltration is rain water that soaks into the ground from precipitation.This water is considered removed from the runoff process.
The major loss considered in the rainfall and runoff modeling is infiltration loss.In SWMM, each sub-catchment is further divided into three subareas: an impervious area with depression storage, an impervious area without depression storage and a pervious area with depression storage.The infiltration losses is considered only from the pervious areas.For the pervious areas of a sub-catchment, SWMM has three different methods for computing infiltration loss: the Horton, Green-Ampt and Curve Number models.The Horton model will be used in this study.

Model Parameters
In general, uncertainty sources in urban drainage modeling include parameter, data and model structure.Only model parameters are used as factors in the sensitivity analysis in this study, neglecting the contribution of model structure.The model hydrology parameters within the SWMM Runoff Block are: sub-catchment slope, sub-catchment width, sub-catchment area, Manning roughness coefficient (n) for pervious and impervious area, minimum and maximum infiltration rates, impervious area with depression storage, and impervious area without depression storage.Whereas in the Extran Block the principal input parameters are: conduit length, Manning's n for the conduit, cross-sectional geometry.
In this study, three scaling factors (i.e., Pct-Area, K-Width, K-Slope) [27] are introduced to replace sub-catchment area, sub-catchment width and sub-catchment slope.The actual values of the model's input parameters are equal to the product of the scaling factor and the measured values of parameters.Sub-catchment area is greatly affected by the subjectivity of model generalization, especially in the case of small study area; uncertainty of sub-catchment area is more obvious.The model parameters and their variation ranges are shown in Table 2. Parameter values have been set at reasonable values based on results coming from the literature and from the field measurements of the study site.For assessing the influence of varying inputs on model outputs, all model parameters are assumed to be uniformly distributed, and thus the parameters have the same probability of taking any value within a specified range.The parameter values of SWMM model vary from sub-watershed to sub-watershed depending on soil types, land cover (percent impervious, etc.), topography and/or other characteristics of the sub-watershed.

Latin Hypercube Sampling
In this paper, the Latin Hypercube Sampling method (LHS) is adopted to generate samples of input variables, which can avoid the repeated sampling.The LHS method is one of the stratified sampling methods, and was first introduced by McKay et al. [28].The LHS can significantly reduce the required number of simulations compared to the conventional Monte Carlo (MC) method [29].In LHS the range of each of the model inputs is divided into N equivalent intervals, where N is the number of simulations.One representive parameter value from each interval is randomly selected.The representative values for each random variable are then combined so that each representative value is considered only once in the simulation process.The objective of LHS is to ensure the full coverage of the range of the input variables.In this way, all possible values of the random variables are represented in the simulation.As the number of parameters increases, the sample size required generally increases.For uncertainty analysis, the number of simulation samples should be at least k+1, where k is the number of model input parameters, but much larger sample sizes are usually necessary to obtain reasonable results [16].
In this study, we used the LHS method to generate 1000 parameter samples from the range of values for each uncertain parameter in Table 2. Ranges of parameter values were determined based on either literature values or field measurements (Table 2).Correlations between parameters were neglected when performing the sampling and each uncertain parameter was assumed to follow a uniform distribution.This assumption can led to some overestimation of uncertainty and a reduction in confidence in model results [30].

Partial Rank Correlation Coefficient (PRCC)
The choice of the sensitivity analysis method to use significantly depends on the assumed relationship between the input parameters and model output.For nonlinear but monotonic relationship between two variables, PRCC [16,31] appears to be the best choice, because it provides a measure of monotonicity between parameters and model output after removing the linear effects of all parameters except the parameter of interest.We have chosen in the present work to use PRCC as the preferred method for sensitivity analysis, as one of the most efficient and reliable sampling-based techniques.A correlation coefficient (CC) between input xj and output y is calculated as follows: To measure the potential nonlinear but monotonic relationships between xj and y, we calculated the PRCC, which is a robust sensitivity measure as long as little to no correlation exists between the inputs [16].It represents a partial correlation on rank-transformed data: xj and y are first rank transformed, then the linear regression models described in Equation (2) are built.A higher positive PRCC indicates that the parameter has a greater positive control on the response variable of interest, while a higher (absolute value) negative PRCC indicates a greater negative control [32].In contrast, a PRCC value close to 0 indicates a poor effect on the response variable of interest.
PRCC is often combined with LHS for conducting sensitivity analysis.By combining the uncertainty analyses with PRCC, we are able to reasonably assess the sensitivity of our output variable to parameter variation.

Mutual Information (Entropy) Analysis
Entropy is a measure of the uncertainty associated with random variables, or, alternatively, a measure of the amount of information contained in a distribution [33].It is closely related to sensitivity, since it is expected that a model will be more sensitive to a parameter carrying more information than to another one carrying less.Mutual information, which is based on the concept of entropy, provides a general framework for dealing with possible non-monotonicity in input-output relationships-which cannot be handled by linear correlation and regression based approaches [22,34].
As suggested by Mishra and Knowlton [35], the combination of the mutual information concept and contingency table analysis can be used to perform global sensitivity analysis.Table 3

is an example of n m
× contingency table [36].A contingency table such as Table 3 is used to examine whether or not the two variables are independent.

Table 3. A n m
× contingency table.
x y Table 3 consists of n rows and m columns that correspond to the n categories of the variable x and the m categories of the variable y.We let ij f denote the joint frequency for (i, j), where the first subscript refers to the row number and the second is the column number.The marginal frequencies represent the row-wise and column-wise sums of the frequencies in the table, denoted by .The entropies of input variable x and output variable y can be defined by:

H x p x p x H y p y p y
The values of discrete random variable x take i x , 1, 2, , i n =  ; And the values of discrete random variable y take j y , 1, 2, , j m =  . The conditional entropy ( ) H x y is the average additional information provided by observing the variable x when the variable y is already known.The conditional entropy ( ) H x y of x given y is defined as: ( ) , ln where ( ) The joint entropy ( ) , H x y of x and y denotes the average total information gained by observing both x and y.The joint entropy ( ) , H x y of x and y is defined as: The mutual information ( ) , I x y of x and y is the amount of information shared by x and y.The mutual information ( ) , I x y is defined in terms of their entropies:

I x y H x H x y H y H y x H x H y H x y
The mutual information is always non-negative: ( ) , 0 I x y ≥ for any two random variables x and y.
The mutual information is zero if and only if x and y are independent.So, the mutual information between the random variables x and y can be considered a measure of dependence between these variables, or better yet, the statistical correlation of x and y.
In mutual entropy method, the sensitivity of the output to input variables is estimated by the following two indicators [24].
The first indicator is the uncertainty coefficient as a quantitative measure of association, defined by the two limits identified above:

I x y U x y H x H y
This measure lies between 0 (corresponding to no association between the x and y) and 1 (completely association).
R-statistic is another alternative measure of association defined on the basis of mutual information: where R varies between 0 and 1. R has a value of zero if there is no association between x and y.R is 1 if there is perfect association between x and y.R can be used to provide a computational representation of variable importance.

Results and Discussion
In the following sections, the results are presented and discussed in detail for the investigated model outputs.

Statistics Analysis of Model Outputs
The LHS method was used to explore the effects of the input parameter uncertainty on the three output variables: the peak discharge (Qp), the total runoff volume (V) and the time to peak (Tp).Using Latin hypercube sampling, 1000 samples were generated from a uniform distribution of the parameter ranges.The model parameters had been assigned as uniform distributions and the parameters were changed within the ranges of Table 2.

CDF
The results of the simulation runs of the SWMM model consist of N observations of each output variable.Distribution functions for each of the output variables can be directly derived and characterized by simple descriptive statistics.The empirical frequency distribution for three output variables were directly derived from the results of the uncertainty analysis, these distributions are presented in Figure 2. The descriptive statistics for these distributions are given in Table 4.The results show high output variability due to the high degree of estimation uncertainty for the initial value of the input parameters.The sample size N is set to 1000. Figure 2 shows the cumulative distribution functions (CDFs) for the outputs of the SWMM model.These descriptive statistics cannot be used to identify which of the input variables are the most important in contributing to the model output; consequently, partial rank correlation coefficients were calculated in order to identify these key variables.

PRCC Analysis
The primary model outputs of interest for the sensitivity analyses were the peak discharge (Qp), the total runoff volume (V) and the time to peak (Tp).PRCC were calculated between each of the 12 input parameters and three output variables.Based on the magnitude of the absolute value of the PRCC, we ranked the relative importance of the 12 input parameters.PRCC values of the parameters are presented in Table 5. Figure 3 shows partial rank scatter plots of the ranks for the total runoff volume and each of the 12 input parameters.This visualisation shows how strongly an output variable is influenced by the input parameters.The visual analysis shows a negative impact of Con-Mann and total runoff volume (a high parameter leads to a low total runoff volume value).The PRCC results for the total runoff volume show that the Con-Mann has the highest influence on the results, followed by the Pct-Area and K-Width.Pct-Area, N-Imperv and N-Perv are strongly positively correlated with the total runoff volume.On the other hand, Con-Mann, K-Width, Min.Infil.Rate and K-Slope are strongly negatively correlated to the total runoff volume.Table 4 shows that the ranking of relative importance of parameter could vary among different output variables.For example, Con-Mann has a relatively low influence on the time of peak, but a high influence on the peak discharge and total runoff volume.

Mutual Entropy Analysis
Mutual entropy analysis can be conducted using contingency tables.A contingency table is a statistical table that shows the relationships between two variables.To construct a contingency table, the first step is to bin the range of values, and then count the number of observations in each bin.Since Mishra et al. [23] pointed out that 10-15 classes for each variable give stable results, we used a 10 × 10 table.The 1000 calculated output values and the corresponding sampled values of all input parameters were divided into 10 equiprobable intervals.Mutual information were calculated from the contingency table based on the data.R statistic to determine the importance ranking of input parameters is shown in Table 6-8.In Table 6-8, the first column and the second column show the variables in order of decreasing importance.The third column denotes the values of the joint entropy H(x,y) as given by Equation ( 4).The mutual entropy I (x,y) (given in the fourth column) and the uncertainty coefficient U(x,y) (given in the fifth column) are calculated from Equations ( 6) and (7), respectively.The sixth column shows the R statistic value calculated using Equation (8).
The most important parameter for total runoff volume and peak discharge identified via the R statistic is Con-Mann, which clearly dominates the importance ranking.For the time to peak, both methods identify N-Imperv as the most important parameter.For the other parameters, however, ranking from PRCCs and mutual information slightly differs.
The entropy analysis indicates the similar parameter importance ranking as PRCCs analysis, indicating the absence of any important non-monotonic relationships between inputs and outputs.

Conclusions
Sensitivity analysis of an urban drainage model with respect to uncertain model-input parameters is presented in this paper.We review and compare two global sensitivity analysis methods that have proven to be among the most reliable and efficient, namely a PRCC and a mutual information method.One example is presented to demonstrate the applicability of the selected global sensitivity analysis methods.Twelve model input parameters and three model outputs are investigated.LHS is used to generate input data from the assigned distributions and ranges in the SWMM model.A sensitivity analysis based on PRCC and mutual information method has been conducted in order to identify the parameter importance of SWMM model.
Statistics analysis of SWMM model outputs reveals high output variability due to uncertain estimates of various model parameter values.However, only a few key input parameters are important in contributing to the model outputs.PRCC and entropy analysis are used to identify and rank the importance of these key input parameters.The example demonstrates how PRCC can be used to identify key input parameters in sample data sets derived from numerical simulations.Scatter plots offer a straightforward way to visualize the relationship between input parameter and output variable.However, the applicability of PRCC is restricted to nonlinear but monotonic relationships between outputs and inputs.The mutual information has the strength of identifying non-monotonic sensitivities.Therefore, using mutual information on stochastic models can complement LHS/PRCC results.
Sensitivity measures of PRCC and entropy-based are found to give similar ranking of the SWMM input parameters.The PRCC and mutual information are found to be sufficient to identify the most important parameters of the SWMM model.
The choice of model output variables significantly influence the importance and ranking of the parameters.It is essential to investigate more than one model output for assessing sensitivities over the range of possible model responses.

Figure 1 .
Figure 1.Generalized distribution of the study area.

It ranges from − 1
and +1.x and y are the mean values of x and y, respectively.N is the number of sampling.The partial correlation coefficient (PCC) provides a measure of the strength of the linear relationship between input xj and output y after the linear influence of the other variables has been eliminated.The PCC between xj and y is defined by the CC of ĵ j x x − and ŷ y − , where ˆj x and ŷ are represented in the following linear regression models: is the joint probability of xi , yj.

Figure 2 .1
Figure 2. The cumulative distribution functions (CDF) of total runoff volume and peak discharge.

Figure 3 .
Figure 3. Partial rank scatter plots of the ranks for the total runoff volume and each of the 12 input parameters from 1000 Monte Carlo simulations.The y-axis represents the output variable residuals while the x-axis represents the residuals for each of the 12 input parameters.

Table 1 .
The total runoff volume and peak discharge in the district.

Table 2 .
Key model parameters involved in this study.

Table 4 .
Descriptive statistics from the uncertainty analysis.

Table 5 .
Result of sensitivity analysis.

Table 6 .
Results of mutual entropy analysis for total runoff volume.

Table 7 .
Results of mutual entropy analysis for peak discharge.

Table 8 .
Results of mutual entropy analysis for time to peak.