Prediction of Typhoon-Induced Flood Flows at Ungauged Catchments Using Simple Regression and Generalized Estimating Equation Approaches

: Typhoons are the main type of natural disaster in Korea, and accurately predicting typhoon-induced ﬂood ﬂows at gauged and ungauged locations remains an important challenge. Flood ﬂows caused by six typhoons since 2002 (typhoons Rusa, Maemi, Nari, Dienmu, Kompasu and Bolaven) are modeled at the outlets of 24 Geum River catchments using the Probability Distributed Moisture model. The Monte Carlo Analysis Toolbox is applied with the Nash Sutcliffe Efﬁciency as the criterion for model parameter estimation. Linear regression relationships between the parameters of the Probability Distributed Moisture model and catchment characteristics are developed for the purpose of generalizing the parameter estimates to ungauged locations. These generalized parameter estimates are tested in terms of ability to predict the ﬂood hydrographs over the 24 catchments using a leave-one-out validation approach. We then test the hypothesis that a more complex generalization approach, the Generalized Estimating Equation, which includes properties of the typhoons as well as catchment characteristics as predictors of PDM model parameters, will provide more accurate predictions. The results show that the predictions of Generalized Estimating Equation are comparable to those of the simpler, conventional regression. The simpler approach is therefore recommended for practical applications; however, further reﬁnements of the Generalized Estimating Equation approach may be explored.


Introduction
Typhoons are the main natural disaster in Korea. In recent years, Korea has experienced an increased frequency and strength of typhoons and an increase in the resulting damage [1,2]. The maximum annual flood peak historically is often caused by heavy rainfall in the wet season ("Changma") as well as typhoons; however, since 2002, typhoons are perceived to have become the main source of flood risk, while the Changma has become less influential in Korea. Estimation of typhoon-induced flood hydrographs and associated flood peaks is therefore considered to be increasingly important for engineering applications, such as design of civil engineering structures, stabilization of river banks and flood warning and management.
Gauged flood hydrographs are not available for most of the medium to small-sized catchments in Korea. The need to predict flows at locations where observed flows do not exist for model calibration and validation, i.e., the ungauged catchment problem, is common [3]. The common approach to this problem is selection of a preferred rainfall-runoff model followed by spatial generalization (regionalization) of its parameters. The regionalization methods depend on nature of the selected rainfall-runoff model and the catchments. Blöschl et al. [3] grouped the methods into three main categories: (a) a-priori estimation of model parameters from catchment characteristics (e.g., [4]); (b) transfer of calibrated model parameters from gauged catchments (e.g., [5]); (c) constraining model parameters by regionalized runoff characteristics and proxy data (e.g., [6]).
The literature on the topic of flood hydrograph regionalization is very well developed. Salinas et al. [7] summarize a comparative assessment of 20 flood prediction studies involving 3023 catchments, concluding that flood flow predictions in ungauged catchments are more accurate in humid than in arid catchments. However, in this and the broader literature, flood studies in Asian monsoon areas are not well represented compared to other humid areas. Although a number of studies have analyzed typhoon rainfall [8][9][10][11], there are particularly few studies focusing specifically on flood hydrology associated with East Asian typhoons combined with mountainous environments. One significant study was by [12], who analyzed how flood events in Korea are controlled by the mean, standard deviation, maximum rainfall intensity of the hourly rainfall time series. The typhoons' maximum rainfall intensity was found to be an important factor affecting the flood responses.
Classic statistical regionalization studies are based on the premise that there are predictable differences in rainfall-runoff model parameters between catchments, and that there are no changes in parameters between flood events. This means that the estimated model parameters are averaged over flood events to one representative parameter set per catchment [13,14]. This may underestimate uncertainty especially when large differences in rainfall properties between events, such as in typhoons, create response differences that cannot be captured by a simple rainfall-runoff model. This problem can be addressed empirically, by extending the classical regionalization to include inter-event variations in model parameters. One formal approach to do this is the Generalized Estimating Equation (GEE) method. GEE was proposed by [15] as an extension of the Generalized Linear Model (GLM) to longitudinal data analysis using quasi-likelihood estimation. GEE is widely used in biomedical studies to analyze the repeated measurements from a subject or correlated observation from a cluster [16,17]; however, the GEE method has not previously been applied in the rainfall-runoff regionalization context.
The aim of this study is to develop a method for prediction of typhoon-induced flood hydrographs for ungauged catchments, using a case study of six typhoon flood events in 24 Geum River catchments, Korea. The relationship between typhoon properties and flood hydrograph properties is explored empirically, as well as the ability to regionalize a simple rainfall runoff model across the catchments by using conventional regression and the GEE method.

Geum River Region
The Geum River is one of the major rivers in Korea, draining the mid-western region of the Korean peninsula ( Figure 1). Korea has a temperate climate with four distinct seasons, and the average annual temperature and rainfall in this region are 11.5 • C and 1285 mm, respectively [18]. The Geum River region experiences heavy rainfall in the summer from June through September including typhoons and tends to be cold and drier in the winter, although heavy snowfalls are common [19]. The region has a complex climate, influenced by both oceanic and continental climates.
The 24 gauged catchments used in the case study are shown in Figure 1. These catchments are relatively natural with the main land uses being agriculture and natural woodland. The catchments shown in white in Figure 1, although they include some water level stations, are not included in the analysis because their flood flows are heavily influenced by flood control effects of Daechung Multipurpose Dam and Geum River estuary barrier. Catchment Characteristics (CCs) are the basic descriptors that support understanding of differences between catchments. In this study, eight CCs were selected from a previous study of hydrological similarity [14]. The values of the selected CCs are given in Table 1.
-Area is size of the catchment, which affects flood volumes; -ALTBAR is the mean altitude of catchment above sea level; -DPS is the mean of the catchment slope, which affects surface runoff response times; -DD is the Drainage Density, a measure of the total length of all the rivers in a catchment area divided by the total area of the catchment, which affects how efficiently a catchment is drained; -FF is the Form Factor, the ratio of the catchment area to the squared value of the total catchment drainage length. It varies from zero (in a highly elongated area) to 1 (in a perfectly circular area), affecting runoff response times. DD and FF are calculated based on equations provided by National Water Resources Management Information System (WAMIS) in Korea; -Curve Number (CN) is an empirical parameter for predicting direct runoff, developed by the US Soil Conservation System, and affects the volume of runoff during a storm [20]; -FARL is Flood Attenuation Factor by Reservoir and Lakes, estimated based on the reservoir data and the catchment terrain database in WAMIS, affecting flood attenuation [21]; -SAAR is Standard Annual Average Rainfall in the period 1981 to 2010, which represents effects of long-term catchment wetness. Hydrological data (rainfall, flow and climatic data) were obtained from the Water Resources Management Information System [22] and the Hydrological Investigation Report of Korea [19]. The rainfall data were provided by the Korea Meteorological Agency (KMA). As shown in Figure 1, 16 rain gauges operated by the KMA are used to produce catchment average rainfall on an hourly time-step with the Thiessen polygon method. The potential evapotranspiration data were estimated by using the metrological data of the nearest weather station to study catchment with the FAO Penman-Monteith method, developed by the Food and Agriculture Organization of the United Nations [21,23].

Typhoons in the Geum River Region
Typhoons in Korea have increased in strength after 2002 [24]. Table 2 lists the periods over which these six typhoons are analyzed in this paper. Observed data from the six typhoons were prepared for the 24 Geum River catchments. Only three catchments have sufficient data for every typhoon-C1 (Bukil), C4 (Cheongju) and C11 (Indong)-however, all catchments have suitable data for two or more flood events, giving a total of 113 observed flood events and at least 8 events for each typhoon. Most of the events are from June to September, and most have single-peak flood hydrographs. The data quality is examined by looking at the runoff ratio (the total runoff divided by the total rainfall during an event, it should be less than one), the timing of peak rainfall and peak runoff (they should not be counter-intuitive), missing data and outliers.
The characteristics of each typhoon flood event are summarized as: total amounts of rainfall and runoff; maximum rainfall intensity; peak flow; lag time between rainfall and runoff peaks; rainfall duration; and antecedent rainfall over the 5 days preceding the event. Figure 2 shows box plots of the first 5 of these characteristics, showing the maxima and minima, and 25% and 75% quantiles. As shown in Figure 2, the flood characteristics vary widely within each typhoon (i.e., between catchments).  Table 3 shows selected data to further illustrate the nature of the hydrological responses. Table 3 shows that the rainfall duration ranges from 2 to 17 h, and total amount of rainfall ranges from 41 mm to 129 mm. Most of the events have maximum rainfall intensity greater than 10 mm/h. The flood events of Rusa, Mamie and Nari have runoff ratios (total flow/total rainfall) greater than 0.85. However, the runoff ratio for Kompasu is only 0.38, presumably related to the lower total rainfall of 40 mm in that typhoon. The wind rather than rainfall was the dominant source of damage in that typhoon event.  Figure 3 gives some further insight into the variations between flood events and to what degree these are explained by the total rainfall, maximum intensity and P5 (the total amount of rainfall in previous 5 days). This exhibits the expected trends that peak flow and total flow tend to increase with increased rainfall and peak intensity. However, runoff ratio and lag time have no significant linear relationship with the rainfall properties; and none of the flow indices are related to P5. The high runoff ratios and the apparent lack of sensitivity to P5 may be due to the generally wet antecedent conditions (e.g., average P5 for C4 = 62 mm). The wide variation of responses for any given set of rainfall and P5 values is likely to be associated with the variation in responses between catchments. The runoff ratios greater than 1 in Figure 3 may be due to rainfall or flow estimation errors and/or due to presence of baseflow. Understanding and modeling the variations in responses is the focus of the rest of this paper.

Method
The rainfall and flow data (both hourly) from six typhoon periods are prepared and model parameters (MPs) are calibrated for each. The MPs are regionalized by regressing the calibrated values against CCs. The conventional regression equations are developed using MP values that are averaged across the events, aiming solely to model the differences in response between catchments. The GEE regression model is developed using the MPs for each typhoon event, aiming to include the differences in responses between typhoons as well as between catchments. Descriptions of the PDM rainfall runoff model, the calibration and validation methods, and the regression and GEE methods are given below.

PDM Rainfall Runoff Model
The PDM (Probability Distributed Moisture) model is a lumped conceptual rainfall runoff model, which has been used as the main rainfall-runoff model in UK flood estimation procedures [25] and has been proposed as suitable for Korean flood estimation [26,27]. The version of the PDM developed by [25], illustrate schematically in Figure 4, is used here. The details of the PDM model structure are described in [28] and only a summary is provided here.
The PDM is composed of a soil moisture accounting model for estimating the effective rainfall and a routing model for converting effective rainfall into runoff at a define point in the river system. The soil moisture accounting model represents the spatial distribution of soil moisture storage as a Pareto distribution function as specified in Equation (1).
r k is the rainfall, ae k is the actual evapotranspiration, S k is the soil moisture storage, Q is the runoff, c k is the soil moisture storage capacity, C max is the maximum soil moisture storage capacity of the catchment, and b is a parameter that defines the strength of spatial variation of c. c is distributed uniformly between values of 0 and C max if b is equal to 1; and tends towards a single value, C max , as b tends towards 0. The routing model consists of two parallel reservoirs that represent quick and slow responses of catchment runoff. The parameters of the PDM model are summarized in Table 4, along with the ranges of values assumed feasible prior to model calibration [29].

Calibration
For calibration of the MPs in Table 4, a Uniform Random Search was implemented using the Monte-Carlo Analysis Toolbox (MCAT) [28]. 10,000 MP sets were randomly sampled from the prior ranges in Table 4, and objective functions were calculated for each sample. This approach to calibration achieves calibration performances that are sufficient approximations to the optimal performances. The initial soil moisture condition of each case is adopted from the results of preliminary, daily continuous-time modelling [26]. The first 10% of the calibration period was used as a warm-up period during which the errors were neglected in calculation of the objective function in order to reduce sensitivity to initial conditions. The calibration objective function is the Nash Sutcliffe Efficiency (NSE), which measures all-round model fit, although, is often observed to have a bias towards fitting high flows [25,30].
where, O i is observed flow at time i, O is the mean of observed flows, S i (θ) is simulated flow at time time i with parameter set θ, 1 of NSE indicates a perfect match of the simulated and observed flow.

Statistical Regression
In this study, we use the following notations. y ij is defined as a calibrated MP value of the j-th flood event from the i-th catchment. x i =(x i1 , · · · , x ip ) is the set of CCs for the i-th catchment and z ij = z ij1 , · · · , z ijq is the set of rainfall characteristics for the j-th flood event at the i-th catchment.
For each i-th catchment, let y i· = ∑ n i j=1 y ij /n i and z i· = z i·1 , · · · , z i·q = ∑ n i j=1 z ij1 n i , · · · , ∑ n i j=1 z ijq n i where n i denotes the number of flood events in the catchment.

Conventional Regression
In the conventional regression analysis, for each catchment, the PDM MPs are calibrated for each event and the values averaged over all events [9]. A linear least-squares regression equation is developed for each of the five averaged model MPs as follows: E(·) is an expectation and e i represents all random errors in the estimation approach, including randomness between catchments and non-convergence of the optimization of the MPs during calibration. The significant CCs are identified using forward selection with a significance level (p value) of 0.05. That is, we added the CC with the smallest p-value from a wald test and stop adding CCs when no variable has a p-value less than 0.05.

Generalized Estimating Equations (GEEs)
Multiple flood events in a catchment can be considered as repeated measurements from a subject, which corresponds to a catchment in this study [20,31,32]. Therefore, using event-aggregated MPs in (3) leads to loss of information about variability of MPs between events and is likely to underestimate the standard error of the MP estimate for any event. One possible solution to this is to consider a subject-specific random effect in the model fitting using a generalized linear mixed model; however, this would involve specifying the error distribution of y ij . Instead, a GEE approach is adopted that avoids the need to fully account for the error distributions while still potentially achieving robust results [16]. For our flood event data, the following GEE model can be considered: E(y ij ) = β 0 + β 1 x i1 + · · · + β p x ip + β p+1 z ij1 + · · · + β p+q z ijq (4) Note that, while a non-linear transformation of y is generally present in a GEE, here the identity link function is used. Equation (4) is separately estimated for each MP. Then, E(y ij ) corresponds to the mean value of an MP for the j-th flood event from the i-th catchment.
The estimation of the parameters β requires the within-subject covariance, that is "Exchangeable" structure assuming that flood events over time have the same correlation, to be specified [33] and See Appendix A. The estimation is based on iteratively reweighted least squares with quasi-likelihood, incorporating the forward selection method and p < 0.05 as the significance criterion. All the statistical analysis was performed using Statistical Analysis Systems (SAS) version 9.4.

Validation
Leave-one-out cross validation is applied. One catchment is kept as a target catchment for validation, and the regression models are developed with the remaining 23 catchments. The simulated hydrographs are visualized and the NSE is evaluated to represent the applicability of the approach for ungauged catchment prediction.

Calibration of the PDM Model
A separate PDM MP set was calibrated for all 113 events with sufficient observed data, in order to examine differences between them. The results for the 6 flood events at the C4 (Cheongju) catchment are presented here as an example. Figure 5 shows the scatter plots of MP value versus NSE obtained from the Uniform Random Search. Optimal values of rtq and f are relatively identifiable, whereas optimal values for rts are relatively non-identifiable, with c and b having intermediate identifiability. Figure 5 shows that, even for the more identifiable MPs, the optimum values are different across typhoons. A similar result is observed for the other gauges.  Table 5 shows the MP and NSE values for C4, and Figure 6 shows the corresponding modeled and observed hydrographs. Table 6 shows the NSE values representing calibration performance of the model applied to the 113 events. In rainfall runoff modelling, the estimated MPs are generally assumed to be the same over all flood events; however, there are wide variations in optimum model parameters in Table 5 and Figure 5. This could be due to a data quality problem, a limitation of model structures or different hydrological responses to flood events. We suppose that these variations in MP values may be partly due to the changes in hydrological response between events, i.e., the MP value must change to allow the PDM model to capture that changed response. Thus, the flood characteristics of each typhoon event are included in the GEE.  There is a wide variation in results across flood events. For typhoon Kompasu period, the range of NSE values across 18 catchments is −13.5 to 0.9; and for typhoon Bolaven the range across 23 catchments is −0.02 to 0.96. 68 of the 113 events have NSE values greater than 0.5. C18 and C23 have consistently low calibration performances with none of the modelled events having NSE > 0.5. This may be because these catchments have comparatively large data errors or have controls on flow, such as out-of-bank flood storage, which the simple PDM model cannot represent. The calibration results for typhoon Dianmu and Kompasu are poor in 12 and 13 of the catchments. This result is due to the inability of the model to simulate the low runoff ratios observed in many of the catchments (e.g., Table 3). It is likely that in these cases the observed rainfall was overestimated. Only the MP sets giving calibration NSE values greater than 0.5 are used for the statistical regression. Table 7 shows conventional regression equations based on the set of calibrated, event-averaged MPs (excluding C4, which will be used for validation of these equations). The R 2 of these equations varies between 0.1 (Cmax) to 0.79 (f).

Statistical Regression
The regression equations in Table 7 are applied to catchment C4 as an example. The estimated MPs are shown in Table 8, and Figure 7 shows the predicted hydrographs. The hydrographs at Kompasu and Bolaven are not considered acceptable in terms of volumes and peaks, indicating that MP estimates are not applicable to all flood events. The result at Dianmu is better than the calibrated model in terms of peak flow, at the expense of volume performance.  Table 9 shows the developed GEEs. In the GEE model of parameter b, the total amount of rainfall (TR) in a typhoon event was selected as an explanatory variable. The b parameter represents the degree of spatial variability of storage capacity in the catchment, therefore a possible explanation for the model of b is as follows: As total rainfall increases, the effect of variability of the soil moisture storage capacity is expected to reduce and hence the value of b is expected to reduce. However, this effect should be implicit in the PDM model, and so the reason b is related to TR is more complex. For the other MPs, only CCs are selected. The R 2 values in Table 9 are low compared to those in Table 7. This is expected because of the much lower variance of the MPs when they are averaged over the typhoons so that only 23 data points are used, as was done to obtain the results in Table 7, compared to the 107 data points used to obtain the results in Table 9 (107 points rather than 113 because the 6 points from C4 are kept back for model validation). In particular, the non-identifiable MPs would increase variance of the errors more if they are not averaged over events. The results in Table 9 may also be influenced by the covariance of the MP values between flood events in each catchment, whereby if the matrix R Equation (6) could be more accurately specified it may improve the GEE result. In Figure 7, the GEE results (shown in solid lines) are generally better performance in the peak flow than the conventional linear regression results (shown in dotted lines) for typhoons Rusa, Maemi, Bolaven and Kompasu. For typhoon Kompasu, the GEE matches closely the peak of the hydrograph, although it has over-predicted the recession flows. For typhoon Nari, the hydrographs simulated using GEE are slightly worse than those simulated using the conventional regression model. However, overall, the GEE gave slightly better performances compared to the conventional regression model. The results of cross validation are shown in Figure 8. The selected CCs for each MP are the same CCs shown in Tables 7 and 9. The results for the conventional regression and GEEs are compared with the locally calibrated model results for each typhoon flood event. In Figure 8a,b, if the points are located on the 45-degree line, the performances using regionalization are the same as those of calibration; whereas if the points are located above the 45-degree line, the performances using regionalization are better than those using the calibrated model. The calibration results are better than both regionalization approaches. The results of the two regionalization methods are compared in Figure 8c). The performances using the two methods are similar. This indicates that there is no practical benefit adopting the more complex GEE method, although there may be scope for refining its specification, in particular the link function and correlation structure. Furthermore, in this study we did not incorporate error models in order to estimate confidence intervals on predictions, and these may be more successful using the GEE approach.

Conclusions
This study analyses six typhoon flood events (Rusa, Maemi, Nari, Dienmu, Kompasu and Bolaven) at 24 Geum River catchments and applies the Probability Distributed Moisture (PDM) rainfall-runoff model to predict river flows. Two alternative statistical models, conventional regression and Generalized Estimating Equations (GEEs), are used to regionalize the PDM model parameters, so that typhoon-induced flows can be simulated at ungauged locations within the same region. The main outcomes are: - The application of a calibrated PDM model to modelling flood event flows in Geum River catchments shows, overall, an acceptable model performance. This supports the use of the PDM model or comparable rainfall-runoff models for simulating extreme flood events in Korea. -Using conventional regression equations to regionalize model parameters to ungauged catchments showed mixed success, for example when treating the C4 catchment as ungauged, there were good results for two flood events (typhoons Nari and Dianmu) and underestimated peaks for two events (typhoons Rusa and Kompasu). - The GEE model extends the conventional regression by including the inter-event variability in PDM model parameters as well as the inter-catchment variability. However only the model parameter b was found to be related to event properties; and validation results showed only slight improvement on the simpler regression approach. While for practical applications we would therefore recommend the simpler regression approach, refinements to the GEE approach may be explored, in particular its potential advantage for estimating an error model and confidence limits on predictions.
Future research should also increase the number of extreme flood events and study catchments to increase the statistical significance of the analysis and increase confidence in the comparison between the two approaches.
Author Contributions: Ho.L. designed the study; N.M. and Hy.L. conducted the majority of the data analysis and wrote the majority of the paper; S.K. provided the data from the models. J.K. contributed to the statistical analysis and interpretation.

Appendix A. GEE for Regionalization of Rainfall Runoff Model
GEE is a semi-parametric method which uses moment assumptions, instead of assuming a certain distribution as in a Generalized Linear Mixed Model. Besides, GEE is instead of attempting to model the within-subject covariance structure, to treat it as a nuisance and simply model the mean response. Sandwich estimator [32,34,35] enables to be robust against to the mis-specified within-subject covariance structure. The estimation is based on iteratively reweighted least squares (IRLS) with quasi-likelihood. For our flood event data, the following GEE model can be considered: g(E(y ij )) = β 0 + β 1 x i1 + · · · + β p x ip + β p+1 z ij1 + · · · + β p+q z ijq (A1) Note that g(·) represents a link function, which can be identity, logit and log functions depending on if response y ij is normally distributed, binary or a Poisson process. In this paper, we assume our predicted variables of interest (i.e., PDM MPs) are continuous, hence, we consider the identity link function only and Equation (A1) can be rewritten as: E(y ij ) = µ ij = β 0 + β 1 x i1 + · · · + β p x ip + β p+1 z ij1 + · · · + β p+q z ijq (A2) Based on the CCs and flood characteristics, the GEE model is E(y ij ) = µ ij = β 0 + β 1 Area i + β 2 ALTBAR i + β 3 FF i + β 4 DD i + β 5 SAAR i + β 6 FARL i +β 7 CN i + β 8 DPS i + β 9 TR ij + β 10 MRI ij + β 11 P5 ij (A3) Equation (A3) is separately estimated for each PDM MP. Then, µ ij corresponds to the mean value of an MP for the j-th flood event from the i-th catchment.
In GEE, within-subject covariance has some structure and this is called "working covariance". For estimation, we need to specify the correlation structure (denoted as R i ). For example, if there are three modeled flood events in the i-th catchment, we can consider R i as the following structure: The matrix in (A4) is the most general case and is called "unspecified" or "unstructured". In addition, there exist other structures such as (i) independent (ρ ij = 0 for all i and j); (ii) Exchangeable (ρ ij = ρ for all i and j) and (iii) AR (1) (ρ ij = ρ |i−j| ). In this paper, we use the "Exchangeable" structure assuming that flood events over time have the same correlation.