An Entropy-Based Generalized Gamma Distribution for Flood Frequency Analysis

Flood frequency analysis (FFA) is needed for the design of water engineering and hydraulic structures. The choice of an appropriate frequency distribution is one of the most important issues in FFA. A key problem in FFA is that no single distribution has been accepted as a global standard. The common practice is to try some candidate distributions and select the one best fitting the data, based on a goodness of fit criterion. However, this practice entails much calculation. Sometimes generalized distributions, which can specialize into several simpler distributions, are fitted, for they may provide a better fit to data. Therefore, the generalized gamma (GG) distribution was employed for FFA in this study. The principle of maximum entropy (POME) was used to estimate GG parameters. Monte Carlo simulation was carried out to evaluate the performance of the GG distribution and to compare with widely used distributions. Finally, the T-year design flood was calculated using the GG and compared with that with other distributions. Results show that the GG distribution is either superior or comparable to other distributions.


Introduction
Flood frequency analysis (FFA) is needed for the design of water engineering and hydraulic structures. The sizing of bridges, culverts and other facilities; the design capacities of levees, spillways and other control structures; and reservoir operation or management all depend upon the estimated magnitude of various design flood values [1][2][3]. In FFA, flow data, such as the annual maximum data, are fitted using a theoretical frequency distribution, which is usually selected from a set of candidate distributions [4]. For example, the Pearson type three distribution (P3) has been recommended in China [5]. In the US, since 1967 the Log-Pearson type 3 distribution (LP3) has been the official distribution for all catchments which are fitted for planning and insurance purposes [6]. The UK has endorsed the GEV distribution [7,8] for FFA.
The choice of the appropriate model is one of the most important issues for FFA. The method commonly practiced is to try different distributions for the data at hand and choose the best fitted distribution using some particular goodness-of-fit measure [9]. One of the disadvantages of this method is that too many different distributions need to be tried and the selected distribution may be the best based on one goodness of fit criterion, but not based on another criterion. In order to

Generalized Gamma Distribution
Let X be a random variable and x be its specific value. The probability density function (PDF) of the generalized gamma (GG) distribution can be expressed as: where Γ(•) is the gamma function, and r 1 , r 2 are the shape parameters, and β is the scale parameter.

Estimation of Parameters of GB2 Distribution by POME
The GG distribution parameters were determined using the principle of maximum entropy (POME). The POME method involves the following steps: (1) specification of constraints; (2) maximization of entropy using the method of Lagrange multipliers; (3) derivation of the relation between Lagrange multipliers and constraints; (4) derivation of the relation between Lagrange multipliers and distribution parameters; and (5) derivation of the relation between distribution parameters and constraints. A flow chart showing the estimation procedure is shown in Figure 1. FFA. The generalized gamma (GG) distribution is discussed in this study. It is a generalization of the two-parameter gamma distribution. The GG distribution includes as special cases the exponential distribution, the two-parameter gamma distribution, and the Weibull distribution, which provide sufficient flexibility to fit a large variety of data sets. After deciding the distribution, the second issues is to estimate the parameters associated with the GG distribution. The popular techniques for parameter estimation include the methods of maximum likelihood (ML) [7], moments (MM) [10] and L-moments [11]. In addition, entropy theory can be used to derive more generalized distributions using different constraints [12]. The theory involves entropy maximizing in accord with the principle of maximum entropy (POME), in which the distribution parameter are determined, given the observed data and a set of constraints. Singh [12] indicated that the entropy method was reasonable and efficient for parameter estimation.
The objective of this study was therefore to propose an entropy based generalized gamma distribution for flood frequency analysis. The GG distribution parameters were estimated using POME. The GG distribution was tested using observed data sets. Also, Monte Carlo simulation was carried out to evaluate the predictive ability of the GG distribution and it was compared with some widely accepted distributions. Finally, the T-year design flood values were calculated and compared based on different FFA distributions.

Generalized Gamma Distribution
Let X be a random variable and x be its specific value. The probability density function (PDF) of the generalized gamma (GG) distribution can be expressed as: 1 2 ( 1) 2 where Γ(•) is the gamma function, and r1, r2 are the shape parameters, and β is the scale parameter.

Estimation of Parameters of GB2 Distribution by POME
The GG distribution parameters were determined using the principle of maximum entropy (POME). The POME method involves the following steps: (1) specification of constraints; (2) maximization of entropy using the method of Lagrange multipliers; (3) derivation of the relation between Lagrange multipliers and constraints; (4) derivation of the relation between Lagrange multipliers and distribution parameters; and (5) derivation of the relation between distribution parameters and constraints. A flow chart showing the estimation procedure is shown in Figure 1.

Specification of Constraints
Flood discharge is considered as a random variable X, which ranges from 0 to infinity. Its probability distribution function (PDF) and cumulative distribution function (CDF) are denoted as f (x) and F(x), respectively, where x is a specific value of X. Since constraints encode the information that can be given for the random variable, following Singh [12], the constraints for the GG distribution can be expressed as: The first constraint is the total probability law, the second constraint is the mean of log values or the geometric mean, and the third constraint is the mean of values raised to a power q or log of scaled values raised to a power and then shifted by unity.

Maximization of Entropy Using the Method of Lagrange Multipliers
The Shannon entropy of X, H(X), can be expressed as [13]: The f (x) can be obtained by maximizing the Shannon entropy subject to given constraints in accord with the principle of maximum entropy (POME). Following Singh [14,15], maximization of Equation (3), subject to Equation (2a) to (2c), using the method of Lagrange multipliers leads to: where λ 0 , λ 1 , λ 2 are the Lagrange multipliers that are not known.

Relation between Lagrange Multipliers and Constraints
Since the Lagrange multiplier λ 0 can be expressed by Equations (6) and (7), the set of equations can be used to obtain λ 0 : Differentiation of Equation (11a) with respect to λ 1 and λ 2 yields: Defining b = 1−λ 1 q , and differentiating Equation (11b) with respect to λ 1 and λ 2 , we obtain: where ϕ(•)is a digamma function. Based on Equations (12) and (13), the relation between Lagrange multipliers and constraints can be expressed as: Entropy 2017, 19, 239

of 15
Since there are three parameters, Equations (13) and (14) are not sufficient for calculating all the parameters, and one additional equation is therefore needed which is given as:

Relation between Parameters and Constraints
Based on the relation between parameters and constraints and between parameters and Lagrange multipliers, the relation between parameters and constraints can be expressed as: where ϕ(•) is the digamma function; ϕ (•) is the tri-gamma function. For a given data set X, the E(lnx) and var(lnx) can be calculated directly. There are three parameters and three equations in Equation (16). Therefore, this set of nonlinear functions can be solved by the widely used Newton iteration method (Deuflhard, [16]) for parameter estimation. The initial value of the three parameters are set to (1, 1, 1). After multiple iterations, the optimal parameters can be obtained.

The Descriptive Ability of GG Distribution
Annual maximum (AM) flood peak data from 10 gauging stations, namely sites 1 to 10, were selected (Table 1). These ten stations are selected due to their diversity of statistical properties and climate types (arid, semi-arid and humid). Besides AM series, partial-duration series can be also employed for the POME method. In this study, the AM series was considered since it is more widely used. The GG distribution was employed to fit the AM series of the 10 sites. The distribution parameters were estimated using Equations (16). The fitted GG distribution and the empirical frequency distribution of the AM series from sites 1, 5, 6 and 8 are shown in Figures 2-5. These four sites are selected because sites 5 and 8 have low skews, site 1 has moderate skew and site 6 has high skew, the cumulative distributions and histograms of AM series fitted by GG distribution for these sites can be representative. The line represents the fitted distribution and point represents the empirical frequencies of observations. Results show that the GG distribution fitted the empirical data well. Histograms of the AM flood peak series fitted by the GG distribution for the four sites are also shown in  which also show that the GG distribution fitted the empirical histograms well. The skewness coefficient of AM series of sites 1, 5, 6 and 8 was 1.94, 0.66, 2.93 and 0.75, respectively, which showed that the GG distribution described both low and high skewed data well. distribution and point represents the empirical frequencies of observations. Results show that the GG distribution fitted the empirical data well. Histograms of the AM flood peak series fitted by the GG distribution for the four sites are also shown in  which also show that the GG distribution fitted the empirical histograms well. The skewness coefficient of AM series of sites 1, 5, 6 and 8 was 1.94, 0.66, 2.93 and 0.75, respectively, which showed that the GG distribution described both low and high skewed data well.   distribution and point represents the empirical frequencies of observations. Results show that the GG distribution fitted the empirical data well. Histograms of the AM flood peak series fitted by the GG distribution for the four sites are also shown in  which also show that the GG distribution fitted the empirical histograms well. The skewness coefficient of AM series of sites 1, 5, 6 and 8 was 1.94, 0.66, 2.93 and 0.75, respectively, which showed that the GG distribution described both low and high skewed data well.         [11,17], while the parameters of NM and LP3 distributions were estimated by MM [18,19]. These FFA models were also fitted to the AM series for the 10 sites and the values of RMSE and AIC were computed for each model using Equations (17) and (18) and listed in Table 2.
where n denotes the sample size, K is the number of parameters of the distribution, P ∧ is the theoretical non-exceedance probability calculated by the distribution, and P is the empirical non-  [11,17], while the parameters of NM and LP3 distributions were estimated by MM [18,19]. These FFA models were also fitted to the AM series for the 10 sites and the values of RMSE and AIC were computed for each model using Equations (17) and (18) and listed in Table 2.
where n denotes the sample size, K is the number of parameters of the distribution, ∧ P is the theoretical non-exceedance probability calculated by the distribution, and P is the empirical non-exceedance probability. Root mean square error (RMSE) is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The smaller RMSE values represent the better performance of the model. The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. It also includes a penalty that is an increasing function of the number of estimated parameters. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Table 2 illustrates that for sites 1, 2, 4, 5, 8, 9 and 10, the GG distribution had the smallest RMSE values, which means the GG distribution fitted the observed AM data best. In addition, the GG distribution had the smallest AIC values for sites 2, 5, 8, and 10. Table 2 also indicates that the average RMSE and AIC values of GG distribution are the smallest among all the compared distributions. Thus, the GG distribution performs better than other distributions.  Table 2 also shows that the GG, P3, GEV, LP3 distributions gave quite similar performances for most of the selected sites. However, it was observed that the GG distribution performed better at several sites. For site 2, the RMSE values for the GG, P3, and GEV distributions were 0.028, 0.033 and 0.034, respectively. The AIC values were −417.67, −351.64 and −347.79, respectively. Thus, the GG distribution performed much better than the P3 and GEV distributions for site 2. Compared with the LP3 distribution, the GG distribution was more appropriate for sites 5 and 7. For site 5, the RMSE and AIC values for the LP3 distribution (GG distribution) were 0.016(0.013) and −559.41 (−578.63), respectively. For site 7, the RMSE and AIC values for the LP3 distribution (GG distribution) were 0.019(0.016) and −545.85 (−571.71), respectively. Thus, the GG distribution outperformed the LP3 distribution for those two sites. The above discussions shows that the GG distribution is either superior or comparable to the commonly used distributions.
The maximum likelihood (ML) method was also employed for GG distribution and compared with the proposed GG-POME model for site 5 (low skew) and site 6 (high skew). Figure 6 gives comparisons of their probability density functions and indicates that the GG-POME model gives a better performance. The RMSE and AIC values of GG-ML model for sites 5 and 6 were also calculated. The RMSE and AIC values for the GG-ML (GG-POME) model are 0.023 (0.013) and −497.54 (−578.63), respectively for site 5. And the RMSE and AIC values for the GG-ML (GG-POME) model are 0.032 (0.024) and −379.24 (−427.53), respectively for site 6. Therefore it may imply that GG-POME model outperforms GG-ML model.
the values actually observed. The smaller RMSE values represent the better performance of the model. The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. It also includes a penalty that is an increasing function of the number of estimated parameters. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value Table 2 illustrates that for sites 1, 2, 4, 5, 8, 9 and 10, the GG distribution had the smallest RMSE values, which means the GG distribution fitted the observed AM data best. In addition, the GG distribution had the smallest AIC values for sites 2, 5, 8, and 10. Table 2 also indicates that the average RMSE and AIC values of GG distribution are the smallest among all the compared distributions. Thus, the GG distribution performs better than other distributions. Table 2 also shows that the GG, P3, GEV, LP3 distributions gave quite similar performances for most of the selected sites. However, it was observed that the GG distribution performed better at several sites. For site 2, the RMSE values for the GG, P3, and GEV distributions were 0.028, 0.033 and 0.034, respectively. The AIC values were −417.67, −351.64 and −347.79, respectively. Thus, the GG distribution performed much better than the P3 and GEV distributions for site 2. Compared with the LP3 distribution, the GG distribution was more appropriate for sites 5 and 7. For site 5, the RMSE and AIC values for the LP3 distribution (GG distribution) were 0.016(0.013) and −559.41 (−578.63), respectively. For site 7, the RMSE and AIC values for the LP3 distribution (GG distribution) were 0.019(0.016) and −545.85 (−571.71), respectively. Thus, the GG distribution outperformed the LP3 distribution for those two sites. The above discussions shows that the GG distribution is either superior or comparable to the commonly used distributions.
The maximum likelihood (ML) method was also employed for GG distribution and compared with the proposed GG-POME model for site 5 (low skew) and site 6 (high skew). Figure 6 gives comparisons of their probability density functions and indicates that the GG-POME model gives a better performance. The RMSE and AIC values of GG-ML model for sites 5 and 6 were also calculated. The RMSE and AIC values for the GG-ML (GG-POME) model are 0.023 (0.013) and −497.54 (−578.63), respectively for site 5. And the RMSE and AIC values for the GG-ML (GG-POME) model are 0.032 (0.024) and −379.24 (−427.53), respectively for site 6. Therefore it may imply that GG-POME model outperforms GG-ML model. . Figure 6. Comparisons of probability density functions of GG-POME and GG-ML models for sites 5 and 6.

Monte Carlo Simulation
The predictive ability of the GG distribution was evaluated using Monte Carlo simulation and compared with that of the P3, GEV, and LP3 distributions. To test how well a candidate distribution estimated the magnitude-return period relationship, a parent distribution which was not identical to any of the candidate distributions was chosen. Cunnane [20] recommended that such a parent distribution should be a Wakeby distribution with certain parameters. In this study, three kinds of data sets were generated from the Wakeby distribution with parameters as shown in Table 3. The Wakeby distribution has quantile function given as [21]: where F is the uniform (0, 1) variate; and ξ, α, β, γ, δ are the parameters. Then, the real quantile value Q T was computed. S = 1000 samples with size n (n = 20, 50, 100) were generated from each Wakeby distribution and fitted by the four distributions to estimate the events of T = 10, 100 and 1000-year return periods. Table 4 lists the RB and RRMSE values computed by each distribution using Equations (20) and (21): where Q T is a given parent quantile, ( ∧ Q T ) 1 . . . ( ∧ Q T ) S are the estimators for the samples generated from the Wakeby distribution, and S is the number of Monte Carlo trials. The relative bias (RB) and the relative root mean square errors (RRMSE) were used to evaluate the accuracy and efficiency of a candidate model, respectively. From Table 4, generally for all distributions and for all cases, it was observed that the RB and RRMSE values increased with the return period T. For a small return period (T = 10), the selected four distributions exhibited very similar behaviors regardless of the sample size. For moderate and large return periods (T = 100 and 1000), notable differences of RB and RRMSE values were observed. Thus, in the latter discussion, we would mainly focus on moderate and high return period quantile estimators.
For case 1 (C v = 0.2, C s = 0.16), it was observed that the GG and P3 distributions were superior to the GEV and LP3 distributions. When the sample size equaled 100 or 50, the P3 distribution quantile estimators had the smallest RB values for both moderate and large return periods (T = 100 and 1000). But the GG distribution quantile estimators had smaller RRMSE values for T = 1000 than other distributions. For a small sample size (n = 20), the GG distribution had the smallest RB and RRMSE values for both moderate and large return periods (T = 100 and 1000). For T = 1000, the RRMSE values of the GG, P3, GEV and LP3 distributions were 6.46, 9.08, 10.84 and 10.03, respectively. Apparently, the GG distribution performed much better when the sample size was small. This indicates that the GG distribution was more robust. Thus, for case 1, the P3 distribution was preferable when the sample size was large than 50, while the GG distribution was more appropriate when sample size did not exceed 50.
For case 2 (C v = 0.36, C s = 0.48), results indicated that for sample size n = 50 and n = 100, the GEV distribution quantile estimators had the smallest RB values for T = 100 and the LP3 distribution quantile estimators had the smallest RB values for T = 1000. However, their RRMSE values were quite large and increased significantly when the sample sizes decreased. For T = 1000, when the sample size decreased from 100 to 20, the RRMSE values of the GEV distribution rose from 16.35 to 34.45, and the RRMSE values of the LP3 distribution rose from 13.51 to 44.85. While the RRMSE values of the GG distribution rose slightly from 4.8 to 14.18. This was due to the poor accuracy of the GEV and LP3 distributions parameter estimators which had high variance for small sample sizes. In this case, the GG distribution performed significantly better than the other three distributions. Its RB values were quite small, and its RRMSE values were the smallest for all sample sizes and return periods. This was a good indication of the robustness of the GG distribution for this case.
For case 3 (C v = 0.55, C s = 0.97), all distribution quantile estimators had quite large RB and RRMSE values. For n = 50 and n = 100, RB and RRMSE of the GEV distribution were the highest, which amounted to 21.02 and 33.68, respectively, for n = 50, T = 1000, while the GG distribution yielded 16.86 and 22.85, respectively. Also for n = 50 and n = 100, the LP3 distribution quantile estimators had the smallest RB values for both T = 100 and T = 1000, and the other three distributions had similar RB values. But the LP3 distribution gave the worst performance for small sample sizes (n = 20). Its RB and RRMSE values were 26.58 and 57.28, respectively, for T = 1000, whereas the GG distribution yielded 17.67 and 28.16, respectively. In this case, the RB values of the GG distribution were comparable to the P3 and GEV distributions, and were a little larger than the LP3 distribution for n = 50 and n = 100, the RRMSE values of the GG distribution were the smallest for both moderate and large return periods (T = 100 and 1000) regardless of the sample size. Also, when the sample size decreased from 100 to 20, the RB and RRMSE values of the GG distribution rose from 17.52 and 19.32 to 17.67 and 28.16, respectively. This might imply that the distribution was less affected by sample size. Thus, the GG distribution was superior to other distributions for this case. Therefore, the predictive ability of the GG distribution was found to be comparable or superior to that of the other distributions, and it was more robust since it was less affected by sample size, and therefore, estimated the magnitude-return period relationships better.

T-Year Design Flood Calculation
The Danjiangkou reservoir lies in the upper Hanjiang basin and is the source of water for the Middle Route Project under the South-to-North Water Transfer Scheme in China [22]. The Geheyan reservoir, with a volume of 3.12 billion m 3 , plays an important role in management of Qingjiang River [23]. Flood frequency analysis for these two sites was therefore considered in this study. The T-year design flood calculated by different FFA distributions at Danjiangkou Reservoir and Geheyan Reservoir are listed in Table 5. Figures 7 and 8 compare frequency curves of different distributions at these two reservoir sites. Table 5 indicates that design flood for small return periods was similar for these four distributions. However, significant differences were observed for large return periods. The 1000-year design flood calculated by the GG and LP3 distributions at Danjiangkou Reservoir were 55,234 m 3 /s and 48,822 m 3 /s, respectively. And the 1000-year design flood calculated by the GEV and LP3 distributions at Geheyan Reservoir were 15,746 m 3 /s and 13,877 m 3 /s, respectively. Figure 7 indicates that the GG, P3, and GEV distributions had quite similar flood quantile estimators for large return periods at Danjiangkou Reservoir. However, the 1000-year design flood calculated by the LP3 distribution was smaller than by the other three distributions. Figure 8 indicates that the 1000-year design flood calculated by the GEV distribution at Geheyan Reservoir was the largest, and was the smallest for the LP3 distribution.  The design flood calculated by the GG distribution was quite close to that by the LP3 distribution. Besides, the P3 distribution has been adopted in China as a uniform procedure for FFA [24,25]. Table 2 shows that RMSE and AIC values for the P3 distribution at Danjiangkou Reservoir were 0.023 and −419.04, respectively, and the GG distribution yielded 0.021 and −421.51, respectively. The RMSE and AIC values for the P3 distribution at Geheyan Reservoir were 0.027 and −249.79, respectively, and the GG distribution yielded 0.023 and −269.84, respectively. Thus, the performance The design flood calculated by the GG distribution was quite close to that by the LP3 distribution. Besides, the P3 distribution has been adopted in China as a uniform procedure for FFA [24,25]. Table 2 shows that RMSE and AIC values for the P3 distribution at Danjiangkou Reservoir were 0.023 and −419.04, respectively, and the GG distribution yielded 0.021 and −421.51, respectively. The RMSE and AIC values for the P3 distribution at Geheyan Reservoir were 0.027 and −249.79, respectively, and the GG distribution yielded 0.023 and −269.84, respectively. Thus, the performance of the GG distribution was better than that of the P3 distribution. Therefore, the design flood estimated by the GG distribution would be preferable in practice.

Conclusions
In this study, the GG distribution with parameters estimated by POME was applied for FFA. Ten gauging stations were selected as a case study to test the GG distribution. Frequency estimates from the GG distribution were also compared with those of commonly used distributions. A Monte Carlo simulation study was carried out to evaluate the predictive ability of the GG distribution and compare it with other distributions. In addition, some characteristics of frequency curves at Danjiangkou Reservoir and Geheyan Reservoir were evaluated. The following conclusions are drawn from this study: (1) The GG distribution is appealing for FFA. The cumulative distributions and histograms show that the GG distribution can fit both low and high skewed data well. (2) The parameters estimated by POME are found reasonable. Both the marginal distributions and histograms indicates that the GG distribution with so estimated parameters can successfully be fitted to empirical values. (3) The performance of the GG distribution is comparable or superior to that of the other distributions.
Results illustrate that for sites 1, 2, 4, 5, 8, 9 and 10, the GG distribution has the smallest RMSE values. In addition, the GG distribution has the smallest AIC values for sites 2, 5, 8, and 10. Thus, the GG distribution is preferred to other distributions for those sites. Furthermore, the GG, P3, GEV, and LP3 distributions give similar performance for most of the selected sites. However, the GG distribution fits better than them for a few sites.
(4) The predictive ability of the GG distribution is found to be comparable or superior to widely accepted distributions. The GG distribution performs significantly better than the other three distributions when sample sizes are small. Thus it is less effected by sample size and is more robust.