# Peaks-Over-Threshold-Based Regional Flood Frequency Analysis Using Regularised Linear Models

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Study Area and Data

^{2}to 1010 km

^{2}, with an average of 360 km

^{2}and a median of 310 km

^{2}. Records of streamflow data range from 27 to 83 years, with an average of 42 years. Among selected stations, 55 are from New South Wales (NSW) and 90 are from Victoria (VIC), both of which are Australian states.

^{2}), mean annual rainfall (MAR, mm), catchment shape factor (SF, fraction), mean annual evapotranspiration (MAE, mm), catchment stream density (SDEN, km

^{−1}), catchment mainstream slope (S1085, m·km

^{−1}) and forest (FST, fraction), are summarised in Table 2. Table 3 shows the correlation coefficients of the independent variables. It was found that some of the variables were highly correlated. However, the Durbin–Watson statistics of the developed regression equations were close to 2.00, indicating that they did not have much impact on the regression analysis. Penalised regression (as adopted here) is more capable of dealing with the highly correlated variables.

#### 2.2. At-Site Flood Frequency Analysis

_{T}(flood discharge for T-year return period), which is estimated by at-site flood frequency analysis.

#### 2.3. Linear Regression Analysis

_{m}), relative error (RE

_{r}), coefficient of determination (R

^{2}) and ratio of predicted and observed flood quantile (Ratio).

_{0}, b

_{1}, b

_{2}, …) by minimising the sum of squared errors (E) between the predicted and observed value of the dependent variable using a set of independent variables, X. The MLR model can be expressed by Equation (1):

_{i}= b

_{0}+ b

_{1}X

_{i1}+ b

_{2}X

_{i2}+ b

_{j}X

_{ij}+ … + b

_{u}X

_{iu}+ E

_{i}

_{1}= |x

_{1}| + |x

_{2}| + … + |x

_{n}|

#### 2.4. Model Construction

#### 2.5. Model Evaluation

_{m}) is a statistical measure for evaluating the prediction performance of a proposed model. The difference between the predicted flood quantile (Q

_{Pred}) and observed flood quantile (Q

_{Obs}) is divided by Q

_{Obs}for each of the stations following LOOCV. The median value of the absolute values considering all the stations is then calculated, as shown in Equation (4):

_{r}) measures the difference between Q

_{Pred}and Q

_{Obs}to reflect under- and over-estimation of the model, as shown in Equation (5):

^{2}) is a statistical metric used to evaluate the goodness-of-fit of a regression equation. It quantifies the proportion of the total variability in the dependent variable that can be explained by the selected independent variables. The higher the R

^{2}value, the better the goodness-of-fit of the model, and a value of 1 indicates a perfect model. It is defined by Equation (6):

_{Pred}and Q

_{Obs}at a given station, a value smaller than 1 indicates an underestimation and a value greater than 1 indicates an overestimation by the developed prediction equation.

## 3. Results and Discussion

_{r}values for the 2-year ARI for the four regression models. No significant spatial trend is noticed. There are several stations located in the inland region with very high absolute RE

_{r}for both NSW and VIC. A similar pattern is also observed at the state boundary between NSW and VIC in the coastal region. Further study is needed to identify why these stations are associated with higher RE

_{r}. It should be noted that the RFFA model recommended in ARR showed similar results; i.e., some stations had higher RE

_{r}values in model validation [35]. A slightly higher value for RE

_{r}is observed for the MLR model, which is located in the upper region of NSW.

_{r}values for the selected regression models for the 20-year ARI. A similar spatial distribution is observed between MLR and penalised regression models. A few stations located along the coastline of southern VIC are found to have a larger value for RE

_{r}, in particular for MLR and LASSO. Further study is needed to find out the reason for these higher RE

_{r}values. A larger portion of the inland region in VIC is found to have a greater RE

_{r}value for the 20-year ARI. On the other hand, the spatial plot of the 20-year ARI is identical to the 2-year ARI at the boundary between NSW and VIC. Figures S3 and S4 plot the absolute RE

_{r}values for ARIs of 5 and 10 years, respectively. A similar distribution pattern of REr values is observed in coastal regions of the selected stations for these ARIs. In Figures S3 and S4, there are a few stations with larger values of absolute RE

_{r}, unlike Figure 5.

_{r}values for the 20- and 100-year ARIs, respectively. A broad agreement between the penalised regression models is found for both of these ARIs. In contrast, the traditional MLR model shows a slight reduction in absolute RE

_{r}for the inland region of VIC. Figure S5 plots the absolute RE

_{r}for the 50-year ARI, which shows a similar pattern as ARIs of 20 and 100 years. Overall, the difference in absolute RE

_{r}across selected regression models is minimal, as can be seen in Figure 8.

_{r}values for ARIs of 2, 20 and 100 years. There are four classes based on a 25% interval of absolute RE

_{r}values. Overall, broad agreement between MLR and penalised regression models can be seen across all the selected ranges of absolute RE

_{r}. For the 2-year ARI, the MLR model accounts for a minimum of 40 stations (RE

_{r}< 25%), while the EN model accounts for 42 stations. A small variability across all the selected ARIs of the stations counted is noted for all four regression models. Figure S6 plots the cumulative site count for ARIs of 5, 10 and 50 years for all the selected regression models. A distribution similar to that in Figure 8 is identified in Figure S6.

^{2}values of the selected regression models based on LOOCV for ARIs of 2, 20 and 100 years. Among various regression models for the 2-year ARI, MLR shows a median R

^{2}of 0.642, while the LASSO and EN models show a slightly reduced value. The RR model shows a median R

^{2}value of 0.645. For the 20-year ARI, the MLR model has the lowest median R

^{2}value of 0.575, while all the penalised models show median R

^{2}values larger than 0.58. Based on the distribution of R

^{2}in the boxplots, for the 2-year ARI, the best model is RR, which is followed by MLR, EN and LASSO. For the 20-year ARI, the best model is RR, which is followed by EN, LASSO and MLR. For the 100-year ARI, the best model is MLR, which is followed by RR, EN and LASSO. Figure S7 plots the R

^{2}values for the 5-, 10- and 50-year ARIs. Similar to Figure 9, in Figure S7, there is no model showing the best performance across all the ARIs.

_{Pred}/Q

_{Obs}ratio (Equation (7)) for the regression models for ARIs of 2, 20 and 100 years. All the models show a median ratio value around the 1:1 line, which represents a broader agreement between the predicted and observed flood quantiles, without notable bias. Furthermore, the distribution of the ratio values (as shown by the boxplots) for all four models are very similar. Figure S8 plots the ratio values for ARIs of 5, 10 and 50 years, which broadly represent similar results to those in Figure 10.

_{r}values for ARIs of 2, 20 and 100 years. The median RE

_{r}values match very well with the 0:0 line, which indicates that the developed regression models are mostly unbiased. The distribution of RE

_{r}values is quite similar for all the regression models (a very similar result is noticed for ARIs of 5, 10 and 50 years, as shown in Figure S9). It should be noted that for a few stations all the regression models show an overestimation of the predicted quantiles (shown as outliers in the boxplots).

_{m}values (Equation (4)) for the four regression models for all six ARIs are shown in Table 4. Although the RE

_{m}values are not remarkably different across the four regression models, LASSO has the smallest RE

_{m}values overall. The RE

_{m}values for LASSO are 37%, 44%, 43%, 44%, 43% and 46%, which are generally smaller than similar RFFA studies, such as that by Zalnezhad et al. [21], who reported RE

_{m}values of 42%, 33%, 36%, 40%, 44% and 54% for ARIs of 2, 5, 10, 20, 50 and 100 years, respectively, for an artificial neural networks (ANN)-AM-based RFFA model for south-east Australia. Zalnezhad et al. [21] reported median Q

_{Pred}/Q

_{Obs}ratio values in the range of 0.94 to 1.57, which are very close to 1.00 in this study. The RE

_{m}values for LASSO are also smaller than those recommended by the Australian Rainfall and Runoff AM-based RFFA model [35], which reported RE

_{m}values in the range of 57–64% for ARIs of 2 to 100 years. The current study provides a more accurate prediction than the study of Aziz et al. [36], who reported RE

_{m}values in the range of 39% to 91% and median Q

_{Pred}/Q

_{Obs}ratio values in the range of 0.17 and 1.82 for an ANN-AM-based RFFA model in south-east Australia.

## 4. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Doeffinger, T.; Rubinyi, S. Secondary benefits of urban flood protection. J. Environ. Manag.
**2023**, 326, 116617. [Google Scholar] [CrossRef] - Gumbel, E.J. Statistics of Extremes; Columbia University Press: New York, NY, USA, 1958. [Google Scholar]
- Kidson, R.; Richards, K.S. Flood frequency analysis: Assumptions and alternatives. Prog. Phys. Geogr. Earth Environ.
**2005**, 29, 392–410. [Google Scholar] [CrossRef] - Zhang, X.; Duan, K.; Dong, Q. Comparison of nonstationary models in analyzing bivariate flood frequency at the Three Gorges Dam. J. Hydrol.
**2019**, 579, 124208. [Google Scholar] [CrossRef] - Zeng, L.; Bi, H.; Li, Y.; Liu, X.; Li, S.; Chen, J. Nonstationary annual maximum flood frequency analysis using a conceptual hydrologic model with time-varying parameters. Water
**2022**, 14, 3959. [Google Scholar] [CrossRef] - Durocher, M.; Zadeh, S.M.; Burn, D.H.; Ashkar, F. Comparison of automatic procedures for selecting flood peaks over threshold based on goodness-of-fit tests. Hydrol. Process.
**2018**, 32, 2874–2887. [Google Scholar] [CrossRef] - Önöz, B.; Bayazit, M. Effect of the occurrence process of the peaks over threshold on the flood estimates. J. Hydrol.
**2001**, 244, 86–96. [Google Scholar] [CrossRef] - Bezak, N.; Brilly, M.; Šraj, M. Comparison between the peaks-over-threshold method and the annual maximum method for flood frequency analysis. Hydrol. Sci. J.
**2014**, 59, 959–977. [Google Scholar] [CrossRef] - Todorovic, P.; Rousselle, J. Some problems of flood analysis. Water Resour. Res.
**1971**, 7, 1144–1150. [Google Scholar] [CrossRef] - Pan, X.; Rahman, A.; Haddad, K.; Ouarda, T.B.; Sharma, A. Regional Flood Frequency Analysis Based on Peaks-Over-Threshold Approach: A Case Study for South-Eastern Australia. J. Hydrol. Reg. Stud.
**2023**, 47, 101407. [Google Scholar] [CrossRef] - Deidda, R.; Puliga, M. Performances of some parameter estimators of the generalized Pareto distribution over rounded-off samples. Phys. Chem. Earth Parts A/B/C
**2009**, 34, 626–634. [Google Scholar] [CrossRef] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol.
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics
**1970**, 12, 55–67. [Google Scholar] [CrossRef] - Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol.
**2005**, 67, 301–320. [Google Scholar] [CrossRef] - Guru, N. Implication of partial duration series on regional flood frequency analysis. Int. J. River Basin Manag.
**2022**, 1–20. [Google Scholar] [CrossRef] - Hamdi, Y.; Duluc, C.M.; Bardet, L.; Rebour, V. Development of a target-site-based regional frequency model using historical information. Nat. Hazards
**2019**, 98, 895–913. [Google Scholar] [CrossRef] - Pan, X.; Rahman, A.; Haddad, K. Regional flood estimation for very frequent floods based on peaks-over-threshold approach: A case study for south-East Australia. In Hydrology & Water Resources Symposium 2022 (HWRS 2022): The Past, the Present, the Future: The Past, the Present, the Future; Engineers: Brisbane, Australia, 2022; pp. 265–276. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 3 July 2023).
- Hosking, J.R.M.; Wallis, J.R. Some statistics useful in regional frequency analysis. Water Resour. Res.
**1993**, 29, 271–281. [Google Scholar] [CrossRef] - Ali, S.; Rahman, A. Development of a kriging-based regional flood frequency analysis technique for South-East Australia. Nat. Hazards
**2022**, 114, 2739–2765. [Google Scholar] [CrossRef] - Zalnezhad, A.; Rahman, A.; Nasiri, N.; Vafakhah, M.; Samali, B.; Ahamed, F. Comparing performance of ANN and SVM methods for regional flood frequency analysis in South-East Australia. Water
**2022**, 14, 3323. [Google Scholar] [CrossRef] - Bobee, B.; Cavadias, G.; Ashkar, F.; Bernier, J.; Rasmussen, P. Towards a systematic approach to comparing distributions used in flood frequency analysis. J. Hydrol.
**1993**, 142, 121–136. [Google Scholar] [CrossRef] - Madsen, H.; Rosbjerg, D. The partial duration series method in regional index-flood modeling. Water Resour. Res.
**1997**, 33, 737–746. [Google Scholar] [CrossRef] - Silva, A.T.; Naghettini, M.; Portela, M.M. On some aspects of peaks-over-threshold modeling of floods under nonstationarity using climate covariates. Stoch. Environ. Res. Risk Assess.
**2016**, 30, 207–224. [Google Scholar] [CrossRef] - Silva, A.T.; Portela, M.M.; Naghettini, M. On peaks-over-threshold modeling of floods with zero-inflated Poisson arrivals under stationarity and nonstationarity. Stoch. Environ. Res. Risk Assess.
**2013**, 28, 1587–1599. [Google Scholar] [CrossRef] - Pickands, J., III. Statistical inference using extreme order statistics. Ann. Stat.
**1975**, 3, 119–131. [Google Scholar] - Water Resources Council (US); Hydrology Committee. Guidelines for Determining Flood Flow Frequency (No. 17); US Water Resources Council, Hydrology Committee: Washington, DC, USA, 1975.
- Bernardara, P.; Mazas, F.; Weiss, J.; Andreewsky, M.; Kergadallan, X.; Benoît, M.; Hamm, L. On the two step threshold selection for over-threshold modelling. Coast. Eng.
**2012**, 2, 1–6. [Google Scholar] [CrossRef] - Coles, S.; Bawa, J.; Trenner, L.; Dorazio, P. An Introduction to Statistical Modeling of Extreme Values; Springer: London, UK, 2001; Volume 208, p. 208. [Google Scholar]
- Cunnane, C. A particular comparison of annual maxima and partial duration series methods of flood frequency prediction. J. Hydrol.
**1973**, 18, 257–271. [Google Scholar] [CrossRef] - Lang, M.; Ouarda, T.; Bobée, B. Towards operational guidelines for over-threshold modeling. J. Hydrol.
**1999**, 225, 103–117. [Google Scholar] [CrossRef] - Persiano, S.; Salinas, J.L.; Stedinger, J.R.; Farmer, W.H.; Lun, D.; Viglione, A.; Blöschl, G.; Castellarin, A. A comparison between generalized least squares regression and top-kriging for homogeneous cross-correlated flood regions. Hydrol. Sci. J.
**2021**, 66, 565–579. [Google Scholar] [CrossRef] - Lee, J.; Lee, O.; Choi, J.; Seo, J.; Won, J.; Jang, S.; Kim, S. Estimation of Real-Time Rainfall Fields Reflecting the Mountain Effect of Rainfall Explained by the WRF Rainfall Fields. Water
**2023**, 15, 1794. [Google Scholar] [CrossRef] - Srinivas, V.; Tripathi, S.; Rao, A.R.; Govindaraju, R.S. Regional flood frequency analysis by combining self-organizing feature map and fuzzy clustering. J. Hydrol.
**2008**, 348, 148–166. [Google Scholar] [CrossRef] - Rahman, A.; Haddad, K.; Kuczera, G.; Weinmann, E. Regional flood methods. Australian Rainfall and Runoff: A Guide to Flood Estimation. In Book 3, Peak Flow Estimation; Australian Government: Canberra, Australia, 2019; pp. 105–146. [Google Scholar]
- Aziz, K.; Rahman, A.; Fang, G.; Shrestha, S. Application of artificial neural networks in regional flood frequency analysis: A case study for Australia. Stoch. Environ. Res. Risk Assess.
**2013**, 28, 541–554. [Google Scholar] [CrossRef]

**Figure 2.**Geographical locations of the selected 145 stream gauging stations in New South Wales and Victoria, Australia.

**Figure 3.**Observed versus predicted flood quantiles (m

^{3}/s) for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

**Figure 4.**Residual quantile–quantile plots for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

**Figure 5.**Spatial distribution of absolute RE

_{r}values for different regression models for ARI = 2 years: (

**a**) RR; (

**b**) EN; (

**c**) LASSO; (

**d**) MLR.

**Figure 6.**Spatial distribution of absolute RE

_{r}values for different regression models for ARI = 20 years: (

**a**) RR; (

**b**) EN; (

**c**) LASSO; (

**d**) MLR.

**Figure 7.**Spatial distribution of absolute RE

_{r}values for different regression models for ARI = 100 years: (

**a**) RR; (

**b**) EN; (

**c**) LASSO; (

**d**) MLR.

**Figure 8.**Cumulative count of stations having a range of different RE

_{r}(%) for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

**Figure 9.**Distribution of R

^{2}values for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

**Figure 10.**Distribution of Q

_{Pred}/Q

_{Obs}ratio (Equation (7)) for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

**Figure 11.**Distribution of RE

_{r}values for different regression models: (

**a**) ARI = 2 years; (

**b**) ARI = 20 years; (

**c**) ARI = 100 years.

Search Query with Boolean Operators | Number of Documents | ||
---|---|---|---|

Scopus (Title, Abstract, Keyword) | Dimensions (Title and Abstract) | Web of Science (Topic ^{1}) | |

“Peaks over threshold” | 1394 | 1332 | 695 |

“Partial duration series” | 301 | 251 | 291 |

(“Partial duration series” OR “peaks over threshold”) | 1673 | 1563 | 954 |

(“Partial duration series” OR “peaks over threshold”) AND (flood) | 437 | 384 | 307 |

(“Partial duration series” OR “peaks over threshold”) AND (flood) AND (“Multiple Linear Regression” OR “Least Absolute Shrinkage and Selection Operator” OR LASSO OR “Ridge Regression” OR “Elastic Net Regression”) | 3 | 1 | 2 |

^{1}Searches title, abstract, author keywords and Keywords Plus.

**Table 2.**Descriptive statistics of the independent variables based on the selected 145 catchments in New South Wales and Victoria, Australia.

Independent Variable | Minimum | Maximum | Mean | Median | Standard Deviation |
---|---|---|---|---|---|

A (km^{2}) | 11.00 | 1010.00 | 360.21 | 310.00 | 258.77 |

MAR (mm) | 485.32 | 1953.23 | 1001.80 | 926.96 | 327.85 |

SF (fraction) | 0.26 | 1.43 | 0.77 | 0.77 | 0.21 |

MAE (mm) | 932.70 | 1543.30 | 1111.25 | 1068.80 | 130.44 |

SDEN (km^{−1}) | 0.52 | 5.47 | 1.97 | 1.58 | 1.01 |

S1085 (m/km) | 0.80 | 69.90 | 12.77 | 9.59 | 10.95 |

FST (fraction) | 0.01 | 1.00 | 0.59 | 0.65 | 0.33 |

**Table 3.**Correlation coefficients (with their corresponding p-values) between the independent variables (NA means not applicable).

A | MAR | SF | MAE | SDEN | S1085 | FST | |
---|---|---|---|---|---|---|---|

A | 1.000 | ||||||

NA | |||||||

MAR | −0.140 | 1.000 | |||||

0.093 | NA | ||||||

SF | −0.009 | −0.073 | 1.000 | ||||

0.914 | 0.383 | NA | |||||

MAE | −0.080 | 0.346 | 0.038 | 1.000 | |||

0.338 | 0.000 | 0.652 | NA | ||||

SDEN | −0.219 | 0.347 | 0.067 | 0.615 | 1.000 | ||

0.008 | 0.000 | 0.424 | 0.000 | NA | |||

S1085 | −0.463 | 0.206 | −0.004 | −0.097 | 0.161 | 1.000 | |

0.000 | 0.013 | 0.962 | 0.247 | 0.054 | NA | ||

FST | 0.015 | 0.328 | 0.048 | −0.022 | 0.173 | 0.437 | 1.000 |

0.863 | 0.000 | 0.566 | 0.791 | 0.037 | 0.000 | NA |

ARI (Years) | MLR | LASSO | RR | EN |
---|---|---|---|---|

2 | 39 | 37 | 38 | 37 |

5 | 43 | 43 | 43 | 45 |

10 | 44 | 43 | 44 | 46 |

20 | 47 | 44 | 44 | 44 |

50 | 45 | 43 | 43 | 44 |

100 | 44 | 46 | 46 | 47 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Pan, X.; Yildirim, G.; Rahman, A.; Haddad, K.; Ouarda, T.B.M.J.
Peaks-Over-Threshold-Based Regional Flood Frequency Analysis Using Regularised Linear Models. *Water* **2023**, *15*, 3808.
https://doi.org/10.3390/w15213808

**AMA Style**

Pan X, Yildirim G, Rahman A, Haddad K, Ouarda TBMJ.
Peaks-Over-Threshold-Based Regional Flood Frequency Analysis Using Regularised Linear Models. *Water*. 2023; 15(21):3808.
https://doi.org/10.3390/w15213808

**Chicago/Turabian Style**

Pan, Xiao, Gokhan Yildirim, Ataur Rahman, Khaled Haddad, and Taha B. M. J. Ouarda.
2023. "Peaks-Over-Threshold-Based Regional Flood Frequency Analysis Using Regularised Linear Models" *Water* 15, no. 21: 3808.
https://doi.org/10.3390/w15213808