# Flash-Flood Forecasting in an Andean Mountain Catchment—Development of a Step-Wise Methodology Based on the Random Forest Algorithm

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Study Area and Dataset

## 3. RF Technique

^{®}[39]. The main functions, attributes and methods employed can be found in http://scikit-learn.org.

#### 3.1. Algorithm

- (i)
- Construct each one of the decision tree models based on a random selection of a number of bootstrap samples (n_ estimators parameter) drawn with (or without) replacement from the training dataset. Each bootstrap is composed by a different subset (roughly two-thirds) of the dataset, in a process known as out-of-bag (OOB) [19]. The OBB technique aims to get unbiased estimates of the regression as well as to get estimates of the importance of the variables used for the tree construction process [40].
- (ii)
- Determine a number of features (max_ features parameter) to perform the best split decision from the total number of predictor variables of the dataset (n_ features). The condition max_ features < n_ features ensures the nonexistence of duplicated DTs in the forest. Consequently, by assuring variety, the problem of over-fitting is avoided. Ref. [20] recommends $max\_features=\sqrt{\mathit{n}\_\mathit{features}}$, for regression problems.
- (iii)
- Split each node of each decision tree into two descendant nodes by using the best split criteria. The calculation of the best splits are chosen based on the mean squared error ($MSE$) for regression problems. The minimum number of samples required to split a node is controlled by the min_ samples_ split parameter.
- (iv)
- Grow n_ trees as much as possible (largest extent) by repeating steps 1 to 3 until a number of nodes have been reached. The optimal number of trees is reached when the $OO{B}_{error}$ stops decreasing. The depth of each tree is controlled by the max_ depth and the min_ samples_ leaf parameters; where min_ samples_ leaf is the minimum number of samples required to be at a leaf node. This is aimed to reduce the structural complexity of the models in what is called pruning criteria [22].
- (v)
- Determine the prediction as the mean response from all regression trees [20].

#### 3.2. Determination of Model Hyper-Parameters

#### 3.3. Input Data Composition

## 4. Model Evaluation

#### 4.1. Goodness-of-Fit Statistics

#### Graphical Analysis

#### 4.2. Feature Selection Analysis

## 5. Step-Wise Methodology

## 6. Results and Discussion

#### 6.1. Determination of the Number of Discharge and Precipitation Lags

#### 6.2. Base Model: Discharge Lags as the Sole Input

#### 6.3. Precipitation Lags as Additional Inputs

#### 6.4. Feature Selection

#### 6.5. Graphical Analysis

#### 6.6. Forecasting Models of 8, 12, 18 and 24 h

## 7. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Abbreviations

ML | Machine learning |

RF | Random forest |

ANN | Artificial neural network |

SVM | Super vector machine |

ME | Mean error |

MSE | Mean square error |

NSE | Nash-Sutcliffe efficiency |

## References

- Stefanidis, S.; Stathis, D. Assessment of flood hazard based on natural and anthropogenic factors using analytic hierarchy process (AHP). Nat. Hazards
**2013**, 68, 569–585. [Google Scholar] [CrossRef] - Min, S.K.; Zhang, X.; Zwiers, F.W.; Hegerl, G.C. Human contribution to more-intense precipitation extremes. Nature
**2011**, 470, 378–381. [Google Scholar] [CrossRef] [PubMed] - Fi-John, C.; Hwang, Y.Y. A self-organization algorithm for real-time flood forecast. Hydrol. Process.
**1999**, 13, 123–138. [Google Scholar] - D’Ercole, R.; Trujillo, M. Amenazas, Vulnerabilidad, Capacidades y Riesgo en el Ecuador; Coopi-IRD-Oxfam: Quito, Ecuador, 2002. [Google Scholar]
- Nash, J.E.; Sutcliffe, J.V. River flow forecasting through conceptual models part I—A discussion of principles. J. Hydrol.
**1970**, 10, 282–290. [Google Scholar] [CrossRef] - Wang, W.; Gelder, P.H.A.J.M.V.; Vrijling, J.K.; Ma, J. Forecasting daily streamflow using hybrid ANN models. J. Hydrol.
**2006**, 324, 383–399. [Google Scholar] [CrossRef] - Galelli, S.; Castelletti, A. Assessing the predictive capability of randomized tree-based ensembles in streamflow modelling. Hydrol. Earth Syst. Sci.
**2013**, 17, 2669–2684. [Google Scholar] [CrossRef][Green Version] - Braud, I.; Ayral, P.A.; Bouvier, C.; Branger, F.; Delrieu, G.; Dramais, G.; Le Coz, J.; Leblois, E.; Nord, G.; Vandervaere, J.P. Advances in flash floods understanding and modelling derived from the FloodScale project in South-East France. In Proceedings of the 3rd European Conference on Flood Risk Management, Innovation, Implementation, Integration (FLOODrisk 2016), Lyon, France, 18–20 October 2016; Volume 7, p. 04005. [Google Scholar]
- Ruin, I.; Creutin, J.D.; Anquetin, S.; Lutoff, C. Human exposure to flash floods—Relation between flood parameters and human vulnerability during a storm of September 2002 in Southern France. J. Hydrol.
**2008**, 361, 199–213. [Google Scholar] [CrossRef] - Buytaert, W.; Celleri, R.; Willems, P.; De Bievre, B.; Wyseure, G. Spatial and temporal rainfall variability in mountainous areas: A case study from the south Ecuadorian Andes. J. Hydrol.
**2006**, 329, 413–421. [Google Scholar] [CrossRef][Green Version] - Celleri, R.; Willems, P.; Buytaert, W.; Feyen, J. Space-time rainfall variability in the Paute Basin, Ecuadorian Andes. Hydrol. Process.
**2007**, 21, 3316–3327. [Google Scholar] [CrossRef] - Espinoza Villar, J.C.; Ronchail, J.; Guyot, J.L.; Cochonneau, G.; Naziano, F.; Lavado, W.; De Oliveira, E.; Pombosa, R.; Vauchel, P. Spatio-temporal rainfall variability in the Amazon basin countries (Brazil, Peru, Bolivia, Colombia, and Ecuador). Int. J. Climatol.
**2009**, 29, 1574–1594. [Google Scholar] [CrossRef][Green Version] - Dinerstein, E.; Graham, D.J.; Olsen, D.M. Una Evaluación del Estado de Conservación de las Eco-Regiones Terrestres de América Latina y el Caribe; Banco Mundial: Washington, DC, USA, 1995. [Google Scholar]
- Rossenaar, A.; Hofstede, R.G.M. Effects of burning and grazing on root biomass in the páramo ecosystem. In Páramo: An Andean Ecosystem under Human Influence; Academic Press: London, UK, 1992; pp. 211–213. [Google Scholar]
- Bontempi, G.; Taieb, S.B.; Le Borgne, Y.A. Machine Learning Strategies for Time Series Forecasting; Springer: Berlin/Heidelberg, Germany, 2012; pp. 62–77. [Google Scholar]
- Jin, L.; Kuang, X.; Huang, H.; Qin, Z.; Wang, Y. Study on the Overfitting of the Artificial Neural Network Forecasting Model. Acta Meteorol. Sin.
**2005**, 19, 216–225. [Google Scholar] - Martens, D.; De Backer, M.; Haesen, R.; Vanthienen, J.; Snoeck, M.; Baesens, B. Classification with ant colony optimization. IEEE Trans. Evol. Comput.
**2007**, 11, 651–665. [Google Scholar] [CrossRef] - Kubal, C.; Haase, D.; Meyer, V.; Scheuer, S. Integrated urban flood risk assessment-adapting a multicriteria approach to a city. Nat. Hazards Earth Syst. Sci.
**2009**, 9, 1881–1895. [Google Scholar] [CrossRef] - Wang, Z.; Lai, C.; Chen, X.; Yang, B.; Zhao, S.; Bai, X. Flood hazard risk assessment model based on random forest. J. Hydrol.
**2015**, 527, 1130–1141. [Google Scholar] [CrossRef] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Kühnlein, M.; Appelhans, T.; Thies, B.; Nauss, T. Improving the accuracy of rainfall rates from optical satellite sensors with machine learning—A random forests-based approach applied to MSG SEVIRI. Remote Sens. Environ.
**2014**, 141, 129–143. [Google Scholar] [CrossRef] - Rodriguez-Galiano, V.; Mendes, M.P.; Garcia-Soldado, M.J.; Chica-Olmo, M.; Ribeiro, L. Predictive modeling of groundwater nitrate pollution using Random Forest and multisource variables related to intrinsic and specific vulnerability: A case study in an agricultural setting (Southern Spain). Sci. Total Environ.
**2014**, 476, 189–206. [Google Scholar] [CrossRef] [PubMed] - Biau, G.; Scornet, E. A random forest guided tour. Test
**2016**, 25, 197–227. [Google Scholar] [CrossRef] - Feng, Q.; Liu, J.; Gong, J. Urban flood mapping based on unmanned aerial vehicle remote sensing and random forest classifier—A case of Yuyao, China. Water
**2015**, 7, 1437–1455. [Google Scholar] [CrossRef] - Lee, S.; Kim, J.C.; Jung, H.S.; Lee, M.J.; Lee, S. Spatial prediction of flood susceptibility using random-forest and boosted-tree models in Seoul metropolitan city, Korea. Geomat. Nat. Hazards Risk
**2017**, 8, 1185–1203. [Google Scholar] [CrossRef] - Zhao, G.; Pang, B.; Xu, Z.; Yue, J.; Tu, T. Mapping flood susceptibility in mountainous areas on a national scale in China. Sci. Total Environ.
**2018**, 615, 1133–1142. [Google Scholar] [CrossRef] [PubMed] - Albers, S.J.; Déry, S.J.; Petticrew, E.L. Flooding in the Nechako River Basin of Canada: A random forest modeling approach to flood analysis in a regulated reservoir system. Can. Water Resour. J.
**2016**, 41, 250–260. [Google Scholar] [CrossRef] - Sorjamaa, A.; Hao, J.; Reyhani, N.; Ji, Y.; Lendasse, A. Methodology for long-term prediction of time series. Neurocomputing
**2007**, 70, 2861–2869. [Google Scholar] [CrossRef] - Wang, W.C.; Chau, K.W.; Cheng, C.T.; Qiu, L. A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series. J. Hydrol.
**2009**, 374, 294–306. [Google Scholar] [CrossRef][Green Version] - Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2011. [Google Scholar]
- Bendix, J.; Rollenbeck, R.; Göttlicher, D.; Cermak, J. Cloud occurrence and cloud properties in Ecuador. Clim. Res.
**2006**, 30, 133–147. [Google Scholar] [CrossRef][Green Version] - Muñoz, P.; Célleri, R.; Feyen, J. Effect of the Resolution of Tipping-Bucket Rain Gauge and Calculation Method on Rainfall Intensities in an Andean Mountain Gradient. Water
**2016**, 8, 534. [Google Scholar] [CrossRef] - Orellana-Alvear, J.; Célleri, R.; Rollenbeck, R.; Bendix, J. Analysis of Rain Types and Their Z–R Relationships at Different Locations in the High Andes of Southern Ecuador. J. Appl. Meteorol. Climatol.
**2017**, 56, 3065–3080. [Google Scholar] [CrossRef] - IUSS Working Group WRB. World Reference Base for Soil Resources 2014; FAO: Rome, Italy, 2014. [Google Scholar]
- Mosquera, G.M.; Lazo, P.X.; Célleri, R.; Wilcox, B.P.; Crespo, P. Runoff from tropical alpine grasslands increases with areal extent of wetlands. Catena
**2015**, 125, 120–128. [Google Scholar] [CrossRef][Green Version] - Mosquera, G.M.; Célleri, R.; Lazo, P.X.; Vaché, K.B.; Perakis, S.S.; Crespo, P. Combined Use of Isotopic and Hydrometric Data to Conceptualize Ecohydrological Processes in a High-Elevation Tropical Ecosystem. Hydrol. Process.
**2016**, 30, 2930–2947. [Google Scholar] [CrossRef] - Willems, P. A time series tool to support the multi-criteria performance evaluation of rainfall-runoff models. Environ. Model. Softw.
**2009**, 24, 311–321. [Google Scholar] [CrossRef] - Breiman, L. Classification and Regression Trees; Routledge: Routledge, UK, 2017. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Boulesteix, A.L.; Janitza, S.; Kruppa, J.; König, I.R. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.
**2012**, 2, 493–507. [Google Scholar] [CrossRef][Green Version] - Probst, P.; Wright, M.; Boulesteix, A.L. Hyperparameters and Tuning Strategies for Random Forest. arXiv, 2018; arXiv:1804.03515. [Google Scholar]
- Willems, P. Parsimonious rainfall-runoff model construction supported by time series processing and validation of hydrological extremes—Part 1: Step-wise model-structure identification and calibration approach. J. Hydrol.
**2014**, 510, 578–590. [Google Scholar] [CrossRef] - Sudheer, K.P.; Gosain, A.K.; Ramasastri, K.S. A data-driven algorithm for constructing artificial neural network rainfall-runoff models. Hydrol. Process.
**2002**, 16, 1325–1330. [Google Scholar] [CrossRef] - Wu, C.L.; Chau, K.W. Rainfall-runoff modeling using artificial neural network coupled with singular spectrum analysis. J. Hydrol.
**2011**, 399, 394–409. [Google Scholar] [CrossRef][Green Version] - Moriasi, D.N.; Arnold, J.G.; Van Liew, M.W.; Bingner, R.L.; Harmel, R.D.; Veith, T.L. Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans. ASME
**2007**, 50, 885–900. [Google Scholar] - Peleg, N.; Gvirtzman, H. Groundwater flow modeling of two-levels perched karstic leaking aquifers as a tool for estimating recharge and hydraulic parameters. J. Hydrol.
**2010**, 388, 13–27. [Google Scholar] [CrossRef] - Tang, Y.; Reed, P.; Werkhoven, K.V.; Wagener, T. Advancing the identification and evaluation of distributed rainfall‐runoff models using global sensitivity analysis. Water Resour. Res.
**2007**, 43, 1–14. [Google Scholar] [CrossRef] - Cortez, P. Sensitivity Analysis for Time Lag Selection to Forecast Seasonal Time Series using Neural Networks and Support Vector Machines. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
- de Almeida, I.K.; Almeida, A.K.; Anache, J.A.A.; Steffen, J.L.; Sobrinho, T.A. Estimation on time of concentration of overland flow in watersheds: A review. Geociências
**2014**, 33, 661–671. [Google Scholar]

**Figure 1.**Location of the Tomebamba catchment at Matadero-Sayausí outlet in the Andean cordillera of Ecuador, South America (UTM coordinates).

**Figure 2.**Precipitation (averaged for Toreadora, Virgen and Chirimachay stations) and discharge hourly time series for the study period (January 2015 to July 2017). Note the horizontal red dashed line at a discharge of 50 m${}^{3}$/s (historical indicator of a flood event).

**Figure 3.**Scheme of the step-wise methodology for developing precipitation-runoff forecasting models.

**Figure 4.**(

**a**) Autocorrelation function (ACF) and (

**b**) Partial autocorrelation function (PACF) of the Matadero-Sayausí discharge series. Gray hatch indicates the 95% confidence band.

**Figure 5.**Pearson cross-correlation comparison between the Toreadora (3955 m.a.s.l.), Virgen (3626 m.a.s.l.) and Chirimachay (3298 m.a.s.l.) precipitation stations and the Matadero-Sayausí (2693 m.a.s.l.) discharge station. Note the horizontal line at a cross-correlation of 0.20.

**Figure 6.**Feature relative importance of the 4-h discharge forecasting model B. Darkest bars indicate the features selected for a reduced version of the input (model C).

**Figure 7.**Model results of the parsimonious 4-h discharge forecasting (model C) (

**a**) Training period, from January-2015 to July-2016. (

**b**). Validation period, from July-2016 to July-2017.

**Figure 8.**(

**a**) Empirical extreme value distribution of peak flows; (

**b**) Comparison of nearly independent peak flow maxima.

**Figure 9.**(

**a**) Empirical extreme value distribution of peak flows, and (

**b**) Comparison of cumulative flow volumes for forecasting models of 4, 8, 12, 18 and 24 h.

**Figure 10.**Comparison of nearly independent peak flow maxima for forecasting models of 4, 8, 12, 18 and 24 h.

Hyper-Parameter | Values |
---|---|

n_estimators * | 50–700 |

max_features | ’auto’, ‘sqrt’ and ‘log2’ |

min_samples_split | 2, 5 and 10 |

min_samples_leaf | 1, 2 and 4 |

max_depth * | 10–700 |

**Table 2.**Input data composition and model efficiencies for forecasting models and their parsimonious versions for prediction horizons of 4, 8, 12, 18 and 24 h.

Forecast Horizon | Discharge | Toreadora | Virgen | Chirimachay | Total | NSE | NSE |
---|---|---|---|---|---|---|---|

[h] | Lags | Lags | Lags | Lags | Features | Training | Validation |

4 | 8 | 24 | 10 | 15 | 60 | 0.954 | 0.758 |

4 * | 8 | 9 | 9 | 9 | 38 | 0.972 | 0.761 |

8 | 8 | 32 | 19 | 23 | 85 | 0.868 | 0.581 |

8 * | 8 | 15 | 15 | 15 | 56 | 0.867 | 0.580 |

12 | 8 | 36 | 23 | 27 | 98 | 0.828 | 0.506 |

12 * | 8 | 18 | 18 | 18 | 65 | 0.829 | 0.503 |

18 | 8 | 42 | 29 | 33 | 115 | 0.772 | 0.442 |

18 * | 8 | 21 | 21 | 21 | 74 | 0.771 | 0.439 |

24 | 15 | 48 | 35 | 39 | 140 | 0.772 | 0.385 |

24 * | 15 | 21 | 21 | 21 | 81 | 0.767 | 0.384 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Muñoz, P.; Orellana-Alvear, J.; Willems, P.; Célleri, R. Flash-Flood Forecasting in an Andean Mountain Catchment—Development of a Step-Wise Methodology Based on the Random Forest Algorithm. *Water* **2018**, *10*, 1519.
https://doi.org/10.3390/w10111519

**AMA Style**

Muñoz P, Orellana-Alvear J, Willems P, Célleri R. Flash-Flood Forecasting in an Andean Mountain Catchment—Development of a Step-Wise Methodology Based on the Random Forest Algorithm. *Water*. 2018; 10(11):1519.
https://doi.org/10.3390/w10111519

**Chicago/Turabian Style**

Muñoz, Paul, Johanna Orellana-Alvear, Patrick Willems, and Rolando Célleri. 2018. "Flash-Flood Forecasting in an Andean Mountain Catchment—Development of a Step-Wise Methodology Based on the Random Forest Algorithm" *Water* 10, no. 11: 1519.
https://doi.org/10.3390/w10111519