A Brief Review of Random Forests for Water Scientists and Practitioners and Their Recent History in Water Resources

Random forests (RF) is a supervised machine learning algorithm, which has recently started to gain prominence in water resources applications. However, existing applications are generally restricted to the implementation of Breiman’s original algorithm for regression and classification problems, while numerous developments could be also useful in solving diverse practical problems in the water sector. Here we popularize RF and their variants for the practicing water scientist, and discuss related concepts and techniques, which have received less attention from the water science and hydrologic communities. In doing so, we review RF applications in water resources, highlight the potential of the original algorithm and its variants, and assess the degree of RF exploitation in a diverse range of applications. Relevant implementations of random forests, as well as related concepts and techniques in the R programming language, are also covered.


Introduction
Breiman's [1] random forests (RF) is one of the most successful machine (statistical) learning algorithms for practical applications; see e.g., Biau and Scornet [2], and Efron and Hastie [3] (p. 324).Despite its practical value, until very recently and compared to other machine learning and artificial intelligence algorithms, random forests remained relatively obscure with limited use in water science and hydrological applications.Thus, the potential of Breiman's [1] original algorithm and its variants in water resources applications remain far from fully exploited.Besides common applications of RF-based algorithms in regression and classification problems and computation of relevant metrics, their use for quantile prediction, survival analysis, and causal inference, to name a few, seem to be less known to water scientists and practitioners.
Random forests have been applied to several scientific fields and associated research areas, such as agriculture (see e.g., Liakos et al. [4]), ecology (see e.g., Cutler et al. [5]), land cover classification (see e.g., Gislason et al. [6]), remote sensing (see e.g., Belgiu and Drăguţ [7], Maxwell et al. [8]), wetland classification (see e.g., Mahdavi et al. [9]), bioinformatics (see e.g., Chen et al. [10]), as well as biological and genetic association studies (see e.g., Goldstein et al. [11]), genomics (see e.g., Chen and Ishwaran [12]), quantitative structure−activity relationships (QSARs) modeling [13], and single nucleotide polymorphism studies (SNP, [14]).An extensive review of the theoretical aspects of random forests can be found In general, statistical learning has two purposes: prediction and inference.Prediction refers to the ability of the algorithm to predict a response variable based on a set of independent variables, while inference refers to understanding how changes of the independent variables affect the response variable (see e.g., James et al. [39], pp.17-20).Breiman favored prediction over interpretation and understanding [40] and, therefore, he emphasized solving practical problems, although random forests are not solely a prediction algorithm (Breiman's approach to statistical science is also reflected in the interview [41]).In James et al. [39] (p.25), random forests are presented as the most flexible algorithm (implying possibly, but not necessarily, that they are a skillful predictor) and, also, the second less interpretable one (the first being support vector machines), with linear models having been characterized by exactly the opposite behavior.The practice of selecting the most flexible model (i.e., a model that can select, combine and fit different functional forms, demonstrating increased capacity in relating dependent to independent variables [39], p. 22) irrespective of its interpretability, is in contrast with e.g., Iorgulescu and Beven [42], who are perhaps the first authors to cite Breiman [1] in a water resources journal, but then decide to use single decision trees instead of random forests in their rainfall-runoff application, because the former are more interpretable, albeit less skillful.Other criteria can also be considered when selecting an algorithm for practical problem solving.Examples include, but are not limited to, the required degree of predictive capacity for the problem under consideration, ease of model use and software availability, as well as user related preferences (e.g., some users feel more comfortable implementing a general algorithm applicable to most cases, rather than investing time and effort in learning a new one tailored to a specific application).
In Breiman [40], a distinction is made between statistical models (e.g., those that use probability distributions to describe data) and algorithmic models (or black-box models for prediction and estimation purposes); with an explicit statement that sticking to the first class of models has hindered progress.This classification is similar to the distinction between physically-based and data-driven hydrological models in water resources; see e.g., Solomatine and Ostfeld [20].The distinction between statistical and algorithmic models has been described in Cox and Efron [43], as an emphasis on prediction using noisy data, rather than trying to interpret the data.An ongoing interesting debate regarding Breiman's [40] stimulating paper and, more in general, the role of statistical vs. algorithmic modeling in predicting and explaining phenomena (see Shmueli [44], Boulesteix and Schmid [45]), shows that the two approaches converge.This is kind of expected, as both statistical and machine learning approaches are subsets of the rapidly emerging field of data science (see e.g., Donoho [46]).However, the role of random forests as a generic framework for predictive modeling seems to be the dominant direction in RF-related research (see e.g., Hengl et al. [47]).
The general trend towards the use of algorithmic models can be attributed to the rapidly increasing availability of big data (see e.g., Efron and Hastie [3]).The latter can be efficiently handled by RF algorithms (see e.g., Genuer et al. [48]), with all applicable reservations and constraints regarding the blind use of such models in exploring data sets (see e.g., Cox et al. [49]).In any case, big data are also becoming rapidly available in hydrology (see e.g., Chen and Wang [50]) and, therefore, a shift of focus towards the use of algorithmic methods and tools (such as RF algorithms) for prediction and inference purposes is already happening.
Other issues that should be properly taken into account when implementing machine learning algorithms in general, and random forests in particular, include: the need for comparison in large datasets [51] using formal procedures [52], reproducibility of applications [53], and variable selection [54].An additional important issue frequently neglected is that causal inference is different from prediction, although there is increasing research regarding causal inference, interpretability, and reliability of machine learning methods [55].
In this context, the main purpose of the present study is to: (a) provide a comprehensive review of random forests and their software implementation for the practicing water scientist, (b) introduce their variants for possible use in water resources problems, and (c) familiarize the reader with the use of RF algorithms in water science, providing appropriate guidelines for full exploitation of their merits Water 2019, 11, 910 5 of 37 according to the broader literature.Sections 2-4 serve as a brief introduction to random forests for water scientists and practitioners, including a concise overview of RF algorithms, their variants and related software implementation in the popular R language.In Section 5, we use a published case study to shed additional light on how random forests work and, also, highlight the importance of understanding the nuances of RF algorithms in practical applications, by discussing how the reviewed work could have been improved in the light of the findings of Sections 2-4.Section 6 reviews important applications of random forests in water science and technology.Concluding remarks and considerations are presented in Section 7.

Random Forests
This section presents random forests (RF) as introduced by Breiman [1], including related concepts and results.In brief, what distinguishes Breiman's RF-algorithm from other RF implementations, is the use of classification and regression trees (CARTs, [56]) as base learners [2]; see Section 2.1 below.For simplicity, and without loss of generality, hereafter we follow the RF parameter notation used in the randomForest R package [57], which is directly linked to Breiman's [1] original paper.

How Random Forests Work
Several papers and textbooks include detailed presentations of RF algorithms; see e.g., Breiman [1], Biau and Scornet [2], and the textbooks James et al. [39], Hastie et al. [58], Kuhn and Johnson [59].The algorithm borrows concepts from earlier works such as [60][61][62] (see also Biau and Scornet [2]).In essence, random forests is a machine learning algorithm that combines the concepts of: classification and regression trees, and bagging with some additional degree of randomization.Section 2.1.1-Section2.1.3present these concepts, and Section 2.1.4discusses how and why they are combined.

Supervised Learning
Supervised learning algorithms are used to conclude on (i.e., learn) a function that combines a set of variables with the aim to predict another variable.The arguments of the function are called predictor variables (also referred to as independent variables, exogenous variables, covariates and features).The variable to be predicted is called the dependent variable (also referred to as the predictand, response, outcome, endogenous variable, target variable and output).Supervised learning algorithms are classified into regression and classification algorithms, according to the type of the dependent variables.In regression algorithms, the dependent variable is quantitative, whereas in classification algorithms the dependent variable is qualitative.In the latter case, the dependent variable can also be ordered; i.e., the values of the variable are ordered but no metric is defined/used to quantitatively assess the observed differences (Hastie et al. [58], pp.[9][10][11].In what follows, we use p and n to denote the number of predictor variables and the size of the training set (i.e., the set used to fit the algorithm), respectively.

Classification and Regression Trees
Classification and regression trees (CARTs, [56]) are methods to partition the variable space based on a set of rules embedded in a decision tree (see Figure 1 below), where each node splits according to a decision rule; see e.g., Hastie et al. [58] (pp.[305][306][307][308][309][310][311][312][313][314][315][316][317], and the review in Loh [19].In this way, the variable space is partitioned into a set of rectangles, and a model is fitted to each set, which in the simplest case can be a constant.In regression trees, the decision rules for node splits are tuned/learnt by optimizing the sum of squared deviations, while in classification by optimizing the Gini index (a definition and interpretation of the Gini index can be found in Hastie et al. [58] (pp.309,310).Note that, in general, tree-based algorithms (including CARTs) are very noisy (see e.g., Hastie et al. [58], p. 588), with major differences having been identified in the decision rules for splitting, and the sizes of trees.

Bagging
Bagging (abbreviation for bootstrap aggregation) is an ensemble learning method [18] proposed in Breiman [63].It generates a bootstrap sample from the original data and then trains a model (e.g., a CART) using the generated sample.The procedure is repeated ntree times.Bagging's prediction is the average of the predictions of the ntree trained models.Thus, bagging reduces the variance of the prediction function, but it requires unbiased models to work effectively [58] (p.587).

Random Forests
Random forests are bagging of CARTs with some additional degree of randomization.Bagging of CARTs is needed to alleviate their instability (see e.g., Ziegler and König [17] and Section 2.1.2).Further, randomization is used to reduce the correlation between the trees and, consequently, reduce the variance of the predictions (i.e., the average of the trees).Randomization is conducted by randomly selecting mtry predictor variables as candidates for splitting [58] (pp.587-604).
Prediction in regression is performed by averaging the predictions of each tree, while in classification it is performed by obtaining the majority class vote from the individual tree class votes (see e.g., Hastie et al. [58], p. 592).An option for parameter tuning of random forests is to use out-ofbag (OOB) errors [2].Out-of-bag samples (about 1/3 of the training set, see Biau and Scornet [2]) are the samples remaining after bootstrapping the training set.The aforementioned procedure resembles the well-known k-fold cross-validation (see e.g., Hastie et al. [58], p. 592, 593).

Properties of Random Forests
While very complex to interpret (see e.g., Ziegler and König [17]), the theoretical properties of random forests have been studied extensively (see e.g., the detailed review in Biau and Scornet [2]), primarily through the use of simplified versions of the algorithm (also referred to as stylized versions, see Biau and Scornet [2]).In summary, random forests: (a) have been found to be consistent (see e.g., references [64][65][66]), (b) reduce the variance, while not increasing the bias of the predictions [67], (c) reach minimax rate of convergence (see e.g., Ziegler and König [17], Genuer [67]), (d) adapt to  [58], p.306).X j denote predictor variables.The tree has four internal nodes and five leaves (terminal nodes).X j ≤ t k and X j > t k correspond to the left and right branches of each internal split, respectively.R i denotes the mean of the observations at leaf i [39] (p.304).
In regression trees, the decision rules for node splits are tuned/learnt by optimizing the sum of squared deviations, while in classification by optimizing the Gini index (a definition and interpretation of the Gini index can be found in Hastie et al. [58] (pp.309,310).Note that, in general, tree-based algorithms (including CARTs) are very noisy (see e.g., Hastie et al. [58], p. 588), with major differences having been identified in the decision rules for splitting, and the sizes of trees.

Bagging
Bagging (abbreviation for bootstrap aggregation) is an ensemble learning method [18] proposed in Breiman [63].It generates a bootstrap sample from the original data and then trains a model (e.g., a CART) using the generated sample.The procedure is repeated ntree times.Bagging's prediction is the average of the predictions of the ntree trained models.Thus, bagging reduces the variance of the prediction function, but it requires unbiased models to work effectively [58] (p.587).

Random Forests
Random forests are bagging of CARTs with some additional degree of randomization.Bagging of CARTs is needed to alleviate their instability (see e.g., Ziegler and König [17] and Section 2.1.2).Further, randomization is used to reduce the correlation between the trees and, consequently, reduce the variance of the predictions (i.e., the average of the trees).Randomization is conducted by randomly selecting mtry predictor variables as candidates for splitting [58] (pp.587-604).
Prediction in regression is performed by averaging the predictions of each tree, while in classification it is performed by obtaining the majority class vote from the individual tree class votes (see e.g., Hastie et al. [58], p. 592).An option for parameter tuning of random forests is to use out-of-bag (OOB) errors [2].Out-of-bag samples (about 1/3 of the training set, see Biau and Scornet [2]) are the samples remaining after bootstrapping the training set.The aforementioned procedure resembles the well-known k-fold cross-validation (see e.g., Hastie et al. [58], p. 592, 593).

Variable Importance Metrics
Estimation of variable importance (i.e., assessing the relative significance of predictor variables in modeling the behavior of response variables; see e.g., Hastie et al. [58], Chapter 10, Grömping [69], and Verikas et al. [70]) is doable with random forests, through the use of variable importance metrics.The latter rank the predictor variables in terms of their relative significance, but provide limited information regarding the absolute performance of individual predictors in modeling the response variables [16].
The two major variable importance metrics (VIMs) used in RF applications are: the mean decrease in node impurities resulting from splitting, and the more advanced (see Strobl et al. [71]) permutation VIM.The first metric averages the decrease over all trees of the Gini index in classification, and the residual sum of squares in regression.The second metric measures the mean decrease in accuracy in the OOB sample by randomly permuting the predictor variable of interest (see randomForest R package, [16]).VIMs for the case of ordinal response variables have also been proposed in Janitza et al. [72].
Studies relating to empirical and theoretical properties of RF VIMs, as well as guidelines on where and how to use them, can be found in the review papers Biau and Scornet [2], Boulesteix et al. [16].The reader is also referred to Grömping [73] for a comparison between linear regression models and RF VIMs, Boulesteix et al. [74] for a survey on Gini VIMs and Nicodemus et al. [75] for a survey on permutation VIMs.VIMs for cases with missing data can be found in Hapfelmeier et al. [76], and for cases with high-dimensional data (i.e., of the form n p) in Janitza et al. [77].

Parameters
Two parameters of RF algorithms already discussed are: the number of trained trees ntree (see Section 2.1.3),and the number of randomly selected predictor variables mtry (see Section 2.1.4).Other parameters are the number of observations sampsize used in each tree, and the maximum number of observations nodesize in each leaf [78].The nodesize parameter is used to stop the tree expansion, while the parameter maxnodes (i.e., the maximum number of terminal nodes/trees a forest can have) can also be used for this task.General guidelines for selecting the optimal parameter values can be found in the review papers Biau and Scornet [2], Scornet [78].As noted in Biau and Scornet [2], the default parameter values in randomForest R package are satisfactory, albeit they can be optimized for any given problem with subsequent increase of the computational time.
The default value of ntree in randomForest R package is set to 500, but different values may be selected based on the required accuracy, taking into account its effect on the computational time [78]; i.e., the prediction accuracy of the algorithm is an increasing function of ntree, and the same holds for the computational burden that increases linearly with ntree.For example, while Probst and Boulesteix [79] propose setting ntree as large as computationally feasible, based on a large empirical study, they note that the performance increase rate of the RF algorithm tends to 0 for ntree ≥ 250.Boulesteix et al. [16] recommend increasing ntree until stabilization of the results is reached.
The set of possible values of mtry is {1, . . ., p}.Its default value in randomForest R package is set to p 1/2 for classification tasks ( • denotes the next larger integer), and p/3 for regression tasks (see also Ziegler and König [17]).Lower mtry values result in faster computations and increased number of induced randomizations (see Section 2.1.4).The problem of finding optimal values for mtry is far from conclusive and, in general, optimization of mtry may be useful [17].However, empirical studies show that the aforementioned default values are either adequate, or too small [78].A comprehensive interpretation of this is as follows: In the case when the majority of selected predictor variables is non-informative, small values of mtry may result in construction of inaccurate trees [16].Furthermore, in the case when the number of informative variables is large, small mtry values may favor predictor variables whose effect is masked by stronger predictors [16], thus, allowing for a higher level of performance/accuracy to be reached.
The default value for nodesize in randomForest R is set to 1 for classification tasks, and 5 for regression tasks.Biau and Scornet [2] argue that the aforementioned values are supported by the literature (see also Díaz-Uriarte and De Andres [80]), while Boulesteix et al. [16] also favor small nodesize values, suggesting the use of parameter maxnodes to control the size of the trees.However, when compared to ntree and mtry, nodesize and maxnodes have less influence on the performance of the algorithm [16].
The set of possible values for sampsize is {1, . . ., n}, and its default value in randomForest R package is set to n, which corresponds to bootstrapping if sampling is conducted with replacement.Sub-sampling (i.e., sampsize < n) without replacement, may be similar in performance to bootstrapping, although in this case sampsize must be tuned (see e.g., Scornet [78]).

Variable Selection
A general review on the task of variable selection, i.e., what predictor variables to include in an optimal model, can be found in Heinze et al. [81].In random forests, variable selection can be conducted via variable importance metrics (VIMs, see Section 2.3), with non-significant variables exhibiting randomly distributed VIMs around zero [71].Therefore, excluding variables with VIMs that fluctuate around zero is a reasonable assumption.
Selection strategies for predictor variables are presented in Díaz-Uriarte and De Andres [80], Genuer et al. [82].Díaz-Uriarte and De Andres [80] suggest a stepwise approach where different predictor variables are tested and progressively removed until the lowest OOB error is reached.Genuer et al. [82] use a stepwise variable introduction strategy based on ascending VIMs; see Ziegler and König [17] for an assessment of the two approaches.

Interactions
According to Boulesteix et al. [83], for the simplest case of additive regression schemes, interaction "denotes deviations from the additive model that are reflected by the inclusion of the product of at least two predictor variables in the model".Clearly, interaction is fundamentally different from confounding (i.e., the correlation between the predictors, in the case of Gaussian variables), as it explicitly reflects deviations from the additivity assumption, through inclusion of non-linear operations among different predictors; see also Boulesteix et al. [16].That said, while CARTs have the capacity to account for interactions among different predictor variables, the interconnection patterns in classification and regression trees do not necessarily imply the presence of interactions; see e.g., Boulesteix et al. [83].

Uncertainty, Time Series Forecasting, Spatial and Spatiotemporal Modeling
A theoretical investigation of the uncertainty of random forest algorithms through confidence interval estimation can be found in Wager et al. [84].Also, Meinshausen [85] used a variant of random forests, referred to as quantile regression forests, for estimation of prediction intervals.Time series forecasting with the use of random forests has also been exploited in the recent years; see e.g., Tyralis and Papacharalampous [86], Papacharalampous et al. [87,88].A demonstration of the use of random forests for spatial and spatiotemporal modeling can be found in Hengl et al. [47].

Twenty Two Reasons towards the Use of Random Forests
Perhaps, one of the most motivating arguments towards the use of random forest algorithms is that given in Efron and Hastie [3] (pp. 347, 348): "Random forests and boosting live at the cutting edge of modern prediction methodology.They fit models of breathtaking complexity compared with classical linear regression, or even with standard GLM modeling as practiced in the late twentieth century.They are Water 2019, 11, 910 9 of 37 routinely used as prediction engines in a wide variety of industrial and scientific applications.For the more cautious, they provide a terrific benchmark for how well a traditional parameterized model is performing: if the random forests does much better, you probably have some work to do, by including some important interactions and the like".In what follows, we present a (non-exhaustive) list of appealing properties of random forests, as presented in the recent literature (some of them are common to other machine learning algorithms): 1.1.They demonstrate increased predictive performance, as verified in competitions (see e.g., Biau and Scornet [2], Díaz-Uriarte and De Andres [80]).1.2.They can capture non-linear dependencies between predictor and dependent variables (see e.g., Boulesteix et al. [16]).1.3.They are non-parametric; i.e., no parametric statistical model needs to be defined for their use (see e.g., Boulesteix et al. [16]).1.4.They are fast compared to other machine learning algorithms (see e.g., Ziegler and König [17]) and, also, they can operate in parallel computing mode.1.5.They can be applied to large-scale problems (see e.g., Biau and Scornet [2]).1.6.They are straightforward to use (see e.g., Athey et al. [89], and Efron and Hastie [3]  As suggested by the no-free-lunch-theorem [90], no algorithm is perfect and, therefore, random forests should not be approached as a remedy to all types of problems; see e.g., Boulesteix et al. [16] [92] and Section 1.

Random Forest Variants
Several variants of Breiman's [1] original RF algorithm have been developed, e.g., by varying the tree construction procedure, changing the data selection approach for the tree construction, and by using alternative methods to aggregate the developed trees for prediction purposes [16].Biau and Scornet [2] and Criminisi et al. [15] present a non-exhaustive list of such variants, while Tripoliti et al. [93] propose modifications to the original algorithm for creating new variants.Table 2 presents a non-exhaustive list of older as well as recently developed variants of Breiman's [1] original RF algorithm in chronological order.These include, but are not limited to: (1) Bayesian additive regression trees for probabilistic prediction (see e.g., Chipman et al. [94], BART are mostly motivated by boosting algorithms); (2) quantile regression forests, for estimation of conditional quantiles (see e.g., Meinshausen [85]); (3) generalized random forests and heteroscedastic Bayesian additive regression trees for modeling heterogeneous and/or heteroscedastic data (see e.g., references [89,95]); (4) distributional regression forests for estimation of the location, scale, and shape distribution parameters (i.e., similarly to generalized additive models (GAMLSS), but with the use of trees instead of e.g., splines; see e.g., Schlosser et al. [96]); (5) multivariate random forests for prediction of multiple dependent variables (see e.g., Segal and Xiao [97]); (6) survival forests for implementing survival analysis (see e.g., Ishwaran et al. [98]), and (7) decision tree fields for combining the concepts of random forests and random fields in geostatistical applications (see e.g., Nowozin et al. [99]).RF variants particularly suited for interpretation, variable importance assessments, and causal inference (i.e., understanding how changes of the independent variables affect the response variables) include: conditional inference forests (see e.g., Hothorn et al. [100]), causal forests for formal statistical inference (see e.g., Wager and Athey [92]), and random intersection trees and iterative random forests for identification of interactions of high order (see e.g., Shah and Meinshausen [101], Basu et al. [102]).Information forests [111][112][113][114] Handles training data arriving sequentially or continuously, changing the underlying distribution.Ranking forests [115,116] Ranking problems Random ferns [117] Same test parameters are used in all nodes of the same tree level.
It corresponds to a lower parametric version of random forests.
Bayesian additive regression trees [94] Aggregation of trees, but inference and fitting is accomplished using Bayesian methods.Conditional means and quantiles can be computed.Node harvest [118] Multiple single nodes.Density forests [15] Density estimation of unlabeled data.Manifold forests [15] Manifold learning (dimensionality reduction).Semi-supervised forests [15] Semi-supervised learning.
Entangled forests [119] Entanglement of the tests applied at each tree node with other nodes in the forest.Decision tree fields [99] Combination of random forests and random fields.
STAR model [120] They can be seen as single nodes equipped with one random projection and multiple decision thresholds Multivariate random forests [97] Predicts multiple dependent variables.Dynamic random forests [121] Inclusion of trees in the ensemble learner depending on previous outputs.

Gradient forests
[122] Use of alternative importance measures.Regularized random forests [123,124] Improvements on variable selection within trees.Cluster forests [125] Appropriate for clustering (unsupervised learning).
Weighted random forests [126] Incorporates tree-level weights for more accurate prediction and computation of variable importance.Random intersection trees [101] High-order interaction discovery.Hyper-Ensemble Smote Undersampled Random Forests [91] Undersampling of the majority class and oversampling of the minority class to learn from highly imbalanced data.Integrated multivariate random forests [127] Integrated different data subtypes.

Generalized random forests [89]
Generalization of random forests for adaptive, local estimation.Iterative random forests [102,128] High-order interaction discovery.Heteroscedastic Bayesian additive regression trees [95] Bayesian additive regression trees for modeling heteroscedastic data.
Local linear forests [129] They model smooth signals and fix boundary bias issues.They build on generalized random forests.

Distributional regression forests [96]
Version of generalized additive models for location, scale, and shape parameters (GAMLSS), using trees.
Causal forests [92] Estimation of heterogeneous treatment effects.They can be used for statistical inference.Neural random forests [130] Reformulation of random forests in a neural network setting.
Finally, Criminisi et al. [15], present several interesting ideas regarding the implementation of random forests in unsupervised and semi-supervised learning, such as density forests for density estimation (i.e., estimation of the latent probability density function from which unlabeled observations have been generated), manifold forests for dimensionality reduction, semi-supervised forests for semi-supervised learning, and cluster forests for clustering (i.e., a type of unsupervised learning).

R Software
After detailed search of the literature, it is noteworthy that most RF variants and related utilities are implemented and freely distributed as distinct packages in the R programming language (see Table 3 for a non-exhaustive list), which appears to be the most important source of tree-related software (see e.g., Boulesteix et al. [16], Ziegler and König [17]).R is a programming language and free software environment for statistical computing and graphics.It is widely used for data analysis and development of statistical software.The core of the language is extended through user-created packages, which include programming of statistical methods, advanced methods for creating visualizations and more.There is abundant literature on of the use of R programming language in statistical applications, including freely available internet resources with presentation of software implementations (e.g., RPubs, https://rpubs.com/).Random forest algorithms implemented in programming languages other than R are presented in Boulesteix et al. [16].The R package directly linked to Breiman's [1] original paper is randomForest, which is also the most commonly used random forest related R package.An improved faster version is the ranger R package; see e.g., Wright and Ziegler [131], where one can find comparisons regarding the speed of different random forest software implementations.Other available R packages deal with computation of variable importance and variable selection, imputation of missing values, and visualization (e.g., plotting of trees), while other packages are directly linked to specific applications and/or combinations of methods.

Random Forests in a Published Case Study
In this Section, we examine the streamflow forecasting case study by Papacharalampous and Tyralis [132], and how this could have been improved, by considering the findings of Sections 2-4.Papacharalampous and Tyralis [132] use previous-day observed streamflow and precipitation as predictor variables to produce next-day forecasts; i.e., a common problem in hydrology (see e.g., Table 1), where numerous machine learning algorithms have been applied.Forecasts are generated by implementing random forests (specifically the ranger R package, with root mean square errors and mean absolute forecast errors as performance indicators), with recursive retraining (i.e., the algorithm is retrained based on past data at each step of the forecast sequence), and predictor variables selected using linear metrics (i.e., the estimated streamflow autocorrelations, and the estimated cross-correlations between precipitation and streamflow, at different lag times).
Based on the findings of Sections 2-4, several improvements could have been possible.For example, variable selection could have been performed based on variable importance metrics, following the strategies presented in Section 2.5, rather than using linear metrics.In addition, different software options could have been possible (see Section 4), while the performance of the algorithm could have been assessed using multiple metrics (see e.g., references [133][134][135]).Note that while including additional (even redundant) predictor variables does not influence negatively the performance of random forests, the computational cost of training the algorithm increases, especially if its parameters require tuning.Therefore, if the aforementioned alternative options had been taken into account, there could have been a compromise between the number of predictor variables, the required degree of optimization, and the computational time.
Finally, several limitations of the algorithm could have been mentioned/discussed in the study, including the inability of random forests to extrapolate outside the training range (see Section 2.8.2), as well as the intrinsic assumption of stationarity common to all machine learning algorithms.The latter precludes application of data driven methods and models to resolve effects associated with changes in the catchment due to human influences; e.g., land cover changes.

Literature Search Results
In an effort to chart the use of random forests in water sciences, we used Scopus database to conduct a literature search based on papers published in Journals related to the Water Science Water 2019, 11,910 and Technology subject areas.The search was restricted to: (a) Journals with CiteScore ≥ 2 (for year 2017), and (b) papers published until 31 December 2018.CiteScore is a metric to track Journal performance published by Elsevier.While other paper selection criteria could also be applied, we feel that the adopted ones resulted in a sufficient list of representative papers.Studies citing Breiman's [1] original paper were selected as a starting basis.From the identified articles, we kept only those that include some type of implementation of random forest algorithms and/or their variants.Notably, most Journals with CiteScore larger than 2 include at least one implementation of random forests.The resulting list includes 203 papers (references ) published in 30 Journals.Parkhurst et al. [250] were the first to use random forests in the corresponding list of 203 papers, to solve water quality related problems.The next two articles on the list appear in the year 2008, one appears in 2009, while the number of papers including RF implementations increases exponentially after 2010; see Figure 2.

Literature Search Results
In an effort to chart the use of random forests in water sciences, we used Scopus database to conduct a literature search based on papers published in Journals related to the Water Science and Technology subject areas.The search was restricted to: (a) Journals with CiteScore ≥ 2 (for year 2017), and (b) papers published until 31 December 2018.CiteScore is a metric to track Journal performance published by Elsevier.While other paper selection criteria could also be applied, we feel that the adopted ones resulted in a sufficient list of representative papers.Studies citing Breiman's [1] original paper were selected as a starting basis.From the identified articles, we kept only those that include some type of implementation of random forest algorithms and/or their variants.Notably, most Journals with CiteScore larger than 2 include at least one implementation of random forests.The resulting list includes 203 papers (references ) published in 30 Journals.Parkhurst et al. [250] were the first to use random forests in the corresponding list of 203 papers, to solve water quality related problems.The next two articles on the list appear in the year 2008, one appears in 2009, while the number of papers including RF implementations increases exponentially after 2010; see Figure 2. Figure 3 shows the list of the 30 Journals (in descending order of published articles) that include some type of implementation of random forests and/or their variants, while Figure 4 illustrates the CiteScores of the selected Journals for year 2017.A visualization of the topics addressed per Journal is presented in Appendix A (Figure A1).
Journals exhibiting the largest numbers of published RF-related papers are Journal of Hydrology, Water Resources Research, and Water (see Figure 3).However in many Journals, the number of RF-related articles is still relatively low.In fact, only seven Journals have published more than 10 articles with RF-related implementations.
As shown in Figure 5, random forests have been applied to solve practical problems from diverse regions of the world.While global data are frequently exploited (see 4th entry in Figure 5), most reviewed studies focus on data originating from the USA and China.This is mainly due to the extensive scientific research conducted by Universities and Research Institutes located in these countries, as well as the availability of open datasets in the USA.
Figure 3 shows the list of the 30 Journals (in descending order of published articles) that include some type of implementation of random forests and/or their variants, while Figure 4 illustrates the CiteScores of the selected Journals for year 2017.A visualization of the topics addressed per Journal is presented in Appendix A (Figure A1).Journals exhibiting the largest numbers of published RF-related papers are Journal of Hydrology, Water Resources Research, and Water (see Figure 3).However in many Journals, the number of RF-related articles is still relatively low.In fact, only seven Journals have published more than 10 articles with RF-related implementations.
As shown in Figure 5, random forests have been applied to solve practical problems from diverse regions of the world.While global data are frequently exploited (see 4th entry in Figure 5), most  As indicated by Figure 6, random forests have been mostly used for regression tasks, but the number of classification studies is also significant.Random forests have been used to model a large variety of water-related variables.Here, we have grouped these variables into 21 categories presented in Figure 7.An important note to be made here is that a large part of the RF literature is devoted to remote sensing applications.As shown in Figure 7, the most frequently studied variable is streamflow, which embodies river discharge and related variables.Applications falling under this category include streamflow modeling; e.g., using data-driven rainfall-runoff models, while streamflow imputation of missing values is also of increased interest.A second theme frequently met is water chemistry, including water quality.These two themes are also the most frequently met in reviews of data-driven models in water resources (see also Section 1).Flow related statistics (e.g., the study of hydrological signatures) also tend to dominate As indicated by Figure 6, random forests have been mostly used for regression tasks, but the number of classification studies is also significant.As indicated by Figure 6, random forests have been mostly used for regression tasks, but the number of classification studies is also significant.Random forests have been used to model a large variety of water-related variables.Here, we have grouped these variables into 21 categories presented in Figure 7.An important note to be made here is that a large part of the RF literature is devoted to remote sensing applications.As shown in Figure 7, the most frequently studied variable is streamflow, which embodies river discharge and related variables.Applications falling under this category include streamflow modeling; e.g., using data-driven rainfall-runoff models, while streamflow imputation of missing values is also of increased interest.A second theme frequently met is water chemistry, including water quality.These two themes are also the most frequently met in reviews of data-driven models in water resources (see also Section 1).Flow related statistics (e.g., the study of hydrological signatures) also tend to dominate Random forests have been used to model a large variety of water-related variables.Here, we have grouped these variables into 21 categories presented in Figure 7.An important note to be made here is that a large part of the RF literature is devoted to remote sensing applications.As shown in Figure 7, the most frequently studied variable is streamflow, which embodies river discharge and related variables.Applications falling under this category include streamflow modeling; e.g., using data-driven rainfall-runoff models, while streamflow imputation of missing values is also of increased interest.A second theme frequently met is water chemistry, including water quality.These two themes are also the most frequently met in reviews of data-driven models in water resources (see also Section 1).Flow related statistics (e.g., the study of hydrological signatures) also tend to dominate the reviewed applications, as random forests can also be used for understanding/interpreting hydrologic phenomena, e.g., through the use of VIMs.
Water 2019, 11, x FOR PEER REVIEW 20 of 43 the reviewed applications, as random forests can also be used for understanding/interpreting hydrologic phenomena, e.g., through the use of VIMs.Other variables frequently met in random forest applications are linked to ecology, land cover, urban water (including water demand and desalination), floods, and soil properties.Evidently, the variety of variables modeled using random forests is considerably larger than that commonly met in typical data-driven modeling.
Two additional important aspects to map are the reasons why random forests are used in water resources applications and their corresponding limitations as perceived by the authors (see Figure 8).In this context, we reviewed each paper in the list, and used a binary coding approach (i.e., 1 for true, 0 for false) to map reference to each of the specific reasons presented in Sections 2.8.1 (reasons 1.1-1.22)and 2.8.2 (reasons 2.1-2.7).The obtained results are summarized in the next Figures.Other variables frequently met in random forest applications are linked to ecology, land cover, urban water (including water demand and desalination), floods, and soil properties.Evidently, the variety of variables modeled using random forests is considerably larger than that commonly met in typical data-driven modeling.
Two additional important aspects to map are the reasons why random forests are used in water resources applications and their corresponding limitations as perceived by the authors (see Figure 8).In this context, we reviewed each paper in the list, and used a binary coding approach (i.e., 1 for true, 0 for false) to map reference to each of the specific reasons presented in Section 2.8.1 (reasons 1.1-1.22)and Section 2.8.2 (reasons 2.1-2.7).The obtained results are summarized in the next Figures.
As illustrated in Figure 8, random forests are mostly used due to their high predictive power (reason 1.1).This should be expected, as the same reason drives most applications of data-driven models in water resources.However, use of random forests is also dominated by their capability to provide variable importance metrics (reason 1.14) and, perhaps, this makes them standing out from the general class of data-driven models, which focus solely on predictive modeling.Efficient modeling of non-linear relationships (reason 1.2) is also a principal reason for the use of random forests, while other reasons referring to their predictive performance and ease of use also prevail (see reasons 1.7, 1.8, 1.3).The efficiency of random forests in selecting variables (reason 1.15), modeling interactions (reason 1.12), and their flexibility (reason 1.13) are also of great importance.Reasons related to the simplicity and speed of the algorithm (reasons 1.4, 1.9) are also frequently mentioned.
Turning to the cautionary use of RF-related algorithms, the most frequently mentioned reasons link to the reliability of VIMs (reason 2.3), and their inability to extrapolate outside the training range (reason 2.2).It is remarkable that none of the reviewed papers mentions that the theoretical properties of the algorithm are not well-understood (reason 2.1).Perhaps, this could be attributed to the fact that all the reviewed articles focus on practical applications.Another shortcoming of RF algorithms, which is not frequently mentioned, is the probable decrease of their performance due to their complete automation (reason 2.5).As illustrated in Figure 8, random forests are mostly used due to their high predictive power (reason 1.1).This should be expected, as the same reason drives most applications of data-driven models in water resources.However, use of random forests is also dominated by their capability to provide variable importance metrics (reason 1.14) and, perhaps, this makes them standing out from the general class of data-driven models, which focus solely on predictive modeling.Efficient modeling of non-linear relationships (reason 1.2) is also a principal reason for the use of random forests, while other reasons referring to their predictive performance and ease of use also prevail (see reasons 1.7, 1.8, 1.3).The efficiency of random forests in selecting variables (reason 1.15), modeling interactions (reason 1.12), and their flexibility (reason 1.13) are also of great importance.Reasons related to the simplicity and speed of the algorithm (reasons 1.4, 1.9) are also frequently mentioned.
Turning to the cautionary use of RF-related algorithms, the most frequently mentioned reasons link to the reliability of VIMs (reason 2.3), and their inability to extrapolate outside the training range (reason 2.2).It is remarkable that none of the reviewed papers mentions that the theoretical properties of the algorithm are not well-understood (reason 2.1).Perhaps, this could be attributed to the fact that all the reviewed articles focus on practical applications.Another shortcoming of RF algorithms, which is not frequently mentioned, is the probable decrease of their performance due to their complete automation (reason 2.5).
Another sound outcome of the conducted review is that variants of random forests have been used less frequently than the original version of the algorithm (see Figure 9).The most implemented variant is conditional inference trees, followed by extremely randomized trees and quantile regression forests.The use of conditional inference trees alleviates shortcomings related to the reliability of the VIMs (reason 2.3), while quantile regression forests can provide probabilistic predictions; therefore, they are relevant to the context of uncertainty estimation.An interesting pattern related to the multiple implementations of extremely randomized trees is their introduction and demonstration in a series of papers published by a research team from Italy.Another sound outcome of the conducted review is that variants of random forests have been used less frequently than the original version of the algorithm (see Figure 9).The most implemented variant is inference trees, followed by extremely randomized trees and quantile regression forests.The use of conditional inference trees alleviates shortcomings related to the reliability of the VIMs (reason 2.3), while quantile regression forests can provide probabilistic predictions; therefore, they are relevant to the context of uncertainty estimation.An interesting pattern related to the multiple implementations of extremely randomized trees is their introduction and demonstration in a series of papers published by a research team from Italy.

More in-depth Analysis on the Use of Random Forests
In order to identify possible dependencies between the different reasons outlined in Sections 2.8.1 and 2.8.2 on the use of random forests, Figure 10 presents a correlation matrix between the indicator (i.e., 0-1) series obtained for each reason based on the list of reviewed articles.By applying a low threshold equal to 0.3, the following reasonable connections are revealed:

•
Flexibility of the algorithm (reason 1.13), and reliability of VIMs (reason 2.3).In order to identify possible dependencies between the different reasons outlined in Sections 2.8.1 and 2.8.2 on the use of random forests, Figure 10 presents a correlation matrix between the indicator (i.e., 0-1) series obtained for each reason based on the list of reviewed articles.By applying a low threshold equal to 0.3, the following reasonable connections are revealed:

•
Ability to model non-linear relationships (reason 1.2), and ability to model interactions (reason 1.12).

•
Flexibility of the algorithm (reason 1.13), and reliability of VIMs (reason 2.3).Looking at the number of reasons mentioned in each paper on the use of random forests (see Figure 11), one sees that articles published in Water Resources Research are very attentive in explaining the modeling choices.Please note that the latter connection reflects articles dealing conditional inference trees, raising the issue of reliability.
At the same threshold, the following connections are considered non-intuitive, as they originate from highly skewed samples (i.e., large fractions of zeros or ones in the indicator series):

•
Ability to process small samples (reason 1.16), and free software implementation (reason 1.22).

•
Ability to solve problems with many classes (reason 1.19), and free software implementation (reason 1.22).
Looking at the number of reasons mentioned in each paper on the use of random forests (see Figure 11), one sees that articles published in Water Resources Research are very attentive in explaining the modeling choices.We also investigated the potential of a possible linkage between the number of reported reasons and the supervised learning task, but no specific pattern could be extracted; see Figure 12.Another type of dependence to examine, is whether the type of variables modeled are related to the number of reported reasons for using random forests; see Figure 13.It appears that sedimentrelated studies reason in greater detail on the use of random forests, while frequently studied variables such as streamflow, water chemistry and flow related statistics (see top variables in Figure 13) appear to be almost equivalent in terms of the presented reasoning.We also investigated the potential of a possible linkage between the number of reported reasons and the supervised learning task, but no specific pattern could be extracted; see Figure 12.We also investigated the potential of a possible linkage between the number of reported reasons and the supervised learning task, but no specific pattern could be extracted; see Figure 12.Another type of dependence to examine, is whether the type of variables modeled are related to the number of reported reasons for using random forests; see Figure 13.It appears that sedimentrelated studies reason in greater detail on the use of random forests, while frequently studied variables such as streamflow, water chemistry and flow related statistics (see top variables in Figure 13) appear to be almost equivalent in terms of the presented reasoning.Another type of dependence to examine, is whether the type of variables modeled are related to the number of reported reasons for using random forests; see Figure 13.It appears that sediment-related studies reason in greater detail on the use of random forests, while frequently studied variables such as streamflow, water chemistry and flow related statistics (see top variables in Figure 13) appear to be almost equivalent in terms of the presented reasoning.
Figure 13.Boxplot of number of reported reasons for using random forests conditioned on the examined variable.The variables are ranked in descending order based on the number of papers implementing random forests (see Figure 7).
Finally, close inspection of Figure 14 shows that regression related tasks are mostly linked to hydrologic variables/applications (i.e., streamflow, precipitation, evapotranspiration, temperature, soil, agriculture, droughts), while classification is more abundant when modeling land cover, natural hazards and snow, which are closely related to remote sensing applications.

Concluding Remarks and Take-home Considerations
Random forests (RF) are simple and fast algorithms with high predictive performance, which can also assist with the interpretation of natural phenomena.Their properties have been recently explored in the area of water resources, resulting in an exponential increase of their use.In addition, due to their flexibility, numerous RF-variants have appeared lately to improve various aspects of Figure 13.Boxplot of number of reported reasons for using random forests conditioned on the examined variable.The variables are ranked in descending order based on the number of papers implementing random forests (see Figure 7).close inspection of Figure 14 shows that regression related tasks are mostly linked to hydrologic variables/applications (i.e., streamflow, precipitation, evapotranspiration, temperature, soil, agriculture, droughts), while classification is more abundant when modeling land cover, natural hazards and snow, which are closely related to remote sensing applications. .Boxplot of number of reported reasons for using random forests conditioned on the examined variable.The variables are ranked in descending order based on the number of papers implementing random forests (see Figure 7).
Finally, close inspection of Figure 14 shows that regression related tasks are mostly linked to hydrologic variables/applications (i.e., streamflow, precipitation, evapotranspiration, temperature, soil, agriculture, droughts), while classification is more abundant when modeling land cover, natural hazards and snow, which are closely related to remote sensing applications.

Concluding Remarks and Take-home Considerations
Random forests (RF) are simple and fast algorithms with high predictive performance, which can also assist with the interpretation of natural phenomena.Their properties have been recently explored in the area of water resources, resulting in an exponential increase of their use.In addition, due to their flexibility, numerous RF-variants have appeared lately to improve various aspects of

Concluding Remarks and Take-Home Considerations
Random forests (RF) are simple and fast algorithms with high predictive performance, which can also assist with the interpretation of natural phenomena.Their properties have been recently explored in the area of water resources, resulting in an exponential increase of their use.In addition, due to their flexibility, numerous RF-variants have appeared lately to improve various aspects of modeling.
We expect an even higher increase of their use in water resources for prediction and inference purposes, as big data are rapidly becoming more available.In what follows, we outline some remarks and recommendations for the practicing water scientists, hoping for full exploitation of the method for prediction and inference purposes: 1.
Contrary to the general class of data-driven models, which focus mostly on forecasting and prediction over interpretation and understanding, random forests allow for explicit interpretation of the obtained results through variable importance metrics (VIMs); see Introduction.

2.
Important considerations regarding the implementation of data-driven models in water science, such as splitting of the dataset into training and testing periods, preprocessing of variables, and variable selection, are explicitly dealt with by random forests.For example, tuning of the algorithm is commonly performed using OOB (out-of-bag) data (see Sections 2.1.4and 2.5), preprocessing has generally small influence on the predictive performance of the algorithm (see reason 1.20 in Section 2.8.1), while there are many automatic variable selection procedures based on VIMs (see reason 1.15 in Section 2.8.1).

3.
In 33% of the reviewed water-related studies (i.e., 67 out of 203) random forests were not the algorithm of focus but, rather, they were used to complement other modeling approaches to improve inference.This highlights their usefulness in water science.4.
The role of random forests as a useful complementary tool in water resources applications is related to their benchmarking nature (see e.g., the comment by Efron and Hastie [3] (pp. 347, 348) in Section 2.8.1, and reason 1.1), as well as their simplicity and ease of use (see Section 2.8.1).
Other important properties of RF algorithms are their speed, and the fact that little (or no) tuning of their parameters is required to reach an acceptable predictive performance; see Section 6.1.

5.
While some attractive properties of random forests are also shared by other data-driven methods (e.g., non-linear and non-parametric modeling), their selection is driven mostly by their increased predictive performance, their capability to capture non-linear dependencies and interactions of variables, as well as their speed, parsimonious parameterization, ease of use, and ability to handle big datasets; see Sections 6.1 and 6.2, and Figure 8.The use of VIMs for interpretation and variable selection is also noteworthy, as they are not commonly implemented by data-driven models other than random forests.6.
The large potential of random forests in water resources applications has been exploited only to a small degree.Perhaps, this is related to the fact that many RF-variants were introduced very recently, while the properties of the algorithm are not fully understood; see Section 6.1.Thus, the potential for further uses and improvements is large, including variants specializing in clustering, modeling of interactions, heteroscedasticity, survival analysis, computation of VIMs and more.The added value of random forests is also confirmed by a wide range of applications in diverse areas of research, such as streamflow modeling, imputation of missing values, water quality, hydrological signatures, ecology, land cover, urban water, floods, and soil properties among other applications; see Section 6.1 for further details.7.
Another important aspect is that most RF-variants have been implemented in the R programming language, and are freely available; see Table 3.This facilitates reproducibility of the results, research advancements, as well as further uses of the algorithm.
In closing, it is quite remarkable that only a few studies recognize possible shortcomings of random forests and their variants, such as their inability to extrapolate outside the training range, and the probable decrease of their performance due to their complete automation.Thus, better understanding of the theoretical properties of the algorithm, its limitations, as well as the conditions that may hinder applicability of random forests, constitute important topics for future consideration.7) conditioned on the Journal.The Journals are ranked in descending order based on the number of published papers on random forests (see Figure 3).

Figure 1 .
Figure 1.Decision tree example (adapted from Hastie et al. [58], p.306).Xj denote predictor variables.The tree has four internal nodes and five leaves (terminal nodes).Xj ≤ tk and Xj > tk correspond to the left and right branches of each internal split, respectively.Ri denotes the mean of the observations at leaf i [39] (p.304).

Figure 1 .
Figure 1.Decision tree example (adapted from Hastie et al.[58], p.306).X j denote predictor variables.The tree has four internal nodes and five leaves (terminal nodes).X j ≤ t k and X j > t k correspond to the left and right branches of each internal split, respectively.R i denotes the mean of the observations at leaf i [39] (p.304).

Figure 2 .
Figure 2. Total number of articles implementing random forests, or their variants, per year of publication.

Figure 3 .
Figure 3. Number of published papers per Journal that include RF-related implementations.Figure 3. Number of published papers per Journal that include RF-related implementations.

Figure 3 .
Figure 3. Number of published papers per Journal that include RF-related implementations.Figure 3. Number of published papers per Journal that include RF-related implementations.Water 2019, 11, x FOR PEER REVIEW 18 of 43

Figure 4 .
Figure 4. Journal CiteScores where RF-related papers are published.The Journals are ranked in descending order, based on the number of published RF-related papers (see also Figure 3).

Figure 4 .
Figure 4. Journal CiteScores where RF-related papers are published.The Journals are ranked in descending order, based on the number of published RF-related papers (see also Figure 3).

Figure 5 .
Figure 5. Number of published RF-related papers conditioned on region of application.

Figure 6 .
Figure 6.Grouping of RF-related articles based on supervised learning tasks.

Figure 5 .
Figure 5. Number of published RF-related papers conditioned on region of application.

Water 2019 , 43 Figure 5 .
Figure 5. Number of published RF-related papers conditioned on region of application.

Figure 6 .
Figure 6.Grouping of RF-related articles based on supervised learning tasks.

Figure 6 .
Figure 6.Grouping of RF-related articles based on supervised learning tasks.

Figure 7 .
Figure 7. Number of published RF-related papers conditioned on the examined variable.

Figure 7 .
Figure 7. Number of published RF-related papers conditioned on the examined variable.

Water 2019 ,
11, x FOR PEER REVIEW 22 of 43

Figure 9 .
Figure 9. Number of published papers implementing random forests and their variants.

Figure 9 .
Figure 9. Number of published papers implementing random forests and their variants.

6. 2 .
More in-Depth Analysis on the Use of Random Forests

Water 2019 , 43 Figure 10 .
Figure 10.Correlation matrix between the indicator (i.e., 0-1) series obtained for each reason (see Sections 2.8.1 and 2.8.2) based on the list of reviewed articles.

Figure 10 .
Figure 10.Correlation matrix between the indicator (i.e., 0-1) series obtained for each reason (see Sections 2.8.1 and 2.8.2) based on the list of reviewed articles.

Figure 11 .
Figure11.Boxplot of number of reasons per paper for using random forests conditioned on the Journal.The Journals are ranked in descending order based on the number of published papers implementing random forests (see Figure3).

Figure 12 .
Figure 12.Boxplot of number of reasons per paper for using random forests conditioned on the supervised learning task.

Figure 11 .
Figure 11.Boxplot of number of reasons per paper for using random forests conditioned on the Journal.The Journals are ranked in descending order based on the number of published papers implementing random forests (see Figure3).

Water 2019 , 43 Figure 11 .
Figure11.Boxplot of number of reasons per paper for using random forests conditioned on the Journal.The Journals are ranked in descending order based on the number of published papers implementing random forests (see Figure3).

Figure 12 .
Figure 12.Boxplot of number of reasons per paper for using random forests conditioned on the supervised learning task.

Figure 12 .
Figure 12.Boxplot of number of reasons per paper for using random forests conditioned on the supervised learning task.

Figure 14 .
Figure 14.Ratio of supervised learning tasks conditioned on the examined variable.The respective numbers of papers are shown in Figure 7.

Water 2019 , 43 Figure 13
Figure13.Boxplot of number of reported reasons for using random forests conditioned on the examined variable.The variables are ranked in descending order based on the number of papers implementing random forests (see Figure7).

Figure 14 .
Figure 14.Ratio of supervised learning tasks conditioned on the examined variable.The respective numbers of papers are shown in Figure 7.

Figure 14 .
Figure 14.Ratio of supervised learning tasks conditioned on the examined variable.The respective numbers of papers are shown in Figure 7.

Figure A1 .
Figure A1.Fraction of papers modeling different variables (see Section 6.1 and Figure7) conditioned on the Journal.The Journals are ranked in descending order based on the number of published papers on random forests (see Figure3).
[80]ulesteix et al.[16], Ziegler and König[17], Díaz-Uriarte and De Andres[80], Boulesteix et al.[83]).1.13.They are flexible (i.e., there is a large potential for modifications) while there is a large number of variants of random forests designed to perform different tasks (see e.g., Boulesteix et al.[16], Ziegler and König[17], Athey et al.[89]and Section 3).1.14.They permit ranking of the relative significance of predictor variables, through variable importance metrics (VIMs; see Section 2.3 and Biau and Scornet[2], Ziegler and König[17], Díaz-Uriarte and De Andres[80]).1.15.1.22.There exist free software implementations of RF algorithms (see e.g., Díaz-Uriarte and De Andres[80]), with most variants and extensions been available as contributed packages in the R programming language.2.8.2.Why the Practicing Hydrologist Should Use Random Forests with Caution . In fact: 2.1.The theoretical properties of random forests are not fully understood, and they are usually interpreted based on simplified/stylized versions of the algorithm (see e.g., Biau and Scornet [2], Ziegler and König [17], and Section 2.2). 2.2.Random forests cannot extrapolate outside the training range; see Hengl et al. [47] for an example.2.3.Variable importance metrics (VIMs) are not always reliable, as they are affected by high correlations and interactions (see e.g., Boulesteix et al. [16], Ziegler and König [17]).2.4.Random forests are harder to interpret/understand compared to single trees (see e.g., Ziegler and König [17]).2.5.The automation of random forests may result in a slight decrease of their predictive performance compared to e.g., highly parameterized tree-based boosting (see e.g., Efron and Hastie [3], p. 324).2.6.They cannot adequately model datasets with imbalanced data (i.e., datasets in which the number of observations of the response variable belonging to one class differs significantly compared to other classes, [91]).2.7.Their original version is not suited for causal inference; see Wager and Athey

Table 2 .
Variants of random forests.

Table 3 .
R packages related to random forests (in alphabetical order), and their specific tasks.The packages can be found in the Comprehensive R Archive Network.