Multi-Variables-Driven Model Based on Random Forest and Gaussian Process Regression for Monthly Streamﬂow Forecasting

: Due to the inherent non-stationary and nonlinear characteristics of original streamﬂow and the complicated relationship between multi-scale predictors and streamﬂow, accurate and reliable monthly streamﬂow forecasting is quite difﬁcult. In this paper, a multi-scale-variables-driven streamﬂow forecasting (MVDSF) framework was proposed to improve the runoff forecasting accuracy and provide more information for decision-making. This framework was realized by integrating random forest (RF) and Gaussian process regression (GPR) with multi-scale variables (hydrometeorological and climate predictors) as inputs and is referred to as RF-GPR-MV. To validate the effectiveness and superiority of the RF-GPR-MV model, it was implemented for multi-step-ahead monthly streamﬂow forecasts with horizons of 1 to 12 months for two key hydrological stations in the Jinsha River basin, Southwest China. Other MVDSF models based on the Pearson correlation coefﬁcient (PCC) and GPR with/without multi-scale variables or the PCC and a backpropagation neural network (BP) or general regression neural network (GRNN), with only previous streamﬂow and precipitation, namely, PCC-GPR-MV, PCC-GPR-QP, PCC-BP-QP, and PCC-GRNN-QP, respectively, were selected as benchmarks. Experimental results indicated that the proposed model was superior to the other benchmark models in terms of the Nash–Sutcliffe efﬁciency (NSE) for almost all forecasting scenarios, especially for forecasting with longer lead times. Additionally, the results also conﬁrmed that the addition of large-scale climate and circulation factors was beneﬁcial for promoting the streamﬂow forecasting ability, with an average contribution rate of about 15%. The RF in the MVDSF framework improved the forecasting performance, with an average contribution rate of about 25%. This improvement was more pronounced when the lead time exceeded 3 months. Moreover, the proposed model could also provide prediction intervals (PIs) to characterize forecast uncertainty, as supplementary information to further help decision makers in relevant departments to avoid risks in water resources management.


Introduction
Accurate and reliable streamflow forecasting is an important basis for rational allocation and sustainable utilization of water resources [1,2]. It can not only provide key decision-making information for avoiding natural disasters such as floods and droughts but also be conducive to the safe and economic operation of reservoirs and the coordination In this paper, two types of IVS are adopted. One is a filter method named PCC, which is well-known for its simple operation. The other is random forest (RF), a typical ML method with the advantage over other commonly used ANNs and SVMs of needing no optimization [29]. In most previous applications of RF in hydrological prediction, it has been used as a prediction model rather than an IVS method. Pham et al. [30] evaluated the potential of RF to achieve streamflow forecasts. Shen et al. [31] used RF for correcting daily discharge predictions. Ahana et al. [32] examined the abilities of LSTM, ELM, and RF to predict monthly streamflow. This paper focuses on the utilization of the ability of RF for feature selection.
Another way to improve the performance of DDM forecasting models is by using hybrid models, which integrate signal decomposition algorithms and/or optimization algorithms. For example, Yaseen et al. [33] applied three different bio-inspired optimization algorithms (GA, PSO, and DE) to optimize the membership function of ANFIS. Chen et al. [4] used BSA to optimize the parameters of the standard ELM model, to improve flood forecasting accuracy. Maheswaran and Khosa [34] proposed a wavelet-Volterra coupled (WVC) model for one-month-ahead streamflow forecasting. Kalteh [35] investigated the relative accuracy of artificial neural network (ANN) and SVR models coupled with wavelet transforms in monthly river flow forecasting and compared them to regular ANN and SVR models, respectively. Sun et al. [27] considered the merits of adaptive variational mode decomposition (AVMD), a backtracking search algorithm (BSA), and regularized extreme learning machine (RELM) to develop a novel hybrid wind speed forecasting model. All this research suggested that hybrid models outperform single models. Unfortunately, coupling signal decomposition algorithms may introduce large decomposition errors when performing decomposition without using future information [6].
Additionally, most research using DDMs has focused on providing deterministic results with a single point value. However, deterministic forecasting cannot quantify the internal uncertainty level of forecasting, so it provides limited information for decision makers in relevant departments [36]. To better support modern water resources management, it is also valuable to predict an interval covering possible future streamflow, since forecasting errors are inevitable. Interval forecasting approaches can be divided into two groups: parametric and nonparametric methods. Parametric methods assume that the predictive distribution follows a given parametric distribution. The normal distribution is a widely used distribution. Nonparametric methods relax the assumption regarding the shape of the predictive distribution. Widely used nonparametric interval forecasting methods include kernel density estimation, quartile regression (QR), and bootstrap-based techniques [27,37]. Nowadays, a very well-known ML method named Gaussian process regression (GPR), which is a nonparametric kernel-based probabilistic approach, has gained considerable attention in various fields such as biodiesel properties [38] and solar irradiance [39] forecasting. The main merit of GPR over other ML methods is that it inherits the strong learning ability of ML and the strong reasoning capacity of the Bayesian method for solving uncertainty problems. Therefore, it can provide both deterministic forecasting results and the uncertainty interval for a given significance level.
In general, the existing research based on DDMs has room for improvement in the following aspects: (1) Model inputs lack climate information. Natural hydrological and other geophysical processes are interactive, and therefore model inputs only taking rainfall and/or streamflow into consideration do not fully characterize the impact of climate change on runoff. (2) The widely used PCC input selection method reflects linear relationships between streamflow and its potential forecasting factors, which is not entirely consistent with the actual relationships between them. (3) Numerous studies focus on 1-month-ahead streamflow forecasting and provide deterministic forecasting results, which cannot provide sufficient decision-making information for reliable and safe water resource management. To address the above questions, in this study, a multi-scale-variables-driven streamflow forecasting (MVDSF) framework is proposed. In the MVDSF framework, many climate factors and circulation factors are first adopted as supplementary predictors to represent climate change. Then, the dimensions of the input data are reduced to save computing resources and modeling time and to reduce the risk of overfitting. After that, using the selected variables as inputs, a data-driven model is constructed, and its parameters are optimized using a grid search algorithm. Finally, the well-tuned data-driven model is applied to predict streamflow.
Here, this framework is realized by combining RF and GPR with multi-scale variables (hydrometeorology, climate predictors) as inputs. The framework is referred to as RF-GPR-MV. The motivation behind the choice of RF is that it is a powerful tool for feature selection for high-dimensional variables but has been less frequently investigated with respect to monthly streamflow forecasting. The reason behind the choice of GPR is that: (1) it has been less frequently investigated in the field of multi-step-ahead monthly streamflow forecasting, (2) it has a fast learning speed compared to other ML models (ANN, SVM, and ELM), and (3) it has both nonlinear fitting ability and uncertainty quantification ability.
The RF-GPR-MV model can provide both deterministic and probabilistic streamflow prediction results. The RF is applied to reduce the input dimensions, and the GPR is applied as the base forecasting module. Comparative experiments have been conducted to verify the proposed RF-GPR-MV model at different hydrological stations. Specifically, the commonly used PCC is separately integrated with the single GPR and two frequently used classical neural networks (BP and GRNN) to validate the effectiveness of the proposed RF-GPR-MV when using high-dimensional inputs. Additionally, to illustrate the effect of climatic factors and circulation indices on forecasting accuracy, input that only uses the previous streamflow and antecedent precipitation is also adopted. Thus, several comparison models are constructed, abbreviated as PCC-GPR-MV, PCC-GPR-QP, PCC-BP-QP, and PCC-GRNN-QP. These models are applied to forecast the 1-month-ahead to 12-month-ahead streamflow at the two main hydrological stations in the Jinsha River basin. The results reveal that the RF-GPR-MV outperforms the other comparison models, emphasizing the contributions of climate and circulation indicators for monthly streamflow forecasting with long lead times.
The main contributions of this paper are as follows: (1) A high-dimensional nonlinear candidate input factor set with more than 400 items is constructed for the first time. (2) A hybrid RF-GPR-MV model in a MVDSF framework is designed for multi-step-ahead monthly streamflow forecasting. (3) The impact of different factors that contribute to the improvement of the hybrid model are evaluated quantitatively.
The reminder of this paper is organized as follows. Section 2 is the Materials and Methods section. Section 3 presents the framework and procedure of the model. Section 4 is a case study. Section 5 provides and discusses the results. Finally, Section 6 concludes this paper with a summary.

Streamflow Prediction Based on ML
Monthly streamflow prediction can be considered as a nonlinear regression problem. For this nonlinear regression problem, the main goal is to find a hydrologic cycle system transfer model for the relationship between streamflow Q and many covariates X based on n observed pairs S n = {(X 1 , y 1 ), (X 2 , y 2 ), · · · , (X n , y n )}, X ∈ R m , y ∈ R, and then make predictions for the streamflow. The covariates include local meteorological (precipitation, potential evapotranspiration, temperature, and humidity) and global climate information.
The streamflow forecasting can be mathematically expressed as: where Q(t + L) is the predicted streamflow at time t + d, L represents the lead time, Q(t − d 1 + 1) stands for the historical flow with t − d 1 + 1 time steps, P(t − d 2 + 1) represents the antecedent precipitation up to t − d 2 + 1 time steps, and Other(t − d 3 + 1) represents the other relevant local meteorological and/or global climate and circulation factors up to t − d 3 + 1 time steps that make higher contributions to the streamflow at time t + d.
The other relevant factors include local meteorological indices such as potential evapotranspiration, temperature, humidity, and/or the flow from major control stations in the upper reaches, together with climate indices such as ENSO. In addition, d i , i = 1, 2, and 3 is the time lag for these relevant factors and ϕ(·) is a hydrologic cycle system transfer function to describe the complicated nonlinear interaction between the flow and its relevant factors at the basin scale.
Two categories of models can be employed to estimate the function ϕ(·). One category includes PMs such as the famous Xin'anjiang hydrological model, and the other category includes DDMs based on ML techniques.
In many cases, DDMs based on ML can be used to replace PMs for multi-step advance monthly runoff prediction, for reasons such as insufficient understanding of the physical mechanisms of the water cycle, the low accuracy of long-term meteorological prediction results, and the lack of modeling data in some areas. In addition, DDMs are easy to implement and can be combined with other emerging technologies to improve their prediction performance.
In this paper, a new DDM named RF-GPR-MV is proposed by integrating RF and GPR for multi-step-ahead monthly streamflow forecasting. To further improve the physical fundamentals of the forecasting model, as well as local hydrometeorological factors, some global climate factors are also considered, in order to represent climate change in the proposed model. In this novel hybrid model, RF is applied to find the optimal input vector from the huge number of candidate input variables, and GPR, a well-known ML technique, is adopted as a basic forecasting module to characterize the hydrologic cycle system transfer function ϕ(·). The theories of RF and GPR are presented in the following subsections.

Random Forest
Random forest (RF) is an ensemble machine learning method for improving the performance of classification and regression trees, as well as for reducing overfitting risk. It has become popular since it was brought out and has been widely utilized in many fields such as rainfall forecasting [40], solar radiation forecasting [41], urban water consumption [42], and land cover classification [43]. It is a powerful machine learning tool for identifying features and/or fitting nonlinear relationships for high-dimensional data, especially in the case of small samples. In addition, it also can give an importance score for each input feature variable by permuting each feature.
Since all the data used in this study are time series, the rest of this section is limited to regression issues. For a given input vector X with m features, with a corresponding output vector Y, a training set S n with n observations can be constructed: The bootstrap sampling technique is firstly employed to obtain training samples from the original data. A bootstrap sample is generated by randomly selecting n observations with replacements from the original training data. Each observation has the same probability 1/n of being chosen and may appear more than once. The bootstrap samples S B n are used to establish B regression trees, and the rest of the out-of-bag (OOB) data S OOB n = {X OOB , y} = S B n / ∈ S n are applied to verify the performance of the built regression trees. All these trees compose a random forest, as shown in Figure 1. The final prediction results of the RF are obtained by aggregating all the regression trees. The prediction precision of each regression tree can be represented by the mean squared error (MSE) between the predicted values and the observed values of the OOB data. trees. All these trees compose a random forest, as shown in Figure 1. The final prediction results of the RF are obtained by aggregating all the regression trees. The prediction precision of each regression tree can be represented by the mean squared error (MSE) between the predicted values and the observed values of the OOB data.  Figure 1. Random forest.
The i-th regression tree Tb (b = 1, 2, …, B) is employed to predict the output ˆi y of OOB X .
There are many factors affecting the generation of streamflow, and these factors interact with each other. The correlations between variables must be considered in the process of determining the RF importance measure. The procedure of RF for estimating the correlated variable importance measure can be briefly described as follows: The i-th regression tree T b (b = 1, 2, . . . , B) is employed to predict the outputŷ i of X OOB . There are many factors affecting the generation of streamflow, and these factors interact with each other. The correlations between variables must be considered in the process of determining the RF importance measure. The procedure of RF for estimating the correlated variable importance measure can be briefly described as follows: Step 1. Estimate the mean vector µ X and covariance matrix C X from the original data X = {X 1 , X 2 , · · · , X n }.
Step 3. Use the regression trees T i to forecast the corresponding OOB data, where the estimation isŷ i .
Step 4. Divide the X OOB into two parts: vector X i OOB and matrix X ∼i OOB .
Step 5. Generate a new matrix X ∼i|i and vector X i|∼i on the basis of vector X i OOB and matrix X ∼i OOB . Their mean vectors and covariance matrices are different from the original µ X and C X , and the new ones should be used in the transformation process. For the multivariate normal distribution, µ ∼i|i , µ i|∼i ,C ∼i|i , and C i|∼i can be calculated as shown below.
The µ X and C X of X can be written as: The conditional mean vector and covariance matrix can be acquired via: After that, the Nataf transform can be used to generate the normal correlation samples X ∼i|i and X i|∼i .
Step 6. Construct new matrices X i OOBnew and X ∼i OOBnew based on matrix X ∼i|i , vector X i OOB , and matrix X ∼i OOB .
Step 7. Use the regression tree to predict the responseŷ i b of X i OOBnew and the responsê y ∼i b of X ∼i OOBnew , respectively. The errors of the correlated variables can be obtained, and the average values are the impact of variable X i .

Gaussian Process Regression
GPR, a very well-known machine learning (ML) method, is a nonparametric Bayesian approach [44]. It was developed by combining Bayesian and statistical theory. This method not only inherits the flexible inductive reasoning ability of Bayesian methods but also has the parallel processing, self-organizing, adaptive, and self-learning abilities of ML. Hence, it has obvious advantages in solving high-dimensional complex nonlinear regression problems with few samples. These characteristics mean that GPR is widely applied in theoretical research and many practical engineering problems [38].
Given a training set S n = {(X, y)|X ∈ R n×m , y ∈ R n } with n observations, where y is the variable to be predicted (such as monthly streamflow), X is the input vector for y with m-dimensional factors. In the GRP model, the input vector X and the target output y should obey the following equation: where g(x i ) represents the latent nonlinear function and its prior probability distribution, p(g(x i )) is a Gaussian distribution, the random residuals ε i are assumed to have iid Gaussian distributions with mean 0 and variance σ 2 n , and I n is the n-dimensional unit matrix. The stochastic process state set of the input variable X follows an n-dimensional joint Gaussian distribution. According to the definition of a Gaussian process, the state set g of the stochastic process is a Gaussian process, and its probability function, denoted GP, can be uniquely determined by its mean function E(X) and covariance function matrix K(X, X) .
According to the properties of GP, the target output y of the training input sample S n and the output of testing sample S test n = {(x test , y test )|x test ∈ R m , y test ∈ R} follow the multivariate Gaussian distribution as follows: where K(x test , X) = K(X, x test ) T is the covariance matrix of the testing input set x test and the training input variables X, and k(x test , x test ) is the covariance for x test itself.
Under the given conditions of the training input data X and output y, the posterior distribution of the predicted value y test can be inferred according to the Bayesian posterior probability mathematical formula of the new input x test : where E(y test ) is the expected value of y test and cov(y test ) is the posterior variance for y test , to measure the uncertainty of the predicted results. Based on the above statement, a GP can be determined by its mean function and the covariance function (CoF) matrix. The standard GP can be transformed into a Gaussian distribution with a mean function of 0. Hence, the key task of solving the GPR model is to determine its covariance function. The GPR requires that its covariance function is a positive definite matrix in the case of a finite input sample size. The above requirements satisfy the Mercer theorem, so the covariance function of the GPR is also called a kernel function and is used to measure the fitting degree between the measured value and the predicted value. There are a variety of choices for the CoF, among which the isotropic squared exponential covariance function (covSEiso) is the most widely used, because the covSEiso has the characteristic of being infinitely differentiable and can then ensure that the GPR is very smooth. Its mathematical expression is: where σ 2 f is the signal variance linked to the general function variance and l is the scale of the variance. In addition, σ 2 f gives the local correlation and l characterizes the correlation between the input and output. A smaller value of l means that the predicted results of the model change rapidly in the input space, indicating weak correlation.
Generally, θ= {l, σ 2 f is called the hyperparameter set for the CoF of the GPR. The most commonly used method for solving hyperparameters is the maximum likelihood function (MLF). MLF is used to estimate the unknown hyperparameters from the training data by inference. In this process of inference, the conditional probability p(y|X, θ ) of the training sample is calculated first, and then its likelihood function L(θ)= − logp(y|X, θ ) can be obtained. The mathematical expression for L(θ) is: Next, the derivative of L(θ) at θ is calculated as: Finally, the optimal θ can be obtained by minimizing the above partial derivative equation using conjugate gradient, Newton's method, and other optimization algorithms. Once the optimal θ is obtained, the expected valueŷ * = E(y test ) and the posterior variance σ = cov(y test ) of y test can be calculated using Equation (9).
According to the sigma rule for Gaussian distributions, the confidence interval of the predicted value for a given confidence level 1-α can be obtained as follows: Water 2022, 14, 1828 9 of 26

Framework and Procedure of the Proposed Model
Traditional DDMs for monthly streamflow forecasting lack physical mechanisms and usually provide deterministic results with a single streamflow value. They have limited capacity to forecast streamflow with nonlinear, highly irregular, non-stationary characteristics. Thus, they provide limited and less reliable information for activities related to water resource management. Therefore, in this study, an MVDSF framework is developed. It is realized by integrating RF and GPR with multi-scale variables (hydrometeorology, climate predictors) as inputs and is referred to as RF-GPR-MV. This framework contains four main stages: (1) data preparation, (2) selection of predictors, (3) model learning, and (4) validation and forecasting. In the first stage, contemporaneous data regarding hydrometeorology and climate variables associated with streamflow are collected. In the second stage, RF is applied to filter the dominant variables by discarding redundant and irrelated information, thereby reducing the dimension of the input vector. This can save time and decrease the risk of overfitting. In the third stage, GPR based on Bayesian theory is adopted to simulate the nonlinear relationships between the optimal input vector and the streamflow for a specific location. In the last stage, the optimized GPR is applied to predict the testing dataset. A diagram of the MVDSF framework is shown in Figure 2, and the RF-GPR-MV procedure is summarized as follows: Step 1: Make a preliminary determination of the alternative input variables and collect their historical observations.
According to the formation mechanism for streamflow, the alternative input factors can be selected and constructed from many related variables such as hydrological and meteorological variables, climate factors, and circulation indices that drive the operation of the water cycle system.
Step 2: Normalize the observations. The potential input variables have local and teleconnected relationships with runoff, and they have different physical dimensions. Therefore, all these observations must be normalized in this step to eliminate the influence of physical dimensions.
The response time of runoff to different variables is different. A PACF (partial autocorrelation function) is used to determine the optimal lag for each variable by forming predictors.
Step 4: Generate the original samples S n .
and they have different physical dimensions. Therefore, all these observations must be normalized in this step to eliminate the influence of physical dimensions.
Step 3: Determine the optimal lag di (i = 1, 2,…, m). The response time of runoff to different variables is different. A PACF (partial autocorrelation function) is used to determine the optimal lag for each variable by forming predictors.
Step 4: Generate the original samples n S . 3)Determine the optimal input vector based on VIM values  The S n = {(X 1 , y 1 ),(X 2 , y 2 ), . . . , (X n , y n )}, X ∈ R m , y ∈ R contains input predictors and the target output for different lead times, as shown in Figure 3.

Stage 2 predictors selection
contains input predictors and the target output for different lead times, as shown in Figure 3. Step 5: Apply the RF algorithm to reduce the dimensions of the input.
Step 5-1: Initialize the parameters of the RF algorithm. In this step, two key parameters of the RF will be initialized: the number of regression trees B and the maximum number of variables used to grow a regression tree mtry. Step 5: Apply the RF algorithm to reduce the dimensions of the input.
Step 5-1: Initialize the parameters of the RF algorithm. In this step, two key parameters of the RF will be initialized: the number of regression trees B and the maximum number of variables used to grow a regression tree mtry. Generally, mtry is recommended to be m/3, where m is the dimension of the alternative input vector.
Step 5-2: Calculate the VIM according to the steps in Section 2.2.
Step 5-3: Determine the optimal input vector based on the VIM values.
List the VIM in descending order. Then, the variables with higher VIM values are selected.
Step 6: Split the data obtained in Step 5 into training and testing datasets.
The normalized data obtained in Step 5 are split into two sets: the training and testing datasets. In this study, the training and testing datasets account for 75% and 25%, respectively, of the monthly data.
Step 7: Train the parameters of the GPR forecasting model. The training dataset is used to construct the GPR and learn its hyperparameters.
Step 8: Validate and predict the streamflow. The testing dataset is applied to cross-validating and forecasting the streamflow using the optimized model produced by Step 7. The output runoff values of the forecasting model should be denormalized to the range of the target output dataset. Then the RF-GPR-MV model outputs the deterministic results and the corresponding prediction interval.
Note that the differences between the five models (RF-GPR-MV, PCC-GPR-MV, PCC-GPR-QP, PCC-BP-QP, and PCC-GRNN-QP) involved in this study are in Step 1, Step 5 and/or Step 7. For example, in PCC-BP-QP, the previous runoff and precipitation are collected in Step 1, the PCC is applied to select the input variables in Step 5, and the parameters of BP are trained in Step 7.

Study Area
The Jinsha River basin, located in Southwest China (as shown in Figure 4), was chosen to demonstrate the ability of the proposed forecasting model to capture the climatehydrological relationship of the water cycle system. The Jinsha River basin covers an area of approximately 473,000 km 2 and is located approximately between 90 • 23 and 104 • 37 E and between 24 • 28 and 35 • 46 N. Most of the landscape in this region is mountainous. The Jinsha River basin basically belongs to the plateau climate zone. From north to south, it can be divided into the sub-arid climate zone of the plateau sub-cold zone, the humid climate zone of the plateau sub-cold zone, and the humid climate zone of the plateau temperate zone. The main stream of the Jinsha River has many merits, including abundant and stable runoff, a large river drop, abundant hydropower resources, and good development conditions. The region is rich in water resources and is one of the world's water resources enrichment areas. Due to these merits, it has been the largest hydropower energy base in China, and it plays an important role in the Chinese project "West-East Electricity Transmission Project". The main rainfall season is from May to September, with a flooding season from June to September. Owing to ubiquitous human activities and global warming, it is very difficult to find a basin that is not affected by natural and/or human factors such as abnormal climatic change, irrigation, water resources engineering projects, and land cover/use change [10]. Therefore, in this paper, we propose a novel data-driven model to better characterize the notoriously nonlinear and non-stationary streamflow, in order to better serve the region's water resources management.
activities and global warming, it is very difficult to find a basin that is not affected by natural and/or human factors such as abnormal climatic change, irrigation, water resources engineering projects, and land cover/use change [10]. Therefore, in this paper, we propose a novel data-driven model to better characterize the notoriously nonlinear and non-stationary streamflow, in order to better serve the region's water resources management.

Data Description and Potential Predictors
Typically, streamflow is affected by many factors, mostly associated with geographic and climatic conditions. Most previous studies focus on local meteorological factors dominated by geographic conditions [9]. Many of these have demonstrated that precipitation has a great influence on both short-and long-term streamflow. Hence, in this paper, the current and antecedent total monthly precipitation was chosen as one of the candidate impact factors (IFs). Previous precipitation can indicate the initial wetness conditions of the study area. Antecedent streamflow is also considered, as it usually represents the comprehensive states of the soil moisture and groundwater stores. Meanwhile, the initial catchment conditions also affect the generation of streamflow. For example, the states of the soil moisture and groundwater storage are relevant to the infiltration process of the hydrologic cycle, thus affecting the streamflow. Therefore, the average temperature, air pressure, and relative humidity were also selected as candidate predictors for streamflow forecasting.
In addition to the above regional meteorological variables representing the initial catchment conditions, some large-scale factors such as global climate variables [23,45]

Data Description and Potential Predictors
Typically, streamflow is affected by many factors, mostly associated with geographic and climatic conditions. Most previous studies focus on local meteorological factors dominated by geographic conditions [9]. Many of these have demonstrated that precipitation has a great influence on both short-and long-term streamflow. Hence, in this paper, the current and antecedent total monthly precipitation was chosen as one of the candidate impact factors (IFs). Previous precipitation can indicate the initial wetness conditions of the study area. Antecedent streamflow is also considered, as it usually represents the comprehensive states of the soil moisture and groundwater stores. Meanwhile, the initial catchment conditions also affect the generation of streamflow. For example, the states of the soil moisture and groundwater storage are relevant to the infiltration process of the hydrologic cycle, thus affecting the streamflow. Therefore, the average temperature, air pressure, and relative humidity were also selected as candidate predictors for streamflow forecasting.
In addition to the above regional meteorological variables representing the initial catchment conditions, some large-scale factors such as global climate variables [23,45] and atmospheric circulation are also considered, to represent the climate variability. Based on a great deal of previous research in the Jinsha River basin and in China on the local influence of large-scale atmospheric circulation, the Pacific decadal oscillation (PDO) [46], the North Atlantic oscillation (NAO) [47], the Atlantic multidecadal oscillation (AMO) [48], the sea surface temperatures (SSTs) over various Niño regions [49], the Pacific/North American pattern (PNA) [50], the Arctic oscillation (AO) [48], the quasi-biennial oscillation (QBO) [51], and the East Atlantic/West Russia (EA/WR) pattern [48] were identified as candidate predictors. Meanwhile, 17 circulation factors from 74 characteristic atmospheric oscillation factors provided by the Chinese National Climate Center (https://cmdp.ncccma.net/cn/prediction.htm, accessed on 1 January 2022) were also adopted. A summary of the candidate predictors is given in Table 1. PVII is the polar vortex intensity index; PVAI is the polar vortex area index; and PV represents the polar vortex. Each factor in the following is denoted by a bold abbreviation.
The datasets used in this study were as follows: (1) Observed streamflow The monthly streamflow of two hydrological stations, named Shigu (SG) and Panzhihua (PZH), in the Jinsha River basin, from 1961 to 2010, were collected from the Bureau of Hydrology, Changjiang Water Resources Commission.
(2) Local meteorological variables including precipitation, air pressure, temperature, and relative humidity These data, at a daily scale for 32 meteorological stations from 1961 to 2010, were downloaded from the Chinese National Meteorological Information Center (http://www. nmic.cn/, accessed on 12 April 2022). The daily meteorological data were then transformed into a monthly scale. Observations collected at one location cannot describe the meteorological conditions of the entire basin. Hence, the weight of each meteorological station was calculated using the Thiessen polygon method (TPM), and then the areal observed values for the corresponding meteorological variables in the studied area were obtained.
(3) Global climate or circulation factors Variables 5 to 21 in Table 1  Observations of meteorological, climate, and circulation factors, were collected for the same period of streamflow.

Model Input Predictor Selection
After determining the potential predictors and collecting their homochronous data, the potential input vectors can be established. The input selection technique was applied to determine suitable input vectors from the large number of candidate IFs, to reduce the amount of redundant and irrelevant information. Finally, the most relevant input vectors with the lowest dimensions were found.
We take the SG station as an example to show the rule for the establishment of potential input vectors. The monthly streamflow of the SG station in January of the following year was taken as the prediction variable. Alternative input vectors were established that included three parts: (1) Observations of streamflow with a lag of 1 to 12 months; (2) Local meteorological variables, global circulation factors, and climate factors (listed in Table 1 Nos. 1-33) with lags of up to 12 months; (3) Observations of the SASMI and EASMI from June to August in the last year.
In summary, there were 34 × 12 + 3 × 2 = 414 potential input factors, as shown in Table 2. The same potential input vector was used for the other 11 months in the following year. Then, the potential input vectors for each month in the following year for the other two hydrological stations were constructed in a similar manner.

No.
Method Model The fewer the variables in the input vector, the simpler the constructed forecasting model is for training. Therefore, the major concern is to reduce the input dimensions, thereby simplifying the forecasting model. In this study, the RF algorithm was adopted to find the input vector with the lowest dimension. The parameters of the RF were set as B = 500 and mtry = 138, according to [29].
The candidate input vectors for the various models are shown in Table 2.

Performance Evaluation
There are many measures for evaluating model performance in the published literature. The Nash-Sutcliffe coefficient of efficiency (NSE) is the most widely used. A disadvantage of the NSE is the use of squared differences, which causes larger values in a time series to be strongly overestimated and lower values to be neglected.
The present study also adopted other widely used indices: the Pearson's correlation coefficient (R), root mean square error (RMSE), mean absolute error (MAE), mean average percentage error (MAPE), mean bias error (MBE), and modified index of agreement (MIA). Among these, the RMSE is sensitive to extremely large or small values of a time series and reflects the degree of variation [52], the MAE reflects the actual forecasting error from a more balanced perspective, the MAPE is a measure of the accuracy for the forecasted streamflow series with no units, the MBE is a measure of the overall bias error or systematic error, and the MIA is a standardized measure of the degree of model prediction error [53]. Usually, the RMSE is higher than the MAE, and the degree to which the RMSE exceeds the MAE is an indicator of the extent to which large outliers (differences between observed values and the corresponding forecasts) exist in the testing set [34]. These indices are defined as follows: where Q obs,i and Q f ore,i are the i-th observed and predicted values of streamflow, respectively, Q obs and Q f ore are the average values of the observed and forecasted streamflow, respectively, and N is the total number of samples.

Results with Only Previous Q and P as Inputs
Forecasting results with different lead times using previous streamflow and precipitation for the two stations are shown in Figures 5 and 6, where the red dashed line is the threshold of 0.7 for the NSE. For example, with regard to the forecast model for SG station, the validation period NSE values for 1-to 3-month-ahead forecasting were equal to 0.87, 0.75, and 0.71 for the PCC-GPR-QP model, compared to values of 0.81, 0.56, and 0.55 for the PCC-BP-QP model and 0.76, 0.32, and 0.38 for the PCC-GRNN-QP model. Further analysis shows that the MAPE values for the 1-to 3-month-ahead forecasting associated with PCC-GPR-QP were 5%, 6%, and 5%; less than the values for the PCC-BP-QP and PCC-GRNN-QP models.
The NSE values presented in these figures provide a clear picture of the poor accuracy of the forecasting results. Specifically, for PCC-GPR-QP, the lead time 1 and lead time 2 forecasting results for these two stations have relatively high accuracy, as their NSE values exceed the threshold of 0.7. For PCC-BP-QP, the lead time 1 forecasting results for the two stations have acceptable accuracies, with NSE values beyond the threshold of 0.7. For PCC-GRNN-QP, the lead time 1 forecasting results for stations SG and PZH have relatively good accuracies, with NSE values beyond the threshold of 0.7. Therefore, the sequence of forecasting ability for a short lead time is: GPR > BP > GRNN. Clearly, the degree of deterioration becomes dramatic as the forecast period increases, especially for GRNN and BP with long lead times.
Overall, for the same inputs and lead times, the GPR always outperforms the BP and GRNN models and has a stronger anti-disturbance ability than the other two models, and therefore it is suitable as a basic forecasting module. However, the forecasting results for PCC-GPR-QP with a lead time beyond 4 still have room for improvement. time 2 forecasting results for these two stations have relatively high accuracy, as their NSE values exceed the threshold of 0.7. For PCC-BP-QP, the lead time 1 forecasting results for the two stations have acceptable accuracies, with NSE values beyond the threshold of 0.7. For PCC-GRNN-QP, the lead time 1 forecasting results for stations SG and PZH have relatively good accuracies, with NSE values beyond the threshold of 0.7. Therefore, the sequence of forecasting ability for a short lead time is: GPR > BP > GRNN. Clearly, the degree of deterioration becomes dramatic as the forecast period increases, especially for GRNN and BP with long lead times.
Overall, for the same inputs and lead times, the GPR always outperforms the BP and GRNN models and has a stronger anti-disturbance ability than the other two models, and therefore it is suitable as a basic forecasting module. However, the forecasting results for PCC-GPR-QP with a lead time beyond 4 still have room for improvement. time 2 forecasting results for these two stations have relatively high accuracy, as their NSE values exceed the threshold of 0.7. For PCC-BP-QP, the lead time 1 forecasting results for the two stations have acceptable accuracies, with NSE values beyond the threshold of 0.7. For PCC-GRNN-QP, the lead time 1 forecasting results for stations SG and PZH have relatively good accuracies, with NSE values beyond the threshold of 0.7. Therefore, the sequence of forecasting ability for a short lead time is: GPR > BP > GRNN. Clearly, the degree of deterioration becomes dramatic as the forecast period increases, especially for GRNN and BP with long lead times. Overall, for the same inputs and lead times, the GPR always outperforms the BP and GRNN models and has a stronger anti-disturbance ability than the other two models, and therefore it is suitable as a basic forecasting module. However, the forecasting results for PCC-GPR-QP with a lead time beyond 4 still have room for improvement.

Results with Local Hydrometeorological and Global Factors as Inputs
The validation results for the proposed model, RF-GPR-MV, and P listed in Tables 3-6, where the NSE values beyond the threshold of grey. Meanwhile, Figures 7-8 show the forecasting results with differe ually, using both hydrometeorological and global climate factors as i

Results with Local Hydrometeorological and Global Factors as Inputs
The validation results for the proposed model, RF-GPR-MV, and PCC-GPR-MV are listed in Tables 3-6

Results with Local Hydrometeorological and Global Factors as Inputs
The validation results for the proposed model, RF-GPR-MV, and PCC-GPR-MV are listed in Tables 3-6     Compared with the performance of PCC-GPR-QP, that of PCC-GPR-MV showed no obvious improvement, while RF-GPR-MV showed significant enhancement. This indicates that the PCC may not be suitable for complex nonlinear relationship extraction, especially for teleconnection. Specifically, the comparison of PCC-GPR-MV and RF-GPR-MV indicates that RF-GPR-MV has lower MAPE, RMSE, and MBE, as well as higher R, NSE, and MIA values than PCC-GPR-MV for almost all lead times. In summary, RF-GPR-MV integrates the dimensionality reduction advantage of RF and the strong learning and reasoning ability of GPR to fully extract the effective information for learning and then further improve the prediction accuracy. Figures 9 and 10 present the hydrograph produced by the RF-GPR-MV model in the training and testing periods. The reason for choosing the results for lead times 1 and 6 is that the observed streamflow was relatively stable when the lead time was 1, and the observed runoff fluctuated greatly when the lead time was 6, as this was close to the flood season. It shows that the prediction intervals (PIs) at the 50% confidence level generated by RF-GPR-MV can always capture the variations in streamflow during the testing phase. For both stations, the prediction interval for 6-month-ahead forecasting was significantly wider than that for 1-month-ahead forecasting, indicating a relatively high uncertainty. Although the uncertainties of forecasting with long lead times are large, they still fall within the given prediction intervals (PIs) at the 50% confidence level, except in some extremely high flow conditions.

Evaluation of the Contributions of Different Factors
To quantitatively evaluate the improvement achieved by the proposed RF-GPR-MV model over the RF-GPR-QP, PCC-GPR-MV, PCC-GPR-QP, PCC-BP-QP, and PCC-GRNN-QP models, the improved percentages of the MAPE index were adopted to evaluate the impacts of the different techniques on the improvement of the hybrid model. Overall, the proposed RF-GPR-MV model using both local hydrometeorological and global climate and circulation factors not only has higher accuracy than the commonly used monthly streamflow forecasting models but can also provide both deterministic results and uncertainty bounds.

Evaluation of the Contributions of Different Factors
To quantitatively evaluate the improvement achieved by the proposed RF-GPR-MV model over the RF-GPR-QP, PCC-GPR-MV, PCC-GPR-QP, PCC-BP-QP, and PCC-GRNN-QP models, the improved percentages of the MAPE index were adopted to evaluate the impacts of the different techniques on the improvement of the hybrid model. P MAPE is defined by:  Figure 11 quantitatively displays the contributions made by the RF method, the GPR model, and/or the multiple variables (MV) as inputs to the total improvement of the RF-GPR-MV model. Clearly, all these had positive effects on the total improvement of the proposed model, though RF (average P MAPE = 27%) contributed more than GPR (average P MAPE = 16%), which emphasizes the critical role of the input selection approach in monthly streamflow forecasting. Additionally, compared with using only previous streamflow and rainfall, using multiple variables (MV) as inputs improved the model performance by about 5%. It can be seen from Figure 12a that the forecasting results for RF-GPR-MV with lead time 1 achieved the smallest forecasting error when taking Q(t-12), Q(t-11), F15(t-5), and Niño4(t-11) as inputs. In this case, the contribution of Q(t-11) was 2.6%; when adding F15(t-5) and Niño4(t-11) as inputs, the forecasting performance was improved by 2.69% over Q(t-12) only. This shows that the contribution of the streamflow itself was greater than the contribution of large-scale climate factors for forecasting with short lead times (dry season). The results for RF-GPR-MV with lead time 6 ( Figure 12b) obtained the best performance when using all the top ten factors, including the local hydrometeorological indices P(t-2), Q(t-4), and P(t-8), the global climate factors AO(t-9), NAO(t-11), PDO(t-1), and NAO(t-2), and the circulation factors F10(t-2), F11(t-2), and SOI(t-10). Among these, the highest contribution to the monthly streamflow forecasting came from F10(t-2) (16%), followed by PDO(t-1) (15%), indicating the importance of large-scale factors in the flood season. It can be seen from Figure 12a that the forecasting results for RF-GPR-MV with lead time 1 achieved the smallest forecasting error when taking Q(t-12), Q(t-11), F15(t-5), and Niño4(t-11) as inputs. In this case, the contribution of Q(t-11) was 2.6%; when adding F15(t-5) and Niño4(t-11) as inputs, the forecasting performance was improved by 2.69% over Q(t-12) only. This shows that the contribution of the streamflow itself was greater than the contribution of large-scale climate factors for forecasting with short lead times (dry season). The results for RF-GPR-MV with lead time 6 ( Figure 12b) obtained the best performance when using all the top ten factors, including the local hydrometeorological indices P(t-2), Q(t-4), and P(t-8), the global climate factors AO(t-9), NAO(t-11), PDO(t-1), and NAO(t-2), and the circulation factors F10(t-2), F11(t-2), and SOI(t-10). Among these, the highest contribution to the monthly streamflow forecasting came from F10(t-2) (16%), followed by PDO(t-1) (15%), indicating the importance of large-scale factors in the flood season. Figure 13a shows that the forecasting results for RF-GPR-MV with lead time 1 at PZH station achieved the lowest MAPE value when using the local hydrometeorological factors Q(t-12), Q(t-11), Q(t-9), Ap(t-9), and Q(t-10). Figure 12b shows that the forecasting results with lead time 6 showed the best performance when taking Ap(t-12), AECP(t-5), AMO(t-10), Q(t-4), Niño3(t-9), and Rh(t-9) as inputs. In this case, the best input vector contained many types of variables varying from local hydrometeorological factors (Ap, Q, Rh) to large-scale climate factors (AECP, AMO, Niño3). Additionally, the forecasting performance using Q(t-4) was better by about 5% than when not using Q(t-4), while the forecasting performance with Niño3(t-9) was better by about 14% than without Niño3(t-9). This result confirms that large-scale climate factors have a great influence on local streamflow with long lead times (flood season).
These figures (Figures 11-13) emphasize that the addition of global climate and circulation factors can improve the forecasting performance for streamflow, especially for forecasting with long lead times. They also indicate that when the number of prediction factors continues to increase, the model's prediction error will also increase. This phenomenon reveals that increasing the number of input factors is not always beneficial to the forecasting results. The best results are obtained with just the appropriate number of input factors. Further, to quantify the contribution of large-scale factors to the model performance for different lead times, RF-GPR-MV was used, and the VIM values in descending order with the top 10 input factors provided by RF and the MAPE values for their corresponding forecasting models were compared with each other. For example, Figures 12  and 13 present the monthly streamflow forecasting results for RF-GPR-MV with lead time 1 (January of the following year) and lead time 6 (June of the following year) using different inputs at SG station and PZH station, respectively.
It can be seen from Figure 12a that the forecasting results for RF-GPR-MV with lead time 1 achieved the smallest forecasting error when taking Q(t-12), Q(t-11), F15(t-5), and Niño4(t-11) as inputs. In this case, the contribution of Q(t-11) was 2.6%; when adding F15(t-5) and Niño4(t-11) as inputs, the forecasting performance was improved by 2.69% over Q(t-12) only. This shows that the contribution of the streamflow itself was greater than the contribution of large-scale climate factors for forecasting with short lead times (dry season). The results for RF-GPR-MV with lead time 6 (Figure 12b) obtained the best performance when using all the top ten factors, including the local hydrometeorological indices P(t-2), Q(t-4), and P(t-8), the global climate factors AO(t-9), NAO(t-11), PDO(t-1), and NAO(t-2), and the circulation factors F10(t-2), F11(t-2), and SOI(t-10). Among these, the highest contribution to the monthly streamflow forecasting came from F10(t-2) (16%), followed by PDO(t-1) (15%), indicating the importance of large-scale factors in the flood season.  Figure 13a shows that the forecasting results for RF-GPR-MV with lead time 1 at PZH station achieved the lowest MAPE value when using the local hydrometeorological factors Q(t-12), Q(t-11), Q(t-9), Ap(t-9), and Q(t-10). Figure 12b shows that the forecasting results with lead time 6 showed the best performance when taking Ap(t-12), AECP(t-5), AMO(t-10), Q(t-4), Niño3(t-9), and Rh(t-9) as inputs. In this case, the best input vector AMO(t-10), Q(t-4), Niño3(t-9), and Rh(t-9) as inputs. In this case, the best input vector contained many types of variables varying from local hydrometeorological factors (Ap, Q, Rh) to large-scale climate factors (AECP, AMO, Niño3). Additionally, the forecasting performance using Q(t-4) was better by about 5% than when not using Q(t-4), while the forecasting performance with Niño3(t-9) was better by about 14% than without Ni-ño3(t-9). This result confirms that large-scale climate factors have a great influence on local streamflow with long lead times (flood season).  These figures (Figures 11-13) emphasize that the addition of global climate and circulation factors can improve the forecasting performance for streamflow, especially for forecasting with long lead times. They also indicate that when the number of prediction factors continues to increase, the model's prediction error will also increase. This phenomenon reveals that increasing the number of input factors is not always beneficial to the forecasting results. The best results are obtained with just the appropriate number of input factors.
In a previous study, Peng et al. [54] employed pure BP, PSO-ANN, and MVO-ANN models for the Jinsha basin at PZH station and achieved a model efficiency (in terms of MAPE) of 10%, which is much higher than that of the RF-GPR-MV model developed in this study. In the study by Yin et al. [55], a SWAT model using meteorological data and the ISI-MIP climate dataset was employed for the Jinsha basin for monthly streamflow forecasting, achieving a model efficiency (in terms of the NSE) of 0.82, which is lower than that of the RF-GPR-MV developed in this study. Moreover, in the study by Luo et al. [56], autocorrelation function (ACF), partial autocorrelation function (PACF), grey correlation analysis (GCA), support vector regression (SVR), and generalized regression neural network (GRNN) models were integrated to forecast monthly streamflow in the Jinsha basin. In this study, the NSE values for GCA-SVR, GCA-GRNN, PCC/ACF-SVR, and PACF-SVR were 0.86, 0.82, 0.83, and 0.82, respectively. The NSE value for the best model (GCA-SVR) was 0.86, which is still lower than that of the RF-GPR-MV model developed in this study.
In a previous study, Peng et al. [54] employed pure BP, PSO-ANN, and MVO-ANN models for the Jinsha basin at PZH station and achieved a model efficiency (in terms of MAPE) of 10%, which is much higher than that of the RF-GPR-MV model developed in this study. In the study by Yin et al. [55], a SWAT model using meteorological data and the ISI-MIP climate dataset was employed for the Jinsha basin for monthly streamflow forecasting, achieving a model efficiency (in terms of the NSE) of 0.82, which is lower than that of the RF-GPR-MV developed in this study. Moreover, in the study by Luo et al. [56], autocorrelation function (ACF), partial autocorrelation function (PACF), grey correlation analysis (GCA), support vector regression (SVR), and generalized regression neural network (GRNN) models were integrated to forecast monthly streamflow in the Jinsha basin. In this study, the NSE values for GCA-SVR, GCA-GRNN, PCC/ACF-SVR, and PACF-SVR were 0.86, 0.82, 0.83, and 0.82, respectively. The NSE value for the best model (GCA-SVR) was 0.86, which is still lower than that of the RF-GPR-MV model developed in this study.
In most of the studies employing ML models [2,13,28,54], only previous streamflow and/or precipitation were taken as inputs. The results in this study revealed the positive effect of climate factors on streamflow forecasting accuracy. This conclusion is consistent with those of other studies [22][23][24]45,56]. In previous studies based on ML models, inputs were determined by a trial-and-error method or the PACF method. Input selection is a vital task in the process of data-driven model development. The PACF method is a method of mining linear relationships. However, the correlations between streamflow and its potential forecasting factors are not always linear. The RF in this study used to determine the appropriate input combination reduced the workload and could deal with nonlinear relationships. Additionally, the RF-GPR-MV model developed in this study could provide both deterministic results and prediction intervals, which is beneficial for real water resources management.

Conclusions
In recent years, due to the impact of climate change, the non-stationary and nonlinear properties of monthly streamflow have been enhanced, which makes it difficult to improve forecasting accuracy. To improve the accuracy of monthly streamflow forecasting and provide better information for water resources management, a novel MVDSF framework was proposed and realized using RF-GPR-MV. It was used to develop a 1-to 12-month-ahead streamflow forecast model for two sites within the Jinsha River basin. Besides the commonly used local hydrometeorological factors (e.g., previous streamflow and antecedent precipitation), multiple large-scale climate and circulation factors were considered, in order to represent climate change. Therefore, about 400 candidate input factors were constructed. An RF was applied to determine the optimal input from the high-dimensional candidate inputs. Additionally, a GPR was employed to learn the nonlinear relationships between streamflow and its multi-scale predictors and to describe the forecast uncertainty.
The conclusions of the present study are as follows: (1) The PCC-GPR-QP model yielded better performance compared with the PCC-BP-QP and PCC-GRNN-QP models, which reveals the capability of GPR for dealing with highly nonlinear streamflow. The average forecasting accuracy was improved by 10~20% when using GPR as a forecast module. (2) The addition of large-scale climate factors significantly improved the long-lead-time forecasting, that is, in the flood season, with an average contribution rate of about 15%. (3) The RF applied to input selection was beneficial for improving the accuracy of forecasting. Compared with the model using the PCC method, the average forecasting accuracy was improved by 25~30%. The accurate and abundant forecast information provided by RF-GPR-MV could help water resources management departments to make scientific decisions to avoid risks. Therefore, this model could be valuable for popularizing this application. Meanwhile, it should be noted that the results of RF-GPR-MV are not ideal under some extremely high flow conditions, and this requires further attention in future studies. Furthermore, the main purpose of this paper was to show that by adding more climate factors and using a reasonable input selection method and a suitable ML model with appropriate parameters, the performance of the monthly streamflow forecasting model could be improved. In future research, other data processing techniques (wavelet transforms, empirical mode decomposition, variational mode decomposition, and so on) and evolutionary algorithms used for optimization of the hyperparameters of the GPR could be considered.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data supporting the findings of this paper are available from the corresponding author upon reasonable request.