Selection of Temporal Lags for Predicting Riverﬂow Series from Hydroelectric Plants Using Variable Selection Methods

: The forecasting of monthly seasonal streamﬂow time series is an important issue for countries where hydroelectric plants contribute signiﬁcantly to electric power generation. The main step in the planning of the electric sector’s operation is to predict such series to anticipate behaviors and issues. In general, several proposals of the literature focus just on the determination of the best forecasting models. However, the correct selection of input variables is an essential step for the forecasting accuracy, which in a univariate model is given by the lags of the time series to forecast. This task can be solved by variable selection methods since the performance of the predictors is directly related to this stage. In the present study, we investigate the performances of linear and non-linear ﬁlters, wrappers, and bio-inspired metaheuristics, totaling ten approaches. The addressed predictors are the extreme learning machine neural networks, representing the non-linear approaches, and the autoregressive linear models, from the Box and Jenkins methodology. The computational results regarding ﬁve series from hydroelectric plants indicate that the wrapper methodology is adequate for the non-linear method, and the linear approaches are better adjusted using ﬁlters.


Introduction
An essential task for countries where power generation is done using hydroelectric plants is the monthly seasonal streamflow series forecasting. This step is directly related to energetic planning, water availability, and pricing strategies. The last Hydropower Status Report from the International Hydropower Association reported that 4306 TWh of electricity was generated in the world employing hydroelectric plants in 2019, corresponding to a 2.5% annual increase [1]. This amount represents the single most significant contribution from a renewable energy source in history. The report addresses data from 13,000 stations in 150 countries. The leaders in hydropower installed capacity are China (356.40 GW), Brazil (109.06 GW), United States (102.75 GW), and Canada (81.39 GW). These figures also provide an idea about the productive chain regarding such power source. It is clear that the correct water management can bring substantial benefits in avoiding water or money waste [2].
Many researchers have addressed streamflow series tasks for different countries such as China, Canada, Ecuador, Iraq, Mozambique, United States, Serbia, Norway, Turkey, Sri-Lanka, and Brazil [2][3][4][5][6][7][8][9][10]. It highlights the importance of the problem to the global economy. These series present a particular seasonal component due to the periods of rainfall along the year, being non-stationary series [11,12]. However, most of such studies focus on determining just the best predictor, disregarding some other essential steps of the whole process.
Identifying a system is a task influenced by factors, such as prior knowledge of its characteristics, complexity, presence of noise, and performance metrics to be used [13,14]. The inputs must represent the dynamics of the system, which helps in choosing a forecasting model that is appropriate to the problem and is a fundamental step to obtain an efficient model for time series forecasting [15]. The structure of models also defined by the number of entries. For example, in the case of neural networks, the number of inputs impacts the determination of their structure, since the more entries there are, the more complex the neural network will be, as well as the more costly its training, without a guarantee of a performance improvement [16]. Additionally, the number of inputs influences the surface of the cost function, which tends to have more local minima [17,18].
Among potential benefits of inputs selection, we can mention the facilitation of visualization, data understanding, order reduction, memory requirements, reduction in training time, and computational effort [17].
Variable selection (VS) methods attempt to identify a subset of inputs that assist in forecasting, pattern recognition, and data regression, playing a significant role in the accuracy of the forecasting methods. Additionally, such approaches tend to simplify the final model, as well as to improve the stability of responses, and eliminate redundant inputs [19]. Harrell [20] stated that VS could be subjective because they use a conceptual understanding of the dependent variable to select independent variables. Many times, a small number of inputs is recommended for prediction purposes [21]. Guyon and Elisseeff [17] classified the selection methods as filters, wrappers, and embedded. We can also mention a new category, the bio-inspired metaheuristics for optimization.
Many research fields have addressed the importance of variable selection to improve the accuracy of models in different contexts, as presented in the works below: • Li et al. [22]-to increase the estimation of growing stem volume of pine using optical images; • Bonah et al. [23]-to quantitative tracking of foodborne pathogens; • Xiong et al. [24]-to increase near-infrared spectroscopy quality; • Speiser et al. [25]-an extensive investigation with 311 datasets to compare several random forest VS methods for classification; • Rendall et al. [26]-extensive comparison of large scale data driven prediction methods based on VS and machine learning; • Marcjasz et al. [27]-to electricity price forecasting; • Santi et al. [28]-to predict mathematics scores of students; • Karim et al. [29]-to predict post-operative outcomes of cardiac surgery patients; • Kim and Kang [30]-to faulty wafer detection in semiconductor manufacturing; • Furmańczyk and Rejchel [31]-to high-dimensional binary classification problems; • Fouad and Loáiciga [5]-to predict percentile flows using inflow duration curve and regression models; • Ata Tutkun and Kayhan Atilgan [32]-investigated VS models in Cox regression, a multivariate model; • Mehmood et al. [33]-compared several VS approaches in partial least-squares regression tasks; • McGee and Yaffee [34]-provided a study on short multivariate time series and many variations of Least Absolute Shrinkage and Selection Operator (LASSO) for VS; • Seo [35]-discussed the VS problem together with outlier detection, due to each input affecting the regression task; • Dong et al. [36]-to wind power generation prediction; • Sigauke et al. [37]-presented a probabilistic hourly load forecasting framework based on additive quantile regression models; • Wang et al. [38]-to short-term wind speed forecasting; • Taormina and Chau [39]-to rainfall-runoff modeling; • Taormina et al. [40]-to river flow forecasting; • Cui and Jiang [41]-to chaotic time series prediction; • Silva et al. [42]-to predict the price of sugarcane derivatives; • Siqueira et al. [12,[43][44][45]-applied the partial autocorrelation function linear filter to streamflow series forecasting; • Siqueira et al. [2]-use of VS methods, such as wrappers and filters to predict streamflow series; and • Kachba et al. [46]-application of wrapper and non-linear filters to estimate the impact of air pollution on human health.
Despite these methodologies not being universal or equally useful in various fields, the presentation of the studies above depicts the importance of variable selection for data processing in many different contexts, methods, and tasks. In streamflow series forecasting from hydroelectric plants, this is even more relevant due to the high magnitude of the energy generated. An increase of a single percentage point in the accuracy of such predictions represents an enormous amount of electric power. However, most of the literature focuses on the definition of the best forecasting model, neglecting a further investigation on the impact in adopting distinct VS approaches.
To fill this gap, we analyzed the use of ten VS methods in the autoregressive (AR) model from the Box and Jenkins methodology and the extreme learning machines neural network. The addressed VS approaches are linear filters (two manners of using the partial autocorrelation function, PACF), non-linear filters (three mutual information-based methods), wrappers (considering three evaluating metrics), and bio-inspired metaheuristics (genetic algorithm and particle swarm optimization).
The remainder of this study is organized as follows: Section 2 presents the main content of the variable selection procedure, as well as the main stages of the seasonal streamflow series forecasting; Section 3 discusses the filters; Section 4 the wrapper; Section 5 the bio-inspired metaheuristics-genetic algorithm and particle swarm optimization (PSO); Section 6 the case study, computational results, and the performance analysis; and Section 7 presents the conclusions.

Variable Selection
The variable selection methodologies can use information available a priori, through empirical tests of trial and error, or some information criterion. Puma-Villanueva et al. [47] describe a simple example of how the general process works. Consider set V that represents the space of input variables, here limited to 3. Thus, we define the vector of inputs V = [v 1 ,v 2 ,v 3 ], with which it is possible to form (2 3 -1 = 7) subsets of inputs, as depicted in Table 1. Table 1. Possible subsets for the input vector V.

Subsets
Selected Inputs The selection methods' role is to define which of these subsets is the most appropriate to represent the information in the data, possibly in contrast to the adoption of all inputs. In this case, selecting variables is choosing the subset that allows the best forecast of future values of a time series, that is, selecting the vectorV ∈ k among the possible combinations between the variables ofV ∈ l , such that k ≤ l. This set represents the dependence structure of a stochastic process over time. In Table 1, the goal of the VS methods is to choose one of seven possibilities.
Yu and Liu [48] present some criteria that relate to this procedure: • Relevance: the concept associated with the importance of a given variable may have to the problem, since the information it contains will be the basis of the selection process. The relevance is strong or weak depending on how much its removal degrades the performance of the predictor; • Redundancy: two or more variables are redundant if their observed values are highly correlated or dependent. The level of this correlation reveals the degree of redundancy; and • Optimality: a so-called optimal subset of input variables is when there is no other subset that produces better results.
However, these characteristics, if combined, have no direct implication. For example, a relevant variable does not mean that the optimal subset contains it. Likewise, the inputs that belong to the optimal subset are not necessarily appropriate [47]. Guyon and Elisseeff [17] classify the VS models into embedded, wrappers, and filters, each of them having its own advantages. The particularities for a given problem indicate which method is most appropriate.
It is essential to point out a difference between the variables and feature selection. The understanding of feature is linked to the idea of a set of inputs that is formed from a combination of the original variables or the extraction of some essential characteristics. An example is principal component analysis (PCA), which linearly combines the inputs [49]. Thus, there is a set of new variables, which are in a new space. It differs from as, in this case, the subset is formed by those variables of the original entries that do not undergo any type of transformation.

Variable Selection in Streamflow Series Forecasting
The predictions regarding the monthly seasonal streamflow series follow the stages shown in Figure 1 (adapted from [2]): variables, which are in a new space. It differs from as, in this case, the subset is formed by those variables of the original entries that do not undergo any type of transformation.

Variable Selection in Streamflow Series Forecasting
The predictions regarding the monthly seasonal streamflow series follow the stages shown in Figure 1 (adapted from [2]): The first stage is data acquisition. Sensors are spread to measure the volume of water that passes a transversal section of the rivers. Such water is that which moves the turbines of hydroelectric plants. The first stage is data acquisition. Sensors are spread to measure the volume of water that passes a transversal section of the rivers. Such water is that which moves the turbines of hydroelectric plants. Due to the nature of the rivers, containing many recesses along their way, the measurement's uncertainty incidence is inevitable.
The second stage is the pre-processing, which can be summarized in two steps: (a) the application of the deseasonalization procedure to remove the inherent seasonal component present in this kind of series. The transformed data are stationary series, with zero mean and variance equal to one. For linear models of the Box and Jenkins methodology, this step is mandatory, but some investigations have shown that the performance of non-linear methods also increases with this procedure; (b) the input selection step, to determine the lags that lead to the best performances of the predictors. The input selection step is the focus of this investigation.
The next stage is the definition of the forecasting model. Undoubtedly, this is the most usual theme addressed in streamflow series forecasting. Clearly, the definition of the adequate predictor is crucial regarding the accuracy of the estimation of future samples.
The last stage is post-processing, which involves three procedures. After making predictions, the output responses of the predictors are in the deseasonalized domain. Therefore, we reverse the deseasonalization to allow a performance evaluation in the real domain. Then, a statistical test is applied to verify if the results found by the various predictors are distinct, even if they present different numeric values. Finally, the last step is performance analysis.
Despite the specialized literature on this series is related to the forecasting model, it is necessary to elaborate a specific investigation focused on the input selection process due to the impact it presents in the performance of such models. Determining the most suitable set of lags may lead to distinct conclusions. Thus, this theme must be discussed.

Filters
The filter selection method is based only on the available data and does not depend on the predictor model. The variables are chosen through linear or non-linear correlation measures between the observations. The main advantage of this method is its generality, as it is not necessary to synthesize the predictor, which tends to make it computationally efficient [17].
However, as this is a previous step, the optimal set of inputs may not be selected, since there is no interaction with the forecasting model. Metrics based on dependency between samples can be useful, but insufficient to ensure that the chosen set is the best possible. Therefore, we recommend using it for problems with a large amount of available data because if the criterion of optimality is not met, the computational cost should be worthwhile. Figure 2 shows the scheme of the filter type method. Some of the inputs contained in the vector u t will belong to vector u t of smaller or equal dimension. The predictions are performed with u t .
Energies 2020, 13, 4236 6 of 35 no interaction with the forecasting model. Metrics based on dependency between samples can be useful, but insufficient to ensure that the chosen set is the best possible. Therefore, we recommend using it for problems with a large amount of available data because if the criterion of optimality is not met, the computational cost should be worthwhile. Figure 2 shows the scheme of the filter type method. Some of the inputs contained in the vector t u will belong to vector t ′ u of smaller or equal dimension. The predictions are performed with t ′ u .

Partial Autocorrelation Function
The partial autocorrelation function (PACF) is a widely used filter to identify the order of linear models [11]. The definition of the partial correlation coefficient is directly related to autoregressive models (AR) and the Yule-Walker equations.
The partial autocorrelation coefficient of order k is the last coefficient of an AR(k) model, adjusted for a time series x t and denoted by ϕ kk . This means that an AR process of order p is different from zero to k less than or equal to p, and zero for k > p. Based on this assumption and using the Yule-Walker equations, the relationship between the autocorrelation estimates of a time series in these terms obeys the set of equations described in (1): or, in a matrix form, we have (2) and (3): in which ρ are the coefficients of the autocorrelation [50]. Thus, the AR(p) model of order p = 1, 2, ..., k must be adjusted to find ϕ kk . Expression (4) shows the coefficients of the first two AR models: being ϕ 11 = ρ 1 the first partial autocorrelation coefficient. Similarly, we have (5): Isolating ϕ 21 and equating the equations, we have ϕ 22 = ρ 2 − ρ 2 1 / 1 − ρ 2 1 , the value admitted as the second partial autocorrelation coefficient.
It is noteworthy that the autocorrelation coefficients ρ p are problem-dependent. Then, the PACF of a series can be estimated through successive adjustments of the autoregressive models, determining Energies 2020, 13, 4236 7 of 35 the most appropriate orders of an AR model. In practice, the AR(1) is adjusted, from where we estimate the coefficient ϕ 11 . Following, we adjust the AR(2), and we have ϕ 21 and ϕ 22 , the latter being of interest. We continue in these systematic steps until the required k order is adjusted, from where the coefficients come out the desired ϕ kk .
For a time series, the highest order is sought such that all estimates ϕ kk for k > p are not significant. The order of the model is the value corresponding to the selected entry, that is, if the coefficients are selected ϕ 11 and ϕ 55 , lags 1 and 5 are part of the subset of inputs.
Quenouille [51] showed that for a AR(p) process, the coefficients ϕ kk estimated for orders greater than p have a Gaussian distribution with a mean equal to zero, variance equal to VAR[ϕ kk ] 1/N, being N the number of samples. Thus, the confidence threshold for the coefficients based on the standard deviation is ϕ kk 2/ √ N, considering that the estimate is different from zero in this interval. However, the method can select non-consecutive delays as model inputs. For example, , which means that ϕ 11 and ϕ 44 were significant, while ϕ 22 and ϕ 33 are not. For hydrologic time series, Stedinger [52] states that it makes no sense that a given sample is related to non-consecutive delays and that delays in a non-consecutive hydrological system selected by the PACF have no physical meaning, proposing the suppression of these entries. According to this work, some historical series have an autocorrelation structure relative to both the time between observations and the observed period.
Taking as an example an AR(6) model, if PACF defines that only the inputs weighted by the coefficients ϕ 11 and ϕ 44 are significant, the latter is considered as spurious data and must be discarded. This means that intermediate values should not be considered. Siqueira et al. [2] used bootstrapping techniques to evaluate the best order of periodic autoregressive models, and reached the same conclusion as Stedinger, with similar orders for streamflow series. Figure 3 is related to the calculation of the PACF of the monthly streamflow series from Furnas hydroelectric plant, located in Brazil. The data used are from January. The horizontal line is the confidence threshold calculated as a function of the standard deviation. Note that, with 12 delays, the method selected lags 1, 5, and 7. If Stedinger's proposal is taken into account, only delay 1 is chosen.  [52] states that it makes no sense that a given sample is related to non-consecutive delays and that delays in a non-consecutive hydrological system selected by the PACF have no physical meaning, proposing the suppression of these entries. According to this work, some historical series have an autocorrelation structure relative to both the time between observations and the observed period.
Taking as an example an AR(6) model, if PACF defines that only the inputs weighted by the coefficients 11 φ and 44 φ are significant, the latter is considered as spurious data and must be discarded. This means that intermediate values should not be considered. Siqueira et al. [2] used bootstrapping techniques to evaluate the best order of periodic autoregressive models, and reached the same conclusion as Stedinger, with similar orders for streamflow series. Figure 3 is related to the calculation of the PACF of the monthly streamflow series from Furnas hydroelectric plant, located in Brazil. The data used are from January. The horizontal line is the confidence threshold calculated as a function of the standard deviation. Note that, with 12 delays, the method selected lags 1, 5, and 7. If Stedinger's proposal is taken into account, only delay 1 is chosen. The technique is implemented for the selection of inputs in linear simulation models by the Brazilian National Electric System Operator [12].

Mutual Information
The dependency between two variables is an essential step for selecting model inputs. In this context, a reliable criterion belonging to the scope of information theory can be used, i.e. the mutual information (MI) [53]. The MI is a metric that provides a measure of the degree of dependence  The technique is implemented for the selection of inputs in linear simulation models by the Brazilian National Electric System Operator [12].

Mutual Information
The dependency between two variables is an essential step for selecting model inputs. In this context, a reliable criterion belonging to the scope of information theory can be used, i.e. The mutual information (MI) [53]. The MI is a metric that provides a measure of the degree of dependence between variables; it reflects the amount of information that links them. This filter can be applied as a criterion to select the inputs of forecasting models.
The definition of MI between two random variables can be interpreted as a measure of proximity between the joint probability distribution of the variables x and y, and the product of their marginal distributions. Mathematically, we have (6): in which f xy (x, y) is the joint probability density function (PDF), and f x (x) and f y (y) are the respective marginal density functions. The MI criterion presents zero as a result for independent variables, and greater than zero otherwise. If a representative sample of the data is available, one can estimate (6) using (7) [54]: where (x i , y i ) is the i-th pair of data from the sample with size N, being i = 1, 2,..., N.
The difficulty in this case is to estimate the probabilities since the distributions are often unknown in practice. Additionally, these estimates may require a large amount of data, which is not always available.
There are several ways to estimate PDFs in the literature. In this work, we use a non-parametric approach based on kernel functions, the absolute distance, or city-block type, which have already been applied in streamflow series forecasting [54]. The choice for this proposal is justified in terms of computational simplicity and absence of data distribution type assumption.
Consider the input and output dataset [X k , y k ], being k = 1, 2,..., N. The approximation of the probability densities of a one-dimensional x variable via non-parametric kernel estimators is given by (8):f in which K λ (t) is the kernel function, and λ the bandwidth or dispersion parameter. Therefore, the marginal approximate probability density function of x is given by (9): with p being the dimension of x.
Equation (9) arises from (8) as a case adapted to multidimensional x, and using the city-block function. The parameter λ is calculated by (10) [55]: Finally, the joint probability of (x − y), the latter being a one-dimensional output, is as in (11) [56]: Energies 2020, 13, 4236 9 of 35 or as in (12)f Therefore, s i is calculated by (13): An example that shows the approximation capability of this proposal is its use to build a bi-variable Gaussian distribution function. This function is defined by (14): where: with x and y being the variables used, µ x and µ y their respective averages, σ x and σ y the standard deviations, and ρ the correlation coefficient between them.
To exemplify the approximation capability using the city-block function, we generate 2000 samples of x and y with normal distribution and zero mean. With Equation (14), it is possible to plot f xy , represented graphically in Figure 4a together with its diagram in contour lines ( Figure 4b). In parallel, we present the approximations using the city-block kernel function of (13) in Figure 4c,d. The proximity of the curves is clear, which illustrates how this function approximates the distributions. The correlation coefficient between them is 0.9932, although the circles are not perfectly concentric.  The example in Figure 5 refers to the calculation of the MI coefficients related to Furnas hydroelectric plant streamflow series. As in Figure 3, the samples are from January. In this case, we adopted p = 100 sequences and α = 5%, and the respective MI values calculated. For 12 lags initially considered as possible explanatory variables, the selected ones are lags 1, 8, and 9.  The final step in applying the MI filter is to define a confidence threshold that determines if an input belongs to the selected subset of explanatory variables. One possibility is to establish a minimum value for the MI and reject entries with a lower MI score. Another option is to use a bootstrapping or resampling technique to test the hypothesis of independence between x and y. For this, several sequences other than x in relation to y are built, in which the independent variable is reordered, and a vector of MIs is obtained. If this value surpasses the threshold at a given level of significance α, x and y are considered dependent. Thus, an input variable x is identified [57]. This work adopts the latter approach.
The example in Figure 5 refers to the calculation of the MI coefficients related to Furnas hydroelectric plant streamflow series. As in Figure 3, the samples are from January. In this case, we adopted p = 100 sequences and α = 5%, and the respective MI values calculated. For 12 lags initially considered as possible explanatory variables, the selected ones are lags 1, 8, and 9. The example in Figure 5 refers to the calculation of the MI coefficients related to Furnas hydroelectric plant streamflow series. As in Figure 3, the samples are from January. In this case, we adopted p = 100 sequences and α = 5%, and the respective MI values calculated. For 12 lags initially considered as possible explanatory variables, the selected ones are lags 1, 8, and 9.

Partial Mutual Information
The MI criterion appears advantageous due to the non-parametric nature on identifying explanatory variables, regardless of the model's nature to adjust afterward. Notwithstanding, two or more explanatory variables may be highly correlated; thus the choice of those variables would insert redundancy and an unnecessary increase in the complexity of the model. It may occur because the

Partial Mutual Information
The MI criterion appears advantageous due to the non-parametric nature on identifying explanatory variables, regardless of the model's nature to adjust afterward. Notwithstanding, two or more explanatory variables may be highly correlated; thus the choice of those variables would insert redundancy and an unnecessary increase in the complexity of the model. It may occur because the criterion does not perform a joint evaluation of the whole set of potential input variables. One way to deal with this is the proposal of [58], who reformulates the MI into what is known as the partial mutual information criterion (PMI). The PMI criterion measures the mutual information between the independent variable x and the dependent variable y, conditioned to a set of inputs z previously selected.
Consider that this set z exists. Next, it is necessary to extract the influence of this set concerning the other potential inputs evaluated yet, to calculate their real contribution, a different from the one already given by z. Thus, following Luna et al. [54] and Sharma [58], Equation (7) can be reformulated as (16): where x = x − E(x |z ) and y = y − E(y |z ).
Here, x and y denote the residuals of x and y, respectively, after removing the conditional expectations given z, E(x |z ) and E(y |z ). With this transformation, x and y can be interpreted as remaining information in both variables: what is yet different from the information in z; and what has not been explained yet from y.
Several approaches have been proposed for estimating expected means [58][59][60]. We opted for simplicity by following the non-parametric approach based on kernel regressions by using the Nadaraya-Watson estimator [61]. According to this, given two variables a and b, a general expected value E(a|b ) is defined by (17) where: with K λa (a − a i ) denoting the kernel function for variable a. As before, we will use the city-block function for this purpose. Therefore, input selection, in this case, follows an iterative process. At the first iteration, MI scores are calculated for every potential input variable previously defined. The first input selected is the one with the higher MI score as long as its significance is validated, initiating z. In the following steps, PMI scores are calculated for all the potential input variables, updating z at each iteration, until the higher PMI is not statistically significant at all. The bootstrapping technique is once more used to verify the PMI scores significance at a 5% level.

Normalization of Maximum Relevance and Minimum Common Redundancy Mutual Information
Some studies use the principles of mutual information, extending it in different directions to increase the filers' selection capability. The work from Che et al. [62] proposed to expand the MI using the maximum relevance and minimum common redundancy (MRMCR) between the inputs of a model (lags). This framework intends to determine the best set of inputs, controlling the redundancy between them.
The first step of this method is to calculate the common redundancy to evaluate the inputs' common information. Following, one must apply the normalization of maximum relevance and minimum common redundancy (N-MRMCR-MI). The result presents values in the interval [0,1].
Let S be the subset of chosen inputs, and T the complementary non-selected group. Then, calculate the common mutual information (CI), using (19): where i is the index of the variables in T, j the index of the inputs in S, and MI is the mutual information (see (7)): The complete application of N-MRMCR-MI procedure is according to the following stages [62]: (1) Initialization: be T = x 1 , x 2 , . . . , x p the full set of inputs, and S = ∅ (empty); (2) First input selection: calculate F(x i ) using (20) for all i = 1, 2, . . . , p, and set the best on x * i applying (21): Energies 2020, 13, 4236 12 of 35 (3) Update the groups: T = T − x * i , and S = x * i ; (4) Greedy selection: repeat steps 1 and 2 until the desired number of features is determined; (5) Determine the N-MRMCR-MI considering the output variable using (22): In this work, we again address the bootstrapping to calculate the confidence level [57], and the city-block functions.

Wrappers
In the wrapper approach, the central aspect is the interaction between the variable selection mechanism and the forecasting model [63]. Once the model has already been adjusted, the wrapper will evaluate, through some performance criteria, each of the subsets to solve it [47]. However, the computational cost involved is high, as the model needs to be adjusted for each candidate subset [64]. The literature recommends using this method for cases where the number of samples is reduced [17]. Figure 6 shows the scheme of the method.
(3) Update the groups: (4) Greedy selection: repeat steps 1 and 2 until the desired number of features is determined; (5) Determine the N-MRMCR-MI considering the output variable using (22): (7) Output the subset S.
In this work, we again address the bootstrapping to calculate the confidence level [57], and the city-block functions.

Wrappers
In the wrapper approach, the central aspect is the interaction between the variable selection mechanism and the forecasting model [63]. Once the model has already been adjusted, the wrapper will evaluate, through some performance criteria, each of the subsets to solve it [47]. However, the computational cost involved is high, as the model needs to be adjusted for each candidate subset [64]. The literature recommends using this method for cases where the number of samples is reduced [17]. Figure 6 shows the scheme of the method.  As shown in Figure 6, the operation of the wrapper is as follows: firstly, the set of entries u t is divided into smaller subsets u t ; next, the predictor is trained and executed for each of subset; after the forecasting stage, we calculate a performance score from the evaluator block for each subset. The one with the best value is used.
It is possible not to use the last predictor shown in Figure 6, if the result of each assessment is stored. However, we presented the selection as a separate task of the forecasting process for the sake of simplicity of understanding.

Progressive Selection
The computational cost of performing an exhaustive search of all possible subsets can be impractical even a relatively small problem since the computational cost is factorial. A proposal to overcome this problem is the wrapper using the progressive selection method. This methodology establishes a manner to build subsets of entries considering each one individually.
The procedure initiates with an empty subset, and we compare each variable with all others. The one presenting best performance measured by the evaluation function is selected, either to improve the result or to least deterioration of this value. After choosing the first entry, fixed in the subset, Energies 2020, 13, 4236 13 of 35 the others are evaluated to ascend as the second entry. We repeat this procedure until the evaluation of all V variables. The final subset is the one with the best overall result. Figure 7 presents this idea, considering the Emborcação series, with the adjustment of an Extreme learning machines neural network (ELM) with a fixed number of 20 neurons in the hidden layer, and a maximum of 10 delays as input. As one can note, we selected three entries in this order: 4, 10, and 6, since this was the combination that had the lowest mean square error. In this case, it is noticeable that the selected entries are not consecutive and that the increase in the number of inputs does not necessarily improve performance. Figure 7 presents this idea, considering the Emborcação series, with the adjustment of an Extreme learning machines neural network (ELM) with a fixed number of 20 neurons in the hidden layer, and a maximum of 10 delays as input. As one can note, we selected three entries in this order: 4, 10, and 6, since this was the combination that had the lowest mean square error. In this case, it is noticeable that the selected entries are not consecutive and that the increase in the number of inputs does not necessarily improve performance.
The number of subsets formed in this case is equal to the number V of inputs, and the number of times the predictor needs training obeys ( ) In the example, above 55 ELMs were adjusted, since V = 10. It is also interesting to observe the behavior of the mean squared error (see Subsection 4.2) between iterations 2 and 4 in Figure 7. When adding the lag v6, the error decreased, improving the value of the objective function to be minimized. However, with the insertion of the variable v3, the error increased. This behavior occurs because the search may fall in local minima, which can circumvent at later iterations [47].

Evaluation Functions
After discussing how the wrapper method works, it is necessary to define a criterion for assessing the quality of forecasting using the previously defined subsets. This step corresponds to the Evaluator block in Figure 7.
The most straightforward criterion is to use some error metric that, for each adjusted set, shows the average of the differences between the desired and the predicted data. The proposal used by Puma-Villanueva et al. [47] is the mean absolute error (MAE), given by (23): where t x is the desired sample in time t, ˆt x is the prediction, and s N the number of predicted samples.
Another possibility is to address the mean squared error (MSE), the most common metric used as a cost function in the training of neural networks, and in the estimation of AR model parameters. This metric is defined by (24): The number of subsets formed in this case is equal to the number V of inputs, and the number of times the predictor needs training obeys V(V + 1)/2. In the example, above 55 ELMs were adjusted, since V = 10.
It is also interesting to observe the behavior of the mean squared error (see Section 4.2) between iterations 2 and 4 in Figure 7. When adding the lag v 6 , the error decreased, improving the value of the objective function to be minimized. However, with the insertion of the variable v 3 , the error increased. This behavior occurs because the search may fall in local minima, which can circumvent at later iterations [47].

Evaluation Functions
After discussing how the wrapper method works, it is necessary to define a criterion for assessing the quality of forecasting using the previously defined subsets. This step corresponds to the Evaluator block in Figure 7. The most straightforward criterion is to use some error metric that, for each adjusted set, shows the average of the differences between the desired and the predicted data. The proposal used by Puma-Villanueva et al. [47] is the mean absolute error (MAE), given by (23): where x t is the desired sample in time t,x t is the prediction, and N s the number of predicted samples. Another possibility is to address the mean squared error (MSE), the most common metric used as a cost function in the training of neural networks, and in the estimation of AR model parameters. This metric is defined by (24): Note that these criteria only consider the final result of the adjustment, regardless of the number of inputs. However, there are other types of evaluation functions seek to penalize the number of entries in order to select parsimonious subsets regarding the number of inputs. Criteria widely used are based on information measures [17].
Schwarz [65] proposed the Bayesian Information Criterion (BIC). It is based on linear correlation metrics, and is linked to the optimal orders of the forecasting model, as defined in (25): where N is the number of observations, p is the order or number of model entries andσ 2 a is estimated variance of white noise (or residue).
Thus, the wrapper chooses the set of inputs with the lowest BIC value. Similarly, the Akaike Information Criterion (AIC) [66] is defined by (26): In both cases, it is clear that there is a penalty concerning the number of entries so that the inclusion of more inputs depends not only on the performance improvement but also on how much it is increased. Thus, the selected subset must be efficient and parsimonious. The difference between the criteria lies in the fact that the BIC penalizes the inclusion more strongly than the AIC. Observe the BIC last term (25) is the natural logarithm of the number of observations, while the AIC (26) has a multiplication of the order or number of model entries by 2.

Bio-Inspired Metaheuristics
Bio-inspired metaheuristics for optimization have been widely applied in VS tasks, especially their binary versions. The genetic algorithm (GA) belongs to the field of evolutionary computation because it is inspired by Darwinian natural selection. Another class of bio-inspired methods is the swarm-based approaches, from which many algorithms were created. The main characteristic of this class is the inspiration based on the collective behavior of groups of animals. Its primary representative is the particle swarm optimization (PSO) [67]. Still, many other algorithms can be cited as artificial bee colony, cat swarm optimization, fish school search, ant colony optimization, among others [68,69].
In this section, we briefly explain the two metaheuristics widely used in the literature for optimization: genetic algorithm and particle swarm optimization. Both techniques simulate multiple agents that evolve/adapt depending on the environment to find better solutions to one fitness function. The flexibility, robustness, and scalability are key advantages of applying metaheuristics to real problems.

Genetic Algorithm
The genetic algorithm (GA) was introduced by John Holland [70], and the theory of natural evolution inspires it. The population which adapts better to the environment circumstances perpetuates the genes, reproducing other individuals with more likely useful genes. Therefore, GA has a population of individuals that will pass into three processes: selection, crossover, and mutation to reproduce a better population until an a priori constraint is reached. For the binary version, a vector of binary values represents each individual of the population.
In summary, the GA starts generating an initial (generally random) population of individuals with the respective fitness function of their genes. Until the population has not converged, the three processes will be executed, and the fitness function will be updated when the genes are updated. The first process, selection, chooses the individuals that will reproduce new individuals. Then, in the following procedure, crossover, the adopted parents (two) will mix their genes to create a new offspring. Thirdly, the mutation process randomly selects genes to be changed. In general, each process establishes the number of individuals that will be updated. The algorithm is usually greedy (only allowing the update of the new individuals that are better than the current ones).

Particle Swarm Optimization
Kennedy and Eberhart [67] developed particle swarm optimization (PSO). After two years, the discrete or binary version of PSO was also published by them [71]. The PSO algorithm mimics the behavior of a flock of birds where each bird is a candidate solution. Each candidate solution i is represented by a position x i and a velocity v i . For the binary version, binary values represent the position and the velocity is continuous values between 0 and 1.
In summary, PSO starts generating an initial (generally random) population of birds with their respective fitness function. Then, until reaching a priori condition, all the birds update the velocity and position. The update of the post (flip the position for binary optimization) is performed each time that random value is higher (or smaller) than a transformation function of the velocity F(v) (such as sigmoid or tangent sigmoid). Using the sigmoid function, we can express that the new position is updated by: Moreover, the new velocity is calculated based on the current velocity, and the delta displacement between the personal (p t best i ) and global (g t best i ) best positions. The p t best i is updated every time that a particle finds a better position, and the g t best i is the best position between the neighbors' particle (defined a priori by the topology). The parameters c 1 and c 2 are a priori constants that define how altruistic or selfish each particle is, and the parameters r 1 and r 2 are random values between 0 and 1. Equation (28) shows the update process in the velocity: Until the algorithm is converged, each particle updates the velocity and position. When the position is updated, the fitness function is also updated.

Case Study
In this section, we summarize the computational results of the linear model and the Extreme Learning Machines neural networks using the variable selection techniques discussed: filters, wrappers, and bio-inspired metaheuristics. As discussed in Section 1, seasonal streamflow series forecasting is essential for countries presenting hydroelectric plants to power generation. In the Brazilian case, 70% of the electric energy is hydroelectric generated [72]. Additionally, this task is vital to optimize the energetic planning [43][44][45]73]. First, it is paramount to mention the adopted assumptions. We performed the simulations with a maximum of 6 delays, following the literature [2,11,54]. We addressed two forecasting approaches: the use of one predictor to the whole series, and the monthly approach, in which we adjusted 12 different models, one for each month of the year.
Several investigations have shown that monthly streamflows present a seasonal behavior throughout the year, being non-stationary series. Linear models cannot be directly applied, being necessary to remove the seasonal component. In this work, we adopted the deseasonalization procedure to transform the series into stationary, with zero mean and variance equals one [2]. The process is reversed before the performance analysis. Equation (29) describes the deseasonalization procedure: where, s n is the original series formed by the samples s i,m , which is transformed into a new series z n ; µ m is the monthly mean;σ m the monthly standard deviation; and the month m = 1, 2 . . . , 12.
The series addressed are related to five important Brazilian hydroelectric plants: Furnas, Emborcação, Sobradinho, Agua Vermelha, and Passo Real. The datasets are available from 1931 to 2015, totaling 85 years or 1020 monthly samples. Each sample refers streamflow in m 3 /s. These data are public, being available on the website of the National Operator of the Electric System (ONS) [74]. We separated each series on three groups: The mean and standard deviation of all series are available in Table 2. Note their distinct statistical and consequent hydrological behavior, enabling a broader analysis of the results.

Predictors
In this section, we briefly describe the forecasting models used in this work. We consider as predictors two methods: the autoregressive linear model (AR) from the Box and Jenkins methodology [50], and the extreme learning machines neural network (ELM), as the nonlinear representative. Note that when using the approach with 12 predictors, the linear method is called the Periodic Autoregressive model (PAR).
The AR approach linearly weights p past values u t = u t−1 , u t−2 , . . . , u t−p of a time series to provide a future response y t . Considering that the values of vector u are stationary, (30) explicates such a process: where ϕ p are the free coefficients of the model. A significant advantage of this method is the possibility of calculating its coefficients using a close form approach named Yule-Walker equations. This means that, using the same set of inputs, the model always converges to the same output. This method guarantees the minimum MSE between the output and the desired response.
The standard AR considers just one model to predict all values of the time series. However, it is possible to extend the AR to series, which presents variations in its structure [75], using the periodic autoregressive model (PAR). According to Hippel and McLeod [76], some historical series, such as hydrological ones with seasonal behavior, present an autocorrelation structure linked to the time delay between observations, and the observed period. In this sense, we can address one predictor for each month, the core of the PAR model. For monthly streamflow forecasting, we use 12 predictors, each one adjusted to predict the samples for each month [11].
The second forecasting model addressed is the extreme learning machine (ELM). ELMs are feedforward neural networks like the traditional multilayer perceptron (MLP), with only one intermediate layer. However, the training process differentiates them since the weights of the neurons in the hidden layer are randomly and independently determined. The training process does not adjust the weights of this layer, but only those of the output. The optimal values of the weights are typically calculated analytically since the training involves solving a linear regression problem [77]. Thus, there is no need to calculate derivatives, back-propagate error signals, or use iterative algorithms, which reduces the computational cost involved in the training process.
Bartlett [78] obtained an important theoretical result. The author proved that controlling the norm of synaptic weights is more relevant in terms of the generalization capability of a neural model than controlling the number of neurons in the middle layer. This leads to important evidence that an improvement occurs when the parameter vector has a minimum norm, so the effective number of neurons in the intermediate layer will be defined by the configuration of the weights of the output layer.
Given this statement, ELM presents a guarantee of good generalization effectively given by the weights of the output layer, and the weights of the intermediate layer can be defined at random. Because of this, the network's training becomes linear in relation to the adjustable parameters for supervised training. The generalized Moore-Penrose operator is the most important candidate for solving this problem in the literature [79,80].
In this work, we address the neural network following the same premises of the AR and PAR models: just one ELM for the complete series (annual approach), and 12 ELMs, one adjusted for each month. We highlight that such method was chosen because it presented good results in monthly seasonal streamflow series forecasting, overcoming other neural models [2,12,[43][44][45]80].

Computational Results
This investigation aims to analyze the quality of the predictions regarding the use of the aforementioned variable selection techniques: filters, wrappers, and bio-inspired metaheuristics. We consider as predictor the autoregressive model (AR), periodic autoregressive model (PAR) [50], and the extreme learning machines neural network (ELM) considering the annual and monthly approaches.
The purpose of these simulations is to find the input selection model that is more suitable for a linear and a non-linear methodologies. Note that the wrappers for AR and PAR models take into account the assessment of the fit of the training set, while the training error of ELMs are not as important because we are interested in the smallest generalization error.
The maximum number of inputs or delays allowed is six, as models of higher orders increase the possibility of negative auto-regressive coefficients [11]. Siqueira et al. [2,12] and Stedinger [52] defend this premise.
The computational results regarding the mean square error (MSE) and mean absolute error (MAE) in the real and deseasonalized (MESd and MAEd) domains for one step ahead are in Tables 3-5. The acronym "Lf" means the linear filter approach based on the partial autocorrelation function (the traditional PACF and the Stedinger's approach PACF-Sted.), "Nf" corresponds to the non-linear filters developed using the mutual information principle (MI, PMI, and N-MRMCR-MI), "WR", the wrapper method considering as evaluation function the BIC, AIC, and MSE, and "M" the metaheuristics, GA and PSO.
In the AR (Table 3) and PAR (Table 4) cases, we presented the error for training and test sets, while for the single and monthly ELM (Table 5), we just show the errors for the test, because we are interested in analyzing the generalization capability of such response. To the ELM, the results are an average of 30 simulations. The best performances in the test set regarding the MSE in the real domain are highlighted using shades of gray. Additionally, in Appendix A we explicate the inputs selected for all case studies in Tables A1-A5. We also applied the Friedman's test to evaluate if the results are significantly distinct [57]. As expected, for the cases that the same set of inputs are selected, there is no statistical difference between the results for the same model. In almost all the other cases, the p-values achieved were close to zero, indicating that change the inputs leads to different conclusions. We discuss the exceptions below.  The critical analysis regarding the results achieved by the AR model reveals interesting behaviors (Table 3). For Furnas and Emborcação time series, there was no perfect correspondence between the best performance regarding the MSE and MAE in the real and deseasonalized domains' errors. In such cases, we assumed the best predictor with the smallest MSE in the real space, following the premises already stated in previous works [2,12,80]. The general analysis showed other relevant issues: for training or test, at least two variable methods led to the same performances since they selected the same subset of inputs, except for the Agua Vermelha training set. As the AR optimized by the Yule-Walker equations presents a closed-form solution, the same input vector necessarily leads to the same output responses. This behavior can occur since just one model is adjusted, and the number of inputs is limited to six.
Likewise, these draws happened for the same class of variable selection method. Note, for example, for Emborcação, Sobradinho, and Passo Real, the smallest training error in the MSE sense was related to the non-linear filters, while for Furnas were the wrappers based on BIC and AIC.
However, we observed an intriguing behavior: the best performances were related to distinct variable selection methods for training and test sets (Table 3). Following some literature regarding monthly seasonal streamflow series forecasting [2,12], one should state the best variable selection method related to the error in the test set. However, the analysis of the training set presented the search capability of the methods. Indeed, the ideal behavior would be the same VS approach for both. The training set was better adjusted for MI filters in 4 cases, and the wrapper (BIC and AIC) in one. PACF best fitted the test sets four of five times, and by the MI filters, once.
Analyzing the inputs selected for AR model in Tables A1-A5 in Appendix A, one can note that the models that fitted better in training set selected six lags for Emborcação, Sobradinho, and Passo Real. For Furnas and Agua Vermelha, two entries. Considering the test set, the PACF approaches, in general, selected three or four inputs. As expected, the MI methods usually chose more lags than the PACF, since they detect non-linear relations. As the linear approach achieved four of five best results, we can affirm that include all inputs in the AR model may tend to a configuration with less generalization capability.
Although the bio-inspired metaheuristics present an elevated search capability, for the AR neither, PSO nor GA achieved some of the best performances (Table 3).
Unlike the AR case, the PAR model's results presented a draw just for the training set of Emborcação (Table 4). It happened because both GA and wrapper-MSE found the same set of inputs (see Table A2). Considering the optimization of 12 models simultaneously, totaling up to 72 free parameters, it is more likely that the variable selection models achieve distinct configurations.
To the PAR model (Table 4), the best VS method presented homogeneity regarding the four error metrics. On the other hand, we noted again that the best performance for training did not present correspondence with the test set. In the training set, the GA stood out, achieving the smallest errors for Emborcação, Furnas, and together with wrapper-MSE for Agua Vermelha. For the other scenarios, we highlight the N-MRMCR-MI. In the test set, just filters reached the smallest errors: PACF-Sted. (Furnas and Agua Vermelha), N-MRMCR-MI (Emborcação), and MI for the others.
Tables A1-A5 reveal the GA and wrapper-MSE often selected five or six inputs for all months. In the comparison of wrapper methodologies, a pattern could be noticed, since the BIC selected fewer inputs than AIC. In practice, we observed the influence of each type of penalty regarding the insertion of new entries.
The results achieved by the ELM considering the annual approach, summarized in Table 5, must be discussed considering not just the numerical values of the errors, but also the statistical difference between them. Additionally, it is important to highlight that when applying neural networks, there is no interest in evaluating the training error, because we are looking for the configuration to achieve the highest generalization capability. Therefore, we discuss just the error in the test set.
In the annual models, we once again noted some ties for Furnas and Emborcação. In addition to Table 5 presenting distinct numerical values, the obtained p-value for the Friedman test was higher than 0.05, as expected, since the set of selected inputs was the same (see Tables A1-A5). As the initialization of the weights of an ELM is random, the outputs are distinct, but the results are close. It is the reason we must run the algorithm at least 30 times.
For Furnas, wrapper (AIC and MSE), and the non-linear filters (MI and PMI) presented the best results, selecting the same entries. For Emborcação, a further discussion must be done. Except for the wrapper-BIC, all methods led to the same performances statistically, according to the Friedman test. However, one can find four distinct sets of inputs in Table A2. It can happen since the approximation capability of a neural network is elevated, and these models are universal approximators. Regarding the inputs, note that lags 1 and 4 belong to all subsets.
For the other series, the AIC stood out for Agua Vermelha and Sobradinho, while the PACF and PACF-Sted., for Passo Real (same set of inputs). The AIC method selected two inputs for all cases.
As observed for the linear prediction models, the MI methods tended to select more lags than the linear filters (see Tables A1-A5). The PMI presented fewer inputs than their non-linear counterparts. In general, it seems clear that inserting too many inputs does not necessarily lead to an increase in performance, especially in the test set.
The monthly prediction using ELM, similar to the linear case, showed error values with a standard pattern. For all scenarios, the wrapper-MSE found the best performances, although the smallest MAE for Sobradinho is related to AIC. In a few cases, the best error metrics converge to the same predictor. Except for the tie in Emborcação, the metaheuristics did not achieve the best errors. Table 6 displays how many times each VS method led to the best performance, according to the models depicted in Tables 3-5: AR (training and test), PAR (training and test), annual ELM (test), and monthly ELM (test). Due to the statistical similarity between some of the best results, provided by Friedman test, we indicated the number of VS methods that achieved similar performance regarding each predictor between parentheses. For example, for the AR in the training set, the PMI reached the best result four times, once alone and three times together with two more VS methods. In bold and underlined it is highlighted the number of times some approach was the single best.
In most cases, the PACF and PACF -Sted. presented the same set of inputs (see Appendix A, Tables A1-A5). It indicates that, for hydrologic series, the hypothesis of dependency of consecutive delays makes sense. These methods were highlighted to the test set of the AR and PAR models ( Table 6).
Considering the non-linear filters for annual models, these approaches presented as inputs all the six lags, except for Agua Vermelha. In general, comparing the MI-methods to the monthly cases, the PMI tended to select fewer inputs, while the N-MRMCR-MI selected more inputs, except for Agua Vermelha (see Tables A1-A5). The MI-based approaches are avid for data since the estimation of the probability density function (PDF) gets better; the more data are available. That is why for annual models, such an approach tends to behave better than the monthly models. The AR was better adjusted for the training set using MI. Additionally, the N-MRMCR-MI won alone twice (Table 6).
Regarding the wrapper methodology, as expected, the MSE-based selected more inputs then BIC and AIC, since there is no penalty in introducing new lags. The BIC selected fewer entries than the AIC, due to its strong penalty function (see Tables A1-A5). However, The BIC did not overcome the others by itself, unlike the AIC, which was the winner twice for ELM annual model. In summary, the wrapper approach stood out for the neural networks (Table 6).
Regarding the metaheuristics, just for the PAR in the training set, the GA achieved most of the best performances. These are powerful methods to deal with binary optimization problems like variable selection, and we believe this approach could be competitive [68,69]. Additionally, the computational cost was higher than the wrappers (Table 6). Moreover, we can also notice that for four of five series, the monthly ELM achieved the best general performances ( Table 3, Table 4 and Table 5). It is particularly relevant since, in current days, one can still find the massive use of linear models in the literature [2,12].
Finally, we state that variable selection is a complex problem, little explored in the context of monthly seasonal streamflow forecasting. The variety among the responses proves the task's inherent difficulties, which present elevated economic and environmental impacts. As can be seen, the models' efficiency is greatly influenced by lags choice. In Figure 8, we present the best predictions achieved by the winner in each scenario: AR for Furnas and monthly ELM for the other cases.

Conclusions
This work performed an extensive investigation on the variable selection methods to determine the best subsets of inputs to increase the accuracy in the monthly seasonal streamflow series forecasting tasks. Most of the specialized literature focus on finding the best predictor, but the selection of the inputs presents an essential step for forecasting.
We addressed wrappers, linear and non-linear filters, and bio-inspired metaheuristics. The wrapper methodology can evaluate the quality of this subset under several criteria, including:

Conclusions
This work performed an extensive investigation on the variable selection methods to determine the best subsets of inputs to increase the accuracy in the monthly seasonal streamflow series forecasting tasks. Most of the specialized literature focus on finding the best predictor, but the selection of the inputs presents an essential step for forecasting.
We addressed wrappers, linear and non-linear filters, and bio-inspired metaheuristics. The wrapper methodology can evaluate the quality of this subset under several criteria, including: Regarding the metaheuristics, we used binary versions of the: • Particle swarm optimization (PSO); and • Genetic algorithm (GA).
We performed computational tests with the predictions made for five monthly series related to hydroelectric plants using the autoregressive linear model (AR), and extreme learning machine neural network (ELM) as predictors. We also addressed two forecasting approaches, the use of only one predictor for the whole series, and the use of 12 predictors, each one adjusted for each month.
The main findings of this investigation are: • The selected lags were very diverse depending on the method, especially for the monthly case; • For the annual approaches, some draws could be found; • The linear models perform better with filters; • The wrapper is the best choice for the neural network; and • Regarding the forecasting methods, the monthly ELM achieved the best error values.
These findings are especially important for countries where the power generation is mainly from hydroelectric plants, since this is the most important renewable power source in the world. Additionally, such investigation can contribute to energetic planning, water availability, and pricing strategies for the power productive chains.
Future works can be developed considering these approaches for other series of renewable inputs for power generation, like wind power [36,38]. Other problems related to energy generation as an estimation of methane production and biogas efficiency [81] could be treated and simulated using a similar methodology. Furthermore, other metaheuristics can be addressed, since there is a vast repertoire of possibilities being developed in the last years, such as differential evolution and artificial bee colony, among others. Funding: This work received funding and technical support from AES and Associated Companies (of the CPFL, Brookfield, and Global group) as part of the ANEEL PD-0610-1004/2015 project, "IRIS -Integration of intermittent renewables: A simulation model of the operation of the Brazilian electrical system to support planning, operation, commercialization, and regulation", which is part of an R and D program regulated by ANEEL, Brazil. The authors also thank the Advanced Institute of Technology and Innovation (IATI) for its support, the National Institute of Meteorology (INMET) for providing the data, and the Coordination for the Improvement of Higher Education Personnel -Brazil (CAPES)-Financing Code 001, for support.