A Hybrid LSTM-Based Genetic Programming Approach for Short-Term Prediction of Global Solar Radiation Using Weather Dat a

: The integration of solar energy in smart grids and other utilities is continuously increasing due to its economic and environmental benefits. However, the uncertainty of available solar energy creates challenges regarding the stability of the generated power the supply-demand balance's consistency. An accurate global solar radiation (GSR) prediction model can ensure overall system reliability and power generation scheduling. This article describes a nonlinear hybrid model based on Long Short-Term Memory (LSTM) models and the Genetic Programming technique for short-term prediction of global solar radiation. The LSTMs are Recurrent Neural Network (RNN) models that are successfully used to predict time-series data. We use these models as base predictors of GSR using weather and solar radiation (SR) data. Genetic programming (GP) is an evolutionary heuristic computing technique that enables automatic search for complex solution formulas. We use the GP in a post-processing stage to combine the LSTM models’ outputs to find the best prediction of the GSR. We have examined two versions of the GP in the proposed model: a standard version and a boosted version that incorporates a local search technique. We have shown an improvement in terms of performance provided by the proposed hybrid model. We have compared its performance to stacking techniques based on machine learning for combination. The results show that the suggested method provides significant improvement in terms of performance and consistency.


Introduction
With the increased interest in energy preservation and environmental protection, the world today is moving into a transition from almost total dependence on fossil fuel to the increased use of alternative sources of energy. Solar radiation (SR) is one of the promising and potential renewable energy sources [1,2]. Short-term forecasting of SR plays a significant role in assessing the feasibility of actual applications in smart balanced grids between power production and consumption at a given interval of time [1][2][3][4][5]. Therefore, it plays an essential role in several aspects of smart grids scheduling management and renewable energy engineering. This fact requires reliable techniques to predict the stochastic dependency between present and future values of SR using available observations.
Machine learning techniques have gained attention as good competitors to classical statistical models in prediction [5][6][7][8][9]. Machine learning models are considered as nonlinear data-driven models. They use prior data records to recognize the stochastic dependency between past and future data sequences.
In general, consistency and stability are the main challenges researchers encounter in designing consistent SR forecasting models. It is particularly true when dealing with stochastic time-series data.
Ensemble learning techniques that consist of aggregating the decisions of several individual models, also named weak learners, have been proposed for forecasting. In general, the main idea behind the ensemble technique is to combine the outcomes of multiple learners that could be complementary to each other, since it is unlikely that all predictors make the same prediction faults. The stacking ensemble technique is one of the contemporary ensemble techniques; it uses a machine learning model, also named the meta-model, to aggregate the decisions of the weak learners of an ensemble [4,7].
Lately, several machine learning ensemble techniques have been proposed to enhance the short-term prediction of SR power, SR energy strength, and wind energy [15][16][17][18][19][20][21][22][23]. Table 1 summarizes some ensemble models in the literature that have been proposed for the prediction of SR and wind energy along with their base learners and ensemble types. Table 1. Summary of reviewed ensembles and combinations for SR energy forecasting.

Authors
Base Learners Ensemble Type Combiner Meta-Model Al-Hajj et al. [4] RNNs + SVRs Parallel stacking MLPs Linares-Rodriguez et al. [17] ANNs Parallel structure Simple averaging Ahmed et al. [18] DTs + RF Regressors + Lasso Regressors Parallel with probabilistic combination Normal distribution methods for probabilistic forecasts Guo et al. [19] ANNs, Linear Regression + RF Parallel structure-weighted sum Linear weighted sum Yeboah et al. [20] DT Bagging technique RF Pan et al. [21] RF Parallel staking Weighted average with Ridge Regression Basaran et al. [22] SVR, ANN, DT Bagging technique Boosting-Bagging Abuella et al. [23] SVRs Parallel structure-stacking Random Forest RF All the above-mentioned ensemble methods are either based on probabilistic combination, weighed sums, or stacking techniques. It is worth noting the stacking techniques. Even though they provide smart combinations of base learners instead of just applying a simple averaging of outputs, they still have some limitations. The main limitation of those techniques is finding the optimal structure and parameters of the machine learning models that are designed for stacking. This limitation becomes obvious when the training dataset dedicated to train the meta-model is small and insufficient to train that model (under-fitting problem).
In this research, we suggest a framework that consists of a meta-heuristic optimization method, named Genetic Programming (GP), that aims to find the optimal combination of the outputs of individual predicators, also called weak learners, of an ensemble. The GP optimization is inspired by the natural selection of individuals that represent solution candidates, based on a fitness objective function. The fittest individuals with high fitness functions are selected to survive for the next generations through a competitive evolutionary computation [24][25][26]. The individuals in our case are the formulas of the combination of the base predictors' outputs. The GP module aims to find the best combination of formulas that minimizes the overall prediction error of the committee of base learners. Our experiments examined two versions of GP, a standard version and a hybrid one named Memetic Programming (MP),boosted by a local search technique [27,28]. The incorporated local search technique permits increasing the exploitation of the GP module when searching for the optimal solution.
In solar energy estimation and SR prediction literature , the GP and other heuristic computation techniques have been used directly on collected weather and solar measured data to estimate or predict the SR strength [29][30][31][32][33][34][35].
In [29], the authors utilized Linear Genetic Programming (LGP) to predict the GSR using weather data. The GSR is formulated in terms of climatological and meteorological parameters in two specific areas of Iran. The authors elaborated and validated two separate models for each of the two regions, and they evaluated the significance of each parameter of the dataset by applying a sensitivity analysis. The authors claimed that the performance of their LGP-based models outperforms those based on angstrom's models. In [30], the authors applied the GP technique to estimate the reference evapotranspiration in California-USA using daily atmospheric parameters such as the daily mean temperature, daily SR, average daily humidity, etc. The comparative experiments of the authors show that their proposed GP-based model outperformed the performance of other conventional models such as the Penman-Monteith equation, the Hargreaves-Samani equation, the Jensen-Haise equation, the Jones-Ritchie equation, and others. In [31], the authors proposed a Multi-Gene Genetic Programming (MGGP) model to estimate SR in Turkey. The model accepts as inputs some meteorological data such as extraterrestrial SR, sunshine duration, mean of monthly maximum sunny hours, longterm mean of monthly maximum air temperatures, and other parameters. The authors concluded that the MGGP model results for SR prediction were more accurate than the values achieved by some conventional empirical equations such as Angstrom and Hargreaves-Samani. The performance of MGGP is also assessed by single-data and multidata models. The multi-data models of MGGP and calibrated empirical equations are more performant than the single-data models for SR estimation. In [32], the researchers employed the GP techniques on climatological data to elaborate the best function predicting the solar panel's temperature (Tpanel) for a product family of polycrystalline solar panels. The data recorded include SR measurements with ambient temperature, wind velocity, and ambient relative humidity as effective parameters. The authors claimed that their obtained function outperformed the models related to the Nominal Operating Cell Temperature (NOCT) and Nominal Module Operating Temperature (NMOT) approaches that are considered as the most common ways to obtain the Tpanel. In [33], the authors presented a hybrid approach that includes the GP and the Simulated Annealing (SA) heuristic optimization techniques for estimating the GSR. The researchers concluded a prediction equation where the SR is formulated in terms of several climatological and meteorological parameters in two cities of Iran.
In [34], the authors investigated the capacity of neural networks models and the Gene Expression Programming (GEP) technique for estimating daily dew point temperature by using daily recorded data that include incoming SR, atmospheric pressure, dew point temperature, and other parameters. The researchers concluded that the GEP models outperform the ANN models in estimating daily dew point temperature values. In [35], the authors proposed a nonlinear regression model based on GP techniques for estimating SR. The best nonlinear modeling function obtained takes into account the measured meteorological data. An enhanced approach that consists of multi nonlinear functions of regression in a parallel structure is presented. Each function is designed to estimate the GSR in a specific seasonal period of the year.
The above-described methods and most of the reported approaches in the literature that use the GP-based techniques for predicting SR and other types of energies apply the GP directly in the weather parameters. In this research, we propose a hybrid framework that uses the GP on finding the optimal combination formula of the outputs of time-series data predictors of the Long-Short term Memory (LSTM) type. We believe that applying the GP or any other heuristic optimization technique directly on weather and solar data will not cope (handle) with this data type [32,33,35]. The proposed hybrid framework employs the GP in a post-processing phase to combine outputs of base recurrent predictors of type LSTM [36,37]. The LSTM is a gated machine learning model that is trained using the gradient-based Back-Propagation Through Time (BPTT) algorithm. The BPTT is a new version of the classical back-propagation gradient descent algorithm suitable for recurrent models. Nowadays, the LSTM models are considered one of the most prominent Recurrent Neural Networks (RNN) models for deep learning to solve time-series data forecasting problems. They are suitable for modeling and handling timeseries data frames by handling the temporal context among serial data records. The proposed method combines the superiority of LSTM models to model time-series data with the capacity of heuristic models to explore the best nonlinear combination formulas. The base forecasters and the heuristic combination module were designed for the shortterm prediction of SR intensity.
The main contribution of this research is to propose a hybrid LSTM-based GP approach for the short-term prediction of GSR. The approach consists of an ensemble of homogeneous base models of the LSTM type combined by a weighted formula designed by a heuristic optimization technique of the GP type. This work also validates and compares the performance of two versions of the GP technique for our approach in terms of performance and consistency. In all the hybrid models considered in this research, the examined predictors aim to predict the amount of GSR in a horizon fifteen minutes ahead using the measured amount of GSR and weather data for the previous n slots of 15 min. Moreover, the proposed approach has been compared against machine learning based stacking techniques. The performance of the considered models has been evaluated using statistical metrics and graphical methods. The rest of the paper is organized as follows: Section 2 introduces the GP and LSTM, and it presents the GP versions and frameworks that we used in our experiments. Section 4 illustrates the structure and the implementation of the proposed method. In Section 5, we present and discuss the experimental results and their interpretation. Finally, the paper is concluded in Section 5.

Genetic Programming
GP is an extension of the conventional Genetic Algorithm (GA), which is inspired by the biological processes of natural selection and survival of the fittest [24,38]. The main difference between GA and GP lies in representing a solution, where GA usually uses an array of bits to represent a solution. In contrast, GP deals with functions (computer programs) represented as tree data structures [24][25][26]. Each tree in GP comprises sets of nodes, where internal nodes represent algebraic and/or logical operators. In contrast, the external ones, called terminals, represent the numbers and data variables of the given problem. Figure 1 depicts examples of binary trees representing mathematical functions, where each function can be seen as a computer program. The GP algorithm uses the so-called genetic operators-crossover, mutation, and cloning (reproduction)-to generate a new population from the current one. This process is repeated until it reaches an optimal solution or reaches the maximum number of iterations [38]. The algorithm starts with an initial population of trees, also named individuals, that are usually generated randomly. Then, in each generation of individuals, the GP algorithm performs the following steps:


Measure the fitness value for each individual using a predefined fitness function. This fitness value indicates how the individual function (computer program) suits the problem under consideration.
 Select a set of individuals, based on their fitness values, from the present population using a suitable selection strategy.  Generate the next population by applying the crossover and mutation operators to selected individuals based on predefined crossover and mutation probabilities.  Based on a predefined reproduction probability, the clone is based on individuals with the best fitness values to the next population.
These steps are repeated until an optimum solution is found or the maximum number of fitness evaluations is reached. Figure 2 illustrates an example of applying the crossover operator on two selected individuals. Several selection strategies for the GP algorithm are introduced and studied extensively, all of which make use of the fitness values of the current population. In this paper, the tournament selection, of size , is used, where individuals from the current population are chosen randomly, and the one with the best fitness is considered as the winner. Moreover, the fitness value of an individual is calculated as the Root Mean Square Error (RMSE) as follows: where H , is the predicted output and is the measured value related to that output. In the following sections, two modified versions of the GP algorithm are introduced.

GPLearn Framework
The GPLearn framework implements GP in Python language with compatible API [39]. GPlearn includes all the standard features to design and validate a GP optimization model. Plus, it provides three alternatives of mutation operation, namely the sub-tree mutation, the hoist mutation, and the point mutation. Those operators permit increasing the exploration capability of the GP implemented in this framework. They operate as follows:  Sub-tree mutation: it takes the winner (a tree) of a random selection and then selects a random sub-tree from it to be replaced. A donor sub-tree is generated at random, and this is inserted into the chosen winner tree to form offspring in the next generation. This type of mutation is considered one of the aggressive mutations. It allows big changes in an individual by replacing a complete sub-tree and all its descendants with totally naïve random components.  Hoist mutation: it takes the winner of a random selection and selects a random subtree from it. Then, a random sub-tree of that sub-tree is selected, and this is lifted (hoisted) into the original sub-tree's location to produce offspring in the next generation.  Point mutation: it takes the winner of a random selection and selects random nodes from the winner to be replaced. Functions are replaced by other functions that require the same number of arguments, and other terminals replace terminals. The resulting tree forms offspring in the next generation. Figure 3 depicts the three described types of mutations.  The GPLearn framework is purposefully constrained to solve symbolic regression problems. Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship among dependent and independent variables. It begins by building a population of simple random formulas to represent a relationship between known independent variables and their dependent variable targets to predict new data. Each successive generation of programs is evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations. GPLearn supports regression through the SymbolicRegressor class and binary classification with the SymbolicClassifier Class. This open-source framework evolved through many versions. The latest version is 0.4.1 through which we have built our GP model.

Memetic Programming
The Memetic Programming (MP) algorithm is a hybrid evolutionary algorithm that searches for suitable functions (computer programs) for a given problem. Each loop of the MP algorithm contains two different types of searches: diversification search and intensification search. The MP algorithm uses the GP operators, crossover and mutation, to guarantee the diversity in the current population. Then, the algorithm employs a set of local search procedures to intensify the best individual programs of the current population. Specifically, these local search procedures generate new trees from a given tree by using shaking, grafting, and/or pruning operators [27,28]. The shaking operator generates a new tree from by altering its nodes without changing its structure. On the other hand, grafting or pruning operators change the structure of by expanding its terminal nodes or cutting some of its sub-trees, respectively. Figure 4 shows three examples of generating the tree from initial tree by applying shaking, grafting, and pruning procedures. For more details about the MP algorithm, the reader can refer to [27,28], where reported results show that the MP algorithm outperforms the standard GP algorithm and recent versions of the GP algorithm.

Long Short-Term Memory (LSTM) Models
LSTMs are the most prominent RNNs models used for deep learning to solve timeseries data prediction problems [36,37,40]. They are suitable for modeling and handling the temporal context among time-series data records. These deep learning models that have been introduced in [36] have gained popularity in the domain of time-series data predictions (speech recognition, gesture recognition, natural language processing, automatic translation, market stocks predictions, etc.). Figure 5 illustrates the main components of an LSTM unit as introduced originally by Hochreiter and Schmidhuber [36]. The main idea was to address vanishing/exploding gradients facing the researchers when the classical gradient descent-based back-propagation training algorithm on deep learning models [36,37]. A single basic cell of type LSTM, also named vanilla cell, includes an input gate, forget gate, and output gate. In addition, it contains a specific internal structure of individual hidden units. Those units and gates help in solving the vanishing/exploding effective back-propagation error over prolonged periods of time. The forget gate permits forgetting the non-important (non-discriminant) information, whereas the state of the cell permits remembering the most important information. An LSTM unit takes as input at instant a signal vector , the hidden state ℎ that depends on the output of the unit at instant − 1, and the cell state of the LSTM unit at instant − 1. The outputs of the cell consist of the current hidden state ℎ , the output of the cell at instant that will be used as an input in the next instant + 1, and the cell state . In general, an LSTM unit has four sets of gates' parameters and one set of output layer parameters. Those parameters are defined in the following five equations: forget gate parameters: input gate parameters: output gate parameters: cell parameters: output layer parameters: where D is the dimension of the input vector, H is the number of LSTM cells in a layer, and K is the number of classes in the case of classification. On the other hand, W(·) denotes the weight matrices, and b(·) represents the bias vectors that affect the transformation of a particular gate. Given an input vector, the forward pass can be described in the following seven equations. The output of the forget gate is calculated using Equation (7). Then, the output of the input gate is calculated using Equation (8). The state of the cell is computed using Equations (9) and (10), consecutively. Finally, the output of the cell is found using Equations (11) = ( [ , ℎ ] + ) (8) č = tanh( [ , ℎ ] + ) (9) = ˳ + ˳ č (10) = ( [ , ℎ ] + ) (11) ℎ = ˳tanh( ) (12) ( ) = 1 1 + (13) where , , , and č denote outputs of the forget, input, output, and cell memory gates, respectively. The represents the sigmoid activation function defined by Equation (13), and • stands for element-wise multiplication. The outcomes of the LSTM unit are (the state memory of the cell) and ℎ (the output value of the cell). These two outcomes are to be passed to the next time step. The reader can refer to [36] for more details about LSTM and other gated models.

Building Base Models
The dataset used to design and validate the base predictors consists of weather and solar historical data provided by the AUMET solar and weather station at the American University of Middle East (AUM)-Kuwait. The dataset comprises daily data records with a resolution of 5 min. Each record (instance) includes up to 32 measures of weather and SR parameters. In this research, we have selected the most seven prominent parameters by using the PCA technique [41,42]: We have built five LSTM individual models as base predictors of the GSR 15 min ahead. Each model is composed of a simple LSTM unit. All the base LSTM models have the same structure and activation functions. Moreover, we have used the same BPTT algorithm's parameters to train the LSTM models in the first layer [35]. Each model receives as input a sequence of successive + 1 input vectors. In the experimental trials, we choose k = 2 based on a sensitivity analysis. Each of these vectors includes a subset of weather and solar parameters out of the total number of parameters described in the following paragraph. Each subset of the parameters includes the measured GSR and other selected weather parameters for k previous 15-min slots (windows). The resulting base models are as follows: The combinations mentioned earlier are the most performant in providing the best complementary information among the outputs of the individual predictors. That is a key factor to improve the overall prediction performance. In addition, we have designed and implemented a reference global LSTM model that receives all seven weather and solar radiation parameters in input to predict the solar radiation strength 15 min ahead.
LSTM_6 receives in each input vector all the adopted weather and solar parameters: Air Density, Average Air Temperature, Heat Index, Max Relative Wind Speed, Relative Humidity, Average Surface Temperature, and Solar Radiation for the current and the previous 15-min slots.
All the LSTM models use the same activation functions and the same parameters' values in the BPTT training algorithm [36].

Building the Hybrid GP-LSTM Model
The hybrid LSTM-based GP approach consists of an ensemble of homogeneous LSTM-based predictors and a heuristic GP-based combiner. Two versions of the GP combiner have been examined, and they are set up as follows:


A standard version consists of the primary mutation and crossover operators as is described in Section 2. The parameters of this version of GP module are setup as follows: o The population size is set to be equal to 200 individuals. Each individual represents a nonlinear combination formula of the outputs of the base predictors. o The set of operators that are used in the internal nodes of the trees that represent the individuals includes the four arithmetic operators: addition (+), difference (−), division (/), and multiplication (*). In addition, it includes the decimal log ( ), the exponential function ( ) and the square root ( ( ) ). o The depth of each of the trees that represent individuals is equal to eight. The maximum depth allowed for trees in each population is one of the adjustable parameters in a GP evolution process. The depth of each of the binary trees is set to be dynamic. It starts by an initial depth that can be dynamically increased until a selected maximum value is reached. o The fitness function evaluates the efficiency of each of the individuals that represent candidate solutions (chromosomes). A fitness function related to the root mean square error (RMSE) has been used.
Moreover, some alternatives have been investigated and examined to adopt and set the following two parameters: o The probability of applying each of the genetic operators (crossover and mutation). o The sampling method to select individuals from the current population to participate in generating new individuals for the next generation.
 A hybrid version of the GP algorithm, named MP, consists of a local search module that permits selecting the best breeding among possible children from two crossover parents. The parameters of this version of GP module are setup almost the same as the standard one for comparison reasons. Therefore, the exact population sizes, set of operators, maximum depth, and fitness functions, and sampling methods have been adopted for this version.

Training and Testing Dataset
As we mentioned in Section 3.1, the dataset used to train and validate the base predictors consists of weather and solar historical data records. The dataset comprises daily data records with a resolution of 5 min. Our experiments have used the data records extracted for the period between September 2020 and March 2021 inclusive. We have removed the night hours from this dataset in which the SR strength is almost equal to 0 (absence of SR). The resolution of the data records has been decreased to be equal to 15 min. The resulted dataset that comprises 8854 data records has been divided into three subsets. The first subset is the one for training having the first 60% of the records to train the base predictors. The second subset is a hold-out subset having 20% of the dataset. This data set is dedicated to validate the base predictors and to design the heuristic GP model as we will describe in this section. The third subset contains the remaining 20% of the data records. The latter subset is the one for testing the base learners and the proposed GP-LSTM hybrid approach, and it has never been used in the training and validation phases.
The hold-out set is used as follows: After the training of the base predictors on the training subset (60% of the dataset), those predictors have been used to make predictions on the hold-out subset records, see Figure 6. So, we obtain five predicted values produced by the five base predictors for each instance in the hold-out subset. Therefore, we have created a new training set by considering these predicted values as input parameters to the GP model and by keeping the target values of the same hold-out subset. The GP heuristic model has been designed on this new training set by finding the best formula to combine the base predictors' outputs to predict the target values of GSR 15-min ahead. In order to reduce the noticeable difference in the scales adopted for the values of the weather and SR parameters, we have applied the MinMax scaling technique. The data values of parameters have been normalized to a standard scale using (14). (14) where is an original value sample of the parameter . min ( ) and max ( ) are the minimum and maximum measured values of the parameter respectively. Finally, we have applied the sliding window technique to prepare the sequential inputs for each LSTM model in a pre-processing phase. During the training phase, the sliding window prepares the sequence of input vectors to include the data records on the 15-min slots of time before slot to predict the GSR for the slot . In this case, the input sequence contains the data records related to the 15-min sized slots , , …, until , And the output target is the GSR for the ℎ period. The sliding window technique is shown in Figure 7. The LSTM based predictors in the first layer of the model do not receive all the data parameters in each data record. Instead, each LSTM predictor gets a combination of selected parameters out of the seven parameters described in the previous paragraph. For example, one of the models receives the SR strength coupled with the Air Density and Heat Index measures for the current and the previous 15-min slots; another model receives the SR strength coupled with the Air Temperature, Relative Wind Speed measures for the current and previous 15-min, and so on. In our experiments, we have examined the effect of each input parameters' combination on the performance of the base predictors.

Experiments
In our comparative analysis of the performance of the base predictors and the LSTM-GP hybrid approach, we examined two versions of the GP:


A standard version of GP comprises the crossover operator with advanced versions of and mutation operators described in Section 2.1. This version is implemented in the GPLearn framework as described in Section 2.1.1.  A standard version of MP that includes a local search sub-model that permits enhancing the exploitation capability of the GP model [27,28].
Plus, we decided to compare the performance of the GP-based combination in the post-processing module for the two versions as mentioned above of GP against machine learning stacking models. For that purpose, we designed and trained four machine learning models for the combination of type Random Forest [14], Multi-Layer Perceptron (MLP) [40], SVR [12,13], and K-Nearest Neighbors (KNN) [43]. All the stacking models have been trained on the training set used to design all the GP combiners. Each of those stacking models receives the prediction outputs of the LSTM predictors and makes the combination of those predictions using the experience that it gains during the training phase. In addition, we have used a simple averaging operator as a baseline reference combiner. The averaging operator receives the outputs of the base predictors and produces an averaging of those outputs. Figure 8 shows the pictogram of the GP-based combination of the LSTM base predictors. In all experiments, we have used the Scikit-learn framework, developed in the French Institute for Research in Computer Science and Automation in France, [44] to implement, train, and validate all the examined predictors and combination models.

Evaluation Metrics
To evaluate and compare the performance of the base predictors and that of the proposed hybrid approach, we have used four statistical metrics, namely, the RMSE, the mean absolute error (MAE), the mean bias error (MBE), and the mean absolute percentage error (MAPE) [7]. We have evaluated the coefficient of determination between the observed and forecasted values of GSR energy.
Those metrics are computed as follows: where H , is the predicted output and is the measured value related to that output.
All the above-described metrics are negatively oriented scores, which means that lower values are favorable.
The correlation coefficient is a numerical measure of the strength of the correlation between two variables [45]. The correlation coefficient is computed by Equation (19) where H and H represents the average values of predicted and measured values, respectively.
Our experiments have used the coefficient of determination computed as the square of the correlation coefficient. The coefficient of determination shows percentage variation in a dependent y, which is expressed by all other independent x variables together. The higher value of is, the better. It is always between 0 and 1, so it can never be negative, since it is a squared value. It is easy to explain the square in terms of regression especially for multiple linear regression where numerous variables are involved [46].

Results and Discussions
In this work, we compared the performance of the heuristic GP techniques concerning the performance of stacking techniques in combining the outputs of base predictors of SR. The experimental study that we have conducted also investigated the improvement to the overall performance of those base predictors after employing the examined combination techniques. In the first stage, we have trained and tested the performance of the base predictors of the LSTM type that we have described in Section 2.2. Then, we have trained the GP-based combiners using the outputs of the base learners on the validation data subset. Finally, we have tested the overall prediction performance of the LSTM-GP hybrid models on the testing dataset. The same hold-out dataset that we have used to train the GP models has been used to train the combination performance of the stacking meta-models. The first row in Table 2 shows the obtained scores of evaluation metrics for the individual global LSTM model that receives all the weather and SR parameters as inputs. The following five rows show the scores of the same metrics obtained by the base predictors, where each of them receives a subset of the input parameters as described in Section 3.1. The best individual model in the ensemble is marked with '*'. The results in Table 2 indicate that the LSTM_3 uses three parameters: namely Average Air Temperature, Heat Index, and SR have outperformed all the individual models and the global model that uses all the seven input parameters. The models LSTM_4 and LSTM_5 have also outperformed the individual global model. That means that each of these two individual models comprises enough information to outperform the global model. Table 3 shows the obtained scores of the metrics for the combination models. The first row shows the scores for a simple averaging combiner. The next four rows show the scores of the metrics for the stacking combiners that employ machine learning techniques, and the last two rows show the values of the metrics scores for the LSTM_GP and LSTM_MP models. The results in Table 3 indicate that the hybrid LSTM_GP models have outperformed all the stacking models and all the individual models reported in Table 2 in terms of all evaluation metrics. The LSTM_GP models have also outperformed the simple averaging model shown in the first line. The two versions of the GP hybrid models showed very close scores for the RSME and MAE metrics. These two hybrid models have exploited well the complementary information among the individual predictors of the committee. This last point can be explained by the negative and positive values of MBE for the individual models, which indicate that some of them underestimate and others overestimate the exact values of GSR. We believe that the advanced operators of mutation implemented in the GPLearn version have compensated the advantages of the local search in the MP hybrid model in solving the examined combination task.
The obtained scores for RMSE in Table 3 indicate that most of the examined combination models, especially the LSTM_GP and LSTM_MP hybrid models, have outperformed the global model that uses all the weather and SR parameters as input. It means that dividing the input parameters of the base learners into subsets in a preprocessing phase and then combining the complementary information of their outputs by a good combiner will ease the prediction task of those base learners and therefore improve the overall performance.
The sixth row in Table 3 shows that the standard GP combination model provides a decrease in the value of RMSE of the average outputs of individual LSTM models by almost 46.27%, as shown in (21): The third, sixth, and seventh rows in Table 3 indicate that the two GP-based hybrid models LSTM_GP and LSTM_MP provide a decrease in the value of RMSE with respect to the best stacking combiner (LSTM_MLP with RMSE = 0.0337) by almost 29.37% and 31.15%, respectively, as shown in (22) Figure 9 depicts the binary trees associated with the best individual candidate solutions in the last population of the standard GP and of the MP evolutionary computations. Each of These two best individuals represent the best combination formula for the outputs of the individual LSTMs'. Equation (24) represents the formula stored in the binary tree provided by the standard GP. Figure 9. Binary tree that represents the fittest individual in the last population of the standard GP evolutionary computation, and that represents the best combination formula that we have adopted in the hybrid GP_LSTM model, where x1 = LSTM_1, x2 = LSTM_2, x3 = LSTM_3, x4 = LSTM_4, and x5 = LSTM_5. Figure 10 shows the daily average prediction error of GSR, 15 min ahead, which is provided by the simple averaging LSTM_Avg model, the stacking LSTM_MLP model, and the examined hybrid LSTM_GP model. To have a clear plot, we have considered the daily average errors of prediction related to of March 2021. The chart in Figure 10 shows a significant outperformance of the LSTM_GP model with respect to the other models in term of minimal deviation around the average value of the prediction error that is the closest to value 0.   4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Conclusions
In this study, we have discussed the short-term prediction of GSR strength on horizontals using hybrid solutions that combine the strength of LSTM models in modeling and predicting time-series data with the capability of GP techniques in finding nonlinear functions of combination. For that purpose, we have proposed and examined two versions of the GP heuristic technique of combination based on evolutionary computation: a standard version and another hybrid version, called Memetic Programming, that incorporates a local search sub-model to enhance the exploitation quality of the standard GP. To evaluate the performance of the proposed hybrid LSTM-GP approaches, an extensive experimental study that compares those approaches against stacking techniques has been conducted. The two versions of the GP hybrid approach have outperformed all the individual base predictors and most of the stacking techniques. That means that the GP-based heuristic modules in the hybrid approach have outperformed the stacking models in finding the best combination maps among the base predictors. Therefore, it can handle the complementary information among the outputs of those predictors. That is important, especially when the data available for designing and training the machine learning combiners (i.e., stacking) is small for practical reasons and therefore inadequate to adjust the freedom parameters of the stacking models.
Extending this study on predicting a different type of energy is one of the following perspectives of this work. Other future perspectives consist in applying the same hybrid techniques on ensembles of other gated recurrent models. In addition, and becasue the prediction performance of any ensemble of models depends on the excellent handling of complementary information among individual learners, finding convenient structures of combination is one of our interests in our future works. Funding: This research received no external funding Data Availability Statement: The data used in this research is private and not publically available yet. The reader may ask for sample data directly from correspondent author by email.

Conflicts of Interest:
The authors declare no conflict of interest.