Forecasting Daily Solar Radiation Using CEEMDAN Decomposition-Based MARS Model Trained by Crow Search Algorithm

The precise forecasting of daily solar radiation (DSR) is receiving prominent attention among thriving solar energy studies. In this study, three standalone models, including gene expression programing (GEP), multivariate adaptive regression splines (MARS), and self-adaptive MARS (SaMARS), were evaluated to forecast DSR. A SaMARS model was classified as MARS model when using the crow search algorithm (CSA). In addition, to overcome the limitations of the standalone models, the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) was employed to enhance the accuracy of DSR forecasting. Therefore, three hybrid models including CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS were proposed to forecast DSR in Busan and Incheon stations in South Korea. The performance of proposed models were evaluated and affirmed that the accuracy of the CEEMDAN-SaMARS model (NSE = 0.878–0.883) outperformed CEEMDAN-MARS (NSE = 0.819–0.818), CEEMDAN-GEP (NSE = 0.873–0.789), SaMARS (NSE = 0.846–0.769), MARS (NSE = 0.819–0.758), and GEP (NSE = 0.814–0.755) models at both stations. Therefore, it can be concluded that the optimized CEEMDAN-SaMARS model significantly enhanced the accuracy of DSR forecasting compared to that of standalone models.


Introduction
Due to the negative impacts of fossil fuels on the environment, renewable energy resources have attracted the attention of governments, researchers and industries.Solar energy is a prominent infinite energy source because of its remarkably accessible properties such as radiant light and heat from the Sun, and its applications including electricity generation and air heating/cooling [1].Recently, solar energy exploitation has become cheaper and more efficient due to the development of various techniques and market competition.Comprehensive understatement of solar radiation is essential for different technological methods of solar energy production as well as the selection of feasible installations in the future [2][3][4].Solar radiation is spatially and temporarily variable, and therefore site measurements are necessary.However, due to various problems (e.g., lack of instruments and fiscal issues), the authentic DSR are scarce [5,6].Thus, it is clear that applying efficient methods are important for estimating DSR based on other input variables such as meteorological and geographical variables [7].
In recent decades, data-driven methods, such as artificial neural networks (ANNs) [8], adaptive neuro fuzzy inference system (ANFIS) [9], support vector machine (SVM) [10], M5-model tree [11], multivariate adaptive regression splines (MARS) [12], and gene expression programing (GEP) [13], have been widely applied for energy demand and solar radiation forecasting studies.For instance, Yadav and Chandel [14] used ANNs to perform an exhaustive review of the prediction of solar radiation.They showed the ability of ANNs for solar radiation forecasting based on different case studies.Sozen et al. [15] validated the performance of ANNs for solar radiation prediction based on geographic parameters of Turkey.Dorvlo et al. [16] evaluated two kinds of neural network namely multilayer perceptron (MLP) and radial basis function (RBF).Alsina et al. [17] evaluated the performance of ANNs for the prediction of monthly solar radiation using 45 meteorological stations over Italy.Lou et al. [18] applied a black box model using a boosted regression tree for predicting the diffusion of SR in Hong Kong and Denver.Moreover, Mohammadi et al. [19] employed wavelet transform (WT)-based SVM to enhance the accuracy of standalone SVM.The method combined a firefly meta-heuristic algorithm with support vector regression (SVR) and was designed by Olatomiwa et al. [9] to evaluate the accuracy of developed methods for solar radiation estimation in Nigeria.Antonanzas et al. [20] evaluated SVR performance for mapping solar irradiation using exogenous input variables, whereas Monteiro et al. [21] compared the ability of two models (e.g., ANNs and SVR) to generate photovoltaic power.Chen et al. [22] estimated a solar radiation problem using least-square SVR (LSSVR) based on the atmospheric data at Chongqing meteorological station, China.Salcedo-Sanz et al. [23] also developed an integrated neuro-evolutionary wrapper-based technique for estimating DSR in Queensland, Australia.They used coral reef optimization (CRO) for the feature selection process to access an optimal set of predictor variables using an extreme learning machine (ELM) method.
Nevertheless, various investigations have been accomplished for estimating solar radiation using empirical and conventional methods, but there is an essential challenge to develop a method to overcome non-stationary time series.To address non-stationary problems, several pre-processing approaches (e.g., the principal component analysis (PCA) [24,25], continuous wavelet transform (CWT) [26][27][28], moving average (MA) [29], wavelet multi-resolution analysis (WMRA) [30], maximum entropy spectral analysis (MESA) [31], singular spectrum analysis (SSA) [32,33], and empirical mode decomposition (EMD) [34]) have been used to decompose input/output variables.These techniques are useful tools to resolve the frequency components of input/output time series data by decomposing original datasets into several sub-series, before such datasets are applied in time series estimations.
More recently, complete enhanced EMD with adaptive noise (CEEMDAN) [35] was successfully applied to reconstruct the original input/output variables precisely.It gives a better spectral separation of the intrinsic mode functions at a lower computational cost.Few studies have been accomplished to enhance the model's performance using CEEMDAN for forecasting different types of data.Zhang et al. [36] investigated and forecast short-term wind speed on the eastern coast of China using CEEMDAN with the flower pollination algorithm (FPA).Prasad et al. [37] used CEEMDAN and EEMD integrated with ELM to improve models' performance for soil moisture forecasting.They provided that the CEEMDAN-ELM model outperformed the other models for upper layer soil moisture forecasting.Moreover, two-phase integration of ELM and CEEMDAN algorithm was investigated by Wen et al. [38] to predict real-time runoff.They designed a two-phase hybrid model, This paper presents an integrated model that is designed for coupling CEEMDAN decomposition with data-driven models for improving forecasting accuracy of DSR.In the present research, the original time series is first decomposed by CEEMDAN for better frequency resolution.The main contribution of this study is that, for the first time, an integrated CEEMDAN algorithm with data-driven models (e.g., GEP, MARS, and SaMARS) is proposed to supply prominent frequency-based input information to forecast DSR.The proposed integrated models are applied at Busan and Incheon meteorological stations in South Korea.Then, it is benchmarked and evaluated with standalone models (e.g., GEP, MARS, and SaMARS) using several statistical criteria.

Data Collection
In this study, meteorological data were collected from two weather stations, namely Busan (Longitude, 129  45 N; Altitude, 21.12 m), in South Korea (Figure 1).These stations are operated and maintained by the Korea Meteorological Administration (KMA).The weather data consist of 16 years (from 2000 to 2016) covering daily records of air temperature (TA), sunshine duration (SD), relative humidity (RH), vapor pressure (VP), sea-level pressure (SLP), pan evaporation (PE), and daily solar radiation (DSR).
Energies 2019, 12, x FOR PEER REVIEW 3 of 23 algorithm was investigated by Wen et al. [38] to predict real-time runoff.They designed a two-phase hybrid model, which utilized CEEMDAN combined with the variational mode decomposition (VMD) method to resolve the frequency of the original datasets in the Yingluoxia watershed, Northwestern China.This paper presents an integrated model that is designed for coupling CEEMDAN decomposition with data-driven models for improving forecasting accuracy of DSR.In the present research, the original time series is first decomposed by CEEMDAN for better frequency resolution.The main contribution of this study is that, for the first time, an integrated CEEMDAN algorithm with data-driven models (e.g., GEP, MARS, and SaMARS) is proposed to supply prominent frequency-based input information to forecast DSR.The proposed integrated models are applied at Busan and Incheon meteorological stations in South Korea.Then, it is benchmarked and evaluated with standalone models (e.g., GEP, MARS, and SaMARS) using several statistical criteria.

Data Collection
In this study, meteorological data were collected from two weather stations, namely Busan (Longitude, 129°03′E; Latitude, 35°10′N; Altitude, 69.2 m) and Incheon (Longitude, 126°70′E; Latitude, 37°45′N; Altitude, 21.12 m), in South Korea (Figure 1).These stations are operated and maintained by the Korea Meteorological Administration (KMA).The weather data consist of 16 years (from 2000 to 2016) covering daily records of air temperature (TA), sunshine duration (SD), relative humidity (RH), vapor pressure (VP), sea-level pressure (SLP), pan evaporation (PE), and daily solar radiation (DSR).Generally, the accessibility of a long-term measured dataset is of particular importance for an accurate estimation of DSR.Furthermore, the accuracy of models for DSR forecasting is affected by the quality of DSR time series data.In measured DSR data, there have been some contradictions and Generally, the accessibility of a long-term measured dataset is of particular importance for an accurate estimation of DSR.Furthermore, the accuracy of models for DSR forecasting is affected by the quality of DSR time series data.In measured DSR data, there have been some contradictions and abnormalities in the values mainly because of malfunctioning instruments [6].After the data sorting, every missing value was replaced with interpolated values by means of various approaches for time series analysis.In addition, this method provides that if there are some missing days or a datum is incorrect, it can be replaced with the means of two nearest non-missing values [39].
In the present study, the data were divided into two sub-sets: the first for calibration and the second for validation.Hence, the first data cover the period from 1 January 2000 to 31 December 2012, while the second data include the period from 1 January 2013 to 31 December 2016.The calibration dataset was applied for creating and statistically analyzing a suitable model, while the second data were used to validate the accuracy and efficiency of selected models.Table 1 presents the statistical parameters related to climatic variables in the selected stations.

Complete Ensemble EMD with Adaptive Noise (CEEMDAN)
The EEMD introduced by Wu and Huang [40] is an adaptive technique for representing non-linear and non-stationary signals as the sum of signal elements with modulated parameters of domain and frequency; a noise-assisted analysis method is given by the popular EMD [41].The EMD is as a self-adaptive decomposition model, without initial knowledge of the number and nature of intrinsic mode functions (IMFs) and is embedded in the data [40,41].However, research has shown that EMD has a limitation regarding mode mixing [40].Mode mixing is either a single IMF including vast disparate scale elements or a similar scale element, which exist in IMFs [42].EEMD is introduced as an enhanced procedure for overcoming the mode mixing problem in EMD.Even if EEMD solves this problem, a finite average, the Gaussian white noise added using the EEMD may not be canceled after resulting in an error of reconstruction.The complete ensemble EMD with adaptive noise (CEEMDAN) is defined as an enhanced EEMD approach.However, there is a high level of computational cost (associated with the exhaustive search) and includes residual noise in the EEMD method.An increase in the number of trials can potentially increase the number of sifting process.A CEEMDAN was employed for declining the trials number while keeping the capacity for resolving the problem of mode mixing [43].
In brief, the CEEMDAN steps are as follows: (1) This method applies for computing the first mode function as following Equation (1).
The first residue is also given as following Equation (2).
(2) Determine emd (t) as the kth IMF element using the EMD method and decompose the sequence r 1 (t) + p 1 emd 1 (n j (t)) to reach the second component of IMF.
A residual signal is provided.
(3) Likewise, in the previous step, the kth residual signal is estimated.
The component of k + 1th IMF is then derived.
(4) Iterate these steps while the residual signal is reached.Suppose, there are L components of IMF.Hence, the original sequence can be computed as: where r(t) is the final residual signal.In this study, the optimum standard deviation is equal to 0.5 and L = 500.According to Figure 2, the original DSR data series was decomposed using CEEMDAN.

Gene Expression Programing (GEP)
GEP model is a subset of GP introduced by Ferreira [44], and consists of five components namely the terminal set, terminal condition, function set, control parameters and fitness function.The GEP model applies character strings of fixed lengths for solving the problems, while parse structures of trees with various lengths are used.Moreover, the GEP model can develop complicated non-linear programs by several subprograms because of its multigenic system.A gene of GEP model is created by two symbol types: the first symbol is fixed length variables and another is constants which is known as terminal set (e.g.{a, b, c, 6}) and some operations as the function set (e.g.{+, −, log}) [45].The generated chromosomes by the GEP model indicate parse trees which are used for reading the encoded information at the strings using Karva language.After that, these chromosomes can be described as expression trees (branched structures).The nodes recorded between the root and the deepest layers can be utilized for inverse expression tree transformation to the Karva expression (Kexpression) for generating the primary string [46].In this way, the K-expression length should be either equal or less than the GEP gene [47].
A GEP model begins while the chromosomes related to fixed lengths are generated randomly in each.When these chromosomes are revealed, all individuals are investigated in case of fitness.The

Gene Expression Programing (GEP)
GEP model is a subset of GP introduced by Ferreira [44], and consists of five components namely the terminal set, terminal condition, function set, control parameters and fitness function.The GEP model applies character strings of fixed lengths for solving the problems, while parse structures of trees with various lengths are used.Moreover, the GEP model can develop complicated non-linear programs by several subprograms because of its multigenic system.A gene of GEP model is created by two symbol types: the first symbol is fixed length variables and another is constants which is known as terminal set (e.g., {a, b, c, 6}) and some operations as the function set (e.g., {+, −, log}) [45].The generated chromosomes by the GEP model indicate parse trees which are used for reading the encoded information at the strings using Karva language.After that, these chromosomes can be described as expression trees (branched structures).The nodes recorded between the root and the deepest layers can be utilized for inverse expression tree transformation to the Karva expression (K-expression) for generating the primary string [46].In this way, the K-expression length should be either equal or less than the GEP gene [47].
A GEP model begins while the chromosomes related to fixed lengths are generated randomly in each.When these chromosomes are revealed, all individuals are investigated in case of fitness.
The individuals are subsequently determined to reproduce based on their fitness.In every generation, this process is iterated until a solution is found.In this method, genetic operations (e.g., mutation and crossover) are also used to convert population [44,45].

Multivariate Adaptive Regression Splines (MARS)
A MARS model is a sort of non-linear model, which was introduced by Friedman [48].Regarding this approach, there is no presumption about basic function relations among input/output variables.The segment endpoints, which are defined as nodes, can determine the endpoint at each region [49,50].
The MARS model builds basis functions using the step searching means.In addition, the MARS model is built using a two-step method.At first, functions are added until probabilistic nodes are found (primary phase).The second step involves removing the minimum real terms (secondary phase).Suppose y is a deterministic output and X = (X 1 , . . ., X p ) is the input variable.Thus, it can be said that data are obtained from an unknown "real" model.Consequently, the response is as follows [51]: where, e is the error distribution.The MARS model is recruited to approximate function f by means of the basis functions (BFs).In fact, BFs are referred to splines (smooth polynomials) including piecewise-cubic and piecewise-linear functions.Equation ( 9) is extracted from the MARS model when alinear combination of BFs and their mutual relations is created [52], where, each λ m (x) is a BF which might be a spline function or product of two or more spline functions.Coefficients β are constant values and can be evaluated by least squares (LS) method.
The MARS model is known as one of data-driven models.Firstly, the primary method is accomplished for training data.By cutting off the β 0 and basis pair, one model is built that has the maximum reduction of training error.The next pair is added to the current model on the basis of the M BFs as follows (10) [53], where the LS technique is used for estimating β.In addition, mutual interactions among the BFs in the model are carefully considered when new BF is added to the model space.Then, BFs are added into the model to achieve the maximum specified number of terms that bring about a perfect fitness model.After that, a backward removal discipline is applied to decrease the number of terms.The main aim of this deletion approach is finding an optimal number of parameters (terms) by getting rid of the unessential variables.

Crow Search Algorithm (CSA)
Crows are capable of memorizing faces, communicating in sophisticated ways, as well as hiding and retrieving food during different seasons.These characteristics of crows allow them to discover where other crows hide their food and steal it when the owner leaves.Considering this, Askarzadeh [54] proposed a novel evolutionary algorithm, named crow search algorithm (CSA), to solve complex optimization issues.This algorithm follows given principles: (1) Crows live in the flock form; (2) They memorize the places that they hid their food; (3) They follow each other to conduct thievery, and (4) They preserve their caches from being pilfered using a probability.
Similar to other optimization-based algorithms, CSA starts the optimization process with a dimensional environment compared to the population of crows.The crow number is N and the position i of crow at each iteration in the search space is computed using a vector x i,iter = x i,iter 1 , x i,iter 2 , . . ., x i,iter d where i = (1, 2, . . ., N) and iter = (1, 2, . . ., iter max ).iter max is the number of maximum iterations.
There is a parameter in CSA named awareness probability which its role is to balance the intensification and diversification [55] to increase the intensification using small values for awareness probability by searching on a local space and by increasing the awareness probability value; CSA can tend to determine the searching space on the global scale.CSA implementation for optimization can be briefly explained as [54]: Defining the optimization problem with all of its constraints, determining the decision variables and setting the CSA parameters, flock size (N), the flight length (fl), maximizing the iteration number (iter max ), and the awareness probability (AP).2.
Initializing the position and memory in a d-dimensional search space randomly for crows according to Equations ( 11) and (12).Each crow can be suitable solution for the problem and d indicates the quantity of the decision variables.
Evaluating the fitness function for each crow by placing the decision variables into the objective function.

4.
Generating new positions as follows: crow i can generate a new situation and select ones among other cows (crow j) randomly and follows it to discover crow j's hidden food source.

5.
The possibility of the new positions for all crows is checked as follows: If new position of crows is possible, the position of that crow is updated.Else, the crow remains in the current situation and new position is not generated for that crow.6.
The fitness function for the new position of each crow is evaluated.7.
The crows update their memory by Equation ( 13): where f (.) is an objective function, x i,iter is the position of crow i in iteration, iter and m i,iter are the memory of crow i in iteration iter.The termination criterion is checked (repeat steps 4-7 until iter max ).Eventually, the best memory position based on the objective function value is considered as the optimum solution.

Self-Adaptive Multivariate Adaptive Regression Splines (SaMARS)
In computer science, a careful choice of parameters belonging to data-driven models, such as ANN, ANFIS, and SVM, is a crucial step to attain outstanding performance in modeling processes.For instance, the number of hidden nodes and layers (both discrete) or the weights and biases are the necessary parameters, which need to be optimized in an ANN model.Even if the model provides appropriate consequences to an addressed problem, its parameters due to incorrect choices may result in a worse performance than expected.The common technique to find desirable parameters is to Energies 2019, 12, 1416 9 of 23 combine prior experiences with a limited heuristic search of optimal solutions which takes a lot of time for the user and might not address the issue.
In addition, the MARS model depends on its parameters containing penalty parameter d, maximum BF M max , and interaction m i .Using MARS model, however, makes it difficult to find optimal parameters simultaneously due to its large range of choices, which appropriate parameters can significantly improve MARS model's forecasting accuracy, and also suitable values can still lie outside suggested ranges.Thus, this study presents SaMARS model as a helpful method to assist users encounter this challenging issue (Figure 3).Firstly, the MARS model is deployed for handling the underlying function.A new MARS method is created based on each set of values of the CSA-provided parameter.CSA's greedy selector compares the quality of proposed model quality regarding evaluation of fitness function.After the calibration processing, the MARS model is employed to verify the validation dataset.The following objective function is applied to check the fitness function of the model: where, E calibration and E validation indicate calibration and validation error, respectively.In Equation ( 14), root mean square error (RMSE) is applied as the estimation error index.It is worth pointing out that the fitness function in Equation ( 14) represents the trade-off between complexity complexity and generalization of model.Due to the fact that the over-fitting occurs more in the training stage, the combination of calibration and validation errors can build a model to balance optimally with minimum calibration error and model generalizability.In the second step, CSA conducts the search for selecting the best parameter values, containing: M max , m i , and d.Once the stop criterion is satisfied, the optimization process is terminated.In this work, a generator number is used as the stop criterion.Prior to reaching the certain generation number, the model is underway.Finally, the optimal forecasting model with the best parameter settings is found when the stop criterion is performed.In other words, the training process of SaMARS has been completed and is ready to forecast DSR using testing data).
Energies 2019, 12, x FOR PEER REVIEW 9 of 23 outside suggested ranges.Thus, this study presents SaMARS model as a helpful method to assist users encounter this challenging issue (Figure 3).Firstly, the MARS model is deployed for handling the underlying function.A new MARS method is created based on each set of values of the CSAprovided parameter.CSA's greedy selector compares the quality of proposed model quality regarding evaluation of fitness function.After the calibration processing, the MARS model is employed to verify the validation dataset.The following objective function is applied to check the fitness function of the model: where, Ecalibration and Evalidation indicate calibration and validation error, respectively.In Equation ( 14), root mean square error (RMSE) is applied as the estimation error index.It is worth pointing out that the fitness function in Equation ( 14) represents the trade-off between complexity complexity and generalization of model.Due to the fact that the over-fitting occurs more in the training stage, the combination of calibration and validation errors can build a model to balance optimally with minimum calibration error and model generalizability.In the second step, CSA conducts the search for selecting the best parameter values, containing: Mmax, m i , and d.Once the stop criterion is satisfied, the optimization process is terminated.In this work, a generator number is used as the stop criterion.Prior to reaching the certain generation number, the model is underway.Finally, the optimal forecasting model with the best parameter settings is found when the stop criterion is performed.In other words, the training process of SaMARS has been completed and is ready to forecast DSR using testing data).

Model Performance
The accuracy and reliability of the proposed models need to be evaluated and assessed.Therefore, several statistical criteria were employed as follows: 1.
Relative root mean square error (RRMSE) Mean absolute error (MAE) Nash-Sutcliffe efficiency (NSE) Willmott's index of agreement (WI) Legates-McCabe's index (LMI) where DSR obs and DSR f or are the measured and forecasted DSR values; DSR obs and DSR f or indicate the observed and forecasted mean values of DSR; and N is the number of DSR data points.The first criterion, RMSE [Range = (0, +∞); Ideal value = 0] provides the standard deviation of estimating errors; and MAE [Range = (0, +∞); Ideal value = 0] gives an information about the average discrepancies between observed and forecasted values.Both RMSE and MAE criteria are known as the absolute error measures [37].In addition, RRMSE [Range = (0, +∞); Ideal value = 0] is an adequate criterion when comparing models of different stations.NSE [Range = (−∞, 1); Ideal value = 1] can be applied for evaluating the capability of hydrological approaches.The highest value of NSE indicates a perfect fit between measured and forecasted DSR.A negative value of NSE presents that the proposed model can perform worse than the mean value of time series dataset [56].Moreover, WI [Range = (0, 1); Ideal value = 1] is a standardized error value of model prediction.Both criteria (e.g., NSE and WI) are sensitive to outliers because of the squaring of difference terms [37,57].However, LMI [Range = (−∞, 1); Ideal value = 1] is not inflated using the squared values and is not sensitive to outliers [58].

Results and Discussion
In this study, hybrid forecasting models (i.e., CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS) are designed using MARS model with CEEMDAN to eliminate non-stationary time series.The efficiency and capability of the proposed hybrid models are compared with three standalone models (e.g., GEP, MARS, and SaMARS) for DSR forecasting in Busan and Incheon stations, South Korea.Figure 3 indicates the main steps of proposed models for enhancing the model's performance.Also, the design parameters of GEP and MARS models for calibration at Busan and Incheon stations are presented in Table 2.

Implementation of CEEMDAN Based Models
In this study, DSR was selected as the target variable based on the different input variables including vapor pressure (VP), air temperature (TA), relative humidity (RH), sea-level pressure (SLP), pan evaporation (PE), and sunshine duration (SD).
Firstly, the meteorological input variables were classified as calibration (from 2000 to 2012) and validation (2013 to 2016) dataset.In the second step, the datasets of calibration and validation phases were separately decomposed into various components (sub-series) and one remaining value (residual).The decomposition of time series dataset into twelve IMFs (from IMF1 to IMF12) and one residual component is provided in Figure 2.Then, the GEP, MARS, and SaMARS models were employed as estimation approaches to estimate each decomposed IMF and residual component, and finally, the estimated decomposed IMFs and residual values using the proposed models were aggregated to generate DSR time series to calculate each component using the same sub-series (IMF1) of six input variables, respectively.
To summarize, the hybrid forecasting models (i.e., CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS) apply the conceptual idea of "decomposition and ensemble".The decomposition and ensemble features can simplify the estimation task and formulate a consensus predicting the original dataset, respectively.

Standalone and Hybrid GEP Models
Results of the GEP and CEEMDAN-GEP models for forecasting DSR in both stations are presented in Table 3.It can be found from Table 3 that    Scatterplots between observed and forecasted DSR values for both stations in the validation phase are presented in Figure 4.It can be seen from the scatterplots that the CEEMDAN-GEP model's best-fit lines in the validation phase between the estimated (y) and the observed (x) values are closer to the ideal line (y = x).Figure 5 shows that time-series plots between observed and forecasted DSR for calibration and validation phases in both stations, the CEEMDAN-GEP outperforms GEP model for DSR forecasting in both stations.It can be seen from the scatterplots that the CEEMDAN-GEP model's best-fit lines in the validation phase between the estimated (y) and the observed (x) values are closer to the ideal line (y = x).Figure 5 shows that time-series plots between observed and forecasted DSR for calibration and validation phases in both stations, the CEEMDAN-GEP outperforms GEP model for DSR forecasting in both stations.

Standalone and Hybrid MARS Models
Based on Table 4, the CEEMDAN-MARS model could forecast DSR with better precision compared with MARS model for calibration and validation phase at Busan station.To confirm the accuracy of MARS and CEEMDAN-MARS models, scatterplots between observed and forecasted DSR at both stations are provided in Figure 6.To confirm the accuracy of MARS and CEEMDAN-MARS models, scatterplots between observed and forecasted DSR at both stations are provided in Figure 6. Figure 6 illustrates that the forecasted DSR values using the CEEMDAN-MARS model were much closer to the corresponding observed DSR values than those of MARS model.Likewise, at Incheon, Figures 8 and 9 provide that the CEEMDAN-SaMARS model give fewer scattered estimates compared with SaMARS.In addition, the accuracy of the forecasted peak values from CEEMDAN-SaMARS is improved compared with SaMARS model.Therefore, the CEEMDAN-SaMARS model is found to be relatively more suitable for forecasting DSR peak values than SaMARS model in the validation phase in Incheon.

Comparison of Proposed Hybrid and Standalone Models
After the investigation of the standalone and hybrid models for DSR forecasting, the performance of the proposed models was generally compared to select the best for DSR forecasting of both stations.It can be seen from Tables 3-5 at Busan, that the values of forecasted DSR using CEEMDAN-MARS and CEEMDAN-SaMARS provided a reliable and efficient performances compared with CEEMDAN-GEP, GEP, MARS, and SaMARS.In a similar comparison at Incheon, the CEEMDAN-SaMARS produced the best results to forecast DSR compared with CEEMDAN-GEP, CEEMDAN-MARS, GEP, MARS, and SaMARS models.From the results at both stations, the conjunction of MARS model, CEEMDAN data pre-processing approach, and CSA algorithm produced the best results to forecast DSR.Therefore, it can be judged from this investigation that the application of CEEMDAN and CSA to standalone data-driven model can clearly enhance the model accuracy and efficiency.
In addition, the forecasting accuracy of the proposed models is shown using box-plots diagrams (Figure 10).Box-plots can be expressed as the spread of observed and forecasted DSR data according to their quartiles.The whiskers indicate the outside variation of 25 th (lower) and 75 th (upper) percentiles [37].Using the box-plots, the CEEMDAN-SaMARS model gave higher accuracy and ability to forecast DSR compared with the other models.
Furthermore, the percentage of absolute forecasted error value is represented using the empirical cumulative distribution function (ECDF) for the validation phase at both stations (Figure 11).Using the error percentage that indicates the minimum error (i.e., from 0 to ± 4 MJ/m 2 ), the CEEMDAN-SaMARS model gave the best performance compared with the other models for forecasting DSR values.

Comparison of Proposed Hybrid and Standalone Models
After the investigation of the standalone and hybrid models for DSR forecasting, the performance of the proposed models was generally compared to select the best for DSR forecasting of both stations.It can be seen from Tables 3-5 at Busan, that the values of forecasted DSR using CEEMDAN-MARS and CEEMDAN-SaMARS provided a reliable and efficient performances compared with CEEMDAN-GEP, GEP, MARS, and SaMARS.In a similar comparison at Incheon, the CEEMDAN-SaMARS produced the best results to forecast DSR compared with CEEMDAN-GEP, CEEMDAN-MARS, GEP, MARS, and SaMARS models.From the results at both stations, the conjunction of MARS model, CEEMDAN data pre-processing approach, and CSA algorithm produced the best results to forecast DSR.Therefore, it can be judged from this investigation that the application of CEEMDAN and CSA to standalone data-driven model can clearly enhance the model accuracy and efficiency.
In addition, the forecasting accuracy of the proposed models is shown using box-plots diagrams (Figure 10).Box-plots can be expressed as the spread of observed and forecasted DSR data according to their quartiles.The whiskers indicate the outside variation of 25th (lower) and 75th (upper) percentiles [37].Using the box-plots, the CEEMDAN-SaMARS model gave higher accuracy and ability to forecast DSR compared with the other models.
Furthermore, the percentage of absolute forecasted error value is represented using the empirical cumulative distribution function (ECDF) for the validation phase at both stations (Figure 11).Using the error percentage that indicates the minimum error (i.e., from 0 to ±4 MJ/m 2 ), the CEEMDAN-SaMARS model gave the best performance compared with the other models for forecasting DSR values.Overall, the results of this research indicate that the CEEMDAN-based GEP, MARS, and SaMARS models capture the non-linear and non-stationary dynamics of DSR time series datasets effectively and can provide optimal DSR forecasting.Although the GEP, MARS, SaMARS models could forecast DSR values, they were not efficient and accurate as CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS models.Though, this study proposed and assessed the hybrid models (i.e., CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS) for DSR forecasting, the identified limitations on DSR research must be taken into account for future studies.
This study explains that the GEP, MARS, and SaMARS models can improve the precision of DSR forecasting when the CEEMDAN method is integrated for decomposing data time-series into components.However, one of the main disadvantages of using the CEEMDAN data pre-processing approach would be classified as time-consumption.In addition, even if CSA algorithm shows excellent performance for determining three parameters of MARS, it has to develop and verify other meta-heuristic algorithms that help to identify the parameters of MARS.The reason can be expressed that CSA algorithm requires a long time to find the parameter of MARS model.Although the outcomes are suitable for this research, other alternative methods have to be found, and investigated to solve this issue.This study explains that the GEP, MARS, and SaMARS models can improve the precision of DSR forecasting when the CEEMDAN method is integrated for decomposing data time-series into components.However, one of the main disadvantages of using the CEEMDAN data pre-processing approach would be classified as time-consumption.In addition, even if CSA algorithm shows excellent performance for determining three parameters of MARS, it has to develop and verify other meta-heuristic algorithms that help to identify the adequate parameters of MARS.The reason can be expressed that CSA algorithm requires a long time to find the parameter of MARS model.Although the outcomes are suitable for this research, other alternative methods have to be found, and investigated to solve this issue.

Conclusions and Future Research
Solar radiation is one of the most renewable and accessible energy sources, and plays a crucial role in global energy demand.Hence, accurate forecasting of DSR is one of the major problems for scientists, engineers, and decision makers.This study attempts to investigate the efficiency of GEP, MARS, and SaMARS models for DSR forecasting at Busan and Incheon stations, South Korea.To overcome the non-stationary and non-linear characteristics of time-series values, the CEEMDAN is utilized as the pre-processing approach to decompose all input/output variables, which improves the accuracy of DSR forecasting.
Comparing the results of standalone models (e.g., GEP, MARS, and SaMARS) and hybrid models (e.g., CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS) confirms the accuracy and efficiency of the proposed models.It is shown that the CEEMDAN data pre-processing approach has a significant influence on models' accuracy and provides better results.At the Busan station, the forecasted DSR using CEEMDAN-GEP, CEEMDAN-MARS, and CEEMDAN-SaMARS models have a higher accuracy in term of NSE, 0.873, 0.879, and 0.878, compared to the standalone GEp, MARS, and SaMARS models.Furthermore, the results indicate that the CEEMDAN-SaMARS model is an effective tool and a promising method for DSR forecasting.The performance of the hybridized CEEMDAN-SaMARS model at both stations was the best, with RRMSE ≤ 18.1% at Busan and 19.3% at Incheon.In addition, this research reveals that the usage of a data pre-processing model (e.g., CEEMDAN) plays a main role in achieving an accurate forecast.Furthermore, future study is necessary to improve the model's accuracy using feature selection methods to attain optimal input variables.
In spite of some limitations, this study can provide the basic information for future work, particularly the use of a hybrid model which can be integrated with a physically-based approach to create a DSR simulation model.Moreover, it is suggested that a two-phase decomposition is utilized to increase IMF1 estimating values which have a high frequency.
combined with the variational mode decomposition (VMD) method to resolve the frequency of the original datasets in the Yingluoxia watershed, Northwestern China.

Figure 1 .
Figure 1.Map of study area.

Figure 1 .
Figure 1.Map of study area.

Figure 2 .
Figure 2. Intrinsic mode functions (IMFs) and residual constructed using CEEMDAN for daily solar radiation at Busan station (calibration phase).

Figure 2 .
Figure 2. Intrinsic mode functions (IMFs) and residual constructed using CEEMDAN for daily solar radiation at Busan station (calibration phase).

Figure 3 .
Figure 3.The main procedures of proposed standalone and hybrid models.

Figure 3 .
Figure 3.The main procedures of proposed standalone and hybrid models.

Figure 4 .
Figure 4. Scatterplots of observed and forecasted DSR using the GEP and CEEMDAN-GEP models in validation phase at Busan (blue color) and Incheon (Green color) stations.

Figure 4 .
Figure 4. Scatterplots of observed and forecasted DSR using the GEP and CEEMDAN-GEP models in validation phase at Busan (blue color) and Incheon (Green color) stations.

Energies 2019 , 23 Figure 5 .
Figure 5. Time-series plots between observed and forecasted DSR for GEP and CEEMDAN-GEP models for calibration and validation phases at Busan and Incheon stations.

Figure 5 .
Figure 5. Time-series plots between observed and forecasted DSR for GEP and CEEMDAN-GEP models for calibration and validation phases at Busan and Incheon stations.

5. 3 .
Standalone and Hybrid MARS Models Based on Table 4, the CEEMDAN-MARS model could forecast DSR with better precision compared with MARS model for calibration and validation phase at Busan station.For the validation phase, the CEEMDAN-MARS model showed optimal values with RMSE = 2.412 MJ/m 2 , RRMSE = 0.166, MAE = 1.983MJ/m 2 , NSE = 0.879, WI = 0.969, and LMI = 0.667 compared with the MARS model (RMSE = 2.951 MJ/m 2 , RRMSE = 0.207, MAE = 2.437 MJ/m 2 , NSE = 0.819, WI = 0.951, and LMI = 0.581) at the Busan station.This illustrates the ability of the CEEMDAN-MARS model for DSR forecasting at the Busan station.Results of statistical criterion at the Incheon station were completely similar to those of Busan.Comparison between MARS and CEEMDAN-MARS models indicated that the CEEMDAN-MARS model produces outstanding results (RMSE = 2.659 MJ/m 2 , RRMSE = 0.185, MAE = 2.221 MJ/m 2 , NSE = 0.818, WI = 0.957, and LMI = 0.582) compared with MARS model (RMSE = 3.066 MJ/m 2 , RRMSE = 0.214, MAE = 2.421 MJ/m 2 , NSE = 0.758, WI = 0.948, and LMI = 0.544) for the validation phase at the Incheon station.Therefore, the MARS model has not permissible result than the CEEMDAN-MARS model for DSR forecasting at the Incheon station.
Figure 6 illustrates that the forecasted DSR values using the CEEMDAN-MARS model were much closer to the corresponding observed DSR values than those of MARS model.Energies 2019, 12, x FOR PEER REVIEW 14 of 23 model for DSR forecasting at the Busan station.Results of statistical criterion at the Incheon station were completely similar to those of Busan.Comparison between MARS and CEEMDAN-MARS models indicated that the CEEMDAN-MARS model produces outstanding results (RMSE = 2.659 MJ/m 2 , RRMSE = 0.185, MAE = 2.221 MJ/m 2 , NSE = 0.818, WI = 0.957, and LMI = 0.582) compared with MARS model (RMSE = 3.066 MJ/m 2 , RRMSE = 0.214, MAE = 2.421 MJ/m 2 , NSE = 0.758, WI = 0.948, and LMI = 0.544) for the validation phase at the Incheon station.Therefore, the MARS model has not permissible result than the CEEMDAN-MARS model for DSR forecasting at the Incheon station.

Figure 6 .
Figure 6.Scatterplots of observed and forecasted DSR using the MARS and CEEMDAN-MARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.

Figure 6 . 23 Figure 7 .
Figure 6.Scatterplots of observed and forecasted DSR using the MARS and CEEMDAN-MARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.

Figure 8 .
Figure 8. Scatterplots of observed and forecasted DSR using the SaMARS and CEEMDAN-SaMARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.Figure 8. Scatterplots of observed and forecasted DSR using the SaMARS and CEEMDAN-SaMARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.

Figure 8 .
Figure 8. Scatterplots of observed and forecasted DSR using the SaMARS and CEEMDAN-SaMARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.Figure 8. Scatterplots of observed and forecasted DSR using the SaMARS and CEEMDAN-SaMARS models in the validation phase at Busan (blue color) and Incheon (Green color) stations.Likewise, at Incheon, Figures8 and 9provide that the CEEMDAN-SaMARS model give fewer scattered estimates compared with SaMARS.In addition, the accuracy of the forecasted peak values from CEEMDAN-SaMARS is improved compared with SaMARS model.Therefore, the CEEMDAN-SaMARS model is found to be relatively more suitable for forecasting DSR peak values than SaMARS model in the validation phase in Incheon.

Figure 9 .
Figure 9. Time-series plots between observed and forecasted DSR for SaMARS and SaMARS models for the calibration and validation phases at Busan and Incheon stations.

Figure 9 .
Figure 9. Time-series plots between observed and forecasted DSR for SaMARS and CEEMDAN-SaMARS models for the calibration and validation phases at Busan and Incheon stations.

Figure 10 .
Figure 10.Box-plots of relative error of forecasted of DSR (MJ /m 2 ) using the standalone and hybrid models at (a) Busan (b) Incheon stations in the validation phase.

Figure 10 .
Figure 10.Box-plots of relative error of forecasted values of DSR (MJ/m 2 ) using the standalone and hybrid models at (a) Busan (b) Incheon stations in the validation phase.

Energies 2019 , 23 Figure 11 .
Figure 11.Empirical cumulative distribution function of the absolute forecasted error (MJ/m 2 ) using the standalone and hybrid models at (a) Busan (b) Incheon stations in the validation phase.

2 Figure 11 .
Figure 11.Empirical cumulative distribution function of the absolute forecasted error (MJ/m 2 ) using the standalone and hybrid models at (a) Busan (b) Incheon stations in the validation phase.

Table 1 .
Descriptive statistics of solar radiation data for the Busan and Incheon stations located in South Korea.

Table 2 .
Design parameters of GEP and MARS models for calibration stage at Busan and Incheon stations.

Table 3 .
Statistical performance of standalone and hybrid GEP models at Busan and Incheon stations.
Scatterplots between observed and forecasted DSR values for both stations in the validation phase are presented in Figure4.

Table 4 .
Statistical performance of standalone and hybrid MARS models at Busan and Incheon stations.

Table 4 .
Statistical performance of standalone and hybrid MARS models at Busan and Incheon stations.

Table 5 .
Statistical performance of standalone and hybrid SaMARS models at Busan and Incheon stations.

Table 5 .
Statistical performance of standalone and hybrid SaMARS models at Busan and Incheon stations.