Enhanced Prediction of Solar Radiation Using NARX Models with Corrected Input Vectors

: The main objective of this work is to analyze and conﬁgure appropriately the input vectors to enhance the performance of NARX models to forecast solar radiation one hour ahead. For this study, Engle–Granger causality tests were implemented. Additionally, collinearity among the meteorological variables of the databases was examined. Different databases were used to test the contribution of these analyses in the improvement of the input vectors. For that, databases from three cities of Mexico with different climates were obtained, namely: Chihuahua, Temixco, and Zacatecas. These databases consisted of hourly measurements of the following variables: solar radiation (SR), wind speed (WS), relative humidity (RH), pressure (P), and temperature (T). Results showed that, in all three cases, proper NARX models were produced even when using input vectors formed only with solar radiation and temperature data. Consequently, it was inferred that pressure, wind speed, and relative humidity could be excluded from the input vectors of the forecasting models since, according to the causality tests, they did not provide relevant information to improve the solar radiation forecast in the studied cases. Conversely, these variables could generate spurious results. Forecasting results obtained with the NARX model were compared to the smart persistence model, commonly used to validate SR prediction. Error measures, such as mean absolute error (MAE) and root mean squared error (RMSE), were used to compare prediction results obtained from different models. In all cases, results obtained from the enhanced NARX model surpassed the results of the smart persistence, namely: in Chihuahua up to 11.5%, in Temixco up to 15.7%, and in Zacatecas up to 27.2%.


Introduction
Solar radiation (SR) is the driving force behind several solar energy devices, such as photovoltaic systems for generating electricity and solar collectors for water heating [1]. Therefore, SR prediction models are essential for different applications, from the operation of energy systems (PV and thermal systems) to meteorological applications [2,3].
Several research works' scope was to develop models that allowed predicting SR with acceptable precision. The techniques used in the forecast models varied according to the place and the available data, so it was difficult to classify them adequately. However, they could be divided into three main groups: clarity index models, hybrid models, and artificial neural network (ANN) models. In the first, the variable to model is the atmospheric transmittance or clarity index [4][5][6][7][8]. Hybrid models

Meteorological Variables' Databases
Data from three sites were analyzed, and at each site, there was an available weather station with specific characteristics. The studied sites were chosen based on their solar radiation potential, weather characteristics, and data availability. These meteorological stations were located in different cities of Mexico, namely: Chihuahua, Temixco, and Zacatecas. Figure 1 shows an approximate climate map of Mexico. This figure also shows the locations of the studied sites. Furthermore, Table 1 shows the precise location of the sites and their corresponding weather type. The main characteristics of measurement equipment of the meteorological stations are shown in Table 2.  Three databases containing 17,520 hourly measurements of each variable, that is two full years of data, were consulted; however, due to the databases' restrictions, different years were employed. In the case of Chihuahua, four meteorological variables were monitored: solar radiation, temperature, relative humidity, and pressure (SR, T, RH, and P); from January 2014 to December 2015. In the case of Temixco, five meteorological variables were measured: SR, T, RH, P, and wind speed (WS); from January 2013 to December 2014. Finally, for the case of Zacatecas, four meteorological variables were monitored: SR, T, RH, and WS; from January 2006 to December 2007.

Techniques and Methods Used in the Analysis
This section describes the applied techniques to select the data that formed the input vectors of the NARX models to forecast SR.

Collinearity Tests
Two variables are defined as collinear if the data vectors that represent them are on the same line (i.e., subspace of dimension one). Generally, the k variables are collinear if the vectors that represent them are in a subspace of dimension smaller than k, that is if one of the vectors is a linear combination of the other vectors. In practice, such "exact collinearity" rarely occurs, but this is certainly not necessary to consider that a collinearity problem exists [24].
Usually, an analysis of the principal components of the independent variables is performed to detect the collinearity. The principal components of a set of variables to other variables are referred to as the linear combination of the original variables. The principal components have three properties: (a) they are mutually independent (they are not correlated with each other); (b) they maintain the same information as the original variables; (c) they have the maximum variance with the above limitations. The variance of each principal component is an eigenvalue of the variance-covariance matrix of the original variables.
The number of null eigenvalues indicates the number of variables that are a linear combination of others (the number of exact collinearities). The eigenvalues near zero indicate serious collinearity problems.

Engle-Granger Causality Test
The fundamental aim of the co-integration tests is to know how much useful information the input variables provide to the model. This information is used to decide which variables should be part of the forecasting model. One of the most popular co-integration tests is the Engle-Granger causality method.
The causality test avoids the generation of spurious results, with a certain level of precision (in this study, a confidence level of 1% was used). By eliminating unnecessary variables from the forecast model, the inclusion of meteorological variables that do not provide any significant information could be avoided; such variables may affect the model performance in unexpected or unwanted ways. These results are the spurious phenomenon of the nonsense regression explained by Yule [25]. Yule proved that spurious correlation could persist in non-stationary time-series, even if the sample is very large.
The Engle-Granger causality tests propose a hypothesis diagnostic, where the null hypothesis rejects the causality approach of one variable against another variable, that is the independent variable does not have useful information to predict the dependent variable. The Engle-Granger causality tests first assumes that the variable X does not cause the variable Y. If this null hypothesis is rejected, then the variable X causes the variable Y, that is the variable X contains relevant information to predict the variable Y. The causality tests were carried out by solving Equations (1) and (2).
where X is, for example, the solar radiation, Y is the air temperature, u 1t is the uncorrelated white noise, and n is the number of lags. α i , β j , λ i , and δ j are parameters to be determined. Similarly, using the auto-regressive vector technique, it is possible to elaborate tests for several variables [25]; such as wind speed, relative humidity, etc.
Since the results of the causality test depend on the number of lags, these tests were conducted using 2, 3, ..., and 24 lags. The results were considered acceptable when some of the unnecessary variables were rejected after modifying the number of lags, due to the main objective of this work being fulfilled: to develop SR forecasting models with a reduced number of input variables.

Augmented Dickey-Fuller Test
Many statistical tests that help to determine if a time-series is or is not stationary exist. The augmented Dickey-Fuller (ADF) test was implemented in this work to identify the stationarity of the time-series. This test is based on the null hypothesis, which supposes that an analyzed time-series is non-stationary. Still, if this first hypothesis is rejected, then the time-series is considered stationary.
On the other hand, unit root tests are based on the study of stochastic time-series; in this case: pure random walk, random walk with drift, and random walk with drift and deterministic trend. Furthermore, the random walk with drift and deterministic trend model was proposed with an additional term, which considered whether the term u t was correlated or not [25], as can be seen in Equation (3).
where β 1 is the drift, β 2 is the deterministic trend term, δ is the term that indicates if the time-series is stationary, t is a pure error term, and

Non-Linear Auto-Regressive Model with Exogenous Inputs
Consider a recurrent neural network with only one input and one output, whose behavior is described by Equations (4) and (5). Given this state-space model, modifying it into an input-output model as an equivalent representation of the neural network is necessary.
Using Equations (4) and (5), it can be easily demonstrated that the output y(n + q) could be expressed in terms of the state x(n) and the input vector u q (n) as follows: where q is the dimensional form of the state-space model and Φ : 2q → . Provided that the recurrent network is observable, the local observability theorem can be used to write: where Ψ : 2q−1 → q . Hence, substituting Equation (7) in Equation (6), it is obtained: where u q−1 (n) is contained in u q (n) as its first elements (q − 1), and the nonlinear mapping F : 2q → takes care of both φ and ψ.
Using the definitions of y p (n) and u q (n) given in Equations (9) and (10), we may rewrite Equation (8) as the expanded form: Replacing n with n − q + 1, then: Expressed in words, some non-linear mapping F : 2q → exists whereby the present value of the output y(n + 1) is uniquely defined in terms of its past values y(n), · · · , y(n − q + 1) and the present and past values of the input u(n), · · · , u(n − q + 1). For this input-output representation to be equivalent to the state-space model of Equations (4) and (5), the recurrent network must be observable. The practical representation of this equivalence is that the NARX model (see Figure 2a, with its limited feedback towards the output neuron, is in fact able to simulate the corresponding fully recurrent state-space model of Figure 2b (assuming that m = 1 and p = 1), with no difference between their input-output behavior [26]. (a)

Smart Persistence Model
The smart persistence model was used as a benchmark. Despite the simplicity of this model, it can be very accurate for low variability periods. Therefore, the smart persistence model is useful to compare to other models. This model is defined as [27]: It is well known that solar radiation suffers losses due to dispersion and atmospheric absorption when it passes through the atmosphere [28]. After the atmospheric absorption phenomenon, the normal solar flow rate (solar radiation/normal irradiance) that reaches the Earth's surface is calculated with Equation (14), found in the Handbook of Solar Energy [28], where it is also mentioned that SR can be another form of normal irradiance.
and: cos θ z = cos φ cos δ cos ω + sin δ sin φ, where θ z is the solar zenith angle defined as the angle between the vertical and the line connecting to the Sun, I ext is the extraterrestrial irradiance per hour, T R is the turbidity factor, φ is the latitude of the site, δ is the declination angle of the Earth, and ω is the hour angle. The extraterrestrial hourly mean irradiance (W/m 2 ) is calculated by the following equation [29]: where E 0 is the eccentricity correction factor, I sc = 1367 W/m 2 is the solar constant, and ω 0.5 is the hour angle at the half hour.

Solar Radiation Estimation under Clear Sky Conditions, SR clr
To have a reference and a constraint for the calculation of smart persistence, a monthly ideal turbidity factor was calculated for each site by solving Equation (14) for T R : where (SR) month is the maximum value of the month of solar irradiance measured at each site, (I ext ) month is the maximum value of solar extraterrestrial radiation for each month at each site, and (cos θ z ) month is the monthly value used to calculate (I ext ) month according to [28]. Tables 3-5 show the obtained results from Equation (17) of the turbidity factor, T R , corresponding to Chihuahua, Temixco, and Zacatecas, respectively.  It is worth mentioning that it was only necessary to calculate T R once for each month since the maximum value measured in each month of both SR and I ext was used. The maximum ideal SR could be calculated using a single T R under optimal conditions at each site per month. The goal was to use that SR calculated value as a constraint for the smart persistence model so that it never exceeded such a value.
Note that the results for the turbidity factor were obtained under the following hypotheses: • at least one day of each month had clear sky conditions. • SR obtained under clear sky conditions was calculated by considering the maximum irradiance of every month. Figure 3 presents the synthetic series of maximum ideal SR clr generated using Equation (14) for a typical year, where the monthly T R presented in Tables 3-5 was used. It should be noticed that 8760 computed data of SR clr were plotted from hourly known values of I ext and θ z for each site, which were adjusted by T R for each month at each site, specifically. This figure also shows the first 744 h corresponding to January, where 31 peaks can be easily observed, one for a day. Several interesting phenomena were observed when reviewing the entire synthetic series: first, it seemed that a relatively continuous area was drawn; however, it was not: it was a line graph that, being a series of hourly SR clr , had a cyclical shape with very close waves; second, 12 steps appeared, due to the effect of T R , which scaled the results and changed every month; and finally, a trend was observed in the data that corresponded to the annual seasonal cycle where the summer months received more SR clr . It is important to highlight that T R produced a peculiar behavior in Zacatecas since there was a significant gap from June to August compared to the rest of the year.

Performance Tests
Two standard error metrics widely used in the solar prediction area were implemented: the root mean squared error (RMSE) and the mean absolute error (MAE).
where (SR) f is the forecasted solar radiation and (SR) m is the measured solar radiation. Additionally, to compare the relative improvement of the different models with respect to the smart persistence model, the forecast skill parameter was calculated as defined in [23].
where the subscripts N ARX and SP denote the NARX model and the smart persistence model, respectively.

Selection of Input Variables and the Structure of the NARX Model
This section explains how the collinearity and co-integration tests were applied to assemble the input vectors of the prediction model. Afterwards, the general architecture of a trained NARX model is presented.

Collinearity Tests
When a variable has a linear combination with another one, it means that both variables are related through a linear expression. When the linear correlation coefficient between the variables is equal to unity, the variables have a simple collinearity. In the case of multiple variables, collinearity occurs when some independent variables are correlated with each other. One way to detect collinearity between multiple variables is through principal component analysis of the independent variables. The variance of each principal component is an eigenvalue of the variance-covariance matrix of the original variables. The null eigenvalues indicate the number of variables that are linear combinations of others (exact collinearities). Once the presence and number of collinearities are determined in a multivariate time-series, it is convenient to find the involved variables through the calculation of their variance-decomposition proportion for each component; Belsley et al. [24] proposed using a proportion threshold of about 0.5.
Collinearity tests were used to assemble the input vectors of the NARX models correctly. Figure 4 shows the collinearity tests performed on the multivariate time series of the three studied sites, with the available variables in each site, namely:

•
In the case of Chihuahua, the temperature and the pressure were above the threshold of the variance-decomposition proportion; see Figure 4a. Thus, in the next stage, the collinearity tests were performed omitting one of these variables, obtaining two new combinations of input vectors. The first input vector was assembled using SR, RH, and T. The second vector was constructed using SR, RH, and P.

•
In the case of Temixco, temperature, relative humidity, and pressure showed collinearity; see Figure 4b. Therefore, three new input vectors were obtained formed as follows: SR, WS, and T variables for the first vector; SR, WS, and RH variables for the second vector; and SR, WS, and P variables for the third vector.

•
In the case of Zacatecas, the temperature significantly exceeded the threshold of 0.5; see Figure 4c. Therefore, two new input vectors were obtained in combination with the temperature, i.e., SR, RH, and T; and SR, WS, and T. Tables 6-8 show the proposed input vectors for Chihuahua, Temixco, and Zacatecas, respectively. In all cases, Vector 1 was formed using all available meteorological variables. Vectors 2 and 3 (and 4 for Temixco) were obtained from the results of the collinearity tests. Finally, Vector 4 (5 for Temixco) was the result of the application of collinearity and integration tests.

Engle-Granger Causality Tests
Cointegration tests to identify redundant variables of the input vector built from collinearity test were applied. In the case of Chihuahua, the collinearity test determined two different combinations of input vectors: V 2 = [SR, RH, T] and V 3 = [SR, RH, P]; which were formed using separately the variables that presented collinearity. Then, the causality test was applied to V 2 , identifying the RH as a redundant variable, so the RH was eliminated from the this input vector, obtaining V 4 = [SR, T] of Table 6. For V 3 , the test did not register any redundant variable, so the input variables remained the same.
In the case of Temixco, the causality tests were applied to three groups of input variables:  Table 7 was defined.
Finally, in the city of Zacatecas, it was found that only T showed collinearity, and two input vectors were formed: The causality tests determined the relative humidity as a redundant variable in V 2 . Therefore, the new input vector V 4 = [SR, T] was formed; see Table 8.

NARX Models Including Collinearity Tests
Once the input vectors were determined, the NARX models were developed for each of the cases shown in Tables 6-8. When all the meteorological variables were used (i.e., vectors V 1 ), as well as vectors obtained from the causality tests, the NARX models were performed using the diagram of Figure 5. The flowchart initiated when input variables were entered. Then, the NARX model was trained, and finally, the performance tests were calculated. The variables were normalized with values from zero to one, before the NARX model was trained. To normalize the variables, the subroutine shown in Figure 5b

NARX Models Including Collinearity and Engle-Granger Causality Tests
The main objective of this research work was to build the best NARX model, based on the proper input vectors. The flowchart of Figure 6 shows the steps that were followed to define the relevant variables in the solar radiation prediction using collinearity and causality tests. In the case of Chihuahua, V 1 included all the variables. Then, V 2 and V 3 were obtained, applying the collinearity test. Finally, the causality test was applied on both input vectors, showing a difference only in V 2 . This test indicated that the RH did not cause SR. Therefore, the new input vector was formed as V 4 = [SR, T]. In the case of Temixco, the causality test was used to build V 2 , V 3 , and V 4 . The causality test determined that the WS did not cause SR in V 2 , while in V 3 and V 4 , there was no change, that is the input vectors remained the same. Therefore, the new input vector was formed as V 5 = [SR, T]. For Zacatecas, the causality test was implemented in V 2 and V 3 . From the application of this test, only V 2 determined that the HR did not cause SR, while in Case 3, no change was registered in the input variables, resulting in According to the obtained results from the collinearity and causality tests' application, the proper input vector for each site only included SR and T, that is one non-collinear variable and one collinear variable.
Although preliminary results indicated that the best input vector only included SR and T, NARX models were trained for each of the cases from Tables 6-8 to compare all NARX models through the performance tests, in which it was expected that those NARX models with the best input vector would obtain the best forecast performance. The first year of the available data was used as the training data. For the performance tests, the SR forecast horizon was one hour ahead. These results were compared with the second year of the data measured in each of the sites, i.e., a blind forecast was calculated.

Number of Time Lags of the NARX Model
First, the variance and the mean were stabilized through the cubic root of the solar radiation, i.e., Y t = SR 1/3 . Then, it was differentiated using 24 lags, as dY t = (1 − B 24 )Y t . Figure 7a shows the transformation of the solar radiation time-series resulting from the application of these operations. After that, the autocorrelation function (ACF) and the partial autocorrelation function (PACF) were obtained from the stabilized solar radiation time-series. Figure 7b shows the sinusoidal behavior of the ACF, which was a clear sign of the presence of seasonality in the SR times-series. At the same time, the PACF displayed peaks every 24 h, which indicated that the time-series presented daily evolution; see Figure 7c. Therefore, the NARX models were tested with different numbers of time lags until reaching 24 lags. The results showed that the best predictions were obtained when 24 lags were used.   Figure 8a. Figure 8b,c shows how the input variables were previously treated with the collinearity test, resulting in two input vectors built with three variables each. Because the temperature and the pressure were defined as collinear, both variables were separated from the original model, and two new input vectors were generated: in the first one, the temperature was combined with solar radiation and relative humidity, x(t) = [SR, RH, T]; while in the second one, the pressure was combined with solar radiation and relative humidity, x(t) = [SR, RH, P]. In this case, Figure 8d was not used; it was only applicable in the case of Temixco. Finally, Figure 8e shows the performed procedure on the input variables, before applying the NARX model. The four original meteorological variables were treated with the collinearity and the causality tests. Therefore, the inlet vector was reduced from four variables to two variables, namely x(t) = [SR, T]. Similarly to the Chihuahua case, the NARX model that included all meteorological variables available in the site as inputs is presented in Figure 8a, where x(t) = [SR, WS, RH, T, P]. According to the collinearity tests, the temperature, the relative humidity, and the pressure showed collinearity. Consequently, these variables could not be combined to build the input vectors. Figure 8b-d shows how the input vectors were generated using the collinearity tests, namely WS, RH], and x(t) = [SR, WS, P], respectively. Finally, in Figure 8e, the solar radiation and temperature were considered as input variables to the NARX model. This vector was obtained from the application in the collinearity and causality tests. As can be seen, the results of the collinearity and the causality tests were not included for the cases V 3 = [SR, WS, RH] and V 4 = [SR, WS, P]; this was because, when applying the causality test to these cases, the results did not suggested any change. Nevertheless, the only change observed occurred when the causality test was applied to V 2 = [SR, WS, T]; where, according to the causality test, WS did not cause SR. Therefore, in this particular case, the input vector was composed as

Artificial Neural Network setup
According to the collinearity test, the only variable that presented collinearity was the temperature. Therefore, there was no restriction in combining T with another variable. Figure 8a shows the NARX architecture using all available variables for the site of Zacatecas, that is x(t) = [SR, T, RH, WS]. Figure 8b,c shows the inlet vectors resulting from the collinearity tests, namely x(y) = [SR, T, RH] and x(y) = [SR, T, WS], respectively. When applying the causality tests to the combination of variables [SR, T, RH], the results indicated that the three variables should be used for the SR prediction. Nevertheless, when applying the causality test to [SR, T, WS], the results indicated that WS did not cause SR, that is WS should not be used for the SR forecast. In Figure 8d, the architecture used for the forecast of solar radiation can be seen, which only had SR and T as input variables.
In all cases, the number of neurons used in the hidden layer of the NARX models was 10. The number of lags was obtained using the autocorrelation function and the partial autocorrelation function. The Levenberg-Marquardt algorithm was used as the training algorithm.

Results and Discussions
The results were obtained using two specialized software packages: EViews 9.0 [30] and MATLAB 2015a [31]. The first was used to perform the cointegration tests thanks to its versatility and the graphical user interface. The second was used for the normalization of the inlet vectors, the NARX models' generation, and the collinearity tests.
The forecast results were compared to each other using forecast error measures to verify which configuration of the NARX model had the best performance. Additionally, to have a different comparison pattern, a solar radiation forecast based on the smart persistence model was generated.

Performance tests
The results of the performance tests obtained from Equations (18)- (20) are shown in Tables 9-11.  Table 9 shows the results obtained for Chihuahua. In the first case, all the available variables in the site were included, this case being where the most significant errors were observed. Once the collinearity and causality techniques were applied, the input vector was reduced to only two variables: SR and T. The collinearity and causality tests restricted the inclusion of every variable that was correlated in the system; for example HR was correlated with T and P; therefore, in our study, only one variable should be incorporated to avoid using redundant information. Finally, it is highlighted that the NARX model that included the collinearity and causality tests showed a better fit than the intelligent persistence model. Tables 10 and 11 show the obtained results for Temixco and Zacatecas, respectively. In both cases, the best forecast result was obtained when only SR and T were used to form the input vectors. Therefore, it can be concluded that solar radiation forecasting improved when the collinearity and causality tests were applied to determine the input vector of the NARX model.
It should be noted that, since three study cases were a limited sample to have a conclusive result, the input vectors obtained with the procedure described in this work for other cases could not be SR and T. Thus, the results obtained in this work, concerning the delimitation of the input variables of the NARX models, did not represent a general rule. Furthermore, another determining factor in the conformation of the input vector, other than collinearity and causality tests, was the historical data, on which the training of the forecasting model depended.
Tables 9-11 also show the calculation of the forecast skill from Equation (20). It can be observed that the best result was obtained for Zacatecas, then for Temixco, and finally, for Chihuahua. They showed the forecast skill intervals of 27.2%, 15.7%, and 11.5%, respectively.

Linear Regression Analysis
Linear regression analysis was applied to better visualize forecast accuracy. The results of this analysis are shown in Figure 9, where the x-axis represents the solar radiation forecasting, the y-axis represents the real data, and the blue line is the regression line. This figure shows that the correlation between the real and the predicted data was between 0.94 and 0.97, which indicated that the models were very competitive.
For the first two sites, Chihuahua and Temixco, the regression line was almost 45 • , while the third site had a slightly more pronounced slope. These results indicated that there was a concordance between the SR forecast and the real data. Furthermore, in this figure, it can be appreciated that the data were not very dispersed in any of the graphs, another indication that the forecast results were good.  Figure 10 shows a comparison between the solar radiation measured data and the forecasting results obtained from the NARX model and the smart persistence model. Figure 10a presents the results for Chihuahua from 1 January 2015 to 7 January 2015. The RMSE for the NARX model for these seven days was 5.3 W/m 2 , while it was 7.3 W/m 2 for the smart persistence model. This chart shows a better fit of the NARX model, especially when the curves did not present a sudden change of direction. When this occurred, both models showed problems fitting to real data; therefore, it was necessary to calculate the prediction error to identify the best-fitting forecast model. Figure 10b shows the comparison of the last seven days of December 2014 of the solar radiation measured data, NARX model, and smart persistence model for Temixco. The RMSE of the NARX model was 4.5 W/m 2 , while 6.1 W/m 2 for the smart persistence model. In this case, both models presented a good performance, with the NARX model being slightly better.  Figure 10c shows the comparison between real and prediction data for 6 May 2007, to 12 May 2007, for the city of Zacatecas. In this case, the RMSE of the NARX model was 7.02 W/m 2 , while 10.16 W/m 2 for the smart persistence model. The measured data had an irregular behavior, which caused the NARX model to underestimate the values of SR. In contrast, the smart persistence model overestimated it. Consequently, once again, it was necessary to know the measurement of the prediction error to determine which was the best fit.

NARX Models versus the Smart Persistence Model
Although the graphs did not show much difference between the NARX model and the smart persistence model, the performance tests that were performed for the seven days analyzed in each case indicated that the NARX model was better than the smart persistence model in all cases.

Conclusions
Solar radiation is the driving force behind several solar energy devices, such as photovoltaic systems for generating electricity and solar collectors for water heating [1]. Many researchers are interested in proposing and improving forecasting results through new techniques of ANN. In this paper, two tests were applied to three databases to achieve the best variable selection of the input vector of NARX models to improve the SR prediction: the collinearity and causality tests.
In the Chihuahua and Zacatecas cases, there was a clear improvement when applying both tests, nevertheless, in Temixco, there was quite similar performance when the NARX model with V 2 = [SR, T, WS] and the NARX model with V 4 = [SR, T] were compared. However, the best model used fewer data by eliminating one variable to achieve similar results as the NARX model with vector V 2 . According to the obtained results, the NARX models that showed the best performance for the solar radiation forecasting one hour ahead were those in which the collinearity and causality tests were applied to determine the proper variables to form the input vectors. Additionally, the input vectors were reduced to use only the significant variables in the forecasting models. The results indicated that, in all studied cases, the solar radiation and temperature were the best combinations of the input variables, which provided better prediction results by using the NARX models.
The prediction results obtained from the NARX models were compared to a benchmark model, the smart persistence model. This model was used to forecast the solar radiation one hour ahead, establishing a relationship between the global solar radiation and solar radiation under clear sky conditions. The solar radiation calculation under clear sky conditions was obtained employing the monthly maximum solar radiation and the monthly maximum extraterrestrial solar radiation. The turbidity coefficient was determined with these monthly values, which is usually obtained through specialized measurement instruments. The solar radiation forecast skill, %Forecast skill , was calculated from the results of the NARX model and the smart persistence model, obtaining 11.5%, 15.7%, and 27.2% for Chihuahua, Temixco, and Zacatecas, respectively.
The next steps that this research envisages include the expansion of the forecast horizon, i.e., models capable of forecast solar radiation with the least uncertainty for 6, 12, 24, and 48 h ahead. Besides that, numerical models such as the Weather Research and Forecasting (WRF) model could be explored. Additionally, hybrid models involving this kind of numerical model with artificial intelligence and statistical models could be employed, which provide the numerical forecast with the linear and non-linear characteristics of the phenomenon.