Modelling and Prediction of Monthly Global Irradiation Using Different Prediction Models

Different prediction models (multiple linear regression, vector support machines, artificial neural networks and random forests) are applied to model the monthly global irradiation (MGI) from different input variables (latitude, longitude and altitude of meteorological station, month, average temperatures, among others) of different areas of Galicia (Spain). The models were trained, validated and queried using data from three stations, and each best model was checked in two independent stations. The results obtained confirmed that the best methodology is the ANN model which presents the lowest RMSE value in the validation and querying phases 1226 kJ/(m2·day) and 1136 kJ/(m2·day), respectively, and predict conveniently for independent stations, 2013 kJ/(m2·day) and 2094 kJ/(m2·day), respectively. Given the good results obtained, it is convenient to continue with the design of artificial neural networks applied to the analysis of monthly global irradiation.


Introduction
Solar radiation exerts its influence over all Earth's processes related to the environment, plant growing and even over human activities development [1]. At ground level, the solar radiation data are important for a large number of applications related to agricultural hydrology, plant growth and others [1]. Besides these, global solar irradiation is a significant parameter in renewable energy applications (for example to determine size and model photovoltaic systems) [2].
Global solar irradiation measurements can be obtained using specific devices [3] which are limited to a small number of meteorological stations, probably due to their high cost and other inconveniences such as a need for regular calibration and maintenance [3,4]. Besides this, these data may not be accessible because the meteorological observatories that include measurement series of solar irradiation are still rarely distributed and these data present, sometimes, a problematic spatial interpolation in areas of intricate orography [5]. According to Notton et al. [6], there are only some 1000 continental stations over the world that can measure solar radiation (number reported by Notton et al., from the World Radiation Data Center (WRDC) [7]) although, currently this number is probably higher.
According to different authors, the shortage, difficulties and uncertainties of these measurements can be estimated from other more abundant variables (climatological properties) such as cloudiness, among others [3,5].
Take into account the increase in energy demand and consumption worldwide, and the search for alternatives to the decrease in fossil fuel reserves [8], it can be understood that the determination of the global solar irradiation can be very important in solar energy conversion systems. Solar photovoltaic energy has presented an important growth in the last years due to its cost reduction [9]. In this energetic context, the Spanish climatic

•
Multiple linear regression can be used in environmental science to determine fine particles (PM 1 ) concentration using environmental, meteorological and physical eventualities variables [14], or to determine diffuse pollutant discharge using different basin's environmental parameters [15]. • Support vector machines can also be used in different fields such as Bioinformatics to identify single-nucleotide polymorphisms [16], to predict dihedral angle regions [17] or to forecast interface residue pairs of protein trimers [18]. • Artificial neural networks can be used in food science to model and optimize the extraction of cashew apple juice [19], to optimize an enzymatic approach to obtain modified artichoke pectin and pectic oligosaccharides [20] or to determine the broccoli buds loss green color velocity using hyperspectral camera combined with artificial neural networks [21]. • Finally, random forest can be used in economics to early prediction of university dropouts [22] or to predict the clean energy stock prices [23].
These types of approximations can be carried out independently or, as in some studies, simultaneously to be compared in their predictive capacity. This is the case of the study carried out by Torkashvand et al. [24] that compared two of these models, MLR and ANN, for the prediction of fruit firmness after six months using different input variables (nutrients concentrations alone, nutrient concentrations combination, among others) and show that, in general, the ANN model showed bigger potential to determine six month kiwifruit firmness [24]. On the other hand, Niu et al. [25] carried out a comparison between three kind of models proposed in this research, multiple linear regression, artificial neural network and support vector machine in the hydropower operation field and they concluded that the artificial neural networks and the support vector machine provide better performances than the MLR model [25]. A similar treatment can be seen in the research carried out by Lee et al. [26] in which these researchers developed simple/multiple LR, RF and SVM models to predict canopy nitrogen weight in corn using multispectral images obtained by an unmanned aerial vehicle. Authors concluded that the RF models presented the best results for the validation set and verified that when more spectral variables were used the model improved the accuracy and make longer the overall processing time [26].
The aim of this research was to develop different prediction models (linear regression, artificial neural networks, support vector machines and random forest) to model the monthly global irradiation (MGI) from three meteorological stations located in the Autonomous Community of Galicia (Spain) and then generalize the knowledge to other two nearby stations. These models will allow determining the monthly global irradiation in different areas of the Galicia based on the meteorological variables used, thus being able to obtain the value of this variable in places where it had not previously been measured, which could facilitate the photovoltaic installations development. This work corresponds to the beginnings of a more ambitious study to model the monthly solar irradiation, and subsequently predict the monthly solar irradiation with a month in advance. This work is a summary of the final degree project developed by the first author of this research [27].

Related Works
In this work, four models have been chosen to predict the monthly global irradiation. Multiple linear regression models have been chosen because they are probably the simplest and fastest models that can be developed to try to determine the monthly global irradiation. The models based on artificial neural networks have been chosen because our Department has several years of experience applying these models to different areas of science such as hydrology, palynology, etc., and, based on accumulated experience, these models offer good results. Furthermore, Diez et al. [1] reported, according to the review of Qazi et al. [28], that this type of approach to determine solar radiation is usually accurate and shows errors less than 20% (depending on the input data and the different architectures of the neural networks developed). It must also be taken into account that this error percentage it's different if the goal is to do a solar irradiation model (that can be used to determine the MGI values in stations where this variable has not been registered, like our research) or to do a prediction model, for example, 3-week ahead. Nevertheless, ANNs have the disadvantage of their high computational cost and time required to obtain the model. The development of the other two models (SVM and RF) was in based on that they are generally between, according to the computational time, the multiple linear regression model and the artificial neural network model and show relatively good results according to the literature.
Related to this study, these kind of models can be used, together or separately, for different purposes.

•
Multiple linear regression models can be used to predict the net radiation using meteorological data such as global solar radiation, temperature, relative humidity, etc. [29]. The researchers developed 8 different equations to estimate the daily net radiation and the results showed good adjustments and low errors on a daily scale, especially in the models that include the variables of relative humidity of the air, temperature, solar radiation and the inverse of the distance between the earth and the sun. Despite the simplicity of the multiple linear regression models, the authors showed good adjustments compared to the Rn FAO 56 OM model, which allows to conclude that the MLR models developed are an alternative to improve the evapotranspiration estimation. • According to Diez et al. [1], artificial neural networks have been used to predict the solar irradiation at different time windows (hourly, daily and monthly) from different meteorological variables (temperature, atmospheric pressure, among others) or even including geographical coordinates such as latitude, longitude and altitude. In this sense, these authors developed ANNs to predict the global solar irradiation of the day after using data from one agrometeorological station located in Mansilla Mayor (León, Castilla y León). The authors concluded that artificial neural networks models provide better results compared to classical methods and require less input variables [1]. This kind of models can be used to determine the average monthly, the average weekly and the daily global solar radiation in Fortaleza (Brazilian Northeast region) using 14-year-long data set to train three different ANNs models [3]. ANNs can also be used to determine different parameters such as the global horizontal irradiation (from meteorological data), the global tilted irradiation (from the horizontal global irradiation and others) and to forecast the hourly direct normal and the global horizontal irradiation from one to six hours horizon [6]. • Support vector machines models can be used to generate the daily global solar irradiation using a general (non-locally dependent) model [9]. The model (which used temperatures, wind speed, relative humidity and rainfall, among other variables) presented a high capacity of generalization for the different studied locations and improved, in terms of mean absolute error, the locally trained models in some locations [9]. SVM models can even be used to forecast photovoltaic power (and be compared with other models) [30]. • Random forest can be used to estimate the solar radiation using air pollution index in three different sites [31] or to forecast solar radiation and compared their result with other methods such as multivariate adaptive regression splines (MARS), classification and regression tree (CART) and M5 [32].
In many research articles, it can be also possible to see comparisons between this kind of models to predict solar irradiation and even other interesting variables related to the subject under study.

•
MLR and ANN models can be compared in the estimation of monthly-average daily solar radiation over different locations in Turkey [33]. Different variables (latitude, longitude, altitude, land surface temperature and month) were used as input variables. According to the authors, the results showed that the ANN model could obtain good performance compared to the multiple linear regression model. • SVM and ANN models were used in a comparative study of different methods carried out by da Silva et al. [34] to estimate the daily global solar irradiation. Four different kinds of architecture combining different input parameters were studied. According to the authors, statistical indicators showed that the SVM technique has better performance than ANN models for the study location (Botucatu/SP/Brazil). Neural models can be compared to random forest models (among other model) to forecast the normal beam, horizontal diffuse and global components [35]. SVM, ANN and deep neural network models can even be used to forecast photovoltaic power [30], to estimate electricity demand (using multiple linear regression, artificial neural network and support vector machine, among other) [12] or to estimate the surface downward longwave radiation (using ANN, SVR and RF, among others) [36] • Random forest to model the daily variability of solar irradiance can be compared to other methods such as multiple linear regression, obtaining the best results between both [37].

Study Area
According to Vázquez [38], Galicia can be divided into four climatic zones based on their solar radiation. To carried out this research, five meteorological stations were selected, all of them belonging to climatic zone II. This zone is characterized to present values between 13.7 MJ/m 2 ·day (3.8 kWh/m 2 ·day) and 15.1 MJ/m 2 ·day (4.2 kWh/m 2 ·day) [38]. The selected meteorological stations were: (i) Amiudal in the municipality of Avión, (ii) Serra do Faro in Rodeiro, (iii) Monte Medo in the municipality of Baños de Molgas, (iv) Ourense-Estacións in the city of Ourense and (v) Pazo de Fontefiz in Coles ( Figure 1). The meteorological stations were selected taking into account the conditions and the quantity of available data to create useful and accurate models for the prediction of MGI.

Database
The database was obtained from the Meteogalicia website [40] which provides the meteorological data for the selected stations. The periodicity of the data was monthly which reduce the volume of handled data and, therefore, the computational cost of modelling.
The selected variables, in addition to the MGI (10 kJ/(m 2 •day)) were: (i) latitude, (ii) longitude and (iii) altitude (m) of the station, (iv) month order, (v-vii) average, average of the maximum and average of the minimum temperatures (°C); (viii-xi) average, average of the maximum and average of the minimum relative humidities (%) and (xii) precipitation (L/m 2 ).
Three meteorological stations, Amiudal, Serra do Faro and Monte Medo, were used to train (2005-2012), validate (2013-2015) and query (2016-2018) the models. The other two stations, Ourense-Estacións and Pazo de Fontefiz, were used to check the models' behaviour in different locations than the previous ones, that is, the knowledge generated in three stations is extrapolated to new locations. In these two stations, the data used includes the period between 2012 and 2018.

Implementation of Models
As previously stated, four different kinds of models were developed: (i) multiple linear regressions, (ii) artificial neural networks, (iii) support vector machines and (iv) random forests. Different combinations of available variables (Table 1) were used to determine the MGI and study the influence of temperatures, humidities and precipitation. The geographic coordinates and the month of the year were selected for all the models. Table 1. Variables, and their combination, used to develop the different models: (i) latitude (Lat), (ii) longitude (Long), (iii) altitude (Alt), (iv) month, (v-vii) average (Tav), average of the maximum (Tav-max) and the average of the minimum temperature (Tav-min); (viii-xi) average (RHav), an average of the maximum (RHav-max) and the average of the minimum (RHav-min) relative humidity and (xii) precipitation (P). Type 1  Type 2  Type 3  Type 4  Type 5  Type 6 Type 7

Database
The database was obtained from the Meteogalicia website [40] which provides the meteorological data for the selected stations. The periodicity of the data was monthly which reduce the volume of handled data and, therefore, the computational cost of modelling.
The selected variables, in addition to the MGI (10 kJ/(m 2 ·day)) were: (i) latitude, (ii) longitude and (iii) altitude (m) of the station, (iv) month order, (v-vii) average, average of the maximum and average of the minimum temperatures ( • C); (viii-xi) average, average of the maximum and average of the minimum relative humidities (%) and (xii) precipitation (L/m 2 ).
Three meteorological stations, Amiudal, Serra do Faro and Monte Medo, were used to train (2005-2012), validate (2013-2015) and query (2016-2018) the models. The other two stations, Ourense-Estacións and Pazo de Fontefiz, were used to check the models' behaviour in different locations than the previous ones, that is, the knowledge generated in three stations is extrapolated to new locations. In these two stations, the data used includes the period between 2012 and 2018.

Implementation of Models
As previously stated, four different kinds of models were developed: (i) multiple linear regressions, (ii) artificial neural networks, (iii) support vector machines and (iv) random forests. Different combinations of available variables (Table 1) were used to determine the MGI and study the influence of temperatures, humidities and precipitation. The geographic coordinates and the month of the year were selected for all the models. Table 1. Variables, and their combination, used to develop the different models: (i) latitude (Lat), (ii) longitude (Long), (iii) altitude (Alt), (iv) month, (v-vii) average (T av ), average of the maximum (T av-max ) and the average of the minimum temperature (T av-min ); (viii-xi) average (RH av ), an average of the maximum (RH av-max ) and the average of the minimum (RH av-min ) relative humidity and (xii) precipitation (P).

MLR Models
Multiple linear regression analysis is a conventional method that relates different independent variables with a dependent one [41]. This method provides a linear inputoutput model for a specific data set [42]. Unlike the simple regression analysis, MLR analysis is closer to real situations because the phenomena are complex and must be explained using different variables that intervene in its existence [43].
It can be expressed mathematically as follows (Equation (1)): being y the desired variable, β 0 the constant, β 1 -β n the regression coefficients, x 1 -x n the input variables and ε is the error.

ANN Models
Artificial neural networks are a type of artificial intelligence (AI) model that simulates the human brain processes information [44]. ANNs presents different interesting aspects such as their fault tolerance or their generalization capabilities, among others [45,46].
The most used artificial neural models are the multilayer feedforward neural networks where the artificial neurons (also called nodes) are distributed into three different layers named; input, hidden and output layer [47]. The optimum number of hidden layer neurons, and the structure, can be defined by trial and error procedure [48,49]. The input layer receives the data provided by the user (in our case, the different variables from the Meteogalicia meteorological stations). During the model training, this information flow within the neural network in only one direction, in this case, from the input neurons to the output neuron going the hidden layer (each node). This flow is made up of two phases, the first one, called propagation, which the processed information is carries to the output layer where it is compared with the expected values and the error is calculated, and the second one, the weight update phase, where using the previous information, the model try to reduce the error made.
The ANN model implemented in this research has been tested using different parameters combination such as (i) the number of cycles (from 1 to 524,288 in 19 steps with a logarithmic scale), (ii) learning rate (0.1, 0.2 and 0.3), (iii) momentum (0.1, 0.2 and 0.3) and decay (true or false).

SVM Models
Support vector machines were introduced in the 1990s by different authors [50], to resolve classification problems and had a great reception and use due to its capacity to deal with non-linear data [9]. This method can be also used for regression purposes [12,50]. These approximations are a type of linear classifiers, which induce linear or hyperplane separators using a kernel function [50]. Support vector machine minimize the error of the training data trying to maximize the separation between classes and, when it comes to regression purposes, the goal is to find a function to approximate the nonlinear relationship between the used variables, that is, between inputs and output [30]. The basic mathematical ideas underlying SVM for function estimation can be analyzed in Smola and Schölkopf [51] which is a good introduction to support vector regression models.
A large combination of parameters to develop an SVM model is possible. To facilitate the development of these models, the combination of γ (represents the influence of a single training case) and C (represents the penalty factor) must be studied [36]. Different ranges can be chosen, in this research the range values for γ and C were chosen taking into account the "A Practical Guide to Support Vector Classification" proposed by Hsu et al. for classification problems [52]. Therefore, the combination of parameters used for SVM models' development is (i) SVM type (ε-SVR and ν-SVR), (ii) γ (from 2 −15 to 2 3 in 18 steps, with a logarithmic scale) and (iii) C (from 2 −5 to 2 15 in 20 steps, with a logarithmic scale).

RF Models
Random forests are non-parametric method proposed in 2001 by Breiman [31,53]. A random forest model is a set of random trees that can be used for regression and classification [54].
The random forest method offers better regression accuracy than the other methods such as MARS or M5 [32]. According to Srivastava et al., the RF model develop a large number of decorrelated decision trees which each generate an individual output, and then the final output value is obtained by averaging the individual output values.
In this research, the RF models were implemented using combinations of (i) number of trees (from 1 to 100 in 99 steps with linear scale), (ii) criterion (least square), (iii) maximal depth (from −1 to 100 in 101 steps with linear scale) and (iv) apply prepruning (true or false).

Statistics of the Developed Models
The statistics used to analyze the models were the squared correlation coefficient (r 2 ), the root mean square error (RMSE, Equation (2)) and the average absolute relative error (Error, Equation (3)). The best model was chosen according to the lowest RMSE in the validation phase:

Equipment and Software Used
The different models were implemented in the server available at the Department of Physical Chemistry of the University of Vigo, Campus of Ourense (Intel ® Core™ i7-8700 processor at 3.20 GHz, with 16 GB of RAM). All models were run on Windows 10 Pro 64-bit operating system. Data were collected and processed using the software Microsoft  Table 2 show the bests models for each model type and the combination of variables used for that model. Next, the best models obtained for each of the studied approaches will be described. Table 2. Adjustment parameters for each best approximation model developed according to its selected input variables. Latitude (Lat), longitude (Long), altitude (Alt), month, average (T av ), average of the maximum (T av-max ) and the average of the minimum temperature (T av-min ), average (RH av ), the average of the maximum (RH av-max ) and the average of the minimum (RH av-min ) relative humidity and precipitation (P). RMSE is the root mean square error (10 kJ/(m 2 ·day)) and r 2 is the squared correlation coefficient.

MLR Models
For the seven MLR models with different combination types, the one that presented the worst adjustment, based on the RMSE for the validation phase, was the model with combination 7. This model presented an RMSE of 5674 kJ/(m 2 ·day) for the validation phase, which corresponds to a low r 2 (0.468). These bad adjustments for the validation phase are extensible to all phases of the model, training and querying. Thus, for these phases, the RMSE values are 5215 kJ/(m 2 ·day) and 6300 kJ/(m 2 ·day) which together with the low squared correlation, 0.426 and 0.343, make this model a model that cannot be used for modelling the MGI. The rest of the models present better adjustments than the previous model, with RMSE values for the validation phase, between 2924 kJ/(m 2 ·day) and 2411 kJ/(m 2 ·day). These models offer for the querying phase some RMSE similar to those provided for the training and/or the validation phase and an average absolute relative error between 18.1% and 19.6%. The best MLR model corresponds to a model with combination type 1 ( Table 2), that is, an MLR that uses all the input available variables to model the behaviour of the MGI.

ANN Models
The worst ANN model developed was the model with combination type 7. This model presented an RMSE value for the validation phase around 1526 kJ/(m 2 ·day) which corresponds to an average absolute relative error of 12.1%. This value is close to the 10% that it is considered as, to our understanding, a good error percentage for this kind of modelling. Nevertheless, some authors suggest that prediction error less than 20% could be good accuracy in terms of solar radiation prediction [28]. The training and querying phase present similar adjustments to the validation phase with squared correlation coefficients of 0.943 and 0.953 for training and querying, respectively. These adjustments make the worst ANN an almost usable method to model the MGI, however, the other developed combination types improve the worst ANN model, presenting RMSE values between 1225 kJ/(m 2 ·day) and 1494 kJ/(m 2 ·day) for the validation phase. The best ANN model ( Table 2) corresponds to a model with combination type 4 (input variables; latitude, longitude, altitude, month and the three humidities).

SVM Models
For the different SVM models developed, the model that presented the worst adjustment, based on the RMSE for the validation phase, was, again, the model with combination 7. It seems clear that in all the models seen, those models that only have the precipitation variable, in addition to the other four fixed variables, do not present good results. The combination type 7 SVM model presents for validation phase an RMSE of 1704 kJ/(m 2 ·day) which corresponds to a good r 2 of 0.956. These adjustments for the validation phase are extensible to the training and querying phase where the RMSE are 1525 kJ/(m 2 ·day) and 1743 kJ/(m 2 ·day) with high squared correlation values, 0.951 and 0.962 which make this SVM a model that could be used for modelling the MGI. The rest of the models present better adjustments being the RMSE value in the validation phase dropped to 1556 kJ/(m 2 ·day) for the second-best model. The best SVM model corresponds to a model with combination type 5, that is, an SVM that uses eight input variables to model the MGI response (Table 2).

RF Models
Finally, the last kind of model is the RF. In this case, the worst model developed was, unlike to the other ML models, a model with combination type 2. This model presented an RMSE value for the validation phase around 2124 kJ/(m 2 ·day) with an average absolute relative error of 15.0%. During the training and the querying phase, the model presents very different adjustments, 925 kJ/(m 2 ·day) and 1651 kJ/(m 2 ·day), respectively. The other combination types slightly improve this model and a better model is obtained when configuration 5 is used ( Table 2).

Best Models Developed
Taking into account the previously chosen models (Table 2), we will now proceed to the analysis as a whole. It can be seen that the RMSE values obtained for the validation phase are included between 1226 kJ/(m 2 ·day) and 2411 kJ/(m 2 ·day).
According to this, the multiple linear regression model is the model that obtains the worst RMSE value in the validation phase with a value of 2411 kJ/(m 2 ·day) and the worst squared correlation coefficient (0.904). This model obtained an average absolute relative error around 19.2%. Regarding the training phase, the model presents lower RMSE value of 2263 kJ/(m 2 ·day) compared with the validation phase, nevertheless, the r 2 also present lower value (0.892). Figure 2A shows the experimental and modelled MGI values by the MLR model. It can be seen how both, the training and validation phase cases, follow the line with slope one (red line), however, a great dispersion is observed in them, this fact can be intuited by the high values of absolute average relative error for both phases (17.3% and 19.2% for training and validation, respectively). These high errors are increased by the existence of some points that are distant from the line with slope one.

Best Models Developed
Taking into account the previously chosen models (Table 2), we will now proceed to the analysis as a whole. It can be seen that the RMSE values obtained for the validation phase are included between 1226 kJ/(m 2 •day) and 2411 kJ/(m 2 •day).
According to this, the multiple linear regression model is the model that obtains the worst RMSE value in the validation phase with a value of 2411 kJ/(m 2 •day) and the worst squared correlation coefficient (0.904). This model obtained an average absolute relative error around 19.2%. Regarding the training phase, the model presents lower RMSE value of 2263 kJ/(m 2 •day) compared with the validation phase, nevertheless, the r 2 also present lower value (0.892). Figure 2A shows the experimental and modelled MGI values by the MLR model. It can be seen how both, the training and validation phase cases, follow the line with slope one (red line), however, a great dispersion is observed in them, this fact can be intuited by the high values of absolute average relative error for both phases (17.3% and 19.2% for training and validation, respectively). These high errors are increased by the existence of some points that are distant from the line with slope one.
Given the results shown for both phases, it is expected that the results for the querying phase will also be the worst compared to the rest of the models. The RMSE is greater than in the validation phase (2458 kJ/(m 2 •day)) and the adjustments, in terms of squared correlation, was the lowest for the three phases (0.885).
In Figure 2A it can be seen that the querying cases also follow the line with slope one, however, as happened with the training and validation cases, these do not adjust the line, observing the existence of some point that is far away.  As expected, the MLR model is not capable of learning correctly and then generalizing that knowledge afterwards. A possible explanation for the poor adjustments of the MLR model may be based on the use of the month variable, which does not present a linear relationship with the MGI.
Given the results shown in the three phases, it can be concluded that the MLR model is not a suitable model for MGI modelling. It has concluded that this model is not usable for the prediction of monthly global irradiation because it presents a high percentage error for all phases (between 17.3% and 19.3%), although, since its error is less than 20% in the model, it could be considered good (taking into account bibliography reported above). Given the results shown for both phases, it is expected that the results for the querying phase will also be the worst compared to the rest of the models. The RMSE is greater than in the validation phase (2458 kJ/(m 2 ·day)) and the adjustments, in terms of squared correlation, was the lowest for the three phases (0.885).
In Figure 2A it can be seen that the querying cases also follow the line with slope one, however, as happened with the training and validation cases, these do not adjust the line, observing the existence of some point that is far away.
As expected, the MLR model is not capable of learning correctly and then generalizing that knowledge afterwards. A possible explanation for the poor adjustments of the MLR model may be based on the use of the month variable, which does not present a linear relationship with the MGI.
Given the results shown in the three phases, it can be concluded that the MLR model is not a suitable model for MGI modelling. It has concluded that this model is not usable for the prediction of monthly global irradiation because it presents a high percentage error for all phases (between 17.3% and 19.3%), although, since its error is less than 20% in the model, it could be considered good (taking into account bibliography reported above).
The next model in terms of low RMSE value in the validation phase is the RF model that presents a value of 1595 kJ/(m 2 ·day). This value is improved in the model's training phase (948 kJ/(m 2 ·day)). In both phases, the RF model improves the MLR model, both in RMSE values and in its squared correlation values (0.982 and 0.962 vs. 0.892 and 0.904, for the training and validation phase, respectively). Besides this, the model presents good behaviour in terms of average absolute relative error. Figure 2B shows the experimental and modelled MGI values by the RF model. It can be seen how the training phase; the cases follow better the line with slope one than the cases predicted by the MLR. This behaviour is similar to the validation cases. The behaviour of both phases is good and reaches the average absolute relative error values of 5.9% and 10.5% for training and validation, respectively.
If we analyze the adjustments for the querying phase it can be seen how the RF model presents, for this phase, the worst adjustments in terms of RMSE (2279 kJ/(m 2 ·day)) although the average absolute relative error remains at similar levels to those of the validation phase (10.7%).
In Figure 2B it can be seen that the querying cases also follow the line with slope one, however, a similar dispersion than provided by the MLR model is observed. It can be seen some cases that deviate more from the trend line one, although in the area of low MGI values it can be seen that the RF model adjusts much better than the MLR model. Due to this the average absolute relative error is good (around 10.7% for querying phase).
Given the results, it can be concluded that the RF model is a suitable approach for MGI modelling due to the fact that its errors for the validation and querying phases remain close to 10%.
The second-best model, taking into account the RMSE value in the validation phase, is the model developed based on support vector machine. The adjustments for the validation phase are kept close to the RF model, in fact, the RMSE value for the SVM model is 1531 kJ/(m 2 ·day) compared to 1595 kJ/(m 2 ·day) for the RF model and the squared correlation values are the same (0.961 vs. 0962 for SVM and RF, respectively). The same happens with the error, which remains for both around 11%. For the training phase, a slight worsening of the fit for the SVM model reaching an RMSE of 1056 kJ/(m 2 ·day) is observed. Figure 2C shows the experimental and modelled MGI values by the SVM model. Training and validation phase cases follow the line with slope one (red line) nevertheless the model provided worse adjustment for the training cases in comparison with the RF model. This may be due to the existence of some points in the middle area that move away from the line with slope one. Both phases showed good adjustments is in terms of average absolute relative error reaching values of 4.9% and 11.0% for training and validation, respectively.
Given the results provided by the SVM model, it can be assuming that for the querying phase the model will work well. According to the adjustment parameters for the querying phase, it can be said that both, the RMSE and the r 2 values remain close to the RMSE of the validation phase.
This fact can be seen in Figure 2C where querying cases are close to the line with slope one and provided better fits than the RF model, although it can be seen some points in the lower and upper area that stray from the line with slope one.
Taking into account that the model offers 8.7% of absolute average relative error for the querying phase, it can be affirmed that the SVM model is a suitable model for MGI modelling.
Finally, for all the models developed, the best model is the ANN model taking into account the criterion of the lowest RMSE value in the validation phase. This model obtained for validation phase a RMSE value of 1226 kJ/(m 2 ·day) that corresponds with the highest squared correlation coefficient (0.975) for all validation phases. Regarding the training phase, the RMSE value is around 1271 kJ/(m 2 ·day) which supposes an absolute average relative error of 7.3%. Figure 2D shows the experimental and modelled MGI values by the ANN model. It can be seen how for the training phase some points distance from the line with slope one. This fact explain that the ANN model did not obtain the best adjustments for the training phase, in comparison with the SVM and the RF model. This behaviour is reversed for the validation phase where it can be seen how this model is the one with the best fits to line with slope one. This behaviour is reversed for the validation phase where it can be seen how this model is the one with the best fits to line with slope one (obtaining an average absolute relative error of 8.8%).
Given the good results provided by the ANN model for both phases, good results for querying are expected. In this case, the RMSE value is lower than in both training and validation phases (1136 kJ/(m 2 ·day)) and corresponds with a squared correlation of 0.980 (the highest for all the models in this phase).
In Figure 2D it can be seen the querying behaviour for this model where the querying cases are very close to the line with slope one. Some small dispersion is observed in the area with high MGI values, but this behaviour is an exception in the model.
Finally, taking into account all the adjustments provided by the model and the low absolute average relative error for the querying phase (6.6%) it can be affirmed that the ANN model is a suitable model for the MGI modelling.
Regarding the variables used by each of the best models, it can be seen in Table 2 that all the selected models have as input variables (apart from the latitude, longitude, altitude and month) all the variables of humidity and precipitation. This fact is only broken by the ANN model that does not use the precipitation. Regarding the MLR model, it can be seen how it includes temperature variables among its input variables. The inclusion of these variables may be because the MLR model, being a linear model, does not work properly with non-linear variables, as is the case of the month variable. Due to this fact, the authors understand that this variable can be counteracted by the MLR model with the inclusion of the temperature variables.

ANN Generalization to Different Locations
After analyzing all the machine learning models developed in the previous section it will proceed to check how the best models work in the two reserved stations (Pazo de Fontefiz in Coles and Ourense-Estacións in Ourense) which have not been used in any of the previous phases. The adjustments for the best models applied to these stations are presented in Table 3.
It can be seen how the support vector machines model is the one that offers worse modelling values for both stations; in fact, it presents errors in terms of RMSE much higher than the other selected models (Table 3). It can be seen how for the Pazo de Fontefiz station the error, in terms of root mean square error, is practically double (4029 kJ/(m 2 ·day)) that the error presented by the best-selected model (ANN); while for the Ourense-Estacións station, the error (8079 kJ/(m 2 ·day)) is almost four times greater than that presented by the best model (ANN). As expected, these high errors affect the average relative absolute error presented by each station, so the Pazo de Fontefiz station presents an error of 24.8%, being overcome by the error obtained in Ourense-Estacións, 47.2%. The SVM model presents good adjustments in terms of squared correlation (upper than 0.940); however, taking into account the adjustments of the root mean square error and the average absolute relative error it can be concluded that the SVM model is not a suitable model for the MGI modelling. Table 3. Adjustment parameters for each of the best models applied to the stations of Pazo de Fontefiz and Ourense-Estacións. RMSE is the root mean square error (10 kJ/(m 2 ·day)), Error is the average absolute relative error (%) and r 2 is the squared correlation coefficient. The remaining three models have better fits than the SVM model. The one with the worst fit is the MLR model that presents an error, in terms of root mean square error, of 2852 kJ/(m 2 ·day) and 2334 kJ/(m 2 ·day) for the stations of Pazo de Fontefiz and Ourense-Estacións, respectively (Table 3). Compared to the SVM model, this model improves its adjustments in terms of RMSE and error, although not in terms of squared correlation. The errors of this model are around 19% for each station. According to this error level, we can say that the model shows good behaviour, but shows a higher error percentage than desired, especially for the Pazo de Fontefiz station.
The second-best model is the random forest model. This model improves the previous models in terms of RMSE, for each of the analyzed stations, but the percentage errors remain high (21.2% and 19.6% for Pazo de Fontefiz and Ourense-Estacións, respectively) ( Table 3). Despite that the model presents high squared correlations (greater than 0.91), the use of the model should be limited.
Finally, the ANN model, which had been chosen in the previous section as the model with the best adjustments for each development phases, has emerged as the model with the best predictions for these two independent stations ( Table 3). The model improves each statistic (except r 2 for the Pazo de Fontefiz station), reporting errors in terms of RMSE, around 2013 kJ/(m 2 ·day) and 2094 kJ/(m 2 ·day) for the station of Pazo de Fontefiz and Ourense-Estacións, respectively. Likewise, for this model, the squared correlation is high (0.935 and 0.971) and the average absolute relative error remains lower than 15% error for each stations (which is considered as a good error percentage). These good adjustments are reflected in Figure 3. The first thing to note is the different size in the database between the two stations. The Pazo de Fontefiz station has data from July 2012 to August 2018 (a total of 73 months), while the Ourense-Estacións station has data from June 2014 to August 2018 (a total of 49 monthly measurements). Figure 3 shows the time series for the real MGI values (olive colour) and the values predicted by the ANN model. Figure 3A shows the time series for the Pazo de Fontefiz station. It can be seen the MGI's cycles with their maximums in the summer months and their minimums in the winter months (range from 3340 kJ/(m 2 ·day) to 26,050 kJ/(m 2 ·day)). The ANN predictions are shown in the Figure 3A as a black line. It can be seen how the modellings fit, almost perfectly, to the real-time series, which means (as we have already seen in the adjustments) that the ANN model can accurately predict the behaviour of the MGI for Pazo de Fontefiz station. It can be seen how for the low-value areas of MGI the prediction overestimates the values while the model behaves, in general, well for high-value areas of MGI (although also some underestimation is observed). Given the good adjustments, and taking into account the Figure 3A modelling time series, it can be said that this model is capable of generalizing the knowledge of the previous phase to other nearby geographical stations (Pazo de Fontefiz).
behaviour than in the case of the Pazo de Fontefiz station. Again, it can be seen how for the low areas of MGI the prediction is overestimated, in general, the MGI values (can even see how this behaviour is observed in some measurement in the maximum area) although for high MGI values the ANN model generally predict well. Taking into account the adjustments and the Figure 3B, it can be said that the ANN model is usable on other nearby geographical stations. Given the adjustments provided by the Artificial Neural Networks model both in the modeling phase and during the prediction phase (Pazo de Fontefiz and Ourense-Estacións), it can be concluded that the ANN model is a useful model which can be used to model and predict the monthly global irradiation in areas bordering the studied stations. This statement is supported not only by the low errors committed in terms of the root mean square error, but also by the percentage of error associated with these predictions, which are maintained for the case of prediction around 15% which we could consider acceptable.  Figure 3B shows the time series for the Ourense-Estacións station. In this case, the time series has a real range from 3930 kJ/(m 2 ·day) to 26,470 kJ/(m 2 ·day). It can be seen how the modelling fits the real-time series, however in this case the adjustments show a worse behaviour than in the case of the Pazo de Fontefiz station. Again, it can be seen how for the low areas of MGI the prediction is overestimated, in general, the MGI values (can even see how this behaviour is observed in some measurement in the maximum area) although for high MGI values the ANN model generally predict well. Taking into account the adjustments and the Figure 3B, it can be said that the ANN model is usable on other nearby geographical stations.
Given the adjustments provided by the Artificial Neural Networks model both in the modeling phase and during the prediction phase (Pazo de Fontefiz and Ourense-Estacións), it can be concluded that the ANN model is a useful model which can be used to model and predict the monthly global irradiation in areas bordering the studied stations. This statement is supported not only by the low errors committed in terms of the root mean square error, but also by the percentage of error associated with these predictions, which are maintained for the case of prediction around 15% which we could consider acceptable.

Conclusions
Based on the goodness of statistics, the modelling carried out by MLR, ANN, SVM and RF methodologies can model and predict the MGI, generally, in an appropriate way for the stations used for its development (Amiudal in Avión, Serra do Faro in Rodeiro and Monte Medo in Baños de Molgas). The results vary where these models are applied to other locations, Pazo de Fontefiz in Coles and Ourense-Estacións in Ourense. Attending to the adjustments obtained for each station it can be affirmed that the best model is the ANN which presents the lowest RMSE value in the validation and querying phases 1226 kJ/(m 2 ·day) and 1136 kJ/(m 2 ·day), respectively, and predict conveniently for Coles and Ourense station 2013 kJ/(m 2 ·day) and 2094 kJ/(m 2 ·day), respectively. These good RMSE values are reinforced by the low percentage error obtained during the prediction phase at the two stations reserved for this purpose.
For all this, it can be concluded that, given the good results obtained, it is convenient to continue with the design of artificial neural networks applied to the analysis of monthly global irradiation in different areas of the Autonomous Community of Galicia to obtain a general model for the entire region. Due to this, this work may be the beginning of a more ambitious global study to model the monthly solar irradiation, and, subsequently, predict the monthly solar irradiation in advance in the Autonomous Community of Galicia (Spain).
All the models developed in this research could be improved with the inclusion of more stations, using different random split datasets, taking into account new meteorological input variables, among others.