A Novel Approach to Enhance the Generalization Capability of the Hourly Solar Diffuse Horizontal Irradiance Models on Diverse Climates

Solar radiation data is essential for the development of many solar energy applications ranging from thermal collectors to building simulation tools, but its availability is limited, especially the diffuse radiation component. There are several studies aimed at predicting this value, but very few studies cover the generalizability of such models on varying climates. Our study investigates how well these models generalize and also show how to enhance their generalizability on different climates. Since machine learning approaches are known to generalize well, we apply them to truly understand how well they perform on different climates than they are originally trained. Therefore, we trained them on datasets from the U.S. and tested on several European climates. The machine learning model that is developed for U.S. climates not only showed low mean absolute error (MAE) of 23 W/m2, but also generalized very well on European climates with MAE in the range of 20 to 27 W/m2. Further investigation into the factors influencing the generalizability revealed that careful selection of the training data can improve the results significantly.


Introduction
Building performance simulation tools such as EnergyPlus [1], TRNSYS [2] etc., requires detailed information on the magnitudes of diffuse and direct components of solar radiation for prediction of energy consumption [3]. The knowledge of these components also enables us to simulate light behavior in complicated environments and render High-dynamic-range (HDR) photorealistic images using lighting visualization tools such as Radiance [4] that are invaluable for designers, architects and daylight simulation [5] researchers alike. Sizing and configuration of solar energy systems such as solar thermal collectors and photovoltaic cells entail reliable solar radiation measurements. However, measuring both these components simultaneously can be expensive as measuring the direct component requires a pyrheliometer along with a solar tracker to track the sun at all times and similarly a shadow band with a pyranometer is needed for the diffuse component. Global solar radiation, which includes both these components, is rather simple and commonly measured not only in meteorological stations but also in smaller local weather stations. A cost-effective approach to obtain diffuse and direct components is to either use various correlations, or develop prediction models based on global solar radiation along with other meteorological parameters.
Several studies evaluating the diffuse component started to appear in the 1960s based on the work of Liu and Jordan [6], which utilised curve fitting models with polynomial terms. Most of the models that are based on Liu and Jordan [6], like Orgill and Hollands [7], Erbs et al. [8], Reindl et al. [9] etc., are based on the relationship between the clearness index (k t ) and diffuse fraction (F d ), where F d is the ratio between diffuse (D h ) and Global horizontal irradiance (G h ) and k t is the ratio between Global horizontal irradiance (G h ) and extraterrestrial solar irradiance (G o ). Although they are based only on this relationship, their models differed as they developed them on different datasets. Therefore, these models are to be curve fitted frequently to adapt to the new datasets, which makes the generalization of these models very difficult.
Recent studies have shown that machine learning approaches have made promising strides in this field. Boland et al. [10] and Ruiz-Arias et al. [11] developed logistic and regression models respectively for diffuse fraction prediction. Soares et al. [12] modeled a perceptron neural-network for São Paulo City, Brazil, to estimate hourly values of the diffuse solar-radiation, which perfomed better than the existing polynomial models. Similar studies conducted by Elminir et al. [13] and Ihya et al. [14] in Egypt and Morocco respectively also supports the fact that neural networks are better predictors than other linear regression models. A recent review by Berrizbeitia et al. [15] also perfectly summarized both empirical and machine learning based models, but many of these models have not explored the possibility of the generalization of these models on different climates. In most of these models, the training data and testing data is from the same climates. Most of the regression models can show exceptional performance in this type of experimental setup, but they fail to show the same performance on different climates. Most of the climates do not have historical data to develop good regression models, but there might be some climates around it which may be abundant in such data. The idea of our research is to harvest this data to develop models and reliably use it to predict for climates that lack historical data. This motivated us to explore the contemporary ways of exploiting the data using machine learning algorithms, as they learn to adapt and produce reliable and repeatable results. While experimenting with several models we also found that their performance is affected by their training datasets. Therefore, we bring forward to the readers our observations and present some recommendations to improve the generalizing capability of these models.
In Section 2, we first give a brief introduction of all the predictive models compared in this study along with finetuning techniques that improve the performance of these models. The polynomial model of Erbs et al. is chosen as a baseline to compare against other machine learning models (linear regression, decision tree, random forests, gradient boosting and XGBoost, and an artificial neural network). The workflow of this study is to first find the best performing approaches by comparing them on the same datasets and then check for their generalizability on a third unseen dataset. Secondly, the best performing approaches are tested with completely different datasets to see if they can reliably make similar predictions and analyze what factors would affect their performance. To make the results reproducible all datasets used in this study along with their locations and data preprocessing techniques are mentioned in Section 3.
Explored hyperparameter configurations of the machine learning models and thus achieved best configurations through finetuning are described in Section 4. Here we also present our findings in two parts. First we present the performance of all the predictive models based on the most commonly used metrics for all the datasets considered in this study along with their generalization capability. Secondly we select the best performing approaches from the first part and train them with U.S. climates and tested them on European climate to check their reliability. Finally, for each part we discuss our findings and present our recommendations on how to select the training datasets to achieve better performing models.

Polynomial Model (Erbs et al. Model)
Dervishi and Mahdavi [16] studied eight polynomial models, based on their prior reported performance and concluded that the best performing polynomial models were those of [7][8][9]. As all three of these models perform equally good, we decided to compare our machine learning models with only the Erbs et al. model. Erbs et al. proposed a relation between the clearness index (k t ) and diffuse fraction (F d ). This model is based on the data from five U.S. weather stations. The diffuse fraction F d with model coefficients (see Table 1) is given by the following set of equations: where, and

Machine Learning Models
Machine learning algorithms are a collection of statistical methods that can be trained to predict and analyze trends. These statistical models, which drew good inferences from weather data led to finding more generalizable predictive patterns. Machine learning models that find these patterns in the data by learning from labeled data are called supervised learning models. In this paper we will discuss briefly few supervised learning models such as linear regression, decision trees, random forests, gradient boosting and XGBoost.

Linear Regression
Linear regression [17] is used to establish the relationship between dependent and independent variables using a linear approach. When the independent variables are more than one, it is called multiple linear regression. It has been studied rigorously and used extensively in several practical applications. In this study, we adopted the least squares approach to fit the model.

Decision Trees
Decision trees [18] are non-parametric methods that build regression or classification models in the form of trees. Decision trees predict the target value by learning simple decision rules by inferring data features. A decision tree is built top-down by breaking down a dataset recursively into smaller datasets based on decision rules called decision nodes. The decisions on the target values are represented by leaf nodes. Therefore, the final tree appears as a set of several decision nodes and leaf nodes.

Random Forests
Random forests [19] use decision trees as building blocks to build more powerful prediction models. The algorithm is an ensemble of decision trees with nearly the same parameters as a decision tree, but considers only a random subset of features for splitting a node, thus introducing additional randomness to the model, while growing the trees. All the decision trees are modeled based on a different subsample of data and each observation of the subsample is chosen with replacement. Thus, this technique takes many uncorrelated learners to make a final model with reduced variance and improved performance.

Gradient Boosting
The idea of combining several weak learners intelligently into a strong learner is called boosting. Gradient boosting [20] is an ensemble learning approach, where we build an ensemble of regressors incrementally in several steps. At each step a new sub-model is added, that tries to compensate for the residuals (errors) made by the previous sub models. These sub models are fairly simple and decision trees are usually the classic choice. The intuition behind the gradient boosting algorithm is to repetitively leverage the patterns from the residuals and strengthen a model with weak predictions to make it a strong predictor. The training of the predictor stops at the stage where residuals do not have any pattern that could be modeled.

XGBoost
XGBoost ("Extreme Gradient Boosting"), a variant of gradient boosting, implemented in library XGBoost by Chen and Guestrin [21], is also used in this study. It is an optimized distributed gradient boosting implementation designed to be more efficient, flexible and portable than existing gradient boosting implementation. Unsurprisingly, it is recognized as the best approach by the European Organization for Nuclear Research (CERN) [22] for classifying signals from the Large Hadron Collider. Unlike gradient boosting, XGBoost uses a more regularized model formalization to control overfitting and thus achieves better performance.

Neural Networks
Artificial neural networks (ANN) [23], a branch of machine learning, has attracted a great deal of attention in recent years. In this study, we will discuss building a deep learning neural network, a more advanced form of neural network which is wider and deeper than traditional neural networks, that produces more accurate predictions of hourly diffuse fraction from a variety of meteorological data. It shows that by increasing hidden layers, the need for feature extraction can be avoided, which was a crucial step in previously developed neural network models.

Basics of Neural Networks
Neural networks (NNs) are inspired by biological neural networks and can perform certain tasks by learning from examples, without the need to program for each individual task. They are used in a wide range of applications such as speech and image recognition, computer vision, machine translation, medical diagnosis and even board games. Feed-forward multi-layered perceptrons (MLPs) are the most common neural networks which are simple but can solve complex problems. A deep neural network (DNN) is an MLP with multiple hidden layers, which avoids the manual process of handcrafted feature engineering by learning feature significance automatically. With the introduction of high power Graphical Processing Units (GPUs) and Compute Unified Device Architecture (CUDA) framework, neural networks can now be easily widened and deepened without much concern for the memory and processing power.
The neural network has several neurons in the input layer, each one representing a feature (input) and several hidden layers with a variable number of neurons and an output layer with neurons, each representing an output. Every neural network has a set of weights, bias and activation functions. The input that enters the neuron is multiplied by weights, which are randomly initialized and then updated during the training process. At the end of the training, features with higher significance achieve higher weightage and vice versa. An activation function translates the linear combination of weight multiplied inputs and bias to output value. The bias added here to the weight multiplied inputs acts as a range shifter for output from the activation function. The data input is transformed by passing through several hidden layers to predict the output as close as possible to the actual value. The objective is to reduce the error (difference between output and actual value) using a cost function, thus increasing the accuracy of prediction. The randomly assigned weights are now updated with backpropagation [24], where the error is fed back to the network along with the gradient of the cost function. To generalize the network, the data is sent in chunks of equal size called batches. A single training iteration with all the batches performing the forward and backward propagation is called an Epoch. To minimize the cost function of the network, an optimizing algorithm such as gradient descent is used. It finds the local minimum of the cost function by taking proportional steps based on the learning rate.

Fine Tuning
To achieve the best results from the machine learning models and neural networks, fine tuning is essential. Fine-tuning or hyperparameter optimization finds the right set of weights, activation functions, learning rates and different constraints that yield an optimal model, which minimizes the loss of cost function on a given validation data. Even the Erbs et al. model is adapted using curve-fitting methods for all datasets (see Table 1).

Grid Search with Cross Validation
Grid search [25,26] or parameter sweep is traditionally used for optimizing hyperparameters. It exhaustively searches through a parameter space, a subset specified manually over the hyperparameter space of the learning algorithm. Usually a grid search algorithm requires a performance metric such as cross-validation, which estimates the accuracy of the prediction algorithm in practice and how well the algorithm generalizes on an independent data set. Cross-validation involves partitioning the dataset into k subsets, performing the training on any k-1 subsets and validating on the remaining subset. This is repeated with a different set of k-1 subsets in multiple rounds until all the combinations are exhausted. Multiple rounds of cross-validation with different partitions reduce the variability, and the results from all the rounds are combined or averaged to give an estimate of the model's predictive performance.

Overfitting
An overfitted model performs well on training data but performs poorly on validation data because the model begins to memorize training data after a certain number of epochs/passes and fails to generalize the trend when testing unseen data. This can happen for several reasons such as if the model contains more parameters than can be justified by the data, or non-conformity of the data shape with the model structure, or higher model loss when compared to the expected level of noise and sometimes even error-prone data. There are several techniques to reduce the chance of overfitting such as early stopping, dropout regularization and pruning.

Early Stopping
Early stopping is a regularization technique used in iterative training methods such as gradient descent, to avoid overfitting. It allows the model to better fit the training data with each iteration until a certain point, where the model's performance on data outside the training set does not improve anymore. Certain rules in early stopping guide the number of iterations that can be run before the model starts to overfit. One such method is validation-based early stopping [27], where rules are set to split the original training set into a new training set and a validation set and the loss on the validation set is used as a proxy to determine the beginning of overfitting. Validation loss is calculated at the end of every epoch, and when it does not show any improvement after certain epochs, training stops at this point and the weights of the network in the previous step is used.

Dropout Regularization
Dropout regularization [28,29] reduces overfitting by randomly omitting, certain number of feature detectors or neurons on each training case. As a result, each neuron learns to detect a feature in general, within the vast combination of contexts it operates, rather than developing complex co-adaptations with several other specific feature detectors.

Pruning
Pruning, a term often associated with decision trees, is a technique to reduce overfitting by reducing the size of the decision trees by removing the least powerful parts of the tree. The regression tree algorithm recursively partitions the data into smaller subsets until those final subsets are homogeneous in terms of the outcome variable. This often leads to final subsets (leaves) each consisting of only one or a few data points, which implies the data is learned exactly by the tree and any new data point that differs slightly might not be well predicted. Such instances can be avoided by pruning back the tree until the cross validated error is minimized.

Data Collection
In this paper, a total of fourteen datasets collected from 3 different networks of ground-based stations were used. The first network of stations is from Germany, operated by Deutscher Wetterdienst (DWD [30]). The weather stations considered in this network are Mannheim, Ensheim (Saarbrücken) and Weinbiet (Neustadt) are shown in Table 2. The samples in these datasets are of 10-min intervals and can be downloaded via its FTP website. The second network of stations belong to the surface radiation budget network (SURFRAD) [31] in U.S., where the data is available at 1-min resolution for all seven stations shown in Table 3. The data from the network can be accessed via a simple python script available on their website. The third set of stations are from Baseline Solar Radiation Network (BSRN) [32] shown in Table 4, where the data is also available at 1-min intervals. The data access is available via both FTP and PANGAEA websites [33]. Apart from these datasets two publicly available satellite-derived diffuse fractions were used for prediction alongside the data from SURFRAD and BSRN networks. National Solar Radiation Database (NSRDB) [34] modeled using a physical solar model (PSM) is available at 30-min resolution covering most of the U.S. with a spatial resolution of 0.04°. NSRDB data can be accessed via a simple python script available at the NREL website. This data was interpolated to 1-min interval and used along with SURFRAD data for prediction. Similarly Copernicus Atmospheric monitoring service-radiation service (CAMS-RAD) [35] was used along with BSRN data. CAMS-RAD data is publicly available at their website with a spatial extent of −66 • to 66 • in both latitudes and longitudes.

Data Preparation
The data is preprocessed to improve its quality. Several criteria have been enforced to remove those samples in the data which were wrong and violated physical limits. Machine learning models usually need data scaled to a standard form, to keep the variance of different features in the same range, especially to validate on unseen data. Therefore all the input features are standardized using quantile transformation [26] available in the scikit-learn library for python. This transforms the features to follow the uniform distribution and speed up the convergence of optimization algorithms such as stochastic gradient descent [36].

Experimental Design
This study is divided into two parts:

•
In the first part, all the predictive models described in Section 2 are first trained on preprocessed datasets of Mannheim and Ensheim (both in Germany) with historical data between the years 2013-2016. During training, all the models are tuned for their best performance using different techniques defined Section 2.4. All the models are then validated against respective climate datasets from 2017 to evaluate their performances usings metrics defined in Section 3.4. These models are then tested for their capability of generalization on a third dataset of Weinbiet (Germany) on 2017 data. The aim here is to identify which approaches generalize well and apply those methodologies in the next part. • In the second part, the main goal is to ascertain that the best performing approaches from the previous study are in fact generalizable on completely different training and test sets. To keep the experiment simple only the two best performing approaches are considered here. The training data sets are chosen from 7 climates in the U.S. from the year 2015. A predictive model for the U.S. is developed with the consolidated data from all the 7 climates. For a comparision, individual predictive models are also developed for each climate seperately. Fine tuning techniques for respective approaches are applied to improve their performance before they are tested on test datasets. The models are first validated on their respective climates with data from the year 2016. Once these models show good validation performance, then they are tested against four climates from Europe to determine their extent of generalization. The test data chosen is from the year 2016. • Finally we derive conclusions based on model performances with different training datasets, especially how they perform on unseen datasets in Germany when trained with local (Germany) and international (U.S.) training sets.

Evaluation Metrics
The performance of all the predictive models is assessed based on these four metrics: normalized Root Mean Squared Error (nRMSE), normalized Mean Biased error (nMBE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). They are defined as follows, where m i is the ith measured value, p i is the predicted value, m is the mean of the measured values and N is the number of data samples.

Tuning Results
Prior to the test for generalizability, the models are finetuned to give improved validation performance. Grid search is performed on the machine learning models based on several hyperparameters. Decision trees, random forests, gradient boosting trees, and XGBoost trees are each grid searched at several depths starting from 4 until 15. Random forests are tested with several different numbers of trees starting from 50 until 250. When it comes to boosting stages of gradient boosting and XGBoost, they are also grid-searched starting from 50 until 250 stages. Additionally, a few hyperparameters are searched in XGBoost models such as the learning rate (from 0.05 to 0.15), subsample ratio (from 0.08 to 0.1 ) and L2 regularization [37] of weights (from 0.80 to 1). The hyperparameters for the machine learning models can be seen in Table 5. The neural networks chosen in this study for both parts are similar. They are feed forward MLPs and have 11 features (Mannheim and Ensheim) and 10 features (U.S), 6 hidden layers (shown in Table 6) and the output with a single neuron gives diffuse fraction (F d ) with sigmoid as the activation function. All hidden layers except the last hidden layer have the dropout regularization added. The dropout fraction is set to 0.3. Neural networks are implemented using Keras library [38] and Grid search from Scikit learn [26] is implemented to search for the best combination of hyperparameters.
The choice of activation functions for all the hidden layers are ELU (Exponential Linear unit) [39], sigmoid [40] and tanh. Early stopping is implemented with validation loss as a proxy to stop training automatically after 10 epochs in the case of no improvement. A model checkpoint is implemented to save the best models at every epoch. The loss function used is mean square error and the optimizer 'Adam' [41] is chosen, as it has shown some good results with the default configuration of learning parameters. Table 6 shows the combination of activations functions obtained by grid search for hidden layers in Ensheim, Mannheim and U.S. models.

First Study
Before the predictive models are trained, a small analysis of the data is performed to see how the input features influence the output, i.e., diffuse fraction. Mutual information(MI) [42] and the R 2 correlation [43] of the input features with respect to the output F d is used for this analysis. Figure 1 ascertains the significance of each input feature.   MI is a measure of mutual dependence between the variables, whereas R 2 correlation gives a proportion of variance that the independent variable predicts from a dependent variable. It can be observed that diffuse fraction is primarily dependent on global horizontal irradiance and the clearness index, but the other parameters also show considerable but varying influences in both climates.
Since the machine learning models can easily handle these minimal number of input features, we considered all the features without discrimination for machine learning models. Unlike the machine learning models, the polynomial models depend mostly on a few significant predictors so input features are considered according to their requirements.
For machine learning models the following features are considered: day of the year (d), hour (h), solar altitude (α) (considering site elevation, T a and P a ), clearness index (k t ), global horizontal irradiance (G h ), atmospheric pressure (P a ), air temperature (T a ), relative humidity (R h ), wind speed (W s ), wind direction (W d ). k t and F d are calculated according to Equations (4) and (5).
Two sets of predictive models are developed each for Mannheim and Ensheim climates. All the machine learning models in this part are trained on 2013-2016 data and validated on 2017 data for each of the datasets. Tables 7 and 8 shows the comparison of the performances of each predictive model on training and validation datasets for Mannheim and Ensheim climates. The predictive models are abbreviated as NN (neural network), XGB (XGBoosting), GB (gradient boosting), RF (random forests), DT (decision trees) and LR (linear regression). The configuration of all the machine learning models can be seen in Table 5 and for neural networks in Table 6. The coefficients for the Erbs et al. model can be seen in Table 1. All the models output diffuse fraction, from which diffuse horizontal irradiance is calculated using Equation (11).
The results in Table 7 show that NN, XGB, GB and RF showed good performance with 26-27% of nRMSE for Mannheim. They showed equally good performance with respect to other metrics as well. In comparision NN fared well, while the other three models showed similar performance. Linear regression and Erbs et al. model showed least performance compared to others. A similar trend can be seen on Ensheim climate by the predictive models (see Table 8). Interestingly the nRMSE has differed between the climates by about 3-4% for all the models but the other three metrics remain the same. In summary neural network (NN) showed best performance on both climates with RMSE of 39 W/m 2 and MAE of 22 W/m 2 .

Test for Generalizability
A third climate, Weinbiet, which has measured diffuse irradiance values, is selected to test for generalizabilty. The Weinbiet weather station (as shown in Figure 2) is located between Mannheim and Ensheim but is considerably closer to the Mannheim weather station (30 KM) rather than Ensheim (75 KM). The Weinbiet dataset is preprocessed similarly to the Ensheim and Mannheim datasets. The true diffuse fraction values from the Weinbiet dataset are only used to measure the predictive performance of the models on the Weinbiet dataset.
Testing the above models on the Weinbiet data shows that the models that are trained on Mannheim showed better performance than on the Ensheim dataset, which is clearly evident from Table 9. Once again NN, XGB, GB and RF showed similar nRMSE of 27% (on Mannheim) but their nMBE's differed significantly. In this case NN showed least nMBE of just 0.33% thus faring better than other predictive models. It can also be observed that the neural network performance on its original dataset of Mannheim (see Table 7) and unseen dataset of Weinbiet are almost similar in nRMSE of 26% and 27% respectively. Though the other models generalized similarly to NN, their nMBE increased significantly on the unseen dataset, which makes their predictions less stable due to high variation. On the other hand models trained on Ensheim seems even less reliable due to high nMBE at around −7 to −8%. These results show that models trained on Mannheim are more favorable for predictions with respect to Weinbiet. A closer look is taken into all the three datasets in order to better understand this phenomenon. Figure 2 shows the correlation of the Weinbiet dataset with the Ensheim and Mannheim datasets. The bar plots show the correlation of each input feature from Mannheim and Ensheim datasets with respect to the Weinbiet dataset. The bar plots for input features corresponding to Mannheim show better correlation when compared to that of Ensheim, which means that the Weinbiet model is more similar to the Mannheim dataset than the Ensheim dataset. This explains that models trained on Mannheim perform better than that of Ensheim.
These results suggest that a pre-trained machine learning model, provided the training dataset is chosen carefully, can make really good predictions on unseen data.

Second Study
From the first part we concluded that NN performed well among other models, but we would like to ascertain if such reliable predictions can be made on a global scale (at least on two different continents). Apart from NN, XGB is one other approach that consistently performed comparable to NN. So in this part, two approaches namely NN and XGB are used to train the models.
The training data is a consolidated data of seven U.S. climates collected from the SURFRAD network (see Table 3) from the year 2015. Both NN and XGB models are finetuned and then validated against 2016 data. To test these models, 2016 data from four European climates (France, Spain and The Netherlands) are considered (see Table 4). Due to unavailability, wind direction and wind speed are not considered, instead satellite-based diffuse fraction (from CAMS-RAD), which is available at most places in the U.S. (NSRDB) and Europe (BSRN) is considered.
Individual models with respect to each of the seven U.S. climates are also developed to compare against the main model, i.e., the model based on the consolidated data of all the seven climates. This gives us a better picture of how good or bad the combined model performs with respect to each climate individually. For instance, the nRMSE of individual models (NN) are in the range of 25 to 31%, whereas the performance of the combined model is 27% (see Table 10). This confirms that the combined model appropriately represents all the underlying climates without much disparity. While comparing the performances of NN and XGB models, it can be seen that in almost all the climates once again NN models have performed better.

Test for Generalizability
Four climates in Europe (two in France, one each in The Netherlands and Spain) are selected to test the generalizability of models that are trained on U.S. data. The bottom part of Table 10 shows the performance metrics.
The performance of NN and XGB are almost similar on European climates in most of the metrics. The nRMSE of the U.S. trained model on European datasets is between 23 to 29% and on its original dataset is 27%. This performance is even better than those of individual models performed on their original datasets (25 to 31%). This shows that these models have shown very good generalizability, in fact even better than their original performance in some cases. This performance gain can be attributed to the fact that the model trained on the consolidated dataset has seen more patterns than individual datasets.
From the above results, it is shown that the right choice of training data can improve the predictive performance of these models.
A cross comparison of the predictive models from both the studies shows a more interesting result. Table 11 shows the performance of all the three NNs on the dataset of Weinbiet. It can be observed that the model trained based on the Mannheim dataset performs better than Ensheim dataset or a combination of U.S. datasets. The nRMSE of NN (Mannheim) is 27% whereas the NN (Ensheim) and NN (U.S) achieved around 29%. Though the model trained on the consolidated dataset of U.S. climates learned more weather patterns, its performance is overshadowed by the NN (Mannheim), which was trained on local data. Since Mannheim data correlates well with Weinbiet (see Figure 2), here NN (Mannheim) has the advantage. Since the general availability of the measured data of diffuse irradiance is scarce, this approach of using a pre-trained model along with generally available data from local weather stations gives easy access to accurate diffuse irradiance data.

Conclusions
The main goal of this article was to test the generalization capability of the hourly diffuse horizontal irradiance prediction models and also to understand the factors that can improve their performance. The experiments in this study were divided into two parts.
In the first part, seven predictive approaches (six machine learning and one polynomial) were each trained on two German climates (Mannheim and Ensheim). They were first finetuned and validated on their respective climates before they were tested on a third German climate (Weinbiet).
We observed that among all the considered approaches, neural networks and XGBoost have exhibited good generalization capabilities when compared to others. Neural networks consistently performed with low mean absolute error (MAE) of 22 W/m 2 on validation datasets of both the climates (Mannheim and Ensheim). While testing on a third climate (Weinbiet), it is observed that the neural network model trained on Mannheim (24 W/m 2 ) performed better than that of Ensheim (28 W/m 2 ). This was a bit surprising as both climates (Mannheim and Ensheim) are close to the third climate (Weinbiet). A quick look into the datasets of all three climates revealed that the Weinbeit dataset (test set) correlates better with Mannheim than Ensheim, which explains the reason behind a better performance of the model, for example, trained on the Mannheim dataset. Though the neural network performed better in all the tests at the local level (German climates), we would like to confirm if it can reliably make good predictions on a global level. Therefore, we shortlisted neural networks and the XGBoost model for our next part to test on more climates.
In the second part, the data was taken from seven different climates in the U.S. and is consolidated and then used to train neural networks and XGBoost models. These models are then tested on four European climates (two from France, one each from The Netherlands and Spain). Once again Neural networks did a great job with MAE of 23 W/m 2 on its validation set and generalized well on European climates with MAE in the range 20-27 W/m 2 . Individual models for each of the U.S. climates are also developed for a comparison. These models also achieved MAE in the range of 20-26 W/m 2 . This indicates that the developed neural networks showed their original performance even on unseen European datasets. Such generalization capability is seldom achieved but in this case it is possible because the model is trained on a consolidated training set. This contributed the model to learn varied weather patterns from seven different U.S. climates and make good predictions on European climates.
Both studies conclude that neural networks are excellent predictors and careful selection of training data can improve their performance considerably. Neural networks seem like a black box model due to the complex relationships between neurons, but they showed better predictions than their counterparts. In this regard, XGBoost fared surprisingly close to neural networks. Considering the time required to train and validate a neural network model, an XGB model did an excellent job with performance on par with neural networks. This research has several practical applications especially for the simulation tools that need diffuse and direct components of the solar radiation. Researchers who use standard weather files in their simulations, can benefit from these trained models by incorporating directly into their tools.