3.1. Forecasting Perfomances
The comparison of the multi-step forecast of solar irradiance with LSTM and FF networks was performed setting the delay parameter
d to 24, despite the mutual information indicated that it would be enough to assume
d = 6. This choice is motivated by the intrinsic periodicity of solar radiation at an hourly scale [
54]. We have experimentally observed that this choice gives more accurate estimates for all the models considered. This probably helps the model in taking into account the natural persistence of solar radiation (see the comments about the performance of the
Pers24 model, below). The length of the forecasting horizon
h, representing the number of steps ahead we predict in the future, was varied from 1 to 12. Data from 2014 to 2017 were used for network training, 2018 for validating the architecture, and 2019 for testing.
The average performances computed on the first 3 hours of the forecasting horizon of the
Pers,
Pers24, and
Clsky models for the test year 2019 are shown in
Table 1 and
Table 2. The
Pers predictor performance rapidly deteriorate when increasing the horizon. They are acceptable only in the short term: after 1 h, the
NSE is equal to 0.79 (considering whole day samples) and 0.59 (daytime samples only), but after two hours
NSE decreases to 0.54 (whole day) and 0.14 (daytime only), and six steps ahead the
NSE becomes −0.93 (whole day) and −1.64 (daytime). The
Pers24 and
Clsky preserve the same performances for each step ahead since they are independent on the horizon. Such models inherently take into account the presence of the daily pseudo-periodic component, which affects hourly global solar radiation.
The
Pers24 predictor appears to be superior to the
Clsky (lower error indicators, higher
NSE) confirming that the information of the last 24 h is much more relevant for a correct prediction than the long-term annual cycle. Indeed, the sun position does not change much between one day and the following, and the meteorological conditions have a certain, statistically relevant tendency to persist. Additionally, the
Pers24 predictor is the only one with practically zero bias, since the small difference that appears in the first column of
Table 1 is simply due to the differences between 31/12/2018 (which is used to compute the predicted values of 1/1/2019) and 31/12/2019. The clear sky model that operates by definition in the absence of cloud cover, overestimates the values above the threshold by 89.86 Wm
−2, on average.
From
Table 2, it appears that
Pers,
Pers24 and
Clsky, are not reliable models, on average, especially if the
NSE is evaluated by using daytime samples only.
Figure 4 reports the results obtained with the three different FF approaches (
Figure 4a–c) and the LSTM forecasting model (
Figure 4d) in terms of
NSE (both in the whole day and in daytime samples only).
Generally speaking,
Figure 4 shows that all the considered neural predictors exhibit an NSE which reaches an asymptotic value around six steps ahead. This is coherent with the previous analysis about the mutual information (see
Figure 3), which, at an hourly scale, is almost zero after six lags.
If the evaluation is carried out considering whole day samples, all the models would have to be considered reliable enough since NSE is only slightly below 0.8, even for prediction horizons of 12 h. On the contrary, if the evaluation is made considering daytime samples only, it clearly appears that models are reliable for a maximum of 5 h ahead, as for higher horizons the NSE value typically falls below 0.6. Therefore, removing the nighttime values of the time series is decisive for a realistic assessment of a solar radiation forecasting model, that would otherwise be strongly biased.
Going in deeper details, the following considerations can be made.
The FF recursive approach performs slightly worse, particularly as measured by NSE and specifically after a forecasting horizon of 5 h. The FF multi-output and multi-model approaches show performances similar to the LSTM. Additionally, one can note that the performance regularly decreases with the length of the horizon for the FF recursive approach and LSTM net, since they explicitly take into account the sequential nature of the task. Conversely, the FF multi-output and the multi-model ones have irregularities, particularly the latter being each predictor for each specific time horizon completely independent on those on a shorter horizon. If perfect training was possible, such irregularities might perhaps be reduced, but they cannot be completely avoided, particularly on the test dataset, because they are inherent to the approach that considers each predicted value as a separate task.
For all the considered benchmarks and neural predictors, the difference between the whole time series (average value 140.37 Wm−2) and the case with a threshold (daytime only), that excludes nighttime values (average 328.62 Wm−2), emerges clearly, given that during all nights the values are zero or close to it, and thus the corresponding errors are also low.
FF nets and LSTM perform similarly also considering the indices computed along the first 3 h of the forecasting horizon as shown in
Table 3 and
Table 4, for the whole day and daytime, respectively.
All the neural predictors provide a definite improvement in comparison to the Pers, Pers24, and Clsky models. Looking, for instance, at the NSE the best baseline predictor is the Pers24, scoring 0.63 (whole day) and 0.28 (daytime only). The corresponding values for the neural networks exceed 0.86 and 0.73, respectively.
An in-depth analysis should compare the neural predictors performance at each step with the best benchmark for that specific step. The latter can be considered as an ensemble of benchmarks composed by the the
Pers model, the most performing one step ahead (
NSE equal to 0.79), and the
Pers24 on the following steps (
NSE equal to 0.63 for
h from 2 to 12). Under this perspective, the neural nets clearly outperform the considered baseline since their
NSE score varies from 0.90 to 0.75 (see the solid lines in
Figure 4 referring to the whole day). The same analysis performed excluding nighttime values leads to quite similar results, confirming that the neural networks always provide a performance much higher than the benchmarks considered here.
An additional way to examine the model performances is presented in
Table 5. We report here the NSE of the predictions of the LSTM network on three horizons, namely 1, 3, and 6 hours ahead. The sunlight (i.e., above 25 W/m
2) test series is partitioned into three classes: cloudy, partly cloudy and sunny days, which constitute about 30, 30, and 40% of the sample, respectively. More precisely, cloudy days are defined as those when the daily average irradiance is below 60% of the clear sky index and sunny days those that are above 90% (remember that the clear sky index already accounts for the average sky cloudiness).
It is quite apparent that the performance of the model decreases consistently from sunny, to partly cloudy, to cloudy days. This result is better illustrated in
Figure 5 where the 3-hour-ahead predictions are shown for three typical days. In the sunny day, on the right, the process is almost deterministic (governed mainly by astronomical conditions), while the situation is completely different in a cloudy day. In the last case, the forecasting error is of the same order of the process (NSE close to zero) and it can be even larger at 6 hours ahead. This determines the negative NSE value shown in
Table 5.
3.2. Domain Adaptation
Besides the accuracy of the forecasted values, another important characteristic of the forecasting models is their generalization capability, often mentioned as domain adaptation in the neural networks literature [
58]. This means the possibility of storing knowledge gained while solving one problem and applying it to different, though similar, datasets [
59].
To test this feature, the FF and LSTM networks developed for the Como station (source domain) have been used, without retraining, on other sites (target domains) spanning more than one degree of latitude and representing quite different geographical settings: from the low and open plain at 35 m a.s.l. to up the mountains at 800 m a.s.l. In addition, the test year has been changed because solar radiation is far from being truly periodical and some years (e.g., 2017) show significantly higher values than others (e.g., 2011). This means quite different solar radiation encompassing a difference of about 25% between yearly average values.
Figure 6 shows the
NSE for the multi-output FF and LSTM networks for three additional stations. All the graphs reach somehow a plateau after six steps ahead, as suggested by the mutual information computed on Como station, and the differences between FF and LSTM networks appear very small or even negligible in almost all the other stations. Six hours ahead, the difference in
NSE between Como, for which the networks have been trained, in the test year (2019) and Bema in 2017, which appears to be the most different dataset, is only about 3% for both FF models and LSTM.
As a further trial, both FF models and LSTM have been tested on a slightly different process, i.e., the hourly average solar radiation recorded at the Como station. While the process has the same average of the original dataset, the variability of the process is different since its standard deviation decreased of about 5%. The averaging process indeed filters the high frequencies. Forecasting results are shown in
Figure 7. Additionally for this process, the neural models perform more or less as for the hourly values for which they have been trained. The accuracy of both LSTM and FF networks improves by about 0.02 (or 8%) in terms of standard
MAE and slightly less in terms of
NSE, in comparison to the original process. For a correct comparison with
Figure 4, however, it is worth bearing in mind that the 1-hour-ahead prediction corresponds to
in the graph, since the average computed at hour
is that from
to
and, thus, includes values that are only 5 minutes ahead of the instant at which the prediction is formulated.
An ad-hoc training on each sequence would undoubtedly improve the performance, but the purpose of this section is exactly to show the potential of networks calibrated on different stations, to evaluate the possibility of adopting a predictor developed elsewhere when a sufficiently long series of values is missing. The forecasting models we developed for a specific site could be used with acceptable accuracy for sites were recording stations are not available or existing time series are not long enough. This suggest the possibility of developing a unique forecasting model for the entire region.