Improving Prediction Intervals Using Measured Solar Power with a Multi-Objective Approach

: Prediction Intervals are pairs of lower and upper bounds on point forecasts and are useful to take into account the uncertainty on predictions. This article studies the inﬂuence of using measured solar power, available at prediction time, on the quality of prediction intervals. While previous studies have suggested that using measured variables can improve point forecasts, not much research has been done on the usefulness of that additional information, so that prediction intervals with less uncertainty can be obtained. With this aim, a multi-objective particle swarm optimization method was used to train neural networks whose outputs are the interval bounds. The inputs to the network used measured solar power in addition to hourly meteorological forecasts. This study was carried out on data from three different locations and for ﬁve forecast horizons, from 1 to 5 h. The results were compared with two benchmark methods (quantile regression and quantile regression forests). The Wilcoxon test was used to assess statistical signiﬁcance. The results show that using measured power reduces the uncertainty associated to the prediction intervals, but mainly for the short forecasting horizons.


Introduction
In the last decade, wind and solar energy production, and more specifically photovoltaic (PV) energy, has increased significantly and, therefore, a large PV penetration with a rapid growth has taken place in the electricity market [1]. To achieve this high penetration, it is important to have accurate point forecasts and most of the research has focused on this issue. However, due to the high variability of several meteorological factors, solar power prediction is inherently uncertain and, therefore, it is also important to estimate the uncertainty around point forecasts. Accurate knowledge of uncertainty on renewable energy production can provide both economic opportunities [2] and reliability improvement [3]. Recent literature shows how research into new uncertainty estimation methods, as well as their applications to the renewable energy industry, is a field of great activity [4][5][6].
The most commonly used ways of representing uncertainty are quantiles and prediction intervals (PIs). The estimation of quantiles has been widely addressed in the literature, especially through methods that estimate each desired quantile independently. For example, quantile regression (QR) [7] constructs linear models by minimizing quantile loss (or pinball score) for a particular quantile-probability. Nonlinear models can also be constructed using this cost function, for example quantile regression neural networks (QRNN) [8] or tree-based ensembles such as Gradient Tree Boosting (GB) [9,10]. There are also techniques which build a single model from which all quantiles can be extracted. For instance, quantile regression forests (QRF) [11] is another non-linear ensemble technique and analog ensembles (AnEn) [12][13][14] generate probability distributions using a set of past All methods were tested using as inputs only the meteorological variables and using the meteorological variables together with measured solar power at t 0 .
The study was done using data from the Global Energy Forecasting Competition 2014 (GEFCom2014) concerning the probabilistic solar power forecasting problem [30]. To advance beyond the preliminary results [26], in this study, extensive experimentation was carried out. Data from three different solar plants in Australia (Stations 1-3) were used and the PIs were estimated for five intra-day forecasting horizons (1-5 h, UTC time). Given that MOPSO and QRF are stochastic techniques, they were run 90 times and statistical tests for comparison were done by means of the Wilcoxon test. The extensiveness of the experiments and the statistical test allowed extracting reliable conclusions.
The rest of the article is organized as follows. Section 2 describes the dataset used for experiments. Section 3 summarizes the evolutionary multi-objective approach for interval optimization. Section 4 describes the two baseline methods for comparison. Section 5 describes the experimental setup and the results, including statistical significance tests. Conclusions are finally drawn in Section 6.

Data Description
The data used in this work were obtained from Task 15 in the probabilistic solar power forecasting problem [30] one of the four tracks from the Global Energy Forecasting Competition 2014 (GEFCom2014). These data include measured solar power generation and meteorological forecasts for three solar stations (Stations 1-3) in Australia whose exact location was been revealed. Measured solar power generation, expressed in proportions between 0 and 1 relative to the nominal power of the plant, was provided hourly, from 1 April 2012 01:00 to 1 July 2014 00:00 UTC. The meteorological forecasts were obtained from the European Centre for Medium-range Weather Forecasts (ECMWF) and include 12 weather variables. Forecasts for those weather variables are issued by ECMWF once per day, at midnight, and they are forecasted for the each hour for the next 24 h. Thus, if t 0 is the time at which forecasts are issued (00:00 UTC), each meteorological variable gets a forecast for t 0 + 1, t 0 + 2, . . . , t 0 + 24.
The 12 weather variables are: total column liquid water, i.e., the vertical integral of cloud liquid water content (kg · m −2 ); total column ice water, i.e., the vertical integral of cloud ice water content (kg · m −2 ); surface pressure (Pa); relative humidity with respect to saturation at 1000 mbar; total cloud cover (0-1); 10-m U wind component (m · s −1 ); 10-m V wind component (m · s −1 ); 2-m temperature (K); surface solar radiation downward (J · m −2 ); surface thermal radiation downward (J · m −2 ); top net solar radiation, i.e., net solar radiation at the top of the atmosphere (J · m −2 ); and total precipitation (convective precipitation + stratiform precipitation) (m). Figure 1 displays a box plot of the normalized measured power generation for each station corresponding to the whole period (2012-04-01 01:00 to 2014-07-01 00:00) and broken down by UTC hour. It is a standard box plot where the lower and upper hinges correspond to the first and third quartiles and the center black horizontal bar is the median. Whiskers are used to denote data in the 1.5*inter-quartile range and black points are used for outliers. It is important to remark that the hour on the x-axis corresponds to EMCWF time. Given that GEFCom2014 did not disclose the actual location of the three solar stations, it is not possible to know with certainty the local time for each of the stations. It is estimated that noon must correspond to the maximum of solar radiation (Time 1-2 for Station 1; Time 3 for Station 2; and Time 2 for Station 3). Zero is the hour at which ECMWF forecasts are issued. Based on the box plot in Figure 1, and given that measured power should be expected to have influence mostly for short time horizons, it was decided to study Hours 1-5. This also excludes hours too close to the night period. Given that, for each horizon, there is a single forecast everyday, the available data are made of 820 instances, corresponding to 820 days. To ensure that both train and test sets are representative, every 30 days, 20 consecutive days were chosen for training and the remaining 10 for testing. Therefore, there were 550 days for training and 270 days for testing. To perform hyper-parameter tuning tasks, such as choosing the best neural network architecture, the best number of optimization iterations or the best configuration for QFR, it is necessary to split the available training data into another two sets: the training set and the validation set (also known as the development (dev) set). In this work, for every 20-day block of the original training partition, the first 14 days were finally used for training, and the remaining six for validation.

Multi-Objective Optimization for Prediction Intervals
The following section describes the multi-objective particle swarm optimization algorithm (MOPSO) applied to PI, originally reported in [21]. This application was inspired by LUBE [16], which used an artificial neural network (ANN) with a single hidden layer to make an estimation of the lower and upper bounds of the PIs. The ANN was optimized using multi-objective evolution techniques.
In this study, the inputs to the ANN can be meteorological forecast variables and measured solar output power. The outputs of the network are the boundaries (lower and upper) forecasted with the network for the given inputs. For each input in the dataset, an observation of power is found. However, the boundaries are never included, therefore a backpropagation optimization is not suitable for this network. As the desired outputs are unknown, this is not a standard supervised regression, and an alternative optimization technique is required.
Therefore, an evolutionary optimization algorithm was chosen to optimize the weights of the network with two primary goals: PI coverage and interval width. Coverage is measured using the prediction interval coverage probability (PICP), which computes the frequency of observations laying within the interval. The PICP can be maximized by also maximizing the width of the PI, producing trivial solutions. To avoid this, a second goal was set to minimize the width of the PI. The following paragraphs formalize the formulation of these goals. Let } be a set of observations, where X i is a vector with the input variables and t i is the observed output variable. Let PI i = [Low i , U pp i ) be the prediction interval for observation X i ([Low i , U pp i ) are the outputs of the ANN). The PICP goal can be calculated as described in Equation (1) and the average interval width (AIW) in Equation (2).
where N is the number of samples, and 0 otherwise), and U pp i and Low i represent the PI boundaries.
Optimizing PICP requires sacrificing a higher AIW, while a narrow AIW will lose some coverage in return. The goals are opposed and a multi-objective approach was applied [21]. Every particle in the MOPSO represents a single solution to the network, i.e., a different configuration of weights. The goals of this algorithm are: 1 − PICP (Equation (1)) and AIW (Equation (2)). This network receives the meteorological data (from the given dataset in Section 2) as inputs and measured solar power at t 0 .
The MOPSO algorithm produces a set of multiple ANNs laid in a Pareto front of non-dominated solutions. Typically, PIs with a desired coverage, called PIs' nominal coverage (or PI NC), are needed by the user. Sometimes, it is more appropriate to refer to 1 − PI NC, which is named as al pha from now on. In this approach, the solution in the Pareto front with al pha closest to the one requested by the user is selected [21].

Benchmark Methods
To have a baseline to compare MOPSO results, linear QR [7] and non-linear QRF [11] were used.

Quantile Regression
The QR [7] algorithm is able to forecast quantiles by using linear models. In comparison, the least squares method makes a prediction of the target conditional average, while QR predicts the median or other quantiles. This is done by minimizing the quantile loss instead of the quadratic loss. Quantiles can be used for obtaining PIs. Let q 1 and q 2 be the 1−PI NC 2 and 1+PI NC 2 quantiles, respectively. There are two quantiles q 1 and q 2 for the left and right 1+PI NC 2 probability tails of the distribution, respectively. The coverage of the PI [q 1 , q 2 ] is equal to PI NC. To summarize, QR builds a couple of linear models to estimate the q 1 and q 2 quantiles, forming the [q 1 , q 2 ] PI in turn.

Quantile Regression Forest
QRF [11] is an adaption of the random forest (RF) algorithm [31] for estimating quantiles and follows a different strategy to linear quantile regression. RF is an ensemble learning algorithm whose individual models are regression trees. Each of the trees is obtained from different subsets of the training data by means of bootstrap sampling. To obtain an output for an instance, it is dropped down the tree until it reaches a leaf. The average of the response value of the training instances in the leaf is returned. RF can become QRF by first adapting the regression trees by changing the information that is stored in the leaf nodes. In the case of QRF, leaf nodes store not only the average response, but all the response values of the training instances that reached the leaf. When an instance is dropped down each of the trees in the QRF ensemble, every regression tree returns a set of response values (the ones in the leaves reached by the instance). From the union of all these response values, quantiles can be estimated. PIs can be computed from the quantiles, similar to what was done with quantile regression.

Experimental Validation
MOPSO and the two benchmark approaches (QR and QRF) were evaluated to construct PIs for the study case described in Section 2. The main comparison refers to using and not using the power measures at t 0 . The latter is identified as +P t 0 . Therefor, six configurations were tested: MOPSO, MOPSO + P t0 , QR, QR + P t0 , QRF, and QRF + P t0 . PIs obtained from the different approaches were evaluated using two metrics: average interval width (AIW) and ratio (obtained as PICP/AIW). The larger is the ratio, the better is the trade-off between PICP and AIW. Solutions that reach large PICP with wide intervals are penalized with small ratio values.
PIs were obtained independently for each of the five forecast horizons (1-5 h). Several target nominal coverage values (al pha) were considered within the range 0.02-0.20. QR and QRF approaches must be run for each desired al pha value. The MOPSO approach is able to obtain PIs for all al phas in a single run (see Section 3). Ninety runs were executed for MOPSO (each run was valid for all the al pha values) and QRF because they are stochastic methods. For QR, only one run for each al pha was executed, because it is not stochastic. The following three subsections describe the tuning of the hyper-parameters of the methods, the experimental results, and the statistical significance tests.

Hyper-Parameter Tuning
Two hyper-parameters were tuned for MOPSO: number of hidden neurons and number of iterations of PSO. The methodology used a training and validation set approach, where the validation set was used to compare and select the best hyper-parameter values. The values of hidden neurons explored were: 2, 4, 6, 8, 10, 15, 20, 30, and 50. The different values for the PSO iterations ranged from 1000 to 16,000 in steps of 1000. Ninety runs were carried out for each number of neurons and iterations, starting with different random seeds. Therefore, each of the 90 runs had its own hyper-parameter tuning process, which resulted in an optimal number of neurons and iterations for each run. The optimal hyperparameters were chosen using the validation set hypervolume ( [21]). It is important to remark that this parameter optimization process was carried out independently for each of the five forecasting horizons. The 90-run average (and standard deviation) of the best hyper-parameters for each horizon and configuration (MOPSO and MOPSO + P t 0 ) is displayed in Table 1.
It is observed that the number of hidden neurons depends on the horizon, and long horizons seem to require fewer hidden neurons than short horizons. This trend is not systematic, but it is true for all stations and horizons that more neurons are required for the first hour than for the fifth hour. A possible explanation is that, for longer horizons, predictions are more noisy, and simpler models are required to avoid overfitting. In any case, it shows that tuning the number of neurons separately for each horizon is useful. The number of iterations is always between 12,000 and 14,000, below the maximum value of 16,000, which shows that the iteration limit was appropriate. Nevertheless, we tried increasing this value for a few runs, but no significant changes in the Pareto fronts were observed. For QR, no parameter tuning is required. However, the performance of RF method depends on two main parameters: nodesize (minimum number of samples in the tree leaves) and mtry (number of input attributes used en the trees). Therefore, for QRF, we also performed hyper-parameter tuning to select the most suitable configuration. The values explored for nodesize were 2, 5, 10, 20, 30, 40, 50, and 100 and for mtry 2, 4, 6, 8, 10, 12, and 14 (the last one only when past value of power (+P t 0 ) was used as input). The measure used for selecting the best combination of parameters for QRF was the ratio in the validation set. The study of best parameters was also carried out for each prediction horizon, as for MOPSO approach. In this case, values of the parameters are very similar for the three stations, the five forecasting horizons, and the different target al pha values. Nodesize values are around 5.6 ± 0.86 with no past power and around 5.6 ± 0.80 when +P t 0 was used. For mtry, the average values are around 11.5 ± 0.62 when no past power was used and 12.5 ± 0.78 when +P t 0 was used (bearing in mind that, when past power was used, an additional attribute was added to the set of inputs).

Experimental Results
To have a first view of the performance of the different methods, Table 2 shows the mean of the ratio values for each method and each horizon, separately for Stations 1-3. Each ratio value is the mean of the ratios for all the al pha values (0.02, 0.05, 0.08, 0.1, 0.15, and 0.2).
The results show that MOPSO + P t0 obtains better results than MOPSO for the nearest horizons (especially Horizons 1 and 2) and for the three locations. For the other methods (QR and QRF), the +P t0 variant also behaves better than the original method, and this situation occurs for all the prediction horizons and all the stations. Comparing MOPSO + P t0 with the other methods, we can observe that, for all horizons and stations, it obtains higher ratio values. When we compare QR + P t0 and QRF + P t0 methods, we can observe that there is not a clear predominance of any method. In some cases, the ratio values of QRF + P t0 are better (Station 1), but, in other cases, it is the opposite, as in Station 3. Next, the results are analyzed in more detail.  ) for each of the three stations, respectively. Each of the figures contain two plots, top for the ratio (a) and bottom for AIW (b). These figures compare MOPSO, QR, QRF, MOPSO + P t0 , QR + P t0 , and QRF + P t0 . Information for MOPSO, MOPSO + P t0 , QRF, and QRF + P t0 is displayed by means of box plots, because they are a good summary of the 90 runs. QR and QR + P t0 are shown as horizontal lines, because QR is not stochastic. Plots can be seen for some representative al pha values (0.02, 0.05, 0.08, 0.1, 0.15, and 0.2). Figure 2a displays the ratio of the six methods tested for Station 1. It is readily apparent that ratio of MOPSO + P t0 is larger than MOPSO for the first horizon, and, to a lesser degree, for the second one. This should be expected, because the influence of measures at time t 0 should decrease as the horizon is farther away from t 0 . It can also be noticed that the difference between MOPSO + P t0 and MOPSO decreases as al pha increases. For instance, for the second horizon, MOPSO + P t0 is clearly better than MOPSO for al pha = 0.02 but quite similar for al pha = 0.2. The second issue to notice in Figure 2 is that, although QR and QRF also benefit from using t 0 at short horizons, MOPSO configurations typically outperform their QR and QRF counterparts (i.e., MOPSO + P t0 's ratio is larger than those of QR + P t0 and QRF + P t0 , and it also is when t 0 is not used). This is true for all horizons and al phas. The AIW plot on the bottom of Figure 2b confirms the previous findings: MOPSO + P t0 intervals are narrower than MOPSO for the first two horizons (and that is also true for QR, QR + P t0 , QRF, and QRF + P t0 ), highlighting the importance of using t 0 for obtaining narrow intervals. Again, MOPSO configurations always obtain narrower intervals than the QR and QRF baselines. Figure 3 supports the previous results for Station 2. In this case, higher ratios are obtained by MOPSO + P t0 (vs. MOPSO, QR + P t0 and QRF + P t0 ) for longer horizons (from 1 to 4 h). Finally, those results are mirrored by Station 3, as shown in Figure 4, although in this case there is an anomaly for al pha = 0.1, where QR + P t0 manages to obtain better ratios (and narrower intervals) than MOPSO + P t0 for the first forecasting horizon. In this station, it is also observed that QRF + P t0 obtains similar ratios as MOPSO + P t0 for al pha > 0.05 and Horizon 1. For the remaining horizons, MOPSO + P t0 outperforms QRF + P t0 . To show the PIs constructed for some specific days, Figure 5  The left column compares MOPSO + P t0 (grey) with MOPSO (yellow) and the right column compares MOPSO + P t0 (grey) with QRF + P t0 (yellow). The x-axis displays the five forecasting horizons and the y-axis shows the PIs (in grey and yellow) and the actual normalized power (a black line). Although the results depend on the particular day, in general, it is observed that MOPSO + P t0 provides narrower intervals than MOPSO (except in the last horizon (5 h) for some of the days) and QRF + P t0 .   The left column compares MOPSO + P t0 (grey) with MOPSO (yellow) and the right column compares MOPSO + P t0 (grey) with QRF + P t0 (yellow).

Statistical Significant Tests
The above figures offer a qualitative view of the results, but, given that MOPSO and QRF were run 90 times, statistical tests could also be computed. In this work, we used the Wilcoxon test. The results are summarized in Figures 6-8, for Stations 1-3, repectively. In this case, only tests for the ratio are shown (but AIW follows a similar pattern and can be found in Appendix A. Part (a) (top) of Figures 6-8 compares MOPSO + P t0 and MOPSO. A blue point is used when the MOPSO + P t0 's ratio is significantly larger than MOPSO, and red for the other way around. Color grey signals that MOPSO + P t0 and MOPSO are not statistically different. lThe lower plots (b) and (c) of Figures 6-8 perform the same comparison, but for MOPSO + P t0 versus QR + P t0 and for MOPSO + P t0 versus QRF + P t0 . Results for Station 1 (see Figure 6a) show that, indeed, for the first two horizons, MOPSO + P t0 's ratios are better than MOPSO's (blue points for all al phas). However, beyond the second hour, MOPSO + P t0 is not necessarily significantly better (and, in very few cases, it is significantly worse: red points). The comparisons between MOPSO + P t0 and QR + P t0 and MOPSO + P t0 and QRF + P t0 offer more systematic results: they are almost all blue points, which means that MOPSO + P t0 is basically significantly better than QR + P t0 and better than QRF + P t0 . The second station shows even cleaner results: Figure 7a displays blue points for the first four hours, and mostly grey for the fifth horizon. Again, the blue-filled plot of Figure 7b means that MOPSO + P t0 is significantly better than QR + P t0 and QRF + P t0 in all cases. In summary, for Station 2, t 0 has a beneficial effect on MOPSO for even longer horizon spans (up to the fourth hour). Finally, Figure 8 plots results for Station 3. In this case, the top plot shows blue points up to the third hour, all grey for the fourth horizon, but mostly red for the fifth one. With respect to MOPSO + P t0 vs. QR + P t0 (Figure 8b), it is again mostly blue points, but the anomalies mentioned in the qualitative results pop up again here (red points), for the first horizon and α = 0.09, 0.1, and 0.11. In this figure it is also observed than the observed similarity of QRF + P t0 and MOPSO + P t0 previously observed for Station 3 and for the first horizon manifests as differences not statistically significant (grey points).    In summary, MOPSO + P t0 is significantly better than MOPSO for the short horizons for all al pha values. For long horizons, MOPSO + P t0 is not significantly worse than MOPSO except for some al pha values (mostly in Station 1). MOPSO + P t0 is significantly better than QR + P t0 and QRF + P t0 for all horizons, except for a few al pha values for the first horizon in Station 3.

Conclusions
The LUBE approach is an interesting alternative for estimating PIs in the context of probabilistic forecasting because it is able to estimate directly the lower and upper bounds of PIs. LUBE can be optimized by MOPSO, a multi-objective evolutionary technique, so that neural networks can be trained to simultaneously optimize the two conflicting properties of PIs: coverage and width. The result of this optimization is a Pareto front, from which solutions can be extracted according to the desired coverage value. In this study, this approach was used to address two issues to obtain PIs for solar power forecasting. The first one was to estimate PIs for five intra-day forecasting horizons from 1 to 5 h. The second, and more important one, was to study the influence of using measured solar power at time t 0 on the quality of PIs. Previous work has shown that this is useful for improving point forecasting, but we studied this issue in the context of probabilistic forecasting.
The approach was applied to data from three solar stations in Australia and experiments were carried out using two main configurations: (1) using meteorological forecasts variables as inputs to the methods; and (2) using additionally the measured solar output power as input. To analyze how far the influence of measured output reaches, hourly forecasts horizons from 1 to 5 h were used. The quality of prediction intervals was estimated using the coverage/width ratio and the width of the intervals. High values for ratio mean that intervals have a good trade-off between coverage and width. This study was done for several desired coverage values (or al pha) from 0.01 to 0.20.
The results show that the ratio is improved by using the measured additional information for the two first horizons on the three locations studied and all the desired coverage values. However, although for one of the station this beneficial influence reaches up to the fourth horizon, in none of the stations the ratio is improved for the farthest horizon tested (5 h). The same trend can be observed for interval width. Experiments were replicated 90 times for different random seeds and statistical significance tests were performed, which show that the mentioned results are statistically significant. Thus, it can be concluded that using measured solar power reduces the uncertainty of the intervals for short forecasting horizons.
The MOPSO approach was compared with QR and QRF as baseline methods. Both were tested in the same conditions as MOPSO, with and without the measured power at time t 0 . For all approaches, the use of measured power helped to obtain better PIs, especially in the first horizons. The comparison also shows that MOPSO + P t0 is significantly better in all cases except a few al pha values in one of the stations.
Author Contributions: All authors participated in all tasks related to this article (conceptualization, software, validation, data curation, writing original draft and review/editing).
Funding: This work was funded by the Spanish Ministry of Science under contract ENE2014-56126-C2-2-R (AOPRIN-SOL project).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
This appendix displays the statistical results for AIW.