Photovoltaic Power Forecasting: Assessment of the Impact of Multiple Sources of Spatio-Temporal Data on Forecast Accuracy

: The efﬁcient integration of photovoltaic (PV) production in energy systems is conditioned by the capacity to anticipate its variability, that is, the capacity to provide accurate forecasts. From the classical forecasting methods in the state of the art dealing with a single power plant, the focus has moved in recent years to spatio-temporal approaches, where geographically dispersed data are used as input to improve forecasts of a site for the horizons up to 6 h ahead. These spatio-temporal approaches provide different performances according to the data sources available but the question of the impact of each source on the actual forecasting performance is still not evaluated. In this paper, we propose a ﬂexible spatio-temporal model to generate PV production forecasts for horizons up to 6 h ahead and we use this model to evaluate the effect of different spatial and temporal data sources on the accuracy of the forecasts. The sources considered are measurements from neighboring PV plants, local meteorological stations, Numerical Weather Predictions, and satellite images. The evaluation of the performance is carried out using a real-world test case featuring a high number of 136 PV plants. The forecasting error has been evaluated for each data source using the Mean Absolute Error and Root Mean Square Error. The results show that neighboring PV plants help to achieve around 10% reduction in forecasting error for the ﬁrst three hours, followed by satellite images which help to gain an additional 3% all over the horizons up to 6 h ahead. The NWP data show no improvement for horizons up to 6 h but is essential for greater horizons.


Introduction
The urge of response to climate change and the necessity to reduce the global carbon footprint have put renewable energy in the spotlight. Photovoltaic (PV) energy generation has grown in many countries with the reduction of its costs. However, PV power generation is not controllable as it depends on the meteorological conditions. Increasing the PV penetration in the grid then require a better control of the production variability. The ability to accurately forecast the future production of the PV power plants is then decisive for both power producers and network operators.
The literature features several methods to forecast PV production. Detailed reviews of the state of the art are provided in [1][2][3]. They can be classified according to the forecast horizon, the available data, and the type of approach, which may be based on statistics, physics or a hybrid combination [2]. Although early methods were deterministic, probabilistic approaches are increasingly popular since they provide additional information about the distribution of future production and thus about uncertainty in the forecasts. Some of these probabilistic approaches are based on Numerical Weather Predictions (NWP) issued by meteorological models or sky imaging, and provide ensemble forecasts of the future PV generation [4][5][6]. Analog ensembles [7], regression trees [8,9] and k-nearest neighbors (kNN) [10,11] are also found in the related literature on probabilistic PV forecasting. A wide range of models based on Artificial Neural Networks (ANN) also exist for short-term PV power production [12,13]. These models have evolved from simple neural networks to radial neural networks (more suitable for time series prediction) and more recently to deep learning methods [14]. Geostationary satellite imagery can be used to estimate ground irradiation. The literature mentions different methods to make this estimate. The main difference between these methods is the characterization of interactions between solar radiation and the atmosphere. Ref. [15,16] provide a review of the first methods used, classified according to whether they are physical or statistical. The various evolutions in the characterization of atmospheric phenomena and technological advances in the field of satellite imagery have led to increasingly efficient methods for deriving irradiation data from satellite images [17,18]. Satellite data can be coupled with ground irradiation measurements to improve the quality of the estimates provided; the site-adaptation method allows estimates to be compared with actual on-site measurements [19].
Ground-based irradiation data, produced either by satellite imagery alone or by coupling to ground measurements, is used to provide irradiation forecasts for horizons ranging from 0 (nowcasting) to 6 h, based on physical, statistical or hybrid prediction methods presented in literature reviews [2,20]. Among other things, cloud motion vectors (CMV) determine the speed and direction of clouds by analyzing satellite images [21][22][23] to provide better forecasts of irradiation. Artificial neural networks [24,25], the SVM [26,27] and the Bayesian estimation [28] are also used in the framework of the irradiation forecasting from satellite images. Approaches also include spatio-temporal methods [29,30] and methods that combine both satellite images, NWP forecasts and ground measurements [31,32].
Considering the state of the art, the key contributions of this paper can be resumed as follows: (1) we propose a spatio-temporal model which can extract and use both spatial and temporal data from the different available sources of data to improve the forecasts accuracy. The model follows a data-driven approach, where the available data are directly fed as input without other advanced pre-treatment than normalization ( i.e., to produce information like cloud motion vectors); (2) we show that the large dimensionality of the model can be efficiently addressed by a Lasso approach that permits to select the most relevant input; (3) we provide a thorough quantitative comparison of the impact that the multiple heterogeneous sources of spatio-temporal data have on the forecasting performance. This data include measurements from neighboring PV plants, local meteorological stations, NWP forecasts and satellite images. Each addition of a new data source is done in relation to the forecasting horizon, making it possible to indicate which data are beneficial for which horizon. A method to define the radius of useful pixels around a PV plant for preselecting the information used as input to the model is proposed. (4) Finally, an exhaustive validation of the proposed approach is made with a real world case study comprising 136 PV installations in France. These contributions will help to build more efficient forecasting models, incite data sharing, contribute to cost-benefit analysis for new measuring infrastructures.
The paper is structured as follows: the PV data and other data sources are presented in Section 2; the proposed incremental spatio-temporal model is presented in Section 3, while Section 4 presents an evaluation and analysis of the performance of the forecasts. Finally, the conclusions of the study are discussed in Section 5.

PV Power Data and Weather Forecasts
The data set, denoted d is a set of 136 different PV power plants in mid-west France. Each power plant is an aggregation of power inverters with peak power ranging from 3.2 kWp to 58 kWp. The distance between the power plants varies from 1 km to 230 km and the available data cover November 2014 to March 2016 with a 15 min temporal resolution. The locations of the power plants are represented in Figure 1. In the following, the power plants are labeled P i,1≤i≤136 . The production data have been normalized employing the same procedure as that proposed in [33]. This permits to avoid that the effect of the daily course of the sun dominates in the correlations that are estimated among two sites. The NWP prediction come from the European Centre for Medium-Range Weather Forecasts (ECMWF) applying its HRES solution (https://www.ecmwf.int/en/forecasts/datasets/ accessed on April 2017). The local meteorological measurements are obtained from the closest meteorological station (of Meteo France network).

Satellite Images
The satellite images used in this paper are extracted from the Helioclim database [34,35]. This database was created using MFG EUMETSAT (European Organization for the Exploitation of Meteorological Satellites) satellite observations. The Helioclim3 version is one of the most efficient versions of the data-base, featuring improved spatial (3 km at nadir) and temporal (15 min) resolutions. The images are treated in nearly real time : there is no analysis time before the reception of the images like there is for NWP forecasts (with 2 h runtime for the fastest NWP models); the small delay is due to internet speed and is in the range of millisecond. The pixels for low solar elevation are interpolated. An example of satellite data providing GHI over an area that covers the power plants of the test case is presented in Figure 2. The figure shows GHI values for two instants in January and July, representing respectively a spatial "screenshot" of the GHI intensity in winter and summer. The lower figure of July presents a higher level of GHI than the upper figure of January. Moreover, in the lower figure, the majority of power plants fall in a region of high GHI with low variability among pixels compared to the upper figure, except for the power plants around 1 degree longitude. More details about the characteristics of the satellite images database can be found in the above-mentioned references. It is noted that for the purpose of this study we obtained the data in the form of data files resulting from the translation of the information in the pixels into numerical information. The pre-processing to obtain the numerical values is done from the service that delivers operationally the satellite images.
Here we consider that the satellite image information employed consists of the timeseries that can be generated from a sequence of images. Each pixel location corresponds to a time series. The resulting data are highly correlated. It is thus necessary to select the number of time series, and thus pixels, that provide informative input for the forecasting model. A methodology to achieve this is proposed in Section 3.1. Note that to derive Cloud Motion Vectors, we consider the basic GHI information derived from the images and not from pre-processing [22]. This is set as a requirement for the data-driven approach of the proposed forecasting model. In other words, we expect that the consideration of spatially distributed GHI time series resulting from a series of past images up to the most recent one is informative about the evolution of the clouds in time, and this can be captured implicitly by the data-driven forecasting model.

Proposed Model
We present here the spatio-temporal model proposed to integrate the information from power measurements to satellite images in an incremental way in order to make possible the assessment of the impact of each data source. We have used satellite images in the form of maps that span the geographic area of the data set d. The first step in considering this type of data as input is to define the map points which are of interest for forecasting the power output of a specific PV plant and also the appropriate treatment to apply to these points. We then present the statistical forecast model that integrates the satellite data. Finally, we detail the results of the evaluation of the performances of the forecasts resulting from this model and its comparison with models of the state of the art.

Identifying the Pixels of Interest
Identifying for each PV plant, the points of the satellite image that are of interest in the context of spatio-temporal forecasting has a double objective: the first is to determine the sub-part of the image, the pixels of which (thus the series of irradiation) are the most related to the production of the site and to quantify this link. It is evident that neighbor pixel carry very similar information that can be redundant and increase the dimensionality of the model. The second objective is to evaluate the interest itself of using satellite images. For each of the s = 1, . . . , n PV installations, the identification of the points of the image that are interesting for the forecast is done in several steps. The first step is to choose the pixels of interest around the site of interest. A correlation analysis between the production and the series of irradiation for some pixels located at 10, 20, . . . 100 km have been conducted. The results are presented in Figure 3 as boxplots of the correlation values between each production series and the pixels located from 10 km to 100 km to the power plants. The Figure show that there is no interest going further than 50 km as the correlations values for distance greater to 50 km are lower than 0.5 in mean and those valeus tend to zero for higher distances. We can see in Figure 4 for three PV power plants, the 50 km area retained for 1 January 2015 at 12:00 UTC. The picture on the upper left size is a bit truncated as the power plants is close to the border of the provided satellite image (which was truncated over the area covering all power plants).  We chose a fixed block size independent from the forecast horizon, although some methods choose scalable block sizes depending on the horizon, especially for motion detection applications of one-dimensional structures [36]. The second step involves transforming the GHI irradiation series into production series assuming that the relation between the irradiation and the production is an efficiency factor.
Then, we evaluate the link between the measurements of production on the PV site and the data derived from the satellite image. For this, we use a bi-varied criterion of spatial association proposed by Wartenberg [37] which is a transformation of the Moran index.
Let X i,j (t) be the estimate of the output provided by the satellite map at the point (i, j) for the moment t, Y s (t) the measure of production on the site s at the moment t and τ a time delay. The coefficient of spatial association of Wartenberg is written: (1) The coefficient of Wartenberg allows the estimation of the links both spatial and temporal. For a zero time offset of the measurement series (τ = 0), the association coefficient makes it possible to evaluate the correlation between the grid points retained and the measurement. Figure 5 presents the correlation values obtained between the production and the estimates for the pixels of the satellite image for three PV plants. We note that for each grid point, we have a time series of GHI estimation and that the correlations were calculated between the GHI and the on-site measurement. The most important correlations are observed for the points closest to the power stations with correlation values that remain high over the entire area of interest.  The calculation of the association coefficient with non-zero time delay values (τ > 0) makes it possible to evaluate the interest of using the satellite images for the horizons envisaged. In our case, the time delays considered are related to the forecast horizons envisaged, that is to say 6 h. We applied time offsets from 1 h to 6 h to PV production series. The association coefficients between these series and the production estimates for the pixels of the satellite image are calculated. They make it possible to determine the areas of interest of the satellite images for the forecasts for horizons corresponding to the offset applied. Figure 6 shows for a power plant in the West of the region covered by the data set, the values of the association coefficient for different time offsets. Note that for small time offsets (or horizons), the area of interest that corresponds to the highest values of the association coefficient remains close to the center of interest. This zone moves away progressively as time offset values increase. This translation can be explained by the advection of clouds. In addition, the association coefficient values decrease with the time offset and the area of interest shifts to the northeast as the offset increases. As mentioned in Section 3.1, the area of interest in the short-term forecasting frame is the 50 km around the power plant. This area represents the pixels which provide information to help improving the forecasts. The area is not yet associated to a specific pixels selection. It is based on the coefficient of spatial association; it contains all the pixels for which the coefficient value is significant. It is also a first step in order to reduce the dimensionality of the problem. It is noted that the initial image has more than 2000 pixels while by going down to a part of the image based on 50 km radius we limit to around 400 pixels. A further selection among the pixels will be done later in Section 3.2.

The Forecasting Model
The deterministic spatio-temporal model with Lasso variable selection proposed in [33] was used as basis and extended to integrate satellite image data. Let's recall that this model is defined by where X represents the set of all the neighboring plants.
The method we propose for integrating satellite image data into this model is to add this information as exogenous variables in the model. In order to do so it is necessary to select which pixels are the most informative to integrate into the model for the PV production forecast. Indeed, Figure 4 represents the pixels of interest around some central dataset. The 50 km zone defined around these plants represents a significant number of pixels that could pose a problem of dimension for the model. We therefore propose to further select the most pertinent pixels by applying the Lasso's variable selection approach. This choice makes it possible to avoid loss of information that could occur in the case of arbitrary choice of pixels and helps reducing the dimension of the problem. The final model obtained is as follows: with Psat t the satellite data, Ls the maximal lag applied to the pixels. The penalties λ 1 and λ 2 are respectively associated to production data and data from satellite images.

Comparison of the Models
With the previously defined spatio-temporal model that integrates exogenous variables from satellite images, we propose here an incremental evaluation approach that aims to quantify the contribution of each source of data in terms of forecast performance.

1.
The reference model is the autoregressive model AR which is a model exploiting only the temporal dependencies of the production data only from the site of interest : The first model we evaluate is the spatio-temporal model which exploits both temporal dependencies in the measurements but also spatial correlation between measurements of different power plants : Production data of neighboring sites, lags of all these production data) In this model, the Lasso variable selection procedure proposed in [33,38] is integrated, thus ensuring the processing of problems of parsimony and dimension. 3.
The second model investigated is an enhancement of the "ST" model with the integration of local meteorological data. This new model is called a spatio-temporal model with conditioning ST(Z) as the parameters are estimated according to the value of the local meteorological variable Z used. 4.
The last models investigated are the spatio-temporal model which exploits satellite images, NWP forecasts or both.
A visual synthesis of all these models is presented in Figure 7.

Variable Selection and Reduction of Dimension
The optimal AR model has been obtained by using the production data of the site of interest and try different lags configurations. The optimal lag has been obtained by minimization of the AIC criteria. For most of the power plants of the data set, the optimal maximum lag is around 1 h (4 time steps). The only variables in this AR(4) model are then the respective lags of the production.
The area of interest of the satellite image around each PV plant is 50 km (see Section 3.1). This area contains approximately 400 pixels. All these 400 pixels are initially integrated into the spatio-temporal forecasting model. The initial number of input variables in the spatio-temporal model with satellite images for a given plant is therefore 2015; which corresponds to the pixels with their respective delays (400 * 3 (3 h)) and to the production series of neighboring plants with their respective delays (136 * 6 (number of lags) − 1). Table 1 shows for a power plant the number of variables selected according to the horizon. The numbers of pixels and different PV units (without the delayed series) selected are also shown in the table. The small number of variables selected shows that the variable selection procedure is effective for reducing the size of the problem. In addition, we note that the variables selected are mainly variables related to the production of neighboring sites, followed by the pixels of satellite images. The number of selected pixels increases slowly with the forecast horizon. There are 4 selected pixels for 15 min when there are 7 for 3 h. One may expect a more important increase but it should be noted that adjacent neighboring sites concentrate most of the spatio-temporal information and only pixels which provided more information are selected to keep the dimensionality of the model low.

Forecasting Performances
The performances of the forecasting models previously presented are then evaluated using state-of-the-art criteria like RMSE, MAE [2] normalized by the maximum power observed for the power plant.
The Tables 2 and 3 present respectively for a selected power plant (P10) the MAE and RMSE for the each of the five models for different horizon.  These conclusions can be extended to the entire test case according to the Table A1 in Appendix A presenting the evaluation results (RMSE, MAE) for 9 other power plants for 6-h horizon. An additional evaluation process of the performance of each model compared to the reference AR model over all the power plants (in Table A1) is presented in Figure 8. The figure is produced as follows : • M 0 is the reference AR model • For each power plants P i , i = 1, ..., 9 in Table A1 • Each line on Figure 8 represents the average improvement at each horizon over all the 9 power plants of a model M i (over M 0 ).
The figure show that the spatio-temporal model allows an average improvement of RMSE of 10% for 3 h. This improvement can reach 20% depending on the plants. This result is consistent with those presented in [38]. Using local wind speed measurements with weather stations near power stations improves forecasting performance by an average of 2% for the first two hours of forecasting. Beyond 3 h, these measurements do not contribute to further improving the prediction performance compared to the basic spatio-temporal model. The spatio-temporal model that integrates the NWP forecasts shows no significant improvement over the initial spatio-temporal model over the 6 h of forecast. It should be noted, however, a slight improvement in the performance of this model for horizon values from 5 h. Integration of satellite images further reduces forecast errors. Indeed, we see in the figure an improvement of the RMSE of the order of 3% on average of the model with integrated satellite images compared to the simple spatio-temporal model. The hierarchy in term of global performances of these models is the ST + STAT model first, followed by the ST(Z) model, the ST model, the ST + NWP model and the AR model.

Conclusions
In this paper, we have proposed a spatio-temporal model which exploits not only the spatio-temporal information of the production measurements of the neighboring sites, but also the satellite images and NWP predictions. Since the latter are characterized by finer resolutions and faster update rates than NWP forecasts, they are a very interesting source of data for short-term PV prediction. We have presented a pixels selection procedure around the plants for which the PV production forecast is being considered. This procedure makes it possible to go from images covering all the power stations considered to a finer image that focuses around the power plant. The relationship between the points of interest of the area around the power plants and the production of the site showed that the pixels closest to the images are the most correlated to the production. We have quantified the contribution of each of the different sources of information namely satellite images, measurements of neighboring power plants, NWP forecasts and local meteorological measurements on forecast performances in comparison with an exclusively temporal reference model. The forecast horizons envisaged are 6 h. The biggest source of improvement comes from the use of power plant measurements. Satellite images can further reduce forecast errors when they are associated with spatio-temporal patterns. The effect of the NWP forecasts is very small on the early horizons in opposition to that of the local meteorological measurements. The NWP predictions, however, correct the poor performance of the spatio-temporal model for horizons greater than 12 h, thus confirming the importance of meteorology for these forecast horizons. It is important to mention that the results obtained in this paper show that the use of geographically distributed data motivates data sharing (as open data or monetised through data markets) as a good practice for the future.
Funding: This work was carried out within the research project entitled "Improvement of PV power forecasting and predictive management including storage solutions", funded by the company Coruscant SA in the frame of its participation to a tender of the French Energy Regulator CRE for the development of PV plants above 250 kWc.

Acknowledgments:
The authors would like to thank the French industrial Hespul for providing the PV data in the frame of the PhD thesis of the first author, as well as the European Center for Medium-Range Weather Forecasts for providing the NWP data. We would like also to thank the partners of the European project Smart4RES (European Union's Horizon 2020, No. 864337) for their useful comments that helped improving the paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: