High-Resolution PV Forecasting from Imperfect Data: A Graph-Based Solution

: Operating power systems with large amounts of renewables requires predicting future photovoltaic (PV) production with ﬁne temporal and spatial resolution. State-of-the-art techniques combine numerical weather predictions with statistical post-processing, but their resolution is too coarse for applications such as local congestion management. In this paper we introduce computing methods for multi-site PV forecasting, which exploit the intuition that PV systems provide a dense network of simple weather stations. These methods rely entirely on production data and address the real-life challenges that come with them, such as noise and gaps. Our approach builds on graph signal processing for signal reconstruction and for forecasting with a linear, spatio-temporal autoregressive (ST-AR) model. It also introduces a data-driven clear-sky production estimation for normalization. The proposed framework was evaluated over one year on both 303 real PV systems under commercial monitoring across Switzerland, and 1000 simulated ones based on high-resolution weather data. The results demonstrate the performance and robustness of the approach: with gaps of four hours on average in the input data, the average daytime NRMSE over a six-hour forecasting horizon (in 15 min steps) and over all systems is 13.8% and 9% for the real and synthetic data sets, respectively.


Introduction
Production forecasting is a critical technology for enabling large-scale penetration of PV generation into the power grid [1]. In particular, improved forecasting leads to lower net generation costs in the power system and to reduced curtailment of PV production [2]. Forecasting accuracy is limited both by the chaotic nature of the weather and by the uncertainty on the physical response of PV systems to given weather conditions due to variable temperature coefficients, nonlinear behavior at low irradiance, or complex local shading.
To address this dual challenge, the state of the art for the forecasting of single PV systems is the combination of multiple machine learning approaches with numerical weather predictions (NWP) as inputs. In a recent benchmarking study on 152 PV systems in the Netherlands, such techniques yielded a daytime normalized root-mean-square error (NRMSE) between 9% and 17% of peak power, depending on the method and system [3]. Forecasts for grid operation purposes have focused on a regional level, at which they benefit from a strong smoothing effect to yield an NRMSE from 5% (1 h ahead) to 7% (24 h ahead) of the daytime peak power in the North of Italy [4]. They estimate the current and near-future levels of regional PV power production using data from sampled systems, static information on all installed PV systems, NWP, and machine learning algorithms.
Many approaches have been proposed to improve PV forecasting accuracy. Since the high variability of solar irradiance mainly comes from cloud movement and its stochastic blocking of sunlight, the works in [5,6] use cloud motion vector information extracted from sky imagers for intra-hour irradiance forecasting. However, sky imagers are too expensive to be deployed at all PV sites, and moreover, they are only useful for intra-hour prediction horizons. The works in [7,8] use satellite images for hourly PV production forecasting. However, wide-area satellite images are not capable of capturing site-specific information, and thus not good for site-specific PV forecasting. The works in [9,10] use forecasted cloud information to improve forecasting accuracy. However, such data may be expensive and require heavy processing.
Most recently, several studies have proposed data-driven forecast methods that use multi-site spatio-temporal historical data for PV forecasting without requiring NWP or cloud movement information [11][12][13][14][15][16][17][18][19]. Simple and fast multi-site forecasting techniques using linear autoregressive (AR) models are proposed in [11][12][13]. The work in [12] proposes a lasso regularization in the AR model to automatically select the input variables, while the work in [13] further develops the AR model into a local vector AR model with ridge regularization that considers local weather changes. However, nonlinear methods outperform the linear techniques for forecasting horizons longer than an hour ahead [20], and recent works have used advanced deep learning techniques to forecast multi-site PV power production for hourly horizons. Feed forward neural networks (FNN) [14] and long short-term memory (LSTM) networks [15,21] are proposed for multi-site spatio-temporal PV forecasting. Particularly, LSTMs have improved the forecasting accuracy since they are well suited for time-series forecasting by processing sequence information using internal gating mechanisms [22,23]. In addition, recent works use convolutional neural networks (CNN) [17][18][19] to learn the complex spatial dependencies of the production data and indirectly track cloud motion. By considering spatio-temporal relations, they can capture cloud cover and cloud movement, since passing clouds influence neighboring PV sites sequentially, thus yielding an improved forecasting accuracy. A limitation of all these techniques is that they rely on the availability of clean data for both training and forecasting.
Yet in real life, validation and cleaning of data from grid sensors is a prerequisite for any data-driven solution, which may be affected by measurement or transmission errors. Published solutions for PV rely on the knowledge of characteristics of the systems [24], which themselves may be faulty or missing. The work of [25] proposed a processing pipeline that takes into account data cleaning and treatment of missing data. They propose to fill the missing gaps by finding a representative PV signal from the same geographical area and replacing the missing segment by the representative signal. The work in [26] proposed a low-rank tensor learning scheme to reconstruct the missing gaps. The low-rank tensor model implicitly considers the spatio-temporal correlations of multi-site PV production signals. However, none of the aforementioned works consider a spatial model for the non-regular spatial distribution of PV plants, thus they do not fully exploit the spatio-temporal correlations of PV production signals to reconstruct the missing segments.
In this paper, we propose a robust framework for multi-site PV forecasting, which can cope with incomplete and noisy data. The proposed forecasting approach is based on a linear spatio-temporal autoregressive (ST-AR) model that only uses past production data from neighboring systems to predict the production of a specific plant. The underlying idea is that temporal correlation between the outputs of PV systems is linked with their spatial relationship, and that PV systems can effectively be used as ground-level weather stations to forecast the production of their neighbors. In order to learn the ST-AR model, we use a group Lasso estimator to automatically select the most informative nodes for prediction as well as to estimate their coefficients. To address the problem of incomplete and noisy data, we propose a graph-based method to preprocess the input data to the ST-AR forecast model. Our method is based on building a graph model to capture the spatial dependencies among the PV systems and exploit the spatio-temporal relations to reconstruct the missing parts of the data. The ST-AR model is enhanced by a novel, data-driven method to calculate the clear-sky performance of PV systems, which takes into account the effects of their local environment and physics. The proposed framework was evaluated in two data sets for an entire year: (1) production data from 303 real PV systems, and (2) simulated production of 1000 PV systems, in both cases distributed over Switzerland.
The organization of the remainder of the paper is as follows. Section 2 presents the proposed robust framework for PV production forecasting and details both the graph-based reconstruction and forecasting methods. Results for PV signal reconstruction and forecasting are presented in Section 3. Finally, we conclude in Section 4 with a discussion of the results and future directions.

Methods
In this section, we present the proposed robust framework for PV production forecasting. Figure 1 depicts a block diagram of the framework. The gap reconstruction and preprocessing module filters the data (i.e., fills the gaps from missing data as well as removes noise) and preprocesses the data (normalization and de-trending) such that its output is in the correct form for the learning or the forecast modules. The learning module learns the graph models used for forecasting from historical data. Finally, the forecast module provides a forecast of the PV production over each point in the area based on the learned models and the cleaned data.

Graph-Based Data Reconstruction: Filling the Gaps
In the following, we detail our proposed approach to fill the gaps in PV power production time series, which is inspired by the work in [27]. Our method is based on building a graph model to capture the spatial dependencies among the PV systems and exploit the spatio-temporal relations to reconstruct the missing parts of the data. The main assumption of our method is that the normalized production data (normalized by peak production or size) are smooth in the temporal as well as in the spatial domain.
Let us begin with some definitions. Let G = (V, E , A) denotes an undirected weighted graph where V denotes the set of vertices or nodes, |V | = N , E denotes the set of edges or links, A denotes the weighted adjacency matrix (symmetric N × N), where A ij > 0 if nodes i and j are connected (i = j). Thus, in our case each plant corresponds to a node in the graph G and the vertices models the spatial relations between N PV plants. We construct the adjacency matrix A by placing edges for the 10 nearest geographical neighbors of each plant and the weights are computed as a function of the distance between plants using a Gaussian kernel. Figure 2 shows an example of such a graph. Each node has an associated time-series signal representing the power production over some time interval. Assuming the data is sampled at a regular time, the discrete time series is represented as an M-dimensional vector, where M is the number of time samples. Thus, we can model the graph time-series signal as the matrix X ∈ R N×M , where each row represents the time-series data associated to the nodes. All time series are normalized by the peak production such that the maximum in each row (node) is one. The core assumption in our model is that the temporal derivative of the signal X (in our case, power production) is smooth in the graph (vertex) domain. This corresponds to the assumption that the effect of clouds on one system propagates mostly unchanged onto neighboring systems. The smoothness of graph signals is a qualitative characteristic that expresses how much the node samples vary with respect to the underlying graph [28]. It can be formalized as follows.
Let L ∈ R N×N be the graph Laplacian matrix associated to the graph G, defined as L = D − A, where D is the (diagonal) degree matrix of the graph in which the main diagonal is defined as D ii = ∑ N j=1 A ij . The graph Laplacian can be interpreted as a difference operator for signals defined on the graph. Let x ∈ R N be a column vector representing a column of X. A typical measure of graph smoothness for x is the Laplacian quadratic form defined as: which is a weighted sum of neighborhood variation over all nodes. Let G ∈ R M×(M−1) denote the temporal difference operator such that every row of Z = XG contains the time difference signal for each node. Let z i denote the i-th column of Z, thus we can define the graph time-series smoothness function as: The function F(Z) = F(XG) sums over time the graph smoothness (Laplacian quadratic form) of the time difference signals. The smaller F(XG) is, the smoother the graph time-series signal X is.
In addition, the corrupted signal with gaps can be modeled as: where S ∈ {0, 1} N×M is a binary sampling operator that models the gaps in the signal, X o denotes the original complete clean signal, W is a realization of white additive Gaussian noise and • denotes the Hadamard product operator. In order to reconstruct the signal X o from the observed (incomplete) signal Y, we solve the following convex problem: min where · F denotes the Frobenius (or L2) norm for matrices. The problem finds the smoothest graph time-series signal that is consistent with the available data. The constant defines the radius of the data fidelity constraint that can be related to the measurement noise level, i.e., the closer is to zero the more accurate we assume our measurements to be. In case = 0, we assume our measurements are noise free.
We developed an efficient algorithm to solve the convex problem based on the projected gradient descent method [29]. The algorithm is implemented in Python and is capable of scaling to large networks of tens of thousands nodes.

Spatio-Temporal Forecast Method
The developed forecast method is based on the assumption that the current power production in a system can be modeled as a linear combination of the past production data of a subset of nodes over a predefined time interval. The rationale behind this model is that events measured in the past production of some nodes, for example, clouds or storms passing by, are informative to predict the production in other nodes. In the following, we describe the main steps in the proposed forecast method.

Normalization and De-Trending
Normalization and de-trending of time-series data become a key step because the main assumption of statistical linear methods is that the data is stationary. However, the power production data is not stationary and has a strong dependency on time (both daily and seasonally). The normalization (and de-trending) method we chose is based on a data-driven computation of the clear sky production for each individual node. We construct for each node and each day of the year a normalization profile based on historical data from the previous year as follows. We first compute over a year of historical data a maximal 24h-production profile at each node x, defined as: whereP x d (t) is the PV power produced at node x at day d, and t is the day timestep. The max-profile P x max undergoes three further transformations. It is first smoothed out using the Savitzky-Golay filter. Then, the profile is stretched in a day-dependent fashion to reflect seasonal variations. Our stretching/squeezing technique takes into account historical sunrise and sunset times, i.e., defined as the first and last non-zero historical production times of the day (these values are inferred over an interval centered on the specific day, to avoid issues on very cloudy days). It aligns the normalization signal so as to ensure that the tails of the normalization profile are very close to the expected daily sunrise and sunset, see Figure 3. Finally, it is multiplied by a daily scaling factor where GHI x t is the clear sky global irradiance at node x and time t, computed with Ineichen-Perez turbidity model and PV-lib [30]. This last step ensures that the normalized signals are in the same range, independently of the period of the year. By doing so, we end up with a normalization profileP x t for each time step t of the original (yearly) time series.
In the forecasting method, we consider the normalized data P x t /P x t that captures the cloud motion without having a strong dependency on seasonal and daily patterns, hence is expected to be relatively well described with linear autoregressive models, see Section 2.2.2. In the night time both values are zero so the ratio is undefined yet it must have a value for forecasting to be possible in the first hours of the day. We decided to define the night-time normalized value as the average value of the ratio in the previous day since this is an unbiased estimator, which assumes a form of persistence. Unless confusions may arise, we denote for the rest of the paper the normalized signals with P x t . Figure 3 shows an example of a normalized signal for a particular node.

Spatio-Temporal Auto-Regressive Forecast Model
The linear spatio-temporal auto-regressive (ST-AR) model used for predicting the production at site x ∈ N , where N = {x 1 , x 2 , . . . , x N } denotes the set of N nodes (sites), at time t is the following: where P x t denotes the normalized PV power production in site x at time t, Q is the past history horizon (in discrete samples), i.e., the order of the AR model or the number of past samples we keep in the model, η x t is the error term, and β x,y i is the model coefficient for lag i and site y. We assume that all time series are In this vectorial form, the vector β x ∈ R QN+1 is formed by grouping all β x,y i that belong to the same node y, i.e., The measurement vector is defined as P x = [P x t+1 , P x t+2 , . . . , P x t+T ] T and the regressor or design matrix is defined as: The production is modeled as a linear combination of the Q past production values for all nodes in N , i.e., all available nodes. However, a large portion of the nodes might be redundant thus not adding valuable information for the prediction. Therefore, we need a tractable method to learn or infer which subset of nodes is the most informative for prediction. In order to select the most informative coefficients, we proposed to use the group LASSO (Least Absolute Shrinkage and Selection Operator) estimator [31].
The group LASSO adds a regularization term to the classical least squares problem to promote sparse group solutions, i.e., it will drive entire groups of coefficients belonging to the same node to zero, thus having the desired selection effect. The group LASSO estimator solves the convex problem: where λ x is a regularization parameter that controls the sparsity of the solution, i.e., the number of nodes selected in the model, and β y ∈ R Q denotes the subvector of coefficients associated to node y. In order to learn the model coefficients for all nodes, we need to solve the group LASSO problem for all nodes in N . By doing so, the problem can be interpreted as a graph learning problem [32] where we learn the topology of a directed graph that minimizes the prediction error. To solve it, we implemented an efficient and scalable algorithm based on the proximal gradient descent method [29]. As is visible from the developed expression in (8), the vectorsβ x depend on the past history horizon. Intuitively, the radius around site x within which nodes can provide valuable information depends on the distance over which clouds can travel. For a very short history horizon (e.g., Q = 1), the most informative nodes will therefore be close neighbors to site x. For longer horizons (e.g., Q = 12, which corresponds to the longest history horizon in our study) weather changes at some remote nodes will have enough time to propagate to x. The set of nodes selected by the group LASSO will therefore extend over a large area. As an example, Figure 4 shows a set of edges obtained for a central node in Switzerland for Q = 12 (three hours of history).

Prediction for Short-Term Horizon
Once the graph model is learned, we use the model coefficients to forecast the production for all nodes in N for a horizon of H time steps ahead. In order to do so, we first predict the production for one time step ahead as:P Since the model coefficients were learned using normalized data, the same normalization is applied to the input data to produce normalized predictions. After the predictions are computed a de-normalization step is needed, i.e., multiplication by the daily profiles for each node.

Evaluation Data and Metrics
We trained and evaluated our methods over two data sets: the "real database" and the "synthetic database". The "real database" consists in production data from 303 real PV systems distributed over Switzerland and monitored with Solar-Log devices from BKW's subsidiary Solare Datensysteme GmbH (SDS). All data were anonymized before the authors accessed them: any field containing personal information were excluded and spatial coordinates were rounded to 0.01 degrees. All sites have uninterrupted data for 2016 and 2017 with a temporal resolution of 15 min.
The "synthetic dataset" consists in the simulated production of 1000 PV systems distributed in Switzerland, dubbed "synthetic database" for the rest of the article. The synthetic dataset is included to evaluate the scalability of the proposed methods to a larger number of nodes. We used the ModelChain class in the pvlib python library [30] to generate the power production time series for each system in the database. Using this class requires describing the location of the PV system, its configuration, and providing prepared weather data. To define the locations we fitted a Gaussian mixture model (GMM) with 80 components on the locations of the PV plants in the real database. From this GMM we sampled the locations of the synthetic database. Then we inferred the distributions of the size, orientation and pitch angles of the systems in the real dataset and sampled from these distributions to define the configurations of the PV systems in the synthetic datasets. Finally we used as inputs historical weather data in 15 min resolution from the HelioClim 3 database (http://www.soda-pro.com/help/helioclim/ helioclim-3-overview) for the years 2016 to 2018. Figure 5a,b show the spatial distribution of the synthetic and real datasets, respectively. Colors indicate the peak production of each plant. In order to evaluate over a full year, we divided the test year (2017) into 24 test periods (batches) of two weeks each. The prediction models were trained with data from the two months prior to each test period. The performance metric for both reconstruction accuracy and forecasting accuracy is the daytime normalized root mean square error (NRMSE) defined at site x and forecasting step (horizon) i as: where P x t andP x t are the ground truth power and predicted power (de-normalized), respectively, of site x at time t, P x max is the maximum power of site x over the evaluation period S, i.e., the 2017 year, and T is the number of time steps in the evaluation interval S. We exclude night periods from the evaluation period S.

Results of the Graph-Based Algorithm for Data Reconstruction
First we present the performance of the graph reconstruction method. We evaluated it by simulating gaps in the time-series. The simulated gaps are drawn from a statistical model such that the expected length of the gaps per day can be varied. The spatial graphs for each data set (real and synthetic) were constructed using the 20 nearest neighbors and using a Gaussian kernel to compute the weights in the adjacency matrix. We varied the expected length of gaps from 2 to 16 h per day to evaluate the performance of the algorithm over a period of one year in batches of two weeks. The upper bound for the residual L2 norm constraint, , in Equation (4) was set as = 10 −2 Y F , i.e., 1% of the total Frobenius norm of the measurement matrix Y. Figure 6 shows the results of the tests. We compare our reconstruction algorithm to a simple linear interpolation to illustrate the effectiveness of the proposed method when large gaps are present in the data. The results show that the reconstruction error is lower for the synthetic dataset than for the real dataset. In the synthetic dataset the proposed method achieves errors below 20% for gaps with expected lengths up to 8 h. In the real dataset, the proposed method yields errors below 20% for gaps with expected lengths up to 4 h. Figure 7 shows a visualization of a reconstructed signal and the original signal for three nodes in a window of six days. The signals come from the real dataset with simulated gaps with an expected length of 8 h per day. Linear interpolation (synthetic) GSR (synthetic) Linear interpolation (real) GSR (real) Figure 6. Daytime normalized root-mean-square error (NRMSE) for signal reconstruction against expected length of gaps (in hours) per day in the data for both the real and synthetic datasets. "GSR" stands for the graph signal reconstruction technique introduced in this paper.

Forecasting Results Using Uninterrupted Data
We first evaluated the performance of the proposed ST-AR model using clean and uninterrupted data. The forecast horizon was set to 6 h ahead, i.e., H = 24 samples ahead with a sampling time of 15 min. The order of the model was set to Q = 12, i.e., we used the past three hours to forecast. We first compared the ST-AR model against the persistence model and the single-site linear auto-regressive (AR) model, i.e., similar to the model defined in Equation (7) but using only information from the same node. The data for the AR model was normalized by the same data-driven clear-sky profiles described in Section 2.2.1. Figure 8a,b shows the forecast error evolution over the prediction horizon (in discrete time steps) for the real dataset and the synthetic dataset, respectively. For each prediction step the median value (solid line) and the interquartile distance (shadow bounds) over all sites are shown. For both datasets, the ST-AR model outperforms the persistence and single-site AR models for the entire forecasting horizon. The maximum errors were obtained for the last time step of the prediction horizon, i.e., 6 h ahead prediction, as expected. In the case of the synthetic dataset, all median errors are below 12% whereas for the real dataset all median errors are below 17%, which is better than state-of-the-art methods for the same horizon [1]. Moreover, for forecast horizons up to 16 steps ahead, i.e., 4 h, the median errors are below 10% for the synthetic dataset and 15% for the real data set.
For the real dataset, we have also included the forecasts for single sites and benchmarked the performance of the proposed method against a method that uses NWP. The forecasts were computed for two locations at which measured PV data and NWP were available, Bern and Bätterkinden. The two locations are about 25 km from each other. As comparison we use forecasts based on support vector regressors (SVR) with NWP as inputs (global irradiance and temperature). As reported in [33], this technique outperforms several state-of-the-art methods such as SARIMA and two neural network models with different weather forecasts. It is therefore a representative benchmark. The forecasts for Bern were computed using historical NWP from the global forecast system (GFS) that have a temporal resolution of 3 h, thus only two points are available for the 6 h ahead horizon. The forecasts for Bätterkinden were computed using historical NWP from Meteotest (https://meteotest.ch/en/) with a temporal resolution of 1 h. Figure 9 presents the time series of actual PV production and six-hour-ahead forecasts using both the ST-AR approach and SVR with NWP. It illustrates the variability in production profiles and forecast errors, which explains the spread seen on Figure 8. Figure 10 shows the dependence of the corresponding NRMSE values on the forecast horizon. The results highlight the dependency of forecasts models that use NWP on the temporal resolution of available weather predictions, whereas the ST-AR overcomes this issue by using past PV production data from other PV stations. The ST-AR approach outperforms the combination of NWP and SVR up to forecast horizons of four to five hours.

Forecasting Results Using Incomplete Data
Finally, we evaluated the complete forecast framework, which receives incomplete data, fills the gaps using the graph-based reconstruction method and then uses the cleaned data for forecasting. We evaluated the proposed approach using two different scenarios: (1) gaps with 4-h duration on average and (2) gaps with 8-h duration on average. The model learning phase of the ST-AR method, i.e., solving Equation (9), was performed using the reconstructed data. Figure 11a,b show the forecasting error results for the real and synthetic data sets, respectively, comparing against the results obtained using uninterrupted data. For each prediction step the median value (solid line) and the interquartile distance (shadow bounds) over all sites are shown.
The results show the robustness of the proposed approach with the median NRMSE for both 4-and 8-h gaps staying slightly above the median error from uninterrupted data. However, the interquartile distance of the NMRSE is increased for both datasets and scenarios. Moreover, in the case of the real dataset, the interquartile distance of the errors is closer to the one obtained using uninterrupted data for the first scenario. The average daytime NMRSE over the entire 6 h ahead horizon is 13.8% and 9%, for the real and synthetic databases, respectively, for the first scenario, and 14.5% and 10%, for the real and synthetic databases, respectively, for the second scenario. In comparison, the average NRMSE using uninterrupted data is 13.5% for the real data set and 8.5% for the synthetic data set, showing the effectiveness of the proposed framework to deal with incomplete data.

Conclusions
We proposed a robust graph-based framework that exploits the spatio-temporal correlation between different systems to forecast the local PV production based on imperfect production measurements. The proposed method achieves an average daytime NRMSE (over all sites) of 13.5% for the real data set and 8.5% for the synthetic data set for forecasting horizon of 6 h ahead (15 min resolution). It outperforms methods based on numerical weather forecasts on time horizons between 0 h and 4 h, confirming the intuition that PV systems can effectively be used as distributed weather stations for forecasting purposes.
The proposed graph-based method proved very effective in reconstructing missing or faulty data. The NRMSE of this reconstruction is well below 20% for gaps of up to 4 h on the real dataset. Not only is the uncertainty on the reconstruction lower with graph-based methods than with conventional, linear interpolation, its increase with an increasing duration in the gaps is also slower. As a result, the full forecasting framework proved very robust against faulty data: the median NRMSE across all the systems with the ST-AR method is 17% for a 6 h horizon over one year and increases by less than one percentage points with gaps in data of up to 8 h.
Finally, as compared to the state of the art, this work has significantly increased the confidence in the investigated algorithms since they have been tested on entire years and on uniquely large datasets: more than 300 real PV systems spread over Switzerland, and 1000 synthetic ones that reproduce the statistical distribution of installed PV in the country in terms of size, orientation and location. Future work will investigate more advanced graph-based machine learning models such as graph neural networks to learn models that overcome the non-stationarity of PV production signals. We will also investigate the generalization of the approach to other weather-dependent signals in the power system such as wind power production.