Comparison of Data-Driven Techniques for Nowcasting Applied to an Industrial-Scale Photovoltaic Plant

: The inherently non-dispatchable nature of renewable sources, such as solar photovoltaic, is regarded as one of the main challenges hindering their massive integration in existing electric grids. Accurate forecasting of the power output of the solar plant might therefore play a key role towards this goal. In this paper, we compare several machine learning and deep learning algorithms for intra-hour forecasting of the output power of a 1 MW photovoltaic plant, using meteorological data acquired in the ﬁeld. With the best performing algorithms, our data-driven workﬂow provided prediction performance that compares well with the present state of the art and could be applied in an industrial setting.


Introduction
In recent years, among renewable energy sources, solar photovoltaic (PV) has experienced rapid growth. According to the most recent survey by the IEA [1], by the end of 2017, the total PV power capacity installed worldwide amounted to 398 GW, leading to the generation of over 460 TWh of energy (around 2% of global power output). Utility-scale projects account for about 60% of total PV installed capacity, with the rest in distributed applications (residential, commercial, and off-grid) [1]. It is well established that one of the main challenges posed by the massive integration of renewable energy sources, such as PV or wind power in the national grids is their inherently non-programmable nature. Therefore, the capability to forecast the energy produced by a PV plant in a given time frame with good accuracy is a key ingredient in overcoming this challenge [2], and is also often connected with economic benefits [3]. The recent evolution of microgrid and smart grid concepts is also a significant driving force for the development of accurate forecasting techniques [4].
In this paper, we will compare several data-driven techniques, such as machine learning and deep learning, for intra-hour forecasting of the output power of an industrial-scale PV plant (1 MW nominal capacity) at 5 min temporal resolution. Meteorological data acquired from sensors in the field will be used as input data to perform the predictions. This framing of the problem belongs to the class of approaches referred to as nowcasting [2,5].
From an operational point of view, short forecast horizons are relevant for AC power quality and grid stability issues. At these time scales the main source of variability in the power output is cloud movement and generation on a local scale [6]. Although these phenomena could be modeled from a physical point of view, their complex and turbulent behavior is a challenge that can be tackled effectively with data-driven techniques. Some approaches make use of imaging techniques, either from ground-based [6] or satellite [7] cameras, while others only make use of endogenous data [8] (i.e., the power output is used as the only input feature to the forecasting). The advantages of a data-driven approach for solar power forecasting lies in its ability to learn, from historical data, the complex patterns that are difficult to model (for instance shading phenomena). No knowledge of the specifications of the plant such as the orientation and electrical characteristics of the panels are required.
The predictive performance of machine learning-based methods has been shown to be superior to physical modeling in several cases [2,9,10]. In particular, our comparison includes several well-known regression algorithms that were tested on an extensive range of parameters.
The paper is structured as follows: Section 2 describes the available dataset and the performed preprocessing, Section 3 contains the description of the applied methods and the definition of the applied error metrics. The results are collected in Section 4, and the concluding remarks are drawn in Section 5.

Data Sources and Preprocessing
The input data for the present study were gathered from the continuous monitoring of a 1 MW nominal power photovoltaic plant operated by Eni, an Italian multinational energy company. The available time series covers approximately one year's worth of data, at 5 min time resolution. A list of the available variables is provided in Table 1. The panel temperature T p is measured by a sensor installed on the back side of one reference module in the field. Meteorological data are monitored by a weather station installed next to the PV arrays. The PV modules are equipped with a 1-axis solar tracking device.

Data Preprocessing
An example of the P and T time series is shown in Figure 1, data were collected between 3 August 2018 and 8 July 2019, and a total of 103,913 time steps are available.

Feature Selection
To find the best compromise between predictive performance and the computational load of the training phase, we have used several considerations to select the most relevant variables among those listed in Table 1.
The physics of the photovoltaic system represents the first source of intuition to guide the selection process. Sophisticated models for the power generated as a function of the environmental conditions have been developed [11]; in essence, the power output of a PV panel is proportional to the solar irradiance.
The proportionality coefficient contains information on the power conversion efficiency of the panel, and it is generally a decreasing function of the photovoltaic cell temperature [11]. Therefore, it is expected that I GH I , followed by the panel temperature, should play a major role, as T p is a good proxy for cell temperature. Air temperature and humidity are also generally expected to be useful predictors, as they are related to solar irradiance by well-known physical laws. Relative humidity, for instance, is intuitively higher during rainy days, which most likely feature heavy cloud cover resulting in low ground-level solar irradiance hence low power output. Nevertheless, the impact of these two variables on the forecasting prediction is highly site-dependent. For this specific dataset, we have found that including both features did not increase predictive performance significantly; some algorithms actually performed worse when humidity was used as the input feature.
Weather parameters, such as wind direction and wind speed, for a given time instant t, they might influence the power generated at successive instants in time P(t + ∆t), for example by indicating the direction and speed of cloud movement. Wind speed could also influence panel temperature, albeit with a delayed response due to thermal capacitance effects [12]. For the specific dataset implemented, by testing the selected algorithms in an exploratory phase, results that no significant performance gains were found when including W s , W d , and p in our case study. This conclusion was also supported by examining statistical indicators, such as the feature importance provided by the Random Forest algorithm [13].
The expectations derived from the physics are also corroborated by an analysis of the correlation of each variable in the dataset with P, reported in Table 2. The most highly correlated variables, as anticipated, are I GH I and T p . The behavior of I GH I , as displayed in Figure 2 clearly has an envelope due to the day-night cycle [10]. This envelope also embeds the seasonal variations of peak irradiance during the year. Local cloud cover introduces a stochastic component in the measured irradiance, which constitutes the challenging aspect of PV power forecasting.
Here, a synthetic feature has been used to separate these two contributions. For identifying the envelope, the so-called clear sky irradiance [14] I CS was computed (To simplify the notation we will drop the subscript GH I in the derived variables; we consider the global horizontal component unless otherwise stated). This is the solar irradiance that would be expected in the absence of cloud cover but accounting for atmospheric turbidity (the effect of aerosols and water content on light scattering and absorption).
The clear sky irradiance takes into account the most common trends, like the day-night cycle and the seasonal variation. In this paper, no additional considerations have been introduced regarding trends, like the ones reported in [15], because they can be learned by the algorithms thanks to the fact that the dataset used covers an entire year.
The clear-sky irradiance model used in the implementation of the Ineichen model [16] was provided by the pvlib library [17].
The decomposition used can be formalized as: is shown in Figure 2. Dropping the time dependence for simplicity, the following hypothesis can be made for the output power: where we have introduced ε(t) and have defined P CS = εI CS and P s = εI stoc . The coefficient ε(t) was estimated as the ratio P/I GH I where I GH I > 10 W/m 2 , and set to zero otherwise.
To reduce the impact of inconsistent data, a rolling average with a window size of 4 samples was then applied to ε. This was found to improve performance and window size has been tuned manually to achieve this goal. The forecasting models were then trained to predict only the stochastic component of the power. The total power was then computed by inverting the transformations defined in Equations (1) and (2).
Finally, all variables were rescaled to the interval [−1, 1]. Scaling of all the features to the same interval is a standard practice to improve the performance of some regression algorithms, in particular, those based on neural networks [18].
To conclude this section, we remark that the main aim of the paper is not to demonstrate the best possible predictive performance, but rather to compare different models on equal grounds. The preprocessing steps outlined above were driven by a combination of physical reasoning, analysis of statistical indicators, and by the necessity of limiting the computational load when working at 5 min time resolution. Appropriate synthetic features were defined and the final list of physical features selected as the input and the output of the regression are summarized in Table 3. Finally, we stress that these choices were optimized for the case at hand and might not be as effective for different applications, as it is natural to expect in any data-driven approach. Table 3. Selected physical variables (features) for power forecasting.

Selected Physical Variables
I stoc T p T P CS P s

Methods
To apply machine learning techniques, forecasting was framed as a regression problem as follows: where x i ∈ R F is a vector containing the set of selected input features.
In the present case, F = 5 and x i = P s , I stoc , T p , T, P CS (t i ). Discrete time steps are defined so that ∆t = t i − t i−1 = 5 min, and a similar notation is adopted for the output of the modelP s j =P s (t j ). We have introduced the parameter B, which controls how far back in time the regressor will look to perform the prediction. This lookback time, denoted as ∆T B , was varied between 2, 4, and 8 h, corresponding to B = 24, 48, 96 at 5 min time resolution (see Figure 3). Generally, the goal of a regression problem is to find the function f : R B×F → R 12 which minimizes a suitable definition of the error between the predicted outputP s and the actual value P s . To this end, a set of pairs of input-output variables on which to train the parameters (if any) of f was constructed and arranged as shown in Figure 3.
The dataset adopted is affected by some gaps (missing data) and some invalid data in the time series. The missing data have managed to split the dataset into subsets containing only consecutive observations: data within each subset were then cast in the desired form by constructing the time-lagged time series.
On the other hand, a first statistical analysis on the time series shows the inconsistent data are a negligible percentage of the entire dataset, and the machine learning methods analyzed in this study are able to automatically detect and ignore these invalid data. However, IEC 61724 recommendations for PV data monitoring and recording have been carefully followed even if the performance ratio of the considered PV plant has been shown here for non-disclosure reasons.
The final resulting dataset comprised N tot = 103,740 samples: the first 80% samples were used as a training set, with a dimension of N tr = 82,992, while the last 20% constituted a test set of N te = 20,748 samples.
Many regression algorithms to implement the function f exist and have been applied to solar irradiance or photovoltaic power forecasting: artificial neural networks [19], tree-based methods [20], and more recently deep learning approaches [21]. The algorithms tested in this study are summarized in Table 4 and Figure 4 shows the flow-chart, which has been followed in the current analysis of the above-listed algorithms. A detailed description of the working principles of each algorithm is out of the scope of this work and we refer the interested reader to the references provided. For this reason, we now provide only a minimum of details that are useful to motivate the choice of algorithms listed in Table 4. Linear regression was included as a representative example of a robust and computationally inexpensive algorithm. The Lasso regressor [22] implements variable selection [23] by penalizing the importance of features which have little impact on the prediction error. Therefore, it should represent an improvement in the performance of simple linear regression, with a comparable computational cost. The Random Forest [24] algorithm is based on an ensemble of decision trees aggregated together following the principle of bagging [18,23]. The multilayer perceptron (MLP) [23] is a type of artificial neural network with multiple hidden neural layers. The number of layers and the number of neurons in each layer are the main parameters that control the flexibility of the network, that is its capability to learn complex patterns. Training of the MLP is based on the feed-forward and back-propagation paradigm.
In k-nearest-neighbor (KNN) regression [25], the output y for an input x is estimated as the average value of the k training outputs whose corresponding inputs are the first k nearest to input x. An appropriate notion of distance in the space of inputs must be defined, the choice of which impacts regression performance.
Long short-term memory [26] (LSTM) neural networks belong to the category of recurrent neural networks. The LSTM network structure can be described in terms of units communicating through input and output gates and endowed with a forget gate that regulates the flow of information between successive units. This enables the network to remember such information for long time intervals. This architecture is well suited to process data that have a sequential nature, as is the case of time series data. This is reflected in how the input data has to be structured (see Figure 5), either for training or to perform a prediction with an LSTM.   The variables x (k) m are denoted so that the k = 1, · · · , F spans the input features while m is the time index.
Given a set of N samples, F physical features and B lookback time steps, each sample is a two-dimensional array of dimension B × F, therefore a portion of the time series comprising B sequential steps is used to perform the prediction. On the contrary, each input sample to other algorithms is a flat array of B · F predictors (see Figure 1), which are all considered equally and independently. For instance, the order in which they are provided is completely irrelevant. The effectiveness of LSTM for time series forecasting has been demonstrated in several studies targeting different types of data [21,29].
All the algorithms listed in Table 4 (except for linear regression) have several parameters that can be tuned. An example is the number and size of the hidden layers in the MLP network. We refer to such variables as training parameters (especially in the context of deep learning, these are also often referred to as hyperparameters).
These can have a significant impact on the performance of the algorithm. A robust strategy to find the optimal configuration is to couple a search in training parameter space with k-fold cross-validation (CV). The latter consists of splitting the training set in k subsets (folds) of the same size, then over the course of k iterations, select one of the folds as the test set, while the remaining k − 1 constitute the training set. At the end, k performance scores are obtained, for instance, the determination coefficient (R 2 ) on the fold acting as the test set. An overall cross-validation score is then built as the average of the k scores. This procedure can, in turn, be iterated on a grid of possible values of the training parameters to find the combination yielding the highest cross-validation score. If m combinations of training parameters are tested, a total of mk fits of the model is performed. In this study, the general machine learning workflow used for all algorithms except for the LSTM was the following, for each value of the lookback time ∆T B : • Three-fold cross-validation and hyperparameter search on the training set (N tr ) samples: training parameters yielding the best cross-validation score were selected. This should select the algorithm with the best generalization performance, which is the one which most likely performs best on new, previously unseen, input data.

•
Evaluation of the performance of the algorithm on the test set, using common error metrics, after reconstruction of the actual power from the predicted stochastic component.
All algorithms were tested on the same CPU-based hardware. The training time of the LSTM network was considerably longer than the other algorithms. The full cross-validation and hyperparameter search for the MLP, totaling 2400 fits required about 200 min for ∆T B = 8 h, whereas one single fit of the LSTM required 400 min. Both computations were parallelized to make use of all the threads available on the machine. For this reason, it was not feasible to apply the same hyperparameter tuning approach to the LSTM. Hence, we restricted the search to a family of networks known as encoder-decoder models [30]. In particular, two architectures were tested, one with a fixed number of units, the other with network size proportional to the lookback times steps B. We will label the two models LSTM and LSTM2, respectively.
Training of the LSTM networks consists of finding the optimal weights of the neural connection that minimize a suitable error metric on the training set, for instance, the mean square error (MSE). This was performed using the Adam [31] algorithm, a standard approach for deep neural networks. Training consists of several successive steps (epochs). The train MSE, being the target of the optimization, is by design expected to decrease through the epochs. This does not guarantee that the error on unseen data, for instance on the test set, will decrease as well. This is the well-known issue of overfitting [23], an example of which is shown in Figure 6. To prevent this occurrence, an early stopping strategy was implemented, keeping track of the MSE on the test set during training and halting the process when the test MSE stops decreasing. Early stopping was also enabled in the scikit-learn library for the MLP regressor. More technical details on the choice of parameters for all the algorithms are provided in Appendix A. Overfitting can occur if training is extended for too long (left). A suitable early stopping strategy helps to prevent this occurrence (right).

Error Metrics
We introduce the main error metrics that will be used in Section 4. Since the output of our intra-hour forecast is the 12-dimensional vector of the predicted power for each 5-min time step, we will define an error metric for each forecast horizon. That is, for j = 1, . . . , 12: where the index i spans over the samples. The performance metrics defined in Equations (4) to (7) are commonly referred to as the mean absolute error (MAE), the root mean square error (RMSE), mean bias error (MBE), and the coefficient of determination (R 2 ), respectively.
It is also interesting to evaluate the mean performance of the algorithm in the forecasted hour with a single representative indicator. Several indicators could be built to characterize different qualitative aspects of the prediction. A possible choice is to define the average power in the hour as P 1h = 1 12 ∑ 12 j=1 P j . With a similar definition for the predicted power (P 1h ), standard definitions of the performance metrics (MAE, RMSE, R 2 ) can be applied to the hourly average power. We will label these metrics as MAE 1h , RMSE 1h , and R 2 1h . A persistence forecast was defined as P s i ,P s i+1 , . . . ,P s i+11 = P s i−12 , P s i−11 , . . . , P s i−1 where the 12 predicted time steps ofP s are equal to the measured values P s observed in the previous 12 time intervals. The last error metric introduced in the skill scores (SS) [32], that compares the prediction of the analyzed model with the error made by a persistence method on the same forecast horizon:

Results
The presented methods have been compared to the dataset described in Section 2. The number of samples in the dataset and their division for training and testing are summarized in Table 5.
In all the performed tests, three different lookback times have been adopted. This corresponds to a different number of lookback time steps used in each sample of the dataset. The information of these tests are summarized in Table 6.

Cross-Validation Performance
We start the analysis from the cross-validation results. Table 7 summarizes the maximum, average, and standard deviation of the cross-validation score. The size of the population on which these quantities are calculated is the number of iterations in the hyperparameter search, which is the number of different combinations of parameters explored.  Each quantity is informative about a different aspect of the algorithm performance. The maximum score is the criterion used to select the best set of training parameters, therefore it can be expected that an increase of the best CV score with ∆T B should correspond to higher performance on the test set as well. None of the tested algorithms showed such an increase. Results on the test set reported in the next section confirm the validity of this conclusion on our data set. The number of predictors is already BF = 120 for ∆T B = 2 h, which is rather large. Indicators such as the feature importance provided by the Random Forest algorithm suggest that only a small fraction of the inputs effectively play a role in the prediction outcome. Therefore, adding even more predictors by increasing ∆T B has no benefit and was found to actually decrease the performance in some cases.
The average value and standard deviation of the CV scores are useful indicators of the robustness of the regression algorithms with respect to hyperparameter selection. The Lasso and MLP algorithms have a much greater variability compared to Random Forest. We remark that the cross-validation score can also assume negative values, which justified the low average value observed for the Lasso. KNN is also fairly robust but has poor prediction performance compared to the others. Note that in the case of linear regression there are no training parameters to be tuned, hence all the observed variability is due to the different splittings of cross-validation.

Test Set Performance
We now examine performance on the test set, also including the two LSTM architectures (see Appendix A), which were not subjected to cross-validation.
We focus on the RMSE metric, shown in Figure 7. At the shortest forecast horizon ( f h ) of 5 min, Linear Regression and Lasso outperformed more flexible regression algorithms. The situation is reversed for f h = 60 min. The KNN algorithm performed poorly, suggesting that it is not well-suited for this specific data and framing of the problem. Nevertheless, it shows more promising performances if the MAE metric is considered (see Figure 8). The LSTM algorithms did not offer any significant performance advantage over Random Forest and MLP in terms.      It is worthwhile to consider additional error metrics since they highlight different aspects of the regression performance. For instance, the RMSE tends to penalize much more large deviations with respect to MAE. The latter is shown in Figure 8 and suggests a different ranking between the algorithms. It is again apparent that no significant improvements were obtained by increasing ∆T B . (6) preserves the sign of the deviation between predicted and actual values. Therefore, it is informative on potential systematic prediction bias. In Figure 9 one can observe that, in contrast with other indicators, MBE is almost constant as a function of the forecast horizon, except for a weak dependence in the case of Linear Regression. The two algorithms that perform worse in terms of RMSE (Linear regression and KNN) are also those with the largest bias. T B = 8h Figure 9. Plots of MBE j with j = 1, · · · , 12, corresponding to forecast horizons from 5 to 60 min ahead, for ∆T B = 2 h (left) and ∆T B = 8 h (right). To preserve the readability of the plot, the legend is not reported and follows the same convention in Figure 7.

The mean bias error (MBE) as defined in Equation
Finally, we examine the Skill Score (defined in Equation (9)) against the persistence model defined in Section 3. As pointed out in a recent review [32], typical skill scores obtained in the literature against persistence by state of the art data-driven forecasts are in the range of SS = 0.3-0.4. In Figure 10 we have thus superimposed to the results reference line at SS = 0.3 to guide the eye. Unsurprisingly considering its definition, the ranking in skill between the algorithms is the same as the one suggested by the RMSE metric. Overall results suggests that a hybrid algorithm using a linear model for the first 5 min and a more flexible model for longer forecasts horizons could be considered to further improve performance. With respect to the peak plant power, the average RMSE reported is at the level of 5% for the best performing algorithms, with a MAE of the order of 3%. Table 8 summarizes the predictive performance of the hourly average power output.  This is one of the possible ways to quantify the performance over the whole forecast interval. One can find in confirmation of some of the trends highlighted by cross-validation scores. For example, the Random Forest shows a decrease in performance as ∆T B increases. On the other hand, the LSTM2 architecture is the only one to show an appreciable improvement. This behavior could have been expected from the design of the LSTM network, as described in Section 3. Nevertheless, we have verified that further increasing ∆T B to 12 h did not increase performance of LSTM2 and also produced a significant rise in training times.
To better analyze the connection between the error metrics of Table 8, it is useful to compare the full distribution of the residuals (δ = P 1h −P 1h ) on the test set. The average of this distribution is, by definition, the MBE, whereas its variance σ 2 = Var(δ) is related to the RMSE as σ 2 = RMSE 2 − MBE 2 . Therefore, qualitative analysis of the probability distributions can offer further insight on such metrics. For instance, Figure 11 (left panel) compares the two algorithms with the largest MBE and RMSE. Negative bias of the KNN mostly arises from the heavier tail of the distribution on the negative side. In the right panel, two of the best performing algorithms are compared and display a very similar distribution. This qualitative outlook should reinforce the conclusion that residual bias is small compared to the spread of the predictions and should not play any major role in practical applications. . Distributions of the residuals of the actual and predicted average power over the hour, namely (P 1h −P 1h ). Continuous probability density functions (PDF) were obtained from the histograms via the kernel density estimation [33] to enhance the readability of the plots. For the same reason, only two pairs of algorithms were selected for comparison. Dashed lines represent the MBE, which is the average value of the PDF. ∆T b = 8 h was set in all four cases.
To offer some more qualitative insight on the predictive performance, in Figure 12 we show scatter plots of the actual vs. predicted power at different forecasts horizons. In keeping with the notation of Equation (3), examining a forecast horizon of 5 min corresponds to selecting P 1 andP 1 , and similarly for the other horizons.  Figure 13 shows a comparison of predicted and measured output on three full days, using the Random Forest algorithm with ∆T B = 2 h. These days were selected by computing the RMSE on each day of the test set for the three selected values of the forecast horizon. Then the distribution of the error was divided in three quantiles. One day for each quantile was then randomly selected, to represent cases where the prediction performs increasingly worse (from top to bottom in Figure 13). Prediction errors during the selected days are also reported in Table 9. Table 9. Predictive performance during the day in terms of the indicators RMSE, MAE, R 2 , for the three representative days reported in Figure 13. Plots of this kind must be interpreted assuming that the algorithm is in continuous operation to provide updated predictions based on new measurements every 5 min. Then one can compare P with the one predicted 5, 30, and 60 min before, as shown in Figure 13. The regressor seems to be able to predict rather rapid oscillations in the power output in a short term forecasting scenario ( f h = 5 min), while it has a tendency to overshoot for longer lead times. It is interesting to note that the magnitude rightmost peak in Day 2 was heavily underestimated even for f h = 5 min. On the other hand, the small rise appearing after the power dropping to zero for several time steps in Day 3 was captured, in spite of the anomalous nature of the event. Note that the flat behavior of the power output in the central part of Day 1 (close to clear sky conditions) is due to the solar tracking system. It is interesting to note that the algorithms provided satisfactory performance by only making use of global horizontal irradiation. No information on solar tracker (such as the time dependence of panel angle) were explicitly fed to the models. The capability to learn complex patterns, reducing physical modeling efforts, is one of the main advantages of a data-driven approach. Figure 13. Comparison of predictive performance on three selected days, and three forecasts horizon. Black line (measured) is the actual power, whereas the red line (predicted) is the power which was predicted 5, 30, or 60 min before, respectively from left to right.

Conclusions
In this study, several well-know machine learning and deep learning algorithms were compared for the purpose of intra-hour forecasting of the power output of a 1 MW industrial-scale photovoltaic plant; the dataset implemented cover 1 year of measurements.
Each algorithm was tested over a large set of training parameters and standard machine learning practices, such as cross-validation were applied. We maintain that this is approach was required for a fair assessment of the performance.
The results highlighted some peculiar features of each regression technique: the Random Forest algorithm, with appropriate selection of the lookback time, was generally the best performing according to all the figures of merit we considered. The strength of this algorithm also lies in its robustness with respect to tuning of the training parameters. This is a particularly desirable feature for industrial applications, where the greater variability of the MLP would require fine-tuning of the parameters for each application on which the algorithm has to be deployed. On the other hand, the ultimate predictive performance of the two algorithms were similar. It must also be highlighted that for short forecast horizons Lasso and Linear regression tend to match or outperform the more flexible algorithms, thus, a hybrid solution for prediction could be a promising development. For the case in hand, we did not find significant benefits in LSTM networks, although they were the only algorithms to show an improvement when increasing the lookback time. This suggests that more complex LSTM architectures could be able to outperform conventional algorithms, provided that suitable hardware to handle very deep neural networks are available in the target application. There are many variables characterizing PV power forecasting efforts [2]: forecast horizon, forecast interval, type of input data, time resolution, size of the target plant, and geographical region. Therefore, it is often impossible to find a benchmark that shares all these characteristics against which new results can be compared directly [32,34].
The results achieved in this paper are related to a single dataset; to reduce the site-dependence of the obtained results, a comparison across different datasets should be performed. Future work will extend the here-exposed analysis to datasets belonging to different site conditions. Nevertheless, we have identified several studies relevant to contextualize our findings in the present state of the art. For instance LSTM and CNN algorithms were shown to outperform MLP on nowcasting ( f h = 1 min) using sky images as additional input. The maximum reported skill score was 0.21 against persistence [35]. Using sky camera features, prediction of 10 to 30 min ahead was reported with a forecast skill up to 0.43 [36]. A skill score of 0.1 was reported for up to 30 min-ahead forecasting of PV power using only spatio-temporal correlation between neighboring PV plants [37]. A recent study [38] investigated several nowcasting algorithms in a microgrid setting (750 W PV power), finding the Random Forest as the best performing. Other studies aim at the forecasting of solar irradiance rather than the PV power directly. In this setting, forecast skills up to 0.3 [39] and 0.15 [40] were reported for intra-hour predictions.
Our approach to data preprocessing prioritized simplicity in the implementation and the identification of the smallest possible set of useful predictors. This is motivated by the need of minimizing the number of sensors to be installed and monitored for effective forecasting. For example, it is remarkable that quality predictions were obtained for a solar-tracking plant by making use of only the global horizontal irradiance. The capability to learn complex patterns without resorting to physical modeling is the main features of data-driven approaches, and was achieved by all the tested algorithms. Naturally, more complex regressors, such as Random Forest should always be preferred for optimal performance. Deep learning tools such as LSTM networks are very promising but might require dedicated hardware (GPUs) or larger training datasets to show significant performance advantage when deployed in industrial applications. This study suggests an effective data-driven workflow for intra-hour PV power forecasting, covering from feature selection to the identification of the best performing algorithms. Such an approach could be of interest for practical applications.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1 details the range of values used in hyperparameter search. For each algorithm, m combinations of training parameters were randomly sampled from the available range, in order to make the search more efficient. We stress that this sampling was performed only once, therefore for all values of ∆T B the same selection of parameters was used. The number m was tuned for each algorithm in order to maintain a manageable computing time for the cross-validation and search. Computation was performed on a dual Xeon E5-2680 machine, featuring 16 physical cores and 32 logical threads.