Application of Temporal Fusion Transformer for Day-Ahead PV Power Forecasting

: The energy generated by a solar photovoltaic (PV) system depends on uncontrollable factors, including weather conditions and solar irradiation, which leads to uncertainty in the power output. Forecast PV power generation is vital to improve grid stability and balance the energy supply and demand. This study aims to predict hourly day-ahead PV power generation by applying Temporal Fusion Transformer (TFT), a new attention-based architecture that incorporates an interpretable explanation of temporal dynamics and high-performance forecasting over multiple horizons. The proposed forecasting model has been trained and tested using data from six different facilities located in Germany and Australia. The results have been compared with other algorithms like Auto Regressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), and Extreme Gradient Boosting (XGBoost), using statistical error indicators. The use of TFT has been shown to be more accurate than the rest of the algorithms to forecast PV generation in the aforementioned facilities.


Introduction
Renewable energy is rapidly increasing worldwide since it became an economical alternative to conventional energy sources such as fossils fuels, which are responsible for greenhouse emissions (GHG), have a limited supply, and their prices are becoming increasingly unpredictable. Furthermore, the European Commission targets aim to reduce emissions by 50% to 55% by 2030 in comparison with 1990 and to become climate neutral in 2050 [1].
Electricity generated by solar photovoltaic (PV) systems has increased by 23% in 2020 to reach 821 TWh worldwide [2]. Projections indicate that the global installed capacity of this technology could rise by more than three times, reaching 2.840 GW in 2030, and 8.519 GW in 2050 [3]. The reason is solar energy presents many advantages: it is an environmentally friendly renewable source, is abundant, has a long service life [4], and as shown in Figure 1, it is becoming one of the most competitive power generation technologies after more than ten years of cost declines with a promising future ahead. However, variations in solar irradiance and meteorological conditions cause fluctuation in solar power generation and lead to uncertainty in power output. This results in a power imbalance between the demand and the supply side of the grid [6]. Also, the unpredictable output significantly impacts the economic dispatch and the scheduling, stability, and reliability of the power system operation [7]. Precise forecasting of the energy produced is essential for grid operators, considering that deviations need to be compensated by the remaining generation technologies. Moreover, it is not only beneficial for system operators as PV plant managers can prevent potential penalties arising from differences between produced and predicted energy [8,9]. Therefore, PV power forecasting contributes to stabilize and optimize the operation and performance of the grid and renewable energy microgrids [10][11][12][13] by means of the comparative analysis between required and predicted energy. It also positively affects the customer by reducing the uncertainty and the generated energy cost [8,14]. Solar power forecasting ranges from ultra-short term to long-term forecasts. Longterm forecasts are especially relevant for planning the long duration power system. However, industry and researchers currently focus primarily on short-term or day-ahead predictions as these can account for cloud cover variability and provide more accurate results [15]. There are several approaches for solar power forecasting: physical models [16], persistence forecast [17], statistical models [18], artificial intelligence [19], including machine learning (ML) and deep learning (DL) [20,21], ensemble and hybrid-based models [14,[22][23][24][25][26]. Common ML methods applied to PV power forecasting in literature are artificial neural networks (ANN) [20,27,28], support vector machine (SVM) [29,30] and extreme learning machine (ELM) [31,32]. Among all PV forecasting approaches, ANN have shown superior accuracy, being successfully applied in short-term forecast horizons [33][34][35][36]. In addition, machine learning methods, like ANN, can handle large amounts of data and produce accurate predictions for short-term periods, without the complexity of mathematical and physical relationships. Some methods used for PV power forecasting present certain limitations. Persistence and statistical methods cannot handle nonlinear data. Physical methods depend on local meteorological data that can have limited access. Numerical weather prediction (NWP) models are widely used to provide forecasts of weather conditions, but they often lack sufficient spatial and temporal resolution. ANN are involved with problems of local minima, overfitting, and complex structure. DL methods have been developed to solve these limitations [37,38]. DL is an advanced branch of machine learning with the ability to process non-linear and complex relationships between various inputs and the forecasted However, variations in solar irradiance and meteorological conditions cause fluctuation in solar power generation and lead to uncertainty in power output. This results in a power imbalance between the demand and the supply side of the grid [6]. Also, the unpredictable output significantly impacts the economic dispatch and the scheduling, stability, and reliability of the power system operation [7]. Precise forecasting of the energy produced is essential for grid operators, considering that deviations need to be compensated by the remaining generation technologies. Moreover, it is not only beneficial for system operators as PV plant managers can prevent potential penalties arising from differences between produced and predicted energy [8,9]. Therefore, PV power forecasting contributes to stabilize and optimize the operation and performance of the grid and renewable energy microgrids [10][11][12][13] by means of the comparative analysis between required and predicted energy. It also positively affects the customer by reducing the uncertainty and the generated energy cost [8,14]. Solar power forecasting ranges from ultra-short term to long-term forecasts. Long-term forecasts are especially relevant for planning the long duration power system. However, industry and researchers currently focus primarily on short-term or day-ahead predictions as these can account for cloud cover variability and provide more accurate results [15]. There are several approaches for solar power forecasting: physical models [16], persistence forecast [17], statistical models [18], artificial intelligence [19], including machine learning (ML) and deep learning (DL) [20,21], ensemble and hybrid-based models [14,[22][23][24][25][26]. Common ML methods applied to PV power forecasting in literature are artificial neural networks (ANN) [20,27,28], support vector machine (SVM) [29,30] and extreme learning machine (ELM) [31,32]. Among all PV forecasting approaches, ANN have shown superior accuracy, being successfully applied in short-term forecast horizons [33][34][35][36]. In addition, machine learning methods, like ANN, can handle large amounts of data and produce accurate predictions for short-term periods, without the complexity of mathematical and physical relationships. Some methods used for PV power forecasting present certain limitations. Persistence and statistical methods cannot handle nonlinear data. Physical methods depend on local meteorological data that can have limited access. Numerical weather prediction (NWP) models are widely used to provide forecasts of weather conditions, but they often lack sufficient spatial and temporal resolution. ANN are involved with problems of local minima, overfitting, and complex structure. DL methods have been developed to solve these limitations [37,38]. DL is an advanced branch of machine learning with the ability to process non-linear and complex relationships between various inputs and the forecasted output. Examples of DL are Convolutional Neural Networks (CNN) [39][40][41], Recurrent Neural Networks (RNN) [42,43], Long Short-Term Memory (LSTM) [21,44] or Convolutional Self-Attention LSTM [45]. These methods have proven to achieve better prediction results when applied to solar power forecasting compared to other machine learning methods, especially those based on LSTM networks [46][47][48][49].
Time series can have seasonal patterns or increasing and decreasing trends. LSTM can learn to remember useful information from several time steps back that is relevant for producing the predicted output. Traditional recurrent networks do not have the ability to decide which information to remember. Nevertheless, LSTM has difficulty in determining long-term dependencies.
More complex deep learning architectures are constantly created that outperform previous ones. For example, Temporal Fusion Transformer (TFT) [50] is an Attention-Based Deep Neural Network for multi-horizon time series forecasting that integrates LSTM in its architecture. TFT also uses the attention mechanism to learn long-term dependencies in time series data, which means that it can directly learn patterns during training. This makes the TFT more advantageous than other time series methods in terms of interpretability. Moreover, TFT minimizes a quantile loss function, which enables it to generate a probabilistic forecast with a confidence interval.
This work aims to predict hourly day-ahead PV power generation by applying TFT as the forecasting method. TFT incorporates an interpretable explanation of temporal dynamics and high-performance forecasting over multiple horizons. It uses specialized components to choose important attributes and a series of gating layers to remove nonessential elements, resulting in high performance in a wide range of applications.
This paper contributes to extending the application of innovative DL methods to PV power forecasting. The novel aspect of this work is the use of TFT for PV power generation forecasting as there is no previous literature found that applies and compares this method for predicting PV output performance. TFT potentially enhances the accuracy of forecasts compared to other methods by learning short and long-term temporal relationships. This not only benefits the management of PV production systems but can also influence the stability and operation of the power system.
The predicting method is evaluated with common datasets used in literature. They include real data from six photovoltaic systems located in Germany and Australia. The results of the proposed DL method are compared with the following methods: Auto Regressive Integrated Moving Average (ARIMA), Multi-Layer Perceptron (MLP), LSTM, and Extreme Gradient Boosting (XGBoost).

Materials and Methods
The methodology applied in this study includes the following steps. (1) Data gathering, where data from different sources is collected to define the final dataset. The electrical energy produced by a photovoltaic system depends on variables such as horizontal irradiation, temperature, humidity, and solar zenith and azimuth, among others. (2) Data pre-processing, to transform raw data into a form more suitable for modeling, including data cleaning, feature selection, and data transformation (3) Model selection. (4) Training and tuning, where the selected model is trained for different combinations of hyperparameters, selecting the model with the best performance. (5) Results evaluation. Figure 2 summarizes the process carried out.

Type of Data
The dataset used in this study includes historical power generation data from six different facilities as the dependent variables, and meteorological data, solar angles, and calendar data as independent variables.
The six facilities are located in Germany and Australia. Those located in Germany consist of three roof mounted systems situated on industrial and residential buildings in the city of Konstanz [51]. In this city, the average global horizontal irradiation is 1212 kWh/m 2 per year, the hottest month is July, with an average high of 25 °C and low of 15 °C, and the coldest month is January with an average low of −2 °C and high of 4 °C. The data is provided with a 5-min resolution.
The facilities in Australia are located at Desert Knowledge Australia Solar Centre (DKASC), in Alice Springs, which is a real-life demonstration of solar technologies in an area of high solar resources [52]. In this area, the average global horizontal irradiation is 2271 kWh/m 2 per year. The hottest month of the year in Alice Springs is January, with an average high of 35 °C and a low of 22 °C. The coldest month of the year in Alice Springs is July, with an average low of 5 °C and a high of 20 °C. The data is provided with 5-min resolution. Table 1 shows the specific information of each PV system and its data.

Type of Data
The dataset used in this study includes historical power generation data from six different facilities as the dependent variables, and meteorological data, solar angles, and calendar data as independent variables.
The six facilities are located in Germany and Australia. Those located in Germany consist of three roof mounted systems situated on industrial and residential buildings in the city of Konstanz [51]. In this city, the average global horizontal irradiation is 1212 kWh/m 2 per year, the hottest month is July, with an average high of 25 • C and low of 15 • C, and the coldest month is January with an average low of −2 • C and high of 4 • C. The data is provided with a 5-min resolution.
The facilities in Australia are located at Desert Knowledge Australia Solar Centre (DKASC), in Alice Springs, which is a real-life demonstration of solar technologies in an area of high solar resources [52]. In this area, the average global horizontal irradiation is 2271 kWh/m 2 per year. The hottest month of the year in Alice Springs is January, with an average high of 35 • C and a low of 22 • C. The coldest month of the year in Alice Springs is July, with an average low of 5 • C and a high of 20 • C. The data is provided with 5-min resolution. Table 1 shows the specific information of each PV system and its data. The meteorological variables used include global horizontal irradiance, wind speed, precipitation, temperature, and humidity. These factors affect to a greater or lesser extent the PV energy output of the system. The DKASC dataset used includes measurements of these variables, while for the systems of Konstanz this information is collected from a nearby weather station [53].
The solar angles, zenith and azimuth, were calculated based on the timestamp and the location of each PV system for each record of the dataset. The solar zenith angle is the angle between the sun's rays and the vertical direction, defining the sun's apparent altitude. The azimuth represents the sun's relative direction along the local horizon. Figures 3 and 4 show the aforementioned variables during a five-day period in one of the facilities from Germany and one from Australia. The meteorological variables used include global horizontal irradiance, wind speed, precipitation, temperature, and humidity. These factors affect to a greater or lesser extent the PV energy output of the system. The DKASC dataset used includes measurements of these variables, while for the systems of Konstanz this information is collected from a nearby weather station [53].
The solar angles, zenith and azimuth, were calculated based on the timestamp and the location of each PV system for each record of the dataset. The solar zenith angle is the angle between the sun's rays and the vertical direction, defining the sun's apparent altitude. The azimuth represents the sun's relative direction along the local horizon. Figures 3 and 4 show the aforementioned variables during a five-day period in one of the facilities from Germany and one from Australia.  Regarding the calendar data, cyclical calendar variables need to be transformed to represent the data sequentially. For example, December and January are 1 month apart, although those months will appear to be separated by 11 months. To avoid this, 2 new features were created, deriving a sine and a cosine transform of the original feature.
Finally, the data was converted to hourly data and merged to create the final dataset, consisting of eleven independent variables, five meteorological variables, four calendar factors, and two variables to represent solar angles. Table 2 shows the variables initially considered. Regarding the calendar data, cyclical calendar variables need to be transformed to represent the data sequentially. For example, December and January are 1 month apart, although those months will appear to be separated by 11 months. To avoid this, 2 new features were created, deriving a sine and a cosine transform of the original feature.
Finally, the data was converted to hourly data and merged to create the final dataset, consisting of eleven independent variables, five meteorological variables, four calendar factors, and two variables to represent solar angles. Table 2 shows the variables initially considered.

Data Pre-Processing
Data pre-processing is defined as the transformation of raw data into a form more suitable for modeling. In this study, the following pre-processing tasks are included: data

Data Pre-Processing
Data pre-processing is defined as the transformation of raw data into a form more suitable for modeling. In this study, the following pre-processing tasks are included: data cleaning, feature selection, and data transformations.
Data cleaning: The raw dataset has to be cleaned to identify and correct errors that may negatively impact the performance of the predictive model. For that, one important task is the detection of outliers taking into account the physical behavior of the system. DBSCAN (Density-Based Spatial Clustering of Application with Noise) was used to identify outliers in the dataset. This algorithm creates groups depending on a distance measurement between points, usually using Euclidean distance. A point is included in a cluster if the distance between this point and another point of the cluster is less than a parameter eps, taking into account a minimum number of points for each cluster. As a result, it identifies outliers as the points that do not belong to a cluster, which are those that are in low-density regions [54]. Considering that the PV energy production should fall within certain bounds for a given global horizontal irradiation, zenith angle and azimuth angle, these four features were considered to apply this algorithm.
Next, the missing values and the values of solar energy production identified as outliers were replaced taking into account the averaged coefficient between the PV generation and the horizontal global irradiation at the same hour on previous days. The new value for solar energy production is obtained by multiplying this coefficient by the horizontal global irradiation at the considered index.
Feature selection: Feature selection is applied using wrapper methods. These methods train models using a subset of the input variables, where the objective is to find the subset that builds the model that yields the best results. As this process is computationally very expensive, we use TFT, a model that already has interpretable feature selection within its architecture to guide it.
The wrapper method used is Backward Elimination [55]. Initially, a TFT model is trained with all the variables, then, at each iteration, the least significant variable in the TFT model is removed, which should improve the performance of the model. Thus, this process is stopped when no improvement is observed after discarding another feature.
Data transforms: Continuous input variables have different scales, which may result in a slow or unstable learning process. For that, they need to be normalized to make every datapoint have the same importance. In this study, standardization (of z-score normalization) was used to rescale the data. This is a technique where the final values have a zero mean and a unit standard deviation. The formula of the z-score normalization is: where x i is the input data, µ is the average value of the feature, and σ is the standard deviation. For the target variable, different transformations are applied. First the natural logarithm of the values, then the values are scaled by dividing each value by the median in its series.

Data Splitting
The dataset is divided into three sets for training purposes: Training set: The sample of the data used to fit the model. Validation set: The sample of data used to provide an unbiased evaluation of a fitted model while tuning its hyperparameters.
Test set: Part of the data used to fairly evaluate the final model fit on the training set. This division is not done evenly; each photovoltaic installation in the dataset is divided so that 70% of its randomly selected samples go into the training set, 20% into the validation set, and 10% into the test set.

Forecasting Models
The solar energy production was predicted using TFT, although its performance was compared with other algorithms like ARIMA, MLP, LSTM, and XGBoost. MLP, LSTM, and TFT models were trained using the implementation in Pytorch Forecasting [56], while ARIMA and XGBoost used their own packages for Python.

TFT
TFT is an attention-based neural network architecture specially designed to combine multi-horizon forecasting with interpretable insights into temporal dynamics [50]. TFT utilizes specialized building blocks to select relevant features and leave out unused components, enabling high performance over a broad range of tasks. Their main components are: Temporal processing to learn both long and short-term temporal patterns from both known and observed time-varying inputs. A temporal self-attention decoder, based on the Masked Multi-Head Attention layer, is employed to capture long-term dependencies, while a sequence-to-sequence layer is used to process local information. -Allows us to calculate prediction intervals via quantile forecast to determine the likelihood of each target value.
TFT improves the interpretability of time series forecasting through the identification of the globally-important variables, the persistent temporal patterns, and the significant events for the prediction problem. It explains how and why the results are generated by the model, in contrast with the concept of a "black box" where it is difficult to explain how models arrive at their predictions. This makes the model's output more trustworthy and easier to work with, which is the objective of Explainable AI (XAI) [57].

ARIMA
ARIMA is a statistical model that predicts future values based on past time series data [58]. This algorithm consists of 3 components:

MLP
A MLP is a fully connected class of feedforward ANN [59]. It is made up of many interconnected computational units, called neurons. An artificial neuron receives inputs from other neurons, which are weighted and summed, establishing the impact of the information going through them. This weighted sum is then transformed by an activation function to generate the output of the neuron. These activation functions are needed to learn complex patterns and non-linear curves.
In a neural network, the neurons are arranged into multiple layers: an input layer, one or more hidden layers, and an output layer. The number of neurons of each layer and the number of hidden layers are the main parameters to optimize.

LSTM
LSTM networks are a special class of RNN, capable of learning long-term dependencies [60]. They can retain previous information and learn temporal correlations between consecutive data points. The memory block, its building structure, is composed of three interacting gates.
In LSTM, the core variable is the cell state, which allows for information to be carried from previous steps throughout the memory block. Its interacting gates control the addition or removal of information from the cell state. In particular, the first gate is called forget gate, and determines which information from previous steps to throw away from the cell state. The second one, the "input gate", controls the updating of the cell state. Finally, the last interacting gate, called the "output gate", provides the final output, based on a filtered version of the updated cell state. XGBoost is a supervised predictive algorithm that uses the boosting principle [61]. The idea behind boosting is to generate multiple "weak" prediction models sequentially, each of these takes the results of the previous model to generate a "stronger" model. It uses different types of weak decision trees as its models. To get a stronger model from these weak models, an optimization algorithm is used, in this case, Gradient Descent.

Hyperparameter Tuning
The different hyperparameters of the model are predefined with default values, although better performance is achieved if they are optimized for the problem under study. Neural Network Intelligence (NNI) is used to tune these hyperparameters [62]. This tool allows us to define a search space, generating different configurations of the parameters. Next, NNI performs training for each combination, in order to find the best outcome. NNI trains the model by generating its own random training, validation, and test sets from the dataset, and training it with a combination of the defined hyperparameters.

Evaluation Metrics
To evaluate and compare the forecasting performances, several statistical error metrics were employed: root mean square error (RMSE), mean absolute error (MAE), mean absolute scaled error (MASE), coefficient of determination (R 2 ), and quantile loss. These evaluation metrics are defined as: where y i andŷ i represent the real and forecasted values, respectively, y avg shows the mean of real values, MAE naive is the MAE of a naïve model that predicts the values by shifting actual values into the future, and Q is the set of quantile values to be fitted in our study Q = {0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98}.

Outliers Detection
The percentage of missing values and outliers detected in each dataset are shown in Table 3. The number of outliers identified is greater in the facilities from Germany than in those from Australia. The reason for this behavior could be the distance between the PV facilities and the weather station in Konstanz. Although the tower is located in the same city not far from the facilities, the irradiance recorded at the same moment can vary significantly on partially cloudy days. The analysis of outliers is, therefore, of great importance for those facilities that do not have their own weather station.

Data Analysis
The dataset was analyzed once it was cleaned. First, Figure 5 shows the probability density function (PDF) of the PV energy production data, taking only into account values between sunrise and sunset. These datasets present a bimodal distribution for the Australian facilities, with a peak close to cero and another at high PV energy production. On the other hand, the distributions are left-skewed for the installations located in Germany, with its highest point at low PV energy production. In both cases, the first peak is derived from the hours near sunset and sunrise, where the global horizontal radiation is low.

Data Analysis
The dataset was analyzed once it was cleaned. First, Figure 5 shows the probability density function (PDF) of the PV energy production data, taking only into account values between sunrise and sunset. These datasets present a bimodal distribution for the Australian facilities, with a peak close to cero and another at high PV energy production. On the other hand, the distributions are left-skewed for the installations located in Germany, with its highest point at low PV energy production. In both cases, the first peak is derived from the hours near sunset and sunrise, where the global horizontal radiation is low.   Figure 6 shows the autocorrelation function (ACF) of PV energy production for two of the time series considered, one for each location. This ACF represents the similarity between two observations depending on the lag time among them. In this case, the results indicate that the PV energy production presents a clear periodic pattern with an interval of 24h. These results are equivalent for the rest of the facilities.
The Pearson correlation coefficient was also calculated to measure the correlation between the PV energy production and both the meteorological variables and the solar angles (Table 4). Figure 6 shows the autocorrelation function (ACF) of PV energy produ of the time series considered, one for each location. This ACF represents between two observations depending on the lag time among them. In this ca indicate that the PV energy production presents a clear periodic pattern wi of 24h. These results are equivalent for the rest of the facilities. The Pearson correlation coefficient was also calculated to measure the c tween the PV energy production and both the meteorological variables and gles (Table 4). The correlation between PV energy production and the different meteo iables varies between the location, showing a different influence of these feat ing on the climatic conditions. With respect to the correlation with the hor irradiation, it is stronger in the case of the facilities from Australia. Among t Germany, the highest correlation was found in the GE_2 facility. The diffe correlations obtained may be due to several factors. First, the sky condition are much more diverse than the sky conditions in Alice Springs, which has the relation between horizontal irradiation and plane-of-array (POA) irradia therefore, in energy production. Secondly, the GE_1 and GE_3 installation ented towards the south, but slightly towards the southwest, which produ gap between the maximum generation and the maximum irradiation points  The correlation between PV energy production and the different meteorological variables varies between the location, showing a different influence of these features depending on the climatic conditions. With respect to the correlation with the horizontal global irradiation, it is stronger in the case of the facilities from Australia. Among the facilities in Germany, the highest correlation was found in the GE_2 facility. The differences in the correlations obtained may be due to several factors. First, the sky conditions in Konstanz are much more diverse than the sky conditions in Alice Springs, which has an impact in the relation between horizontal irradiation and plane-of-array (POA) irradiation [63], and therefore, in energy production. Secondly, the GE_1 and GE_3 installations are not oriented towards the south, but slightly towards the southwest, which produces a certain gap between the maximum generation and the maximum irradiation points, as shown in Figure 7. Thirdly, the tilt of the different systems is not the same: the lower the tilt, the greater the correlation. Finally, the facilities in Germany are located on roofs of urbanized areas, which results in a more obstructed horizon that also decreases the correlation.
Finally, Figure 8 represents the relation between the PV energy production and the horizontal global irradiation for the different facilities, highlighting the differences between the months of the year. Finally, Figure 8 represents the relation between the PV energy production and the horizontal global irradiation for the different facilities, highlighting the differences between the months of the year. Finally, Figure 8 represents the relation between the PV energy production and the horizontal global irradiation for the different facilities, highlighting the differences between the months of the year.

Feature Selection
From the dataset, solar horizontal irradiation, temperature, humidity, zenith angle, azimuth angle, and the sine and cosine transformation of the month were found to be valuable parameters for the model, whereas the rest of the variables considered were discarded. Although different feature sets were used, temperature and solar angles were also selected variables in [64], while precipitation was also discarded.

Forecasting Results
Using the variables selected, the hourly day-ahead PV power generation was predicted using different models for the two locations. The hyperparameters of MLP, LSTM, and XGBoost were established based on the best models in [28,48,65]. For TFT, the hyperparameters were tuned with NNI and the best performance was obtained for the values in Table 5.  Figure 9 shows the results for different combinations of hyperparameters. , x FOR PEER REVIEW 13 of 22

Feature Selection
From the dataset, solar horizontal irradiation, temperature, humidity, zenith angle, azimuth angle, and the sine and cosine transformation of the month were found to be valuable parameters for the model, whereas the rest of the variables considered were discarded. Although different feature sets were used, temperature and solar angles were also selected variables in [64], while precipitation was also discarded.

Forecasting Results
Using the variables selected, the hourly day-ahead PV power generation was predicted using different models for the two locations. The hyperparameters of MLP, LSTM, and XGBoost were established based on the best models in [28,48,65]. For TFT, the hyperparameters were tuned with NNI and the best performance was obtained for the values in Table 5.  Figure 9 shows the results for different combinations of hyperparameters. The variables related to the complexity of the neural network based models are shown in Table 6. TFT is the most complex model considering the three variables: number of parameters, model size, and training time per epoch showing an 82.6% and 86.6% increase in size and in time complexity, respectively, compared to LSTM. The forecasting errors using these different models are shown in Table 7. As can be seen, TFT outperforms the other models with lowest values for all the indicators in both The variables related to the complexity of the neural network based models are shown in Table 6. TFT is the most complex model considering the three variables: number of parameters, model size, and training time per epoch showing an 82.6% and 86.6% increase in size and in time complexity, respectively, compared to LSTM. The forecasting errors using these different models are shown in Table 7. As can be seen, TFT outperforms the other models with lowest values for all the indicators in both locations. Comparing TFT with LSTM, the second-best model, the indicators of RMSE, MAE, and MASE, for Australia are 46%, 48%, 48% lower, respectively. The same behavior can be observed for Germany, with a reduction of 20%, 26%, and 54% in RMSE, MAE, and MASE from TFT with respect to LSTM. These results may be explained by the different components integrated in TFT. This algorithm incorporates temporal self-attention decoder to capture long-term dependencies, allowing to learn relationships at different scales. It also supports to efficiently address features for each input type: static or time-invariant features, past-observed time varying input and future-known time varying input. It was also found to exceed other predictions models in other areas like the forecasting of wind speed or freeway traffic speed [66,67]. LSTM showed good accuracy and better performance than XGBoost, MLP, and ARIMA. This behavior was also found in [47,48]. Unlike XGBoost and MLP, both LSTM and TFT can retain previous information and learn temporal correlations between consecutive data points, giving these models a better performance. Besides, while the ARIMA model was only developed taking into account a linear relationship between the exogenous variables and the target variable, both LSTM and TFT can include non-linear approximations. Therefore, these results indicate that both mechanisms are important for the forecasting of day-ahead solar production.
The performance of TFT on each facility was compared using MASE and R 2 , the two metrics that are scale invariant. The results for Germany are always worse than those for Australia, with a significant reduction in accuracy for both metrics. The forecasting errors with TFT for the different indicators and the different series considered are shown in Table 8. The highest accuracy in forecasting the day-ahead PV power generation is always achieved in AU_1, although the performance of the TFT is very similar for the three Australian PV plants. In this case, the coefficient of variation, which establish the extent of the variability in relation to the mean, is 3.73% and 0.05% for MASE and R 2 , respectively. In the case of Germany, the coefficients of variation are 27.62% and 0.29% for MASE and R 2 , showing a great variability between the performance of the model in German facilities.
The forecast accuracy in the German facilities is still not as favorable as the one in the facilities in Australia, this may be due to the greater variability in the weather conditions experienced by the solar facilities in Konstanz.
However, the forecast in the facilities in Germany has improved drastically, especially in GE_1 and GE_3. As mentioned, the other factors which could be damaging the Pearson correlation between horizontal irradiation and energy production (the different orientation and tilt of the panels and the more blocked horizon) could have been corrected with the implementation in the model of the solar angles (zenith and azimuth), the meteorological variables (temperature and humidity) and the calendar data. The solar angles help to predict the relationship between the horizontal irradiation and the irradiation in the plane of the panels [68], which is reinforced with the meteorological variables [63] that also influence other factors, such as the efficiency of the panels [69]. Figures 10 and 11 show the results of PV power generation forecasting based on TFT and the real solar energy production for representative rainy and sunny days in both locations. These graphs represent the global horizontal irradiation and the observed values from the past 72 h to the prediction length. The predicted value is represented with the prediction intervals, showing the uncertainty of the forecasted values. This information is important when managing the operation of the facilities. However, the forecast in the facilities in Germany has improved drastically, especially in GE_1 and GE_3. As mentioned, the other factors which could be damaging the Pearson correlation between horizontal irradiation and energy production (the different orientation and tilt of the panels and the more blocked horizon) could have been corrected with the implementation in the model of the solar angles (zenith and azimuth), the meteorological variables (temperature and humidity) and the calendar data. The solar angles help to predict the relationship between the horizontal irradiation and the irradiation in the plane of the panels [68], which is reinforced with the meteorological variables [63] that also influence other factors, such as the efficiency of the panels [69]. Figures 10 and 11 show the results of PV power generation forecasting based on TFT and the real solar energy production for representative rainy and sunny days in both locations. These graphs represent the global horizontal irradiation and the observed values from the past 72 h to the prediction length. The predicted value is represented with the prediction intervals, showing the uncertainty of the forecasted values. This information is important when managing the operation of the facilities.  For AU_1 on a rainy day, the maximum solar energy production predicted is 3.50 kWh, where 3.22 kWh and 3.76 kWh are the 0.1 and 0.9 percentiles, respectively. The interval between the tenth and the ninetieth percentile is 0.54 kWh. Considering the hours with energy production, the mean and standard deviation for half of this interval are 0.19 kWh and 0.07 kWh, respectively, with a maximum value of 0.27 kWh and a minimum of 0.07 kWh. For the example of a sunny day, the value predicted for the hour with maximum solar output is 4.05 kWh. In this case, the mean and standard deviation of the mid-range between the considered percentiles during the day is 0.12 kWh and 0.03 kWh, respectively, indicating lower variation in these values. Analyzing these results for this case, it seems that the uncertainty is higher and more varied for rainy days. For AU_1 on a rainy day, the maximum solar energy production predicted is 3.50 kWh, where 3.22 kWh and 3.76 kWh are the 0.1 and 0.9 percentiles, respectively. The interval between the tenth and the ninetieth percentile is 0.54 kWh. Considering the hours with energy production, the mean and standard deviation for half of this interval are 0.19 kWh and 0.07 kWh, respectively, with a maximum value of 0.27 kWh and a minimum of 0.07 kWh. For the example of a sunny day, the value predicted for the hour with maximum solar output is 4.05 kWh. In this case, the mean and standard deviation of the mid-range between the considered percentiles during the day is 0.12 kWh and 0.03 kWh, respectively, indicating lower variation in these values. Analyzing these results for this case, it seems that the uncertainty is higher and more varied for rainy days.
For the case of GE_2, the maximum production on the example for a rainy day is 1.39 kWh, and the mean and standard deviation of the half of the interval between 0.1 and 0.9 percentiles along the day is 0.20 kWh and 0.07 kWh, respectively. The maximum is 0.35 kWh and the minimum is 0.08 kWh. In the case of a sunny day, the values are 0.17 kWh and 0.02 kWh for the mean and the standard deviation, respectively, with a solar energy production of 3.2 kWh. In this case, the uncertainty is higher than in the case of Australia, although it is constant throughout the day. Comparing the examples of a rainy and a sunny day, the mean values are similar, although the rainy day is over a peak of solar energy production of 1.39 kWh, whereas on a sunny day, the maximum output is 3.2 kWh. This indicates that the uncertainty is also higher on rainy days as in the case of Germany.
One possible explanation for the lower accuracy during cloudy or rainy days could be the higher variability of the irradiance levels recorded during such days. This implies that the training of the network, and therefore, the prediction of PV energy production, becomes less reliable under these circumstances. For the case of GE_2, the maximum production on the example for a rainy day is 1.39 kWh, and the mean and standard deviation of the half of the interval between 0.1 and 0.9 percentiles along the day is 0.20 kWh and 0.07 kWh, respectively. The maximum is 0.35 kWh and the minimum is 0.08 kWh. In the case of a sunny day, the values are 0.17 kWh and 0.02 kWh for the mean and the standard deviation, respectively, with a solar energy production of 3.2 kWh. In this case, the uncertainty is higher than in the case of Australia, although it is constant throughout the day. Comparing the examples of a rainy and a sunny day, the mean values are similar, although the rainy day is over a peak of solar energy production of 1.39 kWh, whereas on a sunny day, the maximum output is 3.2 kWh. This indicates that the uncertainty is also higher on rainy days as in the case of Germany.
One possible explanation for the lower accuracy during cloudy or rainy days could be the higher variability of the irradiance levels recorded during such days. This implies that the training of the network, and therefore, the prediction of PV energy production, becomes less reliable under these circumstances.

TFT Interpretability
TFT improves the interpretability of time series forecasting through the calculation of the variable importance scores for the different types of features, and also through the representation of the attention weight patterns. These results can be seen in Figure 12, which shows the importance of the past-observed time varying inputs, Figure 13, representing the importance of the future-known time varying features, and Figure 14 showing the attention weight patterns for one-step-ahead forecasts.

scale.
The encoder variables contain the features for which past values are known at prediction time, and consist of the features selected previously in addition to the index indicating the relative time. In this case, the relative time takes values between −72 (input length) and 0. The importance of each of these variables can be seen in Figure 12. The horizontal solar irradiation received most of the attention (almost 25%), followed by the target variable and the solar zenith (around 12.5% each). Relative humidity and the sine and cosine transformation of the month are variables with a limited weight on the encoder side.  Decoder variables are those features for which future values are known at prediction time. For the decoder, the relative time index takes values between −72 and 24 (prediction length). The importance of each of these variables for the prediction model can be observed in Figure 13. Solar zenith and horizontal solar irradiation play a significant role, summing almost 60% of the weighted importance.
The results from both the encoder and decoder weighted importance highlight the necessity of having a good representation of solar zenith and horizontal solar irradiation to achieve high prediction performance. Decoder variables are those features for which future values are known at prediction time. For the decoder, the relative time index takes values between −72 and 24 (prediction length). The importance of each of these variables for the prediction model can be observed in Figure 13. Solar zenith and horizontal solar irradiation play a significant role, summing almost 60% of the weighted importance.
The results from both the encoder and decoder weighted importance highlight the necessity of having a good representation of solar zenith and horizontal solar irradiation to achieve high prediction performance. Finally, Figure 14 represents the attention weight patterns for one-step-ahead forecasts, which can be used to understand the most important past time steps that the TFT model focused on. It shows that the attention displays a cyclic pattern, with clear attention peaks at daily intervals. Thus, the attention is focused on the closest values, and in values These datasets also have three static inputs: target center, target scale, and the identification of the facility, aiming at providing additional information and context to the model. This last variable is needed to identify each series, since TFT allows to train all of them together, learning also from general patterns amongst series. The first two are related to the standardization of the target variable, and included as static variables in the model. For these series, target center has a value of 0 and target scale is the median of the time series. The importance of these variables is as follows: target center> series id> target scale.
The encoder variables contain the features for which past values are known at prediction time, and consist of the features selected previously in addition to the index indicating the relative time. In this case, the relative time takes values between −72 (input length) and 0. The importance of each of these variables can be seen in Figure 12. The horizontal solar irradiation received most of the attention (almost 25%), followed by the target variable and the solar zenith (around 12.5% each). Relative humidity and the sine and cosine transformation of the month are variables with a limited weight on the encoder side.
Decoder variables are those features for which future values are known at prediction time. For the decoder, the relative time index takes values between −72 and 24 (prediction length). The importance of each of these variables for the prediction model can be observed in Figure 13. Solar zenith and horizontal solar irradiation play a significant role, summing almost 60% of the weighted importance.
The results from both the encoder and decoder weighted importance highlight the necessity of having a good representation of solar zenith and horizontal solar irradiation to achieve high prediction performance.
Finally, Figure 14 represents the attention weight patterns for one-step-ahead forecasts, which can be used to understand the most important past time steps that the TFT model focused on. It shows that the attention displays a cyclic pattern, with clear attention peaks at daily intervals. Thus, the attention is focused on the closest values, and in values at the same hour of previous days. This behavior is understandable since the PV energy production follows the same periodic pattern, as can be seen in Figure 6.

Conclusions
PV energy production varies significantly due to many factors, including weather conditions, time of day, season, solar irradiance, panel degradation, shading, and other uncontrollable factors. These variations create challenges for power systems because they cause grid instability and imbalances in the supply and demand of electricity amongst other complications. An accurate forecast of the PV power generation helps to guarantee the operation of the power system.
In this study, the TFT, a novel deep learning algorithm, is tested to address the hourly day-ahead PV energy production on datasets from different facilities. This algorithm is able to learn long-term and short-term temporal relationships respectively, and also helps to efficiently build feature representations of static variables, observed and known timevarying inputs. These different components allow the model to be successful in producing accurate forecasts, outperforming other advanced methods like LSTM.
TFT outperformed the other models for predicting PV energy production in the six facilities analyzed, providing better values for all the accuracy indicators. Comparing TFT with LSTM, the second-best model, the indicators of RMSE, MAE, and MASE, for the facilities in Australia are 46%, 48%, 48% lower, and 20%, 26%, and 54% lower for the facilities in Germany.
The TFT model has also performed well in facilities which did not have an on-site weather station where the meteorological variables had to be collected from the nearest meteorological station. This is of great importance since a large number of small and medium solar PV facilities do not have a pyranometer on site. The study shows the importance of detection and replacement of outliers in these types of facilities.
The importance of the decoder and encoder variables has been also calculated, revealing that solar horizontal irradiation and the zenith angle are the key variables for the model. This model can also calculate the prediction intervals, estimating the uncertainty of the forecast which facilitates the operation and management of the facilities and for economic dispatch optimization.
In relation to the limitations and complexity of the model used, it should be noted that TFT needs a larger amount of data to achieve good prediction results. Besides, it is also a complex model, requiring higher maintenance due to higher computational and storage resources. When this is not an issue, the outcomes show that this research can be of great benefit to grid operators, plant managers and customers. The use of more complex models, such as TFT, provides more accurate PV production predictions, which contributes to improve the management of the power system.