Short-Term Load Forecasting Using Encoder-Decoder WaveNet: Application to the French Grid

: The prediction of time series data applied to the energy sector (prediction of renewable energy production, forecasting prosumers’ consumption/generation, forecast of country-level consumption, etc.) has numerous useful applications. Nevertheless, the complexity and non-linear behaviour associated with such kind of energy systems hinder the development of accurate algorithms. In such a context, this paper investigates the use of a state-of-art deep learning architecture in order to perform precise load demand forecasting 24-h-ahead in the whole country of France using RTE data. To this end, the authors propose an encoder-decoder architecture inspired by WaveNet, a deep generative model initially designed by Google DeepMind for raw audio waveforms. WaveNet uses dilated causal convolutions and skip-connection to utilise long-term information. This kind of novel ML architecture presents different advantages regarding other statistical algorithms. On the one hand, the proposed deep learning model’s training process can be parallelized in GPUs, which is an advantage in terms of training times compared to recurrent networks. On the other hand, the model prevents degradations problems (explosions and vanishing gradients) due to the residual connections. In addition, this model can learn from an input sequence to produce a forecast sequence in a one-shot manner. For comparison purposes, a comparative analysis between the most performing state-of-art deep learning models and traditional statistical approaches is presented: Autoregressive-Integrated Moving Average (ARIMA), Long-Short-Term-Memory, Gated-Recurrent-Unit (GRU), Multi-Layer Perceptron (MLP), causal 1D-Convolutional Neural Networks (1D-CNN) and ConvLSTM (Encoder-Decoder). The values of the evaluation indicators reveal that WaveNet exhibits superior performance in both forecasting accuracy and robustness.


Introduction
An accurate load forecasting is the basis of energy investment planning. It plays a vital role in the decision making and operation of the energy market. While the overestimation of energy consumption leads to wasted financial resources, the underestimation causes power outages and failures throughout the electrical grid. Authors in [1] conclude that a 1% increase in load forecasting error can increase about a 10 $ million in annual operation costs.
Given the previous context, the authors aim to introduce and propose a novel Deep-Learning algorithm in order to improve the current results achieved by other state-of-art solutions. Specifically, an encoder-decoder WaveNet model, usually applied to othertimeseries forecasting fields, is proposed. To such an end, this paper is focused on demonstrating the performance of the proposed approach on a macroscopic level (French National Consumption) of power forecasting in the short term (one day ahead).
Short-Term load forecasting is attracting widespread interest among researchers due to its increased importance in smart-grids and micro-grids [2]. Classical short-term time series in [14] present a seq2seq approach based on RNN to make medium-to-long term predictions of electricity consumption in a commercial and residential building at hourly resolution. D. L. Marino et al. investigates in [15] two LSTM based architectures (standard and seq2seq architecture) to predict the electricity consumption from one residential customer. The results show that the seq2seq approach performed well in a minutely and hourly forecast, while LSTM shows a poor performance at one-minute resolution data. Convolutional neural networks were employed in time series forecasting in [16] and time series classification in [17]. However, thanks to the advances in Natural Language Processing (NLP) field has appeared novel networks that could be successfully applied to the time series forecasting field, e.g., Temporal Dilated Causal Convolutional Neural Networks (TDCCNN). The causality refers to there is no information "leakage" from the future to the past, and the output sequence has the same length as the input sequence. In the case that it desired to forecast an output length different from an input length using this architecture, it is needed to implement a seq2seq approach. Dilated convolutions increase the receptive field by uniformly sampling from the receptive fields the examples, helping in dealing with data sparsity, long-term sequence relationships, and multi-resolution [18]. Using the encoder and decoder to enhance the performance is now a key trend in the designing process in CNN architecture [19].
In the last few years, the use of dilated temporal convolutional networks as forecast model has been increasing in the field of time series forecasting for numerous applications. However, the energy sector applications of such technology are still barely studied. R. Wan et al. presents in 2019 a novel method based on Temporal Convolution Network (TCN) for time series forecasting, improving the performance of LSTM and ConvLSTM in different datasets. S. Rizvi et al. presented in 2019 a time series forecasting method of Petrol Retail using Dilated Causal Convolutional Networks (DCCCN), outperforming the results by LSTM, GRU, and ARIMA [18]. O. Yazdanbakhsh et al. propose in [20] a dilated convolutional neural network for time series classification, showing that the used architecture can be effective as a model working with hand-crafted features such as spectrogram and statistical features. Yan et al. presented in 2020 a TCN model for weather predictions task [21]. In this research, a comparative experiment was carried out between LSTM and TCN, where TCN outperforms LSTM using various time-series datasets.
Nevertheless, recent studies reveal promising results using WaveNets in the field of time series forecasting, where such novel ML algorithms are improving the results obtained by the aforementioned Deep Learning methods. WaveNet is a complex deep convolutional architecture based on dilated temporal convolution and skip-connections. A. Borovykh et al. presented in [22] an adaptation of WaveNet for multiples time series forecasting problems. WaveNet consistently outperforms the autoregressive model and LSTM in metrics errors and time in the training process. D. Impedovo presents TrafficWave, a generative Deep Learning architecture inspired by Google DeepMind' Wavenet network for traffic flow prediction [23]. The proposed approach shows better results than other state-of-the-art techniques (LSTM, GRU, AutoEncoders). S. Pramono et al. presented a novel method for short-term load forecasting that consists of a combination of dilated causal residual convolutional neural network and LSTM to forecast four hours-ahead [24]. The long sequences are fed to the convolutional model while the short data sequences are fed to the LSTM layer.
Given the previous context and based on the promising results achieved in other areas, the authors propose to apply these novel techniques to the energy load forecasting field by developing a cutting-edge Deep-Learning architecture, for a forecast horizon of 24 hahead, with the main objective of outperforming other state-of-art methods. Specifically, the proposed architecture consists of an Encoder-Decoder approach using WaveNets, which are based on dilated temporal convolutions. The list of advantages that this kind of algorithm presents can be found below:

•
There is no "leakage"from the future to the past. • Employing dilated convolutions enable an exponentially large receptive field to process long input sequences. • As a result of skip-connections, the proposed approach avoids degradation problems (explosions and vanishing gradients) when the depth of the networks increases. • Parallelisation on GPU is also possible thanks to the use of convolutions instead of recurrent units. • Potential performance improvement.
The rest of this document is organised as follows: Section 2 details the elements that comprises the proposed method, data preparation and the steps performed during the training stage. Section 3 presents the results of the proposed approach and comparison methods. Section 4 concludes this research work with future research directions.

Proposed Approach: Encoder-Decoder Approach Using WaveNets
WaveNet ( Figure 1) is a deep generative model originally proposed by DeepMind which was designed for generating raw audio waveforms [25]. The WaveNet model can be extrapolated beyond the audio in order to be applied to any other type of time series forecasting problem, providing an excellent structure for capturing the long-term dependencies without an excessive number of parameters.
cally, the proposed architecture consists of an Encoder-Decoder approach using Wave-Nets, which are based on dilated temporal convolutions. The list of advantages that this kind of algorithm presents can be found below: • There is no "leakage"from the future to the past.

•
Employing dilated convolutions enable an exponentially large receptive field to process long input sequences.

•
As a result of skip-connections, the proposed approach avoids degradation problems (explosions and vanishing gradients) when the depth of the networks increases. • Parallelisation on GPU is also possible thanks to the use of convolutions instead of recurrent units. • Potential performance improvement.
The rest of this document is organised as follows: Section 2 details the elements that comprises the proposed method, data preparation and the steps performed during the training stage. Section 3 presents the results of the proposed approach and comparison methods. Section 4 concludes this research work with future research directions.

Proposed Approach: Encoder-Decoder Approach Using WaveNets
WaveNet ( Figure 1) is a deep generative model originally proposed by DeepMind which was designed for generating raw audio waveforms [25]. The WaveNet model can be extrapolated beyond the audio in order to be applied to any other type of time series forecasting problem, providing an excellent structure for capturing the long-term dependencies without an excessive number of parameters. WaveNet contains stacks of temporal dilated causal convolutions (TDCCN) that allow it to access a broad range of history of past values when forecasting; By using causal convolution networks (TCNs), the model cannot violate the order in which the data is fed to the algorithm. One of the problems of the basic design of TCNs is that they required many layers or larges filters to process long sequences. To address this problem, the authors of WaveNets suggest the use of a dilated causal convolutional network (DCCN) in combination with causal convolutions [25]. A dilated convolution is a convolution where the filter is applied over an area large than its length by skipping input values with a certain step. This network architecture effectively allows the network to operate on a WaveNet contains stacks of temporal dilated causal convolutions (TDCCN) that allow it to access a broad range of history of past values when forecasting; By using causal convolution networks (TCNs), the model cannot violate the order in which the data is fed to the algorithm. One of the problems of the basic design of TCNs is that they required many layers or larges filters to process long sequences. To address this problem, the authors of WaveNets suggest the use of a dilated causal convolutional network (DCCN) in combination with causal convolutions [25]. A dilated convolution is a convolution where the filter is applied over an area large than its length by skipping input values with a certain step. This network architecture effectively allows the network to operate on a coaster scale than with traditional convolutions [25]. One of the advantages of dilated convolutions is that they reduce the number of deep neural networks' parameters. Furthermore, it enables a faster training process and, consequently, lower hardware and energy consumption derived cost. ReLU is used as activation function applying multiple convolutional filters in parallel. The dilated convolution splits the input into two different branches which are recombined later via element-wise multiplication. This process depicts a gated activation unit, where the tanh activation branch acts as the learning filter and the sigmoid activation branch works as a learning gate that controls the information flow through the network. It is a very similar process that the one explained above for the internal gates of the RNNs. Some of the biggest limitations of deep neural networks are degradations problems (exploding/vanishing gradient) during training process. This occurs as a consequence of either too small or too large derivatives in the first layers of the network, causing the gradient to increase or decrease exponentially, which leads to a rapidly declining performance of the network. In order to address the degradation problem, WaveNets make use of residual connections between layers. This type of deep neural network is based on short-cut connections, which turn the network into its counterpart residual versions. Two options could be considered in Residual Networks (ResNets) [26]: adding new layers would not impact on model's performance (these layers would skip over them if those layers were not useful) and if the new layers were useful, the layer's weights would be non-zero, and the model's performance could increase slightly.
Some modifications were performed in order to adapt the architecture, so the solution is able to generate predictions for the desired forecast horizon and reduce the noise at the output. Originally, WaveNet was trained using the next step prediction, so the errors were accumulated as the model produces a long sequence in the absence of conditioning information [27]. The encoder performs dilated temporal convolutions with a filter bank to generate a set of features of the input sequences. These learned features are comprised into a fixed-vector, also known as context-vector. The decoder network generates an output sequence forecast based on the learned feature of the encoder input sequence, allowing us to generate a direct output forecast for the desired forecast horizon. This modification permits the decoder to handle the accumulative noise when producing a long sequence as output. A generic scheme of the proposed architecture is as follows ( Figure 2): coaster scale than with traditional convolutions [25]. One of the advantages of dilated convolutions is that they reduce the number of deep neural networks' parameters. Furthermore, it enables a faster training process and, consequently, lower hardware and energy consumption derived cost. ReLU is used as activation function applying multiple convolutional filters in parallel. The dilated convolution splits the input into two different branches which are recombined later via element-wise multiplication. This process depicts a gated activation unit, where the tanh activation branch acts as the learning filter and the sigmoid activation branch works as a learning gate that controls the information flow through the network. It is a very similar process that the one explained above for the internal gates of the RNNs. Some of the biggest limitations of deep neural networks are degradations problems (exploding/vanishing gradient) during training process. This occurs as a consequence of either too small or too large derivatives in the first layers of the network, causing the gradient to increase or decrease exponentially, which leads to a rapidly declining performance of the network. In order to address the degradation problem, WaveNets make use of residual connections between layers. This type of deep neural network is based on short-cut connections, which turn the network into its counterpart residual versions. Two options could be considered in Residual Networks (ResNets) [26]: adding new layers would not impact on model's performance (these layers would skip over them if those layers were not useful) and if the new layers were useful, the layer's weights would be non-zero, and the model's performance could increase slightly.
Some modifications were performed in order to adapt the architecture, so the solution is able to generate predictions for the desired forecast horizon and reduce the noise at the output. Originally, WaveNet was trained using the next step prediction, so the errors were accumulated as the model produces a long sequence in the absence of conditioning information [27]. The encoder performs dilated temporal convolutions with a filter bank to generate a set of features of the input sequences. These learned features are comprised into a fixed-vector, also known as context-vector. The decoder network generates an output sequence forecast based on the learned feature of the encoder input sequence, allowing us to generate a direct output forecast for the desired forecast horizon. This modification permits the decoder to handle the accumulative noise when producing a long sequence as output. A generic scheme of the proposed architecture is as follows ( Figure  2):

Data Analysis
As mentioned above, the main objective of the proposed approach is to forecast the power consumption a day ahead by using historical past values. That is, both the input and the output are the same parameter: the power consumption. To this end, macroscopic

Data Analysis
As mentioned above, the main objective of the proposed approach is to forecast the power consumption a day ahead by using historical past values. That is, both the input and the output are the same parameter: the power consumption. To this end, macroscopic power consumption data of French has been downloaded from the official RTE website [28]. The data uses for this problem ranges from 2016 to 2020. The data analysis and visualisation were performed by using Pandas, Numpy, and Matploblib Python's libraries. The first visualisation of the available data is illustrated in Figure 3, which shows a seasonal pattern on power consumption. power consumption data of French has been downloaded from the official RTE website [28]. The data uses for this problem ranges from 2016 to 2020. The data analysis and visualisation were performed by using Pandas, Numpy, and Matploblib Python's libraries. The first visualisation of the available data is illustrated in Figure 3, which shows a seasonal pattern on power consumption.

Data Preparation
The process of data preparation is a key aspect when training Deep-Learning models since the quality of data affects many parameters: it reduces the modelling errors, speeds up the training process, and simplifies the whole system [29]. In order to carry out the data preparation process for the proposed case, different steps have been carried out, as detailed below.
First, it is interesting to pre-process the data in order to delete outliers, clean unwanted characters and filter null values. As a result of this analysis, it was found that there are no outliers nor any NaN value present in the RTE data, which indicates that RTE provides the data after processing it. Regarding the size of the data, there are 26.280 rows (8.700 elements per year), but it is needed to transform this data into a sequence format.
It is also important to stress that the input and output data sequences have different dimensions. While the input data has a horizon of one week in the past, the prediction horizon for the output data is 24 h ahead. Thus, the dimensions for each one depend on the following parameters: [number of time steps, sequence length, number of features].
In addition, it is well known that machine learning algorithms usually benefit from scaling all their inputs variables to the same range, e.g., normalise all variables into the range [0, 1]. As a result, the network will have a more stable and quick learning process. Besides, it is useful to avoid floating-point number precision due to the different scales of input data. On a time series forecasting problem, where a numerical sequence must be predicted, it can also help to perform a transformation on the target variable. The method that has been used to normalise the data is known as MinMax normalisation:

Data Preparation
The process of data preparation is a key aspect when training Deep-Learning models since the quality of data affects many parameters: it reduces the modelling errors, speeds up the training process, and simplifies the whole system [29]. In order to carry out the data preparation process for the proposed case, different steps have been carried out, as detailed below.
First, it is interesting to pre-process the data in order to delete outliers, clean unwanted characters and filter null values. As a result of this analysis, it was found that there are no outliers nor any NaN value present in the RTE data, which indicates that RTE provides the data after processing it. Regarding the size of the data, there are 26.280 rows (8.700 elements per year), but it is needed to transform this data into a sequence format.
It is also important to stress that the input and output data sequences have different dimensions. While the input data has a horizon of one week in the past, the prediction horizon for the output data is 24 h ahead. Thus, the dimensions for each one depend on the following parameters: [number of time steps, sequence length, number of features].
In addition, it is well known that machine learning algorithms usually benefit from scaling all their inputs variables to the same range, e.g., normalise all variables into the range [0, 1]. As a result, the network will have a more stable and quick learning process. Besides, it is useful to avoid floating-point number precision due to the different scales of input data. On a time series forecasting problem, where a numerical sequence must be predicted, it can also help to perform a transformation on the target variable. The method that has been used to normalise the data is known as MinMax normalisation: Different statistical parameters were calculated to obtain more information about the given data. The data shows a Skewness of 0.53 that indicates that the distribution has more weight in the left tail of the distribution. Some authors suggested in [30] that better predictions are achieved if the target data is linearised through a logarithmic transform. While normalising the target data does not affect the model performance, it helps to reduce their Skewness. In the proposed approach, a log-transformation has been applied to the Energies 2021, 14, 2524 7 of 16 target data. The Skewness of the dataset after the log-transformation is 0.08, which indicates that the data is now much closer to a normal distribution.
In order to train the proposed model, a training, a validation and a testing set have been extracted from the original data. On the one hand, the training set is the sample of data used to fit (adjust the weigths) the model. On the other hand, the validation set is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. Finally, the testing set is the sample of data used to provide an unbiased evaluation of the final model configuration.
The problem under modelling model consists of a time series problem. Accordingly, the data must be divided sequentially. Taking this criterion, the test set only provides the error metrics that correspond with a number of weeks in a year. It does not provide robust metrics to make conclusions about the model's performance. Thus, different training sessions for each data division have been performed, each with different training, validation, and test sets corresponding with different seasons for each year, in order to obtain representative and reliable metrics. The starting and end date for each training session are presented below. Figures 4-6 shows the power consumption in France for data division 1,2,3, respectively.
While normalising the target data does not affect the model performance, it helps to reduce their Skewness. In the proposed approach, a log-transformation has been applied to the target data. The Skewness of the dataset after the log-transformation is 0.08, which indicates that the data is now much closer to a normal distribution.
In order to train the proposed model, a training, a validation and a testing set have been extracted from the original data. On the one hand, the training set is the sample of data used to fit (adjust the weigths) the model. On the other hand, the validation set is the sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. Finally, the testing set is the sample of data used to provide an unbiased evaluation of the final model configuration.
The problem under modelling model consists of a time series problem. Accordingly, the data must be divided sequentially. Taking this criterion, the test set only provides the error metrics that correspond with a number of weeks in a year. It does not provide robust metrics to make conclusions about the model's performance. Thus, different training sessions for each data division have been performed, each with different training, validation, and test sets corresponding with different seasons for each year, in order to obtain representative and reliable metrics. The starting and end date for each training session are presented below. Figures 4-6 shows the power consumption in France for data division 1,2,3, respectively.

Deep Learning Architecture
Design of the Architecture A grid search process was performed in different training sessions to find the optima hyperparameter's configuration (Table 1). To such an end, several network architecture configurations were trained using the same data division for training, validation, and test ing. These experiments were performed using an i7-8770 processor with 16 GB RAM and Nvidia 1080 RTX with Nvidia CUDA drivers and Tensorflow GPU (2.2.3) as Deep-Learn ing library. As reported in [31], the use of GPU to train models can be up to 4 train faste than CPU. For this reason, the different training stages for the proposed approach and the comparison techniques have been carried out on GPU Tensorflow is an open-source soft ware library developed by Google brain team for numerical computation [32]. Differen

Deep Learning Architecture
Design of the Architecture A grid search process was performed in different training sessions to find the optimal hyperparameter's configuration (Table 1). To such an end, several network architecture configurations were trained using the same data division for training, validation, and testing. These experiments were performed using an i7-8770 processor with 16 GB RAM and Nvidia 1080 RTX with Nvidia CUDA drivers and Tensorflow GPU (2.2.3) as Deep-Learning library. As reported in [31], the use of GPU to train models can be up to 4 train faster than CPU. For this reason, the different training stages for the proposed approach and the comparison techniques have been carried out on GPU Tensorflow is an open-source software library developed by Google brain team for numerical computation [32]. Different models were trained repeatedly, changing the learning parameters (as the learning rate) to find the optimal ones. The following table shows the different tests that have been done to find the best hyper-parameters configuration. It is important to remark that the dilation rates exponentially increase as the depth n of the network increases.
where i represents the i-th layer and n represents the total number of dilated causal convolution layers. Both Encoder and Decoder share the same hyperparameters configuration. Nevertheless, while the encoder transforms the past time series values into a feature vector, the decoder uses the feature vector as an input to generate an output sequence for 24 h-ahead. Once all the results were obtained, the objective was to find the model with the least bias error (error in the training set) and a low validation error. Accordingly, the selected model has the following configuration: • In Section 3, the performance of the previous configuration with the testing set is presented and analysed.

Training
The training process consists of learning the best values of the weights and biases in the training set, so the structure was able to correlate its output sequence with respect to the input sequence. The difference between a Deep-Learning time series approach from a traditional one is the handling of the generalisation error. This error is defined as the error when new input sequences are fed into the trained model. Typically, the usual approach consists of estimating the generalisation error by the metric error in the validation set which is independent of the training set. Two facts determine how well the deep learning model performs: its capabilities to reduce the training error to the lowest level and to keep the gap between the training error and the validation error as smaller as possible.
The rest of the training parameters were selected as follows: • β1 parameter: 0.9. The exponential decay rate for the first moment estimates (momentum). • β2 parameter: 0.99. The exponential decay rate for the first moment estimates (RMSprop). • Loss function: Mean Squared Error (MSE).
The learning curves on the three data division are presented below. When analysing the learning curves (Figure 7), it can be appreciated that the error of the proposed solution is slightly higher in validation sets 1 and 3. These datasets correspond to the winter months, where the error is usually a bit higher as can be observed in the results presented below. Regarding the learning curve corresponding to data division 2, which elapses the summer months, the validation error is even smaller than the training error, since it encompasses a lot of data combining several months.
The learning curves on the three data division are presented below. When analysing the learning curves (Figure 7), it can be appreciated that the error of the proposed solution is slightly higher in validation sets 1 and 3. These datasets correspond to the winter months, where the error is usually a bit higher as can be observed in the results presented below. Regarding the learning curve corresponding to data division 2, which elapses the summer months, the validation error is even smaller than the training error, since it encompasses a lot of data combining several months.

Results
Different error metrics were computed in order to evaluate the model's performance for a forecast horizon of 24 h ahead. These metrics have been calculated by comparing the actual power consumed in France in MW with the power predicted by each of the proposed models. Training, validation, and testing sets have been extracted from the three data partition divisions. Specifically, Mean Squared Error (MSE), Root Mean Squared Error where y is the real measure of power consumption andŷ is the estimated measurement of power consumption. Table 2 reports the results obtained on the three data division for training, validation, and testing datasets. The results reveal similar error metrics for the first two tests, corresponding for the data division 1 and 2. The difference between the testing set and the training set is most significant for the data division 3, where the testing set corresponds to December's month. This increase in error may be due to multiple things: based on data analysis, it has been observed that the standard deviation of consumption is higher in the winter months (Figure 3. This may be due to multiple public holidays, cold days where heating consumption spikes, etc. Additionally, the error metrics of state-of-art deep learning architectures and other advanced statistical time-series methods:Autoregressive-Integrated Moving Average (ARIMA) [32][33][34], Multilayer Perceptron Network (MLP) [35,36], Convolutional Neural Network (CNN) [37,38], Long-Short Term Memory (LSTM) [39][40][41][42][43][44][45][46], Stacked LSTM [47], Gated-Recurrent Unit (GRU) [48], and Stacked-GRU [47,49] are also presented in Table 3 in order to make a fair-comparison between approaches. The criteria to select each of the models is the same as the one previously described for the proposed approach. For each deep-learning architecture, different hyperparameters configuration has been tested. The model that provides the lowest error in each training session's validation set has been selected to calculate the testing set's errors. The results confirmed that the proposed model outperforms the other deep learning techniques, achieving the lowest error in the three data division for a forecast horizon of 24 h-ahead. It is interesting to stress that the CNN-based methods yield better results than recurrent methods, indicating that the causal convolution structure effectively extracts the feature of ever-changing power demand patterns. Contrary to the outcomes achieved by other research carried out in this area, any significant recurrent based model was found to provide the closest results with respect to the convolutional approaches. Only the results from a complex architecture stacking more Gated-Recurrent-Unit (GRU) layers are most comparable to the rest of the methods.
The following images depict the power prediction at instant k for the next 24 h. For better visualisation, only one forecast by day is presented in the subsequent illustrations. Additionally, Table 4 shows a better description of the absolute error's statistical error in each test set.

Discussion
This research has offered a novel Deep-Learning structure commonly used for audio synthesis applied for time series forecasting. Despite this paper's focus on the French load forecasting, the proposed approach could be replicated for a wide range of fields that involve time series forecasting.
It was demonstrated that the proposed encoder-decoder WaveNet, based on dilated temporal causal convolutions, exhibits better results than the other comparison techniques. Our proposed model shows the best performance in different test sets compared to other state-of-art Deep-Learning models in terms of MSE, RMSE, MAE, MAPE, MBE, MBPE. Specifically, Multi-Layer Perceptron Network (MLP) and Causal Convolution 1-D (1D-CNN), Autoregressive-Integrated Moving Average (ARIMA), Long-Short-Term-Memory (LSTM), Gated-Recurrent-Unit (GRU) and some combination of them have been implemented to make a fair comparison between approaches.
The errors were uniformly distributed through the different test sets, which corresponds with different seasons in 2018-2019 ( Figure 8). Although the error metrics are uniformly distributed, it can be appreciated that it is slightly higher in test set 3, corresponding to December of 2019. This variation in the error metrics also occurs in the other comparison techniques, probably due to the holidays and non-school period this month. Although the improvements of the proposed model over other approaches may seem to be slight, it is more than enough to improve the energy usage efficiency allowing significant savings inmoney and resources since, as D. Bunn

Discussion
This research has offered a novel Deep-Learning structure commonly used for audio synthesis applied for time series forecasting. Despite this paper's focus on the French load forecasting, the proposed approach could be replicated for a wide range of fields that involve time series forecasting.
It was demonstrated that the proposed encoder-decoder WaveNet, based on dilated temporal causal convolutions, exhibits better results than the other comparison techniques. Our proposed model shows the best performance in different test sets compared to other state-of-art Deep-Learning models in terms of MSE, RMSE, MAE, MAPE, MBE, MBPE. Specifically, Multi-Layer Perceptron Network (MLP) and Causal Convolution 1-D (1D-CNN), Autoregressive-Integrated Moving Average (ARIMA), Long-Short-Term-Memory (LSTM), Gated-Recurrent-Unit (GRU) and some combination of them have been implemented to make a fair comparison between approaches.
The errors were uniformly distributed through the different test sets, which corresponds with different seasons in 2018-2019 ( Figure 8). Although the error metrics are uniformly distributed, it can be appreciated that it is slightly higher in test set 3, corresponding to December of 2019. This variation in the error metrics also occurs in the other comparison techniques, probably due to the holidays and non-school period this month. Although the improvements of the proposed model over other approaches may seem to be slight, it is more than enough to improve the energy usage efficiency allowing significant savings inmoney and resources since, as D. Bunn et al. claim in [1], a 1% of improvement of an electricity forecasting model can result in up to 10M £ of savings. The performance achieved in this research represents promising results for those researches interested in time series forecasting. This study proved the good learning capabilities of dilated temporal convolutions networks. One of the significant advantages of the proposed method, respectively, traditional recurrent methods, is the possibility of parallelising the training process in GPUs, enabling the training process in less time even if the amount of data increases a lot.
Further research will focus on considering multiples sources of data to improve the proposed network's performance, e.g., weather data and calendar information. Additionally, it can be helpful a better adjustment of the hyperparameter configuration, performing more training sessions. Another possible aspect is ensemble forecasting, where the final prediction forecast will be a linear or non-linear combination of the predictions from different models. A pre-processing stage could be further added to decompose the input sequence, e.g., Wavelet Transforms (WT) or Empirical Mode Decomposition (EMD), whereas each decomposed sequence will be treated as a new feature of the model. Another possible neural network modification consists of replacing the context vector with an attention mechanism to focus on the most important parts in the input sequence with respect to the output sequence.
Anyone interested in the code can write an email to the addresses of the first page in order to be provided with the code.