Deep Learning-Based Short-Term Load Forecasting for Supporting Demand Response Program in Hybrid Energy System

: A novel method for short-term load forecasting (STLF) is proposed in this paper. The method utilizes both long and short data sequences which are fed to a wavenet based model that employs dilated causal residual convolutional neural network (CNN) and long short-term memory (LSTM) layer respectively to hourly forecast future load demand. This model is aimed to support the demand response program in hybrid energy systems, especially systems using renewable and fossil sources. In order to prove the generality of our model, two di ﬀ erent datasets are used which are the ENTSO-E (European Network of Transmission System Operators for Electricity) dataset and ISO-NE (Independent System Operator New England) dataset. Moreover, two di ﬀ erent ways of model testing are conducted. The ﬁrst is testing with the dataset having identical distribution with validation data, while the second is testing with data having unknown distribution. The result shows that our proposed model outperforms other deep learning-based model in terms of root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). In detail, our model achieves RMSE, MAE, and MAPE equal to 203.23, 142.23, and 2.02 for the ENTSO-E testing dataset 1 and 292.07, 196.95 and 3.1 for ENTSO-E dataset 2. Meanwhile, in the ISO-NE dataset, the RMSE, MAE, and MAPE equal to 85.12, 58.96, and 0.4 for ISO-NE testing dataset 1 and 85.31, 62.23, and 0.46 for ISO-NE dataset 2. investigation, S.H.P., M.R. and E.M.; resources, S.H.P.; data curation, and F.H.; writing—original draft preparation, M.R. and E.M.; writing—review and editing, S.H.P. and R.N.H.; visualization, M.R., R.N.H. and F.H.; supervision, S.H.P.; project administration, F.H.; funding acquisition, S.H.P.


Introduction
Nowadays, the hybrid energy system has become more popular in the electricity industry. The main reason of this trend is the exponential reduction of energy storage cost and the development of digital connection, enabling real time monitoring and smarter grid establishment. Moreover, the hybrid energy system is considered as one of the best solutions in tackling intermittency experienced by most renewable energy schemes including solar-and wind-based energy. For example, in solar photovoltaic, the energy is only delivered when obtaining sufficient solar irradiance. As a consequence, a lot of research has been conducted in order to provide the best scheme of this hybrid system [1].
Among provided hybrid energy system schemes, the most possible way to be implemented in the majority of countries is the integration between renewable and fossil energy, because fossil energy has already well established. In case of sustainability, this scheme is very good, because fossil energy is obviously able to supply adequate power to the grid when the alternative sources cannot handle In order to prove the generality of our proposed model performance, two different scenarios of model testing are conducted. The first scenario is using the testing dataset having identical distribution with the validation dataset, while the second is using dataset having unknown distribution. As a comparison, our proposed model results are compared with the performance of the model from [15,16] and the standard wavenet [21]. The simulation result shows that our proposed model outperforms other deep learning models in root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE).
Therefore, due to the accuracy of our model performance, this model can be used for supporting utilities in applying demand response program since it can help the utilities to obtain accurate prediction about the future load demand that eventually providing precise information to the users whether the future load demand can be supplied by the renewable source or not.
The rest of the paper is organized as follows. Section 2 describes the dataset used in detail including preliminary data analysis and data preprocessing. Section 3 explains the model architecture and its parameter, training, and testing stage. Section 4 provides information about results obtained by using the proposed models compared with other deep learning-based models and clear explanation about the reason why the proposed model can achieve the result. Lastly, Section 5 summarize the findings discussed in this paper and also possible future works.

Dataset
In order to prove the generality of our proposed method, two datasets are used as the model's input which are the ENTSO-E (European Network of Transmission System Operators for Electricity) [23] and ISO-NE (Independent System Operator New England) dataset [24]. The ENTSO-E dataset is the dataset obtained from load demand in every country in Europe. In this research we only take into account data gathered from Switzerland. Meanwhile, the ISO-NE dataset is data of hourly load demand in New England.
Those datasets have different kinds of characteristics, especially in the case of load demand range and complexity. In the ENTSO-E dataset, the lowest and highest value of load are 1483 KWh and 18,544 KWh, respectively, while in the ISO-NE dataset they are 9018 KWh and 26,416 KWh, respectively. Another different characteristic is the fluctuation trend in a single day load demand. In the ENTSO-E dataset, the load demand is more oscillated compared to the ISO-NE dataset. As proof, Figures 1 and 2 show the average daily power consumption and example of load trend in a day based on those datasets and example of demand trend of each dataset.
Energies 2019, 12, x FOR PEER REVIEW 3 of 17 comparison, our proposed model results are compared with the performance of the model from [15,16] and the standard wavenet [21]. The simulation result shows that our proposed model outperforms other deep learning models in root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Therefore, due to the accuracy of our model performance, this model can be used for supporting utilities in applying demand response program since it can help the utilities to obtain accurate prediction about the future load demand that eventually providing precise information to the users whether the future load demand can be supplied by the renewable source or not The rest of the paper is organized as follows. Section 2 describes the dataset used in detail including preliminary data analysis and data preprocessing. Section 3 explains the model architecture and its parameter, training, and testing stage. Section 4 provides information about results obtained by using the proposed models compared with other deep learning-based models and clear explanation about the reason why the proposed model can achieve the result. Lastly, Section 5 summarize the findings discussed in this paper and also possible future works.

Dataset
In order to prove the generality of our proposed method, two datasets are used as the model's input which are the ENTSO-E (European Network of Transmission System Operators for Electricity) [23] and ISO-NE (Independent System Operator New England) dataset [24]. The ENTSO-E dataset is the dataset obtained from load demand in every country in Europe. In this research we only take into account data gathered from Switzerland. Meanwhile, the ISO-NE dataset is data of hourly load demand in New England.
Those datasets have different kinds of characteristics, especially in the case of load demand range and complexity. In the ENTSO-E dataset, the lowest and highest value of load are 1483 KWh and 18,544 KWh, respectively, while in the ISO-NE dataset they are 9018 KWh and 26,416 KWh, respectively. Another different characteristic is the fluctuation trend in a single day load demand. In the ENTSO-E dataset, the load demand is more oscillated compared to the ISO-NE dataset. As proof, Figures 1 and 2 show the average daily power consumption and example of load trend in a day based on those datasets and example of demand trend of each dataset.  In building a solution for load forecasting, deep understanding of the load demand trend is very necessary. Figure 3 shows data in both datasets over a year period. From Figure 3, broadly speaking, the trend of load demand has a periodicity that will be repeated over the next weeks. However, this trend is highly affected by a lot of external factors causing a fluctuation over a certain period, for example, the economic, weather, and time index. Unfortunately, obtaining those external factors are difficult, the easiest external data that can be gathered is time index data containing information about the date and clock. Therefore, in this research we not only fed the model by load demand trend but also time index data represented by one-hot vector. One-hot vector is a sparse vector that maps a feature with M categories into a vector which has M elements where only the corresponding element is one while the remaining are zero. Before fed to the model, the datasets must be pre-processed, which consists of checking for null values, splitting dataset into training, validation and testing dataset, and eventually data normalization. The data used for this research is limited only to data taken from 1 January 2015 until 30 May 2017 for the ENTSO-E dataset and from 1 January 2004 until 30 May 2006 for the ISO-NE dataset. The first two years of data are used for the training stage, while the rest of the 3600 data are split into 3000 and 600 data. The first 3000 data are randomly taken for validation and testing data with proportion of nearly 0.65 and 0.35 to be used for validation and the testing stage, respectively. In other words, the total of validation data and first testing data are 1900 and 1100, respectively, while another 600 data are used for the second testing data. We conduct two testing stage, because the first In building a solution for load forecasting, deep understanding of the load demand trend is very necessary. Figure 3 shows data in both datasets over a year period. From Figure 3, broadly speaking, the trend of load demand has a periodicity that will be repeated over the next weeks. However, this trend is highly affected by a lot of external factors causing a fluctuation over a certain period, for example, the economic, weather, and time index. Unfortunately, obtaining those external factors are difficult, the easiest external data that can be gathered is time index data containing information about the date and clock. Therefore, in this research we not only fed the model by load demand trend but also time index data represented by one-hot vector. One-hot vector is a sparse vector that maps a feature with M categories into a vector which has M elements where only the corresponding element is one while the remaining are zero. In building a solution for load forecasting, deep understanding of the load demand trend is very necessary. Figure 3 shows data in both datasets over a year period. From Figure 3, broadly speaking, the trend of load demand has a periodicity that will be repeated over the next weeks. However, this trend is highly affected by a lot of external factors causing a fluctuation over a certain period, for example, the economic, weather, and time index. Unfortunately, obtaining those external factors are difficult, the easiest external data that can be gathered is time index data containing information about the date and clock. Therefore, in this research we not only fed the model by load demand trend but also time index data represented by one-hot vector. One-hot vector is a sparse vector that maps a feature with M categories into a vector which has M elements where only the corresponding element is one while the remaining are zero. Before fed to the model, the datasets must be pre-processed, which consists of checking for null values, splitting dataset into training, validation and testing dataset, and eventually data normalization. The data used for this research is limited only to data taken from 1 January 2015 until 30 May 2017 for the ENTSO-E dataset and from 1 January 2004 until 30 May 2006 for the ISO-NE dataset. The first two years of data are used for the training stage, while the rest of the 3600 data are split into 3000 and 600 data. The first 3000 data are randomly taken for validation and testing data with proportion of nearly 0.65 and 0.35 to be used for validation and the testing stage, respectively. In other words, the total of validation data and first testing data are 1900 and 1100, respectively, while another 600 data are used for the second testing data. We conduct two testing stage, because the first Before fed to the model, the datasets must be pre-processed, which consists of checking for null values, splitting dataset into training, validation and testing dataset, and eventually data normalization. The data used for this research is limited only to data taken from 1 January 2015 until 30 May 2017 for the ENTSO-E dataset and from 1 January 2004 until 30 May 2006 for the ISO-NE dataset. The first two years of data are used for the training stage, while the rest of the 3600 data are split into 3000 and 600 data. The first 3000 data are randomly taken for validation and testing data with proportion of nearly 0.65 and 0.35 to be used for validation and the testing stage, respectively. In other words, the total of validation data and first testing data are 1900 and 1100, respectively, while another 600 data are used for the second testing data. We conduct two testing stage, because the first testing data have a nearly identical distribution with the validation data which clearly make the model provide good accuracy in the testing stage. Meanwhile, the second testing data is clearly new data that their distribution is never experienced by the model both in the training and validation stage. This kind of testing process is appropriate for proving the generality of the model.
In the data normalization process, min-max scaling method as expressed in Equation (1) is implemented.
x = x 1 , . . . , x n and z i is the i th normalized data. The parameters in the normalization process must come from training dataset only, because we assume that the future data (validation and test set) have different distribution with the training dataset.

Model Design
In this research, the cores of the proposed model are wavenet architecture implementing both of the dilated causal CNN and residual network and LSTM, which have proven very well in time-series data prediction.
Our proposed model consists of two stages that have a function to learn long and short sequence data. Inspired by the success of wavenet architecture and LSTM in handling time series data, the long sequence taken from the 32 time steps before the target is learned using wavenet while the short sequence taken from 4 time steps before target is learned using LSTM.

Wavenet
Wavenet consists of deep generative models utilizing the dilated causal convolutional neural network of audio waveform. Causal convolution means that the output of the recent time step is only affected by the previous time step. Meanwhile, the dilated convolutional neural network is a modified convolutional neural network where the filter weight alignment is expanded by a dilation factor that eventually results in a broader receptive field that can be expressed as follows: while the standard convolution is expressed as follows: The dilated convolution is denoted by * l notation. The difference between dilated convolution with standard convolution is the notation l representing the dilation factor which causes the filter to skip one or several points during the convolution process. Figure 4 shows the dilated convolution applied in one dimensional data. The blue, white, and orange circle are input data, hidden layer output, and output layer output, respectively. There are 32 input data taken from t = 1 until t = 32 that are convoluted with filter with the size of two. The dilation rate is increased by one in every hidden layer that causes a broader receptive field. This dilation rate is repeated twice. In the output layer, only the last value is taken, which we assume represents the feature of load at t = 33.
In order to optimize the usage of the dilated convolutional neural network, the residual technique [22] is applied to the model. The implementation of the residual network will take into account lower levels outputs which have features that will help in predicting the future power demand, especially in the case of a network implementing a sparse filter which has the potential to lose several information from the previous layers.  In order to optimize the usage of the dilated convolutional neural network, the residual technique [22] is applied to the model. The implementation of the residual network will take into account lower levels outputs which have features that will help in predicting the future power demand, especially in the case of a network implementing a sparse filter which has the potential to lose several information from the previous layers.
The residual model is a famous way to build the deep neural network that was firstly proposed for the image recognition task. Using this way, instead of only mapping input data to a function ( ) that outputs , the mapping scenario from the previous residual block ( , { }) where is the learned weights and biases from the residual block i is also considered. Therefore, the output of the residual block can be expressed as: Moreover, since we use the stacked residual block, then the output of the residual block can be represented as: is the output of residual block K, is the input of the residual network and ( , ) is the output and associated weight of the previous residual blocks. As a result of several summation between the previous and final residual block, then the back propagation of the network to can be calculated using the following equation: ℒ is the total loss of the network and constant 1 indicates that the gradient of the network output can be directly back-propagated without considering layers' parameters (weights and biases). This formulation ensures the layers do not suffer of vanishing gradient, even the weights are small. Figure 5 shows the basic residual learning process. The residual model is a famous way to build the deep neural network that was firstly proposed for the image recognition task. Using this way, instead of only mapping input data x to a function H(x) that outputsŷ, the mapping scenario from the previous residual block f (x, {W i }) where W i is the learned weights and biases from the residual block i is also considered. Therefore, the output of the residual block can be expressed as: Moreover, since we use the stacked residual block, then the output of the residual block can be represented as: x K is the output of residual block K, x 0 is the input of the residual network and f (x i−1 , W i−1 ) is the output and associated weight of the previous residual blocks. As a result of several summation between the previous and final residual block, then the back propagation of the network to x 0 can be calculated using the following equation: L is the total loss of the network and constant 1 indicates that the gradient of the network output can be directly back-propagated without considering layers' parameters (weights and biases). This formulation ensures the layers do not suffer of vanishing gradient, even the weights are small. Figure 5 shows the basic residual learning process.
Moreover, skip connection and gated activation are applied to the network for speeding up the convergence and avoiding overfitting. The process of residual and skip connection with gated activation is shown in Figure 6. Moreover, skip connection and gated activation are applied to the network for speeding up the convergence and avoiding overfitting. The process of residual and skip connection with gated activation is shown in Figure 6. The gated activations are inspired by the LSTM layer where tanh and sigmoid ( ) work as learned filter and learned gate, respectively. The use of gated activations has been proved to work better compared to using ReLU activation in time series data [21]. The output of dilated convolution with gated activations can be expressed as: where and are learned filter and learned gate, respectively.

LSTM
In the case of forecasting future data, the knowledge of the recent trend is very essential. As an illustration, in predicting future data, we mostly start to figure out a long sequence trend. After we have already known the pattern of the trend based on the long sequence of previous data, then in order to provide better forecast, we also try to relate our understanding of long sequence data with the recent trend. The same concept is applied to our proposed method. We fine tune the wavenetbased model with one LSTM layer assigned to help the network to relate the output of dilated CNN with the recent trend. This step can also be considered as a correction step of the dilated CNN output, as we assert a fix input to be concatenated with dilated CNN output which are then fed to the LSTM layer that also work as output layer. Moreover, skip connection and gated activation are applied to the network for speeding up the convergence and avoiding overfitting. The process of residual and skip connection with gated activation is shown in Figure 6. The gated activations are inspired by the LSTM layer where tanh and sigmoid ( ) work as learned filter and learned gate, respectively. The use of gated activations has been proved to work better compared to using ReLU activation in time series data [21]. The output of dilated convolution with gated activations can be expressed as: where and are learned filter and learned gate, respectively.

LSTM
In the case of forecasting future data, the knowledge of the recent trend is very essential. As an illustration, in predicting future data, we mostly start to figure out a long sequence trend. After we have already known the pattern of the trend based on the long sequence of previous data, then in order to provide better forecast, we also try to relate our understanding of long sequence data with the recent trend. The same concept is applied to our proposed method. We fine tune the wavenetbased model with one LSTM layer assigned to help the network to relate the output of dilated CNN with the recent trend. This step can also be considered as a correction step of the dilated CNN output, as we assert a fix input to be concatenated with dilated CNN output which are then fed to the LSTM layer that also work as output layer. The gated activations are inspired by the LSTM layer where tanh and sigmoid (σ) work as learned filter and learned gate, respectively. The use of gated activations has been proved to work better compared to using ReLU activation in time series data [21]. The output of dilated convolution with gated activations can be expressed as: where w f and w g are learned filter and learned gate, respectively.

LSTM
In the case of forecasting future data, the knowledge of the recent trend is very essential. As an illustration, in predicting future data, we mostly start to figure out a long sequence trend. After we have already known the pattern of the trend based on the long sequence of previous data, then in order to provide better forecast, we also try to relate our understanding of long sequence data with the recent trend. The same concept is applied to our proposed method. We fine tune the wavenet-based model with one LSTM layer assigned to help the network to relate the output of dilated CNN with the recent trend. This step can also be considered as a correction step of the dilated CNN output, as we assert a fix input to be concatenated with dilated CNN output which are then fed to the LSTM layer that also work as output layer.
Brief Explanation of LSTM LSTM is the developed version of the standard recurrent neural network (RNN) where instead of only having a recurrent unit, LSTM has "LSTM cells" that have an internal recurrence consisting of several gating units controlling flow of information [25]. Comparison between the simple RNN and LSTM layer using tanh as the activation function is depicted in Figure 7. Brief Explanation of LSTM LSTM is the developed version of the standard recurrent neural network (RNN) where instead of only having a recurrent unit, LSTM has "LSTM cells" that have an internal recurrence consisting of several gating units controlling flow of information [25]. Comparison between the simple RNN and LSTM layer using tanh as the activation function is depicted in Figure 7. From Figure 7, the difference between the simple RNN and LSTM layer is clear. The LSTM layer is more complex than the simple RNN, because LSTM not only takes into account input ( ) and hidden state (ℎ) at a certain time step, but also LSTM cells ( ) that will replace the hidden state to prevent older signals from vanishing during the process. Three control gates ruling the LSTM cells, forget gate, input gate, and output gate are represented by , , , respectively. Those gates use sigmoid activation function having an output range between 0 and 1 represented by σ. Meanwhile, ̃ is the input node that works identical to the simple RNN layer.
Mathematically explained, forget gate and input gate, respectively can be expressed as: ℎ , is the concatenation between input and hidden state value, while and are weight matrices, respectively. On the other hand, the cell state is updated with the formulation: Where ̃ is expressed as: The last, the output gate and hidden state ℎ is calculated by using the following equation:

Detailed Model Setup
Complete representation of our model is depicted in Figure 8. The wavenet-based model is fed by 32 data sequences containing information of load demand and time index information (clock, day, and month). This wavenet model is used for the initial forecasting algorithm based on long sequence data. Before fed to dilated CNN, long sequence data are prepossessed using standard 1D-CNN with filter size equal to one. Next, the prepossessed data is convoluted by dilated causal CNN with filter size of 2. All of the convolutional layers have ReLU activation function expressed as: From Figure 7, the difference between the simple RNN and LSTM layer is clear. The LSTM layer is more complex than the simple RNN, because LSTM not only takes into account input (x) and hidden state (h) at a certain time step, but also LSTM cells (c) that will replace the hidden state to prevent older signals from vanishing during the process. Three control gates ruling the LSTM cells, forget gate, input gate, and output gate are represented by f t, i t , O t , respectively. Those gates use sigmoid activation function having an output range between 0 and 1 represented by σ. Meanwhile, c t is the input node that works identical to the simple RNN layer.
Mathematically explained, forget gate and input gate, respectively can be expressed as: [h t−1 , x t ] is the concatenation between input and hidden state value, while W and b are weight matrices, respectively. On the other hand, the cell state is updated with the formulation: where c t is expressed as: The last, the output gate o t and hidden state h t is calculated by using the following equation:

Detailed Model Setup
Complete representation of our model is depicted in Figure 8. The wavenet-based model is fed by 32 data sequences containing information of load demand and time index information (clock, day, and month). This wavenet model is used for the initial forecasting algorithm based on long sequence data. Before fed to dilated CNN, long sequence data are prepossessed using standard 1D-CNN with filter size equal to one. Next, the prepossessed data is convoluted by dilated causal CNN with filter size of 2. All of the convolutional layers have ReLU activation function expressed as: In the case of the residual network, the total residual block in our model is 10 with a dilation rate set to be a repetition of a sequence dr = [1,2,4,8,16]. Dilation rate is a hyperparameter that represents how large the gap between the element of the filter is, as shown in Figure 4 and indicated by the arrows. All of the residual blocks are summed followed by the ReLU activation function. The postprocessing is conducted before the last convolutional layer which works similar with time distributed fully connected layer in which it is assigned for normalization of the residual output.
Because the length of input and output in the dilated causal CNN are identical, then the customize layer is built for taking only the last output's neuron representing load at t = 33. After completing the wavenet-based network training, the output of the last convolutional layer is concatenated with recent data sequences (t = 29 until t = 32) that work as LSTM input. Here, by using the fine tuning technique, LSTM with linear activation function is assigned to make self-correction of convolutional output in order to provide more accurate prediction based on short sequence. This selfcorrection method is nearly identical to how humans make a prediction based on data sequences. For example, in predicting the environment temperature, humans will relate the understanding between the temperature trend from the previous day with the recent temperature trend in order to make an accurate prediction for the next hour temperature. Table 1 shows the summary of our model's parameter where f, ks, s, and dr are the number of filters, kernel size, stride, and dilation rate, respectively. Loss function and optimizer are mean absolute error and adaptive and momentum (ADAM) [26], respectively, and batch size is 512. The model was trained using Nvidia GTX 1070, Tensorflow 1.13.1 [27], CUDA 10, and CuDNN 7.6.2 with 500 epochs and the final model is chosen based on the validation accuracy.  In the case of the residual network, the total residual block in our model is 10 with a dilation rate set to be a repetition of a sequence dr = [1,2,4,8,16]. Dilation rate is a hyperparameter that represents how large the gap between the element of the filter is, as shown in Figure 4 and indicated by the arrows. All of the residual blocks are summed followed by the ReLU activation function. The post-processing is conducted before the last convolutional layer which works similar with time distributed fully connected layer in which it is assigned for normalization of the residual output.
Because the length of input and output in the dilated causal CNN are identical, then the customize layer is built for taking only the last output's neuron representing load at t = 33. After completing the wavenet-based network training, the output of the last convolutional layer is concatenated with recent data sequences (t = 29 until t = 32) that work as LSTM input. Here, by using the fine tuning technique, LSTM with linear activation function is assigned to make self-correction of convolutional output in order to provide more accurate prediction based on short sequence. This self-correction method is nearly identical to how humans make a prediction based on data sequences. For example, in predicting the environment temperature, humans will relate the understanding between the temperature trend from the previous day with the recent temperature trend in order to make an accurate prediction for the next hour temperature. Table 1 shows the summary of our model's parameter where f, ks, s, and dr are the number of filters, kernel size, stride, and dilation rate, respectively. Loss function and optimizer are mean absolute error and adaptive and momentum (ADAM) [26], respectively, and batch size is 512. The model was trained using Nvidia GTX 1070, Tensorflow 1.13.1 [27], CUDA 10, and CuDNN 7.6.2 with 500 epochs and the final model is chosen based on the validation accuracy.

Model Performance Evaluation
In order to evaluate our model performance, we mainly compared this model with the two previous deep learning-based models that work very well in the case of STLF. Those models are inspired by [15] (Model 1), which utilizes stacked LSTM and [16] (Model 2) which combines the stacked CNN and LSTM layer with the feature fusion layer. The configuration of each comparison model is identical to the published papers. Model 1 consists of two stacked LSTM layers consisting of 20 units followed by fully connected layers as an output layer. Model 2 is built with a combination between the LSTM and CNN layer where their outputs are concatenated, which is eventually fed to the fully connected layer called a feature fusion layer. All of the models are trained with the same input data. Moreover, the original wavenet model is also used for a model comparison in order to prove the benefit of LSTM as a self-correction layer in our proposed model. This section reports on the performance of hourly load forecasting by using our proposed method compared to other forecasting methods. In the testing stage, all of the models are evaluated with three difference commonly used metrics, root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). MAE is the average of absolute difference values between predicted load and actual load consumption. MAPE is just identical to MAP but it uses a ratio between the difference with the actual load while RMSE is another metric that tends to have a higher value compared to other metrics. The higher value which results from the metrics, the worse performance of the model. Those metrics are defined as follow: Tables 2 and 3 show all of the models' performance on dataset 1 and dataset 2, respectively. Overall, our proposed model outperforms other models, especially in the case of dataset 1 where the data distribution is nearly identical with the validation data, while the worst performance is shown by [16]-based model. In Table 2, our proposed model clearly shows its excellence over other methods which shows the success of the combination between deep residual causal CNN and LSTM using fine tuning technique in understanding long sequence and short sequence data, respectively. Moreover, compared to the standard wavenet model, our model performs better accuracy indicating the usefulness of the LSTM layer in making self-correction for initial load forecasting output by dilated causal residual CNN.  However, all of models exhibit downgraded performance in dataset 2. It indicates that all of the model still cannot understand data which has slightly different distribution with training, validation, and testing data 1. The failure of all of the models in testing using dataset 2 is highly affected by the quality of the ENTSO-E dataset where the inconsistency of hourly power usage or unpredicted trends occur several times. Figures 9 and 10 show the result of STLF on dataset 1 and dataset 2, respectively.  However, all of models exhibit downgraded performance in dataset 2. It indicates that all of the model still cannot understand data which has slightly different distribution with training, validation, and testing data 1. The failure of all of the models in testing using dataset 2 is highly affected by the quality of the ENTSO-E dataset where the inconsistency of hourly power usage or unpredicted trends occur several times. Figures 9 and 10 show the result of STLF on dataset 1 and dataset 2, respectively.     However, all of models exhibit downgraded performance in dataset 2. It indicates that all of the model still cannot understand data which has slightly different distribution with training, validation, and testing data 1. The failure of all of the models in testing using dataset 2 is highly affected by the quality of the ENTSO-E dataset where the inconsistency of hourly power usage or unpredicted trends occur several times. Figures 9 and 10 show the result of STLF on dataset 1 and dataset 2, respectively.    Tables 4 and 5 show all of models performance on ISO-NE dataset 1 and dataset 2 respectively. Different with performances shown in the previous subsection, all of the models perform very well both in known and unknown data distribution. However, the model based on [16] still exhibits the worst accuracy, while our model still performs the best although its excellence is not absolute compared to the model based on [15]. Same with the result obtained using the ENTSO-E dataset, the implementation of the LSTM layer for tuning the wavenet-based model, is proven to help the network in making more accurate load predictions. The success of all models in understanding an unknown data distribution in this dataset is mainly because of the ISO-NE data property, which is simpler compared to the ENTSO-E dataset in which more fluctuations are experienced in certain ranges of time due to external factors. This simplicity results in high accuracy both in validation and testing data, enabling all models to handle forthcoming data sequences. Figures 11 and 12 show the forecasting result using input from dataset 1 and dataset 2, respectively.  Tables 4 and 5 show all of models performance on ISO-NE dataset 1 and dataset 2 respectively. Different with performances shown in the previous subsection, all of the models perform very well both in known and unknown data distribution. However, the model based on [16] still exhibits the worst accuracy, while our model still performs the best although its excellence is not absolute compared to the model based on [15]. Same with the result obtained using the ENTSO-E dataset, the implementation of the LSTM layer for tuning the wavenet-based model, is proven to help the network in making more accurate load predictions. The success of all models in understanding an unknown data distribution in this dataset is mainly because of the ISO-NE data property, which is simpler compared to the ENTSO-E dataset in which more fluctuations are experienced in certain ranges of time due to external factors. This simplicity results in high accuracy both in validation and testing data, enabling all models to handle forthcoming data sequences. Figures 11 and 12 show the forecasting result using input from dataset 1 and dataset 2, respectively. Figure 11. Forecasting result using ISO-NE dataset 1. Figure 11. Forecasting result using ISO-NE dataset 1.

Discussion
Overall, our proposed model shows the best performance compared to the other deep learningbased models. It indicates that each network in the proposed model works very well. The wavenetbased network provides understanding of long sequence data and the LSTM layer helps the model do self-correction by relating the output of the first network with the recent or short sequence data.
However, as suggested by [12,28,29], in order to show our proposed model significance over the other models, both the Wilcoxon signed rank test [30] and Friedman test [31] are conducted using all the models' forecasting error given input from those testing datasets. The Wilcoxon signed rank test will compare the with the Wilcoxon critical value which are expressed in Equations (18) and (19) (for huge number of data), respectively.

= min { , }
= ( + 1) 4 (19) and are the sum of the positive and negative rank, respectively, while is the number of data. If is less than , and the p-value is less than α, then the null hypothesis is rejected, and it indicates the superiority of our model.
On the other hand, the Friedman test is applied to show the significant differences of our proposed models over all comparison models. The statistic F is expressed by: where is the number of forecasting results, is the number of compared models, and is the average rank sum based on forecasting error for th compared model expressed by: If the p-value of F is less than the critical value, than the null hypothesis is not accepted, indicating the superiority of our model. Tables 6-9 show the significance test in testing dataset 1 and dataset 2 ENTSO-E and testing dataset 1 and 2 ISO-NE, respectively. In the Wilcoxon signed-rank test, significant levels are set to α = 0.025 and α = 0.05 while in the Friedman test it is conducted only under α = 0.05. From results showed

Discussion
Overall, our proposed model shows the best performance compared to the other deep learning-based models. It indicates that each network in the proposed model works very well. The wavenet-based network provides understanding of long sequence data and the LSTM layer helps the model do self-correction by relating the output of the first network with the recent or short sequence data.
However, as suggested by [12,28,29], in order to show our proposed model significance over the other models, both the Wilcoxon signed rank test [30] and Friedman test [31] are conducted using all the models' forecasting error given input from those testing datasets. The Wilcoxon signed rank test will compare the W statistic with the Wilcoxon critical value W which are expressed in Equations (18) and (19) (for huge number of data), respectively.
r + and r − are the sum of the positive and negative rank, respectively, while N is the number of data. If W statistic is less than W, and the p-value is less than α, then the null hypothesis is rejected, and it indicates the superiority of our model. On the other hand, the Friedman test is applied to show the significant differences of our proposed models over all comparison models. The statistic F is expressed by: where N is the number of forecasting results, k is the number of compared models, and R j is the average rank sum based on forecasting error r for jth compared model expressed by: If the p-value of F is less than the critical value, than the null hypothesis is not accepted, indicating the superiority of our model. Tables 6-9 show the significance test in testing dataset 1 and dataset 2 ENTSO-E and testing dataset 1 and 2 ISO-NE, respectively. In the Wilcoxon signed-rank test, significant levels are set to α = 0.025 and α = 0.05 while in the Friedman test it is conducted only under α = 0.05. From results showed in the tables, our proposed model shows significant contribution in forecasting accuracy improvement and superiority over the other models, except in the ISO-NE dataset, where in the Wilcoxon signed rank test between our proposed model and Kong et al. model the results indicate no significance, although the results in terms of RMSE, MAE, and MAPE show that our model performs better.

Conclusions and Future Works
This paper proposes a novel method for hourly load forecasting case which is very important in the case of demand response for hybrid energy systems, especially for system use of both renewable and fossil energy in order to reduce fossil energy usage. The proposed method is mainly inspired by the wavenet-based model utilizing dilated causal residual CNN and LSTM. In this approach, two different data sequences are fed to the model. The long data sequences are fed to the wavenet-based model while the short data sequences are fed to the LSTM layer assigned for model self-correction using fine-tuning technique.
In order to prove the generality of our model, two different datasets, which are the ENTSO-E and ISO-NE dataset, are used with two different testing scenarios. The first testing scheme uses the dataset having nearly identical distribution with the validation dataset, while the second uses a dataset from slightly different data distribution. Based on the obtained result, our proposed model exhibits the best performance compared to other deep learning-based models in terms of RMSE, MAE, and MAPE. In detail, our model achieves RMSE, MAE, and MAPE equal to 203.23, 142.23, and 2.02 for ENTSO-E testing dataset 1 and 292.07, 196.95, and 3.1 for ENTSO-E dataset 2. Meanwhile, in the ISO-NE dataset, the RMSE, MAE, and MAPE equal to 85.12, 58.96, and 0.4 for ISO-NE testing dataset 1 and 85.31, 62.23, and 0.46 for ISO-NE dataset 2. However, there are several findings that can be improved in future work. The first is in the ENTSO-E dataset testing result; all models cannot provide high accuracy forecasting if they are fed using slightly different data distribution. It indicates that all models face difficulties in understanding fluctuated or unpredicted data like the ENTSO-E dataset. The second is although RMSE, MAE, and MAPE show our model exhibits better accuracy compared to the Kong et al. model in the ISO-NE dataset, our model cannot provide a significant improvement.
Therefore, for future work, additional external factors data like information of holidays and weather conditions can be fed as the models' input in order to improve our findings. In addition, building a new model can also be conducted since this research area and artificial intelligence (AI) algorithms are developed very quickly, making a new idea come up very fast. All of the codes and datasets used in this research are available on Github.

Conflicts of Interest:
The authors declare no conflict of interest.