A Two-Stage Short-Term Load Forecasting Method Using Long Short-Term Memory and Multilayer Perceptron

: Load forecasting is an essential task in the operation management of a power system.


Introduction
Electricity is an essential guarantee for industrial production and social life. To meet the consumers' satisfaction and generate profit, electric power companies should balance the supply and need by scheduling a series of generators in the most efficient manner. In the long term, the investment in power grids and power plants also needs to keep pace with the increasing power demand. Both in short-term generation dispatch and long-term planning, load forecasting is indispensable for decision-making. Studies have shown that only a 1% decrease in the mean absolute percentage error (MAPE) of load forecasting has a consequential impact of 3∼5% on the supply side by reducing the cost of generation by about 0.1∼0.3% [1]. For instance, a 1% increase in load forecasting error resulted in incremental operating costs of up to 10 million pounds per year in the thermal British power system, reported in 1986 [2]. Load forecasting can be categorized into four types [3] lead to failure in training. A variant of RNN, long short-term memory (LSTM), was initially introduced by Hochreiter et al. [24] to tackle these problems. Applications of the LSTM networks have been reported in different areas, especially in natural language processing [20]. To develop the LSTM's potential, it is generally used in deep neural networks to solve practical problems. Marino D.L. et al. [21] investigated two deep neural networks (DNN) using the LSTM to forecast the next 60 h load demand for an individual building. The results showed that the LSTM-based sequence to sequence (seq2seq) architecture outperforms the standard LSTM network. Both models solely take the historical load as the inputs without considering temperature, holidays, and seasons. The accuracy of predictions over peaks and valleys is poor. In [22], the presented LSTM framework for the STLF of residential households. The results demonstrated that the proposed model is superior to other algorithms and performs better in aggregating load forecast than individual load forecast. Similarly to the former work, none of the temperature, holidays, and seasons are considered. The MAPE of aggregating load forecast reaches up to 8%. Another type of ANN, the convolutional neural networks (CNN), which is good at feature extraction and widely used for image recognition, was adopted to make load forecasts. Kuo P.H. et al. [23] presented a one-dimensional (1D) CNN with three convolution layers for STLF. The results show that the CNN model outperforms other single machine learning methods under the given dataset. The power consumption is highly related to the type of day, season, and temperature. Meanwhile, the power consumption is an inertial and non-linear system. The load itself is strongly correlated to the load of the previous hours. The MLP is good at modeling non-linear tasks. However, it is highly dependent on proper feature selection. Nevertheless, artificial feature selection will inevitably lead to information loss, especially temporal features. The LSTM and CNN models can extract temporal features from the inputs. However, limited to the network topologies in previous literature, none of those models considers all essential features.
To take advantage of different methods, researchers try to find out proper combinations of them to make more precise predictions. A hybrid STLF model using ARIMA and SVM is proposed by Nie H. et al. [25]. It firstly uses ARIMA to obtain the preliminary 24-h forecasts and then uses SVM to correct the deviation of the previous predictions. The results show that the MAPE decreases from 4.5% to 3.85%. Because the ARIMA model can only handle one-sequence to one-sequence tasks, it takes the load itself as the input without considering other features. In addition, the feature selection for the SVM is not reported. In [26], Tian C. et al. presented a DNN model for STLF based on SLTM and CNN. The CNN and LSTM are used to extract the local trend and temporal features of the load, respectively. None of the temperature, holiday, and season is considered. The results show that the CNN-LSTM model gives the lowest average MAPE but fails to show the improvement over peaks and valleys. David K. et al. [27] presented a convolutional LSTM for short-term temperature forecasts. The results reveal that the univariate LSTM network performs well in the first few hours but is outperformed by the multivariate LSTM network. Moon J. et al. [28] proposed a hybrid STLF model for different buildings in a university campus combining random forest (RF) and the MLP. The RF is used to group different buildings into an academic cluster, engineering cluster, and dormitory cluster. The results demonstrate that the proposed model is superior to other popular models. However, this method is inapplicable for regional load forecast. In 2017, Google teams shouted, "Attention is all you need". Ref. [29] making the attention mechanism (AM) more famous in deep learning (DL). Wang S.X. et al. [30] presented a bi-directional LSTM model with the AM. The results show that the MAPE decreases from 3.43% to 2.77% by adding an attention layer to the traditional Bi-LSTM model.
We found that the LSTM-based hybrid model predominates in STLF. The LSTM-base model performs very well over the first few hours of the predictions but is unsatisfactory for the next few hours. How to obtain preliminary predictions with high accuracy using LSTM-based model and improve the predictions over the entire time horizon is an open issue. Based on the outcomes of the previous works, we presents a two-stage STLF method using the LSTM, AM, and MLP. An LSTM-based seq2seq forecasting module with the AM is constructed to make 24-h predictions in the first stage, which can process a multisequence of input. When training the seq2seq module, the LSTM can only learn from the known information before the target day. However, the knowledge of the target day also plays an important role in the predictions. Therefore, in the second stage, the preliminary results obtained in the first stage are fed into the MLP network to make point to point (p2p) predictions along with the information about temperature and holiday of the target day and the missing features that the seq2seq module cannot capture. We evaluate the performance of the seq2seq module under different lengths of the input window, namely timestep. For example, when using the previous seven days' data to predict the next 24 h load, the timestep is 168 h. We collected the power cunsumption and weather records of Kanto region from 2016 to 2019 to train the proposed model. The contributions of this study are as follows: • A two-stage method for STLF that consists of an LSTM-based seq2seq module and an MLP-based p2p module, which outperforms other popular models, is proposed. The seq2seq module is capable to handle a multi-sequence of input, so that the model can capture more features. In the second stage, given proper input features, the forecasting errors of the predictions obtained by the seq2seq module are reduced; • The impact of the timestep for the seq2seq module is investigated. The experimental result shows that a timestep of 24 h is the best choice in this case. The hyperparameters tunning results are presented for reference; • This study reveals that feeding multi-sequence of input to the LSTM makes it adapt to variations of seasons, holidays, and temperature. Therefore, the model does not need to be frequently updated. Although the MLP module inherits the advantage of the seq2seq module and improves the residual distribution by learning the variation tendency of load affected by holidays, seasons, and temperature; • Research directions regarding the improvement and generalization in time series forecasting problems, using the MLP as a residual modifier, are highlighted.
The rest of this paper is organized as follows. Section 2 introduces the methodologies employed in this paper. In Section 3, we describe the proposed model and evaluation indices. Section 4 presents the dataset, data preprocessing, and experimental results. In Section 5, we discuss the results and point out future research directions.

Applied Methodologies
To better understand the framework of the proposed STLF model, this section introduces the relevant methods, including RNN, LSTM, AM, and MLP.

Recurrent Neural Network
The RNN is a generalization of feedforward neural network that has an internal state (i.e., memory), making it applicable to process sequences of inputs, such as speech recognition [31], natural language processing [32], and time series prediction [18,19]. When processing a sequence of input, the RNN performs the same function for each input of data. After producing the output of current input of data, it is duplicated and sent back into the RNN as a component of the next input. Figure 1 shows the structure of a simple RNN. The mapping from the input X t to output y t can be described using following equations [31]: where h t is the hidden state at time t; W xh , W hh , and W hy are shared weight matrix at current input state, previous hidden state, and output state, respectively; f (.) and g(.) are the activation functions. Initially, the RNN takes X 0 from the sequence of input and generates hidden state h 0 and output y 0 . In the next step, h 0 and X 1 are the input. The RNN repeats this process till the end of the sequence. In this way, the RNN keeps remembering the previous information. Thus, it is good at processing the sequence whose contexts are intrinsically related. However, the RNN is trained using backpropagation algorithm, and, therefore, gradient vanishing problem may occur when the sequence became very long.

Long Short-Term Memory
The LSTM network is a variant of RNN, which has several gates to control the input, memory (i.e., cell state), and output, making it remembers past information more efficiently [26]. So that the gradient vanishing problem is resolved. The structure of an LSTM network is shown in Figure 2. When the LSTM processing a sequence, the first step is to decide what information in the memory will be thrown. The decision is made by the forget gate via the Sigmoid function, whose output is a value between 0 and 1. The larger the output value is, the more past information is kept in the memory. The calculation of the forget gate can be expressed as [27]: where W f and b f are the weight matrix and bias of the forget layer, h t−1 is the output (i.e., hidden state) at time t − 1, X t is the input of current state.
The next step is to determine what new information will be stored in the memory by the input gate. The tanh layer creates a candidate of the input. Meanwhile, the input gate will generate a value i t between 0 and 1 through the Sigmoid layer. The candidate,C t , will be scaled by i t and added to the memory to update the cell state C t . The calculations ofC t , i t , and C t are as follows [27]:C where W C and b C are the weight matrix and bias of the tanh layer, W i and b i are the weight and bias of the Sigmoid layer of the input gate, respectively. Finally, the LSTM gives the output controlled by the output gate. We put the current cell state C t through a tanh layer and multiply it by the scalar generated by the output gate. The scalar and output of the LSTM can be computed by [27] where W o and b o are the weight and bias of the Sigmoid layer of the output gate, respectively.

Attention Mechanism
The AM in DL is based on the concept of directing the focus, making the networks pay greater attention to certain factors when processing the input of data. In this study, we employed a self-attention layer to enhance the performance of the seq2seq model [33]. It is used to manage and quantify the interdependence within the elements of the input sequence. Therefore, when generating the output over any timestep, this layer has viewed the whole input sequence and captured the relationships between any two timeslots.
An self-attention mechanism can be described as mapping a query and a set of key-value pairs to an output. The self-attention layer operates an input sequence, a = (a 1 , a 2 , . . . , a n ) where a i ∈ R d a , and generates a new The query, keys, and values are computed by [33] where W q , W k , W v ∈ R d a ×d b are the training parameters. The output b i is computed as weighted sum of a linear transformed input elements [33]: The a ij represents the weight coefficient between elements x i and x j , which is computed using a softmax function [33]: The attention score e ij can be calculated by [33]

Multilayer Perceptron
The MLP is a class of feedforward ANNs, consisting of a series of interconnected neurons (i.e., nodes). The structure of a simple MLP network [34] with one hidden layer is shown in Figure 3. The network consists of an input layer with d units, one hidden layer with m units, and one output layer with one node. Since MLP networks are fully connected, each node in one layer connects with a certain weight to every node in the following layer. The outcomes of the MLP can be described by Equation (15) [34].
where w ij denotes the weight from ith neuron of input layer to jth neuron of hidden layer, λ j denotes the weight from jth neuron of hidden layer to output layer, and Ψ is the activation function. One commonly used activation function is the rectified linear unit (ReLU) function, which can be written as: The MLP can approximate highly non-linear functions between the input and output without any complex mathematical formula. It has been approved that the performance of MLP in forecasting applications outperforms regression-based methods [35]. Depending on different cases, the MLP may have more than one hidden layer and multiple units in the output layer.

The Proposed Model
This section presents the framework of the two-stage forecasting model. A seq2seq forecasting module is used to forecast the next 24 h load demand in the first stage. In the second stage, the MLP module makes p2p predictions based on the outcomes of the seq2seq module, which functioned as a residual modifier. Each element of the output sequence obtained in the first stage will be fed into the MLP along with holiday and temperature information, as well as lagged load demand, giving the final predictions. To evaluate the forecasting accuracy, several criteria are introduced.

The Framework of the Proposed Model
The framework of the two-stage forecasting model is shown in Figure 4. The proposed model consists of an LSTM-based module with the AM for multi-step forecasting and an MLP-based module for residual modification. The inputs of the seq2seq module are over the past n hours where n is the length of the input window, including MoY t (month of year, 1 to 12 represent January to December), DoW t (day of week, 1 to 7 represent Monday to Sunday), HoD t (hour of day, 0 to 23 represent 0:00∼1:00 to 23:00∼24:00), L t (load), and T t (temperature). At time t, the seq2seq module processes the input, outputs the predictions of the next 24 h load demand via the fully connected (FC) layer, y = (y t+1 , y t+2 , . . . , y t+24 ).  We selected ten features as the input of the MLP network, including DoW t , HoD t , H t (holiday, 1 represents holiday, 0 represents workday), H2W t (holiday to workday, 1 represents that the previous day is a holiday and the target day is a workday, otherwise H2W t is 0), L t−168 (168-h lagged load), M3DL t (mean load of the same hour of previous three days), y t (predicted load in the first stage), T t−168 (168-h lagged temperature), T t−24 (24-h lagged temperature), T t (predicted temperature by a trained ARIMA model [9]). The input vector, X = (DoW t , HoD t , H t , H2W t , L t−168 , M3DL t , y t , T t−168 , T t−24 , T t ), is fed into the MLP module to modify the residue of the predicted load y t , giving the final prediction.

Evaluation Indices
Several criteria are used as the evaluation indices to justify the performance of load forecasting models, also called key performance indicator (KPI). In this paper, we choose the mean absolute percentage error (MAPE) and root mean squared error (RMSE) as the indicators. The calculations of these two indices [25] are as follows: where k is the size of the samples,ŷ i and y i represent the predicted value and observed value, respectively. The MAPE shows the mean of the dispersion between predictions and ground truth. On the other hand, the RMSE describes the standard deviation of the differences between the predicted values and observed values. The smaller values of the above KPIs, the better performance of the model is.

Experiments
The experiments are carried out under the environment of Python 3.6 with the Keras (2.2.4) [36] of Tensorflow 2.1.0, using a computer with following specification: window 10, 1.80 GHz 64-Bit Intel Core i7-8565U processor.
The proposed model is trained and examined using the load record and temperature record of Tokyo. In this section, experiments regarding the selection of timestep are presented. The performance of the proposed model is compared with the conventional models.

Dataset Description
The dataset used in the paper consists of load records, temperature records, and calendar. The load records of Tokyo are collected from the website of Tokyo Electric Power Company (TEPCO), and the temperature records are obtained from the website of Japan Meteorological Agency (JMA). The service areas of TEPCO includes Tokyo and other adjacent cities. We use the mean temperature of three major cities (Tokyo, Kanagawa, Gunma) for training. The interval of the load records and temperature records is one hour. The information about holiday is extracted from the calendar of Japan. The period of the dataset is from 1 April 2016 to 31 December 2019. The dataset, which contains 32,880 samples, is split into a training set (80%) and a validation set (20%). We test the losses of the seq2seq model under different splitting ratios. The results are listed in Table 2. We choose the splitting ratio with lower validation loss. Figure 5 shows the hourly load profile.

Data Cleaning
Both the load dataset and temperature dataset have no outlier, which is examined by Tukey's test (i.e., box-plot). In the temperature dataset, there are 30 discrete missing points in total. The missing values are filled using the value of the previous hour, considering the inertia of temperature.

Temperature Adjustment
In the MLR and MLP model, the temperature is adjusted by Equation (19) considering the load is non-linear to the temperature.
where T i is the temperature at time i, T base is the baseline which is set to 18 • C. The load demand increases for cooling while the temperature is getting higher comparing to the baseline. Likewise, the load demand increases for heating along with the decrease in the temperature. The Person correlation between load and temperature increases from 0.09 to 0.56 after the adjustment.

Feature Scaling
Feature scaling is an essential step for training neural networks. In this paper, standardization is applied, which transforms the data to have zero mean and a variance of one using the following equation: where µ and σ are the mean value and standard deviation of feature X, respectively.

Experimental Setting
In the seq2seq module, we used an LSTM layer with 100 units. The hidden fully connected layer (FC1) has 64 nodes with an activation function of the ReLU. The loss is defined as the MAE. The epochs and batch size are 50 and 32, respectively. To select a proper length of input window, we performed a series of experiments by setting the length of input window n = {24, 48, 72, 96, 120, 144, 168}.
In the p2p module, the number of nodes of the input layer and output layer are ten and one, respectively. The numbers of hidden layers and nodes of each hidden layer are two and 100, respectively. The activation function of hidden layers is the ReLU. The loss is defined as the MAE. The number of epochs is 100.
The arguments employed in the seq2seq module and p2p module are determined through a series of hyperparameter tuning tests. We evaluate the validation losses of the models under different hyperparameters. Tables 3 and 4 show the results of the hyperparameter tuning for the seq2seq module. The results for the p2p module are listed in Tables 5 and 6. We choose the settings whose validation loss is the lowest.

Selection of TimeStep
We experimented with the seq2seq model by changing the timestep from 24 to 168. For instance, when n = 168, it means that we use the previous seven days' data to predict the load of the next day. The results are listed in Table 7. According to the results, we set the timestep as 24 h to construct the seq2seq module. From the previous literature, we knew that the future load has strong correlations with the load over the same hour of the last three days and seven days. However, the seq2seq module with a timestep of 24 h resulted in missing the information mentioned earlier. Therefore, in the second stage, we consider involving the missing data in the MLP module.

Results and Comparison
In order to demonstrate the superiority of the proposed STLF model, the traditional MLR model, MLP model, CNN model, standard LSTM model with one sequence of input, standard LSTM model without AM, and LSTM model with AM are selected for comparison in terms of the MAPE and RMSE. Any LSTM model other than the LSTM(1seq), as hereinafter defined, has a default multi-sequence of input. Both the MLR model and MLP model select ten features, which is similar to the MLP module introduced in Section 3.1. The difference is that the y t (predicted load) is replaced by L t−24 (24-h lagged load). In the CNN model, the kernel size of the Conv1D layer and the pool size of the MaxPooling1D are both 3. The settings of LSTM-based and MLP-based models are exactly the same as the settings presented in Section 4.3.1.
As shown in Table 8, the evaluation indices of the proposed model are the smallest among the six models. The results reveal that ANN-based models outperform the regression-based model. The MLP model gives a MAPE of 2.65% which lower than other single ANN networks. The results also show that the AM helps increase the forecasting accuracy of the LSTM-based model. By modifying the seq2seq module's residual, the proposed model gives a MAPE and RMSE of 2% and 1, respectively. Compared to the LSTM-AM model and MLP model, the performance of the proposed model improves by 29.1% and 24.5% in terms of the MAPE, respectively.  Figure 6 shows the forecasting results of six models on a specific holiday and a workday. As we can see, the results given by the LSTM-AM model perform very well during the morning time. That is because of the user's behavior of inertial and less uncertainty. On the other hand, LSTM-based models are proficient in the analysis of time series. However, the forecasting errors become larger after 8:00 a.m. The predictions after 8:00 a.m. on the same day are either all underestimated or all overestimated. Fortunately, the trend is in line with the ground truth. Therefore, it is possible to calibrate the forecasting errors (i.e., residuals) following certain rules. Compared to the LSTM-AM model, predictions of the proposed model change little before 8:00 am and are pulled towards the ground truth after 8:00 am. Both on the holiday and workday, the proposed model is superior to other models. We guess the MLP module helps improve the prediction accuracy because in the second stage of the proposed model it inherits the precise predictions of the seq2seq module in the morning time and modifies the residuals. To verify our hypothesis, we analyze the residuals of the LSTM-AM model and LSTM-AM-MLP model. The definition of the residual can be expressed by whereL t is the predicted value and L t is the observed value. Thus, a positive residual represents overestimation, and a negative residual represents an underestimate. As we can see in Figure 7, the residuals of the LSTM-AM model in the morning time are relatively low and stable. The conclusion is consistent with the forecasting results on the specific holiday and workday. In the following hours of day, the residuals are gradually getting divergent. For the LSTM-AM-MLP model, the residual's distributions in the morning time are very close to those of the LSTM-AM model. In the following hours of day, the heights of boxes are significantly shorter. In other words, the predictions during those hours are convergent to the ground truth. In order to show overall distribution of the residuals, we plot the histograms for both LSTM-AM model and LSTM-AM-MLP model, shown in Figure 8. The proposed model gives 78.9% of the predictions with a residual within −1 and 1, which is 9.4% higher than that of the LSTM-AM model. The results explain our assumption about the reason why the MLP module helps improve the performance.

Discussion
To determine the length of the input window, a series of experiments are carried out. Usually, when increasing the input of the neural network, the model is expected to have a better performance. However, in this case, the error is growing along with the increase in timestep. The result provides a reference for choosing an appropriate timestep, but it cannot explain the abnormal fact. It remains an open issue for future works.
This study compared six independent models for STLF, including the MLR, MLP, CNN, LSTM(1seq) without AM, LSTM without AM, and LSTM with AM. The MLR is a simple regression model. It is easy to understand the mapping from the input to the output. However, the MLR cannot deal with non-linear tasks so that the performance is far from the MLP in spite of given the same inputs. The MLP is outstanding among the six models because of the proper feature selection. However, the performance over the first few hours of the target day is not as good as the LSTM model both on the holiday and workday. The results also show that the LSTM(1seq) model performs poorly, while the LSTM model is greatly improved. In other words, feeding information about month of year, day of week, hour of day, and temperature to the LSTM helps reduce the forecasting error. The CNN model performs much better on workdays than on holidays. On the other hand, the LSTM model has a more balanced performance on both types of days. The MAPE of LSTM model is lower than that of CNN model. The LSTM model with AM is better than the model without AM. To sum up, the proposed LSTM-AM model with a multi-sequence of input outperforms other single models except for the MLP model, offering preferable preliminary predictions.
We notice that the LSTM-AM model gives stable and precise predictions during the morning time. The distribution of residuals is in accord with specific rules. Therefore, we consider the use of the MLP to improve the preliminary predictions. The MLP module inherits the advantage of the seq2seq module. On the other hand, the MLP is focusing on residual modification during peaks. The selected features are day type change (from a holiday to a workday), 168-h lagged load, mean load of the same hour of the previous three days, preliminary prediction, 168-h lagged temperature, 24-h lagged temperature, predicted temperature of the target day, etc. The MLP learns the characteristic of the residual pattern by decoupling the impacts of the holiday, day type change, and temperature change on the electricity consumption. The results demonstrate that the MLP module improves the residuals of the predictions over peaks from 9:00 to 22:00.
Compared to the hybrid LSTM-CNN model presented in [26], the proposed model can handle a multi-sequence of input so that the information about temperature, holidays, hour of day, and month can be fed into the LSTM. Therefore, the proposed model can be adaptive to the variation of temperature, month, date, and time. That is to say, the model does not need to be updated frequently. We observed the daily MAPE for the validation dataset from April to December of the year 2019. There is no significant change over time. We recommend the frequency of updating the network could be six months. However, the LSTM-CNN model can only learn the temporal features from the load of the previous 21 days. It requires frequent updates to adapt to unseen conditions. More importantly, the LSTM-CNN model is unexplainable. In contrast, the proposed model with two independent neural networks is partially interpretable. We know the LSTM-AM learns from the previous load and captures the impacts of season and temperature on power consumption. That is why the LSTM with a multi-sequence of input performs much better than the one with a one-sequence of input. The MLP module masters the variation tendency of load affected by holidays, seasons, and temperature, which has been proved by feeding different input features. Thus, the MLP module improves the residuals of the LSTM-AM's predictions.
The results show that the MAPE of the proposed model decreases from 2.65% to 2% in comparison with the MLP model, which can reduce the cost of generation by 0.065∼0.195%, according to the conclusion stated in [1]. It is reported that the average cost of generation in 2011 in Japan was JPY 11.6 per kWh [37]. The total power generated by TECPO in the year 2011 was 249.2 billion kWh [38]. The cost savings reach JPY 1.88∼5.64 billion by the improvement of the proposed approach. The approach can be implemented on a generic computer with simple maintenance. The electric power companies will benefit a lot using the presented method at a meager cost.
The paper provides an approach for making multi-step load forecasts based on the LSTM without frequent updates and a method to improve the preliminary predictions. We found that the information of the target day is essential for load forecasting. For the LSTM, we cannot construct the input matrix on the target day with the same shape as historical data because of the unmatched number of features. If we try to feed the information of the target day to the LSTM, we need to extend the number of features of the input matrix but not increase the length of the sequence. However, the LSTM is better at extracting the correlations between different time steps rather than learning the intrinsic relationships of the features within the same time step. It might be better to employ CNN to capture the features to obtain more precise preliminary predictions in the first stage. We will verify this idea in future works.
In addition, it is possible to enrich the features fed into the MLP model in the second stage. Nanae K. et al. [39] identified the dominant factors affecting hourly electricity demand, such as indices of tertiary industry activity, producer index, and the number of internet searches (power-related words). The factors mentioned above may be helpful for residual modification. In the second stage, the temperature of the target day is predicted by a simple ARIMA model. Provided the records of temperature forecasts with a one-hour resolution from the weather forecast agency are available, which might be more reliable, the load forecasting errors could be lower.
Previously, researchers often use deep learning to improve the prediction accuracy of their model by increasing the number of layers within the neural network, which makes the model complex and unexplainable. This paper offers a new method to deal with time series forecasting applications. The proposed two-stage model is understandable, easy to train, and maintain. This method could be useful in solar irradiation forecasting, combining LSTM and CNN. Similarly, we can use the MLP to modify the residuals of the predictions obtained by the hybrid LSTM-CNN model with a multi-sequence of input. One candidate of the input features is the mean solar irradiation at the same hour of the similar days. In addition, the proposed approach can be easily applied in various types of load, such as regional, residential, and industrial. The indispensable data includes historical load, temperature, and information about holidays. The method can also be adaptive to datasets with different resolution.  Data Availability Statement: Publicly available datasets were analyzed in this study. The data is collected from TEPCO (https://www.tepco.co.jp/en/forecast/html/download-e.html) and JMA (http://www.jma.go.jp/jma/menu/menureport.html) (accessed on 20 February 2020). The raw data can be downloaded via Google Drive (https://drive.google.com/file/d/1aq7393samv7HTMJwiayQIM5DvpJ bmv-L/view?usp=sharing).

Conflicts of Interest:
The authors declare no conflict of interest.