An Attention-Based Multilayer GRU Model for Multistep-Ahead Short-Term Load Forecasting †

Recently, multistep-ahead prediction has attracted much attention in electric load forecasting because it can deal with sudden changes in power consumption caused by various events such as fire and heat wave for a day from the present time. On the other hand, recurrent neural networks (RNNs), including long short-term memory and gated recurrent unit (GRU) networks, can reflect the previous point well to predict the current point. Due to this property, they have been widely used for multistep-ahead prediction. The GRU model is simple and easy to implement; however, its prediction performance is limited because it considers all input variables equally. In this paper, we propose a short-term load forecasting model using an attention based GRU to focus more on the crucial variables and demonstrate that this can achieve significant performance improvements, especially when the input sequence of RNN is long. Through extensive experiments, we show that the proposed model outperforms other recent multistep-ahead prediction models in the building-level power consumption forecasting.


Introduction
Smart grid technologies have attracted much attention because of their potential to cope with climate change and energy crises [1]. A smart grid optimizes energy efficiency through bidirectional interaction between suppliers and consumers and employs renewable energy (i.e., solar and wind power), combined cooling, heating, and power, and energy storage systems (ESSs) [1 -3]. As a result, greenhouse gases can also be reduced during electricity generation. Short-term load forecasting (STLF) should be performed to determine the power supply required for the power system to establish an operational plan of the smart grid [3,4]. The STLF plays a crucial role in deciding when to run the power generation system from the next hour to one week [4]. Diverse STLF models include daily peak load forecasting, total daily load forecasting, hourly electrical load forecasting, and very STLF (e.g., 10-min and 15-min interval electrical load forecasting). Particularly, hourly electrical load forecasting and very STLF are popularly used for performing demand-side management or operating ESSs from the next hour up to one day [5]. STLF is a challenging task because electrical energy consumption exhibits complicated patterns and fluctuations due to diverse unexpected external factors [6]. Thus, effectively learning the relationship between electrical load and external factors when constructing a forecasting model is essential. Many studies have been done to construct STLF models using artificial intelligence (AI) techniques. For instance, support vector regression (SVR) [7], random forest (RF) [8], and extreme gradient boosting (XGB) [9] have demonstrated excellent prediction performance by considering the nonlinear relationship between input and output variables. However, these STLF models focus on the day-ahead point forecasting; they cannot easily and properly cope with various unexpected events that could affect electrical energy consumption for a day from the present time. Moreover, if the time difference between the training and testing sets is extensive, their prediction performance could deteriorate because they do not adequately reflect recent electrical load trends [10].
Recently, a gated recurrent unit (GRU) was used for STLF to solve these problems because it is simple and easy to implement and can carry out multistep-ahead forecasting [11]. However, the GRU has the disadvantage that its forecasting accuracy may deteriorate for a long input sequence because it concentrates on all variables equally. In this paper, we propose a GRU-based multistep-ahead STLF model and augment it using an attention mechanism to improve the forecasting performance by focusing more on crucial variables. To do this, we first collected hourly power consumption data of three office buildings. Then, we carried out preprocessing and exploratory data analysis to verify the relationship between various factors and the power consumption of buildings for model training. Lastly, we constructed an attention-augmented GRU network for multistep-ahead (24 points) forecasting for a day from the current time.
The main contributions of this paper are as follows: 1.
We conducted multistep-ahead forecasting for the hourly power consumption of buildings to adequately cope with sudden changes in power consumption caused by various unexpected events, such as peak and blackout, instead of day-ahead point STLF.

2.
We constructed an attention-based multilayered GRU model to achieve faster and more stable multistep-ahead STLF than other deep learning (DL) architectures.

3.
We verified the superiority of the proposed model through extensive comparisons with several state-of-the-art forecasting models using the power consumption data of three office buildings.
The remainder of this paper is organized as follows. In Section 2, we review several related studies. Section 3 describes the input variable configuration. Section 4 presents the steps for constructing our attention based GRU model. Section 5 shows the experimental results to demonstrate the superiority of the model. Finally, we conclude the study and present the directions for future research in Section 6.

Related Studies
Many STLF models have been proposed to accurately predict energy consumption based on diverse AI techniques. In this section, we briefly introduce some recent STLF approaches based on machine learning (ML) and DL-based methods and summarize them in Table 1.
Lahouar and Slama [12] proposed a day-ahead hourly electrical load forecasting model based on the RF. They configured input variables by considering two cases: (1) exogenous and relative to the predicted day and (2) endogenous and relative to the previous days. They used the RF model's variable importance to explain the relationship between the input variables and electrical load. They demonstrated the superiority of the proposed model by comparing it with the persistence analysis, artificial neural network, and SVR. Moon et al. [13] developed an RF-based total daily load forecasting model for university campuses. They first used a moving average method to consider the electrical load patterns on the days of the week. They built an RF model using several input variables, such as the timestamp, temperature, academic year, historical load, and prediction value from the moving average method. They demonstrated that their proposed model outperformed other popular ML methods, such as the decision tree, MLR, gradient boosting machine, SVR, and artificial neural network. Park et al. [14] proposed a two-stage hourly electrical load forecasting model. In the first stage, they constructed two forecasting models using the XGB and RF methods. Then, they combined their prediction values using a sliding windowbased multiple linear regression (MLR) model in the second stage. They demonstrated that their proposed model outperforms several single ML methods. Ryu et al. [15] proposed two deep neural network-based STLF models for applying a demand-side empirical load database. They compared these STLF models with the shallow neural network, double-seasonal holt-winters, and autoregressive integrated moving average (ARIMA) and demonstrated that their STLF models exhibited excellent performance. Izonin et al. [16] used non-iterative approaches based on a successive geometric transformation model (SGTM) to predict the net hourly electrical energy output of a combined cycle power plant. The SGTM neural-like structure showed a better prediction performance than MLR, SVR, and general regression neural networks. Motepe et al. [17] developed an improved load forecasting process using hybrid AI and deep learning methods. They determined the optimal hyperparameters through several experiments to enhance the performance of long short-term memory (LSTM) networks. They demonstrated that their proposed model, the LSTM network, had lower prediction errors than an adaptive neuro-fuzzy inference system and optimally pruned extreme learning machine.
Kuan et al. [18] constructed a multilayered self-normalizing GRU model for STLF. They demonstrated that the multilayered self-normalizing technology could improve the prediction performance of the GRU and LSTM models through several experiments. Chitalia et al. [19] presented a short-term load forecasting framework that is robust regardless of building types or locations. They collected five commercial buildings of five different building types located at five different locations. They explored nine different deep learning models and determined the best model for load forecasting in each cluster by considering unsupervised k-mean clustering. Sehovac and Grolinger [20] proposed a recurrent neural network (RNN) with attention for load forecasting. Their RNN had the ability to model time dependencies. They used sequence to sequence (S2S) approach to strengthen the ability by using encoder and decoder. Moreover, they added an attention mechanism to ease the connection between encoder and decoder to further improve the performance. They proved the superiority of their S2S approach and attention mechanism through comparison with a non S2S model and vanilla RNN.

Data Collection
This section describes the data preprocessing and input variable configuration for the STLF model. For model construction and testing, we used one publicly available dataset and one confidential dataset. The former represents electrical energy consumption data collected from two office buildings in Richland, Washington, USA, from 2 January 2009 to 31 December 2011 [21]. The datasets consist of the timestamp information, temperature in Fahrenheit, and hourly electrical energy consumption. For the three missing values (e.g., 2009/04/05/02:00, 2010/04/04/02:00, and 2011/04/03/02:00) and some anomalous data in the datasets, we handled them properly using a linear interpolation method. The latter represents hourly electrical energy consumption data collected from the headquarters of a midsized company in Seoul, South Korea from 1 January 2015 to 31 December 2017 [10]. We also collected the holiday information in Washington and South Korea at 'Time and Date [22]' that can confirm national public holiday information in many countries. Table 2 lists the simple building information with some statistical analyses on the collected electrical energy consumption data. Holiday includes Saturday, Sunday, and national holidays in the table. The minimum and count in the table represent the lowest electrical energy consumption and the number of data points, respectively. In addition, the first quartile, median, and third quartile represent the values of the lower 25%, 50%, and 75% points, respectively, in the energy consumption. Figure 1 shows the box plots for the energy consumption of three buildings by weekday, holiday, and total.

Feature Extraction
We considered the temperature, timestamp information, and historical load data to build the model and performed feature extraction for the given datasets to train the model

Feature Extraction
We considered the temperature, timestamp information, and historical load data to build the model and performed feature extraction for the given datasets to train the model as represented in Table 3. In particular, as electrical energy consumption tends to vary with date and time, we considered all variables that represent date and time such as month, day, hour, day of the week, and holiday. Month, day, and hour have a sequence format. When data with a sequence format are used directly in the AI techniques, it is difficult to reflect their periodicity [23]. For instance, 11 p.m. and midnight are temporally contiguous; however, the data difference in sequence format is 23.
Hence, we represented the month, day, hour, and day of the week into two-dimensional data using Equations (1) to (8) to reflect their periodicity [23,24]. In Equations (3) and (4), LDM represents the last day of the month. For instance, if the month of the prediction is March, its LDM becomes 31. In addition, we used a variable indicating whether it is a holiday to consider different electrical energy consumption patterns on weekdays and holidays: Hour y = cos((360 • /24) × Hour) Day of the week x = sin((360 • /7) × Day of the week) Day of the week y = cos((360 • /7) × Day of the week) The temperature, which is closely related to power consumption, is also used for the model training [13]. The datasets of Building 1 and Building 2 included outdoor temperature in Fahrenheit, while the dataset of Building 3 only included power consumption data according to the timestamp information. Hence, we collected the temperature near the Building 3 provided by the Korea Meteorological Administration (KMA) [25] and converted them from Celsius to Fahrenheit.
We also considered the historical electricity load data comprising the electricity load simultaneously from one day to one week ago as the input variables to reflect the recent electricity consumption pattern [26,27]. In addition, we added a new historical electricity load value as an input variable to reflect different electricity load patterns of holidays and weekdays [13]. When the prediction point was a weekday, we used the weekday electricity consumption average of the past week. Likewise, when the prediction point is a holiday, we used the average of the past week's holiday electricity consumption.

Correlation Analysis
We configured various input variables, such as the timestamp, temperature, and historical load to construct the forecasting model. We calculated the Pearson correlation coefficients with p-values for each dataset to confirm their relevance to the actual electrical energy consumption. Table 4 illustrates the results.  In the table, the average load considering the weekdays/holidays, historical load one day before, and historical load one week before exhibit a strong correlation with the actual electrical energy consumption. We also verified that the electric load pattern differs depending on the hour, day of the week, and holiday because Hour y , Day of the week y , and Holiday present a negative correlation with the actual electrical energy consumption.
The remaining variables are positively related to the building electrical energy consumption. Although they exhibited a nonlinear relationship, we can adequately handle this by applying a deep learning method, such as the LSTM and GRU networks. Overall, most input variables exhibit a meaningful correlation with the building electrical energy consumption because the p-values are less than 0.01.

Gated Recurrent Unit Model
An artificial neural network (ANN), also known as a multilayer perceptron (MLP), is a popular AI technique implemented based on human biological neurons that can process large amounts of data in parallel and learn efficiently [24,28]. The ANN is a static inputoutput mapping model that only considers the input and output and does not consider time.
It usually represents input, weights, output, etc., in vector form and the output is calculated internally from the input and weight vectors. The decision boundary is orthogonal to the weight vector [29]. In addition, the ANN presents various decision boundaries because an activation function determines the perceptron's response to its input [23].
In contrast, a recurrent neural network (RNN), although it is a kind of ANN, is dynamic input-output mapping model that considers the input of all time. RNNs are wellsuited to time-series data because they can process a time-series step-by-step, maintaining an internal state from time step to time step [30]. The LSTM is the most successful and widely used RNN [31]. The LSTM preserves the differential values of old inputs during backpropagation to solve the long-term dependency problem of the RNN. As a variant of the RNN architecture, GRU has the advantage of simplifying the LSTM structure by reducing the computation to update the hidden state while solving the long-term dependency problem and maintaining the performance of the LSTM [11,32]. Figure 2 illustrates a GRU cell architecture. In the figure, the GRU cell has input and forget gates. A gate controller, z, controls both the input and forget gates. When z is 1, the forget gate is closed, and the input gate is open. When z is 0, the forget gate is open, and the input gate is closed. At each step, the previous (t − 1) memory is saved, and the input of the time step is cleared. The GRU cell is controlled according to Equations (9)-(12): time. It usually represents input, weights, output, etc., in vector form and the output is calculated internally from the input and weight vectors. The decision boundary is orthogonal to the weight vector [29]. In addition, the ANN presents various decision boundaries because an activation function determines the perceptron's response to its input [23]. In contrast, a recurrent neural network (RNN), although it is a kind of ANN, is dynamic input-output mapping model that considers the input of all time. RNNs are wellsuited to time-series data because they can process a time-series step-by-step, maintaining an internal state from time step to time step [30]. The LSTM is the most successful and widely used RNN [31]. The LSTM preserves the differential values of old inputs during backpropagation to solve the long-term dependency problem of the RNN. As a variant of the RNN architecture, GRU has the advantage of simplifying the LSTM structure by reducing the computation to update the hidden state while solving the long-term dependency problem and maintaining the performance of the LSTM [11,32]. Figure 2 illustrates a GRU cell architecture. In the figure, the GRU cell has input and forget gates. A gate controller, z, controls both the input and forget gates. When z is 1, the forget gate is closed, and the input gate is open. When z is 0, the forget gate is open, and the input gate is closed. At each step, the previous (t − 1) memory is saved, and the input of the time step is cleared. The GRU cell is controlled according to Equations (9)- (12): We used a GRU network to predict the building electrical energy consumption at 24 points in time (from one hour later to one day later) and considered several hyperparameters to construct our GRU model. We set the number of hidden layers in the GRU model to two. The input layer of the GRU model consists of 18 nodes, and the hidden layer consists of 13 nodes per layer by applying two-thirds of the input layer, plus the size of the output layer [10,23]. We used the scaled exponential linear unit (SELU) as an activation function [33,34], defined by Equation (13), where α is a stochastic variable sampled from a uniform distribution at training time and is fixed to 1.67326, which is the expectation value of the distribution at testing time. Moreover, λ is an extra parameter for determining the slope and is set to 1.0507. This is because SELU can effectively train DL models due to its superior self-normalization quality and no vanishing gradient problem [23,34]. We used a GRU network to predict the building electrical energy consumption at 24 points in time (from one hour later to one day later) and considered several hyperparameters to construct our GRU model. We set the number of hidden layers in the GRU model to two. The input layer of the GRU model consists of 18 nodes, and the hidden layer consists of 13 nodes per layer by applying two-thirds of the input layer, plus the size of the output layer [10,23].
We used the scaled exponential linear unit (SELU) as an activation function [33,34], defined by Equation (13), where α is a stochastic variable sampled from a uniform distribution at training time and is fixed to 1.67326, which is the expectation value of the distribution at testing time. Moreover, λ is an extra parameter for determining the slope and is set to 1.0507. This is because SELU can effectively train DL models due to its superior self-normalization quality and no vanishing gradient problem [23,34].
In addition, we used the Huber loss [35] and adaptive moment estimation (Adam) [34] as a loss function and optimization algorithm, respectively. The Huber loss is calculated using Equation (14) with δ = 1. We set the learning rate and learning epoch to 0.001 and 500, respectively.

Attention Mechanism
A more extended input sequence in the GRU network results in a worse prediction accuracy of the output sequence because it focuses on all input variables equally, even though they could have different correlations to the forecasting. An attention mechanism can be used to alleviate this problem by focusing on more relevant input variables.
An attention mechanism [36] consists of an encoder that generates an attention vector from the input and a decoder that generates a hidden state by taking the encoder output as input. The hidden states are divided by a view, and the encoder assigns an attention score to the hidden state of each step using the hidden state of the previous view decoder. An attention vector is created by performing a soft-max operation on the created attention score. In this way, whenever the decoder predicts the output value, the encoder concentrates on the input variables similar to the predicted value.
For instance, Kwon et al. [37] developed an attention mechanism based RNN model on electronic medical records. Hence, we constructed an attention mechanism to focus on input variables with high correlation to improve the model accuracy. We set the size of the attention window to 96. Figure 3 presents the overall attention based GRU model architecture.

Attention Mechanism
A more extended input sequence in the GRU network results in a worse prediction accuracy of the output sequence because it focuses on all input variables equally, even though they could have different correlations to the forecasting. An attention mechanism can be used to alleviate this problem by focusing on more relevant input variables.
An attention mechanism [36] consists of an encoder that generates an attention vector from the input and a decoder that generates a hidden state by taking the encoder output as input. The hidden states are divided by a view, and the encoder assigns an attention score to the hidden state of each step using the hidden state of the previous view decoder. An attention vector is created by performing a soft-max operation on the created attention score. In this way, whenever the decoder predicts the output value, the encoder concentrates on the input variables similar to the predicted value.
For instance, Kwon et al. [37] developed an attention mechanism based RNN model on electronic medical records. Hence, we constructed an attention mechanism to focus on input variables with high correlation to improve the model accuracy. We set the size of the attention window to 96. Figure 3 presents the overall attention based GRU model architecture.

Experimental Design
In the experiments, we used an Intel (R) Core (TM) i7-8700 CPU (Santa Clara, CA, USA), Samsung 32G DDR4 memory (Suwon, Korea), NVIDIA Geforce GTX 1080ti (Santa Clara, CA, USA) and the operating system is Windows 10 version.

Experimental Design
In the experiments, we used an Intel (R) Core (TM) i7-8700 CPU (Santa Clara, CA, USA), Samsung 32G DDR4 memory (Suwon, Korea), NVIDIA Geforce GTX 1080ti (Santa Clara, CA, USA) and the operating system is Windows 10 version. For multistep-ahead hourly electricity load forecasting, we performed the experiments in Python 3.7.6, and the RNN-based models were constructed using TensorFlow 1.13.1 [38]. Table 5 presents the entire collection, training set, and testing set period for each building. In AI-based STLF model training, normalization is usually conducted to prevent large-scale features from having too much influence by adjusting all the input variables within a close range. For this, we applied the min-max normalization for all input variables using Equation (15). By doing so, we transformed all input variables of the training set and then applied the X norm values obtained from the training set on the testing set. Hence, the minimum and maximum values of an input variable could be 0 and 1, respectively.
We used the mean absolute percentage error (MAPE) and coefficient of variation of the root mean squared error (CVRMSE) to compare the forecasting performance, which can be calculated using Equations (16) and (17), respectively. Both metrics represent the accuracy as a percentage error. Hence, they are more intuitive and easier to understand than other well-known metrics, such as the root mean squared error and mean squared error [23,26]. A lower value for these metrics indicates better prediction performance.
Here, A is an average of the actual values, and A t and F t are the actual and predicted values at time t, respectively.

Experimental Results and Discussion
We compared it with other state-of-the-art models, such as multivariate RF (MRF), DNN, LSTM, ATT-LSTM, GRU, and ensemble models to evaluate the validity of the proposed model. We considered two ensemble models: Park's stacking ensemble model [14] and Moon's stacking ensemble model, called COSMOS [10]. We considered all the input variables using each input variable for the prediction time point to construct the MRF and DNN models. Therefore, we used 432 input variables, that is, 18 (number of input variables) × 24 (number of prediction time points), for the multistep-ahead STLF [39]. The two stacking ensemble models consist of two stages, and the second-stage model used the prediction results of the first stage and demonstrated better forecasting performance than many single ML models and existing forecasting models. We also compared the prediction performance of the proposed model with that of Kuan's GRU model [18]. We implemented an RF-based STLF model using MRF [40] in R packages and the two stacking ensemble models using xgboost 1.3.0 [41] and scikit-learn [42] in the Python environment. The selected hyperparameters for each model are listed in Table 6.  Tables 7-9 and Tables 10-12 present the MAPE and CVRMSE comparison of the three buildings by month and date type, respectively. The values in bold font indicate the best performance for the month (or date type). When comparing MAPE (CVRMSE) by month, the MAPE (CVRMSE) values from April to September, which corresponds to the summer, showed the lowest values. When comparing the forecasting performance of weekdays and holidays, the forecasting performance for weekdays was generally better than that for the weekend.  As listed in Table 2, the scale of power consumption of Building 3 is larger than that of Buildings 1 and 2, so the MAPE and CVRMSE values of Building 3 are generally smaller than Buildings 1 and 2. Tables 7-12 indicate that the proposed model (ATT-GRU) showed the lowest MAPE among most STLF models in the comparisons by month and date type. In addition, from the results of ATT-LSTM and ATT-GRU, the attention mechanism could improve the prediction performance of the general LSTM and GRU models by 10% or more.
Although the GRU model performed better than LSTM in most experiments, the GRU model took less time to construct because it was a lighter model than the LSTM model. The average elapsed times for LSTM and GRU-based models are 10,937, 11,581, 7074, and 7263 s for LSTM, ATT-LSTM, GRU, and ATT-GRU, respectively. Also, their training time is depicted in Table 13. In terms of multistep-ahead STLF, the prediction accuracy of the ATT-GRU model did not reveal any significant decrease, even for the rear points. Moreover, our model exhibits the lowest MAPE and CVRMSE values point by point. The point-by-point forecasting results are depicted in the Appendix A. This is because the attention mechanism calculates the weights of the points and focuses on the points with large weights. Figures 4-6 represent scatter plots for the power consumption and temperature of the three buildings. According to the plots, the average power consumption of Buildings 1 and 2 tended to increase similarly as the temperature approached 0 or 100 degrees Fahrenheit and Building 3 showed an increase in power consumption as the temperature approached 100 degrees Fahrenheit. On the contrary, Buildings 1 and 2 showed frequent power consumption peaks when the temperature got close to 0 degrees Fahrenheit. The power forecast accuracy for Buildings 1 and 2 in November and December deteriorated due to unexpected peaks at low temperatures. Our proposed multi-step ahead forecasting method can be used to predict such unexpected peaks and handle them. due to unexpected peaks at low temperatures. Our proposed multi-step ahead forecasting method can be used to predict such unexpected peaks and handle them.   due to unexpected peaks at low temperatures. Our proposed multi-step ahead forecasting method can be used to predict such unexpected peaks and handle them.

Conclusions
In this paper, we proposed a reliable hourly electrical energy consumption forecasting scheme using the attention mechanism and GRU network. The GRU network performs multistep-ahead forecasting, and the attention mechanism enables the network to concentrate more on the input variables with a higher correlation to energy consumption. We collected publicly available datasets from the United States Department of Energy and constructed 18 variables related to electricity consumption and an STLF model using the attention based GRU network to verify the proposed scheme performance. Through extensive experiments, we compared the proposed model with other popular state-of-theart STLF models, including the stacking ensemble model, DNN, COSMOS, LSTM, ATT-LSTM, and GRU. We showed that our model outperformed the other models in terms of MAPE and CVRMSE. In addition, we demonstrated that the prediction performance of LSTM and GRU can be improved by more than 10% by using attention mechanism. However, our model still has a black box. That is, it cannot explain the evidence of the output like any other deep learning models. To overcome this, we plan to extract feature importance using layer-wise relevance back-propagation and implement explainable AI based on it as a future work. Moreover, we will investigate how further to improve the prediction performance of multistep-ahead electricity load forecasting.

Conclusions
In this paper, we proposed a reliable hourly electrical energy consumption forecasting scheme using the attention mechanism and GRU network. The GRU network performs multistep-ahead forecasting, and the attention mechanism enables the network to concentrate more on the input variables with a higher correlation to energy consumption. We collected publicly available datasets from the United States Department of Energy and constructed 18 variables related to electricity consumption and an STLF model using the attention based GRU network to verify the proposed scheme performance. Through extensive experiments, we compared the proposed model with other popular state-of-the-art STLF models, including the stacking ensemble model, DNN, COSMOS, LSTM, ATT-LSTM, and GRU. We showed that our model outperformed the other models in terms of MAPE and CVRMSE. In addition, we demonstrated that the prediction performance of LSTM and GRU can be improved by more than 10% by using attention mechanism. However, our model still has a black box. That is, it cannot explain the evidence of the output like any other deep learning models. To overcome this, we plan to extract feature importance using layer-wise relevance back-propagation and implement explainable AI based on it as a future work. Moreover, we will investigate how further to improve the prediction performance of multistep-ahead electricity load forecasting.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Tables A1-A6 present the MAPE and CVRMSE values of the forecasting methods for 24 forecasting points.