Short-Term Load Forecasting Using an Attended Sequential Encoder-Stacked Decoder Model with Online Training

: The paper presents a new approach for the prediction of load active power 24 h ahead using an attended sequential encoder and stacked decoder model with Long Short-Term Memory cells. The load data are owned by the New York Independent System Operator (NYISO) and is dated from the years 2014–2017. Due to dynamics in the load patterns, multiple short pieces of training on pre-ﬁltered data are executed in combination with the transfer learning concept. The evaluation is done by direct comparison with the results of the NYISO forecast and additionally under consideration of several benchmark methods. The results in terms of the Mean Absolute Percentage Error range from 1.5% for the highly loaded New York City zone to 3% for the Mohawk Valley zone with rather small load consumption. The execution time of a day ahead forecast including the training on a personal computer without GPU accounts to 10 s on average.


Introduction
Load forecasts are substantial in several areas of power network operation independently of the voltage level. With the increasing number of renewables and thus more volatility and dynamics in the network, the task of load forecasting becomes even more important. Errors in forecasts have direct financial consequences on the utilities and in the long-term on their customers. They also may lead to an inexcusable waste of the green power in the case it has to be curtailed due to network congestions.
Load forecasts are usually subdivided into three categories concerning the length of the prediction horizon: • short-term: from a few minutes until one week ahead • mid-term: from one week until one year ahead • long-term: from one year until several years ahead Short-term load forecasts are used to guarantee a safe and optimal real-time network operation (prevention of network violations, unit commitment, and economic dispatch). Mid-term load forecasts are more important for planning maintenance tasks, load redispatch, and securing a balanced load and generation. Long-term forecasts are mainly relevant for network reassembling and expansion.
In the area of short-term load forecasting, two basic groups of methods have been established, i.e., methods based on statistics and so-called intelligent approaches [1]. Statistical methods are usually easy to implement and provide quick results. A standard method from the group of statistical approaches is multiple regression [2,3]. It can cope with changes in the load data due to trending or seasonal impacts and it can include in the forecast model different kinds of independent variables such as weather and calendar or the load data from previous time instances. To guarantee good prediction results, training data from at least one year before the forecast begins is required. Other well-established statistical forecast methods are General Exponential Smoothing with inclusion of seasonality (Holt-Winters) [4] and autoregressive integrated moving average (ARIMA) [5,6] and combination of ARIMA and Artificial Neural Networks [7]. They can consider influences resulting from changing trends, seasonal differences and irregularities in load data and work well with a limited amount of training samples. However, external variables such as weather cannot be included in their models.
Statistical methods expect an exact mathematical model of load and its influencing factors. Parameters of the model are estimated from the historical data samples. Fuzzy logic approaches from the class of intelligent techniques get along with a high-level model specification expressed with the "IF"-"THEN" statements [8,9]. Another group of intelligent approaches are Artificial Neural Networks (ANNs) [10,11]. They require the definition of the neural network to be used and pairs of input and output data. Since they can approximate any function hidden in the data, they are very well suited for tasks where an explicit model description is too complicated or the underlying function undergoes frequent changes which are difficult to capture such as for example in load forecasting in electric power networks with a high number of renewable sources.
With the advances in the research and the application of neural networks, the classical multi-perceptron networks (MPN) [12,13] are more and more replaced by recurrent or convolutional networks or combinations of these. Recurrent neural networks can represent time dependency by sharing the hidden layers of subsequent time steps. From this category, especially Gated Recurrent Units (GRU) [14] and the even more powerful Long Short-Term Memory (LSTM) networks [15] and their combined usage with convolutional networks [11,16], the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) for hyperparameter search [17,18] shall be mentioned. Different than simple recurrent networks, these architectures do not experience the so-called gradient vanishing or exploding problems [19] and are to some extent able to memorize longer time series sequences and provide therefore superior results over MPLs or simple recurrent architectures.
Lately, originated from the research on machine translation problems also encoderdecoder architectures are being applied to the load forecast problem. Ref. [20] used it for the prediction of the load heat. The basic idea is to encode the information required for the forecast execution before passing it to the actual forecast model. This decoupling is crucial for machine translation and shows however good effects while applied to time series prediction problems. An extension of this architecture is the incorporation of the attention mechanism as introduced by Bahdanau in [21] in the area of natural language processing. The attention approach allows for choosing those encoder states which may be most influential for the prediction of the next decoder state. Ref. [22] applied the attention model with Bahdanau attention [21] to the load forecasting using Vanilla, GRU, and LSTM cells achieving in general superior results over non attended sequence to sequence models. In [23] multi-headed attention together with a seasonal decomposition and trend adjustment is used. Ref. [24] uses the classic combination of attended encoder-decoder model with GRU cells as proposed by [21]. Additionally, to simplify the choice of the hyperparameters the Bayesian optimization is applied.
The goal of the presented approach is the development of an improved attended encoder-decoder architecture and its application to the problem of short-term load forecasting considering the increasing number of renewable sources. Before passing to the model, the inputs of the encoder and decoder are weighted using a one-dimensional convolutional neural network. This operation allows for filtering out features that have a temporarily lower correlation with the load power to predict. Additionally, a novel online training based on its core of the concept of transfer learning is presented [25][26][27].
The scientific contribution of the presented paper is therefore an improved attended sequential encoder-stacked decoder model applied to the problem of short-term load prediction with: • a novel and simplified definition of the attention scoring function • a novel online training procedure for sequence data on the base of transfer learning. This training procedure is especially important in the field of very dynamically changing load patterns. • a high accuracy achieved on real data provided by NYISO • an evaluation with different methods including linear regression, Hidden Markov Models and different recurrent neural network architectures In the next Section 2.1, the data set is discussed along with the feature selection. In Sections 2.2 and 2.3 the definitions of a recurrent network, an LSTM, encoder-decoder model and attention are compiled. Using these definitions, the method is described in Section 2.4. The results obtained by the proposed approach are evaluated together with the results available from NYISO and with additional benchmark methods in Section 3. The conclusions can be found in the final Section 4.

Data Used
The data used for training and evaluation of the approach is owned by the New York Independent System Operator (NYISO) and can be freely accessed [28]. It has been already used in [29] and therefore it will be possible to compare the results of both approaches. NYISO's data consists of integrated hourly load forecasts and corresponding measurements of active load power. Moreover, load information, NYISO additionally offers time series representing the price for power delivery, losses, and congestion in USD and forecasts of ambient and wet bulb temperature in Fahrenheit produced by different weather stations.
The NYISO data set is almost complete (with a few missing inputs) for all eleven zones. Because it includes also the forecast results of the utility, it is very adequate for research purposes.
The training, test, and evaluation data used in the presented approach contain load, temperature, and wet bulb time series from the years 2013 until 2015, 2016, and 2017 respectively. The decision to not consider price information was motivated by the very low correlation between load and price data. More details related to the data set are in [29].
The most strongly correlated features are the load power and the ambient temperature. However, this relationship differs concerning the season as described in [29] and varies strongly depending on the temperature range. Figure Figure 3 shows this relationship which is non-linear and differs strongly for each day discussed. Because of this fact, the conclusion is self-evident, which is that the commonly used sliding window approach for the choice of the training data such as for example in [24] cannot be successful here. Instead, the choice of the training is based on similarday approach [12,30]. The reason is that following days can have strongly differing load patterns due to changing weather conditions. Therefore, the similar day approach seems to be superior. Before the training, time series with missing entries or outlier values are excluded from the data set. The training, test, and evaluation are executed on the z-score normalized data according to Equation (1) with µ the feature mean and σ feature standard deviation. For the calculation of the accuracy of the method, the data are however transformed back to the original value space.

Recurrent Network with Long Short-Term Memory Cell
Recurrent networks are a group of neural networks which is developed to support the dealing of the temporal aspects in the data [31]. This is achieved while conditioning the hidden state from a time instance t not only on the input for time t but also on the hidden state at the time instance t − 1. In simple recurrent networks also called Elman networks, three matrices are shared over all time instances. A matrix W which stores weights connecting input x t and hidden state h t , a matrix U containing the weights between the hidden states of subsequent time instances and the matrix V which transforms the hidden states to the network output. Accordingly, the hidden state h t is obtained from the application of an activation function g on the weighted input and previous hidden state (2). The network output y t is calculated using an activation function f on the weighted hidden state h t as specified in (3).
One of the problems of Elman networks is that information inserted early time instances may become lost at later ones. A solution to that problem is Long Short-Term Memory (LSTM) cells which can filter out information not required at further time steps and keep the one that may be needed later on. This is achieved by adding a recurrent context layer and several weight matrices and gates in combination with the usage of the sigmoid activation function as presented in Figure 4. A Long Short-Term Memory cell consists of the forget gate (4), the input gate (5), the cell update gate (6) and the output gate (7). The context C t is obtained from the sum of the pointwise multiplication (⊗) of the context from time instance t − 1 and the forget gate and the cell update gate and the input gate (8). The output for the time instance y t is calculated through the pointwise multiplication of the context at time instance t and the output gate (9).

Encoder-Decoder and Attention
The encoder-decoder architecture developed in the domain of machine translation is constructed of two separated mostly recurrent networks called the encoder and the decoder. Figure 5 shows an example of such architecture using LSTM networks. The goal of the encoder network is to provide a compressed representation of the input sequence which is then passed to the decoder as the initial state. The decoder creates the output sequence subsequently one by one using the output from the preceding time instance as the input at the following one. Because of this, the encoder-decoder architecture is very well suited for time series problems. The encoder-decoder model is very powerful but has however one significant limitation. While encoding the input sequence, often the information relevant for the creation of a correctly decoded output becomes lost. This is solved using the attention mechanism. At its core, the attention concept evaluates the similarity score between each encoder output and the currently produced decoder output. The goal is to draw attention to those encoder sequence parts which are most significant for the current decoder output. In [21] first the similarity score between each encoder state stored in h j and the previous decoder output s t−1 is calculated (10). The softmax function is applied to the similarity coefficients (11). From that, the context vector is calculated (12) which is then concatenated with the decoder hidden state of the time instance t − 1.
Using these equations, the attention is drawn to those parts of the encoder which are most significant while decoding the ith element of the sequence. As mentioned in Section 2.1, the training data consists not of n recent days like it is established through the sliding window approach. It contains n most similar days to the day under forecast. The similarity of two days day i and day j is expressed by the Equation (13) and it considers the weighted Euclidean distance d(day i , day j ) between the features measured at that days. Most similar days have the smallest distance between the respective features.
The weight allows for larger differences in features that are wider spread and for smaller differences in those which are contained within a narrow interval.
The feature set used for filtering of the most similar days consists of: • daily minimum ambient temperature and wet bulb • daily maximum ambient temperature and wet bulb • daily minimum next day ambient temperature and wet bulb • load power one hour before the intended forecast start before forecast start • the type of the day (working day, weekend or holiday) • the length of the day • the type of the day concerning the ambient temperature (hot, cold, regular day) To predict the hourly load curve for the next day, data of 96 most similar training days is used.

Application of Encoder-Decoder Architecture
The encoder accepts inputs for the last 24 h before the forecast day as shown in Figure 6. The decoder is fed with hourly chunks of the time-series data related to hourly ambient temperature, hourly wet bulb, the type of the day, etc. at the forecast day and the output of the decoder cell at the previous time instance (except the first decoded time step). The usage of the load power alone as encoder input as done in [24] increased the prediction error, therefore it was not applied here. The encoder and decoder input is processed with a 1D convolutional filter to account for changing relevance of the input data depending on temperature and day type.
The encoder is used as a sequential model, the decoder as a stacked architecture as shown in Figure 6. During the test, it turned out that it is beneficial concerning the forecast error to apply a separate convolution and attention layer for each decoded instance. This might be related to the varying correlation between the encoder outputs and the decoder output at time t.

Attention Score Function
During the development of the algorithm, a simplified score function showed slightly better results. In Equation (15), the absolute difference between the encoder output h of the dimension N and the last decoder output s t−1 adjusted to the dimension of the encoder is calculated. Afterwards the softmax function is applied twice (16) and (17). The first time, to map the differences obtained from Equation (15) to the interval [0, 1]. The second time, to assign small differences between the decoder and encoder output to the upper part of the interval [0, 1] and larger differences to the lower part (17). The final context vector is obtained in (18) as a sum over the time dimension. The decoder state for which the attention is applied is concatenated with the context vector. Finally, a dense layer is used to adjust the size of the concatenated vector to the output produced by the LSTM cell.
In this formulation, only one weight matrix is required instead of three matrices as specified in (10). The forecast error while using both formulations of attention is however similar.

Online Training-A Piecewise Learning of the Underlying Function
The standard training approach as applied for example in natural language processing consists of a choice of representative training and validation data. The goal is to learn the approximation of the underlying function in one training procedure and to use the model for a longer period. The validation data are used to control the progress and the quality of the training. Such holistic training procedures can be quite a time-consuming one depending on the complexity of the function to be learned expressed in the network architecture and the number of training data required. Additional problems arise if there is not enough representative training data available or there is a sudden change in the underlying function which is not captured in the chosen training data. In load forecasting, holidays and abnormal days (very cold or very hot) are usually underrepresented and there is no simple method to create artificial data without prior knowledge of the underlying function. Due to the increasing number of renewables and prosumers, load patterns are subject to further unexpected changes. Therefore, there must be a possibility to train the network fast with a limited amount of training data.
In the presented approach, transfer learning [25,26] in combination with online training is being used to cope with time-consuming training procedure, the insufficient amount of training data especially for weekends and hot days, and changing load patterns. Figure 7 shows an overview of the training and inference procedure. The preprocessed training data are loaded. If the forecast for the next day's load power shall be executed, the most similar days are picked. If available, the weights from the previous day are loaded into the model. The encoder-decoder model is trained for 50 epochs in one batch using the ADAM optimizer. Each training data related to the day day j is weighted according to the Euclidean distance to the time series under prediction day i as calculated by Equation (13) shown in (19). w train j = 100 * | max(d(day i , day 1:n )) − d(day i , day j )) max(d(day i , day 1:n )) | (19) No validation data are used but an early stopping criterion is applied if the value of the loss function falls below the threshold of 0.0001. The approximated piece of the underlying function follows the data in the training data set. The prediction results are decoded and returned. The weights are stored so that they can be reused as a starting point for the next training. The data of the forecasted day is added to the training data set. Using this approach, the training time is distributed in small chunks on each day to be forecasted. Additionally, the most recent data can be easily included in the training procedure capturing the most recent changes in the load pattern.

Results
The evaluation of the approach has been executed on the data from 2017 and it includes all NYISO zones. Table 1 compiles the name, the abbreviation, and the average load for each zone. The power consumption is varying because the zones differ concerning the size of the area, the number of residents, and the type of housing (cities, villages, rural areas). New York City is the zone with the highest number of residents. The power consumption for that area is accordingly high. Mohawk Valley on the other side, has a relatively large area, a small number of residents and therefore a significantly smaller load [28]. The results obtained from the sequential encoder-stacked decoder with attention (SESDA) are evaluated together with the results of three other methods. First of them is the Hidden Markov Model approach from [29]. The Hidden Markov Models used there were created online using the data directly without any training procedure. Additionally, the results of the NYISO are taken into account [28,32]. The third benchmark is a linear regression method from [3] and is named Tao's Vanilla Benchmark. It already has served as a benchmark for the GEFCom2012 load forecast competition and was under the first 25% best results of that competition [33].
The prediction error is measured by the Mean Absolute Percentage Error (MAPE) which is specified by Equation (20), with M as measured value and P as the predicted value. Figure 8 compiles the forecast error for 2017 delivered by all considered methods for each NYISO load zone. The attended sequential encoder-stacked decoder (SESDA) achieves the best results for all zones. It is, however, closely followed by the Hidden Markov Model with no training. The NYISO approach which combines regression with the usage of neural networks outperforms Tao's Vanilla Benchmark but seems to have some problems with the low load zone MHK VL. For this zone, all approaches deliver relatively high error which may be related to quite a wide area (15,230 square kilometers) and a small population of that zone.  Figure 9 compares the daily MAPE for 2017 in New York City calculated using the two best approaches: SESDA and HMM. The SESDA approach returns a smaller error. The largest MAPE value for HMM is around 20% around 139th day of the year. Generally, the highest HMM errors are concentrated between May and September. The highest attended sequential encoder-stacked decoder error is around 10%. It occurs on holidays, on 4 July 2017 and 25 December 2017. Moreover, the higher forecast error on average, the HMM approach delivers also more and higher error peaks on problematic days.      Table 2 shows the evaluation of different ANN approaches on the NYC data set. Among all approaches, the best results are achieved with the sequential encoder and stacked decoder architecture. Sequential encoder-decoder with attention performs only a little better than the sequential encoder-decoder without attention because it shares the attention layer for all time instances. Sequential encoder-decoder without attention performs only slightly better than the LSTM network. The reason is too strong context vector compression in a 24 instances encoder. The LSTM network used for the evaluation consists of 24 inputs. The 24th input is the predicted one. The model is trained for each prediction hour separately. The Nonlinear Autoregressive Network with Exogenous inputs (NARX) [19] shows the worst results. It has been implemented using 24 dense networks for every prediction hour each.
The encoder-decoder method has been evaluated on a personal computer with Intel Core i5-6300U CPU@2.40 GHz and 32 GB RAM. For the implementation, the Python programming language along with the Tensorflow [34] and Keras libraries [35] has been used. An average execution time for 24 h ahead forecast took around 10 s (including loading test data, picking the most similar data, loading the weights into the model, training, and inference). The required amount of time is around 5 times larger in comparison to the HMM approach [29].

Conclusions
The presented approach uses a sequential encoder-stacked decoder architecture in combination with attention (SESDA) to predict load power 24 h ahead. The training data are collected using the smallest Euclidean distance between the daily features of the forecasted day and the days in the past. For each forecast day, a fast online training is executed using the filtered most similar past data. The features included in the encoder-decoder model range from hourly weather parameters, calendar data to the load demand from the previous hour.
The algorithm achieves the best results in comparison with the benchmark methods which include Linear Regression, a combination of Linear Regression and the usage of Neural Networks, and the Hidden Markov Model approach. Although the difference between the MAPE values of HMM and the encoder-decoder model is not as large as the difference between HMM and Linear Regression, the HMM seems less stable than the proposed architecture. However, the error reduction comes at a cost of the increased forecast execution time.
One of the limitations of the algorithm is the requirement of the availability of pairs of consecutive daily data for the training due to the network architecture (encoder-decoder model). For data with many gaps related to whole days, the algorithm may perform less successfully. Additionally, the algorithm requires on average 96 historical time series including the previous days for the training. If the requirement is not fulfilled, the prediction error will increase depending on the number of data provided. However, today's utilities have mostly access to the required amount of data.
The approach can be extended to support longer time horizons. However, in this case some modifications of the architecture must be applied. First of all, the length of the encoded sequence must be adjusted to the length of the forecast sequence. Additionally, the self-attention mechanism inside of the decoder has to be used to consider the impact of the preceding predicted values on the following ones due to the longer time horizon.
In future work, the authors will draw their attention to the application of reinforcement learning to the area of short-term load forecasting.

Conflicts of Interest:
The authors declare no conflict of interest.