Deep Learning for Short-Term Load Forecasting—Industrial Consumer Case Study

: In the current trend of consumption, electricity consumption will become a very high cost for the end-users. Consumers acquire energy from suppliers who use short, medium


Introduction
Types of electric load forecasting techniques fall into three main categories regarding the forecast horizon: STLF (short-term load forecasting), MTLF (medium-term load forecasting), and LTLF (Long term load forecasting). The authors in [1] present a classification based on the data frame used before the forecast, including very short-term load forecasting and a framework based on several points predicted into the future. In our study, we implement short-term forecasting for 24 steps (hours). Load forecasting (LF) can offer great value if the process can be automated and operated without human intervention.
The approach presented in the article for 24 h ahead forecast facilitates the acces to the DAM (day ahead market) and Intra-day market to minimize the difference between real and forecasted values. This difference is mandatory to be balanced on the balancing market, which represents a financial problem for the supplier. Most large non-residential consumers have a single tarriff (price/MWh) and few electricity suppliers offer time differenciated prices and time of use tarriffs. This single tariff is calculated to cover the expences with portfolio balancing. Every supplier needs to balances the portfolio of clients by being a balancing responsible party (BRP) or by submiting these responsibility to other parties. Either way, balancing a portfolio is a chalanging task for every supplier.
In literature, the classification of forecasting determines two main categories illustrated in Figure 1: qualitative and quantitative. Work developed by [2,3] is a comprehensive and detailed review on selecting forecasting methods from a business standpoint. Regarding the analogy for electricity load, forecasting has the same foundation principles with few particularities. Qualitative techniques apply empirical knowledge of LF experts to make a Quantitative techniques are better for short-term forecasting and consist of (i) time series forecasting and (ii) causal forecasting. In time series forecasting, the historical data is a set of chronologically ordered observation points where X t is recorded in time, (X t ) n=0,±1,±2,... , observed at times t = 1, 2, ..., n (X 1 , X 2 , . . . , X n ). Contrasting CF, TSF is the natural ordering of the data points. TSF help understand components such as patterns, trends, peak values and any irregularity or variation in the series of data.
The importance of load forecasting is highlighted by many papers such as [5,6] which presents a detailed literature review. In Table 1, electricity market participants who can benefit from forecasting are highlighted.

Literature Review
A plethora of approaches consisting of time series analysis, regression, smoothing techniques, artificial inteligence, artificial neural networks, machine learning, deep learning, reinforcement learning and various hybrid methods can make this area of research overwhelming. Some authors suggest that established models are better [7][8][9] presents evidence that complexity harms accuracy. Authors in [10] propose the Golden Rule to provide a unifying theory of forecasting, while others embed multiple algorithms to build hybrid methods combining characteristics of traditional statistics and machine learning. There is truth on both sides; some algorithms will work better or worse depending on historical data or applied period. The forecasting objectives are to minimize errors and improve economic activity: revenue, profit, and higher customer satisfaction. Low error forecasts are of no inherent value if ignored by the industry or otherwise not used to improve organizational performance. Forecasting competitions presented in [11,12] is one of the best ways to compare algorithms on reliable historical data and point out the results. In multiple cases, recurrent neural networks (RNN) architecture stands out as a stable algorithm. The work done by [13] presents an extensive experimental study using seven popular DL architectures and found that LSTM is the most robust type of recurrent network, and while LSTM provides the best forecasting accuracy, convolutional neural networks (CNN) are more efficient and suffer less variability of results.In this paper, various RNN networks are applied for industrial load forecasting and analyzed to establish the best architecture for deep recurrent neural networks. In the article [14] authors point out that the difference between simple RNN to GRU and LSTM is that the number of parameters increases, a conclusion also presented in our article. For 24 h ahead forecasting commercial building data, authors concluded that the DNN model achieved worse results than the sequence to sequence RNN models. The authors in [15] present a simple recurrent neural network for the one-hour prediction of residential electric load. The model takes as inputs weather data as well as data related to electricity consumption. The percentage error calculated for a week test is 1.5% for the mean error and 4.6% for the maximum error. The difference between industrial load and residential usage is that the latter is highly dependent on weather data and daily patterns are more repetitive. In our article, exogenous variables such as temperature, humidity, and dew point are used in forecasting, because the industrial processes analyzed are influenced by these variables. Day-ahead forecasting of hourly large city load based on deep learning is studied by [16] with a novel flexible architecture that integrates multiple input features processed using different types of neural network components according to their specific characteristics. The authors have implemented multiple parallel CNN components with different filter sizes to introduce parallel structure into the DNN model instead of stacking DNN layers. The proposed architecture (MAPE: 1.405%) outperformed the CNN-LSTM (MAPE: 1.475%) and the DNN (MAPE: 1.665%). Another approach based on RNN and CNN is proposed by [17], consisting of convolutional layers and bidirectional LSTM and GRU recurrent layers to predict the next hour utility load. The results of experiments on two datasets (0.67% and 0.36% MAPE) demonstrate that the proposed model outperforms the conventional GRU and LSTM models. In this article, we found that the deep GRU network performs better than the combined GRU + LSTM network. A comprehensive comparison performed by authors in [18] concludes that RNNs require more resources than traditional models, but perform better. We reached similar findings in our article, the GRU unit is simpler than the LSTM unit, as well as faster in computations. The article presents that overall the LSTM performs better than GRU, which contradicts the results for short-term load forecasting presented in our article. The authors in [19] compare different variations of the LSTM algorithm and conclude that the longer the historical data available for training, the better the load forecasting accuracy would be. For building loads, the day-ahead forecasting errors show up to 45% improvement using RNNs (LSTM, LSTM with attention, BiLSTM, BiLSTM with attention) in comparison with other states of the art forecasting techniques.

Materials and Methods
The forecasting methods are implemented in this paper use hourly data ( Figure 2) from an industrial company active in the wood processing industry for an entire year (2019). The power supply for the factory is provided through twelve power transformers summing 12.6 MVA. The following technological processes, machinery, and equipment determine the electricity consumption forecasted in this article: • Installations that serve the equipment for cutting and exhaust; • Installations that serve the cooling system to ensure the necessary cold to keep in optimal conditions the substances used in the foaming process; • Installations that serve the processing and cutting of sponges;  For the implementation Tensorflow [20] was used for deep learning applications. Keras [21] is a high-level API, open-source library for machine learning that works on top of Tensorflow. For the data preparation and visualization of the results, Scikit-learn [22], Numpy [23], and Seaborn [24] were used. The simulations computed on a PC Intel(R) Core(TM) i5-4690K CPU@3.5 GHz, RAM 16 GB, 64-bit operating system, x64-based processor.
The industrial consumer analyzed is a furniture factory consisting of all the technological processes necessary to manufacture furniture starting from raw wood, mainly electric drives. The consumer energy needs are electricity and wood scraps. Production of heat and hot water relies on burning the remaining wood from the technological processes. The heating in the winter period for the office building and factory production facilities is achieved with electric heaters which influence the consumption in the winter period together with the lightning systems (work schedule is in three shifts). High electricity consumption is driven by large ventilated storage halls used for the thermal preparation of the raw wood. Correlation between electric load and outdoor temperature, dew point, and humidity is observed. Working/non-working days load patterns are not the same because factory planning is highly dependent on production quota. A Dickey-Fuller test [25] made for the yearly load time series points to the null hypothesis and the non-stationarity of the time series. Reliable linear dependencies between exogenous variable and consumption could not be establish and deep learning became an option to explore for nonlinear dependencies. From all the algorithms implemented in this article, variations of RNN (LSTM, GRU, GRU-LSTM), the GRU algorithm offered the best result for forecasting. Given this reason, we tried to analyze which is the best structure for the GRU for our particular problem.

Deep Learning (DL)
There is a vast spectrum of terminology that tends to be confusing because of the interchangeability of utilization: artificial intelligence, machine learning, deep learning, artificial neural networks, or reinforcement learning. Machine learning is considered a subdomain of artificial intelligence [26]. Deep learning is a subdomain of machine learning, and neural networks are at the core of deep learning algorithms. The dissimilarity between a simple neural network and a deep learning algorithm is the number of neurons and structure of hidden layers (deep learning must have more than two hidden layers). ML techniques can be broadly grouped in two large sets-supervised and unsupervised. The methods related to the supervised learning paradigm classify objects in a pool using a set of known annotations/attributes/features. The unsupervised learning techniques form groups among the objects in a batch by identifying similarities and then use them for classifying the unknowns. Reinforcement learning is a behavioral algorithm similar to supervised learning, not using sample data for training but by trial and error. A sequence of successful outcomes will develop the best recommendation or policy for a given problem.
DL models were developed to map a complex function between the last "n" hours (timesteps-also called lag) and predict how the time series can continue in the future, as presented in Figure 3. Most machine learning algorithms have hyperparameters; by setting the parameters, the ML algorithm can offer the desired results. The values of hyperparameters should not be calculated in the learning stage (because of the overfitting problem). To evaluate the generalization of the DL methods on the training data, we use a testing set of time series that the built network in the training stage did not experience prior.  In our work we use deep recurrent neural networks (DRNN) and variations of the algorithm. RNN is a sequential data neural network processor because it has internal memory to update the state of each neuron in the network with the previous input. Because RNN train with backpropagation, this can fail because of vanishing gradient descent. Deep networks combine multiple layers into the architecture and provide more significant benefits.
Neural networks build functions by multiplying a weight matrix to the input vector, add bias, and then apply the activation function to obtain non-linearity in the output. To calculte the current state, we can use the following Formula (1).
In the equation above, the parameters θ include W, U, and b. The W and U are parameters representing weight matrices, and b is the bias vector. Hyperbolic tangent is the activation function tanh for the hidden state; other activation functions could be used. The output of the RNN cell is: where V and c denote the weight and bias, the parameters θ of the output function g.
Matrix V and vector c are multidimensional outputs. The same set of parameters is applied at each time step for every RNN-cell [27]. LSTM was developed to improve the vanishing or exploding gradient problem and has become one of the most popular RNN architectures to date and was introduced by [28]. GRUs were later introduced by [29] as a simpler alternative and have also become quite popular. We will use both architectures in the context of the vanishing or exploding gradient problem. Many variants of LSTM and GRUs exist in the literature, and even the default implementations in various deep learning frameworks often differ. Performance is often similar, but this can confuse when reproducing results. The study proposed by [30] ranked MLP first in terms of forecasting performance, better than Support Vector Regression, RF, ARIMA and RNN. The RNN is among the less accurate ML-based methods in this study, but the authors have not tried a variation of RNN.
The results presented in [31] for air quality prediction show that the LSTM and the CNN-LSTM generally perform better in multi-hour forecasting than other ML algorithms. For residential load with spatial and temporal features authors in [32] obtained the highest performance of 0.37 MSE (mean square error) with CNN-LSTM better than LSTM, GRU, Bi-LSTM, and Attention LSTM. In [33] authors showcase that CNN-LSTM model gives the lowest values of MAE, RMSE and MAPE compared to LSTM, RBFN and XGboost models. An average MAPE of 3.22% is obtained for 24 hours ahead forecast for national consumption.

Gated Recurrent Units (GRU)
Many articles such as [34] present a general literature review of ML and work done by [35][36][37] details a comprehensive review for DL algorithms used for forecasting. The GRU combines input gate and forget gate of LSTM into an update gate Z t and the output gate in LSTM is called a reset gate R t in GRU [38] as shown in Figure 4.  The difference between GRUs and simple RNN is the implementation of gating for the hidden state, which uses a few steps to determine when the hidden state needs to be updated and when to reset.
The input is a sequence of data, for a given time step t, X t ∈ R n×d (n is the number of sequences, i is the number of inputs) and the hidden state of the previous time step is H t ∈ R n×h (h is the number of hidden units). Then, the reset gate R t ∈ R n×h and update gate Z t ∈ R n×h are implemented as follows in Equation (3): where W xr , W xz ∈ R i×h and W hr , W hz ∈ R h×h represent weight parameters and b r , b z ∈ R 1×h are biases. In Equation (4) the reset gate R t updates the candidate hidden state H t ∈ R n×h at time step t: where W xh ∈ R d×h and W hh ∈ R h×h are weight parameters, b h ∈ R 1×h is the bias, and the symbol is the Hadamard (elementwise) product operator. For the nonlinearity of the values in the candidate hidden state, H t uses the tanh to maintain the values in the interval (−1,1). Hidden state at time step t, H t ∈ R n×h , is a combination of previous hidden state H t−1 as presented in Equation (5) and current time step candidate hidden state: Activation functions used in the GRU cell are sigmoid and hyperbolic tangent Equation (6): RNN model was implemented by [39] for forecasting non-residential loads (catering, electronic, and hotel industry), concluded that predicting each consumer is not as accurate as of the forecasts for substation loads, and obtained the best results using LSTM with MAPE ranging from 15.45% to 19.57%. Extreme gradient boosting regressor (XGBoost) for STLF was implemented by [40] and achieved MAPE of 3.74% for a horizon of one week for substation loads, with hourly steps, which is a total of 168 h. For one-step forecasting [41] are presenting a model based on FFNN and LSTM for air compressor electricity usage. The authors in [42] have also been used LSTM for nonlinear, non-stationary, and nonseasonal univariate electric load time series over 96 steps ahead with a MAPE of 5.35%, but it does not mention what type of load is forecasting. The RNN model fails on vanishing gradient decent as mentioned above, so LSTM and GRU are designed to compensate for its failure by using gates such as the ones in Figure 4. GRU is designed to provides a longer-term memory [43].
Multiple hidden RNN layers can be stacked on each another. The main reason for stacking is to allow for greater model complexity. Deep RNN work better than shallower networks, as presented by [44], a multiple-layer deep architecture was better for machinetranslation in an encoder-decoder framework and authors in [45] also showed improved results by using an architecture with several stacked recurrent layers for RNN.

Proposed Methodology
This work uses DRNN for hourly variations of the electricity consumption prediction using weather, type of day, day of week and an autoregressive variable (AR) as input in the training and testing dataset. The historical data used as input in the networks is selected based on the observed results. The best-obtained results use the past two weeks of hourly consumption as input in the neural networks. Shorter lag periods increased the MAPE. This aspect means that daily patterns exist and repeat weekly. For the AR method, the lag was selected based on the p-value analysis.
Autoregressive method (AR) is a regression implementation of time series data to predict future values based on past correlations. Two weeks of hourly data was analysed and based on the p-value the relevant past data was used in the forecast Equation (7). To forecast hours (h 1..24 ) in day (d + 1) we consider hours (h 1.. 24 Based on the regression p-value scoring presented in Table 2, we keep the past hours that are relevant for the regression Equation (7). All the previous steps with a p-value score greater than 0.05 are removed from the regression equation. The forecastŶ t generated by AR (9) is given as input in the training of the GRU network. Recurrent neural networks consisting of more than two hidden layers are called deep recurrent neural networks (DRNN). The main feature of DRNN is that each hidden state is continuously passed to the next time step of the current layer and the next layer of the current time step. The hidden state of the hidden layer H ∈ R n×h , the output layer variable Y t ∈ R n×o , and the hidden layer activation. The hidden state of the hidden layer is computed with Equation (8) and the output of the DRNN with Equation (9).
o: number of outputs W xh ∈ R h×h ; W hh ∈ R h×h and W ho ∈ R h×o : weight parameters b h ∈ R 1×h and b o ∈ R 1×o : bias parameter Considering the mentioned equations, we present in Figure 5 the implementation framework for the deep learning algorithms. All the training variables are used to learn long-term dependencies. In the testing phase, the DRNN model takes as inputs the past three days of hourly consumption, AR(9) hourly forecast for the day (d + 1), and the exogenous variables for the day (d + 1). The output represents the hourly forecast for the next day (24 h).
Forecasting based on the test dataset is implemented on a day ahead approach with the variables presented in Figure 5. We forecast each day once for 24 h and compare it with the actual consumption. This process is repeated for the entire test data without tunning the DRNN algorithms or other ML algorithms used in this work.
The historical data used as input in the networks is selected based on the observed results. The best-obtained results use the past two weeks of hourly consumption as input in the neural networks. Shorter lag periods increased the MAPE. This aspect means that daily patterns exist and repeat weekly. For the AR method, the lag was selected based on the p-value analysis. The exogenous variable considered in forecasting have direct influence in electricity consumption. input output Figure 5. The proposed framework for the implementation of deep learning for hourly load curves.

Results
The scope of this work is to identify the best solution for industrial load forecasting. For this reason, several algorithms were analyzed and implemented using the tools mentioned in Section 3 to have a solid comparison standpoint. For the evaluation of the forecast, the metrics applied for the results obtained in the test dataset using the equations in Table 3. Table 3. Forecast errors metrics. The forecasting methods used in this article and presented in Table 4, together with the hyperparameters for each ML algorithm. The AR method is implemented as well to showcase the load forecast from a traditional perspective. The standalone AR method for forecasting has the same order as the one used in the hybrid method. Table 4. Parameters used by the forecast methods implemented in this work.

AR
Autogressive prediction for each hour is based on the same hour in the past 14 days. Coefficients are presented in Table 2 MLP Multi-Layer Perceptron. Input matrix [24,11] The calculated errors according to Table 3 are presented in Table 5 for each algorithm implemented, highlighting the best results. The results represent the lowest error obtained from multiple implementations for the ML architectures considered in this study. The errors are calculated on a hourly basis and for the entire testing period. The GRU model that uses the feature mentioned in Table 4 as input obtains the best MAPE of 4.82% better than all the other DL algorithms. This approach improved the GRU method by 6.48% and 11.88% compared to the AR method. The LSTM method scored a higher error in comparison with GRU and close to the AR method. The AR result of 5.53% MAPE illustrates that the hourly load curve has a strong repetitive pattern, and the LSTM is overfitting on the training data. The MLP and simple RNN scored higher error than the AR, and the LSTM encoder-decoder has the highest error of 6.28%. In Figure 6, a normal working week (Monday to Friday) is presented to highlight the improvement obtained by using the proposed methodology. It can be observed that using the AR component it helps when the behaviour of the consumer is repetitive and improves the forecast.  Other ML algorithms have good results, but none score better than the GRU. In Figure 7 high variations can be observed in the peak hours. Simple RNN scored the highest MAPE 6.63%, even worse than the autoregressive model. The LSTM encoder-decoder is a sequence to the sequence learning algorithm, performed worse than the LSTM by 14.89% because more trainable weights for each of the encoder's time steps are required. The consequence of this situation is a large number of parameters if the input data for the encoder is a long time series, even more parameters than the LSTM. A complex network needs a longer time for training, and the underfitting problem can cause the LSTM encoderdecoder to perform worse than the LSTM. Because we implemented a similar structure for all the RNN networks to offer a solid comparison benchmark, we can conclude that the LSTM encoder-decoder could improve forecasting results if the complex network would be trained for a longer period with a more powerful hardware resource.  In article [46], the authors compare LSTM with GRU on text datasets, and conclude that through empirical research, the advantage of GRU is relevant in the scenario of small datasets. In other scenarios, compared with LSTM, the performance loss of GRU decreases. GRU can forget and choose memory with one gate, and fewer parameters, while LSTM needs to use more gates and more parameters to complete the same task. For our scenario, with short-term industrial load forecasting in Figure 8, the GRU network offers better result than the LSTM and combined GRU and LSTM networks Figure 9. Because in the training stage the same architecture for the RNN networks were used, the LSTM is underfitted in the training stage. Building larger and training for more epochs did not improve the errors and the network failed due to overfitting on the training data.  LSTM-GRU network improves the overall errors, because of the GRU layer, but there is a clear pattern of high variations for peak values and off-peak period. The authors in [47] could find a clear difference between LSTM and GRU, and suggest that the selection of the type of gated recurrent unit depend on the dataset and corresponding task. In our case, with industrial load forecasting, the results are clearly in favor of GRU.  For a better understanding of the evolution of the GRU results for the testing period in Figure 10 the actual and forecasted load curve can be correlated with the daily MAPE. The best daily MAPE is 1.58%, but the worst error value is 25.38%. The high error was caused by two legal holiday periods from 30 November to 1 December 2019. The GRU fails to correctly identify the evolution of the actual consumption because in the training dataset, this period is not seen by the network, and the GRU can not generalize on such a situation. On top of this, the previous weekend, an uncommon event occurred due to probable losses in the compress air system.  The process of using DL algorithms is stochastic, and with different MAPE values scored by using the same architecture, is difficult to fine-tune the hyperparameters of the neural networks to generalize efficiently on the training data. For this reason, in Figure 11 various implemented architectures are analyzed for the best method implemented in this article. The results indicate that using a complex architecture can lead to higher errors due to overfitting the training data. The lowest forecasting error scored by the GRU model (MAPE 4.82%) was reached by using a network with three hidden layers (GRU 24|3|100|100|48|24). It can be observed that the training MAPE error keeps decreasing, the test error increasing, indicating that the network is overfitting and fails to generalize on the new data. This situation needs to be quantified to find the structure that offers the lowest errors.
A simple indicator was defined to quantify the complexity of the DL algorithm in Equation (10) and to compare the results of the GRU algorithm.
where i is the number of analyzed architectures, E is the number of epochs, and N p is the number of total parameters used in the DL architecture (weight and biases). E max is the maximum number of epoch used in all the simulations, and N pmax is the maximum number of parameters in all simulations. For each simulation, the (DL index ) is compared with the MAPE for both datasets (train and test) and training time to make sure the DL does not overfit on the training data. In Figure 12, the evolution of error concerning epochs, time of training, and the complexity of the DNN (DL index ) are highlighted. Training complex architectures with a high number of epochs, besides time and resources, leads to higher errors because of overfitting. A minimal error can determine the selection of the best algorithm. The (DL index ) highlights that increasing the number of hidden layers and neurons in each layer can negatively impact the performance of the DL algorithms. It becomes computationally harder to obtained lower errors and the training time increases unjustifiably. Training a complex deep network will determine that each layer can describe a precise features in the relation between input and output, but the neural network will fail to generalize to new data needed for forecasting and later for electricity acquisition.

Discussion
Suppliers want clients to use more energy rather than less, but the recent development of demand-side management, demand response, and smart grid [48,49] will shift the status quo from irresponsible behavior towards indispensable accurate predictions. According to [50], the industrial and commercial sector represents 63.46% of the world's total electricity consumption. The challenge is to anticipate the stochastic behavior of large consumers.
Industrial load forecast plays an essential role in the cost of electricity, especially for a large consumer such as the one presented in this work. Load forecasting is important because planning for electricity supply is dependent on consumption forecasts. High imbalances generated between actual and forecast consumption will create a higher risk for all the participants in the power market. This work proposes a methodology for industrial load forecasting based on DL and AR. The result and analysis indicate that DL has a high order of stochasticity, and careful tuning of the parameters is needed. The challenge is to anticipate the stochastic behavior of large consumers. This work proves that deep neural networks can successfully forecast hourly industrial load. The obstacle to this approach is the lack of smart-metering and sensors for forecasting consumption in real-time. Without relevant data, efficient and replicable forecasting models do not represent a reliable investment for the private sector. The novelty of the work is the proposed framework applied for industrial load curves, the analysis of the best architecture, and the scalability of the deep neural networks using a simple complexity index. The study compared the forecast performance for seven methods and tested various combinations for forecast variables and lag structures. Our test sample results across 1608 hourly values (15 October-20 December 2019) indicate consistently that: (i) deep recurrent neural networks are suitable for industrial load consumption; and (ii) the best model implemented for is GRU. The work highlights that increasing the number of hidden layers and neurons in each layer (total parameters) can negatively impact the performance of the DL algorithms. Training a complex deep network, each layer can map exactly the relation between input and output, but the neural network will fail to generalize to new data needed for forecasting and later for electricity acquisition.

Conclusions
A compromise is needed to find a practical solution to make electric load forecasting more accessible to the industry sector by implementing algorithms that learn directly from data with little human intervention. The novelty of the work is the proposed framework applied for industrial load curves, the analysis of the best architecture, and the scalability of the deep neural networks using a simple complexity index. The study compared the forecast performance for seven methods and tested various combinations for forecast variables and lag structures. Our test sample results across 1608 hourly values (15 October-20 December 2019) indicate consistently that: (i) deep recurrent neural networks are suitable for industrial load consumption; and (ii) the best model implemented for is GRU. The work highlights that increasing the number of hidden layers and neurons in each layer can negatively impact the performance of the DL algorithms.