Estimating Energy Forecasting Uncertainty for Reliable AI Autonomous Smart Grid Design

: Building safe, reliable, fully automated energy smart grid systems requires a trustworthy electric load forecasting system. Recent work has shown the efﬁcacy of Long Short-Term Memory neural networks in energy load forecasting. However, such predictions do not come with an estimate of uncertainty, which can be dangerous when critical decisions are being made autonomously in energy production and distribution. In this paper, we present methods for evaluating the uncertainty in short-term electrical load predictions for both deep learning and gradient tree boosting. We train Bayesian deep learning and gradient boosting models with real electric load data and show that an uncertainty estimate may be obtained alongside the prediction itself with minimal loss of accuracy. We ﬁnd that the uncertainty estimates obtained are robust to changes in the input features. This result is an important step in building reliable autonomous smart grids.


Introduction
Confident predictions are necessary to build reliable, fully Artificial Intelligence (AI) automated smart energy grid systems [1][2][3][4][5][6]. Load predictions produced without any information about their uncertainty are unsafe [7] when critical autonomous decisions are being made in energy production and distribution. Current forecasting models normally do not provide information about uncertainty in their predictions, which can lead to costly and risky decisions and affect current efforts to produce a reliable smart energy grid system [8]. This is a problem for many AI applications, such as autonomous vehicles, where the lack of consideration for uncertainty can be dangerous [9]. Therefore, for safe, reliable, autonomous artificial intelligence systems and for safe prediction-based decisions, evaluating the degree of uncertainty is paramount [10].
At the same time, the efficient allocation of resources is a primary goal of the electric power industry, which includes the production, transportation, and distribution of energy. An accurate forecast of the future electric load is integral to achieving this, as it allows utility companies to make informed decisions for planning and operations [2,3,11]. Accurately predicting the hourly load, known as Short-Term Electric Load Forecasting (STELF), allows utilities to make these decisions in real-time and saves millions of dollars normally wasted due to overload or underload costs as well as reducing CO 2 emissions [4,[12][13][14].
Obtaining confident and accurate estimates for learning models through data-driven methods has been the subject of intensive research [4]. A recent survey [13] showed that previous studies primarily used artificial neural networks (ANNs) and support vector machines (SVMs), with decision trees and various statistical methods making up the remainder. Of these, ANNs have become the most common choice due to advances in deep learning and the development of neural net architectures such as the long short-term memory (LSTM) network [15], designed for sequence modeling and time series tasks. LSTM networks have been employed successfully for a wide range of applications, including short-term electric load forecasting [13,[16][17][18][19]. At the same time, tree-based algorithms such as gradient-boosted trees have become widely used in industry and machine learning competitions due to lower computation requirements, ease of interpretation and tuning, and better performance on smaller datasets, and have recently been applied to load forecasting as well [20][21][22].
In this paper, we present the implementation and testing of two methods for estimating uncertainty of short-term electrical load prediction, based on two popular machine learning algorithms: LSTM and gradient boosting regression (GBR). The contributions are fourfold. First, we developed two energy forecasting computational models using deep learning techniques. LSTM represents a class of advanced artificial recurrent neural networks (RNN) that has been shown to be efficient in many applications. However, to our knowledge, this is the first time that LSTM has been applied together with Monte Carlo dropout regularization to quantify the prediction uncertainty for energy forecasting. Second, we applied the optimization algorithm GBR to energy forecasting. More importantly, using the properties of the GBR to accept an arbitrary differentiable loss function, we were able to construct prediction intervals to measure the uncertainty of the prediction. Third, the applicability and effectiveness of the models are verified and analyzed using real-world data-a benchmark dataset on electricity load forecasting. The impact of implementing such methods on predictive accuracy, as well as the effect of input features on the produced uncertainty are also examined. Lastly, experimental results show that the models achieved high prediction accuracies in terms of the Mean Absolute Percent Error (MAPE) and that uncertainty estimates can be obtained with minimal cost to accuracy.
The rest of the paper is organized as follows. In Section 2, we introduce the algorithms, uncertainty estimation, and the design of the learning models. Section 3 describes the processing of the dataset, implementation, and experimental results. Finally, some conclusions and future work are presented in Section 4.

Algorithms of Learning Models
We first discuss the two algorithms for STELF and then present techniques for producing a prediction interval for each.

Long Short-Term Memory
Long short-term memory (LSTM) is a recurrent neural network architecture that has gained widespread popularity due to its ability to model long-term dependencies and automatically extract features [15,16,23,24]. It has also been shown that the ANN vanishing gradient problem, which affects ordinary RNNs, does not occur for the LSTM cell [15,16,23,24]. This has made it a favoured choice for time series modeling [18,19], including recently, electric load forecasting [5,17]. There are several variants in use today, but in this paper, we use the standard architecture presented by Graves and Schmidhuber [23].
The key idea behind LSTM is to introduce a memory cell to standard RNN architecture ( Figure 1). This memory cell, built out of simple nodes in a specific pattern, allows the LSTM module to retain information across many timesteps when needed [25]. The flow of information to and from the memory cell is controlled by three "gates", each of which performs a multiplication on the information passing through to determine how much to allow past.
We will describe the network using a similar presentation to those by Kong [26] and Lipton [25]. We denote the input to the LSTM module at timestep t with x t and its output with y t , in our case, both in MWh units. We represent the state of the memory cell with h t . Each gate itself is a neuron with a sigmoid activation function σ, and uses the input x t and previous memory cell state h t−1 to produce a set of values between 0 and 1 representing the amount of information let through. The W matrices are weight matrices for each neuron.
We first create a candidate state g t to potentially update the memory cell with, using a tanh activation function: where W gx represents the weights between the input x and the state g, W gh represents the weights between the previous state h and the state g, and b g represents the bias. The first of the gates, the forget gate f t , will determine how much of the memory cell state we preserve between timesteps and is given by: where W f x and W f h represent the weights between the gate neuron f and the input x and previous state h, respectively. As above, b f represents the bias. The input gate i t will determine how much of the candidate state g t will be added to the memory cell, and is given by where W ix and W ih represent similar weight matrices as above for the input x and the previous state h, and b f represents the bias. The memory cell is then updated with the new values, where denotes an elementwise or Hadamard product. The output gate is calculated in a similar manner to the other gates: where W ox and W oh represent similar weight matrices as above for the input x and the previous state h, and b o represents the bias. To produce the output y t , the contents of the memory cell are passed through the gate and a tanh function to constrain the values to the same range as an ordinary tanh hidden unit [25].

Uncertainty Estimation for LSTM
Our goal is to measure the uncertainty of the LSTM-based prediction. There may be multiple sources of uncertainty present as well as inherent noise, but in this work we will restrict discussion to the model-specific sources of uncertainty, resulting from uncertainty in our model parameters-the connection weights. This is usually known as model or epistemic uncertainty and will be reduced as more data is collected [10,19,27].
To measure this type of uncertainty, we introduce a prior probability distribution for each weight in the neural network, usually Gaussian, and attempt to find a posterior distribution rather than a single estimate. This type of network is known as a Bayesian neural network (BNN), but due to the large number of parameters involved, it is difficult to solve traditionally [10,19,28,29]. However, it has been shown that dropout regularizationthe technique of stochastically dropping out neurons to reduce overfitting-can also be used to approximate the results of a BNN without the difficulty found in the conventional approach [10,19,28]. In this regularization technique, a small fraction of units throughout the network are muted with every training stage [30,31]. This reduces codependencies between neurons that lead to overfitting and force the model to learn more robust features.
Normally, in the prediction stage, no neurons are dropped and the entire network is used. However, by continuing to drop out neurons during the prediction stage and sampling multiple times-an algorithm called Monte Carlo dropout or MC dropout-we exchange some of the predictive power of the full network in order to create a distribution of predictions. As shown in [28], this can be thought of as sampling from the posterior distribution of a Bayesian network; as a result, we can estimate the variance of the posterior and, therefore, the model uncertainty by means of the variance in these predictions [10,19,28,32].
Our algorithm for estimating model uncertainty using Monte Carlo dropout follows the steps outlined in [10]. A neural network can be represented by a function f W (·), where f captures the network structure, and W is the set of neural network weights. After training our neural network and feeding it with new data x * , our goal is to evaluate the uncertainty of our neural network predictionŷ * = f W (x * ) by calculating the prediction standard error η, which allows us to create an α-level prediction interval with using the z score z α/2 . The detailed algorithm for MC dropout is as follows: Given a new input data x * to a trained neural network, we compute the neural network output prediction while randomly dropping out each hidden unit with certain probability p. The feed forward stage is repeated with stochastic dropout B times, resulting in the predicted values {ŷ * 1 , . . . ,ŷ * B }. The sample variance is then Var[ f W (x * )], and the model uncertainty can be approximated as [10] The desired prediction standard error η can now be calculated using

Gradient Tree Boosting
For comparison, we also evaluate the performance of a second popular algorithm, gradient boosting regression (GBR). Boosting is a meta-algorithmic technique whereby a model is built from an ensemble of individually weak models, step by step. At each stage, the new model being added attempts to improve on the performance of the current ensemble. Implementations of gradient boosting such as XGboost have proved extremely effective in real-world scenarios and machine learning competitions, offering both speed and accuracy.
We provide an overview of the technique, as described in [22]. Given a dataset with n training examples consisting of an input x i and expected output y i , a tree ensemble model φ(x i ) is defined as the sum of K regression trees f 1 , f 2 , ..., f K : To evaluate the performance of a given model, we choose a loss function l(ŷ i , y i ) to measure the error between the predicted value and the target value, and optionally add a regularization function Ω( f k ) which operates on trees and calculates a penalty. The loss L(φ) of the ensemble is then The regularization function, if used, is designed to penalize more complex trees and acts as an "Occam's razor" term that encourages simpler models. Our goal is to minimize L(φ) and, to do so, we introduce each f k iteratively. Assume that the ensemble currently contains K trees. We add a new tree f K+1 that minimizes (13) or in other words, we greedily add the tree that most improves the current model as determined by L. We train the new tree using this objective function; this is often done by approximating the objective function using the first-and second-order gradients of the loss function l(ŷ i , y i ) using the method given by Friedman et al. [33].

Uncertainty Estimation for GBR
As the first few trees added will dominate the model performance, we cannot randomly drop out trees during prediction in order to construct a prediction interval. Instead, we adapt a technique described in [34] whereby the loss function is altered, training the model to predict values other than the mean. By predicting an upper bound and lower bound at a certain level of confidence, we can train the model to output the uncertainty in its own prediction. The loss function we choose is known as the quantile or pinball loss function, a generalization of the Mean Absolute Error (MAE) used for the LSTM model, and for a given predictionŷ i , target value y i , and desired quantile τ, is defined as As shown in Figure 2, this produces a tilted version of the MAE loss function determined by the parameter τ, with τ = 0.5 reducing to standard mean absolute error. Minimizing this loss will train the model to produce a prediction of the τ-th quantile [35], and training two additional models-one to predict an upper quantile, and one to predict a lower quantile-allows us to construct a prediction interval for the gradient-boosted prediction. For our experiments, we select τ = 0.95 and τ = 0.05 to produce a 90% prediction interval.

Experiment and Results
We now discuss our implementations of the methods described above, as well as empirical results.

Model Implementation
We use the Python-based scikit-learn package [36] and the TensorFlow and Keras libraries [16,24,37,38] to implement three learning models. The first model is the standard LSTM architecture as used in practice [17,26], consisting of the input layer, three LSTM hidden layers with 50 neurons each, and a single neuron in the output layer.
The second model implements uncertainty estimation as described above, by performing Monte Carlo sampling with dropout to approximate a Bayesian posterior distribution for the predictions (Algorithm 1). We then construct a 90% interval based on the sample variance as the uncertainty estimate. The Mean Absolute Error (MAE) loss function and the Adam version of stochastic gradient descent were used with Keras and Tensorflow to optimize the LSTM models.
The third model is a gradient boosting regressor implemented using the GradientBoos tingRegressor function from the sci-kit-learn package [36] and optimized for Mean Absolute Error (MAE) in MWh unit. Two auxiliary models are implemented the same way, but optimized using quantile loss for the 95th and 5th quantiles. The 90% prediction interval is constructed using the outputs from the auxiliary models, while the main model provides the prediction. Table 1 lists the configuration parameters for the implementation. input : Data x * , input LSTM layers a(·), trained LSTM network l(·), dropout probability p , number of iterations B output : Predictionŷ * mc , prediction uncertainty η for i = 1 to B do e * (i) ← Variational Dropout (a(x * ), p); y * (i) ← MC Dropout (l(e (i) ), p) end // prediction; // uncertainty in the Prediction η;

Data and Feature Selection
The data used in this paper was obtained from the European Network on Intelligent Technologies (EUNITE) 2001 competition on electricity load forecasting [39]. The data was collected by the Eastern Slovakian Electricity Corporation for two years from 1 January 1997 until 31 December 1998, and contains the following features: -Half-hourly electricity load in MWh.
-Daily average temperature in Celsius.
The objective of the original competition was to predict the daily peak electricity load for the 31 days of January 1999 using the given historical data. For this paper, our objective is to instead predict the electricity load for each half-hour for the same time period and compare the forecast with actual consumption.
The EUNITE competition data has been studied by many authors [40,41] to investigate the relationship between load data and other features such as temperature, weekdays, and annual holidays.

Temperature Effects
References [40,41] show from the load and average daily temperature data that the electricity load has high consumption in the winter (September-March) due to heater usage in the winter, and low consumption in the summer (April-August). This behavior indicate a strong relationship between the electricity load and the weather conditions represented by temperature. A negative correlation with coefficient −0.8676 was found between the daily peak load and daily average temperature over the two years [40,41], showing this strong relation.

Weekday Effects
Examination of load data during the weekdays shows that a periodic weekly pattern exists in the consumption. References [40,41] show that on most Saturdays and Sundays (the weekends), the load is lower than on weekdays (Monday-Friday). In addition, electricity peak load usually occurs on a Wednesday.

Holiday Effects
The analysis for the two-year historical load data shows that the load usually reduces on holidays [40,41]. The load demand also depends on the type of holiday; for example, on Christmas or New Year, the electricity consumption is affected more compared to other public holidays.

Feature Selection
For high-accuracy load forecasting, our model needs the most descriptive input features that have the highest correlation with the output forecasting variable and the highest degree of linear independence.
From the above data analysis, the best candidate input features are the past half-hourly loads, the average daily temperatures, and the type of day [40,41].

Half-Hourly Load
Past half-hourly load is a very good indicator of the next timestep's load. Including the past half-hourly loads as a key feature is very important for high-accuracy forecasting.

Average Daily Temperature
From the above data analysis, electricity load and the average daily temperature have a high negative correlation. Therefore, the average daily temperature is a crucial feature for our forecasting model.

Weekday Indicator
The weekday effect on electricity load was shown in the data analysis as a pattern in references [40,41]. Therefore, the type of day can be encoded numerically as a feature in our forecasting model to enhance its performance. This can be done in Python by using the method, date.weekday(), which returns the day of the week as an integer, where Monday is 0 and Sunday is 6.

Public Holidays
Public holidays have a significant effect on users' behaviors for electricity consumption. Hence, inclusion of a binary indicator 0/1 for public holidays in our forecasting model will benefit its accuracy.

Time Series Modeling
After encoding indicators of both weekdays and public holidays as numerical values, the EUNITE competition data was framed as multivariate time series data consisting of sequences of records ordered by date and time, as shown in Table 2.

Data Normalization
The range of values for the EUNITE competition data varies widely with feature, which affects both the accuracy and the stability of our forecasting model [16,24,37,38]. Therefore, the ranges of all features have been normalized by rescaling each feature to lie within [0, 1]. This scaling was done in Python using the MinMaxScalers method from the sklearn package [24,37].

Convert Data to Input-Output Pairs
The main task for our models in a supervised machine learning setting is to learn the function that maps inputs to an output, based on examples from labeled training data consisting of known input-output pairs [16,24].
The EUNITE competition data must be reframed as a supervised machine learning problem in order to be handled by our models for forecasting electricity load.
Therefore, the multivariate time series was converted to input-output pairs using a sliding window method, in which prior time steps are used as features for predicting the next time step [38,42,43].

Splitting Data into Train and Test Sets
The two years of historical data was split into training and testing sets, with 80% for training and 20% for testing. The training and testing sets were split into input and output variables, and finally, the inputs were reshaped into the 3D format expected by our models, namely, [samples, time steps, features] [16,24,37,38]. The training set was used first to train our models, then the testing set was used to evaluate it.

Experimental Results
The three models-namely, the standard LSTM Model, the Bayesian LSTM model, and the GBR model-were trained and used to forecast the electricity load for the month of January 1999. The accuracy of the load forecast is measured by way of two error metrics: the Root Mean Square Error (RMSE) in MWh units and the Mean Absolute Percent Error (MAPE). The RMSE is the standard deviation of the prediction errors and is defined as follows [40,41]: where y i is the target value,ŷ i is the predicted value, and N is the total number of values. The MAPE is calculated by taking the average of the unsigned percentage errors as follows [40,41]: The behavior of these metrics during the training sessions for each model is shown in Figures 3 and 4. We note that the loss drops with each epoch and stabilizes over time, indicating that the models are sufficiently trained without overfitting.
Training epoch No.  Results from the standard LSTM model are shown in Figure 5. We can see that the model is efficient in comparing the prediction mean and the actual values. In fact, the error percentage measured as MAPE is less than 3%, as shown in Table 3.

MAE (MWh)
The January forecast obtained from the Bayesian LSTM model is shown for 50 timesteps in Figure 6. We note that the uncertainty band is narrow due to this only being a measure of the model uncertainty and not including other sources of error such as noise. Figure 7 shows the same prediction using quantile regression with the GBR model. As this method does not differentiate between sources of uncertainty but instead captures all types, we note that the prediction intervals are wider than those for the LSTM model.

Half-hourly points
Load (MWh)   Introducing MC dropout to the neural network reduces the accuracy of the LSTM model by a small amount. GBR achieves similar accuracy or slightly better accuracy compared to the LSTM model, and as producing a prediction interval does not require any modifications to the main predictive model itself, there is no loss in performance. Overall, the performance of all three models is fairly similar, showing that uncertainty estimates can be obtained with minimal cost to accuracy.

Uncertainty Analysis
We perform some analysis for the obtained uncertainty values. Figure 8 shows the frequency of the calculated uncertainty as a percentage of the actual load values in the prediction of January 1999 loads. We observe that the frequency decreases for higher percentage values of uncertainty and almost no uncertainty estimates rise above 10%. This behavior reflects the relatively high confidence and low error percentage of our forecasting model.  Table 4 shows the correlation of the percentage uncertainty with the input features used in the prediction of January 1999 load, namely, the past half-hourly loads, the average daily temperatures, type of weekdays, and public holidays. From the figure, the uncertainty has a −0.05 correlation with the past half-hourly loads, a 0.03 correlation with average daily temperatures, and zero correlation with the type of weekdays or public holidays. We see that the model uncertainty is generally independent of the features used as it is a result of the model architecture.

Conclusions
With the fast development of machine learning algorithms and their extensive application, estimating model uncertainty has become a significant research area. We implemented and evaluated two methods in quantifying the uncertainty for two popular supervised learning algorithms. This work is one of the few that studies uncertainty for both RNN and tree based algorithms in the area of energy forecasting. An uncertainty measure is critical for applications such as designing reliable AI-enabled autonomous energy smart grids [10]. In addition, uncertainty is also an important metric for energy production and distribution as it allows utility companies to assess the trustworthiness of forecasts before making decisions. Our results demonstrate the effectiveness of the learning techniques, LSTM and GBR, and the uncertainty evaluation methods, using the EUNITE electricity load dataset. Both methods are shown to provide accurate predictions.
There remains much to be investigated in the field of uncertainty estimation and its application to electricity forecasting. Although we only described methods to produce uncertainty estimates for two algorithms used in STELF, these methods can also be extended to other forecasting algorithms. Monte Carlo methods can be used to estimate the degree of uncertainty that can be attributed to the model parameters, by estimating the variance of the posterior. Quantile methods which aim to train separate models to predict upper and lower bounds can provide an overall estimate of the expected distribution. Both techniques provide useful estimates of the risk in real-world scenarios and easy computational implementations for AI-enabled autonomous smart grids.