Big Data Analytics for Short and Medium-Term Electricity Load Forecasting Using an AI Techniques Ensembler

: Electrical load forecasting provides knowledge about future consumption and generation of electricity. There is a high level of fluctuation behavior between energy generation and consumption. Sometimes, the energy demand of the consumer becomes higher than the energy already generated, and vice versa. Electricity load forecasting provides a monitoring framework for future energy generation, consumption, and making a balance between them. In this paper, we propose a framework, in which deep learning and supervised machine learning techniques are implemented for electricity-load forecasting. A three-step model is proposed, which includes: feature selection, extraction, and classification. The hybrid of Random Forest (RF) and Extreme Gradient Boosting (XGB) is used to calculate features’ importance. The average feature importance of hybrid techniques selects the most relevant and high importance features in the feature selection method. The Recursive Feature Elimination (RFE) method is used to eliminate the irrelevant features in the feature extraction method. The load forecasting is performed with Support Vector Machines (SVM) and a hybrid of Gated Recurrent Units (GRU) and Convolutional Neural Networks (CNN). The meta-heuristic algorithms, i.e., Grey Wolf Optimization (GWO) and Earth Worm Optimization (EWO) are applied to tune the hyper-parameters of SVM and CNN-GRU, respectively. The accuracy of our enhanced techniques CNN-GRU-EWO and SVM-GWO is 96.33% and 90.67%, respectively. Our proposed techniques CNN-GRU-EWO and SVM-GWO perform 7% and 3% better than the State-Of-The-Art (SOTA). In the end, a comparison with SOTA techniques is performed to show the improvement of the proposed techniques. This comparison showed that the proposed technique performs well and results in the lowest performance error rates and highest accuracy rates as compared to other techniques.


Introduction
The electrical industry plays a very vital role in human life from various angles. The electricity demand is increasing day-by-day with the rapid increase in population [1]. The traditional power grid became an old version; which is not efficient enough now, so, the intelligent and smart version of the power grid known as the Smart Grid (SG) is introduced. Through the SG system, it became very easy to manage the distribution of electric load for utility companies and remain in touch with the consumers. The SG also helps to reduce the variations between power demand and supply. The most important task of the SG is to effectively control the consumption, generation, and distribution of electricity. The utility supplies electricity to consumers, according to their demand. Sometimes, the rate of electricity consumption of the user increases and the utility does not have enough energy be supplied. To overcome the issue of balancing between consumption and utility supply, the utility uses the electricity load forecasting model, which is one aspect of SG. The conceptual diagram of the SG is shown in Figure 1. The approximate energy consumption pattern of the user is predicted through load forecasting by their historical data. Generally, forecasting is of four types; Short-Term Forecasting (STF), Very Small Term Forecasting (VSTF), Medium-Term Forecasting (MTF), and Long-Term Forecasting (LTF). In STF, electric load of one day-ahead to some weeks-ahead is predicted, VSTF consists of predictions for some hours to one day, MTF predicts data of one week to one year, and through LTF, one year to several years ahead load can be forecasted [2][3][4]. In this paper, STF and MTF are performed with an excessive record of electricity dataset. Data analysis is a process of getting useful information from hidden patterns of data. Data analysts measures the price and load consumption by taking historical data in the form of datasets to perform some tasks which allow us to obtain useful information [5]. In [6], the detailed review of data is available. The volume of real-world data is intensively increasing day-by-day and the large volume of data is referred to as big data. Through data analytics, effective information is collected from massive quantities of historical power data to implement analysis over it, which helps to make more enhancements in the market operations management and planning.
Big data is multifaceted and very excessive in volume. The main issue in the big data sets is redundant features; so traditional methods are not very supportive for handling such a large amount of data.
Many techniques are tested and applied to handle big data and extract useful information. Although, big data is still an issue of the current era. Many authors around the globe are working on handling big data using artificial intelligence techniques. The authors in [7,8] improved the price forecasting and load forecasting accuracy; however, the computational time is not considered. Similarly, the issue of load forecasting is addressed in [9]; however, the issue of overfitting is not addressed. Moreover in [10], the author proposed the BPNN model to forecast the day-ahead electricity load; however, the complexity of the proposed model is increased. Additionally, the LSTM-RNN model in [11] is used to forecast an hourly and monthly electricity load. Furthermore in [12], the hybrid of SVM and non-linear regression is used to forecast load; unfortunately, the problem of over-fitting is increased. Hence, conventional simple techniques and methods are not very suitable for a varying electricity load. A better framework and enhanced techniques are required to solve the load forecasting problem. Therefore, the main objective of this paper is to enhance the accuracy rate of electricity load forecasting by optimizing the parameters of machine learning and deep learning techniques on a large amount of electricity load data. As a large amount of data contain redundant and irrelevant features, which increase the time complexity of training. RF and XGB are used as feature selection methods. The RFE technique is used as a feature extraction method to eliminate the redundancy, while SVM with the GWO algorithm and CNN-GRU with the EWO algorithm is used for the classification of electricity load forecasting.
The main contributions of this paper are given below: The rest of the paper is organized as follows. Section 2 contains the literature review. Section 3 contains the proposed methodology (method). Section 4 contains the results. Section 5 contains the conclusion and policy implications.

Related Work
Many techniques and ideas are used and tested to predict the power load and other areas with successive results. In [13], the authors performed load forecasting with various smart home data by applying an analytical approach to the data, however, they were unable to manage a large amount of data properly. Deep Long Short Term Memory (DLSTM) and machine learning-based model are proposed to forecast the price and electricity [14]. The proposed DLSTM outperforms in achieving the accuracy of load forecasting. However, LSTM is not good in terms of training because it needsa memory bandwidth bound calculation and it limits the applicability of neural network solutions.
In [15], the authors performed load forecasting with feature selection and classification models, taking the dataset as input. They used MI to select the best features and discard insignificant features. The authors proposed three-step strategies for load forecasting in [16], in the first step they used Conditional Mutual Information (CMI) for best feature selection. The second step consists of NLSSVM and ARIMA machine learning techniques, which create nonlinear and linear correlations for load forecasting. In the third step, the parameters of NLSSVM are tuned with the ABC algorithm. However, reducing the features with the help of a feature selection method also reduces the forecasting accuracy rate.
For feature selection, IG and MI techniques have been used, which helped measure the redundancy and most relevant features [17]. They also proposed a hybrid wrapper filter-based approach, where the filter part of the method is selected as a little part of the dataset for features by redundancy and iteration of inputs. They introduced a new feature selection method, but failed to maintain high accuracy rates.
The authors in [18] performed hourly forecasting and also defined the uncertainty of the predictive method. They proposed Generalized ELM and Improve WNN techniques to implement on OAE electrical dataset. However, these techniques are outperformed for this development. The parameters of these techniques are manually tuned, however, with dynamically-tuned hyper-parameters, forecasting accuracy rates can be further increased.
In [19], forecasting has been done using a combination of two deep learning techniques; CNN and LSTM. The proposed model is evaluated using the Mean Square Error (MSE). Further, the accuracy rate of the proposed technique also compared with some benchmark techniques, and results show that the proposed technique outclasses all other techniques in terms of the accuracy rate. The proposed technique performed better, but the authors did not consider the feature redundancy. Redundant features can make a negative effect on model accuracy rates.
The authors in [20], performed day-ahead forecasting by increasing the layers of Artificial Neural Networks (ANN) and tuned the hyper-parameters with an optimizing algorithm. To improve the accuracy rate authors in [21] increased the layers of Neural-Network (NN). The enhanced NN is also compared with conventional techniques to demonstrate improved high accuracy rates. The enhanced NN is compared with ARIMA and SVR, showing that it performed better, but failed to avoid overfitting problems.
In [22], each day of the week forecasting is performed by applying the deep CNN for classification. The applied dataset is taken from the Victorian electrical company, Australia. The authors forecast the one day load and analyzed it by comparing the one day load with the same day's load of the previous three months. However, the author used fewer record datasets and was unable to train the CNN model properly.
The hybrid of CNN and LSTM achieved accuracy in terms of electricity load forecasting in [23]. The objective of their work is short-term forecasting and they used MAPE and MAE error metrics for evaluation of results. The hyperparameters of SVM are tuned with a random search algorithm to achieve improved accuracy and a lower error rate [18]. They performed load forecasting and compared the results with manually tuned SVM and CNN. Results show the improvement of the enhanced technique. The authors used eight years of data for load forecasting purposes, however, SVM is not good for classifying large datasets.
In [24], the authors proposed two techniques named enhanced SVM and enhanced ELM to perform short term load forecasting. A grid search optimization algorithm is used to optimize the hyper-parameters of SVM and the hyper-parameters of the Extreme Learning Machine (ELM) tuned with the Genetic Algorithm (GA). The proposed techniques performed better, but the authors failed to avoid overfitting problems for SVM.
The short-term load forecasting is performed using NN and Levenberg Marquardt learning in [25]. The authors used the Tanzanian dataset duration 2000 to 2008. The author used MAE, MAPE, MSE, and MAPE error matrices to calculate the result. However, the calculated results through MPE and MAPE gave good results, but the error rates calculated through MAE and MSE are very high.
The authors performed forecasting based on a feature selection technique and least square SVM technique in [26]. They used ASF to select the most informative input values and least square SVM used to predict the model. Results were evaluated with MAPE and MAE error metrics. The proposed model gave low error rates concerning MAPE values, but through MAE it showed the worst results. The authors in [27] performed forecasting with feature selection and classification model. Feature selection was based on MI and CA. The classification part was based on the iterative approach of two neural networks. They used the output of the first neural network as an input of the second neural network.
The consumption data of homes are obtained through smart meters and used to perform forecasting with the help of the GRU deep learning technique in [28]. The authors did not consider preprocessing the dataset and did not remove irrelevant features through their applied model. In [29] authors used hourly and historical temperature data for forecasting. The SVM and ANN machine learning techniques are proposed for this purpose. However, the parameters of thr proposed techniques have been tuned manually.
Improved kernel ELM and Cholesky decomposition techniques are used to forecast the electrical load in [30]. The proposed technique is further compared with conventional ELM and GNN. RMSE error evaluator has used to evaluating the results. In their research work, only one evaluator is used to prove the superiority of the proposed model, other error metrics such as MAPE, RMSE, MSE, accuracy, f1-score, and precision, etc. are not included.
In [31,32], authors used the enhanced CNN method to forecast the electricity load. Furthermore, the superiority of the proposed method is shown with different statistical tests. A composite method based on the optimal learning MLP technique is applied to forecast the mid-term electricity load [33,34]. An acceptable accuracy of 85% is achieved to forecast the mid-term electricity load. However, the author has not considered the overfitting problem of MLP and was unable to highlight the issue of disregarding spatial information.
The deep learning techniques CNN and ANN are used for forecasting in [35,36]. The authors tuned the parameters manually and did not eliminate the irrelevant features. The hybrid of CNN and GRU is applied to predict the electricity load in [37][38][39]. Results are evaluated with MAPE and RMSE values. A comparison of the hybrid technique and conventional techniques also performed. Comparison results show that the hybrid technique outclasses all other techniques. The hybrid model performed well, but parameters have been manually tuned. In our work, we have used the latest heuristic algorithm to automatically allocate the optimum values to the parameters of our proposed techniques. In [40,41], the authors used a framework named feature selection, extraction and classification for load forecasting. They used a hybrid of XGB and DTC techniques to select the most relevant features and eliminated the irrelevant features in the feature extraction step using RFE technique. In the end, classification is performed using SVM. The proposed framework performed well, but the computational complexity of SVM is high and SVM is also not good for processing uncertain data [42][43][44]. The literature review shows that most authors performed forecasting with machine learning and deep learning Table 1. By finding the optimal value for the hyperparameters of techniques is tough work. Furthermore, the irrelevant features in electricity datasets also have a negative impact on model training. To solve some of the above-mentioned issues, we used heuristic algorithms to find optimal values of hyperparameters automatically and for feature selection, the extraction model was proposed to remove irrelevant features from a dataset.

Proposed System Model
After evaluating the literature review and the aforementioned techniques for load forecasting, we propose a framework that is based on average feature selection, extraction, and forecasting. The machine learning techniques, RF, and XGB are used as feature selection techniques, while RFE is used for feature extraction activity. For average feature selection, the average score of RF and XGB is considered for the selection of features as described in Equation (1). Moreover, for classification purposes, machine learning-based technique SVM and deep learning-based technique CNN-GRU are used, respectively. Furthermore, the basic parameters of the CNN-GRU and SVM are tuned with a meta-heuristic algorithm, i.e., EWO and GWO, respectively. The forecasting in Figure 2 displayed the working flowchart of the used model.

Dataset Description
The latest electricity daily load dataset is used in this paper, which is downloaded from the ISONE website [34]. The columns in the dataset are referred to as "features" in our work. The dataset is organized according to a month-wise pattern, i.e., January 2012, January 2013 up to January 2019 and February 2012, February 2013 up to February 2019, and so on. The benefit of the month-wise organization is to improve the performance and learning rate of training activity on the dataset.
The dataset set contains 14 features. A feature named "System Load" is taken as a label, i.e., target feature. We used 70% of data in the dataset for training and 30% of data for testing our proposed model. Afterwards, the dataset was again divided; 90% for training and 10% for testing. The testing includes the one-week, one-month and four-month prediction, which are shown in the simulation section. The autocorrelation of data is shown in Figure 3. The overview of the dataset is shown in Figure 4.

Feature Engineering
Machine learning techniques XGB lnd RF are used for the selection of relevant features. RF and XGB calculate the features' importance, i.e., the impact of all features on the target feature. The values are calculated in decimals between 0 and 1. To make the feature selection better, the average of feature importance is taken as given in Equation (1). The feature engineering step removes the unnecessary features and reduces the complexity of the proposed model by providing exact and relevant features for training.
Whereas, Fs defines feature selection and Fi describes the feature importance.
After the selection of relevant features, the most redundant features are extracted using the RFE technique. The RFE technique calculates the dimension and priority of features in terms of true/false and positive integer numbers. After calculating the feature importance through feature selection and dimensional conversion with feature extraction, the drop-out rate is set to eliminate unimportant features. According to Equation (2), those features are selected/reserved, whose average feature importance/weight is greater than the defined threshold and the priority of feature is higher than the defined priority threshold. Moreover, those features are rejected/dropped whose feature weight is fewer than defined feature importance selection threshold and priority are greater than the defined feature priority threshold. The selection threshold of features using average feature selection is 0.6. Furthermore, the features with a priority greater than 5 are considered for selection. The overall selection of features is carried out according to the Equation (2).
Whereas, Fos denotes the overall feature selection and f indicates the feature. avgimp denotes the average feature importance while pr represents the priority of the feature. The α and βpr describe the feature importance threshold and feature priority threshold. After feature selection and extraction, the most relevant features are passed to the classifier for classification and forecasting.

Classification and Forecasting
The classification is carried out using machine learning, i.e., SVM and deep learning CNN-GRU techniques set tuned with optimization techniques, i.e., GWO and EWO, respectively. The tuning parameters of SVM are loss function (gamma), cost incentive (C), and kernel function. The tuning parameters of CNN-GRU are numbers of hidden layers, numbers of neurons on each layer, dropout value. The tuning step will provide optimum values to the classifier, which results in the best training of the model and reduces the chance of overfitting the model on a large amount of data.

CNN-GRU-EWO
The hybrid of CNN and GRU has been used, further, the parameters of this hybrid model are tuned with the EWO technique. The output shape of tuned CNN-GRU layers is shown in Figure 5. The detailed description of CNN and GWO is given below.
Convolutional Neural Network: CNN is a type of deep learning algorithm. This technique is widely used in text and image recognition [29]. CNN might have multiple hidden layers between a single input and the output layer. The hidden layers are convolutional, dense, max-pooling, dropout, and flatten.
Input Layer: This layer is used as the beginning of the workflow of proposed CNN. It is used as a 1st layer of the network. It has no previous layer, nor any weight input. The number of neurons and dataset features is equal at this stage.
Hidden Layer: There can be multiple hidden layers on CNN. The output of the first layer is given to these hidden layers. Each hidden layer can have a different number of neurons. The output of these layers is evaluated with the multiplication of matrices and with previous layers output.
Output Layer: The hidden layer's output becomes the input of this layer. Softmax or sigmoid and logistic functions are used to transfer this input into the probability score.
Convolutional Layer: This layer has multiple filters and performs the most computational work. The convolutional operation is performed through this layer and results are given to the next layer.
Dense Layer: It acts as a conventional MLP. It is directed to connect the neuron of one layer with any other layers' neurons.
Pooling Layer: This layer is used for combining the output of neurons. Further, it is divided into three types; average-pooling, max-pooling, sum-pooling. In our model, max-pooling is used to minimize the parameters and reduce calculation. Generally, this layer is used between convolutional and drop layers.
Activation Function: The activation function Rectified-Linear-Unit (ReLU) is used in the convolutional layer.

Gated Recurrent Unit (GRU)
GRU is an updated version of the Recurrent Neural Network (RNN). RNN has a problem with short term memory, in this context, LSTM and GRU have been proposed. Both GRU and LSTM are useful for maintaining long term information with the help of a gating mechanism. LSTM consists of three gates and GRU just has two gates, named update gate and reset gate [30].
Update Gate: This gate helps the model to calculate how much previous information is needed to be passed in the future. The update gate is very useful for eliminating the risk factor of the vanishing gradient problem because it remembers past information and decides which information is useful and which is not.
Reset Gate: This gate is used to decide how much of the previous information to forget. CNN-GRU: CNN is useful to handle high-dimensional datasets and GRU is useful to process sequence data in minimum time. With the hybridization of these techniques, we can achieve both qualities. In this paper, a hybrid of CNN and GRU is proposed. In this proposed model, the output of the feature selection and extraction is given to CNN. After the input layer of CNN, a GRU is placed. After that, fully connected hidden layers are placed. In the end, the model is compiled and trained to get predicted results.
EWO: To tune the hyperparameters of CNN-GRU, the EWO optimization technique is used as shown in Figure 6. EWO is a nature-inspired heuristic algorithm that is used to solve the optimization problem. In this technique, every earthworm can produce offspring of only two kinds. The child earthworm contains the same length gene as his parent earthworm has. Some earthworms have the best fitness and forward this best fitness to the next generation without any change.

SVM-GWO
To improve the performance of the machine learning technique SVM, the hyperparameters of SVM are tuned with the GWO optimization algorithm.
SVM: It is a type of supervised machine learning algorithm. It is widely used to solve the classification and regression problems. In SVM, a hyperplane line is drawn as in Figure 7, to divide the features into two classes; Linear and Non-Linear. In our proposed SVM, the parameter "gamma", i.e., loss function used with kernel RBF. After tuning the SVM, the optimization technique GWO calculated an optimum value for the SVM parameters; the value of gamma is 0.1, and the value of C is 1.0.
GWO: It is the part of the metaheuristic and swam optimization family. To tune the parameters of the SVM technique, GWO is used. This technique is developed by [31] in 2014, the authors were inspired by the grey wolf's social behavior and named this technique based on the "Grey Wolf Optimization". The hybrid of SVM and GWO is shown in Figure 8.

Simulation Results
The complete implementation of our proposed framework was carried out on the system with specification; Intel Core i7, 8

Average Feature Selection Based on RF and XGB
Feature importance calculated by XGB and RF is shown in Figure 9. Features with more importance have high values and less important features have low values. To make selection feasible and effective, the average feature importance was calculated from RF and XGB feature importance as shown in Figure 10. The features which had higher importance than the threshold were selected and low importance features were rejected. The feature importance calculated by RF is shown in Figure 9a, which shows it's Demand and the Dewpoint feature was most important in the dataset and high impact on the target data. Figure 9b shows the feature score calculated using XGB, which gives RT_CC and Demand as the most relevant feature. The average of Figure 9a,b is taken to calculate the average importance, which is shown in Figure 10. According to the average calculation of features, Demand, RT_MLC, DA_MLC and Dewpoint are the most relevant feature with high influence on the target feature.
RFE calculates the dimensions of the features, i.e., true/false, and thus it removes the redundant features from the dataset. The threshold set for feature selection and extraction is the features with average importance greater than 0.6 and dimension true, which were selected as the best features, while importance less than 0.6 and false dimension were rejected features. Table 2 shows the feature dimensions calculated by RFE. Furthermore, the abbreviation of features is also described.   Figure 11 shows the normal electricity load data from 1 January 2019 to 31 December 2019, which is provided by ISONE. The data is arranged month-wise, as the load pattern of similar months are approximately the same, which is shown in Figure 12. The monthly load of the Jan 2018 and Jan 2019 is approximately the same. Similarly, the load pattern of Dec 2018 and Dec 2019 are almost the same. The same pattern of load helps in the training of our model better.  Figure 13 shows weekly forecast and Figure 14 shows monthly electricity load forecast. Our proposed algorithm performed better in achieving forecasting accuracy. In Figures 13 and 14, the prediction values of our proposed techniques were nearly the same as the actual values of electricity load.  The accuracy of our proposed techniques is higher than the SOTA as described in Figure 16. The enhanced version of the technique outperformed the actual technique. The optimization techniques, i.e., GWO and EWO found the best optimum solutions for the hyperparameters of the techniques, which enhance the accuracy and reduce the time complexity of training the model. The accuracy of our proposed techniques CNN-GRU-EWO and SVM-GWO is 93% and 90%, respectively. The accuracy of SOTA techniques SVM, CNN, LR and ELM is 87.98%, 89%, 78.34%, and 78.98%, respectively, as shown in Figure 16.

Performance Metrics
The performance of our proposed model and SOTA techniques were evaluated using MAPE, MAE, RMSE, MSE, precision, re-call, f-measure, and accuracy. The performance errors of our proposed techniques were much lower than the SOTA as shown in Figure 17. The performance evaluation metrics accuracy was higher than benchmark techniques as shown in Figure 18.
The performance values i.e., F1-score, accuracy, precision, and recall of CNN-GRU-EWA and CNN-GWO is greater than LG, CNN, SVM, and ELM. The MAPE error of SVM-GWO and CNN-GRU-EWA is 1.33% and 6%. The LG technique has the highest performance error of 20%. The high-performance error reduces the forecasting accuracy The training and testing accuracy of our proposed model is shown in Figures 19 and 20. The graph of accuracy is gradually increases with the increase in training on an excessive amount of data. The loss graph is gradually decreased, which shows that our model is well trained and test. The performance evaluation with the performance metric values of our model is described in Table 3.  Table 4 shows different correlation-based tests, a parametric statistical hypothesis-based tests and non-parametric statistical hypothesis based statistical tests of proposed techniques and state-of-the-art techniques.

Conclusions
In this paper, a deep learning, machine learning, and optimization techniques-based model is used for short and medium-term electricity load forecasting. The eight-year electricity load data set was downloaded from the ISONE website. The ISONE provides electricity to different cities in England.
To deal with such a huge amount of data, normal forecasting models are unable to perform well. A framework consists of feature selection, extraction process, and classification, which is proposed to forecast electricity load. The feature engineering process removes the redundancy and selects the most relevant features which have a high impact on the target feature. Furthermore, it also reduces the complexity of the model by providing the most important features to the classifiers. The RF and XGB techniques are used as a feature selection and RFE as a feature extraction method. The feature engineering activity refined the data and passed it to the classifiers. The techniques CNN-GRU and SVM were used as classifiers. To enhance the performance of classifiers, the parameters of CNN-GRU and SVM were tuned with an optimization algorithm EWO and GWO, respectively. The optimization algorithm finds the best optimum values for the techniques of hyperparameters. Moreover, the tuning of parameters provide optimum values to the classifiers, which reduces the chances of model overfitting and helps to increase the accuracy of the model. Our proposed techniques-CNN-GRU-EWO and SVM-GWO-outperform SOTA. The accuracies of CNN-GRU-EWO and SVM-GWO are 96.33% and 93.99%, respectively. Our proposed techniques perform 7% and 3% better than CNN and SVM classifiers. In the future, other optimization techniques will be applied to the machine learning classifiers to enhance the accuracy of electricity load forecasting.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: