Clustering Informed MLP Models for Fast and Accurate Short-Term Load Forecasting

: The stable and efﬁcient operation of power systems requires them to be optimized, which, given the growing availability of load data, relies on load forecasting methods. Fast and highly accurate Short-Term Load Forecasting (STLF) is critical for the daily operation of power plants, and state-of-the-art approaches for it involve hybrid models that deploy regressive deep learning algorithms, such as neural networks, in conjunction with clustering techniques for the pre-processing of load data before they are fed to the neural network. This paper develops and evaluates four robust STLF models based on Multi-Layer Perceptrons (MLPs) coupled with the K-Means and Fuzzy C-Means clustering algorithms. The ﬁrst set of two models cluster the data before feeding it to the MLPs, and are directly comparable to similar existing approaches, yielding, however, better forecasting accuracy. They also serve as a common reference point for the evaluation of the second set of two models, which further enhance the input to the MLP by informing it explicitly with clustering information, which is a novel feature. All four models are designed, tested and evaluated using data from the Greek power system, although their development is generic and they could, in principle, be applied to any power system. The results obtained by the four models are compared to those of other STLF methods, using objective metrics, and the accuracy obtained, as well as convergence time, is in most cases improved.


Introduction
Two requirements of the Net Zero by 2050 initiative for "greener" power grids, such as the integration of renewable energy sources, and the connection of volatile loads, such as electric vehicles, affect the stable and efficient operation of power systems dramatically. Satisfying load needs instantly and at all times becomes a major challenge, and all aspects of the operation and management of power plants, such as economic dispatch [1], demand side management [2], price forecasting [3,4], maintenance scheduling and the formulation of an effective bidding strategy in power system markets [5], along with the financial viability of electrical companies themselves, are increasingly relying on accurate load predictions [6].
Electric load forecasting has, justifiably, been the focus of much research, and work in the area is classified into three categories based on the time horizon and the operational choice that must be made, namely short-term, medium-term, and long-term forecasting. Long-term load forecasting generally spans 20 years and is required for planning purposes, such as the construction of new power plants and the upgrade of transmission system capacity. Medium term load forecasting ranges from a few weeks to a year and is mostly used for scheduling maintenance and fuel supply [7]. The day-to-day functioning of the power system necessitates Short-Term Load Forecasting (STLF), which is primarily influenced by temporal factors (for example, weekly periodicity and seasonal fluctuations) and weather conditions (for example, humidity, temperature, wind speed, and cloud coverage) [8]. STLF is considered essential for the smooth and uninterrupted operation of a power system, because it enables load flow studies and contingency analysis, on issues such as bus voltages, line currents, power generation, and line flows. Therefore, in order to achieve high accuracy in forecasting results, various load forecasting models have been developed and investigated [9].
In the context of these deep learning approaches, the demand for even more accurate predictions has led researchers to develop hybrid forecasting models, which integrate a clustering algorithm for the pre-processing of data before it is used to train the neural network. Typically, clustering methods are implemented in order to create clusters of the load data, which is first pre-processed using an enhanced min-max scaling method [13].The subject of short-term load forecasting coupled with a clustering strategy, has been extensively studied using methods based on RNNs, Long Short-Term Memory (LSTM), CNNs, SVMs [19], ANNs, Simple Exponential Smoothing (SES) and Group Method of Data Handling (GMDH) algorithms [20].
Hernández et al. [21], use a hybrid clustering approach to evaluate short-term load forecasts on the Soria microgrid. First, a Self-Organizing Map (SOM) is used to categorize historical data and the K-Means clustering method is then applied to group the data of each category. To achieve appropriate forecasts of the load curve, a separate MLP for each data cluster is trained. Despite its complexity, this method achieves a Mean Absolute Percentage Error (MAPE) near 2%. In a similar attempt, Farfar et al. [22] utilize a hybrid forecasting model based on a clustering approach of load profiles alongside a daily temperature estimator. Artificial neural networks for the daily load forecast for each cluster are used in the regression phase, with initial weights computed through stacked denoising autoencoders. Each cluster's MAPE does not decrease below 1.9%.
In [23], K-Means is applied to cluster load data and CNNs are utilized to estimate the following day's load in conjunction with meteorological and consumer categorization data. The researchers recorded for winter days forecast values with MAPE equal to 7.41%, while for summer days the results showed MAPE close to 3.06%. Unlike prior studies, where the K-Means application was solely used to load data, a novel clustering technique is introduced in [24]. The authors suggest using the clustering technique to normalized input variables, which include weather and label data that reflect seasonal features in addition to load data. The MAPE of the predicted load values obtained using the suggested technique is about 2%.
In addition to K-Means, there is a plethora of papers about the Fuzzy C-Means (FCM) clustering method in the literature for both short-term load forecasting and power generation forecasting by solar systems [25] and wind turbines [26].
Bian et al. [27], propose data grouping in clusters based on the strong or weak correlation at adjacent moments and not on similar load profile. They apply FCM to further cluster the data to display similar values. Finally, the sets are fed into the NNs, which is used to STLF with MAPE at about 2.01%. In [28], the FCM clustering algorithm based on Principal Component Analysis (PCA) is applied to cluster real-time load data of power systems in NSW State, Australia, at half hourly intervals. The centers of an RBF neural network are determined using PCA. Although the forecasted values obtained through the proposed technique have a fairly high MAPE (specifically 5.1%), it is more accurate than other simpler techniques such as an approach based on a RBF neural network and an approach based on a RBF neural network in conjunction with the FCM algorithm. A similar attempt to utilize the FCM algorithm for load prediction with the Self-Normalizing Gated Recurrent Units (GRU) application is described in [29]. FCM is applied to normalized data in order to create clusters of data that belong to similar days. In this scenario, the MAPE does not drop below 2.6%. This paper presents four generic robust hybrid STLF models, which use MLPs neural networks and the K-Means and Fuzzy C-Means clustering techniques. The models are designed, tested, and evaluated using data from the Greek power system, however they are generic, in the sense that they can be applied to the specific load data of any power system. The first set of two models are developed by initially applying the K-Means and Fuzzy C-Means clustering techniques to the load data, in order to generate optimal clusters, and then feeding each cluster to a MLP to produce short-term load predictions. These two models are similar to existing methods, and they were developed in order to serve as a common reference point for comparison with the second set of two models. However, these first two models contribute some novelty, though similar in spirit to existing methods, because they achieve MAPE well below 2% (around 1.70, a 25% improvement), which is the current best, as indicated by the preceding discussion of related work. The second set of two models were developed in order to improve the first set further, by using a single MLP per clustering method, which is fed with the original load data set and an additional input variable containing the cluster label of each point in the load data set. Hence, one can think of these two models as improved versions of the first set. The labeling information that is used to extend the input to the MLP is produced by the K-Means and Fuzzy C-Means clustering techniques and the use of the elbow optimization method. The forecasting results obtained by the second set of models are also better than those of other approaches with MAPE well below the current best of 2%. Moreover, all four models are compared to other existing load forecasting approaches, and to each other and exhibit shorter convergence time compared to classical data pre-processing approaches [21][22][23][24]27].
In the remainder of this paper we explain the clustering algorithms that were employed, as well as the performance measures, before proceeding with the details of the four models that were developed. The results are shown and discussed for each model, in relation to the other models and related work.

Clustering Methods for Short-Term Load Forecasting
Clustering is an unsupervised machine learning approach that partitions a dataset into groups (clusters) so that data in the same cluster are close to one another and hence very similar. K-Means [30] and Fuzzy C-Means [31] are two of the most prevalent clustering algorithms used in STLF.

K-Means Clustering Algorithm
K-Means clustering begins with the selection of K representative points among the dataset as the initial centroids. Based on the Euclidean distance metric, each point in the dataset is subsequently assigned to the nearest centroid. The centroids for each cluster are updated after the clusters are generated. The algorithm then iteratively executes these two steps until the centroids do not change any further. The selection of the optimum number of clusters, indicated by the parameter K, is derived by proper objective functions, the most important of which is the Sum of Squared Errors (SSE), which is defined mathematically by Equation (1), and must be minimized: where C indicates a cluster, x i is an instance of the given dataset that consists of N points and c k is the centroid of cluster C k . The centroid of each cluster is updated iteratively through Equation (2): where |C k | is the total number of points in cluster k.

Fuzzy C-Means Clustering Algorithm
Strict assignment of points to clusters is not possible in incomplex datasets with overlapping clusters (i.e., where the original dataset cannot be partitioned). As a result, K-Means would produce an inappropriate segmentation of data into clusters. A fuzzy clustering approach (often called soft K-Means clustering) may be used to retrieve such overlapping structures. Each data point in the FCM technique is assigned a probability score that reflects its membership to a given cluster, therefore point membership in various clusters might range from 0 to 1, with 0 denoting no membership, 1 denoting total membership, and intermediate values denoting varying degrees of membership. The sum of memberships of a given point to various clusters must be 1.
The purpose of FCM, as in the case of K-Means, is the reduction of SSE. The membership weight of point x i belonging to cluster C k is represented by w xik and is utilized as an FCM update step. The calculation of w xik is derived from Equation (3): where x i is an instance of the given dataset that consists of N points, c k is the centroid of cluster C k , and β is a parameter that determines the fuzziness of the cluster. Equation (4) calculates the weighted centroid for C k based on the fuzzy weights, and Equation (5) provides the SSE function for each cluster C defined by the FCM:

Elbow Optimization Method
The elbow method is a heuristic method used in cluster analysis to determine the optimal number of clusters into which a given dataset may be segmented [32]. The elbow technique depicts the value of the cost function, generally the Sum of Squared Errors (SSE), produced by a certain number of clusters and then determines the optimal number of clusters (K) by picking the value of K for which the change in SSE first appears to reduce, thus forming an elbow in the curve, i.e., the point after which the distortion starts decreasing in a linear fashion. As K increases, the SSE decreases because each cluster has fewer data points that are closer to their respective centroids. The value of K at which the improvement in distortion decreases the most is known as the elbow of the curve, and it is at this point that splitting the dataset into additional clusters should cease.

Performance Metrics
Certain objective measures must be used to assess the predictive accuracy of a forecasting model, such as MLPs. Mean Absolute Percentage Error (MAPE) and coefficient of determination (R 2 score) are the two most commonly used metrics in the application of neural networks to various regression problems, such as STLF.
In statistics, MAPE is a measure of the predictive accuracy afforded by a forecasting method. Because of its rather obvious definition in terms of relative error, it is often employed as a loss function for regression tasks and model evaluation. MAPE is defined by Equation (6) as follows: where n is the number of data points, A i is the actual value and F i is the forecasted value of each data point. The R 2 score is an important metric for evaluating the performance of a regressionbased machine learning model. It is the amount of the variation in the output dependent attribute, which is predictable from the input independent variable. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. An R 2 value of 1 means that the model fits the data perfectly, and a value of 0 means that the model will perform badly on an unseen dataset, i.e., it has a very poor predictive power. This implies that the closer the value of the R 2 score is to 1, the better the model is trained. The R 2 score is calculated from Equation (7) as follows: where y i is the actual output value that is associated with each input instance x i , f i is the forecasted value for input instance x i , andȳ is the mean value of the dataset.

Problem Formulation
This paper presents the development and evaluation of four hybrid STLF models that lead to high accuracy with fast convergence time. The models employ MLPs neural networks and optimized clustering methods, and they were developed and tested using historical hourly load data of the Greek power system from the period 2013-2017, obtained through ENTSO-E platform [33]. Air temperature was included in the input data of the MLPs, in addition to load data, to increase the precision of the prediction. The data of the period 2013-2016 are used for training purposes (a total of 35.040 data points), and the data of the year 2017 are used as the test set to assess the accuracy of the predictions (8.760 entries). The development and implementation of the K-Means and Fuzzy C-Means algorithms for generating the optimum number of clusters with the use of the elbow optimization method is discussed in what follows. In a nutshell, two experiments were conducted: First, each of the two clustering methods, namely K-Means and FCM, are applied to the input data, thus producing a set of clusters. For each cluster produced, a separate MLP is trained to produce STLF predictions, thus yielding two models. In a second experiment the two clustering methods are applied to the input data, thus producing a set of clusters each. Then the resulting labeling values of the data that are generated by the clustering method are fed, along with load and temperature data, back to one MLP for each clustering method, thus resulting in two STLF models with faster convergence and improved accuracy compared to existing approaches. The MLP neural networks, the clustering algorithms, and the elbow optimization method were developed via Python's library scikit-learn [34] and implemented in a computer used with Intel Core i7-4510U CPU and 8 GB installed RAM.

Calculation of Optimum Number of Clusters
Since STLF is inextricably linked to data concerning temperature, humidity, and historical load values, the algorithms are applied to datasets that include weather and load data. The load data are processed in order to form clusters based on the load profile of each data point using the K-Means and FCM techniques. The variables considered for the application of the clustering methods are the load value at the same time on the same day of the previous week (D-7 Load), the load value of the previous day at the same time (D-1 Load), and the load value at the previous hour load (H-1 Load). The load data used for clustering is pre-processed through the enhanced min-max scaling method, which leads to improved forecasting results, compared to the simple min-max scaling technique [13]. Figure 1 illustrates the process by which clusters are formed for each of the two clustering algorithms used. The optimal number of clusters for the data of the Greek power system is derived by using the Elbow optimization heuristic, which plots the explained variation as a function of the number of clusters by calculating the SSE and picking the elbow of the curve as the number of clusters to use. In order to generate the optimal number of clusters, the SSE is computed for a range of clusters in the interval [1,10], which is a common procedure. Figure 2 shows the SSE for values of the variable K in the interval [1,10]. After extensive experimentation and comparison of forecasting results, using MAPE as a metric, the optimal separation of data into clusters, based on the load profile for the specific dataset, occurs for K = 4. Since clustering is based solely on the three variables, D-7 Load, D-1 Load, and H-1 Load, each data point may be represented as a point in a three-dimensional space with the values of these variables as coordinates. Figure 3 depicts the clusters produced by the K-Means (on the left) and FCM (on the right) clustering algorithms, respectively, of the dataset used.

Short-Term Load Forecasting Approaches in Conjunction with Clustering Techniques
Clustering algorithms have been used to generate input for MLPs in classical approaches to STLF, where a separate MLP neural network is used for each cluster. In the first experiment, the hybrid models created in this way are schematically shown in Figure 4. The neural network input variables are:  In the second experiment, a single MLP, fed with an input variable containing the clustering label, produced by either K-Means or Fuzzy C-Means optimized via the elbow method clustering technique, is used for STLF.
The STLF model for this approach is shown in Figure 5. The input variables of the neural network include, in addition to the ones of the first experiment, the labeling of each data point of the load data produced and optimized by each clustering algorithm. The variable Label f or Cluster receives integer values from 0 to 3, (since the separation of the dataset into four clusters was determined to be optimum) and indicates in which cluster each data point belongs. In total, four STLF models emerged, two from the first experiment which follows classical hybrid approaches, and two from the second, which augments/informs the neural network input with clustering information. The four models were applied to a dataset of the Greek power system and the forecasting results were compared to each other and to other existing load forecasting approaches. Moreover, this comparison indicates which of the clustering algorithms is more appropriate for partitioning a dataset into clusters. Extensive testing and experimentation show that the conjunction of MLPs neural networks with optimized dataset clustering, leads to improvement of the accuracy and the convergence time of the forecasting model.

Results
This section presents the results obtained from the four STLF models that emerged from the two experiments. MAPE and R 2 score are used as metrics, in order to evaluate the accuracy of the prediction of each model. Table 1 shows the total MAPE and R 2 score for the predictive method, and for each cluster individually, for the first model, where K-Means clustering followed by separate MLPs for each cluster was used. Table 2 provides the same information for the second model, where Fuzzy C-Means clustering followed by separate MLPs for each cluster was used. Figure 6 provides a graphical comparison of the actual load values and the prediction results obtained from these two models. Both approaches performed well, in terms of MAPE and R 2 score, compared with existing methods.  6. Actual and predicted load curves resulting from the method using a distinct MLP for each cluster. Table 3 presents the MAPE and the R 2 score of the third and fourth model, where the MLP input is informed with labeling information acquired from the application of K-Means and FCM clustering algorithms, respectively. Figure 7 provides a graphical comparison between real load values and the forecasted load values, indicatively for some days in February 2017, calculated using the third and fourth model.   Figure 8 focuses on the K-Means clustering method and graphically compares the results obtained from the first and third model. A similar graphical comparison of the results obtained with the use of Fuzzy C-Means clustering method, which is from the second and fourth models, is presented in Figure 9.
Apart from the MAPE and R 2 score, the performance for STLF using MLPs in conjunction with K-Means and Fuzzy C-Means, is also evaluated by measuring the execution time required for each approach. Table 4 provides the time (in seconds) needed for the load forecasting of the year 2017 in all four models.

Discussion
The use of a clustering algorithm, which properly groups the data based on their load profile, clearly improves the accuracy of STLF results, as acknowledged by several related works in this area. The current best MAPE obtained is around 2%, although it should be noted that different datasets from different power systems are used. MAPE is equal to 1.80% in [13], which uses the same load data from the Greek power system as this work, and as in Tables 1 and 2, which demonstrate that the first set of models that we developed are more accurate with a better MAPE value, for both clustering methods employed.
The results presented in Table 3 demonstrate that for the second set of models, where a single MLP is employed, informed explicitly with the clustering labels of the input data points, both K-Means and FCM improve the load prediction compared to [13], and the FCM specifically has the best overall accuracy. However, both models yield slightly lower accuracy than their counterparts from the first set, but converge faster than them. In fact, Table 4 demonstrates that the fourth model using FCM performs remarkably better than the others.
A comparison of the results obtained from all four proposed models with similar STLF methods, which use neural network prediction techniques in conjunction with the application of a clustering algorithm, reveals that the methods described in this paper perform in most of the cases similarly or better. However, note that an exact comparison requires comparison on exactly the same dataset. In [21][22][23][24]27], who use similar techniques for short-term load forecasting, the MAPE gets values close to 2%, while in the present work the lowest MAPE is equal to 1.69%. Table 5 presents the results in terms of the achieved MAPE of various techniques suggested by other researchers considered in the related literature review. It is obvious that the models proposed here lead to improved MAPE and therefore greater prediction accuracy for STLF.

Conclusions
This paper examines the integration of clustering algorithms with neural networks for the purposes of developing fast and accurate STLF models. Two ways in which such integration can be implemented were considered, and as a result two sets of models were designed, tested, and evaluated on the same dataset. The first set of models followed the standard for hybrid STLF model development, in which first the dataset is clustered and then each cluster is used to train a MLP. Since we experimented with two clustering algorithms, namely K-Means and Fuzzy C-Means, this first set produced two models, which were used as a reference point. These first two models do present an improvement on the current best score in the relevant literature, because the dataset is initially subjected to enhanced scaling, which has been evaluated in a separate paper [13].
The second way in which clustering algorithms can be integrated with neural networks is explored in the second set of models that were developed. In this case, first the dataset is clustered (using K-Means and Fuzzy C-Means, again), and then a single MLP is trained, whose input variables are augmented with the inclusion of the labeling information produced by the clustering.
All four models were evaluated using load data of the Greek power system as a common reference point. All models yielded better accuracy than other methods (as reflected by MAPE values below 2%). Moreover, the models of the second set, where the MLP is informed by clustering, converged significantly faster. The experiments suggest that the FCM informed MLP is the fastest model, however, to be precise, it needs to be evaluated on other datasets as well, and this is one direction for future work. A second direction for future work involves experimenting with other clustering algorithms to establish whether they might offer even better accuracy and convergence time.