Adaptive Clustering Long Short-Term Memory Network for Short-Term Power Load Forecasting

: Short-term load forecasting (STLF) plays an important role in facilitating efﬁcient and reliable operations of power systems and optimizing energy planning in the electricity market. To improve the accuracy of power load prediction, an adaptive clustering long short-term memory network is proposed to effectively combine the clustering process and prediction process. More speciﬁcally, the clustering process adopts the maximum deviation similarity criterion clustering algorithm (MDSC) as the clustering framework. A bee-foraging learning particle swarm optimization is further applied to realize the adaptive optimization of its hyperparameters. The prediction process consists of three parts: (i) a 9-dimensional load feature vector is proposed as the classiﬁcation feature of SVM to obtain the load similarity cluster of the predicted days; (ii) the same kind of data are used as the training data of long short-term memory network; (iii) the trained network is used to predict the power load curve of the predicted day. Finally, experimental results are presented to show that the proposed scheme achieves an advantage in the prediction accuracy, where the mean absolute percentage error between predicted value and real value is only 8.05% for the ﬁrst day.


Introduction
With the development of society, electricity plays a crucial role in industrial, commercial, and residential settings. It is essential to accurately predict variations in electricity load to ensure the stable operation of the power system [1]. Short-term load forecasting (STLF) is a vital aspect of energy forecasting that involves predicting instantaneous electricity load values at hourly intervals for the next day or several days [2][3][4]. STLF is critical for power system scheduling, energy optimization, and efficient operation of electricity markets [5]. However, STLF presents challenges due to its uncertainty and non-linear characteristics influenced by factors like weather conditions, holidays, seasons, and industrial production. This problem involves high randomness and difficulties in establishing mathematical models as well as selecting appropriate features [6]. Various algorithms and approaches have been proposed by scholars worldwide to improve the accuracy of load forecasting. These include time series forecasting (TSF) [7,8] and support vector machine (SVM) [9]. With the advancement of artificial intelligence, predictive methods based on deep learning have gained popularity due to their ability to approximate high-dimensional functions, uncover hidden information in data, and extract abstract features [10]. Deep learning is more robust compared to traditional TSF and SVM. Specifically, long short-term memory networks (LSTM) [10], a type of recurrent neural network, are preferred for addressing the vanishing gradient problem and improving performance in handling time series data.
For short-term residential load forecasting, Ref. [11] introduced an LSTM-based framework that outperforms conventional backpropagation neural networks in experiments. In Ref. [12], a hybrid method combining variational mode decomposition with LSTM is studied for processing STLF. Additionally, Ref. [13] proposes a hybrid approach using multivariable linear regression and LSTM for short-term load forecasting. In order to improve prediction accuracy, researchers have utilized clustering analysis on the original load data. This involves dividing the data into clusters before making predictions [14]. Common clustering algorithms used for electricity load data include K-Means [15] and density-based spatial clustering of applications with noise (DBSCAN) [16]. For example, in Ref. [17], K-Means was used to classify users based on their electricity consumption patterns, and backpropagation (BP) neural networks were applied for short-term load forecasting. Ref. [18] combined deep learning with K-Means to extract similarities in residential load for accurate individual-level prediction. Additionally, a short-term support vector machine load forecasting method based on K-Means was proposed in Ref. [19], whereas a combination of K-Means and fuzzy processing techniques was used for short-term load prediction in Ref. [20]. Furthermore, ultra-short-term load forecasting involved employing K-Means to divide historical data into clusters and utilizing long and short-term time-series networks (LSTNet) as described in Ref. [21]. DBSCAN was also applied for cluster analysis followed by multiple neural networks for load forecasting according to another study's approach [22]. Compared to previous methods such as K-Means and DBSCAN, our previous work introduced a maximum deviation similarity criterion (MDSC) clustering algorithm specifically designed for STLF. This algorithm demonstrated superior performance when dealing with high-dimensional electricity load data [23].
It is worth noting that clustering algorithms typically require the setting of one or more parameters to improve their effectiveness [24]. For example, K-Means requires specifying the number of clusters, whereas DBSCAN needs two parameters: the neighborhood radius and minimum density. Some researchers have attempted to use heuristic algorithms to automatically determine these parameter values for better clustering results. In Ref. [25], an automatic photon cloud filtering algorithm based on particle swarm optimization (PSO) was proposed to optimize the key parameters of DBSCAN instead of manual adjustment. Another approach in Ref. [26] used the nearest neighbor function and genetic algorithm for automating DBSCAN's parameters. In our previous work, MDSC relied on extensive parameter experiments involving five different parameters such as maximum deviation, allowed deviation at the maximum deviation point, similarity, deviation, and noise threshold [23]. However, this manual selection process is time-consuming and does not guarantee optimal parameter values. Therefore, it is crucial to achieve adaptive optimization of MDSC's parameters in order to obtain improved clustering results. To address this issue, this paper proposes using intelligent algorithms for adapting the settings of MDSC's parameters. The main contribution of this work is summarized in the following: (1) To enhance the accuracy of short-term load forecasting (STLF), we utilize a beeforaging learning particle swarm optimization (BFLPSO) algorithm [27] to adaptively optimize the parameters of MDSC, thereby improving clustering performance. (2) We employ a 9-dimensional load feature vector as SVM classification features to determine the similar cluster for the prediction day. Subsequently, LSTM is utilized to generate the power load curve for the predicted day. The remaining sections of this paper are organized as follows: Section 2 discusses the clustering process, whereas Section 3 describes the prediction process. In Section 4, we present the experimental results that demonstrate the effectiveness of our proposed algorithm. Finally, in Section 5, we conclude and outline future work.

Clustering Process
This section introduces the principle of MDSC, describes the parameters of MDSC optimized by BFLPSO (denoted as BFLPSO-MDSC), and outlines the clustering process.

MDSC Clustering Algorithm
MDSC is a method that uses morphological similarity and maximum deviation similarity to analyze short-term power load data [23]. Assume a dataset with n power load data points. In the dataset, each individual power load data point, denoted as x i = (x i1 , x i2 , . . ., x ik ,. . ., x im ), represents the load values at m different time points of x i . Here, i ranges from 1 to n, and k ranges from 1 to m. Then, five definitions are described as follows: Definition 1. The absolute difference s ijk between the load data x i and x j at each time point is shown in Equation (1). Additionally, if the count of s ijk instances that meet the condition s ijk ≤ γ is denoted as n ij , then n ij represents the number of time points where the similarity between x i and x j occurs.
where γ (where 0 ≤ γ ≤ 1) represents a predetermined constant known as the maximum deviation. It serves as a threshold to assess the similarity of the load values at two corresponding time points.

Definition 2.
If there exists a maximum number of s ijk values that continuously meet the condition γ < s ijk < δ, denoted as m ij , then m ij corresponds to the time point number indicating the highest consecutive deviation between x i and x j . The calculation for m ij can be determined using Equation (2): where δ (0 ≤ δ ≤ 1) is a predetermined constant called the allowed deviation at the maximum deviation point.

Definition 3.
When s ijk ≤ γ, then x ik and x jk are considered similar; otherwise, they are not similar.

Definition 4.
With load data x i as the comparison center, calculate n ij and m ij between x j and x i , in which i, j = 1, 2. . ., n. If n ij and m ij satisfy both Equations (3) and (4), then x j is said to be similar to x i .
Following the four definitions, the subsequent content provides a comprehensive explanation of how MDSC clustering takes place and the approach to acquiring cluster centers. For indefinite i, j = 1, 2. . ., n, and i ≤ j, take x i as the comparison center, compare n ij and m ij with n 0 and m 0 , respectively, and classify all x j satisfying MDSC into S(x i ). That is, let S(x i ) = S(x i )∪{x j }, and remove x j from the set of original load data U, where S(x i ) is the set of profiles similar to x i , and finally calculate D(x i ) according to Equation (5).
If x i represents the load profile that minimizes the function D(x i ), it can be considered as the cluster center for the cluster S(x i ).
However, it is important to note that when clustering the load data by MDSC, there may be a small portion of the data that forms a separate cluster. This cluster comprises either a single data point or only a few data points. These isolated data points, which do not belong to any of the main clusters, are considered noise and will be excluded by the algorithm. The threshold λ for identifying noise data is determined by satisfying the following equation: where C i is the number of data contained in the i-th cluster of data obtained after clustering by the MDSC algorithm.

Optimizing the Parameters of MDSC with BFLPSO
Based on the MDSC algorithm mentioned above, it is necessary to preconfigure five parameters: maximum deviation γ, the allowed deviation at the maximum deviation point δ, the similarity α, the deviation β, and the noise threshold λ. These parameters significantly influence the clustering effect of MDSC. Currently, these parameters are set based on individual experience only, resulting in unscientific values and suboptimal clustering outcomes. To address these challenges, this paper introduces the BFLPSO-MDSC algorithm which utilizes BFLPSO to optimize the adaptive parameters of MDSC. This innovative approach aims to overcome the aforementioned problems and enhance overall clustering performance. In PSO, each particle in the group has unique attributes: position, velocity, and fitness value determined by the optimization objective function. These particles traverse a specific search space while moving at their current speed. The main objective is to find the global optimum by continuously updating the positions and speeds of the particle population based on the trajectory of the current optimal particle. However, traditional PSO often gets stuck in local optima [28]. To address this issue, Ref. [27] introduced a novel approach called BFLPSO in 2021 by integrating the bee foraging learning model (BFL) into PSO. Unlike traditional PSO which only adopts an employed stage similar to the BFL model, BFLPSO incorporates two additional learning phases known as the onlooker learning phase and the scout learning phase from the BFL model. The formula for updating particles in BFLPSO is as follows: where pos t k and q t k are the position and velocity of particle k at moment t; pos t+1 k and q t+1 k are the position and velocity of particle k at moment t + 1; τ k is the index of the learning paradigms for bee foraging learning of the individual historical optimal position of particle k; pbest τ k is constructed by the combination of all particles' personal best positions; w is the inertia weight; c is the learning factor; and rand() is a random number between [0, 1].
Ref. [27] has demonstrated the effectiveness of BFLPSO in solving nonlinear problems, surpassing PSO in performance. In addition, Ref. [29] suggests that PSO exhibits higher computational efficiency compared to other meta-heuristic algorithms. Hence, this paper adopts BFLPSO as an adaptive optimization framework. On this basis of BFLPSO, according to the parameter characteristics of MDSC, the coding for the particle pos t k designed in this paper is shown in the following equation: where γ t k , δ t k , α t k , β t k , and λ t k are parameters in a corresponding set of MDSC. In addition, the fitness function for the particle pos t k is shown in the following equation: where I SSE is the square sum of the error of the cluster validity index (SSE); I DBI is the cluster validity index (DBI); χ is the scaling factor. As the SSE and DBI are of different magnitudes, the scaling factor is needed to scale SSE, to make SSE and DBI function similarly in the objective function and obtain better parameter values. The calculation of I SSE and I DBI is briefly described below. I SSE is the sum of the Euclidean distances from the intra-cluster elements of all clusters to the centers of the clusters in which they are located. That is: where H is the number of clusters after clustering; X i is the data of the i-th cluster; d is the Euclidean distance; and center i is the cluster center of the i-th cluster data, in which the cluster center is defined as follows: I DBI combines the compactness within clusters and the dispersion between clusters, which is calculated according to: where avgD i denotes the average distance from all data of i-th cluster to the cluster center center i , namely: where Xj represent the data of the j-th cluster. A decrease in the DBI indicates higher data compactness within each cluster after clustering, which signifies an improved clustering effect.

The Steps of the Clustering Process
BFLPSO-MDSC consists of six steps: Step 1: Initialize the particles of BFLPSO with a total of 10 particles and set the maximum number of iterations to 50.
Step 2: Use the particles generated in Step 1 as parameters for MDSC. Perform MDSC to obtain SSE and DBI values corresponding to each particle. Calculate the fitness value of each particle using Equation (11).
Step 3: Update the positions and velocities of all particles using Equations (8) and (9). Compute the fitness value for each particle's new position using Equation (11).
Step 4: Check if the fitness value at the new position surpasses its own historical optimal value. If it does, update the historical optimal value for that particle. Additionally, evaluate if this fitness value is better than the global optimal value. If it is, update both the global optimal value and save the global particle.
Step 5: Repeat Steps 3 and 4 until reaching N, which represents the maximum number of iterations. Step 6: Finally, output the optimal position of the global particle along with its corresponding fitness value, as well as provide information about the number of clusters H and present an optimized clustering result. Refer to Figure 1 for a visual representation of BFLPSO-MDSC's flowchart. evaluate if this fitness value is better than the global optimal value. If it is, update both the global optimal value and save the global particle.
Step 5: Repeat Steps 3 and 4 until reaching N, which represents the maximum number of iterations.
Step 6: Finally, output the optimal position of the global particle along with its corresponding fitness value, as well as provide information about the number of clusters H and present an optimized clustering result. Refer to Figure 1 for a visual representation of BFLPSO-MDSC's flowchart.

Prediction Process
The prediction process consists of three main steps. First, load characteristic vectors are used to represent the load curves. Then, SVM-based similar clusters are selected. Finally, LSTM is employed for the prediction.

Prediction Process
The prediction process consists of three main steps. First, load characteristic vectors are used to represent the load curves. Then, SVM-based similar clusters are selected. Finally, LSTM is employed for the prediction.

Load Characteristic Vector
To enhance the identification of similar clusters on the forecast day, this paper utilized the load characteristic vector FV as a representation of the load curve. The definition of FV is as follows:

Similar Cluster Selection Based on SVM
This paper utilizes SVM for classification to obtain the load characteristic vector of the forecast day. The SVM training process is as follows: First, the original load dataset is labeled as SETA. Then, Equation (16)

LSTM Training
LSTM is selected as the primary forecasting framework in this paper due to its effectiveness in predicting time series correlation data [11]. To prepare for future predictions, two LSTM neural networks are trained. The original load data of H clusters are used to train LSTM, resulting in H LSTM networks denoted as L STMO,1 , L STMO,2 , . . ., L STMO,h , . . ., L STMO,H , respectively. Furthermore, the original load data's corresponding load characteristic vectors are trained using LSTM. This training process results in another LSTM network called L STMV . In accordance with our previous work [30], the LSTM network utilizes specific structural hyperparameters: a sequence length of 12, two implicit layers, and a learning rate of 1.

The Steps of Prediction Algorithm
The proposed algorithm, adaptive clustering long short-term memory network (ACLSTM), consists of six steps: Step 1: Employ BFLPSO-MDSC to cluster and analyze the original load data, resulting in the generation of clustering outcome and the number of clusters H.
Step 2: Obtain the corresponding load characteristic vector data of the original load data. Then, the load characteristic vector data are divided into H clusters; train the SVM with labels [1, 2, . . ., H].
Step 3: Use the original data acquired from clustering in Step 1 to train an individual LSTM neural network for each cluster, obtaining L STMO,1 , L STMO,2 , . . ., L STMO,h , . . ., L STMO,H . Similarly, according to the corresponding load characteristic vector data, L STMV can be obtained.
Step 4: Utilize L STMV to derive the load characteristic vector FV for the projected forecast day.
Step 5: Input the FV into the SVM trained in Step 2 to obtain similarity cluster h (h ∈ [1, 2, . . . , H]) for the forecast day.
Step 6: Based on cluster h obtained in Step 5, choose the corresponding neural network L STMO,h for load prediction. Then, the load profile for the forecast day can be obtained. Please refer to Figure 2 for a visual representation of ACLSTM's flowchart. Step 4: Utilize LSTMV to derive the load characteristic vector FV for the projected forecast day.
Step 5: Input the FV into the SVM trained in Step 2 to obtain similarity cluster h ( ) for the forecast day.
Step 6: Based on cluster h obtained in Step 5, choose the corresponding neural network LSTMO,h for load prediction. Then, the load profile for the forecast day can be obtained. Please refer to Figure 2 for a visual representation of ACLSTM's flowchart.

Experimental Environment
The experiment setup comprised a 16 GB memory, Intel (R) Core (TM) i7-8750H processor, operating on Windows 10. For programming, C++ and Python were used, whereas the neural network framework employed was TensorFlow [22].

Experimental Data
In this study, one year's worth of historical power load data from a substation located in Foshan, Guangdong Province, China were used as a training dataset. The load curve format adopted consisted of 96 instantaneous sampling values representing the power

Experimental Environment
The experiment setup comprised a 16 GB memory, Intel (R) Core (TM) i7-8750H processor, operating on Windows 10. For programming, C++ and Python were used, whereas the neural network framework employed was TensorFlow [22].

Experimental Data
In this study, one year's worth of historical power load data from a substation located in Foshan, Guangdong Province, China were used as a training dataset. The load curve format adopted consisted of 96 instantaneous sampling values representing the power load for a single day. The sampling time ranged from 00:00 to 23:45 with a sampling interval of 15 min, meaning that the power load values were recorded every 15 min throughout the day. To evaluate the prediction performance, three evaluation indices-mean absolute percentage error (MAPE), maximum error (EMAX), minimum error (EMIN), mean absolute error (MAE), mean square error (MSE), coefficient of determination (R 2 )-were employed in this paper. The calculation method for each error is provided below:

Experiment 1: Clustering Experiment
To evaluate the clustering effectiveness of BFLPSO-MDSC, a comparison is made between BFLPSO-MDSC and other clustering methods: PSO-MDSC, DBSCAN [16], and K-Means [15]. Specifically, PSO-MDSC uses PSO, which replaces BFLPSO in BFLPSO-MDSC while keeping everything else unchanged. The clustering results for these approaches can be seen in Figure 3, Figure 4, Figure 5, and Figure 6, respectively.           In Figure 3, BFLPSO-MDSC divides the load data into three distinct clusters. The first and third clusters have values greater than 1.0 MW, whereas the second cluster has values below 0.5 MW. Similarly, Figure 4 shows that PSO-MDSC also exhibits a clustering effect similar to BFLPSO-MDSC. Figure 5 demonstrates that DBSCAN also divides the data into three clusters. However, compared to BFLPSO-MDSC and PSO-MDSC, DBSCAN's second cluster contains more noise and the third cluster has very little data with low similarity. On the other hand, Figure 6 reveals that K-Means performs poorly in terms of clustering effectiveness as it is difficult to distinguish between the second and third clusters. Based on Figures 3-6, all four algorithms (BFLPSO-MDSC, PSO-MDSC, DBSCAN, and K-Means) form three clusters. Notably, both BFLPSO-MDSC and PSO-MDSC exhibit higher similarities within each cluster compared to DBSCAN and K-means. For a comprehensive comparison of these algorithms' performance in terms of cluster validity indices, please refer to Table 1.  0.62). These results confirm that BFLPSO-MDSC effectively clusters power data and outperforms the comparison algorithm by constructing a well-suited fitness function and using BFLPSO to optimize MDSC parameters adaptively.

Experiment 2: Prediction Experiment
To evaluate the prediction performance of the algorithm proposed in this paper, an experiment is conducted using ACLSTM for power load forecasting over a two-day period. The results are compared with four other algorithms: PSO-MDSC-LSTM, LSTM [9], gated recurrent unit (GRU) [31], and recurrent neural network (RNN) [32]. PSO-MDSC-LSTM is the first algorithm that replaces BFLPSO in ACLSTM with traditional PSO while keeping the framework and parameters unchanged. The prediction results are presented in Figure 7 and Table 2. pared to PSO-MDSC (by 0.02), DBSCAN (by 1.34), and K-Means (by 0.62). These results confirm that BFLPSO-MDSC effectively clusters power data and outperforms the comparison algorithm by constructing a well-suited fitness function and using BFLPSO to optimize MDSC parameters adaptively.

Experiment 2: Prediction Experiment
To evaluate the prediction performance of the algorithm proposed in this paper, an experiment is conducted using ACLSTM for power load forecasting over a two-day period. The results are compared with four other algorithms: PSO-MDSC-LSTM, LSTM [9], gated recurrent unit (GRU) [31], and recurrent neural network (RNN) [32]. PSO-MDSC-LSTM is the first algorithm that replaces BFLPSO in ACLSTM with traditional PSO while keeping the framework and parameters unchanged. The prediction results are presented in Figure 7 and Table 2.     Based on Figure 7 and Table 2, the MAE, MSE, and R 2 values of ACLSTM and GRU show little difference. However, the MAPE of GRU is worse than that of ACLSTM. Specifically, for the second-day prediction, the MAPE of GRU is 16.32%, whereas ACLSTM only has a value of 11.45%. This represents an increase of 29.84% in MAPE for GRU compared to ACLSTM. Although LSTM and RNN yield better results for first-day predictions, the performance of RNN significantly deteriorates when predicting for the second day compared to ACLSTM. This is particularly evident in terms of MAPE, MAE, MSE, and R 2 values of RNN for the second-day prediction. The results obtained from PSO-MDSC-LSTM are similar to those achieved by ACLSTM. However, when it comes to predicting for the second day specifically, PSO-MDSC-LSTM performs much worse with an R 2 value of −0.29 compared to ACLSTM's value of 0.12. It should be noted that a higher R 2 value indicates better prediction accuracy, whereas a negative value suggests poor prediction quality.
On the other hand, among the predicted load curves for the two days, ACLSTM demonstrates superior performance in terms of MAPE. Specifically, for the first-day prediction, ACLSTM achieves a MAPE of 8.05%, which is lower than PSO-MDSC-LSTM by 0.16%, LSTM by 0.22%, ANN by 0.15%, and GRU by 0.5%. The superiority of ACLSTM becomes even more apparent for the second-day prediction, with a MAPE that is lower than PSO-MDSC-LSTM by 0.16%, LSTM by 3.22%, ANN by 4.15%, and GRU by 4.87%.
ACLSTM outperforms the other four algorithms in terms of overall prediction stability, despite potentially having slightly larger errors for individual points. This demonstrates its effectiveness and superiority in power load forecasting, as it produces load curves that are much closer to the real data.

Conclusions
This paper presents ACLSTM, an algorithm for short-term load forecasting. ACLSTM combines BFLPSO and MDSC clustering to optimize parameters. BFLPSO's spatial searching ability is utilized to find the best combinations of MDSC parameters. The algorithm uses 9-dimensional load feature vectors as training features for SVM, whereas the clustering results are used as labels to determine similarity clusters for forecast days. Load curves of two days are obtained using the LSTM neural network with similar clusters serving as training data. Comparison experiments demonstrate that BFLPSO-MDSC performs well in clustering, and the ACLSTM achieves higher prediction accuracy. The mean absolute percentage error for the first day is only 8.05% compared to the real value. These experiments validate the effectiveness, rationality, and practicality of ACLSTM. In future work, additional method comparison experiments should be conducted to provide further optimization ideas for this algorithm.