1. Introduction
As China’s electrification level rises, so does demand for electrical energy [
1]. Residential, commercial, and industrial electricity consumption all show a year-on-year growth trend. However, the electricity load exhibits obvious randomness and volatility due to objective and social factors such as weather changes, holidays, and unexpected situations [
2,
3]. This complicates load forecasting and has an impact on the reliability and efficiency of power system operation [
4,
5,
6,
7]. Highly accurate load forecasting enables power dispatching authorities to develop more scientific and cost-effective power generation plans, reducing fossil fuel consumption and slowing environmental degradation [
8]. Improved load forecasting accuracy can lead to better peak regulation in power systems with energy storage. As a result, finding high-accuracy load forecasting methods has been a hot and difficult research topic in this field [
9,
10,
11].
In terms of time scales, there are currently four main types of load forecasts: ultra-short-term load forecasts [
12], short-term load forecasts [
13,
14], medium-term load forecasts [
15,
16,
17], and long-term load forecasts [
18,
19,
20,
21]. Ultra-short-term forecasts (one hour or 10 min in the future) are mainly used for real-time security analysis, real-time economic dispatch, and automatic generation control. Short-term forecasts (one day or one week ahead) are used for scheduling daily start-up and shutdown plans and generation plans. Medium-term forecasts (one year in the future) are used for monthly maintenance plans, operation methods, and reservoir scheduling plans. Long-term forecasts (10 years in the future) provide key basic data for grid planning or for determining annual maintenance plans, operation methods, etc.
Load forecasting models can be mainly classified as statistical [
22,
23], artificial intelligence [
24,
25], and combined models [
26,
27,
28].
Statistical models mainly include autoregressive integrated moving average, seasonal autoregressive integrated moving average, multiple linear regression, exponential smoothing method, etc. Wang Bo et al. constructed an autoregressive and moving average model with exogenous variables to achieve the short-term forecasting of load [
29]. Luiz Felipe Amaral et al. developed the smooth transition periodic autoregressive model and evaluated the load forecasting performance of the model [
30]. Traditional statistical methods will no longer be applicable when forecasting complex non-linear trends.
In recent years, artificial intelligence models have been widely applied in the field of load forecasting with excellent results. Common artificial intelligence models can be broadly classified into machine learning models [
31,
32,
33] and deep learning models [
34,
35,
36]. Support vector machines and neural network models are the most representative machine learning models. Jian Luo et al. [
37] constructed a weighted quadratic surface support vector regression model to achieve efficient load prediction. The results show that the support vector function can handle non-linear time series better, but the parameter setting of the method is too cumbersome. Haoming Liu et al. [
38] constructed a combinatorial model based on support vector regression for short-term load forecasting of integrated energy systems. Pham et al. [
39] used BPNN as a core forecasting algorithm for load forecasting. Yusha Hu et al. [
40] established a parameter-optimised BPNN network to avoid the problem of prediction results falling into local optima. Deep learning models are a field of algorithms that has grown from neural network models in machine learning models. Deep learning models contain more complex structures and are suitable for processing large amounts of non-linear data. A stacked autoencoder structure based on a deep LSTM was innovatively proposed by Zahra Fazlipour et al. [
41]. Notably, the study shows that the deep structure is useful for improving prediction accuracy. A stack LSTM model was developed by Hongbo Ren et al. [
42] for implementing load forecasting. Temporal convolutional networks incorporating an attention mechanism were constructed by XianlunTang et al. [
43].
Various single prediction models have their limitations, and prediction accuracy is difficult to meet production needs. Combinatorial prediction methods have started to emerge in recent years. Combinatorial models can be divided into two main categories: combinations of optimisation algorithms and forecasting models and the integration of multiple forecasting models. A combined forecasting model, i.e., the Elman neural network (ENN) model optimised using the particle swarm optimization algorithm for load power forecasting was proposed by Kun Xie et al. [
44]. Considerable work has also been performed by groups of multiple researchers in the area of integration of multiple predictive models. Xifeng Guo et al. [
45] used a convolutional neural network to cascade four different scales of features to fully exploit the potential relationships between continuous and discontinuous data in the feature maps. The feature vectors at different scales are fused as input to the LSTM network, and the LSTM neural network is used for short-term load prediction. Umar Javed et al. [
46] combined an expanded causal convolutional network with short sensory fields and a bi-directional LSTM to build a new load forecasting architecture, achieving higher accuracy in load forecasting. Sana Arastehfar et al. [
47] integrated a graph convolutional neural network and a long- and short-term memory network into a unified network. The network can extract both temporal and spatial information. In addition to the combined models mentioned above, the combination of decomposition methods and deep learning models is a new trend in the field of load forecasting. Weimin Yue et al. [
48] combined ensemble empirical modal decomposition with long- and short-term memory neural networks to address the problem of poor load prediction accuracy. Qian Zhang et al. [
49] combined a variational modal decomposition model with a stacked integrated model for load prediction. These studies show the outstanding advantages of combined models in the field of load forecasting.
In everyday life and production processes, there is differentiation in the electricity consumption behaviour of customers. Load characteristics therefore vary, making it difficult for generalised forecasts to meet the requirements of forecasting accuracy. When forecasting the load in an area, if each subarea is forecast individually and then integrated, the forecast granularity is too fine and prone to over-fitting. At the same time, it would be more time-consuming to forecast the load using this method. Conversely, if all customer loads are aggregated and then predicted, the prediction granularity is too large, and the differentiated characteristics of customer electricity consumption cannot be obtained. Therefore, the extraction of customer electricity characteristics is also a key technique that affects the accuracy of load forecasting. KOIVISTO M et al. [
50] used principal component analysis for dimensionality reduction and clustering using K-means. The method can effectively achieve the clustering of loads, but the stability of the principal component analysis method is poor, and the accuracy is not high. ZHONG S et al. [
51] used the Fourier transform as a dimensionality reduction method to extract the main load features and achieve classification of loads, but the method did not specify the weights of the dimensionality reduction indicators.
Based on previous research, an ensemble load forecasting algorithm based on load clustering and load decomposition strategies is proposed in this study. The improved prediction strategy includes three aspects. The first is a load clustering method based on principal characteristic extraction and dimensionality reduction. The extracted primary features reflect the fluctuation characteristics of various loads. The dimensionality reduction strategy can preserve important features while reducing the complexity of the clustering model, improving clustering efficiency and reducing memory requirements. The clustering algorithm groups loads with similar trends in variation. By predicting a class of loads with similar trends in a provincial region, the forecasting accuracy can be improved while reducing time costs. The second aspect is the load decomposition strategy. Multiple classes of loads with different characteristics are obtained after clustering. The sum of each type of load is decomposed into separate components with a single and uncoupled frequency, and then separate prediction models are constructed for the different frequency components. By decomposing each class of load, not only can the information contained in the data be fully explored, but also the interaction between different components at the characteristic scale can be reduced. Thirdly, an integrated deep learning model based on the LSTM and CNN-GRU model is constructed. In this model, the LSTM and CNN-GRU models are used for the prediction of low-frequency and high-frequency components, respectively. The LSTM can fully reflect the overall trend of the load and has high accuracy in predicting low-frequency time series. The CNN-GRU neural network, on the other hand, has a strong non-linear fitting capability and can achieve accurate prediction of high-frequency components with high randomness. The advantages of these two prediction methods complement each other. The prediction methods were validated by simulation, yielding desirable prediction results.
This paper is organised as follows.
Section 2 presents the novel load forecasting model and describes the relevant theoretical background of the algorithms involved in this paper.
Section 3 presents a case study of the proposed model.
Section 4 concludes the paper.
2. Methodology
2.1. An Ensemble Forecasting Model Based on an Improved Load Clustering and Decomposition Strategy
In this paper, a forecasting strategy based on load clustering and decomposition is designed to achieve accurate load forecasting for provincial areas containing multiple cities, and the following detailed steps are given.
Step 1: A load clustering method based on principal feature extraction was used to cluster loads from multiple cities. The load characteristics are first calculated for all cities. Secondly, the SVD method is applied to reduce the load characteristics in order to extract the main ones. Finally, a K-means algorithm is used to cluster the loads of several cities based on their main characteristics. In this study, the load data of one province (containing 10 cities) are used as the study data. The load data of 10 cities are processed by dimensionality reduction–clustering.
Step 2: The total load of each category is obtained from the clustering results. The VMD algorithm is used to decompose the various types of loadings obtained by clustering. Several different frequency components are obtained.
Step 3: Based on the improved prediction strategy, the ensemble prediction algorithm is proposed. The LSTM and CNN-GRU models are used to predict the low-frequency and high-frequency IMF components obtained using the VMD algorithm, respectively, and then the prediction results of each component are superimposed to obtain the final prediction results of each type of load. Finally, the forecast results of each type of load are superimposed to obtain the load forecast results of the province.
The flowchart of the ensemble model considering clustering and decomposition strategies to achieve provincial short-term load forecasting is given in
Figure 1.
2.2. Load Characteristic Dimensionality Reduction Based on Singular Value Decomposition (SVD)
As the number of dimensions of load data increases, the efficiency of load clustering decreases significantly. By reducing the dimensionality of the load characteristics, the efficiency of clustering can be improved and the memory requirements for data storage can be reduced. SVD, as a matrix decomposition method, enables the dimensionality reduction of matrices.
Assume that the n load characteristics of m users form a real matrix
of order m × n. The n load characteristics of each user are denoted as
. For matrix A, there exist orthogonal matrices
and
, such that the following equation holds [
52].
where
is the diagonal matrix.
The magnitude of the singular value
indicates the importance of the load characteristics. A larger singular value indicates that the feature is more important, while a smaller value indicates that the feature is unimportant and can be ignored. In Formula (1), only the first
dominant singular values are retained. Then, the matrix
and
are reduced to
From Equations (2) and (3), it can be seen that the coordinate system can be reduced to the low-dimensional coordinate system after neglecting the direction of the small variance of the data variation. Accordingly, the coordinate values of the load characteristic ak in the low-dimensional coordinate system can be used to reflect the main characteristics of the load characteristic. In addition, the singular values corresponding to each axis describe the importance of the load characteristic. When clustering coordinates, the higher the singularity value, the more important the corresponding load characteristic is. Therefore, the singular values of the axes are chosen as the weights of the dimensionality reduction indicators and are then normalised.
2.3. K-Means Clustering Algorithm
In this study, the K-means algorithm is used to cluster the loads of several cities in the province based on the main features after dimensionality reduction. The method needs to determine the number of clusters k in advance, and this paper adopts the sum of squared error (SSE) as the criterion for evaluating the effectiveness of clustering and determines the number of clusters accordingly. The SSE metric is defined in the following equation [
53].
where
is the class
sample.
is the cluster centre of the class
sample, and
is the squared Euclidean distance between
and sample
. Smaller
ISEE means better quality of clustering.
Suppose X is a collection of n metadata with s dimensions, denoted X = {x1, x2,……xn} ∈ Rs. The steps for clustering load data are as follows:
- (1)
Determine the number of clusters k according to the clustering validity index SSE.
- (2)
Randomly select the initial k clustering centres
u1,
u2,……
uk ∈
Rs. Calculate for each data sample the class
lables it belongs to.
- (3)
For each class j, recalculate the cluster centre of that class:
where
is the number of clustering centres recalculated for the t-th time for class i.
- (4)
Update the class centre with the class mean.
- (5)
Repeat (3) and (4) until the class centres are unchanged.
- (6)
Output the clustering results.
2.4. Load Decomposition Based on VMD Algorithm
Considering the non-linear and non-smooth nature of the load series, the decomposition method is used to decompose the total load for each category. Each type of load is decomposed into multiple IMFs of different frequencies, and then each load sequence component is predicted separately.
Variational empirical modal decomposition (VMD) [
54] is a new type of adaptive decomposition algorithm. The algorithm decomposes a complex time series into a number of single-frequency components based on a pre-determined number of decompositions M. The optimal solution of the model is obtained by alternating directional multiplication and iterative updating.
Assume a signal to be decomposed:
where
is the original load signal to be decomposed.
vm(
t)(m = 1~M) is the single frequency signal after load decomposition.
M is the number of decompositions.
Am(
t) is the amplitude of the signal
vm(
t).
is the phase angle of
vm(
t).
The VMD extracts M modal components when the original signal is non-smooth, such that the sum of the frequency bandwidths of each component is minimised and the sum of each modal component is equal to the original signal. The constraint model is
where
is the decomposed modal component,
is the decomposed centre frequency,
is the shock function, and
is the original load signal.
The Lagrangian multiplier
and the second-order parametric penalty factor
are used to construct the extended Lagrangian function, and then the alternating direction multiplier method is used to iteratively find the global optimal solution of the objective function. The mathematical model of the augmented Lagrangian function is as follows.
The optimal solution to the above equation is found by alternately solving the multiplicative updates:
,
, and
. The value of
is expressed as follows:
The minimal value of each IMF component from Fourier transform is as follows:
Similarly, find the central frequency minima as
2.5. LSTM
Long Short-Term Memory (LSTM) [
55] is a type of recurrent neural network model used for processing sequence data such as text, speech, and time series data. The advantage of LSTM models is that they can capture long-term dependencies, which traditional recurrent neural network models cannot achieve. The memory unit in the LSTM model can remember information in the input sequence and pass it on to the next time step for better prediction of future values. In addition, the LSTM model can control the flow of information through gate mechanisms to reduce the problem of vanishing gradients. These advantages make LSTM models widely applicable in areas such as speech recognition and time prediction. The low-frequency component of the load fluctuates smoothly, and accurate prediction can be achieved using only the LSTM model. Therefore, the LSTM model is used to implement the low-frequency component prediction in this study.
The structure of the LSTM [
54] consists of three gates: the forgetting gate, the input gate, and the output gate. According to the internal structure of the three gates, the value of the LSTM hidden layer at this moment depends on the joint action of the current moment and the previous moment. The three gates act as three control switches to extract and process the information: delete the historical information that is not useful, retain the information related to the output characteristics, update the state of the hidden layer, and improve the convergence of the model.
2.6. CNN-GRU
The CNN-GRU model is a novel neural network architecture that combines the convolutional neural network (CNN) and gated recurrent unit (GRU) models. The CNN-GRU model is designed to process sequential data, such as time series data, and is particularly well-suited for high-frequency component prediction. The CNN-GRU model consists of two main components: a CNN and a GRU. The CNN is responsible for extracting local features from the input data, while the GRU is used to capture the temporal dependencies in the data. The output of the CNN is fed into the GRU, which then produces time series prediction results. This allows the model to accurately predict high-frequency components in the data, which are often difficult to predict using traditional time series models. Overall, the CNN-GRU model represents a significant advancement in the field of time series prediction and can greatly improve the accuracy and reliability of high-frequency component predictions.
2.6.1. CNN
The convolutional neural network (CNN) is a deep learning model used for image classification, feature extraction, target detection, speech recognition, etc. The CNN mainly consists of multiple convolutional layers, where the convolutional layers are used to extract potential features of the data and pooling layers are used to reduce the size of the feature map. Convolutional layers play a key role in feature extraction. They are responsible for detecting local features in the input data by sliding a set of learnable filters over the input volume and computing the dot product between the filter and the corresponding patch of the input. The result of this operation is a feature map that captures potential fluctuating features in the input data. Overall, the convolutional layer enables the neural network to learn hierarchical representations of the input data that are increasingly abstract and discriminative, thereby improving the accuracy and robustness of the model.
2.6.2. GRU
GRU is a variant of LSTM that uses its specific memory and forgetting structure to model time series dynamically in time. It addresses the phenomenon of gradient disappearance and gradient explosion during the training of loaded time series. Compared with LSTM, GRU reduces the number of gates, which ensures both the accuracy of load prediction and the training time.
The GRU has two gates. The GRU integrates the forgetting gate and the input gate from the LSTM to form a new update gate. The output gate in the LSTM is replaced by a reset gate, which picks the state at the previous moment and writes it to the candidate set at this moment [
54].
2.7. Performance Evaluation
In this paper, the error evaluation index root mean square error (RMSE) [
56] and mean absolute percentage error (MAPE) [
57] are used to evaluate the accuracy of the prediction model. The mathematical expressions are shown in Equations (13) and (14).
where N is the number of samples.
is the true value of the load data.
is the load forecast value.