Short-Term Load Forecasting Based on VMD and Deep TCN-Based Hybrid Model with Self-Attention Mechanism

: Due to difﬁculties with electric energy storage, balancing the supply and demand of the power grid is crucial for the stable operation of power systems. Short-term load forecasting can provide an early warning of excessive power consumption for utilities by formulating the generation, transmission and distribution of electric energy in advance. However, the nonlinear patterns and dynamics of load data still make accurate load forecasting a challenging task. To address this issue, a deep temporal convolutional network (TCN)-based hybrid model combined with variational mode decomposition (VMD) and self-attention mechanism (SAM) is proposed in this study. Firstly, VMD is used to decompose the original load data into a series of intrinsic mode components that are used to reconstruct a feature matrix combined with other external factors. Secondly, a three-layer convolutional neural network is used as a deep network to extract in-depth features between adjacent time points from the feature matrix, and then the output matrix captures the long-term temporal dependencies using the TCN. Thirdly, long short-term memory (LSTM) is utilized to enhance the extraction of temporal features, and the correlation weights of spatiotemporal features are future-adjusted dynamically using SAM to retain important features during the model training. Finally, the load forecasting results can be obtained from the fully connected layer. The effectiveness and generalization of the proposed model were validated on two real-world public datasets, ISO-NE and GEFCom2012. Experimental results indicate that the proposed model signiﬁcantly improves the prediction accuracy in terms of evaluation metrics, compared with other contrast models.


Introduction
Power load forecasting is the foundation of the operation and planning of a power system.The generation, transmission, distribution and consumption of electricity occur almost simultaneously.It is well known that massive storage of electric power is still difficult for current technologies.Thus, power supply sides and demand sides should be balanced dynamically to ensure the safety and reliability of the power grid [1].However, the high penetration of renewable energy sources and electric devices in modern power systems result in significant uncertainties in the load demand.Motivated by this standpoint, power load forecasting has become increasingly important in modern power systems.
Power load forecasting involves forecasting the load demand for a time span in the future, using historical load data and other external factors such as meteorological conditions, seasonal effects, economic factors and social activities.Accurate load forecasting can help power companies to make efficient decisions, avoid resource waste and improve grid stability and reliability [2].Therefore, load forecasting has attracted increasing attention from researchers all over the world.Short-term load forecasting (STLF) is an approach that predicts power load consumption within an interval of an hour to a week, based on historical load data and other external factors [3].STLF provides a basis for generating units to schedule their start-up and shut-down times, prepares the rotating reserve and carries out in-depth analyses of the limitations in the transmission system [4].It has been demonstrated that a 1% increase in STLF error increases the operating cost by GBP 17.7 million [5] or EUR 4.55 to 9.1 million [6].In addition, the use of distributed generation and intelligent devices generates load data with more nonlinear patterns and dynamics [7].Furthermore, the scale of the load data also sharply increases with the widespread use of smart meters [8].Therefore, an accurate STLF model is a premise for the safe, stable and economic operation of an electric grid.
In the early years, a variety of prediction models were developed based on statistical methods and artificial intelligence methods.The statistical methods usually include autoregressive moving average [9], autoregressive integrated moving average [10], exponential smoothing [11], etc.Although these methods are easy to implement without additional feature inputs, they are not suitable for handling load data with nonlinear patterns and dynamics [12,13].In recent years, artificial intelligence (AI) methods have been widely used in power load forecasting, along with progressive computer technologies.As a typical AI algorithm, artificial neural networks (ANNs) have strong nonlinear modeling capabilities, and can obtain any nonlinear function without having information of the relationships between the training model and data in advance [14].However, ANNs set hyperparameters manually, and are thus unable to effectively tackle large-scale load data.Thus, it is difficult to obtain greater prediction accuracies of load data in modern power systems [15].Moreover, the output of the training model depends not only on the current input, but also on the previous input [16].Therefore, traditional artificial intelligence methods cannot fully extract the features of time series to achieve greater prediction accuracies.
Recently, deep learning methods have become increasingly popular in the field of time series.Long short-term memory (LSTM) developed from a recurrent neural network (RNN) can deal with long-term dependencies of time series, avoiding gradient vanishing and exploding of the RNN [17].A convolutional neural network (CNN) can effectively capture the local relationships among adjacent time points [18] and is also widely used in many fields, such as image recognition [19], renewable energy forecasting [20], as well as load forecasting [21].However, a large number of hyperparameters of CNN can easily result in overfitting of the training model.It should be pointed out that one-dimensional (1D) CNNs can overcome the overfitting problem due to there being fewer learnable parameters.Furthermore, the temporal convolutional network (TCN) developed from CNN has a unique dilated convolutional module to obtain a large receptive field with fewer layers [22], which is conducive to extracting nonlinear features of time series.Although the TCN has better predictive performance in time series compared with LSTM and CNN [23,24], it cannot learn the dependency of long-range positions inside the sequence and extract internal correlations of input.In addition, self-attention mechanism (SAM) focuses on capturing internal correlations of input features and can effectively deal with long time series, which helps the training model to identify key features.For instance, a novel model based on bidirectional LSTM, XGBoost and SAM was proposed for power load forecasting in [25].An improved LSTM optimized with SAM was used to predict the concentration of air pollutants in [26].A new model based on SAM and multi-task learning was developed to predict ultra-short-term photovoltaic power generation in [27].These experimental results mentioned above indicate that SAM has the capability to select key information from complex features and avoid irrelevant information during model training.
Although these deep learning methods improve the prediction accuracy, single models cannot fully extract the input features, especially in-depth features.Therefore, scholars are increasingly focusing on hybrid models, which combine the advantages of each single model [28].For example, a hybrid approach combining singular spectrum analysis-based decomposition and ANN for day-ahead hourly load forecasting was proposed in [29].A TCN-LSTM hybrid model was used to forecast weather on meteorological data in [30].A day-ahead load forecasting model based on CNN and TCN was conducted to achieve superior performance in [31].A hybrid model based on TCN combined with attention mechanism was proposed to fully exploit the nonlinear relationships between load data and external factors in [32].Although the above hybrid models improved the prediction performances in various aspects, the raw load data were directly fed into the training models without being smoothed and denoised in advance, which cannot be conducive to further enhancing the accuracy of STLF.
In order to avoid the effects of nonlinear patterns and dynamics of load data on the forecasting accuracy, decomposition methods have been used to decompose the load data into multiple smoothing and stable sub-series at different frequency bands.For instance, Wang et al. [33] decomposed the load data into a series of sub-signals using wavelet decomposition (WD).The sub-signals were predicted by different models according to their frequencies.Liang et al. [34] used empirical mode decomposition (EMD) to decompose load signals into multiple intrinsic mode functions (IMFs) to weaken the nonlinearity and dynamics of the load data.Zhu et al. [35] proposed a hybrid model for carbon price forecasting based on EMD and evolutionary least squares to support vector regression.An ensemble empirical mode decomposition (EEMD) was used to decompose load data into different frequency components in [36].The low-and high-frequency components were predicted via multivariable linear regression LSTM.However, WD cannot meet the requirements of complex and dynamic load series in time-frequency analysis [33].In addition, EMD and EEMD have the disadvantages of mode aliasing and end effect, and lack the support of mathematical theory [37].Then, a variational mode decomposition (VMD) algorithm with a solid mathematical model was used as an alternative to overcome their drawbacks mentioned above.VMD has good data decomposition accuracy, and obtains a group of stable sub-signals without noise interference [38].A hybrid model based on VMD and LSTM for short-term load forecasting was proposed in [39].A CNN and TCN hybrid model combined with VMD-based data processing was developed to predict power load forecasting in [40].A hybrid prediction model based on VMD, TCN and an error correction strategy for electricity load forecasting was presented in [41].A hybrid model based on TCN and VMD for short-term wind power forecasting was proposed in [42].A hybrid model based on GRU and TCN combined with VMD decomposition was developed to predict load forecasting in [43].
Although these studies mentioned above verified that deep learning methods based on VMD techniques can improve the prediction accuracy of load demand, some fields still require further study and improvement.For example, many studies often adopted deep learning models with shallow networks to improve the training efficiency, and they rarely considered using the models with deep networks to learn full features from the decomposed sub-series.Moreover, most studies scarcely focused on the correlations of sub-series and external factors, such as temperature, seasons, holidays, etc. Deep learning models with multiple hidden layers can indeed improve their feature extraction capabilities, but it is difficult to train these models with some sub-series and external factors.Therefore, it is necessary to construct a deep learning model with shallow networks to extract in-depth features from several sub-series and external factors.As a result, a novel STLF model based on VMD and a deep TCN-based hybrid method with SAM is proposed to fully capture the in-depth features of multiple sub-series and external factors.The main advantages of this research are as follows: (1) The raw load data are decomposed into multiple stable sub-series using VMD.Furthermore, the external factors are also considered as input variables and reconstructed as a feature matrix, along with the sub-series for the training model.(2) A three-layer 1D-CNN network is constructed as a deep network to eliminate overfitting of the training model due to fewer hyperparameters, and reshapes the extracted features into time series for the TCN module.The TCN extracts the long-term temporal dependencies of the input matrix.Moreover, LSTM is used to further enhance the extraction of temporal features.
(3) SAM can amplify important information, and then weaken irrelevant information of the feature matrix.Thus, SAM dynamically adjusts the correlation weight of complex features to obtain important formation from the input data.(4) Compared with traditional models or other benchmarking models, the novel hybrid model performs more effectively and achieves greater prediction accuracies.
The rest of this study is organized as follows.Section 2 introduces the basic methodologies of VMD, 1D-CNN, TCN, LSTM, SAM, as well as the framework of the proposed hybrid model.Section 3 describes the data preprocessing, evaluation criteria and experimental analysis.Finally, conclusion and future research are drawn in Section 4.

Methodologies
In this section, an STLF integration framework will be introduced in detail, including VMD, TCN, 1D-CNN, LSTM and SAM.The following sections briefly describe each of the modules used in this study.Finally, the architecture of the proposed hybrid model will be presented, and its main hyperparameters are also listed.

VMD
VMD is an adaptive and fully non-recursive method with modal change and signal processing [44], which decomposes time series data into multiple IMFs.It can not only retain the inherent information of original signals, but also avoids the overlap of variable information.Moreover, VMD has superior performance in sampling and denoising compared to WD, EMD and EEMD [45].The process of decomposition using VMD can be described as follows: Step 1: The analytic signal f (t) can be transformed by the Hilbert theory into a unilateral frequency spectrum for a given mode, as follows: Step 2: Each mode is added by an exponential term, and its frequency spectrum is shifted to the baseband as follows: where δ(t) is the Dirac distribution, u k is the kth mode, ω is the angular frequency and j is an imaginary unit.
Step 3: The Gaussian smoothness, i.e., the square norm of the gradient, is used to calculate the bandwidth of the signal.Then, the constrained variational expression can be written as follows: where ω k is the center frequency of the kth mode.
Step 4: Lagrange multiplier theory is an effective method to enforce constraints.For the convenience of analysis, the constrained variational problem needs to be transformed into an unconstrained one, as shown below: where α is a quadratic penalty factor and λ is the Lagrange multiplier.
Step 5: It is easy to achieve the saddle point of Equation (4) using the alternate direction method of multipliers to further achieve the frequency domain update of each mode.Thus, the expressions of modal component and center frequency are given as follows: where n represents the number of iterations.Additionally, Lagrange multiplier λ should be satisfied with the following expression: Step 6: Solve the saddle point of Equation ( 2) repeatedly until the modal component satisfies the following convergence criterion: where ε is the convergence tolerance that determines the accuracy and iteration times.

1D-CNN
The 1D-CNN is often used in signal processing and time series models [46].It has the ability of translation and scaling due to the local links and weight sharing.Compared with multi-dimensional CNNs, the 1D-CNN can change the number of channels with retaining the feature size to realize dimensional reduction.Furthermore, it deepens the network structure and introduces more nonlinear calculations without increasing the receptive field.The 1D-CNN is calculated as follows: where Z k is the convolution kernel and X t−k+1 is the time series.

TCN
The TCN has been developed on the basis of the 1D-CNN, and has shown excellent performance in many fields of time series processing [47].It includes causal convolution, dilated convolution and residual block, features which are briefly introduced in the following.

Causal Convolution
Causal convolution has the capability to overcome the disclosure of future information.It means that an output y t at moment t only depends on the input at previous moments, i.e., (x t , x t−1 , x t−2 , x t−3 , . ..).This just reflects the strong causal relationship of time series [48].The calculation process of causal convolution is presented in Figure 1.

Dilated Convolution
The expansion of the receptive field leads to an increase in the hyperparameters, and then cause gradient vanishing and exploding.Dilated convolution can address this problem [49].According to Figure 2, the formula for dilated convolution is listed in the following.For an input series X = [ , …  ] and a filter f with size k, the dilated convolution operation F of the sequence element N can be written as follows: where k is the filter size, d is the dilation factor and d •  indicates the orientation of the convolution.An increase in the dilation factor d enables the expansion of the receptive field without increasing the computational cost, which significantly improves the training efficiency.

Residual Block
Figure 3 illustrates the structure of residual block [50].It is obvious that the residual block includes two layers of dilated causal convolution and nonlinear units.Thus, the rectified linear unit (ReLU) is used as an activation function, and regularizes after each dilated convolution.A 1 1 convolution connects back to the input to make the input and output compatible.This suggests that the stacked residual blocks not only improve the training efficiency of the network, but also avoid gradient vanishing.Moreover, it can continuously update the residual characteristics to improve the transmission efficiency of the feature information.

Dilated Convolution
The expansion of the receptive field leads to an increase in the hyperparameters, and then cause gradient vanishing and exploding.Dilated convolution can address this problem [49].According to Figure 2, the formula for dilated convolution is listed in the following.

Dilated Convolution
The expansion of the receptive field leads to an increase in the hyperparameters, and then cause gradient vanishing and exploding.Dilated convolution can address this problem [49].According to Figure 2, the formula for dilated convolution is listed in the following.For an input series X = [ , …  ] and a filter f with size k, the dilated convolution operation F of the sequence element N can be written as follows: where k is the filter size, d is the dilation factor and d •  indicates the orientation of the convolution.An increase in the dilation factor d enables the expansion of the receptive field without increasing the computational cost, which significantly improves the training efficiency.

Residual Block
Figure 3 illustrates the structure of residual block [50].It is obvious that the residual block includes two layers of dilated causal convolution and nonlinear units.Thus, the rectified linear unit (ReLU) is used as an activation function, and regularizes after each dilated convolution.A 1 1 convolution connects back to the input to make the input and output compatible.This suggests that the stacked residual blocks not only improve the training efficiency of the network, but also avoid gradient vanishing.Moreover, it can continuously update the residual characteristics to improve the transmission efficiency of the feature information.For an input series X = [x 0 , . . .x N ] and a filter f with size k, the dilated convolution operation F of the sequence element N can be written as follows: where k is the filter size, d is the dilation factor and d•i indicates the orientation of the convolution.An increase in the dilation factor d enables the expansion of the receptive field without increasing the computational cost, which significantly improves the training efficiency.

Residual Block
Figure 3 illustrates the structure of residual block [50].It is obvious that the residual block includes two layers of dilated causal convolution and nonlinear units.Thus, the rectified linear unit (ReLU) is used as an activation function, and regularizes after each dilated convolution.A 1 × 1 convolution connects back to the input to make the input and output compatible.This suggests that the stacked residual blocks not only improve the training efficiency of the network, but also avoid gradient vanishing.Moreover, it can continuously update the residual characteristics to improve the transmission efficiency of the feature information.

LSTM
LSTM has superior performance in extracting the long-term dependencies of historical and future information for sequence data [51].Figure 4 presents the structure of LSTM, in which there are three gates, including input gate  , forget gate  and output gate  .The unique gates system is introduced in the LSTM model to control the information flow.Firstly, the input gate  is used to decide how much the memory cell state will be updated by the block input.Secondly, what information from the previous cell state should be forgotten is decided by the forget gate  .Finally, the output gate  controls which part of the current memory cell state should be output.The formulas of the control parameters can be written as follows:

LSTM
LSTM has superior performance in extracting the long-term dependencies of historical and future information for sequence data [51].Figure 4 presents the structure of LSTM, in which there are three gates, including input gate i t , forget gate f t and output gate o t .The unique gates system is introduced in the LSTM model to control the information flow.Firstly, the input gate i t is used to decide how much the memory cell state will be updated by the block input.Secondly, what information from the previous cell state should be forgotten is decided by the forget gate f t .Finally, the output gate o t controls which part of the current memory cell state should be output.

LSTM
LSTM has superior performance in extracting the long-term dependencies of historical and future information for sequence data [51].Figure 4 presents the structure of LSTM, in which there are three gates, including input gate  , forget gate  and output gate  .The unique gates system is introduced in the LSTM model to control the information flow.Firstly, the input gate  is used to decide how much the memory cell state will be updated by the block input.Secondly, what information from the previous cell state should be forgotten is decided by the forget gate  .Finally, the output gate  controls which part of the current memory cell state should be output.The formulas of the control parameters can be written as follows: The formulas of the control parameters can be written as follows: where x is the input variable, x t and h t are the input and output at time step t, respectively, W and b are weight matrices and bias matrices, respectively, σ(•) and tanh(•) are two activation functions for the update of cell state and the selective output, respectively, and c t and ∼ c denote the current values and new candidate values for the cell state, respectively.

SAM
SAM pays attention to the internal correlations of input features and assigns different weights according to their importance [52].Firstly, the input sequence X is linearly transformed into the query (Q), key (K) and value (V) using three different weight matrices W Q , W K and W V , respectively.Secondly, the similarity between Q and K can be calculated, and is further normalized by a softmax function to achieve the self-attention matrix (W).Finally, the output matrix H can be obtained by multiplying the matrix W with V. SAM can dynamically adjust the weights of different features and obtain the long-range dependencies of the sequences.Figure 5 depicts the structure of SAM, and the process of SAM is formulated as follows: where D K is the dimension of key K and Softmax (•) is the function of normalization by column.
where  is the input variable,  and ℎ are the input and output at time step t, respectively,  and  are weight matrices and bias matrices, respectively,  • and tanh • are two activation functions for the update of cell state and the selective output, respectively, and  and ̃ denote the current values and new candidate values for the cell state, respectively.

SAM
SAM pays attention to the internal correlations of input features and assigns different weights according to their importance [52].Firstly, the input sequence X is linearly transformed into the query (Q), key (K) and value (V) using three different weight matrices  ,  and  , respectively.Secondly, the similarity between Q and K can be calculated, and is further normalized by a softmax function to achieve the self-attention matrix (W).Finally, the output matrix H can be obtained by multiplying the matrix W with V. SAM can dynamically adjust the weights of different features and obtain the long-range dependencies of the sequences.Figure 5 depicts the structure of SAM, and the process of SAM is formulated as follows: where  is the dimension of key K and Softmax (•) is the function of normalization by column.

Proposed Model
The framework of the proposed model is illustrated in Figure 6, and mainly includes three parts, i.e., feature engineering, feature extraction and load forecasting.In the feature engineering stage, the load data are decomposed into multiple sub-series by VMD.A new feature matrix is constructed with these sub-series and external factors.In the feature extraction stage, the 1D-CNN network is used as a deep network to extract in-depth spatial features, and the TCN extracts the long-term temporal dependencies of the input matrix.Furthermore, the temporal features hidden in load series are further extracted using LSTM.Then, SAM can dynamically adjust the weight of spatiotemporal features to obtain an important feature matrix.Finally, load forecasting can be obtained through the fully connected layer.Table 1 lists the hyperparameters of each algorithm used in the proposed model.Moreover, the steps of the proposed model are described in the following: traction stage, the 1D-CNN network is used as a deep network to extract in-depth spatial features, and the TCN extracts the long-term temporal dependencies of the input matrix.Furthermore, the temporal features hidden in load series are further extracted using LSTM.Then, SAM can dynamically adjust the weight of spatiotemporal features to obtain an important feature matrix.Finally, load forecasting can be obtained through the fully connected layer.Table 1 lists the hyperparameters of each algorithm used in the proposed model.Moreover, the steps of the proposed model are described in the following: Step 1: VMD is employed to decompose the raw load data into eight IMFs.These IMFs are normalized along with temperature, while seasons, holidays and weekends are processed by using a one-hot encoder.The feature matrix can be reconstructed by combining IMFs and external factors in parallel.
Step 2: The three-layer 1D-CNN is used as a deep network to extract deep spatial features between adjacent time points, and then the TCN is utilized to globally capture temporal features from the load data.
Step 3: LSTM can further enhance the extraction of long-term dependencies.Furthermore, the SAM dynamically adjusts the correlations of different features, strengthens long-range dependencies and obtains the important spatiotemporal features.
Step 4: The feature matrix processed from SAM is reshaped into time series that are used to predict load data through the fully connected layer.Step 1: VMD is employed to decompose the raw load data into eight IMFs.These IMFs are normalized along with temperature, while seasons, holidays and weekends are processed by using a one-hot encoder.The feature matrix can be reconstructed by combining IMFs and external factors in parallel.
Step 2: The three-layer 1D-CNN is used as a deep network to extract deep spatial features between adjacent time points, and then the TCN is utilized to globally capture temporal features from the load data.
Step 3: LSTM can further enhance the extraction of long-term dependencies.Furthermore, the SAM dynamically adjusts the correlations of different features, strengthens long-range dependencies and obtains the important spatiotemporal features.
Step 4: The feature matrix processed from SAM is reshaped into time series that are used to predict load data through the fully connected layer.

Experiments and Results Analysis
In this section, two real-world datasets collected from different regions were adopted in this study to verify the effectiveness of the proposed model.The two datasets are described in the following.
The first dataset is the ISO-NE (New England) public dataset [53], which includes load data, temperatures, as well as day types from 1 March 2003 to 31 December 2014.The dataset was collected every hour, and the data from 1 March 2003 to 8 October 2007 were selected for this study.The second selected dataset is the GEFCom2012 public dataset [54], including power load and temperature.The dataset was collected every hour, and a total of 38,065 sets of data from 1 January 2004 to 29 June 2008 were selected as the study sample.The two datasets were divided into a training set, a validation set and a test set at a ratio of 8:1:1.
Experiments on the two datasets were performed on a PC with AMD-4800H CPU and NVIDIA GeForce RTX GPU with 16 GB of memory.The software platform used was Python 3.7 based on Pytorch 1.7.

Data Preprocessing
In this subsection, the ISO-NE dataset was chosen as a typical case to elaborate the characteristics of load data in detail.In addition, the GEFCom2012 dataset, which is similar to the ISO-NE dataset, will no longer be discussed.
The first step was to determine the number of decomposition models k, the penalty factor α and the updating step size τ using the central frequency observation and the residual index minimization method [55].In this research, it was convenient to obtain these parameters, i.e., k = 8, α = 419 and τ = 0.19. Figure 7 shows the decomposition results of load data using VMD.Firstly, the changing trend of the low-frequency component is smoother than that of the original signal, and is roughly consistent with the latter.It indicates that the noise interference would be effectively removed by VMD.Secondly, the influence of seasonal factors on load forecasting needed to be considered in the data preprocessing due to the smoothing and stability of the low-frequency component.Finally, all IMF components showed good periodicity and stability without irregular information, which facilitated the extraction of features in the subsequent stage.
Changing trends in electric demand usually present continuous and periodic characteristics without breaking.However, the power consumption is easily influenced by external factors such as seasons, temperature, day types, etc.This results in complex patterns and dynamics in the load data [56].The relationship between power load and temperature is illustrated in Figure 8, where it shows the distributions of power load and temperature from 1 March 2003 to 8 October 2007.One can clearly see that the power load demand will reach a peak as the temperature becomes very high or low.Thus, temperature has a strong correlation with power load data.Moreover, Figure 9 presents the evolutions of electric consumption in different time horizons.The time horizons of Figure 9a-c are one year from 1 January 2004 to 1 January 2005, one week from 7 to 13 April 2003, and the Christmas period from 22 to 28 December 2003, respectively.One can see from Figure 9a that the power consumption is higher in January, June, July, August and December due to the summer and winter climates.Additionally, the power consumption on weekdays slightly fluctuates and is higher than that on weekends, as shown in Figure 9b.As an important holiday, one can see from Figure 9c that the power consumption during Christmas is significantly lower than at other times.Therefore, the day types such as weekdays and holidays also have important influences on prediction accuracy of the power load.Changing trends in electric demand usually present continuous and periodic characteristics without breaking.However, the power consumption is easily influenced by external factors such as seasons, temperature, day types, etc.This results in complex patterns and dynamics in the load data [56].The relationship between power load and temperature is illustrated in Figure 8, where it shows the distributions of power load and temperature from 1 March 2003 to 8 October 2007.One can clearly see that the power load demand will reach a peak as the temperature becomes very high or low.Thus, temperature has a strong correlation with power load data.Moreover, Figure 9 presents the evolutions of electric consumption in different time horizons.The time horizons of Figure 9a-c are one year from 1 January 2004 to 1 January 2005, one week from 7 to 13 April 2003, and the Christmas period from 22 to 28 December 2003, respectively.One can see from Figure 9a that the power consumption is higher in January, June, July, August and December due to the summer and winter climates.Additionally, the power consumption on weekdays slightly fluctuates and is higher than that on weekends, as shown in Figure 9b.As an important holiday, one can see from Figure 9c that the power consumption during Christmas is significantly lower than at other times.Therefore, the day types such as weekdays and holidays also have important influences on prediction accuracy of the power load.Changing trends in electric demand usually present continuous and periodic characteristics without breaking.However, the power consumption is easily influenced by external factors such as seasons, temperature, day types, etc.This results in complex patterns and dynamics in the load data [56].The relationship between power load and temperature is illustrated in Figure 8, where it shows the distributions of power load and temperature from 1 March 2003 to 8 October 2007.One can clearly see that the power load demand will reach a peak as the temperature becomes very high or low.Thus, temperature has a strong correlation with power load data.Moreover, Figure 9 presents the evolutions of electric consumption in different time horizons.The time horizons of Figure 9a-c are one year from 1 January 2004 to 1 January 2005, one week from 7 to 13 April 2003, and the Christmas period from 22 to 28 December 2003, respectively.One can see from Figure 9a that the power consumption is higher in January, June, July, August and December due to the summer and winter climates.Additionally, the power consumption on weekdays slightly fluctuates and is higher than that on weekends, as shown in Figure 9b.As an important holiday, one can see from Figure 9c that the power consumption during Christmas is significantly lower than at other times.Therefore, the day types such as weekdays and holidays also have important influences on prediction accuracy of the power load.According to the discussions mentioned above, this study selected power demand, holidays, quarters, temperature and weekends as the input variables for load forecasting.One-hot encoding was performed on quarters, holidays, weekends and weekdays.However, a min-max normalization was used to normalize the power load and temperature over a range of (0, 1).The formula of min-max normalization is shown as follows: x = x − min(X) max(X) − min(X) (22) where X is the input series, x indicates the data to be normalized, x is the normalized data and max (•) and min (•) represent the maximum and minimum of the input series, respectively.According to the discussions mentioned above, this study selected power demand, holidays, quarters, temperature and weekends as the input variables for load forecasting.One-hot encoding was performed on quarters, holidays, weekends and weekdays.However, a min-max normalization was used to normalize the power load and temperature over a range of (0, 1).The formula of min-max normalization is shown as follows: where  is the input series,  indicates the data to be normalized,  is the normalized data and max (•) and min (•) represent the maximum and minimum of the input series, respectively.

Evaluation Criteria
To evaluate the performance of the proposed model, the mean absolute percentage error (MAPE), root mean square error (RMSE) and R-squared (R 2 ) were selected as evaluation indices.These statistical metrics are defined as follows: where N is the number of samples, and  and  represent the actual value and the predicted value, respectively.

Comparative Analysis of Experimental Results
Generally, the performances of single models, such as TCN, LSTM and CNN, are inferior to those of hybrid models.Moreover, the TCN-based hybrid models in STLF are

Evaluation Criteria
To evaluate the performance of the proposed model, the mean absolute percentage error (MAPE), root mean square error (RMSE) and R-squared (R 2 ) were selected as evaluation indices.These statistical metrics are defined as follows: where N is the number of samples, and y t and ŷt represent the actual value and the predicted value, respectively.

Comparative Analysis of Experimental Results
Generally, the performances of single models, such as TCN, LSTM and CNN, are inferior to those of hybrid models.Moreover, the TCN-based hybrid models in STLF are comparatively inferior to other hybrid models.Thus, the experimental results of the proposed model were mainly compared with other TCN-based hybrid models with VMD decomposition that include VMD-CNN-TCN-SAM (VCTA), VMD-TCN-LSTM-SAM (VTLA), VMD-CNN-LSTM-SAM (VCLA), VMD-CNN-TCN-LSTM (VCTL) and VMD-CNN-TCN-GRU-SAM (VCTGA).It must be stressed that the CNN in these hybrid models means a three-layer 1D-CNN network.Furthermore, in order to overcome the randomness of the AI algorithms, all of the models mentioned above were trained and tested enough times until the results became stable and reliable.

ISO-NE Dataset
In order to further elaborate its performance for load forecasting, the proposed model was mainly compared to the hybrid models with VMD decomposition.Tables 2-4 show the maximum, minimum and average values of the MAPE, RMSE and R 2 , respectively.Compared with the VCLA, one can find that the VTLA reduces the MAPE by 13.6% and the RMSE by 10.4%, due to the large receptive field of TCN.The VCTA model is conducive to the extraction of long-term in-depth features, and improves the performance of load forecasting.For example, the VCTA greatly decreases the MAPE by 48.6% and the RMSE by 48% compared to the VTLA.When the VCTA is further stacked with GRU, i.e., VCTGA, one finds that its MAPE and RMSE can further be reduced by 20.8% and 30.8%, respectively.It means that the GRU can enhance the extraction of long-term dependencies.However, the proposed model even outperforms the VCTGA by 17.5% and 7% in terms of the MAPE and RMSE, respectively.This again demonstrates that LSTM has a stronger capability to capture temporal dependencies compared with GRU.Compared with VCTL, one can see that SAM is very important to globally enhance the key features for the proposed model.For example, compared with the VCTL, the proposed model significantly decreases the MAPE by 39% and the RMSE by 49%.Therefore, SAM is extremely important for globally adjusting and retaining key features to improve the superior performance of the proposed model.Additionally, the R 2 of the proposed model increases from 0.04% to 1.02% compared with that of the contrast models and reaches up to 99.9%, which infers that VDM decomposition is indeed effective in improving prediction accuracy, and that the proposed model results can be trusted.To further see the deviations between each model and the actual load data, Figure 10 shows the predicted load over 24 h and the actual load data when the MAPE is a maximum and a minimum, respectively.One can find that the VMD-CNN-TCN (VCT)-based hybrid models approximately capture the changing trends in the actual load.However, the VCLA and VTLA models obviously deviate from the actual load.This means that the stacked model based on CNN and TCN plays a critical role in feature extraction for the training model.Similarly, load forecasting during the peak or valley areas is still difficult for the models to accurately predict.However, it can be seen from the subplots that the hybrid model proposed in this article achieves the smallest deviation compared with other VCTbased hybrid models, especially during the turning area.Furthermore, Figure 11 shows bar charts of the RMSE for each model.One can clearly see that the VCTA-based hybrid models are more stable compared with other models, and the proposed model has the best stability among all of these models.At the same time, SAM is also important to improve the stability of the VCT-based models.
difficult for the models to accurately predict.However, it can be seen from the subplots that the hybrid model proposed in this article achieves the smallest deviation compared with other VCT-based hybrid models, especially during the turning area.Furthermore, Figure 11 shows bar charts of the RMSE for each model.One can clearly see that the VCTA-based hybrid models are more stable compared with other models, and the proposed model has the best stability among all of these models.At the same time, SAM is also important to improve the stability of the VCT-based models.

GEFCom2012 Dataset
To further validate the generalizations of the proposed model, all of the contrast experiments mentioned above were performed using the GEFCom2012 dataset.Tables 5-7 show the maximum, minimum and average values of the MAPE, RMSE and R 2 , respectively.One can see that compared with VCLA, the VCTA model significantly reduces the MAPE by 45.7% and the RMSE by 30%.This again demonstrates that the VCT-based models effectively extract the in-depth features from the load series.Furthermore, the proposed model that is stacked with LSTM based on the VCTA outperforms the latter by 12.7% and 14.5% in terms of the MAPE and RMSE, respectively, which further proves that that the hybrid model proposed in this article achieves the smallest deviation compared with other VCT-based hybrid models, especially during the turning area.Furthermore, Figure 11 shows bar charts of the RMSE for each model.One can clearly see that the VCTA-based hybrid models are more stable compared with other models, and the proposed model has the best stability among all of these models.At the same time, SAM is also important to improve the stability of the VCT-based models.

GEFCom2012 Dataset
To further validate the generalizations of the proposed model, all of the contrast experiments mentioned above were performed using the GEFCom2012 dataset.Tables 5-7 show the maximum, minimum and average values of the MAPE, RMSE and R 2 , respectively.One can see that compared with VCLA, the VCTA model significantly reduces the MAPE by 45.7% and the RMSE by 30%.This again demonstrates that the VCT-based models effectively extract the in-depth features from the load series.Furthermore, the proposed model that is stacked with LSTM based on the VCTA outperforms the latter by 12.7% and 14.5% in terms of the MAPE and RMSE, respectively, which further proves that

GEFCom2012 Dataset
To further validate the generalizations of the proposed model, all of the contrast experiments mentioned above were performed using the GEFCom2012 dataset.Tables 5-7 show the maximum, minimum and average values of the MAPE, RMSE and R 2 , respectively.One can see that compared with VCLA, the VCTA model significantly reduces the MAPE by 45.7% and the RMSE by 30%.This again demonstrates that the VCT-based models effectively extract the in-depth features from the load series.Furthermore, the proposed model that is stacked with LSTM based on the VCTA outperforms the latter by 12.7% and 14.5% in terms of the MAPE and RMSE, respectively, which further proves that LSTM can enhance the capture of temporal dependencies.Additionally, the influence of SAM on the extraction of important features from the load data can also demonstrated.One can find that the proposed model further decreases the MAPE by 15.2% and the RMSE by 12.2% compared with the VCTL.At the same time, the R 2 results of all of the models are over 99%, which demonstrates the effectiveness of VDM decomposition; moreover, the R 2 value of the proposed model is the highest among these models.Therefore, these above analyses show that the proposed model has good generalization capabilities in power load forecasting.Figure 12 shows the predicted load over 24 h and the actual load data when the MAPE is a maximum and a minimum, respectively.As shown in Figure 12, all of the models are in accord with the in the actual load during the rising or falling stages.However, the proposed model is better able to capture trends in the actual load compared with the deviations of the other models, especially during the turning area.Furthermore, in comparing Figures 10 and 12, one can see that the changing trends of the GEFCom2012 dataset have more peaks and valleys, which proves that the GEFCom2012 dataset has more complexity and volatility compared to the ISO-NE dataset.Thus, the evaluation indices of the load forecasting for all of the models in the GEFCom2012 dataset are higher than those of the corresponding models with the ISO-NE dataset.In addition, Figure 13 shows the maximum, minimum and average values of the RMSE for all of the models with the GEFCom2012 dataset.It can be seen that although the GEFCom2012 dataset has obviously nonlinear and volatile characteristics, the proposed model still achieves the best reliability and the highest prediction accuracy and stability.

Discussion
After VMD decomposition of the two datasets, the predictive results of all of the VMD-based models in this research became stable, which infers the importance of smoothing and stabilizing the load data for accurate prediction.Moreover, the combination of TCN and LSTM is helpful to fully extract the long-term temporal dependencies and local dependencies of the load data.Due to stacking multiple single modules, there is a lot of irrelevant feature information that is extracted from the load data.Then, SAM is used to dynamically adjust the correlation weights of different important features to obtain the key formation.Therefore, the novel architecture of the proposed model is very beneficial for feature extraction from large-scale load data with nonlinear patterns and dynamics.It should be pointed out that the computational cost of the proposed model increases greatly compared to other single models, due to its complicated structure.However, the computational cost problem can easily be overcome with the rapid development of computing power.In addition, all of the models were trained and tested several times in this study to obtain a series of stable results.Figure 14 shows a boxplot of the MAPE for all of the models based on the ISO-NE and GEFCom2012 datasets.One can see that the proposed model not only achieves the lowest MAPE value among all of the models for the two datasets, but it also achieved the smallest variation range in its MAPE value.This indicates that the novel model proposed in this study has excellent generalization and reliable prediction accuracy.

Discussion
After VMD decomposition of the two datasets, the predictive results of all of the VMD-based models in this research became stable, which infers the importance of smoothing and stabilizing the load data for accurate prediction.Moreover, the combination of TCN and LSTM is helpful to fully extract the long-term temporal dependencies and local dependencies of the load data.Due to stacking multiple single modules, there is a lot of irrelevant feature information that is extracted from the load data.Then, SAM is used to dynamically adjust the correlation weights of different important features to obtain the key formation.Therefore, the novel architecture of the proposed model is very beneficial for feature extraction from large-scale load data with nonlinear patterns and dynamics.It should be pointed out that the computational cost of the proposed model increases greatly compared to other single models, due to its complicated structure.However, the computational cost problem can easily be overcome with the rapid development of computing power.In addition, all of the models were trained and tested several times in this study to obtain a series of stable results.Figure 14 shows a boxplot of the MAPE for all of the models based on the ISO-NE and GEFCom2012 datasets.One can see that the proposed model not only achieves the lowest MAPE value among all of the models for the two datasets, but it also achieved the smallest variation range in its MAPE value.This indicates that the novel model proposed in this study has excellent generalization and reliable prediction accuracy.8 lists the comparative results of the proposed model and other baseline models.It is obvious that the proposed model significantly outperforms state-of-the-art models in terms of the MAPE.Even if the same datasets are used in [14,54] for one-hour-ahead forecasting, the MAPE values of the proposed model are significantly reduced by 37% and 36%, respectively.It should be emphasized that [14] used an EMD decomposition method, and achieved high prediction accuracy.However, compared with the results of [14], the proposed model in this research reduces the MAPE by 37% and the RMSE by 39%.This It is obvious that the proposed model significantly outperforms state-of-the-art models in terms of the MAPE.Even if the same datasets are used in [14,54] for one-hour-ahead forecasting, the MAPE values of the proposed model are significantly reduced by 37% and 36%, respectively.It should be emphasized that [14] used an EMD decomposition method, and achieved high prediction accuracy.However, compared with the results of [14], the proposed model in this research reduces the MAPE by 37% and the RMSE by 39%.This also demonstrates that VMD decomposition outperforms EMD decomposition in terms of data smoothing and linearization.Furthermore, the hybrid models with data decomposition all achieved high accuracies in load forecasting, which proves that the smoothing and stabilizing of load data is a very important process for load forecasting.It should be noted that the RMSE values of some comparative models are lower than that of the proposed model due to their simpler structures and fewer hyperparameters.Overall, the proposed model can extract in-depth spatiotemporal features, and then globally enhance key features to achieve reliable load forecasting with greater accuracy.

Conclusions
In order to globally extract in-depth features from load data with nonlinear patterns and dynamics, a novel model needs to be developed to achieve greater accuracy in load forecasting to ensure the reliable and economic operations of an electric grid.This study proposed a new hybrid model based on VMD and deep TCN-based networks with SAM for STLF.The load data were decomposed into multiple sub-series that were reshaped into a feature matrix, along with external factors.A three-layer 1D-CNN network was used as a deep network to extract in-depth features, thus avoiding the over-fitting and gradient vanishing of the training model.The TCN extracted the long-term temporal dependencies, and LSTM further enhanced the extraction of temporal features.These features were globally adjusted via SAM to obtain important features.The load forecast was eventually obtained from the fully connected layer.The effectiveness and generalization of the proposed model were validated on two real-world datasets, ISO-NE and GEFCom2012.Compared with other benchmarking models, the experimental results demonstrated that the proposed model improved the prediction accuracy by 17% to 71% in terms of the MAPE and 7% to 70% in terms of the RMSE for the ISO-NE dataset, or by 15% to 53% in terms of the MAPE and 12% to 43% in terms of the RMSE for the GEFCom2012 dataset.Moreover, the R 2 of the proposed model achieved up to 99.9% for the ISO-NE dataset and 99.83% for the GEFCom2012 dataset.Therefore, the proposed model can be effective in globally extracting the in-depth features from load data and external factors to obtain greater accuracy for load forecasting.
In future research, we would like to further optimize the deep learning techniques to reduce their computational costs and achieve greater prediction accuracies.Moreover, future research could improve the TCN-based models to better adapt to complex time

Figure 1 .
Figure 1.The structure of causal convolution.

Figure 1 .
Figure 1.The structure of causal convolution.

Figure 1 .
Figure 1.The structure of causal convolution.

Figure 3 .
Figure 3. Structure of the residual block.

Figure 3 .
Figure 3. Structure of the residual block.

20 Figure 3 .
Figure 3. Structure of the residual block.

Figure 5 .
Figure 5.The structure of SAM.Figure 5.The structure of SAM.

Figure 5 .
Figure 5.The structure of SAM.Figure 5.The structure of SAM.

Figure 6 .
Figure 6.Flow chart of the proposed model.Figure 6. Flow chart of the proposed model.

Figure 6 .
Figure 6.Flow chart of the proposed model.Figure 6. Flow chart of the proposed model.

Figure 7 .
Figure 7. Decomposition results of the load data using VMD for the ISO-NE dataset.

Figure 8 .
Figure 8. Evolutions in power load and temperature from 1 March 2003 to 8 October 2007 for ISO-NE dataset.

Figure 7 .
Figure 7. Decomposition results of the load data using VMD for the ISO-NE dataset.

Figure 7 .
Figure 7. Decomposition results of the load data using VMD for the ISO-NE dataset.

Figure 8 .
Figure 8. Evolutions in power load and temperature from 1 March 2003 to 8 October 2007 for ISO-NE dataset.

Figure 8 .
Figure 8. Evolutions in power load and temperature from 1 March 2003 to 8 October 2007 for ISO-NE dataset.

Figure 9 .
Figure 9. Evolutions in power load in different time horizons for ISO-NE dataset.(a) One year, (b) one week, (c) from 22 to 28 December 2003.

9 .
Evolutions in power load in different time horizons for ISO-NE dataset.(a) One year, (b) one week, (c) from 22 to 28 December 2003.

Figure 10 .
Figure 10.Load forecasting profiles over 24 h based on ISO-NE dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 11 .
Figure 11.Maximum, minimum and average values of RMSE for each model based on ISO-NE dataset.

Figure 10 .
Figure 10.Load forecasting profiles over 24 h based on ISO-NE dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 10 .
Figure 10.Load forecasting profiles over 24 h based on ISO-NE dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 11 .
Figure 11.Maximum, minimum and average values of RMSE for each model based on ISO-NE dataset.

Figure 11 .
Figure 11.Maximum, minimum and average values of RMSE for each model based on ISO-NE dataset.

Figure 12 .
Figure 12.Load forecasting profiles over 24 h based on GEFCom2012 dataset when the MAPE is a maximum (a) and a minimum (b).Figure 12. Load forecasting profiles over 24 h based on GEFCom2012 dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 12 .
Figure 12.Load forecasting profiles over 24 h based on GEFCom2012 dataset when the MAPE is a maximum (a) and a minimum (b).Figure 12. Load forecasting profiles over 24 h based on GEFCom2012 dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 12 .
Figure 12.Load forecasting profiles over 24 h based on GEFCom2012 dataset when the MAPE is a maximum (a) and a minimum (b).

Figure 13 .
Figure 13.Maximum, minimum and average values of RMSE for each model based on GEFCom2012 dataset.

Figure 13 .
Figure 13.Maximum, minimum and average values of RMSE for each model based on GEFCom2012 dataset.

Figure 14 .
Figure 14.Boxplots of MAPE for each model based on (a) ISO-NE dataset and (b) GEFCom2012 dataset.3.3.4.Comparison with State-of-the-Art ModelsIt is necessary to compare the proposed model with other state-of-the-art hybrid models, all of which are TCN-based or data decomposition models published in recent years.Table8lists the comparative results of the proposed model and other baseline models.It is obvious that the proposed model significantly outperforms state-of-the-art models in terms of the MAPE.Even if the same datasets are used in[14,54] for one-hour-ahead forecasting, the MAPE values of the proposed model are significantly reduced by 37% and 36%, respectively.It should be emphasized that[14] used an EMD decomposition method, and achieved high prediction accuracy.However, compared with the results of[14], the proposed model in this research reduces the MAPE by 37% and the RMSE by 39%.This

Figure 14 .
Figure 14.Boxplots of MAPE for each model based on (a) ISO-NE dataset and (b) GEFCom2012 dataset.3.3.4.Comparison with State-of-the-Art Models It is necessary to compare the proposed model with other state-of-the-art hybrid models, all of which are TCN-based or data decomposition models published in recent years.Table 8 lists the comparative results of the proposed model and other baseline models.It is obvious that the proposed model significantly outperforms state-of-the-art models

Table 1 .
The hyperparameters of each algorithm used in the proposed model.

Table 2 .
Maximum, minimum and average values of MAPE for ISO-NE dataset.

Table 3 .
Maximum, minimum and average values of RMSE for ISO-NE dataset.

Table 4 .
Maximum, minimum and average values of R 2 for ISO-NE dataset.

Table 5 .
Maximum, minimum and average values of MAPE for GEFCom2012 dataset.

Table 6 .
Maximum, minimum and average values of RMSE for GEFCom2012 dataset.

Table 7 .
Maximum, minimum and average values of R 2 for GEFCom2012 dataset.

Table 8 .
Comparison of the proposed model with state-of-the-art models.