1. Introduction
Power load forecasting is the foundation of the operation and planning of a power system. The generation, transmission, distribution and consumption of electricity occur almost simultaneously. It is well known that massive storage of electric power is still difficult for current technologies. Thus, power supply sides and demand sides should be balanced dynamically to ensure the safety and reliability of the power grid [
1]. However, the high penetration of renewable energy sources and electric devices in modern power systems result in significant uncertainties in the load demand. Motivated by this standpoint, power load forecasting has become increasingly important in modern power systems. 
Power load forecasting involves forecasting the load demand for a time span in the future, using historical load data and other external factors such as meteorological conditions, seasonal effects, economic factors and social activities. Accurate load forecasting can help power companies to make efficient decisions, avoid resource waste and improve grid stability and reliability [
2]. Therefore, load forecasting has attracted increasing attention from researchers all over the world. Short-term load forecasting (STLF) is an approach that predicts power load consumption within an interval of an hour to a week, based on historical load data and other external factors [
3]. STLF provides a basis for generating units to schedule their start-up and shut-down times, prepares the rotating reserve and carries out in-depth analyses of the limitations in the transmission system [
4]. It has been demonstrated that a 1% increase in STLF error increases the operating cost by GBP 17.7 million [
5] or EUR 4.55 to 9.1 million [
6]. In addition, the use of distributed generation and intelligent devices generates load data with more nonlinear patterns and dynamics [
7]. Furthermore, the scale of the load data also sharply increases with the widespread use of smart meters [
8]. Therefore, an accurate STLF model is a premise for the safe, stable and economic operation of an electric grid. 
In the early years, a variety of prediction models were developed based on statistical methods and artificial intelligence methods. The statistical methods usually include autoregressive moving average [
9], autoregressive integrated moving average [
10], exponential smoothing [
11], etc. Although these methods are easy to implement without additional feature inputs, they are not suitable for handling load data with nonlinear patterns and dynamics [
12,
13]. In recent years, artificial intelligence (AI) methods have been widely used in power load forecasting, along with progressive computer technologies. As a typical AI algorithm, artificial neural networks (ANNs) have strong nonlinear modeling capabilities, and can obtain any nonlinear function without having information of the relationships between the training model and data in advance [
14]. However, ANNs set hyperparameters manually, and are thus unable to effectively tackle large-scale load data. Thus, it is difficult to obtain greater prediction accuracies of load data in modern power systems [
15]. Moreover, the output of the training model depends not only on the current input, but also on the previous input [
16]. Therefore, traditional artificial intelligence methods cannot fully extract the features of time series to achieve greater prediction accuracies.
Recently, deep learning methods have become increasingly popular in the field of time series. Long short-term memory (LSTM) developed from a recurrent neural network (RNN) can deal with long-term dependencies of time series, avoiding gradient vanishing and exploding of the RNN [
17]. A convolutional neural network (CNN) can effectively capture the local relationships among adjacent time points [
18] and is also widely used in many fields, such as image recognition [
19], renewable energy forecasting [
20], as well as load forecasting [
21]. However, a large number of hyperparameters of CNN can easily result in overfitting of the training model. It should be pointed out that one-dimensional (1D) CNNs can overcome the overfitting problem due to there being fewer learnable parameters. Furthermore, the temporal convolutional network (TCN) developed from CNN has a unique dilated convolutional module to obtain a large receptive field with fewer layers [
22], which is conducive to extracting nonlinear features of time series. Although the TCN has better predictive performance in time series compared with LSTM and CNN [
23,
24], it cannot learn the dependency of long-range positions inside the sequence and extract internal correlations of input. In addition, self-attention mechanism (SAM) focuses on capturing internal correlations of input features and can effectively deal with long time series, which helps the training model to identify key features. For instance, a novel model based on bidirectional LSTM, XGBoost and SAM was proposed for power load forecasting in [
25]. An improved LSTM optimized with SAM was used to predict the concentration of air pollutants in [
26]. A new model based on SAM and multi-task learning was developed to predict ultra-short-term photovoltaic power generation in [
27]. These experimental results mentioned above indicate that SAM has the capability to select key information from complex features and avoid irrelevant information during model training. 
Although these deep learning methods improve the prediction accuracy, single models cannot fully extract the input features, especially in-depth features. Therefore, scholars are increasingly focusing on hybrid models, which combine the advantages of each single model [
28]. For example, a hybrid approach combining singular spectrum analysis-based decomposition and ANN for day-ahead hourly load forecasting was proposed in [
29]. A TCN–LSTM hybrid model was used to forecast weather on meteorological data in [
30]. A day-ahead load forecasting model based on CNN and TCN was conducted to achieve superior performance in [
31]. A hybrid model based on TCN combined with attention mechanism was proposed to fully exploit the nonlinear relationships between load data and external factors in [
32]. Although the above hybrid models improved the prediction performances in various aspects, the raw load data were directly fed into the training models without being smoothed and denoised in advance, which cannot be conducive to further enhancing the accuracy of STLF.
In order to avoid the effects of nonlinear patterns and dynamics of load data on the forecasting accuracy, decomposition methods have been used to decompose the load data into multiple smoothing and stable sub-series at different frequency bands. For instance, Wang et al. [
33] decomposed the load data into a series of sub-signals using wavelet decomposition (WD). The sub-signals were predicted by different models according to their frequencies. Liang et al. [
34] used empirical mode decomposition (EMD) to decompose load signals into multiple intrinsic mode functions (IMFs) to weaken the nonlinearity and dynamics of the load data. Zhu et al. [
35] proposed a hybrid model for carbon price forecasting based on EMD and evolutionary least squares to support vector regression. An ensemble empirical mode decomposition (EEMD) was used to decompose load data into different frequency components in [
36]. The low- and high-frequency components were predicted via multivariable linear regression LSTM. However, WD cannot meet the requirements of complex and dynamic load series in time-frequency analysis [
33]. In addition, EMD and EEMD have the disadvantages of mode aliasing and end effect, and lack the support of mathematical theory [
37]. Then, a variational mode decomposition (VMD) algorithm with a solid mathematical model was used as an alternative to overcome their drawbacks mentioned above. VMD has good data decomposition accuracy, and obtains a group of stable sub-signals without noise interference [
38]. A hybrid model based on VMD and LSTM for short-term load forecasting was proposed in [
39]. A CNN and TCN hybrid model combined with VMD-based data processing was developed to predict power load forecasting in [
40]. A hybrid prediction model based on VMD, TCN and an error correction strategy for electricity load forecasting was presented in [
41]. A hybrid model based on TCN and VMD for short-term wind power forecasting was proposed in [
42]. A hybrid model based on GRU and TCN combined with VMD decomposition was developed to predict load forecasting in [
43].
Although these studies mentioned above verified that deep learning methods based on VMD techniques can improve the prediction accuracy of load demand, some fields still require further study and improvement. For example, many studies often adopted deep learning models with shallow networks to improve the training efficiency, and they rarely considered using the models with deep networks to learn full features from the decomposed sub-series. Moreover, most studies scarcely focused on the correlations of sub-series and external factors, such as temperature, seasons, holidays, etc. Deep learning models with multiple hidden layers can indeed improve their feature extraction capabilities, but it is difficult to train these models with some sub-series and external factors. Therefore, it is necessary to construct a deep learning model with shallow networks to extract in-depth features from several sub-series and external factors. As a result, a novel STLF model based on VMD and a deep TCN-based hybrid method with SAM is proposed to fully capture the in-depth features of multiple sub-series and external factors. The main advantages of this research are as follows:
- (1)
- The raw load data are decomposed into multiple stable sub-series using VMD. Furthermore, the external factors are also considered as input variables and reconstructed as a feature matrix, along with the sub-series for the training model. 
- (2)
- A three-layer 1D-CNN network is constructed as a deep network to eliminate overfitting of the training model due to fewer hyperparameters, and reshapes the extracted features into time series for the TCN module. The TCN extracts the long-term temporal dependencies of the input matrix. Moreover, LSTM is used to further enhance the extraction of temporal features. 
- (3)
- SAM can amplify important information, and then weaken irrelevant information of the feature matrix. Thus, SAM dynamically adjusts the correlation weight of complex features to obtain important formation from the input data. 
- (4)
- Compared with traditional models or other benchmarking models, the novel hybrid model performs more effectively and achieves greater prediction accuracies. 
The rest of this study is organized as follows. 
Section 2 introduces the basic methodologies of VMD, 1D-CNN, TCN, LSTM, SAM, as well as the framework of the proposed hybrid model. 
Section 3 describes the data preprocessing, evaluation criteria and experimental analysis. Finally, conclusion and future research are drawn in 
Section 4.
  3. Experiments and Results Analysis
In this section, two real-world datasets collected from different regions were adopted in this study to verify the effectiveness of the proposed model. The two datasets are described in the following.
The first dataset is the ISO-NE (New England) public dataset [
53], which includes load data, temperatures, as well as day types from 1 March 2003 to 31 December 2014. The dataset was collected every hour, and the data from 1 March 2003 to 8 October 2007 were selected for this study. The second selected dataset is the GEFCom2012 public dataset [
54], including power load and temperature. The dataset was collected every hour, and a total of 38,065 sets of data from 1 January 2004 to 29 June 2008 were selected as the study sample. The two datasets were divided into a training set, a validation set and a test set at a ratio of 8:1:1.
Experiments on the two datasets were performed on a PC with AMD-4800H CPU and NVIDIA GeForce RTX GPU with 16 GB of memory. The software platform used was Python 3.7 based on Pytorch 1.7.
  3.1. Data Preprocessing
In this subsection, the ISO-NE dataset was chosen as a typical case to elaborate the characteristics of load data in detail. In addition, the GEFCom2012 dataset, which is similar to the ISO-NE dataset, will no longer be discussed.
The first step was to determine the number of decomposition models 
, the penalty factor 
 and the updating step size 
 using the central frequency observation and the residual index minimization method [
55]. In this research, it was convenient to obtain these parameters, i.e., 
, 
 and 
. 
Figure 7 shows the decomposition results of load data using VMD. Firstly, the changing trend of the low-frequency component is smoother than that of the original signal, and is roughly consistent with the latter. It indicates that the noise interference would be effectively removed by VMD. Secondly, the influence of seasonal factors on load forecasting needed to be considered in the data preprocessing due to the smoothing and stability of the low-frequency component. Finally, all IMF components showed good periodicity and stability without irregular information, which facilitated the extraction of features in the subsequent stage.
Changing trends in electric demand usually present continuous and periodic characteristics without breaking. However, the power consumption is easily influenced by external factors such as seasons, temperature, day types, etc. This results in complex patterns and dynamics in the load data [
56]. The relationship between power load and temperature is illustrated in 
Figure 8, where it shows the distributions of power load and temperature from 1 March 2003 to 8 October 2007. One can clearly see that the power load demand will reach a peak as the temperature becomes very high or low. Thus, temperature has a strong correlation with power load data. Moreover, 
Figure 9 presents the evolutions of electric consumption in different time horizons. The time horizons of 
Figure 9a–c are one year from 1 January 2004 to 1 January 2005, one week from 7 to 13 April 2003, and the Christmas period from 22 to 28 December 2003, respectively. One can see from 
Figure 9a that the power consumption is higher in January, June, July, August and December due to the summer and winter climates. Additionally, the power consumption on weekdays slightly fluctuates and is higher than that on weekends, as shown in 
Figure 9b. As an important holiday, one can see from 
Figure 9c that the power consumption during Christmas is significantly lower than at other times. Therefore, the day types such as weekdays and holidays also have important influences on prediction accuracy of the power load.
According to the discussions mentioned above, this study selected power demand, holidays, quarters, temperature and weekends as the input variables for load forecasting. One-hot encoding was performed on quarters, holidays, weekends and weekdays. However, a min–max normalization was used to normalize the power load and temperature over a range of (0, 1). The formula of min–max normalization is shown as follows:
        where 
X is the input series, 
x indicates the data to be normalized, 
 is the normalized data and max (·) and min (·) represent the maximum and minimum of the input series, respectively.
  3.2. Evaluation Criteria
To evaluate the performance of the proposed model, the mean absolute percentage error (MAPE), root mean square error (RMSE) and R-squared (R
2) were selected as evaluation indices. These statistical metrics are defined as follows:
        where 
N is the number of samples, and 
 and 
 represent the actual value and the predicted value, respectively.
  3.3. Comparative Analysis of Experimental Results
Generally, the performances of single models, such as TCN, LSTM and CNN, are inferior to those of hybrid models. Moreover, the TCN-based hybrid models in STLF are comparatively inferior to other hybrid models. Thus, the experimental results of the proposed model were mainly compared with other TCN-based hybrid models with VMD decomposition that include VMD–CNN–TCN–SAM (VCTA), VMD–TCN–LSTM–SAM (VTLA), VMD–CNN–LSTM–SAM (VCLA), VMD–CNN–TCN–LSTM (VCTL) and VMD–CNN–TCN–GRU–SAM (VCTGA). It must be stressed that the CNN in these hybrid models means a three-layer 1D-CNN network. Furthermore, in order to overcome the randomness of the AI algorithms, all of the models mentioned above were trained and tested enough times until the results became stable and reliable.
  3.3.1. ISO-NE Dataset
In order to further elaborate its performance for load forecasting, the proposed model was mainly compared to the hybrid models with VMD decomposition. 
Table 2, 
Table 3 and 
Table 4 show the maximum, minimum and average values of the MAPE, RMSE and R
2, respectively. Compared with the VCLA, one can find that the VTLA reduces the MAPE by 13.6% and the RMSE by 10.4%, due to the large receptive field of TCN. The VCTA model is conducive to the extraction of long-term in-depth features, and improves the performance of load forecasting. For example, the VCTA greatly decreases the MAPE by 48.6% and the RMSE by 48% compared to the VTLA. When the VCTA is further stacked with GRU, i.e., VCTGA, one finds that its MAPE and RMSE can further be reduced by 20.8% and 30.8%, respectively. It means that the GRU can enhance the extraction of long-term dependencies. However, the proposed model even outperforms the VCTGA by 17.5% and 7% in terms of the MAPE and RMSE, respectively. This again demonstrates that LSTM has a stronger capability to capture temporal dependencies compared with GRU. Compared with VCTL, one can see that SAM is very important to globally enhance the key features for the proposed model. For example, compared with the VCTL, the proposed model significantly decreases the MAPE by 39% and the RMSE by 49%. Therefore, SAM is extremely important for globally adjusting and retaining key features to improve the superior performance of the proposed model. Additionally, the R
2 of the proposed model increases from 0.04% to 1.02% compared with that of the contrast models and reaches up to 99.9%, which infers that VDM decomposition is indeed effective in improving prediction accuracy, and that the proposed model results can be trusted.
To further see the deviations between each model and the actual load data, 
Figure 10 shows the predicted load over 24 h and the actual load data when the MAPE is a maximum and a minimum, respectively. One can find that the VMD–CNN–TCN (VCT)-based hybrid models approximately capture the changing trends in the actual load. However, the VCLA and VTLA models obviously deviate from the actual load. This means that the stacked model based on CNN and TCN plays a critical role in feature extraction for the training model. Similarly, load forecasting during the peak or valley areas is still difficult for the models to accurately predict. However, it can be seen from the subplots that the hybrid model proposed in this article achieves the smallest deviation compared with other VCT-based hybrid models, especially during the turning area. Furthermore, 
Figure 11 shows bar charts of the RMSE for each model. One can clearly see that the VCTA-based hybrid models are more stable compared with other models, and the proposed model has the best stability among all of these models. At the same time, SAM is also important to improve the stability of the VCT-based models.
  3.3.2. GEFCom2012 Dataset
To further validate the generalizations of the proposed model, all of the contrast experiments mentioned above were performed using the GEFCom2012 dataset. 
Table 5, 
Table 6 and 
Table 7 show the maximum, minimum and average values of the MAPE, RMSE and R
2, respectively. One can see that compared with VCLA, the VCTA model significantly reduces the MAPE by 45.7% and the RMSE by 30%. This again demonstrates that the VCT-based models effectively extract the in-depth features from the load series. Furthermore, the proposed model that is stacked with LSTM based on the VCTA outperforms the latter by 12.7% and 14.5% in terms of the MAPE and RMSE, respectively, which further proves that LSTM can enhance the capture of temporal dependencies. Additionally, the influence of SAM on the extraction of important features from the load data can also demonstrated. One can find that the proposed model further decreases the MAPE by 15.2% and the RMSE by 12.2% compared with the VCTL. At the same time, the R
2 results of all of the models are over 99%, which demonstrates the effectiveness of VDM decomposition; moreover, the R
2 value of the proposed model is the highest among these models. Therefore, these above analyses show that the proposed model has good generalization capabilities in power load forecasting.
Figure 12 shows the predicted load over 24 h and the actual load data when the MAPE is a maximum and a minimum, respectively. As shown in 
Figure 12, all of the models are in accord with the trends in the actual load during the rising or falling stages. However, the proposed model is better able to capture trends in the actual load compared with the deviations of the other models, especially during the turning area. Furthermore, in comparing 
Figure 10 and 
Figure 12, one can see that the changing trends of the GEFCom2012 dataset have more peaks and valleys, which proves that the GEFCom2012 dataset has more complexity and volatility compared to the ISO-NE dataset. Thus, the evaluation indices of the load forecasting for all of the models in the GEFCom2012 dataset are higher than those of the corresponding models with the ISO-NE dataset. In addition, 
Figure 13 shows the maximum, minimum and average values of the RMSE for all of the models with the GEFCom2012 dataset. It can be seen that although the GEFCom2012 dataset has obviously nonlinear and volatile characteristics, the proposed model still achieves the best reliability and the highest prediction accuracy and stability.
   3.3.3. Discussion
After VMD decomposition of the two datasets, the predictive results of all of the VMD-based models in this research became stable, which infers the importance of smoothing and stabilizing the load data for accurate prediction. Moreover, the combination of TCN and LSTM is helpful to fully extract the long-term temporal dependencies and local dependencies of the load data. Due to stacking multiple single modules, there is a lot of irrelevant feature information that is extracted from the load data. Then, SAM is used to dynamically adjust the correlation weights of different important features to obtain the key formation. Therefore, the novel architecture of the proposed model is very beneficial for feature extraction from large-scale load data with nonlinear patterns and dynamics. It should be pointed out that the computational cost of the proposed model increases greatly compared to other single models, due to its complicated structure. However, the computational cost problem can easily be overcome with the rapid development of computing power. In addition, all of the models were trained and tested several times in this study to obtain a series of stable results. 
Figure 14 shows a boxplot of the MAPE for all of the models based on the ISO-NE and GEFCom2012 datasets. One can see that the proposed model not only achieves the lowest MAPE value among all of the models for the two datasets, but it also achieved the smallest variation range in its MAPE value. This indicates that the novel model proposed in this study has excellent generalization and reliable prediction accuracy.
  3.3.4. Comparison with State-of-the-Art Models
It is necessary to compare the proposed model with other state-of-the-art hybrid models, all of which are TCN-based or data decomposition models published in recent years. 
Table 8 lists the comparative results of the proposed model and other baseline models. It is obvious that the proposed model significantly outperforms state-of-the-art models in terms of the MAPE. Even if the same datasets are used in [
14,
54] for one-hour-ahead forecasting, the MAPE values of the proposed model are significantly reduced by 37% and 36%, respectively. It should be emphasized that [
14] used an EMD decomposition method, and achieved high prediction accuracy. However, compared with the results of [
14], the proposed model in this research reduces the MAPE by 37% and the RMSE by 39%. This also demonstrates that VMD decomposition outperforms EMD decomposition in terms of data smoothing and linearization. Furthermore, the hybrid models with data decomposition all achieved high accuracies in load forecasting, which proves that the smoothing and stabilizing of load data is a very important process for load forecasting. It should be noted that the RMSE values of some comparative models are lower than that of the proposed model due to their simpler structures and fewer hyperparameters. Overall, the proposed model can extract in-depth spatiotemporal features, and then globally enhance key features to achieve reliable load forecasting with greater accuracy.