1. Introduction
With the growing global demand for electricity and the continuous improvement of power systems, short-term power load forecasting is becoming increasingly important for optimizing energy distribution. By accurately forecasting future power demand, power companies can adjust energy supply efficiently, economically, and environmentally, ensuring a balance between supply and demand and avoiding energy waste, while minimizing cost control and environmental impact through the intelligent dispatch of all types of power generation resources. In addition, load forecasting enhances the grid stability and reliability, ensures the quality of power supply, and ensures energy security in the face of natural disasters and other emergencies. Enhancing the power load forecasting accuracy is therefore a primary focus of ongoing research [
1,
2]. Power loads exhibit complicated nonlinear and time-varying features due to a multitude of external influences; short-term power load forecasting is a difficult undertaking. This intricacy makes developing and utilizing forecasting models more challenging.
There are several techniques available today for estimating power use, including traditional statistics, machine learning, and deep learning. Traditional statistical methods mainly include the multiple linear regression method [
3], exponential smoothing method [
4], autoregressive model (AR) [
5], and autoregressive moving average (ARMA) [
6]. Although traditional statistical methods are highly interpretable, they have a great demand for large amounts of high-quality data. Power load forecasting makes extensive use of machine learning techniques because traditional statistical methods do not handle complex and variable patterns well. Among them include the support vector machine (SVM) [
7], artificial neural network [
8], decision tree method [
9], and random forest method [
10]. Large-scale datasets can be handled using mechanical learning techniques; however, for non-linear issues or dataset noise [
11,
12], they may result in unstable or erroneous predictions. Deep learning is being used more for power load forecasting than traditional methods. When working with large-scale and complicated datasets, such as convolutional neural networks (CNNs) [
13], recurrent neural networks (RNNs) [
14], and self-attention networks [
15], it exhibits better expression and generalization abilities. Long short-term memory (LSTM) [
16] is better suited for real-world issues and has the ability to learn long-term dependent information in recurrent neural networks. By adding a gating mechanism, the gated recurrent unit (GRU) [
17] reduces the gradient disappearance and enhances the network’s capacity to monitor long-term relationships.
Long-term dependencies, however, will be challenging for the conventional sequence model to handle. Transformer, which is based on the self-attention mechanism [
18], can immediately establish dependencies across multiple locations, which is preferable when dealing with long-term dependencies and helps to overcome some of the shortcomings and issues of the classic sequence model. The transformer-based model embedding method previously used for MTS prediction mainly captures cross-time dependencies [
19,
20], while cross-dimensional dependencies are not explicitly captured during the embedding period, which limits their predictive ability. Zhang et al. [
21] proposed a Transformer model called Crossformer for MTS prediction using cross-dimensional dependence for time series prediction. Compared with the existing Transformer-based model, this model clearly explores and utilizes the dependence between different variables, which effectively improves the prediction ability.
In complex power load sequences, it is difficult to achieve perfect accuracy using a single prediction approach, and thus, the combination model of multiple methods has been widely used. Han et al. [
22] developed a prediction model that combines empirical modal decomposition (EMD), isometric mapping (Isomap), and the adaboost hybrid algorithm (EMDIA), which utilizes Isomap to extract the key feature sequences and the key influencing factors and Adaboost prediction methodology, which achieves higher prediction accuracy compared with a single model. The experimental results show that compared with the single Adaboost, the EMDIA combined model exhibits superior prediction performance on the Hong Kong total electricity consumption dataset, with a reduction of 11.58% in the MAE, a reduction of 0.13% in the MAPE, a reduction of 49.93 in the RMSE, and an improvement of 0.04 in R2. However, EMD suffers from high computational complexity and sensitivity to feature alignment and outliers. Different from EMD, VMD considers the mechanism of suppressing modal aliasing, and reduces the mutual interference between different modal functions by introducing regularization terms and variational optimization, which makes the decomposition results more accurate [
23]. Moreover, the original power load data are relatively complex, and thus, inputting them directly into the prediction network will make the prediction accuracy not high; however, the VMD can decompose the complex power load data into simpler modal components. Different submodalities contain different features as one of the inputs to the prediction network, which helps to improve the prediction accuracy. Nevertheless, the number of decomposition layers for VMD must be predefined, and selecting an incorrect number can easily lead to problems, such as incomplete or redundant data. Zhu et al. [
24] proposed an improved VMD technique that adaptively adjusts the number of VMD modes based on the PCC between each submode and the initial data, and this method is able to adaptively adjust the number of modes of the VMD based on the signal characteristics.
Wang et al. [
25] used the VMD-CISSA-LSSVM model for prediction. First, VMD is used to process the data. Next, the chaotic sparrow search algorithm is used to find the optimal LSSVM parameters (CISSA). Finally, the least squares support vector machine (LSSVM) is used to construct a forecast. This approach employs VMD, which successfully circumvents the modal aliasing problem associated with the EMD. The optimization algorithm is used to determine the prediction network parameters, which improves the speed and accuracy of the model training, but only one dataset is used, which means the process is somewhat lacking in the accuracy of assessing the generalization ability of the model. This approach successfully circumvents the modal aliasing issue related to EMD. The results show that the model was validated on a dataset from a region in Shandong, and the mean values of the MSE, MAPE, and MAE were 11.9133, 0.7512, and 67.9861, respectively. Zhuang et al. [
26] verified the prediction performance of the VMD-IWOA-LSTM model, in which the original data are first decomposed by VMD, the decomposed data are reconstructed according to their Pearson correlation coefficient (PCC) similarity, and then the reconstructed data are fed into the LSTM network optimized by the IWOA parameters for prediction, which effectively improves the prediction accuracy. The method does not directly input the data into the prediction network but reconstructs the data and then inputs it into the network, which is conducive to improving the model prediction performance. Moreover, three evaluation indexes are used to assess the model performance, which can reflect the accuracy of the model prediction from different perspectives. The results show that the model had an MAPE, MAE, and RMSE of 0.6201, 39.9173 and 52.2262 during the rainy season and 0.3492, 20.7807, and 20.7807 during the dry season for a tropical region dataset. The comparison and analysis of the literature mentioned above is shown in 
Table 1.
Based on the previously mentioned analysis, a VMD-Crossformer short-term power load forecasting model is presented in this paper. First, a strategy for optimizing the VMD parameters was used, which automatically modified the number of modes based on the PCC. Following this, the VMD broke down the initial signal into various modal components and then combined them to reconstruct the data alongside the original load information; ultimately, the data reconstruction is fed into the Crossformer network to obtain the ultimate forecast. In particular, the following contributions are made by this paper:
- (1)
- Finding the ideal number of decomposition layers is made easier by adjusting the VMD parameter  with the help of the PCC. This avoids problems with too many or too few VMD layers. 
- (2)
- The original data are disaggregated by the VMD. Reconstructing the submodal data with the original data, the submodal data act as data augmentation that allows for feature highlighting of the original data. The network can then be trained with this new set of data to improve its ability for grasping the relationships between the variables. 
- (3)
- The Crossformer network based on MTS prediction has a strong prediction capability. Among them, the DSW embedding mainly exploits cross-dimensional dependencies, and the TSA layer is equipped with an attention mechanism with temporal and dimensional phases to capture cross-temporal dependencies and cross-dimensional dependencies. The HED structure integrates information from different time scales for the final prediction. 
- (4)
- The proposed VMD-Crossformer prediction model combines the strengths of each module. The VMD allows the submodal data to be used as data augmentation, while the Crossformer network captures the relationship between the submodalities and the original data in time and dimension. When comparing this model with other models on two datasets, this model had a higher prediction accuracy. 
The rest of the paper is structured as follows: 
Section 2 describes the components of the model theoretically; 
Section 3 describes the structure of the electricity load forecasting; 
Section 4 tests the model on two datasets; and 
Section 5 concludes and gives an outlook on the model proposed in this paper.
  3. Construction of the VMD-Crossformer Prediction Model
Power load forecasting in this paper was done using the VMD-Crossformer model, whose specific model framework structure is depicted in 
Figure 4.
In data preprocessing, this study adopted an improved VMD method to decompose the original signal. First, the number of decomposition layers  was initialized to  = 2. Second, the power load data were decomposed by VMD and K modal components were obtained. Next, we calculated the correlation coefficient between the original data and each modal decomposition p. Finally, we determined whether the smallest |p| was lower than 0.2; if not,  and we performed VMD and calculated the correlation coefficient again; if yes, the current decomposition layer was the optimal decomposition layer . After determining the number of decomposition layers , the VMD decomposed the raw load data into multiple submodalities, each of which captured a different frequency component of the data that was important for the characteristics and information of the data. Using submodalities as data augmentation, data reconstruction of the original data with each submodal data and inputting the reconstructed data into the prediction network could improve the prediction accuracy.
In the prediction module, this study input the reconstructed data into the Crossformer network. First, DSW embedding was used to process the reconstructed data by splitting the sequences on each dimension in the reconstructed data into several segments and embedded them into feature vectors to retain the time and dimension information. Next, a TSA layer was used to capture the cross-time and cross-dimensional dependencies of the embedded arrays. Next, Crossformer employed DSW embeddings and TSA layers to build the HED for prediction using information from different scales. In the HED, each layer corresponded to a scale. The upper layer of the encoder merged the neighboring segments output by the lower layer to capture the dependencies at a coarser scale. Finally, the decoder layer generated predictions at different scales and summed them as the final power load prediction output.
In the process of parameter tuning, the segment length, the number of attention heads, the number of routers in the cross-dimensional stage, and the batch size were all important parameters of the model in this study. Choosing the appropriate segment length can improve the model performance and computational efficiency; a longer segment length can provide more information and help the model learn long-range dependencies and semantic information, while a too-long segment length may lead to a decrease in the efficiency of model training and inference and even trigger the vanishing gradient or gradient explosion problem. Choosing an appropriate number of attention heads can improve the generalization ability of the model. An increase in the number of attentional heads helps the model learn richer and more complex information, improving the model’s generalization ability and prediction accuracy, while too many attentional heads may lead to excessive model computation, affecting the efficiency of training and inference. Choosing the appropriate number of routers can improve the information integration ability of the model; increasing the number of routers may increase the representation ability of the model so that it can better capture the complex relationships in the data, while too many routers may also lead to the model overfitting the training data, increasing the computational cost and making the model difficult to train. Choosing the right batch size can improve the training speed and convergence of the model; a larger batch size can accelerate the training process of the model and improve the convergence speed and generalization ability; while a smaller batch size helps the model better learn the detailed information of the data.
  5. Conclusions
The power system is subject to numerous influencing factors, resulting in the load data exhibiting randomness, variability, and nonlinearity. In this paper, a VMD-Crossformer forecasting model is proposed to improve the accuracy of power load forecasting. Initially, an enhanced optimization method for VMD parameters is employed to dynamically adjust the modal number of the VMD based on the PCC. Afterward, the initial signal is divided into various modal components using VMD. These components are then combined with the original load data to create reconstructed data; finally, the reconstructed data are input into the Crossformer network for final prediction. Through the experimental study, the following conclusions were drawn:
- (1)
- The optimal number of decomposition layers matching the original signal was found using the VMD parameter optimization approach, which was based on the PCC. This method helped to prevent problems such as under-decomposition and over-decomposition of the signal caused by setting an inappropriate number of VMD modes. 
- (2)
- The complex power load data were broken down into relatively simple submodal components using the VMD algorithm. Each modal component reflects the characteristics of the original signal in various frequency ranges, and each modal component is reconstructed with the original signal and then input into the prediction network, which can greatly improve the prediction accuracy. 
- (3)
- The Crossformer network utilized cross-dimensional dependencies and information at different scales to capture the relationship between data more comprehensively and accurately predict the power load data. 
- (4)
- Taking the GEFCom2014 dataset and the load dataset of the Belgium Power Grid Company as an example, the prediction based on VMD-Crossformer showed a higher prediction accuracy and better performance than other models. 
In this paper, a VMD-Crossformer forecasting model is proposed and was applied to power load forecasting, which can provide feasibility and reference value for practical applications. The performance validation was carried out on two highly recognized datasets; three evaluation indexes, MAE, MAPE, and RMSE, were used to assess the model performance; and the experiments showed that the model in this paper had the highest prediction accuracy when compared with other models. However, the model method in this paper still has shortcomings: the optimization algorithm is not used to find the optimal network parameters, and the training speed and model performance can be further improved.