Short-Term Power Load Forecasting Using a VMD-Crossformer Model

: There are several complex and unpredictable aspects that affect the power grid. To make short-term power load forecasting more accurate, a short-term power load forecasting model that utilizes the VMD-Crossformer is suggested in this paper. First, the ideal number of decomposition layers was ascertained using a variational mode decomposition (VMD) parameter optimum approach based on the Pearson correlation coefficient (PCC). Second, the original data was decomposed into multiple modal components using VMD, and then the original data were reconstructed with the modal components. Finally, the reconstructed data were input into the Crossformer network, which utilizes the cross-dimensional dependence of multivariate time series (MTS) prediction; that is, the dimension-segment-wise (DSW) embedding and the two-stage attention (TSA) layer were designed to establish a hierarchical encoder–decoder (HED), and the final prediction was performed using information from different scales. The experimental results show that the method could accurately predict the electricity load with high accuracy and reliability. The MAE, MAPE, and RMSE were 61.532 MW, 1.841%, and 84.486 MW, respectively, for dataset I. The MAE, MAPE, and RMSE were 68.906 MW, 0.847%, and 89.209 MW, respectively, for dataset II. Compared with other models, the model in this paper predicted better.


Introduction
With the growing global demand for electricity and the continuous improvement of power systems, short-term power load forecasting is becoming increasingly important for optimizing energy distribution.By accurately forecasting future power demand, power companies can adjust energy supply efficiently, economically, and environmentally, ensuring a balance between supply and demand and avoiding energy waste, while minimizing cost control and environmental impact through the intelligent dispatch of all types of power generation resources.In addition, load forecasting enhances the grid stability and reliability, ensures the quality of power supply, and ensures energy security in the face of natural disasters and other emergencies.Enhancing the power load forecasting accuracy is therefore a primary focus of ongoing research [1,2].Power loads exhibit complicated nonlinear and time-varying features due to a multitude of external influences; short-term power load forecasting is a difficult undertaking.This intricacy makes developing and utilizing forecasting models more challenging.
There are several techniques available today for estimating power use, including traditional statistics, machine learning, and deep learning.Traditional statistical methods mainly include the multiple linear regression method [3], exponential smoothing method [4], autoregressive model (AR) [5], and autoregressive moving average (ARMA) [6].Although traditional statistical methods are highly interpretable, they have a great demand for large amounts of high-quality data.Power load forecasting makes extensive use of machine learning techniques because traditional statistical methods do not handle complex and variable patterns well.Among them include the support vector machine (SVM) [7], artificial neural network [8], decision tree method [9], and random forest method [10].Large-scale datasets can be handled using mechanical learning techniques; however, for non-linear issues or dataset noise [11,12], they may result in unstable or erroneous predictions.Deep learning is being used more for power load forecasting than traditional methods.When working with large-scale and complicated datasets, such as convolutional neural networks (CNNs) [13], recurrent neural networks (RNNs) [14], and self-attention networks [15], it exhibits better expression and generalization abilities.Long short-term memory (LSTM) [16] is better suited for real-world issues and has the ability to learn longterm dependent information in recurrent neural networks.By adding a gating mechanism, the gated recurrent unit (GRU) [17] reduces the gradient disappearance and enhances the network's capacity to monitor long-term relationships.
Long-term dependencies, however, will be challenging for the conventional sequence model to handle.Transformer, which is based on the self-attention mechanism [18], can immediately establish dependencies across multiple locations, which is preferable when dealing with long-term dependencies and helps to overcome some of the shortcomings and issues of the classic sequence model.The transformer-based model embedding method previously used for MTS prediction mainly captures cross-time dependencies [19,20], while cross-dimensional dependencies are not explicitly captured during the embedding period, which limits their predictive ability.Zhang et al. [21] proposed a Transformer model called Crossformer for MTS prediction using cross-dimensional dependence for time series prediction.Compared with the existing Transformer-based model, this model clearly explores and utilizes the dependence between different variables, which effectively improves the prediction ability.
In complex power load sequences, it is difficult to achieve perfect accuracy using a single prediction approach, and thus, the combination model of multiple methods has been widely used.Han et al. [22] developed a prediction model that combines empirical modal decomposition (EMD), isometric mapping (Isomap), and the adaboost hybrid algorithm (EMDIA), which utilizes Isomap to extract the key feature sequences and the key influencing factors and Adaboost prediction methodology, which achieves higher prediction accuracy compared with a single model.The experimental results show that compared with the single Adaboost, the EMDIA combined model exhibits superior prediction performance on the Hong Kong total electricity consumption dataset, with a reduction of 11.58% in the MAE, a reduction of 0.13% in the MAPE, a reduction of 49.93 in the RMSE, and an improvement of 0.04 in R2.However, EMD suffers from high computational complexity and sensitivity to feature alignment and outliers.Different from EMD, VMD considers the mechanism of suppressing modal aliasing, and reduces the mutual interference between different modal functions by introducing regularization terms and variational optimization, which makes the decomposition results more accurate [23].Moreover, the original power load data are relatively complex, and thus, inputting them directly into the prediction network will make the prediction accuracy not high; however, the VMD can decompose the complex power load data into simpler modal components.Different submodalities contain different features as one of the inputs to the prediction network, which helps to improve the prediction accuracy.Nevertheless, the number of decomposition layers for VMD must be predefined, and selecting an incorrect number can easily lead to problems, such as incomplete or redundant data.Zhu et al. [24] proposed an improved VMD technique that adaptively adjusts the number of VMD modes based on the PCC between each submode and the initial data, and this method is able to adaptively adjust the number of modes of the VMD based on the signal characteristics.
Wang et al. [25] used the VMD-CISSA-LSSVM model for prediction.First, VMD is used to process the data.Next, the chaotic sparrow search algorithm is used to find the optimal LSSVM parameters (CISSA).Finally, the least squares support vector machine (LSSVM) is used to construct a forecast.This approach employs VMD, which successfully circumvents the modal aliasing problem associated with the EMD.The optimization algorithm is used to determine the prediction network parameters, which improves the speed and accuracy of the model training, but only one dataset is used, which means the process is somewhat lacking in the accuracy of assessing the generalization ability of the model.This approach successfully circumvents the modal aliasing issue related to EMD.The results show that the model was validated on a dataset from a region in Shandong, and the mean values of the MSE, MAPE, and MAE were 11.9133, 0.7512, and 67.9861, respectively.Zhuang et al. [26] verified the prediction performance of the VMD-IWOA-LSTM model, in which the original data are first decomposed by VMD, the decomposed data are reconstructed according to their Pearson correlation coefficient (PCC) similarity, and then the reconstructed data are fed into the LSTM network optimized by the IWOA parameters for prediction, which effectively improves the prediction accuracy.The method does not directly input the data into the prediction network but reconstructs the data and then inputs it into the network, which is conducive to improving the model prediction performance.Moreover, three evaluation indexes are used to assess the model performance, which can reflect the accuracy of the model prediction from different perspectives.The results show that the model had an MAPE, MAE, and RMSE of 0.6201, 39.9173 and 52.2262 during the rainy season and 0.3492, 20.7807, and 20.7807 during the dry season for a tropical region dataset.The comparison and analysis of the literature mentioned above is shown in Table 1.Based on the previously mentioned analysis, a VMD-Crossformer short-term power load forecasting model is presented in this paper.First, a strategy for optimizing the VMD parameters was used, which automatically modified the number of modes based on the PCC.Following this, the VMD broke down the initial signal into various modal components and then combined them to reconstruct the data alongside the original load information; ultimately, the data reconstruction is fed into the Crossformer network to obtain the ultimate forecast.In particular, the following contributions are made by this paper: (1) Finding the ideal number of decomposition layers is made easier by adjusting the VMD parameter  with the help of the PCC.This avoids problems with too many or too few VMD layers.(2) The original data are disaggregated by the VMD.Reconstructing the submodal data with the original data, the submodal data act as data augmentation that allows for feature highlighting of the original data.The network can then be trained with this new set of data to improve its ability for grasping the relationships between the variables.
(3) The Crossformer network based on MTS prediction has a strong prediction capability.Among them, the DSW embedding mainly exploits cross-dimensional dependencies, and the TSA layer is equipped with an attention mechanism with temporal and dimensional phases to capture cross-temporal dependencies and cross-dimensional dependencies.The HED structure integrates information from different time scales for the final prediction.(4) The proposed VMD-Crossformer prediction model combines the strengths of each module.The VMD allows the submodal data to be used as data augmentation, while the Crossformer network captures the relationship between the submodalities and the original data in time and dimension.When comparing this model with other models on two datasets, this model had a higher prediction accuracy.
The rest of the paper is structured as follows: Section 2 describes the components of the model theoretically; Section 3 describes the structure of the electricity load forecasting; Section 4 tests the model on two datasets; and Section 5 concludes and gives an outlook on the model proposed in this paper.

Improved Variational Modal Decomposition
The VMD algorithm is an adaptive signal decomposition method [23,27], which can automatically adjust the decomposition parameters according to the characteristics of the signal so as to obtain more accurate modal components.To be processed, the original signal () foot is divided into  variational modal components, each of which has a different center-frequency bandwidth, and it ensures that the total estimated bandwidth of every modal component is kept to a minimum.The particular variational constraint statement is as follows: where the broken-down  modal components are represented by the letter   .  represents the center frequency of every mode, " * " denotes the convolution operation,   is the time derivative of the function, and () is the unit pulse function.
The restricted problem is converted into an unconstrained optimization problem by adding two second-order penalty parameters: α and .Applying constraints is done with , while balancing the smoothness and precise reconstruction of the signal is done using α.

𝐿({𝑢
The augmented Lagrangian saddle points are found by iteratively updating   +1 ,   +1 , and  using the alternating direction multiplier approach.The iterative process is as follows: where  ̂(),  ̂(), and  ̂() are the Fourier transforms of   (), (), and (), respectively;  is the number of iterations; and  is the iteration step size.The process of Equations ( 4)-( 6) is combined with the following stopping condition: where ε is the convergence accuracy.If the stopping requirement in Equation ( 7) is not satisfied,  is raised to  + 1 and the process continues until either the maximum number of tries is reached or the stopping condition is satisfied, and finally,  modal components with independent center frequencies and finite bandwidth are obtained.It was found that an inaccurate setting of the decomposition layer  significantly affects the signal's decomposition effect.The VMD parameter  is adaptively adjusted using the PCC between each submodality and the original data in order to fix this issue.A linear correlation between the data is provided by the PCC, and this correlation is frequently employed in the domains of fault diagnostics [28], load forecasting [29], etc.The calculation formula is where the variables  and  represent the original data and the various modal components, respectively, and  is the number of samples.̅ and  ̅ are the mean values of the items in  and , respectively.Table 2 displays the relevant measures [30].After performing the VMD on the original data,  modal components are derived.The PCC must be computed in order to quantify the relationship between the original signal and the signal's constituent parts.Indicating little to no relationship between the two variables is the PCC with an absolute value less than 0.2.Conversely, if the absolute value exceeds 0.2, it indicates that the modal components are valid and the signal remains under-decomposed.In such cases, an increase in the value of  is necessary.This method determines the optimal number of layers to dissect the signal and ensures it captures all crucial information.When the smallest correlation number is less than 0.2 for the first time, it is considered that the important information in the signal has been sufficiently decomposed, and the  at this time is the optimal number of decomposition layers.

Crossformer Network
Crossformer is a Transformer model that exploits cross-dimensional dependencies for multivariate time series (MTS) forecasting.It is one of the few transformer models that explicitly explores and exploits cross-dimensional dependencies in MTS forecasting.In the first place, cross-dimensional dependence is fully exploited by embedding an MTS using a dimension-segment-wise (DSW) method.Second, the two-stage attention (TSA) layer effectively captures the dependencies between embedded segments and can leverage cross-time and cross-dimensional dependencies efficiently.Finally, the DSW embedding and the TSA layer are utilized to construct a hierarchical encoder-decoder (HED) to capture a variety of information at different scales for a final accurate prediction.

Dimensionally Segmented Embedding
Earlier Transformer-based models did not explicitly capture cross-dimensional dependencies in the embedding process, and the prediction capability was not fully utilized.DSW embedding mainly captures cross-dimensional dependencies, which helps the model to better learn the intrinsic structure of multivariate time series data, and provides efficient inputs to the Crossformer network, enabling the model to process complex data more efficiently.
Each embedding vector in the Transformer contains one piece of information.In the prototype attention used by the original Transformer for MTS prediction, similar data points have similar attention weights.For an MTS, the values of individual steps provide little information, and the embedding vectors are represented as continuous segments of individual dimensions rather than values of all dimensions in a single step.As shown in Figure 1, in DSW embedding, points near the time axis of each dimension form tangent segments using embedding vectors of the same length, and embedding integration is performed by embedding each segment into a vector using a linear projection, to which the positional embedding is added, producing a two-dimensional array of vectors.

Two-Stage Attention Layer
An MTS has time and dimension axes with different meanings, and the temporal and dimensional dependencies between vector arrays can be captured by the TSA layer, as shown in Figure 2, where D is the number of dimensions,  is the number of segments,  (1 ≤ d ≤ D) and  (1 ≤ i ≤ L) are time steps, and LayerNorm denotes layer normalization.
In the intertemporal phase, the 2D array is used as an input into the TSA layer, and the dependencies between time periods of the same dimension are captured in the 2D array.Multiple self-attention (MSA) is directly applied to each dimension and a multilayer feedforward (MLP) network is used to enable the model to better understand the temporal structure in the time series data.In order to capture cross-dimensional dependencies, the MLP in this work uses two layers whose output captures the dependencies between time periods of the same dimension being fed into the cross-dimensional stage.In the cross-dimensional phase, an all-to-all connectivity router mechanism is established between the dimensions, and this all-connectivity both captures the cross-dimensional dependencies and introduces noise, and for high-dimensional datasets with sparse attributes, the use of sparsity to reduce the noise can improve the computational efficiency of the TSA.For each time step, a fixed number of learnable vectors are specified as routers.The routers are used as queries in the MSA, and the vectors of all dimensions are used as keys and values in order to aggregate the information from all dimensions.The router then distributes the received information to the dimensions using the dimension vectors as queries and the aggregated information as keys and values.In this way, a complete connection between dimensions is established and also the output of the router mechanism is obtained.These routers first use the router as a query in the MSA and use all dimension vectors as keys and values to aggregate the information from each dimension.The routers then distribute the received information to the dimensions using the dimension vectors as a query and the aggregated information as keys and values so that a complete connection is established between the dimensions.

Hierarchical Encoder-Decoder
Crossformer adopts a layered encoder-decoder architecture, where the HED consists of segment merging, a TSA layer, and DSW embedding, as shown in Figure 3. Using the HED architecture, Crossformer is able to capture and utilize information from different layers to better process data with complex dependencies and time series features.The encoder uses TSA layers and fragment merging to capture dependencies at different scales: the upper layer vectors cover a wider range of areas, and therefore, the dependencies are at a coarser scale.When exploring different scales, the decoder performs the final prediction by making predictions at each scale and summing them.
In an -layer encoder, each layer except the first one merges every two neighboring vectors in the time domain to obtain a coarser level of representation and then applies a TSA layer to capture the dependencies at this level.The main job of the coding layer is to extract features from the input data and convert it into a format that can be used for other processing.
In the decoder, an array of  + 1 features output from the encoder is available and the  + 1 layer is used in the decoder for prediction.The design of the decoding layer takes into account the hierarchical structure of the data and feature dependencies.The prediction results of each layer are obtained by linear projection.The final prediction results are obtained by summing up the prediction results of each layer.

Construction of the VMD-Crossformer Prediction Model
Power load forecasting in this paper was done using the VMD-Crossformer model, whose specific model framework structure is depicted in Figure 4.In data preprocessing, this study adopted an improved VMD method to decompose the original signal.First, the number of decomposition layers  was initialized to  = 2. Second, the power load data were decomposed by VMD and K modal components were obtained.Next, we calculated the correlation coefficient between the original data and each modal decomposition p. Finally, we determined whether the smallest || was lower than 0.2; if not,  =  + 1 and we performed VMD and calculated the correlation coefficient again; if yes, the current decomposition layer was the optimal decomposition layer .After determining the number of decomposition layers , the VMD decomposed the raw load data into multiple submodalities, each of which captured a different frequency component of the data that was important for the characteristics and information of the data.Using submodalities as data augmentation, data reconstruction of the original data with each submodal data and inputting the reconstructed data into the prediction network could improve the prediction accuracy.
In the prediction module, this study input the reconstructed data into the Crossformer network.First, DSW embedding was used to process the reconstructed data by splitting the sequences on each dimension in the reconstructed data into several segments and embedded them into feature vectors to retain the time and dimension information.Next, a TSA layer was used to capture the cross-time and cross-dimensional dependencies of the embedded arrays.Next, Crossformer employed DSW embeddings and TSA layers to build the HED for prediction using information from different scales.In the HED, each layer corresponded to a scale.The upper layer of the encoder merged the neighboring segments output by the lower layer to capture the dependencies at a coarser scale.Finally, the decoder layer generated predictions at different scales and summed them as the final power load prediction output.
In the process of parameter tuning, the segment length, the number of attention heads, the number of routers in the cross-dimensional stage, and the batch size were all important parameters of the model in this study.Choosing the appropriate segment length can improve the model performance and computational efficiency; a longer segment length can provide more information and help the model learn long-range dependencies and semantic information, while a too-long segment length may lead to a decrease in the efficiency of model training and inference and even trigger the vanishing gradient or gradient explosion problem.Choosing an appropriate number of attention heads can improve the generalization ability of the model.An increase in the number of attentional heads helps the model learn richer and more complex information, improving the model's generalization ability and prediction accuracy, while too many attentional heads may lead to excessive model computation, affecting the efficiency of training and inference.Choosing the appropriate number of routers can improve the information integration ability of the model; increasing the number of routers may increase the representation ability of the model so that it can better capture the complex relationships in the data, while too many routers may also lead to the model overfitting the training data, increasing the computational cost and making the model difficult to train.Choosing the right batch size can improve the training speed and convergence of the model; a larger batch size can accelerate the training process of the model and improve the convergence speed and generalization ability; while a smaller batch size helps the model better learn the detailed information of the data.

Experimental Datasets
Two datasets were used in this experiment, both of which used the past 96 sampling points to predict the future electrical load values at 24 sampling points.
The I dataset was the GEFCom2014 open dataset, which is a standard dataset widely used for power load forecasting.For this experiment, only the hourly load statistics from 1 January 2006 to 21 December 2014, were selected from the GEFCom2014 dataset.Only data containing load information were selected from this dataset, with a total of 78,888 sampling points, as shown in Figure 5a.
The II dataset came from the Belgian grid provider Elia and is available to the public; it is highly reliable with a low data error and missing rate.In this study, power loads at 15 min intervals were selected for the period 1 January 2019 to 31 December 2020, with a total of 70,176 sampling points, as shown in Figure 5b.A clear periodicity and symmetry of the original data could be observed in both datasets, and it could be initially determined that the original data included many submodalities, i.e., the data could be decomposed into several submodalities that represented different features, and the use of the VMD-decomposed submodalities as data augmentation could also lead to an improvement in the accuracy of the prediction model.

Evaluation Criteria
The prediction performance was tested using three widely used indicators: mean absolute error (MAE), mean absolute percentage error (MAPE), and root-mean-square error (RMSE).This allowed for a thorough evaluation of the model's prediction accuracy.Although the MAE is simple to compute, it is not sensitive to outliers.The MAPE is very responsive to changes in values that are proportionate.The RMSE is highly susceptible to spikes and outliers.When these three indicators are considered at the same time, a more comprehensive and accurate evaluation of the model's predictive performance can be made: the smaller the values of the MAE, MAPE, and RMSE, the better the model's predictive performance.The formulas for the three evaluation indicators are as follows: where   ̅ is the predicted value of sample ,   is the true value of the sample value, and  is the total number of samples.

VMD
Table 3 displays the PCC between the components and the original data at various modal numbers.The two datasets were decomposed using a modified VMD approach for the original data.For the two datasets, only when  = 4, PCC was lower than 0.2 for the first time, indicating that four was the ideal decomposition level of the two datasets.
According to the VMD parameter setting method, the modal number  was set to four, the quadratic penalty factor α = 2000, and other parameters were taken as the default values of VMD.
The modal components obtained by VMD are shown in Figure 6.The modal component data and the original load data in the two datasets were respectively composed of reconstructed data and input into the Crossformer prediction network.

Parametric Optimization Results
The experimental environment of the model is trained and tested on the Google Colab server equipped with Tesla T4 graphics card and 16 GB video memory and python version 3.10.12.Important hyperparameters in the deep learning model in this paper included the segment length, number of attention headers, number of routers, and batch size, which directly affected the network prediction, and their default parameters were 6, 4, 10, and 32, respectively.The number of iterations during the experiments was 20, the learning rate was 10 −4 , the number of encoder layers was three, and the number of decoder layers was four.In finding the optimal value of a parameter, the rest of the parameters were kept constant and the following experiments were conducted to determine the optimal parameters one by one.
As shown in Table 4, with the increase in segment length, the model achieved the lowest MAE and MAPE values on both datasets when the segment length was eight.Therefore, the segment length of both datasets in this study was chosen as eight.As shown in Table 5, with the increase in the number of attention heads, when the number of attention heads was four, the MAE and MAPE of the model on the I dataset reached their lowest values; when the number of attention heads was six, the MAE and MAPE of the model on the II dataset reach their lowest values.Therefore, in this study, the number of attention heads was chosen as four for the I dataset and six for the II dataset.As shown in Table 6, as the number of routers increased, the model achieved the lowest MAE and MAPE values on both datasets when the number of routers was eight.Therefore, the number of routers was selected as eight for both datasets in this paper.As shown in Table 7, as the batch size increased, the model had the lowest MAE and MAPE values on the I dataset when the batch size was four, and the model had the lowest MAE and MAPE values on the II dataset when the batch size was eight.Therefore, in this study, the batch size of the I dataset was four, and the batch size of the II dataset was eight.The segment length as eight, the number of attention heads as four, the number of routers as four, and the batch size as four were finally determined through experiments for the I dataset; the segment length as eight, the number of attention heads as six, the number of routers as eight, and the batch size as eight were determined for the II dataset; and the defaults were used for all other parameters.The mean square error was used as the loss function and the Adam (adaptive moment estimation) optimizer was used for training.Using the parameters finalized by the experiment, the average time per epoch for dataset I was 876.422 s, and for dataset II, it was 407.814 s.

Comparison of Forecast Results
Under the same conditions, the model of this study was compared with other models, keeping the learning rate, the number of training rounds, the length of input sequences, and the length of predicted sequences as 10 −4 , 20, 96, and 24, respectively.
In order to verify the reasonableness of the combined VMD-Crossformer model, this study first demonstrated the role of each module through a comparative experiment with the simple model, as shown in Table 8.In order to verify the superiority of the combined VMD-Crossformer model, this study first demonstrated the predictive ability of this combined model through comparison experiments with complex models, as shown in Table 9.The experimental results in Table 9 demonstrate that when compared with the other three prediction methods, the VMD-Crossformer prediction method had the lowest MAE, MAPE, and RMSE on the I and II datasets; in other words, its prediction accuracy was higher.Specifically, compared with VMD-CNN-LSTM [31], VMD-SG-LSTM [32], and VMD-Pyraformer-Adan [33], this paper's model on the I dataset reduced the MAE by 4.335 MW~42.878MW; the MAPE reduced the MAPE by 0.137%~1.294%;and the RMSE reduced the RMSE by 8.527 MW~65.863MW.On the II dataset, compared with VMD-CNN-LSTM, VMD-SG-LSTM, and VMD-Pyraformer-Adan, the model reduced the MAE by 7.648 MW~53.911MW, MAPE by 0.106%~0.694%,and RMSE by 11.575 MW~74.135MW.The prediction method in this study achieved the optimal prediction accuracy on both datasets compared with several recent prediction methods.The prediction results of the comparison experiments are visualized in Figure 8.
Figure 8 shows 120 sample points, and it can be observed intuitively that the results of this study's prediction model more closely fit the real data curve.For complex power load data, the model in this study could capture more feature information compared with other models during a sudden increase or decrease in data changes, thus obtaining a higher accuracy rate.From the figure, it can be seen that there is a certain delay in the performance of LSTM, which led to a decrease in the prediction results, making the prediction accuracy of VMD-CNN-LSTM and VMD-SG-LSTM not the highest.The Transformer variant was able to capture global information well through the mechanism of selfattention, which made the prediction accuracy of the VMD-Pyraformer-Adan and VMD-Crossformer models better, while the Crossformer network captured cross-dimensional dependencies more directly than the Pyraformer network, resulting in a better prediction accuracy.

Conclusions
The power system is subject to numerous influencing factors, resulting in the load data exhibiting randomness, variability, and nonlinearity.In this paper, a VMD-Crossformer forecasting model is proposed to improve the accuracy of power load forecasting.Initially, an enhanced optimization method for VMD parameters is employed to dynamically adjust the modal number of the VMD based on the PCC.Afterward, the initial signal is divided into various modal components using VMD.These components are then combined with the original load data to create reconstructed data; finally, the reconstructed data are input into the Crossformer network for final prediction.Through the experimental study, the following conclusions were drawn: (1) The optimal number of decomposition layers matching the original signal was found using the VMD parameter optimization approach, which was based on the PCC.This method helped to prevent problems such as under-decomposition and over-decomposition of the signal caused by setting an inappropriate number of VMD modes.(2) The complex power load data were broken down into relatively simple submodal components using the VMD algorithm.Each modal component reflects the characteristics of the original signal in various frequency ranges, and each modal component is reconstructed with the original signal and then input into the prediction network, which can greatly improve the prediction accuracy.(3) The Crossformer network utilized cross-dimensional dependencies and information at different scales to capture the relationship between data more comprehensively and accurately predict the power load data.(4) Taking the GEFCom2014 dataset and the load dataset of the Belgium Power Grid Company as an example, the prediction based on VMD-Crossformer showed a higher prediction accuracy and better performance than other models.
In this paper, a VMD-Crossformer forecasting model is proposed and was applied to power load forecasting, which can provide feasibility and reference value for practical applications.The performance validation was carried out on two highly recognized datasets; three evaluation indexes, MAE, MAPE, and RMSE, were used to assess the model performance; and the experiments showed that the model in this paper had the highest prediction accuracy when compared with other models.However, the model method in this paper still has shortcomings: the optimization algorithm is not used to find the optimal network parameters, and the training speed and model performance can be further improved.

Figure 5 .
Figure 5. Original data: (a) I dataset; (b) II dataset.The datasets were split in a 7:1:2 ratio between training, validation, and testing.By training the model on the training set, the model could learn and recognize the laws and patterns of the time series data; by performing model evaluation and adjusting the parameters on the validation set, the model's predictive capacity could be further enhanced to raise the model's accuracy and stability; and by assessing the predicted outcomes of the model on a test set, the model's generalizability was objectively assessed and its ability to accurately forecast loads for real-world applications was determined.

Table 1 .
Comparison and analysis of related literature.

Table 2 .
Measures of relevance.

Table 3 .
PCC between each modal component and the original data: (a) I dataset; (b) II dataset.

Table 4 .
Comparison of different segment lengths.

Table 5 .
Comparison of different number of attention heads.

Table 6 .
Comparison of different number of routers.

Table 7 .
Comparison of different batch sizes.

Table 8 .
Comparative experiments with simple models.The comparison experiments with the simple model show that the inclusion of the modules in this study resulted in a significant increase in the prediction rate, with MAE, MAPE, and RMSE being the lowest on both the I and II datasets.Specifically, on the I dataset, this model reduced the MAE values by 21.253 MW~76.729MW, the MAPE values by 0.661%~2.392%,and the RMSE values by 28.171 MW~102.626MW when compared with GRU, Informer, Crossformer, and EMD-Crossformer.On the II dataset, this paper's model reduced the MAE values by 158.135MW~392.807MW, the MAPE values by 1.919%~5.159%,and the RMSE by 210.187MW~551.914MW.Experiments ①, ②, and ③ showed the superiority of transformer's variant model; experiments ② and ③ showed the better prediction ability of Crossformer; experiments ③ and ⑤ showed the importance of adding the VMD to the Crossformer network; and experiments ④ and ⑤ showed the better prediction effect of using VMD than EMD in combination with

Table 9 .
Comparative experiments with complex models.