4.1. Dataset and Metrics
The data analyzed in this study originate from three datasets. The first dataset is a publicly available wind power prediction dataset on the Kaggle website, the second dataset is from a wind turbine and a wind farm in western China, and the third dataset is from a wind farm in northwestern China. The public dataset contains a total of 21 variables, including various meteorological, turbine, and rotor-related features. The data were recorded between January 2018 and March 2020, with readings taken at ten-minute intervals. Private Dataset 1 includes 10 variables, such as wind speed and temperature, recorded from January 2021 to March 2023, comprising a total of 76,369 data points. Private Dataset 2 includes 12 variables, recorded from May 2023 to June 2024, comprising a total of 87,234 data points.
To minimize the influence of irrelevant variables on prediction accuracy, correlation analysis was performed separately for the three datasets, selecting the top five correlated variables as the final results. First, missing values were addressed using the random forest imputation method to detect anomalies. Subsequently, the Pearson correlation coefficient was employed to analyze the relationship between meteorological variables and wind power. The calculated results are presented in 
Table 1, 
Table 2 and 
Table 3. Since the private dataset also includes forecasted values for five variables from the meteorological bureau, these five variables will be used in conjunction with their actual values to perform power prediction.
This paper uses three metrics, mean absolute error (MAE), mean squared error (MSE), and coefficient of determination (), to measure the differences between predicted and actual values.
MAE is a metric that measures the average absolute error between predicted and actual values, reflecting the average magnitude of prediction errors. It assigns equal weight to each error and is calculated as
        where 
 represents the true value of the 
i-th sample, 
 denotes the predicted value of the 
i-th sample, and 
 is the mean of the true values for all samples.
MSE evaluates the average squared error between predicted and actual values, placing greater emphasis on larger errors. A smaller MSE value indicates better model performance, and it is calculated as
 assesses the model’s goodness of fit to the data, representing the proportion of variation in the target variable that can be explained by the model. Its value ranges from [0, 1], and it is calculated as
  4.3. Results
We conducted comparative experiments on three datasets against seven different baseline methods. The comparative experimental results on the public dataset are shown in 
Table 4. The proposed method achieves an MAPE of 1.325% ± 0.011 and an RMSE of 0.032 MW ± 0.001, which significantly outperforms other models. Compared to the next best model, SSA-VMD-INGO-RF, which has an MAPE of 1.750% ± 0.022, the proposed method reduces the error by 24.3%. Moreover, compared to the RF model (15.800% ± 0.033), there is a 91.6% reduction. Similarly, in terms of the coefficient of determination (
), the proposed method attains 0.991 ± 0.017, slightly surpassing SSA-VMD-INGO-RF (0.990 ± 0.023) and SSA-VMD-PSO-RF (0.985 ± 0.023), which demonstrates its robustness in explaining data variance while maintaining superior error minimization. A general trend observed in the table is that integrating decomposition techniques such as VMD and EMD with optimization algorithms like PSO and NGO consistently improves performance. Traditional RF models show relatively poor performance in terms of both error and fit, highlighting their limitations in handling complex prediction tasks. In conclusion, the proposed method stands out as the most effective in minimizing errors, making it highly suitable for applications requiring high prediction accuracy. Our method achieves the smallest error margin, which indicates the best stability of the proposed method.
The comparative experimental results on Private Dataset 1 are shown in 
Table 5. The proposed method achieves an MAPE of 1.500% ± 0.012 and an RMSE of 0.040 MW ± 0.002, demonstrating statistically robust performance with the smallest uncertainty ranges among all models. Compared to the next best hybrid model, SSA-VMD-INGO-RF (MAPE: 1.950% ± 0.021, RMSE: 0.085 MW ± 0.004), the proposed method reduces the MAPE by 23.1% and the RMSE by 52.9%, while showing an exceptional 91.3% MAPE reduction and 94.7% RMSE reduction over the baseline RF model (MAPE: 17.200% ± 0.035, RMSE: 0.750 MW ± 0.038). In terms of 
, the proposed method achieves 0.990 ± 0.010, marginally surpassing SSA-VMD-INGO-RF (0.985 ± 0.015) and SSA-VMD-PSO-RF (0.975 ± 0.017), with tighter confidence intervals indicating more stable predictions despite the dataset’s inherent complexity. The integration of decomposition techniques (VMD/EMD) with optimization algorithms (PSO/NGO) again proves critical, as evidenced by VMD-RF (MAPE: 3.000% ± 0.028, RMSE: 0.160 MW ± 0.009), which outperforms non-decomposition models like EMD-RF (MAPE: 11.300% ± 0.034, 
: 0.740 ± 0.022). Although the private dataset introduces heightened variability (reflected in slightly elevated MAPE/RMSE versus public dataset results), the proposed method maintains narrower error margins (e.g., ±0.012 vs. ±0.021 MAPE for SSA-VMD-INGO-RF), conclusively validating its cross-dataset reliability and superiority in error minimization while maintaining competitive explanatory power.
The comparative experimental results on Private Dataset 2 are shown in 
Table 6. The proposed method achieves an MAPE of 1.450% ± 0.011 and an RMSE of 0.035 MW ± 0.002—metrics that not only improve upon Private Dataset 1 results (MAPE: 1.500%, RMSE: 0.040 MW) but also maintain the smallest error margins across all comparative models, including SSA-VMD-INGO-RF (MAPE: 1.800% ± 0.020, RMSE: 0.080 MW ± 0.004). The 21.4% MAPE reduction and 56.3% RMSE improvement over this closest competitor, coupled with a 91.4% MAPE reduction from the RF baseline (MAPE: 16.800% ± 0.036), align closely with performance gains observed in both previous datasets (public: 91.6% MAPE reduction, Private Dataset 1: 91.3%), confirming method stability under varying data conditions. The 
 value of 0.992 ± 0.008 surpasses all variants, including SSA-VMD-INGO-RF (0.988 ± 0.014), while exhibiting the tightest confidence interval—a critical indicator of reliable generalization. Notably, the error ranges (MAPE ± 0.011 vs. ±0.012 in Private Dataset 1 and ±0.011 in public data) show decreasing variability as datasets grow more complex, countering the trend observed in conventional models like VMD-RF (MAPE ± 0.027 in Private Dataset 2 vs. ±0.028 in Private Dataset 1). This inverse relationship between dataset complexity and proposed method’s uncertainty highlights its unique adaptability. Cross-dataset comparisons reveal sustained advantages: MAPE improvements of 91.6% (public), 91.3% (Private Dataset 1), and 91.4% (Private Dataset 2) over RF baselines, with 
 values consistently above 0.990 in all scenarios—a trifecta of validation that conclusively establishes the method’s domain-agnostic effectiveness.
Figure 6 illustrates the results of applying VMD to the public dataset, decomposing it into five intrinsic mode functions (IMFs) across different scales. The public dataset exhibits relatively smooth patterns across lower-frequency components, with clear long-term trends and moderate fluctuations in the higher-frequency components. The high-frequency IMFs capture rapid variations with consistent amplitude, indicating a relatively stable and less noisy dataset. The distribution of values suggests that the public dataset follows more predictable patterns with well-defined periodic trends, making it easier for forecasting models to achieve high accuracy.
 The results of ablation experiments in the public dataset are shown in 
Table 7. Each row represents the value of each indicator for the existing structure after the removal of specific modules. The existing structure achieves superior performance with 1.325% ± 0.011 MAPE and 0.032 MW ± 0.001 RMSE, demonstrating its effectiveness through systematic module ablation. The removal of the MFFB (MAPE: 2.150% ± 0.015) results in the most severe degradation (62.3% MAPE increase), confirming its pivotal role in multi-variate feature fusion. Subsequent removal of the AdpGLayer (MAPE: 5.600% ± 0.020) demonstrates 323% error growth, validating its criticality for cross-scale correlation modeling (
Section 3.2). The VMDSGB (MAPE: 9.800% ± 0.030) and standalone VMD (14.200% ± 0.039) show progressively weaker impacts, aligning with their hierarchical roles: basic decomposition versus optimized multi-scale processing. Notably, the baseline MSGB (18.500% ± 0.048) exhibits the poorest performance, highlighting the necessity of VMD-enhanced architecture. Error margins reveal stability patterns: key modules like MFFB (±0.015) and AdpGLayer (±0.020) show tighter bounds than VMD (±0.039) and MSGB (±0.048). Notably, the ablation hierarchy (MFFB > AdpGLayer > VMDSGB > VMD > MSGB) remains stable. The 
 progression (0.520 → 0.991) further quantifies each module’s cumulative explanatory power enhancement, while the proposed method maintains 
 and MAPE reduction > 91% over RF baselines in all scenarios, establishing its domain-agnostic effectiveness for wind power prediction tasks.
Figure 7 presents the results of applying VMD to the private dataset, decomposing it into five intrinsic mode functions (IMFs) across different scales. The high-frequency components display stronger fluctuations with irregular peaks, reflecting higher short-term variability and potential noise interference. This suggests that the private dataset contains more complex dynamics, possibly due to external influences or operational inconsistencies. The lower-frequency components in the private dataset also show greater variation compared to the public dataset, indicating a more intricate long-term trend with less consistency. These differences highlight the challenges posed by the private dataset, requiring more robust modeling techniques to handle the increased variability and complexity effectively. The contrast between the three datasets underscores the need for tailored feature extraction strategies to accommodate the unique properties of each dataset for accurate and reliable predictions.
 In order to further prove the effectiveness of the proposed method, the first 100 data points in Private Dataset 1 are selected and the comparison experiments are conducted with different prediction models, and the results are shown in 
Figure 8. Observing the trends, the proposed method demonstrates a consistently closer fit to the original data than most other methods, especially ones like RF and NGO-RF, which exhibit greater deviations and higher variability. For instance, in the range of index 20–40, the proposed method accurately captures the downward trend and aligns closely with the actual values, while methods such as RF and EMD-RF show significant overshooting. Furthermore, as seen near Index 60, the proposed method maintains stability with minimal error compared to methods like SSA-VMD-PSO-RF, which still display slight oscillations. These observations highlight the robustness and reliability of the proposed method in predicting values with higher precision and reduced fluctuation, making it a superior approach in capturing the true data patterns.
  4.4. Discussion
The experimental results across all three datasets consistently demonstrate the superiority of the proposed MSVMD-Informer in wind power prediction. On the public dataset (
Table 4), our model achieves an MAPE of 1.325% ± 0.011 and an RMSE of 0.032 MW ± 0.001, outperforming even advanced hybrids like SSA-VMD-INGO-RF (24.3% MAPE reduction). This performance gain stems from the model’s ability to resolve multi-scale meteorological couplings—low-frequency IMFs capture seasonal trends linked to large-scale pressure systems—while high-frequency components model turbulence-induced volatility. The narrower error margins (±0.011 MAPE vs. ±0.022 in competitors) reflect enhanced stability from adaptive cross-variable fusion, which dynamically weights humidity-to-wind-power interactions based on their thermodynamic significance.
Similar trends emerge in private datasets (
Table 5 and 
Table 6), where the method maintains MAPE reductions > 91% over RF baselines despite elevated complexity. Private Dataset 1’s high-frequency IMFs exhibit irregular peaks, indicative of terrain-induced turbulence—a challenge addressed by the MFFB’s parallel C2f modules, which preserve transient features through heterogeneous convolutional kernels. The inverse relationship between dataset complexity and the model’s uncertainty (e.g., ±0.011 MAPE in Private Dataset 2 vs. ±0.027 for VMD-RF) highlights its physics-aware design: adaptive graph convolution emulates energy cascade mechanisms, preventing high-frequency signal loss during decomposition.
Ablation studies (
Table 7) quantify the hierarchical importance of model components. The 62.3% MAPE degradation upon removing the MFFB underscores its role in integrating humidity-induced air density effects with wind speed dynamics—a coupling often oversimplified in traditional hybrids. Meanwhile, the 323% error increase after excluding AdpGLayer validates its simulation of nonlinear scale interactions, such as how turbulence (IMF4–5) modulates diurnal patterns (IMF1–2).
The model assumes localized stationarity within 1 h decomposition windows, a simplification justified by Kolmogorov’s turbulence theory but one that is potentially limited in hurricanes where multi-scale couplings become chaotic. Although linear cross-scale fusion reduces computational load (critical for edge deployment), it may underestimate nonlinear resonance effects during extreme weather—a tradeoff evidenced by <1% accuracy loss compared to nonlinear variants.
Future improvements can be made in the following aspects: introducing a meta-learning framework to achieve dynamic adaptive optimization of VMD parameters, compressing the number of parameters in the multi-scale global module (MSGB) through knowledge distillation to enhance inference speed, embedding anomaly aware mechanisms in the feature fusion stage to enhance the responsiveness to unexpected events, and adopting a migration learning strategy to improve the model’s performance of generalization to geographic and climatic differences. In addition, exploring the combination of temporal causal inference with physical constraints may further strengthen the model’s ability to characterize complex meteorological coupling mechanisms.
  4.5. Contribution of the Method
The proposed MSVMD-Informer fundamentally advances hybrid models by integrating “multi-variate multi-scale decomposition”, “adaptive cross-variable interaction learning”, and “end-to-end joint optimization” into a unified framework, while existing hybrid models often treat signal decomposition and deep learning as isolated stages or focus on single-variable temporal patterns, our approach introduces three methodological innovations that collectively address the limitations of prior works:
First, the model explicitly bridges the gap between decomposition and feature fusion through “adaptive multi-scale dependency learning” [
42]. Unlike conventional models that decompose variables independently and fuse features statically, our framework employs adaptive graph convolution to dynamically capture cross-scale and cross-variable correlations. For instance, high-frequency fluctuations in wind speed are linked to low-frequency temperature trends via learnable adjacency matrices, enabling the model to prioritize interactions that are critical for prediction [
40]. This contrasts with fixed-weight fusion strategies in existing hybrids, which fail to adapt to variable-specific temporal dynamics [
43].
Second, the framework introduces “joint optimization of decomposition and prediction”. Traditional hybrid models typically predefine decomposition parameters (e.g., VMD modes) or optimize them separately from prediction objectives, leading to suboptimal alignment between decomposition granularity and downstream tasks [
44]. In our model, both decomposition parameters and prediction layers are co-trained end-to-end, ensuring that decomposition adaptively enhances feature discriminability for wind power forecasting. This integration mitigates mode-mixing issues and refines multi-scale representations based on prediction feedback—a capability absent in sequential decomposition–prediction pipelines [
41].
Third, the architecture uniquely addresses “multi-variable heterogeneity” through hierarchical attention mechanisms [
45]. Although existing methods process variables in isolation or concatenate decomposed features naively, our multi-variate attention layer explicitly models dependencies between variables at different scales. For example, humidity-to-wind-power correlations are learned separately for high-frequency noise and seasonal trends, with adaptive weights assigned to each scale-variable pair [
46]. This contrasts with single-scale attention mechanisms or uniform fusion approaches, which overlook the distinct contributions of variable-scale combinations [
47].