1. Introduction
With the excessive exploitation and use of fossil fuels and traditional non-renewable energy sources, global warming and the overconsumption of fossil energy have become increasingly severe issues. Wind power, as a clean and green energy form, is characterized by its wide distribution and does not produce greenhouse gas emissions, thereby minimizing negative environmental impacts. Consequently, it has gained extensive attention and application worldwide [
1,
2,
3]. However, the wind power generation process is influenced by various factors, exhibiting strong volatility and randomness, which significantly affect grid stability during large-scale integration. The complex dynamic behaviours inherent in wind power generation time series essentially reflect the dynamic equilibrium between symmetry and symmetry breaking in natural systems. This includes, for instance, the recurring “stationary-disturbance” symmetric patterns in short-term fluctuations, the asymmetric characteristics of power responses under different wind conditions, and the hierarchical symmetric correlations among multi-scale fluctuations. The exploration of symmetry correlations is crucial for revealing the evolutionary laws of wind power and breaking through the passive fitting to “disorder” characteristic of traditional forecasting models. Concurrently, by identifying symmetric structures within the fluctuations, the modelling dimensionality of complex systems can be effectively simplified, thereby enhancing prediction accuracy. Accurate wind power prediction is of great significance for energy planning and grid dispatching. Furthermore, it provides technical support for the safe, stable, and economical operation of power systems and wind farms [
4,
5].
Currently, wind power prediction methods mainly include physical methods, statistical methods, and artificial intelligence (AI) methods [
6]. Physical methods rely on specific geographic information and Numerical Weather Prediction (NWP) data to calculate power predictions through physical modelling. Although theoretically robust, these methods are complex to implement and often yield poor prediction results with low robustness. Statistical methods predict future wind power based on historical data relationships, employing traditional models such as Autoregressive Moving Average (ARMA) [
7] and Moving Average (MA). These models capture linear characteristics within the data but struggle to handle the complexity of wind power data, resulting in limited applicability and mediocre performance. With the continuous advancement of AI technology, an increasing number of researchers are adopting AI-based approaches for wind power prediction. Compared to traditional statistical and physical methods, AI methods can automatically extract latent feature correlations, efficiently process high-dimensional and nonlinear data, and eliminate the need for complex modelling processes, offering greater flexibility and robustness [
8].
Wind power prediction time scales are generally categorized into ultra-short-term, short-term, and medium- to long-term predictions [
9]. Ultra-short-term prediction involves forecasting within the next four hours at a 15 min resolution, primarily serving real-time dispatch in power systems. As the proportion of stochastic wind power increases, frequency fluctuations in power systems accelerate, making 15 min resolution predictions more challenging [
10].
Current widely adopted wind power prediction methods predominantly employ ensemble approaches, integrating multiple models and techniques to comprehensively enhance prediction accuracy. Preprocessing input data through analysis and decomposition can significantly improve data reliability and prediction precision. Given the high volatility and noise levels in wind power data, signal decomposition techniques have attracted considerable attention in this field. Commonly used decomposition algorithms include VMD, EEMD, and CEEMDAN. VMD effectively addresses issues such as frequency overlap and large recursive errors found in Empirical Mode Decomposition (EMD). Additionally, VMD requires parameter tuning, which researchers optimize using various algorithms to improve its handling of nonlinear and non-smooth signals [
11]. For instance, Reference [
12] utilized Grey Wolf Optimizer (GWO)-optimized VMD to decompose sequences, reducing non-stationarity and improving prediction accuracy when feeding decomposed sequences into a constructed prediction model. Reference [
13] applied the Sparrow Search Algorithm (SSA) to optimize VMD parameters, overcoming the limitations of manual parameter setting, and achieved enhanced prediction accuracy using LSTM. Reference [
14] employed EEMD to decompose wind power sequences into subcomponents with different frequencies, mitigating the impact of non-smoothness on prediction accuracy. Similarly, Reference [
15] used CEEMDAN for decomposition, followed by sample entropy-based reconstruction, and demonstrated improved prediction accuracy by feeding reconstructed high- and low-frequency sequences into a prediction model. Reference [
16] employed a combination of the Improved Northern Goshawk Optimization (INGO) algorithm and a subtractive optimizer to determine the number of modes in VMD, enhancing both accuracy and interpretability. In reference [
15], the original wind power sequence was decomposed using CEEMDAN, resulting in multiple sub-modes and a residual. The subsequences were then reconstructed by calculating their sample entropy. Subsequently, a Transformer model and a BiGRU-Attention model were utilized to predict the high-frequency and low-frequency sequences, respectively, based on the characteristics of each sequence, achieving favourable prediction results in wind power forecasting. Reference [
17] initially decomposed the wind power sequence by integrating CEEMDAN with a rolling decomposition strategy. The sequences were then reconstructed into high-frequency, medium-frequency, and low-frequency categories based on sample entropy, and predictions were made using an Improved Dung Beetle Optimizer (IDBO)-optimized LSTM model. Reference [
18] first applied CEEMDAN decomposition to extract Intrinsic Mode Functions (IMFs), addressing the nonlinearity and non-stationarity in the data. Subsequently, high-frequency and complex signals were further refined through VMD before being fed into a parallel prediction model for forecasting.
VMD transforms the decomposition problem into solving partial differential equations through a variational framework, thereby avoiding the subjective shortcomings of EMD-type methods that rely on empirical selection of extreme points and interpolation to generate envelope curves. This leads to more stable and repeatable decomposition results. By presetting the number of modes K and optimizing the bandwidth constraints, VMD forces each IMF to focus on a single central frequency, significantly reducing the risk of mode mixing. Wind power sequences are typical non-stationary, nonlinear stochastic processes. Their fluctuations are driven by the coupling of multiple factors, such as wind speed randomness, turbine mechanical inertia, and grid dispatching, and manifest as the superposition of multi-time-scale characteristics.
Traditional EMD-type methods are prone to confusing these features due to mode mixing, leading to error accumulation in subsequent modelling. In contrast, VMD proactively separates different frequency components through variational optimization, enabling a clearer extraction of dominant patterns (e.g., periodic trends, random fluctuations) in wind power sequences. However, after the initial VMD, two types of insufficiently captured components may remain: (1) minor fluctuations that are merged due to the limitation of the preset number of modes K; and (2) local high-frequency components caused by extreme fluctuations in wind power data. CEEMDAN, as a secondary decomposition tool, can both suppress boundary effect noise potentially introduced by VMD decomposition and further refine the multi-scale features, thereby avoiding the loss of effective information due to insufficient initial decomposition.
By leveraging the stability and anti-mode-mixing capability of variational optimization, VMD addresses the “coarse decomposition” challenge of multi-scale separation in wind power data. CEEMDAN achieves “fine decomposition” of residual sequences through adaptive noise suppression and complete component retention. The combination of the two methods avoids both the empirical dependence of traditional EMD-type methods and the limitations of a single decomposition algorithm. This integration facilitates the further extraction of hidden multi-scale features and ensures the completeness of the decomposition.
Furthermore, combining multiple machine learning techniques leverages their respective strengths while mitigating individual shortcomings. Reference [
19] proposed a VMD-CNN-GRU hybrid model, demonstrating that VMD reduces wind speed sequence volatility, CNN extracts complex spatial features, and GRU captures temporal features, collectively outperforming single models. Reference [
20] introduced a combined Temporal Convolutional Network (TCN) and Informer-based model, where TCN extracts hidden temporal features, and Informer encodes them for wind power prediction, achieving superior accuracy compared to standalone models. Reference [
21] explored two novel hybrid models: CNN-ABiLSTM and CNN-Transformer-MLP. In these models, CNN captures short-term patterns in solar and wind energy data, while ABiLSTM and Transformer-MLP handle long-term patterns, excelling in daily, weekly, and monthly predictions.
The selection of hyperparameters in prediction models significantly influences their accuracy and robustness. Passing optimal parameters to the model fully exploits its predictive potential, achieving the best results. Reference [
22] developed a photovoltaic power prediction method with PSO-optimized BiLSTM, yielding precise predictions. Reference [
23] applied CEEMDAN to extract local features and time-frequency characteristics, optimized by an Improved Whale Optimization Algorithm (IWOA) for BiLSTM network training, observing enhanced prediction performance. Reference [
24] utilized the RIME algorithm to optimize hyperparameters in an AM-TCN-BiLSTM model, achieving better prediction accuracy and lower error values.
Table 1 provides a survey and comparative synthesis of previous research in this field.
Current wind power prediction technologies primarily focus on improving prediction accuracy through model combinations or optimizations. However, due to the complexity and high volatility of wind power data, single decomposition methods may still result in high data complexity after processing. Performing secondary decomposition on certain sub-sequences can further reduce the complexity of decomposed sequences, ensuring input data accuracy and reliability. Additionally, most existing studies rely on manual parameter settings for VMD, such as centre frequency methods, which require extensive experimentation and analysis. Moreover, optimization algorithms optimizing metrics like sample entropy, envelope entropy, or information entropy consider only single indicators, leading to uncertain decomposition results. Furthermore, most wind power prediction models employ unidirectional deep learning neural networks, potentially limiting their ability to extract bidirectional hidden information effectively. The optimization of hyperparameters in prediction models remains an area requiring further investigation. Traditional optimization algorithms often exhibit slow convergence speeds and limited exploration capabilities, necessitating improvements in algorithm performance. Existing research predominantly relies on single signal decomposition techniques (such as standalone VMD, CEEMDAN, or EEMD) to process wind power data. However, the strong non-stationarity of wind power arises from the superposition of multi-scale physical processes, making it difficult for a single decomposition to fully separate fluctuation components from different sources. This often results in mode mixing or information loss, leaving the decomposed subsequences with high residual complexity, which directly impairs the input quality for subsequent forecasting models. Furthermore, most current studies employ unidirectional forecasting models, which fail to capture potential dependencies preceding abrupt changes simultaneously. Regarding model parameter optimization, traditional heuristic algorithms are commonly used, which are prone to local optima, slow convergence, and especially low efficiency in global optimization when dealing with high-dimensional hyperparameter spaces. These gaps collectively lead to insufficient characterization of multi-scale wind power fluctuations and limited generalization capability in forecasting models, making it challenging to meet the requirements of ultra-short-term forecasting for high accuracy and strong adaptability.
This study addresses the shortcomings in the entire workflow of “decomposition-reconstruction-modelling-optimization” for ultra-short-term wind power forecasting in existing research: residual complexity after single decomposition, empirical and single-metric dependence for VMD parameter determination, the neglect of bidirectional temporal dependencies in unidirectional models, and the limited optimization capability of hyperparameter tuning algorithms. Wind power inherently exhibits strong volatility, high nonlinearity, and non-stationarity, and is susceptible to instantaneous influences from complex meteorological factors. For ultra-short-term wind power forecasting, rapid analysis of high-frequency data is required. The proposed model employs a BiTCN-BiGRU bidirectional structure to simultaneously learn contextual information from both past and future, accurately capturing minute-level power fluctuations. The VMD-CEEMDAN signal decomposition preprocessing reduces data non-stationarity, and combined with sample entropy reconstruction, enables the model to focus on key short-term features while avoiding interference from redundant noise. The causal convolution in BiTCN relies solely on historical data, preventing information leakage from the future; its parallel convolution computation enhances feature extraction speed, meeting the demand for rapid response. The proposed hybrid modelling framework systematically addresses the high-dimensional non-stationary characteristics of wind power, fully exploits bidirectional temporal dependencies, and achieves optimized model parameters.
The main innovations of this study are summarized as follows:
(1) A novel ultra-short-term wind power prediction model is proposed, predicting future 10 min wind power outputs based on the past 150 min of wind power data. Compared with single models and benchmark models, the proposed model demonstrates superior predictive performance.
(2) An improved decomposition method is introduced based on SSA and VMD. The SSA is employed to optimize the parameters of VMD adaptively, including the number of decomposition modes and penalty parameters, enhancing the quality of input data for the prediction model.
(3) A multi-layer data decomposition and reconstruction technique is proposed. After VMD, sequences are reconstructed using sample entropy theory. Sequences containing more information are subjected to secondary CEEMDAN decomposition and reconstruction, followed by combining all reconstructed sequences as final inputs. This approach reduces noise while preserving data features.
(4) A BiTCN-BiGRU prediction model is proposed and constructed for wind power prediction. The BiTCN network integrates TCN with bidirectional processing mechanisms, capturing bidirectional temporal dependencies to enhance feature extraction from complex time-series data. The extracted information is then fed into the BiGRU network for prediction, significantly improving wind power prediction accuracy.
(5) The traditional GWO algorithm is improved by incorporating strategies such as Golden Sine, Opposition-Based Learning, and Lévy Flight. These enhancements accelerate convergence speed and optimization capability. The IMGWO is applied to optimize multiple key hyperparameters of the prediction model, ensuring optimal performance.
The remaining sections of this paper are organized as follows:
Section 2 describes the methods and relevant theories applied in this study.
Section 3 introduces the proposed methodology, model framework, and technical flowchart.
Section 4 verifies the algorithm using actual operational data from the Sotavento wind farm in Spain, analyzes the simulation results for each season, and evaluates the effectiveness of the wind power prediction model. Finally,
Section 5 concludes the study.
2. Model Principles
2.1. VMD
Proposed in 2014, VMD is a variational mode decomposition method that simultaneously handles recursive and non-recursive signals. It decomposes original complex signals into multiple IMFs under specific constraints, particularly excelling in processing non-stationary and nonlinear time series [
25]. The specific steps are as follows:
(1) Construction of the Variational Model:
VMD aims to minimize the estimated bandwidth of modal decompositions. The corresponding expression is:
where
—the
component of the decomposition;
—the centre frequency of the corresponding component;
—gradient operation;
—convolution operation;
—the
component of the time t;
—the original signal; j—imaginary unit; t—time.
(2) Solution of Constrained Variational Problems:
The Lagrangian penalty operator
and penalty coefficient
were introduced, such that when addressing scenarios involving Gaussian noise interference, the effective strategy involved converting the constraints to transform the constrained variational problem into an unconstrained variational problem. This approach simplified the solving process and mitigated noise effects, whose expression is:
where
—penalty coefficient;
—Lagrange penalty operator; ω—centre frequency.
(3) Combining the Fourier isometric transform with the alternating direction penalty algorithm for alternating update
is expressed as.
where
—the noise tolerance, which satisfies the need for decomposition fidelity;
—the Wiener filter of
;
—the centre of gravity of the power spectrum of the modal function;
—the—the Fourier transform of
.
2.2. Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN)
When processing complex signals, the EMD method may encounter the issue of mode mixing. This can cause different signal modes to interfere with each other during the decomposition process, leading to inaccurate decomposition results. CEEMDAN integrating adaptive noise and ensemble decomposition concepts. By introducing noise adaptively at each decomposition step, CEEMDAN precisely separates different frequency components of the signal, reducing computational load and accelerating decomposition speed. By adaptively adjusting the noise level, the issue of mode mixing in modal decomposition can be mitigated. The CEEMDAN process involves the following steps [
26]:
(1) White noise
is added to the original wind power signal L(t) to produce M distinct new sequences, when the white noise has zero mean and variance 0. Then, the new sequences are obtained by EMD. Thus, the 1st IMF and the 1st residual of CEEMDAN are obtained as:
where operator E
1 is the 1st IMF obtained through EMD.
(2) For the following N−1 IMFs of CEEMDAN, the process is slightly different. First, the white noise
is added to the residual
to produce M distinct new residuals. Then, these new residuals are decomposed by EMD to obtain the jth IMF and the jth residual for CEEMDAN.
(3) Repeat step (2) until no meaningful IMFs can be extracted from the residual signal. Finally, the residual term of CEEMDAN is obtained.
2.3. Grey Wolf Optimizer (GWO) Algorithm
GWO is a swarm intelligence optimization algorithm inspired by the social hierarchy and hunting behaviour of grey wolves [
27]. Its advantages include fewer tunable parameters, intuitive principles, and strong global search capabilities.
In the Grey Wolf Optimization algorithm, the wolf pack’s social structure is divided into four ranks: α, β, δ, and ω. The α wolf represents the leader and symbolizes the current best solution; the β wolf supports the α wolf’s decisions and represents the second-best solution; the δ wolf follows the commands of α and β wolves and represents the third-best solution; and the ω wolves rank lowest, obeying the commands of α, β, and δ wolves. The hunting behaviour of wolves consists of two phases: encircling prey and hunting. These behaviours can be mathematically modelled.
In step 1, after discovering the prey, the pack surrounds the prey, and the distance
between each grey wolf and the prey and the updated position of the grey wolves have
are, respectively:
where t denotes the current number of iterations;
denotes the current grey wolf position; and
denotes the prey position;
and
denote the direction vectors, then we have
where
and
denote the random vectors of [0,1]; a denotes the decay factor, which decreases linearly from 2 to 0 as the number of iterations increases.
Step 2: After simulating the encircling behaviour, the α, β, and δ wolves guide the entire pack to gradually shrink the encirclement range, achieving the hunting objective. This process can be described mathematically.
where
,
,
denote the positions of α, β and δ wolves;
,
,
denote the distances of individual grey wolves from α, β and δ wolves;
and
are directional random vectors;
,
,
denote the updated positions of individual grey wolves based on the positions of α, β and δ wolves, respectively; and
denote the final updated positions of individual grey wolves at the end of iteration in the tth round.
At the end of each iteration, the fitness values of all grey wolves are recalculated. Comparisons determine new α, β, and δ wolves, initiating the next round of iterations and progressively approaching the global optimum.
2.4. Improved Grey Wolf Optimizer (IMGWO) Algorithm
(1) Golden Sine Strategy
To enhance the exploration of the solution space, the Golden Sine strategy is introduced to update the position of grey wolves. Derived from Tanyildizi et al. [
28]’s Golden Sine Algorithm (Golden-SA) in 2017, it replaces constant parameters in the original algorithm with sine functions to thoroughly explore local optima neighbourhoods, improving exploration capability. Simultaneously, the incorporation of the Golden Ratio enhances dynamic search, increasing algorithm coverage [
29]. After introducing the Golden Sine strategy, the position update of grey wolves is expressed as follows:
Firstly, let
= 0.5,
[0,1] be a random number, when
< 0.5, it is a no obstacle state. At this time the position of the grey wolf is updated as follows:
where
∈ [0,2
] and
∈ [0,2
] are random numbers;
and
are the golden section coefficients, as specified in Equations:
where
;
and
initial values are -π and
, respectively,
and
the update rules of and are as follows:
If the current solution is better than the optimal solution, the value of is assigned to , the value of is assigned to , is updated according to Equation (19), otherwise, the value of is assigned to , the value of is assigned to , is updated according to Equation (20).
When
0.5, it is the state with obstacles, and then the grey wolf position update rule is:
(2) Opposition-Based Learning Strategy
This strategy updates individual positions during the algorithm’s iterative process with a certain probability, utilizing rich information from opposite individuals. Not only does it enhance population randomness but also improves algorithm convergence. Opposite learning involves solving the reverse of the current solution within the same space, comparing both solutions, and selecting the better one. Assuming x
i is a solution within the interval [a
i, b
i], its opposite solution is given by:
where a
i, b
i represent the lower and upper bounds of the solution space.
(3) Lévy Flight Strategy
Traditional GWO algorithms suffer from reduced population diversity when all grey wolves move toward the α wolf, especially if the α wolf does not represent the global optimum.
To address this issue, the Lévy Flight strategy is adopted to expand the search domain and increase solution diversity. Lévy Flight is a random walk process where step lengths follow a Lévy distribution, characterized by heavy tails, enabling long-distance jumps in the search space to discover new regions.
The Lévy Flight strategy updates the position of grey wolves x(t) as follows:
where
denotes the individual solution after this round of iteration;
denotes the point-to-point multiplication; r denotes the weight of the control step; β is in the range of (1,3), and in this paper, β takes the value of 1.5,
Denoting the stochastic search path, there are:
where
denotes the historical optimal solution; the variables u and v obey a normal distribution, then there are:
where
and
are the standard deviation of the variables u and
, respectively,
,
= 1, and
is the gamma function.
In this study, the Lévy Flight strategy is applied exclusively to the α wolf, guiding other wolves indirectly and influencing the entire group dynamics. This approach significantly reduces computational time.
The optimization-solving steps of the Improved Grey Wolf Algorithm are illustrated in
Figure 1 [
30].
The primary solving steps of the Improved Grey Wolf Algorithm are as follows:
(1) Initialize the positions of grey wolves (pos) and calculate initial fitness values. Ensure all grey wolves remain within the defined search space. At the start of each iteration, update the positions of α, β, and δ (the top three optimal solutions) based on the current population’s fitness values.
(2) Update each grey wolf’s position using linearly decreasing parameter a and random coefficients A and C. Each grey wolf updates its position based on weighted averages of α, β, and δ. When ∣A∣ ≥ 1, the grey wolf explores randomly; when ∣A∣ < 1, it gradually approaches the optimal solution.
(3) Golden Sine Strategy (GSS):
Apply this strategy during each iteration to enhance search capability and avoid falling into local optima.
(4) Opposition-Based Learning Strategy:
Apply this strategy every five iterations to further enhance exploration of the search space.
(5) Lévy Flight Strategy:
Randomly select 20% of grey wolves for Lévy Flight during each iteration to introduce long-distance jumps and enhance global search capability.
(6) Boundary Checks and Fitness Evaluation:
After each position update, ensure all grey wolves remain within the defined search space. Recalculate fitness values and update personal best (pBest) and global best (gBest). Terminate the process when the maximum number of iterations is reached, outputting the global optimum.
2.5. Bidirectional Temporal Convolutional Neural Network (BiTCN)
Temporal Convolutional Networks (TCNs) are architectures based on convolutional neural networks designed to solve time-series problems, effectively capturing long-term dependencies [
31]. Compared to traditional convolutional networks, TCNs possess strong feature extraction capabilities. They combine causal convolutions, dilated convolutions, and residual connections to form a new network structure.
In TCNs, causal convolutions are specialized operations that consider only preceding moment features during convolution, avoiding leakage of future information and maintaining causality in time-series data processing. To capture both forward and backward features in the data, this study employs a BiTCN structure to extract bidirectional features, achieving higher training model precision. The architecture of the bidirectional dilated causal convolution network is shown in
Figure 2 [
32].
However, simply increasing the number of network layers or enlarging the convolution kernel size to broaden the receptive field leads to significant increases in computational costs. To address this issue, blank values are inserted into the convolution kernels, a method known as dilated convolutions. Dilated convolutions effectively expand the receptive field of network layers, allowing higher-level nodes to cover broader input information without additional computational burdens. For time-series data x ∈ Rn and a convolution kernel f: {0, 1, …, k − 1} → R, the dilated convolution calculation formula is as follows:
where d is the size of the expansion factor, k is the size of the convolution kernel, and
denotes the past direction of the time series.
The TCN neural network consists of stacked TCN residual blocks, each combining dilated causal convolutions and neural network processing layers. Residual blocks include dilated causal convolutions, weight normalization, ReLU activation functions, and Dropout, as shown in
Figure 3 [
32].
2.6. Bidirectional GRU Neural Network
GRU can effectively learn and remember the dependencies over long time steps [
33]. The structure of the GRU is shown in
Figure 4.
where
,
are the reset gate and update gate, respectively;
is the Sigmoid activation function; W is the weight matrix;
is the intermediate memory state;
is the input at time t; and
is the hidden state at time t.
The Bidirectional GRU (BiGRU) consists of two GRU hidden layers combined into a bidirectional neural network, allowing the current moment’s output to connect with both the previous and subsequent moments’ states, facilitating deeper feature extraction. The structure of BiGRU is shown in
Figure 5.
The output information at time
t is the sum of the forward and backward hidden layer outputs, calculated as follows:
where G(·) is the state of the corresponding vector-encoded GRU hidden layer;
,
are the forward and backward hidden layer output states, respectively;
and
are the output weights of the corresponding hidden layer, respectively; and
is the bias of the hidden layer state at time t.
2.7. Improved Variational Modal Decomposition
When using VMD to decompose a time series, the parameters need to be manually set. Among these, the decomposition number K and the penalty factor α significantly influence the decomposition performance. If the decomposition number K is set too high, over-decomposition may occur, leading to redundant components. Conversely, if K is set too low, some signals may be lost. The penalty parameter α affects the bandwidth of each modal component; a smaller α results in wider bandwidths for the IMF components, while a larger α narrows the signal bandwidth of the IMF components. Therefore, it is necessary to use the SSA to adaptively search for the optimal parameter combination [K,α].
When optimizing parameters with the SSA method, an optimal solution must be selected. To achieve this, a fitness function tailored to wind power sequences needs to be constructed. A single metric cannot comprehensively and accurately reflect the characteristics of a time series signal. Therefore, a composite index based on sample entropy and mutual information is established, and the fitness function is defined as the minimum value of this composite index [
34].
Sample entropy is used to assess the complexity and variation in a time series, i.e., the probability that a new pattern may arise in the sequence given a change in dimension.
A time series of length N is given as.
where i = 1, 2, …, N − m + 1.
Define d[X(i),X(j)](i ≠ j) as the difference in the maximum distance between two corresponding elements X(i) and X(j), i.e.,
The threshold r is known (r > 0), calculates the number of d[X(i),X(j)] < r and ratios it to the total number of vectors N − m, i.e.,
The sample entropy of this sequence is obtained by increasing the dimension to m + 1 and calculating its value as.
When N is a finite value, the equation is
Mutual information (MI) is mainly used in information theory to indicate the degree of correlation between two events [
35]. which is not easily disturbed by external factors, and its expression is as follows
where X and Y are different events, H(Y) is the entropy of Y, and H(Y|X) is the conditional entropy of Y when X is known. The MI is normalized as
The larger the MI entropy value, the stronger the correlation between two events. As far as the IMF component is concerned, the richer signal feature information it contains, the larger its MI entropy value is
The established composite indicator
This indicator takes into account both the complexity of the IMF component and the feature information, and when the IMF component contains richer feature information, the smaller the value of the composite indicator SI is, so its minimum value is used as the fitness function, expressed as
4. Experimental Analysis
4.1. Data Source
Real operational data from the Sotavento Galicia wind farm in Spain during 2020 were used to validate the proposed method and model. The data used in this study can be obtained from the following website:
https://www.sotaventogalicia.com/en/technical-area/real-time-data/historical/ (accessed on 1 October 2024). To effectively verify the accuracy of the proposed model, four groups of wind power data from different seasons were selected, including data from March 1 to 31, May 1 to 31, August 1 to 31, and November 1 to 30. The data sampling interval was 10 min. The model in this study focuses on ultra-short-term wind power prediction. It inputs historical wind power data from the past 150 min to forecast wind power for the next 10 min, with the data organized and fed into the model using a sliding window approach.
During data collection and processing, some zero and missing data points may occur, significantly affecting the accuracy of wind power prediction. Therefore, it is necessary to preprocess the raw wind power data using zero removal and mean interpolation. After preprocessing, the spring dataset contained 4019 data points, the summer dataset contained 3930 data points, the autumn dataset contained 3413 data points due to the removal of numerous zero values, and the winter dataset contained 3980 data points.
To systematically validate the prediction performance of the proposed model, the study employed a train-validation-test three-way split of the dataset, strictly adhering to the principle of “independent partitioning, no crossover” to mitigate the risk of data leakage. Specifically: the training set was used for supervised learning of model parameters, the validation set for optimization and calibration of hyperparameters, and the test set served as independent, unseen data to evaluate the model’s generalization capability.
Considering the seasonal fluctuations characteristic of wind power, the study selected representative samples from the original data across four seasons, constructing four seasonal datasets (spring, summer, autumn, winter) to test the model’s adaptability under different climatic scenarios. For outliers in the data, values with negative power generation were first uniformly filled, replaced with 0. For other anomalies—points that deviated significantly from surrounding values or fell well outside the normal range—a correction was applied by substituting the average of the two adjacent data points.
For the preprocessed datasets, the partitioning rules for each seasonal dataset are as follows:
Spring, Summer, and Winter datasets: Divided into training and test sets in a 9:1 ratio based on the total sample size. Additionally, 10% of the training set was extracted to form the validation set for hyperparameter optimization and selection.
Autumn dataset: Due to a relatively smaller sample size after preprocessing, the training-to-validation ratio was adjusted to 7.5:2.5 (i.e., the training set accounts for approximately 75% of the total samples, and the validation set for about 25%), while the test set remained at 10% of the total samples.
The dataset division is shown in
Figure 8.
4.2. Data Preprocessing and Experimental Settings
First, the data were normalized. This process helps improve the convergence speed of the model. Our model adopts the min-max normalization method, defined by the following formula:
where
represents the data points in the dataset;
denotes the value to be normalized;
is the normalized value; and
and
represent the minimum and maximum values of the data to be normalized, respectively.
The experiments were conducted on a PC equipped with an AMD Ryzen 9 7950X 16-Core Processor (4.50 GHz) and 64.00 GB of RAM, using Python 3.9. The BiTCN-BiGRU module was developed with the TensorFlow/Keras framework and trained using the Adam optimizer to balance training efficiency and stability. The model comprises five trainable layers (including the output layer), arranged in the order of data flow as follows: a Bidirectional Temporal Convolutional Network Layer (BiTCN Layer) to extract local temporal features; a Bidirectional Gated Recurrent Unit Layer (BiGRU Layer) to capture bidirectional long-term temporal dependencies; a Dense Layer for feature dimensionality reduction and nonlinear transformation; a LeakyReLU activation layer to introduce nonlinearity and alleviate gradient vanishing; and an Output Layer to generate prediction results. The loss function is set to the Mean Squared Error (MSE) metric.
For all four seasons, uniform default parameters are configured as follows: nb_filters = 64, kernel_size = 2, BiGRU_units = 50, num_epochs = 30, and batch_size = 32. Subsequently, the IGWO optimization algorithm is employed to optimize a subset of the model’s hyperparameters, aiming to achieve optimal prediction performance.
4.3. SSA-VMD-SE-CEEMDAN Multi-Layer Data Decomposition and Reconstruction
The number of modes determines the performance of VMD. When the number of modes is small, VMD tends to filter out important information in the wind power sequence, affecting the performance of the prediction model. Conversely, when the number of modes is large, centre frequencies of some IMFs may overlap, leading to mode mixing or generating additional noise [
36]. By using the SSA to solve and optimize the constructed composite optimization index, both sample entropy and mutual information are considered simultaneously, yielding the optimal VMD parameters.
The wind power sequence is decomposed into multiple subsequences to eliminate noise in the original data, extract main features, and perform adaptive decomposition. Through multiple SSA-VMD experiments, the optimal combination of decomposition mode number k and quadratic penalty term α was determined. For the data decomposition and reconstruction steps of the proposed method in this paper, all operations are uniformly performed using MW as the unit. The population size for the SSA (Sparrow Search Algorithm) solver is set to 30, and the maximum number of iterations is set to 20. The optimization objective of SSA is defined as the composite evaluation metric constructed in this study, with the search intervals for parameters k and α set to [5, 10] and [1000, 2500], respectively. The SSA-optimized VMD parameter results are summarized in
Table 2, and the VMD decomposition results for each season are shown in
Figure 9.
After performing SSA-VMD on the wind power sequence, the sample entropy of each IMF for each season was calculated, as shown in
Table 3. By measuring the complexity of the decomposed sequences and reconstructing them, redundant information is reduced, further improving prediction performance. As shown in
Table 3, for the spring decomposition results, the complexity of the subsequences generally increases gradually, with IMF1 having the lowest SE value and complexity (0.0136) and IMF6 having the highest SE value and complexity (0.5565). Based on the similarity of SE values, IMF3 and IMF4 were merged into a new component IMF3. For the summer season, nine IMFs were obtained after SSA-VMD decomposition. Due to the similarity of SE values between IMF1 and IMF2, they were merged into a new component IMF1; similarly, IMF5 and IMF6 were merged into IMF4, and IMF7 and IMF8 were merged into IMF5. For the autumn season, seven IMFs were obtained after SSA-VMD decomposition, and IMF1 and IMF2 were merged into IMF1 due to their similar SE values. For the winter season, nine IMFs were obtained after SSA-VMD decomposition, with IMF1 and IMF2 merged into IMF1, IMF5 and IMF6 merged into IMF4, and IMF7 and IMF8 merged into IMF5. By merging VMD-decomposed components based on sample entropy similarity, prediction complexity is reduced, avoiding redundancy.
Through the reconstruction of VMD components with similar sample entropy sequences, six reconstructed components were obtained. Additionally, due to the high SE value of the residual sequence IMF6 after reconstruction, and as can be seen from the VMD decomposition diagrams for each season, IMF6 exhibits high complexity with complex sequence information and significant fluctuations. This indicates that the sequence still has a relatively high complexity, and directly predicting this sequence may increase the error of the prediction model. Therefore, CEEMDAN is used to perform secondary decomposition on this sequence to reduce its complexity.
Similarly, in order to save computational costs and reduce data redundancy, the sample entropy values of the sequences after CEEMDAN decomposition are calculated for component reconstruction. The statistics of sample entropy and the reconstructed sequences after secondary decomposition for each season are shown in
Table 4. The first few components after CEEMDAN decomposition have higher complexity, while the complexity of the subsequent components decreases gradually. Based on the theory of sample entropy, the decomposed components are reconstructed according to their similar SE values. For the spring season, nine decomposition components were obtained after CEEMDAN decomposition. IMF5~IMF9 have high similarity in SE values and low complexity; therefore, IMF5~IMF9 are merged into a new component IMF5. For the summer season, twelve decomposition components were obtained after CEEMDAN decomposition. IMF3~IMF5 have high similarity in SE values and similar complexity; thus, IMF3~IMF5 are merged into a new component IMF3. Similarly, since the SE values of IMF7~IMF12 are small and their complexity is low, IMF7~IMF12 are merged into a new component IMF5. By reconstructing components with similar SE values, five sequences are obtained after reconstruction. For the autumn season, nine decomposition components were obtained after CEEMDAN decomposition. IMF3 and IMF4 have high similarity in SE values and similar complexity; thus, IMF3 and IMF4 are merged into a new component IMF3. IMF5 and IMF6 also have high similarity in SE values and similar complexity; thus, IMF5 and IMF6 are merged into a new component IMF4. Similarly, since the SE values of IMF7~IMF9 are small and their complexity is low, IMF7~IMF9 are merged into a new component IMF5. After reconstruction based on components with similar SE values, five sequences are obtained. For the winter season, ten decomposition components were obtained after CEEMDAN decomposition. IMF3 and IMF4 have high similarity in SE values and similar complexity; thus, IMF3 and IMF4 are merged into a new component IMF3. IMF5 and IMF6 also have high similarity in SE values and similar complexity; thus, IMF5 and IMF6 are merged into a new component IMF4. Similarly, since the SE values of IMF7~IMF10 are small and their complexity is low, IMF7~IMF10 are merged into a new component IMF5. After reconstruction based on components with similar SE values, five sequences are obtained. The results of the multilayer decomposition of the sample entropy are shown in
Figure 10. The sequences after CEEMDAN decomposition and reconstruction for each season are shown in
Figure 11,
Figure 12,
Figure 13 and
Figure 14.
The components obtained after the first VMD, the secondary CEEMDAN decomposition, and the reconstruction based on similar sample entropy are combined, resulting in a total of 10 decomposed components. These components serve as the final input sequences for the prediction model. The final reconstructed and combined sequences for each season are shown in
Figure 15,
Figure 16,
Figure 17 and
Figure 18.
4.4. Verification of the Performance of the IMGWO Optimization Algorithm
To evaluate and measure the optimization performance and generalization ability of the proposed IMGWO algorithm, it was compared with several commonly used optimization algorithms. A comparative analysis was conducted using a control group of five optimization algorithms: PSO, DBO, GA, SSA, and GWO. To ensure a fair comparison, a uniform parameter set was applied to all optimizers, with the population size and maximum iterations fixed at 100 and 1000, respectively. Additionally, multiple test functions from CEC2005 were employed to further validate the optimization performance and convergence speed of the algorithms. The selected test functions are shown in
Table 5, and their function graphs are illustrated in
Figure 19.
After conducting multiple simulation optimization experiments, the performance of the proposed IMGWO optimization algorithm was further verified. Eight test functions were selected from the test function set for optimization testing. Optimization fitness iteration curves of the IMGWO algorithm and the control group algorithms are shown in
Figure 20. As can be observed from
Figure 20, the proposed IMGWO optimization algorithm demonstrates faster convergence speed compared to the control group algorithms and exhibits a stronger ability to explore global optima. Moreover, compared to the unimproved GWO algorithm, the improved IMGWO algorithm shows significantly better optimization performance across all test functions.
The IMGWO optimization algorithm proposed in this article was applied to optimize the hyperparameters of the constructed prediction model. By effectively optimizing the hyperparameters within the given range, the optimized parameters were assigned to the prediction model to enhance its predictive performance. This approach ensures the rationality and effectiveness of the model’s parameter selection, avoiding a decline in prediction performance caused by inappropriate hyperparameter choices. Thus, the proposed IMGWO algorithm was utilized to optimize the hyperparameters of the constructed BiTCN-BiGRU prediction model.
To achieve optimal performance for the BiTCN-BiGRU prediction model, this study employs an Improved Grey Wolf Optimization algorithm (GWO-GSS-Levy) for the automatic optimization of key hyperparameters. The validation set Root Mean Square Error (RMSE) is adopted as the fitness function, with the objective of minimizing the model’s prediction error. The hyperparameters to be optimized comprise two categories—model structure and training parameters—totaling four variables: (1) the number of TCN filters (nb_filters), which controls the feature extraction channels of the temporal convolutional network, with a range set to [32, 128]; (2) the number of BiGRU hidden units (BiGRU_units), which determines the nonlinear expressive capacity of the bidirectional gated recurrent unit, with a range of [30, 200]; (3) the number of training epochs (num_epochs), i.e., the number of model iteration rounds, with a range of [20, 50]; and (4) the batch size (batch_size), representing the number of samples per training batch, with a range of [8, 32]. The hyperparameter boundaries are explicitly defined in problem_dictas bounds = [(32, 128), (30, 200), (20, 50), (8, 32)], ensuring that the search space covers a reasonable parameter range.
An improved GWO algorithm incorporating multiple strategies is utilized to search for optimal hyperparameters. The process begins with population initialization: N = 10 individuals with dimension dim = 4 are randomly generated within the hyperparameter boundaries. In each generation, grey wolf positions are updated through the following mechanisms: basic GWO updating (with linearly decreasing parameter a balancing exploration and exploitation, guided by Alpha, Beta, and Delta positions), the Golden Sine Strategy (GSS, which introduces sine function periodicity to adjust positions and enhance local search), the Opposition-Based Learning Strategy (which generates opposition-based solutions for 20% of individuals every 5 generations to broaden the search range), and the Lévy Flight Strategy (which randomly selects 20% of individuals and superimposes heavy-tailed distribution step sizes to enhance global exploration). The fitness function of the new positions is computed in each generation, updating both the individual historical best and the global optimal positions.
For the IMGWO optimization applied across four seasons, the maximum number of iterations per generation is set to 13, and the population size is set to 10. Since the optimization is performed in a four-dimensional space, the parameters for the IGWO optimization were determined through multiple experimental trials. The optimization ranges for each parameter, as well as the optimized hyperparameters obtained through the IMGWO algorithm, are summarized in
Table 6.
4.5. Analysis of Wind Power Prediction Results
To further validate the effectiveness and correctness of the model presented in this article, multiple comparative models were selected to compare their performance with the proposed model. Additionally, predictions from models that did not undergo secondary decomposition or only underwent single decomposition were considered for performance comparison. These included models such as TCN-BiGRU, BiTCN-BiGRU, CNN-BiLSTM-AM, XGBOOST, VMD-BiTCN-BiGRU, CEEMDAN-BiTCN-BiGRU, and VMD-CEEMDAN-BiTCN-BiGRU. The proposed model was compared with these models on the test datasets used in this article for wind power prediction, which included data from spring, summer, autumn, and winter. By separately constructing models and making predictions for each season’s dataset, the effectiveness and generalization performance of the proposed method could be more effectively validated.
The prediction errors for the spring season are shown in
Table 7. Compared to other widely used models currently studied, the model proposed in this article demonstrated the best predictive performance. Specifically, the MSE, MAE, and RMSE metrics were 8.1, 69.3, and 89.9, respectively, while the R
2 value reached 0.9905. To intuitively display the error performance of the model, bar charts and radar charts were used for error comparisons, as shown in
Figure 21 and
Figure 22. Compared to other single models, the proposed model in this article achieved the smallest MSE, MAE, and RMSE values and the highest R
2 value, demonstrating the effectiveness of the proposed model.
Similarly, for the summer season, the prediction errors are shown in
Table 8. The proposed model in this article demonstrated the best predictive performance compared to other comparative models. Specifically, the MSE, MAE, and RMSE metrics were 6.8, 64.9, and 81.7, respectively, while the R
2 value reached 0.9978. Compared to the control group models, all error metrics were better controlled, and prediction accuracy was effectively improved. To intuitively display the error performance of the model, bar charts and radar charts were used for error comparisons, as shown in
Figure 23 and
Figure 24. Compared to other single models, the proposed model in this article achieved the smallest MSE, MAE, and RMSE values and the highest R
2 value, demonstrating the effectiveness of the proposed model.
For the wind power prediction in autumn, the prediction errors are shown in
Table 9. The proposed model in this article demonstrated the best predictive performance compared to other comparative models. Specifically, the MSE, MAE, and RMSE metrics were 14.9, 83.4, and 122.4, respectively, while the R
2 value reached 0.9963. Compared to the control group models, all error metrics were better controlled, and prediction accuracy was effectively improved. To intuitively display the error performance of the model, bar charts and radar charts were used for error comparisons, as shown in
Figure 25 and
Figure 26. Compared to other single models, the proposed model in this article achieved the smallest MSE, MAE, and RMSE values and the highest R
2 value, demonstrating the effectiveness of the proposed model.
For the wind power prediction in winter, the prediction errors are shown in
Table 10. The proposed model in this article demonstrated the best predictive performance compared to other comparative models. Specifically, the MSE, MAE, and RMSE metrics were 0.0116, 0.0755, and 0.1079, respectively, while the R
2 value reached 0.9988. Compared to the control group models, all error metrics were better controlled, and prediction accuracy was effectively improved.
To intuitively display the error performance of the model, bar charts and radar charts were used for error comparisons, as shown in
Figure 27 and
Figure 28. Compared to other single models, the proposed model in this article achieved the smallest MSE, MAE, and RMSE values and the highest R
2 value, demonstrating the effectiveness of the proposed model.
4.6. Ablation Study
To further validate the rationality of the proposed method and the effectiveness of interactions between the introduced modules, ablation experiments were conducted by designing baseline models and performing modular comparative tests. The aim is to verify the contribution of each component to the overall model performance. The specific experimental configurations are as follows: ① Baseline + VMD, ② Baseline + CEEMDAN decomposition, ③ Baseline + two-layer decomposition and reconstruction, ④ Baseline + IMGWO optimization algorithm. Here, the baseline model is defined as the BiTCN-BiGRU prediction model. The results of the ablation experiments are summarized in
Table 11.
The ablation study reveals that each added module plays a critical role in enhancing the overall prediction accuracy of the baseline BiTCN-BiGRU forecasting model for wind power. The baseline prediction model itself demonstrates high prediction accuracy across all four seasons. Furthermore, by incorporating a two-layer decomposition and reconstruction technique and using the reconstructed sequence data for prediction, the model’s accuracy is further improved compared to using a single decomposition method. Specifically, the R2 values for the baseline model combined with the two-layer decomposition and reconstruction technique reach 0.9791, 0.9976, 0.9949, and 0.9982 for the four seasons, respectively.
After optimizing the model’s hyperparameters using the IMGWO proposed in this study, the prediction errors for each season are further reduced. For spring, the MSE, MAE, and RMSE values are reduced to 8.1, 69.3, and 89.9, respectively, with an R2 of 0.9905. For summer, the MSE, MAE, and RMSE values are 6.8, 64.9, and 81.7, respectively, with an R2 of 0.9978. For autumn, the MSE, MAE, and RMSE values are 14.9, 83.4, and 122.4, respectively, with an R2 of 0.9963. For winter, the MSE, MAE, and RMSE values are 11.6, 75.5, and 107.9, respectively, with an R2 of 0.9988. Through the ablation experiments conducted for each module, the superiority of the proposed technical approach is further validated. The model and methods introduced in this study contribute to enhancing the prediction accuracy of wind power forecasting.
5. Conclusions
This study proposes a prediction method and model tailored for ultra-short-term wind power forecasting. By constructing a hybrid ultra-short-term wind power prediction model based on multi-layer data decomposition–reconstruction and IMGWO-optimized BiGRU-BiTCN, the accuracy of ultra-short-term wind power forecasting is effectively improved.
The VMD technique was employed to perform the primary decomposition of raw wind power data, reducing the noise interference present in the data. The SSA was used to optimize and solve for the optimal values of K and α. Subsequently, the sample entropy theory was applied to reconstruct the decomposed sequences. For the last decomposed sequence containing complex information, the CEEMDAN technique was used for secondary decomposition, followed by sequence reconstruction.
To prevent the reliability of the prediction model from decreasing due to improper parameter settings, an optimization algorithm was utilized to solve for multiple hyperparameters of the BiTCN-BiGRU prediction model. The GWO algorithm was systematically enhanced with multiple strategies to improve its global search performance and convergence rate, leading to a marked increase in optimization efficacy.
The evaluation employed actual operational data from a wind farm to verify the predictive performance of the proposed method through simulations. Four evaluation metrics were employed, and datasets covering four seasons—spring, summer, autumn, and winter—were constructed for model validation. For spring, the MSE, MAE, and RMSE prediction error metrics reached 8.1, 69.3, and 89.9, respectively, with an R2 value of 0.9905. For summer, the MSE, MAE, and RMSE values were 6.8, 64.9, and 81.7, respectively, with an R2 of 0.9978. For autumn, the MSE, MAE, and RMSE values were 14.9, 83.4, and 122.4, respectively, with an R2 of 0.9963. For winter, the MSE, MAE, and RMSE values were 11.6, 75.5, and 107.9, respectively, with an R2 of 0.9988.
Although the model proposed in this study demonstrates strong performance in wind power forecasting tasks, it also has certain limitations. The forecasting model constructed in this work relies heavily on computational resources. Its high performance depends on a complex network architecture and bidirectional feature fusion mechanisms, which significantly increase computational complexity. Additionally, the model training process requires high-quality time-series data. Moreover, due to the use of optimization algorithms for hyperparameter tuning—which necessitates repeated calls to the model training interface—parameter calibration on large-scale datasets can be time-consuming. The parameters obtained and the model constructed may also be susceptible to potential overfitting issues.
In the future, the forecasting method proposed in this study will be further applied to more wind farms to validate its generalization performance. Subsequent research can focus on addressing these challenges and limitations.