1. Introduction
In the context of global efforts to achieve carbon neutrality and peak carbon emissions, the development of sustainable and low-emission energy systems has gained widespread international attention. As a result, wind energy has seen a significant rise in its contribution to electricity generation networks [
1]. Nevertheless, the extensive integration of wind-based generation poses considerable operational challenges for power grids [
2]. Enhancing grid reliability and economic efficiency can be achieved through accurate wind power forecasting, which utilises historical weather patterns and generation records to minimise the destabilising effects of renewable energy integration [
3]. Such predictive capabilities play a critical role in ensuring grid stability and optimising wind farm operations [
4].
At this stage, wind power prediction faces two main challenges [
5]: (1) Wind farm historical data contain a variety of features; these data are complex and redundant, interacting with each other, and if these data are directly input into the prediction model, it will result in a large prediction error and a degradation in the performance of the prediction model. (2) The adaptability and robustness of the traditional prediction model are not enough to fit the nonlinear relationship between wind power data and power. Wind power prediction can be divided into three categories according to the mathematical model established [
6]: physical model, statistical model, and learning model. Among them, the physical model establishes physics equations to simulate the wind farm movement laws and combines with power curves to make predictions, which is complex to model and limited in the application of short-term prediction [
7]. Statistical models establish the mapping relationship between historical data and power by analysing historical operation data, so as to predict future power. However, the above methods have a limited ability to deal with the deep features of wind power data, while the generalisation performance needs to be improved [
8,
9,
10]. With the development of artificial intelligence, learning models based on deep learning methods have received wide attention in wind power prediction due to their stronger data mining ability and the ability to make learning corrections continuously [
11].
To address the above two challenges, researchers have used various signal decomposition algorithms to analyse the relationship between signals in different frequency bands. Study [
12] introduced a forecasting approach for wind power generation that combines empirical mode decomposition with radial basis function neural networks. The experimental findings demonstrated that utilising EMD for signal processing significantly enhances the model’s prediction performance. The prediction error RMSE of the traditional RBF neural network was 17.04, while that of the EMD-RBF neural network was 11.61. After EMD decomposition, the prediction error was reduced by 31.87%. Nevertheless, the empirical mode decomposition method suffers from inherent mode-mixing limitations. To address this issue, reference [
13] employed variational mode decomposition for signal preprocessing before feeding the resulting subcomponents into an enhanced gated recurrent unit network for wind generation forecasting. The results showed that VMD can effectively avoid modal overlap issues, with an average error reduction of over 50% compared to single LSTM and single GRU prediction models, and a reduction of approximately 40% compared to conventional multi-dimensional VMD-GRU models. However, the lack of reasonable evaluation criteria to guide the parameter setting hinders the application of VMD in power prediction due to the large impact of VMD preset parameters on the decomposition performance [
14]. In addition, applying various deep learning algorithms to wind power prediction is a research hotspot and direction. Study [
15] used Long Short-Term Memory (LSTM) to address the inherent issues of neural networks, such as getting stuck in local minima and gradient vanishing, achieving an accuracy of 99.63% and reducing the prediction error RMSE by 26.17% compared to traditional Backpropagation (BP) models. Study [
16] used gated recurrent unit (GRU) to effectively reduce the number of parameters and computation by controlling the flow of information and state update. The prediction interval coverage (PICP) was improved to 96.40%, outperforming traditional models (SVM: 94.72%, KELM: 96.11%, ANN: 95.67%). The interval prediction width was reduced by 58.5% to 73.9%. Convolutional neural networks (CNNs) have received increasing attention since the AlexNet model won the image classification competition in 2012 [
17]. Study [
18] implemented a deep learning framework utilising convolutional neural networks to forecast wind energy generation output, reducing prediction errors by 2% to 4% compared to traditional models in terms of prediction accuracy. But wind power is affected by a number of factors together, so the model inputs are multivariate time series data. Traditional convolutional neural networks can handle high-dimensional data well, but the parameter adjustment is slow.
To address the shortcomings of the existing methods in feature extraction and model parameter optimisation, and to provide a more efficient solution for ultra-short-term wind power forecasting, this study proposes a wind power forecasting method based on feature fusion and an improved convolutional neural network. First, kernel principal component analysis (KPCA) is used to analyse historical meteorological datasets, extracting effective meteorological kernel principal components. Then, dynamic mode decomposition (DMD) is employed to extract modal features from historical power data. The meteorological kernel principal components are fused with the modal features of power to form a new sample dataset. This method combines meteorological factors influencing power output, addressing the limitation of single features in comprehensively reflecting the multifactor coupling relationships in wind power. Subsequently, the snowmelt optimisation algorithm (SAO) is used to optimise the CNN hyperparameters, overcoming the limitations of traditional methods that rely on empirically defined parameters. Additionally, a self-attention mechanism is introduced to enhance the global modelling capabilities of CNNs, significantly improving prediction accuracy.
2. Fundamental Theory
2.1. Kernel Principal Component Analysis
As an enhanced variant of conventional principal component analysis, kernel PCA (KPCA) effectively addresses nonlinear feature extraction challenges through kernel-based transformation. Its idea is to use a nonlinear mapping function to project the samples of the original data into a high-dimensional feature space, which is then analysed by PCA, and, finally, the selection of features of the data is achieved by transforming the dot-product operation into a kernel calculation in the original space [
19,
20]. The main operation formulas are as follows:
where
M represents the number of samples and
represents the centred kernel matrix.
After the conversion of the above formulas, it is possible to extract the principal components using the general PCA method, after which the projection of a data point on the eigenvectors is calculated to obtain the kernel principal components of the point.
2.2. Dynamic Modal Decomposition
Peter Schmid proposed dynamic mode decomposition (DMD) in 2008, and since then, it has been widely used in the study of various nonlinear mechanical systems, such as the construction and analysis of hydrodynamic systems. Because of its superior data mining capability, it has also been applied to data degradation and time series prediction in recent years. It is based on the idea of fitting a nonlinear problem by a system of multiple linear equations [
21,
22,
23].
Consider a set of observations containing m data points. In this paper, we use the Hankel matrix to transform the observations into a higher dimensional space, which is constructed as follows:
where
m represents the number of power points and
d represents the dimension of the delay embedding, where time-shifted copies of the scalar time series
y are stacked on top of each other to form the Hankel matrix
H.
A snapshot of the data
is constructed, and the purpose of DMD is to extract important dynamic information from these data. By constructing a sliding matrix, the following two matrices can be defined from
X:
The Koopman operator idea is that there always exists a matrix
A such that the next moment state can be represented by the previous moment state, that is,
. Thus, the relationship between
X1 and
X2 can be expressed as:
xm in
X2 can be represented by a weighted linear combination of the
xi column vectors of
X1:
where
a represents the weight vector and
r represents the residual vector. Then, Equation (5) can be expressed as:
where
represents the unit vector, which is used to find the optimal eigenvalues and eigenvectors by minimising the residual matrix
, so that the system prediction is the closest to the target;
S represents the low-dimensional approximation matrix, whose eigenvalues can be approximated to replace the eigenvalues of
A.
a can be expressed by Equation (9):
Since the matrix
S may have pathological properties that make it difficult to solve directly, it is common to solve its similarity matrix
instead, which is achieved by applying Singular Value Decomposition (SVD) to the data matrix
X1 to obtain
X2:
The eigenvalue
of the similarity matrix
contains the dynamic characteristics of the system state evolution, and its numerical size reflects the change rule of the system within the time step
. The eigenvalue L of the similarity matrix s contains the dynamic characteristics of the system state evolution, specifically expressed as:
and the dynamic modes of the system can be characterised by the
Kth eigenvector
of the matrix
A. Based on the equivalence property of similar matrices, the following relationship exists at this point:
A complete reconstruction of the system state is possible based on the eigenvectors
and eigenvalues
. The cross-section data of the system at any moment can be obtained by the eigenvalue transformation
:
where
represents the modal initial amplitude;
represents the iterative matrix consisting of eigenvectors
;
represents the diagonal matrix with diagonal elements
; and
represents the vector consisting of
.
The DMD responds to a nonlinear dynamic situation with low-dimensional dynamic features, and its dynamic modes reflect the state of that system at any given moment.
2.3. Snow Ablation Optimisation Algorithm
The snow ablation optimiser (SAO) is a new physically based snow ablation optimiser (SAO) algorithm developed to simulate the sublimation and melting behaviour of snow inspired by the sublimation and melting behaviour of snow in nature [
24,
25]. Its algorithmic process is divided into the following four steps:
Initialisation phase: In the initialisation phase, the population is initialised, and a random batch of particles is generated and divided equally into two subpopulations:
Exploration phase: The first subpopulation performs the positional update of the exploration phase, i.e., the sublimation process in which water molecules show a highly dispersed character, and this stochastic process is modelled with Brownian motion because of the irregularity of the motion. Brownian motion can be used to search for potential optimal solutions in the space using dynamics and a uniform step size. At this time, the number of individuals in the population with
Na is randomly selected to form the subpopulation
Z(t), and the position of this individual is updated as follows:
where
represents the elite individuals, which are randomly selected from the set
;
represents the best, second-best, and third-best individuals in the current
Z(t);
represents the individuals at the centre of mass of the top 50% of individuals with large fitness values;
is a vector of random numbers based on a Gaussian distribution, denoting Brownian motion;
denotes the individual at the position of the centre of mass in
Z(t); and
denotes the random number located at (0, 1).
Exploitation phase: The second subpopulation performs the positional update of the exploitation phase, i.e., the water molecules no longer exhibit highly dispersed characteristics during the melting process but rather explore around the current local optimal solution. The classical ‘degree-day method’ is used as one of the models for the snow melting algorithm. At this stage, individuals are more likely to explore potential optimal regions based on the centre of mass of the population and the local optimal solution. In the development phase, the remaining individuals in the population are reconstituted into a subpopulation
Z(t), which is updated by the following equation:
where
M represents the snowmelt rate and
represents random numbers located in (−1, 1) used as inter-individual communication.
Dual-population mechanism: As the number of iterations increases, the number of subpopulations
Na of the sublimation process gradually increases, increasing the weight of exploration, while the number of subpopulations
Nb of the melting process gradually decreases, avoiding over-localised exploitation as a way to achieve the search for the globally optimal solution. When the number of populations is greater than zero and the number of iterations is not up to the maximum number of iterations, the number of populations is expressed as:
In meta-heuristic algorithms, it is important to balance the tension between global search and local exploitation, where steam can come from snow sublimation or from snow first melting into water and then from water vaporisation. Over time, the algorithm gradually shifts from an irregular motion with highly dispersive characteristics at the beginning to a deeper exploration into the solution space. The purpose of the two-population mechanism is to reflect this optimisation-seeking strategy while guaranteeing exploration and deepening capabilities. The code for dual-population mechanism is Algorithm 1. The pseudocode for SAO algorithm is Algorithm 2.
Algorithm 1: Dual-population mechanism |
1: Initialisation: t = 0, tmax, Na = Nb = N/2, where N denotes the population size
2: while (t < tmax) do
3: if Na < N then
4: Na = Na + 1, Nb = Nb − 1
5: end if
6: t = t + 1
7: end while |
Algorithm 2: Snow ablation optimiser (SAO) |
1: Initialisation: the swarm Zi (i = 1,2,…, N), t = 0, tmax, Na = Nb = N/2
2: Fitness evaluation
3: Record the current best individual G(t)
4: while (t < tmax) do
5: Calculate the snowmelt rate M through Equation (10)
6: Randomly divide the whole population P into two subpopulations Pa and Pb
7: for each individual do
8: Update each individual’s position through Equation (12)
9: end for
10: Fitness evaluation
11: Update G(t)
12: t = t + 1
13: end while
14: Return G(t) |
2.4. Convolutional Neural Network
The fundamental structure of convolutional neural networks comprises five primary components: input processing units, feature extraction layers (convolutional blocks), dimensionality reduction modules (pooling operations), classification networks (fully connected layers), and final output nodes. These architectures leverage three key principles—(1) localised receptive fields for spatial feature detection, (2) parameter sharing mechanisms for computational efficiency, and (3) subsampling techniques for hierarchical abstraction—collectively enabling parameter optimisation, accelerated training convergence, and reduced model complexity. These advancements have contributed substantially to improved forecasting capabilities in renewable energy systems, particularly for wind generation prediction tasks in recent research.
The convolutional layer is the most important part of a CNN, and its inputs and outputs are connected by weights and biases [
26]. The input–output correspondence of the convolutional layer is as shown in Equation (24):
where
X represents the input;
f represents the excitation function;
W represents the convolution kernel;
represents the convolution operation; and
B represents the output bias.
The feature map is pooled after the convolution operation so that it takes the mean or maximum value in a certain range, and pooling can effectively reduce the model parameters to avoid overfitting.
The fully connected layer is located behind the convolution and pooling layers, and its function is to integrate the characteristics of the convolution and pooling layers.
where
yk is the output;
k represents the
kth fully connected layer;
represents the connection weights;
is the unfolded one-dimensional graph; and
bk is the bias.
2.5. Improving Convolutional Neural Networks
In this study, the improved CNN-based wind power prediction model was built in MATLAB R2023a, and the overall process can be divided into three parts:
(1) Input layer.
In the input layer, the dimension of the data input is defined, and the features are represented as an ‘image’ with a certain length and a height and width of 1. In essence, the sequence data are processed by a 1 × 1 convolution kernel; then, a sequence folding layer is established to convert the data into a pseudo two-dimensional form that is suitable for CNN processing, so as to facilitate subsequent 2D convolution operations.
(2) Convolution module.
This module contains a convolutional layer, a batch normalisation layer, an activation function layer, a Dropout layer, and a maximum pooling layer.
As shown in
Figure 1, the convolutional layer introduces a hierarchical optimisation architecture for optimisation of the network parameters. SAO is used as an external optimiser in the optimisation phase of the model structural parameters with the optimisation objectives of convolutional kernel size, number, and learning rate, which is run prior to the training of the model program to determine the hyperparameters of the network. Adam is specified as the optimisation algorithm in trainingOptions. Adam determines the initial learning rate through the optimisation algorithm and applies it to the network weights training module, whose optimisation objective is the updating of neuron weights and biases, and which can be optimised by back propagation for the specific weight values of the convolution kernel and the parameters of the fully connected layer. The synergy between the two facilitates both jumping out of the local optimum more efficiently than traditional grid search and successive parameter optimisation for accelerated convergence of the adaptive learning rate.
A batch normalisation layer is added after the convolutional layer to accelerate training and stabilise the training gradient. The activation function ReLU is connected afterwards to optimise the training efficiency and construct a nonlinear mapping relationship between the input features and the output target. The nonlinearly transformed data are added to the Dropout layer before the maximum pooling unit, and the Dropout regularisation temporarily discards the neuron’s output in the CNN at random with a 10% probability (p = 0.1) to prevent the neuron from relying too much on local features. The maximum pooling window is (2, 1) with a step size of 2, placing the lightweight regularity before pooling to retain more valid features for downsampling while avoiding the amplification of noise by pooling.
This module gradually reduces the sequence length while extracting local temporal features to enhance the model’s robustness.
(3) Output module.
This module contains sequence unfolding, the flattening process, a fully connected layer, a self-attention layer, and an output layer. Sequence unfolding and the flattening process restores the collapsed 2D sequence to its original structure and flattens the multi-dimensional features into vectors, after transitioning to the fully connected layer to compress the feature dimensions, the self-attention mechanism is introduced to capture the global dependencies of sequences, enhance the global interactions, and improve the model’s ability of modelling the long time-series relationship to make up for the insufficiency of the CNN’s local sensory field. Finally, the fully connected layer maps the results to the target dimension, and the regression layer outputs the prediction results and calculates the loss.
4. Case Study Analysis
4.1. Data Source
This research utilised operational data collected from a 200 MW wind farm located in Hami, Xinjiang, comprising both power generation records and meteorological measurements. The dataset spans the entire month of January 2019, with 15 min resolution measurements yielding 2976 temporal samples. The facility consists of 133 wind turbines, each rated at 1.5 MW capacity.
The accompanying meteorological observations include eleven distinct parameters: wind velocity and direction recorded at four different heights (10 m, 30 m, 50 m, and 70 m) from the anemometer tower, along with atmospheric pressure, ambient temperature, and relative humidity measurements.
4.2. KPCA Dimensionality Reduction
Wind power prediction has many influencing factors. In addition to the historical power data, which have the most influence on it, the meteorological data measured at wind farms will also have an influence on the power prediction, and because of the correlation between the meteorological data, they need to be downscaled in order to remove the redundant information. Using MATLAB R2023a, KPCA downscaling was performed on the meteorological data, and the kernel parameter was set to
c = 20,000. Eleven meteorological features were fused, the kernel principal components with a cumulative contribution rate of more than 85% were selected, and, finally, four kernel principal components were extracted, which are noted as
y1,
y2,
y3, and
y4. The downscaled data are shown in
Table 1, the cumulative percentage of the contribution of features is shown in
Table 2.
From
Table 2, it can be seen that the first four feature vectors constitute the kernel principal components that can reflect the meteorological information. The aggregated explanatory power of the derived feature components achieved 92.535% of the total variance representation, indicating that the four extracted kernel principal components can reflect the vast majority of the information of the 11 features.
4.3. DMD Decomposed Power Data
The Hankel matrix obtained from the construction of the historical power data was used to extract a snapshot of the data using a sliding time window with a sliding step of 1.The sample data construction is shown in
Figure 3.
The constructed data snapshots were modally decomposed to obtain the dynamic modes of the system. The order of decomposition corresponds to the amount of information contained in the data, and the first feature decomposed by the system is also called the dominant feature. The ten modes after decomposition are shown in
Figure 4. The calculated energy share of each modal signal is shown in
Figure 5.
As can be seen from
Figure 5, the first three orders of modal signals accumulated in the 10 modal signals decomposed by the DMD account for a large proportion of the original signal energy, reaching 85.83%.
4.4. Experimental Comparison of Predictive Model Performance
To systematically evaluate the effectiveness of the proposed model, multiple performance metrics were used to compare the model with benchmark prediction methods. The experimental design employed a control variable method to assess the contribution of each module to the overall system performance.
During model development and validation, the dataset was divided into a training set (80%) and a test set (20%). A rolling multi-step prediction method was used to generate wind power predictions for the next four hours, thereby enabling multi-step performance evaluation.
4.4.1. Prediction Under Different Decomposition Algorithms
To assess the effectiveness of the proposed decomposition method for power data preprocessing, a comparative analysis was conducted against alternative decomposition techniques. The evaluation employed a standard CNN architecture as the baseline prediction framework. As shown in
Figure 6 and
Table 3, the reported error metrics represent averaged values from five independent experimental runs.
The results show that compared with the undecomposed method, the prediction errors of the models under the four decomposed methods are reduced, which verifies the effectiveness of the signal decomposition in improving the prediction results. Among them, the DMD model outperforms the other models in all the indexes, the prediction result curve is closer to the actual power fluctuation, and the RMSE is reduced by 31.9% compared with the undecomposed model, which indicates that the DMD decomposition algorithm can effectively improve the prediction accuracy.
4.4.2. Prediction Under Different Master Forecasting Models
In order to verify the influence of the main prediction model on the prediction performance, RNN, LSTM, GRU and CNN were selected for comparison. The prediction results are shown in
Figure 7, and
Table 4 shows the prediction error metrics of the different models. The results are also averaged after five trials. The results show that the CNN model performs the best with the highest R2 (0.98909), indicating the strongest fitting ability. The MAE (4.7742) and the RMSE (7.9954) are the lowest, indicating the lowest prediction error and the best stability. Although the MAPE (0.3331) is not optimal, it is still within the acceptable range and is still the best choice when combined with the other indicators.
4.4.3. Prediction Under Different Optimisation Algorithms
This study proposes an enhanced prediction framework integrating dynamic mode decomposition (DMD) with convolutional neural networks, augmented by several technical innovations. First, to overcome the challenges of manual hyperparameter optimisation in CNN architectures, the SAO algorithm automatically determines optimal configurations for (1) convolution kernel dimensions (ranging 2–6), (2) filter quantities (8–128 units), and (3) learning rates (10
−4 to 10
−2). Second, a self-attention module is incorporated to improve temporal dependency modelling in extended sequence forecasting. Finally, a dropout layer (with probability 0.1) is implemented to enhance the model’s generalisation capability by mitigating overfitting risks. The experiments show that the SAO-CNN optimisation yields the best parameters: the number of convolutional kernels is 32, the size is 6, and the learning rate is 0.00381847.
Figure 8 demonstrates the fitness curves of the SAO optimisation process.
Figure 9 and
Table 5 show the prediction results and error comparison under different optimisation algorithms; the results are averaged after five trials.
The experimental results demonstrate significant improvements in model performance through parameter optimisation. A comparative analysis reveals that the SAO algorithm outperforms alternative optimisation methods in identifying optimal CNN configurations. Specifically, the SAO-optimised CNN achieves remarkable predictive performance, exhibiting (1) the highest correspondence with actual power measurements (R2 = 99.358%), (2) a 39.8% reduction in the RMSE compared to the baseline CNN architecture, and (3) minimal deviation between predicted and observed power curves. These findings collectively indicate that the SAO-based optimisation approach substantially enhances forecasting precision.
4.4.4. Projections for Different Wind Farms
In order to show that the proposed prediction model has good generalisation, data from a wind farm in Inner Mongolia were used for prediction. The data processing, as well as meteorological factor extraction and historical power data decomposition, were the same as the aforementioned experimental methods. The input matrix was constructed and input into the prediction model. The comparison shows that the proposed prediction model is 2.7% more accurate than the basic CNN model, and the prediction error RMSE is 30.6% lower compared to the basic CNN model. The CNN is reduced by 30.6%, which has good prediction performance as well as generalisation ability. The specific results are shown in
Figure 10 and
Table 6. The results are also averaged over five trials.
5. Conclusions
This research introduces a novel wind power forecasting system that integrates multi-source feature fusion with optimised CNN components to improve temporal prediction performance. The case study analysis reveals three principal conclusions:
(1) Decomposing the actual power data by DMD can effectively retain the information of the original power sequence and improve the accuracy of prediction.
(2) Optimising the CNN parameters by SAO can adaptively determine the optimal network parameter combination. Compared with the commonly used parameter setting methods, it overcomes the randomness of the empirical settings and has higher accuracy when applied to prediction.
(3) The constructed improved CNN prediction model, thanks to the fusion feature’s ability to retain data information and the CNN’s stronger data mining ability, improves the fitting ability of the model to a certain extent, reduces the error when predicting, and demonstrates better prediction performance. It also illustrates the good generalisation performance of the model by predicting the power of different wind farms.
The predictive capability of wind power forecasting models is fundamentally dependent on both the quality of operational data from wind farms and the configuration of model parameters. Therefore, future research should focus on how to deeply mine the original data information and how to optimise the prediction model performance to improve stability and generalisability and better support the online prediction and application of wind power. In response to some of the issues identified in this study, we will conduct further research, specifically including the following:
(1) Conducting additional ablation experiments on the key components of the proposed model, removing the KPCA module, DMD module, SAO module, and self-attention mechanism module one by one to calculate the contribution of each module to the prediction results.
(2) Preprocessing the annual data (a total of 35,040 sampling points) from the two wind farms and extracting merged features. Seasonal predictions will be made for the entire year to validate the applicability of the proposed model under different seasonal conditions. Extreme conditions will be incorporated into the experiments to test model performance. Further model improvements will be made based on the characteristics of different seasons.
(3) Conducting a more detailed analysis of computational efficiency, including training time and resource usage. The model’s performance will be tested in real-world scenarios by integrating the prediction model into the wind farm’s SCADA system via a program, enabling real-time data reception and online predictions, and evaluating the prediction performance.