1. Introduction
Wind power has emerged as a cornerstone of the global transition toward “carbon neutrality”. However, its inherent intermittency and volatility pose significant challenges to the balance of supply and demand in power systems. Accurate short-term wind power forecasting is essential for economic dispatch and operational security [
1]. The integration of wind energy into power grids is hindered by the stochastic nature of wind speeds. Precise short-term forecasting is not merely a technical challenge but a prerequisite for grid stability and economic dispatch.
Traditional statistical methods, such as ARIMA and Holt-Winters models [
2,
3,
4], are interpretable but struggle to capture the highly nonlinear and dynamic characteristics of modern wind power data. Consequently, data-driven approaches, particularly deep learning, have emerged as the mainstream solution. Recurrent Neural Networks (RNNs) [
5] and their variants (LSTM, GRU) have shown success in modeling temporal dependencies, but they often fail to capture the multi-scale characteristics of wind power. Convolutional networks (CNNs/TCNs) lack global memory [
6].
To address these limitations, recent research has shifted towards hybrid architectures. However, two critical gaps remain in the current literature:
Methodological Trade-offs: Recent studies often rely on decomposition-based hybrids to handle non-stationarity. However, as shown in
Table 1, these methods frequently introduce significant computational overhead and may inadvertently introduce data leakage during decomposition.
Lack of Physically Interpretable Architecture: Many existing hybrid models simply stack convolutional and recurrent layers without a clear functional logic aligned with the physical characteristics of wind data (e.g., separating turbulence from trends). This often leads to redundant computations or information loss.
Inadequate Hyperparameter Optimization: The performance of complex hybrid models is highly sensitive to hyperparameters. Most studies utilize manual tuning or basic algorithms like Particle Swarm Optimization (PSO), which easily fall into local optima on the complex, non-convex loss surfaces of deep networks.
To bridge these gaps, this paper proposes a CPO-BiTCN-BiGRU-Attention framework. We theoretically justify this combination: BiTCN acts as a “feature filter” to extract local patterns and reduce noise; BiGRU models the global sequence; and Attention highlights physically significant moments. Crucially, we employ the Crowned Porcupine Optimization (CPO) algorithm. Unlike PSO [
7], Walrus Optimization algorithm (WaOA) [
8], or Genetic Algorithms (GA) [
9,
10], CPO simulates four distinct defense mechanisms (visual, acoustic, odor, physical), providing a dynamic balance between exploration and exploitation, ensuring convergence to the global optimum for hyperparameters.
The main contributions of this study are:
A Physically Motivated Hybrid Architecture: We propose a serial “Filter–Memorize–Focus” framework that effectively integrates the local feature extraction of BiTCN with the global memory of BiGRU and the dynamic weighting of the attention mechanism, specifically designed to handle the multi-scale nature of wind power.
Adaptive Hyperparameter Optimization: The application of the CPO algorithm solves the “black-box” optimization problem of deep networks, demonstrating superior convergence speed and accuracy compared to TSA, SMA, and GWO.
Superior End-to-End Performance: Without relying on complex pre-decomposition techniques (like VMD or EMD), the proposed model achieves State-of-the-Art accuracy on real-world datasets, verified by comprehensive ablation studies and error distribution analysis.
3. Components of the Prediction Model
The proposed prediction framework is not a simple stacking of multiple modules but a physically motivated and data-driven architecture designed to reflect the intrinsic characteristics of wind power generation processes. Wind power time series are inherently nonlinear, non-stationary, and multi-scale, arising from the combined effects of atmospheric turbulence, wind gusts, and diurnal meteorological cycles. To effectively capture these complex dynamics, a carefully organized hierarchical modeling strategy is required.
3.1. Overall Framework: The Logic of Serial Processing
Wind power time series exhibit distinct patterns across different temporal scales, including high-frequency fluctuations caused by turbulence, medium-term local trends induced by wind gusts, and low-frequency periodic components associated with diurnal and seasonal variations. A single predictive model is generally insufficient to simultaneously capture all these characteristics. As illustrated in
Figure 1, to address this challenge, a serial “Filter–Memorize–Focus–Optimize” modeling strategy is adopted, in which each component serves a clearly defined functional role:
Filter (BiTCN): Extracts multi-scale local features from raw wind power signals while suppressing high-frequency noise.
Memorize (BiGRU): Models the temporal evolution and long-term dependencies of the extracted features.
Focus (Attention): Assigns adaptive weights to critical time steps, emphasizing turning points and informative moments.
Optimize (CPO): Automatically tunes the hyperparameters of the hybrid deep learning model to adapt it to the characteristics of a specific dataset.
Through this serial processing pipeline, the framework progressively transforms raw wind power data into high-level, task-oriented representations, thereby enhancing predictive accuracy and robustness.
3.2. BiTCN: Multi-Scale Feature Extraction and Denoising
Temporal Convolutional Networks (TCNs) have been widely recognized for their effectiveness in modeling long-range temporal dependencies while maintaining efficient parallel computation capabilities [
22,
23]. However, conventional TCNs are inherently unidirectional, relying solely on past information, which may result in incomplete temporal feature representation, especially in complex and highly fluctuating wind power sequences.
To overcome this limitation, a Bidirectional Temporal Convolutional Network (BiTCN) is employed. By integrating forward and backward TCNs, the BiTCN is able to exploit contextual information from both historical and future time steps, leading to a more comprehensive temporal feature representation.
Moreover, wind power data are often contaminated by noise originating from sensor measurement errors and atmospheric turbulence. The dilated convolution layers in the BiTCN act as learnable nonlinear filters. By progressively increasing the dilation rate d, the receptive field of the network expands exponentially, enabling the model to capture short-term turbulence and medium-term trends simultaneously without sacrificing temporal resolution. This design allows the BiTCN to effectively perform feature extraction and denoising in a unified manner.
3.2.1. Structure of BiTCN
The BiTCN module consists of two symmetric sub-networks: a forward TCN, which processes the input sequence in chronological order, and a backward TCN, which processes the sequence in reverse order, as shown in
Figure 2. Each sub-network is composed of stacked layers including a 1 × 1 convolution, dilated causal convolution, batch normalization, Leaky ReLU activation, and dropout regularization. The outputs of the forward and backward TCNs are subsequently fused to form a bidirectional temporal representation.
The diagram includes a Forward direction input (X1, X2, …, Xt) and a Reverse direction input (Xt, …, X2, X1). Each direction consists of 1 × 1 Convolution, Dilated Causal Convolution, Batch Standardization, Leaky ReLU activation, and Dropout layers, with the final output denoted as XL+1.
The Dropout layers applied at the end of both the forward and backward TCN branches serve as independent regularization mechanisms rather than terminal outputs of the network. Specifically, Dropout is employed to mitigate overfitting by randomly deactivating a subset of neurons during training, thereby improving the robustness and generalization capability of each directional feature extractor. After Dropout, the outputs of the forward and backward TCNs are fused through feature concatenation to form a unified bidirectional temporal representation, which is then passed to the subsequent BiGRU module. This design ensures that regularization is applied symmetrically to both temporal directions while preserving complete bidirectional information for downstream sequence modeling.
3.2.2. Feature Calculation Formula of BiTCN
The feature extraction process is expressed by Equations (1) and (2):
where
and
denotes the input feature sequence of the BiTCN module;
δ denotes the dimension of the temporal convolution kernel;
denotes the parameter of the Leaky ReLU activation function;
denotes the dilation rate of the dilated convolution;
denotes the dropout regularization parameter;
and
denote the forward and backward temporal features extracted by the BiTCN module at time
t, respectively.
3.3. BiGRU: Temporal Context Learning
The Gated Recurrent Unit (GRU) is a streamlined variant of the Long Short-Term Memory (LSTM) network that simplifies the gating mechanism by employing only two gates: the reset gate and the update gate. Compared with LSTM, GRU significantly reduces model complexity and training time while maintaining comparable predictive performance, making it particularly suitable for large-scale time series forecasting tasks with limited computational resources.
To further capture bidirectional temporal dependencies in wind power sequences, this study adopts the Bidirectional GRU (BiGRU) architecture. In practical wind power generation systems, the current power output is not solely determined by instantaneous wind speed, but is also influenced by the mechanical inertia of wind turbines and the evolving meteorological conditions over preceding and subsequent time intervals. By processing temporal features in both forward (past-to-future) and backward (future-to-past) directions, the BiGRU is able to model such dynamic evolution more effectively.
In the proposed framework, the BiGRU takes the multi-scale feature representations extracted by the BiTCN as input and focuses on learning their temporal evolution, thereby serving as the memory module in the serial “Filter–Memorize–Focus” architecture.
3.3.1. Structure of BiGRU
The BiGRU consists of two parallel GRU layers: a forward GRU layer, which processes the input feature sequence in chronological order, and a backward GRU layer, which processes the same sequence in reverse order, as illustrated in
Figure 3. At each time step, the hidden states generated by the forward and backward GRU layers are combined to form a bidirectional temporal representation. This structure enables the BiGRU to integrate information from both historical and future contexts, resulting in a more comprehensive modeling of temporal dependencies. Such bidirectional modeling is particularly important for wind power forecasting, where abrupt changes and delayed responses frequently occur.
3.3.2. Mathematical Formulation of BiGRU
The state updates of the BiGRU can be expressed by Equations (3)–(5):
where GRU represents the operational process of the traditional GRU network;
and
denote the state and weight of the forward hidden layer at time
t, respectively;
and
denote the state and weight of the backward hidden layer at time
t, respectively;
denotes the bias term of the hidden layer at time
t.
The proposed bidirectional structures do not introduce data leakage during forecasting. In this study, both the BiTCN and BiGRU operate on fixed-length sliding windows that contain only historical observations available up to the prediction time. The term “bidirectional” refers to the internal feature extraction within each input window, where temporal dependencies are modeled in both forward and backward directions to enhance representation learning, rather than accessing any future unseen data beyond the forecasting horizon.
For multi-step forecasting, future ground-truth inputs are not available. Therefore, a recursive forecasting strategy is adopted, in which the model uses its own previous predictions as inputs for subsequent steps. At each prediction step, the bidirectional feature extraction is still confined to the historical window composed of observed or previously predicted values, ensuring that no future information is incorporated during inference.
3.4. Attention Mechanism: Capturing Ramping Events
The attention mechanism is introduced to selectively emphasize critical temporal features by assigning adaptive importance weights to different time steps. In wind power forecasting, adjacent observations generally exhibit stronger correlations with the target output, whereas distant time steps often contribute less. More importantly, rapid changes in wind power, commonly referred to as ramping events, carry substantially more predictive information than relatively stable periods.
Within a typical 24 h wind power sequence, turning points, where wind speed abruptly increases or decreases, reflect sudden meteorological changes and turbine response dynamics. By contrast, steady operating periods provide limited additional information. The attention mechanism enables the model to automatically focus on these informative moments, thereby enhancing its sensitivity to temporal variations and improving forecasting accuracy.
3.4.1. Working Principle of the Attention Mechanism
The attention mechanism operates by quantifying the relevance between each hidden state in the input sequence and the current prediction task. Specifically, it first computes a relevance score for each time step to measure its contribution to the forecasting objective. These scores are then normalized to obtain attention weights, which are used to perform a weighted aggregation of the temporal features.
As illustrated in
Figure 4, the Attention module takes the bidirectional hidden representations generated by the BiGRU as input and outputs a context-aware feature representation that emphasizes critical time steps while suppressing redundant or less informative ones.
3.4.2. Mathematical Formulation of the Attention Mechanism
In the wind power series, not all time steps contribute equally. Sudden changes (ramps) are more informative than steady states. The attention mechanism assigns adaptive weights
αt to the hidden states
ht. The relevant formulas for this mechanism are Equations (6)–(8):
From a physical perspective, a higher attention weight αt indicates that the model has identified a significant meteorological transition or turbine response event at time step t that strongly influences future wind power output. This mechanism enables the model to focus on dynamic changes rather than steady-state conditions, thereby improving its capability to capture ramping behaviors.
3.5. Crowned Porcupine Optimization (CPO) for Hyperparameters
The combined BiTCN-BiGRU-Attention model has a complex, non-convex hyperparameter space. We utilize CPO to automate the tuning process.
Unlike PSO, which relies on a single velocity vector, CPO simulates four defense strategies. This allows the algorithm to switch between aggressive search (Visual/Sound) and precise local convergence (Physical/Odor), significantly reducing the risk of getting trapped in suboptimal hyperparameter configurations.
Time Complexity: The complexity is , where T is iterations, N is population, and D is dimensions. While training is computationally intensive, it is an offline process. The online prediction speed is unaffected and remains fast (milliseconds).
Specifically, CPO is employed to search for the optimal vector of five key hyperparameters: (1) The number of filters in BiTCN layers (Nc), (2) The kernel size (K), (3) The number of hidden units in BiGRU (Nh), (4) The initial learning rate (η), and (5) The regularization coefficient (Dropout rate).
The proposed Filter–Memorize–Focus paradigm provides an engineering-oriented interpretability rather than a strict theoretical interpretability. The interpretability arises from the functional decomposition of the model architecture, where each module is designed to correspond to a specific role consistent with the physical characteristics of wind power time series. Specifically, the convolutional filtering stage is associated with local fluctuation suppression, the recurrent memory stage captures temporal evolution, and the attention mechanism highlights critical ramp-related time steps. This interpretability is therefore qualitative and functional in nature, aiming to improve model transparency and engineering understanding, rather than to establish formal causal or theoretical guarantees.
4. Results and Discussion
4.1. Dataset and Preprocessing
The dataset used in this study was collected from a wind farm in Xinjiang, China, covering May to June 2021. It includes power load data along with related meteorological and temporal features, comprising a total of 3840 data points, each containing 15 distinct features. The dataset has undergone thorough preprocessing to ensure data integrity, containing no missing values or anomalous outliers. Measurements were taken at 15 min intervals, resulting in 96 data points per day. Each time point records multiple meteorological parameters, including wind speed, wind direction, atmospheric pressure, temperature, humidity, and others. All raw data were normalized using the min-max scaling method to eliminate the influence of differing units and scales. To strictly preserve the temporal continuity inherent in wind power data and prevent future data leakage, we adopted a chronological splitting strategy rather than random shuffling. The dataset was divided as follows: the first 70% of the time series was used for training, the subsequent 15% served as the validation set for the CPO process (to calculate the fitness function), and the final 15% was reserved for testing to evaluate the model’s generalization performance.
In this study, the wind power data are sampled at a 15 min interval, and the forecasting task is formulated as a short-term multi-step prediction problem. Specifically, a sliding input window of length L is constructed using historical wind power observations, where each window contains only past information available up to the prediction time. The input features consist of normalized wind power values, while the prediction target is defined as the wind power output for the subsequent H time steps.
The forecast horizon is set to H = 4, corresponding to a 1 h ahead prediction (4 × 15 min). During model training and inference, the proposed framework performs recursive multi-step forecasting, where predictions generated at earlier steps are iteratively fed back as inputs for subsequent steps.
The forecast horizon of 1 h (H = 4) is chosen from both practical and physical perspectives of wind power system operation. In real-world wind farm management and power system dispatching, short-term operational decisions, such as reserve allocation, unit commitment adjustment, and ramping control, are typically made within a time scale of 15–60 min. Therefore, a 1 h ahead prediction provides the most relevant information for operational planning and real-time control.
Moreover, wind power predictability decreases significantly as the forecasting horizon extends beyond one hour due to the increasing influence of large-scale meteorological variations. Longer multi-step horizons (e.g., 8 or 16 steps) tend to suffer from substantial error accumulation in recursive forecasting, which may reduce their practical value for real-time dispatching. Consequently, selecting a 4-step (1 h) horizon represents a balanced trade-off between predictive accuracy and operational applicability, making it particularly suitable for short-term wind power forecasting in practical power system scenarios.
4.2. Evaluation Metrics
4.2.1. Selection of Evaluation Indicators
To quantitatively assess model performance, the following metrics defined in Equations (9)–(11) were employed:
Among them, denotes the predicted value at time t, represents the corresponding ground truth, and N is the length of the sequence.
It should be noted that the use of MAPE in wind power forecasting may be problematic when actual power values approach zero. In practical wind farm operation, exact zero power outputs are rare within normal operating periods, as data segments corresponding to turbine shutdowns or maintenance are excluded during preprocessing. To further avoid numerical instability, a small positive constant is added to the denominator when computing MAPE, ensuring that near-zero values do not lead to inflated errors. Under these conditions, MAPE remains a meaningful indicator of relative prediction accuracy for short-term wind power forecasting.
4.2.2. Setting of Model Parameters
The Crested Porcupine Optimizer was employed to automatically search for the optimal hyperparameters. The objective function for CPO was defined as the RMSE on the validation set. The CPO algorithm configuration and the resource environment are detailed below:
Population Size: 20
Maximum Iterations: 30
Search Dimensionality: 5 (corresponding to the 5 optimized hyperparameters)
Optimization Strategy: The CPO utilizes its cyclic distinct defense mechanisms to balance exploration (global search) and exploitation (local convergence).
Computing Resources: The experiments were conducted on a workstation with an Intel Core i7-12700K CPU, 32 GB RAM, and an NVIDIA GeForce RTX 3080 Ti GPU, using Python 3.9 and PyTorch 1.12.
4.2.3. Training Protocols and Final Hyperparameters
The model was trained using the Adam optimizer with the MSE loss function. We implemented an Early Stopping mechanism with a patience of 15 epochs to prevent overfitting. The batch size was fixed at 64, and the maximum number of epochs was set to 100. The random seed was fixed at 42 to ensure reproducibility.
Table 2 lists the search space for CPO and the final optimal values obtained:
4.3. Analysis of Experimental Results
4.3.1. Training Set Prediction Results
The prediction results of the model on the training set are illustrated in
Figure 5. The model achieved excellent performance on the training set, with a coefficient of determination R
2 = 0.97613 and an RMSE of 9.56 MW. As observed from the figure, the predicted values (orange line) closely tracked the actual values (blue dots), exhibiting a strong fitting effect throughout the training period. This indicates that the model successfully learned the temporal and nonlinear characteristics embedded in the wind power data.
4.3.2. Test Set Prediction Results
The prediction results of the model on the test set are illustrated in
Figure 6. The model still maintained good performance on the test set, with an R
2 = 0.95626 and an RMSE of 13.6094 MW. Although the prediction error was slightly higher than that on the training set—due to the test set containing more fluctuating and irregular samples caused by variable meteorological conditions—the predicted values still effectively captured the main trends of the actual wind power data. The slight decrease in R
2 and increase in RMSE were within a reasonable range, indicating that the model had good generalization ability and did not suffer from overfitting.
4.3.3. Prediction Error Distribution
Figure 7 illustrates the distribution of prediction residuals of the proposed model. It can be observed that most errors are concentrated in a narrow range around zero, and the distribution exhibits an approximately symmetric, bell-shaped form. This suggests that the model predictions are largely unbiased and consistent across different operating conditions.
4.3.4. Regression Analysis
The regression relationship between the model’s predicted values and actual values is illustrated in
Figure 8. The regression line equation was Output ≈ 1 × Target ± 4.1. This indicates a strong linear correlation between the predicted and actual values, further verifying the high prediction accuracy of the model.
4.4. Ablation Experiment
To validate the necessity and effectiveness of each core component in the proposed CPO-BiTCN-BiGRU-Attention framework, a systematic ablation study was conducted. By progressively adding functional modules to the baseline model, the contribution of each component can be quantitatively assessed. The experimental results are summarized in
Table 3, where lower values of MAE, MAPE, and RMSE indicate better forecasting performance.
The results, presented in
Table 3, reveal a clear performance hierarchy. The standalone BiGRU exhibits limited predictive capability (MAE = 23.01 MW, MAPE = 29.03%), primarily due to its sensitivity to the high-frequency noise and non-stationary fluctuations inherent in raw wind power data. The integration of the BiTCN module significantly mitigates this issue, reducing the MAE to 18.05 MW and the RMSE by 21.56%. This substantial improvement confirms that the dilated causal convolutions in BiTCN effectively function as a learnable denoising filter, extracting robust multi-scale temporal features that are more amenable to sequential modeling.
The further addition of the attention mechanism yields a marked reduction in MAPE to 17.71% (a 28.04% decrease relative to BiTCN-BiGRU), demonstrating the module’s ability to selectively emphasize critical “turning points” or ramp events while suppressing less informative steady-state periods. Crucially, the final introduction of the CPO algorithm for hyperparameter optimization unlocks the full potential of the hybrid architecture. The proposed CPO-BiTCN-BiGRU-Attention model achieves the best overall performance (MAE = 9.32 MW, MAPE = 8.41%), representing a 43.48% reduction in MAPE compared to the unoptimized variant. These results collectively validate the complementary nature of the framework: BiTCN denoises local features, BiGRU captures global evolution, Attention focuses on critical transitions, and CPO ensures the optimal structural configuration.
4.5. Comparison of Optimization Algorithms
To evaluate the superiority of the CPO algorithm in hyperparameter optimization (HPO) for the BiTCN-BiGRU-Attention framework, comparative experiments were conducted with two categories of benchmarks: standard search baselines (Random Search, RS; Bayesian Optimization, BO) and three mainstream meta-heuristic algorithms (Tunicate Swarm Algorithm, TSA; Slime Mould Algorithm, SMA; and Grey Wolf Optimizer, GWO). The performance was quantified using RMSE, MAPE, and MAE over 10 independent runs to ensure statistical significance. The results are summarized in
Table 4.
As shown in
Table 4, the performance of the proposed CPO is compared against both standard HPO baselines and meta-heuristic benchmarks. Expectedly, Random Search (RS) yields the highest error and variance (MAE = 16.25 ± 1.15 MW), highlighting the inefficiency of stochastic sampling in complex parameter spaces. Bayesian Optimization (BO) demonstrates a marked improvement over RS, yet it remains sub-optimal compared to CPO, as BO’s surrogate-based approach may struggle with the highly non-convex loss landscape of the BiTCN-BiGRU-Attention architecture.
Quantitatively, strictly comparing the relative improvements among the meta-heuristic benchmarks, the CPO-optimized model significantly outperforms its competitors. Compared to the second-best performing optimization method in terms of MAE (GWO), the CPO-optimized model improved MAE by 35.55% and MAPE by 34.60%. Notably, CPO also outperforms the more sophisticated BO baseline by 14.10% in MAE. Although the RMSE improvement is marginal compared to SMA, the significantly lower Standard Deviation observed across 10 independent runs (e.g., ±0.31 MW for CPO vs. ±0.72 MW for BO) indicates that the CPO algorithm offers superior stability and robustness, effectively avoiding the local optima traps common in other swarm intelligence or surrogate-based methods.
In terms of RMSE, the CPO-optimized model achieves a value of 13.60 MW, which is slightly higher than that of the SMA-optimized model (12.48 MW) but marginally lower than the GWO-optimized model (13.66 MW). This difference can be attributed to the inherent sensitivity of RMSE to a small number of large deviations, as the squaring operation amplifies the influence of extreme prediction errors, such as those caused by sudden wind speed or wind direction changes.
By contrast, the CPO-optimized model consistently attains lower MAE and MAPE values, indicating more stable average prediction performance under normal operating conditions. From a practical wind power dispatching perspective, MAE and MAPE are often more representative of overall forecasting reliability, as they better reflect typical operational errors rather than being dominated by rare extreme events. Therefore, although the SMA-optimized model exhibits a slightly lower RMSE, the CPO-optimized model provides a more favorable balance between robustness and accuracy, rather than uniformly outperforming all alternative optimization strategies across every evaluation metric.
To comprehensively evaluate the practicality of the proposed method, we further compared the computational efficiency of CPO against TSA, SMA, and GWO. The comparison focuses on convergence speed (iterations to optimum) and total wall-clock tuning time.
As shown in
Table 5, although CPO has a slightly higher average time per iteration (361 s) compared to GWO (338 s) due to the simulation of its four distinct defense mechanisms (visual, acoustic, odor, and physical), its convergence speed is significantly superior. CPO typically converges to the global optimum around the 12th iteration, whereas TSA and GWO require more than 20 iterations. Consequently, the total wall-clock tuning time for CPO is reduced by approximately 30–50% compared to the baselines. Furthermore, it is crucial to note that this hyperparameter optimization is an offline process. Once the optimal hyperparameters are determined, the trained model’s online inference speed is in the millisecond range, fully satisfying the real-time requirements of grid dispatching.
4.6. Comparison with Classic Models
To further validate the advancement of the proposed CPO-BiTCN-BiGRU-Attention model, it was compared with four classic wind power prediction models: XGBoost (gradient-boosted tree), SVR (support vector regression), BP neural network (backpropagation), Transformer and CSDI. All models were trained and tested on the same dataset, with performance evaluation using RMSE, MAPE, and MAE.
The classical baseline models considered in this study, including XGBoost, SVR, BP, Transformer and CSDI, were implemented without automated hyperparameter optimization. These models were configured using commonly adopted or recommended parameter settings reported in the wind power forecasting literature and are intended to serve as representative benchmark methods rather than fully optimized competitors. The primary purpose of this comparison is to highlight the performance gap between conventional forecasting models and the proposed CPO-optimized hybrid deep learning framework under practical modeling settings. The results are presented in
Table 6.
As shown in
Table 6, the CPO-BiTCN-BiGRU-Attention model exhibited significant performance advantages over all classic models. These results further confirm the advancement and efficiency of the CPO-BiTCN-BiGRU-Attention model in short-term wind power prediction.
To further validate the statistical significance of these results, a paired t-test was conducted between the proposed CPO-BiTCN-BiGRU-Attention model and the best-performing baseline (CSDI). The analysis yielded p-values < 0.05 for both MAE and RMSE metrics. This statistical evidence allows us to reject the null hypothesis, confirming that the performance improvements achieved by the proposed framework are statistically significant and not attributable to random stochasticity during the training process.
4.7. Performance Analysis by Forecast Horizon
To further evaluate the model’s robustness in multi-step forecasting, we analyzed the error distribution across the prediction horizon (H = 4). In recursive forecasting, errors typically accumulate as the horizon extends.
Table 7 error metrics breakdown by forecast horizon (t + 15 to t + 60 min). presents the detailed performance of the proposed model for each 15 min interval up to 1 h.
5. Conclusions
This study proposes a CPO-optimized BiTCN-BiGRU-Attention model to address the volatility of short-term wind power forecasting. By designing a serial “Filter–Memorize–Focus” architecture, the model effectively decouples local noise from global temporal trends. The integration of BiTCN for feature extraction, BiGRU for sequence modeling, and Attention for event weighting creates a robust predictor. Furthermore, the CPO algorithm effectively solves the hyperparameter optimization problem, outperforming TSA, SMA, and GWO. Experimental results on real-world data confirm the model’s superior accuracy (achieving the lowest MAE and MAPE among all compared methods, with average reductions of approximately 30–45% compared to classical benchmark models), and robustness while maintaining competitive RMSE performance compared to State-of-the-Art baselines. The experimental results on unseen test data demonstrate that the proposed CPO-BiTCN-BiGRU-Attention model exhibits strong generalization capability under varying operating conditions within the same wind farm. This indicates that the learned representations capture intrinsic temporal patterns of wind power generation rather than overfitting to specific samples. These advantages provide clear added value for practical short-term wind power forecasting, particularly for real-time grid dispatching applications. Future work will further investigate the integration of transfer learning techniques to reduce data requirements and accelerate model adaptation when deploying the proposed framework across wind farms with significantly different geographic and meteorological characteristics, especially in data-scarce scenarios.