A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting

Huang, Liusong; Jaharadak, Adam Amril bin; Ahmad, Nor Izzati; Wang, Jie

doi:10.3390/en19041034

Open AccessArticle

A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting

by

Liusong Huang

^1,2

,

Adam Amril bin Jaharadak

^3,*,

Nor Izzati Ahmad

³ and

Jie Wang

¹

School of Computer and Information Engineering, Maanshan Teacher’s College, Maanshan 243041, China

²

School of Graduate Studies, Management and Science University, Shah Alam 40100, Malaysia

³

Faculty of Information Sciences & Engineering, Management and Science University, Shah Alam 40100, Malaysia

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(4), 1034; https://doi.org/10.3390/en19041034

Submission received: 30 December 2025 / Revised: 12 February 2026 / Accepted: 13 February 2026 / Published: 15 February 2026

(This article belongs to the Section A3: Wind, Wave and Tidal Energy)

Download

Browse Figures

Versions Notes

Abstract

Short-term wind power prediction is pivotal for maintaining the stability of power grids characterized by high renewable energy penetration. However, wind power time series exhibit complex characteristics, including local turbulence-induced fluctuations and long-term temporal dependencies, which challenge traditional forecasting models. Furthermore, the performance of hybrid deep learning models is often compromised by the difficulty of tuning hyperparameters over non-convex optimization surfaces. To address these challenges, this study proposes a novel framework: CPO—BiTCN—BiGRU—Attention. Adopting a physically motivated “Filter–Memorize–Focus” strategy, the model first employs a Bidirectional Temporal Convolutional Network (BiTCN) with dilated causal convolutions to extract multi-scale local features and denoise raw data. Subsequently, a Bidirectional Gated Recurrent Unit (BiGRU) captures global temporal evolution, while an attention mechanism dynamically weights critical time steps corresponding to ramp events. To mitigate hyperparameter uncertainty, the Crowned Porcupine Optimization (CPO) algorithm is introduced to adaptively tune the network structure, balancing global exploration and local exploitation more effectively than traditional swarm algorithms. Experimental results obtained from real-world wind farm data in Xinjiang, China, demonstrate that the proposed model consistently outperforms State-of-the-Art benchmark models. Compared with the best competing methods, the proposed framework reduces MAE and MAPE by approximately 30–45%, while maintaining competitive RMSE performance, indicating improved average forecasting accuracy and robustness under varying operating conditions. The results confirm that the proposed architecture effectively decouples local noise from global trends, providing a robust and practical solution for short-term wind power forecasting in grid dispatching applications.

Keywords:

wind power prediction; spatiotemporal dependency; BiTCN-BiGRU; attention mechanism; Crowned Porcupine Optimization (CPO); statistical validation

1. Introduction

Wind power has emerged as a cornerstone of the global transition toward “carbon neutrality”. However, its inherent intermittency and volatility pose significant challenges to the balance of supply and demand in power systems. Accurate short-term wind power forecasting is essential for economic dispatch and operational security [1]. The integration of wind energy into power grids is hindered by the stochastic nature of wind speeds. Precise short-term forecasting is not merely a technical challenge but a prerequisite for grid stability and economic dispatch.

Traditional statistical methods, such as ARIMA and Holt-Winters models [2,3,4], are interpretable but struggle to capture the highly nonlinear and dynamic characteristics of modern wind power data. Consequently, data-driven approaches, particularly deep learning, have emerged as the mainstream solution. Recurrent Neural Networks (RNNs) [5] and their variants (LSTM, GRU) have shown success in modeling temporal dependencies, but they often fail to capture the multi-scale characteristics of wind power. Convolutional networks (CNNs/TCNs) lack global memory [6].

To address these limitations, recent research has shifted towards hybrid architectures. However, two critical gaps remain in the current literature:

Methodological Trade-offs: Recent studies often rely on decomposition-based hybrids to handle non-stationarity. However, as shown in Table 1, these methods frequently introduce significant computational overhead and may inadvertently introduce data leakage during decomposition.
Lack of Physically Interpretable Architecture: Many existing hybrid models simply stack convolutional and recurrent layers without a clear functional logic aligned with the physical characteristics of wind data (e.g., separating turbulence from trends). This often leads to redundant computations or information loss.
Inadequate Hyperparameter Optimization: The performance of complex hybrid models is highly sensitive to hyperparameters. Most studies utilize manual tuning or basic algorithms like Particle Swarm Optimization (PSO), which easily fall into local optima on the complex, non-convex loss surfaces of deep networks.

To bridge these gaps, this paper proposes a CPO-BiTCN-BiGRU-Attention framework. We theoretically justify this combination: BiTCN acts as a “feature filter” to extract local patterns and reduce noise; BiGRU models the global sequence; and Attention highlights physically significant moments. Crucially, we employ the Crowned Porcupine Optimization (CPO) algorithm. Unlike PSO [7], Walrus Optimization algorithm (WaOA) [8], or Genetic Algorithms (GA) [9,10], CPO simulates four distinct defense mechanisms (visual, acoustic, odor, physical), providing a dynamic balance between exploration and exploitation, ensuring convergence to the global optimum for hyperparameters.

The main contributions of this study are:

A Physically Motivated Hybrid Architecture: We propose a serial “Filter–Memorize–Focus” framework that effectively integrates the local feature extraction of BiTCN with the global memory of BiGRU and the dynamic weighting of the attention mechanism, specifically designed to handle the multi-scale nature of wind power.
Adaptive Hyperparameter Optimization: The application of the CPO algorithm solves the “black-box” optimization problem of deep networks, demonstrating superior convergence speed and accuracy compared to TSA, SMA, and GWO.
Superior End-to-End Performance: Without relying on complex pre-decomposition techniques (like VMD or EMD), the proposed model achieves State-of-the-Art accuracy on real-world datasets, verified by comprehensive ablation studies and error distribution analysis.

2. Related Work

The development of wind power prediction models has evolved from simple physical models to complex data-driven hybrid architectures. This section reviews relevant literature to contextualize the proposed approach.

2.1. Deep Learning in Wind Power Forecasting

Early research focused on statistical methods, but the non-linearity of wind data necessitated machine learning solutions. Support Vector Regression (SVR) [11] and XGBoost [12] improved accuracy but struggled with high-dimensional temporal features. Deep learning revolutionized this field. Recurrent Neural Networks (RNNs) [13], specifically LSTMs and GRUs, became standard for handling time-series dependencies. For instance, Alizadegan et al. (2025) [14] demonstrated the superiority of LSTMs over traditional methods in residential load forecasting. However, RNNs suffer from high computational costs and gradient vanishing issues when sequences are long.

2.2. Rise in Hybrid Architectures (TCN and RNN Fusion)

To overcome RNN limitations, Temporal Convolutional Networks (TCNs) were introduced. TCNs use dilated convolutions to expand the receptive field efficiently. Recognizing that wind power data contains both local features (turbulence, gusts) and global trends (daily cycles), researchers began combining TCNs and RNNs. Chi and Yang (2023) [15] proposed a WT-BiGRU-Attention-TCN model, using a wavelet transform for decomposition. While effective, the separate decomposition stage adds computational complexity. Yin and Li (2024) [16] developed a CNN-BiGRU-Attention model. However, standard CNNs lack the causal constraints of TCNs, potentially leading to data leakage in time-series tasks. Zhang et al. (2025) [17] utilized ICEEMDAN decomposition with a BiTCN-BiGRU model. Similar to Chi et al., the reliance on signal decomposition makes the model less end-to-end trainable.

These studies confirm that the BiTCN-BiGRU structure is the current State-of-the-Art backbone [17]. BiTCN extracts high-level local features, which are then sequenced by BiGRU. Our work builds on this backbone but addresses the critical weakness identified in previous works: the optimization of the model structure itself.

2.3. Meta-Heuristic Optimization in Forecasting

The performance of hybrid models is highly sensitive to hyperparameters. Manual tuning is inefficient, and Grid Search is computationally prohibitive. Traditional algorithms like Particle Swarm Optimization (PSO), Chaos Game Optimization (CGO) [18] and Grey Wolf Optimizer (GWO) have been widely used [7,19,20]. However, they often suffer from “premature convergence,” getting trapped in local optima. The Crowned Porcupine Optimization (CPO) algorithm, proposed recently, mimics the defense mechanisms of porcupines (visual, acoustic, odor, and physical attack) to balance exploration and exploitation dynamically. Ning and Huang (2025) [21] recently applied CPO to optimize VMD parameters for wind forecasting, validating CPO’s effectiveness in this domain. However, their work optimized the decomposition stage, whereas our study applies CPO to the deep learning model structure itself, ensuring the predictor is optimally configured.

3. Components of the Prediction Model

The proposed prediction framework is not a simple stacking of multiple modules but a physically motivated and data-driven architecture designed to reflect the intrinsic characteristics of wind power generation processes. Wind power time series are inherently nonlinear, non-stationary, and multi-scale, arising from the combined effects of atmospheric turbulence, wind gusts, and diurnal meteorological cycles. To effectively capture these complex dynamics, a carefully organized hierarchical modeling strategy is required.

3.1. Overall Framework: The Logic of Serial Processing

Wind power time series exhibit distinct patterns across different temporal scales, including high-frequency fluctuations caused by turbulence, medium-term local trends induced by wind gusts, and low-frequency periodic components associated with diurnal and seasonal variations. A single predictive model is generally insufficient to simultaneously capture all these characteristics. As illustrated in Figure 1, to address this challenge, a serial “Filter–Memorize–Focus–Optimize” modeling strategy is adopted, in which each component serves a clearly defined functional role:

Filter (BiTCN): Extracts multi-scale local features from raw wind power signals while suppressing high-frequency noise.
Memorize (BiGRU): Models the temporal evolution and long-term dependencies of the extracted features.
Focus (Attention): Assigns adaptive weights to critical time steps, emphasizing turning points and informative moments.
Optimize (CPO): Automatically tunes the hyperparameters of the hybrid deep learning model to adapt it to the characteristics of a specific dataset.

Through this serial processing pipeline, the framework progressively transforms raw wind power data into high-level, task-oriented representations, thereby enhancing predictive accuracy and robustness.

3.2. BiTCN: Multi-Scale Feature Extraction and Denoising

Temporal Convolutional Networks (TCNs) have been widely recognized for their effectiveness in modeling long-range temporal dependencies while maintaining efficient parallel computation capabilities [22,23]. However, conventional TCNs are inherently unidirectional, relying solely on past information, which may result in incomplete temporal feature representation, especially in complex and highly fluctuating wind power sequences.

To overcome this limitation, a Bidirectional Temporal Convolutional Network (BiTCN) is employed. By integrating forward and backward TCNs, the BiTCN is able to exploit contextual information from both historical and future time steps, leading to a more comprehensive temporal feature representation.

Moreover, wind power data are often contaminated by noise originating from sensor measurement errors and atmospheric turbulence. The dilated convolution layers in the BiTCN act as learnable nonlinear filters. By progressively increasing the dilation rate d, the receptive field of the network expands exponentially, enabling the model to capture short-term turbulence and medium-term trends simultaneously without sacrificing temporal resolution. This design allows the BiTCN to effectively perform feature extraction and denoising in a unified manner.

3.2.1. Structure of BiTCN

The BiTCN module consists of two symmetric sub-networks: a forward TCN, which processes the input sequence in chronological order, and a backward TCN, which processes the sequence in reverse order, as shown in Figure 2. Each sub-network is composed of stacked layers including a 1 × 1 convolution, dilated causal convolution, batch normalization, Leaky ReLU activation, and dropout regularization. The outputs of the forward and backward TCNs are subsequently fused to form a bidirectional temporal representation.

The diagram includes a Forward direction input (X₁, X₂, …, X_t) and a Reverse direction input (X_t, …, X₂, X₁). Each direction consists of 1 × 1 Convolution, Dilated Causal Convolution, Batch Standardization, Leaky ReLU activation, and Dropout layers, with the final output denoted as X_L+₁.

The Dropout layers applied at the end of both the forward and backward TCN branches serve as independent regularization mechanisms rather than terminal outputs of the network. Specifically, Dropout is employed to mitigate overfitting by randomly deactivating a subset of neurons during training, thereby improving the robustness and generalization capability of each directional feature extractor. After Dropout, the outputs of the forward and backward TCNs are fused through feature concatenation to form a unified bidirectional temporal representation, which is then passed to the subsequent BiGRU module. This design ensures that regularization is applied symmetrically to both temporal directions while preserving complete bidirectional information for downstream sequence modeling.

3.2.2. Feature Calculation Formula of BiTCN

The feature extraction process is expressed by Equations (1) and (2):

\vec{σ_{t}} = \vec{TCN} (\vec{X_{L}}, δ, ω, d, τ)

(1)

\overset{\leftarrow}{σ_{t}} = \overset{\leftarrow}{TCN} (\overset{\leftarrow}{X_{L}}, δ, ω, d, τ)

(2)

where

\vec{X_{L}}

and

\overset{\leftarrow}{X_{L}}

denotes the input feature sequence of the BiTCN module; δ denotes the dimension of the temporal convolution kernel;

ω

denotes the parameter of the Leaky ReLU activation function;

d

denotes the dilation rate of the dilated convolution;

τ

denotes the dropout regularization parameter;

\vec{σ_{t}}

and

\overset{\leftarrow}{σ_{t}}

denote the forward and backward temporal features extracted by the BiTCN module at time t, respectively.

3.3. BiGRU: Temporal Context Learning

The Gated Recurrent Unit (GRU) is a streamlined variant of the Long Short-Term Memory (LSTM) network that simplifies the gating mechanism by employing only two gates: the reset gate and the update gate. Compared with LSTM, GRU significantly reduces model complexity and training time while maintaining comparable predictive performance, making it particularly suitable for large-scale time series forecasting tasks with limited computational resources.

To further capture bidirectional temporal dependencies in wind power sequences, this study adopts the Bidirectional GRU (BiGRU) architecture. In practical wind power generation systems, the current power output is not solely determined by instantaneous wind speed, but is also influenced by the mechanical inertia of wind turbines and the evolving meteorological conditions over preceding and subsequent time intervals. By processing temporal features in both forward (past-to-future) and backward (future-to-past) directions, the BiGRU is able to model such dynamic evolution more effectively.

In the proposed framework, the BiGRU takes the multi-scale feature representations extracted by the BiTCN as input and focuses on learning their temporal evolution, thereby serving as the memory module in the serial “Filter–Memorize–Focus” architecture.

3.3.1. Structure of BiGRU

The BiGRU consists of two parallel GRU layers: a forward GRU layer, which processes the input feature sequence in chronological order, and a backward GRU layer, which processes the same sequence in reverse order, as illustrated in Figure 3. At each time step, the hidden states generated by the forward and backward GRU layers are combined to form a bidirectional temporal representation. This structure enables the BiGRU to integrate information from both historical and future contexts, resulting in a more comprehensive modeling of temporal dependencies. Such bidirectional modeling is particularly important for wind power forecasting, where abrupt changes and delayed responses frequently occur.

3.3.2. Mathematical Formulation of BiGRU

The state updates of the BiGRU can be expressed by Equations (3)–(5):

\vec{g_{t}} = GRU (x, \vec{g_{t}})

(3)

\overset{\leftarrow}{g_{t}} = GRU (x, \overset{\leftarrow}{g_{t}})

(4)

g_{t} = H_{\vec{g_{t}}} \vec{g_{t}} + H_{\overset{\leftarrow}{g_{t}}} \overset{\leftarrow}{g_{t}} + a_{t}

(5)

where GRU represents the operational process of the traditional GRU network;

H_{\vec{g_{t}}}

and

\vec{g_{t}}

denote the state and weight of the forward hidden layer at time t, respectively;

H_{\overset{\leftarrow}{g_{t}}}

and

\overset{\leftarrow}{g_{t}}

denote the state and weight of the backward hidden layer at time t, respectively;

a_{t}

denotes the bias term of the hidden layer at time t.

The proposed bidirectional structures do not introduce data leakage during forecasting. In this study, both the BiTCN and BiGRU operate on fixed-length sliding windows that contain only historical observations available up to the prediction time. The term “bidirectional” refers to the internal feature extraction within each input window, where temporal dependencies are modeled in both forward and backward directions to enhance representation learning, rather than accessing any future unseen data beyond the forecasting horizon.

For multi-step forecasting, future ground-truth inputs are not available. Therefore, a recursive forecasting strategy is adopted, in which the model uses its own previous predictions as inputs for subsequent steps. At each prediction step, the bidirectional feature extraction is still confined to the historical window composed of observed or previously predicted values, ensuring that no future information is incorporated during inference.

3.4. Attention Mechanism: Capturing Ramping Events

The attention mechanism is introduced to selectively emphasize critical temporal features by assigning adaptive importance weights to different time steps. In wind power forecasting, adjacent observations generally exhibit stronger correlations with the target output, whereas distant time steps often contribute less. More importantly, rapid changes in wind power, commonly referred to as ramping events, carry substantially more predictive information than relatively stable periods.

Within a typical 24 h wind power sequence, turning points, where wind speed abruptly increases or decreases, reflect sudden meteorological changes and turbine response dynamics. By contrast, steady operating periods provide limited additional information. The attention mechanism enables the model to automatically focus on these informative moments, thereby enhancing its sensitivity to temporal variations and improving forecasting accuracy.

3.4.1. Working Principle of the Attention Mechanism

The attention mechanism operates by quantifying the relevance between each hidden state in the input sequence and the current prediction task. Specifically, it first computes a relevance score for each time step to measure its contribution to the forecasting objective. These scores are then normalized to obtain attention weights, which are used to perform a weighted aggregation of the temporal features.

As illustrated in Figure 4, the Attention module takes the bidirectional hidden representations generated by the BiGRU as input and outputs a context-aware feature representation that emphasizes critical time steps while suppressing redundant or less informative ones.

3.4.2. Mathematical Formulation of the Attention Mechanism

In the wind power series, not all time steps contribute equally. Sudden changes (ramps) are more informative than steady states. The attention mechanism assigns adaptive weights α_t to the hidden states h_t. The relevant formulas for this mechanism are Equations (6)–(8):

e_{t} = v \tan h (W h_{t} + b)

(6)

α_{t} = softmax (e_{t}) = \frac{\exp (e_{t})}{\sum_{j = 1}^{m} e_{j}}

(7)

y_{i} = \sum_{j = 1}^{m} α_{j} h_{j}

(8)

From a physical perspective, a higher attention weight α_t indicates that the model has identified a significant meteorological transition or turbine response event at time step t that strongly influences future wind power output. This mechanism enables the model to focus on dynamic changes rather than steady-state conditions, thereby improving its capability to capture ramping behaviors.

3.5. Crowned Porcupine Optimization (CPO) for Hyperparameters

The combined BiTCN-BiGRU-Attention model has a complex, non-convex hyperparameter space. We utilize CPO to automate the tuning process.

Unlike PSO, which relies on a single velocity vector, CPO simulates four defense strategies. This allows the algorithm to switch between aggressive search (Visual/Sound) and precise local convergence (Physical/Odor), significantly reducing the risk of getting trapped in suboptimal hyperparameter configurations.

Time Complexity: The complexity is

O (T \cdot N \cdot D)

, where T is iterations, N is population, and D is dimensions. While training is computationally intensive, it is an offline process. The online prediction speed is unaffected and remains fast (milliseconds).

Specifically, CPO is employed to search for the optimal vector of five key hyperparameters: (1) The number of filters in BiTCN layers (N_c), (2) The kernel size (K), (3) The number of hidden units in BiGRU (N_h), (4) The initial learning rate (η), and (5) The regularization coefficient (Dropout rate).

The proposed Filter–Memorize–Focus paradigm provides an engineering-oriented interpretability rather than a strict theoretical interpretability. The interpretability arises from the functional decomposition of the model architecture, where each module is designed to correspond to a specific role consistent with the physical characteristics of wind power time series. Specifically, the convolutional filtering stage is associated with local fluctuation suppression, the recurrent memory stage captures temporal evolution, and the attention mechanism highlights critical ramp-related time steps. This interpretability is therefore qualitative and functional in nature, aiming to improve model transparency and engineering understanding, rather than to establish formal causal or theoretical guarantees.

4. Results and Discussion

4.1. Dataset and Preprocessing

The dataset used in this study was collected from a wind farm in Xinjiang, China, covering May to June 2021. It includes power load data along with related meteorological and temporal features, comprising a total of 3840 data points, each containing 15 distinct features. The dataset has undergone thorough preprocessing to ensure data integrity, containing no missing values or anomalous outliers. Measurements were taken at 15 min intervals, resulting in 96 data points per day. Each time point records multiple meteorological parameters, including wind speed, wind direction, atmospheric pressure, temperature, humidity, and others. All raw data were normalized using the min-max scaling method to eliminate the influence of differing units and scales. To strictly preserve the temporal continuity inherent in wind power data and prevent future data leakage, we adopted a chronological splitting strategy rather than random shuffling. The dataset was divided as follows: the first 70% of the time series was used for training, the subsequent 15% served as the validation set for the CPO process (to calculate the fitness function), and the final 15% was reserved for testing to evaluate the model’s generalization performance.

In this study, the wind power data are sampled at a 15 min interval, and the forecasting task is formulated as a short-term multi-step prediction problem. Specifically, a sliding input window of length L is constructed using historical wind power observations, where each window contains only past information available up to the prediction time. The input features consist of normalized wind power values, while the prediction target is defined as the wind power output for the subsequent H time steps.

The forecast horizon is set to H = 4, corresponding to a 1 h ahead prediction (4 × 15 min). During model training and inference, the proposed framework performs recursive multi-step forecasting, where predictions generated at earlier steps are iteratively fed back as inputs for subsequent steps.

The forecast horizon of 1 h (H = 4) is chosen from both practical and physical perspectives of wind power system operation. In real-world wind farm management and power system dispatching, short-term operational decisions, such as reserve allocation, unit commitment adjustment, and ramping control, are typically made within a time scale of 15–60 min. Therefore, a 1 h ahead prediction provides the most relevant information for operational planning and real-time control.

Moreover, wind power predictability decreases significantly as the forecasting horizon extends beyond one hour due to the increasing influence of large-scale meteorological variations. Longer multi-step horizons (e.g., 8 or 16 steps) tend to suffer from substantial error accumulation in recursive forecasting, which may reduce their practical value for real-time dispatching. Consequently, selecting a 4-step (1 h) horizon represents a balanced trade-off between predictive accuracy and operational applicability, making it particularly suitable for short-term wind power forecasting in practical power system scenarios.

4.2. Evaluation Metrics

4.2.1. Selection of Evaluation Indicators

To quantitatively assess model performance, the following metrics defined in Equations (9)–(11) were employed:

E_{MAE} = \frac{1}{N} \sum_{I = 1}^{N} |\hat{y_{t}} - y_{t}|

(9)

E_{RMSE} = \sqrt{\frac{1}{N} \sum_{I = 1}^{N} {(\hat{y_{t}} - y_{t})}^{2}}

(10)

E_{MAPE} = \frac{1}{N} \sum_{I = 1}^{N} |\frac{\hat{y_{t}} - y_{t}}{y_{t}}| \times 100 %

(11)

Among them,

\hat{y_{t}}

denotes the predicted value at time t,

y_{t}

represents the corresponding ground truth, and N is the length of the sequence.

It should be noted that the use of MAPE in wind power forecasting may be problematic when actual power values approach zero. In practical wind farm operation, exact zero power outputs are rare within normal operating periods, as data segments corresponding to turbine shutdowns or maintenance are excluded during preprocessing. To further avoid numerical instability, a small positive constant is added to the denominator when computing MAPE, ensuring that near-zero values do not lead to inflated errors. Under these conditions, MAPE remains a meaningful indicator of relative prediction accuracy for short-term wind power forecasting.

4.2.2. Setting of Model Parameters

The Crested Porcupine Optimizer was employed to automatically search for the optimal hyperparameters. The objective function for CPO was defined as the RMSE on the validation set. The CPO algorithm configuration and the resource environment are detailed below:

Population Size: 20
Maximum Iterations: 30
Search Dimensionality: 5 (corresponding to the 5 optimized hyperparameters)
Optimization Strategy: The CPO utilizes its cyclic distinct defense mechanisms to balance exploration (global search) and exploitation (local convergence).
Computing Resources: The experiments were conducted on a workstation with an Intel Core i7-12700K CPU, 32 GB RAM, and an NVIDIA GeForce RTX 3080 Ti GPU, using Python 3.9 and PyTorch 1.12.

4.2.3. Training Protocols and Final Hyperparameters

The model was trained using the Adam optimizer with the MSE loss function. We implemented an Early Stopping mechanism with a patience of 15 epochs to prevent overfitting. The batch size was fixed at 64, and the maximum number of epochs was set to 100. The random seed was fixed at 42 to ensure reproducibility.

Table 2 lists the search space for CPO and the final optimal values obtained:

4.3. Analysis of Experimental Results

4.3.1. Training Set Prediction Results

The prediction results of the model on the training set are illustrated in Figure 5. The model achieved excellent performance on the training set, with a coefficient of determination R² = 0.97613 and an RMSE of 9.56 MW. As observed from the figure, the predicted values (orange line) closely tracked the actual values (blue dots), exhibiting a strong fitting effect throughout the training period. This indicates that the model successfully learned the temporal and nonlinear characteristics embedded in the wind power data.

4.3.2. Test Set Prediction Results

The prediction results of the model on the test set are illustrated in Figure 6. The model still maintained good performance on the test set, with an R² = 0.95626 and an RMSE of 13.6094 MW. Although the prediction error was slightly higher than that on the training set—due to the test set containing more fluctuating and irregular samples caused by variable meteorological conditions—the predicted values still effectively captured the main trends of the actual wind power data. The slight decrease in R² and increase in RMSE were within a reasonable range, indicating that the model had good generalization ability and did not suffer from overfitting.

4.3.3. Prediction Error Distribution

Figure 7 illustrates the distribution of prediction residuals of the proposed model. It can be observed that most errors are concentrated in a narrow range around zero, and the distribution exhibits an approximately symmetric, bell-shaped form. This suggests that the model predictions are largely unbiased and consistent across different operating conditions.

4.3.4. Regression Analysis

The regression relationship between the model’s predicted values and actual values is illustrated in Figure 8. The regression line equation was Output ≈ 1 × Target ± 4.1. This indicates a strong linear correlation between the predicted and actual values, further verifying the high prediction accuracy of the model.

4.4. Ablation Experiment

To validate the necessity and effectiveness of each core component in the proposed CPO-BiTCN-BiGRU-Attention framework, a systematic ablation study was conducted. By progressively adding functional modules to the baseline model, the contribution of each component can be quantitatively assessed. The experimental results are summarized in Table 3, where lower values of MAE, MAPE, and RMSE indicate better forecasting performance.

The results, presented in Table 3, reveal a clear performance hierarchy. The standalone BiGRU exhibits limited predictive capability (MAE = 23.01 MW, MAPE = 29.03%), primarily due to its sensitivity to the high-frequency noise and non-stationary fluctuations inherent in raw wind power data. The integration of the BiTCN module significantly mitigates this issue, reducing the MAE to 18.05 MW and the RMSE by 21.56%. This substantial improvement confirms that the dilated causal convolutions in BiTCN effectively function as a learnable denoising filter, extracting robust multi-scale temporal features that are more amenable to sequential modeling.

The further addition of the attention mechanism yields a marked reduction in MAPE to 17.71% (a 28.04% decrease relative to BiTCN-BiGRU), demonstrating the module’s ability to selectively emphasize critical “turning points” or ramp events while suppressing less informative steady-state periods. Crucially, the final introduction of the CPO algorithm for hyperparameter optimization unlocks the full potential of the hybrid architecture. The proposed CPO-BiTCN-BiGRU-Attention model achieves the best overall performance (MAE = 9.32 MW, MAPE = 8.41%), representing a 43.48% reduction in MAPE compared to the unoptimized variant. These results collectively validate the complementary nature of the framework: BiTCN denoises local features, BiGRU captures global evolution, Attention focuses on critical transitions, and CPO ensures the optimal structural configuration.

4.5. Comparison of Optimization Algorithms

To evaluate the superiority of the CPO algorithm in hyperparameter optimization (HPO) for the BiTCN-BiGRU-Attention framework, comparative experiments were conducted with two categories of benchmarks: standard search baselines (Random Search, RS; Bayesian Optimization, BO) and three mainstream meta-heuristic algorithms (Tunicate Swarm Algorithm, TSA; Slime Mould Algorithm, SMA; and Grey Wolf Optimizer, GWO). The performance was quantified using RMSE, MAPE, and MAE over 10 independent runs to ensure statistical significance. The results are summarized in Table 4.

As shown in Table 4, the performance of the proposed CPO is compared against both standard HPO baselines and meta-heuristic benchmarks. Expectedly, Random Search (RS) yields the highest error and variance (MAE = 16.25 ± 1.15 MW), highlighting the inefficiency of stochastic sampling in complex parameter spaces. Bayesian Optimization (BO) demonstrates a marked improvement over RS, yet it remains sub-optimal compared to CPO, as BO’s surrogate-based approach may struggle with the highly non-convex loss landscape of the BiTCN-BiGRU-Attention architecture.

Quantitatively, strictly comparing the relative improvements among the meta-heuristic benchmarks, the CPO-optimized model significantly outperforms its competitors. Compared to the second-best performing optimization method in terms of MAE (GWO), the CPO-optimized model improved MAE by 35.55% and MAPE by 34.60%. Notably, CPO also outperforms the more sophisticated BO baseline by 14.10% in MAE. Although the RMSE improvement is marginal compared to SMA, the significantly lower Standard Deviation observed across 10 independent runs (e.g., ±0.31 MW for CPO vs. ±0.72 MW for BO) indicates that the CPO algorithm offers superior stability and robustness, effectively avoiding the local optima traps common in other swarm intelligence or surrogate-based methods.

In terms of RMSE, the CPO-optimized model achieves a value of 13.60 MW, which is slightly higher than that of the SMA-optimized model (12.48 MW) but marginally lower than the GWO-optimized model (13.66 MW). This difference can be attributed to the inherent sensitivity of RMSE to a small number of large deviations, as the squaring operation amplifies the influence of extreme prediction errors, such as those caused by sudden wind speed or wind direction changes.

By contrast, the CPO-optimized model consistently attains lower MAE and MAPE values, indicating more stable average prediction performance under normal operating conditions. From a practical wind power dispatching perspective, MAE and MAPE are often more representative of overall forecasting reliability, as they better reflect typical operational errors rather than being dominated by rare extreme events. Therefore, although the SMA-optimized model exhibits a slightly lower RMSE, the CPO-optimized model provides a more favorable balance between robustness and accuracy, rather than uniformly outperforming all alternative optimization strategies across every evaluation metric.

To comprehensively evaluate the practicality of the proposed method, we further compared the computational efficiency of CPO against TSA, SMA, and GWO. The comparison focuses on convergence speed (iterations to optimum) and total wall-clock tuning time.

As shown in Table 5, although CPO has a slightly higher average time per iteration (361 s) compared to GWO (338 s) due to the simulation of its four distinct defense mechanisms (visual, acoustic, odor, and physical), its convergence speed is significantly superior. CPO typically converges to the global optimum around the 12th iteration, whereas TSA and GWO require more than 20 iterations. Consequently, the total wall-clock tuning time for CPO is reduced by approximately 30–50% compared to the baselines. Furthermore, it is crucial to note that this hyperparameter optimization is an offline process. Once the optimal hyperparameters are determined, the trained model’s online inference speed is in the millisecond range, fully satisfying the real-time requirements of grid dispatching.

4.6. Comparison with Classic Models

To further validate the advancement of the proposed CPO-BiTCN-BiGRU-Attention model, it was compared with four classic wind power prediction models: XGBoost (gradient-boosted tree), SVR (support vector regression), BP neural network (backpropagation), Transformer and CSDI. All models were trained and tested on the same dataset, with performance evaluation using RMSE, MAPE, and MAE.

The classical baseline models considered in this study, including XGBoost, SVR, BP, Transformer and CSDI, were implemented without automated hyperparameter optimization. These models were configured using commonly adopted or recommended parameter settings reported in the wind power forecasting literature and are intended to serve as representative benchmark methods rather than fully optimized competitors. The primary purpose of this comparison is to highlight the performance gap between conventional forecasting models and the proposed CPO-optimized hybrid deep learning framework under practical modeling settings. The results are presented in Table 6.

As shown in Table 6, the CPO-BiTCN-BiGRU-Attention model exhibited significant performance advantages over all classic models. These results further confirm the advancement and efficiency of the CPO-BiTCN-BiGRU-Attention model in short-term wind power prediction.

To further validate the statistical significance of these results, a paired t-test was conducted between the proposed CPO-BiTCN-BiGRU-Attention model and the best-performing baseline (CSDI). The analysis yielded p-values < 0.05 for both MAE and RMSE metrics. This statistical evidence allows us to reject the null hypothesis, confirming that the performance improvements achieved by the proposed framework are statistically significant and not attributable to random stochasticity during the training process.

4.7. Performance Analysis by Forecast Horizon

To further evaluate the model’s robustness in multi-step forecasting, we analyzed the error distribution across the prediction horizon (H = 4). In recursive forecasting, errors typically accumulate as the horizon extends. Table 7 error metrics breakdown by forecast horizon (t + 15 to t + 60 min). presents the detailed performance of the proposed model for each 15 min interval up to 1 h.

5. Conclusions

This study proposes a CPO-optimized BiTCN-BiGRU-Attention model to address the volatility of short-term wind power forecasting. By designing a serial “Filter–Memorize–Focus” architecture, the model effectively decouples local noise from global temporal trends. The integration of BiTCN for feature extraction, BiGRU for sequence modeling, and Attention for event weighting creates a robust predictor. Furthermore, the CPO algorithm effectively solves the hyperparameter optimization problem, outperforming TSA, SMA, and GWO. Experimental results on real-world data confirm the model’s superior accuracy (achieving the lowest MAE and MAPE among all compared methods, with average reductions of approximately 30–45% compared to classical benchmark models), and robustness while maintaining competitive RMSE performance compared to State-of-the-Art baselines. The experimental results on unseen test data demonstrate that the proposed CPO-BiTCN-BiGRU-Attention model exhibits strong generalization capability under varying operating conditions within the same wind farm. This indicates that the learned representations capture intrinsic temporal patterns of wind power generation rather than overfitting to specific samples. These advantages provide clear added value for practical short-term wind power forecasting, particularly for real-time grid dispatching applications. Future work will further investigate the integration of transfer learning techniques to reduce data requirements and accelerate model adaptation when deploying the proposed framework across wind farms with significantly different geographic and meteorological characteristics, especially in data-scarce scenarios.

Author Contributions

Conceptualization, L.H. and A.A.b.J.; methodology, L.H.; software, L.H.; validation, N.I.A., J.W. and L.H.; formal analysis, L.H. and J.W.; investigation, L.H.; resources, A.A.b.J.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, A.A.b.J. and N.I.A.; visualization, L.H.; supervision, A.A.b.J.; project administration, A.A.b.J.; funding acquisition, L.H. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Major Natural Science Research Project of Anhui Provincial Universities [grant number 2022AH040346, 2023AH052475], the Project of Anhui Province Higher Education Institutions’ Science and Engineering Teachers to Take Up Posts in Enterprises for Practical Training [grant number 2024jsqygz150], and the Project of Anhui Province Quality Engineering Project [grant number 2023fzsx009].

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to commercial restrictions. The raw data are confidential wind farm operational data and cannot be publicly shared. However, all processed data and key findings necessary to support the conclusions are included in the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The funding sources acknowledged did not involve any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Abbreviations

ARIMA	AutoRegressive Integrated Moving Average
BiGRU	Bidirectional Gated Recurrent Unit
BiTCN	Bidirectional Temporal Convolutional Network
BO	Bayesian Optimization
BP	Backpropagation Neural Network
CPO	Crowned (Crested) Porcupine Optimization
CNN	Convolutional Neural Network
CSDI	Conditional Score-based Diffusion Imputation
GA	Genetic Algorithm
GWO	Grey Wolf Optimizer
GRU	Gated Recurrent Unit
HPO	Hyperparameter Optimization
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
MSE	Mean Squared Error
RMSE	Root Mean Squared Error
RS	Random Search
SMA	Slime Mould Algorithm
SVR	Support Vector Regression
TCN	Temporal Convolutional Network
TSA	Tunicate Swarm Algorithm
VMD	Variational Mode Decomposition

References

Liu, Z.; Guo, H.; Zhang, Y.; Zuo, Z. A Comprehensive Review of Wind Power Prediction Based on Machine Learning: Models, Applications, and Challenges. Energies 2025, 18, 350. [Google Scholar] [CrossRef]
Almazrouee, A.I.; Almeshal, A.M.; Almutairi, A.S.; Alenezi, M.R.; Alhajeri, S.N. Long-Term Forecasting of Electrical Loads in Kuwait Using Prophet and Holt–Winters Models. Appl. Sci. 2020, 10, 5627. [Google Scholar] [CrossRef]
Liu, C.; Sun, B.; Zhang, C.; Li, F. A Hybrid Prediction Model for Residential Electricity Consumption Using Holt-Winters and Extreme Learning Machine. Appl. Energy 2020, 275, 115383. [Google Scholar] [CrossRef]
Shah, A.A.; Khan, Z.A.; Altamimi, A. SARIMA and Holt-Winters Method Based Microgrids for Load and Generation Forecasting. Prz. Elektrotech. 2021, 97, 1–29. [Google Scholar] [CrossRef]
Qazi, M.A.; Chiang, D.H. Gated Recurrent Neural Network Enhanced Wind Power Prediction Accuracy Using Tpe Bayesian Optimization. Available online: https://ssrn.com/abstract=5243772 (accessed on 6 October 2025).
Wan, H.; Qiu, Z.; Wang, J.; Quan, R.; Chang, Y.; Derigent, W. Optimizing Renewable Energy Forecasting: A Hybrid Approach Integrating MSADBO, BiGRU, and TCN for PV/Wind Power Generation Prediction. J. Supercomput. 2025, 81, 1269. [Google Scholar] [CrossRef]
Li, D.; Yu, X.; Liu, S.; Dong, X.; Zang, H.; Xu, R. Wind Power Prediction Based on PSO-Kalman. Energy Rep. 2022, 8, 958–968. [Google Scholar] [CrossRef]
Shaheen, M.A.; Hasanien, H.M.; Mekhamer, S.F.; Talaat, H.E. Walrus Optimizer-Based Optimal Fractional Order PID Control for Performance Enhancement of Offshore Wind Farms. Sci. Rep. 2024, 14, 17636. [Google Scholar] [CrossRef] [PubMed]
Shahid, F.; Zameer, A.; Muneeb, M. A Novel Genetic LSTM Model for Wind Power Forecast. Energy 2021, 223, 120069. [Google Scholar] [CrossRef]
Zhang, Y.; Pan, G.; Chen, B.; Han, J.; Zhao, Y.; Zhang, C. Short-Term Wind Speed Prediction Model Based on GA-ANN Improved by VMD. Renew. Energy 2020, 156, 1373–1388. [Google Scholar] [CrossRef]
Lu, K.-H.; Hong, C.-M.; Tsai, W.-C.; Cheng, F.-S. Adaptive MPC Control for Wind Power Systems with VRB Storage Using SVR-Based Sensorless Estimation and FPNN-IPSO Optimization. 2025. preprint. Available online: https://www.preprints.org/manuscript/202509.0424 (accessed on 5 October 2025).
Wang, X.; Wang, Y.; Qi, Y.; Gao, J.; Yang, F.; Lu, J. An Ultra-Short-Term Wind Power Prediction Method Based on the Fusion of Multiple Technical Indicators and the XGBoost Algorithm. Energies 2025, 18, 3069. [Google Scholar] [CrossRef]
Deepa, S.; Arumugam, J.; Purushothaman, R.; Nageswari, D.; Babu, L.R. Comparative Analysis of Wind Speed Prediction: Enhancing Accuracy Using PCA and Linear Regression vs. GPR, SVR, and RNN. Int. J. Power Electron. Drive Syst. (IJPEDS) 2025, 16, 538–545. [Google Scholar] [CrossRef]
Alizadegan, H.; Rashidi Malki, B.; Radmehr, A.; Karimi, H.; Ilani, M.A. Comparative Study of Long Short-Term Memory (LSTM), Bidirectional LSTM, and Traditional Machine Learning Approaches for Energy Consumption Prediction. Energy Explor. Exploit. 2025, 43, 281–301. [Google Scholar] [CrossRef]
Chi, D.; Yang, C. Wind Power Prediction Based on WT-BiGRU-Attention-TCN Model. Front. Energy Res. 2023, 11, 1156007. [Google Scholar] [CrossRef]
Yin, D.; Li, J. Ultra-Short-Term Wind Power Prediction Based on CNN-BIGRU-Attention. In Proceedings of the Third International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2024), Beijing, China, 26–28 January 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13181, pp. 914–919. [Google Scholar]
Zhang, X.; Ye, J.; Gao, L.; Ma, S.; Xie, Q.; Huang, H. Short-Term Wind Power Prediction Based on ICEEMDAN Decomposition and BiTCN–BiGRU-Multi-Head Self-Attention Model. Electr. Eng. 2025, 107, 2645–2662. [Google Scholar] [CrossRef]
Shaheen, M.A.; Hasanien, H.M.; Mekhamer, S.F.; Talaat, H.E. A Chaos Game Optimization Algorithm-Based Optimal Control Strategy for Performance Enhancement of Offshore Wind Farms. Renew. Energy Focus 2024, 49, 100578. [Google Scholar] [CrossRef]
Lv, Q.; Zhang, J.; Zhang, J.; Zhang, Z.; Zhou, Q.; Gao, P.; Zhang, H. Short-Term Wind Power Prediction Model Based on PSO-CNN-LSTM. Energies 2025, 18, 3346. [Google Scholar] [CrossRef]
Sun, H.; Cui, Q.; Wen, J.; Kou, L.; Ke, W. Short-Term Wind Power Prediction Method Based on CEEMDAN-GWO-Bi-LSTM. Energy Rep. 2024, 11, 1487–1502. [Google Scholar] [CrossRef]
Ning, J.; Huang, W. An Ultra-Short-Term Wind Power Prediction Model Based on Crested Porcupine Optimizer and Cross-Domain Attention Mechanism. Int. J. Adv. AI Appl. 2025, 1, 1–21. [Google Scholar]
Liu, Y.; Huang, K. A Short-Term Wind Power Prediction Model Based on TCN-BiGRU Self-Attention. In Proceedings of the Fourth International Conference on Electronics Technology and Artificial Intelligence (ETAI 2025), Harbin, China, 21–23 February 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13692, pp. 1191–1199. [Google Scholar]
Wan, A.; Peng, S.; Khalil, A.-B.; Cheng, X.; Ji, X.; Ji, Y.; Ma, S. Hybrid FMD-TCN-BiLSTM-ECA Network for Enhanced Offshore Wind Power Prediction and Grid Stability. Wind Energy Eng. Res. 2025, 3, 100012. [Google Scholar] [CrossRef]

Figure 1. Overall workflow of the proposed CPO-BiTCN-BiGRU-Attention model. The raw wind power time series is first processed by the BiTCN module to extract multi-scale local features and suppress high-frequency noise. The resulting feature sequence is then fed into the BiGRU to model long-term temporal dependencies and global evolution patterns. Subsequently, the attention mechanism assigns adaptive weights to different time steps, emphasizing informative ramping events while attenuating less relevant steady-state periods. Finally, the Crowned Porcupine Optimization algorithm is employed offline to automatically tune the hyperparameters of the entire network, ensuring an optimal model configuration.

Figure 2. BiTCN module structure.

Figure 3. Diagram of the BiGRU structure.

Figure 4. Attention module structure.

Figure 5. The prediction result of the training set.

Figure 6. The prediction results of the test set.

Figure 7. The Power Forecast Error Histogram.

Figure 8. The Regression Graphics.

Table 1. Comparative analysis of the proposed framework versus recent State-of-the-Art models.

Model	Decomposition Method	Attention Mechanism	Optimizer	End-to-End	Main Focus/Limitation
PSO-CNN-LSTM	None	None	PSO	Yes	Basic spatial-temporal extraction; lacks attention and advanced optimization.
CEEMDAN-GWO-BiLSTM	CEEMDAN	None	GWO	No	High accuracy via decomposition but suffers from high computational latency.
TCN-BiGRU-SelfAttention	None	Yes	None	Yes	Good backbone but relies on manual hyperparameter tuning.
CPO-BiTCN-BiGRU-Attention	None	Yes	CPO	Yes	“Filter-Memorize-Focus” logic with global-local feature alignment.

Table 2. Hyperparameter search space and optimal values determined by CPO.

Hyperparameter	Search Range (Lower, Upper)	Final Optimal Value
Optimized by CPO
BiTCN Filter Channels	[16, 128]	64
BiTCN Kernel Size	[2, 7]	3
BiGRU Hidden Units	[32, 256]	128
Initial Learning Rate	[0.0001, 0.01]	0.0012
Dropout Rate	[0.1, 0.5]	0.25
Fixed Settings
Input Window Size	/	96 (24 h)
Prediction Horizon	/	4 (1 h)
Batch Size	/	64
Loss Function	/	MSE
Early Stopping	/	Patience = 15
Activation (Attention)	/	Tanh/Softmax
BiTCN Dilations	/	[1, 2, 4]
Max Epochs	/	100
Optimizer	/	Adam

Table 3. Performance comparison of ablation experiments.

Model	$E_{M A E}$	$E_{M A P E}$	$E_{R M S E}$
BiGRU	23.01	29.03	21.68
BiTCN-BiGRU	18.05	24.61	18.95
BiTCN-BiGRU-attention	16.44	17.71	14.31
CPO-BiTCN-BiGRU-attention	9.32	8.41	13.60

Table 4. Performance comparison of different optimization algorithms (Mean ± SD over 10 runs).

Model	$E_{M A E}$	$E_{M A P E}$	$E_{R M S E}$
RS (Baseline)	16.25 ± 1.15	18.32 ± 1.42	15.70 ± 1.58
BO (Baseline)	10.85 ± 0.72	9.75 ± 0.68	14.22 ± 0.85
TSA	14.17 ± 0.52	11.92 ± 0.48	14.27 ± 0.61
SMA	14.60 ± 0.45	15.46 ± 0.55	12.48 ± 0.39
GWO	14.46 ± 0.49	12.86 ± 0.42	13.66 ± 0.53
CPO	9.32 ± 0.31	8.41 ± 0.25	13.60 ± 0.35

Table 5. Comparison of computational efficiency and convergence behavior.

Optimizer	Avg. Time Per Iteration	Iterations to Converge
TSA-BiTCN-BiGRU-Attention	345 s	25
SMA-BiTCN-BiGRU-Attention	352 s	18
GWO-BiTCN-BiGRU-Attention	338 s	22
CPO-BiTCN-BiGRU-attention	361 s	12

Table 6. Performance comparison of classic prediction models.

Model	$E_{M A E}$	$E_{M A P E}$	$E_{R M S E}$
XGBoost	20.86	17.14	20.08
BP	22.09	20.54	27.17
SVR	29.40	25.03	27.17
Transformer	15.42	13.85	16.90
CSDI	12.18	11.20	14.88
CPO-BiTCN-BiGRU-attention	9.32	8.41	13.60

Table 7. Error metrics breakdown by forecast horizon (t + 15 to t + 60 min).

Horizon	$E_{M A E}$	$E_{M A P E}$	$E_{R M S E}$
Step 1 (15 min)	8.24	7.35	11.45
Step 2 (30 min)	9.05	8.12	13.10
Step 3 (45 min)	9.68	8.76	14.22
Step 4 (60 min)	10.31	9.41	15.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, L.; Jaharadak, A.A.b.; Ahmad, N.I.; Wang, J. A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting. Energies 2026, 19, 1034. https://doi.org/10.3390/en19041034

AMA Style

Huang L, Jaharadak AAb, Ahmad NI, Wang J. A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting. Energies. 2026; 19(4):1034. https://doi.org/10.3390/en19041034

Chicago/Turabian Style

Huang, Liusong, Adam Amril bin Jaharadak, Nor Izzati Ahmad, and Jie Wang. 2026. "A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting" Energies 19, no. 4: 1034. https://doi.org/10.3390/en19041034

APA Style

Huang, L., Jaharadak, A. A. b., Ahmad, N. I., & Wang, J. (2026). A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting. Energies, 19(4), 1034. https://doi.org/10.3390/en19041034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A CPO-Optimized BiTCN–BiGRU–Attention Network for Short-Term Wind Power Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning in Wind Power Forecasting

2.2. Rise in Hybrid Architectures (TCN and RNN Fusion)

2.3. Meta-Heuristic Optimization in Forecasting

3. Components of the Prediction Model

3.1. Overall Framework: The Logic of Serial Processing

3.2. BiTCN: Multi-Scale Feature Extraction and Denoising

3.2.1. Structure of BiTCN

3.2.2. Feature Calculation Formula of BiTCN

3.3. BiGRU: Temporal Context Learning

3.3.1. Structure of BiGRU

3.3.2. Mathematical Formulation of BiGRU

3.4. Attention Mechanism: Capturing Ramping Events

3.4.1. Working Principle of the Attention Mechanism

3.4.2. Mathematical Formulation of the Attention Mechanism

3.5. Crowned Porcupine Optimization (CPO) for Hyperparameters

4. Results and Discussion

4.1. Dataset and Preprocessing

4.2. Evaluation Metrics

4.2.1. Selection of Evaluation Indicators

4.2.2. Setting of Model Parameters

4.2.3. Training Protocols and Final Hyperparameters

4.3. Analysis of Experimental Results

4.3.1. Training Set Prediction Results

4.3.2. Test Set Prediction Results

4.3.3. Prediction Error Distribution

4.3.4. Regression Analysis

4.4. Ablation Experiment

4.5. Comparison of Optimization Algorithms

4.6. Comparison with Classic Models

4.7. Performance Analysis by Forecast Horizon

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI