1. Introduction
Wind energy has become a critical component of global sustainable energy strategies due to its abundance and low environmental impact. According to the Global Wind Energy Council’s (GWEC) “Global Wind Report 2025”, global cumulative wind power installed capacity reached 1136 GW in 2024 (11% year-on-year growth), with a projected 8.8% compound annual growth rate from 2025 to 2030 [
1]. However, wind power’s inherent randomness and intermittency pose significant risks to grid stability—making high-precision prediction a prerequisite for efficient wind energy utilization.
Current wind power prediction methods fall into three categories: physical methods, data-drive methods, and hybrid methods. Physical methods rely on Numerical Weather Prediction and terrain data but suffer from poor portability and sensitivity to data noise.
Differently, data-driven methods, dominated by deep learning models, generate forecasting results based on learned patterns from datasets. Given the dynamic change characteristics of wind power time series, a large number of research results have emerged to effectively improve forecasting performance. Various successful prediction approaches have been reported, including Random Forest algorithm [
2], wind power prediction model combining principal component analysis and BP neural network [
3], long short-term memory network prediction model combining sliding window technology [
4], prediction model using convolutional neural network for feature extraction [
5], and wavelet neural network prediction models based on the principle of wavelet transform [
6]. However, these prediction methods have inherent limitations in feature processing capabilities, such as the risk of losing key information and poor adaptability. To address these issues, Shi et al. [
7] presented an enhanced approach for wind direction correlation by employing a wind direction enhancement algorithm to extract wind direction trend features. Li et al. [
8] improved the Kernel Extreme Learning Machine by optimizing its kernel width and regularization coefficient using a difference algorithm. Meanwhile, given the problems of substantial volatility of wind power and high data dimensionality with redundancy, Wang et al. [
9] introduced an innovative method that incorporates the Temporal Pattern Attention mechanism into the feature extraction process, enhancing the accuracy of wind power predictions. Chen et al. [
10] utilized Principal Component Analysis to effectively reduce dimensionality in the application of generative adversarial networks to wind power forecasting, further refining the feature selection process. However, few existing data-driven improvements specifically target data quality problems, leaving the bottleneck of data sensitivity unresolved.
Hybrid methods, a key research direction in wind power prediction, aim to integrate the advantages of multiple models to compensate for the shortcomings of single models. Existing hybrid studies include those using enhanced variational mode decomposition combined with LSTM networks for subsequence prediction [
11], and an ultra-short-term wind power prediction framework integrating LSTM with the SARIMA model [
12]. However, most of these methods adopt simple module concatenation rather than dynamic fusion of seasonal and temporal features, and their core modules have strong dependencies, poor parameter sensitivity, and low fault tolerance. To address the aforementioned issues of module dependence and parameter optimization in traditional hybrid methods, integrating evolutionary computation with machine learning has become a mainstream research trend—this combination helps systematically identify optimal parameter configurations to enhance model performance. For instance, researchers have explored weighted support vector machines optimized via genetic algorithms [
13], least squares support vector machines [
14], an improved fruit fly optimization algorithm tailored for support-vector-machine-parameter tuning [
15], a rich–poor optimization algorithm for fine-tuning outlier–robust extreme learning machine parameters [
16], and a hybrid improved cuckoo search algorithm for optimizing support vector machine hyperparameters [
17]. Despite progress in parameter optimization, these evolutionary computation-based hybrid methods still fail to resolve the core limitation of traditional hybrid approaches: weak adaptation between long-term dependencies and dynamic seasonal trends in wind power time series.
For error correction and non-stationarity handling—two key challenges closely related to hybrid method performance—existing studies have made preliminary attempts, but with notable drawbacks. In terms of error correction, Liang et al. [
18] combined predictions from a support vector machine (SVM) and an Elman neural network, while Shi et al. [
19] employed a least squares support vector machine with a radial basis function for error correction. However, the accuracy of these error models highly depends on the quality and representativeness of training data, and they cannot compensate for the core defects of hybrid methods themselves. In dealing with time series non-stationarity, new frameworks such as “non-stationary Transformers” [
20]—wind power prediction models enhanced by the improved African Vulture Optimization Algorithm [
21]—and the “swinLST” recurrent unit [
22] have been proposed. Nevertheless, these techniques either require excessive computational resources, are sensitive to algorithm parameter selection, or tend to misinterpret noise as valid spatial correlations, and none address the root issue of poor dynamic fusion of seasonal and temporal features.
In order to provide a clear overview of existing solutions and highlight the research gaps addressed by this work,
Table 1 summarizes representative wind power forecasting methods, their core features, advantages, and limitations. This comparison emphasizes the trade-off between prediction accuracy, computational efficiency, and adaptability to complex time/seasonal features.
In summary, existing wind power prediction methods face three critical and interrelated gaps that restrict their accuracy and reliability: (1) data-driven methods are sensitive to data quality and prone to key feature loss; (2) hybrid methods lack dynamic fusion of seasonal and temporal features, with strong module dependence, poor parameter sensitivity, and failure to effectively resolve the weak adaptation between long-term dependencies and dynamic seasonal trends; and (3) error correction and non-stationarity handling are either inefficient or disconnected from core hybrid method defects. To address the above gaps comprehensively, this paper proposes a T-LSTM-based hybrid framework with four key innovations:
- (1)
We adopt the advanced SARIMA method to analyze seasonal trends of wind power data, with the model’s ability to integrate seasonal/external factors and parameter sensitivity enhanced.
- (2)
The innovative T-LSTM recurrent unit is proposed, which efficiently extracts core feature information from time series data to underpin reliable wind power prediction.
- (3)
A wind power prediction-specific architecture is designed to accurately capture and replicate temporal relationships, strengthening the model’s adaptability to complex trends for higher forecast precision.
- (4)
Comparative analysis with prevalent methods is conducted, with core mechanisms and correction strategies optimized simultaneously to verify the proposed approach’s efficacy.
The subsequent sections of this paper are structured in the following manner:
Section 2 provides an overview of previous research in related areas;
Section 3 elaborates on the architectural design of the new model being proposed; the prediction results of the new model in comparison to other standard models are discussed in
Section 4; the paper is concluded in
Section 5, with a summary of the key findings from the research.
2. Related Work
The prediction task in this article falls under the category of Short-Term Wind Power Prediction (STWPP). Specifically, we focus on single-point rolling prediction of specific onshore wind farms located in northwest China, with the target variable being the total hourly average wind output (kW) of the entire wind farm at future time steps.
The input data for the predictive model comes from the monitoring and data acquisition (SCADA) system of the wind farm, covering the entire year from 1 January 2019 to 31 December 2019 (35,040 data points). Following the widely adopted sequence to point prediction paradigm, the input features include two types of information: (1) historical wind power data of the target wind farm; and (2) multi-source meteorological covariates synchronized with historical power data, including wind speed (m/s), wind direction (°), air pressure (kPa), ambient temperature (°C), and relative humidity (%). The prediction range is set to 15, 30, 60, and 90 min in advance, corresponding to predicting the wind power generation for the next 1-, 2-, 4-, and 6-time intervals, respectively.
2.1. Transformer
In the context of wind power forecasting, the ability of the Transformer architecture to process data in parallel is of particular significance. Huang et al. [
29] conducted in-depth experiments to adapt the Transformer model for wind power forecasting, but its over 100 million parameters increased training costs. Similarly, Chen et al. [
30] investigated different variations of the Transformer model and optimized it for wind power forecasting tasks. Their research demonstrated that with proper adjustments, the Transformer model can be tailored to specific wind power forecasting scenarios, leading to enhanced performance and more reliable predictions. However, during its optimization process, it overly focuses on local features within the scene and does not establish a cross-scene feature transfer mechanism, which increases the operational costs in practical applications.
To solve Transformer’s excessive parameters and lack of cross-scenario adaptability, the proposed framework combines the optimized LSTM and multi-head attention and integrates SARIMA’s seasonal pattern extraction advantage. This design avoids the Transformer’s high parameter cost, enhances cross-scenario feature reuse through multi-feature fusion, and balances prediction accuracy, training efficiency, and practical operational feasibility.
2.2. LSTM
The LSTM network has emerged as a significant advancement in the field of neural networks, particularly in the context of processing sequential data with complex temporal patterns. At the core of the LSTM architecture are specialized LSTM units, which are designed to selectively remember or forget information over long sequences. These units consist of input gates, output gates, and forget gates, which work together to control the flow of information into and out of the memory cells, but the three gates of traditional LSTM increase training latency. Despite its innovative design and potential benefits, the LSTM model has several limitations in practical use. One of the primary challenges is the time-consuming training process. Due to the complexity of the LSTM architecture and the large number of parameters involved, training an LSTM model can be computationally expensive and requires significant amounts of data and computational resources. This can make it difficult to implement the model in real-time applications where quick and accurate predictions are needed.
For LSTM’s high training latency and computational complexity, the LSTM structure of the proposed method is streamlined by optimizing the gate mechanism and reducing redundant parameters while retaining long-term dependency capture capabilities, thus lowering training and operational costs.
2.3. SARIMA
The SARIMA model is a classic statistical method widely used in wind power forecasting, specifically designed to capture seasonal trends and linear time patterns in time series data. Its advantages lie in clear interpretability, mature theoretical support, and good performance in short-term forecasting of stable seasonal sequences. However, SARIMA has significant limitations in wind power prediction scenarios. Firstly, it relies on strict assumptions of linearity and stationarity, which are difficult for wind power data to meet—wind power output has high volatility and nonlinearity, and is influenced by the coupling of multiple meteorological factors. Secondly, as a univariate model, it cannot effectively integrate and analyze multidimensional feature interactions and cannot capture the complex nonlinear relationship between meteorological conditions and wind power generation.
To compensate for SARIMA’s inability to model nonlinear relationships and multi-feature interactions, the proposed framework retains SARIMA’s advantage in capturing linear seasonal trends. By integrating linear seasonal features and nonlinear complex features through weighted fusion, the resistance to data noise and adaptability to sudden meteorological changes have been improved.
3. The Proposed Approach
3.1. Overall Architecture
As shown in
Figure 1, this prediction architecture combines two components through a weighted method to generate accurate wind energy predictions, aiming to improve the prediction reliability of renewable energy applications. To address the issue of weighted fusion strategy between two models, we first established the mathematical specifications for this integration.
Specifically, the prediction results of the LSTM network are corrected by the results of the SARIMA model. The final wind power generation prediction value is shown in Equation (1):
Among them, represents the prediction result of the SARIMA model, and represents the prediction output of the proposed t-LSTM model. is a hyperparameter used to quantify the contribution of the SARIMA model to the final prediction, optimizing the final prediction by minimizing the error between the fused prediction value and the actual wind power value. The final value selected in this paper is 0.14.
The T-LSTM model operates by first processing the data point x(t) at a specific time step t through an input embedding layer, which projects the input data into a hidden dimensional space. Next, the data enters the simplified Transformer module, where long-term dependencies are captured and processed to produce data x_tb. Following this, the T-LSTM unit receives the transformed data block, hidden state h(t − 1), and memory cell state c(t − 1) from the previous time step. Through computational manipulation, the T-LSTM unit generates the current time step’s hidden state h(t) and memory cell state c(t). The hidden state h(t) is duplicated for the reconstruction layer operation, while the other instance, along with memory cell state c(t) at the current time step, becomes input data for the T-LSTM unit in the next time step. Lastly, the reconstruction layer maps the hidden state h(t) back to the input data’s dimensional magnitude and predicts the wind power for the subsequent time step. The T-LSTM model effectively processes data, capturing long-term dependencies to make accurate predictions. This methodology allows for efficient handling of sequential data, ensuring the model’s ability to forecast future outcomes with precision. By utilizing a combination of Transformer modules and LSTM units, the T-LSTM model proves to be a robust tool for time series forecasting tasks.
3.2. Seasonal Auxiliary Prediction Based on the SARIMA Model
The SARIMA model is an advanced version of the Autoregressive Integrated Moving Average (ARIMA) model, specifically designed to address the periodic characteristics seen in time series data. In practical applications, the T-LSTM network can be used in conjunction with SARIMA to refine predicted values and adjust wind power estimates. Traditional ARIMA models struggle to accurately capture both seasonal and non-seasonal components of wind power data, leading to potential errors in parameter selection. The SARIMA model is represented by Equation (2), encompassing the seasonal patterns within the data. By leveraging SARIMA’s capabilities, analysts can better model and predict fluctuations in wind power generation, enhancing the decision-making processes in the renewable energy sector.
where the wind power series is denoted as
and
stands for white noise.
B references the lag operator,
d signifies non-seasonal differencing to remove data non-stationarity, and
D denotes seasonal differencing for tackling seasonal data patterns.
p and q are the autoregressive and moving average terms, respectively. Lastly,
s captures the seasonal order, reflecting the periodicity of seasonal variations in wind power data.
Equations (3) and (4) reveal the autoregressive and moving average polynomials, showcasing the interconnectedness between future time series values, past values, and errors. These equations provide insight into the intricate dependencies within a dataset, aiding in forecasting and analysis.
Equations (5) and (6) describe the seasonal patterns within a time series using autoregressive and moving average polynomials. By integrating these polynomials into the ARIMA equation, the model becomes better equipped to identify and predict seasonal fluctuations. This enhanced capability allows for a more accurate analysis of time series data.
3.3. Transformer Block
The Transformer block (TB) serves as the core feature extraction unit of the proposed wind power prediction model, tasked with converting low-dimensional features from the input embedding layer into high-dimensional representations that capture multi-scale temporal correlations and key meteorological features (see
Figure 2 for its structure and operation process).
The input to the TB is the wind power feature vector (integrating multi-source features like wind speed, wind direction, and temperature) processed by the embedding layer, which undergoes two-step preprocessing to adapt to the “dynamic fluctuation + periodicity” characteristics of wind power time series data. Positional encoding is first applied to inject temporal information—sine cosine encoding is adopted instead of learnable positional encoding to reduce model parameters and mitigate overfitting in small-sample wind power scenarios; specific encoding formulas are in Equations (7) and (8).
Layer Normalization is then performed to unify feature scales, preventing subsequent attention layers from biasing towards large-value features and safeguarding the capture of key small-scale features.
The preprocessed features are fed into the Multi-Head Attention layer, the core of capturing complex correlations in wind power data. Input vectors are split into multiple attention heads for parallel calculation of dimension-specific attention weights, enabling multi-perspective feature extraction and avoiding one-sided correlation capture. Each head assigns dynamic weights based on query-key similarity, emphasizing critical predictive information and suppressing redundancy, with outputs concatenated into a unified feature vector to complete dynamic feature enhancement.
A “residual connection + normalization” structure follows to address gradient vanishing in deep network training and stabilize data distribution. Residual connections allow shallow feature information to propagate directly to subsequent layers, which is essential for stacking multiple TB blocks, while a second Layer Normalization eliminates feature-value fluctuations after residual connections, ensuring stable data distribution for subsequent layers.
The final component is the Multi-Layer Perceptron (MLP), which transforms correlation features from the attention layer into temporal dependent features to meet wind power prediction requirements for long- and short-term temporal patterns. Through a two-layer fully connected structure with ReLU activation, the MLP integrates local correlations into global temporal dependencies and outputs the transformed feature block x_tb.
3.4. T-LSTM Cell
The innovative recurrent cell structure introduced in this research is a refined version of the LSTM network, demonstrated in
Figure 3. To standardize notation and clarify variable connotations, key variables are first defined:
x_tb is the output result processed by the Transformer block;
h_prev (i.e.,
h(
t −
1)) stands for the hidden layer state at the previous time step;
h_t (i.e.,
h(
t)) is the hidden layer state at the current time step;
c_prev (i.e.,
c(
t −
1)) refers to the memory unit at the previous time step; and
c_t (i.e.,
c(
t)) represents the memory unit at the current time step. This structure excels at capturing both short-term and long-term relationships within time series data by updating the cell state
c(
t) and the hidden state
h(
t) horizontally.
In order to address the limitations of traditional LSTM while maintaining its advantages in sequential data modeling, the T-LSTM unit adopts a “single gate dual control” filtering gate which simultaneously adjusts the “fusion of historical memory and new features” and the “output of unit state to hidden state”. The theoretical and empirical basis for this specific design is as follows:
The theoretical basis for simplifying target gates: Traditional LSTM relies on three separate gates to control “new information input,” “historical information retention,” and “unit state output,” but these functions exhibit strong correlation in wind power time series. The filtering gate of T-LSTM integrates these correlation functions into a control coefficient f(t).
Synergistic effect with transformer function: The input of the filter gate directly contains x_tb, ensuring that the control coefficient f(t) is suitable for the fine-grained temporal correlation captured by the Transformer. This design avoids the traditional LSTM’s “equal treatment of raw inputs” and considers the high-value features of wind power prediction.
To begin, the computation process kicks off with the calculation of the gate value, followed by the incorporation of the crucial information output
x_tb modulated by attention weights into the hidden state
h(
t −
1) from the preceding time step. Layer Normalization is then applied, and the Sigmoid activation function is used to determine the gate value
f(
t). The memory cell is updated by processing the current input
x_tb and the memory cell from the previous time step, multiplying the outcome with the gate value
f(
t).The current hidden state is generated by element-wise multiplication between the cell state
c(
t) and the gate value
f(
t). The fundamental equations for this innovative T-LSTM model are outlined in Equations (9)–(11), showcasing the effectiveness of this optimized cell structure in capturing complex dependencies within time series data.
Filter_gate is a crucial element in the operation of the TB, receiving inputs x_tb, h_prev, and c_prev to determine the flow of information within the network. It plays a vital role in controlling the information flow and interactions between the hidden state and memory cell of the previous time step.
4. Numerical Experiment
In this section, we select a wind farm SCADA comprehensive dataset that accurately portrays real-life operating conditions. This dataset encompasses diverse monitoring data across different climates and timeframes, including meteorological parameters and historical wind power output. By utilizing this dataset, we meticulously compare our proposed model against several benchmark methods. Through these approaches, we strive to thoroughly and accurately assess the practical effectiveness of our model in wind power prediction. This methodology enables us to clearly identify the capabilities and limitations of our model in real-world wind farm scenarios, aiming to provide a detailed and reliable evaluation of its practical applicability.
4.1. Data Description and Experiment Setup
Dataset: This study employs a wind farm SCADA comprehensive dataset collected from a wind farm in northwestern China. The dataset, containing information sampled at 15 min intervals, covers an entire year from 1 January 2019 to 31 December 2019 comprising a total of 35,040 data points. By utilizing the monitoring and data acquisition (SCADA) system of wind farms, various real-time meteorological parameters can be captured, including wind direction (°), wind speed (m/s), historical wind power (kW), air pressure (kPa), temperature (°C), and relative humidity (%).
To verify the cross-seasonal generalization ability of the model, the entire dataset was divided into four seasonal subsets based on the meteorological seasons in the Northern Hemisphere: spring (8832 data points from March to May), summer (8832 data points from June to August), autumn (8736 data points from September to November), and winter (8640 data points from December to February).
During the development of our model for wind power prediction, we divided our dataset into training and testing subsets chronologically. The training subset comprises 90% of the data and is utilized to train the model, while the testing subset, accounting for the remaining 10%, is used to evaluate the model’s performance. The experiment adopts single-point prediction, specifically using a rolling prediction strategy to predict the wind power generation for the next 1-, 2-, 4-, and 6-time intervals based on the data from the first 24-time intervals. For cross-seasonal validation, the same training–test split ratio (9:1), rolling prediction strategy, and hyperparameters are applied to each seasonal subset to ensure consistency and avoid overfitting to specific seasonal patterns.
Transformer block and T-LSTM unit parameter selection: two attention heads are used, with head 1 focusing on hourly scale dependence and head 2 focusing on daily scale dependence. This design aims to address the core challenge of wind power forecasting—capturing short-term dynamic fluctuations and long-term seasonal cyclicality—regardless of the forecast range. To validate this design, we conducted quantitative ablation experiments on a 90 min predicted scenario (the longest time range in this study). This situation is the most challenging for multi-scale feature capture: it not only requires capturing short-term temporal correlations but also integrating long-term periodic patterns to avoid cumulative prediction errors. The experimental results show that the single-head configuration achieved a MAE of 10.51 and an RMSE of 14.76; the four-head configuration achieved a MAE of 8.78 and an RMSE of 13.31; the dual-head configuration achieved a balance between accuracy and efficiency, with a MAE of 7.67 and an RMSE of 12.99, which is superior to single-head and four-head configurations.
The hidden layer dimension is 64. Sixty-four dimensions are sufficient to carry key information of time series data, such as non-linear mapping between wind speed and wind power output, seasonal trends at the hourly/daily/weekly level, and cross correlation of multi-source features. Overly high dimensions introduce “redundant feature expression”, which increases the learning burden of the model. Moreover, the feature correlation of time series data is relatively concentrated. Sixty-four dimensions can already cover more than 90% of effective information. And the batch size is 32.
To enhance the reproducibility of the proposed T-LSTM model, detailed training configurations are supplemented in
Table 2:
SARIMA parameter selection: The study not only introduced the primary model but also developed a SARIMA model using wind power time series data. To assess and compare different models, the Akaike information criterion (AIC) was used as the evaluation metric. The AIC considers both the model’s fit with the data and its complexity, aiming to strike a balance between the two. Lower AIC values indicate models that are better suited for wind power data modeling. SARIMA (3, 0, 1, 3, 1, 1, 12) was selected via two steps:
Grid search over ;
By minimizing the AIC value, which is detailed in
Table 3, and validating with seasonal error.
4.2. Performance Metrics and Benchmark Models
A comprehensive evaluation of a wind power prediction method was conducted, comparing it with six other commonly used methods. These methods included long short-term memory (LSTM) [
31], convolutional neural network-gated recurrent unit (CNN-GRU) [
25], non-stationary Transformer (ns_Transformer) [
20], Autoformer [
32], Reformer [
33], and least squares support vector machine with optimization using a specialized algorithm [
21]. These methods were selected for their representativeness in different model categories. As classic deep learning models, GRU outperforms LSTM in computational efficiency with its streamlined update gates while maintaining high accuracy. For Transformer variants, ns_Transformer addresses time series non-stationarity, a key wind power prediction challenge, showing superior performance. Autoformer’s decomposed design combines autocorrelation mechanisms to extract temporal patterns efficiently without heavy computation. Reformer, a benchmark for efficiency–performance trade-off, uses LSH attention and reversible layers—comparison with it highlights the proposed method’s innovation. The LSSVM, optimized via the African Vulture Optimization Algorithm, represents traditional machine learning. This comparative setup reveals each method’s unique strengths, emphasizing the need for tailored selection based on practical requirements. Through this comparative analysis, it was evident that each method presents unique advantages in predicting wind power output, highlighting the importance of choosing the most suitable approach based on specific requirements.
In order to thoroughly assess the effectiveness of various prediction methods, we have chosen to focus on four key evaluation metrics, mean squared error (MSE), mean absolute error (MAE), root mean squared error (RMSE), and the coefficient of determination (R
2). These metrics serve as reliable indicators of predictive performance by quantifying the difference between predicted and actual values. Lower values of MAE, MSE and RMSE indicate a higher level of accuracy in predictions, while higher values of R
2 suggest a stronger correlation between predicted and actual values, resulting in more favorable prediction outcomes. Formulas (12)–(15) outline the specifics for calculating these indicators. By conducting a detailed comparison using these metrics, we can effectively demonstrate the accuracy and dependability of each model when applied to tasks involving the prediction of wind power.
At the same time, to verify the statistical reliability of the performance difference between T-LSTM and baseline models, the normality of the sample-by-sample absolute error (AE) difference was first verified through the Shapiro–Wilk test. The results showed that the AE differences between all baseline models and T-LSTM did not follow a normal distribution (p < 0.0001). Therefore, this study used the Wilcoxon signed-rank test for paired comparisons and Bonferroni correction (comparison frequency k = 6, corrected significance level α = 0.05) to control for type I errors caused by multiple comparisons.
4.3. Ablation Experiment Design and Results
To quantitatively isolate the contributions of three core components of the proposed framework—(i) Transformer block (TB), (ii) improved LSTM gate, and (iii) SARIMA seasonal fusion—we designed three ablation variants based on the full T-LSTM-SARIMA model:
Ablation 1 (T-LSTM w/o TB): Remove the Transformer block; the model uses only the improved LSTM (with Filter Gate) and SARIMA fusion. Input data is directly fed into the T-LSTM cell without multi-scale temporal feature extraction.
Ablation 2 (LSTM-TB-SARIMA): Replace the improved LSTM gate with the traditional LSTM three-gate mechanism; retain the Transformer block and SARIMA fusion.
Ablation 3 (T-LSTM w/o SARIMA): Remove the SARIMA model; the final prediction relies solely on the T-LSTM unit without seasonal trend correction.
All ablation experiments use the same dataset, hyperparameters (except for necessary adjustments to the traditional LSTM gate), and evaluation metrics as the full model. The results are shown in
Table 4. The values in bold in the table represent the optimal performance indicator values.
Compared with the complete model, the MAE of Ablation 1 (without TB) increased from 19.5% (15 min) to 28.8% (60 min). The most significant performance decline occurred within 60 min, indicating that the TB’s multi-head attention mechanism is crucial for medium- and long-term prediction. Without the TB, the model cannot extract fine-grained temporal correlations, resulting in cumulative errors as the prediction range expands. The MAE of Ablation 2 (traditional LSTM gate) is 11.9% (15 min) to 19.9% (60 min) higher than that of the complete model. This verifies that the filter gate retains the ability to capture sequential dependencies. Ablation 3 (excluding SARIMA) showed a 5.7% (15 min) to 8.8% (90 min) increase in MAE, with the greatest degradation occurring within 90 min. This indicates that SARIMA effectively addresses the weakness of T-LSTM units in modeling linear seasonal trends, and the weighted fusion of SARIMA seasonal features reduces prediction bias.
4.4. Comparison with Benchmark Models
In order to improve the readability of performance indicators and avoid redundant numerical representations, we first use a line chart (
Figure 4,
Figure 5 and
Figure 6) to visualize the core evaluation indicators (MAE, RMSE, R
2) within different prediction ranges, and provide corresponding tables for quantitative reference.
The values of four performance metrics over different methods on the test set are shown in
Table 5,
Table 6,
Table 7 and
Table 8. The values in bold in the tables represent the optimal performance indicator values.
The Wilcoxon test results reflect the statistical reliability of performance differences between T-LSTM and benchmark models. We split the visualization into two parts: statistical significance (corrected p-value) and effect size (r-value), as these two indicators address different research questions (whether the difference is significant, and how large the difference is).
Figure 7 uses colors to represent the corrected
p-values, with annotations indicating “significant” (
p-corrected < 0.05) or “not significant” (
p-corrected ≥ 0.05) to directly respond to the core of statistical testing. The difference in red blood cell display is significant, while the difference in white blood cell display is not significant.
Figure 8 shows the magnitude of the effect size r value that quantifies the difference in performance (|r| < 0.1: small; 0.1 ≤ |r| < 0.03: medium; |r| ≥ 0.3: large). The heatmap uses a divergent color palette to distinguish between positive and negative values (negative r indicates T-LSTM outperforms the baseline) and annotates the exact r value for clarity. A negative r-value indicates that T-LSTM outperforms the corresponding benchmark model. The color intensity increases with the absolute value of r (darker = larger effect size).
To complement the visualization and provide precise numerical references,
Table 9 summarizes the key Wilcoxon test results (significance and r-value magnitude).
Overall performance advantage: T-LSTM achieved the lowest MAE, RMSE, and highest R2 in 60 min and 90 min, confirming its advantages in medium- and long-term wind power prediction (based on research objectives segmentation).
Compared to other benchmarks, Reformer and ns_Transformer exhibit the highest prediction errors across all ranges due to their overly complex architecture, which leads to overfitting of wind power time series with high volatility. CNN-GRU and LSSVM exhibit stable but poor performance, reflecting their limitations in balancing short-term fluctuations and long-term trends. LSTM performs relatively well in short-term (15 min) predictions, with a low MAE of 5.2717. However, over time, its error sharply increases, indicating that it is difficult to maintain the accuracy of long-term predictions. Within the 30 min prediction range, the MAE of LSTM (6.0777) is slightly lower than that of T-LSTM (6.6317). This phenomenon is attributed to the inherent advantage of LSTM in capturing short-term real-time sequence dependencies: within 30 min, wind fluctuations are mainly driven by local, short-term meteorological changes, and LSTM’s classical three-gate mechanism directly simulates this short-term temporal correlation without additional feature fusion overhead. In contrast, T-LSTM integrates multi-scale feature extraction and seasonal trend fusion, which introduces moderate computational overhead but can provide cumulative benefits as the prediction range expands. Autoformer is competitive in the short term (15–30 min), but lacks adaptability to seasonal changes in the long term, while T-LSTM maintains consistent accuracy through the fusion of linear and nonlinear features.
Due to the fact that the original prediction curve covered a period of 36 days, the curves overlapped severely and had limited interpretability. To address this issue, we used a shorter time window (3 consecutive days, 288 data points) to plot the prediction curve (
Figure 9,
Figure 10,
Figure 11 and
Figure 12), which clearly highlights the performance differences of the model in scenarios such as stable wind power periods, rapid fluctuations, and peak/valley moments. The wind power forecast results for the next 15 min, 30 min, 60 min, and 90 min are shown in
Figure 9,
Figure 10,
Figure 11 and
Figure 12. These numbers display the actual wind power values and predicted power values of different models, including T-LSTM, LSTM, GRU, Autoformer, Reformer, LSSVM, and ns_Transformer. The selected 3-day period (Days 10–12) is representative of typical wind power characteristics: it includes stable low-fluctuation phases, sudden power surges/drops, and extreme peak values, reflecting the model’s adaptability to diverse scenarios.
The above wind power prediction chart reveals three key findings:
T-LSTM exhibits good dynamic response speed in fast fluctuation scenarios, with shorter time delay compared to other models.
In extreme (peak/valley) scenarios, the prediction error of T-LSTM is smaller than that of the baseline model, reflecting its better robustness.
As the prediction time extends, the error accumulation rate of T-LSTM is the lowest, while the error accumulation rate of LSTM and Autoformer exceeds 20%.
These results supplement the quantitative indicators (MAE/RMSE/R2) and further validate that the integrated design of T-LSTM can effectively balance short-term volatility capture and long-term trend stability.
4.5. Cross-Seasonal Generalization Experiment
To verify whether the proposed model can maintain stable performance across different seasonal meteorological conditions, we conducted additional cross-seasonal experiments. The experiment focuses on the 60 min prediction horizon (representative of medium-term forecasting where seasonal effects are prominent).
Table 10 presents the key evaluation metrics (MAE, RMSE, R
2) of T-LSTM and benchmark models across four seasons for the 60 min prediction horizon. The values in bold in the table represent the optimal performance indicator values.
As expected, all models showed higher errors in winter, but the T-LSTM model maintained good performance in all seasons.
During the spring season, the MAE of T-LSTM is 16.1% lower than that of Autoformer, 23.6% lower than that of LSTM, and 45.8% lower than that of Reformer. Its R2 remains above 0.97, indicating strong adaptability to gradual seasonal changes. During the summer, T-LSTM performed better than Autoformer by 16.0% in MAE, highlighting its ability to capture wind speed changes caused by convective weather. Compared with ns_Transformer and Reformer, T-LSTM reduced MAE by 45.5% and 63.5% respectively, avoiding overfitting to short-term noise. During the autumn season, T-LSTM had the lowest RMSE and the highest R2, outperforming all benchmarks. This confirms that the model does not overfit stable conditions and maintains its feature extraction ability. During winter, the MAE of T-LSTM is 15.7% lower than that of Autoformer and 19.8% lower than that of LSTM. Even under extreme conditions, its R2 is 3.2 percentage points higher than the Reformer, indicating its greater resistance to seasonal fluctuations.
The cross-seasonal results confirm that the advantages of T-LSTM are not limited to annual datasets but also extend to different seasonal scenarios. However, it is worth noting that this validation is based on a single geographic location, and that the model’s generalizability to other regions or climate model anomalous years has not been tested; the author humbly acknowledges this as a limitation.
5. Conclusions
This paper proposes a T-LSTM-SARIMA hybrid framework to address the limitations of existing wind power prediction models. The key findings are as follows:
The T-LSTM unit balances long-term dependency capture (simplified Transformer) and training efficiency (improved LSTM), offering a leaner alternative to standard Transformer-LSTM hybrids; weighted SARIMA fusion enhances seasonal adaptability—T-LSTM outperforms all benchmark methods in long-term prediction MAE. Future work will focus on refining and strengthening the T-LSTM model to improve its robustness. Moreover, there are plans to verify the predictive capabilities over an extended period, with the goal of advancing wind power prediction technology to new heights. Through ongoing efforts to enhance the T-LSTM model, this research aims to elevate the accuracy and reliability of wind power forecasting methods.
Additionally, the proposed T-LSTM-SARIMA hybrid framework may hold certain potential for broader applicability beyond wind power prediction. Its core characteristics—effective capture of multi-scale temporal dependencies, dynamic fusion of linear seasonal trends and non-linear complex features, and robustness to data volatility—may render it potentially applicable to other time series or sequence-related tasks. For instance, in AFDD (air handling unit fault detection), the framework could potentially leverage its temporal feature extraction capability to identify abnormal patterns from semi-labeled operational data [
34]. In fire-door defect text classification, the multi-head attention mechanism and feature fusion strategy might help improve the recognition of defect-related text sequences during pre-delivery inspections [
35]. Such potential versatility suggests that the proposed method might offer some reference for addressing complex prediction and classification challenges across diverse domains.