STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting

Tian, Chenyu; Xia, Min; Yuan, Shi; Wang, Liwen; Zhuang, Wei

doi:10.3390/en19051214

Open AccessArticle

STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting

by

Chenyu Tian

¹

,

Min Xia

¹

,

Shi Yuan

¹

,

Liwen Wang

² and

Wei Zhuang

^1,*

¹

Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, No. 219, Ningliu Road, Nanjing 210044, China

²

China Electric Power Research Institute Co., Ltd., No. 8 Nanrui Road, Gulou District, Nanjing 210003, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(5), 1214; https://doi.org/10.3390/en19051214

Submission received: 30 January 2026 / Revised: 20 February 2026 / Accepted: 26 February 2026 / Published: 28 February 2026

(This article belongs to the Special Issue Development of Artificial Intelligence in Green Buildings and Renewable Energy)

Download

Browse Figures

Versions Notes

Abstract

The accuracy of short-term wind power forecasting (STWPF) is crucial for the stable operation of power systems. To address the issue of insufficient capture of spatio-temporal dependencies in existing models, which leads to low prediction accuracy, this paper proposes a novel Transformer-based spatio-temporal graph convolutional (STGCformer) model. The time series decomposition module (TSDM) captures periodic fluctuations and long-term variations within the data by performing seasonal trend decomposition. The spatio-temporal graph convolutional (STGC) architecture combines a Graph Attention Network (GAT) with convolutional layers (Convs) to capture both spatial and temporal dependencies, jointly processing the spatio-temporal characteristics inherent in wind power data. The Transformer’s attention mechanism simultaneously handles both short-term and long-term fluctuations. Extensive experimental results show that STGCformer achieves the best prediction accuracy across multiple time steps (24, 48, 72, 96 h), with the average absolute error (MAE) and mean absolute percentage error (MAPE) at 48 h being 41.383 and 3.862, respectively. This model provides a new methodological framework for STWPF.

Keywords:

short-term wind power prediction; spatio-temporal graph convolution; time series decomposition; transformer; graph attention network; hybrid models

1. Introduction

With the transformation of the global energy structure, wind power’s [1] characteristics of large reserves and low cost have made it one of the most promising renewable energy sources today [2]. The continuous expansion of installed wind power capacity presents significant challenges to stable and secure grid systems due to the intermittent and unpredictable nature of wind speeds [3]. Therefore, achieving high-precision [4] and reliable wind power forecasting is particularly important. It not only enables better management of power generation [5], reduces operational costs, and optimizes the integration of renewable energy into the grid [6], but it also mitigates the negative impacts that fluctuations in renewable energy may cause, thus ensuring system stability [7].

The traditional wind power forecasting methods primarily comprise physical approaches [8] and statistical methods [9]. Physical methods typically base power prediction on the physical characteristics of wind turbines and meteorological data. These methods rely on the patterns of variation in meteorological factors such as wind speed [10], wind direction, temperature, and atmospheric pressure. By establishing mathematical models (e.g., wind speed-power curves, dynamic models of wind turbines), they predict the output power of wind turbines. Physical methods offer high interpretability and strong theoretical foundations; however, their high computational complexity and cost make them challenging to apply for short-term wind power forecasting (STWPF) [11]. Statistical methods analyze historical wind power output data to establish forecasting models through statistical modeling [12]. These approaches do not require an in-depth understanding of turbine physics, instead relying on patterns within historical data for prediction. Common statistical models include ARIMA [13], Kalman filtering [14], and gray models [15]. Statistical approaches struggle to handle the complex nonlinear characteristics inherent in wind power forecasting, resulting in relatively poor prediction accuracy for STWPF.

In recent years, with the continuous advancement of artificial intelligence technology [16], AI methods have found widespread application in wind power forecasting, demonstrating superior predictive capabilities compared to traditional approaches. Unlike conventional statistical methods, AI techniques can autonomously learn complex patterns and relationships from data without relying on explicit physical models or assumptions [17]. They offer advantages such as rapid computation and robust handling of intricate nonlinear characteristics. Recurrent Neural Networks (RNNs) [18] represent a classic AI approach for addressing time series problems. Their core principle is ‘memory’, whereby outputs depend not only on current inputs but also on ‘memories’ derived from previous steps. However, they are prone to gradient vanishing or exploding gradients when processing lengthy sequences. Long Short-Term Memory (LSTM) networks [19], a variant of RNNs, incorporate gating mechanisms to regulate information flow. This enables LSTMs to capture temporal dependencies [20] over specific time horizons, mitigating the impact of gradient explosion or vanishing on the model. Compared to the LSTM structure, the Gated Recurrent Unit (GRU) [21] is simpler, with only two gates (the update gate and the reset gate). Although it does not have a separate cell state, resulting in fewer parameters and higher computational efficiency, it is less accurate than LSTM in capturing temporal dependencies. However, wind power output exhibits strong nonlinearity, intense volatility, and coexisting multi-scale characteristics. The Transformer [22] demonstrates exceptional modeling capabilities for complex sequential relationships through its attention mechanism. It can directly capture long-term dependencies [23] within input sequences without relying on progressively transmitted information, as required by traditional RNNs or LSTMs. It processes all sequence elements simultaneously within a single computation [24], granting Transformers a distinct advantage over RNNs and LSTMs when handling lengthy sequences. Li et al. [25] generated features across different time scales using the EMD algorithm, then employed a Transformer model with causal convolutions to forecast wind power output, achieving significantly enhanced prediction accuracy. Wu et al. [26] proposed an EEMD-Transformer-based model for wind speed prediction. Hybrid models combine the strengths of different architectures, enhancing feature learning capabilities, robustness, and prediction accuracy. Shi et al. [27] proposed a graph-based GSTAformer. This model employs spatio-temporal joint modeling to enhance prediction accuracy.

The principal contributions of this paper are summarized as follows:

(1): An STGC module is proposed. This module constructs a graph structure using the geographical location information of wind turbines and historical power data. By integrating Graph Attention Network (GAT) and one-dimensional convolutional operations, it simultaneously addresses the spatial dependencies between turbine locations within a wind farm and the temporal dependencies of historical wind power data, thereby enhancing the model’s spatio-temporal feature learning capabilities.
(2): The time series decomposition module (TSDM) decomposes data into seasonal and trend components, enabling the model to better capture long-term trends and periodic fluctuations, thereby improving prediction accuracy.
(3): Design a spatio-temporal modeling module based on the EA-enhanced Transformer. The EA-enhanced Transformer module not only captures the spatio-temporal dependencies in the wind power data but also leverages the self-attention mechanism of the Transformer to provide global information, thereby improving the extraction of trend and seasonal features.
(4): To evaluate the STGCformer model’s accuracy, validation across multiple time steps (24, 48, 72, 96) on the ACM KDD Cup 2022 competition dataset demonstrated that STGCformer achieves optimal accuracy compared to existing models.

2. Feature Analysis

2.1. Problem Definition and Data Preprocessing

STWPF remains fundamentally a time series prediction problem. This paper builds upon this foundation by incorporating spatial dependencies between wind turbines within wind farms to enhance prediction accuracy, though time series forecasting remains the primary approach. In this study, we combined Interquartile Range (IQR) filtering and the local 3-sigma rule to ensure data quality and consistency. Specifically, IQR filtering was used to remove extreme values outside the upper and lower quartiles, which are typically considered outliers. The local 3-sigma rule detects outliers by calculating the standard deviation of local data, marking values that deviate from the local mean by more than three times the standard deviation. By combining these two methods, we were able to effectively clean the data and remove outliers that did not conform to the overall trend. After IQR filtering, approximately 4% of the samples were marked as outliers and removed; applying the local 3-sigma rule further removed 3% of the samples, and missing data was filled using forward and backward linear interpolation methods. This approach allowed us to remove extreme values and anomalous fluctuations, ensuring data quality and the stability of model training. The data was then processed with reversible normalization (RevIN) as per Equation (1).

x_{n o r m, i} = \frac{x - \frac{1}{n} \sum_{i = 1}^{n} x_{i}}{\sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \frac{1}{n} \sum_{i = 1}^{n} x_{i})}^{2}}}

(1)

In this process,

\frac{1}{n} \sum_{i = 1}^{n} x_{i}

is the mean of the global data from the training set, and

\frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - \frac{1}{n} \sum_{i = 1}^{n} x_{i})}^{2}

is the variance of the global data from the training set. Through RevIN, we ensure that different features have the same scale, effectively preventing some features from affecting the model’s training performance due to scale differences.

This paper employs a regression-based version of the Synthetic Minority Over-Sampling Technique (SMOTE) to enhance dataset diversity and improve model generalization capability. It synthesizes samples by interpolating the original dataset, thereby effectively increasing the dataset size while preventing overfitting. First, the data X is decomposed into

f i x_{X}

and

X_{n e w}

, where

f i x_{X}

comprises fixed features requiring no augmentation, and

X_{n e w}

represents the variable feature component. A random number

p \in (0, 1)

is then generated to partition the data into augmented components

X_{i}

and original components

X_{j}

. Interpolation weights

γ

are generated via a Beta distribution. The interpolated result

X_{D}

is obtained using regression-based SMOTE, which employs interpolation methods. This is concatenated with the fixed features to yield

X_{f i n a l}

. In this process, we use a sliding window to construct supervised learning samples, ensuring that interpolation is only performed within the window of historical data, thereby maintaining the temporal dependence of the time series. At the same time, we introduce temporal locality constraints during interpolation to ensure that the generated samples come from temporally close samples and add physical constraints to ensure that the generated samples adhere to physical laws such as wind speed and power. These measures ensure the temporal dependence and physical integrity of the data augmentation process, avoiding temporal leakage and unreasonable results in the interpolated samples. The computational process is illustrated in Equations (2) and (3).

X_{D} = γ X_{i} + (1 - γ) X_{j}

(2)

X_{f i n a l} = C o n c a t (f i x_{X}, X_{D})

(3)

2.2. Wind Power Prediction Data Analysis

This paper utilizes the Spatial-Based Wind Power Dynamic Forecasting Dataset (SDWPF) provided by Baidu PaddlePaddle, encompassing over six months of historical data from 134 generators, collected at 10 min intervals. The dataset further incorporates relative positional information between wind turbines to establish spatial correlations. Figure 1 presents historical power output data for one turbine. The graph demonstrates the pronounced volatility and uncertainty inherent in wind power generation. In practical forecasting operations, the power output is influenced not only by internal factors such as reactive power, turbine internal temperature, and blade pitch angle, but it is also significantly driven by external variables including wind speed, wind direction, and ambient temperature. Consequently, this paper employs Pearson’s correlation coefficient (PCC) to quantify the linear dependency strength between internal/external factors and wind power output.

The feature PCC heatmap is depicted in Figure 2. As TurbID, Day, and Tmstamp are non-numeric features, their PCCs are not calculated; similarly, Pab1, Pab2, and Pab3 share identical Pab1 values; thus, only Pab1 is computed. Based on the Pearson correlation coefficient heatmap, features exhibiting absolute correlation coefficients

| C O R R | > 0.4

with wind power output are retained as auxiliary features. Pab3 shares identical values; only Pab1 is computed. Based on the PCC heatmap, variables exhibiting absolute correlation coefficients

| C O R R | > 0.4

with wind power output are retained as auxiliary features. Consequently, Wspd, Etmp, Itmp, Pab1, and Prtv are selected as the model’s input variables.

3. Methodology

This paper proposes a short-term wind power prediction model based on STGCformer. The model consists of three main modules: the TSDM, the STGC module, and the Transformer-based spatio-temporal modeling module. The input shape for wind power prediction is

X \in R^{(B \times C \times L \times F)}

, where B is the batch size, C is the number of turbines, L represents the time steps, and F represents the number of features. The input first enters the data augmentation (DA) module, which generates synthetic data using SMOTE to enhance the diversity of the training dataset. The augmented input then passes through the TSDM to obtain seasonal trend data. Next, the data in the STGC module undergoes spatio-temporal modeling to capture the spatio-temporal dependencies. The lightweight attention mechanism evolutionary algorithm (EA) effectively captures the global dependencies between wind farms. The features are then encoded using the Transformer Encoder–Decoder and undergo multiple decoding steps. Finally, the linear fusion module performs feature concatenation and mapping to produce the final prediction. The overall model architecture is shown in Figure 3.

3.1. Time Series Decomposition Module

Since wind power prediction is a complex time series problem, this paper constructs a TSDM [28] to help extract information and understand the underlying structure of the data. This paper uses a one-dimensional average pooling method to extract long-term trends. First, adjust the input to the dimensions

X \in R^{(B \times (C \times F) \times L)}

to fit the one-dimensional average pooling. Then, padding is applied to

X^{(B \times (C \times F) \times L)}

according to Equations (4) and (5) to ensure that no edge data is lost during the average pooling operation. The padded data

X^{(B \times (C \times F) \times L)}

is then subjected to one-dimensional average pooling to obtain the trend

X_{t r e n d}^{(B \times (C \times F) \times L)}

of the input data.

X_{p a d}^{(B \times (C \times F) \times L)} = P a d (X^{(B \times (C \times F) \times L)})

(4)

X_{t r e n d}^{(B \times (C \times F) \times L)} = A v g_p o o l 1 d (X_{p a d}^{(B \times (C \times F) \times L)})

(5)

Then, according to Equation (6), the residual between the original data and the trend is computed to obtain the high-frequency seasonal component

X_{s e a s o n}^{(B \times C \times L)}

, and

X_{s e a s o n}^{(B \times (C \times F) \times L)}

and

X_{t r e n d}^{(B \times (C \times F) \times L)}

are transposed back to the original dimensions for output.

X_{s e a s o n}^{(B \times (C \times F) \times L)} = X^{(B \times (C \times F) \times L)} - X_{t r e n d}^{(B \times (C \times F) \times L)}

(6)

The TSDM separates long-term trends from short-term seasonal fluctuations, allowing the prediction model to focus on predicting short-term fluctuations (seasonality) and long-term changes (trends) separately. This helps improve the accuracy of the predictions.

3.2. Spatio-Temporal Graph Convolution Module

3.2.1. Construction of Spatio-Temporal Graph

In this module, we construct a graph structure based on the historical power output and physical locations of wind turbines. This approach accounts not only for the temporal correlations between different turbines but also for their spatial positional correlations. By combining these two dimensions, we effectively capture the spatio-temporal dependencies among wind turbines, thereby providing valuable input for wind power forecasting tasks. Based on the historical power generation of 134 wind turbines in the test set from the SDWPF data, we constructed a Pearson correlation coefficient heatmap between the wind turbines, as shown in Figure 4.

The physical location of wind turbines significantly influences wind power output. Adjacent turbines often share similar characteristics (such as wind speed and direction). Since the wind farm area we are studying is relatively small and the geographical distance between the wind turbines is short, the effect of Earth’s curvature on the distance is very limited within this range. Using Euclidean distance does not introduce significant error, and the computation of Euclidean distance is simpler and faster, making it more efficient for handling wind turbine datasets and accelerating the model training process. Therefore, this study chooses to use Euclidean distance to quantify the similarity between wind turbines and capture their spatial dependencies. For two wind turbines, i and j, with latitude and longitude coordinates

(l_{i 1}, l_{i 2})

and

(l_{j 1}, l_{j 2})

, respectively, the Euclidean distance

d (i, j)

is calculated as follows:

d (i, j) = \sqrt{{(l_{i 1} - l_{j 1})}^{2} + {(l_{i 2} - l_{j 2})}^{2}}

(7)

where

l_{i 1}

and

l_{i 2}

represent the latitude and longitude of wind turbine i, respectively;

l_{j 1}

and

l_{j 2}

represent the latitude and longitude of wind turbine j, respectively.

Since the edge weights are typically inversely proportional to the distance, the edge weight of the graph is defined as follows:

ω_{d} (i, j) = \frac{1}{d (i, j) + ε}

(8)

where

ε

is a small constant to prevent division by zero errors.

Finally, the edge weight

w (i, j)

of the graph is the weighted sum of the physical distance and Pearson correlation coefficient of the power generation between turbines i and j, as shown in Equation (9).

ω (i, j) = α \cdot ω_{d} (i, j) + β \cdot r (i, j)

(9)

where

r (i, j)

is the Pearson correlation coefficient of the power generation between wind turbines i and j.

α

and

β

are weight coefficients, where they adjust the contribution of physical distance and power correlation to the edge weight. These two parameters,

α

and

β

, are learned parameters. During the model training process,

α

and

β

are optimized as part of the model using the gradient descent algorithm and updated through backpropagation to obtain the optimal values of the parameters.

For each node i, the five most relevant nodes are selected and connected to form

E_{i}

, where the spatio-temporal graph

G r a p h

is constructed based on the edge information E and node information V, as shown in Equations (10) and (11).

E_{i} = a r g m a x_{j} ω (i, j) f o r j \neq i, | j | \leq 5

(10)

G r a p h = (V, E)

(11)

The

a r g m a x

operation selects the five neighbors with the largest edge weights

w (i, j)

, where V represents the set of wind turbine nodes. Through this fusion method, the graph neural network can simultaneously capture the physical dependencies in space and the power variation patterns over time, thereby improving the model’s prediction ability.

3.2.2. Spatio-Temporal Graph-Based Convolution Operation

To effectively capture the complex spatial dependencies and temporal dynamics in spatio-temporal data, we adopted the STGC module. This module combines the advantages of the GAT [29] and Convolutional Neural Networks (Conv1D), enabling adaptive learning of relationships between nodes in spatio-temporal graph data and extracting key temporal features in time series data. In this way, we can capture not only the spatial correlations between nodes but also effectively uncover temporal variations in the sequential data, which helps enhance the model’s prediction ability.

First, the input data

X \in R^{(B \times C \times L \times F)}

is reshaped. Then, combined with the graph structure

G = (V, E)

and input data

X \in R^{(B \times C \times L \times F)}

, the time series features for each node are averaged to obtain the representative feature

X_{m e a n}

for each node across the entire time series. Next, linear transformations are applied to compute the query vector

q_{x}^{(B \times C) \times O}

and the key vector

k_{x}

for each node, preparing them for the attention computation. The formula is as follows:

X^{((B \times C) \times F \times L)} = T r a n s p o s e (R e s h a p e (X^{(B \times C \times L \times F)}))

(12)

X_{m e a n}^{((B \times C) \times L)} = \frac{\sum_{t = 1}^{L} X_{t}^{((B \times C) \times L)}}{L}

(13)

k_{x}^{((B \times C) \times O u t)} = L i n e a r_{L \to O} (X_{m e a n}^{((B \times C) \times L)})

(14)

q_{x}^{((B \times C) \times O u t)} = \frac{L i n e a r_{L \to O} (X_{m e a n}^{((B \times C) \times L)})}{\sqrt{O}}

(15)

where

X_{t}^{((B \times C) \times L))}

represents the node features at time step t, and

X_{m e a n}^{((B \times C) \times L))}

represents the average features of each node across all time steps. O is the output dimension, and

\sqrt{O}

is a normalization factor used to prevent numerical instability when computing attention scores.

The attention coefficient

a_{i j}

is calculated based on the query and key vectors of the nodes. The attention coefficients are then normalized to obtain

a_{i j}^{n o r m}

. Finally, the features of the nodes and the attention weights are weighted and summed to obtain the output

X_{i}

. The output for each node is stored in a list to get

X_{g a t}

. The process is shown in Equations (16)–(19).

a_{i j} = L e a k y R e L U (α^{T} [q_{i} ‖ k_{j}])

(16)

a_{i j}^{n o r m} = S o f t m a x (a_{i j})

(17)

X_{i} = \sum_{j \in N (i)} a_{i j} X_{j}

(18)

X_{g a t} = [X_{1}, X_{2}, X_{3}, \dots, X_{t}]

(19)

where

N (i)

is the set of neighboring nodes for node i.

Finally, according to Equations (20) and (21),

X_{g a t}^{B \times C \times O \times F}

is reshaped and transposed to obtain

X_{t}^{B \times O \times (F \times C)}

. Convolution is then applied to

X_{t}^{B \times O \times (F \times C)}

to obtain the final output

Y_{(G a t - C o n v)}^{B \times O \times T}

.

X_{t}^{B \times O \times (F \times C)} = T r a n s p o s e (R e s h a p e (X_{g a t}^{B \times C \times O \times F}))

(20)

Y_{(G a t - C o n v)}^{B \times O \times T} = C o n v (X_{t}^{B \times O \times (F \times C)})

(21)

3.3. Transformer Encoder–Decoder Module

To address this, the paper introduces a lightweight External Attention (EA) mechanism [30] to perform dimensionality reduction on the data, thereby enhancing computational efficiency. EA also effectively captures key information within wind power data, enhancing the model’s comprehension capabilities. By remapping to the original dimensions, EA preserves the accuracy of Transformer predictions while reducing computational costs. This provides an efficient and effective solution, particularly in resource-constrained environments. The architecture diagram of EA is shown in Figure 5.

First, the input

Y_{G a t - C o n v}^{B \times O \times T}

is linearly mapped to the query vector Query according to Equations (22) and (23). Then, the attention coefficient is generated by calculating the similarity between the query vector and the memory unit (key Mk). Next, the

S o f t m a x

function is applied to normalize the attention coefficients of all neighboring nodes, resulting in

Y_{N}^{B \times S \times T}

.

Y_{q}^{B \times S \times T} = L i n e a r_{O \to S} Y_{G a t - C o n v}^{B \times O \times T}

(22)

Y_{N}^{B \times S \times T} = S o f t m a x (Y_{K}^{B \times S \times T})

(23)

The dimensionality-reduced and normalized

Y_{N}^{B \times S \times T}

undergoes Transformer encoding and decoding, significantly reducing computational complexity.The formulas for position encoding and the multi-head attention mechanism are as follows:

Y_{P} = Y_{N}^{B \times S \times T} + P^{T \times S}

(24)

Q, K, V = Y_{P} \times W_{Q, K, V}

(25)

Z_{i} = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{L}})

(26)

Y = C o n c a t (Z_{1}, Z_{2}, \dots, Z_{8}) \times W^{O}

(27)

After each layer’s self-attention mechanism, Transformer also uses a Feed-Forward Network (FFN) to further process the output of the self-attention. The Feed-Forward Network typically consists of two fully connected layers, with an ReLU activation function introducing nonlinear transformations. The computational formula for the Feed-Forward Network is as follows:

F F N (Y) = m a x (0, x W_{1} + b_{1}) W_{2} + b_{2}

(28)

where

W_{1}

and

W_{2}

are the weight matrices, and

b_{1}

and

b_{2}

are the bias terms. The Feed-Forward Network processes the representation at each position individually, enhancing the model’s expressive power through these nonlinear transformations.

After each sublayer, residual connections and normalization operations are applied to train and stabilize the network. This structure helps the encoder capture dependencies at all positions in the input sequence. The formulas for the residual connection and normalization are as follows:

Y_{n o r m} = L a y e r N o r m (x + z)

(29)

Here, x is the input, and z is the output of the sublayer. Residual connections help avoid the vanishing gradient problem, while layer normalization helps maintain the stability of the output at each layer, preventing information loss during the training process.

The decoder layer not only includes the above modules but also introduces a Cross-Attention mechanism. After Transformer encoding and decoding, the final output

Y_{t r a n s f o r m e r}^{B \times S \times T}

is obtained. Equations (30) and (31) perform linear mapping and transpose to adjust the dimensions to

Y_{N}^{B \times C \times O}

.

Y_{v}^{B \times O \times T} = L i n e a r_{S \to O} Y_{t r a n s f o r m e r}^{B \times S \times T}

(30)

Y_{N}^{B \times C \times O} = T r a n s p o s e (L i n e a r_{T \to C} (Y_{v}^{B \times O \times T}))

(31)

Finally, through Equations (32) and (33), the predicted trend and seasonal components are aggregated via additive fusion to obtain

Y_{f u s e d}^{B \times C \times O}

. Then, the final prediction output

Y_{f i n a l}^{B \times C \times O}

is obtained through the

R e n I V

inverse operation.

Y_{f u s e d}^{B \times C \times O} = Y_{s e a s o n a l_f i n a l}^{B \times C \times O} + Y_{t r e n d_f i n a l}^{B \times C \times O}

(32)

Y_{f i n a l}^{B \times C \times O} = R e n I V^{- 1} (Y_{f u s e d}^{B \times C \times O})

(33)

4. Experimental Results Analysis and Discussion

The dataset used in this study comes from the SDWPF provided by the ACM KDD Cup 2022 competition. It contains historical data from 134 turbines over more than six months. The dataset was split into training, validation, and test sets in a 8:1:1 ratio, based on time order. The validation set was used only for hyperparameter tuning and early stopping and did not participate in gradient updates. The model is built using the PyTorch 2.9 framework and trained and tested on an NVIDIA GeForce MX250 GPU platform. We trained the model using the Adam optimizer with FilterMSELoss as the loss function. We employed hyperparameter tuning methods to determine the values of the hyperparameters through a series of experiments and validations. The learning rate was set to 0.00005, and we incorporated dropout and early stopping mechanisms to prevent overfitting. The dropout rate was set to 0.5, and the patience value for early stopping was set to 5. The batch size was 16, the number of training epochs was 50, and the input window length was 144. To verify the prediction performance of the model, this study compares it with the LSTM model, Bidirectional Long Short-Term Memory (BiLSTM), GRU, Bidirectional Gated Recurrent Unit (BiGRU), Transformer, Informer, TCN, TimerXer, PatchTST, iTransformer, and Autoformer models. The model is used to predict wind power for the next 24, 48, 72, and 96 h. MAE, RMSE, and MAPE are chosen as evaluation metrics to assess the model’s performance. We ensured that all models were trained, validated, and tested under consistent conditions to guarantee the rigor of the comparative experiments.

4.1. Analysis of Multi-Step Prediction Experimental Results

We conducted extensive comparative experiments on different models under 24 h, 48-h, 72 h, and 96 h conditions to validate the models’ effectiveness. As shown in Table 1, the proposed model outperforms others in terms of MAE, RMSE, and MAPE. Figure 6 and Figure 7 show the performance improvement rates of STGCformer over different baseline models in terms of MAE and RMSE. For the 24 h, 48 h, 72 h, and 96 h cases, the MAE improvement rates are 2.26–9.17%, 2.36–5.90%, 1.88–5.91%, and 2.69–6.30%, respectively. The RMSE improvement rates are 2.15–7.18%, 1.33–5.36%, 2.10–7.30%, and 2.39–7.39%. Figure 8 presents the average MAPE improvement rates of the proposed model compared to the baseline models. It can be observed that improvements are achieved for every model, with an average improvement rate of 10.34% compared to TCN. Figure 9 presents the comparative prediction curves of different models, clearly reflecting the inherent volatility of wind power data. By combining the spatio-temporal joint modeling capabilities of graph neural networks with the Transformer’s strong advantage in capturing both short-term and long-term dependencies, the STGCformer model achieves higher accuracy in discerning data trends, particularly excelling in detecting instant changes in data and responding to sudden fluctuations, thereby significantly enhancing prediction accuracy. This indicates that STGCformer not only captures long-term trends accurately but also responds sensitively to the rapid changes in wind power data, providing more precise support for wind power forecasting.

These experimental results fully demonstrate the superiority of STGCformer in multi-step wind power forecasting scenarios. Linear models such as LSTM and GRU excel at handling short-term temporal dependencies, yet struggle to capture long-term dependencies and nonlinear couplings within the data, rendering them ill-suited for complex datasets. Transformer-based models (such as Transformer and Informer) possess global modeling capabilities, leveraging attention mechanisms to capture long-term dependencies and effectively learn complex patterns when processing lengthy sequences.

However, attention mechanisms incur substantial computational and memory overhead, alongside significant data redundancy. Neither temporal nor spatial models alone can fully describe the dynamic characteristics of wind power output when applied to wind energy data. The proposed STGCformer ensemble model enhances the processing capability of nonlinear data through DA and STDM. In the STGC module, an innovative graph structure is used, constructed based on the joint modeling of physical location information and historical power data. This structure aggregates the correlation neighborhood information of features, effectively capturing the spatial and temporal dependencies within the wind power data. Utilizing the Transformer for encoding and decoding, it leverages attention mechanisms to capture global dependencies. Furthermore, the lightweight EA model significantly reduces computational cost and complexity. Interactions between different modules collectively achieve high accuracy.

4.2. Ablation Study

To validate the contribution of each module in the proposed STGCformer model, we conducted ablation experiments. To ensure the statistical significance of the results and avoid random chance, the results in the ablation experiments are based on the average of ten runs, with the standard deviation provided (standard deviation ± 1.1). The experimental results are shown in Table 2 and Figure 10. It can be observed that removing the Transformer module significantly reduces the model’s ability to handle complex temporal relationships. After removing the STGC, the model is unable to capture the spatio-temporal dependencies between turbines within the wind farm. These two modules have the greatest impact on model prediction accuracy, especially for long-term forecasting at 96 h. Specifically, removing the Transformer module and STGC increases the MAE from 42.015 to 44.230 and 43.764, respectively, highlighting that the combined use of STGC and Transformer has the greatest contribution to the model’s stable long-term forecasting. The TSDM module decomposes the data into seasonal and trend components, which helps handle periodic fluctuations and long-term variations in the data. As shown in the figure, it consistently improves the model’s MAE by around 3%, providing a stable contribution to the model.

Table 3 presents the computational cost with and without the EA module at 24 h, showing that the EA module significantly reduces the model’s computational cost and improves computational efficiency. DA enhances the model’s generalization ability, preventing overfitting. The cooperation of different modules enables the STGCformer model to achieve strong prediction accuracy.

To validate the stability of the STGCformer architecture, hyperparameter sensitivity experiments were conducted on the SDWPF dataset, with a time dimension of 48 h. Table 4 presents the experimental results under different learning rates. It can be seen that the best results were achieved with the originally set learning rate of 0.00005, but the results for other learning rates showed minimal deviation from the best result. This indicates that STGCformer is not highly sensitive to hyperparameters, and the stability of the architecture is indisputable.

5. Conclusions

Wind power forecasting exhibits significant volatility and uncertainty. To enhance the accuracy and stability of wind power prediction, this paper proposes the STGCformer framework. Initially, feature selection is performed on raw data through Pearson correlation coefficient analysis. The DA module enhances data quality, improving both generalization capability and perceptual sensitivity. The designed STGC module constructs graph data by integrating geographical location information with historical data, effectively achieving spatio-temporal modeling. This approach enhances model interpretability and accuracy.

Finally, the highly complex yet effective Transformer module is combined with a lightweight EA, maintaining computational precision while preserving low time complexity. Multiple step experiments (24 h, 48 h, 72 h, 96 h) were conducted on the SDWPF dataset with several baseline models. STGCformer achieved an absolute performance improvement of approximately 2% to 10% compared to the models mentioned above. Notably, the MAE for 24 h and 48 h achieved excellent results of 37.586 and 41.383, respectively, demonstrating the superior performance of the model in the field of short-term wind power forecasting. This model provides a new methodological framework for STWPF.

Author Contributions

Conceptualization, C.T., S.Y., L.W. and W.Z.; methodology, C.T., M.X. and W.Z.; software, C.T. and L.W.; validation, L.W., S.Y. and M.X.; formal analysis, C.T., S.Y. and L.W., investigation, C.T. and W.Z.; resources, M.X.; data curation, C.T.; writing—original draft preparation, C.T.; writing—review and editing, W.Z.; visualization, M.X.; supervision, W.Z.; project administration, W.Z.; funding acquisition, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by State Grid Corporation of China Project under Grant: 52170025002A-380-ZN.

Data Availability Statement

The original SDWPF dataset used in this paper is publicly available on FigShare, with the DOI: https://doi.org/10.6084/m9.figshare.30787586.

Acknowledgments

The authors would like to thank the organisers of the KDD Cup 2022 competition for providing the SDWPF dataset.

Conflicts of Interest

Author Liwen Wang was employed by the company China Electric Power Research Institute. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from State Grid Corporation of China Project. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Bokde, N.; Feijóo, A.; Villanueva, D.; Kulat, K. A review on hybrid empirical mode decomposition models for wind speed and wind power prediction. Energies 2019, 12, 254. [Google Scholar] [CrossRef]
Gao, L.; Kong, F.; Zhang, F.; Ren, X.Y.; Zhang, X.L.; Qin, L. Ultra-Short-Term Wind Power Prediction Method Based on IPSO–BiLSTM–AM Model. Smart Power 2022, 50, 27–34. [Google Scholar] [CrossRef]
Liu, X.; Pu, X.; Li, J.; Zhang, J. Short-term wind power prediction of a VMD-GRU based on Bayesian optimization. Power Syst. Prot. Control 2023, 51, 158–165. [Google Scholar] [CrossRef]
Ren, Z.; Weng, L.; Xia, M.; Lin, H. MCINet: Multi-attentive cross-level interaction network for cloud and snow segmentation. J. Appl. Remote Sens. 2026, 20, 021404. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, J.; Wang, X. Review on probabilistic forecasting of wind power generation. Renew. Sustain. Energy Rev. 2014, 32, 255–270. [Google Scholar] [CrossRef]
Peng, S.; Peng, J.; Yang, Y.; Zhang, H.; Li, B.; Wang, G. Wind power probability density prediction based on time-variant deep feed-forward neural network. J. Electr. Power Sci. Technol. 2023, 38, 84–93. [Google Scholar] [CrossRef]
Hanifi, S.; Liu, X.; Lin, Z.; Lotfian, S. A critical review of wind power forecasting methods—Past, present and future. Energies 2020, 13, 3764. [Google Scholar] [CrossRef]
Singh, S.; Mohapatra, A. Repeated wavelet transform based ARIMA model for very short-term wind speed forecasting. Renew. Energy 2019, 136, 758–768. [Google Scholar] [CrossRef]
Pearre, N.S.; Swan, L.G. Statistical approach for improved wind speed forecasting for wind power production. Sustain. Energy Technol. Assess. 2018, 27, 180–191. [Google Scholar] [CrossRef]
Alexiadis, M.; Dokopoulos, P.; Sahsamanoglou, H.; Manousaridis, I. Short-term forecasting of wind speed and related electrical power. Sol. Energy 1998, 63, 61–68. [Google Scholar] [CrossRef]
Zhang, J.; Yan, J.; Infield, D.; Liu, Y.; Lien, F.S. Short-term forecasting and uncertainty analysis of wind turbine power based on long short-term memory network and Gaussian mixture model. Appl. Energy 2019, 241, 229–244. [Google Scholar] [CrossRef]
González-Sopeña, J.; Pakrashi, V.; Ghosh, B. An overview of performance evaluation metrics for short-term statistical wind power forecasting. Renew. Sustain. Energy Rev. 2021, 138, 110515. [Google Scholar] [CrossRef]
Wang, Y. Short-term prediction of wind power based on Kalman filter tracking fusion. In Proceedings of the 2021 4th International Conference on Energy, Electrical and Power Engineering (CEEPE), Chongqing, China, 23–25 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 507–511. [Google Scholar] [CrossRef]
Louka, P.; Galanis, G.; Siebert, N.; Kariniotakis, G.; Katsafados, P.; Pytharoulis, I.; Kallos, G. Improvements in wind speed forecasts for wind power prediction purposes using Kalman filtering. J. Wind. Eng. Ind. Aerodyn. 2008, 96, 2348–2362. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, H.; Guo, Y. Wind power prediction based on PSO-SVR and grey combination model. IEEE Access 2019, 7, 136254–136267. [Google Scholar] [CrossRef]
Amjady, N.; Keynia, F.; Zareipour, H. Wind power prediction by a new forecast engine composed of modified hybrid neural network and enhanced particle swarm optimization. IEEE Trans. Sustain. Energy 2011, 2, 265–276. [Google Scholar] [CrossRef]
Liu, H.; Chen, C. Data processing strategies in wind energy forecasting models and applications: A comprehensive review. Appl. Energy 2019, 249, 392–408. [Google Scholar] [CrossRef]
Kisvari, A.; Lin, Z.; Liu, X. Wind power forecasting—A data-driven method along with gated recurrent neural network. Renew. Energy 2021, 163, 1895–1909. [Google Scholar] [CrossRef]
Liu, X.; Zhou, J. Short-term wind power forecasting based on multivariate/multi-step LSTM with temporal feature attention mechanism. Appl. Soft Comput. 2024, 150, 111050. [Google Scholar] [CrossRef]
Sun, Y.; Wang, X.; Yang, J. Modified particle swarm optimization with attention-based LSTM for wind power prediction. Energies 2022, 15, 4334. [Google Scholar] [CrossRef]
Boucetta, L.N.; Amrane, Y.; Arezki, S. Wind power forecasting using a GRU attention model for efficient energy management systems. Electr. Eng. 2025, 107, 2595–2620. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Pan, X.; Wang, L.; Wang, Z.; Huang, C. Short-term wind speed forecasting based on spatial-temporal graph transformer networks. Energy 2022, 253, 124095. [Google Scholar] [CrossRef]
Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Li, N.; Dong, J.; Liu, L.; Li, H.; Yan, J. A novel EMD and causal convolutional network integrated with Transformer for ultra short-term wind power forecasting. Int. J. Electr. Power Energy Syst. 2023, 154, 109470. [Google Scholar] [CrossRef]
Wu, H.; Meng, K.; Fan, D.; Zhang, Z.; Liu, Q. Multistep short-term wind speed forecasting using transformer. Energy 2022, 261, 125231. [Google Scholar] [CrossRef]
Yuan, S.; Mao, Y.; Tian, C.; Yu, F.; Guo, T.; Xia, M. GSTAformer: Graph-Guided Spatio-Temporal Autoformer for Mid-Term Wind Power Forecasting. Energies 2026, 19, 254. [Google Scholar] [CrossRef]
Wei, W.W. Time series analysis. In The Oxford Handbook of Quantitative Methods in Psychology; Oxford University Press: Oxford, UK, 2013; Volume 2, pp. 458–485. [Google Scholar] [CrossRef]
Yu, C.; Yan, G.; Yu, C.; Zhang, Y.; Mi, X. A multi-factor driven spatiotemporal wind power prediction model based on ensemble deep graph attention reinforcement learning networks. Energy 2023, 263, 126034. [Google Scholar] [CrossRef]
Guo, M.H.; Liu, Z.N.; Mu, T.J.; Hu, S.M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Historical wind power data.

Figure 2. Feature Pearson correlation coefficient heatmap.

Figure 3. Overall architecture of STGCformer.

Figure 4. Pearson correlation coefficient heatmap of turbine historical power generation.

Figure 5. External attention module.

Figure 6. Performance improvement of the model compared to baseline models based on MAE.

Figure 7. Performance improvement of the model compared to baseline models based on RMSE.

Figure 8. Average MAPE improvement of STGCformer compared to baseline models.

Figure 9. Prediction comparison curve chart based on the SDWPF dataset.

Figure 10. MAE-based improvement of the model over the ablated variants.

Table 1. The comparison experiment results of different models on the SDWPF dataset.

Model	24 h			48 h			72 h			96 h
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
LSTM	39.763	47.139	3.832	42.804	52.053	4.161	44.365	54.479	4.416	45.583	54.247	4.475
BiLSTM	39.869	47.217	3.898	42.630	51.274	4.186	44.821	54.315	4.393	45.627	54.486	4.502
GRU	39.056	47.393	3.953	42.739	51.660	4.251	44.863	54.129	4.395	46.179	54.982	4.592
BiGUR	40.493	47.827	3.983	43.320	51.081	4.278	44.987	54.976	4.382	45.793	54.469	4.537
Transformer	39.754	46.787	3.872	42.562	50.282	4.219	44.655	53.742	4.325	45.874	54.106	4.621
Informer	40.848	47.879	4.028	43.606	52.109	4.302	45.504	54.025	4.562	46.274	55.527	4.727
Autoformer	39.252	46.682	3.823	42.721	51.026	4.178	44.273	53.585	4.312	45.563	54.254	4.479
TCN	41.035	47.382	4.035	43.825	52.235	4.362	45.146	54.275	4.593	46.427	55.724	4.625
TimerXer	38.437	45.632	3.894	42.358	50.239	4.129	43.425	52.314	4.281	44.847	53.126	4.356
PatchTST	38.794	46.257	3.947	42.863	50.724	4.196	43.891	52.853	4.356	45.279	53.729	4.437
iTransformer	39.157	46.348	3.972	43.257	51.195	4.236	44.217	53.138	4.289	45.374	54.157	4.396
STGCformer	37.586	44.673	3.632	41.383	49.578	3.862	42.625	51.237	4.183	43.674	51.885	4.302

Table 2. Ablation study results.

Model	24 h			48 h			72 h			96 h
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
w/o STGCH	37.122	44.711	3.995	40.385	50.514	4.194	42.127	52.475	4.349	43.764	54.258	4.725
w/o Transformer	37.479	45.392	4.117	40.708	51.324	4.324	42.962	53.585	4.625	44.230	55.315	4.839
w/o TSDM	36.825	44.483	4.015	39.753	49.873	4.274	42.452	51.952	4.386	43.319	53.414	4.631
w/o EA	36.257	43.992	3.974	39.236	48.718	4.134	42.032	51.407	4.289	43.071	52.715	4.481
w/o DA	36.145	44.079	3.926	38.962	48.580	4.087	42.193	51.217	4.247	42.614	52.296	4.401
STGCformer	35.682	43.774	3.872	38.625	48.013	4.024	41.427	50.783	4.167	42.015	51.871	4.305

Table 3. The computational cost with and without the EA module at 24 h.

Model	Params (M)	FLOPs (GLOPs)	Epoch Times (S)
w/o EA	5.853	105.835	149.33
STGCformer	2.855	37.726	49.28

Table 4. Sensitivity analysis of the model’s learning rate based on the SDWPF dataset at the 48 h forecast horizon.

Model (lr)	MAE	RMSE	MAPE
STGCformer (lr = 0.00002)	41.592	49.867	3.869
STGCformer (lr = 0.00005)	41.297	49.627	3.859
STGCformer (lr = 0.0001)	41.583	49.743	3.867

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, C.; Xia, M.; Yuan, S.; Wang, L.; Zhuang, W. STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting. Energies 2026, 19, 1214. https://doi.org/10.3390/en19051214

AMA Style

Tian C, Xia M, Yuan S, Wang L, Zhuang W. STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting. Energies. 2026; 19(5):1214. https://doi.org/10.3390/en19051214

Chicago/Turabian Style

Tian, Chenyu, Min Xia, Shi Yuan, Liwen Wang, and Wei Zhuang. 2026. "STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting" Energies 19, no. 5: 1214. https://doi.org/10.3390/en19051214

APA Style

Tian, C., Xia, M., Yuan, S., Wang, L., & Zhuang, W. (2026). STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting. Energies, 19(5), 1214. https://doi.org/10.3390/en19051214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STGCformer: Spatio-Temporal Graph Convolutional Transformer for Short-Term Wind Power Forecasting

Abstract

1. Introduction

2. Feature Analysis

2.1. Problem Definition and Data Preprocessing

2.2. Wind Power Prediction Data Analysis

3. Methodology

3.1. Time Series Decomposition Module

3.2. Spatio-Temporal Graph Convolution Module

3.2.1. Construction of Spatio-Temporal Graph

3.2.2. Spatio-Temporal Graph-Based Convolution Operation

3.3. Transformer Encoder–Decoder Module

4. Experimental Results Analysis and Discussion

4.1. Analysis of Multi-Step Prediction Experimental Results

4.2. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI