Next Article in Journal
Constrained and Unconstrained Control Design of Electromagnetic Levitation System with Integral Robust–Optimal Sliding Mode Control for Mismatched Uncertainties
Previous Article in Journal
A Rock-on-a-Chip Approach to Investigate Flow Behavior for Underground Gas Storage Applications
Previous Article in Special Issue
EnergAI: A Large Language Model-Driven Generative Design Method for Early-Stage Building Energy Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Short-Term Wind Power Forecasting Method Based on Multi-Decoder and Multi-Task Learning

1
Inner Mongolia Power (Group) Co., Ltd., Hohhot 010010, China
2
College of Smart Energy, Shanghai Jiao Tong University, Shanghai 200240, China
3
Key Laboratory of Control of Power Transmission and Conversion, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China
4
Shanghai Non-Carbon Energy Conversion and Utilization Institute, Shanghai 200240, China
*
Author to whom correspondence should be addressed.
Energies 2026, 19(2), 349; https://doi.org/10.3390/en19020349
Submission received: 1 December 2025 / Revised: 27 December 2025 / Accepted: 29 December 2025 / Published: 10 January 2026
(This article belongs to the Special Issue Challenges and Research Trends of Integrated Zero-Carbon Power Plant)

Abstract

In short-term power forecasting for wind farms, factors such as weather conditions and geographic location lead to certain correlations in the power output of different wind farms, resulting in complex coupling relationships between them. Traditional wind power forecasting methods often predict each wind farm independently, without considering these coupling relationships. To address this issue, this paper proposes a multi-task Transformer model based on multiple decoders, which accounts for the intrinsic connections between different wind farms, enabling joint power forecasting across multiple sites. The proposed model adopts a single encoder-multiple decoder structure, where a unified encoder processes all input data, and multiple decoders perform prediction tasks for each wind farm separately. Testing on actual wind farm data from the Inner Mongolia region of China shows that, compared to other forecasting models, the proposed model significantly improves the accuracy of power predictions for different wind farms.

1. Introduction

Wind energy, as a clean, renewable, and abundant energy source [1], plays an important role in new energy power generation. However, owing to the pronounced randomness and volatility of wind [2], the power output of wind farms exhibits considerable uncertainty [3]. This uncertainty not only poses challenges to the stable operation of power systems but also affects grid dispatching and electricity market transactions. Therefore, accurate wind power forecasting is of great significance for improving the quality of wind power integration [4], optimizing power system operation [5], and enhancing the economic efficiency of electricity markets [6].
Conventional wind power forecasting methods typically perform independent forecasting for different wind farms. Typical methods for short-term wind power forecasting include time-series extrapolation [7], long short-term memory (LSTM) networks [8], extreme learning machines (ELM) [9], and others. In [10], a wind power forecasting model based on a convolutional neural network (CNN)–LSTM architecture is proposed, which improves the forecasting accuracy of wind farm power output in the planning stage when historical operational data are scarce. In [11], the forecasting results of multiple models, including support vector machine (SVM), multilayer perceptron (MLP), and recurrent neural network (RNN), are integrated to further enhance forecasting accuracy. However, due to factors such as geographical correlations, the power outputs of different wind farms exhibit complex coupling relationships. If such coupling can be further taken into account, the accuracy of wind power forecasting at the farm level can be improved to some extent.
The multi-output structure of artificial neural networks can meet the needs of multiple forecasting tasks. In [12], a multitask learning-based load forecasting method is proposed, in which CNN and gated recurrent unit (GRU) networks serve as shared layers to capture the coupling among multiple loads, while a gradient boosting regression tree (GBRT) model is employed to perform joint multivariate load forecasting. In [13], a short-term multivariate load forecasting model based on GRU networks is proposed. By modifying the gating structure of GRU, a multi-level gated recurrent architecture is constructed to achieve multivariate load forecasting. However, as recurrent neural networks, CNN and GRU have inherent limitations: they are prone to gradient vanishing when dealing with time-series data and have difficulty capturing very long-term dependencies.
The Transformer model, first proposed by Vaswani et al. in 2017 with attention mechanisms at its core [14], has achieved great success in fields such as natural language processing, image analysis, and video understanding. In the field of wind power forecasting, when dealing with highly complex and stochastic input sequences of renewable power data, the Transformer can dynamically adjust its focus on meteorological factors such as wind speed and wind direction through the self-attention mechanism. It can capture the correlations among key variables at any temporal and spatial positions and assign higher weights to more informative signals [15], thereby improving the accuracy of wind power forecasting. In recent years, Transformer models based on attention mechanisms have been shown to outperform recurrent neural network models on many tasks. Thanks to their parallel computation capability, Transformers exhibit better gradient flow during training and avoid the gradient vanishing problem commonly encountered in recurrent neural networks. Moreover, the self-attention mechanism can directly attend to the relationships between any two positions in a sequence, making Transformers more efficient in capturing long-range dependencies. Therefore, compared with LSTM and other recurrent neural networks, Transformer models offer a broader range of dependencies and stronger generalization capability.
Existing Transformer-based wind power forecasting approaches rarely adopt a multi-decoder/multi-task design to jointly model multiple spatially correlated wind farms. Instead, these models are typically built independently for each wind farm, with each Transformer model containing only a single decoder. As a result, they fail to explicitly account for the wake effects and spatial dependencies among different wind farms, which limits their forecasting accuracy. This paper addresses this gap by proposing a novel approach with the following key contributions.
(1)
We propose a multi-decoder, multitask learning model for short-term wind power forecasting, based on the Transformer architecture. The proposed model adopts a single-encoder–parallel-decoder structure, enabling joint forecasting of power outputs for multiple wind farms.
(2)
We design a unified encoder to map the input data into a latent representation matrix. Attention mechanisms are employed to extract high-dimensional interaction features from the correlated information across multiple farms. Each farm-specific decoder then combines the latent representation matrix with the input features specific to its forecasting task, enabling accurate predictions for each farm.
(3)
We conduct extensive experiments, including case studies using real wind farm data from Inner Mongolia, China. These experiments demonstrate that the proposed model significantly improves the accuracy of wind power forecasting.

2. Multi-Decoder and Multi-Task Learning Architecture

The overall architecture of the multi-decoder, multitask learning Transformer is shown in Figure 1. It is characterized by a single encoder and multiple task-specific decoders.
In the Transformer model, meteorological data for all wind farms—including the wind speed components in the latitude and longitude directions at 100 m and 10 m, surface pressure, and 2 m temperature—are first passed through an embedding layer and then fed into the encoder. The embedding layer maps the input data into a new feature space via linear transformation or feature extraction, while positional encoding is used to add temporal information to the time-series data. The encoder mainly consists of a multi-head self-attention mechanism, a feedforward neural network, and residual connections with layer normalization. The multi-head self-attention mechanism mimics the resource allocation mechanism of human attention by assigning higher probabilities to important information in the sequence, thereby highlighting key information and reducing or even completely ignoring unimportant information. The feedforward neural network enhances the nonlinear fitting capability of the model, whereas normalization accelerates the convergence of model training.
The structure of each decoder is similar to that of the encoder and mainly includes an embedding layer, a multi-head self-attention mechanism, an encoder–decoder attention mechanism, a feedforward neural network, residual connections with layer normalization, and a final fully connected layer. The input data of each decoder consists of the meteorological variables corresponding to its associated wind farm, such as the wind speed components in the latitude and longitude directions at 100 m and 10 m, surface pressure, and 2 m temperature. The embedding layer maps the input data into a new feature space, while positional encoding adds temporal information to the time-series data. It is worth noting that the second multi-head attention module in the decoder is not a self-attention mechanism. In this module, the keys and values come from the encoder output, whereas the queries come from the output of the first multi-head attention module in the decoder. In this way, each decoder can fully exploit the encoded information during the decoding process. Finally, the fully connected layer transforms the decoder outputs into the final forecasting results.
The model performs a day-ahead short-term wind power forecasting task. At 08:00 on day D, the model takes the most recent available historical observations as inputs and generates the forecast profile for the entire next day, covering 00:00 to 24:00 on day D + 1 at a 15 min resolution, resulting in 96 forecasting points per day. The test period is evaluated under a static-origin day-ahead forecasting protocol, and direct multi-step forecasting is adopted.

3. Transformer-Based Wind Power Forecasting Model

The wind power forecasting model proposed in this paper is built on the Transformer architecture. This section presents in detail the basic principles and mathematical formulations of positional encoding, the multi-head attention mechanism, layer normalization, and the feedforward neural network used in the model.

3.1. Positional Encoding

In the multi-head self-attention mechanism, the attention scores are computed only based on the overall information of the input data, while the positional information of the input sequence is ignored. To address this issue, positional encoding is introduced to provide the Transformer with the position of each input element [16], thereby improving the prediction performance of the model when processing time-series data.
For the element at the k-th dimension and the p-th position in the input sequence, the positional encoding PE(p, k) is defined as follows:
P E ( p , 2 k ) = sin ( p 10000 2 k / d )
P E ( p , 2 k + 1 ) = cos ( p 10000 2 k / d )
The positional encoding not only provides the absolute positional information of different elements, but also exploits the periodicity of sine and cosine functions (i.e., PE(p + p′, k) can be expressed as a linear transformation of PE(p, k)) to encode the relative positional relationships of the input data.

3.2. Attention Mechanism

In renewable energy power forecasting, it is necessary to extract both temporal and feature information of uncertain quantities such as wind speed, temperature and humidity, which makes the input information highly complex. The multi-head self-attention mechanism can adaptively focus on different parts of the input, capture rich feature information, and assign higher weights to key information. Therefore, in this paper, the encoder is built around the multi-head self-attention mechanism to extract the feature information of each uncertain variable and its related variables, ensuring that the model can focus on the most important parts of these uncertain features [17].
The multi-head self-attention mechanism characterizes the correlations between each data point and all other data points in the sequence by computing attention scores. In this paper, dot-product attention is adopted. Let the input data be denoted by x n × d , where n is the number of samples and d is the feature dimension. First, x is linearly transformed to generate the query, key and value matrices. Using the corresponding weight matrices, the linear transformations of x are given by (3)–(5):
Q i = x W i Q
K i = x W i K
V i = x W i V
where for the i-th attention head, the query matrix, key matrix and value matrix are denoted by Qi, Ki and Vi, respectively, each of dimension n × dk. The matrices WQ, WK and WV are the corresponding weight matrices of dimension d × dk, which map the input features from a n × d-dimensional space to a n × dk-dimensional space.
The correlation between each query in the query matrix and all keys in the key matrix is computed via the dot product, and the dot product results are scaled to avoid excessively large values caused by high dimensionality. A softmax function is then applied to the scaled dot products to obtain the attention weight matrix:
α i = softmax ( Q i K i T d k )
where αi denotes the attention weight matrix of the i-th head.
By multiplying the attention weight matrix with the value matrix, the attention output matrix is obtained:
A i = α i V i
where Ai denotes the attention output matrix of the i-th head.
By aggregating the outputs of multiple attention heads, the final output of the multi-head attention mechanism is obtained:
Attention ( x ) = [ A 1   A 2     A h ] W O
where Attention(x) is the final output of the multi-head attention module; h is the number of attention heads; and WO is a linear projection matrix that maps the concatenated vector back to the n × d-dimensional space.
In the decoder, in addition to the multi-head self-attention mechanism, an encoder–decoder attention mechanism is used. The encoder–decoder attention mechanism utilizes the contextual information generated by the encoder to enable more accurate decoding. Compared with the multi-head self-attention mechanism, the query matrix of the encoder–decoder attention comes from the output of the self-attention layer in the decoder and represents the target sequence information, denoted by QD; the key and value matrices come from the intermediate representations output by the encoder and represent the feature sequence information, denoted by KE and VE, respectively, as shown in (9):
α i E - D = softmax ( Q i D ( K i E ) T d k )

3.3. Layer Normalization and Feedforward Neural Network

In the Transformer model, residual connections and layer normalization are typically used together. Residual connections directly add the input of a layer to its output so that the layer learns the residual with respect to the input, which helps mitigate the vanishing and exploding gradient problems in deep neural networks. Let x(0) and output(x(0)) denote the input and output of a layer before the residual connection, respectively. The output of the residual connection, x(1), is given by:
x ( 1 ) = x ( 0 ) + output ( x ( 0 ) )
where + denotes element-wise addition.
The output of the residual connection then serves as the input to the layer normalization. Layer normalization normalizes all features of each sample so that the output of each layer has zero mean and unit variance, thereby accelerating model convergence. The computation of layer normalization is given by (11)–(13):
μ i = 1 d j = 1 d x i j ( 1 )
σ i 2 = 1 d j = 1 d ( x i j ( 1 ) μ i ) 2
x i j ( 2 ) = γ j x i j ( 1 ) μ i σ i 2 + ε + β j
where μi and σ i 2 denote the mean and variance of the i-th sample, respectively; ε is a constant introduced to prevent division by zero; and γj and βj are learnable parameters for each feature dimension. The output of layer normalization is denoted by x(2) = { x i j ( 2 ) |i = 1, 2, …, n; j = 1, 2, …, d}.
The position-wise feedforward neural network is an important component of both the encoder and each decoder in the Transformer model. Its purpose is to process the input data through two linear transformations with a nonlinear activation function in between, thereby extracting more complex feature information. The input to the feedforward neural network is typically the output of layer normalization, denoted by x(2), and its output can be expressed as:
FFN ( x ( 2 ) ) = Re LU ( x ( 2 ) W 1 L + b 1 ) W 2 L + b 2
where FFN(x(2)) denotes the output of the feedforward neural network; ReLU(.) = max(0, .); W 1 L and W 2 L are the weight matrices of the linear transformations; b1 and b2 are the corresponding bias vectors.

4. Case Study Analysis

4.1. Dataset Description

In this case study, the power outputs of three wind farms located in the Inner Mongolia region are forecasted. The historical operational data of the wind farms and the numerical weather prediction data used in this paper are all collected from wind farms in the Inner Mongolia Autonomous Region.
The meteorological input features of the forecasting model include wind speed, wind direction, temperature, humidity, and air pressure. The training dataset covers the period from May 2023 to November 2023, and the testing dataset spans December 2023, resulting in an 87%/13% split between training and testing. The temporal resolution of the data is 15 min. Prior to model training, abnormal samples are removed and missing values are imputed. Min–Max normalization is then applied during both training and inference.
The model is trained using the Adam optimizer with mean squared error (MSE) as the training objective. For multi-task learning across multiple wind farms, the overall loss is formulated as a weighted sum of per-farm MSE losses; equal task weights are adopted in this work. The initial learning rate is 0.001 and is decayed using a step-decay schedule during training. The model is trained for 200 epochs with a batch size of 20, and early stopping is applied when the MSE does not improve for 10 consecutive epochs. The proposed model has 3.0 M trainable parameters, with an average training time of approximately 50 s per epoch on a single NVIDIA RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.2. Evaluation Metrics

The evaluation metrics adopted in this case study are R2 (R-squared), root mean square error (RMSE), mean absolute error (MAE), and normalized root mean square error (nRMSE). Their corresponding formulations are given in (15)–(18).
R 2 = 1 t = 1 T y ^ t y t 2 t = 1 T t = 1 T y t T y t 2
RMSE = 1 T t = 1 T y ^ t y t 2
MAE = 1 T t = 1 T | y ^ t y t |
nRMSE = 1 T t = 1 T y ^ t y t C t 2
where T is the total number of test samples; t is the sample index; y ^ t is the predicted value of the t-th test sample; y t is the true value of the t-th test sample; and Ct denotes the installed capacity of the wind farm for the t-th test sample.

4.3. Comparison with the Independently Trained Transformer Model

In the proposed multi-decoder multi-task learning Transformer model, the number of decoders is fixed to match the number of wind farms (one decoder per wind farm). The number of attention heads is set to 4 and adopted as the default setting in this work, as it provides a good accuracy–cost trade-off validated experimentally. The number of encoder layers and decoder layers is set to 1, and their hidden dimension is 256 for both. To prevent overfitting, the dropout rate is set to 0.1 throughout all layers.
The proposed model is compared with a Transformer model that independently forecasts each wind farm. The indicator analysis is conducted under four scenarios:
  • Scenario 1 evaluates the forecasting performance for wind farm 1,
  • Scenario 2 evaluates the forecasting performance for wind farm 2,
  • Scenario 3 evaluates the forecasting performance for wind farm 3, and
  • Scenario 4 evaluates the metrics based on the total power output of wind farms 1, 2, and 3.
Taking Scenario 4 as an example, the forecasting results of the proposed model and the independent Transformer model are shown in Figure 2.
The evaluation metric results, such as R2, RMSE, MAE, and nRMSE for the proposed model and the independent Transformer model, are shown in Table 1.
According to the data in Table 1, the proposed model achieves significant improvements over the independent Transformer model across all evaluation metrics. In terms of R2, in Scenario 1, the proposed model attains an R2 of 0.60, whereas the independent Transformer model achieves only 0.35, corresponding to an improvement of 71.4%. In Scenario 2, the R2 of the proposed model is 0.63, which is 80.0% higher than that of the independent Transformer model (0.35). In Scenario 3, the proposed model reaches an R2 of 0.86, compared with 0.55 for the independent Transformer model, representing an improvement of 56.4%. In Scenario 4, the proposed model obtains an R2 of 0.87, while the independent Transformer model achieves 0.63, yielding an improvement of 38.1%. These results indicate that the proposed model can better fit the data and enhance the reliability of the forecasts.
Regarding RMSE and MAE, the proposed model outperforms the independent Transformer model in all scenarios, indicating a substantial reduction in prediction errors. In Scenario 1, the RMSE of the proposed model is 23.76, compared with 30.51 for the independent Transformer model, corresponding to a 22.1% reduction in error; meanwhile, MAE decreases from 25.49 to 17.59, a reduction of 31.0%. In Scenario 2, the RMSE of the proposed model is 7.43, compared with 9.74 for the independent Transformer model, representing a 23.7% decrease; MAE decreases from 7.40 to 5.70, a reduction of 23.0%. In Scenario 3, the RMSE of the proposed model is 23.26, while that of the independent Transformer model is 41.14, corresponding to a 43.5% reduction; MAE decreases from 30.55 to 16.33, a reduction of 46.5%. In Scenario 4, the RMSE of the proposed model is 33.52, compared with 55.90 for the independent Transformer model, yielding a 40.0% reduction; MAE decreases from 44.00 to 25.25, a reduction of 42.6%.
In terms of nRMSE, which is a normalized metric, in Scenario 1, the nRMSE of the proposed model is 0.14, whereas the independent Transformer model achieves 0.19, corresponding to a 26.3% improvement. In Scenario 2, the nRMSE of the proposed model is 0.15, compared to 0.20 for the independent Transformer model, resulting in a 25.0% improvement. In Scenario 3, the proposed model achieves an nRMSE of 0.12, while the independent Transformer model reaches 0.20, representing a 40.0% improvement. In Scenario 4, the nRMSE of the proposed model is 0.09, whereas that of the independent Transformer model is 0.13, still yielding a 30.8% improvement.
Overall, the proposed model performs better than the independent Transformer model across all evaluation metrics: it improves R2 by 38.1–80.0%, reduces RMSE by 22.1–43.5%, reduces MAE by 23.0–46.5%, and reduces nRMSE by 25.0–40.0%. These results demonstrate that the proposed model surpasses the independent Transformer model in terms of fitting capability, prediction accuracy, and stability, and thus exhibits superior forecasting performance.

4.4. Comparison with Joint Forecasting Deep Learning Models

The proposed multi-decoder, multitask learning Transformer model is further compared with machine learning models that perform joint forecasting for all wind farms. The baseline machine learning models considered in this paper include an MLP model, an LSTM model, and a CNN model. All three baseline models are implemented using standard Python libraries (Python 3.10.1) and trained using the Adam optimizer with MSE loss. Their architectures and hyperparameters are selected via grid search. The forecasting result curves of each model are shown in Figure 3.
The evaluation metric results, such as R2, RMSE, MAE, and nRMSE for the models are shown in Table 2. These results indicate that the proposed model demonstrates clear advantages over the other joint forecasting machine learning models. In terms of key performance metrics such as R2, RMSE, MAE, and nRMSE, the proposed model consistently outperforms the MLP, LSTM, and CNN models, highlighting its superiority in the forecasting task.
In terms of R2, the proposed model outperforms all other models in every scenario. In Scenario 1, the proposed model achieves an R2 of 0.60, which is 100% higher than that of the best-performing CNN model and 160.9% higher than that of the worst-performing MLP model. In Scenario 2, the R2 of the proposed model is 0.63, representing an improvement of 125.0% over the best-performing LSTM model and 350.0% over the worst-performing CNN model. In Scenario 3, the proposed model attains an R2 of 0.86, which is 68.6% higher than that of the best-performing CNN model and 115.0% higher than that of the worst-performing MLP model. In Scenario 4, the proposed model reaches an R2 of 0.87, exceeding the best-performing CNN model by 38.1% and the worst-performing MLP model by 70.6%.
With respect to RMSE and MAE, the proposed model also shows a clear advantage, effectively reducing prediction errors and improving forecasting accuracy. In Scenario 1, the RMSE of the proposed model is 23.76, which is 24.6% lower than that of the best-performing CNN model and 28.4% lower than that of the worst-performing MLP model. Meanwhile, the MAE is 17.59, representing reductions of 30.2% and 33.6% compared with the best-performing CNN model and the worst-performing MLP model, respectively. In Scenario 2, the proposed model achieves an RMSE of 7.43, which is 27.6% lower than that of the best-performing LSTM model and 52.2% lower than that of the worst-performing CNN model; the MAE is 5.70, 32.3% lower than that of the best-performing LSTM model and 51.2% lower than that of the worst-performing CNN model. In Scenario 3, the RMSE of the proposed model is 23.26, corresponding to reductions of 45.9% and 50.9% relative to the best-performing CNN model and the worst-performing MLP model, respectively; the MAE is 16.33, 48.6% lower than that of the best-performing LSTM model and 52.7% lower than that of the worst-performing MLP model. In Scenario 4, the proposed model attains an RMSE of 33.52, which is 40.0% lower than that of the best-performing CNN model and 47.7% lower than that of the worst-performing MLP model; the MAE is 25.25, representing reductions of 44.0% and 47.9% compared with the best-performing and worst-performing CNN and MLP models, respectively.
In terms of nRMSE, the proposed model consistently outperforms all other models in every scenario. In Scenario 1, the nRMSE of the proposed model is 0.14, which is 26.3% lower than that of the best-performing CNN model and 30.0% lower than that of the worst-performing MLP and LSTM models. In Scenario 2, the proposed model achieves an nRMSE of 0.15, improving upon the best-performing LSTM model by 28.6% and the worst-performing CNN model by 51.6%. In Scenario 3, the nRMSE of the proposed model is 0.12, which is 42.9% lower than that of the best-performing CNN model and 50.0% lower than that of the worst-performing MLP model. In Scenario 4, the proposed model reaches an nRMSE of 0.09, exceeding the best-performing CNN model and the worst-performing MLP and LSTM models by 30.8% and 40.0%, respectively.
Overall, the proposed model consistently outperforms other common models such as MLP, LSTM, and CNN in all scenarios, with significant improvements in the key metrics R2, RMSE, MAE, and nRMSE. The improvement in R2 ranges from 70.6% to 173.9%, with Scenario 2 showing the most pronounced gain and Scenario 4 exhibiting a relatively smaller increase. The reduction in RMSE lies between 28.4% and 49.0%, with the largest error reduction observed in Scenario 3 and the smallest in Scenario 1. The decrease in MAE ranges from 32.3% to 53.7%, where Scenario 3 achieves the greatest reduction and Scenario 2 the smallest. The reduction in nRMSE ranges from 26.3% to 51.6%, with the highest gain in Scenario 2 and a relatively smaller improvement in Scenario 1. These results demonstrate that the proposed model exhibits significant performance advantages across different scenarios.

4.5. Overall Comparison of All Models

To enable a horizontal comparison of all forecasting models, this section takes Scenario 4 as an example and compares the statistical forecasting metrics of different models, including R2, RMSE, MAE, and nRMSE, to provide a more intuitive view of their performance. In addition, to reflect the improvement brought by the proposed forecasting model in terms of power-system economic benefits, the daily operating cost is introduced as a cost metric. Based on the IEEE 30-bus test system, a power-system operation workflow is simulated [18], including day-ahead unit commitment and intra-day economic dispatch. The cost metric is defined as the daily average total operating cost of the system, i.e., the sum of the day-ahead cost and the intra-day cost. The thermal unit parameters are listed in Table 3.
As shown in Table 4, the overall comparison across all models indicates that the forecasting performance of the Independent Transformer model and other joint-forecasting models (MLP model, LSTM model, and CNN model) is broadly comparable, whereas the proposed model achieves a clear improvement in forecasting accuracy and yields enhanced economic benefits for power system operation. In terms of accuracy metrics, the proposed model improves R2 by 38.10% compared with the Independent Transformer model and by 70.59%, 58.18%, and 38.10% compared with the MLP, LSTM, and CNN models, respectively. For RMSE, the proposed model improves performance by 40.04% relative to the Independent Transformer model and by 47.68%, 45.17%, and 40.04% relative to the MLP, LSTM, and CNN models, respectively. For MAE, the corresponding improvements are 42.61% over the Independent Transformer model and 47.97%, 46.36%, and 44.03% over the MLP, LSTM, and CNN models, respectively. For the normalized metric nRMSE, the proposed model achieves improvements of 30.77% over the Independent Transformer model and 40.00%, 40.00%, and 30.77% over the MLP, LSTM, and CNN models, respectively. In terms of the cost metric, the proposed model reduces the daily average total operating cost by 5.54% compared with the Independent Transformer model and by 11.37%, 9.50%, and 7.29% compared with the MLP, LSTM, and CNN models, respectively, thereby effectively lowering system operating costs and improving economic efficiency.

5. Conclusions

Traditional wind power forecasting methods often ignore complex coupling relationships among different wind farms. To address this issue, this paper proposes a short-term wind power forecasting method based on multiple decoders and multi-task learning. The proposed model adopts a single-encoder–multi-decoder architecture, in which a unified encoder is used to encode the input data, while multiple decoders are employed to perform the forecasting tasks for individual wind farms, thereby accounting for the intrinsic relationships among them. Case studies based on real wind farm data from the Inner Mongolia Autonomous Region of China demonstrate that, compared with classical forecasting models, the proposed model can effectively improve forecasting performance across different wind farms. In terms of accuracy metrics, compared with the Independent Transformer, MLP, LSTM, and CNN models, the proposed model improves R2 by up to 70.59%, reduces RMSE by up to 47.68%, reduces MAE by up to 47.97%, and reduces nRMSE by up to 40.00%. In terms of the cost metric, the proposed model reduces the daily average total operating cost by up to 11.37%, thereby lowering system operating costs and improving economic efficiency.
Despite the favorable forecasting performance, several limitations remain. First, although the multi-decoder design improves forecasting accuracy, scaling to a larger number of wind farms increases the number of decoders, which in turn raises computational and memory requirements. Second, the current evaluation is based on a single-year dataset, with the test period limited to one month, which may not fully reflect the model’s generalization ability across seasons and interannual variability. Future work will focus on addressing both limitations to facilitate broader real-world deployment across different regions.

Author Contributions

Methodology, Q.L. and Y.L.; validation, X.Y.; formal analysis, S.W.; investigation, H.Z.; resources, Q.L.; data curation, X.Y.; writing—original draft preparation, Q.L., Y.L. and X.Y.; writing—review and editing, S.W. and R.L.; visualization, H.Z.; supervision, R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Program of Inner Mongolia Autonomous Region (Project No. 2022JBGS0044).

Data Availability Statement

The dataset presented in this study is available at https://github.com/768-lab/A-Short-Term-Wind-Power-Forecasting-Method-Based-on-Multi-Decoder-and-Multi-Task-Learning.git (accessed on 28 December 2025).

Conflicts of Interest

Authors Qiang Li, Yongzhi Liu and Siyu Wang were employed by the company Inner Mongolia Power (Group) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Li, M.; Yang, M.; Yu, Y.; Lee, W.J. A wind speed correction method based on modified hidden Markov model for enhancing wind power forecast. IEEE Trans. Ind. Appl. 2021, 58, 656–666. [Google Scholar] [CrossRef]
  2. Konstantinou, T.; Hatziargyriou, N. Regional wind power forecasting based on Bayesian feature selection. IEEE Trans. Power Syst. 2024, 40, 113–124. [Google Scholar] [CrossRef]
  3. Jin, F.; Chen, X.; Yu, Y.; Li, K. An improved regularization stochastic configuration network for robust wind speed prediction. Energies 2025, 18, 6170. [Google Scholar] [CrossRef]
  4. Silva, T.; Costa, A.; Vilela, O.C.; Willmersdorf, R.; dos Santos Júnior, J.V.; Alves, L.H.B.; Tyaquiçã, P.; de Lima, M.F.S.; de Souza, H.R.B.; Veleda, D. Application of the WRF model for operational wind power forecasting in northeast Brazil. Energies 2025, 18, 5731. [Google Scholar] [CrossRef]
  5. Yang, M.; Huang, Y.; Wang, Z.; Wang, B.; Su, X. A framework of day-ahead wind supply power forecasting by risk scenario perception. IEEE Trans. Sustain. Energy 2025, 16, 1659–1672. [Google Scholar] [CrossRef]
  6. Gokce, B.; Kayakutlu, G. Agent-based energy market modeling with machine learning and econometric forecasting for the net-zero emissions transition. Energies 2025, 18, 5655. [Google Scholar] [CrossRef]
  7. Yunus, K.; Thiringer, T.; Chen, P. ARIMA-based frequency-decomposed modeling of wind speed time series. IEEE Trans. Power Syst. 2016, 31, 2546–2556. [Google Scholar] [CrossRef]
  8. Shahid, F.; Zameer, A.; Muneeb, M. A novel genetic LSTM model for wind power forecast. Energy 2021, 223, 120069. [Google Scholar] [CrossRef]
  9. Wan, C.; Xu, Z.; Pinson, P.; Dong, Z.Y.; Wong, K.P. Probabilistic forecasting of wind power generation using extreme learning machine. IEEE Trans. Power Syst. 2013, 29, 1033–1044. [Google Scholar] [CrossRef]
  10. Wu, Q.; Guan, F.; Lv, C.; Huang, Y. Ultra-short-term multi-step wind power forecasting based on CNN-LSTM. IET Renew. Power Gener. 2021, 15, 1019–1029. [Google Scholar] [CrossRef]
  11. Sarp, A.O.; Mengüç, E.C.; Peker, M.; Güvenç, B.Ç. Data-adaptive censoring for short-term wind speed predictors based on MLP, RNN, and SVM. IEEE Syst. J. 2022, 16, 3625–3634. [Google Scholar] [CrossRef]
  12. Xuan, W.; Shouxiang, W.; Qianyu, Z.; Shaomin, W.; Liwei, F. A multi-energy load prediction model based on deep multi-task learning and ensemble approach for regional integrated energy systems. Int. J. Electr. Power Energy Syst. 2021, 126, 106583. [Google Scholar] [CrossRef]
  13. Wang, S.; Zhang, Z. Short-term multivariate load forecasting of an integrated energy system based on a quantum weighted multi-hierarchy gated recurrent unit neural network. Power Syst. Prot. Control 2022, 50, 85–93. [Google Scholar] [CrossRef]
  14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Available online: https://dl.acm.org/doi/10.5555/3295222.3295349 (accessed on 28 December 2025).
  15. Xiong, B.; Lou, L.; Meng, X.; Wang, X.; Ma, H.; Wang, Z. Short-term wind power forecasting based on attention mechanism and deep learning. Electr. Power Syst. Res. 2022, 206, 107776. [Google Scholar] [CrossRef]
  16. Bispo Junior, D.A.; Leite, G.d.N.P.; Droguett, E.L.; de Souza, O.V.C.; Lisboa, L.A.; Cavalcanti, G.D.d.C.; Ochoa, A.A.V.; Costa, A.C.A.d.; Vilela, O.d.C.; Brennand, L.J.d.P.; et al. Short-term wind power forecasting with Transformer-based models enhanced by Time2Vec and efficient attention. Energies 2025, 18, 6162. [Google Scholar] [CrossRef]
  17. Niu, Z.; Yu, Z.; Tang, W.; Wu, Q.; Reformat, M. Wind power forecasting using attention-based gated recurrent unit network. Energy 2020, 196, 117081. [Google Scholar] [CrossRef]
  18. Dvorkin, V.; Delikaraoglou, S.; Morales, J.M. Setting reserve requirements to approximate the efficiency of the stochastic dispatch. IEEE Trans. Power Syst. 2019, 34, 1524–1536. [Google Scholar] [CrossRef]
Figure 1. Architecture of the multi-decoder and multi-task learning model.
Figure 1. Architecture of the multi-decoder and multi-task learning model.
Energies 19 00349 g001
Figure 2. Forecasting results of the proposed model and the independent Transformer model. (a) Forecasting results of the proposed model, which performs joint forecasting for multiple wind farms using a multi-decoder, multitask learning architecture. (b) Forecasting results of the independent Transformer model, which builds separate Transformer models to forecast each wind farm individually.
Figure 2. Forecasting results of the proposed model and the independent Transformer model. (a) Forecasting results of the proposed model, which performs joint forecasting for multiple wind farms using a multi-decoder, multitask learning architecture. (b) Forecasting results of the independent Transformer model, which builds separate Transformer models to forecast each wind farm individually.
Energies 19 00349 g002
Figure 3. Forecasting results of the proposed model, MLP model, LSTM model, and CNN model. (a) Forecasting results of the proposed model. (b) Forecasting results of the MLP model. (c) Forecasting results of the LSTM model. (d) Forecasting results of the CNN model.
Figure 3. Forecasting results of the proposed model, MLP model, LSTM model, and CNN model. (a) Forecasting results of the proposed model. (b) Forecasting results of the MLP model. (c) Forecasting results of the LSTM model. (d) Forecasting results of the CNN model.
Energies 19 00349 g003aEnergies 19 00349 g003b
Table 1. Comparison with an independently predicted transformer model.
Table 1. Comparison with an independently predicted transformer model.
ScenarioModelR2RMSEMAEnRMSE
1Proposed model0.6023.7617.590.14
Independent Transformer model0.3530.5125.490.19
2Proposed model0.637.435.700.15
Independent Transformer model0.359.747.400.20
3Proposed model0.8623.2616.330.12
Independent Transformer model0.5541.1430.550.20
4Proposed model0.8733.5225.250.09
Independent Transformer model0.6355.9044.000.17
Table 2. Comparison with joint prediction machine learning models.
Table 2. Comparison with joint prediction machine learning models.
ScenarioModelR2RMSEMAEnRMSE
1Proposed model0.6023.7617.590.14
MLP model0.2333.1726.480.20
LSTM model0.2532.8226.310.20
CNN model0.3031.5025.210.19
2Proposed model0.637.435.700.15
MLP model0.2312.359.290.25
LSTM model0.2810.268.420.21
CNN model0.1415.5411.670.31
3Proposed model0.8623.2616.330.12
MLP model0.4047.3834.520.24
LSTM model0.4844.0631.750.22
CNN model0.5142.9634.080.21
4Proposed model0.8733.5225.250.09
MLP model0.5164.0748.530.15
LSTM model0.5561.1347.070.15
CNN model0.6355.9045.110.13
Table 3. Parameters of the thermal generating units.
Table 3. Parameters of the thermal generating units.
Unit IDRated Capacity (MW)Energy Price (CNY/MW)Upward Reserve Capacity
(MW)
Downward
Reserve Capacity
(MW)
Upward Reserve Price (CNY/MW)Downward
Reserve Price (CNY/MW)
110,2055005000500012001200
264804802800280010001000
338885501600160012501250
451844602400240010001000
525925101200120012001200
625925001120112011501150
Table 4. Comparison of forecasting performance across all models under Scenario 4 in terms of R2, RMSE, MAE, nRMSE, and cost.
Table 4. Comparison of forecasting performance across all models under Scenario 4 in terms of R2, RMSE, MAE, nRMSE, and cost.
ScenarioModelR2RMSEMAEnRMSECost/(CNY/Day)
4Proposed model0.8733.5225.250.097968
Independent Transformer model0.6355.9044.000.138435
MLP model0.5164.0748.530.158990
LSTM model0.5561.1347.070.158804
CNN model0.6355.9045.110.138595
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Q.; Liu, Y.; Yan, X.; Zhang, H.; Wang, S.; Li, R. A Short-Term Wind Power Forecasting Method Based on Multi-Decoder and Multi-Task Learning. Energies 2026, 19, 349. https://doi.org/10.3390/en19020349

AMA Style

Li Q, Liu Y, Yan X, Zhang H, Wang S, Li R. A Short-Term Wind Power Forecasting Method Based on Multi-Decoder and Multi-Task Learning. Energies. 2026; 19(2):349. https://doi.org/10.3390/en19020349

Chicago/Turabian Style

Li, Qiang, Yongzhi Liu, Xinyue Yan, Haipeng Zhang, Siyu Wang, and Ran Li. 2026. "A Short-Term Wind Power Forecasting Method Based on Multi-Decoder and Multi-Task Learning" Energies 19, no. 2: 349. https://doi.org/10.3390/en19020349

APA Style

Li, Q., Liu, Y., Yan, X., Zhang, H., Wang, S., & Li, R. (2026). A Short-Term Wind Power Forecasting Method Based on Multi-Decoder and Multi-Task Learning. Energies, 19(2), 349. https://doi.org/10.3390/en19020349

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop