A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing

Shan, Chudong; Liu, Shuai; Peng, Shuangjian; Huang, Zhihong; Zuo, Yuanjun; Zhang, Wenjing; Xiao, Jian

doi:10.3390/en18112902

Open AccessArticle

A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing

by

Chudong Shan

^1,2,*

,

Shuai Liu

^1,2,

Shuangjian Peng

^1,2,

Zhihong Huang

^1,2

,

Yuanjun Zuo

^1,2

,

Wenjing Zhang

^1,2 and

Jian Xiao

^1,2

¹

State Grid Hunan Electric Power Company Limited Research Institute, Changsha 410000, China

²

Hunan Province Engineering Technology Research Center of Electric Power Multimodal Perception and Edge Intelligence, Changsha 410000, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(11), 2902; https://doi.org/10.3390/en18112902

Submission received: 7 April 2025 / Revised: 20 May 2025 / Accepted: 30 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue Trends and Challenges in Power System Stability and Control)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of renewable energy, wind power forecasting has become increasingly important in power system scheduling and management. However, the forecasting of wind power is subject to the complex influence of multiple variable features and their interrelationships, which poses challenges to traditional forecasting methods. As an effective feature extraction technique, representation learning can better capture complex feature relationships and improve forecasting performance. This paper proposes a two-stage forecasting framework based on lightweight representation learning and multivariate feature mixing. In the representation learning stage, the efficient spatial pyramid module is introduced to reconstruct the dilated convolution part of the original TS2Vec representation learning model to fuse multi-scale features and better improve the gridding effect caused by dilated convolution while significantly reducing the number of parameters in the representation learning model. In the feature mixing stage, TSMixer is used as the basic model to extract cross-dimensional interaction features through its multivariate linear mixing mechanism, and the SimAM lightweight attention mechanism is introduced to adaptively focus on the contribution of key time steps and optimize the allocation of forecasting weights. The experimental results conducted on actual wind farm datasets show that the model proposed in this paper significantly improves the accuracy of wind power forecasting, providing new ideas and methods for the field of wind power forecasting.

Keywords:

wind power forecasting; representation learning; artificial intelligence; feature mixture; attention mechanism

1. Introduction

With the continuous growth of global energy demand and the increasing awareness of environmental protection, renewable energy has gradually become one of the focal points of international attention. Among renewable energy sources, wind energy, as a relatively mature and widely utilized form, has enormous development potential and economic benefits [1]. However, due to the complexity and randomness of wind speed, the volatility of wind power poses challenges for the operation and planning of wind power systems [2]. Therefore, accurately forecasting the time series of wind power is important. Time series forecasting of wind power can provide strong support for grid scheduling, new energy generation planning, and the operation and maintenance of wind turbines. An effective wind power forecasting model can help optimize the scheduling and regulation of wind power systems and improve system performance and reliability while also reducing energy costs and environmental pollution [3].

In wind power time series forecasting, researchers have conducted a substantial amount of research. Currently, commonly used forecasting methods include statistical time series analysis methods, Artificial Intelligence (AI) methods, and physics-based models [4]. Among these, machine learning methods such as Support Vector Machines (SVM), Artificial Neural Networks (ANN), Deep Neural Networks (DNN), and other deep learning algorithms have been widely applied in wind power forecasting [5,6,7]. Karijadi et al. [8] proposed a hybrid CEEMDAN-EWT deep learning method for wind power forecasting. This method uses a combination of Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and Empirical Wavelet Transform (EWT) as preprocessing techniques, where CEEMDAN is first used to decompose the original wind power data into several sub-sequences, and the EWT denoising technique is used to denoise the highest frequency sequence generated by CEEMDAN. Then, use Long-Short Term Memory (LSTM) to predict all subsequences in the CEEMDAN-EWT process, and aggregate the prediction results of each subsequence to achieve the final prediction result. However, this method did not effectively utilize the various multivariate features that affect wind power generation. Chen et al. [9] proposed a CNN-BiLSTM short-term wind power forecasting method based on feature selection. Although this method utilizes multivariate features, it requires a feature correlation analysis of the dataset and weighting the input data based on feature correlation to form a multidimensional feature dataset, which is a cumbersome process.

Currently, traditional wind power forecasting models face numerous issues when handling time series data, such as weak feature extraction capabilities and significant information loss. In existing wind power forecasting methods, the use of feature representation is relatively limited. There are currently many methods for feature selection, such as ElasticNet, Pearson’s correlation, and ReliefF algorithms [10,11], but using deep learning techniques for automatic feature extraction and representation would be more convenient and accurate. Feature representation can extract latent patterns and characteristics from time series data, thereby improving the performance of forecasting models. Traditional time series forecasting methods often rely on simple statistical features or directly use raw data, failing to fully mine the deep-seated information within the data. This leads to certain limitations for the models when confronted with the complex variations in wind power.

To address these issues, some researchers have proposed unsupervised representation learning models in recent years, among which the TS2Vec model is a typical example [12]. The TS2Vec model extracts features by mapping time series data to a low-dimensional vector space, utilizing traditional dilated convolution networks for feature processing. Although the TS2Vec model has achieved certain results in various downstream machine learning tasks, its dilated convolution module has limitations in capturing long-term dependencies, and its gridding effect may result in information loss or ambiguity. Additionally, the model makes a large number of parameters and computational complexity. The efficient spatial pyramid structure proposed in ESPNet [13] can better compensate for the gridding effect caused by dilated convolution while reducing model complexity.

It is equally a challenging problem to select suitable downstream forecasting models for multi feature mixing and extraction of data with multiple feature variables after feature representation by representation learning models. Many current time series forecasting models are based on univariate time forecasting, while multivariate models seem to have overfitting issues, especially when the target time series is not correlated with other covariates [14]. However, when there is a strong correlation between the target time series of wind power and various other meteorological factors covariates, it is meaningful to have a model that can use information from covariates and other features for forecasting [15]. In recent years, scholars have proposed different models such as PatchTST [16], DLinear [17], CycleNet [18], etc., which are based on channel independent implementation. These models usually do not consider any potential interactions or correlations between channels, and each channel is processed as a separate input without utilizing shared information or dependencies [19]. Various time series foundation models, such as LLM4TS [20], Time LMM [21], Chronos [22], etc., are also based on channel independence. Although LLM4TS and Time LLM introduce pre-training techniques, their self attention mechanisms are still limited to single variable channels, making it difficult to establish dynamic weight allocation mechanisms across feature domains.

In response to the above bottlenecks, multivariate feature mixing has become a new direction to break through the limitations of channel independence [23]. TSMixer, as a novel architecture, effectively captures temporal patterns and cross-variable information through temporal and feature mixing operations, enabling dynamic fusion of multivariate time series [24]. However, existing research still faces challenges in designing downstream forecasting tasks based on feature representation, where redundancy and noise coexist in high-dimensional feature spaces, and the contribution of key covariates needs to be strengthened through adaptive weight allocation [25]. SimAM, as a lightweight and parameter-free attention mechanism, generates attention weights by calculating the local self-similarity of feature maps without introducing any additional parameters [26]. The introduction of a lightweight SimAM attention mechanism can evaluate feature importance through an energy function that does not require parameter optimization, and can highlight the weights of key segments in multivariate time series under controllable computational costs [27].

Therefore, this paper proposes a wind power forecasting method and constructs a two-stage forecasting framework based on lightweight representation learning and multivariate feature mixing. In the representation learning stage, the efficient spatial pyramid module is utilized to replace the dilated convolution module in the original TS2Vec model. The improved TS2Vec representation model is utilized to characterize the matrix containing historical measured weather data, historical power generation data, and other features of the wind farm. The multivariate feature data after representational learning serves as input for the downstream forecasting model. In the stage of multivariate feature mixing, the use of SimAM-TSMixer hybrid architecture is intended to break through the inherent limitations of channel-independent models and provide a new solution for multivariate wind power forecasting. By efficiently combining SimAM with TSMixer, the model can dynamically calibrate the correlation strength between meteorological covariates and target time series during feature mixing, while mining spatiotemporal interaction patterns across variables.

The main contributions of this paper are as follows:

A two-stage forecasting framework based on lightweight representation learning and multivariate feature mixing is proposed, which can effectively extract potential patterns and features in wind power related time series, and at the same time can efficiently realize the dynamic fusion of multivariate features, enabling the model to better adapt to complex time series data.
In the lightweight representation learning stage, the dilated convolution part of the existing TS2Vec representation model is innovatively modified, and the efficient spatial pyramid structure is adopted, which better compensates for the gridding effect caused by the dilated convolution. This not only enhances the ability of the model to capture multi-scale features, but also improves the flexibility and adaptability of the model in dealing with complex time series data.
In the multivariate feature mixing stage, a multivariate mixing layer is constructed based on the TSMixer architecture, which utilizes its cross-dimensional interaction mechanism to extract implicit associations among features, and embeds the SimAM lightweight attention mechanism, which adaptively adjusts the weights of the time steps through parameter-free computation to suppress the noise interference and enhance the contribution of key features.

2. Data

To verify the effectiveness and practicality of the model proposed in this paper, the data used in the study consists of real historical data from a wind farm in Hunan Province, China, over the course of one year. The data sampling frequency is 15 min and includes measurements such as wind speed and direction at 10 m, wind speed and direction at 50 m, hub wind speed and direction, temperature, pressure, and historical actual power.

Due to the long-term operation of wind power generation systems and their supporting meteorological monitoring equipment, various problems may occur in data collection (such as equipment failures, signal interference, network transmission interruptions, etc.), resulting in noise, erroneous records, and missing data in the original data. The main purpose of data cleaning is to eliminate invalid data, fill in missing values, and correct erroneous records, thereby ensuring the integrity and accuracy of the dataset. Due to the uncertainty of missing values, the longest continuous missing time is approximately 3 h, which means 12 consecutive sampling points. The Isolation Forest algorithm [28] is used for outlier detection, to fill in missing values and solve outliers, we used the commonly used linear interpolation [29] method as shown in Equation (1), where (

y_{2}

>

y_{1}

) and (

x_{2}

>

x

>

x_{1}

). In the equation,

y

represents the value of the unknown data point to be estimated, and

x

is the value of the unknown data point. This equation is based on the principle of similar triangles, using the two known data points (

x_{1}, y_{1}

) and (

x_{2}, y_{2}

) to calculate the data of the interpolated point.

y = y_{1} + \frac{(y_{2} - y_{1}) \times (x - x_{1})}{x_{2} - x_{1}}

(1)

We use StandardScaler for data normalization [30], and the formula for data normalization is shown in Equation (2):

Z = \frac{x - μ}{σ}

(2)

Here

x

is the original data value,

μ

is the mean of the feature, which is the average of all data points, and

σ

is the standard deviation of the feature, which indicates the dispersion of the data points.

3. Methods and Results

3.1. Lightweight Representation Learning Model

3.1.1. Model Design

The lightweight representation learning model set out in the present paper enhances the dilated convolution in the encoder part of the original TS2Vec model. As illustrated in Figure 1, the original TS2Vec model employs a dilated convolution module with 11 residual blocks to extract contextual representations for each timestamp. Each residual block contains two one-dimensional dilated convolution layers with a dilation parameter of 2ⁱ⁻¹ (i represents the order of the residual blocks) and a kernel size of 3. The dilation parameter of the dilated convolution in each residual block increases from 1 to 1024. However, each residual block contains two dilated convolutions with the same dilation parameter, which can lead to some pixels in the feature map not participating in the computation when dilated convolutions with the uniform dilation parameter are stacked. Additionally, the gridding effect of the dilated convolution can result in a lack of interdependence between the results of convolution layers, leading to insufficient correlation between the convolution results of corresponding layers and causing local information loss. Although the original TS2Vec model uses a residual structure to overcome some gridding effects caused by dilated convolutions, the dilated convolution residual block in the model performs convolution operations based on the aforementioned residual block rather than separately extracting multi-scale features from the original input data. Furthermore, the large number of convolution layers in the model results in a significant number of parameters and computational load. This paper refers to the scheme proposed in ESPNet [13] to replace the dilated convolution module in the original TS2Vec model with the efficient spatial pyramid module in Figure 2.

Figure 2 illustrates the structure of the lightweight representation learning model set out in the present paper. The starboard side of Figure 2 shows the specific implementation method of the efficient spatial pyramid module. To reduce the channel number of the output feature map after the timestamp masking module in the original TS2Vec model and thereby decrease the subsequent computational load, a one-dimensional pointwise convolution with a kernel size of 1 is applied. Additionally, 11 parallel dilated convolution layers with dilation parameters ranging from 1 to 1024 are utilized to perform convolution operations on the feature map after pointwise convolution. Each dilated convolution layer has a dilation parameter of 2ⁱ⁻¹ (where i represents the layer number), and the padding value is consistent with the dilation parameter of that layer. The feature map output from the dilated convolution layer with the smallest dilation parameter is progressively stacked with the output feature maps from each dilated convolution layer, and each stacked feature map is concatenated. Finally, the concatenated feature map is added to the initially linearly projected input feature map to form a residual structure. This structure employs the concept of feature hierarchy to directly utilize dilated convolutions with different dilation parameters for multi-scale feature extraction on the input feature map after pointwise convolution, layering, and stacking the output of dilated convolutions with different dilation parameters. Essentially, this adds discrete receptive fields, effectively compensating for the gridding effect caused by dilated convolutions, preserving local details and global semantic features, and better capturing multi-scale temporal information. Meanwhile, the use of pointwise convolution and a reduced number of convolution layers decreases the overall computational load of the model.

The hierarchical contrasting module in Figure 1 and Figure 2 utilizes temporal contrastive loss and instance-wise contrastive loss values to capture the contextual representation of the time series. The representation of the same timestamp from two views of the input time series is considered positive, while the representation of different timestamps from the same time series is considered negative. Let i be the index of the input time series sample and t be the timestamp. The temporal contrastive loss of the i-th time series with timestamp t can be given in Equation (3).

l_{t e m p}^{(i, t)} = - l o g \frac{e x p (r_{i, t} \cdot r_{i, t}^{'})}{\sum_{t^{'} \in Ω} (e x p (r_{i, t} \cdot r_{i, t^{'}}^{'}) + 1_{[t \neq t^{'}]} e x p (r_{i, t} \cdot r_{i, t^{'}}))}

(3)

The instance-wise contrastive loss indexed by (i, t) can be expressed as Equation (4).

l_{i n s t}^{(i, t)} = - l o g \frac{e x p (r_{i, t} \cdot r_{i, t}^{'})}{\sum_{j = 1}^{B} (e x p (r_{i, t} \cdot r_{j, t}^{'}) + 1_{[i \neq j]} e x p (r_{i, t} \cdot r_{j, t}))}

(4)

3.1.2. Model Testing and Result Analysis

To validate the performance of the lightweight representation learning model set out in the present paper, we use the feature matrix of wind farm historical data after feature representation combined with the downstream time series forecasting model LSTM [31] to test the lightweight representation learning model. Figure 3 displays the specific construction method of the LSTM model used in the model testing process. The LSTM model consists of 5 layers, with the first 3 layers of LSTM hidden layers having a size of units set to 50 and a return sequence status set to True. The fourth layer of LSTM hidden layers has a size of units set to 50 and a return sequence status set to False. The fifth layer is the fully connected Dense layer, and the output data dimension is placed at 96, corresponding to a data length of one day in the dataset.

We first use the LSTM model to make downstream forecasting on the actual data of the wind farm represented by the original TS2Vec model and the actual data of the wind farm represented by the improved TS2Vec model, which is our proposed lightweight representation learning model. The dataset includes wind speed at 10 m, wind direction at 10 m, wind speed at 50 m, wind direction at 50 m, hub height wind speed, hub height wind direction, temperature, air pressure, and historical actual power data from wind turbines within one year. The dataset size is (35,040, 9).

As shown in Figure 1, the original TS2Vec model’s dilated convolution part is set to have a kernel size of 3, and the first 10 residual blocks have 64 input and output feature map channels, while the last residual block has 352 output feature map channels. For the improved TS2Vec model, since the original TS2Vec model outputs a feature map with a feature dimension of 64 after the timestamp masking module, the feature map is transposed to the number of channels of 64. Then, a one-dimensional pointwise convolution with a kernel size of 1 is utilized to reduce the number of channels in the feature map to 32, thereby reducing the subsequent computational load. In the subsequent parallel dilated convolution section, the kernel size of each parallel dilated convolution layer is set as 3, and the number of channels in the output feature map is 32. The feature maps output by each dilated convolution layer are stacked step by step, and each stacked feature map is concatenated. Finally, the concatenated feature map is added to the input feature map with a channel number of 352 that was first projected through a linear connection layer to obtain the output feature map with a channel number of 352. Using 70% of the data in the dataset for training the original TS2Vec model and the improved TS2Vec model, and encoding the original dataset with the trained models, a representation dataset of size (35,040, 352) can be achieved.

As for the LSTM model for downstream time series forecasting used for model testing, a historical data set with a length of 288, which corresponds to three days of data after representation learning, is set as input x. The actual power of the original data with a length of 96, which corresponds to one day, is taken as output y to train the LSTM model. Similarly, 70% of the original data set is used as the training set, 10% of the data set is used as the validation set, and 20% of the data set is used as the testing set.

The evaluation indicators of the original TS2Vec model and the improved TS2Vec model combined with LSTM for time series forecasting on the test set are as follows [32], MAE and RMSE have the same unit as y.

M A E = \frac{1}{m} \sum_{i = 1}^{m} ∣ (y_{i} - {\hat{y}}_{i}) ∣

(5)

M S E = \frac{1}{m} \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2}

(6)

R M S E = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} (y_{i} - {\hat{y}}_{i})^{2}}

(7)

The training and testing experiments of the model were conducted on an x86-64 Linux operating system equipped with Intel (R) xeon (R) Gold 6430 CPU and NVIDIA A800 SXM4 80 GB GPU, using the PyTorch 2.1.0 and CUDA 12.1 environment.

Table 1 displays the test results of the lightweight representation learning model proposed in this paper, namely the improved TS2Vec model and the original TS2Vec model, when combined with the same downstream forecasting model LSTM for time series forecasting on the same dataset.

We simultaneously calculated the parameter and computational complexity of the Encoder section of both the original TS2Vec model and the improved TS2Vec model on the test set [33]. The computational complexity (FLOPs) stands for floating point operations per second, the parameter count (Params) is mainly used to designate the size of the model. The comparison results are presented in Table 2.

From Table 2, it can be observed that compared to the original TS2Vec model, the improved TS2Vec model combined with LSTM significantly reduces computational and parameter complexity while ensuring forecasting accuracy.

We also compared the results of time series forecasting on the original dataset using the LSTM model and several other existing deep learning time series models without using representation learning models, with the forecasting results of the improved TS2Vec set out in the present paper combined with the LSTM model. The results are presented in Table 3. Meanwhile, Figure 4, Figure 5, Figure 6 and Figure 7 demonstrate the performance of the improved TS2Vec combined with the LSTM model and other mainstream deep learning time series forecasting models on the same test set.

From the comparison curves of the improved TS2Vec combined with LSTM model and other mainstream deep learning time series forecasting models’ wind power forecasting results and actual wind power generation values in randomly selected periods shown in Figure 4, Figure 5, Figure 6 and Figure 7, it can be observed that the improved TS2Vec combined with LSTM model’s forecasting results are closest to the true curve. From the above results, it be observed that the use of the representation learning model has significantly improved the wind power time series forecasting results. At the same time, the improved representation model has significantly reduced the parameter and computational complexity compared to the original TS2Vec model, which well proves the superiority of the lightweight representation learning model proposed in this paper.

3.2. Multivariate Feature Mixing Model

3.2.1. Model Design

In the earlier part of the paper, we designed a lightweight representation learning model and verified through experiments that the feature representation model achieved good representation performance. As the original data dimension after feature representation became 352 dimensions, we studied how to develop a better multivariate feature mixing downstream time series forecasting model that can fully integrate and extract the features of each dimension in multivariate time series data. As entered in Section 1, many current time series forecasting models use the idea of channel independence. Therefore, we draw inspiration from the Time Series Mixer (TSMixer) [24] model and use a stacked multilayer perceptron architecture to achieve cross-variable information mixing and feature extraction of high-dimensional and multivariate historical weather and power data of wind farms after feature representation, thereby further improving the accuracy of wind power forecasting. In addition, we have introduced a lightweight, parameter-free attention mechanism, SimAM [25], based on downstream forecasting models, which generates attention weights by calculating the local self-similarity of feature maps without introducing any additional parameters, thus improving the performance of downstream forecasting models.

The overall structure diagram of the multivariate feature mixing downstream forecasting model designed in this paper is presented in Figure 8. In terms of the dataset, we use the improved TS2Vec representation model to represent the data with a size of (35,040, 352) as the dataset. At the same time, to further highlight the impact of historical power generation on future power generation, we also concatenated historical power generation data in the last column based on 352-dimensional data, so the size of the dataset becomes (35,040, 353). The model uses historical data from the past three days to forecast the power generation for the next day. Here, a length of 288 is configured as the input, which corresponds to the size of (288, 353) after representation learning. The output size of the model is (96, 1), which is the power generation for the next day.

From Figure 8, it can be observed that the multivariate feature mixing model includes two stacked Mixer Layer modules, which are the core modules of the multivariate feature mixing model. Figure 9 shows the specific implementation details of the mixer layer in the multivariate feature mixing model. The mixer layer is mainly composed of time mixing and feature mixing, which can fuse the time and feature levels of input data to better forecast time series using more information from the input data. Both the time mixing and feature mixing modules include the MLP module. For time mixing, MLP_time consists of a fully connected layer, a ReLU activation function, and a dropout layer. The input feature map (where rows represent time and columns represent features) is first transposed, and MLP_time is implemented in the time domain and shared among all features. For feature mixing, MLP_feature consists of two fully connected layers, ReLU activation function, and dropout layer. MLP_feature applies to the feature domain and is shared across all time steps. The time mixing and feature mixing modules can automatically adapt to the use of time and cross-variable information, and both modules utilize residual connections to help the model learn deeper data representations while maintaining reasonable computational costs [37].

The attention mechanism can automatically identify important parts of the input data, highlight key features through weight allocation, and reduce the interference of redundant information on the subsequent processing of the MLP. This filtering mechanism significantly improves model efficiency, especially in complex inputs such as long sequences and multimodal data [38]. From the Figure 9, to enable the model to better extract the features of the original data in the time and feature dimensions, this paper introduces the lightweight and parameter free attention mechanism module SimAM, which has good application effects in the image field, in both time mixing and feature mixing, and modifies its structure to an attention mechanism module suitable for one-dimensional time series domain.

As indicated in Figure 10, many attention mechanism modules currently operate along the channel or spatial dimension, generating channel or spatial weights and treating neurons in each channel or spatial position equally [39]. SimAM can generate both channel and spatial weights simultaneously, generating attention weights by calculating the similarity between each pixel in the feature map and its adjacent pixels. For the input feature map

X \in R^{B \times C \times L}

, where B represents batch size, C represents the number of channels in the feature map, and L represents the length of the one-dimensional time series feature map. For each pixel

x_{i}

in the feature map, as shown in Equation (8), SimAM first indirectly reflects similarity by calculating the average square of the difference between

x_{i}

and its neighboring pixels.

Ω_{i}

represents the neighborhood of

x_{i}

, and N represents the number of pixels in the neighborhood. However, to simplify the calculation, this paper uses the average of pixels in the feature map as

x_{k}

to calculate

s_{i}

.

s_{i} = \frac{1}{N} \sum_{k \in Ω_{i}} | x_{i} - x_{k} |^{2}

(8)

After obtaining

s_{i}

, the attention weight

w_{i}

is calculated using Equation (9), which is similar to the sigmoid function that projects the weight values into the (0, 1) interval, where

σ_{i}^{2}

is the normalization of

s_{i}

and

ϵ

is taken as 1 × 10^-4. After obtaining the attention weight map, it can be multiplied element by element with the original feature map to obtain the feature map after passing through the attention module.

w_{i} = \frac{1}{1 + e x p (- \frac{1}{4} (\frac{s_{i}}{σ_{i}^{2} + ϵ} - 1))}

(9)

Figure 11 displays the temporal projection module in the multivariate feature mixing model of this paper. The module first takes the data from the target column of the feature map processed by the mixer layer. After various time mixing, feature mixing, and SimAM attention mechanism modules in the early stage, the target column not only retains the information of historical power generation, but also integrates numerous cross features from other dimensions and spaces. After being applied to the fully connected layer in the time domain, it can learn the time pattern and map the time series from the original input length of 288 to the target forecasting length of 96.

3.2.2. Model Testing and Result Analysis

In order to verify the compatibility between the multivariate feature mixing model and the lightweight representation learning model proposed in this paper, as well as the forecasting performance of the model, we conducted sufficient experiments. The training and testing experiments of the model were conducted on an x86-64 Linux operating system equipped with Intel (R) xeon (R) Gold 6430 CPU and NVIDIA A800 SXM4 80 GB GPU, using the Pytorch framework.

We took the lightweight representation learning model for feature representation and used 70% of the data in the 35,040 length dataset as the training set, 10% of the data as the validation set, and 20% of the data as the testing set to train and test the performance of the multivariate feature mixing model and compare them with other mainstream models. At the same time, ablation experiments were designed to verify the contribution of the SimAM attention module in a multivariate feature mixing model to the improvement of model forecasting performance. Utilizing MAE, MSE, and RMSE as evaluation indicators, the Table 4 shows the specific performance of the model on different indicators.

From Table 4, it can be observed that we conducted experiments on various combinations and found that adding the SimAM attention module to the downstream forecasting model resulted in a decrease in forecasting error. In particular, the comparison data in the third and fourth rows of Table 4 shows that the SimAM module has a particularly significant improvement in forecasting performance when the representation model is not used on the original data. Even with the use of the SimAM module for the data after feature representation, the forecasting performance of the model can be enhanced to a certain extent. In addition, we compared the experimental results of the original TS2Vec model combined with our proposed multivariate feature mixing downstream forecasting model and the improved TS2Vec model combined with our proposed downstream forecasting model in this paper. It was found that the improved TS2Vec model combined with the downstream forecasting model performed better, further confirming the effectiveness of the multivariate feature mixing model in this paper. At the same time, we also listed the experimental results of the improved TS2Vec model combined with SimAM attention module and the currently advanced lightweight time series forecasting model SparseTSF [40] in Table 4. Through comparison, we can further verify the advantages of the model scheme used throughout this paper. At the same time, Table 3 also lists the performance of other models on the same dataset, all of which have well-verified the progressive nature of the model set out in the present paper. Figure 12 shows the comparison results between the forecasted power and actual power of the wind power forecasting model proposed in this paper on a long-term test set. It can be observed that the trend of the model’s forecasted results and true values is similar in most intervals. The model proposed in this paper not only has significant advantages in the evaluation indicators such as MAE, MSE, RMSE listed in Table 4, but also demonstrates the effectiveness and accuracy of the proposed model in wind power forecasting by maintaining a consistent trend between the visualized wind power forecasting results and the actual values over a long period of time, as shown in Figure 12.

4. Discussion and Future Work

This paper proposes an innovative two-stage forecasting framework for wind power forecasting tasks, which achieves improved forecasting performance through a collaborative optimization mechanism of lightweight representation learning and multivariate feature mixing. In the representation learning stage, by designing the efficient spatial pyramid module to reconstruct the multi-scale feature fusion path, the gridding effect caused by traditional dilated convolution is effectively alleviated, and the model parameters and computational complexity are compressed, significantly improving the efficiency of feature extraction. In the feature mixing stage, the TSMixer model achieves deep mining of cross-dimensional interactive features through a linear mixing layer, and combines the SimAM attention mechanism to construct a dynamic weight allocation network, enabling the model to adaptively capture the contribution differences of key time steps. The experimental results show the rationality and effectiveness of the model design. Future research will focus on the fusion modeling of multimodal meteorological data, exploring a hybrid modeling paradigm of physical constraints and data-driven approaches to enhance predictive robustness under extreme weather conditions, and attempting to extend the framework to a wider range of time series forecasting scenarios such as photovoltaic power forecasting and power load forecasting, promoting the development of intelligent scheduling technology for clean energy.

5. Conclusions

This study presents a two-stage wind power forecasting framework integrating lightweight representation learning and multivariate feature mixing, which significantly improves forecasting accuracy through the collaborative optimization of an improved TS2Vec model and a SimAM-TSMixer hybrid architecture. Experimental results show that the proposed model outperforms existing methods across key metrics (MAE, MSE, and RMSE), validating its effectiveness. The efficient spatial pyramid module in the representation learning stage mitigates the gridding effect of dilated convolution while reducing computational costs, whereas the feature mixing stage leverages TSMixer’s cross-dimensional interaction and SimAM’s parameter-free attention to dynamically weight critical time steps. However, limitations remain: (1) The model’s adaptability to high-frequency fluctuating data requires further validation. (2) The linear assumptions in feature mixing may constrain nonlinear modeling under extreme weather conditions. In realistic applications, ensuring data quality and timeliness is crucial, and site-specific parameter tuning is recommended to optimize performance. Additionally, the framework could be expanded to broader time-series forecasting scenarios, such as photovoltaic power and load forecasting, to advance intelligent scheduling technologies for clean energy.

Author Contributions

Conceptualization, C.S.; methodology, C.S.; software, C.S.; validation, C.S.; formal analysis, S.L. and Z.H.; investigation, S.L. and Z.H.; resources, S.L. and Y.Z.; data curation, S.P. and Y.Z.; writing—original draft preparation, S.P.; writing—review and editing, S.P.; visualization, W.Z.; supervision, W.Z.; project administration, J.X.; funding acquisition, J.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Project of State Grid Corporation of China Headquarters, titled “Key Technologies and Applications of Multi temporal and Spatial Scale Prediction of Power Source Load for Public Services Based on Pretrained Models”, grant number 5216A5240018.

Data Availability Statement

The datasets presented in this article are not readily available due to commercial restrictions and data privacy protection.

Conflicts of Interest

Authors Chudong Shan, Shuai Liu, Shuangjian Peng, Zhihong Huang, Yuanjun Zuo, Wenjing Zhang and Jian Xiao were employed by State Grid Hunan Electric Power Company Limited Research Institute.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
SVM	Support Vector Machine
ANN	Artificial Neural Network
DNN	Deep Neural Network
CEEMDAN	Complete Ensemble Empirical Mode Decomposition with Adaptive Noise
EWT	Empirical Wavelet Transform
LSTM	Long-Short Term Memory
ESPNet	Efficient Spatial Pyramid Net
PatchTST	Patch Time Series Transformer
SimAM	Similarity-Aware Activation Module
LLM4TS	Large Language Models for Time Series
LLM	Large Language Model
TSMixer	Time Series Mixer
MLP	Multilayer Perceptron
RNN	Recurrent Neural Network
FLOPs	Floating Point Operations
CPU	Central Processing Unit
GPU	Graphics Processing Unit
MAE	Mean Absolute Error
MSE	Mean Squared Error
RMSE	Root Mean Square Error

References

Nazir, M.S.; Wang, Y.; Bilal, M.; Abdalla, A.N. Wind energy, its application, challenges, and potential environmental impact. In Handbook of Climate Change Mitigation and Adaptation; Springer: Cham, Switzerland, 2022; pp. 899–935. [Google Scholar]
Ahmed, S.D.; Al-Ismail, F.S.; Shafiullah, M.; Al-Sulaiman, F.A.; El-Amin, I.M. Grid integration challenges of wind energy: A review. IEEE Access 2020, 8, 10857–10878. [Google Scholar] [CrossRef]
Tsai, W.-C.; Hong, C.-M.; Tu, C.-S.; Lin, W.-M.; Chen, C.-H. A review of modern wind power generation forecasting technologies. Sustainability 2023, 15, 10757. [Google Scholar] [CrossRef]
Qiao, Y.; Lu, Z.; Min, Y. Research & application of raising wind power prediction accuracy. Power Syst. Technol. 2017, 41, 3261–3268. [Google Scholar]
Zheng, Y.; Ge, Y.; Muhsen, S.; Wang, S.; Elkamchouchi, D.H.; Ali, E.; Ali, H.E. New ridge regression, artificial neural networks and support vector machine for wind speed prediction. Adv. Eng. Softw. 2023, 179, 103426. [Google Scholar] [CrossRef]
Ateş, K.T. Estimation of short-term power of wind turbines using artificial neural network (ANN) and swarm intelligence. Sustainability 2023, 15, 13572. [Google Scholar] [CrossRef]
Tarek, Z.; Shams, M.Y.; Elshewey, A.M.; El-kenawy, E.-S.M.; Ibrahim, A.; Abdelhamid, A.A.; El-dosuky, M.A. Wind Power Prediction Based on Machine Learning and Deep Learning Models. Comput. Mater. Contin. 2022, 74, 715–732. [Google Scholar] [CrossRef]
Karijadi, I.; Chou, S.-Y.; Dewabharata, A. Wind power forecasting based on hybrid CEEMDAN-EWT deep learning method. Renew. Energy 2023, 218, 119357. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, H.; Zhou, R.; Xu, P.; Zhang, K.; Dai, Y.; Zhang, H.; Zhang, J.; Gao, T. CNN-BiLSTM short-term wind power forecasting method based on feature selection. IEEE J. Radio Freq. Identif. 2022, 6, 922–927. [Google Scholar] [CrossRef]
Yousaf, S.; Bradshaw, C.R.; Kamalapurkar, R.; San, O. Investigating critical model input features for unitary air conditioning equipment. Energy Build. 2023, 284, 112823. [Google Scholar] [CrossRef]
Yousaf, S.; Bradshaw, C.R.; Kamalapurkar, R.; San, O. A gray-box model for unitary air conditioners developed with symbolic regression. Int. J. Refrig. 2024, 168, 696–707. [Google Scholar] [CrossRef]
Yue, Z.; Wang, Y.; Duan, J.; Yang, T.; Huang, C.; Tong, Y.; Xu, B. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 Februry–1 March 2022; pp. 8980–8987. [Google Scholar]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
Qiu, X.; Cheng, H.; Wu, X.; Hu, J.; Guo, C.; Yang, B. A comprehensive survey of deep learning for multivariate time series forecasting: A channel strategy perspective. arXiv 2025, arXiv:2502.10721. [Google Scholar]
Sørensen, M.L.; Nystrup, P.; Bjerregård, M.B.; Møller, J.K.; Bacher, P.; Madsen, H. Recent developments in multivariate wind and solar power forecasting. Wiley Interdiscip. Rev. Energy Environ. 2023, 12, e465. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 11121–11128. [Google Scholar] [CrossRef]
Lin, S.; Lin, W.; Hu, X.; Wu, W.; Mo, R.; Zhong, H. Cyclenet: Enhancing time series forecasting through modeling periodic patterns. Adv. Neural Inf. Process. Syst. 2024, 37, 106315–106345. [Google Scholar]
Han, L.; Ye, H.-J.; Zhan, D.-C. The capacity and robustness trade-off: Revisiting the channel independent strategy for multivariate time series forecasting. IEEE Trans. Knowl. Data Eng. 2024, 36, 7129–7142. [Google Scholar] [CrossRef]
Chang, C.; Wang, W.-Y.; Peng, W.-C.; Chen, T.-F. Llm4ts: Aligning pre-trained llms as data-efficient time-series forecasters. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–20. [Google Scholar] [CrossRef]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.-Y.; Liang, Y.; Li, Y.-F.; Pan, S. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar]
Ansari, A.F.; Stella, L.; Turkmen, C.; Zhang, X.; Mercado, P.; Shen, H.; Shchur, O.; Rangapuram, S.S.; Arango, S.P.; Kapoor, S. Chronos: Learning the language of time series. arXiv 2024, arXiv:2403.07815. [Google Scholar]
De Caro, F.; De Stefani, J.; Vaccaro, A.; Bontempi, G. DAFT-E: Feature-based multivariate and multi-step-ahead wind power forecasting. IEEE Trans. Sustain. Energy 2021, 13, 1199–1209. [Google Scholar] [CrossRef]
Chen, S.-A.; Li, C.-L.; Yoder, N.; Arik, S.O.; Pfister, T. Tsmixer: An all-mlp architecture for time series forecasting. arXiv 2023, arXiv:2303.06053. [Google Scholar]
Zheng, X.; Chen, X.; Schürch, M.; Mollaysa, A.; Allam, A.; Krauthammer, M. Simts: Rethinking contrastive representation learning for time series forecasting. arXiv 2023, arXiv:2303.18205. [Google Scholar]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Shan, C.; Yuan, Z.; Qiu, Z.; He, Z.; An, P. A Dual Arrhythmia Classification Algorithm Based on Deep Learning and Attention Mechanism Incorporating Morphological-temporal Information. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
Blu, T.; Thévenaz, P.; Unser, M. Linear interpolation revitalized. IEEE Trans. Image Process. 2004, 13, 710–719. [Google Scholar] [CrossRef] [PubMed]
Ahsan, M.M.; Mahmud, M.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of data scaling methods on machine learning algorithms and model performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Cerqueira, V.; Torgo, L.; Mozetič, I. Evaluating time series forecasting models: An empirical study on performance estimation methods. Mach. Learn. 2020, 109, 1997–2028. [Google Scholar] [CrossRef]
Abnar, S.; Shah, H.; Busbridge, D.; Ali, A.M.E.; Susskind, J.; Thilak, V. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models. arXiv 2025, arXiv:2501.12370. [Google Scholar]
Popescu, M.-C.; Balas, V.E.; Perescu-Popescu, L.; Mastorakis, N. Multilayer perceptron and neural networks. WSEAS Trans. Circuits Syst. 2009, 8, 579–588. [Google Scholar]
Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Zhang, K.; Sun, M.; Han, T.X.; Yuan, X.; Guo, L.; Liu, T. Residual networks of residual networks: Multilevel residual networks. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 1303–1314. [Google Scholar] [CrossRef]
Hollis, T.; Viscardi, A.; Yi, S.E. A comparison of LSTMs and attention mechanisms for forecasting financial time series. arXiv 2018, arXiv:1812.07699. [Google Scholar]
Brauwers, G.; Frasincar, F. A general survey on attention mechanisms in deep learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 3279–3298. [Google Scholar] [CrossRef]
Lin, S.; Lin, W.; Wu, W.; Chen, H.; Yang, J. Sparsetsf: Modeling long-term time series forecasting with 1k parameters. arXiv 2024, arXiv:2405.00946. [Google Scholar]

Figure 1. Original TS2Vec model schemes.

Figure 2. Structure of the lightweight representation learning model.

Figure 3. Details of LSTM during model testing process.

Figure 4. Comparison between improved TS2Vec combined with LSTM and MLP.

Figure 5. Comparison between improved TS2Vec combined with LSTM and RNN.

Figure 6. Comparison between improved TS2Vec combined with LSTM and N-BEATS.

Figure 7. Comparison between improved TS2Vec combined with LSTM and original LSTM.

Figure 8. The overall structure of multivariate feature mixing model.

Figure 9. The specific implementation details of the mixer layer.

Figure 10. Comparison of different types of attention mechanisms for time series.

Figure 11. The specific implementation details of the temporal projection module.

Figure 12. Comparison between the forecasted results and actual wind power of our proposed model on a long-term test set.

Table 1. Comparison of performance testing between the original TS2Vec model and the improved TS2Vec model.

Model	MAE	MSE	RMSE
Original TS2Vec + LSTM	0.6438	0.7305	0.8547
Improved TS2Vec + LSTM	0.6228	0.7268	0.8525

Table 2. Comparison of model size between the original TS2Vec model and the improved TS2Vec model.

Model	FLOPs	Params
Original TS2Vec + LSTM	4.9659 G	0.7110 M
Improved TS2Vec + LSTM	0.4286 G	0.0609 M

Table 3. Comparison of performance testing between the improved TS2Vec combined with LSTM model and other mainstream deep learning models.

Model	MAE	MSE	RMSE
MLP [34]	0.7178	0.9589	0.9793
RNN [35]	0.7538	0.8680	0.9317
N-BEATS [36]	0.7001	0.8787	0.9374
LSTM	0.7380	0.9790	0.9895
Original TS2Vec + LSTM	0.6438	0.7305	0.8547
Improved TS2Vec + LSTM	0.6228	0.7268	0.8525

Table 4. Comparison of wind power forecasting performance on various combinations.

Model	MAE	MSE	RMSE
Improved TS2Vec + LSTM	0.6228	0.7268	0.8525
Improved TS2Vec + SimAM + SparseTSF [40]	0.6674	0.7202	0.8486
TSMixer	0.7053	0.8167	0.9037
SimAM + TSmixer	0.6673	0.7423	0.8615
Original TS2Vec + TSMixer	0.5404	0.4889	0.6992
Original TS2Vec + SimAM + TSMixer	0.5334	0.4721	0.6871
Improved TS2Vec + TSMixer	0.3780	0.2477	0.4977
Improved TS2Vec + SimAM + TSmixer	0.3735	0.2434	0.4934

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shan, C.; Liu, S.; Peng, S.; Huang, Z.; Zuo, Y.; Zhang, W.; Xiao, J. A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing. Energies 2025, 18, 2902. https://doi.org/10.3390/en18112902

AMA Style

Shan C, Liu S, Peng S, Huang Z, Zuo Y, Zhang W, Xiao J. A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing. Energies. 2025; 18(11):2902. https://doi.org/10.3390/en18112902

Chicago/Turabian Style

Shan, Chudong, Shuai Liu, Shuangjian Peng, Zhihong Huang, Yuanjun Zuo, Wenjing Zhang, and Jian Xiao. 2025. "A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing" Energies 18, no. 11: 2902. https://doi.org/10.3390/en18112902

APA Style

Shan, C., Liu, S., Peng, S., Huang, Z., Zuo, Y., Zhang, W., & Xiao, J. (2025). A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing. Energies, 18(11), 2902. https://doi.org/10.3390/en18112902

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Wind Power Forecasting Method Based on Lightweight Representation Learning and Multivariate Feature Mixing

Abstract

1. Introduction

2. Data

3. Methods and Results

3.1. Lightweight Representation Learning Model

3.1.1. Model Design

3.1.2. Model Testing and Result Analysis

3.2. Multivariate Feature Mixing Model

3.2.1. Model Design

3.2.2. Model Testing and Result Analysis

4. Discussion and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI