1. Introduction
The issue of climate warming is currently a global concern. One of the major contributors to the problem is the use of fossil fuels. In recent years, countries around the world have been actively promoting a clean and low-carbon energy transition. Various countries have formed a new global energy pattern by formulating energy transition policies and by increasing investment in renewable energy and other programs. In this context, China has taken a series of measures to promote a low-carbon transition in its energy supply system. It is foreseeable that with the deepening of this energy transition, wind power will keep developing rapidly and become an important part of China’s new power system construction in the future. However, as wind power is affected by wind speed, temperature and other environmental factors and has high randomness and fluctuation, large-scale wind power installation presents great challenges with respect to the safe and stable operation of the power system. Thus, in order to improve wind power consumption capacity and realize the optimal dispatch of the power system, it is important to improve the accuracy of short-term wind power forecasting. The current wind power forecasting models are, on the one hand, constrained by the source data of wind farms, and tend to ignore the influence of various environmental factors on wind power, resulting in the multivariate series of environmental factors not being used effectively. On the other hand, due to the nonlinear variation of the wind power and multivariate environmental information series [
1,
2,
3], the convergence of forecasting models gradually slows down, and problems with over-fitting problem can occur with increased in input variables.
The main current wind power forecasting methods are the physical, statistical, and machine learning methods. The physical method involves the construction of a physical model to forecast from meteorological data and the surface information around the electric field [
4], represented by NWP (Numerical Weather Prediction). The main statistical methods are the time series method [
5], gray forecasting method [
6], autoregressive moving average method [
7], and a few others which refer to the construction of a nonlinear relationship between historical data and wind power data to make forecasts. The main machine learning methods are neural networks [
8], support vector machines [
9], extreme learning machines [
10], etc. Due to the ability of neural networks to mine the nonlinear relationships and deep features in training data, they generally have better forecasting performance [
11,
12], and are now widely used in forecasting for wind power generation [
13]. Comparisons in the literature [
14,
15,
16,
17] between LSTM (Long Short-Term Memory neural network) and other forecasting models have shown LSTM models to be better than other forecasting models for both long-term and short-term forecasting; however, the LSTM model has the problems of model complexity and long training time. Thus, proper data processing can enhance the learning effect of the LSTM model. In [
18], the authors proposed a PCA (Principal Component Analysis)-based LSTM forecasting model which was able to effectively reduce the dimensionality of the input variables of the LSTM. In [
19], the authors proposed an LSTM wind power forecasting method based on RR (Robust Regression) and VMD (Variational Mode Decomposition) which improved the LSTM forecasting accuracy by decomposing the wind power series to eliminate the noise. In [
20], the authors proposed a multiple variable LSTM algorithm and performed dimensionality reduction on the original data using the wavelet noise reduction method, then reconstructed pre-selected input data using chaos analysis and the classification forecast tree method, showing that dimensionality reduction can effectively improve LSTM forecasting accuracy. Both [
21,
22] improved the LSTM model by introducing an attention mechanism which was able to allocate limited computational resources to more important tasks while solving the information overload problem. In [
23,
24], the authors similarly introduced an attention mechanism into the CNN-LSTM forecasting model to improve forecasting accuracy. The above examples illustrate that improving the quality of input data is beneficial to the forecasting accuracy of the LSTM model.
To improve the accuracy of wind power forecasting, the data quality first needs to be improved from several perspectives, including data decomposition and dimensionality reduction. Wind power series and multivariate environmental factor series are often non-stationary [
25], and forecasting the original series directly may result in large errors. Decomposing the signal of the original series can improve forecasting accuracy while reducing the complexity of the data [
26,
27]. The most commonly used signal decomposition methods are FFT (Fast Fourier Transform), WT (Wavelet Transform), and EMD (Empirical Mode Decomposition). In [
28,
29], the authors optimized the signal time domain waveform using FFT, although they did not consider the local features of the time frequency. In [
30,
31], the authors used Wavelet Transform to decompose the wind speed time series, which has the drawback of requiring manual setting of the parameters. Compared to the two methods mentioned above, EMD [
32] can actively identify the features of the input data with self-adaptability and multi-resolution. In [
33], the authors proposed an EMD-MODBN (Multi-objective Deep Belief Network) model, and their experimental results proved that decomposition of data using EMD methods is beneficial in helping neural networks to capture the potential connections between data. In [
34], the authors decomposed the wind speed data series using EMD to remove the noise from the original data, leading to improved forecasting accuracy. In [
35], the authors decomposed the residential load signal using EMD; their results proved that more data peaks correlates to better EMD decomposition. The above examples illustrate the superiority of the EMD algorithm in data decomposition. Thus, in this paper, we chose to decompose the environmental factor series using the EMD algorithm.
In addition to the accuracy and stationarity of the data, the dimensionality plays an important role in the forecasting accuracy of the model. Introducing a dimensionality reduction algorithm into the forecasting model can improve its calculation efficiency. PCA (Principal Component Analysis) is an effective method for reducing the dimensionality of data by analyzing the covariance structure of multivariate data series, calculating the contribution of each series, and selecting the primary series to be expressed in order to achieve the purpose of data dimensionality reduction [
36,
37,
38]. In [
39], the authors used PCA based on high-frequency data to forecast the high-dimensional covariance matrix, which can well characterize the long memory behavior of realized eigenvalue series and be easily estimated by OLS. In [
40], the authors used the PCA-BP method to filter the model input data, eliminating redundant and irrelevant information, reducing the complexity of the model, and improving the forecasting performance of the subsequent model. In [
41], the authors preprocessed the data using PCA in order to achieve dimensionality reduction, which reduced the computation time of the subsequent model. RF (Random Forest) is a supervised learning algorithm which represents an effective method of dimensionality reduction. In [
42], the authors used the RF algorithm to rank the importance of individual features based on the Gini index for feature filtering, and experimental results showed that the random forest algorithm-based feature importance ranking was able to filter features and improve the accuracy of subsequent models. In [
43], the authors compared a multiple linear regression model with a random forest model and found that the RF model was far superior to the multiple linear regression model in terms of both goodness-of-fit and other evaluation indicators. In [
44], the authors preferentially selected features via RF algorithm, and the experimental results showed that random forest feature selection can effectively reduce data redundancy, eliminate features with poor differentiability, and improve the recognition effect of the model. In [
45,
46,
47,
48], the authors computed feature importance with RF algorithms in different fields, all of which improved the subsequent model effects. Thus, in this paper, we chose to reduce the dimensionality of the EMD decomposed series using a combined PCA-RF algorithm.
To date, data decomposition-based and data dimensionality reduction-based wind power forecasting has been studied by several scholars. In [
49], the authors used complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) to divide the volatility into several fluctuation components with different frequency characteristics; their experimental results demonstrated the advantages of data decomposition. In [
50], the authors used VMD to decompose wind speed into the nonlinear part, the linear part, and the noise part, and experiment results showed that the model proposed in their paper well and truly reflected the characteristics of the wind speed. In [
51], the authors extracted principal components from high-dimensional raw data using PCA results as input variables for subsequent models; their experimental results demonstrated improvement in forecasting accuracy. In [
52], the authors processed the input variables through VMD decomposition and PCA dimensionality reduction, and the experimental results showed that the proposed method had higher forecasting accuracy than the traditional methods. It can be seen from the above examples that both data decomposition and data dimensionality reduction are widely used in the field of wind power forecasting, although there are fewer studies that combine decomposition and dimensionality reduction.
In summary, to solve the problems of inaccurate feature identification and slow convergence in traditional wind power forecasting models, this paper makes full use of the environmental factor series that affect wind power, mines the features of wind power and environmental factors over time, and constructs an EMD-PCA-RF-LSTM forecasting model to improve the forecasting accuracy of wind power in order to provide certain technical support for the safe dispatch of power systems and enhance power system’s abilities with respect to the consumption of wind power. In this paper, the environmental factors are filtered by the LASSO (Least Absolute Shrinkage and Selection Operator) algorithm at the very beginning. Second, the environmental factor series are decomposed by the EMD algorithm to reduce non-stationarity. Third, the key influencing factor series are extracted and feature importance is calculated by the PCA-RF algorithm for further feature extraction from the data. Finally, dynamic time modeling of multivariate feature series is performed by the LSTM algorithm to forecast the wind power. A comparative analysis including a traditional BP neural network, SVM, EMD-PCA-RF-BP, EMD-PCA-RF-SVM and single LSTM, EMD-PCA-LSTM, and EMD-RF-LSTM showed that the combined forecasting model proposed in this paper had the highest accuracy.
The rest of the sections in this paper are arranged as follows: the relevant theories of each algorithm are described in
Section 2; data preprocessing, and the design of the combined model, and the evaluation indicators are described in
Section 3; in
Section 4, the performance of proposed model is analyzed according to the results of five combined models and three benchmark models on a case study; finally, the conclusions of this study are presented in
Section 5.
2. Methods
2.1. LASSO Algorithm
The LASSO algorithm, first proposed by Robert Tibshirani [
53], is a regression analysis method that enables simultaneous variable selection and parameter estimation. The LASSO method is used to compress the regression coefficients by constructing a penalty function. Coefficients with small absolute values are compressed to zero, thus achieving the purpose of variable selection. The process is as follows.
A general linear regression model can be expressed by the following formula:
In Formula (1), ${y}_{i}$ is the dependent variable, $i=1,2,\cdots ,n$; $\alpha $ is a constant term; ${\beta}_{1},{\beta}_{2,}\cdots ,{\beta}_{p}$ are the regression parameters; ${x}_{ij}$ is the independent variable that affects the dependent variable; and ${\epsilon}_{i}$ is a normal distribution with zero mean.
The estimator
$\left(\widehat{\alpha},\widehat{\beta}\right)$ of LASSO can be expressed by the following formula:
In Formula (2),
$t$ is an adjustable parameter which satisfies
$t\ge 0$ for all the estimated values of
$t$ and
$\alpha $ satisfying
$\widehat{\alpha}=\overline{y}$. Without loss of generality, assume that
$\overline{y}=0$; then,
$\alpha =0$. Then, Formula (1) can be written as follows:
Let
${t}_{0}={\displaystyle \sum _{j=1}^{p}\left|{\widehat{\beta}}_{j}^{0}\right|}$, where
${\widehat{\beta}}_{j}^{0}$ is the least squares estimation of the regression coefficient. When
$t<{t}_{0}$, the regression coefficients are compressed and certain of the regression coefficients converge to 0 or even equal 0. Then, Formula (3) can be written as follows:
where
$\sum _{i=1}^{n}{\left({y}_{i}-{\displaystyle \sum _{j=1}^{p}{\beta}_{j}{x}_{ij}}\right)}^{2}$ is the loss function reflecting the fitting effect of the model,
$\lambda {\displaystyle \sum _{j=1}^{p}\left|{\beta}_{j}\right|}$ is the penalty function reflecting the model’s compression strength of the coefficients, and
$\lambda $ is the penalty parameter subject to
$\lambda \in \left[0,\infty \right)$, which determines the compression strength of the coefficient. As
$\lambda $ increases, the coefficients of the variables in the model are gradually compressed, and when
$\lambda $ reaches a certain value, the coefficients of certain variables are compressed to zero, achieving the purpose of variable selection.
In this paper, K-fold Cross-Validation is used to determine the penalty parameter, $\lambda $. The sample datasets are divided into $K$ subsets. One subset is used as the dataset for validating the model, while the other $K-1$ subsets are used to construct the model. The cross-validation repeats $K$ times, with each subset validated once. The process is as follows.
Step 1. Divide the sample datasets $T$ into $K$ subsets, $T=\left\{{T}_{1},{T}_{2},\cdots ,{T}_{K}\right\}$.
Step 2. Use one of the subsets as the validation set and the remaining $K-1$ subsets as the training set, that is, train $K$ times and validate $K$ times.
Step 3. For each penalty parameter
$\lambda $, use the training set to find the estimator
${\widehat{\beta}}^{\left(t\right)}\left(\lambda \right)$ of
$\beta $; the statistic for cross-validation is
Step 4. The corresponding penalty parameter is obtained by minimizing
$CV\left(\lambda \right)$:
2.2. EMD Algorithm
In this paper, the EMD algorithm [
34] is used to decompose the environmental factor series. The EMD algorithm can be used to obtain the local features of the environmental factor series that affect the wind power at different time scales, which results in a more detailed, though larger, number of data series.
EMD, as a data-driven adaptive nonlinear time-varying signal decomposition method, is based on Fourier analysis and Wavelet analysis, and is applicable for smoothing nonlinear non-stationary signals in a step-by-step process. The filtering process of the EMD algorithm decomposes complex time series data into a finite number of Intrinsic Mode Functions (IMFs); these decomposed IMFs contain the fluctuating information of the original data on different time scales [
54]. The process is as follows.
Step 1. For any processed signal
$x\left(t\right)$, determine its local maximum and minimum values and record the difference between the signal data,
$x\left(t\right)$, and the mean value of the upper and lower envelope,
${m}_{1}\left(t\right)$, as
${h}_{1}\left(t\right)$, which can be expressed as follows:
Step 2. Repeating the above process, the first-stage IMF
${h}_{1}\left(t\right)$ filtered from the original signal usually contains the highest frequency component of the signal. Separate
${h}_{1}\left(t\right)$ from
$x\left(t\right)$ to obtain the difference signal,
${r}_{1}\left(t\right)$, with the high-frequency component removed, and repeat the above filtering steps with
${r}_{1}\left(t\right)$ as the new signal until the residual signal in the
n-th stage is a monotonic function that no longer filters out the
${r}_{1}\left(t\right)$ of the IMF. The expression is as follows:
According to the decomposition algorithm,
$x\left(t\right)$ can be expressed as a sum of
n IMFs and a single residual, with the following expression:
In Formula (9), ${r}_{n}\left(t\right)$ is the residual, indicating the average trend in the signal, while ${h}_{j}\left(t\right)$ is the j-th IMF, j = 1, 2,…, n, which denote the different components of the signal from high to low frequencies, respectively.
The end of the filtering process is mainly based on a Cauchy-like convergence criterion [
55], and the standard deviation (SD) is usually set from 0.2 to 0.3, with the following expression:
2.3. PCA-RF Combined Algorithm
In this paper, the PCA-RF combined algorithm is used to reduce the feature dimensions without losing the original data information. The results are then used as the input of the LSTM forecasting model in order to improve the calculation efficiency and accuracy of feature extraction.
2.3.1. PCA Algorithm
The data series obtained by EMD decomposition enriches the number of feature series, although the dimensionality of the input variables increases as well. PCA was first proposed by Karl Pearson in 1901, and has been widely used to reduce the number of feature vectors; thus, it can achieve the dimensionality reduction of data. It can be used to reduce the computation times of artificial neural networks as well as to increase their calculation speed. The process is as follows [
56].
Assuming an original dataset
$X=\left\{{x}_{11},{x}_{12},{x}_{ij},\cdots ,{x}_{mn}\right\}$ where
$i$ is the time observation point and
$j$ is the series of environmental factors constituting the dataset matrix:
Step 1. Normalize the data series to obtain the normalization matrix,
${X}^{\ast}$:
Step 2. Use linear transformation to obtain the covariance matrix,
$R$:
Step 3. Solve
$\left|\lambda I-R\right|=0$ to obtain the feature matrix,
$\lambda $; finally, calculate the accumulated contribution rate,
${\beta}_{i}$:
Usually, the first $k$ principal components, with an accumulated contribution of 75% to 95%, are able to contain most of the information that the $n$ original variables can provide.
2.3.2. RF Algorithm
Following PCA dimensionality reduction, the optimal feature series are selected via RF by evaluating the feature importance of the series. The RF algorithm, first proposed by Breiman in 2001 [
57], is an integrated algorithm based on decision trees. The sampling method of RF sets about 1/3 of the sample as not selected, and this part of the sample becomes out-of-bag (OOB) data.
RF obtains the feature importance by perturbing the out-of-bag data that are not involved in decision tree training and calculating their classification accuracy difference. The process is as follows [
44].
Step 1. RF carries out Bootstrap sampling by taking $K$ sample datasets to generate $K$ decision trees, each of which is generated independently.
Step 2. Let $k=1$ to train decision tree ${T}_{k}$. The training input is the $k$-th dataset; calculate the accuracy, ${L}_{k}$, for the $k$-th out-of-bag data.
Step 3. Rearrange the feature, $f$, in the out-of-bag dataset and calculate the accuracy, ${L}_{k}^{f}$.
Step 4. Repeat Steps 2 and 3 for all sample datasets $k=2,3,\cdots ,K$.
Step 5. Calculate the classification accuracy error after feature rearrangement, which can be expressed as
Step 6. From Formula (15), we can obtain the influence of features on the accuracy of the out-of-bag data:
where the variance of
${e}^{f}$ is as follows:
Step 7. From Formulas (16) and (17), the importance of a feature
$f$ can be calculated.
Step 8. From Formula (18), the ${f}_{VI}$ of all the features can be obtained.
To select the optimal feature subset, the feature subset is generated by removing one feature at a time from the sorted feature set, calculating the accuracy of the feature subset, and finally selecting the one with the highest accuracy as the most optimal feature set.
2.4. LSTM Algorithm
A Long Short-Term Memory (LSTM) network is an extension of a Recurrent Neural Network (RNN) [
58]. The LSTM network contains one input layer, one output layer, and several hidden layers; its structure is shown in
Figure 1. The key to the LSTM network is the cell state;
${C}_{t-1}$ and
${C}_{t}$ are the old state and new state of the cell, respectively. In an LSTM, the information about the cell state is censored by the gate structure, which allows for selective passage of information. It consists of a sigmoid neural network layer and a pairwise multiplication operation. The sigmoid is a nonlinear activation function and is contained in the gate structure. The gate structure output ranges from 0 to 1 and defines the degree of information passing through. The tanh layer in
Figure 1 is an activation function that maps a real number input to the range [−1, 1].
The Cell of an LSTM includes an input gate, output gate, and forget gate, which is the key to controlling the information flow. In the following formulas,
${i}_{t}$,
${o}_{t}$, and
${f}_{t}$ denote the state values of the input gate, output gate, and forget gate, respectively [
24].
Step 1. The sigmoid layer by the forget gate decides whether the information is forgotten from the old cell state,
${C}_{t-1}$, with the input,
${x}_{t}$, of the current layer and the output,
${h}_{t-1}$, of the previous layer; then, the current cell state output is as follows:
Step 2. In order to generate the information that needs to be updated and stored in the cell state, first, the input gate updates the information using the result,
${i}_{t}$, of the sigmoid layer. Then, the new candidate value,
${C}_{t}$, generated by the tanh layer is added to the cell state.
${C}_{t}$ is obtained from multiplying the old cell state by
${f}_{t}$ to forget the unwanted information and adding it to the new candidate information,
${i}_{t}\cdot {\tilde{C}}_{t}$:
Step 3. To determine the information output by the output gate, the initial output is first obtained through the sigmoid layer, the value of the cell state is scaled between [−1, 1] using the tanh layer, and the output obtained by sigmoid is multiplied pair by pair to obtain the output,
${h}_{t}$:
In Formulas (19)–(23), ${W}_{1}^{i}$, ${W}_{1}^{f}$, ${W}_{1}^{o}$, ${W}_{1}^{C}$ are the weight matrices communicating ${x}_{t}$ with the input gate, forget gate, output gate, and cell input, respectively; ${W}_{h}^{i}$, ${W}_{h}^{f}$, ${W}_{h}^{o}$, ${W}_{h}^{C}$ are the weight matrices connecting ${h}_{t-1}$ with the input gate, forget gate, output gate, and cell input, respectively; ${b}_{i}$, ${b}_{f}$, ${b}_{o}$, ${b}_{C}$ are the biases of the input gate, forget gate, output gate, and cell input, respectively; and $\sigma $ is the sigmoid activation function.
5. Conclusions
In this paper, we have proposed a combined EMD-PCA-RF-LSTM forecasting model for addressing with the fluctuation and randomness of wind power which fully considers the five main environmental factors that affect wind power generation, namely, temperature, air pressure, humidity, wind direction, and wind speed. Our main conclusions are as follows.
The environmental factors were filtered using the Least Absolute Shrinkage and Selection Operator (LASSO) method; variables with regression coefficients of 0 were eliminated in order to achieve variable selection. The time series of five columns of environmental factors were decomposed using the Empirical Modal Decomposition (EMD) method to obtain IMFs, and the key influencing series affecting wind power were filtered out by Principal Component Analysis (PCA) to reduce the dimensionality. The key influencing factor series after PCA dimensionality reduction were subjected to further feature extraction by Random Forest (RF), a certain percentage of features were eliminated according to the feature importance ranking, and a new feature set was ultimately obtained as the input variable of the LSTM model. Finally, the non-linear relationship between environmental factor time series and wind power series was modeled dynamically by a long and short-term memory (LSTM) neural network to construct a forecasting model to forecast wind power.
By comparing SVM, BP, LSTM, EMD-PCA-RF-SVM, EMD-PCA-RF-BP, EMD-RF-LSTM, EMD-PCA-LSTM, and EMD-PCA-RF-LSTM, it was determined that the combined forecasting model proposed in this paper works optimally. With the combined EMD-PCA-RF decomposition reduction method, the MSE, RMSE, and MAE of SVM, BP, and LSTM all decreased, the adj-R^{2} of SVM improved by 42.02%, the adj-R^{2} of BP improved by 22.99%, and the adj-R^{2} of LSTM improved by 2.24% to 0.9699203, achieving the best results among all of the models.
This paper verifies the practicality of the EMD-PCA-RF-LSTM model in the field of wind power forecasting and extends the application scope of deep learning techniques. The proposed forecasting method provides a new perspective for in-depth discussion of the economic operation and scheduling of wind power grid-connected systems which has good application prospects and engineering application value in reality. Related research is already in progress.