Short-Term Wind Speed Prediction Based on Principal Component Analysis and LSTM

: An accurate prediction of wind speed is crucial for the economic and resilient operation of power systems with a high penetration level of wind power. Meteorological information such as temperature, humidity, air pressure, and wind level has a signiﬁcant inﬂuence on wind speed, which makes it di ﬃ cult to predict wind speed accurately. This paper proposes a wind speed prediction method through an e ﬀ ective combination of principal component analysis (PCA) and long short-term memory (LSTM) network. Firstly, PCA is employed to reduce the dimensions of the original multidimensional meteorological data which a ﬀ ect the wind speed. Further, di ﬀ erential evolution (DE) algorithm is presented to optimize the learning rate, number of hidden layer nodes, and batch size of the LSTM network. Finally, the reduced feature data from PCA and the wind speed data are merged together as an input to the LSTM network for wind speed prediction. In order to show the merits of the proposed method, several prevailing prediction methods, such as Gaussian process regression (GPR), support vector regression (SVR), recurrent neural network (RNN), and other forecasting techniques, are introduced for comparative purposes. Numerical results show that the proposed method performs best in prediction accuracy.


Introduction
As one of the clean and renewable energy sources, wind power has developed rapidly all over the world during the last decade. In 2018, the global installed wind power capacity was 592 GW, which is expected to increase to 800 GW by the end of 2021 [1]. Wind speed is the most important factor affecting wind power generation [2]. Variability and uncertainty stemming from noncontrollable and nonadjustable wind speed bring tremendous difficulties to large-scale wind power integration and operation in power systems. A more accurate wind speed prediction can help reduce the negative impact of wind power integration and improve the efficacy and stability of power system operations [3][4][5].
To date, there have been three noteworthy approaches to wind speed prediction. The first approach is with respect to physical methods such as numerical weather prediction (NWP) models [6,7], which primarily use mathematical models of the atmosphere and oceans to obtain the estimates of wind speed forecasts. NWP is typically a physical model with the advantages of high precision and strong basis. However, there exist a variety of challenges facing NWP, such as the difficulty of collecting meteorological data, the requirement of large-scale computing resources, and so on.
The second approach is statistical methods, which make use of historical and measured data to establish input and output function models. These methods include the Kalman filter [8], autoregressive 1.
A wind speed prediction algorithm considering meteorological features based on PCA and LSTM networks is presented. DE as a hyperparameter selection method is also included in the proposed method.

2.
The PCA preprocessing method can effectively reduce the dimensions and retain the features in the data, which lays an important foundation for more accurate prediction with improved computational efficiency. 3.
The proposed method is validated on three different cases considering real-world data, and experimental results show that the proposed method outperforms other popular forecasting methods.
The rest of this paper is organized as follows. The proposed hybrid prediction approach is described in Section 2. The experimental design and numerical validation are shown in Section 3. Conclusions and future work are provided in Section 4.

Methodology
In this section, the underlying theories and developed method are described, including the PCA algorithm, LSTM networks, and the hyperparameter selection based on DE, as well as the prediction framework proposed in this paper.

PCA Algorithm
In many cases, there is a correlated relationship among variables, which makes the problem under study very complicated. When there is a certain correlation between two variables, it can be explained that the two variables reflecting the information of this problem have a certain overlap. PCA is devised to delete redundant (closely related) variables and to establish as few new variables as possible, so that these new variables are irrelevant [37].
In the PCA algorithm, orthogonal transformation is used to convert a set of variables that may have linear correlation into a set of linearly uncorrelated variables. The converted variables are used as the principal components. Take a data matrix of M × N as shown in Equation (1), for example; it means that there are N samples, and each sample has M features.
The calculation process of the PCA algorithm to reduce the dimensions of T is as follows: (1) Calculate the covariance matrix of T by Equation (2): where Q is the covariance matrix; T n represents the nth sample vector; T denotes the mean value of the nth sample; and (T n − T) is the transposed matrix. (2) Calculate the eigenvalues and eigenvectors of Q by Equation (3): where v m is the mth eigenvector of Q and λ m is the mth eigenvalue of Q. V and Λ are composed of eigenvectors and eigenvalues of Q, respectively.
(3) Arrange the eigenvalues from large to small and then calculate the contribution rate of each feature and cumulative contribution rate of all features by Equations (4) and (5) as follows: where p l is the contribution rate of the lth component, λ l is the lth eigenvalue arranged from large to small. p represents the cumulative contribution rate. (4) According to (3), select I (I ≤ M) components which contain the most information of T from M components. The eigenvectors corresponding to the selected components constitute the transformation matrix U. The reduced dimension matrix Z is obtained by multiplying the original data matrix T and the transformation matrix U as described in Equation (6) where Z is the reduced dimension matrix m. U represents the transformation matrix.
In this paper, up to 11 meteorological characteristics affecting wind speed are collected, including air temperature, air pressure, humidity, and so forth. The PCA algorithm above is employed to reduce the dimensions of the meteorological data. The data after dimensionality reduction can effectively keep the original meteorological information as much as possible.

LSTM
As a special kind of recurrent neural network (RNN), the LSTM neural network was first proposed by Hochreiter and Schmidhuber [38]. According to the LSTM structure in Figure 1, the current state of a cell will be affected by the previous cell state, which reflects the recurrent characteristics of LSTM. Based on RNN, the candidate cell, forget gate, input gate, and output gate are added to the hidden layer of LSTM. LSTM with such a structure does not cause the gradient to disappear or explode and can learn the information contained in time series data more effectively [39]. An LSTM unit composed of a candidate cell, an input gate, an output gate, and a forget gate is shown in Figure 2. The input gate controls the extent of the values which flow into the cell. The forget gate controls the extent of the values which remain in the cell. The output gate and the value in the cell determine the output of an LSTM unit [39].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 16 where l p is the contribution rate of the lth component, is the lth eigenvalue arranged from large to small. p represents the cumulative contribution rate.
where Z is the reduced dimension matrix m. represents the transformation matrix.
In this paper, up to 11 meteorological characteristics affecting wind speed are collected, including air temperature, air pressure, humidity, and so forth. The PCA algorithm above is employed to reduce the dimensions of the meteorological data. The data after dimensionality reduction can effectively keep the original meteorological information as much as possible.

LSTM
As a special kind of recurrent neural network (RNN), the LSTM neural network was first proposed by Hochreiter and Schmidhuber [38]. According to the LSTM structure in Figure 1, the current state of a cell will be affected by the previous cell state, which reflects the recurrent characteristics of LSTM. Based on RNN, the candidate cell, forget gate, input gate, and output gate are added to the hidden layer of LSTM. LSTM with such a structure does not cause the gradient to disappear or explode and can learn the information contained in time series data more effectively [39]. An LSTM unit composed of a candidate cell, an input gate, an output gate, and a forget gate is shown in Figure 2. The input gate controls the extent of the values which flow into the cell. The forget gate controls the extent of the values which remain in the cell. The output gate and the value in the cell determine the output of an LSTM unit [39].
The calculation process of LSTM in Figure 2 is as follows: (1) Calculate the inputs for three gate units and the candidate cell by Equations (7)- (10): where represents the inputs at time , and 1 t h  represents the cell output at time , , , ct net are input gate, forget gate, output gate, and the candidate cell, respectively.
, , ,̃ are the input gate, forget gate, output gate, and weight matrix of the candidate cell, respectively. , , ,̃ are the bias of input gate, forget gate, output gate, and candidate cell, respectively.
(2) Calculate the three gate units by Equations (11)-(15): ̃= tanh (̃, ) = ⋅ −1 + ⋅̃ (15) where , t f , , , ̃ represent the input gate at time t, forget gate, unit state, output gate, and candidate cell vector unit output, respectively. (⋅) stands for sigmoid activation function and is expressed by Equation (16): tanh(· ) stands for tanh activation function and is expressed by Equation (17): The calculation process of LSTM in Figure 2 is as follows: (1) Calculate the inputs for three gate units and the candidate cell by Equations (7)- (10): where x t represents the inputs at time t, and h t−1 represents the cell output at time t − 1. net i,t net f ,t net o,t net c,t are input gate, forget gate, output gate, and the candidate cell, respectively. W i , W f , W o , W c are the input gate, forget gate, output gate, and weight matrix of the candidate cell, respectively. b i , b f , b o , b c are the bias of input gate, forget gate, output gate, and candidate cell, respectively. (2) Calculate the three gate units by Equations (11)-(15): where i t , f t , C t , o t ,c t represent the input gate at time t, forget gate, unit state, output gate, and candidate cell vector unit output, respectively. σ(·) stands for sigmoid activation function and is expressed by Equation (16): tanh(·) stands for tanh activation function and is expressed by Equation (17): Appl. Sci. 2020, 10, 4416 6 of 15 (3) Calculate the output by Equation (18): where h t is the unit output at time t.

Selection of Hyperparameters
Many parameters for LSTM can affect its accuracy and performance. The selected hyperparameters are a learning rate, the number of units of the hidden layer, and the number of batch size. If the selected learning rate is too small, the convergence will be too slow; otherwise, the cost function will oscillate. The number of units of the hidden layer will influence the effect of fitting. For the number of batch size, if this number is too small, then the training data will be extremely difficult to converge, which will lead to underfitting. If the number is too large, then the required memory will increase significantly. For example, when the number of units of the hidden layer is specified within (1, 100) and the range of the number of batch size is specified within (1, 500), a total of 50,000 combinations will be generated. Thus, to overcome the computational burden, a simple yet reliable algorithm should be utilized to select the optimal combination of parameters for balancing predictive performance and computational efficiency. In this paper, the hyperparameters of LSTM are determined by means of the DE algorithm. DE is a heuristic random search algorithm based on group differences [40]. The objective function of the optimization problem to select hyperparameters is the root mean square error (RMSE), which represents the sum of the squared deviations of the predicted value and the true value calculated by Equation (19): whereỹ s is the sth predicted value, y s is the sth true value. The process of the DE algorithm to select LSTM hyperparameters is shown below: (1) Initialization: Initialize the following parameters: length of individual D, number of iterations G, population size NP, crossover rate, and scaling or mutation factor. The population is randomly generated by Equation (20): where ω = 1, 2, . . . , NP; k = 1, 2, . . . D; X L ω,k and X U ω,k are the upper and lower bounds of the k-th dimension, respectively.
(2) Mutation: Mutation operator is used to generate the mutation vector (H i ) for the individual of the population using Equation (21).
where x r1 (g), x r2 (g), x r3 (g) are individuals randomly selected from the population, and r1 r2 r3, F is the scaling factor and g represents the g-th generation. (3) Crossover: Crossover operation is to randomly select individuals using Equation (22) where U ω,k is the new individual generated in the crossover operation, CR is the crossover rate. (4) Selection: In the selection operator as shown in Equation (23), for minimization problems, if the fitness value f (U ω,k (g + 1)) of the trial vector U ω,k (g + 1) is less than or equal to the fitness value f (X ω,k (g)) of the target vector X ω,k (g), the trial vector will replace the target vector and enter the population. Otherwise, the target vector is still retained.
The best individual is the three hyperparameters of the LSTM prediction network structure. In this paper, the best set of hyperparameters is selected by comparing the RMSE corresponding to different hyperparameter values generated by training.

Proposed Prediction Framework
The framework of the proposed prediction algorithm, shown in Figure 3, includes three parts. They are Part A (Data processing), Part B (Hyperparameters optimization), and Part C (Forecasting).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 16 value ( , ( )) of the target vector , ( ), the trial vector will replace the target vector and enter the population. Otherwise, the target vector is still retained.

Proposed Prediction Framework
The framework of the proposed prediction algorithm, shown in Figure 3, includes three parts. They are Part A (Data processing), Part B (Hyperparameters optimization), and Part C (Forecasting).
In Part A, 11 types of meteorological data (feature data in Part A), such as temperature, humidity, air pressure, and wind level, are dimensionally reduced by PCA. The data processed by the PCA algorithm and historical wind speed data together form the input of Part B. Part B uses the hyperparameters selected by DE to obtain a new LSTM model. Part C first divides the processed data in part A into a training set and a test set, and then applies the LSTM model in part B to forecast wind speed.  Figure 3. The proposed prediction framework.

Evaluation Metric of Prediction
In order to evaluate the performance of the proposed model, four different indicators, including RMSE (see Equation (19)), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (denoted by 2 R ), are adopted as evaluation metrics.
(1) MAE MAE is the average of the absolute error and can reflect the error between the predicted value and the actual value well. The smaller the MAE is, the higher the accuracy the prediction achieves. The formula is as follows.
(2) MAPE In Part A, 11 types of meteorological data (feature data in Part A), such as temperature, humidity, air pressure, and wind level, are dimensionally reduced by PCA. The data processed by the PCA algorithm and historical wind speed data together form the input of Part B. Part B uses the hyperparameters selected by DE to obtain a new LSTM model. Part C first divides the processed data in part A into a training set and a test set, and then applies the LSTM model in part B to forecast wind speed.

Evaluation Metric of Prediction
In order to evaluate the performance of the proposed model, four different indicators, including RMSE (see Equation (19)), mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (denoted by R 2 ), are adopted as evaluation metrics.
(1) MAE MAE is the average of the absolute error and can reflect the error between the predicted value and the actual value well. The smaller the MAE is, the higher the accuracy the prediction achieves. The formula is as follows.
Appl. Sci. 2020, 10, 4416 8 of 15 (2) MAPE MAPE represents the ratio between the error and the true value. The smaller the MAPE is, the closer the predicted value is to the true value. The formula is as follows.
(3) Coefficient of determination The coefficient of determination (R 2 ) is the square of the correlation between predicted values and true values. The range of R 2 is specified within [0,1]. The closer R 2 is to 1, the more perfect fit the prediction model has. Therefore, R 2 can be used as an important indicator. The formula is as follows.
where Y is the mean of the observed value.

Case Study
In this section, the description of meteorological data is first introduced. Then, the experimental design and parameter settings are described. Finally, results and analysis are shown to validate the performance of the proposed method.

Data Description
The meteorological data came from Fuyun Meteorological Station (N 46.59 • , E 89.31 • ) located at Xinjiang province in China [33]. The time span was one month (from 00:00 on the 15th of July, 2018 to 23:00 on the 14th of August). The total length of the sample was 744 with an hourly resolution. The basic information of meteorological factors is listed in Table 1. Hourly wind speed data shown in Figure 4 were divided into a training set (the first 624 points) and a test set (the last 120 points).

Experimental Design and Parameter Settings
We first used the PCA algorithm to calculate the contribution and cumulative contribution rate of each component as shown in Table 2. It can be seen that as the number of features increased, the correlation became more and more obvious, which means that there was no need to measure all the features. The cumulative contribution rate of the first five principal components was 0.9653. Therefore, these five principal components and wind speed were selected to form the input to the prediction models.   To validate the prediction performance of the proposed PCA-LSTM method, BPNN-, SVR-, GPR-, and RNN-based methods were selected for comparison. The following three cases were studied: Case On the basis of several trials and similar works [40,41], the parameters of the DE algorithm were set as follows: population size NP = 10, number of iterations G = 20, scaling factor F = 0.6, and crossover factor CR = 0.8. For the LSTM network, the range of the learning rate was specified within [0, 1], the range of the number of hidden layer units was within [1,100], and the range of the batch size was specified within [1,100]. Table 3 shows parameter settings of LSTM determined by the DE algorithm and other models. Furthermore, the Adam algorithm was adopted to make calculation of the LSTM network more efficient [40]. For SVR and GPR, the results of each run were the same, hence, they ran once. For network-based methods, such as BPNN, LSTM, and RNN, the results obtained by  To validate the prediction performance of the proposed PCA-LSTM method, BPNN-, SVR-, GPR-, and RNN-based methods were selected for comparison. The following three cases were studied: Case I: LSTM model optimized by DE algorithm and comparisons with BPNN, GPR, RNN, and SVR only using historical wind speed data Case II: Feature-LSTM model optimized by DE algorithm and comparisons with Feature-BPNN, Feature-GPR, Feature-RNN, and Feature-SVR using all meteorological factors related to wind speed data Case III: the proposed model and comparisons with PCA-BPNN, PCA-GPR, PCA-RNN, and PCA-SVR using meteorological factors processed by the PCA algorithm.
On the basis of several trials and similar works [40,41], the parameters of the DE algorithm were set as follows: population size NP = 10, number of iterations G = 20, scaling factor F = 0.6, and crossover factor CR = 0.8. For the LSTM network, the range of the learning rate was specified within [0, 1], the range of the number of hidden layer units was within [1,100], and the range of the batch size was specified within [1,100]. Table 3 shows parameter settings of LSTM determined by the DE algorithm and other models. Furthermore, the Adam algorithm was adopted to make calculation of the LSTM network more efficient [40]. For SVR and GPR, the results of each run were the same, hence, they ran once. For network-based methods, such as BPNN, LSTM, and RNN, the results obtained by each run were different; these models ran independently for 10 times, and the average was recorded as the final result.

Prediction Result Analysis
Case I was to demonstrate the effectiveness of the LSTM model optimized by the DE algorithm compared with the BPNN, GPR, RNN, and SVR models. The forecasting indicators obtained from the above models are shown in Table 4, where the best results for each model are in bold. For the LSTM model, the average values of RMSE, MAE, MAPE, and R 2 were 0.3327, 0.2598, 0.1004, and 0.9655, respectively. Compared with other models, the LSTM model optimized by DE performed the best in RMSE, MAE, and R 2 . In order to more intuitively compare the above five models, the forecasting wind speed and R 2 for each model are shown in Figure 5. The coefficient of determination of the LSTM model was the largest. So, the fitting effect was the best and the prediction was the most accurate. To summarize, the LSTM model optimized by the DE algorithm provided better performance than the other four traditional models.  In Case II, we used meteorological characteristics related to wind speed as input of the five forecasting models. Table 5 and Figure 6 show the performance metrics of forecasting results achieved by the above five different models. RMSE, MAPE, and MAE of the Feature-LSTM model were 0.1745, 0.0488, and 0.1212, respectively. 2 R of the Feature-LSTM was 0.9749, which was slightly smaller than that of Feature-BPNN. Overall, the comprehensive prediction performance of the Feature-LSTM was the best among all the above models. Compared with Case 1, all four indicators improved significantly.  In Case II, we used meteorological characteristics related to wind speed as input of the five forecasting models. Table 5 and Figure 6 show the performance metrics of forecasting results achieved by the above five different models. RMSE, MAPE, and MAE of the Feature-LSTM model were 0.1745, 0.0488, and 0.1212, respectively. R 2 of the Feature-LSTM was 0.9749, which was slightly smaller than that of Feature-BPNN. Overall, the comprehensive prediction performance of the Feature-LSTM was the best among all the above models. Compared with Case 1, all four indicators improved significantly. The prediction results based on the PCA and LSTM methods of Case III are shown in Figure 7. The figure shows the true values and the predicted values of the five different models. Table 6 is the evaluation index value of each prediction model. Through comparing PCA-LSTM with the PCA-BPNN, PCA-GPR, PCA-RNN, and PCA-SVR models, it can be clearly observed that combined methods had an apparent influence on forecasting performance. From Table 6 and Figure 7, the proposed model outperformed the other four competitors for short-term wind speed forecasting with the smallest mean value of RMSE as 0.1474, MAPE as 0.0382, and MAE as 0.1015, as well as the highest mean value of R 2 as 0.9989, making it the best among all prediction models. Therefore, compared with Case I and Case II, the four indicators of the proposed PCA-LSTM method in this article were the best. The PCA-LSTM model achieved superiority in the 15 forecasting models in all cases.  Table 6 is the evaluation index value of each prediction model. Through comparing PCA-LSTM with the PCA-BPNN, PCA-GPR, PCA-RNN, and PCA-SVR models, it can be clearly observed that combined methods had an apparent influence on forecasting performance. From Table 6

Discussion
In the above experiments, the BPNN, GPR, LSTM, RNN, and SVR methods were selected in Case I to predict wind speed using only historical wind speed data. Among them, the comprehensive performance of LSTM was the best, which reflects the strong fitting ability of the LSTM model to the nonlinear problem. By comparing Case I with Case II, the use of meteorological characteristics related to wind speed during wind speed prediction can improve the prediction results. In the comparison between Case II and Case III, the PCA algorithm was used to reduce the dimension of meteorological characteristics related to wind speed, and the prediction results were further improved in all prediction models.

Conclusions
In this paper, a hybrid PCA and LSTM prediction method is presented. PCA is used to process original meteorological data. LSTM is optimized by the DE algorithm to obtain the best prediction model. Combining PCA and LSTM shows great advantages. The proposed method is applied to predict wind speed and the results prove that the method has strong predictive ability for time series data. Based on the analyses of Cases I-III, the proposed model not only requires less data than other models, but also largely improves the accuracy of forecasting results.
In our future work, hybrid methods using different deep learning models will be considered for time series prediction. In addition, we will improve the PCA method to make it more applicable and efficient.