Short-Term Wind Power Prediction for Wind Farm Clusters Based on SFFS Feature Selection and BLSTM Deep Learning

: Wind power prediction (WPP) of wind farm clusters is important to the safe operation and economic dispatch of the power system, but it faces two challenges: (1) The dimensions of the input parameters for WPP of wind farm clusters are very high so that the input parameters contain irrelevant or redundant features; (2) it is difﬁcult to build a holistic WPP model with high-dimensional input parameters for wind farm clusters. To overcome these challenges, a novel short-term WPP model for wind farm clusters, based on sequential ﬂoating forward selection (SFFS) feature selection and bidirectional long short-term memory (BLSTM) deep learning, is proposed in this paper. First, more than 300,000 input features of the wind farm cluster are constructed. Second, the SFFS method is applied to sort the high-dimensional features and analyze the rule that the forecasting accuracy changes with the number of features to obtain the optimal number of features and feature sets. Finally, based on the results of feature selection, BLSTM is applied to build a WPP model for wind farm clusters with a combination of feature selection and deep learning. This case study shows that (1) SFFS is an effective method for selecting the core features for WPP of wind farm clusters; (2) BLSTM shows not only higher WPP accuracy than long short-term memory and backpropagation neural network but also outstanding performance in terms of reducing the phase errors of WPP. Author Contributions: Conceptualization, methodology, X.P.; software, K.C.; validation, J.L.; formal analysis, investigation, T.C.; resources, X.P.; data curation, K.C.; writing— preparation, K.C.; writing—review and editing, X.P.; visualization, K.C.; supervision, S.D.; project administration, X.P.; X.P.


Introduction
According to the statistical results from the Global Wind Energy Council (GWEC), at the end of 2019, the total installed capacity of global wind farms reached 651 GW [1]. With the rapid development of the global wind power industry, the distribution of wind farms is changing from the decentralized and small-scale distribution of the early stages to clustered and large-scale distribution [2]. A wind farm is a power plant composed of multiple wind turbine units. A wind farm cluster is a group of multiple wind farms in a specific area, in which the number of wind farms is generally several or dozens, depending on the size of the geographical area [3]. Wind power prediction (WPP) is the estimation of the expected production of wind power for a period of future time based on the meteorological data and historical operating data of the wind farm [3,4]. Short-term WPP is usually considered to forecast wind power output in the next few days, usually 2-3 days, which is mainly applied to short-term power generation scheduling and power trading [3]. In this paper, the prediction of the future 96 h is studied. WPP for wind farm clusters is the forecasting of the overall output of a wind farm cluster composed of multiple wind farms in a large spatial range [3]. Wind power output is random and uncertain. The integration of large-scale wind farms into the power system has a significant impact on the safe operation and economic dispatch of the power system, making the WPP of wind farm clusters increasingly important [4].
WPP methods for wind farm clusters mainly include the accumulation method and the statistical upscaling method [5][6][7]. The basic principle of these methods is to predict each or part of the wind farms in the cluster first, based on which the forecasting of output power for the wind farm cluster is obtained through accumulating or upscaling the forecasting results of individual farms [8,9]. With the improvement of computing techniques and the development of deep learning theories [10], holistic WPP modeling for wind farm clusters is becoming possible.
The holistic WPP modeling for wind farm clusters faces the following two challenges: (1) Many features affect the power output of wind farm clusters, including uncorrelated and redundant features, which need to be optimized; (2) for WPP of wind farm clusters, it is difficult to establish a holistic high-precision prediction model with high-dimensional input parameters. To overcome these two challenges, both feature selection and deep learning are studied in this paper.
Feature selection is removing irrelevant and redundant features and reducing the computational complexity of the algorithm by mining the intrinsic relationship between the features and the target sequences [11]. In past studies [12][13][14][15], random search methods, such as simulated annealing algorithm, tabu search algorithm, genetic algorithm, and random sampling with replacement algorithm, have been applied for feature selection, which could allow the quick securing of a locally optimal solution that meets the requirements. However, the disadvantages of these methods are the uncertainty of the results and a large amount of training data [12][13][14][15].
In reference [16], the heuristic search method was applied for feature selection of the siting and sizing of active and reactive power sources. Both the calculation time and the effectiveness of feature selection are considered within the heuristic search method, which is widely applied for feature selection of text analysis and image recognition. There are many common heuristic search methods, including the best individual feature (BIF) method, the sequence forward selection (SFS) method, the sequential floating forward selection (SFFS) method, and so on [17][18][19][20]. The BIF method was adopted in reference [18], based on which the features are ranked by calculating the contribution value when each feature is applied individually. The calculation speed of BIF is fast, but the redundancy between features is not considered within the method. In reference [19], the SFFS method was applied for feature selection of pattern recognition. Backtracking, a feature elimination mechanism, is applied in the feature selection process of the SFFS method, based on which the redundancy between features is reduced [20]. Therefore, the SFFS method is chosen as the feature selection method for WPP of wind farm clusters in this paper.
Feature selection results will be used as the input parameters for the WPP model. Deep learning is one of the foremost methods of WPP [21][22][23][24][25][26][27]. Compared with traditional shallow artificial neural networks, deep learning neural networks have excellent learning and generalization capabilities for massive data, including convolutional neural networks [22], deep belief networks [23], long short-term memory (LSTM) networks [24], stacked denoising autoencoders [25], and so on. Among these methods, LSTM networks show excellent characteristics for the forecasting of time series data [26]. In studies [24] and [27], LSTM was applied for WPP, which was proven to offer significant advantages compared to the traditional forecasting methods of backpropagation neural network (BPNN) and autoregressive integrated moving average. LSTM has a memory function for historical data, so the trend of wind power changes (such as rise or fall) could be well served by the LSTM network. However, when the power sequence changes from one trend to another, such as from the rise to fall, certain phase delay errors might be caused by the LSTM network [28].
In recent years, the bidirectional long short-term memory (BLSTM) method has shown significant advantages in the areas of speech recognition [29,30], handwriting recognition [31], and protein structure prediction [32] compared with the LSTM method. For the BLSTM method, both historical and future time series are applied as the input of the model to offset the trend inertia appearing in a single time series direction and to effectively reduce the phase error of the forecasting. Therefore, BLSTM is chosen as the WPP method for wind farm clusters in this paper.
A novel SFFS-BLSTM WPP model for wind farm clusters is proposed in this paper. Specifically, we make the following two contributions: (1) A high-latitude candidate feature set, with more than 300,000 features, for WPP of wind farm clusters is constructed based on feature transformation such as wavelet transformation (WT) and empirical mode decomposition (EMD) transformation, and a novel SFFS method is applied in feature selection for WPP of wind farm clusters; (2) Based on the results of feature selection, a short-term WPP model for wind farm clusters, named SFFS-BLSTM, combining SFFS feature selection and BLSTM deep learning, is proposed in the paper, which shows excellent characteristics of reducing prediction errors, especially phase errors.

The Overall Flowchart of SFFS-BLSTM
The overall flowchart of the WPP model for wind farm clusters based on SFFS-BLSTM is shown in Figure 1, which is divided into three phases: Phase 1: high-dimensional feature construction for wind farm clusters, Phase 2: feature selection of wind farm clusters based on SFFS, and Phase 3: short-term WPP for wind farm clusters based on BLSTM. These three stages are divided into nine work steps, which are described in detail as follows.
Step 1: Feature extraction of wind farm clusters. Based on the wind power data and numerical weather prediction (NWP) data of the wind farm cluster, different parameters such as wind speed, wind direction, temperature, pressure, and humidity of each wind farm are extracted, and time-series features and statistical features of wind farm clusters are also constructed.
Step 2: Feature transformation of wind farm clusters. Based on WT and EMD transformations, the time series features are decomposed into low-frequency and high-frequency components to obtain frequency-domain features. In total, more than 300,000 features are constructed in the paper.
Step 3: Initial feature ranking based on BIF. The BIF method based on mutual information (MI) is applied to initially rank over 300,000 features [33].
Step 4: Feature validity verification. Based on the results of the initial feature ranking, the number of input features of the LSTM WPP model is increased in increments of 500 to analyze the change in the WPP accuracy when the number of features increases with the feature ranking results and to initially determine the optimal number of features for WPP.
Step 5: Feature ranking based on SFFS. Based on the initial feature selection results, the SFFS method is applied to further rank the features selected in step 4.
Step 6: Feature validity verification. Based on the feature ranking results in step 5, the number of input features of the LSTM WPP model is increased in increments of 20, to analyze the change in the WPP accuracy when the number of features increases with the feature ranking results and to determine the optimal number of features and feature sets for WPP.
Step 7: Statistical analysis of the selected features. Based on the results of optimal feature selection, statistical analysis is applied to obtain the most important factors affecting the WPP accuracy of wind farm clusters.
Step 8: Deep learning-based WPP for wind farm clusters. Based on the results of feature selection, LSTM and BLSTM are comparatively applied to carry out WPP for wind farm clusters.
Step 9: WPP results and error analysis. Based on the WPP results obtained in step 8, the root mean square error (RMSE) of the WPPs and wind power outputs of the WPPs for LSTM and BLSTM are comparatively analyzed to assess the two methods.
Step 9: WPP results and error analysis. Based on the WPP results obtained in step 8, the root mean square error (RMSE) of the WPPs and wind power outputs of the WPPs for LSTM and BLSTM are comparatively analyzed to assess the two methods.

Stage 1: Feature Construction for Wind Farm Clusters
In this paper, three kinds of features are applied as candidate features for feature selection of WPP for wind farm clusters, including original NWP features, frequency domain features, and time-series features.
(1) Original NWP features and corresponding statistical features The original NWP features of the wind farms applied in the paper are shown in Table  1. There are 11 NWP features for each wind farm, including wind speeds, and wind di-

Stage 1: Feature Construction for Wind Farm Clusters
In this paper, three kinds of features are applied as candidate features for feature selection of WPP for wind farm clusters, including original NWP features, frequency domain features, and time-series features.
(1) Original NWP features and corresponding statistical features The original NWP features of the wind farms applied in the paper are shown in Table 1. There are 11 NWP features for each wind farm, including wind speeds, and wind directions at four different heights, atmospheric temperature, humidity, and sea-level pressure. Taking a wind farm cluster as an example, which contains 20 wind farms, the number of original NWP features is 11 × 20 = 220. For each original NWP feature, statistical features of 20 wind farms could be constructed, which reflect the overall output of the wind farm cluster. As shown in Table 2, the mean, mode, upper quartile, median, lower quartile, and interquartile range of each original NWP feature of 20 wind farms were constructed. In total, there are 11 × 20 + 11 × 6 = 286 original NWP features and statistical features of the wind farm cluster containing 20 wind farms.
(2) Time series features As shown in Figure 2, in addition to the NWP data at the time of WPP, the NWP data 12 h before and after the time to be predicted might also be applied as valid input parameters [34]. If the time interval of the NWP data is 15 min, there are 96 moments within 24 h: 12 h plus 12 h. Therefore, for each moment to be predicted, the valid number of input features is 286 × 96 = 27,456. rections at four different heights, atmospheric temperature, humidity, and sea-level pressure. Taking a wind farm cluster as an example, which contains 20 wind farms, the number of original NWP features is 11 × 20 = 220. For each original NWP feature, statistical features of 20 wind farms could be constructed, which reflect the overall output of the wind farm cluster. As shown in Table 2, the mean, mode, upper quartile, median, lower quartile, and interquartile range of each original NWP feature of 20 wind farms were constructed.
In total, there are 11 × 20 + 11 × 6 = 286 original NWP features and statistical features of the wind farm cluster containing 20 wind farms.  (2) Time series features As shown in Figure 2, in addition to the NWP data at the time of WPP, the NWP data 12 h before and after the time to be predicted might also be applied as valid input parameters [34]. If the time interval of the NWP data is 15 min, there are 96 moments within 24 h: 12 h plus 12 h. Therefore, for each moment to be predicted, the valid number of input features is 286 × 96 = 27,456.  WT and EMD transformations are applied to obtain 10 new features with different frequency components for each original NWP feature and corresponding statistical features. "db9" is chosen as the mother wavelet. After wavelet transform, the feature sequence is decomposed into four layers. The high-frequency components generated at each layer are named wavelet1, wavelet2, wavelet3, and wavelet4, and the low-frequency component generated at the fourth layer is named wavelet5. The features generated by the EMD transform are named emd1, emd2, emd3, emd4, and emd5 in order of frequency. The frequency ranges of the wavelet and EMD features are shown in Table 3.

Stage 2: Feature Selection Based on SFFS
If more than 300,000 features are applied as the input parameters for WPP, not only is the training of the WPP model difficult but also the computational efficiency and prediction accuracy make it difficult to achieve the ideal situation. Therefore, feature selection is applied to the high-dimensional input features. The heuristic feature selection method based on SFFS, which is developed from the SFS method, is adopted in this paper [35]. Unlike SFS, a feature elimination mechanism is added to the SFFS method. The feature selection and feature elimination are alternated within the SFFS process, which removes redundant features while selecting the effective features, essentially avoiding redundancy among features [20].
The flowchart of the SFFS method is shown in Figure 3, which is divided into five steps. The purpose of the SFFS method is to select a certain number of optimal features from the candidate features and add them to the target feature subset S. The number of target features is d, and the number of candidate features is m.
Step 1: The optimal number of features, added to the target feature subsets, is determined, named as L. L is set to be the difference between the number of target features d and the number of selected features n multiplied by a coefficient, and the coefficient is recommended to be 10%, that is, Step 2: According to the formulated criterion function, which is presented in the second part of Section 2.3 of this paper, L features that maximize the criterion function value are selected from the candidate features and added to the target feature subset S.
Step 3: The number of target features and the threshold number of features are compared.
If the number of target features reaches the threshold d, the loop is stopped, and the target feature subset that meets the requirements is obtained. Otherwise, step 4 is executed. Step 4: The optimal number of removing features is determined, named as R. R is set to be the number of selected features multiplied by a coefficient. The value of the coefficient is recommended to be 10%, that is, R = n × 10% [36].
Step 5: R number of features that minimize the criterion function are selected and removed from the target feature subset S, and then step 1 is executed again, and the above steps are looped.
Energies 2021, 14, x FOR PEER REVIEW 7 of 18 Step 4: The optimal number of removing features is determined, named as R. R is se to be the number of selected features multiplied by a coefficient. The value of the coefficient is recommended to be 10%, that is, R = n × 10% [36].
Step 5: R number of features that minimize the criterion function are selected and removed from the target feature subset S, and then step 1 is executed again, and the above steps are looped. The key points of the SFFS method are evaluation index and criterion function [37] MI is applied as the evaluation index, and the minimum redundancy maximum relevance (mRMR) algorithm is applied as the criterion function, which is presented in detail as follows.
(1) Evaluation index As a result of the strong nonlinear relationship between the output power of the wind farm clusters and the NWP, the MI has a stronger ability to represent the nonlinear corre lation than other indicators such as Euclidean distance and consistency measure, so it is selected as the evaluation index for the feature selection method in the paper.
MI is a correlation parameter in information theory [33], which is the amount of information of one random variable contained in another random variable [38]. In other words, MI is a reduction in the uncertainty of a random variable as a result of knowing the laws of another random variable [39].
For example, if there are two random variables, X and Y, the joint probability distribution of X and Y is p(x, y). The edge probability distributions of X and Y are p(x) and p(y) The MI of X and Y, named as I(X; Y), is defined as the relative entropy of the joint probability distribution p(x, y) and the edge probability distribution p(x)p(y), which is shown in Equation (1). The key points of the SFFS method are evaluation index and criterion function [37]. MI is applied as the evaluation index, and the minimum redundancy maximum relevance (mRMR) algorithm is applied as the criterion function, which is presented in detail as follows.
(1) Evaluation index As a result of the strong nonlinear relationship between the output power of the wind farm clusters and the NWP, the MI has a stronger ability to represent the nonlinear correlation than other indicators such as Euclidean distance and consistency measure, so it is selected as the evaluation index for the feature selection method in the paper.
MI is a correlation parameter in information theory [33], which is the amount of information of one random variable contained in another random variable [38]. In other words, MI is a reduction in the uncertainty of a random variable as a result of knowing the laws of another random variable [39].
For example, if there are two random variables, X and Y, the joint probability distribution of X and Y is p(x, y). The edge probability distributions of X and Y are p(x) and p(y). The MI of X and Y, named as I(X; Y), is defined as the relative entropy of the joint probability distribution p(x, y) and the edge probability distribution p(x)p(y), which is shown in Equation (1).
(2) Criterion function To make the selected features meet the requirements of both effectiveness and low redundancy, the criterion function of the mRMR is selected as the criterion function of the feature selection method in the paper.
Based on the maximum relevance principle, the average value of the MI between the features, contained in the target feature subset S and wind power P of wind farm clusters, is maximized. The constraint condition based on the maximum relevance is shown in Equation (2). (2) is the MI between v i , the ith feature contained in target feature subset S, and wind power P of wind farm clusters. There will be redundancy in the target feature subset S based on the maximum relevance principle, and there will be a large degree of correlation between features in the set S. Therefore, a constraint condition based on the minimum redundancy should be added to the criterion function. The constraint condition based on the minimum redundancy is shown in Equation (3). (3) is the MI between features v i and v j , the i th and j th features contained in S. Based on the aforementioned two constraint conditions, the criterion function for feature selection based on the mRMR principle is obtained, as shown in Equation (4).

Stage 3: WPP for Wind Farm Clusters Based on BLSTM
The results of the SFFS feature selection will be applied as input parameters for the WPP model. The deep learning short-term WPP based on BLSTM for wind farm clusters is constructed in the paper. BLSTM networks not only have strong learning and generalization capabilities for massive data but also have strong mapping capabilities for time series data. BLSTM networks offer the advantage of eliminating phase errors [28]. The BLSTM network is developed from LSTM, and the two networks have similarities in structure.
(1) LSTM LSTM is a deep learning network widely applied in time series data prediction, which is mainly composed of an input layer, hidden layer, and output layer. The structure of LSTM is shown in Figure 4.
In Figure 4, each LSTM unit is a cell with memory function, and the state of the cell at time t is recorded as c t . The state of the cell at the last moment c t−1 will be inputted into each gate as internal information. The current input x t and the output at the last time y t−1 are received by the LSTM unit. x t and y t−1 are applied as control information, and c t−1 is modified and updated to get c t based on the value of x t and y t−1 . Finally, the state c t of the (2) BLSTM One shortcoming of the LSTM model is that only the historical data transmitted from the forward sequence could be applied in the model. For WPP, the output at the time to be predicted is not only causal with historical data but also correlated with future data Therefore, the predicted value of the future output from the reverse sequence is also crit ical to the accuracy of WPP. In BLSTM, complementary information from the past and th future are integrated for prediction, based on two independent hidden layers containing the data from both the forward and reverse directions. The structure of BLSTM is shown in Figure 5. (1) Training The forward and reverse LSTM networks are trained in alternating order. BLSTM could be trained based on a similar algorithm as LSTM. The training process of BLSTM i as follows: For forward networks, the forward and reverse states are processed first, and then the output is calculated. For reverse networks, the output is processed first, and then the forward and backward states are processed. After the forward and backward net works are processed, the weights will be updated [40].
(2) Forecasting The forward and reverse LSTM networks are applied for WPP in parallel. The pre diction flowchart of the BLSTM neural network is shown in Figure 6. In Figure 6, the inpu data of the testing data set are inputted to the forward and reverse LSTM networks in forward and reverse order respectively. The results of the two sequences are averaged to obtain the hidden layer results of the BLSTM. The results of the BLSTM are inputted to the second hidden layer of the forward and reverse network in the forward and revers y2 yn y0 (2) BLSTM One shortcoming of the LSTM model is that only the historical data transmitted from the forward sequence could be applied in the model. For WPP, the output at the time to be predicted is not only causal with historical data but also correlated with future data. Therefore, the predicted value of the future output from the reverse sequence is also critical to the accuracy of WPP. In BLSTM, complementary information from the past and the future are integrated for prediction, based on two independent hidden layers containing the data from both the forward and reverse directions. The structure of BLSTM is shown in Figure 5. (2) BLSTM One shortcoming of the LSTM model is that only the historical data transmitted from the forward sequence could be applied in the model. For WPP, the output at the time to be predicted is not only causal with historical data but also correlated with future data Therefore, the predicted value of the future output from the reverse sequence is also critical to the accuracy of WPP. In BLSTM, complementary information from the past and the future are integrated for prediction, based on two independent hidden layers containing the data from both the forward and reverse directions. The structure of BLSTM is shown in Figure 5. (1) Training The forward and reverse LSTM networks are trained in alternating order. BLSTM could be trained based on a similar algorithm as LSTM. The training process of BLSTM is as follows: For forward networks, the forward and reverse states are processed first, and then the output is calculated. For reverse networks, the output is processed first, and then the forward and backward states are processed. After the forward and backward networks are processed, the weights will be updated [40].
(2) Forecasting The forward and reverse LSTM networks are applied for WPP in parallel. The prediction flowchart of the BLSTM neural network is shown in Figure 6. In Figure 6, the input data of the testing data set are inputted to the forward and reverse LSTM networks in forward and reverse order respectively. The results of the two sequences are averaged to obtain the hidden layer results of the BLSTM. The results of the BLSTM are inputted to the second hidden layer of the forward and reverse network in the forward and reverse order, respectively. Finally, the output results of each hidden layer are obtained, the re- y2 yn y0 (1) Training The forward and reverse LSTM networks are trained in alternating order. BLSTM could be trained based on a similar algorithm as LSTM. The training process of BLSTM is as follows: For forward networks, the forward and reverse states are processed first, and then the output is calculated. For reverse networks, the output is processed first, and then the forward and backward states are processed. After the forward and backward networks are processed, the weights will be updated [40].
(2) Forecasting The forward and reverse LSTM networks are applied for WPP in parallel. The prediction flowchart of the BLSTM neural network is shown in Figure 6. In Figure 6, the input data of the testing data set are inputted to the forward and reverse LSTM networks in forward and reverse order respectively. The results of the two sequences are averaged to obtain the hidden layer results of the BLSTM. The results of the BLSTM are inputted to the second hidden layer of the forward and reverse network in the forward and reverse order, respectively. Finally, the output results of each hidden layer are obtained, the results of the last hidden layer are inputted to the output layer, and the results of the BLSTM network are obtained [40].

Case Study
In this paper, based on the overall flowchart shown in Figure 1

Case Study
In this paper, based on the overall flowchart shown in Figure 1, feature construction, feature selection, and the prediction model based on deep learning are carried out and evaluated with data from industrial applications. The data are from a wind farm cluster in Ning Xia province of China, which contains 20 wind farms. The time range of the data is 660 days, ranging from 1 January 2017 to 1 November 2018. The data for the first 440 days are selected as training data, and the data for the last 220 days are selected as testing data. NWP data are forecast for the next 4 days, with a time-lapse of 15 min. The geographical distribution of the wind farms within the wind farm cluster is shown in Figure 7. The NWP numbers and the installed capacities of the 20 wind farms are shown in Table 4.

Case Study
In this paper, based on the overall flowchart shown in Figure 1, feature construction, feature selection, and the prediction model based on deep learning are carried out and evaluated with data from industrial applications. The data are from a wind farm cluster in Ning Xia province of China, which contains 20 wind farms. The time range of the data is 660 days, ranging from 1 January 2017 to 1 November 2018. The data for the first 440 days are selected as training data, and the data for the last 220 days are selected as testing data. NWP data are forecast for the next 4 days, with a time-lapse of 15 min. The geographical distribution of the wind farms within the wind farm cluster is shown in Figure  7. The NWP numbers and the installed capacities of the 20 wind farms are shown in Table  4.

Results of Feature Selection
A total of 302,016 features are preliminarily ranked based on BIF. According to the ranking results, the number of input features of the LSTM prediction model is successively increased in increments of 500, and the change rule of RMSE of the WPP model for the next 4 days with the increase in the number of features is analyzed; the results are shown in Figure 8a. As shown in Figure 8a, the WPP error of the wind farm cluster first drops sharply and then rises slowly with the increase in the number of features, and the number of optimal features is about 1000.

Results of Feature Selection
A total of 302,016 features are preliminarily ranked based on BIF. Accordin ranking results, the number of input features of the LSTM prediction model is succ increased in increments of 500, and the change rule of RMSE of the WPP model next 4 days with the increase in the number of features is analyzed; the results are in Figure 8a. As shown in Figure 8a, the WPP error of the wind farm cluster firs sharply and then rises slowly with the increase in the number of features, and the of optimal features is about 1000.  To determine a more accurate number of optimal features, the number of input features of the LSTM prediction model is successively increased in increments of 20 for the first 2000 features based on the order ranked by BIF, and the change rule of RMSE of the WPP model for the next 4 days with the increase in the number of features is shown in Figure 8b. As seen in Figure 8b, the number of optimal features is 980. The MI values of the first 2000 features are shown in Figure 8c. The MI of the 980th feature is 0.6891, so in the data of the case, the features with MI higher than 0.6891 are effective features that could promote the WPP accuracy. The dotted green lines in Figure 8a,b show the RMSE of WPP of the wind farm clusters without feature construction and selection. No feature construction and selection means that the raw NWP data of the 20 wind farms are inputted into the WPP model, comprising 220 features of the wind speed and wind direction of 4 different altitudes, sea-level pressure, atmospheric moisture, and temperature corresponding to each wind farm.
The three methods of feature selection-BIF, mRMR, and SFFS-are applied to rank the top 1000 features ranked by BIF. According to the ranking results, the change rules of RMSE of the WPP model for the next 4 days with the increase in the number of features are analyzed, and the results of the change rule are shown in Figure 9.
gies 2021, 14, x FOR PEER REVIEW 12 o into the WPP model, comprising 220 features of the wind speed and wind direction o different altitudes, sea-level pressure, atmospheric moisture, and temperature cor sponding to each wind farm. The three methods of feature selection-BIF, mRMR, and SFFS-are applied to ra the top 1000 features ranked by BIF. According to the ranking results, the change rules RMSE of the WPP model for the next 4 days with the increase in the number of featu are analyzed, and the results of the change rule are shown in Figure 9. From Figure 9, conclusions could be drawn as follows: (1) At the primary stage feature selection, the number of selected features is less than 100; the WPP error of t SFFS method declines more rapidly than that of the other two methods, so more effect features were selected by the SFFS method at the primary stage than the other two me ods. (2) When the number of features selected by the SFFS method is about 130, the W accuracy is higher than that without feature construction and selection. When the numb of selected features is about 660, the optimal accuracy is achieved by SFFS, which is 0.37 lower than that with no feature construction and selection. The number of selected f tures is about 780 by the mRMR method and about 980 by the BIF method when the op mal WPP accuracies are achieved. (3) When the optimal accuracy is achieved, the RM of the SFFS method is the lowest, which is 11.99%, followed by the use of the mRM method, which is 12.05%, and the RMSE of the BIF method is the highest, which is 12.07 The RMSEs of WPP with an increase in the number of features corresponding to thr feature selection methods by daily statistics are shown in Figure 10.
From Figure 10, conclusions could be drawn as follows: (1) Among the three me ods, WPP errors of the SFFS method decline most rapidly, with an increase in the numb of features, and the optimal WPP accuracy of the SFFS method is first achieved with min fluctuations of the curve, so very few redundant features are selected by this method. When the timescales of WPP are different, the numbers of optimal features are differe Based on the SFFS method, the number of optimal features is 720 for the first day, wh it is 700 for the second day, 560 for the third day, and 560 for the fourth day. (3) Compar with no feature construction and selection, after feature construction and selection bas From Figure 9, conclusions could be drawn as follows: (1) At the primary stage of feature selection, the number of selected features is less than 100; the WPP error of the SFFS method declines more rapidly than that of the other two methods, so more effective features were selected by the SFFS method at the primary stage than the other two methods.
(2) When the number of features selected by the SFFS method is about 130, the WPP accuracy is higher than that without feature construction and selection. When the number of selected features is about 660, the optimal accuracy is achieved by SFFS, which is 0.37% lower than that with no feature construction and selection. The number of selected features is about 780 by the mRMR method and about 980 by the BIF method when the optimal WPP accuracies are achieved. (3) When the optimal accuracy is achieved, the RMSE of the SFFS method is the lowest, which is 11.99%, followed by the use of the mRMR method, which is 12.05%, and the RMSE of the BIF method is the highest, which is 12.07%.
The RMSEs of WPP with an increase in the number of features corresponding to three feature selection methods by daily statistics are shown in Figure 10.
From Figure 10, conclusions could be drawn as follows: (1) Among the three methods, WPP errors of the SFFS method decline most rapidly, with an increase in the number of features, and the optimal WPP accuracy of the SFFS method is first achieved with minor fluctuations of the curve, so very few redundant features are selected by this method.
(2) When the timescales of WPP are different, the numbers of optimal features are different. Based on the SFFS method, the number of optimal features is 720 for the first day, while it is 700 for the second day, 560 for the third day, and 560 for the fourth day. (3) Compared with no feature construction and selection, after feature construction and selection based on the SFFS method, the errors of WPP for different days have different degrees of decline. RMSE decreases by 0.33% for the first day, 0.52% for the second day, 0.35% for the third day, and 0.5% for the fourth day. Statistical analysis is carried out for the top 660 features selected by the SFF Some aspects of the features, in terms of their frequency level and the wind farm belong to, are enumerated in Table 5.  Figure 11.
From Figure 11, the conclusion could be drawn that the original NWP fe resent the largest quantity of the selected features, with 283 selected, accountin while the other features-WT and EMD decomposition features and statistical account for 57%. Statistical analysis is carried out for the top 660 features selected by the SFFS method. Some aspects of the features, in terms of their frequency level and the wind farm that they belong to, are enumerated in Table 5. The statistical results of the valid features, which are grouped into four types-original features, statistical features, WT decomposition features, and EMD decomposition featuresare shown in Figure 11.
From Figure 11, the conclusion could be drawn that the original NWP features represent the largest quantity of the selected features, with 283 selected, accounting for 43%, while the other features-WT and EMD decomposition features and statistical featuresaccount for 57%.
The statistical results of the subdivided valid features are shown in Figure 12. The statistical results of the subdivided valid features are shown in Figure 12.  The statistical results of the subdivided valid features are shown in Figure 12.   Figure 12b, 37 are mean value and 27 are mode value, accounting for the highest number, which reflects the overall situation of the wind farm cluster. Thus, constructing statistical features reflecting the overall situation of the wind farm cluster will contribute to the promotion of the WPP accuracy. (4) Among the five different bands of WT and EMD decomposition features, shown in Figure 12c,d, when the frequency is lower, more features are selected, and when the frequency is higher, fewer features are selected.

Comparison of the WPP Results Based on BPNN, LSTM, and BLSTM
With the same input features, the WPP results of BPNN, LSTM, and BLSTM are compared. The change rule of the RMSEs of the three WPP models for 4 days with the increase in the number of features is shown in Figure 13.
gies 2021, 14, x FOR PEER REVIEW 15 of wind farm cluster. Thus, constructing statistical features reflecting the overall situation the wind farm cluster will contribute to the promotion of the WPP accuracy. (4) Amo the five different bands of WT and EMD decomposition features, shown in Figure 12c when the frequency is lower, more features are selected, and when the frequency is high fewer features are selected.

Comparison of the WPP Results Based on BPNN, LSTM, and BLSTM
With the same input features, the WPP results of BPNN, LSTM, and BLSTM are com pared. The change rule of the RMSEs of the three WPP models for 4 days with the increa in the number of features is shown in Figure 13. From Figure 13, conclusions could be drawn as follows: (1) The optimal number features of the three WPP models is about 660 for each one. (2) During the increase of t input number of features from 20 to 1000, except for a few dots, BLSTM shows high WPP accuracy than LSTM. (3) When 660 optimal features are selected, the WPP RMSE BLSTM is 11.80%, while those of LSTM and BPNN are 11.99% and 12.34%, respective There is a 0.19 percentage improvement by BLSTM compared with LSTM, and a 0.54 pe centage improvement by BLSTM compared with BPNN.
The one-day-ahead WPP results of LSTM and BLSTM for 750 h, in terms of the wi power curve, are shown in Figure 14. From the details of the WPP results shown in Figu 14b,d, conclusions could be drawn as follows: For the prediction of a wave process wind power, LSTM is ideal for the prediction of the uphill stage. However, after the cre the actual wind power trend changes from uphill to downhill, while the uphill inertial the prediction curve is maintained because of LSTM's memory, leading to a predicti value higher than the actual value at the downhill stage. As predicted from both historic and future directions, BLSTM does not have the problem of advance or lag of wind pow crest, so the WPP accuracy is higher than LSTM. In other words, BLSTM shows bet performance in terms of reducing the phase errors of WPP than LSTM. From Figure 13, conclusions could be drawn as follows: (1) The optimal number of features of the three WPP models is about 660 for each one. (2) During the increase of the input number of features from 20 to 1000, except for a few dots, BLSTM shows higher WPP accuracy than LSTM. (3) When 660 optimal features are selected, the WPP RMSE of BLSTM is 11.80%, while those of LSTM and BPNN are 11.99% and 12.34%, respectively. There is a 0.19 percentage improvement by BLSTM compared with LSTM, and a 0.54 percentage improvement by BLSTM compared with BPNN.
The one-day-ahead WPP results of LSTM and BLSTM for 750 h, in terms of the wind power curve, are shown in Figure 14. From the details of the WPP results shown in Figure 14b,d, conclusions could be drawn as follows: For the prediction of a wave process of wind power, LSTM is ideal for the prediction of the uphill stage. However, after the crest, the actual wind power trend changes from uphill to downhill, while the uphill inertial of the prediction curve is maintained because of LSTM's memory, leading to a predictive value higher than the actual value at the downhill stage. As predicted from both historical and future directions, BLSTM does not have the problem of advance or lag of wind power crest, so the WPP accuracy is higher than LSTM. In other words, BLSTM shows better performance in terms of reducing the phase errors of WPP than LSTM.

Conclusions
A short-term WPP method for wind farm clusters based on SFFS feature selection and BLSTM deep learning is presented in this paper and validated with data from 20 wind farms. The conclusions are summarized as follows: (1) Based on the data of the wind farm cluster and the 302,016 features in the paper, the feature selection and validation results show that the WPP errors of the wind farm cluster first drop sharply and then rise slowly with the increase in the number of features (Figure 8a). When the timescale of WPP is different, the number of optimal features and the optimal feature sets are different ( Figure 10). (2) The comparison of BIF-, mRMR-, and SFFS-based feature selection shows that the SFFS method selects more effective features than the other two methods. (3) When the number of the features selected by the SFFS method is about 130, the WPP accuracy is higher than that without feature construction and selection. When the number of selected features is about 660, the optimal accuracy is achieved, which is 0.37% lower than that without feature construction and selection ( Figure 9). Compared with no feature construction and selection, after feature construction and selection, the errors of different prediction models have different degrees of decline (Figure 8-10). (4) The results of statistical analysis of the optimal feature set show that the following features are effective for the overall WPP modeling of wind farm clusters: wind speed of the height of the wind turbine hub, statistical features reflecting the overall situation of the wind farm cluster, low-frequency features in the frequency decomposition features, and so on ( Figure 12). (5) Based on SFFS feature selection, a short-term WPP model for wind farm clusters based on BLSTM is presented in this paper. The case study demonstrates that BLSTM shows higher WPP accuracy than LSTM ( Figure 13). Compared with LSTM, BLSTM can predict from both historical and future directions, which contributes to the outstanding performance of reducing the phase errors ( Figure 14).

Conclusions
A short-term WPP method for wind farm clusters based on SFFS feature selection and BLSTM deep learning is presented in this paper and validated with data from 20 wind farms. The conclusions are summarized as follows: (1) Based on the data of the wind farm cluster and the 302,016 features in the paper, the feature selection and validation results show that the WPP errors of the wind farm cluster first drop sharply and then rise slowly with the increase in the number of features (Figure 8a). When the timescale of WPP is different, the number of optimal features and the optimal feature sets are different ( Figure 10). (2) The comparison of BIF-, mRMR-, and SFFS-based feature selection shows that the SFFS method selects more effective features than the other two methods. (3) When the number of the features selected by the SFFS method is about 130, the WPP accuracy is higher than that without feature construction and selection. When the number of selected features is about 660, the optimal accuracy is achieved, which is 0.37% lower than that without feature construction and selection ( Figure 9). Compared with no feature construction and selection, after feature construction and selection, the errors of different prediction models have different degrees of decline (Figures 8-10). (4) The results of statistical analysis of the optimal feature set show that the following features are effective for the overall WPP modeling of wind farm clusters: wind speed of the height of the wind turbine hub, statistical features reflecting the overall situation of the wind farm cluster, low-frequency features in the frequency decomposition features, and so on ( Figure 12). (5) Based on SFFS feature selection, a short-term WPP model for wind farm clusters based on BLSTM is presented in this paper. The case study demonstrates that BLSTM shows higher WPP accuracy than LSTM ( Figure 13). Compared with LSTM, BLSTM can predict from both historical and future directions, which contributes to the outstanding performance of reducing the phase errors ( Figure 14).