Short-Term Forecasting of Photovoltaic Power Generation Based on Feature Selection and Bias Compensation–LSTM Network

: In this paper, a hybrid model that considers both accuracy and efﬁciency is proposed to predict photovoltaic (PV) power generation. To achieve this, improved forward feature selection is applied to obtain the optimal feature set, which aims to remove redundant information and obtain related features, resulting in a signiﬁcant improvement in forecasting accuracy and efﬁciency. The prediction error is irregularly distributed. Thus, a bias compensation–long short-term memory (BC– LSTM) network is proposed to minimize the prediction error. The experimental results show that the new feature selection method can improve the prediction accuracy by 0.6% and the calculation efﬁciency by 20% compared to using feature importance identiﬁcation based on LightGBM. The BC–LSTM network can improve accuracy by 0.3% using about twice the time compared with the LSTM network, and the hybrid model can further improve prediction accuracy and efﬁciency based on the BC–LSTM network.


Introduction
Currently, the consumption of fossil energy is increasing with societal developments. Solar energy has received more and more attention from all over the world in the past decade [1]. More and more PV power plants connected to the grid will bring great challenges to its security and stability [2,3]. In the case of the poor prediction accuracy of a PV power outage, the spinning reserve of a conventional power supply has to be increased to ensure the safety of the power system. As a result, the renewable energy consumption space is crowded out, which results in an increase in the amount of abandoned sunlight. Accurate forecasting of PV power generation can provide a reliable basis for peak load and frequency regulation, power flow optimization, and equipment maintenance, as well as technical support for the complementary and coordinated control of wind power, which is one of the key technologies to improve the grid's ability to accept PV power. Therefore, accurate and reliable PV forecasting techniques are needed to optimize operation costs and reduce uncertainties in power systems [4].
The approaches of PV power prediction are generally divided into two categories: stepwise methods and direct methods [5]. The stepwise forecast consists of two steps. First, solar radiation intensity [6][7][8][9][10] and temperature [11][12][13] are forecasted, which are then applied as inputs to forecast the PV power. Unlike the stepwise forecast method, direct forecast predicts PV power through the historical data of PV power and meteorological information, and its approaches are divided into three categories: (1) statistical model methods; (2) physical methods; and (3) artificial intelligent learning methods [14][15][16].
One common method to improve prediction efficiency and accuracy is to select the optimal feature set from the original data set [17][18][19][20]. One way to select the optimal feature set is to use the Pearson correlation coefficient to analyze the influence of various meteorological factors on the output of PV power generation [21]. The Pearson coefficient can analyze linear relationships between features, but its ability to analyze nonlinear or nonstationary problems is limited. To better solve nonlinear or nonstationary problems, some researchers proposed an adaptive hybrid predictor subset selection strategy to obtain the most relevant and nonredundant predictors for enhanced short-term forecasting [22]. The strategy chooses the optimal feature set through the following two aspects: (1) the correlation between features and PV power to choose the most relevant feature set, and (2) the correlation between features to select the nonredundant feature set. However, according to our research, some features alone will not affect the prediction results, but a combination of certain features can affect the prediction results. We call this feature a combination relationship. Based on the existing research, we propose a new feature selection method, designed in consideration of all three aforementioned aspects.
Another common method to improve prediction accuracy is to change the structure of the model [23][24][25][26][27][28]. The convolutional neural network (CNN) embodies powerful capabilities in image processing [29]. In recent years, more and more CNN-based models have been used to forecast PV power and have achieved good results. The CNN network can extract the features of the original data well, but it is not good at dealing with timing problems. A recurrent neural network (RNN) is considered to be a more effective tool for time series data prediction. In previous work, some researchers demonstrated that RNN has better prediction performance than backpropagation NN (BPNN) and radial basis function NN (RBFNN) [30,31]. Unlike the aforementioned studies, some researchers introduced the concept of residuals into the model design [32]. By establishing an additional error prediction model, the prediction error is added back to the prediction result to obtain the final prediction result. This method cleverly introduces error compensation terms to improve prediction accuracy. Based on this, and considering the superiority of LSTM networks in time series problems, a hybrid model, the bias compensation-long short-term memory (BC-LSTM) network, is used to perform the PV power forecasting.
However, traditional bias compensation networks have the following problems. The actual power value and the power error term are generated inconsistently, and the influencing factors are also different. Using the same meteorological data as the model input, its prediction accuracy is poor, and its computational complexity is high. Therefore, a framework with a hybrid method combining feature selection and the BC-LSTM network is proposed where the feature selection method is applied to the two LSTM networks to improve prediction accuracy and calculation efficiency. In this study, a new method based on feature selection and the BC-LSTM network is proposed to forecast PV power. The main contributions are outlined as follows: (1) Unlike the existing research, improved forward feature selection is proposed to obtain the optimal feature set. (2) An LSTM network, in consideration of bias compensation, called BC-LSTM, is proposed to optimize the performance of the model. (3) A framework with a hybrid method combining feature selection and the BC-LSTM network is used to perform the PV power forecasting. (4) In addition to the RMSE and MAE indicators, the variance and skewness indicators are introduced to evaluate the prediction model.
The remainder of this paper is organized as follows: In Section 2, the related methods are introduced in detail; in Section 3, the hybrid method combines the feature selection, and the BC-LSTM network is explained; in Section 4, the performance of the five models are analyzed and compared; and in Section 5, the work of this paper is summarized, and the conclusions are provided.

Improved Forward Feature Selection
In the collected data set, some features may not be relevant or may have low correlation with respect to the PV output. It is necessary to select an optimal feature set from the whole data set. Based on the analysis of many existing studies, a new feature selection method called improved forward feature selection (IFFS) is proposed, which consists of two steps, as follows: Step 1: Sort the importance of the original feature set using LightGBM.
Step 2: Select the optimal feature set according to the algorithm flow. LightGBM is a machine-learning algorithm based on the gradient boosting decision tree (GBDT) released by the Microsoft Corporation in 2017 [33].
A feature set is introduced with n instances {x 1 , . . . , x n }, where each x i is a vector. In each iteration of gradient boosting, the negative gradients of the loss function with respect to the output of the model are denoted as {g 1 , . . . , g n }. The specific training steps for ordering the feature importance using LightGBM are as follows: (1) Discretize continuous features into k values and then generate a histogram with k bins.
(2) Calculate the initial gradient, sort in descending order, select the first a × 100% large gradient sample, and randomly select the remaining small gradient samples of (b × (1−a)) as a new sample (small gradient samples multiplied by [1−a]/b weight coefficient).  (1) and (2).
where V j (d) represents the split gain of the jth feature at node d, A represents the large gradient sample set obtained in step 2.3, B represents the small gradient sample set obtained in step 2.3, d represents the node, a and b both represent the sampling rate, and g i represents the gradient.
The feature importance is calculated based on the number of times the feature splits. The more times a feature is used for node splitting, the more important the feature is.
Based on the sorted feature set, IFFS is proposed for the feature selection, which considers the combination relationship between features and improves computing efficiency, and is mainly achieved by the following: a.
F is the feature set sorted using LightGBM, and the more important features are more likely to be useful features. Thus, n is the number of times of successively adding features without improving metrics instead of cumulatively adding features without improving metrics. This enables the more important features to be preferentially selected, which can effectively improve the calculation efficiency while ensuring the efficiency of the feature set. b.
The two thresholds N max and K max are used to control the calculation efficiency of the algorithm. Reasonably selecting the threshold can ensure the validity of the feature set while improving the calculation efficiency. c.
If the network's metrics have not improved after a new feature is added, this feature is not abandoned directly. Instead, it is saved to the useless feature set. Then, the useless feature set and unselected feature set together form a new feature set and enter the next cycle, which ensures the selected feature set is more useful to some extent. d.
When K increases, the candidate feature set is randomly shuffled so that the probability of all of the features to be selected is the same. This can consider the combination relationship between strong and weak features, which ensures the feature selection result is further optimized.
The pseudocode of IFFS is shown in Algorithm 1:

BC-LSTM Network
The LSTM network was proposed by Hochreiter and Schmidhuber in 1997 to avoid long-term dependencies through targeted design [34]. It is mainly composed of an input layer, hidden layer, and output layer. Its specific structure is shown in Figure 1. and the Tanh function to form the neuron state Ct. Finally, the output ga which information in the cell is finally output. The memory cell state Ct is the activation function, and the output gate is dynamically controlled to output ht of the LSTM cell at time t.
According to Figure 1, Equation (3) can be obtained: where '*' denotes the Hadamard product. Further considering the weights W and the offsets b of the input, outp gates, Equation (4) can be obtained: The LSTM unit receives the current input X t and the state h t−1 of the tuple and obtains the state C t−1 of the neuron at time t. We first determine which information to clear through the forget gate, then add new information to the state of the cell through the input gate and update the state of the cell. This process is mainly controlled by the Sigmoid function and the Tanh function to form the neuron state C t . Finally, the output gate determines which information in the cell is finally output. The memory cell state C t is calculated by the activation function, and the output gate is dynamically controlled to form the final output h t of the LSTM cell at time t.
According to Figure 1, Equation (3) can be obtained: where '*' denotes the Hadamard product. Further considering the weights W and the offsets b of the input, output, and forget gates, Equation (4) can be obtained: where σ() is the logistic sigmoid function; W i,x , W f,x ,W o,x , and W C,x are the four weight matrices applied to the input; and W i,h , W f,h , W o,h , and W C,h are the four weight matrices applied at the previous time step to the output. Additionally, b i , b f , b o , and b C are four bias vectors, and f t , i t , and O t refer to the activation functions of each gate.
Comparing the predicted PV power output and the actual power output, it can be found that the prediction error is irregularly distributed. Based on this fact, BC-LSTM is proposed, which builds an additional model (called the bias compensation network) to predict the prediction bias to minimize the overall prediction error. This method uses the error compensation term to improve the prediction accuracy. The structure of the BC-LSTM network is shown in Figure 2. and ft, it, and Ot refer to the activation functions of each gate.
Comparing the predicted PV power output and the actual power output, it can be found that the prediction error is irregularly distributed. Based on this fact, BC-LSTM is proposed, which builds an additional model (called the bias compensation network) to predict the prediction bias to minimize the overall prediction error. This method uses the error compensation term to improve the prediction accuracy. The structure of the BC-LSTM network is shown in Figure 2.

Performance of the Hybrid Method
The interest in using the framework of the hybrid method is demonstrated in this section, combining the IFFS and the BC-LSTM network for PV forecasting purposes.
The BC-LSTM network uses two LSTM networks to predict the actual PV power and PV power error, respectively. The biggest difference between the hybrid method and traditional BC-LSTM network is that the new feature selection method is used to select feature sets for the prediction network and bias compensation network, respectively. The actual power value and the power error term are generated inconsistently, and the influencing factors are also different. Using the same meteorological data as the model input, its prediction accuracy is poor, and its computational complexity is high. Therefore, a framework for the hybrid method combining the feature selection and BC-LSTM network is proposed, where IFFS is applied to the two LSTM networks to improve prediction accuracy and calculation efficiency. The specific implementation process is shown in Figure 3.
Step one: Collect raw data and perform the data processing. The data processing procedure is composed of two sections: data cleaning and data preparation. In the first section, data cleaning includes two aspects: an outlier detector and missing value filling. The original data first uses isolation forests to detect outliers and remove outliers, then uses KNN to fill in the missing values. In the second section, new features are constructed based on the original features.

Performance of the Hybrid Method
The interest in using the framework of the hybrid method is demonstrated in this section, combining the IFFS and the BC-LSTM network for PV forecasting purposes.
The BC-LSTM network uses two LSTM networks to predict the actual PV power and PV power error, respectively. The biggest difference between the hybrid method and traditional BC-LSTM network is that the new feature selection method is used to select feature sets for the prediction network and bias compensation network, respectively. The actual power value and the power error term are generated inconsistently, and the influencing factors are also different. Using the same meteorological data as the model input, its prediction accuracy is poor, and its computational complexity is high. Therefore, a framework for the hybrid method combining the feature selection and BC-LSTM network is proposed, where IFFS is applied to the two LSTM networks to improve prediction accuracy and calculation efficiency. The specific implementation process is shown in Figure 3.
Step one: Collect raw data and perform the data processing. The data processing procedure is composed of two sections: data cleaning and data preparation. In the first section, data cleaning includes two aspects: an outlier detector and missing value filling. The original data first uses isolation forests to detect outliers and remove outliers, then uses KNN to fill in the missing values. In the second section, new features are constructed based on the original features.
Step two: Construct the regression model based on LightGBM to obtain the feature importance identification. Then, obtain the optimal feature set according to the algorithm flow in Section 2.1.
Step three: Split the preprocessed data set into training and validation sets. Construct the regression model based on the LSTM network and initialize the parameters. In the t th state of the LSTM, update the cell state based on the input at time t and the state at the previous time t−1. The target at the mth iteration is to update the parameters and minimize the loss function, denoted as follows: where x i is the ith sample, and y i is the expected result of the ith sample.
Step four: After k iterations, or after the loss function no longer decreases, output the final trained model, LSTM network 1, and the predicted results.
Step five: Subtract the predicted data from the original data to get the error set. The error set and the processed feature set form a new data set. Repeat step two through four for the new data set to get the final trained model, LSTM network 2, and the predicted error results. Step six: The predicted results in step four and the predicted error results in step five form the final prediction results, and then calculate the metrics. i1  where xi is the ith sample, and yi is the expected result of the ith sample.
Step four: After k iterations, or after the loss function no longer decreases, output the final trained model, LSTM network 1, and the predicted results.
Step five: Subtract the predicted data from the original data to get the error set. The error set and the processed feature set form a new data set. Repeat step two through four for the new data set to get the final trained model, LSTM network 2, and the predicted error results.
Step six: The predicted results in step four and the predicted error results in step five form the final prediction results, and then calculate the metrics.

Data Description and Implementation Environment
Two years' (1 January 2017-31 December 2018) worth of NWP data from the no. 24 PV power plant located in Ningxia Zhongwei, China, were selected for this study. The data resolution is 15 min, and the prediction horizon is one day ahead, with a total of 20-dimensional original features. All data have been preprocessed for performance (the specific operations are shown in Section 3). The features after the data processing procedure are shown in Table 1. To strengthen the results of the generalization ability, the data from 2017 and from January, February, April, May, July, August, October, and November of 2018 are used as the training data set, and the data from March, June, September, and December of 2018 are used as the validation data set. Considering the power generation at sunset is almost zero, the night data is excluded in the training and validation data set. All experiments were run in the python3.6 environment in anaconda, accelerated by NVDIA Geforce RTX2080Ti, CPU Intel(R) Xeon(R) CPU E5-2680 v3@2.50 GHz 2.5 GHz, memory 8 GB. There are three tasks that need to be completed in this case study: (1) Verify the effectiveness of IFFS.
(2) Verify the superiority of the BC-LSTM network compared to the traditional LSTM network. (3) Verify the superiority of the hybrid method compared to the BC-LSTM network.

Feature Selection Result-LightGBM
According to the ranking results for feature importance in Figure 4 According to the ranking results for feature importance in Figure 4, 23 features are selected after excluding {F1, F16, F22, F23, F25, F26} to form a new feature set XI as the comparative feature set.

Feature Selection Result-IFFS
The features applied to the forecasting model are selected on the basis of feature importance ranking. The evaluation network used is an LSTM network, and the evaluation metric is MAE. After many attempts, when Nmax is 5 and Kmax is 3, the prediction effect is optimal. The specific metric change results are shown in Figure 5.

Feature Selection Result-IFFS
The features applied to the forecasting model are selected on the basis of feature importance ranking. The evaluation network used is an LSTM network, and the evaluation metric is MAE. After many attempts, when N max is 5 and K max is 3, the prediction effect is optimal. The specific metric change results are shown in Figure 5. To illustrate the effectiveness of IFFS, the following three experimental groups are set up for comparative analysis: a. Feature set XO containing 29 original features. b. Feature set XI containing 23 original features selected from the LightGBM feature importance sorting results. c. Feature set XS obtained using IFFS. Sixteen features are selected after excluding {F 1 , F 9 , F 10 , F 11 , F 15 , F 16 , F 21 , F 22 , F 23 , F 24 , F 25 , F 26 , F 29 } to form a new feature set X S as the optimal feature set.

Comparing Different Methods
To illustrate the effectiveness of IFFS, the following three experimental groups are set up for comparative analysis: Feature set X I containing 23 original features selected from the LightGBM feature importance sorting results. c.
Feature set X S obtained using IFFS.

Comparing Different Methods
To solve the three problems mentioned in Section 4.1, the following experimental schemes are designed for comparative analysis: Scheme (1): The feature set X O is used as an input, and the network uses a standard LSTM network, which will be referred to as the LSTM (X O ) network.
Scheme (2): The feature set X I is used as an input, and the network uses a standard LSTM network, which will be referred to as the LSTM (X I ) network.
Scheme (3): The feature set X S is used as an input, and the network uses a standard LSTM network, which will be referred to as the LSTM (X S ) network.
Scheme (4): The feature set X S is used as the input, and the network uses the BC-LSTM network, which will be referred to as the BC-LSTM (X S ) network.
Scheme (5): The feature set X S and the feature set X R are, respectively, used as the inputs of the BC-LSTM networks, which will be referred to as the BC-LSTM (X S + X R ) network.
Note: X O , X I , and X S are the feature sets obtained in Section 4.2, where feature set X R is the feature set selected for the error network using IFFS, and the result is X R : {F 4 , F 5 , F 6 , F 7 , F 8 , F 10 , F 14 , F 18 , F 19 , F 20 , F 21 , F 25 , F 29 }. The evaluation metrics of the five schemes are RMSE, MAE, and R 2 _adjusted, and the training and verification sets are divided as shown in Section 4.1. RMSE, MAE, and R 2 _adjusted were defined as follows: The root-mean-square error (RMSE) measures the difference between the actual values and the forecasting values. It is defined as The mean absolute error (MAE) is the average of the absolute errors. It is defined as The adjusted R-square (R 2 _adjusted) judges whether the predictive model is fitting and reflects whether the prediction deviates from reality, which ranges from [0, 1]. If R 2 = 0, the model fits poorly, and if R 2 = 1, the model has no errors. However, as the number of independent variables increases, R 2 will continue to increase. To eliminate the impact of the number of features, we introduce the R 2 _adjusted indicator, which is defined as where N is the number of observations, p is the number of features, y f is the forecast value, y a is the actual value, and y m is the mean value. For fairness of comparison, when the evaluation metrics are RMSE and MAE, all network parameters are optimized by the GridSearch method, as shown in Table 2. When the evaluation metric is training time, the parameter settings of all the networks are the same.
(1) The x-axis represents the period, and the y-axis denotes the normalized power values.
To better illustrate the comparison of the effects between the different models, the data from different months are selected for display, which is shown in Figure 6. (2) The histogram is used to show the forecast error distribution, and its skewness and variance are calculated, which is shown in Figure 7. The calculation formula is shown as follows: where σ x represents the variance of x, E represents expectations, and S(x) represents skewness. Table 2. Feature selection result.  Comparing the experimental results in Figure 6, for dates with relatively stable output power, both LSTM and BC-LSTM networks can achieve satisfactory prediction results. For dates with strong volatility or cloudy days, the BC-LSTM network has stronger prediction capabilities than the traditional LSTM network. For weakly volatile dates, its power has a strong regularity, and its error is small. Therefore, using the bias compensation network to predict its accuracy improvement effect is limited. For dates with large power fluctuations, the difference between the prediction result and the actual power is   In the power system dispatch, the reserve capacity needs to be reduced as much as possible. If the prediction power is greater than the true power, the reserve capacity must be increased; otherwise, some PV power electricity could be abandoned. Therefore, for the same MAE and RMSE, the prediction result with skewness less than 0 is better. It can be seen from Figure 7 that the skewness of the LSTM network is about 0, while the skewness of the BC-LSTM network is less than 0, which means that the BC-LSTM network is more conducive to power system dispatch and reserve capacity reduction. Comparing the experimental results in Figure 6, for dates with relatively stable output power, both LSTM and BC-LSTM networks can achieve satisfactory prediction results. For dates with strong volatility or cloudy days, the BC-LSTM network has stronger prediction capabilities than the traditional LSTM network. For weakly volatile dates, its power has a strong regularity, and its error is small. Therefore, using the bias compensation network to predict its accuracy improvement effect is limited. For dates with large power fluctuations, the difference between the prediction result and the actual power is large. Therefore, constructing a reasonable and effective error network to predict the error value can effectively improve its prediction accuracy, and this is why BC-LSTM performs better than traditional LSTM networks.

Feature Set Features
In the power system dispatch, the reserve capacity needs to be reduced as much as possible. If the prediction power is greater than the true power, the reserve capacity must Energies 2021, 14, 3086 13 of 16 be increased; otherwise, some PV power electricity could be abandoned. Therefore, for the same MAE and RMSE, the prediction result with skewness less than 0 is better. It can be seen from Figure 7 that the skewness of the LSTM network is about 0, while the skewness of the BC-LSTM network is less than 0, which means that the BC-LSTM network is more conducive to power system dispatch and reserve capacity reduction.
The evaluation metrics of each experimental group were obtained in Section 4.3. After further analysis of the experimental results (Table 3), the following conclusion can be drawn. Comparing the results of Schemes (1)-(3), the results obtained by using IFFS have improved the MAE, RMSE value, and time efficiency. To some extent, this can illustrate the effectiveness of IFFS. Compared with the LightGBM method, IFFS can improve the prediction accuracy by 0.67% and the calculation efficiency by 20%. Comparing feature sets X I and X S , it is not difficult to find that the main difference between feature sets X S and X I is that more feature-forward difference terms are excluded from feature set X S , considering that PV power is mainly related to NWP data, and the forward difference of NWP data is mainly related to power fluctuations. Thus, excluding these features can improve the calculation efficiency and improve the prediction accuracy. This is consistent with the result of IFFS, which, to a certain extent, can further illustrate the effectiveness of IFFS. Moreover, the number of features used in this paper is 29. It is expected that, when the number of features is improved, the improvement effect in terms of time efficiency and accuracy will be higher.
In comparing Schemes (3)-(5), Scheme (4) slightly improves the accuracy compared with Scheme (3) by about 0.27%, but its time efficiency is about half of Scheme (3). Compared with Scheme (3), Scheme (5) takes less time than Scheme (4) to achieve a greater improvement in accuracy. The superiority of BC-LSTM is that it uses two different network structures to predict the power and power error values, respectively, and then obtains the final prediction result. Thus, the accuracy of its prediction result depends on the accuracy of the two networks. Compared with the experimental Scheme (3), Scheme (4) sacrifices a large amount of computational efficiency but only exchanges this for a slight improvement in accuracy. The reason is that the prediction accuracy of the error network is not good. And the experimental results of Schemes (4) and (5) also show that the hybrid method, combining IFFS and the BC-LSTM network, can achieve higher accuracy and computational efficiency compared with the BC-LSTM network without the feature selection.

Conclusions
A framework of the hybrid method combining feature selection and the BC-LSTM network is used to perform PV power forecasting. The proposed methods are applied to solve the actual forecast case at the Ningxia Zhongwei no. 24 PV Power Plant in China. Three verification metrics, RMSE, MAE, and training time, are used to evaluate prediction accuracy and calculation efficiency.
The conslusions are summarized as follows: (1) The optimal feature set for PV power prediction could be selected based on IFFS. Compared with LightGBM, IFFS can improve the prediction accuracy by 0.67% and the calculation efficiency by 20%; (2) The BC-LSTM network is an effevtive method for predicting fluctuating PV power.
Using the same feature set as input, the BC-LSTM performs better than traditional LSTM networks in terms of prediction accuracy by about 0.27%, and the BC-LSTM network is more conducive to power system dispatch and reserve capacity reduction, but it takes more time for training and prediction than the LSTM method. (3) The hybrid method combining IFFS feature selection and the BC-LSTM network demonstrates significant advantages for PV power prediction, which can achieve improving the prediction accuracy by 0.9% and the calculation efficiency by 10% compared with the BC-LSTM network.
In summary, the hybrid method obtains high-precision prediction results with minimal training time. These results fully demonstrate that the hybrid method has the ability to obtain PV power prediction results with excellent performance for accuracy and calculation efficiency.
There are still many aspects that are worthy of further verification.
(1) In this article, IFFS is applied to a feature selection for PV power prediction, and the prediction accuracy is improved. In the future, more transformation features for PV power prediction could be constructed, and further feature selection could be carried out based on the proposed method.

Conflicts of Interest:
The authors declare no conflict of interest.