1. Introduction
In the face of energy security problems caused by climate change, air pollution, and the use of traditional energy, as well as the continuous reduction in traditional fossil energy reserves such as coal, oil, and gas and the continuous rise in global power energy demand, many countries are vigorously promoting the development of renewable energy [
1]. Solar energy has many advantages, such as abundant reserves, wide distribution, and cleanliness. The development of efficient utilization of solar energy has received extensive attention [
2]. As a clean and renewable energy form, PV power generation has great development potential and application prospects. With the continuous advancement of technology and the continuous support of policies, PV power generation will play an increasingly important role in global energy transformation.
With more and more large-scale PV power stations connected to the power grid, the intermittency and volatility of PV power generation pose a great threat to the stable balance and reliable operation of the regional power grid. Using meteorological data and historical PV operation data, the system output can be predicted in advance. On the basis of prediction, the advanced dispatching control technology can be used to stabilize or even eliminate the fluctuation and intermittency influence of PV power generation. Therefore, it is of great significance to establish an accurate output prediction model for large-scale PV power stations for the popularization and promotion of PV power stations.
PV prediction models can be divided into direct prediction [
3] and indirect prediction [
4] according to the prediction object. Direct prediction predicts the output of PV power generation directly, while indirect prediction predicts the meteorological factors related to PV power generation, such as irradiance, and then calculates the output of PV power generation according to the photoelectric conversion model.
From the time scale of prediction, PV power generation prediction can be divided into ultra-short-term prediction (0~6 h), short-term prediction (6 h~1 day), and long-term prediction (one month~one year) [
5]. Ultra-short-term forecasting is critical for real-time system scheduling, rapid demand responses, and grid stability [
6,
7,
8]. The short-term prediction results are mainly used for load balancing and power dispatching [
9]. Long-term power generation prediction is based on the statistical analysis of historical data such as long-term meteorology and solar radiation in a certain area, and the future long-term PV power generation situation is obtained. The prediction results can be applied to energy planning [
10] and benefit evaluation.
In recent years, many PV power prediction models have been developed based on deep learning [
11]. For short-term prediction problems, Wang et al. [
12] conducted day-ahead PV power prediction based on the LSTM-RNN model and time correlation modification under the partial daily pattern prediction framework. Li et al. [
13] proposed a method for centralized PV plants based on LSTNet-Attention. Wu et al. [
14] predicted the outpower of a PV station in Australia by combining a deep learning model with trend feature extraction and feature selection. Jakoplic et al. [
15] realized short-term PV power plant output forecasting by using sky images and deep learning. In literature [
16], a deep learning framework based on 7.5-min-ahead and 15-min-ahead approaches to predict short-term PV power was introduced. In addition to short-term prediction, Jung et al. [
17] conducted long-term power forecasting for the PV system in South Korea based on long short-term memory recurrent neural networks. The processes of the proposed models are relatively complicated, and the cost of practical application is too high. What is more, high-quality training data are required. However, in practical applications, when there are certain outliers, the accuracy of the prediction results will be affected to some extent, even after complex data preprocessing.
As a highly effective and widely used machine learning algorithm, gradient-boosting decision trees (GBDT) have been continuously researched and developed in recent years. The GBDT is a machine learning algorithm based on ensemble learning. It has the advantages of strong prediction ability, good adaptability to data, strong interpretability, strong anti-overfitting ability, and good scalability. The GBDT model can use parallel computation in the training stage and does not require a large number of parameters, so it has high training efficiency. However, it is rarely studied in the field of PV power generation prediction. Compared with the prediction method based on deep learning, the prediction method based on the GBDT can accommodate some outliers, so there is no need for overly complex data preprocessing. It is more efficient and concise in the model training stage, so the prediction cost is relatively lower, and the low-cost nature is conducive to the promotion and application of the method in practice. In this paper, the GBDT is introduced into the output prediction application of a large-scale PV power station, and an output prediction method of a large-scale PV power station based on a GBDT is proposed. This method first collects the original data and then establishes the experimental sample set through data interpolation, data supplement, and data integration. Further data preprocessing is performed through data cleaning and normalization; the pre-processed data are used to train the model so as to establish a PV output prediction model. Based on the test samples and the trained model, the prediction of PV output is realized. Finally, the prediction results are imported into the error analysis module to quantitatively evaluate the model performance. The remainder of this paper is organized as follows: The second section introduces the specific implementation steps of the PV output prediction method proposed in this paper. In the third section, the proposed method is analyzed and verified by an example analysis. The fourth section is the conclusion of this study.
2. PV Output Prediction Method Based on GBDT
2.1. GBDT Model
The GBDT regression algorithm process is described as follows:
Let be the training set, where , and X is the input sample space; is the input feature; , and Y is the output feature.
where
is the loss function,
; and
c is a constant to minimize the loss function.
- 2.
Establish the M classification regression tree (m = 1, 2, …, M).
The pseudo residual corresponding to the
mth tree is calculated for sample
i as follows:
By using the regression tree to fit the data (), the leaf node area , corresponding to the m-th tree, is obtained, where , and are the number of leaf nodes of the m-th regression tree.
Calculate the best fitting value for
j:
Update learner
:
where
is an indicative function, the sample observation point falls into the
area, and the function is 1; otherwise, it is 0.
- 3.
The expression of the final learner is as follows:
2.2. PVoutput Prediction Method Based on GBDT
In this paper, the specific steps of the PV output prediction method based on the GBDT model are as follows, as shown in
Figure 1:
Collect the PV output data and numerical weather forecast data.
- 2.
Establish the experimental sample set.
The weather forecast data with a period of 15 min are obtained by linear interpolation. Calculate the total astronomical solar irradiance data at the corresponding time of the PV output sequence. Through the time matching method, the PV output data, numerical weather forecast data, and astronomical solar total irradiance data are integrated to establish the experimental sample set.
- 3.
Data preprocessing.
Clean the abnormal samples and construct the sample set under the normal operation state of the system. Based on the following formula, the input and output eigenvalues of the prediction model are normalized by Max–Min, and then the training sample set and the test sample set are divided.
where
is the normalized input and output eigenvalues,
is the original eigenvalue,
is the maximum eigenvalue, and
is the minimum eigenvalue.
- 4.
Model training.
The output of the PV system is used as the model output, and the numerical weather forecast characteristics and the total solar irradiance are used as the model input. The GBDT model is trained based on the training sample set, and the hyperparameters are adjusted to prevent overfitting.
- 5.
Output sequence prediction.
Based on the test sample set and GBDT model, the PV output sequence during the test period is predicted.
- 6.
Model performance evaluation.
The test sample set and the PV output prediction results are imported into the error analysis module, and the model performance is quantitatively evaluated by comparison. In this paper, the error indicators are the normalized mean absolute error (nMAE) and normalized root mean square error (nRMSE), as shown in Formula (7) and Formula (8).
where
is the test sample size,
is the predicted value of the normalized output, and
is the actual value of the normalized output.
2.3. Reference Model
In this paper, the output prediction methods based on the persistent prediction model, the random forest regression (RFR), and the support vector machine (SVM) model are used to compare with the method proposed in this paper so as to verify the feasibility and accuracy of the proposed method.
The persistence prediction model takes the historical output value at the same time point as the predicted output value. The prediction process is simple, but the error is large. The RFR model [
18] is an improved bagged regression tree model [
19,
20,
21,
22], which is one of the most effective machine learning prediction algorithms [
23]. The RFR model establishes a sample set through the Bootstrap sampling strategy. The Bootstrap sampling strategy aims to generate a new sample set, called an in-bag sample set, from an existing training sample set by repeated, random, and put-back sampling. These samples will be used to train one of the RFR classification and regression trees (CART). Based on a bag sample set, a CART regression tree is trained by dichotomy. Repeat the above steps to generate a CART regression tree, which forms the trained RFR model. After training the required number of CART regression trees, the prediction results of the RFR model are the average of the prediction results of these CART regression trees based on the test data. SVM is a supervised learning model that is widely used in classification and regression tasks. By selecting the appropriate kernel functions and parameters, SVM can deal with nonlinearly separable data in high-dimensional space with high classification accuracy and strong generalization ability. Its working principle is to find a hyperplane to distinguish different types of data points so that the interval between the two types of data is maximized.
3. Results
In this paper, a PV power station in Guangxi, China, is selected for an example analysis. The data collection period is from 1 August 2023 to 29 February 2024. The data of the first 6 months were used as the training set, and the data of the last 1 month were used as the test set. In the total set, after the data of the night when there was no solar radiation is deleted, there are no null values in the meteorological data, while there are some null values in the measured output data of the inverters of the PV power station. The statistical results show that the minimum value of the proportions of null values of 780 inverters is 6.3%, and the maximum value is 10.1%.
Figure 2 shows the comparison of the prediction results of different prediction models on 11 February 2024. The results show that the method using the persistent prediction model deviated more from the true value than other methods.
In addition to the comparative analysis of the single-day prediction results, the effectiveness of the proposed model was further verified using a test sample from February 2024.
Figure 3 shows the correlation diagrams between the predicted and measured values of the output based on different models. In order to make a better comparison, the power value is normalized to the interval [0, 1]. The blue point represents the normalized predicted output and the corresponding actual output. When the predicted value is equal to the measured value, the blue point falls on the red solid line, so the dispersion degree of the blue points around the red line reflects the error between the predicted output and the measured output. The denser the concentration of blue spots around the red line, the smaller the prediction error of the corresponding model.
Figure 3 intuitively reflects that the output prediction results based on the GBDT model are densely distributed around the red solid line, followed by the output prediction results based on the SVM model and the RFR model, and the results obtained by the persistent prediction model are sparsely distributed around the red solid line.
As shown in
Figure 3, the correlation diagram between the measured and the predicted output intuitively compares the performance of different prediction methods, but it is not reliable in quantitative evaluation. The statistical results of the prediction errors based on different models are shown in
Table 1. Through quantitative comparison, the error indexes nMAE and nRMSE of the prediction results using the proposed method are lower than those of the other three reference models.
4. Conclusions and Discussion
On the basis of prediction, advanced dispatching control technology can be used to stabilize or even eliminate the fluctuation and intermittent influence of PV power generation. In this paper, the output prediction of large-scale PV power stations is studied. An output prediction method based on the GBDT is proposed, and the feasibility and accuracy of the method are verified. The proposed method can work effectively, even in the case of missing values or partial outliers in the data, so there is no need to fill in the missing values or preprocess the outliers extensively in the data preprocessing stage. The proposed method can gradually optimize the model through iteration in the model training stage. Each tree corrects the error of the previous tree, thereby enhancing the overall prediction ability of the model. It also automatically selects important features and reduces the workload of feature engineering. Through quantitative analysis, compared with the traditional method, the prediction error nMAE of the proposed method is reduced by 4.36%, and the prediction error nRMSE is reduced by 8.61%. Therefore, for the output prediction of large-scale PV power stations, the proposed method has better prediction performance than the traditional method. The proposed method can be applied to other large-scale centralized PV power stations to improve the accuracy and efficiency of the whole station output prediction. This method can adapt to the situation of a few outliers, but when there are too many outliers, the prediction method needs to be improved through further research.