1. Introduction
In recent years, to address problems such as energy shortages and environmental pollution, the development of renewable energy has become the main direction of the global energy revolution and a key response to climate change [
1]. Solar energy has developed rapidly as an efficient, renewable and clean energy source. The global installed photovoltaic (PV) capacity has grown swiftly. According to the global PV report released by the International Energy Agency, by the end of 2021, the cumulative installed capacity reached 942 GW, which is an increase of 22.8% over that in 2020; as such, PV energy has great developmental potential [
2]. However, with the continuous increase in the proportion of PV energy, the randomness and volatility of PV outputs have become increasingly prominent, which brings certain difficulties to the operation of a power grid. Therefore, the accurate prediction of PV power generation can help the grid dispatching department to better avoid risks, improve the safety and economy of the power system and be of great significance to the stable operation of the power grid.
In numerous previous studies, scholars carried out research on photovoltaic power generation forecasting, which is mainly divided into two categories: physical models and statistical models. In physical models, the forecast value of solar irradiance and geographic location information, combined with the operation mode of photovoltaic modules, are used to carry out mathematical modeling [
3], and the energy storage system is used to solve the negative effects of unstable power generation and low power supply reliability. In practical applications, errors due to power loss and other issues will inevitably occur when using photovoltaic power. Improving material properties is the most direct way to improve photoelectric conversion efficiency [
4,
5]. At present, scholars have studied the structural characteristics of composite materials to improve the status of photovoltaic applications [
6,
7]. The rapid development of the photovoltaic industry has brought broad application prospects to the research field of photovoltaic composite materials. In the statistical model, the historical data of photovoltaic power plants is mainly relied upon. Therefore, artificial intelligence algorithms have been favored by scholars. These include machine learning algorithms such as artificial neural networks [
8,
9] (ANNs) and support vector machines [
10,
11] (SVMs). These algorithms have been widely used in the field of PV power generation forecasting. For example, the article in [
12] proposed an efficient ANN prediction model to study the relationship between meteorological data and PV power generation. The authors of [
13] proposed an extended model based on an SVM to obtain a more accurate dataset. The prediction accuracy of machine learning models often depends on the quality of the given dataset and the settings of the internal hyperparameters. Likewise, small dataset differences can lead to significant changes in prediction results [
14]. Therefore, hybrid forecasting models have appeared one after another. By optimizing the utilized dataset and calculating the best hyperparameters, a forecasting model can obtain its best forecasting effect. Experiments have shown that the use of the SVM algorithm, after performing particle swarm optimization (PSO) for the parameters, can obtain more accurate prediction results [
15]. Usman et al. [
16] developed an evaluation framework for short-term PV power prediction and conducted a comparative analysis among various machine learning models and feature selection methods, and the results showed that the extreme gradient boosting (XGBoost) method outperformed individual machine learning methods. According to the authors of [
17], by combining XGBoost with feature engineering technology, important information was extracted from weather forecasts to achieve improved prediction accuracy.
Compared with traditional machine learning techniques, deep learning models have better fitting performance and are able to discover intrinsic connections in high-dimensional data [
18]. Therefore, a PV prediction model based on deep learning can better mine the intrinsic value of feature data. Deep learning models include convolutional neural networks (CNNs) [
19], deep belief networks (DBNs) [
20], recurrent neural networks (RNNs), generative adversarial networks (GANs) [
21] and other classic models, as well as their variants and combined models. As a variant of an RNN model, a long short-term memory (LSTM) network can effectively capture the long-term dependencies of time series and has become very popular in the field of short-term PV output power prediction. For example, the experimental results in [
22] showed that the performance of an LSTM-based PV power generation prediction method is better than that of multilayer perceptrons (MLPs) and deep convolutional networks. The authors of [
23] used an LSTM network to predict the solar irradiance on the previous day, and its result was better than those of the backpropagation (BP) neural network and linear least-squares regression. The authors in [
24] proposed a CNN-LSTM hybrid deep learning model, which uses a multilayer CNN for feature extraction and an LSTM layer for prediction, thereby effectively improving the prediction effect of the LSTM.
Regardless of the chosen prediction algorithm, the data processing step is a challenge that cannot be ignored. A PV output power sequence has nonlinear characteristics. Decomposing such a time series into multiple subsequences can effectively reduce the complexity of the data and is an effective means for improving the prediction accuracy of the utilized model [
25]. Common sequence decomposition methods include empirical mode decomposition (EMD), ensemble EMD (EEMD) and wavelet decomposition (WD) [
26,
27]. However, the results of the above sequence decomposition methods cause modal aliasing, which increases the difficulty of prediction. As a method that performs sequence decomposition and reconstruction [
28], singular spectrum analysis (SSA) can effectively decompose a sequence into a trend sequence, a periodic sequence and a noise sequence without selecting an a priori basis function or a complex operation process, and this technique achieves better objectivity and adaptability [
29]. It is suitable for various engineering disciplines and has been widely used in wind power forecasting and power load forecasting [
30,
31]. For example, [
32] decomposed a wind power series into two subsequences (a trend series and a noise series) through SSA and used the hybrid Laguerre neural network to predict the decomposed signals. In [
33], a multistep advance wind speed prediction model was proposed by combining variational mode decomposition (VMD) and SSA with an LSTM model.
The processing of weather characteristic data is also an important link in PV power forecasting. Although the PV output power fluctuates, the fluctuation range of the PV output power is similar under the same weather type. Therefore, when constructing a dataset for PV forecasting, clustering the data on similar days according to the associated weather types can reduce data redundancy and forecasting errors [
34]. Commonly used clustering methods include K-nearest neighbors (KNN) [
35] and K-means clustering (K-means) [
36]. The authors of [
37] used the fuzzy C-means (FCM) clustering algorithm to cluster and analyze historical meteorological data and weather forecast information, and used the whale optimization algorithm and a least-squares SVM (LSSVM) to make predictions. In [
38], K-means clustering was used to select similar historical data from forecasting days as training samples, and then, complete EEMD with adaptive noise (CEEMDAN) and a gated recurrent unit (GRU) were used to forecast PV power. The simulation results showed that the proposed model outperformed other models. It can be seen that when processing PV power generation datasets, whether clustering weather types or searching for similar days, establishing corresponding models for different types of data can improve the resulting prediction accuracy. The above methods slice an entire dataset into many smaller datasets for training a prediction model. When the amount of data is insufficient, the decomposed dataset may be very small, which can easily lead to an insufficient number of training samples for the algorithm and overfitting of the prediction results [
39].
In summary, this paper proposes a hybrid forecasting model based on SSA-LSTM. SSA decomposition is performed on the given PV output power sequence with strong volatility; the trend sequence, periodic sequence and noise sequence of the PV output power sequence are extracted; and principal component analysis is performed on the sequence. The important components are extracted for sequence reconstruction, and LSTM prediction models are separately established for the reconstructed sequences. The purpose of this is to enable the LSTM to directly learn regular sequence data, reduce the complexity of the model and improve the prediction accuracy. Existing research lacks in-depth studies on feature information and the law of PV output power. This paper fully mines the characteristics of PV meteorological data, extracts high-quality features and improves the data quality.
To verify the validity of the model, this paper utilizes data from the Ningxia Wuzhong Sun Mountain PV power station [
40]. At the same time, we conduct comparative experiments under two frameworks. Model 1 is a time series prediction model, and model 2 incorporates weather features and the feature data constructed in this paper into LSTM prediction. The purpose of this test is to gain insight into the impact of feature data on prediction performance and to verify the effectiveness of the developed method.
The contributions of this paper can be summarized as follows:
To improve the quality of the utilized dataset, the PV output power obtained under different weather conditions is analyzed, the law of PV output power is summarized, and a new feature is constructed by combining the PV output law and weather data. The aim is to achieve improved prediction accuracy by mining higher-quality feature data;
A short-term PV prediction model (SSA-LSTM) is proposed, in which SSA decomposes nonlinear PV sequences into more regular trend sequences, periodic sequences and noise sequences, reducing the learning complexity of LSTM; the model is combined with feature data to achieve improved prediction accuracy.
The rest of the paper is organized as follows: Second 2 analyzes the characteristics of PV output power and performs feature extraction;
Section 3 introduces the forecasting methods and technical descriptions used in this paper;
Section 4 presents a case study that validates the validity of the prediction model proposed in this paper using data from the Sun Mountain PV power plant in Wuzhong, Ningxia, China; and
Section 5 draws conclusions.
2. PV Power Generation Feature Extraction
There are many factors that affect PV output power. Among them, weather factors have direct impacts on PV output power. This chapter divides weather conditions into four types (sunny, partly cloudy, cloudy and rainy); analyzes the PV output power law in detail under different weather types; and extracts eigenvalues according to the PV output power law.
2.1. Typical Form of PV Power Generation
Figure 1 depicts the PV output power produced for five days under four typical weather types: sunny, partly cloudy, cloudy and rainy. The daily comparison is conducted from 5:00 to 18:45, and the sampling interval is 15 min, with a total of 56 nodes per day. Among them, the PV output power levels on sunny days exhibit the highest similarity and are close to the same value. Due to changes in climatic conditions, the fluctuation of PV output power on cloudy, and cloudy and rainy days increases and becomes extremely irregular, and the maximum daily PV output power gradually decreases. When dealing with such problems, some scholars use algorithms to find historically similar days as a training set. The dataset is clustered and analyzed according to its meteorological features, the weather types are divided based on this, and the forecast days are predicted using the data obtained under the same weather type. However, it can be seen from the figure that even under the same weather type, the PV output law exhibits obvious differences. Therefore, it is difficult to capture the power fluctuation characteristics for a whole day based only on the daily matching of similar weather characteristics. At the same time, in a case with a small amount of data, the division of the dataset will reduce the amount of training data, which will reduce the model prediction accuracy to a certain extent. Based on the above two points, it is necessary to conduct a more detailed analysis of the characteristics of PV output power, and conduct feature screening and matching at a finer time granularity to achieve improved prediction accuracy.
As seen from the above figure, the output PV power has a strong trend on sunny days and gradually decreases after gradually increasing to the peak output, showing a hemispherical shape. Although there are no such obvious features for other weather types, from a short-term point of view, the PV output power also forms a short-term increasing or decreasing trend after fluctuation. Therefore, this feature is called the short-term trend of PV output power in this paper. Although it is difficult to find days with similar PV output power, under the same type of weather, the PV output power fluctuates within roughly the same interval. Therefore, it is easier to find similar output points at the same time in history, and at the same time, the quality of the dataset can be improved (that is, made more accurate). Based on the above analysis, feature data for the short-term trend of PV output power and the similarities to power are simultaneously constructed.
2.2. Short-Term Trend Correlation Analysis
According to the characteristic that PV output power forms an increasing or decreasing trend in a short period of time, this paper takes the power at N moments before the PV output power point as a feature and conducts a correlation analysis on it. The purpose is to determine that the PV output power at time t has a strong correlation with the outputs at the previous time points. P(t) represents the power at time t, and P(t−1) represents the power at the previous time node before time t. The historical measured data are constructed in turn to construct power features, and SPSS software is used to carry out a correlation analysis on the constructed dataset. The results are shown in
Table 1.
In this paper, data with correlations exceeding 0.7 are retained. The power at time t has a strong correlation with the power at the previous seven time stamps, and the correlation strength decreases in turn. It is proven that the change in the current power has a certain internal relationship with the power at the previous moments, which is in line with the hypothesis of this paper and can be used as a prediction feature. In this paper, the power levels at the first three moments with the strongest correlations are selected as the features.
2.3. Power Similarity Matching at the Same Moment
The purpose of similarity matching is to find similar power points at the same moment in history. When selecting the power features at the same time, the output power is greatly affected by meteorological features such as global horizontal irradiance, the ambient temperature, the humidity, etc. The above features are selected to calculate the grey correlation degree. Considering that the similarity between the forecast date and the historical date is affected by seasonality, the closer to the forecast date, the higher the probability of finding similar outputs is. Therefore, this paper only analyzes the grey correlation degree at the same time 30 days before the PV output power point and selects the three power data with the highest grey correlation degrees as the prediction features.
2.3.1. Relevance Calculation
The formula for calculating the correlation coefficients between the comparison sequence xi(k) and the reference sequence y(k) is shown in Equation (1).
ξi denotes the correlation coefficient of element k; mini mink |y(k) − xi(k)| is the minimum value of the absolute difference between all comparison sequence values and the reference sequence values. Similarly, maxi maxk |y(k) − xi(k)| is the maximum value of the absolute difference between the sequences; the resolution coefficient ρ is taken as 0.4 in this paper.
2.3.2. Grey Relation Analysis
After calculating the relation coefficient for each element in xi(k), the grey relation degree ri can be calculated by Equation (2).
ri > 0.7 indicates that the two datasets are strongly correlated; 0.5 < ri < 0.7 indicates some correlation; ri < 0.5 indicates little correlation.
2.3.3. Process of Feature Selection
The algorithmic flow is shown in
Table 2. Utilizing Equation (1) to calculate the correlations between the prediction points and the meteorological features at the same moment for the previous 30 days, the first three power points with the highest correlations are selected as the prediction features, and the power levels with the largest-to-smallest correlations are Pa, Pb and Pc. Performing feature construction for specific similar moments can make the prediction model training process more targeted.
2.4. Optional Feature
To quantify the quality of the matched feature data constructed in this paper, the Pearson correlation coefficient was introduced to compare the correlation between the matched features and the original data. The original data include the actual power, global horizontal irradiance (GHI), the ambient temperature (AT), the component temperature (CT) and the relative humidity (RH). The matching features include: the power at moment t−1; power at moment t−2; power at moment t−3; and similar powers Pa, Pb and Pc. There are 10 vectors. The specific results are shown in
Table 3.
It is not difficult to see that in the meteorological data, global horizontal irradiance has the highest correlation, followed by the ambient temperature. The component temperature and the relative humidity are weakly correlated with the actual power. The short-term power trend has been analyzed in a previous article and will not be repeated here. Among the similar powers, Pa has the strongest correlation with the actual power, which is larger than the correlation coefficient of global horizontal irradiance. The correlation between Pb, Pc and actual power decreases, but is still stronger than the correlation coefficient of the ambient temperature. It can be seen from the correlation results that the feature data constructed in this paper can improve the quality of the dataset, and most of the data belong to the strong correlation level.
According to the correlation calculation results in
Table 3, all the above matched feature data can be used as prediction data, and the specific feature quantity selection results are shown in
Table 4.