Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU

Gao, Chao; Zhang, Shuai; Li, Zhiqin; Zhou, Bin; Guo, Dong; Shao, Wenqi; Li, Haowen

doi:10.3390/su17167383

Open AccessArticle

Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU

by

Chao Gao

¹,

Shuai Zhang

¹,

Zhiqin Li

¹,

Bin Zhou

²,

Dong Guo

^1,*,

Wenqi Shao

¹ and

Haowen Li

¹

School of Transportation and Vehicle Engineering, Shandong University of Technology, Zibo 255049, China

²

State Key Laboratory of Intelligent Transportation System, Beijing 210096, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(16), 7383; https://doi.org/10.3390/su17167383

Submission received: 7 July 2025 / Revised: 11 August 2025 / Accepted: 12 August 2025 / Published: 15 August 2025

Download

Browse Figures

Versions Notes

Abstract

In order to address the challenge of achieving optimal prediction accuracy when a single prediction model faced with changes in meteorological conditions of different weather types, this paper proposes a photovoltaic (PV) power prediction method based on the combination of similar day clustering and convolutional neural network (CNN)-gated recurrent unit (GRU). The Pearson correlation coefficient and Spearman’s correlation coefficient are used to filter out the key features such as total solar radiation and module temperature to construct a new input dataset; the K-means algorithm is used to perform clustering analysis on the data, and the data are classified into sunny, cloudy, and rainy days; the spatial correlation features of the meteorological factors are extracted by using the convolutional neural network (CNN), and the CNN-GRU model is established by combining with the gated recurrent units (GRUs). The PV output power is predicted based on the PV power data and the corresponding meteorological data from a place in Ningxia, collected during June to August 2020, and the method proposed in the article is tested. Validation results show that, compared to other models, the model proposed in this paper reduces MAE and RMSE by 66.1% and 65.7% on average under three different weather type scenarios, and improves R² by 19.8% on average. This verifies that the model has high prediction accuracy and generalization ability, achieving better results in PV output power prediction. The CNN-GRU model demonstrates superior capability in modeling short- and long-term dependencies compared to other deep learning hybrid approaches, while also achieving higher computational efficiency and faster training convergence.

Keywords:

photovoltaic power generation; power prediction; K-means; convolutional neural network; Gated circulation units

1. Introduction

Currently, with the progress of the times and the rapid development of society, human beings’ demand for energy is increasing, while the limitations and environmental hazards of traditional fossil energy sources such as coal, oil and natural gas are becoming more and more prominent [1]. In this context, adjusting the energy structure and accelerating the development and utilisation of new energy sources has become a key path to cracking the energy crisis and achieving sustainable development [2]. According to the latest report released by the International Energy Agency (IEA), global renewable energy capacity additions rise to 700 GW in 2024, setting a new record for the 22nd consecutive year. Among them, solar energy, as a representative of clean and renewable new energy, has great potential and prospects [3]. In recent years, innovations in photovoltaic materials and device architectures have continuously driven industrial upgrades. Novel materials and structural designs have significantly improved the conversion efficiency and long-term stability of photovoltaic devices, providing a solid physical foundation for large-scale applications [4,5]. In 2024, the global new PV installed capacity is expected to exceed 550 GW, of which, China accounted for more than 50%, with an average growth rate of more than 35% [6]. Its growth trend is shown in Figure 1. As the installed capacity of PV increases year by year, the proportion of PV power generation in the power system is rising. Accurate and effective PV prediction results are important for the safe and stable operation of the power system.

PV output is affected by many factors, resulting in strong volatility and stochasticity, which makes PV power prediction difficult. Current PV power prediction methods are broadly classified into three types: mainly physical models, statistical methods and deep learning algorithms.

The physical modelling approach to PV power prediction involves using meteorological conditions such as solar radiation and cloudiness provided by the weather forecast (NWP), combined with the characteristics of the PV module to establish a mathematical relationship to simulate the relationship between the environmental conditions and the output power, and to achieve the output power prediction [7]. Holland et al. [8] constructed a photovoltaic simulation physical model to predict photovoltaic power based on numerical weather forecasts and local irradiance measurements. Markovics et al. [9] tested and compared 24 physical models for predicting PV power based on NWP and found that performing hyper-parameter tuning significantly reduced the prediction error of the models. Zhi et al. [10] proposed a physical model with environmental parameter prediction and an improved maximum power point tracking algorithm to achieve PV power prediction for different weather conditions. In order to reduce the systematic error of the weather forecasting system and further improve the prediction accuracy, scholars proposed to combine the ensemble NWP with the ensemble physical model chain for PV power prediction, which post-processes the weather data and significantly improves the prediction accuracy of PV power [11,12]. However, the physical model depends heavily on the accuracy of weather forecast data, requires highly accurate meteorological data, has high computational complexity, is suited for short-term predictions under stable meteorological conditions, lacks generalizability to regional studies, and has certain limitations [13].

As the research progressed, statistical methods were applied to forecasting studies by developing mathematical models to reveal the intrinsic relationship between power generation and key variables such as time of day and weather conditions. Atique et al. [14] applied an autoregressive moving average model (ARIMA) to predict photovoltaic (PV) output power by transforming seasonal and non-stationary time series data into a stationary format. Jeong et al. [15] proposed the use of seasonal autoregressive integral sliding average model (SARIMA) to predict PV output power and evaluated the prediction performance of the SARIMA model. Li et al. [16] proposed an adaptive seasonal autoregressive integral moving average model (ASARIMA) to predict photovoltaic (PV) power generation, and the experimental results showed that the performance of the proposed model was better than other existing power prediction algorithms. Jung et al. [17] proposed a regional PV power prediction method based on vector autoregressive (VAR) model, and the validation results showed that the accuracy of VAR model is higher than ARIMA model. Wang et al. [18] proposed a short- and medium-term forecasting method for regional photovoltaic (PV) power generation based on fuzzy support vector machine, and the experimental results show that the proposed method can effectively shorten the forecasting time of short- and medium-term regional PV power generation with a high accuracy rate. Statistical methods have demonstrated significant advantages in dealing with linear relationships and trend prediction, but their prediction accuracy and generalization ability may be limited when faced with the complex effects of non-linear and variable meteorological conditions on PV power.

With the continuous innovation of artificial intelligence and machine learning technologies, more and more research is devoted to exploring the application of these advanced technologies in the field of PV power generation prediction, aiming to obtain more accurate and stable prediction results. Liu et al. [19] proposed an improved whale algorithm to optimise the support vector machine model, which symmetrically adapts to different weather conditions by training with less data and achieves the desired prediction accuracy under different weather conditions. Khan et al. [20] used Artificial Neural Network (ANN) to train the PV sample data to predict the PV output power, and the experimental results showed that the ANN was able to accurately predict the PV output power. Since all single models have certain limitations, the prediction accuracy of PV power generation prediction models and their robustness have been aimed to be further improved in recent years by combining two or more prediction models. Wang et al. [21] proposed a PV power prediction method based on frequency domain decomposition and a hybrid deep learning model, and the results show that the proposed prediction model improves the prediction accuracy and prediction stability by about 15% on average in the case of a seven-day advance prediction compared to other prediction models. Hu et al. [22] proposed a novel model, CA-Transformer, which employs Copula functions to address the limitations of traditional methods in capturing nonlinear relationships in photovoltaic data and improve prediction results. Wu et al. [23] proposed a combined IXGBoost-KELM short-term PV power prediction model consisting of multidimensional similar day clustering and pairwise decomposition to predict PV power generation under three weather conditions. Liu et al. [24] propose an ultra-short-term PV power generation prediction model based on wavelet decomposition, dual attention mechanism and bidirectional long and short-term memory network (W-DA-BiLSTM), and the experimental results show that the model proposed in this paper has higher accuracy and efficiency in predicting PV power generation, and can effectively solve the common stochastic fluctuations and nonlinear problems in PV power generation. Overall, the combined model method can avoid the limitations of a single model, fully leverage the strengths of various models, complement their weaknesses, and avoid the shortcomings, and compared with a single model it can achieve better prediction results.

In this paper, a PV power prediction method based on the combination of similar day clustering and CNN-GRU is proposed based on previous research. The main contributions of this study are as follows:

(1) A subtyping prediction framework incorporating similar-day clustering is constructed to improve the model’s adaptability to different weather conditions. By introducing K-means clustering to classify similar weather samples, the model can be targeted for training under relatively consistent meteorological feature scenarios, which improves the prediction stability and generalization ability.

(2) A hybrid deep learning model integrating Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) is proposed. CNN is used to extract local features from high-dimensional meteorological data, while GRU captures both short-term and long-term dependencies in the time series, effectively leveraging the complementary strengths of both methods in feature extraction.

(3) The model’s performance is systematically evaluated based on real meteorological and power generation data across different weather types. Experimental results demonstrate that the proposed method outperforms baseline models in terms of accuracy and robustness. This study not only validates the model’s practicality in real-world scenarios but also provides a scalable modeling approach for high-precision, context-specific photovoltaic power forecasting.

2. Basic Algorithmic Principles

2.1. K-Means Clustering Algorithm

The K-means clustering algorithm is a division-based unsupervised learning algorithm, the core idea of which is to divide n data samples into K clusters through iterative optimisation, with the centre of the clusters being the clustering centre. The distance from each data point to the centroid of the cluster to which it belongs is minimised. The algorithm achieves cluster delineation by minimizing the squared error function, i.e., minimising the sum of the squares of the distances from each data point to the cluster centre within the cluster, which is defined mathematical as follows:

J = \sum_{i = 1}^{K} \sum_{j = 1}^{n} {‖x_{j} - μ_{i}‖}^{2}

(1)

where J is the sum of squared errors within clusters. K is the number of clusters. n is the total number of samples. x_j is the jth sample in the dataset. μ_i is the centroid of the ith cluster in the dataset.

The specific process of K-means clustering algorithm is as follows:

Step 1 Given n data samples, K objects are randomly selected as initial clustering centres.

Step 2 Calculate the distance of each sample point to each cluster centre separately and assign it to the cluster nearest to it one by one.

Step 3 Once all the objects have been assigned, update the K class centre locations, with the centroid defined as the mean value of all the objects in the cluster in each dimension.

Step 4 Compare with the K clustering centres obtained from the previous calculation, if the centroids have changed, return to step 2, otherwise, proceed to step 5.

Step 5 When the class centre no longer changes, stop and output the clustering results.

2.2. Convolutional Neural Network (CNN)

Convolutional Neural Network (CNN) is a deep feed-forward neural network characterised by local connectivity and weight sharing, where local connectivity refers to spatially local receptive fields, with each neuron is connected only to a small, localised region of the input. The CNN network structure is shown in Figure 2. The input layer receives the raw data. In the convolutional layer, the network performs feature extraction from local regions of the input data using a learnable convolutional kernel. The pooling layer reduces the feature map dimensions by spatial downsampling, enhancing the robustness of the model to local translations while preserving the main features. The fully connected layer establishes nonlinear associations between global features by spreading the multidimensional feature map into high-dimensional vectors. The output layer outputs the prediction results.

2.3. Gated Recirculation Unit (GRU)

Gated Recurrent Unit (GRU) is a model of gating mechanism designed to alleviate the problem of gradient vanishing in Recurrent Neural Networks (RNNs). By introducing learnable update gates and reset gates, the fusion weights of historical information and current inputs are dynamically regulated, enabling the modelling of both long and short-term dependencies in sequence data. Compared to traditional RNNs with Long Short-Term Memory Networks (LSTMs), GRU reduces the model complexity by simplifying the gating structure, while maintaining the efficiency of temporal feature extraction. The network structure of the GRU model is shown in Figure 3 and the computational equations are shown in (2)–(5).

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}])

(2)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}])

(3)

{\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, x_{t}])

(4)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(5)

where z_t is the update gate, r_t is the reset gate,

{\tilde{h}}_{t}

is the candidate hidden state, h_t is the hidden state passed to the next moment, x_t is the input at moment t, h_t−₁ is the state at moment t − 1, σ is the sigmoid function, tanh is the hyperbolic tangent function, and W_z, W_r and W are the weight vectors of the cyclic connection.

3. Photovoltaic Forecasting Model Construction

The accuracy of PV power prediction is significantly affected by changes in climatic conditions. To improve the accuracy of PV power prediction, this paper proposes a PV power prediction model based on similar day clustering with CNN-GRU. Based on the historical photovoltaic power and climate conditions, the K-means algorithm is classified into three weather types, namely sunny, cloudy and rainy days, which are, respectively, inputted into the CNN-GRU model for photovoltaic power prediction, and the prediction accuracy of the model is verified through the multi-indicator evaluation system.

3.1. Modelling

In this paper, we propose a PV power prediction model based on similar day clustering and CNN-GRU, which achieves scenario segmentation through weather clustering and combines feature preference with deep learning to achieve accurate predictions for different weather patterns. The model flow is shown in Figure 4, and the specific prediction steps are as follows.

(1) The raw PV power data are cleaned and processed, and the input features that are strongly correlated with PV output power are screened out using the Pearson and Spearman correlation coefficient, which are used as the input conditions for the model.

(2) Cluster analysis is performed based on key meteorological factors using the K-means algorithm to classify the data into three weather types: sunny, cloudy and rainy.

(3) The PV power generation data under different weather types are used to train and evaluate the CNN-GRU model on the training and testing sets, respectively.

(4) The prediction results under each weather type are evaluated by comparing the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Squared Error (MSE), and Coefficient of Certainty (R²) evaluation metrics to measure the prediction results of the different models.

3.2. Evaluation Indicators

In order to ensure the reliability and practical applicability of the prediction results, it is necessary to evaluate the prediction results, and this paper selects the following three evaluation indicators to comprehensively assess the prediction accuracy of the model.

(1): Mean Absolute Error (MAE)

The average absolute error indicates the average value of the absolute error between the predicted value and the actual value, the range of values is

[0, + \infty)

, the smaller the value is, the closer the predicted value is to the actual value, indicating that the model prediction accuracy is higher, the specific calculation formula is:

MAE = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(6)

where y_i is the true value,

{\hat{y}}_{i}

is the predicted value, and n is the number of samples.

(2): Root Mean Square Error (RMSE)

The root mean square error represents the sample standard deviation of the difference between the predicted value and the real value, indicating the degree of dispersion of the prediction error, and takes a value in the range of

[0, + \infty)

. The smaller the value, the smaller the error between the predicted value and the real value, and the higher the model accuracy.

RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(7)

(3): Coefficient of certainty (R²)

The coefficient of determination indicates the extent to which the variable X explains Y. R² takes values in the range of

[0, 1]

, and the closer its value is to 1, the better the model fit.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(8)

where

\bar{y}

is the average of the true values.

4. Calculus Analysis

The experimental data used in this study were collected from a medium-sized photovoltaic (PV) power station located in Ningxia, China, with an installed capacity of 150 kW. PV generation data from June to August 2020 were used for model development and evaluation. Given the intermittent nature of solar power, PV output data were recorded daily from 08:15 to 20:15 at 15-min intervals, resulting in a total of 5303 data samples. The collected variables include PV output power, total solar irradiance, module temperature, air temperature, humidity, surface pressure and wind velocity. For each weather category, the data were divided into training and testing sets in a 7:3 ratio based on chronological order, ensuring that the test data occurred after the training data. The prediction performance of the proposed CNN-GRU model was compared with that of individual CNN, GRU, and Transformer models.

4.1. Data Preprocessing

Blank values often appear in PV data due to sensor failures, data transmission delays, etc. To address the problem of missing data in PV datasets and to ensure the integrity and continuity of the data, the mean supplementation method is used to fill in the blank values. Priority is given to using historical averages over the same time period for supplementation, maintaining the overall statistical properties of the data while reducing the bias introduced by the filling process.

The input features in the PV power generation data have different magnitudes and distribution ranges. To eliminate the effect of magnitude between different features, this paper adopts the Z-score normalisation method to normalise the data, and its calculation formula is as follows.

z = \frac{x - μ}{σ}

(9)

where z is the standardised data, x is the original data, µ is the data mean and σ is the data standard deviation.

4.2. Analysis of the Impact Factors of Photovoltaic Power Generation

The output power of photovoltaic power generation is affected by various environmental and system factors, such as solar irradiance, ambient temperature, module temperature, humidity, and wind velocity. To quantitatively assess the correlation between factors and PV power, the correlation coefficients between input features and output power were analyzed using Pearson correlation analysis, reflecting the degree of influence of different variables on PV power, with the following mathematical expressions.

ρ_{X, Y} = \frac{E (X Y) - E (X) E (Y)}{\sqrt{E (X^{2}) - {(E (X))}^{2}} \sqrt{E (Y^{2}) - {(E (Y))}^{2}}}

(10)

The Pearson correlation coefficient lies between [−1, 1], with positive values indicating a positive correlation and negative values indicating a negative correlation. The closer the absolute value is to 1, the stronger the correlation. The heat map of Pearson’s correlation coefficient between PV power and the factors is shown in Figure 5.

From Figure 5, it can be seen that the PV power is positively correlated with module temperature, air temperature, total solar radiation, and wind velocity, with correlation values of 0.784, 0.422, 0.926, and 0.125, respectively. It shows a negative correlation with ground pressure and relative humidity, with values of −0.053 and −0.278, respectively.

The Pearson correlation coefficient primarily reflects linear dependencies between variables. However, in practical scenarios, many influencing factors may exhibit nonlinear characteristics, and relying solely on Pearson correlation may fail to capture such complex relationships. To provide a more comprehensive evaluation of the factors affecting photovoltaic (PV) power generation, this study further employs Spearman’s rank correlation coefficient to reveal potential monotonic relationships between variables. This allows for a more thorough analysis of the influence of each factor on PV output, thereby offering deeper support for model optimization and the improvement of prediction accuracy.

Spearman’s rank correlation coefficient is a non-parametric statistical method used to measure the strength of a monotonic relationship between two variables. Its calculation is given as follows:

r = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}

(11)

where r denotes the Spearman’s rank correlation coefficient, n represents the sample size, and d_i = R(x_i) − R(y_i) is the rank difference for the i-th sample, where R(x_i) and R(y_i) denote the ranks of the variables x_i and y_i, respectively.

\sum_{i = 1}^{n} d_{i}^{2}

denotes the sum of the squares of the rank order differences for all samples. A coefficient of r = 1 indicates a perfect positive correlation, r = 0 indicates no correlation, and r = −1 indicates a perfect negative correlation. The Spearman correlation heatmap between PV power output and the influencing factors is presented in Figure 6.

As shown in Figure 6, PV power output exhibits positive correlations with module temperature (0.692), ambient temperature (0.487), total solar radiation (0.983), and wind velocity (0.086). In contrast, it shows negative correlations with ground pressure (−0.021) and relative humidity (−0.243).

Theoretically, an increase in module temperature typically leads to a reduction in PV conversion efficiency. However, in real-world operating conditions, higher module temperatures often occur during periods of peak irradiance. As a result, the observed positive correlation between temperature and output power is likely driven by the simultaneous effect of high solar radiation [25,26]. Moreover, wind velocity is often associated with weather phenomena such as increased cloud cover or precipitation, which may reduce solar irradiance. Nevertheless, under specific environmental conditions and operating states, moderate wind velocity can enhance heat dissipation from the PV modules, thereby reducing module temperature and improving power conversion efficiency [27].

To quantitatively evaluate the direct impact of wind velocity on photovoltaic (PV) power output and eliminate the confounding effect of global solar radiation, this study employs partial correlation analysis. Partial correlation is a statistical method used to measure the strength of the linear relationship between two target variables while controlling for one or more other variables. In the context of multiple meteorological factors, this method effectively reveals the net influence of each factor on PV system power output.

Assuming three variables x₁, x₂, x₃, the partial correlation coefficient r_12,3 between x₁ and x₂, controlling for x₃, can be calculated using the following formula:

r_{12, 3} = \frac{r_{12} - r_{13} r_{23}}{\sqrt{1 - r_{13}^{2}} \cdot \sqrt{1 - r_{23}^{2}}}

(12)

where r₁₂, r₁₃ and r₂₃ represent the Pearson correlation coefficients between the respective variable pairs.

The results show that the partial correlation coefficient between wind velocity and PV power output is 0.0027, which is significantly lower than the Pearson (0.125) and Spearman (0.086) correlation coefficients. This indicates that the direct linear effect of wind velocity on PV power output is weak, and that the observed correlation is largely driven by shared variation with solar irradiance and other factors.

The above analysis confirms that solar radiation is the most critical factor affecting PV power output, showing a strong correlation. In addition, module temperature, ambient temperature, and relative humidity also exhibit relatively strong correlations and should be considered key influencing variables in predictive modeling.

4.3. Similar Day Data Clustering

The pre-processed data were analysed using K-means clustering based on total solar radiation and cloud opacity. The K-means clustering algorithm is an iterative solving cluster analysis algorithm. The basic principle of the algorithm is to divide the dataset into K groups and randomly select K data objects as the initial clustering centroids. The distance between each data object and all initial cluster centroids is calculated, and each data object is assigned to the closest cluster centre based on the minimum distance principle. Each cluster centre and its assigned data objects together form a cluster. This process is iterated until the cluster centroids no longer change. To determine the optimal number of clusters, the value of K was set from 2 to 6, and the corresponding silhouette coefficients were calculated. The silhouette coefficient curve is shown in Figure 7. A higher silhouette coefficient indicates better clustering performance. The results show that the silhouette coefficient reaches its maximum when K = 3, suggesting that three clusters provide the most effective separation.

Therefore, in this study, we set the K value of 3 to represent the three weather types of cloudy, rainy, and sunny days. There are 43% cloudy days, 10% rainy days and 47% sunny days. Figure 8 shows the visualization of the clusters for the different weather types selected for the months of June–August. The differently colored curves in the figure represent the trend of PV power on different dates.

4.4. Parameter Settings

In this paper, the CNN-GRU model is used to predict the photovoltaic output power, and a comparative analysis is conducted with the CNN, GRU, and Transformer models. The parameter settings of the prediction models are shown in Table 1.

4.5. Forecast Results and Analysis

In order to verify the prediction performance of the CNN-GRU model, this paper compares the CNN-GRU model with the CNN model, the GRU model, and the Transformer model, and each model uses the weather sample data after clustering for training and prediction, and compares and analyses the prediction results of each model under three kinds of weather conditions, respectively, and the results are shown in Figure 9.

Figure 9a presents a comparison of the prediction results of various models under sunny day conditions. Overall, all four models effectively forecast the general trend of photovoltaic (PV) power output. As shown in the magnified view of the local peak area on the right, the CNN-GRU model demonstrates superior dynamic response speed and amplitude accuracy compared to the standalone CNN and GRU models. The Transformer model also exhibits strong tracking ability in terms of local prediction accuracy. Its prediction curve is relatively smooth and capable of accurately capturing power peaks, performing slightly better than the GRU model but slightly worse than the CNN-GRU and CNN models.

Figure 9b illustrates the prediction results under cloudy day conditions, where solar irradiance fluctuates significantly, leading to more pronounced differences in model performance. The GRU model exhibits a clear phase shift relative to the actual values and only captures the general trend of PV power variation, failing to reflect short-term fluctuations. The CNN model shows improved prediction accuracy due to its enhanced ability to extract local features through convolutional kernels. The CNN-GRU model, which integrates the strengths of both CNN and GRU, demonstrates a strong advantage in tracking sudden changes while maintaining the overall trend. Although the Transformer model can generally forecast the overall trend, its response to abrupt changes is slower than that of the CNN-GRU and CNN models, with accuracy slightly better than the GRU model.

Figure 9c compares the prediction performance under rainy day conditions, characterised by intense meteorological fluctuations and a pronounced non-stationary nature of PV power output. Under such conditions, the GRU, CNN, and Transformer models all struggle to accurately capture the rapid variations in PV power, exhibiting significant prediction delays. In contrast, the CNN-GRU model responds more promptly to abrupt changes in power, with prediction trends that align more closely with the actual power output curve.

In this study, the predictive performance of each model is evaluated using three metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the coefficient of determination (R²). Figure 10 and Table 2 present a comparison of the prediction errors of the models under sunny day conditions. It can be observed that the CNN-GRU model outperforms all other models across all three evaluation metrics. Specifically, the CNN-GRU model achieves an MAE of 2.9252, which is a reduction of 43.5%, 69.4%, and 55.7% compared to the CNN (5.1729), GRU (9.5627), and Transformer (6.5987) models, respectively. Its RMSE is 3.6364, representing a decrease of 40.93%, 66.67%, and 54.84% compared to the CNN (6.1559), GRU (10.9115), and Transformer (8.0516) models, respectively. Moreover, the CNN-GRU model achieves an R² value of 0.9971, significantly higher than those of the CNN (0.9763), GRU (0.9256), and Transformer (0.9593) models, with relative improvements of 2.13%, 7.72%, and 3.94%, respectively. This shows that the CNN-GRU model performs more accurately than the CNN model and the GRU model in predicting the sunny day conditions, and its prediction result curve has a higher goodness-of-fit with the measured data.

Compared to sunny day conditions, the nonlinear fluctuations in irradiance caused by cloud movement on cloudy days impose greater challenges on model robustness. The CNN-GRU model, by leveraging the convolutional neural network’s ability to deeply extract spatial correlations among meteorological factors and combining it with the gated recurrent unit’s strength in modeling temporal dynamics, is capable of effectively capturing the transient features of solar radiation under cloudy weather conditions. Figure 11 and Table 3 present a comparative analysis of prediction errors among the models under cloudy day scenarios. As shown in the table, the performance differences among the models become more pronounced under these conditions. Based on the three evaluation metrics (MAE, RMSE, and R²), the CNN-GRU model continues to demonstrate superior predictive performance. Specifically, the CNN-GRU model achieves an MAE of 3.2157, representing reductions of 47.9%, 75.2%, and 60.1% compared to the CNN (6.1688), GRU (12.9789), and Transformer (8.0500) models, respectively. Its RMSE is 3.9912, reflecting decreases of 44.0%, 74.2%, and 61.5% relative to the CNN (7.1338), GRU (15.4523), and Transformer (10.3716) models, respectively. Moreover, the CNN-GRU model achieves an R² of 0.9892, which is 2.47%, 18.1%, and 6.75% higher than the CNN (0.9654), GRU (0.8376), and Transformer (0.9266) models, respectively. These results indicate that the CNN-GRU model exhibits a significantly enhanced ability to capture PV power fluctuations under complex meteorological conditions, demonstrating strong robustness and predictive accuracy in the presence of cloudy weather.

Under rainy day conditions, photovoltaic (PV) power output is subject to more complex meteorological disturbances. Figure 12 and Table 4 present a comparison of prediction errors across different models under rainy weather scenarios. It can be observed that the CNN-GRU model continues to exhibit a significant advantage across all three evaluation metrics. Specifically, the CNN-GRU model achieves an MAE of 1.5746, which is substantially lower than that of the CNN (7.2725), GRU (7.5287), and Transformer (9.1324) models, representing reductions of 78.3%, 79.1%, and 82.8%, respectively. The RMSE of the CNN-GRU model is 2.0512, which corresponds to decreases of 77.4%, 75.6%, and 80.8% compared to the CNN (9.0641), GRU (8.4076), and Transformer (10.6653) models, respectively. In terms of the coefficient of determination, the CNN-GRU model attains an R² value of 0.9820, which is significantly higher than that of the CNN (0.6478), GRU (0.6970), and Transformer (0.6256) models, with improvements of 51.7%, 40.9%, and 56.9%, respectively. These results suggest that, although the Transformer model exhibits a certain level of predictive capability in some scenarios, its stability and accuracy remain inferior to those of the hybrid CNN-GRU architecture. The CNN-GRU model demonstrates generalizable value in enhancing the performance of PV power forecasting systems across varying weather conditions.

To evaluate the training efficiency of CNN, GRU, Transformer and CNN-GRU models under different weather conditions, this study recorded the training times of each model during sunny, cloudy, and rainy scenarios. Table 5 presents a comparison of training times across different weather conditions.

According to the comparative data presented in Table 5, the CNN model consistently exhibits the shortest training time among all models, which can be attributed to its relatively simple structure and lower computational overhead when processing time-series data. The GRU model, on the other hand, requires sequential processing to capture temporal dependencies, resulting in a longer training time. The CNN-GRU model leverages the CNN layers to extract local features and effectively reduce the dimensionality of the input data, thereby alleviating the computational burden on the subsequent GRU layers. As a result, its training time falls between that of the CNN and GRU models. The Transformer model, due to its use of multi-head attention mechanisms and complex encoder architecture, demands significantly more computation during training. Consequently, it records the longest training times under sunny and cloudy conditions, reaching 49.28 s and 43.89 s, respectively, highlighting its disadvantage in terms of computational efficiency. Interestingly, under rainy conditions, the Transformer achieves the shortest training time (13.64 s), possibly due to reduced data volume or improved architectural adaptability to this specific scenario.

While the CNN model demonstrates superior training efficiency, its prediction performance is relatively limited. The Transformer model shows certain advantages in specific evaluation metrics but suffers from high computational costs. In contrast, the proposed CNN-GRU model achieves a good balance between predictive accuracy and training efficiency. It consistently delivers reliable performance across various weather conditions, confirming its practicality and robustness in photovoltaic power prediction tasks.

5. Conclusions

In this paper, we propose a photovoltaic power prediction model based on similar day clustering and CNN-GRU, validate the model’s performance through example analysis, and draw the following conclusions.

(1) Pearson and Spearman correlation coefficient analyses are applied to screen out the key factors affecting the PV output power, reduce the input dimensions of the model, and eliminate the interference of features with smaller influences.

(2) The K-means algorithm is used to classify the original data into three weather types: sunny, cloudy, and rainy, and the predictions of the model for different weather types are discussed separately to further improve the prediction accuracy.

(3) Using CNN feature extraction and GRU nonlinear fitting capability, the CNN-GRU model is proposed to predict the PV output power under sunny, cloudy and rainy weather scenarios, respectively. The examples show that MAE and RMSE are reduced by 66.1% and 65.7% on average and R² is improved by 19.8% on average in three different weather type scenarios. This verifies that the model has high prediction accuracy and generalisation ability, and better results in PV output power prediction.

Accurate photovoltaic (PV) power generation forecasting enables the effective optimization of PV generation and energy storage system scheduling, thereby improving the utilization of renewable energy and reducing system operating costs. Moreover, precise PV forecasting is essential for load management in power grids, as it provides critical data support for grid dispatching, enhances the grid’s adaptability to variable power output, and strengthens its overall stability and flexibility.

Nevertheless, this study still has several limitations in terms of practical application, which warrant further exploration and improvement in future research:

(1) In this study, the dataset was divided into three subsets—sunny, cloudy and rainy—using the k-means clustering algorithm. Separate training and testing procedures were then applied to each subset. However, real-world weather conditions are far more complex. In practical deployment, accurately classifying and predicting weather types based on weather forecast data remains a challenge. Reducing the impact of forecast inaccuracies on prediction performance and ensuring the model’s stability under dynamically changing weather conditions are key issues that need to be addressed in future work.

(2) The model was trained and tested using summer data from a PV plant located in Ningxia, China. Therefore, its applicability to other geographical regions and under extreme weather conditions remains limited. Future research should focus on collecting and integrating PV generation data from multiple regions, and developing and promoting more generalised PV forecasting models capable of adapting to region-specific climate characteristics.

(3) In practical applications, the deployment of the model may face limitations related to computational resources and real-time data processing. Future research should consider incorporating predictive input variables such as Numerical Weather Prediction (NWP) data to explore the model’s robustness and generalization under forecast-based conditions. This will further enhance the overall operational efficiency of the power grid and provide theoretical support and practical guidance for large-scale photovoltaic integration and the development of smart grids.

Author Contributions

Conceptualization, C.G. and S.Z.; methodology, C.G. and Z.L.; software, C.G., S.Z. and W.S.; validation, Z.L. and H.L.; formal analysis, W.S. and H.L.; investigation, B.Z. and D.G.; resources, B.Z. and D.G.; data curation, S.Z.; writing—original draft preparation, C.G.; writing—review and editing, S.Z.; visualization, Z.L.; supervision, D.G.; project administration, D.G.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the State Key Lab of Intelligent Transportation System under Project, grant number 2024-B009 and SDUT & Zibo City Integration Development Project, grant number 2022JS005.

Data Availability Statement

The data are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Al-Shetwi, A.Q. Sustainable development of renewable energy integrated power sector: Trends, environmental impacts, and recent challenges. Sci. Total Environ. 2022, 822, 153645. [Google Scholar] [CrossRef]
Iheanetu, K.J. Solar photovoltaic power forecasting: A review. Sustainability 2022, 14, 17005. [Google Scholar] [CrossRef]
Mellit, A.; Pavan, A.M.; Lughi, V. Deep learning neural networks for short-term photovoltaic power forecasting. Renew. Energy 2021, 172, 276–288. [Google Scholar] [CrossRef]
Spampinato, C.; Valastro, S.; Smecca, E.; Arena, V.; Mannino, G.; La Magna, A.; Corsaro, C.; Neri, F.; Fazio, E.; Alberti, A. Spongy TiO₂ layers deposited by gig-lox sputtering processes: Contact angle measurements. J. Vac. Sci. Technol. B 2023, 41, 012802. [Google Scholar] [CrossRef]
Spampinato, C.; Valastro, S.; Calogero, G.; Smecca, E.; Mannino, G.; Arena, V.; Balestrini, R.; Sillo, F.; Ciná, L.; La Magna, A.; et al. Improved radicchio seedling growth under CsPbI3 perovskite rooftop in a laboratory-scale greenhouse for Agrivoltaics application. Nat. Commun. 2025, 16, 2190. [Google Scholar] [CrossRef]
Gómez-Expósito, A. The rooftop PV revolution. iEnergy 2025, 4, 1–2. [Google Scholar] [CrossRef]
Mayer, M.J.; Gróf, G. Extensive comparison of physical models for photovoltaic power forecasting. Appl. Energy 2021, 283, 116239. [Google Scholar] [CrossRef]
Holland, N.; Pang, X.; Herzberg, W.; Karalus, S.; Bor, J.; Lorenz, E. Solar and PV forecasting for large PV power plants using numerical weather models, satellite data and ground measurements. In Proceedings of the 2019 IEEE 46th Photovoltaic Specialists Conference (PVSC), Chicago, IL, USA, 16–21 June 2019; IEEE: New York, NY, USA, 2019; pp. 1609–1614. [Google Scholar]
Markovics, D.; Mayer, M.J. Comparison of machine learning methods for photovoltaic power forecasting based on numerical weather prediction. Renew. Sustain. Energy Rev. 2022, 161, 112364. [Google Scholar] [CrossRef]
Zhi, Y.; Sun, T.; Yang, X. A physical model with meteorological forecasting for hourly rooftop photovoltaic power prediction. J. Build. Eng. 2023, 75, 106997. [Google Scholar] [CrossRef]
Mayer, M.J.; Yang, D. Pairing ensemble numerical weather prediction with ensemble physical model chain for probabilistic photovoltaic power forecasting. Renew. Sustain. Energy Rev. 2023, 175, 113171. [Google Scholar] [CrossRef]
Horat, N.; Klerings, S.; Lerch, S. Improving model chain approaches for probabilistic solar energy forecasting through post-processing and machine learning. Adv. Atmos. Sci. 2025, 42, 297–312. [Google Scholar] [CrossRef]
Gaboitaolelwe, J.; Zungeru, A.M.; Yahya, A.; Lebekwe, C.K.; Vinod, D.N.; Salau, A.O. Machine learning based solar photovoltaic power forecasting: A review and comparison. IEEE Access 2023, 11, 40820–40845. [Google Scholar] [CrossRef]
Atique, S.; Noureen, S.; Roy, V.; Subburaj, V.; Bayne, S.; Macfie, J. Forecasting of total daily solar energy generation using ARIMA: A case study. In Proceedings of the 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 7–9 January 2019; IEEE: New York, NY, USA, 2019; pp. 114–119. [Google Scholar]
Jeong, H.Y.; Hong, S.H.; Jeon, J.S.; Lim, S.C.; Kim, J.C.; Park, H.W.; Park, C.Y. A Research of Prediction of Photovoltaic Power using SARIMA Model. J. Korea Multimed. Soc. 2022, 25, 82–91. [Google Scholar]
Li, L.; Han, C. ASARIMA: An adaptive harvested power prediction model for solar energy harvesting sensor networks. Electronics 2022, 11, 2934. [Google Scholar] [CrossRef]
Jung, A.H.; Lee, D.H.; Kim, J.Y.; Kim, C.K.; Kim, H.G.; Lee, Y.S. Regional photovoltaic power forecasting using vector autoregression model in South Korea. Energies 2022, 15, 7853. [Google Scholar] [CrossRef]
Wang, P.; Zhou, S.; Chen, J.; Yang, K.; Shi, Q.; Tao, M. Research on medium and short term prediction of regional photovoltaic power generation based on fuzzy support vector machine. In Proceedings of the Eighth International Conference on Energy System, Electricity, and Power (ESEP 2023), Wuhan, China, 24–26 November 2023; SPIE: Cergy, France, 2024; Volume 13159, pp. 249–254. [Google Scholar]
Liu, Y.W.; Feng, H.; Li, H.Y.; Li, L.L. An Improved Whale Algorithm for Support Vector Machine Prediction of Photovoltaic Power Generation. Symmetry 2021, 13, 212. [Google Scholar] [CrossRef]
Khan, M.A.; Khan, M.A.; Ali, H.; Ashraf, B.; Khan, S.; Baig, D.E.; Wadood, A.; Khurshaid, T. Output power prediction of a photovoltaic module through artificial neural network. IEEE Access 2022, 10, 116160–116166. [Google Scholar] [CrossRef]
Wang, L.; Mao, M.; Xie, J.; Liao, Z.; Zhang, H.; Li, H. Accurate solar PV power prediction interval method based on frequency-domain decomposition and LSTM model. Energy 2023, 262, 125592. [Google Scholar] [CrossRef]
Hu, K.; Fu, Z.; Lang, C.; Li, W.; Tao, Q.; Wang, B. Short-Term Photovoltaic Power Generation Prediction Based on Copula Function and CNN-CosAttention-Transformer. Sustainability 2024, 16, 5940. [Google Scholar] [CrossRef]
Wu, T.; Hu, R.; Zhu, H.; Jiang, M.; Lv, K.; Dong, Y.; Zhang, D. Combined IXGBoost-KELM short-term photovoltaic power prediction model based on multidimensional similar day clustering and dual decomposition. Energy 2024, 288, 129770. [Google Scholar] [CrossRef]
Liu, M.; Wang, X.; Zhong, Z. Ultra-Short-Term Photovoltaic Power Prediction Based on BiLSTM with Wavelet Decomposition and Dual Attention Mechanism. Electronics 2025, 14, 306. [Google Scholar] [CrossRef]
Seapan, M.; Hishikawa, Y.; Yoshita, M.; Okajima, K. Temperature and irradiance dependences of the current and voltage at maximum power of crystalline silicon PV devices. Sol. Energy 2020, 204, 459–465. [Google Scholar] [CrossRef]
Omoriare, J.U.; Ogherohwo, E.P.; Zhimwang, J.T. Investigating the influence of solar irradiance variability on the output power of photovoltaic (PV) systems in Akure, Nigeria. World J. Appl. Sci. Technol. 2024, 16, 18–22. [Google Scholar] [CrossRef]
Gökmen, N.; Hu, W.; Hou, P.; Chen, Z.; Sera, D.; Spataru, S. Investigation of wind speed cooling effect on PV panels in windy locations. Renew. Energy 2016, 90, 283–290. [Google Scholar] [CrossRef]

Figure 1. New and total installed PV capacity in China.

Figure 2. CNN structure.

Figure 3. Internal structure of the GRU model.

Figure 4. Flowchart of photovoltaic forecasting.

Figure 5. Heatmap of Pearson correlation coefficients between PV power output and influencing factors.

Figure 6. Heatmap of Spearman correlation coefficients between PV power output and influencing factors.

Figure 7. Silhouette Plot for different values of K.

Figure 8. Clustering results for different weather types from June to August. (a) Sunny days; (b) Cloudy days; (c) Rainy days.

Figure 9. Prediction results of each model under different weather conditions. (a) Sunny days; (b) Cloudy days; (c) Rainy days.

Figure 10. Comparison of the errors of the models in the case of sunny days.

Figure 11. Comparison of the errors of the models under cloudy days.

Figure 12. Comparison of the errors of the models under rainy days.

Table 1. Parameter settings of prediction models.

Network Structure	Related Parameters	Parameter Values
Overall Structure	Optimiser	Adam
	Batch Size	256
	Training Epochs	500
	Initial Learning Rate	0.003
	Learning Rate Decay	0.1
CNN Module (2 Layers)	Kernel Size	[1, 3]
	Number of Kernels	[16, 32]
	Activation Function	ReLU
	Dropout	0.2
GRU Module (1 Layer)	Number of Neurons	20
GRU Module (1 Layer)	Dropout	0.2
Fully Connected Layer	Number of Neurons	1
Fully Connected Layer	Activation Function	ReLU

Table 2. Errors in the prediction results of each model under sunny days.

Predictive Modelling	MAE	RMSE	R²
CNN	5.1729	6.1559	0.9763
GRU	9.5627	10.9115	0.9256
Transformer	6.5987	8.0516	0.9593
CNN-GRU	2.9252	3.6364	0.9971

Table 3. Errors in the prediction results of each model under cloudy days.

Predictive Modelling	MAE	RMSE	R²
CNN	6.1688	7.1338	0.9654
GRU	12.9789	15.4523	0.8376
Transformer	8.0500	10.3716	0.9266
CNN-GRU	3.2157	3.9912	0.9892

Table 4. Errors in the prediction results of each model under rainy days.

Predictive Modelling	MAE	RMSE	R²
CNN	7.2725	9.0641	0.6478
GRU	7.5287	8.4076	0.6970
Transformer	9.1324	10.6653	0.6256
CNN-GRU	1.5746	2.0512	0.9820

Table 5. Training time comparison of models under different weather conditions.

Weather Condition	CNN	GRU	Transformer	CNN-GRU
Sunny	27.65	37.61	49.28	33.96
Cloudy	25.55	36.86	43.89	32.90
Rainy	14.91	25.50	13.64	19.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, C.; Zhang, S.; Li, Z.; Zhou, B.; Guo, D.; Shao, W.; Li, H. Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU. Sustainability 2025, 17, 7383. https://doi.org/10.3390/su17167383

AMA Style

Gao C, Zhang S, Li Z, Zhou B, Guo D, Shao W, Li H. Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU. Sustainability. 2025; 17(16):7383. https://doi.org/10.3390/su17167383

Chicago/Turabian Style

Gao, Chao, Shuai Zhang, Zhiqin Li, Bin Zhou, Dong Guo, Wenqi Shao, and Haowen Li. 2025. "Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU" Sustainability 17, no. 16: 7383. https://doi.org/10.3390/su17167383

APA Style

Gao, C., Zhang, S., Li, Z., Zhou, B., Guo, D., Shao, W., & Li, H. (2025). Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU. Sustainability, 17(16), 7383. https://doi.org/10.3390/su17167383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Photovoltaic Power Prediction Based on Similar Day Clustering Combined with CNN-GRU

Abstract

1. Introduction

2. Basic Algorithmic Principles

2.1. K-Means Clustering Algorithm

2.2. Convolutional Neural Network (CNN)

2.3. Gated Recirculation Unit (GRU)

3. Photovoltaic Forecasting Model Construction

3.1. Modelling

3.2. Evaluation Indicators

4. Calculus Analysis

4.1. Data Preprocessing

4.2. Analysis of the Impact Factors of Photovoltaic Power Generation

4.3. Similar Day Data Clustering

4.4. Parameter Settings

4.5. Forecast Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI