1. Introduction
Currently, with the progress of the times and the rapid development of society, human beings’ demand for energy is increasing, while the limitations and environmental hazards of traditional fossil energy sources such as coal, oil and natural gas are becoming more and more prominent [
1]. In this context, adjusting the energy structure and accelerating the development and utilisation of new energy sources has become a key path to cracking the energy crisis and achieving sustainable development [
2]. According to the latest report released by the International Energy Agency (IEA), global renewable energy capacity additions rise to 700 GW in 2024, setting a new record for the 22nd consecutive year. Among them, solar energy, as a representative of clean and renewable new energy, has great potential and prospects [
3]. In recent years, innovations in photovoltaic materials and device architectures have continuously driven industrial upgrades. Novel materials and structural designs have significantly improved the conversion efficiency and long-term stability of photovoltaic devices, providing a solid physical foundation for large-scale applications [
4,
5]. In 2024, the global new PV installed capacity is expected to exceed 550 GW, of which, China accounted for more than 50%, with an average growth rate of more than 35% [
6]. Its growth trend is shown in
Figure 1. As the installed capacity of PV increases year by year, the proportion of PV power generation in the power system is rising. Accurate and effective PV prediction results are important for the safe and stable operation of the power system.
PV output is affected by many factors, resulting in strong volatility and stochasticity, which makes PV power prediction difficult. Current PV power prediction methods are broadly classified into three types: mainly physical models, statistical methods and deep learning algorithms.
The physical modelling approach to PV power prediction involves using meteorological conditions such as solar radiation and cloudiness provided by the weather forecast (NWP), combined with the characteristics of the PV module to establish a mathematical relationship to simulate the relationship between the environmental conditions and the output power, and to achieve the output power prediction [
7]. Holland et al. [
8] constructed a photovoltaic simulation physical model to predict photovoltaic power based on numerical weather forecasts and local irradiance measurements. Markovics et al. [
9] tested and compared 24 physical models for predicting PV power based on NWP and found that performing hyper-parameter tuning significantly reduced the prediction error of the models. Zhi et al. [
10] proposed a physical model with environmental parameter prediction and an improved maximum power point tracking algorithm to achieve PV power prediction for different weather conditions. In order to reduce the systematic error of the weather forecasting system and further improve the prediction accuracy, scholars proposed to combine the ensemble NWP with the ensemble physical model chain for PV power prediction, which post-processes the weather data and significantly improves the prediction accuracy of PV power [
11,
12]. However, the physical model depends heavily on the accuracy of weather forecast data, requires highly accurate meteorological data, has high computational complexity, is suited for short-term predictions under stable meteorological conditions, lacks generalizability to regional studies, and has certain limitations [
13].
As the research progressed, statistical methods were applied to forecasting studies by developing mathematical models to reveal the intrinsic relationship between power generation and key variables such as time of day and weather conditions. Atique et al. [
14] applied an autoregressive moving average model (ARIMA) to predict photovoltaic (PV) output power by transforming seasonal and non-stationary time series data into a stationary format. Jeong et al. [
15] proposed the use of seasonal autoregressive integral sliding average model (SARIMA) to predict PV output power and evaluated the prediction performance of the SARIMA model. Li et al. [
16] proposed an adaptive seasonal autoregressive integral moving average model (ASARIMA) to predict photovoltaic (PV) power generation, and the experimental results showed that the performance of the proposed model was better than other existing power prediction algorithms. Jung et al. [
17] proposed a regional PV power prediction method based on vector autoregressive (VAR) model, and the validation results showed that the accuracy of VAR model is higher than ARIMA model. Wang et al. [
18] proposed a short- and medium-term forecasting method for regional photovoltaic (PV) power generation based on fuzzy support vector machine, and the experimental results show that the proposed method can effectively shorten the forecasting time of short- and medium-term regional PV power generation with a high accuracy rate. Statistical methods have demonstrated significant advantages in dealing with linear relationships and trend prediction, but their prediction accuracy and generalization ability may be limited when faced with the complex effects of non-linear and variable meteorological conditions on PV power.
With the continuous innovation of artificial intelligence and machine learning technologies, more and more research is devoted to exploring the application of these advanced technologies in the field of PV power generation prediction, aiming to obtain more accurate and stable prediction results. Liu et al. [
19] proposed an improved whale algorithm to optimise the support vector machine model, which symmetrically adapts to different weather conditions by training with less data and achieves the desired prediction accuracy under different weather conditions. Khan et al. [
20] used Artificial Neural Network (ANN) to train the PV sample data to predict the PV output power, and the experimental results showed that the ANN was able to accurately predict the PV output power. Since all single models have certain limitations, the prediction accuracy of PV power generation prediction models and their robustness have been aimed to be further improved in recent years by combining two or more prediction models. Wang et al. [
21] proposed a PV power prediction method based on frequency domain decomposition and a hybrid deep learning model, and the results show that the proposed prediction model improves the prediction accuracy and prediction stability by about 15% on average in the case of a seven-day advance prediction compared to other prediction models. Hu et al. [
22] proposed a novel model, CA-Transformer, which employs Copula functions to address the limitations of traditional methods in capturing nonlinear relationships in photovoltaic data and improve prediction results. Wu et al. [
23] proposed a combined IXGBoost-KELM short-term PV power prediction model consisting of multidimensional similar day clustering and pairwise decomposition to predict PV power generation under three weather conditions. Liu et al. [
24] propose an ultra-short-term PV power generation prediction model based on wavelet decomposition, dual attention mechanism and bidirectional long and short-term memory network (W-DA-BiLSTM), and the experimental results show that the model proposed in this paper has higher accuracy and efficiency in predicting PV power generation, and can effectively solve the common stochastic fluctuations and nonlinear problems in PV power generation. Overall, the combined model method can avoid the limitations of a single model, fully leverage the strengths of various models, complement their weaknesses, and avoid the shortcomings, and compared with a single model it can achieve better prediction results.
In this paper, a PV power prediction method based on the combination of similar day clustering and CNN-GRU is proposed based on previous research. The main contributions of this study are as follows:
(1) A subtyping prediction framework incorporating similar-day clustering is constructed to improve the model’s adaptability to different weather conditions. By introducing K-means clustering to classify similar weather samples, the model can be targeted for training under relatively consistent meteorological feature scenarios, which improves the prediction stability and generalization ability.
(2) A hybrid deep learning model integrating Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) is proposed. CNN is used to extract local features from high-dimensional meteorological data, while GRU captures both short-term and long-term dependencies in the time series, effectively leveraging the complementary strengths of both methods in feature extraction.
(3) The model’s performance is systematically evaluated based on real meteorological and power generation data across different weather types. Experimental results demonstrate that the proposed method outperforms baseline models in terms of accuracy and robustness. This study not only validates the model’s practicality in real-world scenarios but also provides a scalable modeling approach for high-precision, context-specific photovoltaic power forecasting.
3. Photovoltaic Forecasting Model Construction
The accuracy of PV power prediction is significantly affected by changes in climatic conditions. To improve the accuracy of PV power prediction, this paper proposes a PV power prediction model based on similar day clustering with CNN-GRU. Based on the historical photovoltaic power and climate conditions, the K-means algorithm is classified into three weather types, namely sunny, cloudy and rainy days, which are, respectively, inputted into the CNN-GRU model for photovoltaic power prediction, and the prediction accuracy of the model is verified through the multi-indicator evaluation system.
3.1. Modelling
In this paper, we propose a PV power prediction model based on similar day clustering and CNN-GRU, which achieves scenario segmentation through weather clustering and combines feature preference with deep learning to achieve accurate predictions for different weather patterns. The model flow is shown in
Figure 4, and the specific prediction steps are as follows.
(1) The raw PV power data are cleaned and processed, and the input features that are strongly correlated with PV output power are screened out using the Pearson and Spearman correlation coefficient, which are used as the input conditions for the model.
(2) Cluster analysis is performed based on key meteorological factors using the K-means algorithm to classify the data into three weather types: sunny, cloudy and rainy.
(3) The PV power generation data under different weather types are used to train and evaluate the CNN-GRU model on the training and testing sets, respectively.
(4) The prediction results under each weather type are evaluated by comparing the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Squared Error (MSE), and Coefficient of Certainty (R2) evaluation metrics to measure the prediction results of the different models.
3.2. Evaluation Indicators
In order to ensure the reliability and practical applicability of the prediction results, it is necessary to evaluate the prediction results, and this paper selects the following three evaluation indicators to comprehensively assess the prediction accuracy of the model.
- (1)
Mean Absolute Error (MAE)
The average absolute error indicates the average value of the absolute error between the predicted value and the actual value, the range of values is
, the smaller the value is, the closer the predicted value is to the actual value, indicating that the model prediction accuracy is higher, the specific calculation formula is:
where
yi is the true value,
is the predicted value, and
n is the number of samples.
- (2)
Root Mean Square Error (RMSE)
The root mean square error represents the sample standard deviation of the difference between the predicted value and the real value, indicating the degree of dispersion of the prediction error, and takes a value in the range of
. The smaller the value, the smaller the error between the predicted value and the real value, and the higher the model accuracy.
- (3)
Coefficient of certainty (R2)
The coefficient of determination indicates the extent to which the variable X explains Y.
R2 takes values in the range of
, and the closer its value is to 1, the better the model fit.
where
is the average of the true values.
4. Calculus Analysis
The experimental data used in this study were collected from a medium-sized photovoltaic (PV) power station located in Ningxia, China, with an installed capacity of 150 kW. PV generation data from June to August 2020 were used for model development and evaluation. Given the intermittent nature of solar power, PV output data were recorded daily from 08:15 to 20:15 at 15-min intervals, resulting in a total of 5303 data samples. The collected variables include PV output power, total solar irradiance, module temperature, air temperature, humidity, surface pressure and wind velocity. For each weather category, the data were divided into training and testing sets in a 7:3 ratio based on chronological order, ensuring that the test data occurred after the training data. The prediction performance of the proposed CNN-GRU model was compared with that of individual CNN, GRU, and Transformer models.
4.1. Data Preprocessing
Blank values often appear in PV data due to sensor failures, data transmission delays, etc. To address the problem of missing data in PV datasets and to ensure the integrity and continuity of the data, the mean supplementation method is used to fill in the blank values. Priority is given to using historical averages over the same time period for supplementation, maintaining the overall statistical properties of the data while reducing the bias introduced by the filling process.
The input features in the PV power generation data have different magnitudes and distribution ranges. To eliminate the effect of magnitude between different features, this paper adopts the Z-score normalisation method to normalise the data, and its calculation formula is as follows.
where
z is the standardised data,
x is the original data,
µ is the data mean and
σ is the data standard deviation.
4.2. Analysis of the Impact Factors of Photovoltaic Power Generation
The output power of photovoltaic power generation is affected by various environmental and system factors, such as solar irradiance, ambient temperature, module temperature, humidity, and wind velocity. To quantitatively assess the correlation between factors and PV power, the correlation coefficients between input features and output power were analyzed using Pearson correlation analysis, reflecting the degree of influence of different variables on PV power, with the following mathematical expressions.
The Pearson correlation coefficient lies between [−1, 1], with positive values indicating a positive correlation and negative values indicating a negative correlation. The closer the absolute value is to 1, the stronger the correlation. The heat map of Pearson’s correlation coefficient between PV power and the factors is shown in
Figure 5.
From
Figure 5, it can be seen that the PV power is positively correlated with module temperature, air temperature, total solar radiation, and wind velocity, with correlation values of 0.784, 0.422, 0.926, and 0.125, respectively. It shows a negative correlation with ground pressure and relative humidity, with values of −0.053 and −0.278, respectively.
The Pearson correlation coefficient primarily reflects linear dependencies between variables. However, in practical scenarios, many influencing factors may exhibit nonlinear characteristics, and relying solely on Pearson correlation may fail to capture such complex relationships. To provide a more comprehensive evaluation of the factors affecting photovoltaic (PV) power generation, this study further employs Spearman’s rank correlation coefficient to reveal potential monotonic relationships between variables. This allows for a more thorough analysis of the influence of each factor on PV output, thereby offering deeper support for model optimization and the improvement of prediction accuracy.
Spearman’s rank correlation coefficient is a non-parametric statistical method used to measure the strength of a monotonic relationship between two variables. Its calculation is given as follows:
where
r denotes the Spearman’s rank correlation coefficient,
n represents the sample size, and
di =
R(
xi) −
R(
yi) is the rank difference for the
i-th sample, where
R(
xi) and
R(
yi) denote the ranks of the variables
xi and
yi, respectively.
denotes the sum of the squares of the rank order differences for all samples. A coefficient of
r = 1 indicates a perfect positive correlation,
r = 0 indicates no correlation, and
r = −1 indicates a perfect negative correlation. The Spearman correlation heatmap between PV power output and the influencing factors is presented in
Figure 6.
As shown in
Figure 6, PV power output exhibits positive correlations with module temperature (0.692), ambient temperature (0.487), total solar radiation (0.983), and wind velocity (0.086). In contrast, it shows negative correlations with ground pressure (−0.021) and relative humidity (−0.243).
Theoretically, an increase in module temperature typically leads to a reduction in PV conversion efficiency. However, in real-world operating conditions, higher module temperatures often occur during periods of peak irradiance. As a result, the observed positive correlation between temperature and output power is likely driven by the simultaneous effect of high solar radiation [
25,
26]. Moreover, wind velocity is often associated with weather phenomena such as increased cloud cover or precipitation, which may reduce solar irradiance. Nevertheless, under specific environmental conditions and operating states, moderate wind velocity can enhance heat dissipation from the PV modules, thereby reducing module temperature and improving power conversion efficiency [
27].
To quantitatively evaluate the direct impact of wind velocity on photovoltaic (PV) power output and eliminate the confounding effect of global solar radiation, this study employs partial correlation analysis. Partial correlation is a statistical method used to measure the strength of the linear relationship between two target variables while controlling for one or more other variables. In the context of multiple meteorological factors, this method effectively reveals the net influence of each factor on PV system power output.
Assuming three variables
x1,
x2,
x3, the partial correlation coefficient
r12,3 between x
1 and
x2, controlling for
x3, can be calculated using the following formula:
where
r12,
r13 and
r23 represent the Pearson correlation coefficients between the respective variable pairs.
The results show that the partial correlation coefficient between wind velocity and PV power output is 0.0027, which is significantly lower than the Pearson (0.125) and Spearman (0.086) correlation coefficients. This indicates that the direct linear effect of wind velocity on PV power output is weak, and that the observed correlation is largely driven by shared variation with solar irradiance and other factors.
The above analysis confirms that solar radiation is the most critical factor affecting PV power output, showing a strong correlation. In addition, module temperature, ambient temperature, and relative humidity also exhibit relatively strong correlations and should be considered key influencing variables in predictive modeling.
4.3. Similar Day Data Clustering
The pre-processed data were analysed using K-means clustering based on total solar radiation and cloud opacity. The K-means clustering algorithm is an iterative solving cluster analysis algorithm. The basic principle of the algorithm is to divide the dataset into K groups and randomly select K data objects as the initial clustering centroids. The distance between each data object and all initial cluster centroids is calculated, and each data object is assigned to the closest cluster centre based on the minimum distance principle. Each cluster centre and its assigned data objects together form a cluster. This process is iterated until the cluster centroids no longer change. To determine the optimal number of clusters, the value of K was set from 2 to 6, and the corresponding silhouette coefficients were calculated. The silhouette coefficient curve is shown in
Figure 7. A higher silhouette coefficient indicates better clustering performance. The results show that the silhouette coefficient reaches its maximum when K = 3, suggesting that three clusters provide the most effective separation.
Therefore, in this study, we set the K value of 3 to represent the three weather types of cloudy, rainy, and sunny days. There are 43% cloudy days, 10% rainy days and 47% sunny days.
Figure 8 shows the visualization of the clusters for the different weather types selected for the months of June–August. The differently colored curves in the figure represent the trend of PV power on different dates.
4.4. Parameter Settings
In this paper, the CNN-GRU model is used to predict the photovoltaic output power, and a comparative analysis is conducted with the CNN, GRU, and Transformer models. The parameter settings of the prediction models are shown in
Table 1.
4.5. Forecast Results and Analysis
In order to verify the prediction performance of the CNN-GRU model, this paper compares the CNN-GRU model with the CNN model, the GRU model, and the Transformer model, and each model uses the weather sample data after clustering for training and prediction, and compares and analyses the prediction results of each model under three kinds of weather conditions, respectively, and the results are shown in
Figure 9.
Figure 9a presents a comparison of the prediction results of various models under sunny day conditions. Overall, all four models effectively forecast the general trend of photovoltaic (PV) power output. As shown in the magnified view of the local peak area on the right, the CNN-GRU model demonstrates superior dynamic response speed and amplitude accuracy compared to the standalone CNN and GRU models. The Transformer model also exhibits strong tracking ability in terms of local prediction accuracy. Its prediction curve is relatively smooth and capable of accurately capturing power peaks, performing slightly better than the GRU model but slightly worse than the CNN-GRU and CNN models.
Figure 9b illustrates the prediction results under cloudy day conditions, where solar irradiance fluctuates significantly, leading to more pronounced differences in model performance. The GRU model exhibits a clear phase shift relative to the actual values and only captures the general trend of PV power variation, failing to reflect short-term fluctuations. The CNN model shows improved prediction accuracy due to its enhanced ability to extract local features through convolutional kernels. The CNN-GRU model, which integrates the strengths of both CNN and GRU, demonstrates a strong advantage in tracking sudden changes while maintaining the overall trend. Although the Transformer model can generally forecast the overall trend, its response to abrupt changes is slower than that of the CNN-GRU and CNN models, with accuracy slightly better than the GRU model.
Figure 9c compares the prediction performance under rainy day conditions, characterised by intense meteorological fluctuations and a pronounced non-stationary nature of PV power output. Under such conditions, the GRU, CNN, and Transformer models all struggle to accurately capture the rapid variations in PV power, exhibiting significant prediction delays. In contrast, the CNN-GRU model responds more promptly to abrupt changes in power, with prediction trends that align more closely with the actual power output curve.
In this study, the predictive performance of each model is evaluated using three metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the coefficient of determination (R
2).
Figure 10 and
Table 2 present a comparison of the prediction errors of the models under sunny day conditions. It can be observed that the CNN-GRU model outperforms all other models across all three evaluation metrics. Specifically, the CNN-GRU model achieves an MAE of 2.9252, which is a reduction of 43.5%, 69.4%, and 55.7% compared to the CNN (5.1729), GRU (9.5627), and Transformer (6.5987) models, respectively. Its RMSE is 3.6364, representing a decrease of 40.93%, 66.67%, and 54.84% compared to the CNN (6.1559), GRU (10.9115), and Transformer (8.0516) models, respectively. Moreover, the CNN-GRU model achieves an R
2 value of 0.9971, significantly higher than those of the CNN (0.9763), GRU (0.9256), and Transformer (0.9593) models, with relative improvements of 2.13%, 7.72%, and 3.94%, respectively. This shows that the CNN-GRU model performs more accurately than the CNN model and the GRU model in predicting the sunny day conditions, and its prediction result curve has a higher goodness-of-fit with the measured data.
Compared to sunny day conditions, the nonlinear fluctuations in irradiance caused by cloud movement on cloudy days impose greater challenges on model robustness. The CNN-GRU model, by leveraging the convolutional neural network’s ability to deeply extract spatial correlations among meteorological factors and combining it with the gated recurrent unit’s strength in modeling temporal dynamics, is capable of effectively capturing the transient features of solar radiation under cloudy weather conditions.
Figure 11 and
Table 3 present a comparative analysis of prediction errors among the models under cloudy day scenarios. As shown in the table, the performance differences among the models become more pronounced under these conditions. Based on the three evaluation metrics (MAE, RMSE, and R
2), the CNN-GRU model continues to demonstrate superior predictive performance. Specifically, the CNN-GRU model achieves an MAE of 3.2157, representing reductions of 47.9%, 75.2%, and 60.1% compared to the CNN (6.1688), GRU (12.9789), and Transformer (8.0500) models, respectively. Its RMSE is 3.9912, reflecting decreases of 44.0%, 74.2%, and 61.5% relative to the CNN (7.1338), GRU (15.4523), and Transformer (10.3716) models, respectively. Moreover, the CNN-GRU model achieves an R
2 of 0.9892, which is 2.47%, 18.1%, and 6.75% higher than the CNN (0.9654), GRU (0.8376), and Transformer (0.9266) models, respectively. These results indicate that the CNN-GRU model exhibits a significantly enhanced ability to capture PV power fluctuations under complex meteorological conditions, demonstrating strong robustness and predictive accuracy in the presence of cloudy weather.
Under rainy day conditions, photovoltaic (PV) power output is subject to more complex meteorological disturbances.
Figure 12 and
Table 4 present a comparison of prediction errors across different models under rainy weather scenarios. It can be observed that the CNN-GRU model continues to exhibit a significant advantage across all three evaluation metrics. Specifically, the CNN-GRU model achieves an MAE of 1.5746, which is substantially lower than that of the CNN (7.2725), GRU (7.5287), and Transformer (9.1324) models, representing reductions of 78.3%, 79.1%, and 82.8%, respectively. The RMSE of the CNN-GRU model is 2.0512, which corresponds to decreases of 77.4%, 75.6%, and 80.8% compared to the CNN (9.0641), GRU (8.4076), and Transformer (10.6653) models, respectively. In terms of the coefficient of determination, the CNN-GRU model attains an R
2 value of 0.9820, which is significantly higher than that of the CNN (0.6478), GRU (0.6970), and Transformer (0.6256) models, with improvements of 51.7%, 40.9%, and 56.9%, respectively. These results suggest that, although the Transformer model exhibits a certain level of predictive capability in some scenarios, its stability and accuracy remain inferior to those of the hybrid CNN-GRU architecture. The CNN-GRU model demonstrates generalizable value in enhancing the performance of PV power forecasting systems across varying weather conditions.
To evaluate the training efficiency of CNN, GRU, Transformer and CNN-GRU models under different weather conditions, this study recorded the training times of each model during sunny, cloudy, and rainy scenarios.
Table 5 presents a comparison of training times across different weather conditions.
According to the comparative data presented in
Table 5, the CNN model consistently exhibits the shortest training time among all models, which can be attributed to its relatively simple structure and lower computational overhead when processing time-series data. The GRU model, on the other hand, requires sequential processing to capture temporal dependencies, resulting in a longer training time. The CNN-GRU model leverages the CNN layers to extract local features and effectively reduce the dimensionality of the input data, thereby alleviating the computational burden on the subsequent GRU layers. As a result, its training time falls between that of the CNN and GRU models. The Transformer model, due to its use of multi-head attention mechanisms and complex encoder architecture, demands significantly more computation during training. Consequently, it records the longest training times under sunny and cloudy conditions, reaching 49.28 s and 43.89 s, respectively, highlighting its disadvantage in terms of computational efficiency. Interestingly, under rainy conditions, the Transformer achieves the shortest training time (13.64 s), possibly due to reduced data volume or improved architectural adaptability to this specific scenario.
While the CNN model demonstrates superior training efficiency, its prediction performance is relatively limited. The Transformer model shows certain advantages in specific evaluation metrics but suffers from high computational costs. In contrast, the proposed CNN-GRU model achieves a good balance between predictive accuracy and training efficiency. It consistently delivers reliable performance across various weather conditions, confirming its practicality and robustness in photovoltaic power prediction tasks.
5. Conclusions
In this paper, we propose a photovoltaic power prediction model based on similar day clustering and CNN-GRU, validate the model’s performance through example analysis, and draw the following conclusions.
(1) Pearson and Spearman correlation coefficient analyses are applied to screen out the key factors affecting the PV output power, reduce the input dimensions of the model, and eliminate the interference of features with smaller influences.
(2) The K-means algorithm is used to classify the original data into three weather types: sunny, cloudy, and rainy, and the predictions of the model for different weather types are discussed separately to further improve the prediction accuracy.
(3) Using CNN feature extraction and GRU nonlinear fitting capability, the CNN-GRU model is proposed to predict the PV output power under sunny, cloudy and rainy weather scenarios, respectively. The examples show that MAE and RMSE are reduced by 66.1% and 65.7% on average and R2 is improved by 19.8% on average in three different weather type scenarios. This verifies that the model has high prediction accuracy and generalisation ability, and better results in PV output power prediction.
Accurate photovoltaic (PV) power generation forecasting enables the effective optimization of PV generation and energy storage system scheduling, thereby improving the utilization of renewable energy and reducing system operating costs. Moreover, precise PV forecasting is essential for load management in power grids, as it provides critical data support for grid dispatching, enhances the grid’s adaptability to variable power output, and strengthens its overall stability and flexibility.
Nevertheless, this study still has several limitations in terms of practical application, which warrant further exploration and improvement in future research:
(1) In this study, the dataset was divided into three subsets—sunny, cloudy and rainy—using the k-means clustering algorithm. Separate training and testing procedures were then applied to each subset. However, real-world weather conditions are far more complex. In practical deployment, accurately classifying and predicting weather types based on weather forecast data remains a challenge. Reducing the impact of forecast inaccuracies on prediction performance and ensuring the model’s stability under dynamically changing weather conditions are key issues that need to be addressed in future work.
(2) The model was trained and tested using summer data from a PV plant located in Ningxia, China. Therefore, its applicability to other geographical regions and under extreme weather conditions remains limited. Future research should focus on collecting and integrating PV generation data from multiple regions, and developing and promoting more generalised PV forecasting models capable of adapting to region-specific climate characteristics.
(3) In practical applications, the deployment of the model may face limitations related to computational resources and real-time data processing. Future research should consider incorporating predictive input variables such as Numerical Weather Prediction (NWP) data to explore the model’s robustness and generalization under forecast-based conditions. This will further enhance the overall operational efficiency of the power grid and provide theoretical support and practical guidance for large-scale photovoltaic integration and the development of smart grids.