Quantitative Analysis of the Impact of Meteorological Environment on Photovoltaic System Feasibility

: The meteorological environment is a determining factor in photovoltaic (PV) system fea ‐ sibility (PVSF). To evaluate this impact more accurately, a quantitative analysis model based on multimeteorological factors and the Random Forest Regression model is proposed in this work. Firstly, an evaluation system is established to assess the impact. Then, to predict the indicators of the evaluation system, a parameter, i.e., performance ratio in sampling period, is defined. Secondly, a set of essential influences on the performance ratio in the sampling period is established through analyzing and reducing the discovered influences on the PV system performance. Finally, data from the Desert Knowledge Australia Solar Centre (DKASC) website are used to conduct the experiment. During the experiment, the sample set is cleaned using the model based on the cosine of the zenith angle. The functional relationship between the performance ratio in the sampling period and its essential influences is established through training a Random Forest Regression model with the data of the modeling system. The data of the test system are used to verify the forecast performance of the proposed model. Compared with the reference model, which is based on the traditional physical experiment, the results of the proposed model accord better with the measured values.


Introduction
It is expected that by 2088, the world will run out of fossil fuels [1]. To overcome the energy crisis, solar energy, along with other renewable energy sources, has gathered significant attention due to its abundance, universality and cleanliness [2]. By the end of 2020, the global cumulative installed capacity of PV power plants amounted 714 GW, with an additional installed capacity of 127 GW (+22%) [3]. Such a scale of growth puts forward higher requirements for the study of the PVSF. Improving the accuracy of PVSF assessment can effectively promote scientific decisions concerning project approval by government departments, and reduce the investment risk of power operators. In recent years, the rapid development of big data storage and data mining technology has promoted technological innovation in all walks of life. To establish a big data system by integrating the data extracted from the PVSF study report and other useful data, and to rapidly and quantitatively evaluate the PVSF using data-driven methods have attracted the interest of the PV project evaluation department. The impact of the meteorological environment plays an important role for the PVSF, because it has a great influence on the generation potential of PV systems.
In the traditional method, the impact is analyzed by using solar resources and the estimated output energy. Solar energy resources are easy to calculate based on the dataset of horizontal irradiance measured at the site of the PV system to be evaluated. PV system output energy is generally predicted by using the model based on the system physical structure (SPS). The first step of this model is to determine the ideal energy conversion efficiency of the system according to its SPS. The second step is to create a data set of the incident solar irradiance on the module or array plane, which is denoted as plane of array irradiance (POAI). The third step is to establish a data set of the ideal output power (IOP) of the system based on the data set of the POAI and the ideal energy conversion efficiency of the system. The last step is to correct the IOPs and predict the system output energy.
The power output of PV system is not only dependent on the POAI, but is also affected by the system performance. The fluctuation of system performance is mainly caused by the change of PV module efficiency. The operating temperature of a PV module has a dramatic effect on its electrical efficiency [22], which is well documented [23]. Radziemska [24] experimentally investigated the effect of the operating temperature and found a decrease of about 0.65% electrical efficiency of the PV module for every 1 K increase in cell temperature. Teo et al. [25] observed that the operating temperature of PV modules attained a value as high as 68 °C, and the electrical efficiency dropped significantly to 8.6%. In the last step of the model based on the SPS, traditionally, the ideal power was corrected by using the module power temperature coefficient provided by the manufacturer and the operating temperature, calculated from the ambient temperature (AT). In this correction model, only the influence of AT on the operating temperature of PV modules is taken into account.
Many scholars have studied the PV module temperature and found that, in addition to the AT, the operating temperature of PV modules is also affected by other meteorological factors. By comparison, Ayompe, Duffy et al. [26] found that high average wind speed (WS) and low AT can improve PV module efficiency. According to some tests, Griffith et al. [27] found that the cell temperature is extremely sensitive to WS, and moderately sensitive to wind direction. Teo et al. [25] experimentally found that the PV module increases by 1.4 °C for every 100 W m 2 ⁄ increment of solar irradiance with active cooling, and increases more, i.e., by 1.8 °C, without active cooling. There are many correlations expressing the PV cell temperature as a function of weather variables such as AT, local WS, and solar irradiance, with material and system-dependent properties as parameters, e.g., glazing-cover transmittance, plate absorptance, etc. [23,28,29].
The PV module performance is affected by the operating temperature in the short term; however, in the long term, the degradation in the performance of a PV module due to its aging is also an influencing factor. Jordan and Kurtz [30] reviewed the degradation rates of flat-plate terrestrial modules and systems reported in the published literature from field testing from 1974 to 2013. Nearly 2000 degradation rates measured on individual modules or entire systems were assembled from the literature, showing a median value of 0.5% each year. Ndiaye et al. [31] reviewed the most frequent PV module degradation types of silicon PV modules according to the literature.
The impact of solar radiation and AT on the PVSF can be analyzed through the model based on the SPS, but it is difficult to analyze the comprehensive impact of more meteorological factors through this model. Ignoring other meteorological factors which have some impact on the PV system performance not only has a negative impact on the accuracy of prediction results, but also leads to a waste of data resources that can be extracted from the feasibility study report.
To analyze the impact of meteorological environment on the PVSF, the main step is to predict the PV output energy. In addition to the PV output energy prediction method of the model based on the SPS, a variety of more advanced PV output energy prediction models have been proposed, as reviewed and compared in [42][43][44][45][46][47]. These models can be classified as statistical techniques, linear models or time series models, artificial intelligence techniques and hybrid models [42]. Most of these models are designed to eliminate the negative impact created by the inherent variability of PV output power on the electrical performance of the grid. The purpose of these models is to predict the output energy of the built PV system in the short-term. The PV system performance can be affected by the degradation in the performance of PV modules in the long term; this factor is generally ignored in short-term forecasting. However, in the PVSF study, the output energy of the unbuilt system in the ultralong term in the future needs to be predicted, so the impact of the degradation in the performance of PV modules cannot be ignored. Each proposed prediction model has its specific applicable conditions, and few PV output energy prediction models have been designed based on the data available in the PVSF study report, so when using these models, it is difficult to make full use of the data resources in feasibility reports.
Therefore, in this study, we aimed to develop a model for analyzing the impact of the meteorological environment on the PVSF in data-driven way. Firstly, we sought to establish a more scientific indicator system to quantify the impact by optimizing the traditional indicators and introducing a new indicator. Then, to improve the indicator prediction accuracy by making the most of the data resources and increasing the no extra data measurement cost, we analyzed and reduced the influences on the system performance based on the data available in the PVSF study report and picked out the essential influences which corresponded to the inputs of the indicator prediction model. Finally, we used the Random Forest Regression (RFR) [48] model to develop an indicator prediction model and took not only the two parameters (POAI and AT) which are the inputs of the model based on the SPS, but also some other meteorological parameters which also correlate with the PV system performance as the inputs of the model to quantify the comprehensive impact of multiple meteorological factors on the PVSF.

Development of the Indicator System
The indicators for analyzing the impact of meteorological environment on the PVSF in this article include the final PV system yield ( f ), the reference yield ( r ), and the performance ratio (PR), which are defined in the IEC standard 61,724 [49] as follows: where is the output energy, is the system capacity, is the total in-plane irradiation per square meter, and is the reference irradiance at which is determined. The indicator f is used to quantify the impact of meteorological environment on the system generation potential, because if the feasibilities of several PV systems with different capacities are compared, f is more applicative than which is one of the traditional indicators. The indicator r is used as one of the indicators, because compared with the horizontal irradiation per square meter, which is the other traditional indicator, r is more appropriate to quantify the impact of meteorological environment on the solar resources in PVSF study. We also take PR into the indicator system, because PR can be used to quantify the impact of meteorological environment on the system performance. Thus, we establish an indicator system by optimizing the two traditional indicators and introducing a new one. More scientific and reasonable analysis of the impact of the meteorological environment on the PVSF can be realized based on the proposed indicator system.

Essential Influences on the PV System Performance
To improve the indicator prediction accuracy by making the most of the data resources and increasing the no extra data measurement cost, the influences on the system performance were analyzed and reduced based on the data available in the PVSF study report and several influences were taken as the essential ones, which correspond to the inputs of the indicator prediction model. In this study, the performance ratio in the sampling period (PRP) expressed by the Equation (4) is taken as the parameter of the PV system performance: where is the system power in sampling period. The solar radiation, WS, wind direction, AT, air humidity (AH), and degradation in the performance of the PV module due to aging are taken as the essential influences (Figure 1), because, according to the definition of the PRP, the determinants are POAI and system performance. Solar radiation, WS, wind direction, and AT can affect cell temperature [23,25,[27][28][29], and indirectly affect the PRP. Solar radiation also affects the PRP by affecting the POAI. The reason the AH is taken as one of the essential influences is that this factor has some effect on the heat dissipation of PV modules and on the PRP. Degradation in the performance of PV modules is also taken as an essential influence, because in the long run, the effect of this factor on the PRP is too significant to be ignored.  The data of the parameters which correspond to the essential influences is available in PVSF study report, but the available data resources are limited in the PVSF study stage. To increase the no extra data measurement cost, some other influences on the PRP, such as front surface soiling, mismatch losses, dust accumulation, incident solar spectrum, etc. were reduced, although this may introduce some errors into the prediction results. However, this error is minimized with the proposed indicator prediction model.

Proposed Indicator Prediction Model
The steps of the proposed indicator prediction model are: (1) Utilize the RFR model to learn the functional relationship between the PRP and the parameters of the essential influences in Section 2.2 with the data of the modelling system.
(2) Predict the PRP of the system to be evaluated using the data measured at the site of the system and the functional relationship established in the previous step.
(3) Predict the indicators through statistics and further calculation with the predicted PRP dataset.

RFR Model
The RFR model is an improved bagging regression tree (RT) model [50][51][52][53], and one of the most effective machine learning models for forecasting [54]. Thus, the RFR model is used to forecast the indicators to analyze the impact of the meteorological environment on the PVSF in this paper. The process can be divided into the following five essential steps as shown in Figure 2.
(1) A sample set is obtained via bootstrap strategy. The bootstrap strategy aims to generate a dataset by resampling existing training samples. One-third of the whole training samples are left out. These samples form the out-of-bag (OOB) data. The other samples are called in-bag data which are used to create a RT of the RFR.
(2) A RT is trained based on the in-bag data by a binary number of splits. More specifically, the training phase are: (a) If the number of all features is , randomly select ( ) features from the all features as a feature subset of the current node; (b) Pick an optimal feature from the features to split the current node into two daughter nodes. (c) Repeat (a) and (b) until ending up at a leaf node.
(3) Repeat (1) and (2) times to generate RTs which make up the trained RFR model. Breiman [48] suggests to set the number of trees as 500 (default).
(4) The RFR model provides a significant measure, variable-importance (VI) measure, which aims to calculate the contribution of each input to the output. The calculating process are: (1) Calculate the prediction error of the result which is predicted by using a RT and its relative OBB data. (2) Randomly alter the sequence of the element of an input vector in the OOB data. Then calculate the increased prediction error. (3) The definition of the VI measure is the mean value of the increased prediction errors before and after altering the input vector over all RTs in the forest.
(5) After obtaining the required number of RTs, the final predicted response of the developed RFR model is the average of the RT predictions in the test data set.

Model Training
The system with the same type of modules as the system to be evaluated was taken as the modelling system for two reasons. The first is that under the same conditions, the degradation rates of different types of modules may be different. Bayandelger et al. [55] found that the annual degradation rates of multicrystalline silicon (mc-Si) and single-crystalline silicon (sc-Si) are −1.28% and −0.86%, respectively. The other reason is that the relationships between the electrical efficiency and the operating temperature of different types of modules are also different. Polysilicon PV modules is the most sensitive to module temperature variations, compared with amorphous silicon (a-Si) PV modules and stacked a-Si PV modules [39], and a-Si PV modules are more affected by temperature than monocrystalline PV modules [36]. Under different climatic conditions, the degradation rates of PV modules also vary for the same technology of solar cells forming part of the PV module, which has a negative impact on the prediction performance of the proposed model. However, this impact can be minimized through selecting the PV system with similar climatic conditions to the system to be evaluated as the modelling system.
To obtain the essential influence set on the PRP, some influences were reduced for which data is not available in the PVSF study report. To minimize the error caused by this reduction, the system with similar geomorphic conditions and the closest distance to the system to be evaluated was selected as the modeling system, because we believe that the external environment of those reduced factors is very similar in the locations of the two systems when the modeling system is chosen in this way. Thus, the impact of the reduced factors on the functional relationship between the parameters which correspond to the essential influences and the PRP of the modeling system is similar to the system to be evaluated.
In training the RFR model, the process of selecting the hyper-parameters is as follows: (1) Preset the hyper-parameters which include the number of regression trees ( tree ), the number of random variables ( var ), and the number of observations in leaves ( obs ). Then train the model and calculate the OOB error (OOBE).
(2) Increase the value of tree from 1 until the OOBE is no longer increased, then select the value at this point for tree .
(3) Increase the value of var from 1 to the input number of the RFR model which is 9 in this article, then select the value with the minimum OOBE for var .
(4) Increase the value of obs from 1 until the OOBE begins to increase significantly, then select the value with the minimum OOBE for obs .

Inputs of the RFR Model
The PRP was taken as the output of the RFR model, and the inputs of the model were the WS, wind incidence angle on the module (WIAM), AT, AH, global horizontal irradiance (GHI), diffuse horizontal irradiance (DHI), direct normal irradiance (DNI), POAI, and the system operated days (SOD). The data of the WS, AT, AH, GHI, DHI, DNI and SOD were derived from the measured historical dataset. The POAI data was obtained through the Hay model [10], and the WIAM data were acquired according to formula (5): ws arccos sin • cos wg , where ws is the WIAM, is the inclined angle of the module, wg is the angle between the wind direction and the azimuth of the module which was obtained through formula (6): where m is the azimuth of the module, w is the wind direction. The inputs of the RFR model correspond to the essential influences on the PRP. WS, AT, and AH correspond to the eponymous influences; GHI, DHI, DNI, and POAI correspond to the solar radiation; SOD corresponds to the module aging; WIAM corresponds to the wind direction. The WIAM was taken as one of the predictors instead of wind direction, because the effective windward area of a PV module, which has some effect on the heat dissipation and thus affects the performance of the module, depends not only on the wind direction, but also on the module azimuth.

Data and Data Preprocessing
Both the modelling dataset [56] and the testing dataset [57] were derived from the Desert Knowledge Australia Solar Centre (Desert Knowledge Australia, the Australian Government, the Northern Territory Government and the project managers; Ekistica do not endorse, and accept no legal liability whatsoever arising from, or connected to, the outcomes and conclusions associated with the use of data from the Desert Knowledge Australia Solar Centre).
In the experiment, two systems (see Table 1) were selected as the modelling system and the system to be evaluated. In the selected period, the sample missing rates of the sample sets were so small (see Table 2) that the influence of missing samples on the experiment was ignored. The meteorological data measured in the data acquisition year was taken as the representative of the annual meteorological data in the lifecycle (25 years) of the system to be evaluated, since it is almost impossible to accurately predict the meteorological information in the next 25 years, and the interannual variation of meteorological information can be ignored in analyzing the impact of the meteorological environment on the PVSF.
Cleaning redundant data not only promotes the efficiency of the prediction model, but also avoids the interference of redundant data to the prediction. In this paper, if the system power is almost zero, the relevant sample is considered as a redundant sample. For the modeling system, the system power value can be used to determine whether the sample is redundant or not. However, for the system to be evaluated, there is no system power data, so the redundant samples cannot be cleaned according to the power value. When the cosine of zenith angle is less than or equal to zero, the solar radiation is almost zero, so in this paper, the data was cleaned according the cosine of zenith angle.

Statistics of Indicators
Due to the difference in the rated conversion efficiency of the modules between the modelling system and the system to be evaluated, the prediction results of the RFR model were modified as follows: where PRP f is the PRP of the system to be evaluated, PRP f is the prediction result of the RFR model, f is the rated conversion efficiency of the modules of the system to be evaluated, and m is the rated conversion efficiency of the modules of the modelling system. Based on the PRP f dataset, the values of each indicator in the first year of the lifecycle were obtained. The annual output energy values were obtained by Formula (8), and the annual values of each indicator were got according to the definitions (Formulas (1)-(3)). The predicted annual average values of f , PR and r in the lifecycle of the system are taken as the quantitative results of the indicators of the evaluation system.
where is the output energy in the year, is the output energy in the first year, and Δ is the guaranteed annual PV module degradation rate.
According to the definition of the indicators, both f and PR are linearly correlated with , so the prediction errors of , f and PR are the same. The errors of the results were analyzed through the Mean Absolute Percentage Error (MAPE) and Root Mean-Square Percentage Error (RMSPE) of the predicted monthly values of in the first year. There are two sources of error in the proposed model: (1) The intrinsic error existing in RFR model; (2) The error caused by reducing the nonessential influences on the PRP in this paper. The errors of the proposed model and the reference model were obtained based on the predicted results and the measurements of the system to be evaluated. The intrinsic error existing in RFR model was got based on the OOB prediction results and the measurements of the modelling system.

Reference Indicator Prediction Model
The main difference between the reference model used in this paper and the proposed model is that the models to obtain the PRP are different. In the proposed model, PRP is predicted based on the RFR model. The WS, WIAM, AT, AH, GHI, DHI, DNI, POAI, and SOD were taken as the inputs of the model. In the reference model, PRP is predicted based on the SPS. The GHI, POAI and AT were taken as the inputs of the model. The proposed model is based on the RFR model. So before predicting the indicators of the system to be evaluated, the RFR model need to be trained with the data of the modeling system. The reference model is based on the SPS. So, the indicators of the system to be evaluated can be predicted directly with the data of this system.
The main steps of the reference model to predict the PRP are: (1) Obtain the IOP of the system by Formula (9).
where the POAI is obtained by using the Hay model [10], and the R is the Standard Test Irradiance (1000 W m 2 ⁄ ).
(2) Estimate the operating temperature of PV modules by Formula (10).
where O is the operating temperature and NOCT is the Normal Operating Cell Temperature.
where C is the corrected power output and the T is the Cell Temperature Coefficient.
(4) Predict the indicators according to the corresponding definitions.

Data Preprocessing Results
We found that, by using the proposed sample cleaning method, the cleaning rates of the two systems were so high ( Table 3) that a large number of redundant samples were cleaned, which improved the prediction efficiency and avoided the interference of redundant samples in the modelling and prediction process.

Hyper-Parameters Selection Results
In selecting the hyperparameters of the RFR model, the initial value of tree was set ar 100, var was 3, and obs was 5. The OOBE was no longer decreased when tree was greater than 20 ( Figure 3). With the increase of var from 1 to 9, the OOBE decreased at first and then increased, and the value of var corresponding to the minimum OOBE was 4 ( Figure 4). The OOBE decreased at the early stage and then increased with the increase of obs from 1, and the value of obs corresponding to the minimum OOBE was 5 ( Figure  5).

VIs Calculation Results
We found that the VIs of the inputs were positive and on the same order of magnitude ( Figure 6), which means the correlations between the influences corresponding to the inputs and the PRP are approximate. Compared with the reference model, which can be used to analyze the impacts of solar radiation and AT, more essential influences on the PVSF can be quantitatively analyzed by using the proposed model.

Indicators Predation Results
The most important input is AT, followed by AH, WS, etc., but it was quite unexpected that the inputs GHI, DHI, DNI, and POAI, which correspond to the influencing factor, solar radiation, would be of lower importance. We believe that the main reason for this is that the four variables were strongly correlated, and the importance of each variable was weakened.
The monthly values of r were statistically obtained based on the measured dataset, so the predicted values are equal to the observed ones ( Figure 7). The monthly , f , and PR predicted by using the proposed model are generally closer to the observed values than the predictions of the reference model (Figures 8-10), which indicates that the stability of the proposed model is higher than the reference model. The monthly observed results are slightly lower than those predicted by us, which, we believe, was most probably because there was some inevitable lost yield due to the unexpected shutdown of the system when the system runs in practice. However, the predictions of the reference model are generally lower than the observed values, which may be because the empirical value of the transmission efficiency of PV system commonly used in the reference model is lower than the real value.    The annual average values of f , PR and r (Table 4) in the lifecycle of the system, which were taken as the results of the indicators to analyze the impact of meteorological environment on the PVSF in this paper, were obtained based on the predicted results of the first year. The MAPE, due to the intrinsic error of the RFR model, is 0.26%, and the RMSPE is 0.21%. The MAPE of the result of the proposed prediction model is 3.06%, and the RMSPE is 0.9%. The MAPE of the result predicted by using the reference model is 8.89%, and the RMSPE is 1.95%. The intrinsic error of the RFR model is smaller than the error caused by reducing the nonessential influences on the PRP, which is the other part of the error resources of the proposed model. Although reducing the influences on the PRP has some negative impact on the accuracy and stability in predicting the indictors, compared with the reference model, the proposed model still has greater advantages.

Conclusions
In this study, a model to analyze the impact of the meteorological environment on the PVSF was proposed. To quantify this impact, a new indicator system was proposed by optimizing the two traditional indicators and introducing a new one. Compared with traditional indicators, the proposed indicator system is more comprehensive and conducive to comparing the feasibilities of several PV systems with different capacities. It is hard to improve the indicators prediction accuracy by using the reference model. Since the input sets of the proposed short-term energy prediction models do not match with the data available in the PVSF study report, the indicators cannot be predicted based on these models.
To predict the indicators, a prediction model based on multimeteorological factors and RFR model was proposed. In this model, to improve the indicator prediction accuracy by making the most of the data resources and increasing the no extra data measurement cost, the influences on the system performance were analyzed and reduced based on the data available in the PVSF study report, and several influences were taken as the essential ones, which corresponded to the inputs of the indicator prediction model. According to the calculated VIs of the inputs, the set of essential influences and the corresponding inputs we created were found to adequately support the prediction of the indicators. To eliminate redundant data of the system to be evaluated, a data cleaning model applicable to PVSF study stage was proposed. In the experiment, the original dataset was preprocessed by using the proposed data cleaning model, and a large number of redundant samples were cleaned. The error analysis demonstrates that the precision of the proposed prediction method is higher than that of the reference model.
In this paper, the indicators were predicted based on the data resources available in the PVSF study report. The accuracy of the prediction results depends not only on the prediction model, but also on the available data. Therefore, if the data resources in the PVSF study report are more abundant in future, the accuracy of the prediction results can be further improved.
Author Contributions: All of the authors have contributed to this research. F.X. and G.P. provided the research ideas and guided the revision of the draft; D.M. and H.S performed the case studies and analysis and wrote and revised the paper. All authors have read and agreed to the published version of the manuscript.