Daily Photovoltaic Power Generation Forecasting Model Based on Random Forest Algorithm for North China in Winter

: North China is one of the country’s most important socio-economic centers, but its severe air pollution is a huge concern. In this region, precisely forecasting the daily photovoltaic power generation in winter is essential to improve equipment utilization rate and mitigate e ﬀ ects of power system on the environment. Considering the climatic characteristics of North China, the winter days are divided into three classiﬁcations. A forecasting model based on random forest algorithm is then designed for each classiﬁcation. To evaluate its performance, the proposed model and three other methods are separately used to forecast the daily power generation at the Zhonghe PV station, which is located in the center of North China. Empirical results show that, because of its ability to reduce the risk of overﬁtting by balancing decision trees, the proposed model obtains mean absolute percentage errors as low as 2.83% and 3.89% for clear and cloudy days, respectively. For days in which weather conditions are unusual, forecasting errors are relatively large. On these days, enlarging training samples, performing subdivision, and imposing manual intervention can improve the forecasting precision. Generally, the proposed model is better than the other three methods for nearly all error evaluation indicators in each classiﬁcation.


Introduction
North China, the geographical boundary of the area shown in Figure 2 [1], is one of the country's most important socio-economic centers. In recent years, this region has accumulated over one quarter of China's population and has produced a similar share of the country's GDP [2].
In this region, vast amounts of electric power are consumed to meet the requirements of socio-economic development. In 2017, electricity consumption had reached 1500.5 billion kWh, an increase by 70 billion kWh from 2016 [3]. To ensure the electric power supply, North China has built large-scale thermal power plants in the past several decades [4]. Given its fragile environment, the operation of thermal power plants further deteriorates the prevailing air pollution. Among China's 169 large-and medium-sized cities, the top 20 with the worst air conditions are all located in North China, and 24%-38% of different pollutants in this region are emitted from thermal power plants [5,6]. During winter, poor diffusion conditions of air pollutants, high power load, and the lack of other clean power sources significantly increase the effects of thermal power plants on air quality [7]. It is both necessary to promote the development of photovoltaic (PV) power generation and improve the utilization efficiency of power that are generated by solar PV in North China.
North China has abundant solar resources. Its annual solar horizontal radiation is higher than 1050 kWh/m 2 [8]. With the advancement of technology and the decrease in electricity generation cost,

1.
Topic: In this paper, a daily PV power generation forecasting model for North China in winter is proposed. The proposed forecasting model is based on the random forest algorithm and can obtain satisfactory forecasting results using small samples. The results of this study provide a reference to the sustainable development of PV generation in this area. 2.
Influence factor selection: The unique winter climatic characteristics of North China were considered. In consideration of the serious air pollution in winter, "PM2.5" is especially selected as one of the influence factors. 3.
Weather classification: To ensure the operation of the model, the weather classification analysis is used to fix the varied weather. By combining weather types with similar characteristics, the problem of balance between category and sample number was solved. 4.
Methodology: The application of RF for solar PV systems, as most of the previous researches are focused on trend extrapolation methods, artificial intelligence methods or support vector machines.
The remainder of this paper is organized as follows. Section 2 introduces the factors affecting PV power generation and weather classification. The principles of the method used for forecasting are described in Section 3. A case study of Zhonghe PV power station and result discussions are provided in Section 4. Conclusions and perspectives are summarized in the last section.

Influence Factor Selection
As mentioned in Section 1, weather conditions affect the electricity power generated by PV stations to a large extent. Involving these key factors into the forecasting model is essential to improve the prediction precision. When selecting key factors, the technical requirements of PV stations and the unique climatic characteristics of North China in winter require consideration.
In this research, six factors are involved in the forecasting model. (1) Total solar radiation. Solar radiation is the sum of direct and diffuse radiation and is the base of PV power generation. On this basis, this research takes total solar radiation as a selected factor. (2) Mean atmospheric pressure and (3) Wind speed. North China is located in the southeast end of Eurasia; winter in this area is usually from November to January, weather conditions are significantly affected by the monsoon [29]. The winter monsoon, which is caused by the gradient between the Siberian high and Aleutian low pressures, is the most important atmospheric circulation in the Northern Hemisphere [30]. This research uses the daily mean atmospheric pressure and wind speed to describe the influence of the winter monsoon. (4) Mean temperature. The winter monsoon brings strong cold air from Siberia to North China, and thus, the winter temperature in this region is generally lower and changeable than that in other areas of the same latitude [1]. Hence, the daily mean temperature is included into the forecasting model. (5) Relative humidity. Most areas of North China belong to temperate and semi-humid zones. During winter, different levels of fog commonly appear due to vapor in the air. In this research, the daily mean of relative humidity is considered in the forecasting model. (6) PM2.5 concentration. As mentioned in the preceding sections, air pollution further deteriorates in North China in the winter. Air pollutants can affect the generation of PV power stations on the ground. PM 2.5 concentration has considerable influence on the loss of power generation. This relationship has a coincidence level of over 90% [31]. In North China, PM2.5 is monitored once an hour in winter. This research excludes the nighttime monitored results and only considers the daytime mean values of PM2.5 concentration as a variable in the forecasting model. This research selects the optimal values of the above parameters by comparing the performance of different parameter values. In addition, this research adopts the default values in the Sklearn module, which is written using the Python programming language [36]. Figure 1 shows the process of forecasting the daily PV power generation.  To build a forecast model, the collected raw data require initial cleaning and filtering. In other words, fragmentary data-mainly referring to days for technical breakdown, and evidently outliers-mainly referring to statistic errors, must be amended. For preprocessed observations, the classification should be performed in accordance with their weather conditions. In this research, for each classification, a RF model was used for PV power generation. After training, the forecasting results are obtained by inputting the weather conditions of the target days into the model. When evaluating the performance of the RF model, the forecasting results for the three classifications are considered comprehensively.

Performance evaluation indicators
To understand the performance of the forecasting model, error analysis indicators are necessary. This research uses MAE, mean absolute percentage error (MAPE), root mean square error (RMSE), and explained variance (EV).
MAE represents the mean of the absolute errors and is used to reflect the real situation of the forecast error. MAE is calculated as: where is the prediction and yi is the real value. The range of values for MAE is 0 to +∞. Evidently, as the result of MAE increases, the size of error likewise increases.
MAPE is a straightforward measure of the prediction accuracy of a forecasting method, and is thus usually considered as the fairest indicator [37]. On this basis, MAPE is usually considered as the most important precision evaluation indicator. Dividing each error by the real value to provide an average result in percentage, MAPE is defined by: RMSE is another common indicator used in the evaluation of the accuracy of the forecasting model. RMSE is the square root of MSE and is more sensitive to large errors than MAPE. The formula is: To build a forecast model, the collected raw data require initial cleaning and filtering. In other words, fragmentary data-mainly referring to days for technical breakdown, and evidently outliers-mainly referring to statistic errors, must be amended. For preprocessed observations, the classification should be performed in accordance with their weather conditions. In this research, for each classification, a RF model was used for PV power generation. After training, the forecasting results are obtained by inputting the weather conditions of the target days into the model. When evaluating the performance of the RF model, the forecasting results for the three classifications are considered comprehensively.

Performance Evaluation Indicators
To understand the performance of the forecasting model, error analysis indicators are necessary. This research uses MAE, mean absolute percentage error (MAPE), root mean square error (RMSE), and explained variance (EV).
MAE represents the mean of the absolute errors and is used to reflect the real situation of the forecast error. MAE is calculated as: whereŷ i is the prediction and y i is the real value. The range of values for MAE is 0 to +∞. Evidently, as the result of MAE increases, the size of error likewise increases. MAPE is a straightforward measure of the prediction accuracy of a forecasting method, and is thus usually considered as the fairest indicator [37]. On this basis, MAPE is usually considered as the Sustainability 2020, 12, 2247 6 of 17 most important precision evaluation indicator. Dividing each error by the real value to provide an average result in percentage, MAPE is defined by: RMSE is another common indicator used in the evaluation of the accuracy of the forecasting model. RMSE is the square root of MSE and is more sensitive to large errors than MAPE. The formula is: EV measures the part of the variation in a given data set that can be explained by a model. EV is calculated as: where Var is the variance of a sequence of values.
The largest value of EV is 1, which represents the best prediction.

Model Application
In this research, the power generation from Zhonghe PV station is used to test the above forecasting method. The central position of the station is 114.32 • E, 37.45 • N, in Xingtai city, Hebei province, the center of North China, see Figure 2.
EV measures the part of the variation in a given data set that can be explained by a model. EV is calculated as: where Var is the variance of a sequence of values.
The largest value of EV is 1, which represents the best prediction.

Model application
In this research, the power generation from Zhonghe PV station is used to test the above forecasting method. The central position of the station is 114.32ºE, 37.45ºN, in Xingtai city, Hebei province, the center of North China, see Figure. 2. In addition to the daily power generation, meteorological data are also needed in the forecasting model. These data are selected from the Xingtai Meteorological Bureau in Hebei province.
Following the process shown in Fig.1, before the raw data being classified and introducing into the forecasting model, these data are preprocessed by cleaning and filtering to mend fragmentary information and eliminate statistical errors. In particular, data on PM2.5 and total solar radiation are preprocessed. In the sequence of PM2.5, approximately 10% of data was missing due to observation equipment failure and other reasons. To fill the missing values, linear interpolation method is adopted in this research. Regarding the total solar radiation, despite the possibility of automatic solar radiation recording, erroneous data remain because of the measuring instruments [38]. To replace problematic data, alternate values are obtained through different measures: 1) analysis of data in similar days; 2) calculation of mean values of observations in adjacent areas; and 3) linear interpolation.
Finally, data from November to December 2016 and from January to November in 2017 are set In addition to the daily power generation, meteorological data are also needed in the forecasting model. These data are selected from the Xingtai Meteorological Bureau in Hebei province.
Following the process shown in Figure 1, before the raw data being classified and introducing into the forecasting model, these data are preprocessed by cleaning and filtering to mend fragmentary information and eliminate statistical errors. In particular, data on PM2.5 and total solar radiation are preprocessed. In the sequence of PM2.5, approximately 10% of data was missing due to observation equipment failure and other reasons. To fill the missing values, linear interpolation method is adopted in this research. Regarding the total solar radiation, despite the possibility of automatic solar radiation Sustainability 2020, 12, 2247 7 of 17 recording, erroneous data remain because of the measuring instruments [38]. To replace problematic data, alternate values are obtained through different measures: (1) analysis of data in similar days; (2) calculation of mean values of observations in adjacent areas; and (3) linear interpolation.
Finally, data from November to December 2016 and from January to November in 2017 are set as the training samples. The testing samples are composed of data from 1 November to 31 December 2018. Value distribution ranges of different indicators in the training and testing samples are shown in Table 1. As shown in Table 1, the training and testing samples have similar distribution range for most indicators. Thus, the RF model obtained by the training samples can be used for the forecasting analysis of the testing samples.
The next step is to classify the samples in both training set and test set. According to Section 2.2, both of the two data sets are divided into three categories: clear, cloudy, rainy or snowy days. The data in the same category of training set and test set are corresponding to each other. For instance, the data under the "clear days" classification in the test set will be input into the RF model trained by the data under the same classification in training set.
After classification, RF models will be established for each weather category. To ensure the good performance of the proposed model, ascertaining the optimal parameters is necessary after inputting the training samples into the corresponding model. As mentioned in Section 3.1, the proposed model can automatically search for the optimal parameters when providing a rough range of the number of trees. On the basis of the sample size, this research limited the tree number to the range of 100-1000. The final results of optimal parameters under different weather classification are shown in Table 2. After the above steps are completed, the influencing factors in the training set were input into the corresponding RF model, and then the prediction results can be obtained.

Forecasting Results Analysis
As mentioned in Section 4.1, the RF models are used to forecast the daily power generation of the Zhonghe PV station from 1 November to 31 December 2018. This period comprises 61 days. However, Sustainability 2020, 12, 2247 8 of 17 11 days are abandoned due to the serious loss of weather records and equipment maintenance, leaving only 50 days. To facilitate the following analysis, each day was assigned a code. These results are listed in Table 3. To understand the forecasting results, a line chart is used to demonstrate the forecasting performance, and a histogram is used to show the forecasting percentage errors. The line chart and histogram are shown in Figure 3a,b, respectively. Figure 3a shows that affected by weather conditions, the real power generation of the PV station extremely changes. Generally, forecasting results can always be close to the real values. Hence, the proposed model can be applied to this problem. Figure 3b shows three evidently largest forecasting errors, coded as 5, 14, and 46. Figure 3a shows that these days are all extremely rainy or snowy with the least power generation. As these weather conditions seldom appear, such training samples are limited and thereby produce the largest forecasting errors. In fact, rainy or snowy days are rarer than other weather types in North China, and observations in this classification vary greatly. As a result, the forecasting error of rainy or snowy days  Figure 3a shows that affected by weather conditions, the real power generation of the PV station extremely changes. Generally, forecasting results can always be close to the real values. Hence, the proposed model can be applied to this problem. Figure 3b shows three evidently largest forecasting errors, coded as 5, 14, and 46. Figure. 3(a) shows that these days are all extremely rainy or snowy with the least power generation. As these weather conditions seldom appear, such training samples are limited and thereby produce the largest forecasting errors. In fact, rainy or snowy days are rarer than other weather types in North China, and observations in this classification vary greatly. As a result, the forecasting error of rainy or snowy days is evidently larger than that of other weather classifications. Furthermore, enlarging training samples and performing subdivision can improve the forecasting precision of this classification. Figure 3b shows that in addition to the rainy or snowy days, there are still some other days which have large forecasting errors. The forecasting errors of the 11th, 25th, 30th, and several other days are larger than their neighbors. In accordance with Table 3, these days are all classification conversion days, e.g., the first day of continuous cloudy days. Given that several weather indicators are close to its neighbor classification during these days, the trained RF model in this research cannot perfectly handle such a situation and therefore, cause large forecasting errors. To improve forecasting precision of these days, manual intervention is necessary. Specifically, the forecasting results of these days should be adjusted in accordance with its neighbor classifications.
According to statistics, errors of photovoltaic power generation prediction using random forest method are within the range of 8.5% to 12.3% [39]. Except for the three days that had been explained, the prediction errors of the random forest model established in this paper are almost all less than this range.  Figure 3b shows that in addition to the rainy or snowy days, there are still some other days which have large forecasting errors. The forecasting errors of the 11th, 25th, 30th, and several other days are larger than their neighbors. In accordance with Table 3, these days are all classification conversion days, e.g., the first day of continuous cloudy days. Given that several weather indicators are close to its neighbor classification during these days, the trained RF model in this research cannot perfectly handle such a situation and therefore, cause large forecasting errors. To improve forecasting precision of these days, manual intervention is necessary. Specifically, the forecasting results of these days should be adjusted in accordance with its neighbor classifications.

Performance evaluation
According to statistics, errors of photovoltaic power generation prediction using random forest method are within the range of 8.5% to 12.3% [39]. Except for the three days that had been explained, the prediction errors of the random forest model established in this paper are almost all less than this range.

Performance Evaluation
To evaluate the performance of the proposed model, the support vector regression (SVR) with Linear kernel, elastic net (EN), and gradient boosting decision tree (GBDT) models are also used to forecast the power generation of the Zhonghe PV station from 1 November to 31 December 2018. The training samples and their classifications are the same as those of the proposed model. The forecasting Sustainability 2020, 12, 2247 10 of 17 results of these three methods are listed in the Appendix A. Figure 4 shows the comparison of the different methods for each classification.
To evaluate the performance of the proposed model, the support vector regression (SVR) with Linear kernel, elastic net (EN), and gradient boosting decision tree (GBDT) models are also used to forecast the power generation of the Zhonghe PV station from November 1 to December 31, 2018. The training samples and their classifications are the same as those of the proposed model. The forecasting results of these three methods are listed in the Supplementary Material. Fig. 4 shows the comparison of the different methods for each classification.  Figure 4 shows that for clear and cloudy days, which have large samples, the forecasting results of RF, SVR, EN, and GBDT are relatively close. However, for rainy or snowy days with small samples, the forecasting performance of SVR and EN presents higher instability than other models. Several results, e.g., 41st, 43rd, and 46th days, have large errors. To evaluate the performance of the above models, the MAE, MAPE, RMSE, and EV are calculated in each classification and listed in Table 4.   Figure 4 shows that for clear and cloudy days, which have large samples, the forecasting results of RF, SVR, EN, and GBDT are relatively close. However, for rainy or snowy days with small samples, the forecasting performance of SVR and EN presents higher instability than other models. Several results, e.g., 41st, 43rd, and 46th days, have large errors. To evaluate the performance of the above models, the MAE, MAPE, RMSE, and EV are calculated in each classification and listed in Table 4. By calculating error indicators, as shown in Table 4, except EV slightly smaller than SVR in the clear days, the RF-based method proposed in this research performs the best for all indicators in all classifications. Especially for the MAPE in the clear and cloudy days, the superiority of the proposed model is especially distinct.

Conclusion
Better forecasting the daily PV power generation in winter is essential to improving the equipment utilization rate and mitigate effects of power system on the environment in North China. Given that RF algorithms can lower the risk of overfitting by balancing decision trees in small samples, the winter days are divided into three classifications and an RF-based daily PV power generation forecasting model is built for each classification.
The proposed model and three other methods are separately used to forecast the daily power generation from 1 November to 31 December 2018 at the Zhonghe PV station, which is located in the center of North China. Empirical results show the following conclusions: (1) The values of MAPE, which is usually considered as the most important precision evaluation indicator, of the proposed model for clear and cloudy days are as low as 2.83% and 3.89%, respectively. (2) During the rainy or snowy days, which rarely appear in this region, the forecasting errors of the proposed model are relatively larger than those during other weather conditions. In those days, the forecasting precision can be improved by enlarging training samples and performing subdivision.

Conflicts of Interest:
There are no conflicting interests needing to be clarified.