Use of a Big Data Analysis in Regression of Solar Power Generation on Meteorological Variables for a Korean Solar Power Plant

: This study identiﬁed the meteorological variables that signiﬁcantly impact the power generation of a solar power plant in Samcheonpo, Korea. To this end, multiple regression models were developed to estimate the power generation of the solar power plant with changing weather conditions. The meteorological data for the regression models were the daily data from January 2011 to December 2019. The dependent variable was the daily power generation of the solar power plant in kWh, and the independent variables were the insolation intensity during daylight hours (MJ/m 2 ), daylight time (h), average relative humidity (%), minimum relative humidity (%), and quantity of evaporation (mm). A regression model for the entire data and 12 monthly regression models for the monthly data were constructed using R, a large data analysis software. The 12 monthly regression models estimated the solar power generation better than the entire regression model. The variables with the highest inﬂuence on solar power generation were the insolation intensity variables during daylight hours and daylight time.


Introduction
Since 2017, the Korean government has promoted an energy policy that has a goal of renewable energy reaching 20% in the national gross power generation by 2030 [1]. Owing to this policy, Korea's capacity of renewable energy power facilities has increased by approximately five times from 2010 to 2019. In 2019, the renewable energy facility capacity share was 13% of the gross power generation facility capacity, while the solar power facility capacity accounted for 67% of the gross renewable energy facility capacity [2].
As it converts sunlight into electricity, solar power generation heavily depends on the weather conditions of the region where the facility is installed [3]. For example, insolation, which significantly influences solar power generation, fluctuates each month. Figure 1 shows the monthly insolation (MJ/m 2 ) averaged from 2011 to 2019 over the entire region of Korea. In Korea, it reaches the maximum in May and the minimum in December, as shown in Figure 1. Rain also affects solar power generation. There were several rainy days in September 2020, which was approximately 1.7 times higher than those in May 2020 when the highest amount of solar power was generated in the entire year [4].
Such irregular weather conditions make it difficult to ensure stable solar power generation. Because every power plant must respond in a timely manner to the changing electricity demands over time, solar power plants should be capable of predicting the amount of power required in the near future and respond accordingly to changing electricity demands [5]. Accordingly, several studies have tried to predict the amount of solar power generation as accurately as possible. Solar power generation has been predicted by utilizing mathematical relationships with linear regression models [6,7], autoregressive models [8,9], and recurrent neural network models [10,11]. The prediction models might be divided into the two categories: short-and long-term prediction models depending on if the prediction period is longer than a day [12]. The short-term prediction model can effectively predict near-future power generation, but the long-term prediction model is also needed to consider unexpected and extreme weather conditions such as a long rainy season [13]. Such irregular weather conditions make it difficult to ensure stable solar power generation. Because every power plant must respond in a timely manner to the changing electricity demands over time, solar power plants should be capable of predicting the amount of power required in the near future and respond accordingly to changing electricity demands [5]. Accordingly, several studies have tried to predict the amount of solar power generation as accurately as possible. Solar power generation has been predicted by utilizing mathematical relationships with linear regression models [6,7], autoregressive models [8,9], and recurrent neural network models [10,11]. The prediction models might be divided into the two categories: short-and long-term prediction models depending on if the prediction period is longer than a day [12]. The short-term prediction model can effectively predict near-future power generation, but the long-term prediction model is also needed to consider unexpected and extreme weather conditions such as a long rainy season [13].
Recently, researchers have adopted predictive modeling techniques such as "artificial neural networks," "fuzzy predictions," and "support vector regressions" [14]. However, most of these models have been unable to make accurate predictions because they did not have sufficient raw data, which means that the predictability of the models could be improved if more raw data are accumulated [13].
This study aimed to develop a model in order to easily predict solar power according to the changes in the meteorological variables, as well as identify the meteorological variables significantly impacting the solar power generation in Korea. To achieve this objective, a multiple regression analysis technique was applied to the big data on the solar power generation and weather conditions around the area where the solar power plant was installed. The multiple regression analysis has advantages of the variables being added to and removed from the model easily in the middle of the regression process, and thus a quick calculation was possible [15]. For the regression analysis, the packages in R, a large data analysis software, was used [16].
In this study, two types of regression model were developed. First, irrespective of month, a regression model for the entire dataset was developed. Second, as insolation intensity in Korea considerably varied from month to month. As such, 12 regression models for each month were developed to increase the predictability of the model. For the regression, the dependent variable was the quantity of solar power generated by a solar Recently, researchers have adopted predictive modeling techniques such as "artificial neural networks," "fuzzy predictions," and "support vector regressions" [14]. However, most of these models have been unable to make accurate predictions because they did not have sufficient raw data, which means that the predictability of the models could be improved if more raw data are accumulated [13].
This study aimed to develop a model in order to easily predict solar power according to the changes in the meteorological variables, as well as identify the meteorological variables significantly impacting the solar power generation in Korea. To achieve this objective, a multiple regression analysis technique was applied to the big data on the solar power generation and weather conditions around the area where the solar power plant was installed. The multiple regression analysis has advantages of the variables being added to and removed from the model easily in the middle of the regression process, and thus a quick calculation was possible [15]. For the regression analysis, the packages in R, a large data analysis software, was used [16].
In this study, two types of regression model were developed. First, irrespective of month, a regression model for the entire dataset was developed. Second, as insolation intensity in Korea considerably varied from month to month. As such, 12 regression models for each month were developed to increase the predictability of the model. For the regression, the dependent variable was the quantity of solar power generated by a solar power plant in Samcheonpo, Korea, and the independent variables were the meteorological data provided by the Korean Meteorological Administration, which were screened sequentially during regression analysis.

Selection of a Solar Power Plant for Analysis
Insolation intensity is a key determinant in selecting the site for a solar power plant [17], and it fluctuates with location and timing [18]. To select the solar power plant for our analyses, first, the areas with high insolation intensity in Korea, which could be suitable for installing solar power plants, were found. The data analyses regarding the average insolation of the 14 sites for the past 20 years (1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) showed that Mokpo and Jinju, located on the southwest coast of the Korean Peninsula, had the highest insolation in Korea [4]. Second, the solar power plants were investigated, which were installed in areas with the highest insolation. Then, the Samcheonpo solar power plants were chosen, which are operated by the Korea Southeast Power Co. because the data on solar power generation and the meteorological variables needed for further analyses could be secured. The white star in the red cone in Figure 2 shows the location of the Samcheonpo solar power plants in Goseong-gun, Gyeongsangnam-do, Korea.

Selection of a Solar Power Plant for Analysis
Insolation intensity is a key determinant in selecting the site for a solar power plant [17], and it fluctuates with location and timing [18]. To select the solar power plant for our analyses, first, the areas with high insolation intensity in Korea, which could be suitable for installing solar power plants, were found. The data analyses regarding the average insolation of the 14 sites for the past 20 years (1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007) showed that Mokpo and Jinju, located on the southwest coast of the Korean Peninsula, had the highest insolation in Korea [4]. Second, the solar power plants were investigated, which were installed in areas with the highest insolation. Then, the Samcheonpo solar power plants were chosen, which are operated by the Korea Southeast Power Co. because the data on solar power generation and the meteorological variables needed for further analyses could be secured. The white star in the red cone in Figure 2 shows the location of the Samcheonpo solar power plants in Goseong-gun, Gyeongsangnam-do, Korea. As shown in Table 1, there are five units with a facility capacity of 0.1 MW in the Samcheonpo solar plants. Among these five units, unit #1 was chosen as the plant for our analyses and, thus, we gathered the data on solar power generation provided by the Korea Southeast Power Co. in Sacheon, Gyeongsangnam-do, Korea through the official government portal for public information release. As shown in Table 1, there are five units with a facility capacity of 0.1 MW in the Samcheonpo solar plants. Among these five units, unit #1 was chosen as the plant for our analyses and, thus, we gathered the data on solar power generation provided by the Korea Southeast Power Co. in Sacheon, Gyeongsangnam-do, Korea through the official government portal for public information release.

Meteorological Data
To develop regression models between the solar power generation (Y) and meteorological variables (X i ), the meteorological data around the Samcheonpo solar power plants were needed. As there is no official weather station at the site of the Samcheonpo plants, the meteorological data provided by the Jinju weather station was 20 km away in a straight line from the Samcheonpo plants [26]. The data provided by the Jinju weather station were obtained through the Korea Meteorological Administration's website.
The meteorological variables considered for analyses were the insolation intensity at the peak time (MJ/m 2 ), insolation intensity during daylight hours (MJ/m 2 ), daylight time (h), average relative humidity (%), minimum relative humidity (%), and amount of evaporation (mm). The total number of meteorological data collected from January 2011 to December 2019 was 19,623 [27]. The numbers of data for the variables "daylight time", "average relative humidity", and "minimum relative humidity" were each 3285. The numbers of data for the variables "insolation intensity at the peak time" and "insolation intensity during daylight hours" were each 3281. The number of data for the variable "amount of evaporation" was 3206. Figure 3 shows the degrees of correlation between solar power generation and the several selected independent variables [28,29]. Figure 4 shows the degrees of correlation between the power generation and meteorological variables for the data sets collected over 9 years (2011-2019). As shown in Figure 4, three variables-i.e., the insolation intensity at the peak time, insolation intensity during daylight hours, and the daylight time-were highly positively correlated with power generation.
To develop regression models between the solar power generation (Y) and meteorological variables (Xi), the meteorological data around the Samcheonpo solar power plants were needed. As there is no official weather station at the site of the Samcheonpo plants, the meteorological data provided by the Jinju weather station was 20 km away in a straight line from the Samcheonpo plants [26]. The data provided by the Jinju weather station were obtained through the Korea Meteorological Administration's website.
The meteorological variables considered for analyses were the insolation intensity at the peak time (MJ/m 2 ), insolation intensity during daylight hours (MJ/m 2 ), daylight time (h), average relative humidity (%), minimum relative humidity (%), and amount of evaporation (mm). The total number of meteorological data collected from January 2011 to December 2019 was 19,623 [27]. The numbers of data for the variables "daylight time", "average relative humidity", and "minimum relative humidity" were each 3285. The numbers of data for the variables "insolation intensity at the peak time" and "insolation intensity during daylight hours" were each 3281. The number of data for the variable "amount of evaporation" was 3206. Figure 3 shows the degrees of correlation between solar power generation and the several selected independent variables [28,29]. Figure 4 shows the degrees of correlation between the power generation and meteorological variables for the data sets collected over 9 years (2011-2019). As shown in Figure 4, three variables-i.e., the insolation intensity at the peak time, insolation intensity during daylight hours, and the daylight time-were highly positively correlated with power generation.

Analysis Method
In this paper, a multilinear regression analysis was applied to determine the causal relationship between independent variables (Xi), meteorological data, dependent variables (Y), and solar power generation because there were several independent variables. The regression analysis estimated the value of a dependent variable by substituting the

Analysis Method
In this paper, a multilinear regression analysis was applied to determine the causal relationship between independent variables (X i ), meteorological data, dependent variables (Y), and solar power generation because there were several independent variables. The regression analysis estimated the value of a dependent variable by substituting the values of independent variables. Accordingly, solar power generation can be estimated using a multilinear regression equation of the multiple meteorological variables as follows [30,31]: Equation (1) estimates the value of the dependent variable, as well as the values of the regression coefficients, β 0 , β 1 , β 2 , . . . and β p . Each regression coefficient is interpreted as the extent to which each independent variable affects the dependent variable.
Estimating regression coefficients requires partial differentiation of the error sum of squares (SSE) for each variable and minimizing it to estimate the regression variables. The SSE is represented as follows: where e i is the deviation of the regression estimation.
In the regression analysis, the coefficient of determination (R 2 ) is used to evaluate the goodness of fit or to know the explanatory power of the independent variables for estimating the dependent variable. The coefficient of determination is given as follows: Variation explained by the regression line Total variation (3) whereŷ i − y indicates the difference between the estimated dependent variable value and sample mean. The range of R 2 is 0 < R 2 < 1. The closer it is to 1, the closer the regression model is to the overall variation. However, because multiple regression analysis has two or more independent variables, it is necessary to consider the adjusted coefficient of determination (Adjusted R 2 ), which compensates for the characteristic of R 2 that increases as the number of independent variables increases. The formula for the adjusted coefficient of determination is as follows: where (n − p − 1) is the degree of freedom, n is the number of samples, and p is the number of independent variables. In addition, the p-value and multicollinearity were diagnosed to confirm statistical significance. Finally, to evaluate the accuracy of the derived regression model, we used the R 2 , adjusted R 2 , and root mean square error (RMSE) values.
The p-value and probability of significance can determine if the null hypothesis or alternative hypothesis is adopted. To reject the null hypothesis that "the independent variables do not affect the dependent variable," the p-value must be less than 0.05, and the alternative hypothesis can be adopted by rejecting the null hypothesis.
A multicollinearity between variables exists, which overlaps with the variability between independent variables and does not bring about overlapping variability. This leads to poor interpretation of the regression analysis results and decreases its accuracy, which requires diagnosis. Methods for diagnosing multicollinearity should utilize variance inflation factors (VIFs). VIFs can be determined using Equation (5) [32].
If the VIF is greater than 10, then the variable possesses multicollinearity and should be excluded from the regression analysis.
The RMSE is commonly used when considering the difference between the estimated and measured values, and it is suitable for expressing precision. The smaller the error, the better the performance of the regression model.

Results and Discussion
The coefficients of the entire regression model between the solar power generations over a year for the past 9 years and the meteorological variables are listed in Table 2. As observed in Table 2, the VIF values for the two independent variables-namely the insolation intensities at the peak time and during daylight hours-exceed 10. This indicates that these two variables possessed multicollinearity. Thus, the insolation intensity at the peak time was excluded from the regression model. The revised regression models and their statistics were obtained after removing the insolation intensity at peak times, as summarized in Tables 3 and 4, respectively. For the regression models given in Table 3, R 2 and adjusted R 2 were 0.7738 and 0.7735, respectively. The p-value is generally interpreted as statistically significant when it is less than 0.05. As the p-value is less than 0.05, the null hypothesis can be rejected, and the alternative hypothesis can be adopted [33]. In other words, the equation using coefficients in Table 3 can be used as a multiple linear regression model.
To consider the difference in monthly insolation, we derived 12 monthly regression models. Table 5 lists the regression models for the monthly solar power generation, which was averaged from 2011 to 2019. The R 2 s for the monthly regression models in Table 5 are larger than in Table 4. This means that the goodness of fit of the regression models in Table 5 was better than that of the regression model in Table 4. Therefore, the monthly regression models estimated solar power generation was better than the entire regression model. The regression model with the highest accuracy was for January and that with the lowest accuracy was for December. Figure 5 compares the actual daily power generation of the Samcheonpo power plant in 2019, as well as the predicted monthly regression models in Table 5.  Table 5 also shows that the two variables (insolation intensity during daylight hours (X 1 ), and daylight time (X 2 )) were the dominant variables impacting the solar power generation. For January, June to September, and November to December, the insolation intensity during daylight hours (X 1 ) was the most dominant meteorological variable. For February to May and October, the daylight times (X 2 ) were the most dominant meteorological variable. Interestingly, the evaporation quantity fairly impacted solar power generation in January and November.
(X1), and daylight time (X2)) were the dominant variables impacting the solar power generation. For January, June to September, and November to December, the insolation intensity during daylight hours (X1) was the most dominant meteorological variable. For February to May and October, the daylight times (X2) were the most dominant meteorological variable. Interestingly, the evaporation quantity fairly impacted solar power generation in January and November.

Conclusions
This study investigated the correlation between solar power generation and the meteorological variables by deriving multiple linear regression models. For this, a large data analysis software, R, was applied to the solar power generation and meteorological variable datasets. In the regression models, solar power generation was set as the dependent variable and the meteorological variables as the independent variables. The independent variables first considered were the insolation intensity at the peak time (MJ/h), insolation intensity during daylight hours (MJ/m 2 ), daylight time (h), average relative humidity (%), minimum relative humidity (%), and evaporation amount (mm). Through statistical analysis, the insolation intensity at the peak time was excluded from the further regression